At the very least, this is what I will be using as a data backup mechanism (moving away from SQL dumps). I hope other people gather the time to implement it and post their data in this format as well.
It is generic, and allows for not just FRC, but FTC and FLL as well, it allows for team number or information changes, since it keeps data per season, not once for all of history. It allows for score components to be posted and attributed to multiple teams. It is flexible enough to allow for more then two alliances, or varying numbers of teams per alliance. In many events, I have both the score and penalty components posted, and for some events I have more fine-grained data: scored points, bonus, and penalties, all seperate (which will appear in the dump when I get around to importing the score components into my database).
The specification for the format appears at http://docs.google.com/Doc?id=dcz67k4q_38f2rp7sfc. If anyone is interested in helping engineer the details of the format PM me, or visit me on IRC (see my footer), I can add you as an editor to the document. You should have good problem-solving skills, think about different situations that the format might be used in, know about XML and it’s related technologies (XQuery, XPath, XSLT, namespaces), and be familiar with RFC 2119 and when to use each keyword. Basically, be able to think: is there a reason a certain feature should be required, is there any reason that someone might need to leave an attribute out, how should parsers deal with missing data, etc.
Talking with other people about this, there was some concern over the selection of XML as the data format. While XML is certainly not appropriate for everything (including RPC calls imo), it really stands out for FIRST data because it is human readable, can be queried for data with XQuery (if you want to do statistics for example), it can be easily transformed into other formats (you could generate static HTML pages using XSL), and it has namespace support, so it can be extended. Using namespaces, you could link to game videos (<v:video href=“http://firstvideoarchive.com/2008Archive/index.php?dir=Arizona/&file=az_qf2m3.wmv”/> for example) or game data specific to one season, you can add a s2008:laps=“4” or s2007:lift=“498:12 330:4” attribute to each team element under each match, to record laps a robot made or if it lifted any other robots.
I think you’ve done some good work so far, but there are a few rough spots. The biggest thing I’ve seen so far is that the document itself contains no meta data about the date of creation, or the time span where it was valid.
Another thing, which may just be matter of preference, is that the you use one letter abbreviations in several places. Even if for no other reason than readability full words seem like a better choice.
If anyone is interested I’ve attached a file, which almost conforms to this spec, that was generated using the TBA Api.
I dont have a major issue with it being XML (there are better formats for this though). My question is why is this not two files one for team data and one for match data, you have a lot of duplicate data in that file. I am not sure why you list all the teams at a comp listed at the top level of it, and then in each of the matches?
It is a good start it just needs some major layout work IHMO.
Some comments, although I have not given the document a thorough read:
Don’t abbreviate things when possible. “Red” and “Blue” are better than “R” and “B”. You’re thinking very FIRST-centric, think abstractly.
I’d like to see alliance/team have an optional “captain” attribute.
Calculating scores from penalties is nice, but a luxury. Can we have a “final score” attribute that is required, and “unpenalized score” and “penalties” attributes that are optional? What happens if in the future penalties are assessed by giving the opposing team points?
The “competition” attribute should have canonical names. “FRC” “FTC” “FLL” “VEX” “BEST” “OCCRA” spring to mind.
What is done for date in other XML formats? RSS does this: “<pubDate>Fri, 05 Sep 2008 00:51:30 -0400</pubDate>”, so is that an easy format to generate/parse comared to YYYY-MM-DD?
Location is “e.g. “Phoenix, AZ, USA””. We’re throwing away resolution with that. <location><street /><city /><state /><country /></location>?
We should give some serious thought to the representation of match numbers. “21” for Quarter Finals 2 Match 1 makes some sense, but what happens if there is ever a series of 8 ties and we go to game 10? The Blue Alliance represents this data poorly now.
I used abbreviations for alliance color because that is how I stored it in my database (a CHAR(1)), and the keep names descriptive, values small philosophy followed me into XML. I think you are right on that however, so I’ll change it. I think all lowercase works, is that a good idea? Or Title Case?
I think meta data is outside of the scope of the format? There is RDF, which is aimed at adding metadata to XML documents, I think that might be the solution if that is completely necessary.
The idea of this is to have a file format that can store everything in one location. It is flexible enough to keep teams and matches separately, or upload data for only one event for example, but I can’t think of a reason to do that, when you could just merge them, when you are trying to store multiple events or seasons of data.
As for redundant data, you could figure out which teams are in each event by scanning each child <alliance> tag, but that is harder to do. It also makes semantic sense, a team element under an event tag means the team was part of that event, the same way it does for the alliance tag. Similar thinking was behind adding the score=“” attribute to the alliance tag, it is redundant, but it simplifies it down, and it allows you to not know each component, and have only the score.
Abbreviations, that is a good point too.
For elimination matches, a captain designation would be very useful. It didn’t cross my mind, I don’t keep any elimination data at all (I should be, because it is the only time the same teams replay each other under the same conditions, useful for statistics work).
Good point about the penalties. The score attribute was added as a convenience, and I don’t know how programs would deal with discrepancies between the score/penalty elements and score attribute. If there was a penalty system that awarded points to the other team/alliance, it would be hard to implement because the data format follows the premise that data belongs to other data, e.g. a score belongs to an alliance, which is a part of a match, etc. In that case, you would be faced with the issue of who owns the penalty. Off the top of my head, you would give a <penalty value=“0” name=“somepenalty for -10”/> to the offending side, then award <score name=“somepenalty” value=“10”/> to the other side.
Making the score attribute mandatory seems like a good idea. That would allow the listed components to not be comprehensive, that is, the listed score components are not 100% of the score. It might follow that you could have “points” and “penalties” attributes that are also authoritative over the respective score and penalty components.
There are way to many date formats, YYYY-MM-DD is the international standard, defined in ISO 8601 (not a public document, but information about it is out there). It helped tremendously that that is what SQL uses for dates as well. As for converting between dates, well, it isn’t simple anywhere I think. I would like to get timezone data in there somewhere.
The location attribute is simply what FIRST gives teams, it isn’t readily available in any other format. It shouldn’t be hard to split locations by commas I think, if you really need to parse the data, or use a library or API like Google Local which doesn’t have a problem with mixed data. Location is for human reference mostly, so making it atomic like that isn’t really necessary, and it might make it harder to display a simple location. Can you think of a situation where atomic location data is necessary?
FIRST gives match ID’s to elimination matches too, I was thinking that those would be used, along with the name attribute to name them with dashes. A problem I didn’t consider (again because I don’t store elimination data) is that you have multiple matches with the same ID. I think previously, to keep practice and qualification data, I just kept two seperate events, but that doesn’t really make much sense in a more semantic format like this one. Probably, only allow one match number per match type.
You should figure out is this raw data or is this processed data. If it is raw data then out with the dup, the processing to regenerate that is tiny, raw data is just that raw. If it is not then there really is no reason to not break it into other file. Just as in a database there are multiple tables.
That brings back to the question why not just make the raw SQL database readable to the world then we can make queries off of it. XML just is not a database replacement when you have relational data, which is why there is dups.
I hadn’t considered it to be used as a format for before matches, but that could very well be useful, if it is used as the format for a native XML database or something. How would you specify a match as unplayed? An attribute that specifies that might get redundant, since you don’t usually look at the data during an event, but before or after, that would mean having to deal with a bunch of unplayed=“unplayed” attributes. Is leaving out the score attribute enough to imply it is scheduled and not played yet?
If you want use that is any better then a text find, like “find all teams not in North America” (a query I have been interested in myself before) then even fine-grained, seperate country/province/city fields are not going to cut it. It seems like a dedicated API (Google Maps comes to mind) would be the best solution if you really need to interact with the location field. There shouldn’t be too much difference between a simple string and multiple atomic fields, plus the single location attribute is simpler.
I added times for each match, something which I assumed I added but did not somehow. It is in “YYYY-MM-DD HH:mm:ss” format, local timezone (FIRST-specified time).
For names of matches and alliances, what case should be used? Lowercase seems to fit with me, just keep things ultra-consistent. “red” “blue” “elimination” “qualification” etc.
It was brought to my attention that I was having encoding problems, MySQL was sending the data in iso-8859-1 (apparently I don’t want latin1_swedish_ci collation). I don’t know much about character encoding, but I think I figured out how to set the connection encoding to UTF-8, the XML default encoding (“SET NAMES utf8”).
For TBA, we set scores to -1 to indicate “unknown”. I don’t think this is a good strategy for the XML format. Maybe a “status” attribute? “future” “playing” “finished”? I am not sure entirely here how much information is too much.
For names of matches and alliances, what case should be used? Lowercase seems to fit with me, just keep things ultra-consistent. “red” “blue” “elimination” “qualification” etc.
Why do we not make a standard sqlite database, and then allow data from a standard XML file to be uploaded to it, could be written in java or python and tada its cross compatible magic. I fear inconstancies will stem from duplicate data.
We are not confined to integers here, wouldn’t a literal “?” work for a played match, but unknown score? It would mean an extra sanity check that would have to be made when parsing the document though. Then again, if you want any support for unknown scores, it is an extra if/then you have to make. Most languages would convert a “?” to 0 or some legal value that doesn’t throw an error (atoi in C, and PHP, at the very least), if no support for unknown scores are added.
This isn’t intended as a data storage mechanism, I would suggest reading why I went with XML back in my first post again. A relational database like SQLite also doesn’t solve two of the requirements I am addressing: a human readable format, and an extendable format. XML would allow you to add your own data, like how many laps a team made, say, to each team element under an alliance element. To do the same in SQLite you would have to create a new table (how I am planning on doing it for my database), if you wanted to keep the data in a relational format (there are other ways you could store the data, but it wouldn’t be a strict relational format).
And listing the teams in an event isn’t redundant: it shows the teams that went to an event, regardless of if they played any matches or not. You have the exact same thing in a relational database, a table that links teams to an event.
CREATE TABLE `team_event` (
`teamnum` int(10) unsigned NOT NULL,
`eventid` int(10) unsigned NOT NULL,
PRIMARY KEY (`teamnum`,`eventid`)
)
Even in a relational database, you have to put in some redundant data from time to time simply to speed things up. It is much faster to keep a column that keeps track of the number of private messages that you have, and update it every time you receive or delete a message, then it is to count every message in your inbox every time you want to know.
// Actually, after looking through the list, there are a lot of missing teams. 230 (Gaelhawks) was the next to come to mind. 1071 (Max) is another, along with at least dozen other teams, many of them from Connecticut…
You should be in the 2008 season team list. I don’t have the complete 2007 season team list (the first one in the file), for that season I only have the teams that went to championship or AZ regional. Hopefully I will get that data later on from somewhere.
Everyone, however, should be in there about half way down, even if I don’t have the nicknames (I should spend some time importing those sooner or later).
sorry that file is not really human readable for most people. To leverage it for any scouting data or otherwise a script is going to be parsing it.
I would look at a many-to-many relationship before saying it cant be done.
I see this as a great transitionary format? I.E. database to database. I would be more then willing to write SQLite python scripts to both export and import data in this format.
I am not trying to trash your work you have obviously put some time in this, I am just pointing out some weaknesses. As for adding your own extra types of data I would frown on that as it would pull away from your standard. Lets include what we want, and make revisions as a group, then in the meta data you should be able to write what version of the format is being used, and the scripts will all be happy.
If there is one thing I cannot stand… well, if there is one thing I cannot stand about programming, it is Python.
SQLite, Postgres, MySQL, relational databases in general are for storing data. They are not for transferring it. Even when backing up relational databases, do you copy the binary files? No, you export the SQL statements that can re-create the database from scratch as a backup. SQLite is not a data format, it is a database. There is a distinct difference.
Absolutely not a problem
Yes, XML has weaknesses. Relational data also has weaknesses. There is no one killer data format, and you have to choose what makes sense and understand it won’t always be the correct choice. Again, for sharing data, XML is well suited because you can put comments in it, it is human readable. I can open it up in a web browser and inspect the data - huge plus, at least one person has already done so. Namespaces are a plus, it is extendable, and elements are inherited from other elements, that is to say there is an ancestry. It is made very clear that teams can belong to seasons, events, and matches, scores can belong to a team or a match, and so on. Such flexibility is hard with relational data, which is fairly strict. Shipping around data in a relational format confines DBAs to a straitjacket, I am re-working my own database right now (have been for two weeks on and off, and there isn’t a real end in sight) because I described alliances, scores, and events in a bad way, and am really paying for it with my time.
I am wondering why you are bashing python, as it is an excellent language for scripting, which is what this is all about. Do you really want to write your database scripts in C? I dont think so. I was speaking about writing a TurboGears or Django app which is fully python css and html (none of the php mumbojumbo already went through that stage of my life). One of the cool things about these apps is the model.py file which defines your database very simply and in an Easy-To-Read way.
If you read what I wrote I was not wanting to ship a SQLite file around. But have a standard model that can be created that people can put on there computer, and have the scripts to Take Your XML Format and place it in a database, and export the data base to your XML format.
It seems to me this could be powerful. People can write scripts on the database in what ever they want (python makes this easy), and they can see the data in the XML format if they want during transition.
As for XML as a database format… come on. Even xindice on there front page says it does not serve as a general database. I have played the XML database game in past years and know where it leads. I still use XML for data transmission all the time.
Nobody is saying you can’t have a SQLite/Python parser, but ultimately not everybody is going to use the same language framework. Once this format is documented and stable you can write you Python scripts , I can write my .net apps, and some guy over there can write his Ruby scripts or Java apps or PHP site.
I don’t think anyone is intending to use this format for long term storage, not that somebody won’t ;), but for inter-team or even intra-team data transportation this could be a very useful format.
That problem with that statement is that not everyone will agree on the exact data for each match, event, or even season. I think the way this should be approached is that the format will provide all of the functionality for 95% of the people using it, but the data from the 5% who want other data included should still be usable so long as they don’t leave out any required fields.
You could still include a format version, but it would only be the base format (i.e. not specifying the 5% usages).
If you don’t allow this people will either branch the format and create many similar, but incompatible formats, or the format will become a mess of with 95% of the spec being for the 5% use cases (insert generic Windows bloat joke here).
I’m not very savvy in this field, but this whole thing seems interesting. I’m currently working on creating a regional intraweb that teams can connect to.
At first I was going to implement one of the other solutions already floating around, but if this’ll be ready before Week 1, I’ll be happy in trying it out.
Of course, I’ll need help integrating it into a CMS/Website. Keep up the good work!