FIRST Event Data in XML format

Good Afternoon,

A recent project required me to parse the team history and event pages from the USFirst.org website. Then another project forced me to redo the same task, obviously I reused most of the code but this became quite tiresome because handling everything as text strings has some major drawbacks, foremost among them being that I have to use regular expressions for everything. As a result I decided to create some scripts that will scrape the site and return XML data for various things. For example, one of the scripts pulls the ranking data from an event. The following is a small example from the Lansing Event. (I truncated the results, the actual output does contain all the teams)


<Event>
        <Ranking>
                <Rank>1</Rank>
                <Team_Number>67</Team_Number>
                <Wins>12</Wins>
                <Losses>0</Losses>
                <Ties>0</Ties>
                <Plays>12</Plays>
                <QS>24.00</QS>
                <RS>51.75</RS>
                <MP>117</MP>
        </Ranking>
        <Ranking>
                <Rank>2</Rank>
                <Team_Number>1</Team_Number>
                <Wins>10</Wins>
                <Losses>2</Losses>
                <Ties>0</Ties>
                <Plays>12</Plays>
                <QS>20.00</QS>
                <RS>46.83</RS>
                <MP>95</MP>
        </Ranking>
</Event>

My primary question is, would the FIRST community be interested in these scripts? If so, what pages would you like to see (so I can prioritize writing them) They are being written in Python but the heavy lifting is all done by regular expressions so they should be adaptable to any language.

I say open source whenever you can!

Here is the python script for the ranking of the teams. It SHOULD work for all of the regionals for which the event page has data.

In the spirit of freedom all code is licensed under the GPL

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program.  If not, see &lt;http://www.gnu.org/licenses/&gt;.

Obviously, I can’t enforce the license but I ask that you make any programs that utilize portions of this code available to the community in a timely fashion.


#!/usr/bin/env python
#The above MUST be the first line in order to be able to execute on *nix systems. 
#To exec this you must have permission to do so
#chmod +x [name]

import re
import urllib2
import sys

if len(sys.argv) < 2:
	sys.exit("Must provide event code")

event_code = sys.argv[1]

def removeHTML(string):
	string = re.sub("<.*?>","",string)
	string = re.sub("</.*>\
","",string)
	string = re.sub("\
","",string)
	return string
try:
	url_buffer = urllib2.urlopen('http://www2.usfirst.org/2009comp/events/'+event_code+'/rankings.html')
	page_data = url_buffer.read()


	#	sys.exit("This is not a valid event code. If you believe this to be invalid please report a bug.")

	page_list = list(re.findall("(<TD .*?>[0-9].*</TD>\
)",page_data))

	if len(page_list) == 0: #This is the case with the oddly formatted pages
		#We strip out the stuff that is mucking us up
		page_data = re.sub("<p.*\
.*style.*?>","",page_data);
		page_data = re.sub("<o:p.*/p>","",page_data);
		#and parse again
		page_list = list(re.findall("<td .*?>\
.*\
.*</td>",page_data))
		if len(page_list) == 0:#if it is still 0 there is something else wrong and we need to report a bug
			sys.exit("This is not a valid event code. If you believe this to be invalid please report a bug.")

	print "<Event>"
	for i in range(0,len(page_list)/9):
		print "	<Ranking>"
		print "		<Rank>"+removeHTML(page_list[9*i])+"</Rank>"
		print "		<Team_Number>"+removeHTML(page_list[9*i+1])+"</Team_Number>"
		print "		<Wins>"+removeHTML(page_list[9*i+2])+"</Wins>"
		print "		<Losses>"+removeHTML(page_list[9*i+3])+"</Losses>"
		print "		<Ties>"+removeHTML(page_list[9*i+4])+"</Ties>"
		print "		<Plays>"+removeHTML(page_list[9*i+5])+"</Plays>"
		print "		<QS>"+removeHTML(page_list[9*i+6])+"</QS>"
		print "		<RS>"+removeHTML(page_list[9*i+7])+"</RS>"
		print "		<MP>"+removeHTML(page_list[9*i+8])+"</MP>"
		print "	</Ranking>"
	print "</Event>"
except:
	sys.exit("This is not a valid event code. If you believe this to be invalid please report a bug.")

Attached is the output of this script when run with glr as an option. (Really Brandon, no XML format allowed?)

I make no claims as to the efficiency, this is my first foray into Python.

RankingExample.txt (7.73 KB)


RankingExample.txt (7.73 KB)

Update to this, the corrections I made for Dallas and Connecticut did not work, I will be trying to fix those later tonight.

As a consolation prize I tossed together a quick (read simple and not pretty) page you can grab xml data from using scripts though I would prefer you run the script on your own machines. If you want a one off piece of data feel free to use it.

http://schreiaj.ath.cx/share/FRC_Parsers/ranking.php

The page takes a couple of arguments, Event_Code which is the event code used by FIRST, these can be found on frclinks.com. It also takes *HTML_Display *which is either true or false. A true value will encode the page such that the tags for the xml show up in the browser, otherwise they will not. HTML_Display is optional but without an Event_Code the page will not load anything.

An example is

http://schreiaj.ath.cx/share/FRC_Parsers/ranking.php?HTML_Display=true&Event_Code=GLR

It will load the Lansing District event to display in the browser. Any questions feel free to ask.

I will be making the updated script available as soon as possible. Sorry about that.

EDIT: The Championship divisions rankings do work, FRClinks has the wrong code for them, it is the full name of the division. ie, Newton is Newton.

FIRST’s system is inconsistent about the Divisions. They also had their match results posted differently than other events this season. I am not sure if this is indicative of an overall change in the system, or just Divisions being weird.

After parsing First web pages myself for the Regional Twitter Accounts, I now have a newly found respect for what the team at The Blue Alliance has done to gather data :slight_smile:

Seemed like every week was a scramble to adapt to something new. And then when I found out that Einstein’s data was not posted real time, well I put the NASA feed projected on a wall at home and posted match scores by hand.

What a joy it would be if First offered some way to get to this information besides parsing their web sites. Not 100% sure what I would expect, maybe a web service that made the data available?

The FMSFRC twitter feed came close to offering some data in a real time feed format, and maybe that is the answer. But now it is tough to go back and scrape all that data from twitter pages if you didn’t get it during the realtime feed.

Richard, I will be making the entire FRCFMS feed (That twitter still lets me grab) available as soon as I get time to do it. I have a script written to do it. If someone would like to run it I can give instructions on doing it.

Yes FIRST would make all of our lives simpler if they would find a standard and stick to it. Either let us have an API we can make calls to (Published well before kickoff) or at least have a standardized page layout and don’t change it without warning us and telling us about the changes. One of the additional reasons for this project is that we have a STANDARD way of accessing data.

If anyone would like to offer assistance feel free to shoot me a PM.

On my to-do list depending on how things go…can’t say much more than that right now…I might be able to share more info in PM, if you ask nicely…

Nate, I was just grumbling. Im already providing XML information for a couple of the pages and am working on the others.

As an update:

http://schreiaj.ath.cx/share/FRC_Parsers/qualschedule.php will provide the qualification schedules for the regionals that are not bizarre.

http://schreiaj.ath.cx/share/FRC_Parsers/ranking.php will provide the qualification ranking data.

Both pages take the following options:

Event_Code - Event Code from frclinks.com. Since I now use frclinks to find the pages the exact codes given on there are what need to be used.

Year - 2008, 2009 are currently supported.

HTML_Display [true,false] - This decides whether to escape the tags so that they display in the browser. If you are parsing the xml in a script I would suggest leaving this false (or blank). If you plan on copying and pasting the xml anywhere from the browser use true.

Currently I am working on parsing the team history pages and will post that as soon as I am done.

I believe that bizarre page formatting comes from the pages being opened in Microsoft Word, edited, and saved again. Some result pages are full of Word HTML markup.

Don’t get me started on Word and HTML :mad:

On an unrelated note, in the spirit of open source all the code is available http://schreiaj.ath.cx/share/FRC_Parsers/ and the current versions I am working on at the moment are at http://schreiaj.ath.cx/share/FRC_Parsers/Parsers_Beta/

Where did you find frclinks.com? That’s a nifty idea.

I don’t know how well one parsing method works against others, a regex will work as long as they don’t add a new table to the document, and don’t add non-numerical data. Likewise, an HTML parser will, and will also properly handle entities like < , but any change in structure will not work (though that is a simple parameter change telling it the new path to the data). I just use the DOM and SimpleXML parsers in PHP, Python (eewww, Python) must have something similar.

I have an initiative to standardize how FIRST data is published, XML Interchange format. An example that mixes the rankings and schedule:


<event season="2009" code="GLR">
        <team number="67" game:rank="1" game:win="12" game:lost="0" game:tie="0" game:plays="0" game:qs="24.00" game:rs="51.75" game:mp="117" />
        ...

        <match type="qualification" number="1" time="11:45">
            <alliance name="red">
                <team position="1" number="1940"/>
                <team position="2" number="216"/>
                <team position="3" number="123"/>
            </alliance>
            <alliance name="blue">
                <team position="1" number="1896"/>
                <team position="2" number="468"/>
                <team position="3" number="894"/>
            </alliance>
        </match>
        ...
</event>

Where game is some XML namespace (if you like).

As for licensing, as a rule of thumb, if the code is shorter then the license would be, I put it in public domain.