UPDATE: We experienced some data loss on The Blue Alliance. I’m working to restore the situation, and it should be restored entirely by the end of April 4th, 2012.
Those curious about technical details can read more in this Google Doc, where I am recording what I am doing to fix it.
Most of 2012’s data is recovered already. Not all events have been, so some pages will continue to generate errors. Whatever doesn’t heal itself with cronjobs will get fixed tomorrow night.
At this point, nearly all data has been recovered. 2012 Events and Matches are partially missing. 2011 Events and Matches are entirely missing. This data will be recovered when FIRST’s pages come back online.
Some teams who have not competed in 2012 have lost their details like nickname. These will be restored when FIRST’s pages come back online.
Will write more about backup measures we should take in the future in the document, but not tonight.
FIRST’s servers are rejecting our scraping attempts from our production server. I’ve emailed [email protected] to attempt to resolve the issue. Does anyone know anyone else I can get in touch with?
“Google access to this page has been blocked due to repeated failure to respect robots.txt”
What are you violating? The 10 sec page request delay or something else?
You’ll want to assure them you’ll respect current settings and future changes in robots.txt.
I’ve talked to their IT dept before, but normal changeover almost assures that it’s different people by now.
New employees are publicized in the FIRST Newsletter. I’ve attached a list, but beware, the earlier people may be gone or re-positioned within the organization as time passed.
I got in touch with the FIRST web team, and we’re going to make some changes to how we request pages. Things should be back up and running later tonight
I believe all data is restored now. We lost some metadata that had been manually edited, but the Events and Matches should be back. We’ll monitor our logs for errors in the next few days, and fix anything else that crops up.
Hmm, trying to figure out why we’re not re-grabbing these right. Thanks for the pointers. FIRST’s page structure makes scraping this information tricky (you can’t just look up a team with it’s team number), so I think something may have changed in the mean time.
Oddly enough, it looks like you only have team info for defunct team numbers like 40, 47 & 65. Wonder if that is stale backup data because it is not overloading them in failed scrapes.
All the team info TBA uses is available in one (easy to parse) tab deliminated page https://my.usfirst.org/frc/scoring/index.lasso?page=teamlist
It would be easier to just scrape that page. Plus, that is only 1 page request instead of thousands.
Alternatively, you could ask 358 for its database(especially if you want to fill in defunct team data)