The Blue Alliance - Data Loss

UPDATE: We experienced some data loss on The Blue Alliance. I’m working to restore the situation, and it should be restored entirely by the end of April 4th, 2012.

Those curious about technical details can read more in this Google Doc, where I am recording what I am doing to fix it.

Most of 2012’s data is recovered already. Not all events have been, so some pages will continue to generate errors. Whatever doesn’t heal itself with cronjobs will get fixed tomorrow night.

Thanks,
Greg

my.usfirst.org appears to be down, which is preventing The Blue Alliance from rescraping data from FIRST.

http://www.usfirst.org/whatsgoingon fails to load its iframe, for example.

At this point, nearly all data has been recovered. 2012 Events and Matches are partially missing. 2011 Events and Matches are entirely missing. This data will be recovered when FIRST’s pages come back online.

Some teams who have not competed in 2012 have lost their details like nickname. These will be restored when FIRST’s pages come back online.

Will write more about backup measures we should take in the future in the document, but not tonight.

FIRST’s servers are rejecting our scraping attempts from our production server. I’ve emailed [email protected] to attempt to resolve the issue. Does anyone know anyone else I can get in touch with?

“Google access to this page has been blocked due to repeated failure to respect robots.txt”

What are you violating? The 10 sec page request delay or something else?
You’ll want to assure them you’ll respect current settings and future changes in robots.txt.

I’ve talked to their IT dept before, but normal changeover almost assures that it’s different people by now.
New employees are publicized in the FIRST Newsletter. I’ve attached a list, but beware, the earlier people may be gone or re-positioned within the organization as time passed.

FIRST-Bios-2011.zip (2.75 MB)


FIRST-Bios-2011.zip (2.75 MB)

I got in touch with the FIRST web team, and we’re going to make some changes to how we request pages. Things should be back up and running later tonight :slight_smile:

Thank you for keeping such a website up and running. It’s a great tool to use and FIRST wouldn’t be the same without it.

Thanks Greg for all your hard work. The Blue Alliance is a true asset to the community.

I believe all data is restored now. We lost some metadata that had been manually edited, but the Events and Matches should be back. We’ll monitor our logs for errors in the next few days, and fix anything else that crops up.

We’re now investigating backup options :slight_smile:

Thanks for your hard work Greg! I needed this back up ASAP to do some scouting.

:smiley: Glad we managed to fix everything midweek instead of over a weekend.

What event is 2012oj? We don’t seem to have it, and I can’t figure out what it is. People are trying to get to it though, and it’s throwing errors.

Considering this throws an error, I suspect there is no such event.

I’ll just put another thing in the list of “why we should make proper 404 pages”. Thanks!

Many team names have been lost and replaced with just the number.

3 Apr 2012, 21:27 - Reporting user replies that methods were unknowingly called during probe of security hole. Thus, data was lost in prod.

This sounds very unethical, especially with TBAv4 being open source. Said user should have run tests on a local machine.

Can you expand a little more on this portion?

Yea, we don’t have fully up to date team names for teams that didn’t compete in 2012. If you find any missing for 2012 teams, let me know.

For pre-2012 teams, I’ve opened an issue to fix this (we’ve never done it right): https://github.com/gregmarra/the-blue-alliance/issues/108

There was no malicious intent. This was a combination of curiosity and us having a big bug in our configurations. Everyone learned something.

Team 801 is missing name and sponsors.

Hmm, trying to figure out why we’re not re-grabbing these right. Thanks for the pointers. FIRST’s page structure makes scraping this information tricky (you can’t just look up a team with it’s team number), so I think something may have changed in the mean time.

Will dig in later this week, thanks.

You could use frclinks.com/t/####.

Oddly enough, it looks like you only have team info for defunct team numbers like 40, 47 & 65. Wonder if that is stale backup data because it is not overloading them in failed scrapes.

All the team info TBA uses is available in one (easy to parse) tab deliminated page
https://my.usfirst.org/frc/scoring/index.lasso?page=teamlist
It would be easier to just scrape that page. Plus, that is only 1 page request instead of thousands.

Alternatively, you could ask 358 for its database(especially if you want to fill in defunct team data)

Great job getting it back up, Greg!