[FRC Blog] Regional Registration Issues

Posted on the FRC Blog, 10/6/16: http://www.firstinspires.org/robotics/frc/blog/2017-regional-registration-issues

Regional Registration Issues

Written by Frank Merrick, 2016 OCT 06.

I know that apologies are not enough, but I am sincerely sorry for the registration issues we experienced today. As I have said before, we know we ask much of our teams and our teams have the right to expect a high quality experience from us in return. This is not something we delivered today.

We intentionally scheduled District event registration before Regional event registration, knowing that District registration would be the lighter load and could help point out any weaknesses before Regional registration opened. Yesterday, the system performed just as expected, with no appreciable delay, and no indication of problems to come. Today, the number of simultaneous connection attempts to the database apparently reached a tipping point, at which the system essentially stopped responding and some elements crashed. This is a different issue than the one we saw during our initial registration attempt on September 22.

Noting that several hundred teams had already registered at the point the system first stopped responding, we re-started elements of the system to see if that would clear the issue. It did not. Eventually, we significantly reduced the number of simultaneous connections allowed to the database. This allowed the system to continue processing registrations without internal errors, but at a relatively slow rate. Many teams attempting to register for events during this period were seeing error messages. Eventually the errors cleared and teams could register, but in many cases this took numerous attempts.

During the period when the system was struggling, we were having an on-going conversation among *FIRST *management – do we shut down the system completely and try again another day, or continue to process registrations in a slow start-and-stop manner? The answer was not obvious. The ‘try again’ system would likely have been very different from the system we used yesterday and today, and would have been untested in a real-world environment. Also, there was the question of how to handle the teams that managed to get registered before we shut down. Would it be fair to allow them to keep their registration, or would we have had to bring everyone back to the starting line again? While the answers to these questions were not obvious, eventually we came to the conclusion that we could not be sufficiently confident the ‘try again’ system we put in place for next week would have been any better than the existing system we had chugging along slowly. The only thing we could be sure of with the new system is that we would have lost another week or so of time in getting teams registered. Primarily for this reason, we decided to continue.

Our decision to continue processing registrations slowly clearly affected some teams more than others. Teams that got in quickly went about their day. Teams that did not get in quickly had to wait around, and we recognize some folks had limited time and likely needed to get back to their lives before they were able to register. Once again, I am very sorry for the pain this caused.

There are a few good things that came out of this, though, in the end. First, as of about 3:30 PM today, we had over 2,000 teams registered for events, between our totals yesterday and today. This exceeds the count we had in the first 24 hours last season. Also, we are even more committed to transitioning to a preference system for the 2018 season that will eliminate the need for everyone to feel like they need to be part of this crazy horserace. Under current plans for 2018, as long as preferences are entered by the deadline, there will be no advantage to being first.

We still feel, despite these challenges, that we have a great season ahead of us!

More information, to include specific dates for second registrations, to follow before the weekend.

Frank

Disclaimer: This is all my opinion. Take it with a grain of NaCl.

How and why does an organization with the breadth of FIRST have such a failure like this? The scale of their org, both in terms of scale and volunteer force would make a proper rewrite very easy (if only considering software.)

I personally believe that FIRST needs to take a page out of the book of TBA and rewrite their whole registration system, top to bottom in a high-performing language like C++ or as a module to the “big 2” web servers. Keeping on ASP.NET does them no favors, and the more dependent on the old software they are the messier it’ll be when it breaks again (not if.)

Alternatively, can HP or Dell donate some nice servers to FIRST? Sounds like they need more dedicated RAM. :rolleyes:

I wish the FIRST team the best of luck moving forward, any system failure on this scale is messy…

It seems like it’s more of a software problem than a capacity problem. FIRST may be a large organization, but even then, they can only do so much to handle so many simultaneous requests. Even companies as big as Apple go down with big product launches.

That being said, I feel that I need to remind everyone that MongoDB is web scale! :slight_smile:

This definitely sounds like a software and/or caching (probably lack thereof) problem.

Keep in mind per Frank’s stats some 2,000 teams registered in the last 2 days. With proper scaling and caching there’s no reason a well-developed stack couldn’t handle all 2,000 registrations at the same time.

I’d be willing to wager their ASP.NET server still uses a MS SQL server on the backend somewhere, and each addition to the API has just been stacked on top of the old one.

Old datasets can be a pain to manage, but MS SQL does have a dump feature to export the DB to SQL / CSV that could then be imported into a DB lang of choice.

FIRST is a non-profit operating on a shoestring.

The launch of the new website last year caused me consternation.
I don’t like change, but I can adapt.
I let my feelings be known about how I felt the new website to be juvenile.

On the + side, I received a personal phone call last week about my emailed complaints.
When I clicked the “contact us” link from my Dashboard, I ended up in a “do loop”.
The person who contacted me seemed to be responsive. We’ll see if that issue gets fixed. They have bigger fish to fry today.

IMHO, FIRST has subcontracted duties to parties that are not top tier teams.

I just want to throw an idea out for a second… Just let me know where this falls on the spectrum of Great Idea <–> Bat #$% Crazy…

Why not open source TIMS/STIMS and the FMS Web Backend/API and let the teams hack* on it?

The devs that put together and manage TBA have done a great job managing it and handing load and caching. I’m sure as a community we could do far better, and provide more resources (eyeballs, time, talent) than what is available to FIRST IT.

  • Hack - as in the old/original definition

(And yes, clearly, game elements for the unreleased games would need to be managed in a closed repo…)

Closer to the crazy side, I’m afraid.

It’s not that FIRST couldn’t use something like that. But for a variety of reasons, it would just about have to be locked down much of the time, as far as editing/access. Which… let’s go with that’s about where we’re at now.

There are some really, really good reasons to NOT let a lot of people at this problem. If I start with names/addresses of many FIRST folks, I think you’ll understand.

While something built by teams could be a good base, I don’t see it being the final product by any means.

Oh, and then there’s the (lack of) comment problems…

Have you ever heard of a back door?

Suppose someone on team XYZ puts in a (relatively obscure) back door. It’s in open source, but members of teams ABCD, EF, and GHIJ discover it. What do they do? Blow the whistle, or be one of the handful of teams that jump to the head of the line? Put my vote down for B$C.

Simply put, The website is a continuous black eye for our wonderful organization. I hate to say this, as I respect so many in FIRST, but the whole website just plainly sucks.

I hate it. I cannot find things easily, it is very frustrating, not user friendly, and I just grit my teeth and do everything possible to avoid the website and system. When I have to use it, I would rather write an essay, speak in front of my peers, wear only my underwear to work, and go without Star Trek for a year. I have have even considered getting my wisdom teeth removed instead of using the website.

Give us back our old system. Atleast I could find my way through it.

Sorry FIRST, but I’m with weberr here. We really need the old website back. I cant imagine what its like for new teams, when I at least know what I am looking for by my previous experiences using the site.

The best registration system was the one that had all of the events listed on one set of pages (iirc). 2 categories in boxes. Event capacity, no. of slots left. The box was color coded. It turned a certain color once no. of slots left became low or another color when event was full.
In a snapshot, you could see all the information you needed without having to navigate endlessly just to find out 1 piece of information.
You could click on links to see who was registered for the event and not have to use the search bar everytime you needed to find something.

FIRST has been going through website redesigns for many years.
The old website had a slew of problems as well and was far from perfect.
They still need time to get the bugs worked out on the latest design.
If you have suggestions on how to improve the system, I recommend submitting your feedback to FIRST.

Meh. Sure, there’s more they could have done. Here’s the thing about “software”, coming from someone who’s done “it” and only “it” for 14 years professionally.

“Why didn’t ABC use XYZ?” completely ignores any actual tradeoffs or decision making an organization does based upon things we as teams do not know are considerations. No, FIRST doesn’t need to be so transparent in their minute decisions. In aggregate I bet teams spend more money on coffee and pizza than on their robots and yet we do not expect any transparency from coffee growers or milk farmers about their products and services.

Web tech changes every 9 months or so. Most of the new stuff is crap, full of bugs, full of leveraged software’s bugs, and doesn’t integrate well without a bit of very specific, specialized knowledge. Stuff that’s been around a few years is just now being analyzed for tradeoffs versus the older technology. Sometimes a new tech is great, sometimes it’s crap. It’s an early-adopter’s common fallacy that old = irrelevant and useless.

“They’re THIS big, why isn’t this process better” completely plays into the software fallacy about adding more people to a project will get it done sooner or better. Since testing is usually the #1 thing that has its schedule cut when approaching a deadline, I’m going to guess they ran out of time to test large quantities of simultaneous connections.

“My high school students could have done this better” (heard elsewhere on social media) is a complete lie or pipe dream. Even the best high school teams lack what it takes to get something out of “sandbox” mode and into production at scale before a deadline - unless they have help from seasoned veterans. 90% of college kids have the same issue. Even most of the ‘black box’ stuff I received from post-grad researchers made so many assumptions about environment and lacked any kind of deployment or scale consideration.

Their re-structuring of the problem for 2018 seems like a much better way to scale registration.

Maybe I’m missing something here, but what does the “FMS Web Backend/API” have to do with registration/the website?

I think they can do better.

Migrating the application to Amazon Web Services would be a start. AWS has the services in place to auto-scale and to auto-descale based on the load. So FIRST would pay when the scaling is needed, and not pay when the scaling is no longer needed. AWS is one example of several cloud services available.

Scale testing is easily doable, both from a software and a money standpoint. I was using JMeter years ago to do load testing on my websites. JMeter is still used, and I would bet several others are available as well (both commercial and opensource). It is not hard to do load testing.

I can’t agree more with this post. It reminds me of a good article I saw last week about the idea of “<thing> is nowhere near that hard, I could do it myself!”: http://danluu.com/sounds-easy/

Scalability problems also are rarely solved by throwing more resources at it. In fact, often, application performance can often decreases when more cores are added due to the increased need for synchronization and hammering of shared cache lines in a write heavy workload (plus added network latency in distributed systems). There is a huge amount of coordination required to make sure the database status stays in sync across all the machines involved. If people are interested, PM me and I can link you to some papers on the subject. These kinds of challenges take lots of time and engineering effort and like Peter said above, FIRST has limited engineering resources available (including writing all the software for next year’s game).

There are also a couple of comparisons to TBA in the thread and I’d like to point out it’s not an apt comparison to make in this case. TBA, as a system, has the benefit of an almost exclusively read-based workload and can do all writes asynchronously. TBA is cached very heavily, but that means latency of updates is scarified (that’s why it’s often a little out of date in the offseason). These assumptions would fail miserably in a registration environment, since you need up to the second information about event status (whereas a TBA page can be from anywhere between 1 minute and 1 day old).

At the end of the day, FIRST just realized they have to iterate their system. They employ engineers just like the rest of us who work incredibly hard to build and maintain their tools. Sure, not everything will work 100% of the time, but that’s why we iterate and make things better. And it seems like a priority-based approach will make the entire process a lot less stressful for everyone involved.

The TBA caching I’m preferring to is more the memcache caching, and less the content cache (ala Cloudflare). Properly cached, event data is always served from the cache, and registration data is until it is changed and invalidated. For many of these pages (eg every dashboard page load once you register) this would reduce the number of DB calls dramatically, for a normalized schema.

Frank has stated they believe the issue to be with their MySQL database (sorry I can’t find which blog of his said it recently) and it’s responsiveness under load. Having worked w/ MySQL databases that serve far more simultaneous read/write requests, it’s highly likely the problem lies between design, and cache. (Or a really poorly configured DB server)

I agree with Glenn and weberr, having been around for a long time, I found the flaws with the old website something I could work around. The layout for the new one is not user friendly for FIRST teams. I tried for 30 minutes the other day to find the “What is FIRST” video … I could only find it in Spanish. I ended up going to You Tube to find it. It is a great video and it should be EASY to find on the website.

We will continue to send in our comments regarding website use. I feel that perhaps the website was designed more for people learning about FIRST than the FIRST community itself.