[TBA] Postmortem – 16 March 2019 Site Outage

#1

Postmortem – TBA 16 March 2019 Site Outage

On March 16th, 2019 (Pacific Time), The Blue Alliance went down for 3 hours from ~21:00 to 23:59 due to a combination of overrunning a daily budget cap set on the tbatv-prod-hrd Google App Engine project, and an error in the Google Cloud Console that prevent increasing the budget. This prevented usage of the website, API, and signed-in usage of the iOS app.

This document is a postmortem diagnosis of what went wrong, and what steps are being taken to prevent similar issues in the future. https://github.com/the-blue-alliance/the-blue-alliance/issues/2464 is the parent issue to track follow ups from this incident. We are sharing it publicly so others can see a postmortem process, and better understand how TBA operates.

This is also shared on our blog at https://blog.thebluealliance.com/2019/03/18/postmortem-16-march-2019-site-outage/

Impact

For around 3 hours, users were unable to use The Blue Alliance website or API. The Blue Alliance for iOS app crashed on launch for users signed into the MyTBA feature, resulting in ~1500 crashes for ~300 users during the outage.

Fortunately, we overran quota late in the evening, limiting user impact.

Timeline (all times Pacific Time)

  • Mar 16th, 9:21pm – First 500 error due to quota overrun
  • 9:30pm – Google Cloud Monitoring sends an alert to #monitor on Slack about 500s spiking
  • 9:32pm – 500s peak at 46 qps (queries per second).
  • 9:34pm – Eugene becomes aware of quota overrun, asks Greg on Slack and Facebook Messenger to increase GAE quota with instructions how to do so.
  • 9:37pm – Greg attempts to increase quota, but gets error in Google Cloud Console when accessing GAE settings. Greg only has his phone with him, so debugging is challenging.
  • 9:49pm – Greg creates #site-down on Slack to coordinate response.
  • 9:53pm – Zach shares that iOS app is crashing on launch due to failure to receive an expected response from server.
  • 10:26pm – Greg determines no further action is possible, and goes to sleep.
  • Mar 17th, 12:00am – Quota resets every 24h. After reset, site comes back up.
  • Later that morning – Greg increases daily quota from $20 to $100. Incident fully resolved.

Figures

Google App Engine response time series (shown in Eastern Time). Observe all responses become 500 errors for 3 hours in the early morning of Sunday, September 17th.

Over Quota Error. This is what visitors to The Blue Alliance site saw during the outage. The 3rd party and mobile app APIs were also down. Some pages would continue to serve if users hit CloudFlare caches instead of reaching the Google App Engine server.

Google Cloud Console App Engine Settings Error Screen. This error message presented itself when Greg attempted to increase the quota, with different tracking numbers over time. The errors stopped displaying after our quota reset.

Daily Google App Engine costs. Google estimates cost to implement quota limits, so March 16th did not fully reach $20.

Detection

How did you know something was broken? Did you have alerting to find it for you, or did you need to rely on humans catching/reporting?

  • 9:30pm – Google Cloud Monitor bot sent a message to #monitor about a spike in #500 errors However, its message does not generate push notifications to site admins.
  • 9:34pm – Eugene noticed that the site was erroring while attempting to use it for scouting purposes.

No proactive “approaching quota” alerts fired. No alerting sent push notifications to site admins.

Escalation

How hard was it to get the right people in the room to fix the problem?

Eugene messaged Greg on Slack and Facebook Messenger. Greg had Slack Notifications on his phone off, and did not have his laptop with him. It was a fluke Greg was awake, given the late hour.

Remediation

What did you do to fix the issue? How hard was it? What could we do better?

Greg attempted to increase the GAE daily quota, but was stopped by the erroring settings page.

We waited 3 hours until midnight pacific time, when our Google App Engine quota reset. This restored the site. The GAE settings no longer errored, and Greg increased the daily quota.

Eugene should have had permissions to fix this problem himself.

Prevention

What tooling/code/process improvements should we make to have prevented this in the first place?

We need proactive alerting for quotas. Alerts need to notify site admins with push notifications. All admins need appropriate permissions.

Poor Detection

We were surprised that the site went down, and it was detected by an admin experiencing an error, not by proactive alerting.

Recommended Follow-ups:

  • Alerting – We should connect a Google Cloud PubSub topic to quota alerts, and have it automatically post in Slack. This will allow us to proactively catch quota issues, instead of discovering them from user reports. Tracked in https://github.com/the-blue-alliance/the-blue-alliance/issues/2466
  • Notifications – The Google Cloud Monitor bot in Slack should alert the channel. It doesn’t seem to have this functionality, so we may need to hack something else.

Google App Engine Daily Quota Exceeded

The Blue Alliance exceeded $20 of Google App Engine predicted quota in a single day for the first time on Saturday, March 16th, 2019. March 16th had $18.28 billed, while the previous Saturday was billed at $17.31. The Blue Alliance incurs Google Cloud Platform costs other than Google App Engine, but the App Engine budget quota is what took the site offline.

Recommended Follow-ups:

  • DONE Increase Quota – We have increased the daily quota to $100.

Admins Did Not Have Permissions

Eugene discovered the issue, but only Greg and Phil (who were in Eastern Time, where it was after midnight) had permissions to increase GAE quota. It was a fluke Greg was awake and available, as he was away from home for the weekend.

Recommended Follow-ups:

  • DONE Give Eugene permissions – Eugene administers the TBA services, so should have permission to change budgets. Eugene has been given these permissions.

Google App Engine Settings Errors

We were unable to increase the quota once we had overrun it. We believe there is a bug in the Google Cloud App Engine Settings console that prevents accessing settings once quota is overrun, which prevents increasing quota. After our quota reset, we were able to increase the quota.

Recommended Follow-ups:

  • Understand Bug – We do not know whether others have encountered this issue before. Tracked in https://github.com/the-blue-alliance/the-blue-alliance/issues/2467
  • DONE Bug Reporting (via GAE) – We used the GAE bug reporting tool to report this issue.
  • Bug Reporting (to a human) – We should tell a person we know who works on Google Cloud about this.

No Google Technical Escalation

We have no human contact at Google to escalate technical issues to. In the event our GAE settings continued to error, we would have no remediation path to fix the problem. This could present a problem for similar issues in the future that might require intervention by a Google employee to fix.

Recommended Follow-ups:

  • Find an FRC alum who works on Google Cloud – We should find someone inside of Google to whom we can escalate future issues that we are unable to self-remediate.

iOS App Crashing Due To Expected Response

The TBA mobile apps cache data locally, so should continue to work even if the server is down. However, the iOS app attempts to sync push notification tokens for signed in users with the MyTBA endpoints on app launch. The iOS app was expecting a valid responses from the MyTBA endpoints, which resulted in users crashing on launch.

Recommended Follow-ups:

48 Likes
TBA 2019 Growth Data Recap
#2

Thanks for writing this up, @Greg_Marra!

I want to stress the value of a productive and open incident management culture and hope that we can turn it into a useful learning experience for the rest of the community. Many of the TBA devs do this at our day jobs. Many of us run large production services at scale and try to apply some of those lessons to TBA.

So, if anyone has any questions or would like to discuss what it’s like to run TBA operationally, please don’t be shy - we’d love to have those conversations.

12 Likes
#3

Thank you for posting this! This was actually really interesting and insightful to read through and really informative.

And thank you for what you do to keep this site running!

3 Likes
#4

A few thoughts:

  • Is there a public breakdown of expenses (I’m thinking GCP/GAE, marketing), along with revenue in?
  • Have you identified a root cause for the spike? Was it the load from the new iOS app, API queries from the community, or simply that TBA is getting more popular?
  • I use the TBA API every once in a while, and it would also be nice if there was some feature to make sure that I’m being a good citizen and not hogging things. It would be nice to see my TBA footprint at a per-key level.
#5

Nope, although that’s probably something we can publish (cc @Greg_Marra).

I don’t think we have yet, although I’m willing to buy the theory that it’s due to added load from the iOS app’s this year (nice work, @Zach_O :slight_smile:). Unfortunately, I’m not sure we have good enough attribution to really know for sure.

This is an excellent idea - we should be able to at least instrument some basic stuff, like overall number of requests per key. The api keys all already live in the db and need to be loaded on every call, so we’d just have to add in a write (and maybe blacklist this for some really heavily used keys as a savings). I bet we can also track some more advanced db usage stats (or something that at least roughly correlates to TBA cost, like total number of bytes in the response). Big caveat being this can’t ever be prefect accurate due to the CloudFlare edge cache (most requests don’t actually hit a TBA webserver).

#6

There wasn’t a particular spike – you can see in the cost per day graph that our usage has just continued to grow steadily, and we eventually crossed the $20/day point I had set our GAE budget to previously. We’ve seen growth each year in total visits to TBA – the iPhone app is extremely exciting to have available, but doesn’t create a large increase in total costs.

1 Like
#7

This is something that could be fun to compile at the end of this competition season as a recap.

#8

Thank you for posting a good post-mortem. While the details are not important to me, the post shows students not yet in the real world what should happen (most) every time a problem is reported. In business, a glitch like that can lose you money, lots of it, so understanding the ‘smoking gun’ and archiving what was learned (as opposed to keeping is as Tribal Knowledge) is important.

Read, and Learn, young’uns.

#9

Did you guys just make up this whole story to guilt me into adding better caching to my event simulator? :grin:

Actually though, if you had some way to let the API Users how if they are hogging things with unnecessary queries that might kick some of us into better shape.

6 Likes
#10

We may work on better API insights at the API key level, both for efficiency reasons (although we’re not worried about the cost of providing the API to the community right now), as well as just to better understand how people use it so we can improve it!

4 Likes
#11

I am one of those people who unnecessarily queries the API - but I’m trying to improve, I promise! A few weeks ago, I was thinking that it would be great to know how much I am using up, especially by API Key. I split up all of my various projects into different v3 keys so knowing which projects are querying a lot would be a great leak-detection so I know if I need to kill an old key or application.

Would love a leaderboard of those who least efficiently query the API, I’m sure I’m near the top and could use some public humiliation. /s, sort of

#12

In case you’re unaware, Slack has per-channel notification settings, so a simple solution to this could be for all relevant parties to enable push notifications for all messages to your #monitor channel. I’m assuming that there’s nothing you’re sending in that channel that shouldn’t ping you.

1 Like
#13

Thanks for pointing this out! I’ve enabled them for myself and set the channel topic to encourage others to do the same.

This is a good example of the benefits of open incident reviews :slight_smile:

4 Likes
#14

Additional impact - 604’s scouting meeting was interrupted while Eugene tried to figure out why TBA was down. :frowning_face:

More constructively, there was a blog post in 2017 about reducing the impact of API queries on TBA. It would be helpful to know if that advice and those tips are still relevant and if there’s anything else us API consumers can do to be good citizens.

+1

1 Like