Should be back now - my bad. I’m testing out some stuff around moving off the old python2 runtime with … destructive consequences.
Testing in production. Classic.
…where does TBA find these clowns?
Facebook engineers using Google software interfacing with whatever Microsoft product HQ purchased repeatedly asked when there’ll be an app that runs on Apple hardware.
I think this is also a good opportunity to try and apply some of the things I do at work to stuff I do outside of it. My company has a process we call “SEV Review” (SEV stands for Site Event - something being broken enough to have real production impact). The most important part is that the process is used not as a vehicle of blame, but as a means of making the site better. With that in mind, let’s write up an incident report for how I broke TBA for 20 minutes
As many people may know, The Blue Alliance is hosted using the Google App Engine standard runtime environment. It’s a project written in python that dates to ~2010, so it’s always been python2-based. We’ve wanted to migrate to the python3-based runtime for a while now, but it’s a difficult project, as there are a number of breaking changes in the underlying environment. Add in the fact that the TBA application is pretty tightly coupled to the runtime and you end up with a pretty complex migration. The theme running through the work is the idea that the python3 GAE runtime provides a very barebones runtime with minimal bundled services (as opposed to the python2 version which had many). Any external dependencies can be managed by the application itself through standard python
requirements.txt files, which will end up being more a more manageable solution for the long term. We’ve set the goal of having the site powered by the python3 runtime for the 2021 season, so there’s no time like the present to get started
TBA’s underlying database uses the App Engine ndb framework to store everything. The traditional NDB library is bundled with the python2 runtime, so it needs to be migrated to a standalone library that we can “vendor” into the runtime. But more importantly, it’ll begin the process of weaning TBA off the bundled GAE services.
The 500s were accompanied by the following stack trace. The site’s alert was open for a total of 22 minutes and would have resulted in a near-total outage (unless a request was served by edge cache).
(/base/alloc/tmpfs/dynamic_runtimes/python27g/96042b99a0720f74/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py:269) Traceback (most recent call last): File "/base/alloc/tmpfs/dynamic_runtimes/python27g/96042b99a0720f74/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 240, in Handle handler = _config_handle.add_wsgi_middleware(self._LoadHandler()) File "/base/alloc/tmpfs/dynamic_runtimes/python27g/96042b99a0720f74/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 311, in _LoadHandler handler, path, err = LoadObject(self._handler) File "/base/alloc/tmpfs/dynamic_runtimes/python27g/96042b99a0720f74/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 85, in LoadObject obj = __import__(path) File "/base/data/home/apps/s~tbatv-prod-hrd/prod-2.426421983306964781/main.py", line 9, in <module> from controllers.account_controller import AccountEdit, AccountLoginRequired, AccountLogin, AccountLogout, \ File "/base/data/home/apps/s~tbatv-prod-hrd/prod-2.426421983306964781/controllers/account_controller.py", line 11, in <module> from base_controller import LoggedInHandler File "/base/data/home/apps/s~tbatv-prod-hrd/prod-2.426421983306964781/controllers/base_controller.py", line 20, in <module> from stackdriver.profiler import trace_context, TraceContext File "/base/data/home/apps/s~tbatv-prod-hrd/prod-2.426421983306964781/stackdriver/profiler.py", line 10, in <module> from googleapiclient import discovery File "/base/data/home/apps/s~tbatv-prod-hrd/prod-2.426421983306964781/lib/googleapiclient/discovery.py", line 49, in <module> import google.api_core.client_options File "/base/data/home/apps/s~tbatv-prod-hrd/prod-2.426421983306964781/lib/google/api_core/__init__.py", line 23, in <module> __version__ = get_distribution("google-api-core").version File "/base/alloc/tmpfs/dynamic_runtimes/python27g/96042b99a0720f74/python27/python27_lib/versions/third_party/setuptools-0.6c11/pkg_resources.py", line 311, in get_distribution if isinstance(dist,Requirement): dist = get_provider(dist) File "/base/alloc/tmpfs/dynamic_runtimes/python27g/96042b99a0720f74/python27/python27_lib/versions/third_party/setuptools-0.6c11/pkg_resources.py", line 197, in get_provider return working_set.find(moduleOrReq) or require(str(moduleOrReq)) File "/base/alloc/tmpfs/dynamic_runtimes/python27g/96042b99a0720f74/python27/python27_lib/versions/third_party/setuptools-0.6c11/pkg_resources.py", line 666, in require needed = self.resolve(parse_requirements(requirements)) File "/base/alloc/tmpfs/dynamic_runtimes/python27g/96042b99a0720f74/python27/python27_lib/versions/third_party/setuptools-0.6c11/pkg_resources.py", line 565, in resolve raise DistributionNotFound(req) # XXX put more info here DistributionNotFound: google-api-core
The 500s were caused by this commit. The commit’s intent was to begin building abstractions to make it possible to codemod the application from one NDB library to another. However, it requires adding some dependencies to the deployment, which is often tricky to get right because the particular runtime configuration in prod is very difficult to replicate in a dev setup (much of the “magic” provided by the GAE runtime is exactly the thing that the py3 runtime will be getting rid of!). This is the root cause of the issue - not all third party python libraries are supported in the python2 runtime, so some can come bundled with the runtime, whereas for development, you need to install them on the development system using the standard means. This is why the changed worked on my local machine and not in prod, as it had something to do with the differences in the environments.
A common pattern for incident reports uses the DERP format - let’s go through that next.
Detection (how did we discover the issue) - we have automated alerting on site 500s in our slack channel, it fired even before this thread was started, so I reverted the offending commit right when I saw it
Escalation (how did the right people get involved) - between monitoring the site while risky commits were pushing, and the couple pings in the stack channel, I was aware there were issues pretty quickly.
Remediation (how did fixing the actual issue go) - I reverted the commit and it deployed using TBA’s standard continuous deployment system. Once the push finished, the fatals stopped.
Prevention (how can this not happen again) - While changes of this nature are inherently risky, there’s some tooling we can invest in so we can catch issues like this more easily in the future.
- Make it easier to set up and run TBA on App Engine instances that are not the prod site. I actually have one of these, but some of the tooling is a bit kludgy. This will give us the benefit of having a prod-configured environment that doesn’t have prod-levle impact when it breaks
- Better integration testing in CI on pull requests. An interesting idea would be to spin up the development container and make some HTTP requests at it to sanity check the application as a whole (instead of isolated unit tests). The tricky thing would be seeding the datastore used here with something meaningful that won’t require constant upkeep as we add new events.
- Run smoke tests as part of continuous deployment. We could deploy to a staging “version” (which GAE supports) that run the full prod configuration and make requests using the prod db as well
- Continuous deployment system should automated support rollbacks if the smoke checks in the last step fail
- We should think about the tradeoffs in having a fully separate, long running py3 branch and have it deployed to a separte (no-default) set of modules, which would be accessable at py3.thebluealliance.com (or whatever). This will be workable for some changes, but not all (for example, codemods requiring changing all the projects’ imports would be difficult to test in this way)
Hopefully this will provide some insight into what the TBA team is up to these days, as well as why we consider publically accountable incident reports to be an important part of what we do.
If you like reading TBA postmortems, why not try this other one (although that one has prettier pictures).
And finally, if helping TBA migrate to a more modern runtime environment sounds like fun to you, we’re always looking for new contributors. Drop me a line and I’ll be glad to help get you started