Yesterday (2014-04-08) at 22:39 UTC to 23:16 UTC, Hosted Enterprise Chef search API requests were returning 502 HTTP response codes. One of the “killer features” of using a Chef Server is the search capability, so I know many of our customers rely on that API. I’m sorry that the search API was not available.
Then, at 2014-04-09T03:55 UTC to 04:17 UTC, the search API began returning 500s again. This was deemed to only affect approximately 10% of the search requests, and I’m sorry about this degradation of service.
In this post, I’d like to provide some background on the infrastructure, what went wrong, how we responded, and the remediation steps that we’re taking to ensure this doesn’t happen in the future.
On April 7, 2014, the “Heartbleed” SSL vulnerability CVE-2014-0160 was announced. As many of you know, OpenSSL is widely used across the Internet, and we include OpenSSL and link against it in the web front end services for Open Source Chef, Enterprise Chef, and Hosted Enterprise Chef. We announced yesterday an update about what is affected in our Chef Server stack, and the steps we were taking to mitigate the vulnerability.
Hosted Enterprise Chef is not affected, as none of the external services using SSL are linked against a vulnerable version of OpenSSL. However, as a precautionary measure, we decided we would update OpenSSL packages in our infrastructure that were affected.
Our Search Infrastructure
For those unfamiliar with the internals of the Chef Server API, the reference implementation uses Apache Solr for indexing the JSON data, such as information about nodes that are managed. In our Hosted Enterprise Chef environment, we run Solr 1.4, and it is made highly available for redundancy . Due to the age of Solr in our environment, we are currently working to update Solr to version 4 to take advantage of various new features and bug fixes. The Erchef API sends all search queries to Solr 4 and Solr 1 simultaneously so we can measure performance and accuracy before we upgrade. All the results from Solr 4 are compared against Solr 1 and the differences are logged. The request times are also aggregated to measure performance.
The Solr 1 systems that are currently running for all your search queries are a highly available pair using pacemaker. However, since we’re sending queries to Solr 4 simultaneously, and rely on Solr 1 as the primary service, we haven’t set up Solr 4 under HA yet. This was an issue because our alert monitoring for the Solr service is based on using the VIP managed by pacemaker.
The services we care most about – the ones that run the API service – are almost all managed by runit. For the systems that are also “HA” such as Solr, we actually don’t start them by default on the nodes, by creating the
/etc/service/SERVICENAME/down file. This means we can have the HA software (pacemaker) manage the service, using
sv control commands.
We had two separate outages on Tuesday, and I’ll cover each one in turn as they have different root causes.
The package upgrades were done using
apt-get upgrade as experience has shown that this is a (generally) non-impactful way to upgrade all packages. In our production environment, we have Ubuntu 10.04, Ubuntu 12.04, and systems are launched at different points in time as we add features to the service. Not all the systems needed all the same package updates, but the
runit package in particular was common to several upgrades. Normally in the past, upgrading
runit has been an uneventful experience. During the package installation, the
runsvdir master supervisor was stopped. We manage the
runsvdir.conf upstart file with Chef in order to set filehandle limits for services managed by runit. During the package install, the local copy was preserved, which has the following lines:
pre-stop script set +e killall -HUP runsvdir exit 0 end script
According to the runsvdir manual, “if runsvdir receives a TERM signal, it exits with 0 immediately” and “If runsvdir receives a HUP signal, it sends a TERM signal to each runsv(8) process it is monitoring and then exits with 111.”
What this means is that the service stop by the package upgrade process stopped
runsvdir with the
HUP signal, causing all the
runsv services being monitored to stop as well. Then, when the package installation starts
runsvdir again, the services with the
down file in their service directory were not restarted. In the past, this wasn’t an issue because we didn’t have limits managed, and used the default
runsvdir.conf from the package manager.
After the packages were updated, Solr 4 was not restarted because we tell it to be
down, and traffic from Erchef API servers wasn’t able to reach it, causing HTTP 502 response codes.
The second outage where search queries were only returning 500s approximately 10% of the time was a bit more simple. The Erchef service on one of the nodes was running out of Erlang processes, and which caused memory spikes and requests to that system failed. Other Erchef nodes didn’t appear to be affected, so the service was degraded rather than completely down.
So how did we stabilize the service to resolve the problems?
In the first outage with Solr requests, we modified the configuration for Erchef so it wouldn’t send the requests to Solr 4. We did also restart the Solr 4 service, but the actual fix was in Erchef.
In the second, we restarted Erchef on the affected node. Then, we performed a rolling restart of Erchef across all those nodes to mitigate process leaks. An issue here is we didn’t have sufficient monitoring of the Erlang VM in Erchef, so we didn’t know it was approaching the Erlang process limit.
How Will We Improve?
One of the most important things about doing a post mortem analysis of these kinds of problems is learning from mistakes, or learning how individual component failure can cascade. We have several corrective actions that will be taken by the team in order to ensure these kind of outages don’t happen in the future.
We now better know the behavior of
runsvdir when it is restarted. Upgrading the package will be done in a more isolated way to ensure it doesn’t impact production services outside a maintenance window.
We will complete the migration to Solr 4. This means we’ll bring up a second HA node, and get the aforementioned VIP monitoring in place so we’re aware of problems with the service itself. In the interim we’ll also ensure that failure of Erchef to connect to Solr 4 doesn’t affect search queries, by ensuring they go to Solr 1 as normal. Once Solr 4 migration is complete, that will be a non-issue.
We need to improve our monitoring in general of Erlang, and build statistical models on upstream services (not just Solr) and alert when an abnormal number of 500s occur. In addition to this, we need to spread more knowledge about debugging and managing Erlang services, as our team has grown in the last several months.
We plan to improve our incident response, assigning incident commanders and notifying the appropriate parties to get communications out to the status site and twitter.
Again, I’m sorry about the outages. We weren’t able to get everything restored to normal state as quickly as we’d like. I know our customers rely on the search feature to dynamically build out their infrastructure, and we let you down.