Hosted Chef Service Interruptions

On Friday, April 17th Hosted Chef experienced an average error rate of approximately 30% for a period of 4 hours followed by intermittent brief periods of similar error rates until April 22nd. These incidents resulted in the majority of chef-client runs failing during the periods of increased error rates and greatly increased latency in all successful responses.
I know we have many users relying on Hosted Chef to manage and deploy systems and I’m sorry for this incident. In this post I will explain how this failure occurred and what steps we have taken and will take to ensure it doesn’t happen again.

At 08:27 UTC on April 17th the on-call engineer was paged due to HTTP 500 errors on the API and an exceptionally high load on the database. After a brief investigation into the database load at 08:49, the load and error rate both fell back to normal levels on their own. No further problems were seen until 20:00 UTC, at which point a similar spike in database CPU usage, system load, and error rate were seen. At this time the status page was updated and engineering staff began investigating the cause of the increased load. From 20:00 to 23:50 engineering staff investigated several potential causes. It was decided to temporarily shut off access to the API from Berkshelf due to an exceptionally high number of ongoing Berkshelf requests (later it was discovered this was due to aggressive retry behavior on failure). At 23:50 it was discovered that quick restarts of all application servers temporarily resolved the symptoms.

Over the weekend of April 18th to April 20th, Chef engineers set up an alert and watched database load every hour, restarting services as necessary to avoid a prolonged outage. On April 20th around 16:00 UTC, after discussing investigations into the issue that were performed over the weekend, the following contributing factors were determined:

* Increasing traffic on Hosted Chef
* A poorly optimized database query and loop in the Chef Server code making an excessive number of database calls
* Spikes in traffic at the top of the hour

A group of engineers started on a fix for the database queries and loop in the Chef Server code. Another group began work on a script to reproduce these conditions and confirm the cause. By 05:19 UTC on April 21st the script reproduction script was completed. Tests performed with the script validated the chosen course of action.

On April 22nd a build of Chef Server was completed and tested in a non-production environment. At 17:28 UTC the build was deployed to Hosted Chef. The number of queries, database load and database response time all dropped dramatically and top of the hour service interruptions ceased. Engineers continued to monitor the error rate and database load throughout the day and at 22:10 UTC status was updated and the incident was marked as resolved.

In addition to the outage on April 17th, there were brief interruptions ranging from 1 to 5 minutes at the following times:

* 4/18 04:00
* 4/18 14:00
* 4/18 15:00
* 4/18 16:00
* 4/18 23:00
* 4/19 09:00
* 4/19 13:00
* 4/19 16:00
* 4/19 17:00
* 4/19 21:00
* 4/19 22:00
* 4/19 23:00
* 4/20 17:00
* 4/20 20:00
* 4/20 21:00
* 4/20 22:00
* 4/20 23:00
* 4/21 00:00
* 4/21 01:00
* 4/21 02:00
* 4/21 08:00
* 4/21 12:00
* 4/21 15:00
* 4/21 16:00
* 4/21 17:00
* 4/21 18:00
* 4/21 19:00
* 4/22 05:00

The potential for this re-enforcing error loop has existed in Chef Server for some time. While the conditions that lead to the loop starting have been addressed, engineers are still working on addressing some architectural concerns that prevent these errors from self resolving under certain conditions. In the meantime, additional monitoring and internal documentation to detect and resolve this problem are in place.

I’m sorry for the interruption in service and the inconvenience that it caused. Chef engineering takes the trust that our customers have in our products very seriously and we are committed to improving the user experience of Hosted Chef. We are also committed to learning from our mistakes and doing everything we can to prevent this issue in the future.

Paul Mooring
Operations Lead

Paul Mooring