Hosted Chef Degraded Performance Post Mortem

On Friday, October 26th between 1228 and 2046 UTC, Opscode Hosted Chef was in a degraded state with significantly reduced throughput and a high rate of timeout errors for API requests. We deeply regret this inconvenience to our customers, and want to explain the cause of the reduction in performance.

This outage was directly related to a single API call that triggered a bug in the API endpoint responsible for cookbook dependency resolution.  The bug caused the worker processing the request to become unresponsive. Once a majority of workers received the problematic request, API responsiveness became degraded.

The symptoms of the bug, high CPU but no crash or stacktrace, led us to rule out external factors before coming back to investigate the code paths being executed by worker processes using strace. The system call traces revealed a problem API call from a single organization. Blocking this call restored the platform normal operation.

We are working with the affected customer to address cookbook dependency issues that triggered the bug. We are in the process of deploying code fixes to address the particular bug triggered by circular cookbook dependencies as well as more general measures to prevent a similar bug from impacting the entire system. We have also expanded monitoring and alerting to provide greater detail for a number of API calls.

Again, we are very sorry for any inconvenience this caused you, and we remain committed to providing the highest quality product and experience on the net.

Pauly Comtois