On Thursday, March 26th Hosted Chef experienced a degradation in service where logging into oc-id, Hosted Chef’s identify service, periodically failed. This failure meant that it was difficult to login into Supermarket, Hosted Chef’s profile page, and oc-id itself, since each of these systems rely on oc-id for their authentication tokens. During this degradation all other systems functioned normally and there was no impact to any Chef client runs that use Hosted Chef. Users that had already successfully logged into one of these systems also saw no impact, as this degradation only impacted new logins.
I’m sorry this impacted your ability to login to Hosted Chef and other services. In this post I’m going to talk about why this degradation occurred and what steps Chef is taking to ensure this does not happen again.
I’d like to also apologize for the lateness of this post. The incident occurred on Thursday, March 26th and was resolved at that time. Chef held an internal postmortem on Friday, March 27th. One of the actions that was decided from that postmortem was that we should create this blog post. Ideally, this blog post should have gone up the next week, but did not because we were all very busy at ChefConf. In the future we will do our best to be more timely in posting information such as this.
So what happened? Some Chef engineers noticed strangeness trying to log into supermarket.chef.io. They consulted with the engineers responsible for oc-id and it was verified that something was amiss with logging into oc-id in Hosted Chef. An incident was called where this was investigated. It was determined that oc-id was periodically receiving 500s when it was attempting to communicate with the Erchef nodes (the Erlang component of the Chef server) in Hosted Chef. oc-id communicates with the Chef server to authenticate users, since user data is stored on the Chef server. oc-id functions as an oauth provider and once it verifies the user data with the Chef server it issues an oauth token. oc-id was periodically failing in this handshake with the Chef server, but it wasn’t failing every request.
The team stepped back to examine what had changed recently that might be causing these failures. No recent deploys had been done around oc-id. However, there had been changes to Erchef. A series of upgrades had been happening in Hosted Chef to move from Ubuntu 10.04 to Ubuntu 12.04. Erchef, being a long running component in Hosted Chef, has lived on Ubuntu 10.04 for a long time. To make the upgrade to Ubuntu 12.04 as smooth as possible, Erchef running on Ubuntu 12.04 was being canary deployed into Hosted Chef. Initially a single Erchef node running on Ubuntu 12.04 had been introduced into Hosted Chef. This node was monitored and no issues were observed. At the time the oc-id failures were observed half the Erchef nodes were running on Ubuntu 12.04.
Since this was recognized as the change that was most likely the cause, even though it was not understood exactly why at the time, the Erchef nodes on 12.04 were pulled out of rotation and the Erchef nodes on 10.04 that were sitting idle were added back to the rotation to ensure there was no change in capacity. Tests to check the login capability of oc-id verified that it was fully operational again and there were no more login errors or 500s from talking to the Chef servers. The Erchef servers running on 12.04 had been the issue. But why?
After investigation it was determined that the new servers were missing the private key that oc-id uses to authenticate itself with Erchef. At the time that the servers were provisioned the cookbooks were also updated. In this update the private key was removed from the servers because it was thought this key was not needed and the desire was to remove any unused keys.
Once the first Erchef server running on 12.04 was put into the rotation oc-id began returning 500s. This was missed at the time because the monitoring around oc-id itself was insufficient. As more of the new servers were added to the rotation the 500s increased, since it became more likely a request would hit one of the new servers. This still did not trigger any alerts however because the traffic to oc-id is such a small amount of the overall traffic in Hosted Chef that it didn’t trigger the threshold for alerting around elevated 500 levels. Only when the behavior was directly observed by Chef employees was the issue finally discovered (we did not receive any customer reports of this issue).
The possibility for encountering this issue existed in Hosted Chef for twenty-one hours, as that was the length of time the first Erchef node running on 12.04 had been in the rotation before it was pulled. oc-id remained mostly functional for most of that time, because the number of canaried Erchef nodes on 12.04 was low. It was only for the last hour of the issue, when the number of canaried Erchef nodes on 12.04 had been increased to 50% of the cluster, that oc-id failed for most login requests.
Since the resolution of this incident, the cookbook used to deploy Erchef has been updated to ensure the needed key is placed on the system. All Erchef servers running on 12.04 have also been added back into the rotation. Monitoring is being added around the Erchef servers to verify that the key needed by oc-id is in place. Additional monitoring is being added around oc-id to verify that it is functioning properly.
In addition to these fixes, we’re trying to be more transparent, such as with this blog post. We plan to try and hold more public postmortems in the future that can be attended by the community when functionality such as Supermarket is affected, although we failed to hold a public postmortem in this case.
We’re sorry for the degradation of service around oc-id in Hosted Chef. We know that you depend on Hosted Chef to manage your infrastructure and that you’ve placed a lot of trust in us. We’ll continually work to ensure that we earn your trust.
Thank you for your support.