Last Tuesday 04-08-2014 7:12 PM, we released Chef Client 11.12.0 which contained 3 regressions (CHEF-5198, CHEF-5199 and OHAI-562). Subsequently, we released a new version of Chef Client which addressed these issues on Wednesday 04-09-2014 6:28 PM. However, during the 24 hours in between, users of Chef were not able to:
- Mitigate the Heartbleed vulnerability.
- Make a Chef run which includes a file download from servers serving gzipped content that do not use chunked transfer. (A very common scenario)
- Reload custom Ohai 6 plugins via the
ohairesource. (Particularly impactful to the users of the nginx cookbook.)
- Upload a cookbook that has
returnstatement in it.
I’m aware that you rely on Chef in extremely critical situations. I’m sorry that Chef Client was not able to live up to your expectations for this time window especially during the critical Heartbleed vulnerability response.
In this blog post I’ll share the results of our postmortem about this incident.
Events & Response
Based on our original plans, the Chef Client 11.12.0 was expected to ship on Friday (04-11). However, with the announcement of the Heartbleed vulnerability, and in order to deliver the mitigation for this vulnerability more quickly, we pulled this release forward by 3 days.
CHEF-5198 was filed on 8:00 PM on Tuesday. At that moment we were working on the 10.32.0 release again in the context of the Heartbleed vulnerability. We’d seen some signs of the regressions but it was not until Wednesday morning that we were able to regroup and start investigating the reports. We responded to the filed issues on 9:48 AM Wednesday morning. At this time we had reproduced the test cases and started working on fixes for the issues.
At 1:00 PM we finished local testing of the fixes and hit the build button in our Continuous Integration (CI) pipelines. Due to some stability issues, we were only able to deliver this release to you in 5.5 hours at 6:30 PM.
Mistakes and Remediations
Vulnerability Mitigation Releases
The first mistake in the chain was to ship 11.12.0 in response to the reported vulnerability. If we had made a patch release (i.e. 11.10.6) with only the mitigation of the vulnerability, the first impact we caused would have been mitigated. We made this choice for speed as we were pretty confident on 11.12.0 and it also included mitigation for another vulnerability (libyaml (CVE-2014-2525)).
In order to prevent this from happening again moving forward as a policy we will make smaller scoped patch releases while shipping mitigations for security vulnerabilities.
Extensive Test Coverage
Obviously if we’d had test coverage for the specific regressions we encountered, all of the errors would have been prevented. We will be filling the exposed holes in our functional test suite; however given the variety of platforms Chef Client is running on and the depth of its functionality, we will continue to have holes in our functional test coverage.
This incident made us re-think our testing strategy for Chef Client and as a result we’ve decided to bump the priority of one of the improvements that we have been discussing for a while: “End to End integration tests for Chef Client”.
We will be working on building tests that will cover the common end to end scenarios and ensure that they are working in all of our releases. Ideally we would like to hook these tests into Travis as well so that they will also automatically run for your contributions as well.
One of the other major gates before our releases has been “Dogfooding”. We use release candidates of our software as much as possible. We update our pre-production infrastructure with all the Chef Client Release Candidates. However in 11.12.0 release we have skipped this step since we needed to cut off access to our pre-production infrastructure because of Heartbleed.
In order to help with the test coverage, moving forward we will increase the depth of our dogfooding efforts.
CI Infrastructure Stability
Lastly, taking 5.5 hours to release a build is unacceptable to us. Our awesome Release Engineering team have been re-engineering our CI infrastructure and building a world class Continuous Delivery system which will enable us to ship daily Chef Client releases with high confidence. We will talk more about this at ChefConf.
In the meantime in order to improve our response time in incident like this in the future we will be taking below steps to improve our CI infrastructure:
- Improve the stability of our CI cluster with Chef.
- Improve our Solaris testing speed with better hardware.
- Investigate intelligent ways of running tests. (For instance, we don’t need to run the full knife test suite if the Solaris package provider is changed.)
Again, I’m sorry about the regressions and delaying our response to the Heartbleed vulnerability. We let you down at a critical time.
In addition to continuously striving to improve our incident response, we will take these steps to prevent the same things from happening again:
- Vulnerability mitigation release policy
- End-To-End Chef Client integration tests
- Even more dogfooding
- Stabilization of our existing CI cluster while building a new cluster for continuous releases
As usual we’re listening anytime you want to reach out to us:
Chef IRC, Chef-Hacking IRC, Chef Mailing Lists, File an issue for Chef