Blog-Compliance_100x385

Remediating “escaped defects” within continuous delivery

Today I’m going to share a story about an incident we had here at Chef on September 9, 2015. Normally we don’t make a public blog post about an internal-only incident, but this particular issue had some contributing factors that are related to our ChefDK project, and affected the release schedule for version 0.8.0. This incident provided some excellent learning opportunities for ourselves and the community, and also shows how having a continuous delivery model helps us build better software, faster.

What Happened?

We run an internal Chef Delivery server that is used to deploy several of our sites, such as Learn Chef and Downloads. These sites are generated and then stored in S3 buckets with content managed by a Chef Provisioning recipe. The recipe is run by a Chef Delivery build node that has a nightly ChefDK package installed from our current channel repository. On September 9th, the Learn Chef and Downloads site pipelines in Delivery started failing during the “publish” stage with the following error:

[code]
superclass mismatch for class Notification
[/code]

In this post, I will explain the background on the incident, the contributing factors that led up to the failure, what we did to stabilize and remediate the problem, and the corrective actions we’re taking to reduce the chance of this kind of incident occurring in the future. We call this kind of incident an “escaped defect” because it is derived from a software bug that could have been caught. In this case, we caught it before it reached customers!

Background

It will be helpful to briefly explain how Chef Delivery works. Each project has a pipeline that runs through several phases and stages, and each stage has a corresponding Chef recipe that Delivery runs. In the “Verify” phase, Delivery will run the lint, syntax check, and unit test stages. In the “Build” phase, Delivery will also run lint, syntax check, and unit test stages. Additionally, Delivery runs the quality, security, and finally the “publish” phase. For our web site pipelines, such as Learn Chef, or Downloads, the “publish” phase pushes the generated content out to an S3 bucket as an artifact. The recipe looks like this:

[code language=”ruby”]
require ‘chef/provisioning/aws_driver’

artifact_bucket = "PROJECT_NAME-artifacts"

aws_s3_bucket artifact_bucket do
# other resource properties…
end

execute ‘build the site’ do
command ‘bundle exec middleman build –clean –verbose’
# other properties…
end

execute "create the tarball" do
command "tar cvzf #{build_name}.tar.gz –exclude .git –exclude .delivery build"
end

execute "upload the tarball" do
command "aws s3 cp #{build_name}.tar.gz s3://#{artifact_bucket}/"
end
[/code]

Delivery uses “build nodes” that run these recipes. In our case these are EC2 instances running Ubuntu 14.04, and using ChefDK’s bundled Chef. We use ChefDK, because a lot of the pipelines in Chef Delivery are Chef cookbooks, and we want all the development tools available for running the aforementioned stages like lint (rubocop, foodcritic), syntax (knife), and unit (chefspec). We have a Chef recipe that installs the latest ChefDK nightly build from our current package repository, but this recipe is applied only when someone manually runs chef-client – we don’t have Chef running as a service on these build nodes.

And therein lies the first part of our contributing factors.

Contributing Factors

Given enough time, all systems fail, and complex systems fail in complex ways. It is very rare for there to be a single root cause to any incident, and so it is important to look at the contributing factors, or the “hows” that failure occurred, rather than try to answer the five “whys”. Due to the complexity of software and build infrastructure, we have a number of factors that led up to the escaped defect.

We install the Ruby applications in the ChefDK omnibus package with a utility called appbundler, “Appbundler reads a Gemfile.lock and generates code with gem "some-dep", "= VERSION" statements to lock the app’s dependencies to the versions selected by bundler.” In ChefDK, we have an “appbundle” for the chef gem. This means we have a source checkout, /opt/chefdk/embedded/apps/chef, and then we have a RubyGems installation, /opt/chefdk/embedded/lib/ruby/gems/2.1.0/gems/chef-VERSION, where VERSION is the desired version for a particular ChefDK release.

As part of the Omnibus build for ChefDK, we also include RubyGems, and a few days before the incident we merged a change to bump the RubyGems version from 2.4.4 to 2.4.8. It appears that a change in RubyGems 2.4.5 changed the way requires of files in the “chef” app-bundled application resolve to the gem install rather than the source checkout when another gem that depends on “chef” is loaded. Loading the file multiple times is bad – it emits a lot of warnings – but it isn’t normally fatal. However, for the class in question, we subclass Struct.new, and that situation is fatal.

The exception we saw, superclass mismatch for class Notification came from the resource notification class in Chef. This manifested when performing a require 'chef/provisioning/aws_driver' in a recipe that needed to use one of the AWS resources in a Chef Delivery build pipeline. It’s not clear why this only caused superclass mismatch when require‘d from chef/provisioning/aws_driver, but that’s the only place where we saw it happen so far.

To sum up, the contributing factors:

  1. A change in RubyGems between version 2.4.4 and 2.4.8 modified the way that it selects files to require
  2. Applications installed by appbundler could have files require‘d twice when installed at their source location and also as a gem
  3. A nightly build of ChefDK including RubyGems was deployed to the build nodes
  4. Doing a require 'chef/provisioning/aws_driver' in a recipe caused the exception superclass mismatch for class Notification

Stabilization

To reproduce the issue, we installed the affected nightly build of ChefDK 0.8.0 on Ubuntu 14.04 and wrote a simple recipe that looks like this:

[code]
require ‘chef/provisioning/aws_driver’
[/code]

Then ran Chef with that recipe:

[code]
chef-client -z recipe.rb
[/code]

On a system where this was reproduced, we verified that downgrading the RubyGems version to 2.4.4 resolved the issue.

[code]
sudo chef gem update –system 2.4.4
[/code]

To get this remediated on the Delivery build nodes, we downgraded RubyGems in the ChefDK package from 2.4.8 to 2.4.4. We tested on one system in isolation to verify, and then merged the pull request. Once the pull request was merged, our Jenkins-based build pipeline for the package triggered, and started the build process. It takes a while to run, but when it finished we updated the ChefDK package on the build nodes and restarted one of the affected pipelines. Everything worked fine, and we rejoiced.

Corrective Actions

Downgrading RubyGems isn’t the right long term solution, however. We had an internal postmortem meeting to discuss what other corrective actions we needed to take to reduce the chance of having this kind of escaped defect in the future. Failure is inevitable – we cannot 100% prevent escaped defects. However, we can reduce their chances and impact. We know we want to upgrade RubyGems to version 2.4.8, because it contains bug fixes we want. One of the corrective actions that came out of our meeting is to get that updated. To do that, we need also to patch appbundler so it would load the applications from their gem-installed location instead of the source-tree location. Both changes – RubyGems (2.4.8) and appbundler (0.5.0) are in the nightly builds for ChefDK, so those fixes will be available when ChefDK 0.8.0 is released.

Another corrective action to come out of this incident is to run chef-client as a service on the Delivery build nodes so they always have the latest ChefDK nightly build installed. This reinforces the fact that working in a continuous delivery model of building and releasing software is the right thing to do. This is a huge win for us and our customers because we can more easily find this kind of issue before it gets released, and prevent escaped defects that cause us to release x.y.1, x.y.2, versions in short order.

Conclusion

It took some time to detect the problem. The ChefDK build with these bugs was published to the current channel on September 8 at 18:53 UTC, but it wasn’t installed on the build nodes until September 9 at 14:00 UTC. By getting our chef-client runs daemonized on the build nodes, we would have installed this sooner. Because we’ve moved the deployment of our public-facing web sites to Chef Delivery, we have a much shorter feedback loop when there’s issues. Because we’re exercising a lot of different code paths in our products – ChefDK, Chef Provisioning, and Chef itself – we have a large coverage area for issues to arise. This is a win for our customers, because we find these issues before they’re ever in a general public release. This results in us delivering higher quality software, and being a company that organizations enjoy working with.

Posted in:

Joshua Timberman

Joshua Timberman is a Code Cleric at CHEF, where he Cures Technical Debt Wounds for 1d8+5 lines of code, casts Protection from Yaks, and otherwise helps continuously improve internal technical process.