Supermarket Berkshelf Incident Post Mortem

We at Chef believe it is important to conduct public post mortems whenever possible. We recently conducted one around a Supermarket/Berkshelf incident that occurred on May 16, 2016. I was the incident commander for this incident and would like to share both the video and write up.

Video Recording

Write Up


On May 16 we experienced a brief SSL issue between Supermarket and Berkshelf.


This incident began at 21:56UTC on Monday, May 16, 2016. It was resolved at 22:49UTC that same day.

**Time to detect**: 13 minutes 21:56UTC - 22:09UTC on Monday, May 16, 2016
**Time to resolve**: 44 minutes 21:56UTC - 22:36UTC on Monday, May 16, 2016
All times UTC
21:56  -   Nell Shamrell-Harrington upgraded 2 of the 4 Supermarket Prod nodes from Supermarket 2.5.2 to Supermarket 2.6.0.  She also upgraded the cookbook versions of oc-omnibus-supermarket and supermarket-omnibus-cookbook
22:09  -   Nell Shamrell-Harrington ran berks install to pull cookbooks from the public Supermarket and received this error:
           OpenSSL::SSL::SSLError: hostname "" does not match the server certificate
          She asked in the internal Chef Slack if someone else would run berks install to confirm what she was seeing
22:24  -  Lamont Grandquist confirmed that he was seeing the same error in Travis builds
22:32  -  Nell Shamrell-Harrington declared an incident
22:36  -  Nell Shamrell-Harrington moved the two upgraded Supermarket prod nodes out of the Supermarket prod ELB and confirmed that she no longer saw the error when running berks install
22:38  -  SaintAardvark in the #chef IRC channel reported SSL issues with running Berks install, Noah Katrowitz mentioned that kisoku (#chef IRC handle) was reporting the same thing
22:39  -  Noah Kantrowitz DM'd Nell Shamrell-Harrington to let her know that users in the Chef IRC channel were reporting issues with berks and Supermarket
22:43  -  Lamont Grandquist reported that Travis runs were working again
22:46  -  Nell Shamrell-Harrington entered #chef IRC
22:47  -  kisoku reported that his CI jobs were working again in #chef IRC
22:50  -  SaintAardvark reported that his Jenkins jobs were working again in #chef IRC
22:49  -  Nell Shamrell-Harrington declared the incident closed

Contributing Factor(s)

The 2.6.0 release of Supermarket included a commit which changed the AWS S3 urls used to access cookbook artifacts in S3 storage. Prior to this change, Supermarket (through the Paperclip plug in) used a hosted-style S3 url. The one for public Supermarket looked like this:

The problem was this URL style only worked if an S3 bucket was in N. Virginia. To fix this, we changed our config to use a path-style url like this:

When this change was merged and deployed, this error appeared when someone attempted to do a berks install using public Supermarket as the cookbook source:

OpenSSL::SSL::SSLError: hostname "" does not match the server certificate

This was due to there being “.” in the bucket name “” Although the previous S3 url style worked with dots in the bucket name, it did not work for a path-style url

Stabilization Steps

We had fortunately only upgraded 2 of the 4 prod nodes, so we removed the 2 upgraded nodes from the ELB, then downgraded them back to Supermarket 2.5.2


For approximately 53 minutes, anyone using berks install saw the SSL error.

Corrective Actions

  • Make S3 url style configurable in Supermarket
  • Make sure staging bucket has similar formatted name to the production bucket
  • Ensure that berks install is part of smoke tests in both staging and production
  • Add documentation around considerations when naming an S3 bucket
  • Investigate adding a monitor that does a simple berks install and executes on a regular basis

Nell Shamrell-Harrington