We at Chef believe it is important to conduct public post mortems whenever possible. We recently conducted one around a Supermarket/Berkshelf incident that occurred on May 16, 2016. I was the incident commander for this incident and would like to share both the video and write up.
On May 16 we experienced a brief SSL issue between Supermarket and Berkshelf.
This incident began at 21:56UTC on Monday, May 16, 2016. It was resolved at 22:49UTC that same day.
**Time to detect**: 13 minutes 21:56UTC - 22:09UTC on Monday, May 16, 2016 **Time to resolve**: 44 minutes 21:56UTC - 22:36UTC on Monday, May 16, 2016 All times UTC 21:56 - Nell Shamrell-Harrington upgraded 2 of the 4 Supermarket Prod nodes from Supermarket 2.5.2 to Supermarket 2.6.0. She also upgraded the cookbook versions of oc-omnibus-supermarket and supermarket-omnibus-cookbook 22:09 - Nell Shamrell-Harrington ran berks install to pull cookbooks from the public Supermarket and received this error: OpenSSL::SSL::SSLError: hostname "community-files.opscode.com.s3.amazonaws.com" does not match the server certificate She asked in the internal Chef Slack if someone else would run berks install to confirm what she was seeing 22:24 - Lamont Grandquist confirmed that he was seeing the same error in Travis builds 22:32 - Nell Shamrell-Harrington declared an incident 22:36 - Nell Shamrell-Harrington moved the two upgraded Supermarket prod nodes out of the Supermarket prod ELB and confirmed that she no longer saw the error when running berks install 22:38 - SaintAardvark in the #chef IRC channel reported SSL issues with running Berks install, Noah Katrowitz mentioned that kisoku (#chef IRC handle) was reporting the same thing 22:39 - Noah Kantrowitz DM'd Nell Shamrell-Harrington to let her know that users in the Chef IRC channel were reporting issues with berks and Supermarket 22:43 - Lamont Grandquist reported that Travis runs were working again 22:46 - Nell Shamrell-Harrington entered #chef IRC 22:47 - kisoku reported that his CI jobs were working again in #chef IRC 22:50 - SaintAardvark reported that his Jenkins jobs were working again in #chef IRC 22:49 - Nell Shamrell-Harrington declared the incident closed
The 2.6.0 release of Supermarket included a commit which changed the AWS S3 urls used to access cookbook artifacts in S3 storage. Prior to this change, Supermarket (through the Paperclip plug in) used a hosted-style S3 url. The one for public Supermarket looked like this:
The problem was this URL style only worked if an S3 bucket was in N. Virginia. To fix this, we changed our config to use a path-style url like this:
When this change was merged and deployed, this error appeared when someone attempted to do a berks install using public Supermarket as the cookbook source:
OpenSSL::SSL::SSLError: hostname "community-files.opscode.com.s3.amazonaws.com" does not match the server certificate
This was due to there being “.” in the bucket name “community-files.opscode.com.s3.amazonaws.com.” Although the previous S3 url style worked with dots in the bucket name, it did not work for a path-style url
We had fortunately only upgraded 2 of the 4 prod nodes, so we removed the 2 upgraded nodes from the ELB, then downgraded them back to Supermarket 2.5.2
For approximately 53 minutes, anyone using berks install saw the SSL error.
- Make S3 url style configurable in Supermarket
- Make sure staging bucket has similar formatted name to the production bucket
- Ensure that berks install is part of smoke tests in both staging and production
- Add documentation around considerations when naming an S3 bucket
- Investigate adding a monitor that does a simple berks install and executes on a regular basis