The Chef 11 Server provides significant improvements in terms of compute efficiency, scalability, and operability. We achieved these improvements by rewriting the API server in Erlang and switching the data store from CouchDB to PostgreSQL. This post provides a behind the scenes look at some of the engineering decisions and development approaches involved in creating Erchef.
The rewrite was motivated by challenges faced while scaling Opscode Hosted Chef for a growing customer base and Private Chef for some very large customers. For the business, it was essential for Opscode to reduce the cost and complexity of operating Hosted Chef or a large Private Chef installation and hence worth significant investment.
Our Ruby-based Hosted Chef API deployment uses Ruby Enterprise Edition and unicorn. As our service grew, we encountered two main challenges in scaling out the API service. First, the throughput of the service is gated by the number of unicorn workers we are able to run. Secondly, each unicorn worker carries the memory overhead of Ruby, the server code, and space needed to process requests. Before we began the migration work, our front end layer consisted of eight servers each running 12 unicorn workers. These unicorns consumed a total of 19.2GB of RAM (204MB per worker under load). The unicorn RAM use is exacerbated by the behavior of MRI’s garbage collector which makes it difficult for the interpreter to return RAM to the OS. As a result, processing a single large request will bloat a unicorn worker until it is restarted.
On the backend, CouchDB turned out not to be a good fit for our data size and load. Hosted Chef is currently running on a combination of CouchDB and MySQL. The engineering team is actively working on migration tooling to move the relational data to PostgreSQL (more on this choice below) and the remaining CouchDB data to PostgreSQL. The metrics for these backend systems summarize the situation rather well. For CouchDB we see 90th percentile latencies of around 125ms with regular peaks of over 1000ms. The 90th percentile latencies for MySQL are just below 50ms with peaks around 150ms.
Erlang is a fantastic environment for building robust high throughput web services. We chose Erlang because it fit the problem space, Opscode’s engineering team had experience with it, and because we were pleased with the performance and operational characteristics of an internal service that had already been implemented in Erlang. We now have additional data on how our Erlang/SQL based Chef Server behaves at scale and the results confirm that our reasoning for making the switch was sound.
The features of Erlang which make it a good fit for high volume web services are its share-nothing memory and process model, multi-core scalability, and the soft-realtime performance resulting from Erlang’s per-process garbage collection model.
In Erchef, each HTTP request is handled by a separate lightweight Erlang process. With few exceptions, the code that handles a given HTTP request is entirely serial. This made the code quick to write and easy to read and reason about. The Erlang VM handles the concurrency for us so that we don’t have to configure a fixed worker set ahead of time. We only need a single Erlang VM on a given host so the memory overhead of the VM and our code is only paid once. For simple tasks like processing HTTP requests, it often feels like one is getting concurrency for free.
The code that handles a request does not have to do any explicit locking because Erlang processes do not share memory. While we tackled our fair share of bugs during the development process, we have yet to encounter a deadlock or race condition bug. Deadlocks and race conditions are possible under Erlang but you have to work a bit harder to create one compared to other languages.
Erlang’s garbage collection has substantially improved the server’s memory consumption and dramatically improved overall throughput. Erlang’s per-process garbage collection avoids global stop-the-world collection pauses commonly seen in environments featuring a single shared heap and garbage collector. Based on our experience working with the JVM, one of the things that surprised us is that we have not, thus far, had to tune Erlang’s garbage collector. We’ve seen reliable performance without VM tuning and with little attention paid to code optimization.
While the primary motivation for choosing Erlang was based around performance (and we have not been disappointed there), we’ve also benefited from the debuggability of the Erlang platform. For example, see this post describing how we used dynamic tracing to quickly identify the cause of a bug. We expect these features of the Erlang platform to continue to pay dividends as we improve Erchef.
As Opscode’s Hosted Chef customer base grew, it became apparent that CouchDB was not a good fit as the backing store for the service. Our write heavy load rapidly created garbage in Couch’s append only log which meant that we had to run compactions on a tight schedule. Compactions introduced additional contention for the (at the time) single threaded access to the underlying Couch databases. A file handle resource leak required us to restart CouchDB on a regular basis to reclaim RAM. With thousands of databases, we weren’t able to take advantage of CouchDB’s replication. We were not happy with the latency metrics we were seeing nor the reliability. It was time to find a different solution.
When we evaluated our options, we realized that a traditional SQL-based solution provided a number of benefits given the overall size of our data and the types of queries we need to run. The Chef API needs more than a key/value store since it must maintain lists of objects of a given type. Secondary indices are a must and are trivial in a RDBMS.
Our production system in Hosted Chef started off in MySQL because we had deeper experience with MySQL on the team and already had MySQL running in our production environment. As we productized Private Chef, we realized we could not ship MySQL with Chef due to licensing issues. Using MySQL meant there would need to be an additional manual step to link Chef with the database. Initially, we worked around this by making PostgreSQL the default and making it possible for customers to wire in MySQL if desired. After some months of this approach, we observed that all of our customers had selected to stick with the default PostgreSQL and that we were spending a non-trivial amount of engineering effort maintaining code, queries, and tests for two database systems. Moreover, as development progressed, we encountered a number of situations where the features in PostgreSQL made solving our problems easier. We decided to focus our efforts on PostgreSQL and drop support for MySQL. This was not an easy decision because we knew there would be members of our community who would prefer to run MySQL as the backing store. Supporting a single database engine allows us to put more time into making the Chef Server better because we don’t have to put engineering time into cross db compatibility and testing. In addition, we hope that the switch to the Omnibus based packaging for the server makes some of the underlying technology choices less of an issue since installation is now a one-shot, single package, all batteries included affair.
Mitigating the Risk of The Big Rewrite
We reduced the risk inherent in any large rewrite by breaking the project down into smaller pieces that could be delivered independently. Chef’s REST API made the decomposition straight forward. We started with search and then began implementing the endpoints that handle each Chef object type. We used a feature flipping mechanism to allow the systems to pull data out of the appropriate data store (CouchDB or PostgreSQL).
As part of the porting process, we invested in a custom integration test framework built on top of RSpec which we dubbed chef-pedant along with a corresponding set of tests. Before embarking on the Erlang implementation for a given endpoint, we first ensured we had test coverage of the existing Ruby-based API. This gave us confidence in the backwards compatibility of the new server and allowed us to make measured decisions for the few places where we decided to deviate from the Ruby server’s behavior. The current test suite is comprised of 2712 “examples”.
Where Did We End Up?
We are very proud of what the Chef 11 Server has already achieved in terms of scale. Check out the work Cycle Computing did building a 10K node cluster on a single Chef Server instance, or read about how Facebook is using Chef.
One very pleasant surprise is that we’ve already received patches for Erchef from the Chef community. We weren’t sure if the move to Erlang would stifle community involvement with the server project and we recognize that Erlang is less approachable, in general, than Ruby. Thus far, the lesson seems to be: do not underestimate the Chef community. So keep those patches coming and we’ll keep working to make Chef Server faster, more reliable, and easier to install and use.