Using Chef to automate systems configuration at massive scale
Chef provided an automation solution flexible enough to bend to our scale dynamics without requiring us to change our workflow.
For many years, Facebook managed its systems with cfengine2. With many individual clusters over 10k nodes in size, a slew of different constantly-changing system configurations, and small teams, this system was showing its age and the complexity was steadily increasing, limiting its effectiveness and usability. It was difficult to integrate with internal systems, testing was often impractical, and it provided no isolation of configurations, among many other problems.
“When your environment is big, doing things one-off doesn’t scale,” said Phil Dibowitz, Production Engineer, Facebook. “At some point you have to deal with reality. You can postpone automation for a long time and make your life really, really difficult. But at some point your life goes from difficult to impossible.”
Facebook’s infrastructure team uses Chef to manage thousands of servers, configurations, and administrative access policies in its dynamic compute environment.
After an extensive evaluation of the tools and paradigms in modern systems configuration management–open source, proprietary, and a potential home-grown solution–Facebook built a system based on Chef.The evaluation process involved understanding the direction they wanted to take in managing the next many iterations of systems, clusters, and teams. Facebook also evaluated the various paradigms behind effective configuration management and the different kinds of scale they provide.
Using Chef allowed Phil and his team at Facebook to build an extremely flexible system that allows a tiny team to manage an incredibly large number of systems with a variety of unique configuration needs.
“There are three dimensions of scale we generally look at for infrastructure – the number of servers, the volume of different configurations across those systems, and the number of people required to maintain those configurations,” said Phil Dibowitz, Production Engineer, Facebook. “Chef provided an automation solution flexible enough to bend to our scale dynamics without requiring us to change our workflow. Chef provided top-flight support, earlier access to upcoming changes, and additional rich features on top of the functionality in open-source Chef. Further, Chef’s basis on open-source Chef also aligns with our own open philosophy allowing us to contribute back to the greater Chef community.”
“Facebook’s infrastructure is both truly unique and a model for the future of enterprise computing. Their use of Chef to automate and manage this large-scale infrastructure illustrates the power of Chef in solving some of the most critical and complicated IT challenges on the planet,” said Adam Jacob, Chef’s Chief Customer Officer. “But, more importantly, Facebook’s use of Chef demonstrates Facebook’s continuing commitment to open source and sharing their own best practices to help anyone trying to manage just a few servers, or ten thousand.”