We are building and operating larger, more complex systems which leads to systems that degrade and fail in new and unexpected ways. We must learn to observe, respond to, and learn from these failures. The chaos engineering track at ChefConf is the place to share stories of improving mean-time-to-detect and mean-time-to-resolve, improving post-mortems and learning reviews, practicing game days, and, in short, learning how to be more responsive to systemic failures. I’m personally inviting you to submit your stories for our Chicago event through our ChefConf call for presenters (CFP). In particular, this blog post is designed to give you some thoughts and inspirations for building a proposal for the Chaos Engineering track.
The Simian Army is Coming
Perhaps you have heard of Netflix’s Simian Army or Chaos Monkey. Maybe you are practicing Game Days in your own environment. These tools and practices are all about introducing chaos into the system and understanding how the system responds. It’s important to remember that the “system” in this case includes the infrastructure and applications we are running but also the people responsible for running them. You will get a chance to practice your response to failure. Using these tools and practices can help you practice that response in a slightly a more controlled way.
- What have you done to intentionally disrupt your production environments?
- What role does automation play in building resiliency and self-healing into your applications and infrastructure?
- What systems, training, and practices have you introduced to improve your team’s incident response?
Better Learning Through Failure
All systems fail. It is not a question of “if” but “when.” When systems do fail we have an opportunity to learn a lot and identify opportunities for improvement. These failures offer a great opportunity for collaboration and can help provide better context about the customers and users of our systems. Failures are tremendous learning opportunities, do not waste them!
- What tools and practices have you put in place to improve your time-to-detect service degradations and failures?
- What tools and practices have had the largest impact on your ability to recover faster?
- How are you ensuring incidents and outages are handled in a humane way?
- In hindsight, what do you wish you knew before your project went sideways that would have allowed you to avoid the failure in the first place?
- What improvements have you made to your post-mortems and learning reviews?
Formal Chaos Engineering
Chaos Engineering, as outlined on http://principlesofchaos.org/, is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. This community are bringing a formal approach to experimentation and espousing the necessity for automation. How are you putting some of these things to work today?
- How are you monitoring the state and changes within your systems?
- What telemetry and observability are you building into your applications?
- How are you automating chaos in your environments?
Capture Chaos in a Framework
Incident response can be improved and time-to-recover reduced when we utilize a formal approaches or frameworks. The incident command system (ICS), the observe, orient, decide, act loop (OODA), and the Cynefin framework are just some of the frameworks that can help teams understand and respond to chaos and failures. Perhaps you have experience with one or more of these systems and can provide some insight and inspiration to your peers.
- What formal or informal training have you put in place to help deal with chaotic systems?
- How have frameworks impacted your team’s ability to respond to issues?
- Could you give us an introduction to the framework you are using?
Schadenfreude is pleasure derived from another person’s misfortune. Let’s face it, in our industry we love to talk about the epic failures and disasters we have experienced. The “other person” is sometimes, nay often, our own past self. Laughing, or crying, about our own past failures can be cathartic and can really help others learn.
- Walk us through your most epic outage.
- What unlikely series of events converged to cause the most epic outage you have seen?
- What ridiculous policies were put in place because of an outage?
Pay it Forward
Your story and experiences are worth sharing with the community. Help others learn and further your own knowledge through sharing. The ChefConf CFP is open now. Use some of the questions posed here to help form a talk proposal for the Chaos Engineering track. Besides this track, we are encouraging submissions across these tracks as well:
- Infrastructure Automation
- Compliance Automation
- Application Automation
- People, Process, and Team
- Delivering Delight
- Chaos Engineering
- Don’t Label Me!
Submit your talk proposal now! The deadline is Wednesday, January 10, 2018 at 11:59 PM Pacific time.