Select Page

Chaos Engineering allows you to compare what you think will happen with what happens in your systems. Literally “break things on purpose” to learn how to build more resilient systems.

By proactively testing how a system responds under stress, failures can be identified and corrected before they end up in the news.

Keep reading if you want to learn more about it.

 

What Is Meant By Chaos Engineering?

Chaos engineering is preventative medicine, being the science behind intentionally injecting flaws into systems to measure resilience.

Like any scientific method, Chaos Engineering focuses on experiments and hypotheses and then compares the results with a control. The quintessential example of Chaos Engineering in a distributed system is removing random services to see how the elements respond and what detriment manifests in the user journey.

If you take a cross-section of what an application needs to run (compute, storage, networking, and application infrastructure), injecting a fault or turbulent conditions anywhere in that stack are valid experiments in Chaos Engineering.

Network saturation or storage suddenly becoming volatile are known flaws in the tech industry, but Chaos Engineering allows for much more controlled testing of these flaws.

Due to the wide swath of infrastructure that can be affected, Chaos Engineering users and professionals can be just about anyone supporting the infrastructure and application stack.

 

Benefits of Chaos Engineering

The main benefits of Chaos Engineering are:

a) For customers: Higher availability and durability of service means no outages interrupt your daily life.

b) For businesses: it can help prevent extremely large losses in revenue and maintenance costs, improve on-call training for engineering teams, and improve the SEV (incident) management program for the entire enterprise.

c) For technicians: insights from chaos experiments can mean reduced incidents, reduced guard load, greater understanding of system failure modes, improved system design, faster mean time to SEV detection and reduced repeat SEVs.

 

What Companies Use Chaos Engineering?

Many larger technology companies practice Chaos Engineering to better understand their distributed systems and microservices architectures.

The list includes Netflix, LinkedIn, Facebook, Google, Microsoft, Amazon, and many others.

More traditional industries such as banking and finance have also realized the importance of Chaos Engineering.

For example, in 2014, the National Australia Bank migrated from physical infrastructure to Amazon Web Services and used Chaos Engineering to dramatically reduce the incident count.

 

How to Start Chaos Engineering

Have you ever asked yourself: how do I start Chaos Engineering?

Then you’re in the right place…

 

  1. Plan Experiment – One of the most powerful questions in Chaos Engineering is “What could go wrong?”
  2. Create Hypothesis – You have an idea of what can go wrong. You have chosen the exact glitch to inject. What happens next? This is an excellent thinking exercise for teamwork.
  3. Measure Impact – To understand how your system behaves under stress, you need to measure the availability and durability of your system. It’s good to have a key performance metric that correlates with customer success (such as orders per minute or stream starts per second).
  4. Backup Plan – If you are running commands by hand, be careful not to break ssh or control access to your instance’s blueprint.
  5. Fix It – After running your first experiment, hopefully, there is one of two outcomes: either you have verified that your system is resilient to the flaw you have introduced, or you have found a problem that you need to fix.