What is Chaos Testing?

Application chaos testing does its best to counteract Murphy’s Law, where anything that can go wrong will go wrong, and at the worst possible time.

Chaos testing and engineering is a proactive test methodology that identifies system errors prone to misuse before they can cause damage and security concerns for an application.  This style of testing was developed and made popular by digital streaming service providers.  Streaming providers rely heavily on their online applications to run smoothly for their customers with high uptime and high availability. Any service disruption can significantly damage their reputation and their bottom line.

Sharing this same necessity, more industries and companies have adopted chaos testing and are implementing chaos engineering plans.  It’s quickly becoming standard procedure by more online applications.

Chaos Testing vs Chaos Engineering

There are two methods of Chaos that have been developed based on the popularity of the approach.

Chaos testing is a reactive method that tests a launched application.  This technique ensures applications and systems are working as expected in real-time, verifying everything is smooth and online.  This is a stress test for an active application.  It’s set up by security teams to monitor a system’s response in a real-life environment.  The chaos bombardment can be measured in real-time to determine how load and stress is handled by the application.

Chaos engineering is planned prior to launch by DevOps or DevSecOps and is used during the developmental phase.  The focus is engineering scans and tests to deliberately bring a system offline and crash it to model misuse.  The goal is to find and deliberately exploit potential points of failure.

Chaos Testing Principles and Philosophy

The Chaos Philosophy has two parts to it.  First, make sure there isn’t a single point of failure within a system – one vulnerability that can take down an entire application and lead to unnecessary downtime. 

The second part of Chaos Philosophy is, never be 100% confident that your application does not have a single point of failure.  Always assume that one error can take down your entire application.  This creates a diligent security cycle and a continual, consistent monitoring and testing loop.  

Based off the Chaos Testing Philosophy, these 5 principles make up the core continuum of chaos engineering and testing.

The 5 Chaos Testing Principles:

  • Determine a system’s regular behavior.  Defining your system’s “steady state” by creating a measurable overall output, error rates, system latency, and any other indicators for normal operating behavior.  Any unexpected operating behavior outside of these parameters can be used as a warning for an abnormal system state.
  • Hypothesize that your steady state is sturdy.  When you experiment on your steady state, believe that no matter what you do to disrupt your system, nothing will happen.  This is part of chaos engineering; the actions injected into the system will not change the state of it.
  • Run designed experiments to create failures.  After analysis to determine possible failure scenarios in your system infrastructure, design failure experiments to test these scenarios.  Running these chaos experiments in a controlled environment allows you to run a backout program, a recovery record of what failed so that it can be reversed.
  • Analyze and document the results.  Running an experiment is only the first half, proper documentation and analysis is the second half.  Check to see if the experiment changed how the system operated within its steady state.  Identify if there was or wasn’t an impact to the continuity of service or experience, and if the service remained unfazed.
  • Continually monitor and repeat different tests.  Cybersecurity is rarely a “one-and-done” scan or test.  Chaos testing and engineering is no different.  Continual system monitoring with a mix of chaos testing and chaos engineering is the strongest way to maintain a reliable system process.

Who Uses Chaos Engineering and Testing?

Streaming services or any companies that rely nearly exclusively on their application to function 100% of the time are prime candidates for chaos engineering and chaos testing.  Netflix originally “invented” this cybersecurity testing method to ensure their subscribers wouldn’t lose access to their platform.  This cybersecurity method has performed so well that other larger streaming companies, Amazon, Google, Facebook, and Microsoft have implemented and evolved it to proactively prevent disruptions and outages.

Chaos engineering can also help meet compliance and regulatory standards when used during the DevOPs phase along with continual post-deployment chaos testing.  Even though nothing in the cybersecurity world is a guarantee, adding a chaos security loop is a great way to go fortify application security.

Additional Resources:

BeSTORM

Plan Your DevOPS Controlled Attack

Fortra VM

Monitor and Document Potential Vulnerabilities

Pen Testing Services

Test the Depth of Application Weaknesses