You are currently viewing How Chaos Engineering can improve your Cybersecurity

How Chaos Engineering can improve your Cybersecurity

Chaos Engineering intentionally introduces failures into a system to test its resilience and identify weaknesses. Chaos Engineering aims to build confidence in the system’s ability to withstand unexpected events and improve its reliability. You do this by running “chaos experiments” that simulate failures or disruptions in the system and observe how the system responds. By proactively testing for failures, organizations can identify and fix vulnerabilities before they cause significant issues in production.

In this post, I show you how Chaos Engineering can contribute to the vigilance of your ICT systems. I do this with a six steps guide. If you follow this guide, you have a solid foundation for a Chaos Engineering system in your organization.

Step 1: Identify the most critical components and prioritize for testing

To identify the most vital components of a system and prioritize them for testing, you need to determine the purpose of your IT systems and the essential functions these systems perform. This process helps you identify the crucial components of your IT systems to fulfill their intended purpose.

You focus on identifying the components that are most likely to fail or cause issues. These components may be heavily used, have complex functionality, or are critical to the system’s operation.

After you have identified the most critical components, you evaluate the impact of a failure in each component. Some components, if they fail, may cause the entire system to fail or significantly impact its performance, and you must give these components higher priority testing.

Take any external dependencies into account that each component has. Components that interact with other systems or rely on other components may be more critical, and you should prioritize these components for testing. You should also consult experts, including developers and technical leads, to get their input on which components are most critical and which you should prioritize for testing. Use all this information to create a testing priority list for the components of the system. This list will help guide the testing process and ensure that you thoroughly test the most critical components.

Step 2 Develop hypotheses about the causes of failures and test these hypotheses

When you have completed the identification process,  you can start with the development of hypotheses about the causes of failures in a system and create experiments to test these hypotheses.

Identify a specific component of the system on which you want to focus. Determine the expected behavior of the component and how it should respond to different inputs and conditions. After that, you develop a list of potential failure modes for the component. Make sure you consider external factors (such as network outages or power disruptions) and internal factors (such as code bugs or resource constraints). For each potential failure mode, you hypothesize what might cause the failure and how it would manifest in your system. After this, you design an experiment to test each hypothesis, including what inputs or conditions you will manipulate and how you measure the results. Run the experiments and analyze the results to determine if your hypotheses were correct and identify any issues or weaknesses in the system. After you have completed the experiments, develop and implement measures to prevent or mitigate the identified failures in the system based on the results of your experiments.

Step 3 Set up monitoring & alerting systems to detect and report on failures in real-time

To set up a robust monitoring and alerting system, you first identify the key metrics and events you want to monitor and be alerted on. These could include metrics like system performance, availability, error rates, and others.

After identifying the key metrics, choose a suitable monitoring and alerting tool that meets your needs. Many options are available, including open-source and commercial tools, and some popular choices include:

  • Chaos Monkey. Chaos Monkey is an open-source tool that Netflix initially developed, and chaos Monkey randomly disables servers in a production environment to test the system’s ability to recover.
  • Gremlin. Gremlin is a commercial tool (which means you have to pay) that provides various ways to inject failures into a system, such as network partitions and resource exhaustion.
  • Chaos Kong. Chaos Kong is an open-source tool that simulates the failure of entire regions or availability zones in a cloud environment.
  • AWS Chaos Gauntlet. Amazon Web Services developed this tool, automating the process of running chaos experiments in the cloud.
  • Azure Chaos. Azure Chaos is a tool that Microsoft developed. You can use the tool to inject failures into Azure-based systems.

Of course, you can’t start right away with these tools. First, you must configure the monitoring and alerting tool to track your identified metrics and events. Sometimes this requires you to set up custom monitors, integrate with existing monitoring systems, or configure alerts to be sent to the appropriate individuals or teams. Test the monitoring and alerting system to ensure it is working as expected. Sometimes you might have to manually trigger alerts or simulate failures to see how the system responds.

Make sure to establish procedures for responding to alerts and failures. Sometimes, you must create runbooks or other documents outlining the steps to take in the event of a failure and establish a chain of command for who is responsible for responding to different alerts.

These tools can test your systems’ reliability and fault tolerance. You can also identify potential issues in your systems with these tools that you might not discover in traditional testing. After completion, you must regularly review and update the monitoring and alerting system to ensure that it is still meeting your needs and providing timely and accurate alerts.

Step 4 Create a process for responding to failures

The next step is to create a strategy for responding to failures, including identifying the root cause, implementing fixes, and rolling out updates.

First, you identify the root cause: this is the process’s first and most crucial step. It would be best if you determined what caused the failure to know how to fix it. It may also involve gathering log files, analyzing system performance data, and talking to team members involved in the experiment.

Once you have identified the root cause of the failure, you can start working on a fix which may involve updating configuration files, deploying new code, or modifying the system architecture. It is essential to test the fix thoroughly to ensure it resolves the issue.

After testing the fix, you can roll out an update to the production system. It is wise to first deploy the update to a small group of users and gradually expand it to the entire user base. Monitoring the system closely after rolling out the update is crucial to ensure it is working as intended. Finally, it is essential to document the entire process, including the root cause, implemented a fix, and any lessons learned. Documentation helps your team to learn from the experience and improve the system’s resilience in the future. It also contributes to a standardization process, removing any dependencies of individuals. I know it might not be the sexiest of jobs, but it is crucial to do.

Step 5 Regularly conduct chaos experiments

Make sure to periodically integrate a system of chaos experiments to identify and address potential failures before they occur in production.

To implement a structured system of periodical chaos experiments, you need to define your system’s desired behavior and the conditions under which it should operate normally. This definition will serve as the baseline for your experiments. After this, you identify potential failure points in your system, such as network connections, servers, or dependencies on external services.

Then you can design experiments that test the system’s resilience by inducing failures or disruptions at these points. Be sure to start small and gradually increase the scale and complexity of your experiments. Make sure you run the experiments in a controlled environment, such as a staging or test environment, rather than in production, and observe the system’s behavior during the experiments. Compare this to the desired behavior you defined at the start. Look for any deviations or failures, and take note of them.

Then, analyze the results of the experiments and use them to identify and address potential failures or weaknesses in the system. Regularly repeat all these steps to improve the resilience of your system continually.

Step 6 Use the results of your experiments to improve your system’s operation and design

If you want to improve your systems, you need to analyze the results of your experiments. You do this by reviewing the data and observations you collected during your experiments. Based on this data, you look for patterns and trends and try to identify any areas where your system performed poorly or experienced failure.

After getting a good understanding of what went wrong during your experiments, you must try to identify the root causes of the failures. Was it a problem with the system’s design, or was it due to an external factor like network congestion or a third-party service outage? When you have an answer to all these questions, you can develop an improvement plan. You base this plan on your analysis of the experiment results. Your plan might include changing the system’s design, implementing new tools or processes, or training staff on better responding to failures. When your plan is ready, you can implement your improvement plan. Make sure to test your changes and ensure that all your changes are working as intended. After that, you set up a process of ongoing monitoring and assessment of the system’s performance which is key to ensuring that it remains resilient and able to handle failures.

Final Thoughts

It is important to note that Chaos Engineering should be conducted in a controlled and safe environment with appropriate safeguards to prevent unintended consequences. Make sure to do all your experiments in a fully isolated environment, and never do this in production.

When you structurally follow the six steps in this post, I can guarantee your systems will be more robust, and your defense against cyber threats will improve. There is one keyword you need to breathe: routine. It is fun to set up “Dexter’s lab,” but routinely using this by conforming yourself to the fixed principles you have set up might be the biggest challenge. Please don’t make Chaos Engineering a hobby project but turn it into a professional pillar of your cybersecurity strategy with dedicated resources and a fixed reporting structure that shows your stakeholders the importance of Chaos Engineering. Otherwise, the available budget will dry up, and you will have to abandon your lab: a missed opportunity!

Feel free to contact me if you have questions or in case you have any additional advice/tips about this subject. You can also contact me if you need support or other recommendations for setting up your Chaos Engineering lab. If you want to keep me in the loop if I upload a new post, make sure to subscribe so you receive a notification by e-mail.

Gijs Groenland

I live in San Diego, USA together with my wife, son, and daughter. I work as Chief Financial and Information Officer (CFIO) at a mid-sized company.

Leave a Reply