Skip to content

Getting Started with the Chaos Toolkit

When you practice chaos engineering, you are discovering how your system reacts following certain conditions you inject. By doing this in a controlled fashion, you may learn how to change the system accordingly.

This tutorial will quickly give you a tour of the basic elements of an experiment.

Tip

You will find many more in-depth labs on GitHub.

Get Ready

Before you can run the experiment against your system, you will need to have it setup.

Get the Code

You can fetch the code as follows:

git clone https://github.com/chaostoolkit/chaostoolkit-documentation-code

This particular tutorial is under tutorials/a-simple-walkthrough.

cd chaostoolkit-documentation-code/tutorials/a-simple-walkthrough

Third-party binaries

The experiment will use the following binaries, make sure you have them in your PATH:

Install the Application dependencies

Our application is made of a simple set of two microservices that converse with each other over HTTPS.

Tip

It’s recommended that you use a Python virtual environment to keep your project dependencies isolated from any other projects.

As you should already run Python 3 for the toolkit itself, we assume you are set accordingly, please install only the application dependencies:

pip install -U -r requirements.txt

Install the Chaos Toolkit

You must install the Chaos Toolkit and its dependencies for the purpose of this tutorial. While we suggest you go through the install section, it boils down to:

pip install -U chaostoolkit

Your First Experiment

Explore a Potential Weakness

So, looking at our application, what could we consider as of interest to discover through an experiment? Well, we are using HTTPS between the two services in our application, what if the certificate expired?

Tip

In such a simple setup, we appreciate that things will likely break and, in general there is no point to run an experiment when you already know the outcome. However, let’s humor ourselves for the sake of learning the basics behind the Chaos Toolkit.

We can intuit things will fall apart if the sunset service cannot talk the astre service but, should the chain be part of much more complex graph of services, it could be difficult to fully understand all the branches that would be affected nor how badly. So, it may still be useful to experiment by injecting an expired certificate.

Define the Steady State Hypothesis

You can only learn if you know where you start from and what a good baseline for your application is.

Here we assume two things:

  • the services are running
  • we can call the sunset service to retrieve the sunset time for a given city

During our experiment, we will vary the conditions of the system and expect the state to remain valid.

Changing the Conditions

In this tutorial, we are going to set an expired certificate and restart the services. We will then call our application and see how it responds.

Declare an Experiment to Observe the Weakness

At this stage, we can create an experiment that tells how the system behaves when a certificate expires.

```json { “title”: “What is the impact of an expired certificate on our application chain?”, “description”: “If a certificate expires, we should gracefully deal with the issue.”, “tags”: [“tls”], “steady-state-hypothesis”: { “title”: “Application responds”, “probes”: [ { “type”: “probe”, “name”: “the-astre-service-must-be-running”, “tolerance”: true, “provider”: { “type”: “python”, “module”: “os.path”, “func”: “exists”, “arguments”: { “path”: “astre.pid” } } }, { “type”: “probe”, “name”: “the-sunset-service-must-be-running”, “tolerance”: true, “provider”: { “type”: “python”, “module”: “os.path”, “func”: “exists”, “arguments”: { “path”: “sunset.pid” } } }, { “type”: “probe”, “name”: “we-can-request-sunset”, “tolerance”: 200, “provider”: { “type”: “http”, “timeout”: 3, “verify_tls”: false, “url”: “https://localhost:8443/city/Paris” } } ] }, “method”: [ { “type”: “action”, “name”: “swap-to-expired-cert”, “provider”: { “type”: “process”, “path”: “cp”, “arguments”: “expired-cert.pem cert.pem” } }, { “type”: “probe”, “name”: “read-tls-cert-expiry-date”, “provider”: { “type”: “process”, “path”: “openssl”, “arguments”: “x509 -enddate -noout -in cert.pem” } }, { “type”: “action”, “name”: “restart-astre-service-to-pick-up-certificate”, “provider”: { “type”: “process”, “path”: “pkill”, “arguments”: “–echo -HUP -F astre.pid” } }, { “type”: “action”, “name”: “restart-sunset-service-to-pick-up-certificate”, “provider”: { “type”: “process”, “path”: “pkill”, “arguments”: “–echo -HUP -F sunset.pid” }, “pauses”: { “after”: 1 } } ], “rollbacks”: [ { “type”: “action”, “name”: “swap-to-valid-cert”, “provider”: { “type”: “process”, “path”: “cp”, “arguments”: “valid-cert.pem cert.pem” } }, { “ref”: “restart-astre-service-to-pick-up-certificate” }, { “ref”: “restart-sunset-service-to-pick-up-certificate” } ]

} ```

```yaml

title: What is the impact of an expired certificate on our application chain? description: If a certificate expires, we should gracefully deal with the issue. tags: - tls steady-state-hypothesis: title: Application responds probes: - type: probe name: the-astre-service-must-be-running tolerance: true provider: type: python module: os.path func: exists arguments: path: astre.pid - type: probe name: the-sunset-service-must-be-running tolerance: true provider: type: python module: os.path func: exists arguments: path: sunset.pid - type: probe name: we-can-request-sunset tolerance: 200 provider: type: http timeout: 3 verify_tls: false url: https://localhost:8443/city/Paris method: - type: action name: swap-to-expired-cert provider: type: process path: cp arguments: expired-cert.pem cert.pem - type: probe name: read-tls-cert-expiry-date provider: type: process path: openssl arguments: x509 -enddate -noout -in cert.pem - type: action name: restart-astre-service-to-pick-up-certificate provider: type: process path: pkill arguments: “–echo -HUP -F astre.pid” - type: action name: restart-sunset-service-to-pick-up-certificate provider: type: process path: pkill arguments: “–echo -HUP -F sunset.pid” pauses: after: 1 rollbacks: - type: action name: swap-to-valid-cert provider: type: process path: cp arguments: valid-cert.pem cert.pem - ref: restart-astre-service-to-pick-up-certificate - ref: restart-sunset-service-to-pick-up-certificate ```

The Various Sections of an Experiment

Let’s now go through the experiment blocks.

The steady state hypothesis declares the various probes that will be applied as part of the hypothesis check.

The hypothesis is played twice. The first time before we do anything else to ensure the system is indeed in a normal state, here we check the services are running by looking up their PID files and we call the sunset service which should respond OK. The second time the hypothesis is applied is after the conditions were changed in the system, to validate it is still in a normal state.

The method is the block which changes the conditions of our system/application. Here, we swap the valid certificate for an expired one and restart the services by sending them a SIGHUP signal.

Finally, the rollback section (which is optional) tries to remediate to the changes we made, in this case by swapping back to the valid certificate.

Different Kinds of Activities

It is interesting to notice that the hypothesis uses probes while rollbacks are made of actions only. The method may use both. The reason is that the hypothesis is only about querying the system while rollbacks act on it. Finally, it is often useful to query the system while we change the conditions, for future analysis.

Probes and Actions are activities that do not differ in the way they work, it’s only their goal that differs.

You can create activities that make HTTP calls, execute processes or perform more complex operations through extensions (often implemented in Python). This tutorial showcases a bit of all of those.

Tolerances in the Hypothesis

Hypothesis probes expect a tolerance property which tells the Chaos Toolkit how to validate a certain aspect of the state. In our example, tolerances are rather simple. We check file exists and that a HTTP request returns a status code of 200. Richer tolerances can be created by using regex or jsonpath.

Run the Experiment

Start the Experiment without the Application

To run the experiment, simply execute the following command:

chaos run experiment.json
[2018-05-14 18:38:04 INFO] Validating the experiment's syntax
[2018-05-14 18:38:04 INFO] Experiment looks valid
[2018-05-14 18:38:04 INFO] Running experiment: What is the impact of an expired certificate on our application chain?
[2018-05-14 18:38:04 INFO] Steady state hypothesis: Application responds
[2018-05-14 18:38:04 INFO] Probe: the-astre-service-must-be-running
[2018-05-14 18:38:04 CRITICAL] Steady state probe 'the-astre-service-must-be-running' is not in the given tolerance so failing this experiment
[2018-05-14 18:38:04 INFO] Let's rollback...
[2018-05-14 18:38:04 INFO] Rollback: swap-to-valid-cert
[2018-05-14 18:38:04 INFO] Action: swap-to-valid-cert
[2018-05-14 18:38:04 INFO] Rollback: None
[2018-05-14 18:38:04 INFO] Action: restart-astre-service-to-pick-up-certificate
[2018-05-14 18:38:04 INFO] Rollback: None
[2018-05-14 18:38:04 INFO] Action: restart-sunset-service-to-pick-up-certificate
[2018-05-14 18:38:04 INFO] Pausing after activity for 1s...
[2018-05-14 18:38:05 INFO] Experiment ended with status: failed

Because we ran this command before we even started our application, our steady-state hypothesis failed and bailed the experiment immediately.

Note that the rollbacks will run anyway. They are only bypassed when you send a SIGINT or SIGTERM signal to the chaos process because the toolkit assumes you may want to review your system.

Start the Application

You may now run the application.

First, copy the valid certificate as follows:

cp valid-cert.pem cert.pem

Next, start the services, in one terminal:

python3 astre.py

[14/May/2018:16:11:09] ENGINE Listening for SIGTERM.
[14/May/2018:16:11:09] ENGINE Listening for SIGHUP.
[14/May/2018:16:11:09] ENGINE Listening for SIGUSR1.
[14/May/2018:16:11:09] ENGINE Bus STARTING
[14/May/2018:16:11:09] ENGINE Serving on https://127.0.0.1:8444
[14/May/2018:16:11:09] ENGINE Bus STARTE
Then, in another terminal:

python3 sunset.py
[14/May/2018:16:13:58] ENGINE Listening for SIGTERM.
[14/May/2018:16:13:58] ENGINE Listening for SIGHUP.
[14/May/2018:16:13:58] ENGINE Listening for SIGUSR1.
[14/May/2018:16:13:58] ENGINE Bus STARTING
[14/May/2018:16:13:58] ENGINE Serving on https://127.0.0.1:8443
[14/May/2018:16:13:58] ENGINE Bus STARTED

Now you may perform a simple call:

curl -k https://localhost:8443/city/Paris
The sunset will occur at 2018-05-14T21:23:09+02:00 in Paris

What happens is that the sunset service performs a call to the astre service for the data and simply render them to you, as plain text. Both services are chained together over HTTPS.

Run the Experiment

Now your application is running, execute the experiment once again:

chaos run experiment.json
[2018-05-14 18:41:09 INFO] Validating the experiment's syntax
[2018-05-14 18:41:09 INFO] Experiment looks valid
[2018-05-14 18:41:09 INFO] Running experiment: What is the impact of an expired certificate on our application chain?
[2018-05-14 18:41:09 INFO] Steady state hypothesis: Application responds
[2018-05-14 18:41:09 INFO] Probe: the-astre-service-must-be-running
[2018-05-14 18:41:09 INFO] Probe: the-sunset-service-must-be-running
[2018-05-14 18:41:09 INFO] Probe: we-can-request-sunset
[2018-05-14 18:41:09 INFO] Steady state hypothesis is met!
[2018-05-14 18:41:09 INFO] Action: swap-to-expired-cert
[2018-05-14 18:41:09 INFO] Probe: read-tls-cert-expiry-date
[2018-05-14 18:41:09 INFO] Action: restart-astre-service-to-pick-up-certificate
[2018-05-14 18:41:09 INFO] Action: restart-sunset-service-to-pick-up-certificate
[2018-05-14 18:41:09 INFO] Pausing after activity for 1s...
[2018-05-14 18:41:10 INFO] Steady state hypothesis: Application responds
[2018-05-14 18:41:10 INFO] Probe: the-astre-service-must-be-running
[2018-05-14 18:41:10 INFO] Probe: the-sunset-service-must-be-running
[2018-05-14 18:41:10 INFO] Probe: we-can-request-sunset
[2018-05-14 18:41:10 CRITICAL] Steady state probe 'we-can-request-sunset' is not in the given tolerance so failing this experiment
[2018-05-14 18:41:10 INFO] Let's rollback...
[2018-05-14 18:41:10 INFO] Rollback: swap-to-valid-cert
[2018-05-14 18:41:10 INFO] Action: swap-to-valid-cert
[2018-05-14 18:41:10 INFO] Rollback: None
[2018-05-14 18:41:10 INFO] Action: restart-astre-service-to-pick-up-certificate
[2018-05-14 18:41:10 INFO] Rollback: None
[2018-05-14 18:41:10 INFO] Action: restart-sunset-service-to-pick-up-certificate
[2018-05-14 18:41:10 INFO] Pausing after activity for 1s...
[2018-05-14 18:41:11 INFO] Experiment ended with status: failed

Each activity is run in the order it appears in the experiment. Notice now how the hypothesis is not met after we swapped the certificates. But, we learn something interesting, even if expected, using an expired certificate does not prevent our services to even start.

Report on your Findings

Review the Journal of the Run

You may now review the journal generated by the run:

cat journal.json

It contains the activities runs and the output of each of them.

Generate a Report

You can generate a PDF (or HTML, markdown…) report from the journal if you install the chaostoolkit-reporting plugin first:

chaos report --export-format=pdf journal.json report.pdf

Learnings and Responses

In this experiment, we proved what we guessed initially, that an expired certificate will create trouble and break the application for our users. What could be the responses?

  • Use a circuit-breaker to provide a more meaningful, and controlled, answer to the caller
  • Prevent the service to start when the certificate it uses is expired
  • Put some monitoring in place on our certificates and trigger an alert when they get close to their end date
  • Move to Let’s Encrypt and renew our certs automatically

For each of these potential responses, you could create an experiment should they unearth potential new questions.

Next?

An experiment is never the end game. The flow should be continuous and you should create and run experiments regularly.