Running Chaos Toolkit experiments as AWS Batch Jobs¶
It is common when using AWS for hosting your infrastructure that you’ll have strict security policies in place. These policies will usually only allow for internal traffic within AWS, amongst various other things. A question we’re asked a lot is can I run Chaos Toolkit from AWS, to run against AWS?
. The answer is simply, yes, you can.
Why Batch?¶
You may have followed our Running Chaos Toolkit from an EC2 instance guide and wonder why we would write a guide for Batch - Batch has some benefits over EC2:
- Your infrastructure isn’t running all the time
- You can use Docker images to encapsulate your experiment environment
- You can submit multiple jobs to run different experiments rather than blocking on one experiment in an EC2 instance
Why not ECS and Fargate?¶
We sometimes get asked about how to run Chaos Toolkit on ECS with Fargate, whilst we understand why you might want to do this, Chaos Toolkit experiments aren’t analogous to something like a microservice. We don’t run Chaos Toolkit continuously and request it to run jobs, rather we invoke Chaos Toolkit when we want to use it.
Because of this difference in thinking, we recommend you use Batch (With Fargate as the compute provider) to invoke Chaos Toolkit experiments.
The Steps¶
For the purposes of this guide, we’ll run you through setting up your Chaos Toolkit experiments manually. If however, you’re familiar with the AWS Cloud Development Kit (CDK), we have an example repository deploying the same infrastructure using CDK here.
There are a few pre-requisites required to be able to follow this guide:
- You’ll need access to the AWS Console (We’re assuming you’re comfortable here)
- You’ll need AWS CLI installed and configured
- You’ll need to be able to create EC2 instances (Or have someone do this for you)
- You’ll need to be able to create IAM Roles and Policies (Or have someone do this for you)
- You’ll need to be able to create Batch Compute Environments, Jobs, and Queues (Or have someone do this for you)
- You’ll need to be able to create ECR Repositories and push to them (Or have someone to do this for you)
1. Create your system (an EC2 instance)¶
Similar to our Running Chaos Toolkit from an EC2 instance guide, we’ll be using an EC2 instance as our ‘system’ to run our experiment against. We’ll setup our SSH to ensure that the EC2 instance is in a running
state.
- Navigate to the EC2 console and select
Launch Instance
- For this guide, we’ll select the
Amazon Linux 2 AMI
at the top of the list - For this guide, we’ll select a
t2.micro
(But you can choose a larger one) - Go onto
Configure Instance Details
- Select the VPC to deploy into via the
Network
dropdown - Select the Subnet to deploy into via the
Subnet
dropdown - Go onto
Add Storage
- For now, the defaults will be fine - Go onto
Add Tags
- We recommend at minimum, adding a tag{"OWNER": "your-name"}
- Go onto
Configure Security Group
- Click the
X
to the right of the SSH rule, you won’t need this - Go onto
Review and Launch
- SelectLaunch
- Select
Proceed without a key pair
, check the tickbox, and clickLaunch Instances
- Create the EC2 instance for our ‘system’, replacing
YOUR_NAME
with your name:aws ec2 run-instances \ --image-id ami-0d26eb3972b7f8c96 \ --instance-type t2.micro \ --count 1 \ --tag-specifications 'ResourceType=instance,Tags=[{Key=OWNER,Value=YOUR_NAME}]' \ --no-cli-pager
You can leave this instance up for the duration of this guide.
2. Create your experiment¶
In an empty directory, create a folder named experiments
:
mkdir experiments
Create a file named experiment-1.json
inside experiments/
with the following contents:
{
"title": "Running Chaos Toolkit from AWS Batch",
"description": "N/A",
"tags": [],
"steady-state-hypothesis": {
"title": "EC2 is RUNNING",
"probes": [
{
"type": "probe",
"name": "instance_state",
"provider": {
"type": "python",
"module": "chaosaws.ec2.probes",
"func": "instance_state",
"arguments": {
"state": "running",
"instance_ids": [
"<INSTANCE_ID>"
],
"filters": []
}
},
"tolerance": true
}
]
},
"method": [],
"configuration": {
"aws_region": "<REGION>"
}
}
Replace the value of <INSTANCE_ID>
with the value of the id of the deployed instance. Replace <REGION>
with the name of the region the instance is deployed in.
3. Setup your Docker image¶
AWS Batch requires us to have a Docker image within the Job definition, this container will be what does the work for our Batch Job.
Make a file named Dockerfile
alongside experiments/
with the following contents:
FROM chaostoolkit/chaostoolkit:latest
RUN pip install chaostoolkit-aws
RUN mkdir /home/svc/experiments
COPY experiments /home/svc/experiments
WORKDIR /home/svc/experiments
4. Create your ECR repository and push the image¶
- Navigate to the ECR console and select
Repositories
- Select
Create Repository
- Leave the repository set to
Private
and enter a name for your repository- For the purpose of this guide, we’ll be using
ctk-batch
- For the purpose of this guide, we’ll be using
- Select
Create Repository
- Select
<repository-name>
from the table - Select
View push commands
- Follow the commands outlined there (We’ll show you them below as an example)
- Create the ECR repository:
aws ecr create-repository \ --repository-name ctk-batch \ --no-cli-pager
Logging in to ECR with Docker
aws ecr get-login-password --region eu-west-2 | docker login --username AWS --password-stdin <your-aws-account-id>.dkr.ecr.eu-west-2.amazonaws.com
Login Succeeded
Building the image
docker build -t ctk-batch .
[+] Building 1.8s (10/10) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 220B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/chaostoolkit/chaostoolkit:latest 1.5s
=> [1/5] FROM docker.io/chaostoolkit/chaostoolkit:latest@sha256:3801eda37de7e8f00fb556220fff7935fea45d248881f4253cd9c29b4d3023f3 0.0s
=> => resolve docker.io/chaostoolkit/chaostoolkit:latest@sha256:3801eda37de7e8f00fb556220fff7935fea45d248881f4253cd9c29b4d3023f3 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 959B 0.0s
=> CACHED [2/5] RUN pip install chaostoolkit-aws 0.0s
=> CACHED [3/5] RUN mkdir /home/svc/experiments 0.0s
=> [4/5] COPY experiments /home/svc/experiments 0.0s
=> [5/5] WORKDIR /home/svc/experiments 0.0s
=> exporting to image 0.1s
=> => exporting layers 0.0s
=> => writing image sha256:4a3ce8f2824518bffa47ff3d293488f18f83e25711bedc32e13611a5c7e7e0af 0.0s
=> => naming to docker.io/library/ctk-batch 0.0s
Tagging the image
docker tag ctk-batch:latest <your-aws-account-id>.dkr.ecr.eu-west-2.amazonaws.com/ctk-batch:latest
Pushing the image
docker push <your-aws-account-id>.dkr.ecr.eu-west-2.amazonaws.com/ctk-batch:latest
The push refers to repository [<your-aws-account-id>.dkr.ecr.eu-west-2.amazonaws.com/ctk-batch]
5f70bf18a086: Pushed
ed804ed04ee1: Pushed
8ac8250b5bff: Pushed
65bb6a66824b: Pushed
381a8a9c329b: Pushed
7a767cefe1f5: Pushed
011386fb6049: Pushed
ac4086fc0a4e: Pushed
065eb9ef9cc4: Pushed
93ee5bc36b87: Pushed
9cc956b239dd: Pushed
bc276c40b172: Pushed
latest: digest: sha256:9702b9cf63a6e4961689a661340fc0573d28d0e7f506b90fa5d080e4e7c9d275 size: 2826
5. Create your Batch Compute environment¶
To actually run your Jobs, Batch needs a Compute environment configured. This is where you tell AWS what runs the jobs (i.e EC2 instances/Fargate/etc.).
- Navigate to the Batch console and select
Compute environments
- Select
Create
- Leave
Managed
selected and provide a name- For the purpose of this guide we’ll use
ctk-batch-comp-env
- For the purpose of this guide we’ll use
- Leave
Fargate
selected underInstance configuration
but setMaximum vCPUs
to1
- If you want to use a specific VPC, Subnets, and Security Group, select those in
Networking
- For this guide, we’ll use the values AWS filled in
- Add a tag - We recommend at minimum, adding a tag
{"OWNER": "your-name"}
- Select
Create compute environment
-
To create your Compute environment, you’ll need to choose which VPC to deploy into. Take note of the VPC ID of the VPC you want to deploy into from this command:
aws ec2 describe-vpcs \ --no-cli-pager
-
You’ll also need to know the subnets you’ll provide it, take note of the output for the following command, note down the subnet IDs you want to use, replacing
VPC_ID
with the VPC ID from above into:aws ec2 describe-subnets \ --filter Name=vpc-id,Values=VPC_ID \ --no-cli-pager
-
You’ll also need to know the security groups to assign to your compute environment, take note of the output for the following command, note down the security group IDs you want to use, replacing
VPC_ID
with the VPC ID you’re deploying into:aws ec2 describe-security-groups \ --filter Name=vpc-id,Values=VPC_ID \ --no-cli-pager
-
Create the Batch Compute environment, replacing
YOUR_NAME
with your name and replacingSUBNET_IDS
with a comma seperated list of subnets andSECURITY_GROUP_IDS
with a comma seperated list of security groups from the above commands:aws batch create-compute-environment \ --compute-environment-name ctk-batch-comp-env \ --type MANAGED \ --state ENABLED \ --compute-resources type=FARGATE,maxvCpus=1,subnets=SUBNET_IDS,securityGroupIds=SECURITY_GROUP_IDS \ --tags OWNER=YOUR_NAME \ --no-cli-pager
6. Create your Batch Job queue¶
When you submit Jobs, Batch uses a Job queue to manage what is and needs to be running and where it needs to run.
- Navigate to the Batch console and select
Job queues
- Select
Create
- Give a name for the Job queue
- For the purpose of this guide we’ll use
ctk-batch-job-queue
- For the purpose of this guide we’ll use
- Leave priority as
1
- Add a tag - We recommend at minimum, adding a tag
{"OWNER": "your-name"}
- In the
Connected compute environments
section, select your Compute environment from the dropdown - Select
Create
- Create the Batch Job queue, replacing
YOUR_NAME
with your name:aws batch create-job-queue \ --job-queue-name ctk-batch-job-queue \ --state ENABLED \ --priority 1 \ --compute-environment-order order=1,computeEnvironment=ctk-batch-comp-env \ --tags OWNER=YOUR_NAME \ --no-cli-pager
7. Create your Batch Job execution role¶
Because you’ve set up an ECR repository with your Docker image in, you need to provide Batch with an execution role that will allow it to pull the image from ECR. It will also enable Batch to output the logs of the container to CloudWatch.
Don’t be confused when we refer to Elastic Container Service
, Batch is using it under the hood.
- Navigate to the IAM console and select
Roles
- Select
Create role
- Select
Elastic Container Service
from the list - Select
Elastic Container Service Task
from theSelect your use case
list - Select
Next: Permissions
- In the search bar, type
AmazonECSTaskExecutionRolePolicy
and select it - Select
Next: Tags
- Add a tag - We recommend at minimum, adding a tag
{"OWNER": "your-name"}
- Select
Next: Review
- Provide a name for the Role
- For the purpose of this guide we’ll use
ctk-batch-execution-role
- For the purpose of this guide we’ll use
- Select
Create Role
-
Create a file named
execution-assume-role.json
with the following contents:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "ecs-tasks.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
-
Create the execution role, replacing
YOUR_NAME
with your name:aws iam create-role \ --role-name ctk-batch-execution-role \ --assume-role-policy-document file://execution-assume-role.json \ --tags Key=OWNER,Value=YOUR_NAME \ --no-cli-pager
-
Attach the
AmazonECSTaskExecutionRolePolicy
policy to the role:aws iam attach-role-policy \ --role-name ctk-batch-execution-role \ --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy \ --no-cli-pager
8. Create your Batch Job job role¶
The nature of chaostoolkit-aws
means that we use boto3
to make AWS requests within our experiment. To be able to make these calls, the container that is running our experiment needs credentials and permissions to do so.
By creating a job role for our Job, we can:
- Provide our Job with credentials with AWS
- Outline exactly what our Job is allowed to do
- Navigate to the IAM console and select
Roles
- Select
Create role
- Select
Elastic Container Service
from the list - Select
Elastic Container Service Task
from theSelect your use case
list - Select
Next: Permissions
- Select
Create policy
(A new window will open) - Move to the
JSON
tab and paste the following:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:DescribeInstance*" ], "Resource": "*" } ] }
- Select
Next: Tags
- Add a tag - We recommend at minimum, adding a tag
{"OWNER": "your-name"}
- Select
Next: Review
- Provide a name for the Policy
- For the purpose of this guide we’ll use
ctk-batch-job-policy
- For the purpose of this guide we’ll use
- Select
Create Policy
- Navigate back to the Role tab
- In the search bar, type
ctk-batch-job-policy
and select it (You may have to click the refresh button) - Select
Next: Tags
- Add a tag - We recommend at minimum, adding a tag
{"OWNER": "your-name"}
- Select
Next: Review
- Provide a name for the Role
- For the purpose of this guide we’ll use
ctk-batch-job-role
- For the purpose of this guide we’ll use
- Select
Create Role
-
Create a file named
job-assume-role.json
with the following contents:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "ecs-tasks.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
-
Create the job role, replacing
YOUR_NAME
with your name:aws iam create-role \ --role-name ctk-batch-job-role \ --assume-role-policy-document file://job-assume-role.json \ --tags Key=OWNER,Value=YOUR_NAME \ --no-cli-pager
-
Create a file named ‘job-policy.json` with the following contents:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:DescribeInstance*" ], "Resource": "*" } ] }
-
Create a policy for the job role, replacing
YOUR_NAME
with your name:aws iam create-policy \ --policy-name ctk-batch-job-policy \ --policy-document file://job-policy.json \ --tags Key=OWNER,Value=YOUR_NAME \ --no-cli-pager
-
Attach the policy to the job role, replacing
AWS_ACCOUNT_ID
with your AWS account id:aws iam attach-role-policy \ --role-name ctk-batch-job-role \ --policy-arn arn:aws:iam::AWS_ACCOUNT_ID:policy/ctk-batch-job-policy \ --no-cli-pager
9. Create your Batch Job definition¶
This is where we tell AWS what our Job is and needs.
- Navigate to the Batch console and select
Job definitions
- Select
Create
- Provide a name
- For the purpose of this guide we’ll use
ctk-batch-job-def
- For the purpose of this guide we’ll use
- Leave
Fargate
selected - Scroll down to
Container properties
- For
Image
, provide the URI of your ECR image from earlier- For example, when writing this guide, ours is:
<our-aws-account-id>.dkr.ecr.eu-west-2.amazonaws.com/ctk-batch:latest
- For example, when writing this guide, ours is:
- Leave
Bash
selected and inside command put:run experiment-1.json
- Select
0.25
forvCpus
and0.5 GB
forMemory
- Select
ctk-batch-execution-role
from theExecution role
dropdown - If your Compute environment is using Public subnets, select
Assign public IP
- If you’re using Private subnets, please follow this StackOverflow post on what best setup to have to enable your jobs to communicate with ECR
- Select
Additional configuration
- Select
ctk-batch-job-role
from theJob role
dropdown - Scroll down to the
Log configuration
section - Select
awslogs
from theLog driver
dropdown - Scroll down to the
Tags
section - Add a tag - We recommend at minimum, adding a tag
{"OWNER": "your-name"}
- Select
Enable
underPropagate Tags
- Select
Create
-
Create a file named
container-properties.json
with the following contents:{ "image": "ECR_IMAGE_URI", "command": ["run", "experiment-1.json"], "jobRoleArn": "arn:aws:iam::AWS_ACCOUNT_ID:role/ctk-batch-job-role", "executionRoleArn": "arn:aws:iam::AWS_ACCOUNT_ID:role/ctk-batch-execution-role", "resourceRequirements": [ { "value": "512", "type": "MEMORY" }, { "value": "0.25", "type": "VCPU" } ], "logConfiguration": { "logDriver": "awslogs" }, "networkConfiguration": { "assignPublicIp": "ENABLED" }, "fargatePlatformConfiguration": { "platformVersion": "LATEST" } }
-
Replace
ECR_IMAGE_URI
in the file with the URI of the image you pushed to ECR. ReplaceAWS_ACCOUNT_ID
your AWS account ID. -
If you’re running your jobs in public subnets, leave
"assignPublicIp": "ENABLED"
as it is, however, if you are not running them in public subnets, we recommend you look at this StackOverflow post on what best setup to have to enable your jobs to communicate with ECR. -
Create your Batch Job definition, replacing
YOUR_NAME
with your name:aws batch register-job-definition \ --job-definition-name ctk-batch-job-def \ --type container \ --container-properties file://container-properties.json \ --platform-capabilities FARGATE \ --tags OWNER=YOUR_NAME \ --propagate-tags \ --no-cli-pager
10. Run your experiment¶
Now that you have:
- Setup your ‘system’
- Setup your container with its dependencies and experiments
- Setup your ECR repository
- Setup your required IAM Roles and Policies
- Setup your Batch Job Compute Environment, Queue, and Definition
It’s a great time to try and run it!
- Navigate to the Batch console and select
Jobs
- Select
ctk-batch-job-queue
from thePlease select a job queue
dropdown - Select
Submit new job
- Enter a name (This can be anything)
- Select
ctk-batch-job-def
from theJob definition
dropdown - Select
ctk-batch-job-queue
from theJob queue
dropdown - Scroll down to the
Tags
section - Add a tag - We recommend at minimum, adding a tag
{"OWNER": "your-name"}
- Select
Submit
- Select
ctk-batch-job-queue
from thePlease select a job queue
dropdown - You can see your job, select it
- Under
Job status
you’ll see the different states move along (Hit the refresh button) - Once
Succeeded
is reached, you can select the link underLog stream name
Here you’ll find the CloudWatch logs of the experiment:
No older events at this moment. Retry
[2021-08-19 14:02:42 INFO] Validating the experiment's syntax
[2021-08-19 14:02:42 INFO] Experiment looks valid
[2021-08-19 14:02:42 INFO] Running experiment: Running Chaos Toolkit from AWS Batch
[2021-08-19 14:02:42 INFO] Steady-state strategy: default
[2021-08-19 14:02:42 INFO] Rollbacks strategy: default
[2021-08-19 14:02:42 INFO] Steady state hypothesis: EC2 is RUNNING
[2021-08-19 14:02:42 INFO] Probe: instance_state
[2021-08-19 14:02:43 INFO] Steady state hypothesis is met!
[2021-08-19 14:02:43 INFO] Playing your experiment's method now...
[2021-08-19 14:02:43 INFO] No declared activities, let's move on.
[2021-08-19 14:02:43 INFO] Steady state hypothesis: EC2 is RUNNING
[2021-08-19 14:02:43 INFO] Probe: instance_state
[2021-08-19 14:02:43 INFO] Steady state hypothesis is met!
[2021-08-19 14:02:43 INFO] Let's rollback...
[2021-08-19 14:02:43 INFO] No declared rollbacks, let's move on.
[2021-08-19 14:02:43 INFO] Experiment ended with status: completed
No newer events at this moment. Auto retry paused. Resume
-
Run your Job, replacing
YOUR_NAME
with your name, take note ofjobId
in the output:aws batch submit-job \ --job-name ctk-batch-example-1 \ --job-queue ctk-batch-job-queue \ --job-definition ctk-batch-job-def \ --propagate-tags \ --tags OWNER=YOUR_NAME \ --no-cli-pager
-
Describe your job, making note of the value of
logStreamName
, replaceJOB_ID
with thejobId
from above:aws batch describe-jobs \ --jobs JOB_ID \ --no-cli-pager
-
Checkout the logs of your job, replacing
LOG_STREAM
with the log stream name from above:aws logs get-log-events \ --log-group-name /aws/batch/job \ --log-stream-name LOG_STREAM \ --output text \ --no-cli-pager
You’ll then see the CloudWatch logs of the experiment:
b/36345990587319449028074048616721581140117476148622393344 f/36345990607367818961553078820962191867228353142498787343
EVENTS 1629810588018 [2021-08-24 13:09:45 INFO] Validating the experiment's syntax 1629810585420
EVENTS 1629810588018 [2021-08-24 13:09:45 INFO] Experiment looks valid 1629810585571
EVENTS 1629810588018 [2021-08-24 13:09:45 INFO] Running experiment: Running Chaos Toolkit from AWS Batch 1629810585572
EVENTS 1629810588018 [2021-08-24 13:09:45 INFO] Steady-state strategy: default 1629810585577
EVENTS 1629810588018 [2021-08-24 13:09:45 INFO] Rollbacks strategy: default 1629810585577
EVENTS 1629810588018 [2021-08-24 13:09:45 INFO] Steady state hypothesis: EC2 is RUNNING 1629810585577
EVENTS 1629810588018 [2021-08-24 13:09:45 INFO] Probe: instance_state 1629810585578
EVENTS 1629810588018 [2021-08-24 13:09:46 INFO] Steady state hypothesis is met! 1629810586238
EVENTS 1629810588018 [2021-08-24 13:09:46 INFO] Playing your experiment's method now... 1629810586238
EVENTS 1629810588018 [2021-08-24 13:09:46 INFO] No declared activities, let's move on. 1629810586238
EVENTS 1629810588018 [2021-08-24 13:09:46 INFO] Steady state hypothesis: EC2 is RUNNING 1629810586238
EVENTS 1629810588018 [2021-08-24 13:09:46 INFO] Probe: instance_state 1629810586239
EVENTS 1629810588018 [2021-08-24 13:09:46 INFO] Steady state hypothesis is met! 1629810586318
EVENTS 1629810588018 [2021-08-24 13:09:46 INFO] Let's rollback... 1629810586319
EVENTS 1629810588018 [2021-08-24 13:09:46 INFO] No declared rollbacks, let's move on. 1629810586319
EVENTS 1629810588018 [2021-08-24 13:09:46 INFO] Experiment ended with status: completed 1629810586319
Summary¶
Like our Running Chaos Toolkit from an EC2 instance guide, our experiment was extremely simple. Again, this guide was not meant to teach you to write experiments. The purpose of the guide was to show you how you might run Chaos Toolkit from AWS to interact with your AWS infrastructure, in a more reactive process.
Rather than having an EC2 instance running and not doing any work, you have the ability now to fire off Chaos Toolkit experiments and only use the compute you need.
You should now have an appreciation and the ability to:
- Create a containerised setup for Chaos Toolkit
- Setup IAM Roles and Policies to restrict and enable your Batch Jobs to run
- Create and run Batch Jobs which will carry out your Chaos Toolkit experiments
- More importantly, within your infrastructures networking limits
Extras¶
Whilst the above guide will tell you all you need to know to get started with AWS Batch and running Chaos Toolkit experiments with jobs, it is very manual and has a few shortcomings that are easily fixed with some more work.
AWS Cloud Development Kit (CDK)¶
As mentioned near the start of this guide, we have this repository which contains an AWS CDK project which deploys almost the same infrastructure as this guide.
The infrastructure differs slightly in:
- We create our own new VPC
- We specifically place our Compute environment in private subnets
- AWS CDK auto-magically sets up networking infrastructure for us to communicate with ECR without having to give our Jobs public IP addresses
We also modify the experiment-1.json
file to accept an environment variable for the EC2 instance ID as this will be provided by CDK.
If you wish to try this project out, clone the repository and ensure you install all of the requirements first.
Once you’re setup with the requirements, you can check what infrastructure will be deployed with:
make diff
...
Resources
[+] AWS::S3::Bucket journal-bucket-your-name-dev journalbucketyour-namedev58D204DE
[+] AWS::S3::BucketPolicy journal-bucket-your-name-dev/Policy journalbucketyour-namedevPolicyDFBFADE1
[+] Custom::S3AutoDeleteObjects journal-bucket-your-name-dev/AutoDeleteObjectsCustomResource journalbucketyour-namedevAutoDeleteObjectsCustomResourceB5FE1104
[+] AWS::IAM::Role Custom::S3AutoDeleteObjectsCustomResourceProvider/Role CustomS3AutoDeleteObjectsCustomResourceProviderRole3B1BD092
[+] AWS::Lambda::Function Custom::S3AutoDeleteObjectsCustomResourceProvider/Handler CustomS3AutoDeleteObjectsCustomResourceProviderHandler9D90184F
[+] AWS::EC2::VPC vpc-your-name-dev vpcyour-namedev8A672852
[+] AWS::EC2::Subnet vpc-your-name-dev/PublicSubnet1/Subnet vpcyour-namedevPublicSubnet1Subnet4E80B3F7
[+] AWS::EC2::RouteTable vpc-your-name-dev/PublicSubnet1/RouteTable vpcyour-namedevPublicSubnet1RouteTable3BA26768
[+] AWS::EC2::SubnetRouteTableAssociation vpc-your-name-dev/PublicSubnet1/RouteTableAssociation vpcyour-namedevPublicSubnet1RouteTableAssociationF3D844E7
[+] AWS::EC2::Route vpc-your-name-dev/PublicSubnet1/DefaultRoute vpcyour-namedevPublicSubnet1DefaultRouteD7E793DD
[+] AWS::EC2::EIP vpc-your-name-dev/PublicSubnet1/EIP vpcyour-namedevPublicSubnet1EIP712EAA5B
[+] AWS::EC2::NatGateway vpc-your-name-dev/PublicSubnet1/NATGateway vpcyour-namedevPublicSubnet1NATGatewayCC6D84C6
[+] AWS::EC2::Subnet vpc-your-name-dev/PublicSubnet2/Subnet vpcyour-namedevPublicSubnet2Subnet3E3E0046
[+] AWS::EC2::RouteTable vpc-your-name-dev/PublicSubnet2/RouteTable vpcyour-namedevPublicSubnet2RouteTable1AB520E0
[+] AWS::EC2::SubnetRouteTableAssociation vpc-your-name-dev/PublicSubnet2/RouteTableAssociation vpcyour-namedevPublicSubnet2RouteTableAssociation2FEAAF25
[+] AWS::EC2::Route vpc-your-name-dev/PublicSubnet2/DefaultRoute vpcyour-namedevPublicSubnet2DefaultRouteC103C9D2
[+] AWS::EC2::EIP vpc-your-name-dev/PublicSubnet2/EIP vpcyour-namedevPublicSubnet2EIP6AD92B60
[+] AWS::EC2::NatGateway vpc-your-name-dev/PublicSubnet2/NATGateway vpcyour-namedevPublicSubnet2NATGatewayA09AEBA1
[+] AWS::EC2::Subnet vpc-your-name-dev/PrivateSubnet1/Subnet vpcyour-namedevPrivateSubnet1SubnetB315A65A
[+] AWS::EC2::RouteTable vpc-your-name-dev/PrivateSubnet1/RouteTable vpcyour-namedevPrivateSubnet1RouteTableA5FAAF1C
[+] AWS::EC2::SubnetRouteTableAssociation vpc-your-name-dev/PrivateSubnet1/RouteTableAssociation vpcyour-namedevPrivateSubnet1RouteTableAssociationE3B5D7DD
[+] AWS::EC2::Route vpc-your-name-dev/PrivateSubnet1/DefaultRoute vpcyour-namedevPrivateSubnet1DefaultRoute9152FB24
[+] AWS::EC2::Subnet vpc-your-name-dev/PrivateSubnet2/Subnet vpcyour-namedevPrivateSubnet2Subnet414716F0
[+] AWS::EC2::RouteTable vpc-your-name-dev/PrivateSubnet2/RouteTable vpcyour-namedevPrivateSubnet2RouteTable225072CD
[+] AWS::EC2::SubnetRouteTableAssociation vpc-your-name-dev/PrivateSubnet2/RouteTableAssociation vpcyour-namedevPrivateSubnet2RouteTableAssociationF9EA82A2
[+] AWS::EC2::Route vpc-your-name-dev/PrivateSubnet2/DefaultRoute vpcyour-namedevPrivateSubnet2DefaultRoute7BE0AFBF
[+] AWS::EC2::InternetGateway vpc-your-name-dev/IGW vpcyour-namedevIGW70FB840E
[+] AWS::EC2::VPCGatewayAttachment vpc-your-name-dev/VPCGW vpcyour-namedevVPCGWB8F53F81
[+] AWS::EC2::SecurityGroup instance-your-name-dev/InstanceSecurityGroup instanceyour-namedevInstanceSecurityGroup50C02701
[+] AWS::IAM::Role instance-your-name-dev/InstanceRole instanceyour-namedevInstanceRoleF653EE93
[+] AWS::IAM::InstanceProfile instance-your-name-dev/InstanceProfile instanceyour-namedevInstanceProfile6799F951
[+] AWS::EC2::Instance instance-your-name-dev instanceyour-namedev8DD0F85A
[+] AWS::IAM::Role batch-service-role-your-name-dev batchserviceroleyour-namedevB064BF84
[+] AWS::IAM::Role batch-execution-role-your-name-dev batchexecutionroleyour-namedev39DE3188
[+] AWS::IAM::Policy batch-execution-policy-your-name-dev batchexecutionpolicyyour-namedev8BCEF321
[+] AWS::IAM::Role batch-job-role-your-name-dev batchjobroleyour-namedevAC17F802
[+] AWS::IAM::Policy batch-job-role-your-name-dev/DefaultPolicy batchjobroleyour-namedevDefaultPolicy44B9665C
[+] AWS::IAM::Policy batch-job-policy-your-name-dev batchjobpolicyyour-namedev9C329AAB
[+] AWS::Batch::ComputeEnvironment compute-env-your-name-dev computeenvyour-namedev
[+] AWS::Batch::JobQueue job-queue-your-name-dev jobqueueyour-namedev
[+] AWS::Batch::JobDefinition job-def-your-name-dev jobdefyour-namedev
To deploy the infrastructure, run:
make deploy
...
ChaosToolkitBatchExampleStack-your-name-dev: creating CloudFormation changeset...
[███████████████████████████████████████████████▏··········] (35/43)
09:24:07 | CREATE_IN_PROGRESS | AWS::CloudFormation::Stack | ChaosToolkitBatchExampleStack-your-name-dev
09:24:52 | CREATE_IN_PROGRESS | AWS::IAM::InstanceProfile | instance-your-name-dev/InstanceProfile
09:25:13 | CREATE_IN_PROGRESS | AWS::EC2::NatGateway | vpc-your-name-dev/PublicSubnet2/NATGateway
09:25:13 | CREATE_IN_PROGRESS | AWS::EC2::NatGateway | vpc-your-name-dev/PublicSubnet1/NATGateway
...
You can then navigate to the Batch console and run jobs as previously outlined in the guide above.
Storing your journal¶
You might have noticed that being able to view the Chaos Toolkit experiment journal presents a pickle of a situation. As Batch Job containers are ephemeral, once the Job has run and terminated (either successfully or not), the place your experiment just ran in is destroyed for good.
There are likely several ways you could get around this issue, you could implement an extension to upload journals somewhere, you could extend chaostoolkit-aws to upload journal runs to S3 or to an EBS volume (if you’d deployed that too).
In the CDK example, we actually snuck in a way to store journals; we use a wrapper script which calls chaostoolkit
and then uses boto3
in a Python script to upload the journal to S3, into a bucket we also deploy in the stack.
Take a look at our Dockerfile
compared to the one in the guide above:
FROM chaostoolkit/chaostoolkit:latest
RUN pip install chaostoolkit-aws
RUN mkdir /home/svc/experiments
COPY experiments /home/svc/experiments
COPY run_experiment.sh /home/svc/experiments/
COPY upload_journal.py /home/svc/experiments/
WORKDIR /home/svc/experiments
ENTRYPOINT [ "sh", "run_experiment.sh" ]
We’ve added two new COPY
statements, moving our wrapper script and our upload script into the container. We’ve also added an override to the containers ENTRYPOINT
value, which in chaostoolkit/chaostoolkit:latest
is chaos
.
The wrapper script is very simple, it just looks like:
#!/bin/bash
chaos run $1 --journal-path=/home/svc/experiments/journal.json
python3 upload_journal.py
Our upload script is also very basic:
import os
from datetime import datetime
import boto3
def upload_journal():
s3 = boto3.client("s3")
with open("/home/svc/experiments/journal.json", "rb") as journal:
s3.upload_fileobj(
journal,
os.environ["JOURNAL_BUCKET"],
f"{datetime.now().strftime('%Y%m%d-%H%M%S')}.json",
)
if __name__ == "__main__":
upload_journal()
With these changes and a small modification to the command
of our Job definition, we can now invoke our experiment, specify a location for our journal, and then upload the journal with a suitable name - we set our journal name to the current datetime
. You could also include your experiment name if you have many of them.