Kuldeep Chowhan

@kuldeep

Chaos Engineering using Amazon EC2 Systems Manager

Background

I have been looking at tools on how can chaos be introduced into applications (Chaos Engineering) so that I can test whether applications are resilient. As I was exploring different tools, I explored the idea of why can’t I leverage Amazon EC2 Systems Manager suite of tools that are already available in AWS to introduce chaos for the applications.

What Is Amazon EC2 Systems Manager?

Amazon EC2 Systems Manager is a collection of capabilities that helps you automate management tasks such as collecting system inventory, applying operating system patches, automating the creation of Amazon Machine Images (AMIs), and configuring operating systems and applications at scale. Systems Manager lets you remotely and securely manage the configuration of your managed instances.

More Info at: http://docs.aws.amazon.com/systems-manager/latest/userguide/what-is-systems-manager.html

With this idea in mind, I started to look at all the different ways that I can introduce chaos and look at what tools can I use that EC2 already has instead of building my own tool.

These are the tools that are part of Amazon Systems Manager that I picked to perform Chaos Engineering

  1. SSM Document
  2. Run Command

What is SSM Document?

An Amazon EC2 Systems Manager Document defines the actions that Systems Manager performs on your managed instances. Systems Manager includes more than a dozen pre-configured documents that you can use by specifying parameters at runtime. Documents use JavaScript Object Notation (JSON), and they include steps and parameters that you specify. Steps execute in sequential order.

More info at: : http://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-ssm-docs.html

What is Run Command?

Systems Manager Run Command lets you remotely and securely manage the configuration of your managed instances. A managed instance is any Amazon EC2 instance or on-premises machine in your hybrid environment that has been configured for Systems Manager. Run Command enables you to automate common administrative tasks and perform ad hoc configuration changes at scale. You can use Run Command from the EC2 console, the AWS Command Line Interface, Windows PowerShell, or the AWS SDKs.

More info at: https://aws.amazon.com/ec2/run-command/

Setup Walkthrough

Let’s walkthrough the setup that is required for us to run the Chaos Engineering Experiment

Setting Up Systems Manager

To get started with Amazon EC2 Systems Manager, verify prerequisites, configure AWS Identity and Access Management (IAM) roles, and install the SSM Agent on managed instances.

This document talks about how to configure the IAM roles and the installation steps for SSM Agent: http://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-setting-up.html

Using the above the steps, install the SSM Agent on Amazon Linux EC2 instances on which you want to perform the Chaos Experiment.

Create SSM Document

We will create a SSM Document to run our different Chaos Engineering experiments. Let’s start with creating one document to Blackhole an instance on a specified port for a specified amount of time. I will follow up with additional blog posts with different SSM Document templates to perform different Chaos Engineering experiments.

This page talks about how you can create a SSM Document: http://docs.aws.amazon.com/systems-manager/latest/userguide/create-ssm-doc.html

For our use case of Blackhole a port on an instance, you will create the Document using the below information:

Name: Chaos-Blackhole

Document Type: Command

Content:

Content used for Blackhole SSM Document

As you can see above, I using aws:runShellScript action to execute commands on the instance to Blackhole a port on the instance. Let’s walkthrough the commands that I’m using.

iptables -A INPUT -p tcp — destination-port {{ port }} -j DROP

Adds a iptables rule to drop the packets on the specified port

sleep {{ duration }}

Waits for the specified duration

iptables -D INPUT -p tcp — destination-port {{ port }} -j DROP

Deletes the iptable rule to drop the packets on the specified port

Port and Duration are both parameters to the Document and these parameters are filled with values when the Run Command is executed.

Run Chaos Engineering Experiment

Now that the Systems Manager Agent is installed on an EC2 instance and SSM Document is created it is time to run our Chaos Engineering Experiment, let’s look at how to run the Chaos Blackhole experiment. For our experiment we will install nginx on an EC2 instance which has SSM Agent installed and will run our Chaos Experiment to Blackhole port 80 which nginx uses on the instance. When the experiment is running, we should be unable to access nginx on that instance via browser on port 80. If we are unable to access the port 80 on the instance via browser then it means that the experiment is successful.

Preparing EC2 instance for our Chaos Engineering Experiment

  1. SSH to the instance on which you have installed the SSM Agent
  2. Install nginx by running the command yum install nginx -y
  3. Start nginx by running the command service nginx restart
  4. Verify that nginx is running on the instance by running curl http://localhost and see if you get response back

Running the experiment

Now that the EC2 instance has nginx running on port 80, we can run our Chaos Blackhole experiment and see if our experiment succeeds. For this walkthrough I’m using AWS Management console, however all the steps that I have mentioned in the document can be run using AWS CLI or AWS SDK.

  • Navigate to AWS Console and select EC2 service from the list of services. In the EC2 dashboard, select Run Command from the menu on the left hand side
Run Command
  • In the commands window, click on Run a command button
  • In the Run a command window, filter the available commands to Owned by me and select Chaos-Blackhole document
Selecting Chaos-Blackhole Document from the list
  • From the Target instances select the instance on which you have configured SSM Agent
  • Enter 80 as port number to blackhole
  • Enter 30 seconds as the duration to blackhole
  • After you have filled in these details you can click on the Run button in the bottom to start our experiment
  • Before the experiment I was able to access nginx via the browser
  • During the experiment, when i tried to hit the ip of the EC2 instance from the browser, it stop responding and the wheel kept spinning
Nginx was not accessible during the execution of the run command
  • Once the Run command execution is complete, I was able to access nginx via browser without any issues
  • You can also look at the status of the Run command in the Command list window
Run Command Execution
  • By leveraging EC2 Systems Manager I was able to successfully run Chaos Engineering Experiment of Blackhole a port on an instance.

Feel free to provide feedback on the approach and ways on how to improve it.

More by Kuldeep Chowhan

Topics of interest

More Related Stories