My first attempt to log EC2 instance names to PagerDuty and Airbrake broke most of our infrastructure. I failed to account for unpublished AWS rate limits, and when an unexpected volume of errors caused my code hit those rate limits, insufficient error handling led to an infinite loop when errors were thrown in our exception loggers.
I hope that this tutorial can save you some of my headache. I’ll walk you through how to use the boto3
Python client to access the name of a running EC2 instance from that instance, and along the way I’ll include caveats and gotchas that will help you avoid some of my mistakes.
Most information about the instance is accessible with the boto3 Instance resource. To create that resource, we first need to retrieve the instance id and instance region.
AWS provides Instance Metadata and User Data via the url http://169.254.169.254
, which you can request from any running EC2 instance. In particular, we are interested in the Instance Identity Document, which is accessible at http://169.254.169.254/latest/dynamic/instance-identity/document
.
import requests
r = requests.get("http://169.254.169.254/latest/dynamic/instance-identity/document")response_json = r.json()region = response_json.get('region')instance_id = response_json.get('instanceId')
If you are not familiar with the requests
library, I would recommend checking out Response Status Codes, particularly the raise_for_status
function, as a starting point for error handling.
We can then use the instance id and region to retrieve the boto3 Instance resource.
import boto3
ec2 = boto3.resource('ec2', region_name=region)instance = ec2.Instance(instance_id)
Validate
region
andinstance_id
before passing them toboto3
The first step of boto3 error handling is to catch ClientError
and BotoCoreError
, both found in the botocore.exceptions
package.
In my experience, the boto3
client has pretty confusing error handling for invalid or None
region or instance ids. In addition to the errors mentioned above, None
values in either field will raise the Python built-in ValueError
. I would recommend that you do not attempt to use theboto3
client if region && instance_id
is false.
An instance’s “Name” is really an instance tag with the key “Name”. You can retrieve tags from the instance resource, and filter for Name
tags.
tags = instance.tags or []names = [tag.get('Value') for tag in tags if tag.get('Key') == 'Name']name = names[0] if names else None
Because attributes are lazy-loaded, some invalid instance ids throw errors here
According to the boto3 documentation, resource attributes are lazy-loaded, meaning that the first API call is made when the attribute is first accessed. This means that while None
or empty strings are validated when creating the ec2.Instance
resource, non-empty string ids that are the right type but the wrong value will be validated here, with the first DescribeInstances
call. To combat this, you’ll want to attempt to catch the botocore.exceptions
Exceptions from the last section.
From the Open Guide To AWS section on EC2 gotchas and limitations:
❗If the EC2 API itself is a critical dependency of your infrastructure (e.g. for automated server replacement, custom scaling algorithms, etc.) and you are running at a large scale or making many EC2 API calls, make sure that you understand when they might fail (calls to it are rate limited and the limits are not published and subject to change) and code and test against that possibility.
The boto3
client loads information about an instance with the DescribeInstances
API call. If you, for instance, make this API call to retrieve the instance name every time you log an error, you could easily hit the DescribeInstances
rate limit.
In addition to the error handling mentioned above, you will want to consolidate your calls to the AWS API to avoid hitting the unpublished AWS rate limits. Our solution was to fetch the instance name once at the startup of the API server, and cache the result in a global data structure. Instead of calling the EC2 API every time we need to log an error, we now call it only once when deploying new code to a machine.
Here is an example of what a final get_instance_name
function could look like.