When it comes to strategically managing your website’s traffic, AWS Route53 Weighted Routing Policy emerges as a powerful option. It offers a unique solution for distributing traffic unevenly or in a controlled manner:
It lets you associate multiple resources with a single domain name (example.com) or subdomain name (acme.example.com) and choose how much traffic is routed to each resource. This can be useful for a variety of purposes, including load balancing and testing new versions of software.
Imagine you’ve developed an exciting new version of your software, and you want to ensure its performance and scalability before rolling it out to all users. With Route53 weighted routing policy, you can gradually shift a portion of the production traffic to the new version, effectively “baking” it while keeping the majority of users on the stable version. This controlled transition minimizes risks and ensures a smooth user experience throughout the process.
Step 1: Access the AWS console for Route53, navigate to the hosted zones page, and click on “create hosted zone” to view the list of hosted zones associated with your account (no need to select regions as Route53 is globally applied).
Step 2: On this page, add the domain name that you want to have for the weighted records. When configuring the domain name "beta.devgrowthdemo.com" to split traffic between an older version and a new version of a service, you can use weighted DNS records.
Step 3: After creating the new hosted zone, NS and SOA records are automatically generated for you. Typically, no changes are required for these record types, so you can proceed by clicking on “create record”.
Step 4: When creating a new record, you can change its routing policy like below. Once you select weighted, more options will show up, such as the weight. Note that the weight is a number between 0 and 255. Here I create a weighted record called mynewserviceaddress.devgrowthdemo.com with a weight of 200, which is roughly an 80% ratio. And I will set another record point to my old version of the service with 55 weight. Note that you can also select record type, for example, A record, CNAME record or alias record points to AWS resources such as ELB.
Result: now we have two weighted records behind beta.devgrowthdemo.com, one with 200 weight, and the other with 25. Theoretically, the traffic hitting this domain will be distributed to these two records according to the weight config.
While setting up Route53 weighted routing policy may seem straightforward, there are some important caveats to consider before fully embracing this feature:
1. DNS is heavily cached →DNS queries may not reach Route53 Name Server Level as expected
DNS is heavily caching at various steps in the resolution path, can pose a problem when working with Route53 weighted records. When a DNS query is initiated, it doesn’t always reach Route53 directly. Instead, it passes through intermediate resolvers that cache the answers. This can lead to discrepancies in traffic distribution, especially if you have a limited number of clients residing in a single AWS region.
The actual traffic ratio may deviate from your intended configuration due to the influence of cache DNS resolvers. In extreme cases where all clients are concentrated in one VPC and one AZ, you might find only one of the records receiving traffic. While there are mitigation measures, such as lowering the TTL value, these come with trade-offs, such as increased AWS Route53 costs, latency, and reduced protection from DNS outages.
However, if your clients are distributed, such as consumer browsers or mobile apps, you should observe the desired traffic ratio alignment with the Route53 configuration. Therefore, it’s essential to evaluate your specific use case before relying solely on Route53, especially if precise weight ratio control is crucial.
2. The weight distribution is hard to observe on small scale → Not easy to test locally
Weighted records don’t show the behavior you expect on a small scale such as local testing, they function statistically rather than as precise active load-balancing mechanisms. When testing with a script on your local machine, the DNS query will resolve to one weighted record and remain constant throughout the test run. If you by generating a large amount of queries from multiple hosts, running them longer than the TTL, you may see the DNS configuration reflected.
The ideal method for testing the policy involves querying the R53 Name Server associated with the hosted zone using commands like dig A example.com @nameserver. Performing a substantial number of requests, such as 10,000, is necessary to see the reflected configuration.
3. Update TTL value may not be easy with CDK & Changing TTL value could take a longer time to be effective
In the previous section, we see that using the AWS console to set up record TTL is easy. And when you need to change the TTL with the AWS console it's easy too. If you change the TTL for one record, the other related weighted records’ TTL will update to the same value automatically.
However, if you are using CDK to update the TTL of the weighted record, you may see errors like this RRSet with DNS name xxxx, SetIdentifier live cannot be created as weighted sets must contain the same TTL. There is a workaround by removing one of the records, update the TTL, then recreate both. This may result in downtime so if you want to avoid that you can always go with the AWS console way so the TTL updates can gets applied to all weighted records simultaneously.
Before you make a decision about what to do with updating TTL, keep in mind that:
Not all recursive DNS servers strictly follow TTL values. You cannot force them to update their caches, some DNS resolvers don’t respect TTLs at all and just implement their own caching policy however they wish. It may take considerable time, ranging from one to several days, for TTL updates to become effective. You may want to set the TTL to a lower value before making DNS changes such as 60 seconds, but in the real world, the common DNS TTL before making changes is to lower it to 300seconds, and once the change is done you can set it back to 3600(1 hour) or 86,400 seconds(24 hours).
Hopefully, this helps you make better decisions to choose the right tech stack for traffic management 😀!
Also published here.