✨ an AWS experiment ✨
We started a big project few months ago: migrate our application to GraphQL.
With 300+ React Component and 20 API endpoints with many nested ressources, the maintainability and performance of the application were getting worse every month.
Once this project finished, we wanted to ensure that GraphQL fulfil his promises, so we decided to track migrated components “perceived performance”.
We wanted to watch the performance of our GraphQL Ruby API over time and avoid UX/performance regression.
Considering that we already use a lot of external SASS (NewRelic, Cloudinary, AWS, Segment, OAuth0, …), instead of building a full AWS data ELK stack, we decided to build a simple and low-cost performance “dashboard”.
Since our Segment account save all raw data in a S3 bucket, we decided to use this data by sending a custom event Perfomance.QueryLoadTime
.
Note: Segment is a SASS Analytics API that provide integrations with 200+ services like Hubspot, custom webhooks or AWS.
Of course, Segment can easily be replaced by AWS Kinesis Firehose and AWS API Gateway
The front React application is sending a custom Perfomance.QueryLoadTime
event to Segment each time a specific GraphQL request end.The event is then stored in S3 as JSON raw data files.
Then, every week, the data is fetched from S3 and consolidated, ready to be used in Google Sheets.
Let’s see how to make it work out !
Apollo Data is a GraphQL client compatible with React
For this, we need to write a custom link that intercept some filtered queries and send performance data to Segment.
You’ll see below a
JasonBourne
service.Don’t worry, this is our Segment client wrapper.
It’s all.All you have to do is to add this link
to your ApolloClient
instance.
export const client = new ApolloClient({link: ApolloLink.from([perfMonitorLink, networkLink]),cache: apolloCache});
You’re all set for the front part.
Athena is a service of Amazon that allow to run SQL queries against S3 files.
Configuring Athena is quite easy, you need to:
Here is the raw data structure that Segment store to S3(for the Perfomance.QueryLoadTime
event)
You’ll notice that many field are not relevant for data analysis, so we’re gonna only keep a subset.
Note: The “Create table” UI available on AWS Athena do not allow to create table with complex data structure (nested field, union type field, etc), so we need to write a create table query, by hand ✍️
Athena use a Apache HIVE language to describe data, here is the wanted CREATE TABLE
query :
The query expose the data format (JSON), data structure, the source and the destination
To do so, we can use a wonderful tool called [hive-json-schema](https://github.com/quux00/hive-json-schema)
.This is an open source project — recommended by AWS official documentation — to generate HIVE query from JSON raw data.
In short, given a JSON example file, it generate a corresponding CREATE TABLE
query.
Read the doc, it’s really simple (take 5 minutes maximum).
The table is now ready, we can write a SELECT
query.Here’s ours:
SELECTproperties.ellapsedMs, properties.operationName, sentAtFROM "segment"."perf" WHEREevent = 'Performance.QueryLoadTime'ORDER BY sentAt DESC;
Since it’s plain SQL, you can select all the field available on this table.
Run the query and save it.
NB: each time a query is executed, Athena store the CSV result on your S3.To know where (which bucket), go to Settings.
Athena do not propose to run a saved query periodically.However, we can use the AWS Athena API to run a query remotely.Next step, the lambda.
AWS Lambda is a service that allow to create serverless functions.
Serverless architectures refer to applications that significantly depend […] on custom code that’s run in ephemeral containers (Function as a Service or “FaaS”)[…] . By using these ideas, […], such architectures remove the need for the traditional ‘always on’ server system sitting behind an application. Depending on the circumstances, such systems can significantly reduce operational cost and complexity at a cost of vendor dependencies and (at the moment) immaturity of supporting services.
Martin Fowler — Serverless Architectures
We don’t want to go on AWS Athena UI every week to run the saved query manually, so we need a Lambda that run our query very week.
Here’s how to do so:
Go to “Create Function” and select “Node 6.10” and “Create a custom role”.You’ll be redirect to AWS IAM, click on “Allow”.Please write down the name given to the IAM role, it will be used later.
Then click on “Create Function”.
You’ll arrive a page like this.
In order to run, a lambda need to be invoked.We want to invoke our lambda on a weekly basis, like a CRON task.
AWS Cloudwatch offer this feature.
On the left, select “CloudWatch Events”, then use this configuration :
The lambda will be invoked by a Cloudwatch event every 7 days.
This is the nifty part, you’ll need to go to IAM and find the role you just created.
Then ensure that the role has the following permissions:
This will allow the lambda to find run a query and store it results on S3.
NB: I highly encourage you to dig in the AWS IAM documentation in order to understand all the implications.
The lambda must :
NB: To get your Athena Query ID, open the query from Athena “Saved Queries” and copy the id from the URL
Now the consolidated data is fresh and available, we need to :
You’ll find the file in the bucket specified in the Lambda source code.Remember, we specified a custom output location to the AWS Athena start query call.
Copy the content of the file, and paste it in a new sheet.
In case you’re dealing with date field, here’s a trick.
Google Spreadsheet do not understand the ISO date format, we need to “format” it.Here is the formula, apply it to a whole new column for each needed field.
= DATEVALUE(MID(C2,1,10)) + TIMEVALUE(MID(C2,12,8))
Here you’ll get a “MM/DD/YYYY HH:MM:SS” format ⤴️
Remember to change the column format to “Date time”
The consolidated CSV data is no yet usable.Since the data use many dimensions (sentAt, operationName) for a single value (elapsedMs), we need to build a “matrix/pivot table”.
For this, create a new sheet and go to “Data>Pivot table …”
Here is an example of pivot table configuration, it’s very “data-specific”.
The configuration of the chart is very personal and depend on what type of data you have, here’s an example of time-based performance data.
My favorite chart type for time based performance data is Scatter chart ⤵️
Or the classic Smooth line chart with average valuesinstead of detailed values ⤵️
This helps us to see the trend and maximum values in a glance ✨
Cons
This step is hard to automate easily because AWS Athena create a unique .csv file in S3 each time a query end.A workaround could be to do 2 things :
latest.csv
and save it publicly on S3.IMPORTDATA()
function in Google Spreadsheet
This way, the data will always be the freshest one!
We may want to create many Athena queries (examples: BI or advanced cross analytics-tools reporting)
For this we can update the Lambda function to update all or many queries.
Of course, there is a lot of battle-tested and professional solutions.This blogpost expose a solution to a specific context with a particular financial and technologic constraints.We could totally configure Datadog or pay a Tableau licence, but it would not be that fun!
This is of course an experiment and temporary solution.
I hope you learnt some things about AWS or GraphQL.Please feel free to drop a comment if i missed anything!