Charly Poly

@wittydeveloper

Build performance reports for your Apollo React Components in 30 minutes. ⏲️

✨ an AWS experiment ✨

Motivation

We started a big project few months ago: migrate our application to GraphQL.

With 300+ React Component and 20 API endpoints with many nested ressources, the maintainability and performance of the application were getting worse every month.

Once this project finished, we wanted to ensure that GraphQL fulfil his promises, so we decided to track migrated components “perceived performance”.

We wanted to watch the performance of our GraphQL Ruby API over time and avoid UX/performance regression.

Considering that we already use a lot of external SASS (NewRelic, Cloudinary, AWS, Segment, OAuth0, …), instead of building a full AWS data ELK stack, we decided to build a simple and low-cost performance “dashboard”.

Since our Segment account save all raw data in a S3 bucket, we decided to use this data by sending a custom event Perfomance.QueryLoadTime .

Note: Segment is a SASS Analytics API that provide integrations with 200+ services like Hubspot, custom webhooks or AWS.

The Architecture

Of course, Segment can easily be replaced by AWS Kinesis Firehose and AWS API Gateway

The front React application is sending a custom Perfomance.QueryLoadTime event to Segment each time a specific GraphQL request end.
The event is then stored in S3 as JSON raw data files.

Then, every week, the data is fetched from S3 and consolidated, ready to be used in Google Sheets.

Let’s see how to make it work out !

Send performance data — setup Apollo with Segment

Apollo Data is a GraphQL client compatible with React

For this, we need to write a custom link that intercept some filtered queries and send performance data to Segment.

You’ll see below a JasonBourne service.
Don’t worry, this is our Segment client wrapper.

It’s all.
All you have to do is to add this link to your ApolloClient instance.

export const client = new ApolloClient({
link: ApolloLink.from([perfMonitorLink, networkLink]),
cache: apolloCache
});

You’re all set for the front part.

Build consolidated data — Setup AWS Athena

Athena is a service of Amazon that allow to run SQL queries against S3 files.

Configuring Athena is quite easy, you need to:

  1. create a database
  2. create a table by specifying a source and the data structure
  3. create a named query and save it

The data

Here is the raw data structure that Segment store to S3
(for the Perfomance.QueryLoadTime event)

You’ll notice that many field are not relevant for data analysis, so we’re gonna only keep a subset.

Create a database and a table

Note: The “Create table” UI available on AWS Athena do not allow to create table with complex data structure (nested field, union type field, etc), so we need to write a create table query, by hand ✍️

Athena use a Apache HIVE language to describe data, here is the wanted CREATE TABLE query :

The query expose the data format (JSON), data structure, the source and the destination

To do so, we can use a wonderful tool called hive-json-schema .
This is an open source project — recommended by AWS official documentation — to generate HIVE query from JSON raw data.

In short, given a JSON example file, it generate a corresponding CREATE TABLE query.

Read the doc, it’s really simple (take 5 minutes maximum).

Write and save a SELECT Query

The table is now ready, we can write a SELECT query.
Here’s ours:

SELECT
properties.ellapsedMs, properties.operationName, sentAt
FROM "segment"."perf" WHERE
event = 'Performance.QueryLoadTime'
ORDER BY sentAt DESC;

Since it’s plain SQL, you can select all the field available on this table.

Run the query and save it.

NB: each time a query is executed, Athena store the CSV result on your S3.
To know where (which bucket), go to Settings.

Problem

Athena do not propose to run a saved query periodically.
However, we can use the AWS Athena API to run a query remotely.
Next step, the lambda.

Keep data fresh — Setup a periodic AWS Lambda

AWS Lambda is a service that allow to create serverless functions.

Serverless architectures refer to applications that significantly depend […] on custom code that’s run in ephemeral containers (Function as a Service or “FaaS”)[…] . By using these ideas, […], such architectures remove the need for the traditional ‘always on’ server system sitting behind an application. Depending on the circumstances, such systems can significantly reduce operational cost and complexity at a cost of vendor dependencies and (at the moment) immaturity of supporting services.

Martin Fowler — Serverless Architectures

We don’t want to go on AWS Athena UI every week to run the saved query manually, so we need a Lambda that run our query very week.

Here’s how to do so:

  1. Create a periodic lambda
  2. Ensure lambda have sufficient rights to call Athena and store results to S3

Create a “periodic lambda”

Go to “Create Function” and select “Node 6.10” and “Create a custom role”.
You’ll be redirect to AWS IAM, click on “Allow”.
Please write down the name given to the IAM role, it will be used later.

Then click on “Create Function”.

Select a trigger

You’ll arrive a page like this.

In order to run, a lambda need to be invoked.
We want to invoke our lambda on a weekly basis, like a CRON task.

AWS Cloudwatch offer this feature.

On the left, select “CloudWatch Events”, then use this configuration :

The lambda will be invoked by a Cloudwatch event every 7 days.

The lambda IAM role

This is the nifty part, you’ll need to go to IAM and find the role you just created.

Then ensure that the role has the following permissions:

  • Athena: GetNamedQuery, StartQueryExecution
  • S3: ListBucket, CreateBucket, PutObject, ListAllMyBuckets

This will allow the lambda to find run a query and store it results on S3.

NB: I highly encourage you to dig in the AWS IAM documentation in order to understand all the implications.

The lambda code

The lambda must :

  1. Get the named query by id
  2. Start the query
  3. Terminate lambda execution

NB: To get your Athena Query ID, open the query from Athena “Saved Queries” and copy the id from the URL

Displaying data — Configure Google Spreadsheet

Now the consolidated data is fresh and available, we need to :

  1. get the consolidated data
  2. format date field — if any
  3. build a pivot table
  4. build a chart

Copy the content of the last CSV file created by AWS Athena

You’ll find the file in the bucket specified in the Lambda source code.
Remember, we specified a custom output location to the AWS Athena start query call.

Copy the content of the file, and paste it in a new sheet.

Format data

In case you’re dealing with date field, here’s a trick.

Google Spreadsheet do not understand the ISO date format, we need to “format” it.
Here is the formula, apply it to a whole new column for each needed field.

= DATEVALUE(MID(C2,1,10)) + TIMEVALUE(MID(C2,12,8))

Here you’ll get a “MM/DD/YYYY HH:MM:SS” format ⤴️

Remember to change the column format to “Date time”

Build the “pivot table”

The consolidated CSV data is no yet usable.
Since the data use many dimensions (sentAt, operationName) for a single value (elapsedMs), we need to build a “matrix/pivot table”.

For this, create a new sheet and go to “Data>Pivot table …”

Here is an example of pivot table configuration, it’s very “data-specific”.

Configure the chart

The configuration of the chart is very personal and depend on what type of data you have, here’s an example of time-based performance data.

My favorite chart type for time based performance data is Scatter chart ⤵️

Or the classic Smooth line chart with average values
instead of detailed values ⤵️

This helps us to see the trend and maximum values in a glance ✨

Conclusion

Pros

  • only pay for Lambda execution and Athena query once a week
  • very flexible configuration
  • geeky 🤓

Cons

  • Solutions already exists : ELK, AWS Quicksight, Tableau
  • Copy-paste data every week
  • not configurable for a non-tech

Future and improvements 📈

CVS file data pasting step

This step is hard to automate easily because AWS Athena create a unique .csv file in S3 each time a query end.
A workaround could be to do 2 things :

  1. update the lambda to save the result of the query in a file named latest.csv and save it publicly on S3.
  2. Then use awesome IMPORTDATA() function in Google Spreadsheet
This way, the data will always be the freshest one!

Support for many queries

We may want to create many Athena queries (examples: BI or advanced cross analytics-tools reporting)

For this we can update the Lambda function to update all or many queries.

What about Tableau, AWS Quicksight or Datadog ?

Of course, there is a lot of battle-tested and professional solutions.
This blogpost expose a solution to a specific context with a particular financial and technologic constraints.
We could totally configure Datadog or pay a Tableau licence, but it would not be that fun!

This is of course an experiment and temporary solution.

Thanks for reading! 🌞

I hope you learnt some things about AWS or GraphQL.
Please feel free to drop a comment if i missed anything!

Topics of interest

More Related Stories