by Luke Gotszling
At intermix.io, we use Amazon Redshift as part of our stack. Amazon Redshift is an OLAP database, and a valuable tool for data teams due to its low cost and speed for analytical queries. We have a particular use case though. We’re using Amazon Redshift in an OLTP scenario, i.e. we’ve built an analytical application on top of Redshift.
The challenge of using Redshift as an OLTP is that queries can lack the low-latency that would exist on a traditional RDBMS and transactional queries. Also, there are features of a traditional RDBMS that would often be useful when querying Redshift (e.g. indices).
In this post we detail a method of using Amazon Redshift as an OLTP. We will tell the story of how we off-load OLTP workloads from Amazon Redshift to Amazon RDS. We were able to do this without writing any complex ETL or changing our data processing strategy.
intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their critical data flows.
Image 1: Simplified architecture of Amazon Redshift in an OLTP scenario
There are three steps for acquiring the data we need to provide our analytics to our customers.
2. We move the data into the intermix.io S3 environment.
3. From S3, we copy data into Redshift.
We use Amazon Redshift for:
In the early stage of intermix.io, our dashboard pages would pull data from Redshift directly. But as we added more and more customers, we realized the dashboard pages were not performing well. This was due to two factors:
We did use the best practices to manage our Redshift cluster, including the proper set-up of our workload management (WLM). But we also knew that using Redshift for this OLTP use case — serving data to the dashboard — was not going to work long term. So we started looking for solutions.
Our ideal solution would:
Amazon Redshift is a columnar-oriented petabyte-scale data warehouse. It is an OLAP database suitable for analytical queries and applications. It uses a modified version of PostgreSQL 8.
The solution was to use Amazon RDS as a ‘cache’ by leveraging the Dblink feature of RDS. This allows us to make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over. Dblink handles moving the data at the block level.
This solution had a bunch of benefits:
Before we go over the installation instructions, we’ll cover some of the features that using Dblink with Redshift provides.
The main features presented in this post require a Postgres instance with Postgres version >= 9.4 (some features listed in this section require more recent versions of Postgres). We use an RDS-hosted version of Postgres but other types of instances will work as long as they meet the minimum requirements.
Run procedures on your data in other languages: PL/pgsql, PL/v8, PL/perl, PL/Tcl — For example: query your data using Javascript
1.Use full text search
2. Standardize mailing addresses (address_standardizer and address_standardizer_data_us extensions)
3. Create a tree structure out of your data using ltree
4. Take advantage of PostGIS features when interacting with geospatial data
5. Run Oracle functions via the orafce extension
Now onto the installation. For installing on Redshift, perform the following commands:
GRANT SELECT ON all TABLES IN SCHEMA data to <amazon_redshift_username>;
Enable dblink and postgres_fdw in the Postgres RDS instance (only needs to be done once per database).
CREATE EXTENSION postgres_fdw; CREATE EXTENSION dblink;
Create the server link and user mapping.
CREATE SERVER redshift_server FOREIGN DATA WRAPPER redshift_fdw
OPTIONS (host ‘<amazon_redshift_ip>’, port ‘<port>’, dbname ‘<database_name>’, sslmode ‘require’);
CREATE USER MAPPING FOR <rds_postgresql_username> SERVER redshift_server
OPTIONS (user ‘<amazon_redshift_username>’, password ‘<password>’);
That’s it, now you’re ready to query your Redshift data in RDS!
This approach allows you to query Redshift while inside of a Postgres instance. It bridges data but doesn’t confer any performance benefits. For example, to get a list of users that were updated this year.
SELECT * FROM dblink(‘redshift_server’, $REDSHIFT$ SELECT id FROM users WHERE updated_on >= ‘2018–01–01’; $REDSHIFT$) AS t1 (id int);
To create a queryable cache of the Redshift data, we create a materialized view. You can also select a subset of rows and a subset of columns if you don’t require the full table.
CREATE MATERIALIZED VIEW mv_users AS SELECT * FROM dblink(‘redshift_server’, $REDSHIFT$ SELECT id, updated_on FROM users; $REDSHIFT$) AS t1 (id int, updated_on timestamp);
A materialized view will not self-update. So we need an approach to refresh the data. If you only wish to execute live queries (no materialized views) then you can ignore this section.
Refresh approach 1 (non-concurrent) Easier and faster but with concurrent reads blocked.
REFRESH MATERIALIZED VIEW mv_users;
This will block concurrent reads, so if you need to be able to always read from the view follow the next approach.
Refresh approach 2 (concurrent) Non-blocking refreshes (this requires a unique index on the materialized view).
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_users;
This requires a unique index to work. For example, after creating the view:
CREATE UNIQUE INDEX mv_user_id_updated_on_idx ON data.mv_users (id, updated_on);
Periodic refresh If the data you retrieve takes more than 5 minutes to transfer or you don’t mind refreshing synchronously, you can issue the REFRESH commands above inside your code or in a periodic task as needed.
Refresh Lambdas If retrieving the data from Redshift takes fewer than 5 minutes, AWS’s Lambda is a good approach to use here. Be mindful of concurrency, for example it’s bad practice to issue multiple concurrent refreshes on a single table while another refresh is currently happening on the same table. We’ll have more about how we approach this in a subsequent blog post.
Incremental refreshes The above refresh methods query the full set of data each time (even if nothing has changed), if you don’t have updates on your data you may want to use the following approach instead:
Dblink ingestion incurs regular data transfer costs, hence if you don’t want to be charged for the network traffic of synchronizing the data please have your Postgres RDS instance in the same AZ as your Redshift leader node.
You need to ensure that the user permissions for the users querying data via Dblink matches that of the same users querying directly on Redshift, so that additional permissions aren’t granted by querying Dblink.
In the AWS console, keep an eye on the following Postgres RDS metrics to ensure that the instance isn’t overwhelmed:
And then of course we use our own intermix.io dashboard to monitor the performance of our Amazon Redshift clusters and our data pipelines. A few select key metrics we pay attention to are:
Image 2: intermix.io dashboard — query memory footprint
For example, in image 2 you can see a distribution of all queries for a certain time interval. We can see how much memory each query consumes, if it’s disk-based or not, and what the total query return time (execution time + queue time) is.
The total on-disk size limit for Postgres on RDS is 16TB, so the data you link cannot be greater than that (such a transfer would take at least 3.6 hours on a 10Gbit connection). The total table size on Redshift can be greater than 16TB but your materialized view query cannot select a quantity of rows and columns combined with index overhead that would exceed 16TB.
A current limitation of the DNS resolving on RDS is such that lookups don’t resolve to private IPs for DNS queries performed inside of Dblink. Hence if you’re running in a VPC without publicly routable Redshift then you’ll need to setup your own DNS to resolve to the private IP or hardcode Redshift Leader node private IPs when provisioning Dblink.
With the approach in this post, we showed you how it’s possible to use Amazon Redshift in an OLTP scenario. By using Amazon RDS as a ‘cache’ and leveraging its Dblink feature, we can make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over.
So if your using Amazon Redshift in an OLTP scenario for building data apps, and you want to monitor the performance of those data apps — schedule a call with us. We’d love to look at your set-up, and help you with a free trial of intermix.io to tune your Amazon Redshift cluster.
Originally published at www.intermix.io on July 5, 2018.