by Luke Gotszling
At intermix.io, we use Amazon Redshift as part of our stack. Amazon Redshift is an OLAP database, and a valuable tool for data teams due to its low cost and speed for analytical queries. We have a particular use case though. We’re using Amazon Redshift in an OLTP scenario, i.e. we’ve built an analytical application on top of Redshift.
The challenge of using Redshift as an OLTP is that queries can lack the low-latency that would exist on a traditional RDBMS and transactional queries. Also, there are features of a traditional RDBMS that would often be useful when querying Redshift (e.g. indices).
In this post we detail a method of using Amazon Redshift as an OLTP. We will tell the story of how we off-load OLTP workloads from Amazon Redshift to Amazon RDS. We were able to do this without writing any complex ETL or changing our data processing strategy.
How We Use Amazon Redshift
intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their critical data flows.
Image 1: Simplified architecture of Amazon Redshift in an OLTP scenario
There are three steps for acquiring the data we need to provide our analytics to our customers.
- We extract raw event data from the customer databases via the intermix.io Collector. The Collector runs as a Docker container in the customer’s environment.
2. We move the data into the intermix.io S3 environment.
3. From S3, we copy data into Redshift.
We use Amazon Redshift for:
- storing raw data before transformation (we prune frequently and we only keep data until it is processed)
- transforming data, with a multi-stage data pipeline to reduce, combine, and aggregate data
- moving data to a serving layer (initially it was only Elasticsearch)
- serving data to our product dashboard (this is the piece that was giving us trouble)
In the early stage of intermix.io, our dashboard pages would pull data from Redshift directly. But as we added more and more customers, we realized the dashboard pages were not performing well. This was due to two factors:
- more customers using our dashboard meant more concurrency
- more processing in Redshift meant more workloads were competing for query resources
We did use the best practices to manage our Redshift cluster, including the proper set-up of our workload management (WLM). But we also knew that using Redshift for this OLTP use case — serving data to the dashboard — was not going to work long term. So we started looking for solutions.
Our ideal solution would:
- use Amazon RDS for driver compatibility. We already used RDS for other transactional workloads and there are RDS features we wanted.
- avoid new ETL to copy data off of Redshift.
- keep Amazon Redshift as the single place where transformation logic lives (i.e. “source of truth”).
What is Amazon Redshift?
Amazon Redshift is a columnar-oriented petabyte-scale data warehouse. It is an OLAP database suitable for analytical queries and applications. It uses a modified version of PostgreSQL 8.
The solution was to use Amazon RDS as a ‘cache’ by leveraging the Dblink feature of RDS. This allows us to make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over. Dblink handles moving the data at the block level.
This solution had a bunch of benefits:
- API porting was trivial because the underlying schema and drivers were the same
- Amazon Redshift could remain the source of truth
- ETL was done “automatically” via Dblink so we didn’t need to worry about moving data around
- For disaster recovery and HA purposes, we could treat the RDS instances as fungible cache nodes.
The Magic — 6 Steps
Before we go over the installation instructions, we’ll cover some of the features that using Dblink with Redshift provides.
The main features presented in this post require a Postgres instance with Postgres version >= 9.4 (some features listed in this section require more recent versions of Postgres). We use an RDS-hosted version of Postgres but other types of instances will work as long as they meet the minimum requirements.
- Automatic pruning (via refreshing a materialized view with a WHERE or LIMIT clause)
- Leverage indices
- Use a materialized view for caching
- Enable cross-region replication (via an RDS instance in another region or a cross-region read replica)
- Query your Redshift data locally/offline by using dblink to synchronize it to a locally running Postgres instance
- High availability: use a multi-AZ instance to have automatic failover in case of hardware failure
- Have more than 500 connections and run hundreds of queries concurrently
- Independently scale read capacity through the use of read replicas
- Manage the data with extended JSON support
- Run procedures on your data in other languages: PL/pgsql, PL/v8, PL/perl, PL/Tcl
- Utilize parallel workers
- Leverage other Postgres extensions
1.Use full text search
2. Standardize mailing addresses (address_standardizer and address_standardizer_data_us extensions)
3. Create a tree structure out of your data using ltree
4. Take advantage of PostGIS features when interacting with geospatial data
5. Run Oracle functions via the orafce extension
- Query your data on a single node with 64 vCPUs and 488GB RAM (2x the size of the largest single node in Redshift)
- Run a single query that combines data from Redshift with data on Postgres
- Query data from multiple Redshift databases and clusters in one place
- Use just about any feature on Postgres that can query data from a materialized view
Step 1: Setup on Redshift cluster
Now onto the installation. For installing on Redshift, perform the following commands:
- Create a new read-only user for Dblink (or use an existing one). We reference this user below as <amazon_redshift_username>.
- Grant access to the tables you will want to query.
GRANT SELECT ON all TABLES IN SCHEMA data to <amazon_redshift_username>;
- Ensure that the Postgres RDS instance has a network route to the Redshift instance (see the ‘Caveats and limits’ section at the bottom of this post if it has to use a private IP address).
Step 2: Setup on Postgres RDS instance
Enable dblink and postgres_fdw in the Postgres RDS instance (only needs to be done once per database).
CREATE EXTENSION postgres_fdw; CREATE EXTENSION dblink;
Create the server link and user mapping.
CREATE SERVER redshift_server FOREIGN DATA WRAPPER redshift_fdw
OPTIONS (host ‘<amazon_redshift_ip>’, port ‘<port>’, dbname ‘<database_name>’, sslmode ‘require’);
CREATE USER MAPPING FOR <rds_postgresql_username> SERVER redshift_server
OPTIONS (user ‘<amazon_redshift_username>’, password ‘<password>’);
That’s it, now you’re ready to query your Redshift data in RDS!
Step 3: Run some queries
This approach allows you to query Redshift while inside of a Postgres instance. It bridges data but doesn’t confer any performance benefits. For example, to get a list of users that were updated this year.
SELECT * FROM dblink(‘redshift_server’, $REDSHIFT$ SELECT id FROM users WHERE updated_on >= ‘2018–01–01’; $REDSHIFT$) AS t1 (id int);
Step 4: Create a cached view into the data
To create a queryable cache of the Redshift data, we create a materialized view. You can also select a subset of rows and a subset of columns if you don’t require the full table.
CREATE MATERIALIZED VIEW mv_users AS SELECT * FROM dblink(‘redshift_server’, $REDSHIFT$ SELECT id, updated_on FROM users; $REDSHIFT$) AS t1 (id int, updated_on timestamp);
Step 5: Keep it up to date
A materialized view will not self-update. So we need an approach to refresh the data. If you only wish to execute live queries (no materialized views) then you can ignore this section.
Refresh approach 1 (non-concurrent)
Easier and faster but with concurrent reads blocked.
REFRESH MATERIALIZED VIEW mv_users;
This will block concurrent reads, so if you need to be able to always read from the view follow the next approach.
Refresh approach 2 (concurrent)
Non-blocking refreshes (this requires a unique index on the materialized view).
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_users;
This requires a unique index to work. For example, after creating the view:
CREATE UNIQUE INDEX mv_user_id_updated_on_idx ON data.mv_users (id, updated_on);
If the data you retrieve takes more than 5 minutes to transfer or you don’t mind refreshing synchronously, you can issue the REFRESH commands above inside your code or in a periodic task as needed.
If retrieving the data from Redshift takes fewer than 5 minutes, AWS’s Lambda is a good approach to use here. Be mindful of concurrency, for example it’s bad practice to issue multiple concurrent refreshes on a single table while another refresh is currently happening on the same table. We’ll have more about how we approach this in a subsequent blog post.
The above refresh methods query the full set of data each time (even if nothing has changed), if you don’t have updates on your data you may want to use the following approach instead:
- As above but create a table (we call this a mirrored table) instead of a materialized view.
- Use a refresh procedure that inserts into the table which selects only the rows that have a primary key greater than what currently exists in the mirrored table.
Step 6: Operations and monitoring
Dblink ingestion incurs regular data transfer costs, hence if you don’t want to be charged for the network traffic of synchronizing the data please have your Postgres RDS instance in the same AZ as your Redshift leader node.
You need to ensure that the user permissions for the users querying data via Dblink matches that of the same users querying directly on Redshift, so that additional permissions aren’t granted by querying Dblink.
In the AWS console, keep an eye on the following Postgres RDS metrics to ensure that the instance isn’t overwhelmed:
- Free disk space
- Network bandwidth
- Lock wait times
And then of course we use our own intermix.io dashboard to monitor the performance of our Amazon Redshift clusters and our data pipelines. A few select key metrics we pay attention to are:
- Concurrency levels and queue wait time
- Disk-based query volume
- Query latency trends
- Table growth
For example, in image 2 you can see a distribution of all queries for a certain time interval. We can see how much memory each query consumes, if it’s disk-based or not, and what the total query return time (execution time + queue time) is.
Caveats and limits
The total on-disk size limit for Postgres on RDS is 16TB, so the data you link cannot be greater than that (such a transfer would take at least 3.6 hours on a 10Gbit connection). The total table size on Redshift can be greater than 16TB but your materialized view query cannot select a quantity of rows and columns combined with index overhead that would exceed 16TB.
A current limitation of the DNS resolving on RDS is such that lookups don’t resolve to private IPs for DNS queries performed inside of Dblink. Hence if you’re running in a VPC without publicly routable Redshift then you’ll need to setup your own DNS to resolve to the private IP or hardcode Redshift Leader node private IPs when provisioning Dblink.
With the approach in this post, we showed you how it’s possible to use Amazon Redshift in an OLTP scenario. By using Amazon RDS as a ‘cache’ and leveraging its Dblink feature, we can make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over.
So if your using Amazon Redshift in an OLTP scenario for building data apps, and you want to monitor the performance of those data apps — schedule a call with us. We’d love to look at your set-up, and help you with a free trial of intermix.io to tune your Amazon Redshift cluster.
Originally published at www.intermix.io on July 5, 2018.