by Luke Gotszling At , we use Amazon Redshift as part of our stack. Amazon Redshift is an OLAP , and a valuable tool for data teams due to its low cost and speed for analytical queries. We have a particular use case though. We’re using Amazon Redshift in an OLTP scenario, i.e. we’ve built an analytical application on top of Redshift. intermix.io database The challenge of using Redshift as an OLTP is that queries can lack the low-latency that would exist on a traditional RDBMS and transactional queries. Also, there are features of a traditional RDBMS that would often be useful when querying Redshift (e.g. indices). In this post we detail a method of using Amazon Redshift as an OLTP. We will tell the story of how we off-load OLTP workloads from Amazon Redshift to Amazon RDS. We were able to do this without writing any complex ETL or changing our data processing strategy. How We Use Amazon Redshift intermix.io is an analytics platform that provides a single monitoring dashboard for data engineers to keep an eye on their critical data flows. Image 1: Simplified architecture of Amazon Redshift in an OLTP scenario There are three steps for acquiring the data we need to provide our analytics to our customers. We extract raw event data from the customer databases via the . The Collector runs as a Docker container in the customer’s environment. intermix.io Collector 2. We move the data into the intermix.io S3 environment. 3. From S3, we copy data into Redshift. We use Amazon Redshift for: before transformation (we prune frequently and we only keep data until it is processed) storing raw data , with a multi-stage data pipeline to reduce, combine, and aggregate data transforming data to a serving layer (initially it was only Elasticsearch) moving data to our product dashboard (this is the piece that was giving us trouble) serving data Issues In the early stage of intermix.io, our dashboard pages would pull data from Redshift directly. But as we added more and more customers, we realized the dashboard pages were not performing well. This was due to two factors: more customers using our dashboard meant more concurrency more processing in Redshift meant more workloads were competing for query resources We did use the best practices to manage our Redshift cluster, including the ). But we also knew that using Redshift for this OLTP use case — serving data to the dashboard — was not going to work long term. So we started looking for solutions. proper set-up of our workload management (WLM Our ideal solution would: use Amazon RDS for driver compatibility. We already used RDS for other transactional workloads and there are RDS features we wanted. avoid new ETL to copy data off of Redshift. keep Amazon Redshift as the single place where transformation logic lives (i.e. “source of truth”). What is Amazon Redshift? Amazon Redshift is a columnar-oriented petabyte-scale data warehouse. It is an OLAP database suitable for analytical queries and applications. It uses a modified version of PostgreSQL 8. Our Approach The solution was to use Amazon RDS as a ‘cache’ by leveraging the . This allows us to make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over. handles moving the data at the block level. Dblink feature of RDS Dblink This solution had a bunch of benefits: API porting was trivial because the underlying schema and drivers were the same Amazon Redshift could remain the source of truth ETL was done “automatically” via Dblink so we didn’t need to worry about moving data around For disaster recovery and HA purposes, we could treat the RDS instances as fungible cache nodes. The Magic — 6 Steps Before we go over the installation instructions, we’ll cover some of the features that using Dblink with Redshift provides. The main features presented in this post require a Postgres instance with Postgres version >= 9.4 ( ). We use an RDS-hosted version of Postgres but other types of instances will work as long as they meet the minimum requirements. some features listed in this section require more recent versions of Postgres Automatic pruning (via refreshing a materialized view with a WHERE or LIMIT clause) Leverage indices Use a materialized view for caching Enable cross-region replication (via an RDS instance in another region or a cross-region read replica) Query your Redshift data locally/offline by using dblink to synchronize it to a locally running Postgres instance High availability: use a multi-AZ instance to have automatic failover in case of hardware failure Have more than 500 connections and run hundreds of queries concurrently Independently scale read capacity through the use of read replicas Manage the data with extended JSON support Run procedures on your data in other languages: PL/pgsql, PL/v8, PL/perl, PL/Tcl — For example: query your data using Javascript Utilize parallel workers Leverage other Postgres extensions 1.Use full text search 2. Standardize mailing addresses (address_standardizer and address_standardizer_data_us extensions) 3. Create a tree structure out of your data using ltree 4. Take advantage of PostGIS features when interacting with geospatial data 5. Run Oracle functions via the orafce extension Query your data on a single node with 64 vCPUs and 488GB RAM (2x the size of the largest single node in Redshift) Run a single query that combines data from Redshift with data on Postgres Query data from multiple Redshift databases and clusters in one place Use just about any feature on Postgres that can query data from a materialized view Step 1: Setup on Redshift cluster Now onto the installation. For installing on Redshift, perform the following commands: Create a new read-only user for Dblink (or use an existing one). We reference this user below as <amazon_redshift_username>. Grant access to the tables you will want to query. GRANT SELECT ON all TABLES IN SCHEMA data to <amazon_redshift_username>; Ensure that the Postgres RDS instance has a network route to the Redshift instance (see the ‘Caveats and limits’ section at the bottom of this post if it has to use a private IP address). Step 2: Setup on Postgres RDS instance Enable dblink and postgres_fdw in the Postgres RDS instance (only needs to be done once per database). CREATE EXTENSION postgres_fdw; CREATE EXTENSION dblink; Create the server link and user mapping. CREATE SERVER redshift_server FOREIGN DATA WRAPPER redshift_fdw OPTIONS (host ‘<amazon_redshift_ip>’, port ‘<port>’, dbname ‘<database_name>’, sslmode ‘require’); CREATE USER MAPPING FOR <rds_postgresql_username> SERVER redshift_server OPTIONS (user ‘<amazon_redshift_username>’, password ‘<password>’); That’s it, now you’re ready to query your Redshift data in RDS! Step 3: Run some queries This approach allows you to query Redshift while inside of a Postgres instance. It bridges data but doesn’t confer any performance benefits. For example, to get a list of users that were updated this year. SELECT * FROM dblink(‘redshift_server’, $REDSHIFT$ SELECT id FROM users WHERE updated_on >= ‘2018–01–01’; $REDSHIFT$) AS t1 (id int); Step 4: Create a cached view into the data To create a queryable cache of the Redshift data, we create a . You can also select a subset of rows and a subset of columns if you don’t require the full table. materialized view CREATE MATERIALIZED VIEW mv_users AS SELECT * FROM dblink(‘redshift_server’, $REDSHIFT$ SELECT id, updated_on FROM users; $REDSHIFT$) AS t1 (id int, updated_on timestamp); Step 5: Keep it up to date A materialized view will not self-update. So we need an approach to refresh the data. If you only wish to execute live queries (no materialized views) then you can ignore this section. Easier and faster but with concurrent reads blocked. Refresh approach 1 (non-concurrent) REFRESH MATERIALIZED VIEW mv_users; This will block concurrent reads, so if you need to be able to always read from the view follow the next approach. Non-blocking refreshes (this requires a unique index on the materialized view). Refresh approach 2 (concurrent) REFRESH MATERIALIZED VIEW CONCURRENTLY mv_users; This a unique index to work. For example, after creating the view: requires CREATE UNIQUE INDEX mv_user_id_updated_on_idx ON data.mv_users (id, updated_on); If the data you retrieve takes more than 5 minutes to transfer or you don’t mind refreshing synchronously, you can issue the REFRESH commands above inside your code or in a periodic task as needed. Periodic refresh If retrieving the data from Redshift takes fewer than 5 minutes, AWS’s Lambda is a good approach to use here. Be mindful of concurrency, for example it’s bad practice to issue multiple concurrent refreshes on a single table while another refresh is currently happening on the same table. We’ll have more about how we approach this in a subsequent blog post. Refresh Lambdas The above refresh methods query the full set of data each time (even if nothing has changed), if you don’t have updates on your data you may want to use the following approach instead: Incremental refreshes As above but create a table (we call this a mirrored table) instead of a materialized view. Use a refresh procedure that inserts into the table which selects only the rows that have a primary key greater than what currently exists in the mirrored table. Step 6: Operations and monitoring Dblink ingestion incurs regular data transfer costs, hence if you don’t want to be charged for the network traffic of synchronizing the data please have your Postgres RDS instance in the same AZ as your Redshift leader node. You need to ensure that the user permissions for the users querying data via Dblink matches that of the same users querying directly on Redshift, so that additional permissions aren’t granted by querying Dblink. In the console, keep an eye on the following Postgres RDS metrics to ensure that the instance isn’t overwhelmed: AWS Free disk space IOPS Network bandwidth Lock wait times And then of course we use our own intermix.io dashboard to monitor the performance of our Amazon Redshift clusters and our data pipelines. A few select key metrics we pay attention to are: Concurrency levels and queue wait time Disk-based query volume Query latency trends Table growth Image 2: intermix.io dashboard — query memory footprint For example, in image 2 you can see a distribution of all queries for a certain time interval. We can see how much memory each query consumes, if it’s disk-based or not, and what the total query return time (execution time + queue time) is. Caveats and limits The total on-disk size limit for Postgres on RDS is 16TB, so the data you link cannot be greater than that (such a transfer would take at least 3.6 hours on a 10Gbit connection). The total table size on Redshift can be greater than 16TB but your materialized view query cannot select a quantity of rows and columns combined with index overhead that would exceed 16TB. A current limitation of the DNS resolving on RDS is such that lookups don’t resolve to private IPs for DNS queries performed inside of Dblink. Hence if you’re running in a VPC without publicly routable Redshift then you’ll need to setup your own DNS to resolve to the private IP or hardcode Redshift Leader node private IPs when provisioning Dblink. Wrapping up With the approach in this post, we showed you how it’s possible to use Amazon Redshift in an OLTP scenario. By using Amazon RDS as a ‘cache’ and leveraging its Dblink feature, we can make any Amazon Redshift table available as a relation in RDS, without the need to explicitly copy that data over. So if your using Amazon Redshift in an OLTP scenario for building data apps, and you want to monitor the performance of those data apps — schedule a call with us. We’d love to look at your set-up, and help you with a free trial of intermix.io to tune your Amazon Redshift cluster. Originally published at www.intermix.io on July 5, 2018.

Amazon

Have Your Postgres Cake with Amazon Redshift and eat it, too.

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How Successful Founders Will Raise Money in 2018

6 Reasons to Use Amazon Redshift

Cost Effective Data Warehousing: Delta View and Partitioned Raw Table

How Costly is AWS Redshift Serverless?

Spark and Redshift: Which is better for big data?

Understanding How Data Warehousing on AWS Works

How Successful Founders Will Raise Money in 2018

6 Reasons to Use Amazon Redshift

Cost Effective Data Warehousing: Delta View and Partitioned Raw Table

How Costly is AWS Redshift Serverless?

Spark and Redshift: Which is better for big data?

Understanding How Data Warehousing on AWS Works

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps