I recently joined as a data engineer and was getting ramped up on , a workflow tool that we used to manage ETL pipelines internally. If you’re totally new to Airflow, imagine it as a souped-up crontab with a much better UI. Plaid Airflow It’s fast! It’s flexible! It’s free! It’s Airflow! Around the time that I was joining, Plaid was migrating onto for visualizing SQL queries, and my immediate mission became to get more of the data people relied on for analytics insights into our nascent cluster, the data warehouse we query from Periscope. My colleague had spent some time harvesting our MongoDB tables into Redshift by tailing the oplogs and doing some post-processing in Spark, as well as parsing through a bunch of log dumps in S3 to create meaningful analytics tables. Periscope Data Redshift Plaid ETL pipeline circa early 2018 Motivation Kindly, my coworker left a more straightforward task to me to help me get ramped up with Airflow — moving data regularly from MySQL to Redshift. We had recently begun using Amazon Aurora instances on RDS, and needed to harvest the data from RDS and load it into Redshift to establish KPIs for these new datasets. Our goal was to timeframe this exercise to a week, so we set ourselves some constraints: We were going to use MySQL’s ability to select into s3, and Redshift’s UNLOAD command. We would only perform complete table-copies once every day, for every table. Since our largest table took only a few hours to copy, we decided to accept pulling all of the data every night as both easier to implement and more resilient to changes or incongruities in the two data sources. We would work on getting the most important data, and not worry about details like deeply nested json columns or binary image files stored in the database. Designing a workflow We wanted to design a translation from MySQL to Redshift, and knew that there would have to be a translation of the schema. Fortunately, AWS provides a resource We can pull the schema out of MySQL using a straightforward query: comparing MySQL and Redshift types. table_name, column_name, data_type, ordinal_position, character_maximum_length information_schema.columns table_schema = database(); select from where We handled a few types manually, for example instead of moving binary data over we would detect a binary type and instead return a boolean as for whether the column was null or non-null, to avoid having to copy a large amount of binary data over the wire that would be unusable for analytics. We wrapped the functionality into some python scripts that generates translation configurations. Example MySQL configuration We then created dag_mysql_processor.py to take in these database configurations and generate the associated dags. It iterates through each entry and generates the corresponding step using a series of SQL templates that we wrote using the as a guide. For example: AWS Docs The result is a batch of Airflow DAGs, one for each table in a MySQL Database. Currently we have each of these DAGs running once daily, which provides a good-enough latency for our current use-cases, by completely re-building the table once a day. Our linear DAG pulling data from MySQL to S3 to Redshift Bumps in the Road The first bit of trouble came about from trying to do a hot-swap. We wanted to ensure that the table looked correct before replacing the old data, so we added a step to validate the row counts. In the first version of our DAG, we executed each statement as a separate airflow task, but tables could occasionally disappear. Combining the swap step into a single, transactional task prevents any table downtime for our style of full-table replacement. V1 of the project had a race condition in dropping and recreating the table. We anticipated that full-copies would be troublesome, as some of our tables are many billions of rows and a global “success or fail” — a consequence of only doing full table copies — would be hard to recover from in real-time. We ran a few experiments on some of our largest tables to understand where errors may be occurring, and quickly became our go-to debugging . stl_load_errors tool * stl_load_errors starttime 1; select from order by desc limit Some of these issues we’re in translating columns from MySQL types to Redshift types. We were able to resolve a good handful of these issues by adding some logic to account for specific types during the conversion and to account for patterns in our schema, which we were able to match based on column name and type. potential_binary_column = column_name column_name 'md5' in or 'uuid' in Another culprit of errors are newlines present in the data, which causes a line to get split into two (or more) rows which then fail to parse. The solution here was to use a consistent and purposeful delimiter on either side of the equation. From mysql: {{ info.columns }} {{ info.table_name }} OUTFILE s3 '{{ info.s3_path }}'FIELDS TERMINATED enclosed escaped **by '\\'**LINES TERMINATED **BY '\n'**MANIFEST **ON**OVERWRITE **ON**; select from INTO BY ',' by '"' And pairing it with the Redshift query "{{ info.schema_name }}"."{{ info.table_name }}"; "{{ info.schema_name }}"."{{ info.table_name }}" '{{ info.s3_path }}.manifest'iam_role '{{ PLAID_REDSHIFT_ROLE }}'**DELIMITER** {{ info.max_error }}**ESCAPE**; TRUNCATE COPY from ','TRUNCATECOLUMNSEMPTYASNULLACCEPTINVCHARSIGNOREBLANKLINESNULL AS '\\N'MANIFESTREMOVEQUOTESSTATUPDATE ONMAXERROR Dawn of a New Day — The Intern Arrives Instead of hard-coded schema files, we want to move the configuration files into a persistent storage layer which we planned on calling , or database-database — a moniker we are inheriting from members of our engineering team who worked on a similar system at Square. dbdb We received the feedback to not call it around the same time that Michael Troute joined us for the Summer as a software engineering intern. He identified a need for Plaid to support ad-hoc uploading of CSVs to our Redshift cluster, and we agreed to combine CSV uploads, DBDB, and other planned ETL improvements under the broader umbrella of Data Warehouse Management. dbdb This paid immediate dividends for our development team. When we first started working with Airflow there were a handful of DAGs governed by a pair of configuration files, but the addition of several MySQL databases and an increasing suite of use cases led to the team having to open several pull requests just for configuration changes. Storing the state in the data warehouse manager instead lets us more easily modify the system— automatically adding and removing tables and columns as they get added or dropped from the upstream database and adding custom functionality like setting Redshift sort and distribution keys, and deploying better methodology for our database ingestion. Some of our tables are billions of rows, and many of these tables are either “append-only” or “sliding-window” updates. For append-only tables the proverbial “low-hanging-fruit” is to only query new rows from the database. For other tables, the rows that are updated are usually created within the past few days, so a mixture of full, partial, and incremental updates will make sense here. Iterative loading, easily controlled maintaining state in the Data Warehouse Manager Alternative Solutions We took a foray into using existing solutions that would save us the engineering effort. Our favorite tool out of this investigation was , and if you’re looking for a drag-n-drop style data pipeline tool, I would definitely check them out. They offer an Amazon Machine Image (AMI) that you can deploy inside of your AWS infrastructure and have great support for these kinds of problems. Ultimately, we identified that our use-case required too much external processing for Matillion to be the right solution for us. Matillion If you’re looking for a cloud provider, we’ve had success using for data processing pipelines, recommends additional vendors, and recommends a similar method using mysqldump. Stitch Data this post from Amazon Panoply Conclusion Airflow has been a reliable tool for us and is an important part of our in-house ETL efforts. Plaid works with many different data sources, and for non-sensitive datasets + 3rd-party data and have been instrumental in building up data workflows. For our data under our on-premise security umbrella, Airflow has shown itself to be reliable, informative, and accessible to new members of the team getting ramped up on data ingestion at Plaid. Stitch Segment Plaid has a blossoming data ecosystem spanning many billions of rows. If you’re an airflow veteran or live and breathe ETL pipelines we’d love to chat! We’d especially love to open-source our efforts: if you like working on these kinds of problems or think this solution could help your own use case, we’d love to get in touch in the comments below.

AirBnB

Amazon

Getting Ramped-Up on Airflow with MySQL → S3 → Redshift

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Learning to Recruit as a Software Engineer

Airflow, Meta Data Engineering, and a Data Platform for the World’s Largest Democracy

Apache Airflow And Its Contribution to Enterprise Data Integration

Apache Airflow: Is It a Good Tool for Data Quality Checks?

Automating a Machine Learning Workflow using Google BigQuery and Amazon Managed Apache Airflow

Data Pipelines: OpenWeatherMap-Airflow [A How-To Guide]

Learning to Recruit as a Software Engineer

Airflow, Meta Data Engineering, and a Data Platform for the World’s Largest Democracy

Apache Airflow And Its Contribution to Enterprise Data Integration

Apache Airflow: Is It a Good Tool for Data Quality Checks?

Automating a Machine Learning Workflow using Google BigQuery and Amazon Managed Apache Airflow

Data Pipelines: OpenWeatherMap-Airflow [A How-To Guide]

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps