In quickly becoming one of the preeminent companies in the background check space, Checkr (YC S14) has experienced a steady and amazing amount of growth. With that rise in volume has come the demand for better insight into the operation of running background checks and, in turn, the need for reliable and scalable data processing and storage.
Thus, in 2017, we started a significant push to building out our data infrastructure. What that meant depended on the demand we received both internally and externally, which we identified as the following:
1. Providing customer-facing analytics
Many enterprise customers care deeply about the efficiency and accuracy of their background checks. Since a number of these customers operate at a high volume (some via extensive use of the Checkr API), questions like…
1) “What’s the average county record turnaround time for our applicants?”
2) “What are the most common reasons for motor vehicle reports to not clear?”
3) “What’s the daily background check volume by type?”
… come up frequently as customer feedback.
2. Guiding product decisions
Since the early days at Checkr, we’ve been fortunate to find great product/market fit, a position that’s served us well. As an example, our motor vehicle record check has been consistently remarked as simple to set up and comprehensive in its results. However, as our applicant base has grown, there has come a more pressing need to monitor applicant behavior, and to do so with ease. Access to disparate sources of data, be it application logs or events in a report lifecycle, offers an opportunity to construct a more holistic picture into the system’s health and inner workings.
3. Facilitating eng development
During engineering development, we frequently need to query our production data to fetch and verify records. Nonetheless, querying against our transactional stores has historically been slow, as well as stressful on and potentially destabilizing to our production environment.
4. Enabling data scientists and machine learning engineers
Beyond the above, looking forward, a key differentiator between Checkr and other background check services is our keen focus on automation, something we’ve managed to accomplish in part by building custom machine learning models for tasks such as name matching. Reducing burden for development of machine learning models — and as a result, the downstream operational burden — can assist in easing even the most mundane of tasks, such as data extraction.
In this blog post, you’ll see an overview of how we established a stable data infrastructure to fuel our growth in the matter of a few months. We hope our insights can be useful for companies facing similar challenges at scale.
In the beginning, we implemented some quick and dirty tools to enable infrequent querying of data. Our solution was by no means scalable, but it was sufficient to answer some key business questions.
Here at Checkr, we rely heavily on MySQL (Amazon RDS) and MongoDB (Atlas). We use MySQL to store volatile and structured data including users, reports and package, while leveraging MongoDB to store largish or more static content such as post-processed provider results or webhook logs. For simplicity, we won’t go into detail on some of the less frequently used storage at Checkr.
To fulfill the need to query production data, we created a temporary solution, as shown in the graph above. For our MySQL cluster, we used replication and allowed clients to open direct connections to our replica set. Since MongoDB is a schemaless key-value store, we built a streaming service on top of MoSQL by Stripe, which tailed MongoDB’s oplog and stored the resulting data — mapped by a schema defined per Mongo collection — in a PostgreSQL instance. From there, we allowed clients to open direct connections to our PostgreSQL instance. This simple architecture served us well at the time for ad-hoc data queries. However, using this approach, there were some fairly significant and key disadvantages:
- Joining across stores was difficult to impossible. Since some data resided in MySQL and other data resided in MongoDB, complex reports often required a federated query, something that wasn’t easily manageable given the above architecture.
- MoSQL, mo’ like mo’ maintenance. Infrequently, we ran into MoSQL stability issues, and had to kick off processes to “catch up” and/or “restart” replication. If anything ever changed in an upstream collection, we also needed to update schemas defined in our streaming service. As many fields were in JSON, parsing for analysis was also complicated. Lastly, with extensive use of embedded documents in MongoDB, seemingly simple queries were slow and difficult to author.
- To no surprise, building complex logic to support analytics quickly grew out of control.
To address the above challenges, it became clear that 2017 would be the year we’d double down on our data infrastructure investment.
In the course of a few months, we established more robust and stable data infrastructure. To complete the entire picture, we needed:
- federated querying engine
- schema solution, reliable data storage
- workflow engine
- data analysis tools
As a rapidly growing startup, we had limited engineering resources, so, in our design decisions, we generally favored well-adopted and pre-existing solutions, be it a popular OSS project or enterprise solution. In being an early adopter of a microservices architecture, it made this transition easier than originally anticipated. As we’d already invested in simple and fast introduction of new services into our ecosystem, we had the flexibility to adeptly select different components, choosing what we determined to be the best tool for each respective job.
For the foundational layer of our data infrastructure, we required a tool to “stitch” all our production data together. The solution could have been a consolidated database or a federated querying engine with access to various sources of production data. Given the nature of our initial needs (i.e. analytical data could be latent), we agreed to first take the traditional ETL approach and focus on creating stable batch snapshots of our data. In the end, we chose Presto as our query engine to provide federated access to numerous data sources, as we saw several advantages:
- Presto is extremely fast, not to mention in use at large scale at Facebook, Airbnb, Netflix, and more. Given the size of our data (hundreds of gigabytes), we knew Presto could handle our use at all levels.
- Presto is not a database, which, at minimum, saves us the trouble of maintaining yet another database.
- Presto has extensive support for different connectors, covering MySQL, MongoDB, and more.
- Presto has a renowned community and is actively maintained.
That said, we did experience some incompatibilities with connectors. As one example, at one point, we needed to add SSL support for the MongoDB connector. Shortly thereafter, for the MySQL connector, we found that we had to introduce the ability to parse certain data types as varchar. Even with our hiccups though, setting up Presto was as straightforward as it gets, with the speed and reliability far trumping any challenges posed in getting our cluster up and running.
And one of the main reasons for that being the case was through the use of our beloved deployment tool Codeflow. As it stands today, our cluster has 20 nodes with 2 CPU cores and 24G of memory, with each node running in a Docker container in Kubernetes. Sounds complicated, but Codeflow made all of that easily to define and manage.
You can see our Presto deployment setup from Codeflow below:
We intentionally skipped some other popular data infrastructure solutions, such as Hadoop or Redshift. Hadoop requires a high level of maintenance, and as we currently aren’t concerned about operating at petabyte scale, it was a non-starter. We also elected Presto over Redshift because a) we’d identified it as a potentially expensive solution for our use-case and b) we didn’t want to maintain and update our schema (covered in the next section).
MongoDB is great in that it provides a schemaless solution for easy storage of diverse document types, while also scaling horizontally “out-of-the-box” and at relatively low cost. Nonetheless, it also presents challenges for querying and performing data analysis, both obstacles when data warehousing. As a storage schema is the heart of any ETL process, to extract data successfully, we needed a process for inferring a target schema from our MongoDB data. After some research, we landed on implementing a MongoDB schema analyzer, built atop an open-source tool, Variety. With the analyzer, we could regularly update collection schemas used by the PrestoDB connector with little to no maintenance cost.
With a way to neatly map all fields to a target table schema, we had a method for extracting data from MongoDB. We then needed a means to load data into our data warehouse of choice. For this part of the process, we chose Hive and its metastore, in part because it’s natively supported by Presto. Given that Hive can be deployed as a standalone service without any Hadoop infrastructure, using the Hive connector, we were able to efficiently extract data from MySQL and MongoDB by just writing Presto queries, and perform some simple data transforms as needed before outputting processed data into our data warehouse.
Checkr is a cloud-based company, so to no surprise, a lot of our data is stored in the cloud, specifically on AWS. With that, we wanted a cost effective way to store large chunks of data for data analysis and S3 was the obvious choice. While many traditional data warehouse solutions use distributed file systems such as HDFS, our Presto + Hive approach integrated particularly well with S3, because
- As Presto handles all data manipulation in memory, there is limited “temporary storage” on disk needed.
- Since we configured a large memory footprint for our Presto cluster, disk IO to S3 is limited as a large amount of data is cached.
It’s worth noting that there are also limitations with this approach, as the Presto cluster needs to have enough memory to hold all data in-memory for any given query. While there are some possible optimizations available — including splitting up data into separate queries -, those approaches can be tedious to implement. By keeping data on S3, we achieved high reliability and were also able to keep multiple snapshots with relatively low cost.
There are many existing scheduling engine solutions out there, including Pinball and Luigi. We decided to choose Airflow because it has an active community and is well maintained. There are many great features that come shipped with Airflow, the following are a few that we like in particular:
- Presto connector. It’s easy to setup connection to Presto. Since we have already consolidated all data storages with Presto + Hive, connection with Presto means we can query any data with Airflow.
- Easy to create operators. Airflow uses operators to complete different type of tasks, and it’s easy to write your own. For example, we implemented a Presto query operator with custom condition checks and retention rules. All presto query tasks can then reuse the operator with just a few lines of code.
- Monitoring and alerting. With Airflow, it’s easy to create workflow notifications and alerts with email, slack or pagerduty. Airflow can be integrated well with existing engineering services in your organization.
- Airflow also has OAuth support.
We will have another blog post with details about how we use Airflow to create workflows here at Checkr, so stay tuned!
With all the ETL pieces in place, we need to open up our dataset for data scientists and analysts, so we can extract values out of it.
For basic querying needs, we provided support for SQL clients to connect to Presto. There is no built in OAuth support for Presto, so we had to create our own LDAP service to enable authentication for Presto. It’s good to enable authentication to Presto not only for security reasons, but also for monitoring usage and adoption ratio within the organization.
For more advanced data users, we also created a Jupyter notebook service. We used S3 as the file system with the open source Jupyter S3 content manager. This made it easier to share analysis within the organization as we store all notebooks within the same bucket. The plugin also loads the directories in a passive way, so the notebook file explorer is fairly responsive even though all files are on S3. We also created code repo sync workflows to load Github repos onto S3 periodically. This makes it easier for notebooks to reuse logic from other repos. These workflows are controlled by Airflow and run on schedule. With Jupyter, data scientists can access any repo and any dataset we have through their browsers, which makes it a powerful analysis tool.
As mentioned in the earlier section, one motivation for building this infrastructure was to provide analytics to our customers. We’re lucky enough to be in a world with many mature analytics providers. We picked Chartio as our analytics provider, primarily for the following reasons:
- Easy to setup. Has drag and drop dashboard building.
- Good embedding support. Fit naturally in our customer-facing web application.
- Ability to build custom dashboard for every customer.
- Compatible with Presto.
The ease of setup saved us a significant amount of development time. We got a lot of positive feedback from our customers and were able to iterate quickly as our dashboards were highly configurable.
One thing worth mentioning is the fact that we power customer-facing analytics directly with Presto. This is an unconventional approach and might not be a working solution for everyone. A more traditional approach is usually to use NoSQL to precompute possible permutations to achieve O(1) lookup, or to use a relational database with aggressive caching and limit segmentation and filtering functionality. We can use Presto directly because we generate semi-precomputed datasets and we have most data loaded in memory before a request. We make sure the data tables powering analytics only contain the columns necessary, and since currently our analytics customers are employers, we do not need to prepare analytics for hundreds of millions of users.
We’ve also focused on improving user event logging on our site. In addition to logging with MongoDB, we’re adding more extensive user logging, sending those events to Segment to share with services like Amplitude for funnel analysis. In exploring internal BI, we also connected with Looker and others. We won’t go into too much detail on these integrations in this blog post.
And there you have it, the complete picture of Checkr’s data infrastructure as of today:
In the last few months, Checkr data infrastructure has come a long way and, already, we’re starting to see the benefits from our labor. Now we have daily dumps of our production data in S3, fast querying with Presto, powerful analysis tools including Jupyter and Looker, and application funnel analysis. We were able to complete all major components shown above in a relatively short amount of time thanks to the rapidly growing open source community and the microservices architecture we adopted early on.
Additionally, our development progress has been greatly expedited by our open source deployment framework Codeflow (much more to come about that later!).
If you too are interested in building data infrastructure at a rapidly growing startup — come join us in the next part of our journey! We’re hiring!
Acknowledgements: Checkr data infrastructure was built in collaboration with Kurt Ruppel, Stewart Park, Grace Lee, Kevin Woods and the rest of the customer report and money team. Kurt Ruppel, and Michelle Berry gave valuable edits to this post. Additionally, many engineers and product managers across the company provided feedback. Huge appreciation to everyone involved in the project!