Open-source data integration is not new. It started 16 years ago with Talend. But since then, the whole industry has changed. The likes of Snowflake, BigQuery, and Redshift have changed how data is being hosted, managed, and accessed, while making it easier and a lot cheaper. But the data integration industry has evolved as well.
On one hand, new open-source projects emerged, such as Singer.io in 2017. This enabled more data integration connectors to become accessible to more teams, even though it still required a significant amount of manual work.
On the other hand, data integration was made accessible to more teams (analysts, scientists, business intelligence teams). Indeed, companies like Fivetran benefited from Snowflake’s rise, empowering non-engineering teams to set up and manage their data integration connectors by themselves, so they can use and work on the data in an autonomous way.
But even with this progress, a large majority of teams still build their own connectors in-house. The build vs. buy leans strongly on the build. That’s why we think it’s time to have a fresh new look at the landscape of the open-source technologies around data integration.
However, the idea for this article came from an awesome debate on DBT’s Slack last week. The discussion centered around two things:
Even Fivetran’s CEO was involved in the debate.
We already synthetized the second point in a previous article. In this article, we want to analyze the first point: the landscape of open-source data integration technologies.
Here is a table summarizing our analysis.
In orange is what we’re currently building at Airbyte in the next few weeks.
To better understand this table, we invite you to read below the details of our analysis on the landscape.
Singer
Singer was launched in 2017, and was until now the most popular open-source project. It was initiated by StitchData, which was founded in 2016. Over the years, Singer grew to support 96 taps and targets.
In the end, a lot of teams will use StitchData for the connectors that work well, and will build their own integration connectors if they don’t work out of the box. Editing a Singer connector is not easier than building and maintaining the connector yourself. This defeats the purpose of open source.
Airbyte was born in July 2020, so it is still new. It was born out of frustration with Singer and other open-source projects. It was built by a team of data integration veterans from Liveramp, who individually built and maintained more than 1,000 integrations, so 8 times more than Singer. Their ambition is to support 50+ connectors by the end of 2020, so in just 5 months since the inception of the project.
Airbyte’s mission is to commoditize data integration, and we have made several significant choices towards this goal:
The number of connectors supported by Airbyte and its community is growing fast. Their team anticipates that it will outgrow Singer’s by early 2021. Note that Airbyte’s data protocol is compatible with Singer’s. So it is easy to migrate a Singer tap onto Airbyte, too.
PipelineWise is an open-source project by Transferwise that was built with the primary goal to serve their own needs. They support 21 connectors, and add new ones based on the needs of the mother company. There is no business model attached to the project, and no apparent interest from the company in growing the community.
Meltano is an orchestrator dedicated to data integration, built by Gitlab on top of Singer’s taps and targets. Since 2019, they have been iterating on several approaches. They now have one maintainer for this project that is CLI-first. After one year, they now support 19 connectors.
Here are some other open-source projects that you might have heard of, as they’re often used by data engineering teams. We thought they deserved to be mentioned.
We see a lot of teams building their own data integration connectors using Airflow for the orchestration and scheduling. Airflow wasn’t built with data integration in mind. But a lot of teams use it to build workflows. Airbyte is the only open-source project to offer an API so teams can include data integration jobs in their workflows.
DBT is the most widely used data transformation open-source project. You need to be proficient in SQL to use it properly, but a lot of data engineering / integration teams use it to normalize raw data coming into the warehouses or databases.
Both Airbyte and Meltano are compatible with DBT. Airbyte will offer teams the ability to choose between raw or normalized data for each connection they need, which addresses the needs of both data engineering and data analyst teams. Meltano doesn’t provide normalized schemas, and relies solely on DBT for that.
Apache Camel is an open-source rule-based routing and mediation engine. That means you get smart completion of routing rules in your IDE whether in your Java, Scala, or XML editor. It uses URIs to enable easier integration with all kinds of transport and messaging models including HTTP, ActiveMQ, JMS, JBI, SCA, MINA and CXF, together with working with pluggable Data Format options.
Streamsets is a data-ops platform that includes a low-level open-source data collection engine named DataCollector. This open-source project is not supported by any community, and is mostly used by the company to assure their enterprise clients that they will still have access to the code whatever happens.
Let us know if we missed any open-source projects or any valuable information on the listed ones. We will try to keep this list up to date and precise.
Also published on: https://airbyte.io/articles/data-engineering-thoughts/the-state-of-open-source-data-integration-and-etl/