After 7 years working as Senior Data Manager at Criteo and as Head of Data at Payfit, and after interviewing 200+ data leaders for user research, I am starting to have a good overview of data stacks in action. So, by popular demand, I’m dropping here and there some advice on the trends I observed. Earlier in March, I had two different calls: one from a former colleague, now head of data at , and one from a friend, now head of data at . Both had taken a new job and had to build a complete data stack, from scratch! I assume this could interest others. Here is the piece of advice I gave them. Swile Cajoo Before we start, here is a Decision Matrix that you can use for benchmark purposes: Ease of Use Integration in Cloud environment Community & Documentation Governance Capabilities Pricing To keep things simple, we’ll just follow the data, from source to reporting. Data Warehouse Where you store your data for analytics needs A data warehouse is a place you store and organize your data before analysts use it My recommendation is simple: If your dev stack runs on Google Cloud, then go for BigQuery (BQ) If your dev stack runs on AWS, then go for Snowflake (SF) Here are a few reasons. Redshift was indeed the first big player on the Cloud Data Warehouse market, but it means they’ve made mistakes that followers could avoid, and they haven’t changed much for the last few years. Redshift has a poor UI/UX, especially for end-users (data consumers). BigQuery and Snowflake are quite similar in terms of features and pricing models (based on usage: BQ bills you by scanned data, Snowflake bills you by query time). A slight difference is that the first one is stronger when the whole stack is run on Google Cloud while the second is slightly more “Analytics” focused, making it a perfect match for AWS users. Snowflake is heading towards being a full data platform (multi-cloud, Snowpipe, jobs, data share, python/js functions) with a friendly user interface. BigQuery’s strength is its integration within the Google Cloud systems: read/write from GCS, IAM rights management, billing, Google Spreadsheets native integration, …) Ingestion Get Data in your Data Warehouse Stitch and Fivetran are extracting data from all your business sources and bring them to your Data Warehouse For data originating from a SaaS, use or . These two companies do a great job at maintaining these data dumps for their thousands of clients, and their pricing is linear as well as predictable. The sources they can retrieve data from is numerous, check these and . StitchData Fivetran here here You may want to cherry-pick some tables to ingest yourself, for frequency reasons or whatever. Make sure to leverage the numerous open-source taps made available . here Want a mix between the fully managed and the fully homemade? Check . You’ll have to host it but it comes with the necessary bunch of features, sources, and destinations. Airbyte provides similar features, depending on the plan you’re on. But focuses more on event tracking, it’s a must-have to sync your marketing campaigns with user activity. Think of Segment as the big ears listening to what is happening on your website/app. It also demands an ongoing tagging plan. An alternative is which does not integrate that well with other data sources but does not require a tagging plan Segment Contentsquare Do not worry, there is still plenty of DE work to do: you’ve got ingestion pipelines to build for all your data. If your PROD is in NoSQL (such as Mongo), then you’ve got the transformation to do before ingesting the data in your data warehouse. If your PROD is already in a SQL format, then you can dump all needed data in your analytics warehouse, and directly move on to the next paragraph production Transformation From raw data to model/reports ready Like any raw material, you want to transform it to deliver value. Use DBT to transform data. First things first, (note the L now in the middle). In recent years, storage costs decreased to a point (< $30 per Tb/month) where you can now load all raw data in your data warehouse, and then do the transformation in SQL. Keep in mind that storage in SF and BQ is as cheap as S3 storage is, it’s computing that costs money. ETL is dead, long live ELT Now, putting the spotlight on what became the obvious tool in 90% of the data teams I’ve talked to: the notorious . There is even a new job title that comes with it: . DBT is open source, quite easy to set up, and comes with a bunch of interesting features oriented on quality, such as unit tests. DBT Analytics Engineer The general idea is to load all your data in your warehouse, then use DBT to create your datamart ie: refined data, ready to be used for analytics purposes. As a side note, DBT is not a scheduler, you’ll still have to schedule it. It’s your scheduler who’ll launch the DBT executions… which naturally leads me to the scheduler section below. One last note: end of last year Google , it’s now free and well-integrated with Big Query, but the community is not that big. acquired Dataform Scheduler Run job 1 then run job 2 In any transformation process, there is a schedule. Things have to happen in a certain order. Use Airflow for data transformation scheduling. The market standard for scheduling really boils down to Airflow, the question is which deployment setup you want? You can go for: Self-Managed: dockerized airflow Managed: AWS version / GCP Version Fully-Managed: Astronomer Astronomer helps you get off the ground faster. But there is a cost for that… Data Viz Make your shiny dashboards Once you have a high-value dataset, you will want to show the output in the sexiest and most understandable way. Data visualization with , , or is the best option. Tableau Looker Metabase This one is tough, as competition is fierce within this segment. This will also be the only part your business users will see, out of the whole stack you are building. Strong options are , , or . I know that there are a lot of other options out there, that’s just my opinion Tableau Looker Metabase I’ll make a dedicated article on the matter, but to start, go for Metabase. First, on the data Viz side, it’s simple and straightforward, it takes a few minutes to go from a table to a dashboard. Keep in mind that for a while your dashboards will be simple, as you need to educate the different teams. Oh, and it’s free as long as you host it yourself. Data Discovery Find, understand and use data faster High-value datasets or analyses are useless if no one knows they exist. Make sure people do. Castor is the tool for that. Do you think that data discovery is only for people with too much data or many people? Did you know that after a few data dumps from Stitch or Segment, you’ll already have hundreds of tables? On top of that, I’m quite sure that the first data analyst or engineer you’ll hire will make their own data documentation on a google doc or else. Wouldn’t it be better to have that info accessible to everyone and especially to the second hire? At Castor, we provide just that! is a data documentation tool for the Notion-Figma-Slack generation. Or data-wise, for the Fivetran-Looker-Snowflake-DBT aficionados. We designed our catalog to be . Castor easy to use, automated, delightful, and friendly Want to check it out? and we will show you a demo. Reach out to us

Stacks

BUNCH

Community Is

Discovery

First Data

Google

Slack

Why Businesses Need Data Governance

Find, Trust and Use Data

Too Long; Didn't Read

How to Build a Data Stack from Scratch

How to Build a Data Stack from Scratch

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

5 Most Important Tips Every Data Analyst Should Know

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

5 Most Important Tips Every Data Analyst Should Know

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps