I started my career as a first-generation analyst focusing on writing SQL scripts, learning R, and publishing dashboards. As things progressed, I graduated into and Data Engineering where my focus shifted to managing the life-cycle of models and data pipelines. 2022 is my 16th year in the data industry and I am still learning new ways to be productive and impactful. Today, I am now the head of a data science & data engineering function in one of the unicorns and I would like to share my findings and where I am heading next. Data Science ML When I look at the big picture, I realized that the problems most companies face are quite similar. Their vision towards being data-driven has turned into a BHAG — pronounced “bee hag” ( ig airy udacious oal). B H A G We data folks like patterns, so here are my findings: During 5 out of 10 review meets, I have witnessed people question the reliability of the data/report/dashboard. Additionally, HODs will also try to convince others that their data is the most accurate or reliable :) A lot of times, HOD comes and says that the data is not updated. The data team is already working to fix the report/data table. A new product got launched the week before, however, we are yet to figure out the performance. The data team is working on a query change and will soon update the CXO team. Everyone has built expertise around writing complicated ML (machine-learned) models, however very few talk about or deploy inference monitoring. There is a high probability of model drift or performance drift in the coming weeks/months if not monitored or observed efficiently. Very few companies deploy solutions or models to detect performance anomalies. The list is long, I am sure you can relate or add more to this. In a nutshell, I found that data reliability is a BIG challenge and there is a need for a solution that is easy to use, understand, and deploy, and also not heavy on investment. I am and I am on a mission to build and develop a solution to make your data reliable. Jatin Solanki What is needed to make your data more reliable? Complexities around data infrastructure are surging as companies gear to get a competitive edge and out-of-the-box offerings. Every company goes through a data maturity matrix. In order to reach a level where you deploy AI models or self-service models, you need to invest in a robust foundation. in garbage out. In my opinion, the foundation begins with a reliable data source or defining source of truth. Your data models won’t be impactful if it’s ingested with bad data. You know it’s garbage On a high level, here are a few checks you can implement to ensure data reliability: It ensures all the row/events are captured or ingested. Volume: Recency of the data. If your data gets updated every xx mins, this test will ensure its updated and raises an incident if not. Freshness: : If there is a schema change or a new feature that was launched, your data team needs to be aware to update the scripts. Schema Change : All the events are in an acceptable range. e.g if a critical shouldn’t contain values, then this test ensures to raise an alert for any . Distribution null null or missing values : This is a must-have module, however, we always underplay these ones. Lineage provides a handy info to the data team of the upstream and downstream. Lineage : I would say recon or finding deltas between two given datasets. This could be used to understand the difference between and OR between and . This could be effective in running some financial recon too, like payment gateway to the sales table. Reconciliation staging production source destination What next? How do we implement this? The most common question people face with: Build versus Buy I am a big fan of open-source tech, however, in some critical modules, I prefer buying an out-of-the-box solution because it’s scalable and already tested in the market. Developing in-house might cost you around US2k per month and it includes a few hours of engineer’s time along with cloud cost. If you are inclined toward buying an out-of-the-box solution, here are a few factors that should be part of your checklist. Should be able to connect to popular sources which require minimal config. Extract information automatically without the need for additional code. No-code or CLI (I leave it to you) Lineage and Catalog module. Data Reconciliation along with scheduling feature. Anomaly detection Of course, Of course, all the tests we discussed earlier along with alerts should be in a position to tell where to . debug A robust platform provides easy access to all the incidents and also evaluates the data health. It should be in a position to automatically detect my critical data assets and apply hygiene checks. The only platform to group alerts instead of pushing 100+ alerts. At last, the solution should help you reduce data quality incidents and make your data more reliable. So, do I need a data observability platform? If your answer to any of the below questions or scenarios is “Yes”, then you should procure or deploy a data observability solution right away. Dashboard not getting updated on a regular basis? Don’t know which report is accurate? Business stakeholders are the first to learn about data incidents. Questions during a meeting on the performance stats. Have at least 2 members in the data team. Deployed a business intelligence tool. As software developers have leveraged on DataDog, Dynatrace, etc kind of solutions to ensure web/app uptime, data leaders should invest in data observability solutions to ensure data reliability. Also published here.

What is Data Profiling? Concepts and Examples

Data Observability: The First Step Towards Being Data-Driven

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Comprehensive Guide to Understanding Data Mesh Architecture

4 Ways Businesses Can Practice Effective Data Governance

A Guide To Protecting Sensitive Business Data

Blockchain and Business Data — A Marriage Waiting To Happen

How to Design a Comprehensive Framework for Entity Resolution

How to Get Started with Data Governance Best Practices

A Comprehensive Guide to Understanding Data Mesh Architecture

4 Ways Businesses Can Practice Effective Data Governance

A Guide To Protecting Sensitive Business Data

Blockchain and Business Data — A Marriage Waiting To Happen

How to Design a Comprehensive Framework for Entity Resolution

How to Get Started with Data Governance Best Practices

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps