paint-brush
Why Businesses Need Data Governanceby@xavierdeboisredon
267 reads

Why Businesses Need Data Governance

by CastorOctober 9th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

"Data governance is a measure of a company's control over its data" People think data governance equals data compliance & privacy. Well, that's point-blank wrong. Data governance = controlled business impact driven by data.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Why Businesses Need Data Governance
Castor HackerNoon profile picture

And Why Governance is the Gordian Knot to all Your Business Problems



"Data governance is a measure of a company's control over its data"



Data governance is a data management concept. It is a measure of the control a company has over its data. This control can be achieved through high-quality data, visibility on data pipelines, actionable rights management, and clear accountability. Data governance encompasses the people, processes, and tools required to create consistent and proper handling of a company's data. By consistent and proper handling of data, I mean ensure availability, usability, consistency, understandability, data integrity, and data security.


The most comprehensive governance model — say, for a global bank — will have a robust data-governance council (often with C-suite leaders involved) to drive it; a high degree of automation with metadata recorded in an enterprise dictionary or data catalog; data lineage traced back to the source for many data elements; and a broader domain scope with ongoing prioritization as enterprise needs shift.


Good data governance and privacy model is a mix of people, processes, and software.


Data Governance has a Direct Business Impact


Data governance isn't just that rusty process that companies have to deploy in order to comply with the regulation. Of course, part of it is a legal obligation, and thank god, but clean governance can have high business outcomes.


Here are the main goals of data governance:

Data Governance Business Impact - by Xavier de Boisredon

When Did Data Governance Become a Thing?


Timeline and key milestones in the space.

For the past twenty years, the challenge around data has been to build an infrastructure to store and consume data efficiently and at scale. Producing data has become cheaper and easier over the years with the emergence of cloud data warehouses and transformation tools like dbt. Access to data has been democratized thanks to BI tools with BI tools like Looker, Tableau, or Metabase. Now, building nice dashboards is the new norm in Ops and Marketing team. This gave rise to a new problem: decentralized, untrustworthy & irrelevant data and dashboards.

Even the most data-driven companies still struggle to get value from data - up to 73% of all enterprise data goes unused.

→ 1990-2010: the emergence of the 1st regulation on data privacy


In the 1970s, the first data protection regulation in the world was vetted in Hessen, Germany. Since then, data regulation has kept increasing. The 1990's mark the first regulations regarding data privacy with the EU directive on data protection.

Yet, compliance with regulation really became a worldwide challenge in the second half of the 2010s with the emergence of GDPR, HIPAA, and other regional regulations on personal data privacy. These first regulations drove data governance for large enterprises. This created an urgency to build tools to handle these new requirements.

→ 2010 - 2020: 1st tools to comply with the regulation. C-level realizes data governance becomes a strategic advantage to drive business value.


With the increasing complexity of data resources/processes on the one hand and the first fines for GDPR infringement on the other, companies started to build regulatory compliance processes. The 1st pieces of software to organize Governance and Privacy were born with companies like Alation and Collibra.


The challenge is simple: enforce traceability across the various data infrastructure in the company. Data governance was then a privilege of enterprise-level companies, the only ones able to afford those tools. On-premise data storage makes it expensive to deploy these software. Indeed, companies like Alation and Collibra had to deploy technology specialists on the field to connect the data to their software. The first version of data governance tools aims at collecting and referencing data resources across the organization's departments.


There were several forces at play in this period. It became easier to collect data, cheaper to store it, simpler to analyze it. This led to a Cambrian explosion of the number of data resources. As a result, large companies struggled to have visibility over the work done with data. Data was decentralized, untrustworthy & irrelevant. This chaos brought a new strategic dimension to data governance. More than a compliance obligation, data governance became a key lever to bring about business value.

→ 2020+: Towards an automated and actionable data governance


With the standardization of the cloud data stack, the paradigm changed. It is easier to connect to the data infrastructure and gather metadata. Where it took 6 months to deploy a data governance tool on a multitude of siloed on-premise data centers in 2012, it can take up to 10 minutes in 2021 on the modern data stack (for example: Snowflake, Looker, and DBT).

This gave rise to new challenges: automatization and collaboration. Data governance on excel means maintaining manually 100+ fields, on thousands of tables and dashboards. This is impossible. Data governance with a non-automated tool means maintaining 10+ fields on thousands of tables: this is time-consuming. Doing data governance with a fully automated tool means maintaining 1 or 2 fields only on thousands of tables (literally table and column/field description). For that last part of manual work, you want to leverage the community. Prioritize work based on data consumption (high documentation SLA for popular resources) and democratize usage through a friendly UX.


Additionally, you want that data governance tool to be integrated into the rest of the data stack. Define something once and find it everywhere: whether this is a table definition, a tag, a KPI, a dashboard, access rights, or data quality results.

Data Governance Challenges Are Not the Same for Everyone


Diverse governance's use-cases based on industry needs and company size

Diverse governance's use-cases based on industry needs and company size

There are two main drivers for data governance programs:


  • Level of regulation needed in the industry

    Data regulation pushes the minimum bar of data governance processes higher. It requires businesses to add controls, reporting, and documentation. This is a need to ensure transparency over sometimes unclear processes.


  • Level of complexity of the data assets

    Having strong governance becomes increasingly important with the exponential growth of data resources, tools, and people in a company.


The level of complexity increases with the scope of business operations (number of lines of business and geographies covered), the velocity of data creation, or the level of automation (decision-making, processes) based on data.

How Do You Set Up Good Data Governance and Privacy Model?


Several bricks are needed to enforce data management.


Several bricks are needed to enforce data management - Image by Xavier de Boisredon


  • Data Architecture (Storage, Modeling, Visualization)


    Before even talking about data governance, a company needs the basis: a good infrastructure to begin with. Based on business needs and the company's data maturity, the nature of the data architecture can change a lot. Regarding storage, do you go for on-premise or cloud? Data warehouse or a data lake? Regarding modeling: Spark or DBT? in a data warehouse or BI tool? Real-time or batch? Regarding visualization: do you allow anyone to build dashboards or data teams only? Etc.


  • Search and Discovery


    The first level of any data governance strategy is making sure relevant people can find the relevant datasets to do their analysis or build their AI model. If you don't implement this step, companies end up with a lot of questions on Slack and useless meetings with the engineering teams. The company ends up with a lot of duplicate tables, analyses, and dashboards. It takes valuable time to engineering resources that are needed to perform the next steps.


  • Metadata and Documentation


    Once you can efficiently find the data, you need to understand it quickly in order to assess if it is going to be useful. For example, you are looking at a dataset called "active_users_revenue_2021". There is a column "payment." Is this column in € or $? Has it been refreshed this morning, last week, or last year? Does it contain all the data on active users or just the ones in Europe? If I remove a column, will this break important dashboards for the marketing or finance team? etc.


  • Data Quality


    Now that you have data stored in scalable infrastructure that everyone can find and understand, you need to trust that what is inside is of high quality. This is why so many data observability and reliability tools were born in the last five years. Data observability is the general concept of using automated monitoring, alerting, and triaging to eliminate data downtime. The two main approaches to data quality are: declarative (manually define thresholds and behavior) or ML-driven (detecting sudden changes in distribution).


  • Security and Access Rights


    Some data might be more private or strategic than others. Let's say you are a bank; you don't want to give access to the transaction logs to anyone in the company. You need to define access rights, and managing them efficiently can quickly become a struggle as the number and type of people working with data grow. Sometimes, you want to give access to someone for a specific mission and for anything else. What happens when one of your employees is in the finance department but moves to marketing? You need to manage these rights thoroughly and efficiently.


  • Compliance and Regulation


    This one is self-explanatory. You need to list all assets, report on personal information and usage to comply with regulations. For now, only enterprise companies are targeted by regulators; it is just a question of time before smaller companies start receiving fines.

Where Does Data Governance Fit in the Modern Data Stack?


Data governance brings trust from the raw data sources to domain expert dashboards.

Modern data stack governance - Image by Xavier de Boisredon

The typical data flow is the following :


  • You collect data from various sources from your business. It can be product logs, marketing, and website data, payment and sales logs, etc. You extract that information with tools like Fivetran, Stitch, or Airbyte.


  • You then store this data in a data warehouse (Snowflake, Redshift, Bigquery, Firebolt, to name the most popular). The data warehouse is both a place to store and transform your data to refine it.


  • The new trending transformation layer for the past 3 years is DBT. It enables to perform data transformation in SQL within the data warehouse while implementing software engineering good practices.


  • At last, the transformation helps you build your "data mart," the golden standard in terms of refined data. The visualization brick helps domain experts visualize this gold-level data to share insights throughout the whole organization.


    These steps are happening on different tools with a high level of abstraction. It is hard to keep a bird's eye view of what happens under the hood. This is what data governance is bringing to the table. You can see how the data flows, where the pipeline breaks, where risks lie, where to put your energy as a data manager, etc.



Also published at: https://www.castordoc.com/blog/what-is-data-governance-and-privacy.