Repaying the GDPR Data Governance Debt with Metadata and Semantics

Being GDPR compliant means having robust Data Governance in place. And robust Data Governance is built on metadata and semantics.

GDPR is portrayed as somewhat of a boogeyman. Most of the talk on GDPR has been focusing on cost, rather than benefits. Yes, fines occurring as a consequence of failure to comply with GDPR can be substantial. And yes, becoming GDPR compliant incurs costs to organizations. But if GDPR did not exist, it would be necessary to invent it.

It may be hard to see this now, with organizations scrambling for the final stretch of what has been a frantic race to compliance. But this race extends beyond the finish line. And that’s not just because, according to surveys, a significant part of organizations will not be GDPR-compliant on May the 25th.

To be GDPR compliant is to have a robust Data Governance process and infrastructure in place. This is what enables organizations to be in control of the data they manage. If this is in place, GDPR compliance can be built on top of it. Identifying data lineage and use across the organization are typical examples of this.

But there are far too many organizations in which data is managed in a haphazard fashion. Why? Lack of know-how, tools, or budget / time are some typical reasons. Just think of the average user’s personal computer storage, multiply this at organizational scale, and you get the picture.

Still, it all comes down to priorities, as the race to GDPR compliance goes to show. Data Governance should have been there all along, and GDPR is a good opportunity to repay the Data Governance debt.

Technical debt is a term used to describe what happens in software development when choosing an easy solution now instead of using a better approach that would take longer. Over time, technical debt accumulates, until it gets to the point where it becomes really hard or even impossible to resolve / repay.

The same concept applies to Data Governance too. As many organizations have accumulated mass amounts of Data Governance debt, GDPR is the equivalent of a creditor making that debt repayable immediately. Let’s take a look at some fundamentals that can save organizations from going bankrupt: metadata and semantics.

Metadata for Data Governance

When we talk about Data Governance, what we really talk about is the ability to keep track of data, to be able to know things such as: What is that data about? What are the entities referenced in that data? Where is the data coming from? When were they acquired, and how? Who is using them, and where?

Without being able to answer those questions, there’s no such thing as Data Governance, and no policy can possibly be enforced. What all of that really is, is data about data, or metadata. Depending on the point of view, metadata can be classified according to different taxonomies.

According to Getty’s classification, metadata can be classified as Administrative, Descriptive, Preservation, Technical, and Use metadata. Getty’s classification is among the most complete ones, as it stems from the cultural heritage world, one of professional asset and content managers.

Each type of metadata helps keep track of a different aspect of data. Image: Ontotext

Organizations such as museums and libraries have been on the forefront of metadata research, development and use, as the need to manage assets has always been integral for them. With the proliferation of Big Data however, the average organization today may very well have more assets to manage than the average museum a few years back.

Increasingly, the physical world has come to be represented in the digital realm. In addition, many processes and functions today have become primarily, or exclusively, digital. So although the types of assets and data most organizations have to manage are typically different than museums or libraries, they could benefit from their approach to using metadata.

Traditionally, metadata use in the cultural heritage domain has been focused on manual annotation. That approach won’t scale for most organizations, but it does not have to be that way. Much of the metadata capturing process can be automated, achieving greater efficiency. But there’s something about the cultural heritage approach to metadata everyone could reuse.

Semantics for Metadata

When we talk about metadata, what we really talk about is the ability to capture and manage additional data about data. The most common form of metadata is a schema. A schema describes the structure and meaning of a dataset, thus specifying its semantics.

The advent of technologies such as NoSQL stores or Hadoop, and the proliferation of unstructured or semi-structured data such as documents and multimedia has brought the use of schema under critique. But while schema-free, or schema-on-read approaches, have their merits, eventually even their proponents have come to terms the realities of data management.

It can be argued that schema-free and schema-on-read are another form of data debt. As most data have structure and relationships, not dealing with them upfront can speed up storage and use, but it will eventually creep up to hinder data (re)usability. The structure and semantics in data is useful even for machine learning approaches.

In similar fashion, semantics is also useful for metadata. Having well-defined, curated, and shared vocabularies to describe various types of metadata pays off. By using such vocabularies, organizations have a better picture of what the meaning of their metadata is.

Semantics help define the meaning of metadata. Image: Sung-Kook Han

They benefit from the accumulated experience that has gone into designing them. They avoid reinventing the wheel, and they can achieve interoperability and cross-linking among datasets, both internal and external.

Some examples of such metadata vocabularies are Dublin Core, DCAT and SKOS. They all come from the Semantic Web world, and have years of modeling expertise and use in projects behind them. They also work in tandem, showcasing how semantics-powered vocabularies enable users to mix and match, customize and extend them to suit their needs.

Ironically, while the Semantic Web itself has gotten a rather bad name in terms of approachability and applicability, its grand vision seems closer than ever: going from a web of documents, to a web of data.

A big part of the web is already powered by Semantic Web vocabularies, most prominently schema.org. And the use of knowledge graphs and semantic data lakes is getting more widespread too.

But how could a metadata, semantics-powered solution for Data Governance work in the real world?

A Metadata, Semantics — powered Data Governance solution

Here’s a roadmap of sorts:

Build a comprehensive internal data model
Use vocabularies such as Dublin Core, DCAT and SKOS to capture and manage metadata
Expand to add more concrete / domain-specific models to make it easier for tools of different types to use the metadata.

If that sounds interesting, but too abstract or hard, you’ll be glad to know it’s already under way in the Apache Atlas project.

Atlas is a Data Governance and metadata framework for Hadoop, and has been around for a while. Now however Atlas has gotten a new life: there is a team of around 20 people in total working on it, it has support from IBM and Hortonworks, and it is taking the center stage at one of the most important big data events in the world.

Atlas is based on that recipe above. If you’re interested in the technical details, you can check the in-depth presentation given at Dataworks EMEA, or the Atlas project wiki. Having done that, we turned to IBM Distinguished Engineer Mandy Chessel who is among Atlas’ masterminds to discuss further. Chessel explains that:

“Many data standards have a more limited scope than what we need to describe and govern data. We have built out a comprehensive metadata type system that gives us the base metamodel for the metadata repository and the metadata exchange protocols (the Open Metadata Repository Service).

This was seeded by stitching details of open standards and adding in other metadata elements that we know are needed for many client engagements. The plan is to tag our model with mappings to the open standards we have used. We are also working with The Open Group on building an O-DEF tag library for metadata to tie down the semantics of the elements in our metamodel.

This is very low-level and uses concepts like entity, relationship, classification and attribute. You could think of the objects it exchanges as metadata atoms. The domain specific models/APIs are like metadata molecules. They describe larger, domain specific objects like assets, data base table, policy etc”.

When discussing what lead them to use this approach, Chessel says it was “partly to show how the open standards have been combined so we can verify the open metadata model is compatible with the open standards used by different communities.

But the real practical reason for maintaining the mapping to these standards is to enable metadata exchange with tools that use these standards. In particular, many open data sites use the dcat standard. It would be good to be able to exchange and consume metadata from these sites”.

Above and beyond GDPR, and Hadoop

That’s all fine and well in the (linked) open data world, but what about beyond that? Chessel acknowledges there has been limited uptake of open standards in commercial governance products, and says this is partly due to gaps in the standards and a lack of pressure from customers.

She adds however that there is a lot of interesting work around library systems and similar content management systems that they can learn from. Plus, there are metadata standards associated with different types of data, such as photographs or location data: “the standards can capture deep expertise which is helpful in validating what we are doing”.

Chessel says that designing, implementing and documenting this is done, at least as part 1 is concerned. The OMRS is practically complete and the Open Metadata Access Services are in progress. There are some pieces of the puzzle that are not quite there yet, such as lineage for which they want to step back and evaluate their options.

When discussing more concrete, domain-specific models, Chessel notes that there is no metadata standard for describing a data model:

“Regulators and industry consortia end up resorting to commercial products (eg ERWIN or Power designer) or spreadsheets to distribute glossaries and data models. We have interest from a regulator to build out the area 5 model to be able to capture data models and reference data so they can publish their regulations in an open format that is directly consumable by tools”.

Inside open metadata — the deep dive from DataWorks Summit

But what the target audience for Atlas and its real-life adoption?

“Atlas is the open source project that is where we are developing the source code — this includes the APIs, frameworks, protocol implementations that are packaged together. There is also an “Atlas Server” which is a graph based metadata repository that is delivered by the Apache Atlas project too.

This Atlas Server is part of the Hortonworks HDP product — which is where the Hadoop association comes from. The open metadata libraries are designed to be embedded in the Atlas server of course, but also other vendor products. Also the Atlas server runs outside of Hadoop. So we are not tied to Hadoop.

Then we have developers from our products taking the open metadata libraries and incorporating them into our products. There is another large vendor looking at integration and others waiting in the wings for us to finish the domain-specific interfaces before starting their integration work.

The technology we have built uses a connector framework to allow it to run on different platforms. However, adoption is key. This is where the ODPi comes from. It aims to help vendors with adoption and brings practitioners together to develop open metadata content”.

Connected Data — Bringing it all together

Even though Atlas is not the only solution in this space, and needs to work in a broader context including tools and processes, this is important. Atlas may be the lever for metadata and semantics to get a head start in Data Governance in the enterprise, similar to what schema.org has done for data on the web.

Even though additional analysis is warranted, in the end, it’s all about connecting data. This is exactly what we’ll be doing at Connected Data in London this November. If that got your attention, there’s more where that came from. We are putting together a program featuring la creme de la creme in:

Enterprise Knowledge Graphs & Schema Management
Linked Data & Semantic Publishing
AI, Machine Learning & Graphs
Graph Databases

Come join us while it’s still early!

Originally published at connected-data.london on May 21, 2018.