paint-brush
Towards a Rosetta Stone for (meta)data: What Makes a Term a Good Term and a Schema a Good Schema?by@interoperability

Towards a Rosetta Stone for (meta)data: What Makes a Term a Good Term and a Schema a Good Schema?

tldt arrow

Too Long; Didn't Read

This paper explores a machine-actionable Rosetta Stone Framework for (meta)data, which uses reference terms and schemata as an interlingua to minimize mappings and crosswalks.
featured image - Towards a Rosetta Stone for (meta)data: What Makes a Term a Good Term and a Schema a Good Schema?
Interoperability in Software Publication HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Vogt, Lars, TIB Leibniz Information Centre for Science and Technology;

(2) Konrad, Marcel, TIB Leibniz Information Centre for Science and Technology;

(3) Prinz, Manuel, TIB Leibniz Information Centre for Science and Technology.

What makes a term a good term and a schema a good schema?

First and foremost, a good schema for a (meta)data statement must cover all the information that needs to be documented, stored, and represented for the corresponding type of statement. However, beyond that, there are many other criteria for evaluating schemata. Most of these relate to the different operations one wants to perform on the (meta)data, and the formats required by the corresponding tools, which determine the degree of machine-actionability of the (meta)data. These include search operations (i.e., the findability in FAIR), but also reasoning and all kinds of data transformations, such as unit conversion for measurement data. Communicating with humans is another set of operations that needs to be considered when evaluating (meta)data schemata, as it relates to cognitive interoperability and thus the human-actionability of (meta)data.


Unfortunately, different operations are likely to have different requirements on a schema, and the tools that execute these operations may have their own requirements. For example, optimizing the findability of measurement data requires a different data schema than optimizing reasoning over them or their reusability. A given schema therefore needs to be evaluated in terms of the operations to be performed and the tools to be used on the (meta)data, often involving trade-offs between different operations that are prioritized differently in order to achieve an overall optimum. An example is the trade-off between reasoning and human-readability, as discussed above in the context of the dilemma between machine-actionable and human-actionable (meta)data schemata (Fig. 1).


As a consequence, for a given type of statement, there is likely to be a need for more than one corresponding schema. This is mainly because, besides historical reasons, different research communities often have different frames of reference and thus emphasize different aspects of a given type of entity, resulting in the need for different terms for the same type of entity (resulting in issues with ontological and thus terminological interoperability, but not necessarily with referential interoperability), but also because different research communities want to perform different operations on the (meta)data, different types of schemata. Since operations on (meta)data can be performed with different sets of tools, not only the structure of the schema is important, but also the format in which it can be communicated with such tools. For example, some tools require (meta)data to be in RDF/OWL, others in JSON, as CSV, or as a Python or Java data class.


Obviously, FAIRness is not sufficient as an indicator of high quality (meta)data―the use of (meta)data often depends on their fitness-for-use, i.e., their availability in appropriate formats that conform to established standards and protocols that allow their direct use, e.g., when a specific analysis software requires data in a specific format.


Therefore, although agreement on a common vocabulary and a common set of schemata would be a solution for semantic interoperability and machine-actionability of (meta)data across different research domains, this is unlikely to happen, and we have to think pragmatically and emphasize the need for ontological and referential term mappings and schema crosswalks for terminological schematic interoperability.