paint-brush
Towards a Rosetta Stone for (meta)data: Abstract & Introductionby@interoperability

Towards a Rosetta Stone for (meta)data: Abstract & Introduction

tldt arrow

Too Long; Didn't Read

This paper explores a machine-actionable Rosetta Stone Framework for (meta)data, which uses reference terms and schemata as an interlingua to minimize mappings.
featured image - Towards a Rosetta Stone for (meta)data: Abstract & Introduction
Interoperability in Software Publication HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Vogt, Lars, TIB Leibniz Information Centre for Science and Technology;

(2) Konrad, Marcel, TIB Leibniz Information Centre for Science and Technology;

(3) Prinz, Manuel, TIB Leibniz Information Centre for Science and Technology.

Abstract

In order to effectively manage the overwhelming influx of data, it is crucial to ensure that data is findable, accessible, interoperable, and reusable (FAIR). While ontologies and knowledge graphs have been employed to enhance FAIRness, challenges remain regarding semantic and cognitive interoperability. We explore how English facilitates reliable communication of terms and statements, and transfer our findings to a framework of ontologies and knowledge graphs, while treating terms and statements as minimal information units. We categorize statement types based on their predicates, recognizing the limitations of modeling non-binary predicates with multiple triples, which negatively impacts interoperability. Terms are associated with different frames of reference, and different operations require different schemata. Term mappings and schema crosswalks are therefore vital for semantic interoperability. We propose a machine-actionable Rosetta Stone Framework for (meta)data, which uses reference terms and schemata as an interlingua to minimize mappings and crosswalks. Modeling statements rather than a human-independent reality ensures cognitive familiarity and thus better interoperability of data structures. We extend this Rosetta modeling paradigm to reference schemata, resulting in simple schemata with a consistent structure across statement types, empowering domain experts to create their own schemata using the Rosetta Editor, without requiring knowledge of semantics. The Editor also allows specifying textual and graphical display templates for each schema, delivering human-readable data representations alongside machine-actionable data structures. The Rosetta Query Builder derives queries based on completed input forms and the information from corresponding reference schemata. This work sets the conceptual ground for the Rosetta Stone Framework that we plan to develop in the future.


Introduction

In the current era, we are witnessing an exponential growth in the generation and consumption of data. Statista’s projections from 2020 indicate that the global data creation, capture, copying, and consumption will reach an estimated 97 zettabytes in 2022, with a staggering daily volume of approximately 328.77 billion terabytes by 2023, reflecting a notable 23.71% increase over the preceding year, with videos accounting for over 50% of the global traffic (1,2). Additionally, data production in the last two years accounted for a striking 90% of the entire data generated worldwide in 2016 (3), illustrating the rapid pace of data expansion. The doubling of the overall data volume every three years (4) further accentuates the magnitude of this exponential growth. Simultaneously, the scholarly domain is witnessing a surge in publications, with an annual output of over 7 million academic papers (5). These figures emphasize the urgency of harnessing machine support, as without it, the sheer volume of (meta)data threatens to overwhelm and impede meaningful insights.


With this in mind, it is essential to facilitate machine-actionable (meta)data in scientific research, so that machines assist researchers in identifying relevant (meta)data pertaining to a specific research question. Moreover, enhancing the machine-actionability of (meta)data offers a potential solution to the reproducibility crisis in science (6), enabling the availability, findability, and usability of raw (meta)data (7).


Considering the substantial global investment in research and development, which reached a staggering sum of $1,609,214 billion in 2021 (8), failure to render the corresponding output of (meta)data machine-actionable could potentially result in redundant research efforts. By enabling machine-actionable (meta)data, substantial savings can be achieved, redirecting resources towards novel and non-repetitive research endeavors. However, the concept of machine-actionable (meta)data requires clarification.



Looking at the definition of machine-actionable (see Box 1), it is evident that this attribute cannot be simplified as a mere Boolean property. Instead, it exists on a spectrum, allowing for degrees of machine-actionability. Numerous operations can potentially be applied to a given set of (meta)data, and the ability to apply even a single operation would suffice to classify the (meta)data as machine-actionable. Consequently, specifying the set of operations that can be applied to the (meta)data, along with the corresponding tools or code, is more meaningful than labeling the entire set as machine-actionable. For instance, one could state that “Dataset A is machine-actionable with respect to operation X using tool Y”.


It is worth noting that reading a dataset could be considered an operation itself. Therefore, datasets documented in formats as PDF, XML, or even ASCII files can be considered machine-readable and, to some extent, already machine-actionable. Moreover, if a dataset is machine-readable, search operations can be performed on it, enabling the identification of specific elements through label matching, for example. The success of search operations serves as a measure of the findability of (meta)data. Similarly to machine-actionability, findability cannot be characterized as a Boolean property. Machine-readable (meta)data can be found through label matching, while interpretable (meta)data can be found through their meaning, referent, or contextual information. Thus, (meta)data that are readable but not interpretable possess limited findability. It is important to emphasize that our definition of machine-actionability, as outlined in Box 1, strictly depends on machine-interoperability. Consequently, machine-reading of a dataset and machine-searching based on label matching are not considered proper examples for operations that fulfill the requirements of machine-actionability.


In 2016, the FAIR Guiding Principles for (meta)data were introduced, providing a framework to assess the extent to which (meta)data are Findable, Accessible, Interoperable, and Reusable for both machines and humans alike (11). These principles have gained increasing attention from the research, industry, and knowledge management tool development communities in recent years (11–16). Furthermore, stakeholders in science and research policy have recognized the significance of the FAIR Principles. The economic impact of FAIR research (meta)data was estimated by the European Union (EU) in 2018, revealing that the lack of FAIR (meta)data costs the EU economy at least 10.2 billion Euros annually. Taking into account the positive effects of FAIR (meta)data on data quality and machine-readability, an additional 16 billion Euros were estimated (17). Consequently, the High Level Expert Group of the European Open Science Cloud (EOSC) recommended the establishment of an Internet of FAIR Data and Services (IFDS) (18). The IFDS aims to enable data-rich institutions, research projects, and citizen-science initiatives to make their (meta)data accessible in accordance with the FAIR Guiding Principles, while retaining control over ethical, privacy, and legal aspects of their (meta)data (following Barend Mons’ data visiting as opposed to data sharing (19)). Achieving this goal requires the provision of rich machine-actionable (meta)data, their organization into FAIR Digital Objects (20,21), each identifiable by a Unique Persistent and Resolvable Identifier (UPRI), and the development of suitable concepts and tools for human-readable interface outputs and search capabilities. Although progress has been made toward building the IFDS (see the GO FAIR Initiative and EOSC), the current state of the FAIRness of (meta)data in many data-rich institutions and companies is still far from ideal.


The increasing volume, velocity, variety, and complexity of data present significant challenges that traditional methods and techniques for handling, processing, analyzing, managing, storing, and retrieving (meta)data struggle to address effectively within a reasonable timeframe (22). However, knowledge graphs in conjunction with ontologies, offer a promising technical solution for implementing the FAIR Guiding Principles, thanks to their transparent semantics, highly structured syntax, and standardized formats (23,24). Knowledge graphs represent instances, classes, and their relationships as resources with their own UPRIs. These UPRIs are employed to denote relationships between entities using the triple syntax of Subject-Predicate-Object. Each particular relationship is thus modeled as a structured set of three distinct data points. In contrast, relational databases model entity relationships between data table columns and not between individual data points. Consequently, knowledge graphs outperform relational databases in handling complex queries on densely connected data, which is often the case with research (meta)data. Therefore, knowledge graphs are particularly well suited for FAIR research and development, as well as any task requiring efficient and detailed retrieval of (meta)data.


Nonetheless, employing a knowledge graph to document (meta)data does not guarantee adherence to FAIR Principles. Achieving FAIRness necessitates additional guidelines, such as consistent usage of the same semantic data model for identical types of (meta)data statements to ensure schematic interoperability (10), as well as organizing (meta)data into FAIR Digital Objects (20,21). Knowledge graphs, being a relatively new concept and technology, introduce their own specific technical, conceptual, and societal challenges. This is evident in the somewhat ambiguous nature of the knowledge graph concept (23) and the absence of commonly accepted standards, given the diverse technical and conceptual incarnations ranging from labeled property graphs like Neo4J to approaches based on the Resource Description Framework (RDF), employing RDF stores and applications of description logic using the Web Ontology Language (OWL).


Regardless of these considerations, it can be argued that the demand for FAIR and machine-actionable (meta)data presupposes their successful communication between machines and between humans and machines. During such communication processes, preserving the meaning and reference of the message between sender and receiver is crucial, requiring both parties to share the same background knowledge, encompassing lexical competences, syntax and grammar rules, and relevant contextual knowledge.


In this paper, we lay the conceptual groundwork for a future machine-actionable Rosetta Stone Framework designed to achieve cognitive and semantic interoperability of (meta)data. We begin by examining the notion of interoperability and emphasizing the necessity of cognitive interoperability. Drawing inspiration from natural languages like English, we explore how semantic interoperability can be understood by analyzing the way terms and statements convey meaning and information. Recognizing terms and statements as basic units of information, we investigate the linguistic structures that ensure reliable information communication and draw parallels to the structures found in (meta)data schemata. This analysis yields insights into achieving semantic interoperability of terms and statements.


Our main argument centers around the requirement for a machine-actionable Rosetta Stone Framework for (meta)data, which addresses terminological and propositional interoperability. This framework leverages reference terms and reference schemata as an interlingua to establish term mappings and schema crosswalks, facilitating cognitive and semantic interoperability. The emphasis lies on a modeling paradigm that enables machine-interpretability of (meta)data, prioritizing their findability and interoperability over reasoning capabilities. This prioritization opens up new avenues for modeling by shifting away from the paradigm frequently applied in science that focuses on modeling a human-independent reality. Instead, the Rosetta Modeling Paradigm models the structure of statements to enhance efficient and reliable information communication.


We propose two alternative versions of a Rosetta Modeling Approach: a minimal light version and a full version that supports versioning of statements and tracks the editing history of each statement within a knowledge graph. Furthermore, we discuss the specifications for a Rosetta Editor, a tool that will allow domain experts to create new reference schemata without requiring expertise in semantics or computer science. The Rosetta Editor will also enable the specification of display templates for human-readable text-based and graph-based representation of (meta)data.


Additionally, we propose a Rosetta Query Builder that allows users to write queries without the need for a graph query language. Finally, we provide an overview of the Rosetta Framework and explore related work in the field before presenting our concluding remarks.


It is important to note that this paper represents conceptual work and serves as a foundation for future development, testing, and implementation of the envisioned Rosetta Stone Framework.