paint-brush
Parallels Between the Structure of Natural Language Statements and Data Schemataby@interoperability

Parallels Between the Structure of Natural Language Statements and Data Schemata

tldt arrow

Too Long; Didn't Read

This paper explores a machine-actionable Rosetta Stone Framework for (meta)data, which uses reference terms and schemata as an interlingua to minimize mappings and crosswalks.
featured image - Parallels Between the Structure of Natural Language Statements and Data Schemata
Interoperability in Software Publication HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Vogt, Lars, TIB Leibniz Information Centre for Science and Technology;

(2) Konrad, Marcel, TIB Leibniz Information Centre for Science and Technology;

(3) Prinz, Manuel, TIB Leibniz Information Centre for Science and Technology.

Parallels between the structure of natural language statements and data schemata with implications for semantic interoperability

Since we can think of each datum as a somewhat formalized representation of a natural language statement, structured in such a way that it can be easily compared with statements of the same type, and easily read and operationalized by machines (cf. Fig. 2A with 2D,E), we can think of a data schema as a formalization of a particular type of natural language statement to support its machine-actionability. In other words, data schemata are to machines what syntax trees are to humans—both define positions with semantic roles. When we compare data schemata with their corresponding natural language statements, we can thus see similarities between the structure of a sentence defined by the syntax and grammar of a natural language and the structure of a corresponding schema (Fig. 2). As discussed above, the syntactic positions of terms in a natural language sentence take on specific semantic roles and contribute significantly to the meaning of the statement. For a data schema to have the same meaning as its corresponding natural language statements, it must, as a minimum requirement, functionally and semantically provide a similar structure with the same elements as the corresponding syntax tree: the schema must represent all relevant syntactic positions—in schemas often called slots—and their associated semantic roles in the form of constraint specifications, with terms and values populating the slots (see Fig. 2D,E). After all, humans need to be able to understand these data schemata, and need to be able to translate a given datum represented in a given data schema back into a natural language statement. Data schemata can therefore be seen as attempts to translate the structure of natural language sentences into machine-actionable data structures.


Looking at these interdependencies, we can distinguish different causes for the lack of semantic interoperability. And since we distinguish between terms and propositions as two different types of meaning-carrying entities when we communicate (meta)data, we can distinguish causes for the lack of terminological interoperability from causes for the lack of propositional interoperability.


With regard to terminological interoperability, we can distinguish between ontological and referential causes for the lack of interoperability of terms (10). When two given terms are compared semantically, they can either (i) differ in both their meaning and their referent (e.g., ‘apple’ and ‘car’), (ii) differ only in their meaning but share the same referent (e.g., ‘Morning Star’ and ‘Evening Star’, both referring to the planet Venus), or (iii) share the same meaning and the same referent.


If two terms share their meaning and their referent, they are referentially and ontologically interoperable. They are strict synonyms and can be used interchangeably. Since no two terms can share their meaning but not their referent, ontological interoperability always implies referential interoperability, but not vice versa. Thus, if two terms have the same referent but not the same meaning, because controlled vocabularies may differ in their ontological commitments, their ontological interoperability is violated, but not necessarily their referential interoperability, since both terms can be used to refer to the same entity. For example, the COVID-19 Vocabulary Ontology (COVoC) defines ‘viruses’ as a subclass of ‘organism’, while the Virus Infectious Diseases Ontology (vido) defines ‘virus’ as a subclass of ‘acellular structure’, and thus as an object that is not an organism–these two terms are therefore not ontologically interoperable, even though they have the same referent (i.e., the same extension). In other words, the set of ontologically interoperable terms is a subset of the set of referentially interoperable terms.


As far as terminological interoperability is concerned, we can therefore conclude that although ontological interoperability is preferred 2 , referential interoperability is the minimum requirement for the interoperability of terms, since when we communicate information, we at least want to know that we are referring to the same entities.


In the context of knowledge graphs and ontologies, if two terms share the same meaning and referent despite having different UPRIs, we can express their terminological interoperability by specifying a corresponding term mapping using the ‘same as’ (owl:sameAs) property. If two terms have only the same referent but not the same meaning, we can express their referential interoperability by specifying a corresponding term mapping using the ‘equivalent class’ (owl:equivalentClass) property. We can thus distinguish between ontological (i.e., same-as) and referential (i.e., equivalent-class) term mappings. Both types of mappings are homogeneous definition mappings (49), where there is only one vocabulary element to be mapped on the left side, and several others on the right side of the definition that do not need to be mapped.


With respect to propositional interoperability, we can distinguish between logical and schematic causes for the lack of interoperability of statements of the same type (10). Data and metadata statements are logically interoperable if they have been modeled on the basis of the same logical framework (e.g., OWL-Full and thus on full description logics), so that one can reason over them using appropriate reasoners (e.g., an OWL-Full reasoner). When we talk about logically interoperable terms, we are actually referring to the ontological definitions of these terms and thus to their class axioms, which are universal statements that can be placed in the same logical framework if the terms are logically interoperable.


Schematic interoperability is achieved when statements of the same type are documented using the same (meta)data schema. If statements of the same type were represented using different schemata, corresponding (meta)data would no longer be interoperable. In such cases, one would have to specify schema crosswalks by aligning slots (i.e., syntactic positions) that share the same constraint specification (i.e., semantic roles) across different schemata modeling the same type of statement, in order to regain schematic interoperability (see Fig. 3). If the schemata use different vocabularies to populate their slots (i.e., the constraint specifications refer to different vocabularies), then corresponding term mappings must be included in the crosswalk to ensure terminological interoperability (see red bordered slots in Fig. 3). Consequently, we can distinguish between ontological and referential schema crosswalks. A schema crosswalk is a set of rules that specifies how (meta)data elements or attributes (i.e., slots) from one schema and format can be aligned and mapped to the equivalent (meta)data elements or attributes in another schema and format.


In other words, to achieve schematic interoperability between two given statements, the subject, predicate, and object slots of their (meta)data schemata need to be aligned, and their terms be mapped across controlled vocabularies. To do this, the schemata must first be formally specified, e.g., in the form of graph patterns specified as SHACL shapes. Shapes that share the same statement-type referent then need to be aligned and mapped. These are ontology pattern alignments (i.e., TBox alignments) (49) or ABox alignments, where several vocabulary elements must first be aligned and then mapped in schema crosswalks.


As far as propositional interoperability is concerned, we can therefore conclude that although a combination of logical interoperability and schematic interoperability is preferred, for the same reasons as for terms, schematic interoperability using referential schema crosswalks is the minimum requirement for the interoperability of statements.


In summary, the interoperability of (meta)data statements does not only depend on the number of applicable operations and thus on machine-actionability, but also on the completeness of the ontological and referential term mappings and schema crosswalks relevant to the statements.


Figure 3: Crosswalk from one schema to another for a weight measurement statement. The same weight measurementstatement is modeled using two different schemata. Top: The weight measurement according to the schema of the