The abundance of tabular data being accumulated by enterprises internally and online represents a big challenge for data integration and analysis for companies that strive to be data-driven.
While the main focus of AI/ML research and community during the last several years has been on processing unstructured data, tabular data still is where the most time and money are spent in the Data Integration world.
These days, every company, whether big or small, has dozens of different systems (data sources).
Many of those have great integration capabilities, but still, every day people export data from various systems and ingest it into proprietary enterprise Data Lakes, Data Warehouses, Reporting databases, etc.
That requires a lot of time and qualified Data Engineers.
To keep up with quickly changing business requirements and growing data size, there have been many great technologies and tools developed by companies and communities: big data stack, data wrangling tools, quality and observability, visualization, and NLU interfaces to the data.
All those technologies allowed the collection and processing of unprecedented amounts of data. Though it also makes the modern data engineering stack very complex, which, in turn, results in long and expensive development cycles that don’t meet business requirements
Tools assume that the user understands the data, knows how it’s structured, and how to cross-reference data that comes from different sources.
And that represents probably the biggest problem in today’s corporate world. In order to do something with the data you have to understand it, know where it is, and how it’s structured.
There is a misconception that there are people who fully understand all the nuances of companies' data and can efficiently manage it.
And it becomes exponentially more difficult when a company grows and people change. In its work, Datuum specifically targets this challenge by applying state-of-the-art AI models to help Data Workers handle the variety and complexity of tabular data.
Datuum speeds up and optimizes the integration of multiple data sources by helping data workers find the data they need and connect the dots between source and destination.
As a result of interacting with your data, our AI learns more with every iteration and that data so the institutional knowledge is not lost.
Imagine that you have two different data sources with seemingly similar data stored in them. For instance, data from medical or insurance domains, in which some tables could contain hundreds or even thousands of columns.
Our major task is to unify this data into a single source of truth.
The cornerstone of this unification is the process of column matching – discovering combinations of columns from different data sources which store the same information (Fig. 1).
While it could be trivial to match columns with generalized data like name, email, or address, it's not the case for columns containing domain-specific information – to map this kind of data, we usually need domain expertise and manual intervention.
The other important phase of the unification pipeline would be record matching – the process in which records from different data sources are merged into a single entity. For highly regulated domains like healthcare, this step could be extremely important.
And it's even harder to get right when large datasets are being processed, even for human annotators.
Thankfully, the latest research in large language models gives us the key to the simplification of the matching process. It doesn't yet fully remove the need for human experts (especially in complex domains) but could give them a substantial boost compared to standard settings.
By definition, language modeling is the usage of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models train by analyzing text to correctly predict the next or masked word in a sequence (Fig. 2).
The task of language modeling is not new, but in recent years, it has become one of the main areas for research and development in the machine learning field. There are 3 reasons why this sub-field experienced such enormous progress:
As a result, transformer models have found use in a wide range of tasks, ranging from question answering to video understanding. Naturally, people also looked at applying this architecture to tabular data as well.
Tabular data is textual in its nature. One could argue that numeric columns do not constitute text, but in fact, numerical values are nothing other than sequences of specialized text characters.
This means that tabular data, similar to textual one, could be effectively modeled as a language using Transformers.
Modeling tabular data with Transformers facilitates the solution of other downstream tasks, such as column type prediction or column name recovery. Some of the proposed models could also learn efficient dense representations of the table parts: columns and rows.
Machine learning approaches using tabular data can hardly be called new. For many years, tree-based solutions, such as random forests or gradient boosting algorithms, were used for various classification and regression tasks, such as credit score prediction or fraud detection.
However, the complexity of the aforementioned models lets us learn and understand patterns of data on the observation (row) level.
For a much deeper understanding of data, as relations between columns in a table or between tables in a database, neural networks became very handy.
In the pre-transformer era, deep learning models had a limited capability in processing sequential data like column values or names and, hence, learning contextual information from it.
As a result, they often relied on manually engineered features for model input and processed each column separately.
One prominent example of such a model is Sherlock [1] – the first successful application of deep learning to the task of semantic type classification.
Sherlock is a regular feed-forward neural network that uses a set of statistical features calculated from column values to predict one of 78 semantic column types.
It was the first deep-learning model used to improve column matching in Datuum.
Despite being quite effective for common column types, like name, address, or zip code, Sherlock had a set of limitations, such as high reliance on annotated data, especially in previously unseen domains, or inability to take table context and column name into account.
Obviously, that represented an obstacle to successfully applying Datuum in new verticals fast. As that architecture required additional training on the new data.
Fortunately enough, the Transformer-based model proved to be a perfect alternative to Sherlock.
Our history with the application of transformers to the task of column matching started with extensive research of related work. As a result, we chose TABBIE [2] and TaBERT [3] models for further experiments.
It should be noted that at the beginning of 2022, almost all SOTA approaches to the task of column-type classification were based on Transformers (Fig. 3).
By fine-tuning these models on multiple relevant datasets, we proved the superiority of transformer-based methods over Sherlock. For example, the TABBIE F1-macro score on the TURL dataset [4] was almost 20% higher than Sherlock’s.
In fact, the performance of transformers on classification tasks wasn’t what we were looking for. The property that caught our attention most was the representational ability of these models.
For instance, by training transformer-based models like TABBIE on domain-specific datasets for the task of semantic column type classification, we were able to learn dense representations of the elements of these tables: rows and columns.
This, in turn, allowed us to extend the application of the model to previously unseen domains without the need to annotate new data.
Our column-matching pipeline is specifically designed for a system that has a regular inflow of new data from previously unseen domains (Fig. 4).
Generally, our approach works the following way:
This pipeline has a few important benefits compared to the pipeline based on the Sherlock model, namely:
Combined, these advantages set a new standard in the accuracy and adaptivity of column matching employed in Datuum.
The representational ability of Transformer-based language models is bringing new capabilities to almost all tasks that rely on textual data, and table cross-reference is no exception.
By employing a Transformer model in our column-matching pipeline, we were able to increase the matching accuracy and generalization ability of the system that doesn’t depend on annotated data, which is especially useful in domains where data access is restricted.
https://hci.stanford.edu/\~cagatay/projects/table-understanding/Sherlock-KDD19.pdf
https://arxiv.org/pdf/2105.02584.pdf
https://arxiv.org/pdf/2005.08314.pdf
https://arxiv.org/pdf/2006.14806.pdf
Also published here