Context matters. Let’s take, for instance, the following list of strings:
If we assume all items in the list above have the same semantic value, what is it exactly?
The obvious answer to this question is “geographical place”, right? Let’s look at it in a wider context.
Now we know that we were looking at the list of surnames, not cities. We had no way of telling the difference without extending the context.
In this particular case, larger sample size would barely help, as nearly any city name can be a surname for someone and a lot of cities are named after some people.
In normal situations, our brain doesn’t even notice how tricky this kind of distinction is, because, as humans, we rarely operate without a rich context.
NLP techniques have some of the required context for token classification from the text which surrounds any particular token and use word embeddings to make a “best guess” of what this word might mean.
When we analyze columns of data (let’s say, from a CSV file), we don’t have any sentence to derive a context from. Instead, we have other columns and, more often than not, other files.
datuum.ai technology actually utilizes all the context which is available in a data source and takes a full set of columns, their order, full set of files/tables, etc. into account in order to determine a semantic type of the data.
It took a lot of time and countless iterations to come up with neural network architecture and feature set which would allow us to achieve what we did achieve. There are still a lot of improvements ahead.
And this is just the tip of the iceberg, one rather simple problem of many we are solving to make our platform work.
Thanks to Dmytro Zhuk, Founder & CTO at @datuum for the story!
Previously published here.