By jake neely
At Forge.AI, we capture events from unstructured data and represent them in a manner suitable for machine learning, decision making, and other algorithmic tasks for our customers (for a broad technical overview, see this blog post). In order to do this, we employ a suite of state of the art machine learning and natural language understanding technologies, many of which are supervised learning systems. For our business to scale aggressively, we need an economically viable way to acquire training data quickly for those supervised learners. We use natural language generation to do just that, supplementing human annotations with annotated synthetic language in an agile fashion.
The cost and time required to obtain quality training datasets for supervised learning is a chief bottleneck in the deployability of machine learning. To minimize those impacts, much of the work in the broader research community has focused on reducing the inherent training data requirements of the model itself. This has proven to be effective in a variety of models and use cases (e.g. the success of attention networks for document classification 1) but the difficulty of implementation and the fairly narrow focus on benchmarks suggests that a complementary approach, tackling the problem from the direction of the training data itself, could be fruitful.
There is precedence for augmenting supervised learning with artificially expanded training datasets. In image and speech recognition this is called training data augmentation, and it has been very successful 2 3.This is because there are natural metric spaces on which invariant properties exist, such as image rotation or speed/tone of speech, that a model should possess. In natural language, however, no such intrinsic space exists. This makes the task of expanding a text training corpus considerably more complex. Below are three examples of research aimed at addressing training data scarcity in natural language:
2. Data counterfeiting 5: delexicalize all annotated slots and values for each annotated utterance, randomly replace delexicalized slots with similar slots under some clustering, and yield a set of pseudo annotated utterances, e.g. a basic subject-object utterance would be turned into the pseudo annotation:
“SLOT_SUBJECT called SLOT_OBJECT yesterday afternoon to talk about work”
3. Weak supervision 6: approximately and noisily annotate documents to produce low quality training data coupled with associated scores, use those scores within an appropriate loss function to enable “noise-aware” training. For a great state of the art framework for weak supervision, look at data programming 7 8
These approaches inflate existing training corpora in analogy to data augmentation and do not rely on heavy linguistic modeling. They are shown to be effective at domain-constrained use cases ranging from text classification to spoken dialogue systems. For our use case, we need to be able to rapidly expand our event coverage to new domains and believe that deep linguistic modeling without domain-specific rules is a powerful and effective way to do that.
It is easy to fall into a chicken-and-egg dilemma with training data generation: it is difficult to design these systems such that they themselves do not require large quantities of training data. As a result, construction of a necessarily robust and domain-agnostic system can end up going so far as to solve the very problem reserved for the supervised learner. Domain-specific rules sets, grammars, and constraints on the supervised learning models are ways to get around this, but can sometimes result in over-specialized and rigid systems.
A second key challenge to training data generation is controlling the bias induced on a supervised learning system. Approaches to data generation must produce a varied enough corpus that the supervised model isn’t just learning the generative model of training data. It is difficult to define what is effective and sufficient for a training corpus. Later, we will discuss a few paths of inquiry into intrinsic properties of training corpora that may be predictive of training behavior. These properties can be a measure of sufficient variability that is not dependent on the resulting performance of the trained supervised model.
The primary design constraints of our system are that it must be robust and easily extensible to new events and semantic structures; this means that we cannot rely on hand-coded domain-specific rules or language. We are designing the system such that everything should be learned from the exemplar data, auxiliary data sources, and a priori knowledge of English language constructs.
Our system architecture takes some inspiration from traditional natural language generation systems 9.The initial design is split into three key components that we are working to prove out (see Figure 1):
We look forward to sharing changes and improvements in much greater depth as we prove out this technology.
Imagine, as a customer, that you want to start receiving events about lawsuits. You can define a minimal structured representation of a lawsuit with four fields:
We call the collection of these fields the semantic event frame. With this frame in hand, we acquire a few hundred human-annotated examples of these events in unstructured natural language (see Figure 2 for a pictorial of a human annotation). From there, we can combine the frame with the human annotations and any auxiliary data sources, internally or externally (e.g. a knowledge base, a repository of filing dates), together with our natural language generation system. This produces a synthetic training corpus much larger than the few hundred examples we sourced from humans. Since our extraction technology is fairly sophisticated, it requires a good deal of training data. By reducing the number of human annotations we have to source, we can cut the training and deployment time for this new lawsuit event down by a sizable factor.
Here is a very simple example of a synthetically generated lawsuit event:
Party X on Tuesday filed a lawsuit against Party Y for $50,000.
And a slightly more complicated one:
Party X is suing Party Y. The lawsuit, which was filed Tuesday, is for total damages of $50,000.
Of course, real data is often more complicated: events are expressed over a span of more than a few sentences and sometimes the document expressing the event has a lot of less semantically relevant text. Our system is being built to produce documents that display such features.
While our natural language generation system is only in its infancy, we are already getting very promising results. Figure X, below, illustrates some examples of our system producing grammatical variations of a simple product recall event, with a focus on changing tense and voice. This is one of the features that enables more complicated language generation to produce structurally diverse documents.
In Figure Y, we see an example of a much more linguistically complicated event. This generated text contains a good deal of variation, complex content and structure, and domain specificity. We are able to generate semantically relevant but role-implicit sentences and clauses (see for example the last sentence), and are able to describe this event as being related to a game despite that alias not existing in the event frame.
Our road toward highly robust and rich natural language generation is just beginning. Below are some advancements in our research horizon that we intend to pursue.
One of the most challenging aspects of training data generation is ensuring that the correct depth and breadth of semantics are captured. We are currently exploring path traversals on our budding knowledge base (see this blog post) as a means for re-ranking semantic plans. This holds the promise of enabling the generation of larger and more consistently relevant documents.
Currently, domain specificity is learned from classified, non-annotated documents and a small number of human-annotated exemplars. We have been exploring reinforcement learning as a means to further refine domain-specific generated documents. We invision implementing this as a user-in-the-loop mode of interaction to speed up development time.
As an initial proof of concept, we are exploring neural machine translation 10 approaches to translate from English to the target language at the end of the generation pipeline. This has the potential pitfall of amplifying the systematic error introduced in the grammatical models, so we will need to be mindful of that.
Many industrial practitioners, especially those in resource-constrained organizations, have identified the limited applicability of many published approaches on the highly noisy and varied real world natural language training data faced in commercial use cases. In order to understand how to collect and produce training data reliably and sustainably, we must understand intrinsic properties of training corpora that predict supervised learning behavior from less than optimal training data.
We can think of these properties as fitting into three classes: lexical, syntactic, and semantic.
We have shown how we are using natural language generation to collapse the time and cost of training usually associated with supervised learning, which is allowing us to accelerate our event coverage and scale deeply and horizontally. We are excited to share our progress as we continue to build this foundational technology to language understanding.
Note: This post was originally published on our blog: https://www.forge.ai/blog/how-we-are-using-natural-language-generation-to-scale-at-forgeai
Create your free account to unlock your custom reading experience.