On real world data, Deep Learning performance can be shallow I often get pitched with a superior solution for Natural Language Understanding ( ). The plan appears prudent. After all, . A better NLU AI entices many useful advancements, ranging from and to , with an ultimate promise of better language comprehension. deep learning NLU deep learning is the disruptive new force in AI smarter chat bots virtual assistants news categorization State of the Practice Lets assume this superior deep learning (DL) "product” is called . Their pitch deck will invariably have a bar chart that looks something like this — the claim being that the new DL topic classifier/tagger of is better than state of the art methods. "(dot)AI” (Dot)AI In many industries, it is expected that production grade ML classifiers have more than 90% accuracy for quality assurance and a decent user experience. This is expected tolerance level for news categorization or conversational bots The chart presents an interesting proposition, even though performance is only slightly superior than state of the art. In any product, specific to that industry. For example, a model’s best accuracy score might be reasonable for video recommenders or image transcription, but is outside the tolerance limits for news categorization. what constitutes “good enough” depends on tolerance for error In the realm of natural language text classification, do DL techniques significantly outperform shallow methods, e.g or bag of words (BoW) based approaches? You don’t have to be a sceptic to ask the question: TF-IDF The assumption often is a confident Yes — that DL obliterates shallow methods in NLU. Three recent trends underpin this illusion: But does it? In industry AI conferences, , with almost zero talks on production level natural language tasks. Why? deep learning talks overwhelming relate to image/audio/video data The media and others continuously hype deep learning as , . This can lead to confusion for practitioners trying to evaluate DL’s utility in their domain. a silver bullet without perusing the actual results in papers A lot of results just on some artificial benchmark, whereas robustness and applicability matter more. squeeze a few percent of performance While DL has taken the computing world by storm, its impact on certain fundamental NLU tasks remains uncertain and performance is not always superior. To understand why, let me first describe and then the trying to solve it and the NLU task state of the art models how DL underperforms. A Fundamental NLU Task A critical task in natural language understanding is to comprehend the topic of a sentence. The topic could be a tag (such as , , , or ), but usually it merely such as a person’s name or extracting location. politics music gaming immigration adventure-sports isn’t a named entity based task A topic tagger will attempt to tag the first WFTV article to “ ”, while the second WFTV article to “ ” although both mentions `Tiger`. This can get complicated quickly due to things like , as is shown in the . sports animals word sense disambiguation example on the right This type of software is called topic taggers. Their utility cannot be overstated. Topics are key in and Consider — the most common problem bot companies face is the lack of any automated way to capture what their users are messaging about. The only way to estimate user intent from bot messages is either via human eye-balling or whatever matches pre-built regex scripts. Both methods are suboptimal and cannot cover larger topics space. extracting intent formulating automated responses. chat bots In fact, topic tagging at various semantic resolutions is a to NLU for two reasons: (1) Text classification into topics is a precursor to most higher level NLU tasks such as sentiment detection, discourse analysis, episodic memory, and even question answering ( ) (2) Also, NLP pipelines are considerably prone to error propagation, i.e. an error in topic classification can jeopardize future analysis, such as or discourse or even sentiment analysis. Thus, finding the right topic is crucial for NLU. gateway solution approach the quintessential NLU task episodic memory modeling What good is sentiment of a piece of news unless we know what exactly this sentiment is about? Incorrect topic tagging can adversely affect sentiment utility. In reality, topic classification is a hard problem , which has been at times underestimated and overlooked by the AI community. Comprehending the topic is a first step in taking meaningful action. State of the Art Over the years, several technologies have tried to tackle the topic classification problem. There was , and others like , etc. Half of these are either or . The other half has or needs of whatever it outputs. LSA LDA PLSI Explicit Semantic Analysis not production grade don’t scale well with messy real-world data poor interpretability considerable post-processing Today, two main solutions appear overwhelmingly in topic classification performance comparisons. (1) First is the very deep C [ ] model from 2016, which proposes the use of very deep neural network architecture - a “ . (2) Second is the approach (also 2016). Its performance is almost as good as but is orders of magnitude faster in training and evaluation than . Some call FastText the — whatever that means. New world models: onvolutional Neural Net DCNN state of the art in computer vision” [FastText](https://arxiv.org/abs/1607.01759) DCNN DCNN Tesla of NLP Both methods are elegant in their own way. The big difference is that whereas does not fall in the “stereotype” of fancy deep neural nets. Instead it uses word embeddings to solve the tag prediction task. FastText is a shallow network DCNN is 29 layers deep neural net. FastText FastText extends the basic idea to predict a topic label, instead of predicting the middle/missing word (which recall is the original task). This visualizes Word2Vec word embeddings [ ] word embedding Word2Vec link Holding the ground for older/naive models are n-gram/bag of word based models and TFIDF, which still find value in large-scale implementations. Old world models: A final component in examination of state of the art is datasets on which these models are tested. Benchmark datasets are key for reproducibility and comparative analysis. In topic classification tasks, three popular datasets are: , and . They differ in corpus size and number of topics (classes) present in the data. Benchmark Data : [AG news](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) Sogou news Yahoo! answers The three datasets marked in red rectangles are specifically used for topic classification. Shown with arrow is one instance in the dataset. Task is to predict the label by analyzing the sample. Deep (Learning) Impact? First, lets look at the results from the DL paper and compare it with naive models. The numbers below indicate error rates when running a particular configuration of a model on the topic classification datasets. DCNN This is Table 4 from . Topic classification datasets (mentioned above) are marked with rectangles. The corresponding comparable error values are marked with red ellipses. [DCNN] paper main observations here: Four In 2 /3 topic classification datasets ( + ), the . AG Sogou naive/shallow methods perform better than deep learning In the 3rd dataset ( ), . Yah. Ans DL reduces the error by just **~1.63** The , which is non-trivial and significantly lower than the tolerance level for most quality production systems. accuracy of the best model on ( **Yah. Ans** ) dataset is still at **~73** An important thing to note: all 3 datasets have a topic space less than , which is still somewhat synthetic. In real world natural language data (news streams or conversational messaging), topic spaces could easily exceed 20 or 25 different topics (or intents). This is key, because the next point hints can have huge impact on accuracy. 11 topic space cardinality 4. Notice when the topic space grows from 4 to 10 ( vs. ), While its possible this is caused by spurious factors, such as imbalanced datasets, there is a good chance that a Accuracy Degradation: AG Yah.Ans the error skyrockets to _28.26_ from _7.64_ with the same model. four-fold increase in error is due to complexities involved in generalizing larger topic spaces. Finally, lets look at performance on these datasets and compare it to DL and the naive approaches: FastText DCNN This is Table 1 from showing accuracy values on the three topic classification datasets, comparing FastText with naive methods. FastText paper observations with comparative results : Three further FastText’s 5. Once again, in 2/3 datasets model. In dataset, is only inferior by . FastText performs better than deep learning Yah. Ans FastText ~1.1 6. The in the first two datasets ( and ). DCNN deep learning method actually performs worse than naive models AG Sogou 7. And again, in 2/3 datasets, the naive model’s performance is comparable or better than FastText In addition to these (stunning) results, recall that non-DL models are usually orders of magnitude and faster to train much much more interpretable. Why is this U_nreasonable_? Well, looks like when it comes to topic classifiers — the old world models (naive / shallower) aren’t ready to give up their throne just yet! This It is counterintuitive, given the new world DL models were produced at a company with tons of data, performance should be significantly better. ineffectiveness of deep learning is somewhat unexpected. However, we observe little difference in accuracy. Naive/older models are better or comparable to DL models when classifying text into topics. From “ ”, performance beats older algorithms with sufficient data. What Data Scientists should know about Deep Learning Deep learning might have deep problems in classifying language, but the objective here isn’t to disparage it or have anything to do with a . I think its impact is clear and promising. In computer vision and and , DNN’s have taken us where we have never been before. deep learning conspiracy speech recognition playing games But the reality is your mileage may vary when using deep learning for a basic Natural Language task like text classification. Why this gap in performance between image/video/audio vs. language data? Perhaps it has to do with the patterns of biological signal processing required to “perceive” the former while the patterns of cultural context required to “comprehend” the latter? In any case, there is still much we have to learn about the intricacies of learning itself, especially with different forms of multimedia.