paint-brush
Natural Language Processing with Python: A Detailed Overviewby@princekumar036
188 reads

Natural Language Processing with Python: A Detailed Overview

by February 21st, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

NLP is an emerging field at the intersection of linguistics, computer science and artificial intelligence that makes computer understand and generate human language. Python is the most preferred programming language for NLP along with its libraries like NLTK, spaCy, CoreNLP, TextBlob and Genism.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Natural Language Processing with Python: A Detailed Overview
undefined HackerNoon profile picture


Artificial Intelligence is currently one of the most sought-after technologies, and so is NLP, a subset of AI. Recent years have been marked by tremendous research and development in AI. NLP is an important technology, and we use it every day. This article will give us some perspective on this emerging technology. We will look at its history and some of its typical applications. How and where can we learn NLP to create various real-life projects. We also briefly look at Python, arguably the most favorable programming language for NLP and some of its essential NLP libraries. Along the way, some learning resources have also been recommended.

What is NLP?

First things first: what exactly is NLP? We all know that computers can only understand 0s and 1s, but we humans communicate in various languages, thousands of languages. 7139, to be exact. This creates a deadlock in our interaction with the computers. The traditional solution to this deadlock was a methodical interaction with computers by clicking on various options to perform the desired task. But nowadays, we can achieve this by talking to our computers through Cortana or Siri in our own language, and they can serve our tasks and even talk back to us in the same language. The underlying technology that makes this happen is Natural Language Processing or the NLP.

Popular voice assistants (Image credit: Michael Heller)


NLP is an emerging interdisciplinary field of linguistics, computer science and artificial intelligence that makes computers understand and generate human language. This is done by feeding a large amount of data or corpus in case of language data to the computer, which analyses them to understand and generate meanings.

History of NLP

Although NLP has become very popular in recent years, its history goes back to the 1950s. The genesis of all the AI technologies lies with the computer scientist Alan M. Turing and his seminal paper titled Computing Machinery and Intelligence (1950).” The central question of his article was ‘Can machines think?’. In the report, Mr Turing proposed a benchmark to test the intelligence of computers in comparison to humans. He called it the Imitation Game, which later became known as the Turing Test. Among other criteria was the machine’s ability to ‘understand and speak’ natural languages. The Georgetown experiment of 1954 was another development early in machine translation. The experiment demonstrated a fully automatic translation from Russian to English. The field of NLP, along with AI, has undergone continuous development since then. The growth in the field of NLP can be divided into three categories based on the underlying method or approach for solving NLP problems described in the next section.

NLP Methods

Rule based NLP (1950s — 1990s)

The earliest methodologies used to achieve natural language processing by computers was based on a pre-defined set of rules, also called Symbolic NLP. A collection of hand-coded rules was fed into the computer, and it yielded results based on that.


The early research in NLP was focused primarily on machine translation. Rule-based machine translation (RBMT) required a thorough linguistic description of the source and the target languages. The basic approach involved two steps: 1) finding a structural equivalent of the source sentence and the output using a parser and an analyzer for the source language and a generator for the target language, and 2) using a bilingual dictionary for a word-to-word translation to finally produce the output sentence.



Architecture of a Rule-based machine translation (Image credit: Budditha Hettige)


Given that a human language is pervasive and ambiguous, infinitely many such rules are possible. Obviously, hand-coding such a large number of rules is not possible. Thus, these systems were very narrow and could only produce results in some given scenarios. For example, the much-celebrated Georgetown experiment could only translate some 60 plus sentences from Russian to English.

Statistical NLP (1990s — 2010s)

With the advent of more powerful computers with higher processing speed, it was now possible to process a large amount of data. Statistical NLP took advantage of this, and new algorithms based on machine learning came into being. These algorithms are based on statistical models which make a soft, probabilistic decision to yield output(s). Instead of hard-coding rules as in Rule-based NLP, these systems automatically learn such rules by analyzing large amounts of data fed through real-world parallel corpora. In the case of Statistical NLP, the system does not generate one final output. Instead, it outputs several possible answers with relative probability. The drawback, however, was that its algorithms were challenging to build. It required a complex pipeline of separate sub-tasks like tokenization, parts-of-speech tagging, word sense disambiguation, and many more to finally produce the output.


A basic Statistical machine translation pipeline (Image credit: Karan Singla)

Neural NLP (2010s — present)

As deep learning became popular during the 2010s, the same was applied to the NLP. Deep neural network-style machine learning methods became widespread in NLP too. This approach, too, uses statistical models to predict the likelihood of an output. However, unlike Statistical NLP, it incorporates the entire sentence in a single integrated model wiping out the need to build that complex pipeline of intermediate subtasks as in statistical models. In this approach, the system uses an artificial neural network. The artificial neural network system, in theory, tries to mimic the neural network of the human brain.


A simple neural network (Image credit: Wikipedia)


An artificial neural network consists of three layers: input, hidden, and output. The input layer receives the input and sends them to the hidden layer, where all the computations are performed on the data. The output is then transferred to the output layer. The connection between the neurons is called weight. The initial value of weights is set randomly, which changes during artificial neural network learning. These weights are crucial in determining the probability of an output.

Common NLP Tasks

Here is a non-exhaustive list of some of the most common tasks in natural language processing. Note that some of these tasks may not be an end in themselves but serve as subtasks in solving other tasks which have real-world applications.

  1. Tokenisation — separate a continuous text into words
  2. Parts-of-speech tagging — determine the parts of speech of each word in a sentence
  3. Stopword removal — filter out high-frequency words like to, at, the, for, etc
  4. Stemming — remove inflections (prefix and suffix) (e.g., driving → driv)
  5. Lemmatization — remove inflections and return base form of the word (e.g., driving → drive)
  6. Coreference resolution — determine which word refers to which words in a sentence/text
  7. Parsing — determine and visualize the parse tree of a sentence
  8. Word sense disambiguation — select contextual meaning of a polysemic word
  9. Named entity recognition — determine the proper nouns in a sentence/text
  10. Relationship extraction — identify relationships among named entities
  11. Optical character recognition (OCR) — determine the text printed in an image
  12. Speech Recognition — convert speech into text
  13. Speech Segmentation — separate a speech into words
  14. Text-to-speech — convert text to speech
  15. Automatic summarisation — produce a summary of a larger text
  16. Grammatical error correction — detect and correct grammatical errors in a text
  17. Machine translation — automatically translate a text from one language to another
  18. Natural language understanding (NLU) — convert text into machine-readable code
  19. Natural language generation (NLG) — make machine produce natural language


SYSTRAN is one the oldest machine translation companies


Python for NLP

Python is a preferred programming language for NLP. As of February 2022, it is the most popular programming language. Python’s ubiquitous nature and its application to a wide array of fields make it so popular.


While programming languages like Java and R are also used for NLP, Python is a clear winner. Python is easy to learn and understand because of its transparent and straightforward syntax. Python offers perhaps the largest community of developers, which can be really helpful in case the code needs some debugging. In addition, Python seamlessly integrates with other programming languages and tools. Most importantly, Python is backed by an extensive collection of libraries that enables developers to quickly solve NLP tasks.

Resources

  • Python for Everybody: Start with this five-part specialization program on Coursera. It will provide a complete overview of Python programming.
  • Automate Boring Stuff with Python: Read this free to read online book by Al Sweigart for step-by-step instructions and guided projects.
  • Python tutorial: the official Python tutorial and documentation

NLTK

NLTK — the Natural Language Toolkit — is a suite of open-source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing.


A parse tree generated with Python library NLTK (Image credit: nltk.org)


Natural Language Toolkit or NLTK is the most popular library for NLP. It has a massive active community with over 10.4k stars and 2.5k forks on its GitHub repository. It was developed at the University of Pennsylvania by Steven Bird and Edward Loper and was released in 2001. NLTK is freely available for Windows, Mac OS X, and Linux. It has built-in support for more than 100 corpora and trained models. NLTK comes with a free ebook written by its creators, a comprehensive guide for writing Python programs, and working with NLTK. They also have an active discussion forum on google groups.

Resources:

spaCy

Industrial-strength Natural Language Processing (NLP) in Python.


spaCy’s components (Image credit: spacy.io)

spaCy is relatively young but very hot right now. Its GitHub repository has more than 22.4k stars and 3.7k forks, much higher than NLTK. It was written in Python and Cython, making it fast and efficient at handling large corpus. It is an industry-ready library and was designed for production usage.

Some features of spaCy are:

  • It provides support for linguistically-motivated tokenization in more than 60 languages.
  • It has 64 pre-trained pipelines in 19 different languages.
  • It provides pretrained transformers like BERT.
  • It provides functionalities for named entity recognition, part-of-speech-tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more.

Resources:

CoreNLP

A Java suite of core NLP tools.

Working of CoreNLP pipeline (Image credit: CoreNLP)


CoreNLP was initially written in Java and was developed at Stanford University. But it is equipped with wrappers for other languages like Python, R, JavaScript, etc. Thus, it is a library that can be used with most programming languages. It is a one-stop-shop destination for all the core NLP functionalities like linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian and Spanish. It has 8.3k stars and 2.6k forks on its GitHub repository.

Resources:

TextBlob

Simplified Text Processing


TextBlob is a Python library for processing textual data. It provides a simple API for the most common NLP tasks like POS tagging, tokenization, n-grams, etc. It is beginner-friendly and the fastest among other libraries. and is made on top of NLTK and pattern. It has 8k stars and 1.1k forks on its GitHub repository at the time of writing this article.

Resources:

  • Documentation: Official TextBlob documentation and quick start guide

Gensim

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.


Gensim was created by Radim Řehůřek in 2009. It is implemented in Python and Cython, making it incredibly fast. Plus, all its algorithms are memory-independent, i.e. they can process inputs larger than RAM size. It is mainly used to identify semantic similarities between two documents through vector space and topic modeling. It supports algorithms like Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning. It works on vast data collections from specific categories and provides clear insight. Gensim’s GitHub repository has 12.9k stars and 4.2k forks.

Resources:

Documentation: Official documentation and tutorials.

Conclusion

Natural language processing is a sub-field of Artificial Intelligence under active research and development. We can see its practical applications all around us. From automatic captioning in YouTube videos to Chrome automatically translation webpages in foreign languages for us, assistive writing with Grammarly, our iPhone’s keyboard predicting keywords for un and so on. The possibilities are limitless. Natural language processing is indispensable for artificial intelligence and for our future technologies.

References

  1. Bird, Steven, et al. Natural Language Processing with Python. 1st ed, O’Reilly, 2009.
  2. Budditha Hettige. A COMPUTATIONAL GRAMMAR OF SINHALA FOR ENGLISH-SINHALA MACHINE TRANSLATION. 2011. DOI.org (Datacite), https://doi.org/10.13140/RG.2.1.2330.6968.
  3. Heller, Michael. ‘Study Claims Siri and Google Assistant Are Equal’. Phone Arena, https://www.phonearena.com/news/Study-claims-Siri-and-Google-Assistant-are-equal_id115667. Accessed 8 Feb. 2022.
  4. Singla, Karan. Methods for Leveraging Lexical Information in SMT. 2015. DOI.org (Datacite), https://doi.org/10.13140/RG.2.1.2138.7367.
  5. Turing, A. M. ‘I. — COMPUTING MACHINERY AND INTELLIGENCE’. Mind, vol. LIX, no. 236, Oct. 1950, pp. 433–60. DOI.org (Crossref), https://doi.org/10.1093/mind/LIX.236.433.