Decide by yourself Preprocessing a text corpus is one of the mandatory things that has to be done for any application. There are some standard steps that go along with most of the applications, whereas sometimes you need to do some customized preprocessing. In this blog, I will be discussing few data transformation steps that I personally use while working with textual data. We will also talk about the limitations of such transformations. NLP P.S. The steps taken by me for data preprocessing are not mandatory for all applications, these ones are what I found useful over the period of time. Below mentioned are few transformations that I usually prefer doing depending on how problem statement is defined. Universal Transformation Computers are not humans i.e. for them COMPUTER, computer and Computer all three words doesn’t mean the same thing until and unless they are trained for that. Doing simple operation of word count will give a count of 1 for all three examples which is not correct when dealing with semantics. Training an efficient system for this requires large dataset where the context for such instance remains the same. Other efficient naive techniques to solve this issue is to switch to similar casing style. In practice, people usually lowercase the words. Lowercasing: Punctuations can be tricky. In general, boundary punctuations can be removed without any issues but same doesn’t hold for the cases where punctuations occur within a word. Such cases don’t work well with tokenizers. Also, you will be losing the word structure significantly i.e -> then you can’t call your expand contractions function over the same. Punctuation Removal: don’t dont This is a precautionary step that can be taken. It asks you to add an extra white space after the end of the lexical unit for proper tokenization. For example, with existing sentence tokenizer, despite high accuracy, they fail to capture some edge cases. i.e This sentence will be segmented as just one sentence with sentence tokenizer. So, such cases before sending to NLTK should be resolved to Adding extra space: This is a good book.I like reading it. NLTK This is a good book. I like reading it. This again solves the same problem that lowercasing solves. Think of a scenario where there is consecutive space or punctuations between different like words/sentences/paragraphs. While doing word/sentence/paragraph segmentation we don’t want our system to differentiate between computer and computer<space>. Stripping ends: lexical units Expanding contraction simply mean to normalize -> , -> etc. But this is not a simple task to handle as there are ambiguous situations that need to be taken care i.e -> . Wikipedia has an exhaustive list of common English language contractions that people use. You can find it . A simple resolution to it can be to go with greedy selection technique from the Wikipedia list for every contraction you get. Expanding Contractions: don’t do not doesn’t does not he’ll he shall / he will here Stemming is the process of converting word to it’s root form by removing boundary morphemes. With my personal experience stemming should be avoided when the input and output to the model are both natural language. For example, in abstractive summarization systems. You would not want to see stemmed words in the summary. Stemming plays a good role in classification tasks because the vector representation that we choose will not create redundant value for unstemmed words. Stemming: Stop-words are not always generic. Apart from common language-specific terms, there are corpus specific repetitive words that might also not be useful in the analysis and act as noise. This decision of adding corpus specific repetitive threshold is usually decided by analyzing the corpus by finding percentage occurrence of each word. I won’t say we should always remove language specific stop words because there might be use case like grammar correction where your system has to appropriately add articles in a sentence. Now here you cannot put in the stop list. Corpus specific Stop-words: a, an, the Spelling correction is one of the important transformation that should be done while implementing a search engine. The implementation can be as traditional as to as sophisticated as implementing a model. Existing spelling correction implementation work drastically bad when applied to nouns and works decently with verbs. Spelling Correction: Levenshtein distance sequence to sequence Apart from the transformations, one also needs to smartly stack them in the pipeline. If there is an application that requires both punctuation removal and contraction expansion then punctuation removal should be done after contraction expansion else it will not make sense. Having said that, feel free to correct me wherever you feel by commenting down below.

Natural Language PREprocessing

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Layman’s Introduction to Principal Components

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

A Layman’s Introduction to Principal Components

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps