Preprocessing a text corpus is one of the mandatory things that has to be done for any NLP application. There are some standard steps that go along with most of the applications, whereas sometimes you need to do some customized preprocessing. In this blog, I will be discussing few data transformation steps that I personally use while working with textual data. We will also talk about the limitations of such transformations.
P.S. The steps taken by me for data preprocessing are not mandatory for all applications, these ones are what I found useful over the period of time.
Below mentioned are few transformations that I usually prefer doing depending on how problem statement is defined.
Lowercasing: Computers are not humans i.e. for them COMPUTER, computer and Computer all three words doesn’t mean the same thing until and unless they are trained for that. Doing simple operation of word count will give a count of 1 for all three examples which is not correct when dealing with semantics. Training an efficient system for this requires large dataset where the context for such instance remains the same. Other efficient naive techniques to solve this issue is to switch to similar casing style. In practice, people usually lowercase the words.
Punctuation Removal: Punctuations can be tricky. In general, boundary punctuations can be removed without any issues but same doesn’t hold for the cases where punctuations occur within a word. Such cases don’t work well with tokenizers. Also, you will be losing the word structure significantly i.e don’t -> dont then you can’t call your expand contractions function over the same.
Adding extra space: This is a precautionary step that can be taken. It asks you to add an extra white space after the end of the lexical unit for proper tokenization. For example, with existing sentence tokenizer, despite high accuracy, they fail to capture some edge cases. i.e This is a good book.I like reading it. This sentence will be segmented as just one sentence with NLTK sentence tokenizer. So, such cases before sending to NLTK should be resolved to This is a good book. I like reading it.
Stripping ends: This again solves the same problem that lowercasing solves. Think of a scenario where there is consecutive space or punctuations between different lexical units like words/sentences/paragraphs. While doing word/sentence/paragraph segmentation we don’t want our system to differentiate between computer and computer<space>.
Expanding Contractions: Expanding contraction simply mean to normalize don’t -> do not , doesn’t -> does not etc. But this is not a simple task to handle as there are ambiguous situations that need to be taken care i.e he’ll -> he shall / he will. Wikipedia has an exhaustive list of common English language contractions that people use. You can find it here. A simple resolution to it can be to go with greedy selection technique from the Wikipedia list for every contraction you get.
Stemming: Stemming is the process of converting word to it’s root form by removing boundary morphemes. With my personal experience stemming should be avoided when the input and output to the model are both natural language. For example, in abstractive summarization systems. You would not want to see stemmed words in the summary. Stemming plays a good role in classification tasks because the vector representation that we choose will not create redundant value for unstemmed words.
Corpus specific Stop-words: Stop-words are not always generic. Apart from common language-specific terms, there are corpus specific repetitive words that might also not be useful in the analysis and act as noise. This decision of adding corpus specific repetitive threshold is usually decided by analyzing the corpus by finding percentage occurrence of each word. I won’t say we should always remove language specific stop words because there might be use case like grammar correction where your system has to appropriately add articles in a sentence. Now here you cannot put a, an, the in the stop list.
Spelling Correction: Spelling correction is one of the important transformation that should be done while implementing a search engine. The implementation can be as traditional as Levenshtein distance to as sophisticated as implementing a sequence to sequence model. Existing spelling correction implementation work drastically bad when applied to nouns and works decently with verbs.
Apart from the transformations, one also needs to smartly stack them in the pipeline. If there is an application that requires both punctuation removal and contraction expansion then punctuation removal should be done after contraction expansion else it will not make sense.
Having said that, feel free to correct me wherever you feel by commenting down below.