Natural Language Processing (NLP) is a subfield of Linguistics and Computer Science that deals with processing and understanding huge volumes of unstructured text data.
To leverage NLP to analyze textual data, you’ll have to come up with a numeric representation of the text corpus, or the collection of documents. This tutorial will teach you all about Document-Term Matrix (DTM)—a matrix representation of the text corpus.
By the end of this tutorial, you'll have learned enough to understand the Document-Term Matrix representation, and the significance of the TF-IDF score. And you'll also be able to generate Document-Term matrices for any text corpus in scikit-learn.
In the context of NLP, a text corpus is simply a collection of documents. And a document can be a sentence, a group of sentences , or even a phrase—depending upon the use case.
Here’s a simple example of a corpus with only 3 documents
D1
, D2
and D3
.D1: NLP is super cool to learn and apply.
D2: I am currently learning NLP, you can too!
D3: CS224n at Stanford is the best NLP class you can ever take!
Let's suppose the text corpus contains M documents, and N terms in all. The general structure of the Document-Term Matrix is shown below:
Document-Term Matrix (Image by the author)
Let's parse the matrix representation:
D1
, D2
,..., DM
are the M
documentsT1
, T2
,..., TN
are the N
terms.One of the simplest ways of populating the Document-Term Matrix is using the number of occurrences of the
N
terms across all the M
documents.w11
denotes the number of times the term T1
occurs in the document D1
, w12
denotes the number of times the term T2
occurs in the document D1
, and so on.wij
denotes the number of times the term Tj
occurs in the document Di
.This way, a term that occurs frequently in the text corpus will have a much higher score, and a term that occurs rarely in the text corpus will have a much lower score. The ordering and semantic relationship between the terms is lost.
Populating the Document-Term Matrix as shown above—with the number of times a term occurs in the documents is both simple and intuitive. However, are there any caveats you should be aware of?
Head over to the next section to find out.
In populating the Document-Term Matrix with the number of occurrences, frequently occurring terms are assigned a higher score than the rarely occurring terms.
Now, take a step back, and ask yourself the question: "Is a frequently occurring term necessarily important?"
Well, this may not always be the case.
For example, when you're reading a piece of text on astronomy—you'll come across words like 'sky' and 'stars' very often. But, would you not like to see beyond these words to understand what the text is talking about?
This is exactly the motivation behind TF-IDF score—another useful metric that you can use to populate the Document-Term Matrix.
TF-IDF score is a combination of two metrics: the Term Frequency (TF) and the Inverse Document Frequency (IDF).
The TF-IDF score is given by the following equation:
where,
TF_ij
is the number of times the term Tj
occurs in the document Di
dfj
is the number of documents containing the term Tj
Let's parse the above expression:
TF_ij
is called the Term Frequency (TF). It is identical to the count we had in the previous section: the number of times the term Tj
occurs in the document Di
. So, the more frequently a term occurs in a particular document, the higher its TF score.Here,
•
M
is the total number of documents, and •
df_j
is the number of documents containing the term TF_ij
Let's first understand the significance of the IDF score.
If the term
Tj
occurs in all the documents, the value of df_j = M
. This would leave you with log(M/M) = log(1)
Tj
occurs in only one of the M
documents, you'll have log(M/1) = log(M)
. Notice how this is the maximum value that the IDF score can take.
Putting it all together:
In simple terms, the TF-IDF score says, "Hey! I believe terms that occur frequently in a particular document are likely more important than terms that occur frequently in all documents in the corpus!"
Though the TF-IDF score fails to capture contextual relationships between terms in the corpus, it's still considered an effective metric and is preferred over counting only the number of occurrences.
Now that you’ve learned the basics of Document-Term Matrix and the TF-IDF scores, let’s now see how you can generate the DTM and populate it with the TF-IDF score in scikit-learn.
1️⃣ Let's first get our text corpus
The following code cell contains a piece of text on computer programming—and this will be our corpus of interest.
text=["Computer programming is the process of designing and building an executable computer program to accomplish a specific computing result or to perform a specific task.",
"Programming involves tasks such as: analysis, generating algorithms, profiling algorithms' accuracy and resource consumption, and the implementation of algorithms in a chosen programming language (commonly referred to as coding).",
"The source code of a program is written in one or more languages that are intelligible to programmers, rather than machine code, which is directly executed by the central processing unit.",
"The purpose of programming is to find a sequence of instructions that will automate the performance of a task (which can be as complex as an operating system) on a computer, often for solving a given problem.",
"Proficient programming thus often requires expertise in several different subjects, including knowledge of the application domain, specialized algorithms, and formal logic.",
"Tasks accompanying and related to programming include: testing, debugging, source code maintenance, implementation of build systems, and management of derived artifacts, such as the machine code of computer programs.",
"These might be considered part of the programming process, but often the term software development is used for this larger process with the term programming, implementation, or coding reserved for the actual writing of code.",
"Software engineering combines engineering techniques with software development practices.",
"Reverse engineering is a related process used by designers, analysts and programmers to understand and re-create/re-implement"]
2️⃣ The next step is to import the
TfidfVectorizer
class from scikit-learn's feature extraction module for text data. This TfidfVectorizer
helps generate the Document-Term Matrix populated with the TF-IDF score.from sklearn.feature_extraction.text import TfidfVectorizer
3️⃣ Let's instantiate a
Tfidfvectorizer
object. Let's call it vectorizer
vectorizer = TfidfVectorizer(stop_words='english',smooth_idf=True)
▶️ Now, call the
fit_transform()
method on the vectorizer
with text
as the argument.input_matrix = vectorizer.fit_transform(text)
print(input_matrix)
# [truncated view]
# sparse representation of the DTM
# non-zero indices and TF-IDF score
(0, 83) 0.21527315443847012
(0, 55) 0.25487697925239533
(0, 73) 0.25487697925239533
(0, 20) 0.25487697925239533
(0, 80) 0.5097539585047907
(0, 1) 0.25487697925239533
(0, 63) 0.21527315443847012
(0, 33) 0.25487697925239533
(0, 11) 0.25487697925239533
(0, 27) 0.25487697925239533
As you can see, the
input_matrix
has the scores, and the indices corresponding to the non-zero entries only—and not the entire matrix.To cast it into a matrix of dimensions
M x N
, you can use the to_dense()
method, as shown in the code snippet below:input_matrix = vectorizer.fit_transform(text).todense()
# Truncated view of the entire matrix
[[0. 0.25487698 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.25487698
0. 0. 0. 0. 0. 0.
0. 0.37434759 0.25487698 0. 0. 0.
0. 0. 0. 0.25487698 0. 0.
0. 0. 0. 0.25487698 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0.25487698 0. 0. 0. 0.18717379
0. 0. 0. 0.21527315 0. 0.13251329
0. 0. 0. 0. 0. 0.
0. 0.25487698 0. 0. 0. 0.
0. 0. 0.50975396 0. 0. 0.21527315
0. 0. 0. 0. 0. 0.
0. 0. 0. ]
[0. 0. 0.22103745 0. 0.56007525 0.22103745
0. 0. 0. 0. 0. 0.
0. 0.22103745 0. 0.18669175 0. 0.22103745
0. 0. 0. 0. 0.22103745 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0.22103745 0. 0. 0.16232309 0.
0. 0. 0. 0.22103745 0. 0.22103745
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0.22103745 0. 0. 0.22983952
0. 0. 0.22103745 0. 0. 0.
0.22103745 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.18669175 0. 0. 0. 0. 0.
0. 0. 0. ]
I hope this tutorial helped you understand what the Document-Term Matrix (DTM) is, and the significance of the TF-IDF score.
You can use the above code to generate the DTM for your favorite piece of text—simply replace the
text
with your text corpus, and you're all ready to go!Generating the DTM is often the first step in many unsupervised NLP tasks such as topic modeling.
▶️ Confusion Matrix in Machine Learning: Everything You Need to Know
▶️ Learn K-Means Clustering by Quantizing Color Images in Python
▶️ 9 Best Data Engineering Courses You Should Take in 2022
See you all soon in another post. Until then, happy coding!🎉