One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.
So, youβre playing with ML models and you encounter this βOne hot encodingβ term all over the place. You see the sklearn documentation for one hot encoder and it says β Encode categorical integer features using a one-hot aka one-of-K scheme.β Itβs not all that clear right? Or at least it was not for me. So letβs look at what one hot encoding actually is.
Suppose the dataset is as follows:
ββββββββββββββ¦ββββββββββββββββββ¦βββββββββ
β CompanyName Categoricalvalue β Price β
β βββββββββββββ¬ββββββββββββββββββ£βββββββββ
β VW β¬ 1 β 20000 β
β Acura β¬ 2 β 10011 β
β Honda β¬ 3 β 50000 β
β Honda β¬ 3 β 10000 β
ββββββββββββββ©ββββββββββββββββββ©βββββββββ
onehot-dataset hosted with β€ by GitHub
The categorical value represents the numerical value of the entry in the dataset. For example: if there were to be another company in the dataset, it would have been given categorical value as 4. As the number of unique entries increases, the categorical values also proportionally increases.
The previous table is just a representation. In reality, the categorical values start from 0 goes all the way up to N-1 categories.
As you probably already know, the categorical value assignment can be done using sklearnβs LabelEncoder.
Now letβs get back to one hot encoding: Say we follow instructions as given in the sklearnβs documentation for one hot encoding and follow it with a little cleanup, we end up with the following:
ββββββ¦βββββββ¦βββββββ¦βββββββββ¦
β VW β Acuraβ Hondaβ Price β
β βββββ¬βββββββ¬βββββββ¬βββββββββ¬
β 1 β¬ 0 β¬ 0 β 20000 β
β 0 β¬ 1 β¬ 0 β 10011 β
β 0 β¬ 0 β¬ 1 β 50000 β
β 0 β¬ 0 β¬ 1 β 10000 β
ββββββ©βββββββ©βββββββ©βββββββββ
After one hot encoding hosted with β€ by GitHub
0 indicates non existent while 1 indicates existent.
Before we proceed further, could you think of one reason why just label encoding is not sufficient to provide to the model for training? Why do you need one hot encoding?
Problem with label encoding is that it assumes higher the categorical value, better the category. βWait, What!?β.
Let me explain: What this form of organization presupposes is VW > Acura > Honda based on the categorical values. Say supposing your model internally calculates average, then accordingly we get, 1+3 = 4/2 =2. This implies that: Average of VW and Honda is Acura. This is definitely a recipe for disaster. This modelβs prediction would have a lot of errors.
This is why we use one hot encoder to perform βbinarizationβ of the category and include it as a feature to train the model.
Another Example: Suppose you have βflowerβ feature which can take values βdaffodilβ, βlilyβ, and βroseβ. One hot encoding converts βflowerβ feature to three features, βis_daffodilβ, βis_lilyβ, and βis_roseβ which all are binary.
One hot encoding explained in an image
Lead image via https://i.stack.imgur.com/mfsNd.png