What is One Hot Encoding? Why and When Do You Have to Use it?

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

So, you’re playing with ML models and you encounter this “One hot encoding” term all over the place. You see the sklearn documentation for one hot encoder and it says “ Encode categorical integer features using a one-hot aka one-of-K scheme.” It’s not all that clear right? Or at least it was not for me. So let’s look at what one hot encoding actually is.

Suppose the dataset is as follows:

╔════════════╦═════════════════╦════════╗ 
║ CompanyName Categoricalvalue ║ Price  ║
╠════════════╬═════════════════╣════════║ 
║ VW         ╬      1          ║ 20000  ║
║ Acura      ╬      2          ║ 10011  ║
║ Honda      ╬      3          ║ 50000  ║
║ Honda      ╬      3          ║ 10000  ║
╚════════════╩═════════════════╩════════╝

onehot-dataset hosted with ❤ by GitHub

The categorical value represents the numerical value of the entry in the dataset. For example: if there were to be another company in the dataset, it would have been given categorical value as 4. As the number of unique entries increases, the categorical values also proportionally increases.

The previous table is just a representation. In reality, the categorical values start from 0 goes all the way up to N-1 categories.

As you probably already know, the categorical value assignment can be done using sklearn’s LabelEncoder.

Now let’s get back to one hot encoding: Say we follow instructions as given in the sklearn’s documentation for one hot encoding and follow it with a little cleanup, we end up with the following:

╔════╦══════╦══════╦════════╦
║ VW ║ Acura║ Honda║ Price  ║
╠════╬══════╬══════╬════════╬
║ 1  ╬ 0    ╬ 0    ║ 20000  ║
║ 0  ╬ 1    ╬ 0    ║ 10011  ║
║ 0  ╬ 0    ╬ 1    ║ 50000  ║
║ 0  ╬ 0    ╬ 1    ║ 10000  ║
╚════╩══════╩══════╩════════╝

After one hot encoding hosted with ❤ by GitHub

0 indicates non existent while 1 indicates existent.

Before we proceed further, could you think of one reason why just label encoding is not sufficient to provide to the model for training? Why do you need one hot encoding?

Problem with label encoding is that it assumes higher the categorical value, better the category. “Wait, What!?”.

Let me explain: What this form of organization presupposes is VW > Acura > Honda based on the categorical values. Say supposing your model internally calculates average, then accordingly we get, 1+3 = 4/2 =2. This implies that: Average of VW and Honda is Acura. This is definitely a recipe for disaster. This model’s prediction would have a lot of errors.

This is why we use one hot encoder to perform “binarization” of the category and include it as a feature to train the model.

Another Example: Suppose you have ‘flower’ feature which can take values ‘daffodil’, ‘lily’, and ‘rose’. One hot encoding converts ‘flower’ feature to three features, ‘is_daffodil’, ‘is_lily’, and ‘is_rose’ which all are binary.

One hot encoding explained in an image

Lead image via https://i.stack.imgur.com/mfsNd.png