paint-brush
The Concept Behind "Mean Target Encoding" in AI & MLby@sanjaykn170396
1,759 reads
1,759 reads

The Concept Behind "Mean Target Encoding" in AI & ML

by Sanjay KumarJanuary 19th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Mean target encoding is a special kind of categorical data encoding technique. It is widely accepted as one of the pre-eminent approaches to treat categorical variables. In this article, we will discuss the intuition, examples, pros and cons of this technique in a cogent manner.
featured image - The Concept Behind "Mean Target Encoding" in AI & ML
Sanjay Kumar HackerNoon profile picture


Introduction

Categorical Encoding - Undoubtedly, is an integral part of data pre-processing in machine learning. Although the most common categorical encoding techniques are Label and One-Hot encoding, there are a lot of other efficient methods which students & beginners often forget while treating the data before passing it into a statistical model. Target encoding is one such efficient encoding technique that is widely accepted as one of the pre-eminent approaches to treat categorical variables. In this article, we will discuss the intuition, examples, pros and cons of this technique in a cogent manner.


The basic intuition and a real-time example

Mean target encoding is a special kind of categorical data encoding technique followed as a part of the feature engineering process in machine learning.


Machine learning algorithms do not understand the data as categorical variables in the form of a string. A categorical variable is one that has two or more categories. For example, gender is a categorical variable having two categories (male and female).


Any machine learning algorithm will have to represent these variables in multidimensional space as vectors in order for various purposes like hyperplane generation, calculating the distances, grouping the data points, etc.


When they are in the form of strings as mentioned in the above examples, it is quite impossible to accommodate them in the multi-dimensional space with uniquely identifiable axes points. Therefore, we have to convert (encode) them into numerical values corresponding to each category. Target means encoding is one such technique to do this task. Let us understand the working procedure behind this technique through a real-life example.


Consider that we need to predict whether a person is male or female using three independent variables-

  1. Age
  2. Profession
  3. Salary


Our data looks like this -

Image source: Illustrated by the author


Here, “Age” and “Salary” are numerical columns. Hence, we don't need any encoding for those columns. However, “Profession” is a  categorical column with 3 categories (Engineer, Lawyer and Nurse). Hence, we need to do the target mean encoding for this column.


In the target column ie “Gender”, we have 2 classes - “Male” and “Female”. We can consider “Male” as a positive class and “Female” as a negative class for this problem.


First of all, we need to measure the frequency of the “Profession” categories for the positive and negative target classes.


For example,

Let us take the case of Engineers.

Image source: Illustrated by the author

We have a total of 5 engineers and 4 of them are Males (positive class). Hence, we need to consider the fraction of positive class for this particular category ie ⅘ = 0.8


Similarly, we can do this task for the Lawyers

Image source: Illustrated by the authorFraction = ½ = 0.5


The same process has to be done for the Nurses

Image source: Illustrated by the author
Fraction = 1/3 = 0.3

The final inference after this procedure will look like this-

Image source: Illustrated by the author


Now, we can just replace these numerical values for each category in the feature column-

Image source: Illustrated by the author


We can eliminate the original column since we have the encoded values with us-

Image source: Illustrated by the author

Now, we can see that all the features are turned into numerical columns. Hence, we can proceed with this data for implementing any machine learning algorithm.


Now, we have used the target column values for encoding our feature column. Obviously, you will be having the doubt that how can we encode the feature column of test data. Because, In test data, we will not be having access to target column values.


The answer is simple.


We have to do all the above processes in the training set. Once, we derive the mean target value for each category like this-

Image source: Illustrated by the author


We can just replace the values for these categories with the mean target values in the test data also.


Advantages

  • Historically, it is seen that mean target encoding performs much better than label encoding and one-hot encoding if the train & test data have approximately the same distribution of classes.


  • When we compare target encoding with other encoding techniques, this will considerably reduce the cardinality of the data.


Disadvantages

  • On the flip side, this technique is heavily dependent on the distribution of the target variables in the data. Hence, it might be very appropriate for some datasets but not as good for some other datasets. Hence, the decision to implement this technique should be taken carefully after inspecting the characteristics of the dataset.


  • Also, this technique might be prone to overfitting. Hence, researchers should take care of the proper regularisation techniques along with the model building.


  • Since there is a chance for target leakage due to the impact of the target column in the encoded values, researchers should take care of the appropriate measures and precautions to make sure that the entire target information is not exposed to the feature variable. Some researchers prefer cross-validation to overcome this challenge.


    For example,

    Let’s assume that we split our data into 4 folds. We will keep one fold as the holdout set and calculate the mean target values using the reaming 3 sets.


    Then, substitute these values for the categories present in the holdout set.

Image source: Illustrated by the author


Repeat this procedure for all other folds-

Image source: Illustrated by the author

Finally, calculate the mean value of each category.

Let's say the data looks like this after the k -fold cross-validation-

Image source: Illustrated by the author


Take the mean value for each of the categories -

Engineer : (0.7 + 0.8 + 0.8 + 0.9 + 0.9 )/5 = 0.82

Nurse : (0.4 + 0.4 + 0.3 )/3 = 0.37

Lawyer : (0.5 + 0.6 )/2 = 0.55


Substitute these values for the corresponding categories-

Image source: Illustrated by the author

.

Now, this data can be used for further model-building procedures.


  • Another issue, we may face is the improper distribution of categories in train and test data. In such a case, the categories may assume extreme values. Therefore the target means for the category are mixed with the marginal mean of the target


Conclusion

To summarize, encoding categorical data is an unavoidable part of feature engineering. It is essential to understand all types of encoding, apart from popular ones because In real-world problems, most of the time we might be choosing one specific encoding method for the proper working of the model. It should be noted that working with different encoders can vary the results of the model. Hence, the selection encoding type for a particular dataset/problem is one of the critical decisions taken by the researcher.


In this article, hope you got a basic understanding of the concept behind target encoding. This technique is available as an inbuilt library in most data science-oriented programming languages such as Python or R. Hence, it is easy to implement this once you understand the theoretical intuition. I have added the links to some of the advanced materials in the references section where you can dive deep into the complex calculations if you are interested.

References

  1. Target encoding - H20 documentation
  2. Target encoding tutorial - Kaggle Blog
  3. Category Encoders - Scikitlearn documentation
  4. NIST/SEMATEK (2008) Handbook of Statistical Methods
  5. Friendly, MichaelVisualizing categorical data. SAS Institute, 2000.
  6. Christensen, Ronald (1997). Log-linear models and logistic regression. Springer Texts in Statistics (Second ed.). New York: Springer-Verlag. pp. xvi+483. ISBN 0-387-98247-7MR 1633357.