Onward, to desire's end!
In data science, encoding is the process of converting data from one form to another, usually from a human-readable format to a machine-readable format. On the other hand, decoding is the reverse process, where data is converted from machine-readable format back to a human-readable format.
Take machine learning and numeric data as an example. The input data for machine learning models must be numeric in order to operate. Consider categorical features that are not numerical. They need to be converted into numeric values so machine learning models can read them for generating predictions.
This section covers the various methods for encoding data:
Label Encoding: represents each category of a feature as a number.
For example, if there is a data set with the characteristics of Taiwanese cities, and the categories include Taipei City, New Taipei City, and Taoyuan City in the data during analysis, each category will be represented by a number. Taipei City can be converted into the number 1, New Taipei City into 2, and finally Taoyuan City into 3.
This method's numeric results are convenient for program operations, but since the numbers themselves have no meaning, model training or prediction cannot be directly performed.
One Hot Encoding: converts each category into each feature and give a value of 0 or 1. When a certain piece of data belongs to this category, the feature in this category is 1 and the rest are 0.
Using the Taiwanese city example again, through One-Hot Encoding, features for these 3 categories will be created respectively. Their names would be "Is this Taipei City", "Is this New Taipei City", and "Is this Taoyuan City." If the a certain piece of data's city is Taipei City, then the feature value of "Is this Taipei City" is 1, and the rest are 0.
This method is relatively straightforward to implement, but it increases dimensionality so training becomes slower and more complex. It can also create sparse data since most entries in the new columns will be zero. Additionally, one-hot encoding takes more space but adds no new information since it only changes data representation.
Ordinal Encoding: similar to Label Encoding, each category of a certain feature is converted into numeric value. The only difference is that this feature has the order of "size, length or ranking." It can express the difference between categories through numeric value.
For example, there is a data set with the feature of age group, and the categories include <20, 20~40, 40~60, and >60. These four categories can express age, so when encoded with Ordinal Encoding, you can convert to <20 to 1, 20~40 to 2, 40~60 to 3, and >60 to 4. The value itself can then be directly calculated.
The number of feature dimensions produced by this method will not be numerous, and the values themselves are meaningful for training. However, there is only a size relationship between categories, and there is no difference in degree.
Frequency Encoding: the number of times a certain category appears in the variable is used as the characteristic value of the category. This value represents the meaning of "occurrence frequency" or "probability of occurrence," and can be divided into numerical values so that it can be directly calculated.
In the figure below, "Red" appears twice in the data, so it will be converted to 2 when Encoding is performed. "Green" appears thrice in the data, so it will be converted to 3 when Encoding is performed. And so on.
With this method, feature dimensions will not increase and the values themselves have meaning. However, since categories with the same frequency of occurrence will have the same numerical value, categories of different categories with completely different meanings will be unintentionally expressed with the same numerical value, resulting in the loss of the meaning of the original category itself.
Feature Combining: after cross-comparing two or more categorical variables, a new categorical variable is formed.
There is a dataset with two categorical variables gender and smoking status, and their categories are male and female, and smoking and non-smoking respectively. Mixing these two variables will produce a new variable "the interaction between gender and smoking." Its new four categories will be male smoker, male non-smoker, female smoker, and "female non-smoker.
Interactions between variables can be taken into account via Feature Combining, although a dataset with too many categories can easily create too few pieces of information in a specific category.