We welcome the ascension of our artificial overlords!

Back

19th June 2024 - Feature Engineering: Continuous Variables

Belief brews doubt. When there are too many cans, mix it with facts and truths. You might someday craft a fine tonic to carry you towards desire.

Feature scaling

Different features have different value ranges. Feature scaling can limit all features to the same range. Doing so can improve the performance of the model by ensuring that the model is not biased towards features with larger value ranges. Here are some scales for the above task:

Max-Abs Scale: converts the numerical range of the feature to between [-1, 1].

Min-Max Scale: converts the numerical range of the feature to between [0, 1].

Feature Normalization

Since different features have different numerical distributions, feature normalization converts the numerical distribution of each feature into a unified numerical distribution. Here are some normalization methods:

Z-Score Normalization: converts the numerical value to a state of standard normal distribution, where mean is 0 and standard deviation is 1.

Quantile Normalization: converts a numeric value into a continuous uniform distribution ranging from 0 to 1.

Feature scaling and feature standardization have similar functions and purposes. Different articles will have different classification standards when describing them. Please focus on cognitive methods and mathematical formulas.

Feature Transformation

In addition to the value range or value distribution, in order to simplify the complex relationship between the feature and the target, or to enlarge or reduce the overall value, features can be transformed through many different functions. Here are different types of transformations to pick from:

Log Transformation: substitutes numerical features uniformly into the natural logarithm function. Converts values less than 1 to negative numbers, and reduce values greater than 1, so that the overall range of the feature is close to 0.

Root Transformation: substitute numerical features uniformly into the root function for numerical conversion. Amplifies values less than 1 and reduce values greater than 1, so that the overall range of the characteristic is close to 1.

Exponential Transformation: unifies numerical features into natural exponential functions. Those with smaller values will be enlarged, but those with larger values will be enlarged, with an emphasis effect.

Here is my assignment code work on feature transformation:

Feature Binning

The feature binning technique transforms continuous, numerical values into discrete, categorical features. Specifically, binning compares each value to the neighborhood of values surrounding it and then sorts data points into a number of bins.

A number of cutting rules are included below:

Numeric range: define the numerical range of each category in an equidistant manner according to the number of categories to be intervalized.

Category frequency: according to the number of categories to be intervalized, define the numerical range of each category based on the number of categories.

Professional knowledge: experts in the field effectively cut the values according to the habits or theories of the field. For example, if you want to intervalize the Body Mass Index (BMI), according to professional knowledge in the medical field, BMI < 18 is called "underweight", 18 ≤ BMI < 23 is called "moderate", 23 ≤ BMI < 27 is called "slightly overweight", 27 ≤ BMI < 30 is called "overweight", and the rest are called "obese".

Missing value processing for numerical features

AI models need some data to make desirable predictions, and what is the opposite of something? Nothing. Like missing values. There are many reasons for missing values, but they mainly include the following two categories:

Random Missingness: the loss of data is due to non-specific factors and no particularly meaningful reasons, resulting in missing values randomly occurring in the data, such as loss due to careless human input during the electronification of the questionnaire, machine failure, data recording, or retrieval failure, etc. If there are too many missing values for this numerical feature, you can also choose to ignore the feature.
Non-Random Missingness: the generation of missing data is particularly meaningful. For example, when a patient does not suffer from high blood pressure, there will be no type of high blood pressure medication taken. Since each missing data has meaningful information, it is necessary to clearly define the meaning of the missing value through many conditions and data preprocessing, such as giving a category and indicating the reason for the missing value.

When faced with missing values in continuous features, the following methods can be used to solve the problem:

Statistical Value Filling: statistics are representative values derived from all data, so they are suitable for filling in missing values. Common statistics used to fill in missing values include means and medians. Easy to operate, simple concept, and suitable for datasets with few data or features, but ignores the complex relationships between features and changes the probability distribution of numerical features.
Regression Imputation: using features with missing values as prediction targets and features without missing values as features, a regression model is established, and finally the missing values are predicted. Because the missing value results are predicted through the model, the filling results are different each time. This method takes into account the complex relationships between features, it is less likely to, but the choice of model will affect subsequent predictions, leading to an increase in the complexity of feature engineering.
K-Nearest Neighbor (KNN) filling: is similar to the regression model filling method, but uses the KNN algorithm to calculate the filling rules.
Model-Driven Imputation: extends regression model imputation and K-nearest neighbor algorithms to many other models. Microsoft researcher Müller, AC demonstrated a comparison of three filling techniques in the 2018 "Applied Machine Learning" course, and found that the filling technique using random forests is very effective for subsequent classification tasks.

Page updated

Google Sites

Report abuse