We welcome the ascension of our artificial overlords!

Back

16th June 2024 - Feature Engineering

Work 6 days a week, earn more than none to eat…

Feature engineering (feature as in an individual measurable property within a recorded dataset) is a crucial step in the machine learning workflow that involves transforming raw data into features that are most suitable for training a machine learning model.

Imagine you are giving a recipe to a chef who has never made a dish with it before. You do not just hand them a list of raw ingredients ; you also provide them instructions on how to prepare them. Feature engineering is similar, taking raw data and transforming it into a format that a machine learning model can easily understand and use to make predictions.

Urgent-important matrix: Taiwanese version

The Taiwanese have an additional quadrant in their urgent-important matrix: time needed. It is a similar yet distinct quadrant from urgency, which is not always related to the amount of time. The top row of texts in the matrix below (left to right) means: importance, urgency, and time needed.

When a row columned facts in the matrix are converted into scores via a condition metric (the red category in the middle), a feature is made. While the final score of each feature may not directly correspond to the total price or probability, they can be the basis for subsequent evaluation.

Applications of feature engineering

The Taiwanese urgent-important (plus time needed) matrix does not strictly contain the three aforementioned quadrants, as the examples below will demonstrate.

The left graph is a feature matrix for training a price prediction model of properties in Taiwan. The fact row shows a property with an area of 20 square meters in New Taipei City. Based on this matrix's conversion metrics, each square meter grants 50 points, and 500 if the district is specifically Taipei city. Overall, a 20-square meter property in New Taipei City finally scores (50 * 20) + 0 = 1000 + 0 = 1000 points.

The right graph is a feature matrix for training a prediction model of survival rates on the famous Titanic. The fact row shows a male passenger at age 50. Based on this matrix's conversion metrics, males lose 2 points and females earn 1 point, while passengers above age 60 get 0 points, passengers aged 30 to 59 year-olds get 1 points, and passengers below age 29 (which I assume was mistyped) get 2 points. Overall, a male passenger of age 50 finally scores (-2) + 1 = (-1) points.

Prediction model training workflow

Here is a breakdown of the steps involved in training prediction models:

Data Reading: the program starts by reading data from CSV files df_train and df_test.
Data Splitting and Transformation: the data is split into two DataFrames of the same names: df_train for training the model and df_test for testing the model's performance. There are separate branches for processing the training and testing data.
Training Data Processing: a label encoder is applied to convert categorical variables in df_train into numerical values for machine learning algorithms to handle more effectively. A MinMax encoder is applied, likely to scale numerical features in df_train to a specific range (typically between 0 and 1) to improve model performance.
Model Training and Prediction: a machine learning model is trained using the processed df_train data (train_X and train_Y). The trained model is then used to make predictions on the processed df_test data. The predictions are stored in the pred variable.
Output Generation: a submission file containing the predicted values from the model is created.

Overall, this flowchart outlines a typical machine learning workflow where data is loaded, preprocessed, used to train a model, and then used to make predictions on unseen data.

Page updated

Google Sites

Report abuse