Spent too much time researching deeply onto the last chapter. My goal, after all, is knowing how machines learn to the best of my brainpower.
Today's topic relates back to categorical feature engineering, specifically its sub-topic of feature combination. Knowing how to reduce your model's workload without sacrificing its predictive accuracy is one thing; knowing what features the system must ditch to predict better is another issue entirely.
2 or more features can be combined through basic calculations to obtain new information and overcome collinearity problems, situations where 2 or more independent/predictor variable in a model are highly correlated, making it difficult to isolate the independent effect of each variable on the dependent/target variable. For example, you can calculate someone's BMI with their weight and height.
Multiple categorical features can be combined with interaction technique to generate a new feature contained in a new category. In reference to a cardiovascular disease dataset, the 2 preexistent categories gender and "has had heart disease" can be copied and fused into a 3rd category dubbed "gender and has had heart disease" with summary values from each index/row.
When there is a significant dichotomy between the continuous and category variable, you may encode the categorical dataset when it is most appropriate. But be aware of creating artificial orders akin to examples like the one below, where gender is label encoded into 0 and 1, then multiplied with age to create an illogical combined category.
When you generate a new feature from a dataset, how does one determine its value or impact to our goals? There are a number of evaluation methods that can help a data analyst or scientist determine whether a new feature contributes to or interferes with the predictive power of a predictive model. Every project has their own set of unique conditions that are ordered by feature importance.
The tree algorithm is a model with a root node (entire dataset) that splits into more distinguishing child nodes (subsets) recursively, until a stopping criterion is met. The end results are leaf nodes (terminal nodes) representing the final predictions/classifications based on the dominant class or average target value in each node. Under this sentence is a list of its characteristics:
The sooner the features are picked out as benchmarks, the more important it is.
The more times it is picked out, the more important it is to use as a benchmark.
If a feature is used as the benchmark of a node more than twice, the individual nodes can be calculated and summed up.
The larger the metric, the higher the importance of the node. If the metric is negative, it means that the feature cannot be effectively branched.
The higher the sum of Entropy or Gini, the more important it is.
Adding to the last list item, the Gini impurity measures the frequency at which any element of the dataset will be mislabeled when it is randomly labeled. The optimum split is chosen by the features with less Gini Index, to a minimum value of 0. Entropy is a measure of information that indicates the disorder of the features with the target. Similar to the Gini Index, the optimum split is chosen by the feature with less entropy. It gets its max value when the probability of two classes is the same and a node is pure when the entropy has its minimum value of 0.
Permutation importance is a model inspection technique that measures the contribution of each feature to a fitted model’s statistical performance on a given dataset.
A critical step in permutation importance is swapping the value of a feature with another to confirm whether the final result will be affected. If the final result is affected, it means that changing the value of the feature will affect the final result, indicating that the feature is important. Otherwise, it is unimportant.