"Tis the basic right for the individual to choose, and to be disagreed by the outsider."
It is time for a more hands-on approach on training your own machine learning model. Modern models have already been written into a kit, so you just need to install it.
Seaborn is a library that uses Matplotlib underneath to plot graphs. It can be used to visualize random distributions, creating drawings with higher readability to human audiences.
seaborn.relplot() provides access to several different axes-level functions that show the relationship between two variables with semantic mappings of subsets.
Before diving into model training, data exploration is a crucial step that involves analyzing the data to understand its characteristics and identify potential relationships between features. The goal is to understand the distribution of values within each feature, identify potential outliers, and assess the relationships between features e.g., correlations.
During data exploration, we analyze features to understand their role in distinguishing between data points. Features that show clear differences between data points belonging to different classes are likely to be informative for the model. Said features help the model learn the patterns that separate the classes.
But features that distinguish data points may not always be helpful. Irrelevant features might capture noise or unrelated patterns that can confuse the model and lead to overfitting. Choosing the right features is a balancing act. We want features that are informative and relevant to the task at hand, while avoiding redundant or irrelevant ones.
In my breast cancer model, I am using the features radius mean and texture mean to predict the diagnosis of benign or malignant breast cancers (as target). The model is a logistic regression model, classifying data using a sigmoid function. In layman's terms, a sigmoid function is a math function that squashes any real number between negative infinity and positive infinity into a value between 0 and 1. A value close to 0 indicates a low probability of belonging to the positive class; a value close to 1 indicates the opposite.
An around 89% accuracy is not too bad of a score for a prediction model, but it is a substandard one for topics with high-cost errors such as breast cancer diagnoses.
After completing your prediction model, feed it a dataset and read the prediction results. It is more practical to grab another dataset that is not the one already used for training it. You can then cross-compare the prediction results besides the training data.