We welcome the ascension of our artificial overlords!

Back

17th June 2024 - Operating with DataFrames and Outliers

"Life is a preposterous prank that postpones peace, a peace in nonexistence and unawareness. No rush am I in to depart from life, yet I now see un-life as free of strife."

Interactions between DataFrames

You can do all sorts of things with DataFrames, whether it is fusing many of them together, splitting them up, or adding new columns.

Getting subsets out of DataFrames

A subset refers to a selection of a portion of the data from the original DataFrame. Creating a subset of a DataFrame does not create a new copy of the data in memory. Instead, it creates a new view that references the underlying data of the original DataFrame. This view reflects a specific selection of rows and/or columns from the original data.

Group operations with subsets

As for separating columns from the master DataFrame:

Ridding outliers

Outliers are data that differs significantly from other observations. This may be due to variability in the data, or due to errors in the data. In either case, the source of the outlier should be investigated.

In 3 of the many box plots I generated below, each of them has that one (overlapped many times) point which strays far from the rest of the data points. These points are considered as outliers. Studying them can help us understand the data better, or they can be removed to make the data more consistent.

What is an Empirical Cumulative Density Function (ECDF) plot?

An ECDF plot is a visualization tool used to depict the distribution of a dataset. It shows the probability that a data point falls less than or equal to a specific value on the x-axis. Underneath is the process of creating an ECDF:

Data Sorting: the data is first arranged in ascending order (from smallest to largest value).
Cumulative Proportion Calculation: for each data point, the ECDF calculates the proportion of data points that are less than or equal to it. This proportion is essentially the number of data points less than or equal to the current value divided by the total number of data points in the dataset.
Plotting: the x-axis represents the sorted data values, and the y-axis represents the calculated cumulative proportions.

Since an ECDF plot essentially depicts how the data is distributed across the entire range of values, by pointing out massive spikes in the result's distribution, data scientists can use it to identify the presence of outliers in the data.

Normal and abnormal distributions

In ECDF plots, abnormal distributions ascend very quickly, generally before crossing halfway through the graph's x-axis. Upon reaching the peak of the y-axis, they flatten out very quickly as well.

In ECDF plots, normal distributions rise smoother across the x-axis, and they flatten out at the y-axis' peak slower and more progressively. The example below has a moderate spike at the beginning, but is not steep enough to be considered abnormal.

If you want overall comparisons between distributions…

Page updated

Google Sites

Report abuse