When life gives you lemons, you juice them, squeeze them on others' wounds, or plant them to grow lemons. All for the hell of it!
Data visualization is the graphical representation of information and data. For many people, visuals are processed faster than text, leading to quicker comprehension. Additionally, shorter text extracts can minimize information loss during the translation process.
The most common data storage method is the table. Its convenience, management policy, and practicality trumps over pictures, text and other types. Though when users see dense tables like the one below, they will inevitably find it difficult to read, and it will also be difficult to quickly summarize the information presented.
Visualization helps present tabular data with low legibility in the form of diagrams, which not only increases readability but also summarizes key information. In practice, visual dashboards are popular in various fields to help users improve data reading efficiency and extract more critical information in a visual way.
Matplotlib is a Python module that can generate a variety of plots such as bar charts, line charts, and scatter plots.
Bar charts generate individual amounts of data for every category. This type of chart is best used for datasets with limited, nonnumerical categories.
Histograms use bars to depict how frequent certain values or ranges of values appear within a dataset. Imagine lining up all the data points on a number line, then dividing the line into sections (bins) and counting how many data points fall into each bin. The height of each bar in the histogram corresponds to the number of data points within that particular bin.
Box plots are a way of displaying the distribution of data based on a five number summary - “minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”. These quartiles can be used to determine if a distribution is skewed, and whether there are potential outliers in the data set.
Line graphs print data in a zigzag line pattern, clearly showcasing trends over different categories, most of them being chronological.
Keep in mind that plotting with numerical data directly from CSV files will not change their type to numeric automatically. You need to convert them to numeric values first, lest you get a graph displaying numerical values ascending in index order instead of their actual values. The worst part of this study case? It took me hours of debugging to figure out this simple error.
Scatter plots are graphs that show the relationship between two continuous variables. They print points on a graph, where each point represents a single observation. A slope can then be calculated to determine the relationship between the two variables. Compared to line graphs, scatter plots are better at showing the relationship between two variables. This is because they do not connect the points with lines, which can obscure the relationship between the variables.
If you want to add a second category variable to further explore the relationship between groups, you can use "color" to distinguish the effects of different groups.
The learning website does not teach printing a legend onto the plot, so take the examples below to differentiate them. Two methods are separately used to create a legend: plt.scatter() and a for loop. Only the latter has the capabilities to create a legend including multiple categories.
To categorize in a box plot, you can create .loc variables for each non-numerical category in your DataFrame.
When you are running older code syntaxes from educative websites, it will benefit you long-term to study the newest syntax rules if you seek a professional coding career. Doing so might eliminate potential readability/compatibility issues in software, just to be on the safer side.
Sometimes at work, a DataFrame might be too large or impractical to be displayed in a single plot. In this case, you can limit the number of data points to be displayed by sampling the DataFrame.
Seaborn: a suite based on Matplotlib. It is considered a beautiful design that is more convenient to operate than its inspirer.
Plotline: a package that extends the ggplot2 architecture of the R language. The operation syntax is exactly the same as ggplot2.
Plotly: a package that generates "interactive visualization charts". The generated chart has the function of zooming in, zooming out or selecting specific areas.
Dash: paired with the Plotly suite to design an "interactive visual dashboard". By adding interactive functions such as drop-down menus and timelines, you can filter out a specific range of data to present specific results.