Back to work again!
Today's focus is example-reliant, using real life topics to illustrate the values of 4 data science problems.
Eligible reasons for answering data science questions can include the following:
Entertainment purposes.
Addressing core issues of an enterprise.
Public interest or influencing policy direction.
Worldly contribution.
In practice, data science can help create prediction models for daily tasks such as battlegrounds survival games, taxi trip durations, user advertising, pneumonia detection, etc.
The source of data directly influences its quality. Using different data sources for literature review, one can more accurately speculate the sources and frequency of data anomalies.
There are many data sources on the internet for anyone to search from:
Web traffic
Web crawlers
Crowdsourcing
Paper-to-electronic files
Knowing what kinds of data you obtained can help you perceive their relevance or value to your field(s) of interest.
Structured data: info organized in neat rows and columns, each with a defined meaning and name. Reviewing them will help you understand what info is stored in each section and ensure you are interpreting it correctly.
Unstructured data: info that is not following a predefined format. Since it lacks a rigid structure, you will need to consider data conversion and standardization methods. The goal is to transform said data into a format that can be analyzed by machine learning models.
I have already covered examples of structured and unstructured data in the last blog section, so head back if you want to learn about them.
Each data science question should be verifiable using mathematical evaluation metrics. Common metric models for different problem types are:
Classification problems: accuracy, Area under the ROC Curve (AUC), mean Average Precision (MAP)
Regression problems: mean absolute error (MAE), root mean squared error (RMSE)
Supplementary information: metrics