Simple technologies killed off many old jobs and made new ones in return… Will AI repeat or surpass this trend?
Data science is a discipline that uses data to answer scientific questions. It includes a series of elements: collecting data, exploring data, formulating hypotheses, verifying data, and establishing or rejecting of hypotheses. In the early stages of data science, data scientists need to first define what questions to ask, then refer to them as a guide on which other elements to extend towards next.
A diagram from Harvard University proposes a repetitive process of:
Relevant questions to ask: what is your objective? If you have relevant information, what would you want to do with it? What do you want to predict?
Acquisition of data: how is the data obtained or generated? What information is relevant? Are there any privacy concerns?
Exploring data: data pre-processing. Preliminary drawing information. Are there any outliers? Any recurring patterns in the data?
Build a model: select a model to use. Train the model. Validate the model. Pick model hyperparameters.
Presenting results: how can you present your data science results? What did you learn from doing data science? Are they reasonable? How can you tell a good story?
It is worth noting that you can progress each step of a data science project forward or backward, meaning there is no straightforward path towards completion. A data science project must determine whether the tasks per stage have achieved their goals. Continuously, you have to review or modify past processes to ensure the entire journey progresses more smoothly. More often, you have to reflect on whether the problems or goals you have set are appropriate. There are opportunities to modify the entire problem, direction, and goals, but these must be based on the results presented in gathered data, instead of human intuition.
The definition of the problem is very important, but it might be impossible (impractical) to define a complete problem at once. Before you can come up with a complete problem, you must go through many attempts and experiments to gradually correct the angle of the problem. You will need deeper insights and a sufficient understanding of your field(s) of interest before you can generate a guiding question.
The questions about the problem you need to answer first are:
Why is this problem important to attend to? What are the impacts of it on the field?
Should you solve this problem, what benefits do you gain? Or what other things have you developed a deeper understanding of?
How difficult is it for you to solve this problem?
An insightful question is a good question. If you can answer the following, you will know more about what contributes to creating a good question:
Can this problem be solved with a predictive model? If so, what is the thing(s) you want to predict?
Can the problem be defined as a regression or classification problem?
Regression: the relationship between one dependent variable and one or more other independent variables. Can be used to predict continuous values.
Classification: the separation of data points into predefined classes. Predictive labelling allows near instant and automated responses.
What kind of data is needed to train your prediction model to make predictions?
From the start, though you have no means to gain insights on your chosen field(s), you can still collect relevant information extensively to obtain them. Such information can not only be used to solve specific problems, but also further explored to get data that could inspire the creation of more problem questions/statements.
There are so many sources of information nowadays, it will be hard to find any that can relevantly answer your questions. The data could have been scavenged from the Internet, collected by a company or customer behavioral observations, or from some large database. The point is, as long as the data can answer your data science questions and be used to train your predictive models, it is good data.
The first thing you need to think about is "What kind of data can we use to solve the problem?" For example, if you want to answer "Can we use the characteristics of fruits to classify fruits?", you may want to collect the properties of fruits to predict each of their types. If the question becomes "Can we use photos of fruits to classify fruits?", then you need to get image data of fruits. Asking where one can collect such data that meets conditions is an important question.
After considering the conditions of the data-to-be-collected, research sources containing them. Where can you obtain data easily and feasibly? When you are done with that, you can analyze the types of data collected. Data types can be roughly divided into the following types:
Structured data: refers to organized tabular data. Examples include Excel, CSV files, etc.
Unstructured data: refers to data other than tables. Examples include JSON files, pictures, films, sounds, letters, etc.
Generally speaking, multimedia materials have different data processing procedures.
Data quantity is another important issue. Training a predictive model requires a numerous amount of data. Generally speaking, traditional scientific models are relatively simple models, needing to be fed at least 5 data points and about 10 digits of data to be considered a properly trained model. Traditional machine learning models require at least 100 to 1,000 points, while modern deep learning models require at least 10,000 data points.
When considering data sources, think of the suitable quantity of data needed to train your prediction model.
The processing of data depends on data type, tasks to be completed, and its purposes. Just like in the topic of culinary arts, a chef must first know what they want dish to prepare before deciding how to process the relevant ingredients. Special ingredients will have special processing methods; the same can be said with processing data. Hence there is no singular universal method to process data.
One method of processing data is using package software to process table data. This method is suitable for users who are not familiar with computers. However, lots of data are very large in size, and processing them with package software will create huge amounts of unreadable data, eventually crashing involved software. Therefore, I discourage using this method to train a machine learning prediction model.
Another way is using programming language to manipulate data. It is seen as a more advanced method than using packaged software. This method is suitable for users more familiar with programming language. It can handle very large data sets, is equipped with a programming language containing different packages, can simultaneously handle tasks such as computing, modeling, visualization, etc.
In the data exploration step, a large number of data manipulation and data visualization methods will be used to achieve data exploration purposes. When exploring information at the start, remain curious - not entirely about the questions you want to ask, as any sources of intrigue that are relevant to your field of interest will do.
The purpose of exploring data is to 'understand' it. By doing so, you can extract relevant information about your field of interest from the data, which you can use to obtain a deeper understanding of your field.
After having a certain understanding of your data and field, you will look for potentially relevant or prognostic info to build a predictive model. In the future, this prediction model will be an artificial intelligence model that can be put online.