The elective defines artificial intelligence (AI) as a broad field of science, including the topics of:
Artificial intelligence (AI): the science and engineering of making intelligent machines.
Machine learning (ML): a field of study focusing on algorithms which enable computers to learn from data, even improve themselves, without being explicitly programmed.
Deep learning (DL): a subset of machine learning methods based on artificial neural networks (ANNs) with representation learning.
Machine learning algorithms aim to uncover patterns in a set of training data (e.g., mapping inputs and outputs). A machine learning model is created out of the uncovered patterns, and its performance is tested with test data. If performances are satisfactory, said model can be used to predict new inputs.
Several types of machine learning include:
Supervised learning: trains a model for a specific task using labelled data.
Unsupervised learning: trains a model to discover patterns/relationships in unlabeled data.
Reinforcement learning: mimics the human trial-and-error learning process that trains software to make decisions that achieve the most optimal results
Self-supervised learning: makes the model train itself by generating its own labels from unlabeled data.
A 'classical approach' in machine learning refers to traditional, well-established algorithms that predate the current wave of deep learning, including techniques like linear regression, support vector machines (SVMs), decision trees, and Naive Bayes, which rely heavily on manual feature engineering and are generally considered easier to interpret than complex neural networks.
Essentially, the older, more established methods of machine learning compared to newer deep learning models.
The successor algorithm of 'classical' machine learning, deep learning, could do manual feature engineering and selection itself by learning hierarchical representations of data through multiple layers in a neural network. This capability allowed deep learning models to automatically extract relevant features from raw data, such as images, text, and audio, without requiring explicit human intervention.
The remarkable success of deep learning, particularly with deep neural networks (DNNs), over the past few decades has been instrumental in driving the advancements of modern AI systems. Many of the most significant recent developments in AI heavily rely on implementations leveraging deep learning and DNNs, with large language models (LLM) being a prime example.
However, please note that the profound advancement in AI via deep learning does not disqualify the worth of classical machine learning. Classical machine learning algorithms continue to be valuable tools for a wide range of applications, particularly when dealing with smaller datasets, simpler problems, or when model interpretability is crucial.
The choice between classical and deep learning depends on the specific problem, the available data, and the desired trade-off between performance, interpretability, and computational cost. Often, a combination of both approaches can be the most effective strategy for solving complex AI problems.
Logically, different tasks demand different types of data. Practically, grouping distinct types of data based on their attributes helps AI systems extract valuable insights from complex data by structuring it into manageable units.
Imagine a library where books are organized into different sections. Each section contains books that are similar in genre or topic. This is analogous to grouping forms of data separately, where data is organized into different groups based on their characteristics, such as data type, source, or purpose.
Data attributes in tabular data can have two main types:
Categorical: Is typically used to store quantitative or qualitative data.
Nominal: Can fit different categories (e.g., animals being mammal, fish, etc.)
Ordinal: Fits in ranked/ordered categories (e.g., bad/average/good/excellent)
Binary: Contains two options (e.g., True/False).
Numerical: Is typically used to store quantitative data.
Continuous: Can take any numeric value (e.g., age).
Discrete: Can take only a finite number of specific values (e.g., number of own cars).
Text
Text
Text
Text
Text
Text
Text
Text
Data imbalance happens when one or several classes or categories are overrepresented (majority classes) in the data, while some other classes are underrepresented (minority classes).
It is an important problem to fix because when a dataset has significantly uneven class distributions, it can lead to machine learning models performing poorly on the minority class, resulting in biased predictions and unreliable results, especially when evaluating model performance using standard metrics like accuracy
Some moments when data imbalance specifically becomes a problem can include:
When splitting your data in different sets (e.g., train, test, validation sets): you need to keep the same distributions for the target and important attributes in your different sets.
When training a model: your model might become very good at predicting the majority class(es), but not the minority class(es).
When getting insights from a trained model: because of the previous point, the insight you get from your model once trained might not reflect the actual patterns.
To fix data imbalance, you can apply techniques such as:
Oversampling: duplicates existing minority class data points or generates synthetic data points.
Undersampling: randomly removes data points from the majority class to achieve a more balanced distribution.
Data augmentation: creates new, slightly modified versions of existing minority class data to expand the dataset.
Cost-sensitive learning: assigns higher penalties to misclassifications of the minority class, forcing the model to focus on learning from these data points.
Text
Text
Text
Text
Text
Text
Text
Data exploration is the process of analyzing and summarizing a dataset to understand its structure, patterns, and relationships, using techniques tailored to data types. The goal is to uncover insights, identify anomalies, and inform subsequent steps in analysis or modeling.
Despite being one of the most time-consuming tasks when preparing data, data cleaning is essential at ensuring data accuracy, reliability, and consistency, which are crucial for effective data analysis, informed decision-making, and compliance with regulations.
Cleaning can be removing duplicates, fixing formatting issues, dealing with missing values or outliers, or a combination of any of them. Here are some techniques with simplifying some of these tasks:
Missing values: Remove an instance (row) or a full attribute (column); or use imputation methods to replace the missing values.
Outliers: Identify any that are a result of an error. If it is clearly not representative of the population you are studying, removal might be appropriate.
Data preprocessing involves transforming raw data into a consistently usable format for analysis, including cleaning, organizing; and summarizing to identify patterns, make predictions, and inform decision-making.
Data preprocessing involves converting text and categorical attributes to numerical attributes (one-hot encoding or embedding), scaling data (normalization or standardization), applying selected transformations to data distribution (if not close to Gaussian), under/oversampling data (if imbalanced), reducing the total number of dimensions, etc.
Transformation pipelines are structured frameworks for data transformation, processing data through multiple stages to make it compatible for ingestion or analysis. In common practice, we embed all the cleaning and pre-processing step into a pipeline, which can be reused on data as needed.