We welcome the ascension of our artificial overlords!

Back

12th September 2024 - Machine Learning: Processing Datasets of Various Sizes

Like creator, like creation, both are allergic to extremes.

When training machine learning models, the size of the dataset plays a crucial role in determining the model's performance and generalizability. While large datasets can provide abundant information and improve model accuracy, small datasets can pose challenges in training effective models.

Processing larger datasets efficiently

While datasets like CIFAR-10 are relatively small and can fit into memory, larger datasets may exceed the available RAM, causing out-of-memory (OOM) errors. To handle large datasets efficiently, data can be divided into smaller batches. Each batch is processed separately by the model, and the results are aggregated to update the model's parameters. GPUs can process multiple batches in parallel, significantly accelerating training time.

Imagine a bakery (neural network) that produces a large quantity of bread (model outputs). Instead of baking all the loaves at once, which might overwhelm the oven (architecture), the bakery divides the dough (training data) into batches. Each batch is baked separately, and the finished loaves are combined to form the final product.

Building a generator on Python

In Python, a generator is a special type of function that returns an iterator. Unlike regular functions that return a single value at a time, generators return a sequence of values, one at a time, using the yield keyword.

Evert time the next() function is run on the generator function above, the subsequent item of the list is fetched. In the example below, if the function is run for the first time each run, the first item is retrieved. Afterwards, it keeps count of the index it referenced onward.

When our generator function reaches the end of the list, with no new indexes to access, it returns a StopIteration error.

For those who want their generators to loop after reaching the end of a list, add a while true condition above your original script.

No more StopIteration errors.

Generators can also be used to create plots. In the following script, we are building a function that fetches images from a database and structurally displays them on a figure as axis-less subplots.

We will once again be borrowing CIFAR-10's images to run our functions. In the script below, a new CIFAR-10 generator function will extract images and labels in from the training and test datasets in batches of 32.

Afterwards, the command of next(cifar_gen) is assigned to images and labels, which will generate a figure containing 32 CIFAR-10 images, teh same amount as generator img_combine's set batch size, when paired with it as a parameter.

Run it a consecutive second time, another 32 images are fetched and shown on the figure.

In CNNs, generators can be effectively used to implement batch processing. The training loop of the CNN can iterate over the generator, fetching batches of data one at a time. This avoids loading the entire dataset into memory, which can be memory-intensive for large datasets.

For this exercise, we are running the generator with our last CIFAR-10 CNN model architecture.

The new generator below fetches a maximum of 32 images and labels per batch to train the CNN with.

Compared to the outputs of the traditional CNN, implementing the generator seems to have slightly decreased the final convergence, resulting in a lower accuracy and higher loss for both sets.

Visualized as linear plots, the convergence progresses of both training curves each form an asymmetrical parabola-like shape, the loss curve resembling more of said shape. At the same times, the valid (training) curves are spontaneously erratic, their values converging and un-converging extremely over the epochs. The increasing gap between the accuracy curves suggest intensifying overfitting.

Processing smaller datasets efficiently

In practice, when carrying out various machine learning projects, we may often encounter situations where the amount of data is insufficient, the common reasons being difficult data collection, labor-intensive data labelling, noisy or poor data quality, etc.

To address the shortcomings of having small datasets for machine learning, data augmentation involves artificially expanding the dataset by creating new samples from existing ones. This can be done in the forms of:

Image Augmentation: flipping, rotating, cropping, and adding noise to images.
Text Augmentation: synonym replacement, backtranslation, and adding noise to text data.
Time Series Augmentation: time shifting, scaling, and adding noise to time series data.

While data augmentation can be a valuable technique for improving model performance, it is not a universal solution. Overusing or misapplying augmentation can lead to degraded performance. Professionally, the choice of augmentation methods should be tailored to the specific task and dataset.

Augmenting the entire dataset can also introduce bias and hinder the model's ability to generalize. It is crucial to split the data into training and validation sets before applying data augmentation.

Building a data augmentation generator on Python

For this exercise, we will run image data augmentation on CIFAR-10 with the ImageDataGenerator module in TensorFlow. We will build a data augmentation generator that will be used to fit the CNN model from exercises on 11th September.

Compare the images in the figure below with the one above. Notice how each image has either been flipped, scaled, or repositioned.

Analyzing the output, the simple data augmentation generator seems to have heavily destabilized the CNN's performance. While its outputs are a net improvement over our past CIFAR-10 DNN, they are significantly lower than our regular and batch-processed CNNs.

The subplots below show the training curves converging steadily if plateauing periodically. While the valid curves also exhibited the same trend, by the 6th epoch, the loss kept increasing and accuracy began decreasing when approaching the end. Speaking of the end, a massive gap between training and valid curves has formed there, indicating increasing overfitting.

Page updated

Google Sites

Report abuse