Data Augmentation Techniques for Your Next Data Science Model

Here at SpringML, we work on sophisticated models to solve complex business problems for our clients.  But, sometimes a bit of augmentation is necessary to get the job done. Data augmentation descriptions can be vague, but to me, it is primarily any kind of altercation to the dataset to make the predictions of the model better.  

 

Objective & Prerequisites:

By the end of this read, you will learn how to use some data augmentation techniques for your next data science model.  Some of these ideas are for modeling in general, and most are specific to deep learning.

Before beginning you will need:

  • Some basic understanding of machine learning and deep learning.
  • Prior experience in training and running deep learning models would be helpful, but not necessary.  

Difficulty Level: Medium

Why do Data Augmentation?

Reason 1: Imbalanced Classes

In my collegiate classroom, I had curated datasets with known problems, known outcomes, and balanced datasets.  In the real world, this is hardly the case.  Below are three examples of how I used data augmentation to fix imbalanced classes.

Example A: Not Enough Documents

When working on a document classifier to identify government forms, I found that some forms that were not processed as often.  I realized that aside from a few signatures and addresses, these government forms were almost the same. So I simply duplicated a blank copy of the form that was under-represented in class, and it worked!

In this example, I wanted to show that I added a few blank forms to balance out the class in the dataset.  

 

Example B: Not Enough Objects in the Photo

(Deep Learning Specific for Object Detection)

While I was working on an object detection project for a client in the retail space, some objects were represented more often than others.  Certain cans and bottles were featured more often than others due to popularity and sales, which created an under representation for some. In my experience, the number of objects in a photo mattered more than the number of photos in the dataset itself.  For less represented objects, I simply opened up the picture in my MacBook, and cropped, copied and pasted the objects to create more of them in the existing photo.

In the example above, one can see that I artificially created more of these objects to detect in an object detection model.  In the image to the right, one can also see that I stretched the cropped figure for the model to detect the object in different sizes.

 

Before you try technique A or B, you should ask yourself:

If I collected more data, would it be the same anyway? If this answer is yes, then you can use the techniques.  I would not use this as a basis for your dataset, but if you are in a pinch, this could help you fill out your classes to create balance for your model.  

Just don’t be like Thanos and eliminate half the population in the universe to create balance (You can find a Quora discussion about it here).

But on a serious note, I found that the downside of doing this is overfitting, since there is not enough variety of the object in different angles.  Government documents stay the same, but cans and bottles rotate, meaning that the backside of the object would have to be considered as well. Also, I noticed that artificially creating more objects in a photo unnecessarily crowds the image landscape, making things harder to validate the model.  

 

Example C: Delete some data

This is not a preferred solution for me, but instead of augmenting data for underrepresented classes, one can simply remove some of the data from the overrepresented classes. If there is more than enough examples to represent the class, one can remove some data so other classes can be detected by the model.  

 

Reason 2: Make the model more robust

(Deep Learning Specific)

Sometimes, we purposely ‘corrupt’ the data just to throw a wrench into the deep learning model.  These changes make the model better suited to handle ‘dirty’ data since data is not always perfect in the wild. Obviously, clean data should be the foundation of your dataset for modeling, but if it’s too perfect, it’s not able to handle different scenarios outside of the dataset.  It’s like memorizing all the answers to a test, only for the test format, order, and wording to change, making your memorization effort a futile activity.

A change in lighting, a different angle, or different image size can throw off the prediction of a model. When the model is training, the loss rate will temporarily jump when it runs into the ‘corrupt’ image, but the iterations will smooth things out eventually. That signifies that the model is learning.  This obviously increases training time, however, the model will be more robust in the long run.

Here are three examples of how I used data augmentation to increase a model’s health.

Example A: You Can Apply Blur to an Image

Code to Blur an Image in Python:

import cv2
from matplotlib import pyplot as plt
% matplotlib inline
img = cv2.imread('IMG-0115.JPG')
blurImg = cv2.blur(img,(50,50))
plt.imshow(blurImg)

That way, the model will work harder to try and detect these cans.  

 

Example B: Brighten or Darken an Image (Read More Here)

Code to Lighten/Darken an Image in Python:

from PIL import Image
from PIL import ImageEnhance
input_image = 'your_image.jpg'
factor = # less than 1 darkens, greater than 1 brightens
image = Image.open(input_image)
enhancer_object = ImageEnhance.Brightness(image)
out = enhancer_object.enhance(factor)
out.save('image.jpg')

I noticed that when I was training on images, a slight change in lighting threw off the model.  To fix this, you can change the brightness of the image so the model can detect images or objects in different lighting.  

This is just an example of what is possible with image augmentation, but there many other ways to adjust an image. Just take a look at this article to get a glimpse of just some of the methods.

 

Example C: You Can Apply the Idea of Image Data Obscuring to a Time Series Dataset

Data Augmentation - Sine Wave with Noise

This is a sine wave that I added noise to as an example.  When I was training an LSTM network for anomaly detection, I worked to not make the time series data too accurate. If the data was “too perfect” or was always trained on an ideal dataset the model could become overfit.  This means the model would not be able to generalize a time series and start to model on the noise, not seeing the overall picture. Through reconstruction error, the model should learn what the ‘normal’ time series graph looks like through the noise.  The blue line in this picture represents what the model should look like when it is finally trained.

 

Python Code Noise Creator Code:

import numpy as np
from matplotlib import pyplot as plt
% matplotlib inline
x = np.linspace(-5,5,100)
y = np.sin(x)
y_1 = [i+ np.random.choice(np.linspace(-.15,.15)) for i in y]
plt.figure(figsize=(15,7))
plt.plot(x,y, label='augmented')
plt.plot(x,y_1,label='actual')
plt.legend();

 

Conclusion

These are just some of the things that are possible with data augmentation.  Next time when you are in a pinch, you can apply these ideas to your next data science model.

At SpringML, we can also create custom deep learnings models such as Using Object Detection to Find Potholes and augment data if necessary to achieve your business goals.  If you want to contact us for our services, you can reach us at info@springml.com.