Ebook

Chapter 2

Prepare the Data


The data you’ll be working with is heterogeneous. It comes from multiple sources (sensors, databases, audio files, and so on), in different formats, from different domains, and with different time intervals. And it is noisy.

Effective preparation of all this data is critical. To be of use in an AI system, the data must be filtered, cleaned, and labeled.

Easy to say; difficult and time-consuming to do. For example:

  • What if the dataset is too large to load into memory?
  • How do you preprocess the data so that the network will give accurate results?
  • What is the quickest way to label all the data?
  • What if there isn’t enough data to train a network?

Let’s look at how MATLAB helps you handle these challenges.

Large datasets can be in the form of files that do not fit into available memory or files that take a long time to process. A large dataset can also be a collection of numerous small files.

MATLAB provides tools for handing large datasets. These include:

Datastores. Rather than loading all your data into memory, datastores load in data only as you need it. The datastore acts as a pointer to the data.

Tall arrays. Tall arrays let you work with numeric data stored in arrays that have too many rows to fit into memory.

bigimage. A bigimage object represents large images as smaller blocks of data that can be independently loaded and processed.

In a perfect world, all your images would be clean, sharp, and in pristine condition, and no preprocessing would be necessary. But the data used for visual inspection systems is unlikely to be pristine. Lighting conditions might be suboptimal, obscuring key features. Conversely, your images might be cluttered with features that you don’t need for the network—for example, a semiconductor contains many shapes and images that make detecting anomalies difficult.  

Time spent on preprocessing your data is time well spent. Clean, clear images will dramatically improve the predictive accuracy of your algorithms. For example, these plots show the network results with and without preprocessing.

 

Figure 1

Each image run through the network is given a score of the likelihood that the object is defective. With the preprocessed data, shown on the right, all the defective units are clustered on the left—a clear indication that when the images have been preprocessed, the network can more easily distinguish between defective and non-defective parts.

Depending on your data and classification goals, you can preprocess your images using any or all of these techniques:

  • Register misaligned images. Images are easier to classify when they are all aligned in the same way.
  • Adjust image intensity. Enhancing the image makes it stand out from the background.
  • Segment or threshold the image. Use techniques such as clustering to separate the image from a busy background, pieces of scrap metal, or other visual clutter.
  • Perform region analysis. Define defects by shape, size, color, and so on.

 

Figure 1

Deep learning requires a lot of labeled data. The more thorough and accurate the labeling, the better the performance of the network.

Labeling defines those features in the data that you want the network to recognize and classify. But data labeling is extremely time-consuming and error-prone—imagine manually drawing bounding boxes around thousands of images or having to define the classes of every single pixel in those images.

With MATLAB you can automate the more time-consuming parts of this process by using interactive tools. For example:

Image labeler and video labeler apps classify regions of an image and automatically apply that classification through every frame of a video.

Signal and audio labeler apps, like the image labeler, have built-in automation capabilities—in this case, to speed up labeling of signal data.

Big-image labeler app lets you label large images interactively. You don’t have to worry about extracting patches, labeling each patch, and reconstructing the image. Instead, you can move around the image and label different parts of it.

You may need additional data to achieve higher accuracy or to apply your model to a wider set of signal types or scenarios. You might need to use a different (possibly more complicated) model that simply requires more data and labels.

Instead of creating new data, a common strategy is to generate additional data from your original set by means of data augmentation. Data augmentation is usually accomplished using simple geometric transformation techniques.

The data transformation command in MATLAB applies random transformations to the dataset (such as cropping, rotating, resizing, translating, and flipping). You can also perform color transformations like hue and contrast jitter to expand your data set. 

 

Figure 1

You can write these files out as new images and add them to your dataset.