Data Preprocessing and the Short-Time Fourier Transform | Deep Learning for Engineers, Part 3
From the series: Deep Learning for Engineers
Brian Douglas
Data in its raw form may not be ideal for training a network. There are changes you can make to the data that are often desired or sometimes necessary to make training faster and simpler, and to ensure that it converges on a solution.
This video covers three reasons why deep learning systems are important for preprocessing:
- Transform the data into a form that is suitable for the network architecture
- Reduce the dimensions of your data and make patterns more obvious
- Adjust the training data to ensure the entire solution space is covered
Published: 1 Apr 2021
In the last video, we talked about how we need data to train a classification network. However, in many cases, data in its raw form, the form in which it was collected, might not be ideal for training a network. There are some changes we can make to the data that are often desired or sometimes necessary in order to make training faster, simpler, or to ensure that it converges on a solution in the first place. And so in this video that’s what we’re going to talk about. Specifically, I want to highlight a few different ways to preprocess data, and why it’s so important to the deep learning workflow.
Now, like every video in this series, this isn’t intended to cover everything you need to know. But hopefully, this will at least get you thinking about your own particular engineering problem and the preprocessing that you may have to do in order to successfully implement deep learning. I hope you stick around for it. I’m Brian, and welcome to a MATLAB Tech Talk.
Data preprocessing is a pretty broad term. It’s basically anything you do to the raw data prior to inputting it into your specific machine learning operations. And it’s hugely important for at least three reasons; One, preprocessing can transform the data into a form that is suitable for the network architecture, two, it can help reduce the dimensions of your data and make patterns more obvious, and three, it can adjust the training data to ensure the entire solution space is covered.
Let’s walk through an example of each of these so that hopefully, they make a little more sense. And we’ll start with transforming the data.
The input into a network is fixed in terms of the number of elements you feed into it. This means that your data needs to be formed into discrete packets that all have the same number of elements. If you’re working with images, each image needs to be the same size which might mean that part of preprocessing is to crop, pad out or resize images that don’t have the correct dimensions.
It’s the same if you’re working with signals and not images. The length and sample rate of the signals need to be consistent, or again cropping, padding, and resampling is required. That’s just one quick example to highlight how the network architecture and the data need to be consistent with each other, but there many other things as well like making sure units are correct. But this sort of data reformatting is one reason to do preprocessing..
For the second reason, dimensional reduction, it’s helpful to remember that deep learning trains a network to recognize patterns in data. And so any information that isn’t needed to recognize the patterns you’re looking for can be removed without impacting the overall classification. Removing extraneous data helps make the remaining pattern more obvious and that’s going to help the learning process. In general, if the patterns are more obvious to a human, they are going to be more obvious to the deep learning algorithms as well.
But the other benefit of reducing the dimensionality of your data is because of the so-called curse of dimensionality, where more dimensions means more features and variations of each feature and therefore, more training data is needed to cover all possible combinations. So, not only is the data itself larger with higher dimensions, but you need more of it to train the network. So, overall it takes more network complexity, more data storage, and more time to train.
For example, take these two images of the number four, scaled to two different resolutions. We can tell that both of these are the number four despite the one on the right having fewer pixels, or having a lower dimensionality. This is because the defining features that make up the number four still exist at both resolutions. In this way, a network trained on the lower resolution image set could still function as a number classifier just as well as one trained on the higher resolution image.
Now let’s look at a slightly different four, again at both resolutions. You can see that the two instances differ slightly in their details, which is picked up nicely at the higher resolution, but is mostly lost in the lower resolution.
And this is where the curse of dimensionality comes into play. A network trained on the higher resolution images may have converged on a solution that thinks this little tail is a defining feature of a 4 and therefore misclassify this one since it doesn’t have that tail. But since that detail is missing in the lower resolution, the network almost has no choice but to converge on broad details that truly define the number. And you might be thinking, well, why wouldn’t the higher resolution network learn to recognize the broader features also? Well it can! It just takes more training data for the network to figure out that these small details that we’re feeding into it don’t affect the larger classification.
Now a draw back to dimensionality reduction is that you have to understand your data well enough that you can reduce the dimensions but not so far as to accidentally remove critical information from your data set.
For example, say we want to train a network that can visually identify manufacturing defects in hex nuts. It wouldn’t be a good idea to reduce the size of these images by just scaling them down. The flaws or the patterns we’re looking for are quite small and we’d lose the detail that distinguishes them.In this case a better dimensional reduction approach might be to crop the images instead.
There are a lot of different ways to reduce dimensionality like for example removing noise or removing trends from the data, but however you do it, the key take away is that we want to remove any components from our dataset that we know aren’t important so that the network doesn’t have to learn that it’s not important.
Alright, the last benefit of data preprocessing that I want to talk about is using it to cover a larger portion of the solution space. A network can only learn from the training data you give it. And so, if you want a network that can recognize the number 4 in various different hand writing styles, it makes sense that you need to have a several examples of different ways that people write the number 4. However, writing style isn’t the only variable that can affect a good classification. Sometimes the number might be written rotated in one way or another, or maybe it’s a little bigger or smaller. And the problem is that a network trained on all of these 4’s, would fail to classify these rotated and scaled 4’s, since it wasn’t trained on it.
Now, rather than collecting real labeled data for every variation, we can simply duplicate the training data and preprocess it by rotating and scaling across the entire solution space. Or, on the flip side we may take unlabeled data, which could be scaled and rotated in any way, and then preprocess it to put it into an orientation and size that the network has been trained for. So, in these examples we are able to manipulate the data to cover a larger solution space.
Ok, that’s a quick overview of data preprocessing. The main takeaway here is that you want to use some of your expertise and knowledge to change the raw data in a way that makes learning possible, faster, and more accurate.
To give you a sense of what this could look like in one case, I want to talk about the preprocessing that is done on the audio data within the MATLAB example, speech command recognition using deep learning.
This example shows you how to train a convolutional neural network to recognize a given set of verbal commands. You can go through the whole thing yourself if you want to see all of the details, but what I want to show you is how much preprocessing is done in order to prepare the audio signal for training. The convolutional neural network is looking for patterns in an image, so we have to transform this audio signal into a 2D image that contains recognizable patterns that can be used to distinguish between each of the possible verbal commands.
This particular image is called a spectrogram. And I think it’s worth explaining how this image is created in some detail because I think it nicely demonstrates what you might think about when you are preprocessing your own data.
Alright, what I have here is an audio waveform of me saying the word ALLOW. It’s recorded at 44.1 kHz and is about 0.8 seconds long. The first thing we want to do is make sure that this audio segment is the same length as all of the other audio that we may want to classify with our network. Since some words may take longer than 0.8 seconds to say, I decided to extend this out to exactly 1 second. And I did that by padding the beginning and end of the signal with zeros.
If we take the FFT of this signal we can get a sense of where there is the most frequency content. Since this was recorded at 44.1 kHz, there is frequency information up to about 22 kHz, but as you can see there isn’t a whole lot of information in the higher frequencies. I’m going to resample the audio signal at 16 kHz which will capture up to 8 kHz signals and it won’t cause any major loss of quality.
Alright, this is the frequency content for the entire signal which isn’t quite what we want. As we speak, the frequency content changes based on the sounds and syllables we’re saying and we want to be able to pick out those individual sounds in words, so we need to see how the frequency content changes over time. And we can do that with a short time fourier transform.
We start by selecting a window size that is smaller than the full signal, and then running an FFT just on that subset of data. I’m choosing a window that is a little over 180 milliseconds and you can see what that bit of audio looks like here.
Now, there is a problem with running an FFT on this audio segment exactly as it is, and that’s that the FFT is expecting the signal to be repeating. And if I line a few of them up one after another, you’ll see that there is this discontinuity that we’ve created. This jump is going to artificially add a bunch of high frequency content and make our spectrogram much noisier than it should be.
So, to fix this we apply a window function. There are a bunch of different window functions, but in this example I’m using a Hann function. The details don’t really matter because the general idea behind all window functions is mostly the same. They start and end at zero and have some kind of scaling in between. And since the window function starts and ends at zero, when I multiply it with the audio segment it guarantees that the resulting signal also starts and ends at zero which means there won’t be that discontinuity when it repeats. Now this scaling fixes the discontinuity but we are losing some information near the edges of the window, but as I’ll show you shortly, that is why the windows overlap.
Ok, now we can take the FFT of this scaled signal, the red line, and we can see that there isn’t a whole lot of content, which we knew because the word hasn’t started yet. But if we scroll back up we can move onto the next window which I’m overlapping with the previous one by 50%. Once again we take the time data and scale with with the Hann function. Notice, that with the overlapping windows, the part of the signal that was lost in the first window is present in the second. So, we are capturing that information.
Alright, we keep this up by hopping the window across the entire signal, applying the window function, and then taking the FFT. I’m just showing the first 4 windows here so you get the idea, but if we go all the way across the whole signal what we’re left with is the frequency content by window. But here, once again we have more information than we actually need. Each FFT produces a spectrum with thousands of values and we don’t need that level of granularity.
One common way to reduce the amount of information here is by splitting up the spectrum into a number of bins, and then scaling and summing the frequencies in each bin with a mel filter bank - which is a set of triangular band pass filters that are spaced closer together at the lower frequencies and then gradually get wider and further apart as frequency increases. Essentially, this is to sort of model the sensitivity of the human ear where we are more sensitive to lower frequencies than higher. So, we are basically just capturing finer resolution at the lower frequencies. Also, with the triangular filter the frequency content in the middle of each bin is weighted more heavily in the summation than the frequencies near the edge of the bin. To capture this information near the edge, once again we overlap the bins.
What we end up with after all of this binning and filtering and summing is something that looks like this. One value per bin that represents the frequency content for that small bit of spectrum. Here I’ve colored these squares based on the magnitude of the frequency, but they all look black since there isn’t much information here. But if we apply this binning and scaling to each of our windows you can start to see that there is some interesting content here and that this frequency content changes from one window to the next.
And now that we have all of this information by bin and window, the last thing to do is to put all of this into an image. The first window is placed on the left side of the image with the lowest frequency bin on the bottom. And then we put the next window beside it, and the next, until we’ve gone across the entire signal and we’ve created a spectrogram. Which is pretty cool right? It’s kind of neat that we can create an image of audio signals like this.
Ok, so, hopefully you can see when we go back to the MATLAB example, how this blue image represents the frequency content of this audio signal over time.
But more than that, hopefully, you can see obvious patterns in this spectrogram - much more so than you can in the waveform. And they’re so unique in fact, that I bet that even you could use patterns like these to determine what word is being said. And to prove it, check out this spectrogram of me saying allow, aloe, alloy, and ally.
Despite how close these words are to each other, the patterns they make in this spectrogram all have defining features that make them different from each other. If I say one of these words a second time, you could probably determine what I said from just the spectrogram alone. And this is exactly the uniqueness in patterns that a deep learning algorithm and a convolutional neural network could thrive on.
Alright, that’s where I’ll leave it for now. If you want to learn more, I’ve left a bunch of links in the description to different MATLAB tools that help with feature extraction and data preprocessing, and several examples that show these tools in action on images, audio, and other signals.
In the next video, I want to talk about how we can build off of an existing pre-trained network with transfer learning. So, if you don’t want to miss that or any other Tech Talk video, don’t forget to subscribe to this channel. Also, if you want to check out my channel, Control System Lectures, I cover more control topics there as well. Thanks for watching, and I’ll see you next time.