Deep Learning for Computer Vision
Overview
While deep learning can achieve state-of-the-art accuracy for object recognition and object detection, it can be difficult to train, evaluate and compare deep learning models. Deep learning also requires a significant amount of data and computational resources.
In this webinar, we will explore how MATLAB® addresses the most common deep learning challenges and gain insight into the procedure for training accurate deep learning models. We will cover new capabilities for deep learning and computer vision for object recognition and object detection.
Highlights
We will use real-world examples to demonstrate:
- Accessing and managing large sets of images
- Using visualization to gain insight into the training process
- Leveraging pretrained networks to perform new recognition tasks using transfer learning
- Speeding up the training process using GPUs and Parallel Computing Toolbox™
About the Presenter
Johanna Pingel joined the MathWorks team in 2013, specializing in Image Processing and Computer Vision applications with MATLAB. She has a M.S. degree from Rensselaer Polytechnic Institute and a B.A. degree from Carnegie Mellon University. She has been working in the Computer Vision application space for over 5 years, with a focus on object detection and tracking.
Recorded: 2 Aug 2017
Hello, my name is Johanna here with Gabriel and we're going to talk about deep learning for computer vision. We've got some great new demos and capabilities to show you. So let's get started.
Yeah, so we'll start off by setting some context. We've got other deep learning videos up on our web site which are much shorter than this webinar, and you should definitely watch them, as well. But the main thing is that we'll be going into much more depth in this webinar compared to those other videos. We're talking about deep learning for computer vision. What is deep learning? It's a type of machine learning that learns features and tasks directly from data, which could be images, text, or sounds.
Since we're discussing computer vision, we'll naturally be looking at image data. But just keep in mind that deep learning applies to many other tasks that don't deal with images.
Right. So let's look at a quick workflow of how deep learning works. Let's say we have a set of images where each image contains one or four different kinds of objects. And we want something that can automatically recognize which object is in each image. We start with labeled images, which just means that we tell the deep learning algorithm what the image contains. And with that information, it starts to understand the object's specific features and associate them with the corresponding category.
You'll note that the task is learned directly from the data, which also means that we don't have any influence over what features are being learned. You might hear this being referred to as end-to-end learning, but in any case, just keep in mind that deep learning learns features directly from the data.
So that's the basic workflow of deep learning. While the concept of deep learning has been around for a while, it's become much more popular in recent times due to techniques that have massively improved the accuracy of these classifiers, to the point where they outperform people in classifying images. So there are also several factors that enable deep learning, including large sets of labeled data, powerful GPUs to speed up training, and the ability to use other people's work as a starting point to training your own deep neural network, which we will talk about later.
Yes, we will. So right before we dive into things, we want to give you some background and framing for why we're doing this webinar. Deep learning is difficult. It's cutting edge technology, and it can get complicated, whether you're dealing with network architectures, understanding how to train an accurate model, and incorporating thousands of training images.
Yeah, not to mention everyone's favorite task—trying to figure out why something isn't working.
We want MATLAB to make deep learning easy and accessible to everyone. In this webinar, along with other resources on our website, we'll explain how you can quickly get started with deep learning using MATLAB. The examples in our webinar will also demonstrate how to handle large sets of images, easily integrate GPUs to train deep learning models faster, understand what's happening inside a model as it's training, and build on models from experts in the field so you don't have to start from scratch. And with that, let's get into it.
Yeah. Let's do it. So we're going to cover three examples of deep learning—image classification using a pretrained network, transfer learning to classify new objects, and object detection in images and video. So first up is image classification using a pretrained network. So I have an image here of peppers that I want to be able to classify. And believe it or not, I can do it with MATLAB in four basic lines of code.
One, import a pretrained model. Two, bring in the image. Three, resize the image. And four, classify the image.
Nice.
So that's it.
Pretty cool.
All right, so moving on to the second demo—
He's kidding.
Yeah, I'm kidding. So we'll talk about what's going on here.
So what's this AlexNet in the first line of code? Who is Alex and why are we using his net?
So to directly answer your question, AlexNet is a convolutional neural network designed by various people, including one Alex Krizhevsky. But I should probably provide some context. So there's this independent project not related to MATLAB that's been around for a while called the ImageNet project. And its goal is to have a massive repository of visual content, like images, for people to use to do research and design in visual object recognition.
So it started in 2010. They ran an annual competition called the ImageNet to Large Scale Visual Recognition Challenge.
Oh, yeah. The old ILSVRC.
Yeah, that competition. So competitors submit software programs which compete to correctly classify and detect objects in the [INAUDIBLE]. Now, up until 2012, the standard way to implement computer vision was through a process called feature engineering, as opposed to AlexNet, which used and improved on methods based in deep learning. So as you can probably guess, AlexNet was submitted to the 2012 ILSVRC under the team name Supervision, one word. And it blew the competition out of the water, which I guess could refer to both the competitors and the competition itself.
And there's a lot of hype around it because people were realizing deep learning's not just theoretical. It's really practical and it does things way better than what we've been doing before. So history lesson aside, AlexNet is trained to recognize exactly 1,000 different objects, which I'm guessing had something to do with the victory conditions of the ILSVRC 2012. It's one of several pretrained networks you can access from MATLAB, which also includes VGG-16 and 19.
Do we have a history lesson for that?
I will not go into a history lesson for those. So let's bring it back to our four lines of code. So first check out how MATLAB makes it dead easy to import a pretrained model. Like, it doesn't get easier than that. If you don't have AlexNet on your computer, you just need to download it once, whether it's through the add-on manager or using the link in the error if you run the code without having downloaded it. And now you can use it for this demo and for anything else you want.
So in the second line, you're bringing in the image. That seems pretty straightforward. But why did you resize the image? So the first time I did this, I tried to be all clever and do it in three lines of code.
Without the resizing?
Yeah. And I got this error, which mentioned something about size, which means, yay, I get to figure out why it's not working.
Everyone's favorite thing to do.
So if I do net dot layers, it'll show me the architecture of the network. And it looks intimidating at first, but the first layer, the input layer, has a size of 227 by 227 pixels. The x3 at the end is RGB values since this is a color photo. So seeing that, I'm like, oh, OK. Just use MATLAB to resize the image so it doesn't error out when it passes to the network. And our final line of code can now classify the image.
So you mentioned earlier that AlexNet is a convolutional neural network. What does that mean, and can I please call it CNN for short?
I mean, as long as viewers don't confuse this webinar with a certain cable news network—Cable News—oh. That's what CNN stands for, doesn't it? Well, in addition to CNN being a self-referential cable news network, It's a popular architecture in deep learning for image and computer vision problems. And independent of AlexNet, the three main things to understand about CNNs are convolution, activation, and pooling.
Convolution is a mathematical operation which you might remember from whatever college course introduced you to Fourier and Laplace transforms, for better or for worse. The idea is we put our input images through multiple transformations and each of them extracts certain features from the image. Activation applies a transformation to the output of the convolution. One popular activation function is ReLU, or ReLU, tomato tomato, which simply takes the output and maps it to the highest positive value. And finally, pulling is a process where we simplify the output by taking only one value to carry to the next layer, which helps reduce the number of parameters that the model needs to learn about.
So these three steps are repeated to form the entire CNN architecture, which can have tens or hundreds of layers, each of which learns to detect different features. So one neat thing about MATLAB is that it enables you to look at the feature maps. So if you compare features closer to the initial layer versus features closer to the final layer, they get more and more complex, going from colors and edges to something that seems more detailed.
Let's take a look, again, at the layers of AlexNet. You can see the convolutions, activations, and pooling. Some other network will have a different configuration of these layers, but at the very end, they'll all have a final layer which performs the classification. With a few more lines of code, we can repeatedly display an image along with what AlexNet thinks it is. Sometimes it gets it, sometimes it doesn't. But it's pretty good, as long as the object was in the original set of 1,000.
Which begs the question, what can you do if it's not?
Well, allow me to answer that by saying, that was image classification using a pretrained model. Let's move on to our second demo.
All right. In the next demo, we have video of cars driving down a highway. And we want to be able to classify these as cars, trucks, or SUVs. We're going to use AlexNet and fine tune the network for just our categories of objects, a process called transfer learning, which can be used to classify objects not in the original network.
And there's our answer to the previous question. Quick follow-up for you. So if you had a classification task where your objects happened to be one of the 1,000, is there any reason you wouldn't just use AlexNet.
Good question. The main benefit to transfer learning in that case is to have a classifier specific to your data. If you train on fewer categories, you can potentially improve the accuracy.
Makes sense.
So I took this video from my cell phone and I was able to automatically bring it into MATLAB using IP webcam. This function allowed me to record hours of video of cars traveling outside the office window. Now, using MATLAB and computer vision, I'm able to extract the cars from each frame of video based on their motion using a relatively simple process called background subtraction.
And that's just a matter of looking at the pixel difference between two consecutive images and pulling out the stuff that's different enough.
Now, as vehicles are passing by, we want to classify them as either a car, truck, or SUV. And that's not what AlexNet thinks we're looking at. So if our current model doesn't work on our data, we need a new model. So let's say we want to classify five different kinds of vehicles—cars, trucks, large trucks, SUVs and vans. Our plan is to use AlexNet as a starting point and use transfer learning to create a model specific to these five categories.
So for what reason would you use transfer learning as opposed to, say, train a network from scratch?
So training from scratch is definitely something you can try. And we give you all the tools in MATLAB to do this. But there are a couple of very practical reasons to do transfer learning instead. For example, you don't have to set up the network architecture by yourself, which requires a lot of trial and error to find a good combination of layers. Also, transfer learning doesn't require nearly as many images to build an accurate model compared to training from scratch. And finally, you can leverage knowledge and expertise from top researchers in the deep learning fields who have spent much more time training models than we have.
Sounds good.
So here are five folders containing lots of images of our five categories. We want a simple way to bring in this data to pass to our deep learning algorithm. Earlier, Gabriel used imread as a way to bring in the image of peppers. But we don't want to have to do this for every image. Instead, I'm going to use a function called image data store, which is an efficient way of bringing in data.
And we should note that there's many different kinds of data stores within MATLAB for different big data and data analytics tasks. So it's not just for images. If you have lots of data, data store is your friend.
So once that point image data store to my folders, it's going to automatically label all my data based on the names of the folders containing the images. So there's no need to do it one by one. Once I've done that, I have access to useful functionality, like seeing how many images I have for each category, and being able to quickly split my images into a training set and a test set.
If you need to, you can also specify a custom read function. Image data store as imread by default to read in all the images, which is great for standard image formats. But if you happen to have non-standard image formats that imread doesn't know how to handle, you just write your own function, pass it into image data store, and then you're good to go.
And even if you do have standard image formats, you can make a custom read function that does image preprocessing, like resizing, sharpening, or denoising. In our case, using AlexNet, we need to resize them to 227 by 227. So we use this custom read function here.
So I notice that you're not doing a straight-up resize. It looks like you're padding the image. What's the reason for that?
So this was just from personal experience. I tried resizing the images and the network wasn't doing very well. And when I looked at the images myself, I couldn't tell the differences between cars and SUVs. So I did something that has the same effect of cropping the image and maintaining the aspect ratio. And since that helps maintain the structural differences, I figured that might help the network. So earlier you saw that AlexNet does a poor job of classifying our cars and trucks on their own. So we need to fine tune the network.
If we look at the layers, you can see the final fully connected layer representing the 1,000 categories that AlexNet was trained on. To perform transfer learning, we replace the 1,000 with five for our five categories of objects. And then this line resets the classification, which means forget about those names of the 1,000 objects you learned. You only care about these five new ones.
And this is the only core change you need to make?
Yep. That's all the network manipulation you need to do. If you ran this, you would get a classifier which would output one of those five objects.
So I guess the question is, how well does it do?
So we trained this network beforehand, and it actually got really good results, like 97% accuracy.
That's pretty impressive for, like, two minor changes to the code.
But let's be honest, you might not get to that point right away. Remember that AlexNet was trained on millions of images, including some vehicles. So it's reasonable to assume that it happened to transfer over very smoothly to our data. But if you were to transfer learn on other, very different images from the original set, maybe you might have to make some more changes.
Makes sense. So what are some things people can try if they find themselves with subpar accuracy?
There's a lot of things that you can try. And we'll go into rapid fire mode. You can follow along with this slide. First of all, there are some things you can do before you even start changing parameters. Check your data. I can't emphasize this enough. Initially, my train model was misclassifying a lot of images. And I realized some of my data was in the wrong folders. Obviously, if your setup isn't accurate, whether it's wrong folders or bad training data, you're not going to get very far.
Next, try getting more data. Sometimes the classifier needs more images to understand the problem better. And finally, try a different network. We're working with AlexNet, but as we mentioned, there are other networks that are available to you. And it's possible that a different CNN may offer better results.
Sounds good. So let's say I'm pretty sure I have my setup correct. What can I do now?
So now it's a matter of altering the network and the training process. Let's start with the network. Changing the network means adding, removing, or modifying layers. You could add another fully connected layer to the network which increases the non-linearity of the network and could help increase the accuracy of the network, depending on the data. You can also modify the learning weights of your new layers so that they learn faster than the earlier original layers of the network. This is useful if you want to preserve the rich features the network learned previously about its original data.
As for changing the training process, it's a matter of changing training options. You can try more stages, fewer stages, and other options, as well, for which you can find documentation on our website.
So it was fair for me to say this. All the options seem to be, like, you treat the network like a black box. If you train it and it's not very good, then you throw one of these modifications at it, tell it to start training, wait out the full waiting time, and then you find out if it actually made things better or worse. So is there anything we can do, say, in the middle of the process?
Absolutely. We have a set of output functions that can show us what's happening in the network as it's training. The first one plots the accuracy of the network as it trains. Ideally, you want to see the accuracy trend upward over time. And if that's not what you see, you can stop the training and try to fix it before you potentially waste hours training on something that isn't improving. You can also stop the training early, based on certain conditions. Here I'm telling the network to stop if I reach an accuracy of 99.5%.
And I'm guessing that's so you don't overtrain slash overfit the network.
Yep. We also have the concept of checkpoints. You can stop the network training at a specific point, see how well it does on a test set, and then if you decide it needs more training, you don't have to start from the beginning. You can just pick up the training where you left off. And as you might expect, there is documentation on our website for our many different training options. If you take a look here, you can see the options I just outlined—plot training accuracy, and here, stopping at a specified accuracy. So definitely try out these examples.
Yes, please. Copy paste this code. There are people out there who are like never copy, paste code you find on the internet. And I get that they mean, like, don't blindly copy stuff and expect it to just work. But seriously guys, let he who is without copy pasted internet code cast the first error message.
You should definitely copy our code. It's nice not having to write out all that code yourself and have some great starting points for better control into the training process.
So let's say that I'm really hardcore about getting my network fine tuned and I want to remove the black box aspect of the network as much as possible. So I imagine you probably can't directly see what the network sees. But how can we start getting a more intimate understanding of our network?
One thing you can do is visualize what the network is finding as features in our images. We can look at the filters and we can look at the result of an image after those filters have been applied. In the first convolution, we see we're extracting out edges, dark and light patterns. They might be very apparent, or not so much. And it all depends on how strong those features are in the image.
So you can do this with any layer of your network?
Yep. Let's take a look at another one. The output of the fourth convolution of this image produces something more abstract, but interesting features. You could make assumptions that this particular channel is finding the wheels and the bumper of the car as features. To test our theory, let's try out another image where the back wheel isn't visible on the left side of the image. If our assumption is correct, then the output of this channel shouldn't activate as much on the left side of the image. And that's what we're seeing.
Nice. So if any of you want to debug your network, this technique gives you a visual representation of what your network sees and might help you get a better understanding of what's going on.
Yes. And all the code is in the documentation. The example on the website goes through finding features in a face, but it's the same concept. We'll look at one more tool that you might find useful called deep dream. Deep dream can be used to make very interesting, artsy images that you might have seen online. But it's another tool we can use to understand the network. Deep dream is going to output an image representing the features it has learned throughout the training process.
So one way of understanding this is to say, instead of giving the network an image and having it connect to a class, let's go in reverse. We give the network a class and we have it give us an image. So why is this helpful.
So let's look at the documentation. Neural network toolbox has a great page on deep learning. One of the concepts here is deep dream and an example of using AlexNet with deep dream. We can see here I'm asking for a hen, one of the categories AlexNet was trained on. And deep dream gives me in somewhat abstract version of what a hen looks like to it. And we can create deep dream images for any of the categories in our network.
So if we were to see something that doesn't look like the category, we can assume our network might not be learning our categories correctly.
Yes, it might be an issue with the training data. Let me give you an example. In AlexNet's original 1,000 categories, it has a squirrel category. And I happen to have a bunch of pictures of squirrels so we can try them out on our network. We see all the predictions are correct, except this one. If we look at deep dream for squirrel, what do we see? And how about for hair, what it was mistaken for? There are some vibrant colors that correspond well to the first few images we tried out. You can see features associated with the tail. And these are strong features that this one image doesn't have.
And from that I guess we could add more test images that contain those types of features or lack thereof to our network.
So now you have enough to get started with deep learning, and more specifically, transfer learning. But we're not completely done with our example. Remember that video we showed a while back of cars driving down the road? We tried to classify with AlexNet, which is why we went through all the trouble to create our own custom model. Using the same algorithm as before to detect the cars in the image, I can now classify using our model. And we can see what our model thinks they are and the competence of that prediction.
Very nice.
So that was getting started with transfer learning and a lot of tips and tricks for understanding your network and making improvements. And we hope you've seen how MATLAB makes it easy to handle large sets of images, access models from experts in the field, visualize and debug the network, and accelerate deep learning with GPUs.
Wait, you totally didn't cover that last one.
Ah, so you were paying attention.
Yes, I was.
Yeah, we didn't explicitly cover it. But if you look carefully at the training clips, the output messages indicated that we were training on a single GPU, an NVIDIA® 3.0 compute capable GPU, which is the minimum requirement to using a GPU for deep learning. And the beauty of GPU computing with MATLAB is it's all handled behind the scenes. And you, as a user, don't have to worry about it. MATLAB uses a GPU by default if you have one, and none of the functions change if you're using a GPU or a cluster of GPUs or GPUs in the cloud, or even a CPU.
Can you use, like, a CPU for training? I like how you went from big, bigger, biggest, and then shrank down to bare bones computation.
Yes, technically you can use a CPU. But take a look at this time lapsed video of trying to train the same deep learning algorithm on a CPU versus a GPU.
Wow. That's very unimpressive.
Yeah. And all of this applies to any part of the training process, whether training, testing, or visualizing a network. So if a CPU was your only option, then go for it. But we encourage you to use a GPU for training, or at least to make sure you go for a long coffee breaks while training models.
All right. So for our final demo, we'll talk about a somewhat more challenging problem that's often been brought to our attention. Take a look at this image here. If we present it to our network, what will it think it is? In any case, up till now, we've only shown examples of classifying the entire image into one category. But in this image, clearly there's multiple kinds of vehicles in multiple locations. And the network we trained isn't able to tell us that.
So this classic problem is called object detection, or locating objects in the scene. So in this example, we're looking at the backside of several vehicles. And our goal is to detect them. So we need to create an object detector that recognizes the object we care about. Now, how should we go about doing that?
Well, the theme of this webinar has been deep learning, so how about deep learning?
Fantastic. So if we're going to train a vehicle detector to recognize cars from behind, it'll need lots of images for training. Now, the issue is our image data hasn't been cropped down to the individual cars, which means at first glance we'll have to go through the tedious task of cropping and labeling all of our images from scratch. How long is this webinar supposed to be?
30 minutes or less.
I don't think we can do that. Unless we have MATLAB. Yay. I'm sorry. So MATLAB has built-in apps to help you with this process. For one, you can quickly go through all your data and draw bounding boxes around the objects in the scene. Now, even though that's better than manual cropping, you don't want to have to do that 100 or 1,000 times. So if you have a video or an image sequence, MATLAB can automate the process of labeling objects in the scene.
In the first frame of the video, I specify where the object is. And now MATLAB will track it throughout the entire video. And just like that, I have hundreds of new labeled backs of cars without having to do it 100 times. So now we have all of our images with the bounding box of the object we care about. And again, for real world and robust solutions, you'll need thousands or millions of examples of the objects. So imagine trying to do that manually without the app.
Back to deep learning. We're going to use a CNN to train the object detector. We could totally import a pretrained CNN like we did before, and that'll totally work. But to show you guys something new, we're going to create a CNN architecture from scratch. So we won't type out everything in real time, but creating a CNN from scratch in MATLAB is just a matter of convolution, activation, and pulling layers—three things you talked about before.
And that's what we have right here in sequence. You get to decide on the number of filters to use. And since we'll make all this code available to you, feel free to use it and get your feet wet with creating your own CNN from scratch. So now it's time to train our detector. With MATLAB's computer vision tools, we actually have a couple of object detectors you can choose from. And what's nice is that you can use the same training data for any one of them that you choose. So as you can see from this code, you can try out all of them very simply and see how they do.
And we have documentation for these detectors, which will provide recommendations for which one to use in certain scenarios. So be sure to look at that if you plan to utilize object detection.
Yeah. So we've trained our detector. And we'll try it out on a sample image. You can see the results right here. Looks pretty good. But for a more impressive demo, let's try it out on a video. There it goes, as you can see, driving down the highway. And it's classifying all cars. It's pretty nifty. And for the advanced user, you have access to helper functions to get a better understanding of its performance.
Here's how MATLAB makes it easy to do object detection from quickly labeling your data with built-in apps and training your algorithms with deep learning and other tools in computer vision. To wrap things up, keep in mind that, although we used a lot of vehicles in our example, MATLAB and deep learning are not limited to classifying vehicles. So whether it's people's faces, dog breeds, or a giant squirrel collection, you can do it easily with MATLAB.
I want to quickly call out our support for solving regression problems with deep learning, which means instead of outputting a class or category, you can output a numeric value. We have some examples of this, where you can detect lane boundaries on the road. And for those of you tired of hearing about cars, we have one where we predict facial key points, which could be used to predict a person's facial expressions.
So today we saw some of the new things you can do with MATLAB and deep learning. And we hope you were able to clearly see how MATLAB makes the daunting task of deep learning much easier. So be sure to check out all the code used in our webinar and try it out on your own data.
And if you go to the Add On Manager where you get our pretrained networks, you can find in the same place some other resources to get up and running with deep learning, including a video that shows how to use MATLAB to quickly classify objects with a webcam.
Check out our other resources on our website for getting started with deep learning, and feel free to email us with any questions at image-processing@mathworks.com.