Prototype and Adjust a Deep Learning Network on FPGA
Deep learning inferencing is so computationally intensive that using it in FPGA-based applications often requires significant customization of the hardware architecture and the deep learning network itself. Being able to iterate during early exploration is vital to converging on your goals during implementation.
Deep Learning HDL Toolbox™ works closely with Deep Learning Toolbox™ within MATLAB® to let deep learning practitioners and hardware designers prototype and make large-scale changes to meet system requirements. This video shows how to:
- Prototype a pretrained industrial defect detection network on an FPGA
- Analyze network performance running on the FPGA
- Adjust the network design and quickly prototype to see the results
- Quantize the network and parameters to int8 data types
Published: 2 Sep 2020
FPGAs are popular devices for deep learning inferencing in edge applications because programming custom hardware delivers speed and power efficiency. But implementing deep learning on fixed hardware resources is challenging.
If you look at a simple network like AlexNet, in just the first convolutional layer, there are 96 filters of 11x11x3. Single-precision data is 32-bits, so that’s 140k of RAM for the filter parameters. But now as those filters stride across the image, each one is a matrix multiply-accumulate – so that’s 105 million floating-point operations. These multiply-accumulates map to specialized resources on FPGAs, but even high-end devices only have a few thousand of those. And those filter parameters really add up as you go deeper in the network.
Deep Learning HDL Toolbox comes with a generic deep learning processor that’s architected to time-multiplex convolution layers so you can target a wide variety of networks to an FPGA. But activations and parameters need to be shuttled into and out of memory, which adds latency, and the calculations themselves add latency.
To illustrate, this example inspects nuts and bolts for defects, classifying them using a trained AlexNet network. The pre-processing resizes and selects a region of interest, then uses the network to classify whether the part is defective or ok, then the post-processing annotates the result.
Let’s say that this is inspecting a production line, so we need to use a high-speed camera, processing at 120 frames per second.
We can add our 5 lines of MATLAB code to prototype it running on an FPGA from within our MATLAB algorithm.
Note that after compilating, the parameters require 140 megabytes – we will come back to that later. After deploying and running, the results look good, but the profile shows that there’s too much latency to keep up with the incoming frame rate.
If the system requirements are such that we can’t drop frames, we need to make some major changes, which is going to take some collaboration between the designers of the deep learning network and the processor.
The first thing we can look at is the network design. Each of those layers adds latency to the network. If for instance the deep learning engineer can get assurance from the system architect that all the images will be as simple as the ones in the test set, then they can remove some of the convolutional layers in the latter part of the network. This results in a new custom network almost half the size of the original AlexNet. We re-trained this new network and confirmed that there was not much drop in accuracy.
So we can load the new network and prototype again on the same FPGA bitstream. The first thing you will notice is the reduction in memory required, from 140MB to 84MB. Running on the FPGA shows that we still get correct classification results with significant speed improvement. But it’s still short of our goal.
Another area that can make a big difference is quantizing the network. For instance if we quantize to int8 data types, we can really reduce the amount of data that needs to be stored in, and retrieved from off-chip RAM. This also reduces latency. And it reduces the hardware resources required for the multiply-accumulates so we can parallelize more to increase throughput.
Download the Deep Learning Quantization support package from the Add-On Explorer, and bring up the deepNetworkQuantizer app. Load the network you want to quantize. Then calibrate using a data store that we already set up. Calibration runs a set of images through the existing network to collect the required ranges for the weights, biases, and activations. These histograms show in gray the data that cannot be represented by the quantized types, the blue is what can be represented, and darker colors are higher frequency bins. If this is acceptable, quantize the parameters and load a data store to validate the accuracy with the quantized data – you can see the results here - then export the network.
This quantized network needs to run on a processor that’s also quantized to int8. We could work with the hardware team to customize the deep learning processor to add more parallel threads and possible increase the clock frequency, but for now let’s see how this performs on the int8 version of our downloaded zcu102 bitstream. After compiling, the parameters have now been reduced to 68 megabytes. We still get the right prediction results, and this reduced the latency to the point where it can run at 139 frames per second, which exceeds our goals!
So by getting feedback on FPGA performance metrics from within MATLAB, we were able to make some large adjustments by first removing some unnecessary layers, then by quantizing to int8. Now we know we have a deep learning network that will on a deep learning processor on an FPGA with the performance that our application requires.