Custom Deep Learning Processor Generation to Meet Performance Requirements
This example shows how to create a custom processor configuration and estimate the performance of a pretrained series network. You can then modify parameters of the custom processor configuration and re-estimate the performance. Once you have achieved your performance requirements you can generate a custom bitstream by using the custom processor configuration.
Prerequisites
Deep Learning HDL Toolbox™Support Package for Xilinx FPGA and SoC
Deep Learning Toolbox™
Deep Learning HDL Toolbox™
Deep Learning Toolbox Model Quantization Library
MATLAB Coder Interface for Deep Learning
Load Pretrained Series Network
To load the pretrained series network LogoNet, enter:
net = getLogoNetwork;
Define Training and Validation Data Sets
This example uses the logos_dataset
data set. The data set consists of 320 images. Create an augmentedImageDatastore
object to use for training and validation.
curDir = pwd; unzip('logos_dataset.zip'); imds = imageDatastore('logos_dataset', ... 'IncludeSubfolders',true, ... 'LabelSource','foldernames'); [imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
Create Custom Processor Configuration
To create a custom processor configuration, use the dlhdl.ProcessorConfig
object. For more information, see dlhdl.ProcessorConfig
. To learn about modifiable parameters of the processor configuration, see getModuleProperty
and setModuleProperty
.
hPC = dlhdl.ProcessorConfig; hPC.TargetFrequency = 220; hPC
hPC = Processing Module "conv" ModuleGeneration: 'on' LRNBlockGeneration: 'off' SegmentationBlockGeneration: 'on' ConvThreadNumber: 16 InputMemorySize: [227 227 3] OutputMemorySize: [227 227 3] FeatureSizeLimit: 2048 Processing Module "fc" ModuleGeneration: 'on' SoftmaxBlockGeneration: 'off' FCThreadNumber: 4 InputMemorySize: 25088 OutputMemorySize: 4096 Processing Module "custom" ModuleGeneration: 'on' Addition: 'on' MishLayer: 'off' Multiplication: 'on' Resize2D: 'off' Sigmoid: 'off' SwishLayer: 'off' TanhLayer: 'off' InputMemorySize: 40 OutputMemorySize: 120 Processor Top Level Properties RunTimeControl: 'register' RunTimeStatus: 'register' InputStreamControl: 'register' OutputStreamControl: 'register' SetupControl: 'register' ProcessorDataType: 'single' UseVendorLibrary: 'on' System Level Properties TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit' TargetFrequency: 220 SynthesisTool: 'Xilinx Vivado' ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM' SynthesisToolChipFamily: 'Zynq UltraScale+' SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e' SynthesisToolPackageName: '' SynthesisToolSpeedValue: ''
Estimate LogoNet Performance
To estimate
the performance of the LogoNet series network, use the estimatePerformance
function of the dlhdl.ProcessorConfig
object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).
hPC.estimatePerformance(net)
### Notice: The layer 'imageinput' of type 'ImageInputLayer' is split into an image input layer 'imageinput' and an addition layer 'imageinput_norm' for normalization on hardware. ### The network includes the following layers: 1 'imageinput' Image Input 227×227×3 images with 'zerocenter' normalization and 'randfliplr' augmentations (SW Layer) 2 'conv_1' 2-D Convolution 96 5×5×3 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer) 3 'relu_1' ReLU ReLU (HW Layer) 4 'maxpool_1' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 5 'conv_2' 2-D Convolution 128 3×3×96 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer) 6 'relu_2' ReLU ReLU (HW Layer) 7 'maxpool_2' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 8 'conv_3' 2-D Convolution 384 3×3×128 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer) 9 'relu_3' ReLU ReLU (HW Layer) 10 'maxpool_3' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 11 'conv_4' 2-D Convolution 128 3×3×384 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 12 'relu_4' ReLU ReLU (HW Layer) 13 'maxpool_4' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 14 'fc_1' Fully Connected 2048 fully connected layer (HW Layer) 15 'relu_5' ReLU ReLU (HW Layer) 16 'fc_2' Fully Connected 2048 fully connected layer (HW Layer) 17 'relu_6' ReLU ReLU (HW Layer) 18 'fc_3' Fully Connected 32 fully connected layer (HW Layer) 19 'softmax' Softmax softmax (SW Layer) 20 'classoutput' Classification Output crossentropyex with 'adidas' and 31 other classes (SW Layer) ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software. Deep Learning Processor Estimator Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 39245087 0.17839 1 39245087 5.6 imageinput_norm 275732 0.00125 conv_1 6836112 0.03107 maxpool_1 3706776 0.01685 conv_2 10461413 0.04755 maxpool_2 1174098 0.00534 conv_3 9392181 0.04269 maxpool_3 1230834 0.00559 conv_4 1768564 0.00804 maxpool_4 24482 0.00011 fc_1 2651287 0.01205 fc_2 1696631 0.00771 fc_3 26977 0.00012 * The clock frequency of the DL processor is: 220MHz
The estimated frames per second is 5.5 Frames/s. To improve the network performance, modify the custom processor convolution module kernel data type, convolution processor thread number, fully connected module kernel data type, and fully connected module thread number. For more information about these processor parameters, see getModuleProperty
and setModuleProperty
.
Create Modified Custom Processor Configuration
To create a custom processor configuration, use the dlhdl.ProcessorConfig
object. For more information, see dlhdl.ProcessorConfig
. To learn about modifiable parameters of the processor configuration, see getModuleProperty
and setModuleProperty
.
hPCNew = dlhdl.ProcessorConfig; hPCNew.TargetFrequency = 300; hPCNew.ProcessorDataType = 'int8'; hPCNew.setModuleProperty('conv', 'ConvThreadNumber', 64); hPCNew.setModuleProperty('fc', 'FCThreadNumber', 16); hPCNew
hPCNew = Processing Module "conv" ModuleGeneration: 'on' LRNBlockGeneration: 'off' SegmentationBlockGeneration: 'on' ConvThreadNumber: 64 InputMemorySize: [227 227 3] OutputMemorySize: [227 227 3] FeatureSizeLimit: 2048 Processing Module "fc" ModuleGeneration: 'on' SoftmaxBlockGeneration: 'off' FCThreadNumber: 16 InputMemorySize: 25088 OutputMemorySize: 4096 Processing Module "custom" ModuleGeneration: 'on' Addition: 'on' MishLayer: 'off' Multiplication: 'on' Resize2D: 'off' Sigmoid: 'off' SwishLayer: 'off' TanhLayer: 'off' InputMemorySize: 40 OutputMemorySize: 120 Processor Top Level Properties RunTimeControl: 'register' RunTimeStatus: 'register' InputStreamControl: 'register' OutputStreamControl: 'register' SetupControl: 'register' ProcessorDataType: 'int8' UseVendorLibrary: 'on' System Level Properties TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit' TargetFrequency: 300 SynthesisTool: 'Xilinx Vivado' ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM' SynthesisToolChipFamily: 'Zynq UltraScale+' SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e' SynthesisToolPackageName: '' SynthesisToolSpeedValue: ''
Quantize LogoNet Series Network
To quantize the LogoNet network, enter:
imageData = imageDatastore(fullfile(curDir,'logos_dataset'),... 'IncludeSubfolders',true,'FileExtensions','.JPG','LabelSource','foldernames'); imageData_reduced = imageData.subset(1:20); dlquantObj = dlquantizer(net,'ExecutionEnvironment','FPGA'); dlquantObj.calibrate(imageData_reduced)
ans=35×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue MaxValue
__________________________ __________________ ________________________ ___________ __________
"conv_1_Weights" {'conv_1' } "Weights" -0.048978 0.039352
"conv_1_Bias" {'conv_1' } "Bias" 0.99996 1.0028
"conv_2_Weights" {'conv_2' } "Weights" -0.055518 0.061901
"conv_2_Bias" {'conv_2' } "Bias" -0.00061171 0.00227
"conv_3_Weights" {'conv_3' } "Weights" -0.045942 0.046927
"conv_3_Bias" {'conv_3' } "Bias" -0.0013998 0.0015218
"conv_4_Weights" {'conv_4' } "Weights" -0.045967 0.051
"conv_4_Bias" {'conv_4' } "Bias" -0.00164 0.0037892
"fc_1_Weights" {'fc_1' } "Weights" -0.051394 0.054344
"fc_1_Bias" {'fc_1' } "Bias" -0.00052319 0.00084454
"fc_2_Weights" {'fc_2' } "Weights" -0.05016 0.051557
"fc_2_Bias" {'fc_2' } "Bias" -0.0017564 0.0018502
"fc_3_Weights" {'fc_3' } "Weights" -0.050706 0.04678
"fc_3_Bias" {'fc_3' } "Bias" -0.02951 0.024855
"imageinput" {'imageinput'} "Activations" 0 255
"imageinput_normalization" {'imageinput'} "Activations" -139.34 198.72
⋮
Estimate LogoNet Performance
To estimate the performance of the LogoNet series network, use the estimatePerformance
function of the dlhdl.ProcessorConfig
object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).
hPCNew.estimatePerformance(dlquantObj)
### The network includes the following layers: 1 'imageinput' Image Input 227×227×3 images with 'zerocenter' normalization and 'randfliplr' augmentations (SW Layer) 2 'conv_1' 2-D Convolution 96 5×5×3 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer) 3 'relu_1' ReLU ReLU (HW Layer) 4 'maxpool_1' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 5 'conv_2' 2-D Convolution 128 3×3×96 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer) 6 'relu_2' ReLU ReLU (HW Layer) 7 'maxpool_2' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 8 'conv_3' 2-D Convolution 384 3×3×128 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer) 9 'relu_3' ReLU ReLU (HW Layer) 10 'maxpool_3' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 11 'conv_4' 2-D Convolution 128 3×3×384 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 12 'relu_4' ReLU ReLU (HW Layer) 13 'maxpool_4' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 14 'fc_1' Fully Connected 2048 fully connected layer (HW Layer) 15 'relu_5' ReLU ReLU (HW Layer) 16 'fc_2' Fully Connected 2048 fully connected layer (HW Layer) 17 'relu_6' ReLU ReLU (HW Layer) 18 'fc_3' Fully Connected 32 fully connected layer (HW Layer) 19 'softmax' Softmax softmax (SW Layer) 20 'classoutput' Classification Output crossentropyex with 'adidas' and 31 other classes (SW Layer) ### Notice: The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software. ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software. Deep Learning Processor Estimator Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 13791559 0.04597 1 13791559 21.8 conv_1 3488979 0.01163 maxpool_1 1852524 0.00618 conv_2 2940919 0.00980 maxpool_2 586833 0.00196 conv_3 2584863 0.00862 maxpool_3 615201 0.00205 conv_4 612220 0.00204 maxpool_4 12217 0.00004 fc_1 665265 0.00222 fc_2 425425 0.00142 fc_3 7113 0.00002 * The clock frequency of the DL processor is: 300MHz
The estimated frames per second is 21.7 Frames/s.
Generate Custom Processor and Bitstream
Use the new custom processor configuration to build and generate a custom processor and bitstream. Use the custom bitstream to deploy the LogoNet network to your target FPGA board.
hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2020.2\bin\vivado.bat'); dlhdl.buildProcessor(hPCNew);
To learn how to use the generated bitstream file, see Generate Custom Bitstream.
The generated bitstream in this example is similar to the zcu102_int8
bitstream. To deploy the quantized LogoNet network using the zcu102_int8
bitstream, see Classify Images on FPGA Using Quantized Neural Network.
See Also
dlhdl.ProcessorConfig
| estimatePerformance
| estimateResources