Optimize Deep Learning Processor Configuration for Network Performance
This example shows how to generate a deep learning processor configuration and estimate the performance of a pretrained network. Generate a deep learning processor configuration optimized for the target frames-per-second value of the network, then generate a custom bitstream by using the optimized processor configuration.
Load Pretrained Network and Create Processor Configuration
To load a pretrained ResNet-18 network, enter:
net = resnet18;
Create a custom deep learning processor configuration. For more information, see dlhdl.ProcessorConfig
.
hPC = dlhdl.ProcessorConfig;
Estimate Network Performance
Establish the baseline performance of the network, by estimating the performance of the ResNet-18 network. Estimate the performance, by using the estimatePerformance
method of the dlhdl.ProcessorConfig
object. The method returns the estimated layer latency, network latency, and network performance in frames per second.
estimatePerformance(hPC,net);
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization. ### The network includes the following layers: 1 'data' Image Input 224×224×3 images with 'zscore' normalization (SW Layer) 2 'conv1' 2-D Convolution 64 7×7×3 convolutions with stride [2 2] and padding [3 3 3 3] (HW Layer) 3 'conv1_relu' ReLU ReLU (HW Layer) 4 'pool1' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [1 1 1 1] (HW Layer) 5 'res2a_branch2a' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 6 'res2a_branch2a_relu' ReLU ReLU (HW Layer) 7 'res2a_branch2b' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 8 'res2a' Addition Element-wise addition of 2 inputs (HW Layer) 9 'res2a_relu' ReLU ReLU (HW Layer) 10 'res2b_branch2a' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 11 'res2b_branch2a_relu' ReLU ReLU (HW Layer) 12 'res2b_branch2b' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 13 'res2b' Addition Element-wise addition of 2 inputs (HW Layer) 14 'res2b_relu' ReLU ReLU (HW Layer) 15 'res3a_branch2a' 2-D Convolution 128 3×3×64 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 16 'res3a_branch2a_relu' ReLU ReLU (HW Layer) 17 'res3a_branch2b' 2-D Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 18 'res3a_branch1' 2-D Convolution 128 1×1×64 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 19 'res3a' Addition Element-wise addition of 2 inputs (HW Layer) 20 'res3a_relu' ReLU ReLU (HW Layer) 21 'res3b_branch2a' 2-D Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 22 'res3b_branch2a_relu' ReLU ReLU (HW Layer) 23 'res3b_branch2b' 2-D Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 24 'res3b' Addition Element-wise addition of 2 inputs (HW Layer) 25 'res3b_relu' ReLU ReLU (HW Layer) 26 'res4a_branch2a' 2-D Convolution 256 3×3×128 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 27 'res4a_branch2a_relu' ReLU ReLU (HW Layer) 28 'res4a_branch2b' 2-D Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 29 'res4a_branch1' 2-D Convolution 256 1×1×128 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 30 'res4a' Addition Element-wise addition of 2 inputs (HW Layer) 31 'res4a_relu' ReLU ReLU (HW Layer) 32 'res4b_branch2a' 2-D Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 33 'res4b_branch2a_relu' ReLU ReLU (HW Layer) 34 'res4b_branch2b' 2-D Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 35 'res4b' Addition Element-wise addition of 2 inputs (HW Layer) 36 'res4b_relu' ReLU ReLU (HW Layer) 37 'res5a_branch2a' 2-D Convolution 512 3×3×256 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 38 'res5a_branch2a_relu' ReLU ReLU (HW Layer) 39 'res5a_branch2b' 2-D Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 40 'res5a_branch1' 2-D Convolution 512 1×1×256 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 41 'res5a' Addition Element-wise addition of 2 inputs (HW Layer) 42 'res5a_relu' ReLU ReLU (HW Layer) 43 'res5b_branch2a' 2-D Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 44 'res5b_branch2a_relu' ReLU ReLU (HW Layer) 45 'res5b_branch2b' 2-D Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 46 'res5b' Addition Element-wise addition of 2 inputs (HW Layer) 47 'res5b_relu' ReLU ReLU (HW Layer) 48 'pool5' 2-D Global Average Pooling 2-D global average pooling (HW Layer) 49 'fc1000' Fully Connected 1000 fully connected layer (HW Layer) 50 'prob' Softmax softmax (SW Layer) 51 'ClassificationLayer_predictions' Classification Output crossentropyex with 'tench' and 999 other classes (SW Layer) ### Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'ClassificationLayer_predictions' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software. Deep Learning Processor Estimator Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 21328236 0.10664 1 21328236 9.4 ____data_norm_add 210750 0.00105 ____data_norm 210750 0.00105 ____conv1 2164124 0.01082 ____pool1 515064 0.00258 ____res2a_branch2a 966221 0.00483 ____res2a_branch2b 966221 0.00483 ____res2a 210750 0.00105 ____res2b_branch2a 966221 0.00483 ____res2b_branch2b 966221 0.00483 ____res2b 210750 0.00105 ____res3a_branch1 540861 0.00270 ____res3a_branch2a 540749 0.00270 ____res3a_branch2b 919117 0.00460 ____res3a 105404 0.00053 ____res3b_branch2a 919117 0.00460 ____res3b_branch2b 919117 0.00460 ____res3b 105404 0.00053 ____res4a_branch1 503405 0.00252 ____res4a_branch2a 509261 0.00255 ____res4a_branch2b 905421 0.00453 ____res4a 52724 0.00026 ____res4b_branch2a 905421 0.00453 ____res4b_branch2b 905421 0.00453 ____res4b 52724 0.00026 ____res5a_branch1 744525 0.00372 ____res5a_branch2a 751693 0.00376 ____res5a_branch2b 1415373 0.00708 ____res5a 26368 0.00013 ____res5b_branch2a 1415373 0.00708 ____res5b_branch2b 1415373 0.00708 ____res5b 26368 0.00013 ____pool5 54594 0.00027 ____fc1000 207351 0.00104 * The clock frequency of the DL processor is: 200MHz
The estimated frames-per-second performance is 9.4 frames per second. To improve the network performance, you can modify the properties of the custom deep learning processor configuration hPC
or use the optimizeConfigurationForNetwork
method. In this example, you use the optimizeConfigurationForNetwork
method. To learn about modifying the properties manually, see Effects of Custom Deep Learning Processor Parameters on Performance and Resource Utilization.
Generate Optimized Processor Configuration
Optimize the processor configuration by using the optimizeConfigurationForNetwork
method. Use the optional FramesPerSecond
name-value argument.
hPC_optimized = optimizeConfigurationForNetwork(hPC,net,FramesPerSecond=10);
### Optimizing processor configuration for deep learning network... Deep Learning Processor Estimator Resource Results DSPs Block RAM* LUTs(CLB/ALUT) ------------- ------------- ------------- Available 2520 912 274080 ------------- ------------- ------------- Total 438( 18%) 600( 66%) 270396( 99%) ReferenceDesign 3( 1%) 78( 9%) 35000( 13%) DL_Processor 435( 18%) 522( 58%) 235396( 86%) * Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices ### Note: Processing module "conv" property "InputMemorySize" changed from "[227 227 3]" to "[217 217 3]". ### Note: Processing module "conv" property "OutputMemorySize" changed from "[227 227 3]" to "[217 217 3]". ### Note: Processing module "conv" property "SegmentationBlockGeneration" changed from "true" to "false". ### Note: Processing module "fc" property "FCThreadNumber" changed from "4" to "8". ### Note: Processing module "fc" property "WeightAXIDataBitwidth" changed from "128" to "256". ### Note: Processing module "fc" property "SoftmaxBlockGeneration" changed from "false" to "true". Processing Module "conv" ModuleGeneration: 'on' LRNBlockGeneration: 'off' SegmentationBlockGeneration: 'off' ConvThreadNumber: 16 InputMemorySize: [217 217 3] OutputMemorySize: [217 217 3] FeatureSizeLimit: 2048 Processing Module "fc" ModuleGeneration: 'on' SoftmaxBlockGeneration: 'on' SigmoidBlockGeneration: 'off' FCThreadNumber: 8 InputMemorySize: 25088 OutputMemorySize: 4096 Processing Module "custom" ModuleGeneration: 'on' Addition: 'on' Multiplication: 'on' Resize2D: 'off' Sigmoid: 'off' TanhLayer: 'off' InputMemorySize: 40 OutputMemorySize: 120 Processor Top Level Properties RunTimeControl: 'register' RunTimeStatus: 'register' InputStreamControl: 'register' OutputStreamControl: 'register' SetupControl: 'register' ProcessorDataType: 'single' System Level Properties TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit' TargetFrequency: 200 SynthesisTool: 'Xilinx Vivado' ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM' SynthesisToolChipFamily: 'Zynq UltraScale+' SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e' SynthesisToolPackageName: '' SynthesisToolSpeedValue: '' ### Optimizing processor configuration for deep learning network complete.
Estimate performance of the ResNet-18 network by using the new optimized deep learning processor configuration.
estimatePerformance(hPC_optimized,net);
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization. ### The network includes the following layers: 1 'data' Image Input 224×224×3 images with 'zscore' normalization (SW Layer) 2 'conv1' 2-D Convolution 64 7×7×3 convolutions with stride [2 2] and padding [3 3 3 3] (HW Layer) 3 'conv1_relu' ReLU ReLU (HW Layer) 4 'pool1' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding [1 1 1 1] (HW Layer) 5 'res2a_branch2a' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 6 'res2a_branch2a_relu' ReLU ReLU (HW Layer) 7 'res2a_branch2b' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 8 'res2a' Addition Element-wise addition of 2 inputs (HW Layer) 9 'res2a_relu' ReLU ReLU (HW Layer) 10 'res2b_branch2a' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 11 'res2b_branch2a_relu' ReLU ReLU (HW Layer) 12 'res2b_branch2b' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 13 'res2b' Addition Element-wise addition of 2 inputs (HW Layer) 14 'res2b_relu' ReLU ReLU (HW Layer) 15 'res3a_branch2a' 2-D Convolution 128 3×3×64 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 16 'res3a_branch2a_relu' ReLU ReLU (HW Layer) 17 'res3a_branch2b' 2-D Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 18 'res3a_branch1' 2-D Convolution 128 1×1×64 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 19 'res3a' Addition Element-wise addition of 2 inputs (HW Layer) 20 'res3a_relu' ReLU ReLU (HW Layer) 21 'res3b_branch2a' 2-D Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 22 'res3b_branch2a_relu' ReLU ReLU (HW Layer) 23 'res3b_branch2b' 2-D Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 24 'res3b' Addition Element-wise addition of 2 inputs (HW Layer) 25 'res3b_relu' ReLU ReLU (HW Layer) 26 'res4a_branch2a' 2-D Convolution 256 3×3×128 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 27 'res4a_branch2a_relu' ReLU ReLU (HW Layer) 28 'res4a_branch2b' 2-D Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 29 'res4a_branch1' 2-D Convolution 256 1×1×128 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 30 'res4a' Addition Element-wise addition of 2 inputs (HW Layer) 31 'res4a_relu' ReLU ReLU (HW Layer) 32 'res4b_branch2a' 2-D Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 33 'res4b_branch2a_relu' ReLU ReLU (HW Layer) 34 'res4b_branch2b' 2-D Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 35 'res4b' Addition Element-wise addition of 2 inputs (HW Layer) 36 'res4b_relu' ReLU ReLU (HW Layer) 37 'res5a_branch2a' 2-D Convolution 512 3×3×256 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 38 'res5a_branch2a_relu' ReLU ReLU (HW Layer) 39 'res5a_branch2b' 2-D Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 40 'res5a_branch1' 2-D Convolution 512 1×1×256 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 41 'res5a' Addition Element-wise addition of 2 inputs (HW Layer) 42 'res5a_relu' ReLU ReLU (HW Layer) 43 'res5b_branch2a' 2-D Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 44 'res5b_branch2a_relu' ReLU ReLU (HW Layer) 45 'res5b_branch2b' 2-D Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 46 'res5b' Addition Element-wise addition of 2 inputs (HW Layer) 47 'res5b_relu' ReLU ReLU (HW Layer) 48 'pool5' 2-D Global Average Pooling 2-D global average pooling (HW Layer) 49 'fc1000' Fully Connected 1000 fully connected layer (HW Layer) 50 'prob' Softmax softmax (HW Layer) 51 'ClassificationLayer_predictions' Classification Output crossentropyex with 'tench' and 999 other classes (SW Layer) ### Notice: The layer 'ClassificationLayer_predictions' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software. Deep Learning Processor Estimator Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 19966252 0.09983 1 19966252 10.0 ____data_norm_add 210750 0.00105 ____data_norm 210750 0.00105 ____conv1 2224339 0.01112 ____pool1 632402 0.00316 ____res2a_branch2a 1038708 0.00519 ____res2a_branch2b 1038708 0.00519 ____res2a 210750 0.00105 ____res2b_branch2a 1038708 0.00519 ____res2b_branch2b 1038708 0.00519 ____res2b 210750 0.00105 ____res3a_branch1 630228 0.00315 ____res3a_branch2a 625092 0.00313 ____res3a_branch2b 919117 0.00460 ____res3a 105404 0.00053 ____res3b_branch2a 919117 0.00460 ____res3b_branch2b 919117 0.00460 ____res3b 105404 0.00053 ____res4a_branch1 503405 0.00252 ____res4a_branch2a 509261 0.00255 ____res4a_branch2b 905421 0.00453 ____res4a 52724 0.00026 ____res4b_branch2a 905421 0.00453 ____res4b_branch2b 905421 0.00453 ____res4b 52724 0.00026 ____res5a_branch1 506957 0.00253 ____res5a_branch2a 514125 0.00257 ____res5a_branch2b 940237 0.00470 ____res5a 26368 0.00013 ____res5b_branch2a 940237 0.00470 ____res5b_branch2b 940237 0.00470 ____res5b 26368 0.00013 ____pool5 54594 0.00027 ____fc1000 103438 0.00052 ____prob 1262 0.00001 * The clock frequency of the DL processor is: 200MHz
The new estimated frames per second performance is 10 frames per second.
This image shows the comparison between the original processor configuration and the optimized processor configuration:
The optimized processor configuration has:
SegmentationBlockGeneration
turned off.InputMemorySize
andOutputMemorySize
reduced to [217 217 3].SoftMaxBlockGeneration
turned on.FCThreadNumber
increased to 8.
Generate Optimized Custom Bitstream
Use the optimized custom deep learning processor configuration to build and generate a custom bitstream. Use the custom bitstream to deploy the pretrained ResNet-18 network to your target FPGA board.
hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2023.1\bin\vivado.bat'); dlhdl.buildProcessor(hPC_optimized);
See Also
dlhdl.ProcessorConfig
| estimatePerformance