Generate Custom Bitstream to Meet Custom Deep Learning Network Requirements
Deploy your custom network that only has layers with the convolution module output format or only layers with the fully connected module output format by generating a resource optimized custom bitstream that satisfies your performance and resource requirements. Bitstream generated using the default deep learning processor configuration consists of the convolution (conv), fully connected (fc), and adder modules. The generated default bitstreams could exceed your resource utilization requirements which could drive up costs. To generate a bitstream that consists of only the layers in your custom deep learning network, modify the deep learning processor configuration by using the setModuleProperty
function of the dlhdl.ProcessorConfig
object.
In this example, you have a network that has only layers that have the fully connected module output format. Generate a custom bitstream that consists of the fully connected module only by removing the convolution and adder modules from the deep learning processor configuration. To remove the convolution and adder modules:
Turn off the
ModuleGeneration
property for the individual modules in the deep learning processor configuration.Use the
optimizeConfigurationForNetwork
function.
The function takes the deep learning network object as the input and returns an optimized custom deep learning processor configuration.Rapidly verify the resource utilization of the optimized deep learning processor configuration by using the
estimateResources
function.
Prerequisites
Deep Learning HDL Toolbox™ Support Package for Xilinx™ FPGA and SoC
Deep Learning Toolbox™
Deep Learning HDL Toolbox™
Setup Synthesis Toolpath
To set up the Xilinx® Vivado™ tool path, enter:
% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2022.1\bin\vivado.bat');
Create Custom Processor Configuration
Create a custom processor configuration. Save the configuration to hPC
.
hPC = dlhdl.ProcessorConfig
hPC = Processing Module "conv" ModuleGeneration: 'on' LRNBlockGeneration: 'off' SegmentationBlockGeneration: 'on' ConvThreadNumber: 16 InputMemorySize: [227 227 3] OutputMemorySize: [227 227 3] FeatureSizeLimit: 2048 Processing Module "fc" ModuleGeneration: 'on' SoftmaxBlockGeneration: 'off' FCThreadNumber: 4 InputMemorySize: 25088 OutputMemorySize: 4096 Processing Module "custom" ModuleGeneration: 'on' Addition: 'on' MishLayer: 'off' Multiplication: 'on' Resize2D: 'off' Sigmoid: 'off' SwishLayer: 'off' TanhLayer: 'off' InputMemorySize: 40 OutputMemorySize: 120 Processor Top Level Properties RunTimeControl: 'register' RunTimeStatus: 'register' InputStreamControl: 'register' OutputStreamControl: 'register' SetupControl: 'register' ProcessorDataType: 'single' System Level Properties TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit' TargetFrequency: 200 SynthesisTool: 'Xilinx Vivado' ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM' SynthesisToolChipFamily: 'Zynq UltraScale+' SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e' SynthesisToolPackageName: '' SynthesisToolSpeedValue: ''
Optimize Processor Configuration for a Custom Fully Connected (FC) Layer only Network
To optimize your processor configuration, create a custom fully connected layer only network. Call the custom network fcnet
.
layers = [ ... imageInputLayer([28 28 3],'Normalization','none','Name','input') fullyConnectedLayer(10,'Name','fc')]; layers(2).Weights = rand(10,28*28*3); layers(2).Bias = rand(10,1); fcnet = dlnetwork(layers); plot(fcnet);
Retrieve the resource utilization for the default custom processor configuration by using estimateResources
.
Retrieve the performance for the custom network fcnet
by using estimatePerformance
.
hPC.estimateResources
Deep Learning Processor Estimator Resource Results DSPs Block RAM* LUTs(CLB/ALUT) ------------- ------------- ------------- Available 2520 912 274080 ------------- ------------- ------------- DL_Processor 389( 16%) 508( 56%) 216119( 79%) * Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices
hPC.estimatePerformance(fcnet)
### An output layer called 'Output1_fc' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network. ### The network includes the following layers: 1 'input' Image Input 28×28×3 images (SW Layer) 2 'fc' Fully Connected 10 fully connected layer (HW Layer) 3 'Output1_fc' Regression Output mean-squared-error (SW Layer) ### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software. ### Notice: The layer 'Output1_fc' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software. Deep Learning Processor Estimator Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 16127 0.00008 1 16127 12401.6 fc 16127 0.00008 * The clock frequency of the DL processor is: 200MHz
The target device resource counts are:
Digital signal processor (DSP) slice count — 240
Block random access memory (BRAM) count — 128
The estimated performance is 12401.6 frames per second (FPS). The estimated resource use counts are:
Digital signal processor (DSP) slice count — 389
Block random access memory (BRAM) count — 508
The estimated DSP slice count and BRAM count use exceeds the target device resource budget. Customize the bitstream configuration to reduce resource use by customizing the processor configuration.
Customize Processor Configuration by Using ModuleGeneration
Property
Create a deep learning network processor configuration object. Save it to hPC_moduleoff
.
Turn off the convolution and adder modules in the custom deep learning processor configuration.
hPC_moduleoff = dlhdl.ProcessorConfig; hPC_moduleoff.setModuleProperty('conv','ModuleGeneration','off'); hPC_moduleoff.setModuleProperty('adder','ModuleGeneration','off');
Retrieve the resource utilization for the default custom processor configuration by using estimateResources.
Retrieve the performance for the custom network fcnet
by using estimatePerformance
.
hPC_moduleoff.estimateResources
Deep Learning Processor Estimator Resource Results DSPs Block RAM* LUTs(CLB/ALUT) ------------- ------------- ------------- Available 2520 912 274080 ------------- ------------- ------------- DL_Processor 17( 1%) 44( 5%) 25760( 10%) * Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices
hPC_moduleoff.estimatePerformance(fcnet)
### An output layer called 'Output1_fc' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network. ### The network includes the following layers: 1 'input' Image Input 28×28×3 images (SW Layer) 2 'fc' Fully Connected 10 fully connected layer (HW Layer) 3 'Output1_fc' Regression Output mean-squared-error (SW Layer) ### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software. ### Notice: The layer 'Output1_fc' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software. Deep Learning Processor Estimator Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 16127 0.00008 1 16127 12401.6 fc 16127 0.00008 * The clock frequency of the DL processor is: 200MHz
The target device resource counts are:
Digital signal processor (DSP) slice count — 240
Block random access memory (BRAM) count — 128
The estimated performance is 12401.6 frames per second (FPS). The estimated resource use counts are:
Digital signal processor (DSP) slice count — 17
Block random access memory (BRAM) count — 44
The estimated resources of the customized bitstream match the user target device resource budget. The estimated performance matches the target network performance.
Customize Processor Configuration by Using optimizeConfigurationForNetwork
Create a deep learning network processor configuration object. Save it to hPC_optimized
. Generate an optimized deep learning processor configuration by using the optimizeConfigurationForNetwork
function.
hPC_optimized = dlhdl.ProcessorConfig; hPC_optimized.optimizeConfigurationForNetwork(fcnet);
### Optimizing processor configuration for deep learning network... ### An output layer called 'Output1_fc' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network. ### Note: Processing module "conv" property "ModuleGeneration" changed from "true" to "false". ### Note: Processing module "fc" property "InputMemorySize" changed from "25088" to "2352". ### Note: Processing module "fc" property "OutputMemorySize" changed from "4096" to "128". ### Note: Processing module "custom" property "ModuleGeneration" changed from "true" to "false". Processing Module "conv" ModuleGeneration: 'off' Processing Module "fc" ModuleGeneration: 'on' SoftmaxBlockGeneration: 'off' FCThreadNumber: 4 InputMemorySize: 2352 OutputMemorySize: 128 Processing Module "custom" ModuleGeneration: 'off' Processor Top Level Properties RunTimeControl: 'register' RunTimeStatus: 'register' InputStreamControl: 'register' OutputStreamControl: 'register' SetupControl: 'register' ProcessorDataType: 'single' System Level Properties TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit' TargetFrequency: 200 SynthesisTool: 'Xilinx Vivado' ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM' SynthesisToolChipFamily: 'Zynq UltraScale+' SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e' SynthesisToolPackageName: '' SynthesisToolSpeedValue: '' ### Optimizing processor configuration for deep learning network complete.
Retrieve the resource utilization for the default custom processor configuration by using estimateResources.
Retrieve the performance for the custom network fcnet
by using estimatePerformance
.
hPC_optimized.estimateResources
Deep Learning Processor Estimator Resource Results DSPs Block RAM* LUTs(CLB/ALUT) ------------- ------------- ------------- Available 2520 912 274080 ------------- ------------- ------------- DL_Processor 17( 1%) 20( 3%) 25760( 10%) * Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices
hPC_optimized.estimatePerformance(fcnet)
### An output layer called 'Output1_fc' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network. ### The network includes the following layers: 1 'input' Image Input 28×28×3 images (SW Layer) 2 'fc' Fully Connected 10 fully connected layer (HW Layer) 3 'Output1_fc' Regression Output mean-squared-error (SW Layer) ### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software. ### Notice: The layer 'Output1_fc' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software. Deep Learning Processor Estimator Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 16127 0.00008 1 16127 12401.6 fc 16127 0.00008 * The clock frequency of the DL processor is: 200MHz
The target device resource counts are:
Digital signal processor (DSP) slice count — 240
Block random access memory (BRAM) count — 128
The estimated performance is 12401.6 frames per second (FPS). The estimated resource use counts are:
Digital signal processor (DSP) slice count — 17
Block random access memory (BRAM) count — 20
The estimated resources of the customized bitstream match the user target device resource budget. The estimated performance matches the target network performance.
Generate Custom Bitstream
Generate a custom bitstream using the processor configuration that matches your performance and resource requirements.
To deploy fcnet
using the bitstream generated by using the ModuleOff
property, uncomment this line of code:
% dlhdl.buildProcessor(hPC_moduleoff)
To deploy fcnet
using the bitstream generated by using the optimizeConfigurationForNetwork
function, uncomment this line of code:
% dlhdl.buildProcessor(hPC_optimized)
See Also
dlhdl.ProcessorConfig
| getModuleProperty
| setModuleProperty
| estimatePerformance
| estimateResources