Deep Learning Processor IP Core External Memory Data Format

Deep Learning HDL Toolbox™ uses an AXI4 interface to read and write data from the external memory. The AXI4 interface has a bit width of 32-bits. When you use the dlhdl.Workflow object to read data from and write data to the external memory, Deep Learning HDL Toolbox formats the input data, writes it to external memory, and then reads data from external memory and processes it before presenting you the data. If you do not use the dlhdl.Workflow object, then you must format the input data, store it in external memory, read the output data, and process the formatted output data.

To format the convolution module input data, see Convolution Module External Memory Data Format. To read the formatted convolution module output data from external memory, see Read Convolution Module Output Data from External Memory. To format the fully connected module input data, see Fully Connected Module External Memory Data Format.

The padding format for the external memory depends on the:

Parallel transfer data number — Number of pixels transferred every clock cycle through the AXI master interface. To calculate the parallel transfer data number, use the formula power(2,nextpow2(sqrt(ConvThreadNumber))). For example, if the convolution thread number is nine, the calculated value of the parallel data transfer number is four. See ConvThreadNumber.
Feature number — Value of the z-dimension of an x-by-y-by-z matrix. For example, most input images are of dimension x-by-y-by-3, with three referring to the red, green, and blue channels of an image.
Thread number — Number of channels of the input that a convolution style layer operates on simultaneously. To calculate the thread number, use the formula sqrt(ConvThreadNumber). For example, if the convolution thread number is nine, the calculated thread number is three. See ConvThreadNumber.

Convolution Module External Memory Data Format

The single data type inputs of the deep learning processor convolution module are typically three-dimensional (3-D).The external memory stores the data in a one-dimensional (1-D) vector. To convert the 3-D input image into 1-D to for a thread number,C, parallel data transfer number, N, and a feature number, Z:

Send N number of data along the z-dimension of the matrix.
Send the image information along the x-dimension of the input image.
Send the image information along the y-dimension of the input image.
After completing the first NXY block, send the next NXY block along the z-dimension of the matrix.

The image demonstrates how data stored in a 3-by-3-by-4 matrix translates into a 1-by-36 vector for storage in the external memory.

Data Padding for Power of Two Thread Numbers

When the image feature number, Z is not a multiple of the parallel data transfer thread number, N then you must pad a zeroes matrix of size x-by-y along the z-dimension of the matrix to make the image feature number, Z a multiple of the parallel data transfer number, N.

For example, if your input image is an x-by-y matrix that has a feature number, Z value of three and a parallel data transfer number, N value of four, pad the image with a zeros matrix of size x-by-y along the z-dimension to make the input to the external memory an x-by-y-by-4 matrix.

This image is the input image format before padding.

This image is the input image format after zero padding.

This image shows the external memory data format for the input matrix after the zero padding. In the image, A, B, and C are the three features of the input image and G is the zero-padded data to make the input image feature number, Z value four, which is a multiple of the parallel data transfer number, N.

If your deep learning processor consists of only a convolution processing module, the output external data uses the convolution module external data format, which contains padded data if your output feature number, Z is not a multiple of the parallel data transfer number, N. When you use an dlhdl.Workflow object, Deep Learning HDL Toolbox removes the padded data. If you do not use the dlhdl.Workflow object and directly read the output from the external memory, remove the padded data.

This image shows the external memory data format for a sample 3-by-3-by-5 input matrix that has a thread number 64. The calculated parallel data transfer number, N is eight and the input array is padded with three additional 3-by-3 array of zeroes, converted to a 1-D 1-by-72 vector and stored in external memory. Deep Learning HDL Toolbox completes one row before moving on to the next row.

The int8 data type outputs of the deep learning processor convolution module are typically three-dimensional (3-D).The external memory stores the data in a one-dimensional (1-D) vector. The external memory uses one byte for a value. Deep Learning HDL Toolbox uses an AXI4 interface to read and write data from the external memory. The AXI4 interface has a bit width of 32-bits. For example, if the convolution thread number is four, then you must concatenate one byte per thread into a 32-bit wide vector and store it in the external memory. When the int8 data type is stored in the memory the addresses are incremented in steps of one.

This image shows a sample 3-by-3-by-5 matrix passed as an input to a deep learning processor configuration that has a thread number , C of three and a parallel data transfer number, N value of four. Deep Learning HDL Toolbox first processes the first three features and inserts a zero padding, then processes the fourth and fifth features and inserts two additional zero-padded matrices of size 3-by-3. Deep Learning HDL Toolbox writes the data in little-endian format starting with the least significant bit and ending with the most significant bit.

Data Padding for Non-Power of Two Thread Numbers

When the thread number, C is not a power of two and is lower than the parallel data transfer number, N, then you must pad a zeroes matrix of size x-by-y along the z-dimension of the matrix. Insert the zeroes matrix after every C number of elements along the z-dimension of the matrix to make the feature number, Z value a multiple of N.

For example, if your input image is an x-by-y matrix that has a thread number,C value of three, and a parallel data transfer number, N and feature number,Z are four, pad the image with a zeroes matrix of size x-by-y after the third channel and three zeroes matrices of x-by-y after the fourth channel to make the input to the external memory an x-by-y-by-eight matrix.

This image is the input image format before padding.

Input image matrix before padding

This image is the input image format after zero padding.

Input image matrix after zero padding

This image shows how a sample 3-by-3-by-5 matrix passed as an input to a deep learning processor configuration with a parallel thread number, C of three and a thread number, N value of four is stored in external memory. Deep Learning HDL Toolbox first processes the first three features, inserts a zero padding, then processes the fourth and fifth features and inserts two additional zero-padded matrices of size 3-by-3 along the3 z-dimension of the matrix

When the values of C and N are equal, padding is required only when Z is not a multiple of C.

This image shows the external memory data format for a 3-by-3-by-5 input matrix that has a thread number of 64. Because C is 64, the parallel data transfer number is eight and the input array is padded with three additional 3-by-3 arrays of zeroes and then converted to a 1-D vector and stored in external memory. Deep Learning HDL Toolbox completes one row before moving on to the next row. Deep Learning HDL Toolbox writes the data in little-endian format starting with the least significant bit and ending with the most significant bit.

Read Convolution Module Output Data from External Memory

When you read the convolution module output data from external memory you must process the data to remove the zero-padded data. To read the 1-D data from external memory and convert it back into 3-D output matrix, you must:

Send C number of data in the z-dimension of the matrix.
Send the image information along the x-dimension of the input image.
Send the image information along the y-dimension of the input image.
After completing the first CXY block, send the next CXY block along the z-dimension of the matrix.

For example, if you have a 3-by-3-by-5 matrix that has a thread number of four, you pad the input data with a 3-by-3 matrix of zeroes along the z-dimension to convert it into a 3-by-3-by-8 matrix and then store it in memory as a 1-by-72 vector. When you read this data convert the 1-by-72 matrix back into a 3-by-3-by-8 matrix, remove the zero-padded data, and return a 3-by-3--by-5 matrix. This image shows to translate the 1-by-72 vector data in external memory back into a 3-by-3-by-8 matrix.

Calculation of Output Memory Size

The size of the output for a deep learning processor IP core depends on the feature number, Z, the thread number, C, and the parallel data transfer number, N. To calculate the output memory size, use the formula dimension1 * dimension2 * ceil(Z/C) * N. For example, for an input matrix of size 3-by-3-by-4 the output memory size for a C and N value of four is 3 *3 *ceil(4/4) *4 = 36. In this example the output is written four values at a time because the value of N is four.

For a 3-by-3-by-4 matrix with a C value of three and N value of four, the output size is 3 *3 *ceil(4/3) *4 =72. In this example, even when the output is written four values at a time, only the first three values are valid because the fourth value is a zero-padded value.

Fully Connected Module External Memory Data Format

If your deep learning network includes both convolution and fully connected layers, the output of the deep learning processor uses the fully connected module external memory data format.

The input of the fully connected layer is a 1-by-X matrix where X is the number of output features of the fully connected layer. If X is not a multiple of the parallel data transfer number, then you must pad the memory with zeroes to align the memory with the parallel data transfer number, N.

This image shows the external memory format for single data type with an output feature number of six and a parallel data transfer number, N of eight. You must pad the zeroes at the end.

This image shows the external memory format for int8 data type data that has an output feature number of six and a parallel data transfer number, N of eight. The data is padded with two zeroes at the end. The data is written in little-endian format starting with the LSB and ending with the MSB.

Custom Layer Module External Memory Data Format

If the last layer is a sigmoid layer and the input to the sigmoid layer is a convolution layer, the custom layer module format is the same as the convolution module external memory data format. See Convolution Module External Memory Data Format.

If the last layer is a sigmoid layer and the input to the sigmoid layer is a fully connected layer, the custom layer module format is the same as the fully connected module external memory data format. See Fully Connected Module External Memory Data Format.