## MathWorks<sup>®</sup>

INNOVATIVE GSPS SIGNAL PROCESSING SOLUTION USING MATLAB & SIMULINK FOR FPGA/SoC

#### Nadereh Rooein

Principal ASIC/FPGA Application Engineer, MathWorks nrooein@mathworks.com

Nadereh Rooein Principal Application Engineer nrooein@mathworks.com

**Nadereh Rooein** has many years of experience in ASIC/ FPGA design and verification.

Nadereh has been a part of the Application Engineer team at MathWorks as an expert in HDL tools modeling and Verification in various application area. This presentation, showcases MathWorks' solution that significantly eases implementing Giga Sample Per Second (GSPS) or Super-Sample-Rate (SSR) digital signal processing algorithms in MATLAB and Simulink using DSP HDL Toolbox.

## Agenda

- > Implementation of high throughput (GSPS) signal processing
- > GSPS DDC example
- Seamless architecture selection for Optimized implementation
- Simulink simulation with Hardware Latency



## Era of High-Bandwidth GSPS Data processing

|                 |             | 10 A         |                    |         |         |
|-----------------|-------------|--------------|--------------------|---------|---------|
| <1 GHz          | 3 GHz       | 5 GHz        | 24 GHz             | 39 GHz  | 100 GHz |
|                 |             | 5G Sp        | ectrum             |         | 19 - Ja |
| Low-band        | Mid-band    | →            | igh-band (e.g., mm | nWave)  |         |
| Constant of the | FR1         |              | FR2                | 2 4 ( ) |         |
| Figure 2: The ! | 5G spectrum | which includ | es FR1 and FR2     | 2 hands | 1190    |





G-FR2

## Emerging Technologies Demanding Advanced DSP

In recent years, advances in analog-to-digital converters (ADC) have enabled high-bandwidth applications like RADAR, 5G (FR2) or Software-Defined Radio to deliver data from over-the-air interfaces to the DSP algorithms at "Giga Sample-Per-Second" (GSPS) rates.





## **GSPS** Design Challenges for FPGA/ASIC

Scalar signal processing at GSPS throughput require GHz clock.

However, the max clock frequency in FPGAs and ASICs is limited and cannot catch up with the new demands.

In addition, higher clock frequency is not desirable due to the power consumption.

How to design GSPS throughput without requiring GHz clock?

## Parallel (Frame) Processing

Scalar processing @GHz

## 10 9 8 7 6 5 4 3 2 1 DSP Logic

Frame processing @MHz



DSP Logic

Redesign

## ENTITY FilterProgrammable IS PORT( clk

The hardware architecture of the signal processing application should be changed to <u>process more data in parallel.</u>

write\_done write\_address coeffs\_in filter\_out ΞN

IN

IN

IN

## Frame-Based Design Challenges

• Frame based algorithm may not exist or well known, like any IIR filter, CIC, or Biquad.

- Frame-based DSP, like any filter or FFT for different input frame size requiring extensive verification and rewriting the code and test bench.
- Design exploration for area and speed optimization is costly and the algorithm should be re-designed.



## Scalar vs. Framed-Based FIR Architecture



#### Frame-Based: Polyphase



## Designing High-Rate Filter Requires Significant DSP Resources



DSP HDL Toolbox implements DSP algorithms (FFT, IIR and FIR filters) which automatically select the proper hardware architecture based on the throughput.

## **FFT Implementation Exploration**



Implementation metrics are postsynthesis, targeting Xilinx Zynq<sup>®</sup> UltraScale+ speed grade -2

### Frame-Based Digital Down Converter Example

# Showcasing ease of use & seamless algorithm adaption

## DSP HDL Toolbox – DDC Example

- This example shows how to design a digital downconverter (DDC) for radio communication applications such as LTE and generate HDL code.
- This is an example to show how it is easy to change the throughput and explore the functionality and hardware resources for a DDC filter chain.





## **Digital Down Converter Structure**



## HDL implementation of DDC with Frame Processing



## Frame-Based DDC example



| FsIn = 122.88e6; | % Sampling rate of DDC input                                 |
|------------------|--------------------------------------------------------------|
| FsOut = 1.92e6;  | % Sampling rate of DDC output                                |
| Fc = 32e6;       | % Carrier frequency                                          |
| Fpass = 540e3;   | % Passband frequency, equivalent to 36x15kHz LTE subcarriers |
| Fstop = 700e3;   | % Stopband frequency                                         |
| Ap = 0.1;        | % Passband ripple                                            |
| Ast = 60;        | % Stopband attenuation                                       |
|                  |                                                              |

FrameSize = 4;

#### Digital-Down-Converter-for-LTE





|   | FsIn = 122.88e6; | % Sampling rate of DDC input                                 |
|---|------------------|--------------------------------------------------------------|
|   | FsOut = 1.92e6;  | % Sampling rate of DDC output                                |
|   | Fc = 32e6;       | % Carrier frequency                                          |
|   | Fpass = 540e3;   | % Passband frequency, equivalent to 36x15kHz LTE subcarriers |
|   | Fstop = 700e3;   | % Stopband frequency                                         |
|   | Ap = 0.1;        | % Passband ripple                                            |
|   | Ast = 60;        | % Stopband attenuation                                       |
| L |                  |                                                              |

FrameSize = 1;

## Use Frame-size to Parameterize the Model

| sampling rate = clock rate (Fclk) | Blo                                          | ck Parameters: NCO                                  |                        |          | Block Parameters: CIC Co                                                                                        | mpensation Decimation                  | ( • • × ) |
|-----------------------------------|----------------------------------------------|-----------------------------------------------------|------------------------|----------|-----------------------------------------------------------------------------------------------------------------|----------------------------------------|-----------|
|                                   | Generate real or complex sinusoida           | I signals                                           |                        |          | FIR Decimator                                                                                                   |                                        |           |
|                                   | Main Data Types                              |                                                     |                        |          | FIR Decimator Filter real or complex input for HDL c                                                            | ode generation.                        |           |
|                                   | Algorithm parameters                         |                                                     |                        |          |                                                                                                                 | -                                      |           |
| 7-0                               | Phase increment source:                      | Property -                                          |                        |          | Choose from Direct form systolic or Direct form tran<br>All filter structure shares multipliers in symmetric or |                                        |           |
|                                   | Phase increment:                             | nco.PhaseInc                                        |                        |          | Systolic structures make efficient use of Intel and Xi                                                          | linx DSP blocks.                       |           |
|                                   | Phase offset source:                         | Property -                                          |                        |          |                                                                                                                 |                                        |           |
|                                   | Phase offset:                                | 0                                                   |                        |          | Main Data Types Control Ports                                                                                   |                                        |           |
| exp                               | Dither source:                               | Property -                                          |                        | data     | Filter parameters                                                                                               |                                        |           |
| 1 valid NCO<br>Latency = 9        | Number of dither bits:<br>Samples per frame: | rco NumDitherBits                                   | x[2n]<br>FIR Decimator |          |                                                                                                                 |                                        |           |
| Latency = 9                       | V Enable look up table compressio            |                                                     | Latency =              | valid    | Coefficients:                                                                                                   | compFilt.Numerator                     | :         |
|                                   | Control ports                                |                                                     | Valid                  | valid    | Filter structure:                                                                                               | Direct form systolic                   | •         |
| NCO                               | Enable accumulator reset input               | port                                                | CIC Compensation Dec   | cimation | Decimation factor:                                                                                              | 2                                      |           |
|                                   | · · ·                                        | porc                                                |                        |          |                                                                                                                 |                                        |           |
|                                   | Output parameters<br>Type of output signal:  | Complex exponential                                 |                        |          | Minimum number of cycles between valid input                                                                    | ceil(8/FrameSize)                      | Ē         |
|                                   | Enable phase port                            | Complex exponential +                               |                        |          |                                                                                                                 |                                        |           |
|                                   |                                              |                                                     |                        |          |                                                                                                                 |                                        |           |
|                                   |                                              |                                                     |                        |          |                                                                                                                 |                                        |           |
|                                   |                                              |                                                     |                        |          |                                                                                                                 |                                        |           |
|                                   |                                              |                                                     |                        |          | 0                                                                                                               | OK Cancel Help                         | Apply     |
|                                   | 0                                            | <u>O</u> K <u>C</u> ancel <u>H</u> elp <u>Apply</u> |                        |          | •                                                                                                               | <u>O</u> K <u>C</u> ancel <u>H</u> elp | Apply     |

## **Resource utilization**

#### Scalar processing

#### Frame processing

|          |             |           |               | n f |          |             |           |  |
|----------|-------------|-----------|---------------|-----|----------|-------------|-----------|--|
| Resource | Utilization | Available | Utilization % |     | Resource | Utilization | Available |  |
| LUT      | 2632        | 425280    | 0.62          |     | LUT      | 5139        | 425280    |  |
| LUTRAM   | 74          | 213600    | 0.03          |     | LUTRAM   | 171         | 213600    |  |
| FF       | 6145        | 850560    | 0.72          |     | FF       | 11024       | 850560    |  |
| BRAM     | 0.50        | 1080      | 0.05          |     | BRAM     | 2           | 1080      |  |
| DSP      | (18)        | 4272      | 0.42          |     | DSP      | 38          | 4272      |  |
|          |             |           |               |     |          |             |           |  |

The resources are increased by factor of almost 2 While the input frame size increased by 4

## Performance of the Frame Processing DDC Zynq UltraScale+ RFSoC

| Processing | Max. Clock Frequency MHz | Throughput Sample/Second Msps |
|------------|--------------------------|-------------------------------|
| Scalar     | 412                      | 412                           |
| Frame      | 408                      | 408X4 = 1632                  |

Achieved 1.6 Giga Sample Per Second

## Reduce the Excessive Number of Multipliers



In a DDC with frame input, optimizing the first down-sample filter is essential to minimize multiplier usage.

Frame based CIC Decimator provides a simple solution to bring down the sampling rate significantly without using a lot of resource or any multiplier.

DSP HDL offers an innovative approach for designing frame-based CIC or IIR filter chains, optimized for GSPS sampling frequencies.

## Algorithm Adaptation based on Frame Size



|            | CIC input            |   | CIC Compensation input |         | Half band input      |         | Final filter input   |         |
|------------|----------------------|---|------------------------|---------|----------------------|---------|----------------------|---------|
| Frame size | Processing Sharing   |   | Processing             | Sharing | Processing           | Sharing | Processing           | Sharing |
| 1          | Scalar in-Scalar out | 1 | Scalar in-Scalar out   | 8       | Scalar in-Scalar out | 16      | Scalar in-Scalar out | 32      |
| 4          | Frame in-Scalar out  | 1 | Scalar in-Scalar out   | 2       | Scalar in-Scalar out | 4       | Scalar in-Scalar out | 8       |
| 16         | Frame in- Frame out  | 1 | Frame in-Scalar out    | 1       | Scalar in-Scalar out | 1       | Scalar in-Scalar out | 2       |
| 32         | Frame in-Frame out   | 1 | Frame in-Frame out     | 1       | Frame in-Scalar out  | 1       | Scalar in-Scalar out | 1       |

## FIR Decimator Optimization Technique For Automatic architecture selection



Link to the documentation

## **Discrete FIR Filter – Automatic Architecture Selection**



DSP Engineers can use the **same algorithm for both scalar and frame processing** and easily explore area, speed and throughput trade-offs targeting FPGAs and SoCs.





## **DSP HDL IPs**

- Provide hardware-optimized algorithms that model streaming data interfaces, hardware latency, and control signals in MATLAB and Simulink®.
- ✓ Can process a number of samples in parallel to achieve high throughput such as gigasample-per-second (GSPS) rates.
- ✓ You can change the block parameters to explore different hardware implementations.
- ✓ These blocks support HDL code generation and deployment to FPGAs with HDL Coder<sup>™</sup>

## Simulink Simulation with Hardware Latency

## DSP HDL Toolbox IP Blocks



## Wireless HDL Toolbox



## Vision HDL Toolbox

- Simulate with latency but it does not show it on the mask.
- The vision blocks use a control bus signal, which includes Valid in and Valid out.



## Fixed-Point Designer HDL Support Library



| Implementation                          | Throughput    | Latency        | Area                        |
|-----------------------------------------|---------------|----------------|-----------------------------|
| Systolic                                | С             | O( <i>n</i> )  | O( <i>mn</i> <sup>2</sup> ) |
| Partial-Systolic                        | С             | O( <i>m</i> )  | O( <i>n</i> <sup>2</sup> )  |
| Partial-Systolic with Forgetting Factor | С             | O( <i>n</i> )  | O( <i>n</i> <sup>2</sup> )  |
| Burst                                   | O( <i>n</i> ) | O( <i>mn</i> ) | O( <i>n</i> )               |

#### Choose a Block for HDL-Optimized Fixed-Point Matrix Operation

## **Cycle-Accurate Hardware Latency Simulation**

The blocks model architectural latency including Pipeline registers and resource sharing. <u>FIR Filter Architectures for FPGAs and ASICs</u>



#### **Fully Parallel Systolic Architecture**

## **Cycle-Accurate Hardware Latency Simulation**

The latency between valid input data and the corresponding valid output data depends on block parameters

- Block architecture,
- Input frame-size
- Spacing between validIn samples
- Filter/FFT length

## **CIC** Decimator Latency

The latency of the block changes depending on the type of input, the decimation you specify, the number of sections, and the value of the **Gain correction** parameter. This table shows the latency of the block. *N* is the number of sections and *vecLen* is the length of the vector.

| Input Data | Output Data | Decimation Type | Gain Correction                                                                   | Latency in Clock Cycles                                                           |
|------------|-------------|-----------------|-----------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| Scalar     | Scalar      | Fixed           | off                                                                               | 3 + N. When $R = 1, 2 + N$ .                                                      |
|            |             |                 | on                                                                                | 3 + N + 9. When $R = 1, 2 + N + 9$ .                                              |
| Scalar     | Scalar      | Variable        | off                                                                               | $4 + N$ . When $R_{max} = 1, 3 + N$ .                                             |
|            |             | on              | $4 + N + 9$ . When $R_{max} = 1, 3 + N + 9$ .                                     |                                                                                   |
| Vector     | Scalar      | Fixed           | off                                                                               | $floor((vecLen - 1) \times (N/vecLen)) + 1 + N + (2 + (vecLen + 1) \times N)$     |
|            |             |                 | on                                                                                | $floor((vecLen - 1) \times (N/vecLen)) + 1 + N + (2 + (vecLen + 1) \times N) + 9$ |
| Vector     | Vector      | Fixed           | off                                                                               | $floor((vecLen - 1) \times (N/vecLen)) + 1 + N + (2 + (vecLen + 1) \times N)$     |
|            |             | on              | $floor((vecLen - 1) \times (N/vecLen)) + 1 + N + (2 + (vecLen + 1) \times N) + 9$ |                                                                                   |

## **CIC** Decimator Latency



**Takeaway** 

**DSP HDL Blocks:** 

Support GSPS processing

Seamless algorithm adaptation

Simulate with hardware latency

Facilitate design optimization for speed, area, and throughput





## **DSP for FPGA Training**

This three-day course will review DSP fundamentals from the perspective of implementation within the FPGA fabric. Particular emphasis will be given to highlighting the cost, with respect to both resources and performance, associated with the implementation of various DSP techniques and algorithms.

Topics include:

- Introduction to FPGA hardware and technology for DSP applications
- DSP fixed-point arithmetic
- Signal flow graph techniques
- HDL code generation for FPGAs
- Fast Fourier Transform (FFT) Implementation
- Design and implementation of FIR, IIR and CIC filters
- CORDIC algorithm
- Design and implementation of adaptive algorithms such as LMS and QR algorithm
- Techniques for synchronization and digital communications timing recovery

https://www.mathworks.com/learn/training/dsp-for-fpgas.html