Unleashing the Power of FPGAs Through Model-Based Design
Nabeel Shirazi, Ph.D, Xilinx Inc.
Model-Based Design has long been the de facto standard for algorithm developers for exploring and implementing applications such as software defined radios, embedded vision, motor control systems, and medical devices. Many of these applications require high performance compute and benefit significantly from the massively parallel architecture of FPGAs. But to leverage FPGAs, developers needed to bridge the gap between the algorithm-centric world of MATLAB® and Simulink® and the hardware-centric world of FPGAs, which once required fairly arduous manual translation steps.
Almost 20 years ago, Xilinx pioneered the solution to this problem with System Generator for DSP, which enabled a Model-Based design flow that could map directly to FPGAs. This has been used successfully over thousands of designs. But a lot has changed over the last two decades. New applications such as ADAS, 5G, and machine learning have placed increasing performance demands on systems and driven the evolution of FPGAs into new device classes such as programmable SoCs and just recently, adaptive compute acceleration platforms (ACAP). Along with that, the model-based programming model has also evolved and moved to higher levels of abstraction in order to manage the massive increase in system complexity.
This talk draws inspiration from the past 20 years of Model-Based Design to lay a foundation for the next 20 years of innovation. We describe how the market trends, programmable devices, and model-based development have changed over the past decade and how they are likely to evolve in the years to come.
Recorded: 6 Nov 2018
Well, thank you, Rich, for the introduction. It's really my pleasure to be here today. I have had the opportunity to work with MathWorks for the last 20 years on developing a tool flow from MATLAB and Simulink to our devices. However, this is the very first time I've had a chance to talk to so many MATLAB and Simulink users in the audience. So it's really my pleasure to be here and my honor to be here.
Twenty years is a long time in this industry, and there's been a lot of change. So I'll give you a glimpse of what it's like, the journey that we had with MATLAB and Simulink, and the collaboration that we've had with the MathWorks, some of the challenges that we have today with today's applications, and where we may be going in the future.
For starters, we have seen three generations of wireless infrastructure emerge over this time frame, starting with 3G, 4G LTE, and now you're seeing 5G being rolled out. Also, we're at the dawn of AI adoption. It's anybody's guess, like you've seen in Richard's talk, where AI is going to be used. But one thing we know for sure, that the complexity of these algorithms will grow tremendously.
So in order to meet the needs of these different applications, we've had to come up with new classes of devices over the years. So we started with FPGAs, field programmable gate arrays. So can I see a show of hands of how many people have used an FPGA? Okay. Xilinx folks have to put their hands down. Okay. Great! There's a good number out there.
So I'll give you a brief summary of what FPGA is. It's an array of configurable logic blocks and programmable interconnect. And these configurable logic blocks can be cascaded together to create larger functions like FFTs and filters. And the programmable interconnect can be used to create custom data paths between those configurable logic blocks. Back then, we had thousands of these configurable logic blocks in our devices. And now, we have millions. You can instantiate many of these functions in parallel with each other. And that way, you can outperform CPUs, DSP processors, and GPUs now.
So these FPGAs have evolved over the years. And they've truly changed into systems on a chip. So most recently, we've had RFSoC and MPSoC devices rolled out. RFSoC is used for the 5G market. MPSoC, or multiprocessing system on a chip, is used specifically in embedded vision and AI applications.
And most recently, we announced our ACAP device, or adaptable compute acceleration platform. And this is a really exciting device. I've been at Xilinx for 20 years, and this is the one I'm most excited about by far. So my job at Xilinx has been to connect the dots between these applications and these devices by ratcheting up the level of abstraction in our tools, in order to provide more productivity to users like you.
So let's take a look at the very beginning, when it was a dawn of 3G wireless. And it was 1998 and just so happens to be when I first joined Xilinx. So we had a major comms customer, or communications customer, come to us. And they wanted to implement a key portion of their 3G wireless radio, which is a digital baseband predistortion algorithm. And the primary reason why they wanted to do this was they wanted to cost reduce their system. And they wanted to take advantage of FPGAs in order to exploit the parallelism that it has, but also create custom memory hierarchies to feed the compute engines on their device.
There was one problem. Their designs were written in MATLAB and Simulink, and our tools at that time only took in hardware description languages, like VDHL Verilog, RTL level code. So we had this huge gap between Simulink and FPGAs. And there was really only one way, back then, to bridge that gap. And that was to hire a hardware designer.
So this poor hardware designer would bang away on his state-of-the-art computer with a CRT monitor and do a manual translation between the Simulink diagram and RTL code. And this is extremely erroneous and error prone. So there had to be a better way to get to 3G first.
And that better way that Xilinx invented was a tool called System Generator for DSP. So what this allowed them to do was go from a DSP-oriented block set in Simulink to RTL code and highly customized IP cores. Now, we did this mapping, essentially, between the block in Simulink and the IP core that Xilinx provided. That gave you really good quality of results.
So this customer's design is actually shown over here on the right-hand side. And it was a 13,000-block Simulink model. But it only took them about a couple months to design it and verify it. And they credit us for getting to 3G wireless market first.
So it's a consensus of our customers that Model-Based Design is much more than just code generation. It is simulation, prototyping, and doing the code generation to go to production. Also, Simulink provides a very, very natural way to specify parallelism and custom data passed between your compute engines. It lets you debug and test at the model level, which simulates orders of magnitude faster than RTL code. And it reduces the number of times that you have to go to hardware, because that can be a more tedious process to debug your design in hardware. And finally, you're creating an executable spec. And this executable spec can be handed off to different groups in your organization, your FPGA engineer, your RF engineer, as well as your comms engineer.
So let's take a look at some hard customer data here. So we were working with BAE Systems back then. And they had a VHDL expert creating a software-defined radio. That took him 645 hours to build that software-defined radio. A different engineer picked up MATLAB, Simulink, and Systems Generator for DSP, and it only took him 46 hours to build the same design. That is a 14x productivity gain. That is huge.
Okay. So let's fast forward to today. And we're trying to address different applications. Now we're trying to address 5G and AI. In order to do that, we've had to create new blocksets in System Generator. So we've created what's called a super sampler rate processing blockset in System Generator to address the 5G market. And we've had to create a completely new tool called Model Composer to address the Embedded Vision and AI market. And these will take advantage of our devices, the SoC devices and our future devices.
So as many of you may know, 5G wireless is a real bear. You know, here's a chart from an ETRI workshop, where the consensus was it's 100x more complex than 4G. There's many key performance indicators here, like peak data rate, capacity, and latency. Each one of them is at least an order of magnitude more aggressive than we had in 4G.
And on top of that, there are new technologies in 5G like multi-user massive MIMO, or multiple input, multiple output antennas. There's new beamforming technology to talk to all these antennas. And they're communicating at a millimeter wave frequency between 30 gigahertz and 300 gigahertz. And on top of all of that, the standard is still evolving. So it's a perfect fit for our devices.
So let's take a look at the applications that we're trying to target. So first of all, the MIMO communication will be done in the remote radio head. We're looking at baseband processing and the wireless backhaul, which will use the millimeter wave frequencies. And the perfect device to implement this in is the RFSoC device. So I had mentioned this is an SoC device, and it includes the traditional FPGA fabric that you see in the yellow over there. But it also has a processing subsystem where you have a quad core arm at 53 processors, you have real-time ARM processors, you have a hardened memory controller.
But what really is unique about this device is the A-to-D converters and the D-to-A converters that are included inside the device. And they're operating at a very fast frequency. They're up at 4 gigasamples per second on the A-to-D, and the D-to-A is at 6.4 gigasamples for a second. So this simplifies the design considerably because previously, you had to have the A-to-Ds and the D-to-As outside your device. Now you pull them into your device. It simplifies complexity, as well as reduces power.
So how do you model this, model 5G designs in MATLAB and Simulink? Well, fortunately, MathWorks has recently come out with a 5G Toolbox that lets you do end-to-end link-level simulation, lets you generate waveforms. You can download these waveforms to actual hardware using the Avnet RFSoC explorer. You can take this stimuli and feed it to the A-to-Ds and D-to-As. You can set up the device, and you can bring data back from the device and compare it to a golden reference model back in MATLAB.
Now with these really fast A-to-Ds and D-to-As, it poses a challenge for the DSP data path that you have to implement in the programmable logic. So in original System Generator, what you would have to do is time d multiplex to input, because it's coming at—in this example—1.5 gigahertz. The FPGA fabric was running relatively slower. And you had to paralyze your data path, and then time multiplex your output. Okay?
So in order to make this simpler, what we did was create a super sample rate processing blockset that takes a vector of data into each one of the blocks. So this design, that was about 30-something blocks, has simplified down to just nine blocks with the super sampling block set. So it takes vectors in, supplies vectors out, and makes things much, much simpler to construct.
Okay. Let's switch gears and take a look at an embedded vision ADAS example. So we had a customer come to us recently. And they wanted to build a module that goes behind your rear-view mirror of your car. It's for collision avoidance. And their power budget was really strict. It was only five watts. I mean, there's not a lot of compute you can do at five watts. And their cost budget was only $10 to $40. And Rich talked about some of the challenges in ADAS.
So let's take an example here. Imagine you're driving down the road—or maybe in the future, the car is driving you down the road—and there is a front-facing camera that is collecting data. You know, it's probably sampling anywhere between 30 frames per second or 60 frames per second. HD cameras, sometimes you have stereo vision cameras, right? You have a deep learning network there that is trying to classify the images that are coming in, to figure out what kind of objects are in front of you, which direction that they're going in order to create a pathway to avoid that object. And finally, you need to communicate to the braking system to apply the brakes.
Now the requirement that came in from this customer said that the latency from the input sensor to applying the brakes needs to be 30 milliseconds. So the less time it takes to do this computation, the more time you have to apply the brakes. That's kind of important. So the perfect device to implement this was Zynq MPSoC. Now, this has a different collection of hardened components in the device. It doesn't have the A-to-Ds and the D-to-As, but it has a video codec in it.
So now we had to rethink: How do we want to do the design for the programmable logic in Simulink? So that's where we came up with Model Composer. So like System Generator, it's a blockset that fits into Simulink. However, it simulates it at a much higher level of abstraction. You have blocks that do computer vision, linear algebra, DSP. And here, I've opened up the computer vision subset of the library. Now, the computer vision portion of the library leverages OpenCV open source libraries. And I'll show you an example with a couple blocks, a dilation and erosion block here.
So this example is a street sign detection. It uses a color to detect the sign. So one of the guys in my group decided to take a video as he was driving into Xilinx. At least he was paying attention to the speed signs when he was driving in. And he recorded this video, and he played it back in Simulink using blocks from the Computer Vision System Toolbox. So you can you can have the images come in, and you can look at the resulting images using the blocks from the Computer Vision System Toolbox.
The first block that you see in that pipeline here is actually imported C code. So somebody had written C code to do color space conversion, brought it in. And then you see on the end of the pipeline, those two blocks from the OpenCV library. Now if you take a close look at the data types that are being used here, they are frames of video data. Back in the sysgen days, we had to use pixel-level processing. Here, we're doing frame-level processing. And also, these blocks are untimed, so that the combination of those two lets you do orders of magnitude faster simulation than we had with System Generator.
Now, the output product from Model Composer is completely different as well. We generate C code that gets further optimized by Vivado High-Level Synthesis technology that we have. And we also generate RTL code that leverages the high-performance IP cores that we have at Xilinx like for filtering FFTs. So it's kind of the best of both worlds.
We also have a path to go to hardware. I don't want to give too much away about that, because I'd like you guys to go into the room next door and take a look at it. There's an optical flow example that they've built, and it's running on an Avnet Ultra 96 port over there.
Okay. So I showed you a little bit of how we started with Model-Based DDesign, some of the challenges that we have today with 5G and ML, and then let's take a look at the road ahead. So one thing we know for sure is that there's going to be AI, AI, and more AI in the future in applications. Even in 5G, it's used for threat detection for instance. And I introduced Model Composer so that we plan to increase the capabilities of Model Composer to meet the needs of AI. And we had to come up with a new class of device, and ACAP device.
So let's take a look at that. So with ML, many of you probably know there's two distinct phases. There is training, where you have a large dataset and you train a neural network. This can take hours, maybe days, within a data center. On a data center, it's not as important on how much latency you have to compute the results, or how much power you're consuming. But it is when you deploy it.
To make the training process much easier, MathWorks has a Deep Learning Toolbox that you'll hear about more today. Now you have to deploy that network onto actual hardware, and that's where inference comes into play. Here, power and latency is extremely important, like I showed you in that ADAS example.
So at Xilinx, we realize that training is very important. It's got a lot of press, because you know, they have networks now that are better than the human eye. However, we think there's a larger market opportunity in inference. So that's where we're going to be focusing.
But there's many, many challenges in inference. For instance, the rate of AI adoption is staggering. We've gone from AlexNet to GoogLeNet to now ResNet. And by the time you start your design, you might have a new state-of-the-art network by the time you finish your design. Again, performance at low latency is critical. Power consumption is also critical. And you want to accelerate the entire app, not just the inference engine but everything from cleaning up the image coming in from the sensor, to the inference engine, to being able to do the decision making. And you want to take advantage of all your resources on these devices.
So we believe that adaptable hardware can address these challenges. And the reasons why is that you can create custom data flows for your application. You can create custom memory hierarchies to feed the compute in your algorithms. And you can use custom precision. And to bring all three of these together, Xilinx—and many other people in the industry—has coined this term of domain-specific architecture, or DSAs. And that's how you bring all of these things together into a single design. And I'll show you an example of that.
So Xilinx has an image classification solution that's available on cloud solutions like F1, where they've customized the data flow for the latest DNNs. They've customized the memory hierarchy, so you're using caches within the device, and you minimize the amount of time that you communicate to external DDR memory. And you also customize the precision for these networks. So in this example, we're using 8-bit integers. But this is not enough for tomorrow's AI applications.
So that's where we believe that our new adaptable compute acceleration platform, or ACAP devices, will really shine. So I don't have enough time to talk about all the key new innovations in this device. I mean, there's things like new ARM processors, and there's a network on a chip to move data all around the device, there's hardened memory controllers. There's, you know, things like PCIe Gen 4. The one area that I'd like to focus on is the AI engines.
So the AI engines are an array of VLIW/SIMD processors. Now in our very first device, there's 300 of them in the device. And there's massive interconnect between these VLIW processors. You can go off into the fabric to create custom memory hierarchies to feed this beast. And the way we're going to end up programming this is using MATLAB Simulink and use Model Composer to make that all happen.
So what do you get with an ACAP device? So here's a benchmark on a GoogLeNet network with a constraint of a two-millisecond latency. So we've done a comparison. So again, the state of the art GPUs here. And our existing solution, the XDNN DSA that I mentioned, classifies 4,000 images per second. Okay? That outperforms the existing GPUs right there. But with the ACAP device, you go up to 22,000 images per second. That's quite an improvement. But it even gets better than that.
So Xilinx has what we call printing technology that looks at a network, figures out different branches of the network that it can cut off, and quantize different portions of that algorithm. And we can get even better performance. So you can get 1.3x to 8x more improvement in performance. And you can get up to 30,000 images per second for an ACAP device. That's pretty good.
Okay. So how do the tools need to adapt in order to take advantage of all this technology and meet the needs of these applications? Well, first thing, I would love to be able to do cloud-based design entry. So recently, I had a chance to use MATLAB online. And it was amazing. Three seconds, you're on, you're using MATLAB. It can't get any simpler than that. Right? But there was one thing missing, and that was Simulink. So I was really looking forward to that.
Also, you saw in the ADAS example, there were system-level constraints that came in from the customer, right? Things like bandwidth, latency, and power. These all needs to be inputs to the compiler. Right now, the input to the compiler is clock frequency. Right? Also, you need to know whether you're going to the cloud or edge. We might make decisions based on that. So we need to improve the compilers.
So since you'll be doing cloud-based entry, you could leverage FPGA hardware that's already on the cloud, like Amazon F1. Many other cloud vendors are deploying FPGAs. But imagine you're profiling your code, you hit a hotspot. Maybe MATLAB could say, oh, we have an accelerated version of that. That's already available on the cloud. And you go ahead and use that.
We need higher levels of abstraction in Simulink. For example, I showed you that digital predistortion design in the beginning. That was a 13,000-block design. We believe that's going to be a single block in Model Composer in the future. That takes advantage of AI engines.
And lastly, we'd like you to be able to take advantage of a wide range of these domain-specific architectures. Like I showed you the image classification engine, you should be able to add your special sauce to that so you don't have to reinvent that. And you don't have to deal with all the nitty gritty I/O of the device. And I think all this can happen through a joint collaboration with the MathWorks. And at the end, what we can provide to you is a deployable design that uses all the compute resources of an ACAP design for both the edge and the cloud.
So needless to say, we've seen a lot of innovation over the last 20 years. Model-Based Design is more important than ever to manage complexity and improve productivity for you. Xilinx will continue to invest heavily in Model-Based Design, because we believe that it's a natural and productive on-ramp to our devices. For reasons I mentioned earlier, adaptable devices have a clear advantage in ML, ADAS, and 5G to meet the performance, latency, and power requirements of our customers.
And lastly, we believe that the intersection of tools, silicon, and these platforms will provide an inflection point for more AI adoption. So whether you're an AI expert or are just learning about AI for the very first time and were inspired by Rich's talk, we look forward to working with you and our partners, MathWorks, on your next design. Thank you.
[APPLAUSE]