Video length is 17:42

Working with Synthetic Data | Deep Learning for Engineers, Part 2

From the series: Deep Learning for Engineers

Brian Douglas

This video covers the first step in deep learning: ensuring you have data to train the network. Learn if deep learning is right for your project based on the type and amount of data you have for training. Also discover how you can use synthetic data for training.

Published: 1 Apr 2021

In the first video, we covered how deep learning can be used to solved practical engineering problems, specifically problems where you’re trying to classify complex patterns in data. And we left off having briefly introduced the deep learning design workflow.

In this video, we’re going to focus on the first step in this workflow: having access to data. Now I want to reiterate the caveat for this series is that I’m not trying to explain everything you need to know about deep learning, I just want to introduce some concepts and get you thinking about the scenarios in which deep learning might be the right choice to solve your engineering problems. And part of making that decision comes down to the type and amount of data that you have access to.

And to ground this conversation with some realism, we’re going to use a practical engineering example: waveform identification in radar and communication applications.  And that’s a mouthful! But if you stick around I’ll explain what that means and how access to data and deep learning can play an important part in solving this problem.  I’m Brian, and welcome to a MATLAB Tech Talk.

I’m going to cover two use cases for why we may need to identify RF waveforms in the first place: these are for communication and radar.

We’ll begin by describing the overall problem - you know what’s the need? Then we’ll talk about why traditional methods for classification could be difficult, and why it might make sense to use data and deep learning to solve the problem.  Then I want to end by talking about how we can synthesize the training data through simulation rather than going out and collecting it directly with field measurements.

Keep in mind as we walk through this very specific problem of classifying RF waveforms, that the general idea of getting something useful out of confusing and messy time series data is a common engineering problem; for example this is something that comes up often in predictive maintenance.   So, even if you’re not into radar or communications, most of what we’ll cover is applicable to many other engineering problems.

Alright, with that out of the way let’s get back to communication.

To communicate at a distance we need to be able to send information from a transmitter to a receiver.  This is done by choosing a carrier frequency that is modulated in some way that encodes the information. For example, we may use a digital modulation process like Binary Phase Shift Keying (or BPSK). Simply put, if the transmitter is sending a 0, then the phase of the carrier signal is unaffected, and when a 1 is sent, the phase is shifted by 180 degrees.  As long as both the receiver and the transmitter are expecting the same set up, that is the same carrier frequency, modulation scheme, and waveform parameters, then the information can be decoded correctly at its destination.

However, it isn’t always the case that the receiver is aware of all of this ahead of time. For example, signal intelligence and surveillance systems might just be listening for signals, any signals that happen to be out there and by detecting the modulation scheme they could characterize and identify the type of transmitter that is sending it. Or in the case of 5G and other wireless schemes, if there is interference and noise swamping the signal, it is beneficial to understand and label the source of the interference so that it can know what frequency and modulation scheme to switch to in order to minimize this interference.  Even if avoidance isn’t the end goal, it is still helpful to understand if the noise is coming from a specific external signal so that at the very least the operators know that it isn’t a hardware problem they’re seeing.

This is a similar case for radar systems. With radar tracking and target identification, we send out radio waves to probe the environment and listen for their reflection. A common way to do this is to pulse the radar signal; alternating between transmitting and listening. And the pulse itself can have different waveforms like rectangular, linear frequency modulation, and Barker codes as a few examples.  But instead of you sending out the pulses, and therefore you knowing what the reflected waveforms will look like, let’s say you wanted detect radar signatures before you yourself are detected, then it might be helpful to have a system that can search across and, say, a 4 GHz bandwidth, and find and classify any of the known radar pulse waveforms within the incoming signal, and then again be able to determine the type of radar that is emitting this signal.

This is what is meant by modulation identification, or waveform identification. We have this need for a function or a set of functions that take in raw IQ signals from the antenna and label the waveform and its parameters.  So, you can start to see how this is shaping up to be a deep learning problem, where we could learn this classification model using labeled waveform data.

However, I think a good rule of thumb is to not start with deep learning or other machine learning techniques when traditional, rule-based methods will work. So, the question is, can we rely on traditional methods to classify waveforms? For example, why not use our knowledge and expertise in RF signals to write some code or build hardware that processes the incoming signal in some way that makes feature recognition more obvious, and then we pick out certain features within the processed data, and then from that write code that determines the waveform? And it kind of feels like building a system that can distinguish between, say, these two waveforms is pretty straightforward - they look quite unique to each other. Unfortunately, things can get pretty complicated in a hurry which can make pattern recognition with tradition methods difficult or at the very least time consuming.

To show you an example, the left plot is the time domain signal for a linear frequency modulated pulse with a random sweep bandwidth, pulse width, pulse repetition frequency, and sweep direction.  And the right plot is the frequency domain representation of it. And the LFM waveform is modulating a carrier signal which is a higher frequency also randomized somewhere around the 20 MHz region, so this is what one version of an ideal LFM waveform looks like.

However, there are many things that can impair this signal. Weather and physical obstacles can affect different frequencies in different ways and this can change the shape of the waveform as it propagates through them, and there are hardware distortions from the radio electronics which cause white noise and other phase and frequency offsets, which I’m modeling here by adding some white Gaussian noise.  There are also reflections off obstacles near the antenna which can lead to the signal interacting with slightly out of phase versions of itself. As well as many other sources of noise and errors which can affect the received signal.

So, this is one noisy LFM waveform, but here is another and another. And this is really the crux of our problem.  Our waveform classifier needs to be able to recognize all of these as linear frequency modulations. And more than that, it also needs to recognize other modulations that look very similar to LFM and take up the same frequency bandwidths, and are subjected to the same noise sources and errors.

Therefore, our solution space, that is the entire set of conditions and scenarios under which our classification algorithm needs to work is massive.  And designing a classification model that uses a rule-based approach that can handle all of these variations might not be easy.  But finding complex patterns in large, messy, and confusing data sets is exactly the type of problem where deep learning approaches can be beneficial … but to accomplish this you need access to training data.

As an oversimplification of the deep learning problem, you could set up a network architecture that will accomplish all of the data processing, feature extraction, and waveform identification tasks.  Then if you have access to enough labeled data, you could employ a deep learning algorithm that will tune this network to accurately classify waveforms in unlabeled data.  This is the goal. But what does it mean to have enough labeled data, and where does that data come from?

Well, to answer that I want to start by saying that no matter what method you chose to design your classification algorithm, you need data.  Even if you are building a rule-based algorithm, you have to understand your system and the signals that it will see well enough to be able to write those rules.

The difference between that set of data and what a deep learning system needs is mostly a question of quantity.

When a person is designing an algorithm, they are bringing along years of experience and knowledge about the problem which helps them to quickly dismiss certain approaches or ideas that are obviously not the solution. Or for example, they understand what white noise is and so they can more quickly recognize it in a frequency plot. However, unless you’re starting with a partially trained model, the classification network we’re designing with deep learning has no experience or existing knowledge to draw from.  It doesn’t know what is obvious.

Therefore, it takes many more examples of the labeled data for the network to understand even basic concepts like rising edges in a signal, let alone combining those edges into more abstract concepts like waveforms and noise.

So, in this way we’re using more data to offset the experience and knowledge that humans would normally bring to the problem. Now, what I’m talking about here is a full end-to-end deep learning approach where raw signals are fed into the network and a label is applied to it.  This requires the most data to train since we aren’t supplementing the network with any human knowledge.  However, this isn’t always the case.  For example, a person could use their knowledge to preprocess the data by say filtering it first, or by transforming it in a way that makes some distinguishing features more obvious, or by going so far as to extract the obvious features themselves before a machine learning algorithm determines the classification. In this way, we are using human knowledge to shrink the remaining classification problem, which is the part that needs to be learned, and therefore in general requires less training data.

Now, regardless of where your problem fits on this curve, the bottom line is that you need some amount of good labeled data that covers the entire solution space that the classification algorithm needs to handle. In our case, the data needs to span all of the modulation schemes, at many different carrier frequencies, noise conditions, bandwidths and so on.

So, the next question is, how do you acquire this labeled data? And one way, if you’re lucky is to just use existing databases.  If you’re working with images for example, you can just start from an image database like ImageNet and then you can add to that database with your own labeled data to fill in any missing gaps.  However, at the moment I think most engineering problems are unique enough that augmenting or extending an existing database is just about as large of a problem as creating your own database from scratch.

So, that leaves the other option which is to collect your own data.  This can be done by placing a transmitter and receiver out in the field and sending various waveforms with different parameters while simultaneously adjusting the environment - that’s things like noise parameters and other RF sources. But you can image that this could be fairly difficult and time consuming to do, especially if you’re trying to control for weather, or different propagation distances.

In some cases, like with autonomous vehicles, the field is still the best way to collect data because there are billions of cars on the road.  The approach is to place sensors on cars that are driven by people in a vast array of scenarios and conditions.  Then over time, millions of driven miles, and countless hours of labeling, a database is built up.

This could also be the approach we use for our waveform database.  After all, there are billions of receivers in the world and it would be nice if they recorded their received signal, labeled it with the modulation scheme they were designed for, and then saved it off in a global database.

However, in this particular case, there’s a faster and cheaper way to access labeled data; and that is by generating it through simulation.

As long as you understand the scope of the solution space that you want to solve for, then you can build a simulation that takes all of that into account.

For example, we can list out the modulation schemes we want to classify and their specific parameters, the types of impairments we want to be able to handle, the variations in the hardware, and anything else we deem important.   And using all of that, we can build a simulation that will generate realistic received signals across the entire solution space. As long as you trust your simulation to represent the important features and characteristics of the real signals, then generating millions of test cases is relatively quick and easy.

And a benefit of synthesized data is that the label practically comes for free since you need the label to generate the data in the first place.

Now it’s important to understand when simulating data makes sense and when it doesn’t.  Like, if you wanted to build a network that could classify words in audio signals, then simulating people saying words is probably much harder than just collecting a lot of real audio.  But for this particular problem, where the physics are well understood, it makes sense to build a model and generate the data.

And this is exactly what is being demonstrated in this MATLAB example. I encourage you to check this out if really want to understand what’s going on, but right now I just want to quickly highlight a few things.

This first section is using a pre-trained network to recognize 11 different modulation types.  This is an example of "hey, if you can find a pre-trained network that already does what you want, then you’re done".  But if you can’t, you need to train one yourself.  So, if we scroll past that, the next section is where waveform data is generated which can be used to train a new model.  You can see here that it’s generating 10,000 frames for each modulation type, so we’re going to get 110,000 signals in just a few minutes.

For each signal, it’s adding random amounts of white noise, multi path fading, and hardware offsets. So the idea is that we’re covering the entire expected solution space.   A little further down it’s plotting a few random signals so you can see what they look like in the time domain and as a spectrogram.

Now, we can use this simulated data to train the network, which took about an hour for this particular case and ended up correctly labeling about 95% of the simulated validation data. Which is nice, except who really cares how well this network can label simulated data?  I mean it was trained on simulated data so of course it learned to do a good job at recognizing it, but the real test is how well this network can label real RF data.

In the last section, this is exactly what is done. A software defined radio is transmitting various waveforms and a receiver is recording the signal and using the trained network to label the waveform.  And according to the confusion matrix it does a really good job, about 99% overall accuracy.  Which seems amazing, but we have to consider that in this test, the two radios were stationary and placed 2 feet from each other which would limit the amount of propagation noise and multi path fading, and maybe other things.  So, these were relatively clean signals.

The real test would be to validate this network on hardware that is put in a more realistic scenario.  The bottom line, though is that it is possible, and sometimes preferred, to train a network using simulated data. And it will work in real situations as long as the simulation generates signals that closely match the conditions of the real system, or is more strenuous than the real system so it bounds it.

Alright, that’s where I want to leave this video for now.  Hopefully, you have an idea of how you would go about collecting labeled data for the particular engineering problem you’re trying to solve. Whether that means pulling from an existing database, collecting real data yourself, or simulating data.