Cleaning and Analyzing Real-World Sensor Data with MATLAB
Sensor data provides valuable insight into how products and systems perform in the real-world. However, this data often contains artifacts that make it challenging to analyze, such as missing data, outliers, noisy data, and non-uniform sampling rates. These challenges can be compounded by a large amount of data from multiple data loggers, tests, systems, etc.
Watch this webinar to learn about new MATLAB features for working with sensor data, including:
- MATLAB datatypes for working with time series sensor data
- Working with large collections of telemetry data (big data)
- Detecting and handling outliers, using preprocessing functions and Live Tasks
- Smoothing and filtering noisy data, including spectral analysis with the Signal Analyzer app
- Documenting analyses with the MATLAB Live Editor
Recorded: 25 Jun 2020
Hi, and welcome to this webinar on cleaning and analyzing real world sensor data with MATLAB. My name is Sudheer Nuggehali, and I'm an application engineer with The MathWorks. Today, we'll be looking at a variety of techniques for analyzing and managing telemetry data.
So a brief overview here of the agenda. To start, we will be taking some real world data and discussing methods for cleaning this data, such as managing missing data points, identifying outliers, and working with noisy data. Now, our original data set does fit into memory. However, we often have to deal with large collections of files that may not fit into memory. So, after we've completed our analysis on the in-memory data, I will show you how we can scale our algorithm up to deal with large collections of files that may not fit into memory.
OK. So when we're talking about a data analysis, I like to think about this workflow in three separate phases. The first phase is data access. And this is, at a high level, what is our data, and where does our data live. So does our data live in files such as csv, text, xls? Is it coming in from software packages or databases? Or are we streaming it in from hardware?
Now, once we have our data, the second stage of this workflow is really where we want to spend the majority of our time, is the exploration and discovery. Doing our data analysis or modeling, our algorithm development, our application development. Now once we've finished this and we have results, the final part of this is sharing, right? Generating reports and documents that we may want to send over to colleagues or even deploying our results into a production type environment, for example.
Today, we're really going to be focusing on this first part. So bringing data into MATLAB and then getting it ready, cleaning it in order to do some further analysis and modeling algorithm development.
But what are some of the real challenges when working with this real telemetry data? The first challenge that we often see is just data access. So being able to go ahead and access data in different formats or managing heterogeneous data.
Now, once we have this data in, oftentimes the quality is not what we want. We oftentimes have to go about combining data which have different sampling rates or identifying missing data points. Now, if we're dealing with real world sensor data, we can see things like sensor dropouts. So we'll work today and see how we can identify and tackle this problem.
And finally, being able to identify bad data. So identifying outliers and possibly smoothing out some of that noise that we often see. Now, once we've done this and we cleaned our data, the last step is really the analysis, going in and now being able to analyze and generate some results from the data that we have.
And so today, we're going to be looking and working through a lot of these challenges here through the lens of a demo. Now the primary source of data is a recreation of a NASA white paper where we will be analyzing the data from a flight test, particularly a windup turn test. Now this is just a test to determine the longitudinal stability of an airplane, but the goal today is to bring that data into MATLAB and prepare it for further statistical analysis.
And with that, let's move over to MATLAB. So the first step in our workflow is data access, and so the data that we have today lives in a Microsoft Excel worksheet. So let's open this up. And we see that the data contains many of the variables that we want, such as airspeed, angle of attack, elevator load factor. And we see that it's also uniformly sampled, and one common analysis we may want to do with this data set to calculate this stability is look at the relationship of the load factor versus stick force.
Now, with this specific data set-- this is what I like to call a Candyland data set, or a perfect data set. So here you'll notice that this is uniformly sampled. We don't have outliers. It's not noisy. And so, with this type of data, it's really easy to come up with an analysis. For example, calculate the stability here. However, for the majority of the presentation today, we're actually not going to be using this perfect data set. We're working with the slightly more messy data set.
So before we move on to analyzing and cleaning a messier data set, one common question that comes up is how do I go about reading different data sets from different sources, as well as possibly aligning these data sets together, then maybe sample that different rates?
And so, in the presentation today, I'm not going to be focusing on the reading of these data sets or the aligning. However, our documentation is a great resource to find out more information about how to do this, as well as we do have another webinar titled "Data Science with MATLAB" that goes into a little bit more of a deeper dive in how we can read in different data sets and align these data sets.
So let's go ahead and look at an example of cleaning a messier data set. So what I've done here is I'm opening up a live script here in MATLAB. So the Live Editor is our rich text editor that allows us to edit and execute our code just like you would using the regular MATLAB editor. However, we can do some neat things here like add titles, add text. And as we'll see, as we execute our code here, the outputs will be directly embedded in the editor itself.
So first up here, I'm going to load in my corrupted data set, and, now that I have my data set in, I may want to find out is my data good or not. And so, one way to do that, is to use a visualization here like a stackedplot.
And so, what this function allows me to do is to quickly see how the variables in my data set vary with time. So if we look at this visualization, we have here on our x-axis, a common x-axis of time, and we see all of our variables.
Now, for our stick factor, load factor, all the way to angle of attack, this is what we expect to see in terms of our signals. But for airspeed, we see these dips, and that could be a flag to us to say, hey, maybe the airspeed is what's being corrupted here. And so, to confirm this, we can dig a little bit deeper and just focus in on the airspeed. And what we quickly see here is that we do have a corrupted data set in our airspeed variable where the missing values here are represented as zeros.
And so having these numeric values represented as zeros, or, in general, having missing data represented as numeric values, can often be confusing. Because we see here that our y-axis, our actual plot is skewed. We don't actually see what the airspeed looks like because we're dipping down to zero.
And so one thing we may want to do here is to standardize all these missing values and maybe not have it be a numeric value. And so to do that, we could traditionally go ahead and logically index, find these numeric values and change them. However, we do have some convenient functions provided for you to do this.
And so one of these functions is called standardizedmissing. And so what this function will do is, you can specify all the different types of numeric missing values, and it'll standardize and convert all these values to NaNs, or not a numbers. And what's really powerful about standardizing these to NaN values is that, in MATLAB, NaN values are plotted as blanks. So here, we can clearly see the missing data points here without skewing that axis, the y-axis for us.
And so if we maybe want to see the two plots together, we can take a look at, within the Live Editor here, that those missing values up top correspond to our blanks. And these two smaller missing values here correspond to possibly the smaller blank over here, which we can maybe zoom in and check these two missing values.
So now that we've gone ahead and identified these missing values, there's a few different things we can do with this. One option to dealing with these missing values is to just completely remove it. And if you wanted to do that in MATLAB, it's very easy. We can call the function rmmissing in our data set, and it will go ahead and remove all those missing values for us. And so, now, if we look at our stack plot, we see that our airspeed doesn't contain those dips.
But you often have to be careful when just removing these missing values because we could be throwing out some valuable information. Not all the data we may want to throw out when we're moving this. And so, again, you have to be cautious about doing this right away.
However, there are some alternatives to doing this. So instead of removing them, we could possibly just fill them. And so there's a lot of different filling options or methods as well. And so, again, we should be careful about which method that we choose.
And so an example of this is, if we wanted to go ahead and just simply use a simple method and uniformly fill our missing data with some sort of interpolation or possibly a regression, we can use this function fillmissing. It'll automatically go find the missing data points and fill it in with our interpolation method. In this case, a linear interpolation.
So if I scroll down, we'll see that this interpolation method did not do a great job. Because we are probably expecting our data set for this airspeed to have this second order a parabolic type shape down here instead of this linear behavior. And so, again, if you do have some knowledge about your data set or how your data set flows with time, you may have a little bit more intuition about which method to use because linear may not always work for every data set.
So one option here is I can go ahead and delete this, and we'll see there are a lot of different other methods we can choose in fillmissing. So I may want to go ahead and just use a spline because we have that type of second order behavior or parabolic, and we see it does a much better job here with our interpolation. So, again, being careful about what you choose. Oftentimes, you may not be able to just choose any random method.
Now, this is just a simple method where we're uniformly filling our data. We may want to use a little bit more of an advance maneuver where we specify different show behaviors for different sections of missing data. So we can think of this as that sensor drop out type behavior. So for example, let's say our sensor, we had a dropout for a few minutes in a specific day's worth of data, and maybe that same sensor dropped out for multiple hours or even multiple days in another area.
So, again, we may not want to treat the missing data that was dropped out for a few minutes the same as a few hours. And so, in MATLAB, we can do some of these advanced maneuvers as well. So I've created this function fillsections which uses, again, fillmissing under the hood. But here, we're saying, OK, if there are fewer than five missing data points, we're going to fill it in with a linear interpolation, and if there are longer sections of missing data, we're going to use a spline.
And so if we go ahead and we visualize this, we see a nice interpolation behavior here. Now, in this specific data set, we could have just used a spline like we did before uniformly, and it would do a good job. But again, depending on your data set, that might not always be the case, and if you have to specify different interpolation behaviors for different dropped sections, you can do that as well.
So we've talked a lot about how to identify missing data and fill missing data, and one thing that might be a little bit harder to do is actually identify outliers or even fill these outliers. So we're going to talk about ways where we can automate this process of identifying and filling outliers.
So, to start off with, let me go ahead and actually load in a clean data set, add some normally distributed random noise, and go ahead and bias just a few points. And so if I go ahead and visualize our data set now, we'll notice that it's a lot messier than it was before. We have this noise, and we've biased a few of our points here. And so, the goal is, how do we go about identifying outliers in this noisier data set?
So there are a bunch of ways to go about detecting outliers in MATLAB. But, in R2019b, we released a new set of tools called Live Tasks in MATLAB. And this is a nice, easy, interactive way to develop a simple outlier detection algorithm, but also to do some common pre-processing type tasks.
So, if we go up to our Live Editor, if we go to Tasks, you'll notice there are a bunch of different options here for pre-processing. You'll notice Clean Outlier. Smoothing Data Points. We'll look at that in a little bit. Well, we could have also used the clean missing data one before instead of having to programmatically go ahead and play around the fillmissing function. We could use this Live Task. We'll look at how we can use this with determining our simple outlier detection.
You'll also notice that we have these Live Tasks here for other domains. For example, control system design and analysis, predictive maintenance, system identification. So, if you're going in and you want tune a PID controller, and you don't want to have to open up a separate app to do this or have to do this using code or documentation, these Live Tasks are directly embedded in the editor itself for you to go ahead and possibly change your PID gains, for example.
So one of the real powers of using these Live Tasks is context switching. You don't have to switch over to documentation or open up a separate app. You can do this all within your algorithm development workflow within the editor.
So what I've done here is I've gone ahead and inserted the Clean Outliers Live Task. And so, within this Live Task, I've gone ahead and inputted my data, selected airspeed. And you'll notice that the visualization pops up here as well as I'm executing this Live Task. So I'm going to go ahead and specify a cleaning method. So, in this case, if I want to fill or remove outliers, and then our cleaning method.
Again, there's a lot of different options for us. In this case, I wanted to use a spline. But if I wanted to use linear, you'll notice that the Live Task automatically updates, and the visualization itself updated as well. This is what's so powerful, it's iterative process of determining what method or what parameters tune.
So now that I've defined my method, I can choose the detection. So what if I wanted to use a moving median, a moving mean? Again, the visualization changes, and you'll see the number of outliers may have changed as well. And so I can go ahead and specify different threshold factors.
And this is really what, at least for me, was often tedious, is what are those parameters within each specific function that I have to fine tune. And it took a lot of iteration. Programmatically, it was tedious to do, and, with these Live Tasks, it's much, much easier. I don't have to worry about, how do I plot this visualization. It does it for me.
So, in this case, I know a spline and a moving median, with a threshold factor of 3, and a moving window of 20 actually does a pretty good job. It identified a few outliers here.
But you may be saying, OK, I don't want to constantly have to do this every single time for different data sets. I want to automate this. I want to figure out what the code is doing under the hood. We can do that. If I look at the controls and code view, you look at all the MATLAB code that it took to create and do this and create the visualizations as well as identify and fill those outliers. And the workhorse that's doing this under the hood is the filloutliers function.
So now, from here, you can go to the documentation and figure out, oh, what does this function do. Or you can actually convert this task to editable code, which now you can go ahead and use like normal MATLAB code. I can go ahead and start playing around and editing this, just like I would with my normal code.
In this case, we've gone ahead and developed a simple outlier detection algorithm, but we could use or do more advanced methods in MATLAB as well. And so one of these advanced methods is what we call our non-parametric residual bootstrapping. And so this is a technique where we assume a normal distribution for residuals and simply look at the probability of a given residual to determine if it's an outlier. So, by bootstrapping this data set, we can generate confidence intervals and use a three sigma-type approach to identify the outliers.
Now today, I'm not going to go into a whole lot of detail about this specific approach, but, as we're going down here, we see that we're using functions from our statistics and machine learning tool box like Bootstrap to generate the bootstrap data. And then we're generating our confidence intervals, and, finally, detecting outliers using a three sigma-like approach.
And so if I go ahead and plot this, we see that this method actually identified a few more outliers than our simple algorithm did, or our Live Task. And so this oftentimes, using this bootstrap approach, is often beneficial when you actually don't have a lot of data. This bootstrap approach allows you to almost use a Monte Carlo-type technique here. It might be a little bit more robust.
So we talked about how can we fill missing data, identify missing data, work with outliers. But one thing that may actually be even harder than those things is dealing with noisy data. And so we'll look at a few different techniques, one time domain approach, and then some frequency domain approaches to go about smoothing or dealing with this noisy data.
So the first thing I'm going to do is actually load in a noisy data set, and what I've done is use one of those Live Tasks that I mentioned before. There was a Smooth Data Live Task there, so I've gone ahead and I've implemented that. And we see that, using this Live Task-- I'm using just a moving median approach here-- it does a pretty good job of smoothing my data.
Now, again, if the moving median is not the right approach, time domain approach, the Live Task, I can quickly change this. Maybe I want to use a local quadratic regression. I can do that, and we see that it does maybe a little bit better job. Or maybe it doesn't. I can go ahead and change it back. Again, that iterative process is there.
But, again, this is a time domain approach. We're actually going to be looking at a frequency domain approach as well using some signal processing techniques. But instead of doing this programmatically, we're going to be using the Signal Analyzer app.
So, to open up the Signal Analyzer app, I'm going to go over to the Apps tab here. And you'll notice there are a lot of different apps or interactive apps for your own domain-specific applications. So, for example, signal processing communications, there are a whole bunch of different apps you can use. And so today, we're just going to focus in on the signal analyzer.
So, just like most apps in MATLAB, what I always like to say is we move from the left to right. So the first step here is to bring our data in. And we can quickly see here what our data looks like. Over the x-axis here is representative samples, and so to go about changing this, I really want to work with time values. And so I'm going to go ahead and input a time vector from my workspace, and we'll notice that we now have time on our x-axis.
So, if we keep moving from left to right, we can go ahead and look at that pre-processing section. So there's a ton of different pre-processing techniques we can use, or options. There are some of those time domain techniques here, like smoothing data, maybe resampling, detrending. However, we're going to be focusing on some of these frequency domain techniques, using a low pass filter or a high pass filter.
Now, oftentimes before we go ahead and just use a filter, we may want to know or figure out what those cutoff frequencies are. And to do that, we would want to look at the power spectrum or look at some sort of frequency representation of our time domain data. And so to do that, we can go ahead and quickly generate a power spectrum, a persistent spectrum, or even some time frequency plots as well-- spectrogram or a scalogram.
So if I go ahead and look at my power spectrum, we can notice that a lot of the high frequency noise here is happening from about, let's say, 1 hertz on. And again, this data set here, this was kind of contrived where we added in the noise, so there's not great frequency domain things to look at. But we can get an idea just by the power spectrum that, hey, maybe our cutoff frequency should be around maybe 0.6, 0.7 hertz.
However, we may also want to go and look at just subsections of our signal. So we can also do that in the app using the Panner. So here, I can say, look, I only want to look at a specific region of interest. And using the Panner, I can go and look at my signal at specific points or regions of interest. So again, a really powerful technique for doing some quick analysis on your signal data.
So if we go back to our pre-processing, what I wanted maybe to do is apply a low pass filter. So here I can go about entering my passband frequency. So we came up with about 0.7 hertz at low pass, and we can see our results. I'm going to collapse some of these spectrum here.
So you see here, it's kind of interesting results. We see this type of ringing behavior in the beginning of our signal and the end of our signal, and this took a little bit of time for me to figure out what was going on. But if we actually zoom in the middle portion, we see that our data was actually filtered pretty well.
Now, the ringing behavior has to do with the filter representation of what the app is doing under the hood, which is it's generating an FIR filter. So sometimes, when you use these types of filters, we see this type of ringing behavior.
Now another option instead of just using the built-in low pass is to actually use your own custom filters that you can build into the app. So using my custom filter, which is a IIR filter, a Butterworth IIR filter, I can go about filtering this guy using my own custom filter. And you see that it does a pretty good job, and we don't see that ringing effect.
So, again, a nice way to go about quickly eliminating some of that noise and exploring and understanding the data that you have in both the time domain and the frequency domain.
Now, one of the questions you may have is, why do I need to keep coming back to this app every single time with new data that's coming in? We don't want to have to constantly come back and do this. We want a way to automate this process.
And so, just like almost every app in MATLAB, we have this generate code function. So we can go about generating the MATLAB code that does the pre-processing, but also, the code that does the generation of the power spectrum. So, if we're looking at this portion of code, it's everything about setting up our signal and then the function that actually generates it. So now I can use this script to actually automate this and look at it with new data sets coming in.
So that's exactly what I've done in my analysis here, where I have taken that generated code, pasted it into my analysis, and now, with the new data sets, I can automatically come up with that same power spectrum, as well as use that pre-processing function to filter and eliminate that noise. So just different approaches we can take for both the time domain and frequency domain for eliminating the noise in our signals.
All right. So let me go ahead and flip back here to our slides. We talked a lot during this presentation today about cleaning in-memory data sets, assuming that our data sets will fit into memory.
However, oftentimes, like I mentioned in the beginning, we have large collections of files and a lot of telemetry data or sensor data that may not fit into memory at once. And so we may need to come up with new techniques to have to chunk through data sets and have to do our processing on maybe a certain file at a time or just a subset of the data. And so now, I'll show you guys how our algorithm would have changed to be able to scale up to larger data sets.
I've come up with two different files here. One file for the cleaning of the data, the in-memory data set, and one file changing it so we can scale it up for our big data. So I'm going to do a quick code comparison here to see what are the actual differences here in our algorithm when we're looking to scale it up.
So on the left hand side is going to be our script here for the big data version, and the script on the right hand side here is going to be our in-memory script that we were looking at before. And so, the first thing that we'll see that's different is that instead of the load function, we're going to be using a datastore.
And so, a datastore in MATLAB is, at a high level, just a reference to a collection of files. So it basically points to where the files live and doesn't bring the data into memory right away. It stores information about the file locations and metadata associated with these files.
Now, the next thing we're going to do here is create a tall array because now that we have this reference to the collection of files, we are still not dealing with a full out of memory data set. To do that, we're going to use these tall arrays.
And so this is a relatively new data type in MATLAB that is designed for data that have too many rows to fit into memory. So it has that tall type of structure. So we're going to be using the data inside the datastore as a normal MATLAB variable that potentially has too many rows to fit into memory.
And so, from here, we're going to do our analysis just like we would do on normal in-memory variables in MATLAB. So, as we continue scrolling down, you'll notice that we're using those similar functions, seconds, table2timetable, on our full out of memory data set.
Now, one thing that has changed here as well is that I commented out the stackedplot command. So the stackedplot visualization is actually not supported for tall arrays. So not all visualizations are supported for these tall data types. However, we do have some visualizations that are supported, like basic line plots, scatter plots, histograms, and we're continuously adding support for visualizations for these out-of-memory data types. And you can find a list of these in our documentation.
And, as we scroll down, again, you'll see that we use the plot command with our tall data types. Even functions like those common pre-processing functions like standardizedmissing are supported. Not a whole lot has changed in my code. Rmmissing. Once we go to that missing data, fillmissing is also supported for the full out-of-memory tall arrays.
Now, the other thing that we're going to see that's different is a call to this function gather. So what gather does is, once we want an answer, once we want to pull our data into memory, what gather does is it performs an optimization under the hood that allows MATLAB to run through our data set, our full data set, in the least amount of passes. It makes sure that it doesn't run through our full data set every single time we're executing our command.
And so if you guys want to learn more and go a little bit deeper into datastores, tall arrays, we do have another webinar titled, "Tackling Big Data with MATLAB" that goes a lot deeper into exactly how to use datastores, tall arrays, and more details about this.
So let me flip back to the slides, and let's get to the summary. So, if we go back to our data analysis workflow, as I said, we really focused in on this initial section. So getting files into MATLAB and getting it ready to do that further statistical analysis, the modeling, the algorithm development.
MATLAB has a wide variety of tools to help you get your raw data suitable for analysis. High level functions such as fillmissing, filloutliers, and Live Tasks help you quickly apply well known pre-processing techniques to see how those impact your data.
But sometimes, you might need a little bit more customization, and for that you can use more advanced techniques like statistical tools and even signal processing tools-- using the Signal Analyzer App, for example. So the great thing is, you can try all these in the same environment and figure out what works best for your particular data set.
Now, once we have done our analysis in our in-memory data set, we can easily scale our algorithm up by using things like datastores and tall arrays to process and analyze our full out-of-memory data set on our desktop, traditional clusters, or even big data systems like Spark and Hadoop.
So working with messy data can be a tedious and oftentimes unenjoyable task, and we hope that, with these tools, you can get the data cleanup out of the way quickly so you can focus on your analysis and your results.
And so with that, that's all for me. Thanks for tuning in.