Speeding Up Data Preprocessing for Machine Learning
Chapter 3
MATLAB Apps for Preprocessing
The previous chapter discussed using MATLAB commands and the Live Editor to perform common preprocessing techniques. MATLAB also has apps specifically designed to make complicated tasks easier. This chapter introduces two apps particularly useful for data exploration and preprocessing when working with time-series data or using parametric equations for data cleaning.
Preprocessing Data with the Signal Analyzer App
To successfully train a machine learning model, you need to preprocess your time-series data and use the features that matter, which effectively comes down to reducing the signal data dimensionality and variability.
When creating a preprocessing script for time-series data, it is good practice to start with a few signals to represent your wider dataset.
The Signal Analyzer app makes it easy to then:
- Explore your data
- Determine which preprocessing techniques and methods are appropriate
- Create or update a script to run against the full dataset
Explore Your Time-Series Data
Understanding your data starts with visualization. The Signal Analyzer app is a useful tool to look at multiple signals at once and recognize patterns, as well as explore signals in the time, frequency, and time-frequency domains simultaneously.
Time
Frequency
Time-Frequency
The app is accessed through the Apps tab in the MATLAB toolstrip or by typing signalAnalyzer
in the MATLAB command window.
Note: Missing data must be cleaned before it can be used in the Signal Analyzer app. See the following section for handling missing data.
Here you have a single signal that’s been duplicated three times. Each copy has been preprocessed with either a highpass, bandpass, or lowpass filter. You can see what each processed signal looks like in the time and frequency domains, alongside a spectrogram of the original signal.
When working with the Signal Analyzer app you will need to clean missing data first. Additionally, the Signal Analyzer app looks at data in timetables as durations, and data currently in datetime format is converted into duration.
To analyze timetables with time values stored as a datetime array, convert the array to a relative duration array by subtracting the first element from all the others. The following example creates a timetable with datetime row times and converts it to a timetable readable by Signal Analyzer.
Tt = timetable(datetime(2016,11,9,2,30,1:10)',randn(10,1)); dt = tt.Time-tt.Time(1); tn = timetable(dt,tt.Var1);
See a full example:
Determine Which Preprocessing Techniques and Methods Are Appropriate
Exploring your data will help uncover what preprocessing techniques you need to apply. A good habit is to make a duplicate of the signal so you can see how preprocessing tasks have affected your signal.
Here are two preprocessing tasks:
- Resampling to create uniform data, generally performed early in data cleaning
- Smoothing to find patterns in your data, performed after resampling
Time-series data is not always sampled uniformly. Sensors might record signals at different intervals, or they might only record when triggered by an event. In those cases, sample rates could fluctuate wildly. In other cases, you may deal with multiple regularly sampled signals, each using a different sample rate. Regardless of the specific situation, resampling your data into uniformly sampled signals will make downstream tasks a lot easier.
With uniformly sampled signals, you can apply many signal processing algorithms to:
- Extract features to use the data with a wide variety of machine learning algorithms
- Transform signal data into time-frequency maps such as spectrograms and wavelet transforms for visualization or use with models that work on 2D data representations
- Identify regions that are (and are not) of interest
Resampling signals is a straightforward task in the Signal Analyzer app. Drag the data you want to use from the Workspace Browser up to the Signal Table or over to the Display. Right-click on the signal and define Time Values to describe the sample rate, sample time, or time values vector of selected signals.
This simple example uses non-uniform data sampled once a day over six weeks. See the full documentation for this example.
In the Analyzer tab, expand the Preprocessing gallery and click the Resample icon. Signal Analyzer uses the Signal Processing Toolbox™ function resample
to perform the resampling.
If your data is not uniformly sampled, you can use the app to interpolate it onto a uniform grid, specifying the method (linear, shape-preserving piecewise cubic, or cubic spline interpolation) and sample rate.
If you want to see how other interpolation methods or resample settings perform, click Undo Preprocessing and resample the signal again.
For uniformly sampled signals, you can use the app to change the sample rate. You can specify either the desired sample rate or the factor by which you want to interpolate or decimate the signal.
Smoothing is useful to find patterns in your data while leaving out things that are unimportant using filtering. The goal of smoothing is to remove noisy data components, making slow changes and trends easier to see.
Say you have a set of temperature readings in Celsius taken every hour for an entire month.
You can use moving average filters and resampling to isolate the effect of periodic components of the time of day on hourly temperature readings, as well as remove unwanted line noise from an open-loop voltage measurement.
You can see the effect that the time of day has on the temperature readings. If you are interested in the daily temperature variation over the month, the hourly fluctuations only contribute noise, which can make the daily variations difficult to discern. You can choose Smooth function from the Preprocessing section of the Signal Analyzer app to remove the effect of the time of day very quickly.
There are eight smoothing methods to choose from.
You can see how using moving mean with the default settings affects the line.
For instructions on how to accomplish this programmatically, see the Signal Smoothing example.
Create or Update the Script to Run Against the Full Dataset
When you are happy that you know which preprocessing techniques you want to apply to your full dataset, you can click Generate Function in the Analyzer tab.
Generate Function produces the code for all the preprocessing tasks that you applied as a single function that can be saved as a MATLAB script. Preprocessing an arbitrarily large dataset can now be automated by applying that function iteratively on all relevant data recordings by means of a short MATLAB program.
Another method to clean your outliers in MATLAB is to use the Curve Fitting app. In some cases, the underlying process captured in the data might be well-approximated by a parametric equation. In this situation you can use curve fitting to fit an equation to the data, and then identify outliers by examining residuals.
In the example used earlier of a survey that asks for years of experience and age, you saw missing age data that was more prevalent in the higher age bands. Assuming you filled missing data with moving mean if age and experience were plotted against each other, it might be reasonable to expect that a linear fitting would work well. To test this out, you need an array of each variable, which was done when the data was cleaned, so you have a 499x1 double array of survey respondents’ ages and a 499x1 double array of survey respondents’ years of experience.
Start by opening the Curve Fitting app in the MATLAB toolstrip:
This opens up a user interface in a separate window. In the top left of the UI are drop-down menus to select what it is you want to plot. What you have in your workspace will determine what variables are available to you here.
The top-center drop-down menu offers a series of methods to fit a curve to the data. In the example, the linear fitting works well. Under View, there’s an option to plot the residuals, which produces the second plot shown. The results follow closely to what you might expect. The younger ages have less variation in their years of experience; it would be difficult for a 20-year-old to have 15 years of experience. Conversely, as age goes up, so does the tendency for bigger swings from the fit.
You can try out different fits on your data; when you decide how you want to plot your data, select File > Generate Code to create your function, which you can then add to your preprocessing script.
You can also use the app to select and exclude outliers, to see the effect that removing outliers has on the fitted model.
Once you’ve done some initial data cleaning, try your data in the algorithm and see how it performs. It’s important to be able to get fast feedback on whether your data needs further preprocessing.
Repeat the Process
Cleaning your data for a machine learning algorithm is an iterative process. You explore your data to find which techniques are required, perform them, and then see how the algorithm performs with your data. If your model yields poor accuracy, you might need to go back and see whether further preprocessing is needed.
Try this for yourself using the Data Science: Predict Damage Costs of Weather Events example.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)