Speeding Up Data Preprocessing for Machine Learning
Chapter 2
Common Preprocessing Tasks
The preprocessing tasks required for your machine learning application will vary based on what kind of data you’re working with as well as the type of algorithm you choose. However, some tasks are more common than others. This chapter explores five of the most common data preprocessing techniques as well as how to perform them in MATLAB.
Formatting
Many preprocessing tasks involve making datasets uniform in one way or another; this includes formatting, especially in relation to timestamped data.
Formatting issues can quietly introduce major problems that waste time if they aren’t addressed early on. These problems may not necessarily throw an error, so they can decrease the accuracy of your model without being obviously wrong. Two MATLAB functions make formatting timestamped data easier:
datetime
duration
Time Standardization and Offsets
If you work with timestamped data, it is worth checking that dates and times are written in a consistent format. This should not be assumed.
The datetime
arrays in MATLAB represent points in time and save you work reckoning with time zones, daylight savings, and leap seconds.
Syntax
t = datetime
t = datetime(relativeDay)
t = datetime(DateStrings)
t = datetime(DateVectors)
t = datetime(Y,M,D)
Time Durations
In MATLAB, data that represents an elapsed amount of time is represented by duration. Different data logging systems, variance in start times, and other factors can create problems for relative timestamps that measure elapsed time.
Use duration
to compare how long tasks with different start times lasted, set a threshold, or create a consistent length of time (e.g., signal data at 10 second intervals).
See more information on working with dates, times, and calendar durations.
Syntax
D = duration(H,MI,S)
D = duration(H,MI,S,MS)
D = duration(X)
Merging
Combining data from multiple locations or datasets can be the key to a successful machine learning model. For example, augmenting a retail dataset with weather data could greatly improve a machine learning model’s chances at predicting sales of sunscreen. Two common ways of joining datasets are:
- Merge based on some key value – a unique identifier such as an email address
- Merge based on time
Merging based on a key can be accomplished using the command for the desired type of merge (join
, innerjoin
, outerjoin
), or by using the Join Tables Live Editor task:
Merging based on time can be accomplished in MATLAB with timetables and the synchronize
command. Note that unless the timetables being synchronized have exactly the same time step, you will need to address what happens at time steps when a particular variable was not measured. The synchronize
command provides many built-in methods to deal with this issue, such as performing linear or spline interpolation, or copying the value from the nearest time step where a measurement was made.
Merge Timetables
% Load two sample timetables from % a file, then synchronize their % data to a vector of new row % times. load smallTT % Display the timetables. TT1 has % row times that are out of % order. TT1 and TT2 have % different variables. TT1 TT1=3×2 timetable Time Temp ____________________ ____ 18-Dec-2015 12:00:00 42.3 18-Dec-2015 08:00:00 37.3 18-Dec-2015 10:00:00 39.1 TT2 TT2=3×2 timetable Time Pressure ____________________ ________ 18-Dec-2015 09:00:00 30.1 18-Dec-2015 11:00:00 30.03 18-Dec-2015 13:00:00 29.9 % Synchronize TT1 and TT2. % The output timetable, TT, % contains all the row times from % both timetables, in sorted % order. In TT, Temp contains % NaN for row times from TT2, % and Pressure contains NaN % for row times from TT1. TT = synchronize(TT1,TT2) TT=6×3 timetable Time Temp Pressure ____________________ ____ ________ 18-Dec-2015 08:00:00 37.3 NaN 18-Dec-2015 09:00:00 NaN 30.1 18-Dec-2015 10:00:00 39.1 NaN 18-Dec-2015 11:00:00 NaN 30.03 18-Dec-2015 12:00:00 42.3 NaN 18-Dec-2015 13:00:00 NaN 29.9
Try this example by pasting this command in MATLAB:
openExample('matlab/SynchronizeTimetablesAndInsertMissingDataIndicatorsExample')
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Machine learning algorithms make predictions based on the features within the data and how those features relate to each other. When data is missing within an observation, finding the relationships between features becomes more difficult. When data is missing, the decision to remove or replace the data might depend on whether missing variables in the observation are missing.
When preparing data for machine learning, you have three options for what you should do with missing data: delete (the observation, or perhaps the whole column), replace, or do nothing and let the model deal with it. Your decision depends on why that data is missing.
To see if your missing data is random, plot each of the variables to look for a pattern.
Missing numbers are represented in MATLAB according to their datatype. For instance, a missing number will show up as NaN, a datetime NaT, a categorical as <undefined>, and a missing string as <missing>.
The following example looks for missing data within a table of 500 survey responses (SurveyAges
) with three columns: Respondent ID, Age, and Years of Experience. You want to see if there are missing values and their locations.
Find Missing Data
A = SurveyAges(:,2)
Tf = ismissing(A)
G = sum(Tf)
This results in a logical array of 1 and 0, where 1 means the data is missing. In this case G = 36; the total number of missing responses.
Syntax
Tf = ismissing(A)
Tf = ismissing(A,indicator)
In general, if the data is missing randomly, it is safe to delete it. By plotting the data in this example, you can see the missing data is more prevalent in more high age bracket.
When missing data is not random (e.g., older employees may not want to disclose their age), then deleting that data can introduce bias. In this case you could fill those gaps with generated data. Which method to use to fill those values will depend on your data. Common techniques include mean, median, mode, linear regression, and K nearest neighbor. If you are working with time-series data, additional techniques include last observation carried forward, next observation carried backward, and linear interpolation.
MATLAB has many functions you can use to delete or fill missing values. fillmissing
is a versatile function discussed more later.
Clean missing data
If a column is missing a significant proportion of values, it may be best to remove the entire column from the dataset.
This is a simple step using a task in the Live Editor. As with other tasks, select New Live Script in the Home toolstrip.
Under the Live Editor tab, click Task, then select Clean Missing Data.
That will open the user interface. The options available to you to select data will depend on the variables in your workspace.
Use the drop-down menu to choose the data you want to search for missing data, then choose what you want to do with those entries, whether it’s fill them or remove them.
Set the cleaning method to Remove missing. The workspace now has three variables: the original table, an index of the missing entries, and a double array of cleaned data.
You can tell the data is cleaned because the graph on the right plots the missing data with a headline that numbers the missing entries.
Clicking the down arrow at the bottom of the task window exposes the code so you can see what MATLAB is doing.
% Remove missing data [cleanedData,missingIndices] = rmmissing(offerab.Actions); % Visualize results clf plot(find(~missingIndices),cleanedData,'Color',[0 114 189]/255,'LineWidth',1.5,'DisplayName','Cleaned data') hold on % Plot removed missing entries x = repelem(find(missingIndices),3); y = repmat([ylim(gca) NaN]',nnz(missingIndices),1); plot(x,y,'Color',[145 145 145]/255,'DisplayName','Removed missing entries') title(['Number of removed missing entries: ' num2str(nnz(missingIndices))]) hold off legend clear x y
This code can be saved as part of a preprocessing script.
The steps to replace missing data in the Live Editor begin the same as if you were removing the data. This time, under Specify Method, select Fill missing then select one of the 10 methods.
When you look at the code generated to fill missing entries, in this example using moving mean, you can see that this is done in a single line of code; the rest of the code plots the cleaned data and filled entries.
% Fill missing data [cleanedData,missingIndices] = fillmissing(offerab.Actions,'movmean',3); % Visualize results clf plot(cleanedData,'Color',[0 114 189]/255,'LineWidth',1.5,'DisplayName','Cleaned data') hold on % Plot filled missing entries plot(find(missingIndices),cleanedData(missingIndices),'.','MarkerSize',12,... 'Color',[217 83 25]/255,'DisplayName','Filled missing entries') title(['Number of filled missing entries: ' num2str(nnz(missingIndices))]) hold off legend clear missingIndices
The presence and frequency of outliers tells you a lot about the overall quality of your data. Whether outliers are uniformly distributed, clustered, or exist in some other pattern influences how you deal with them. If they make up a very small proportion of your dataset and they look inaccurate or are the result of poor data collection, it may be safe to remove them as bad data. However, the first step is still identifying them in your data.
You can apply a number of different statistical models to find outliers in data. Popular methods include:
- Grubb’s test
- Median
- Moving means
- Quartiles
No matter which method you use, you should become familiar with the min, max, mean, and standard deviation of your data.
MATLAB has a number of built-in functions you can use to find outliers in data; files are also available in File Exchange to simplify this task, such as Tests to Identify Outliers in Data Series.
While removing outliers is a risky business, leaving them untouched can significantly skew results. There are many ways to handle outliers in MATLAB, whether programmatically using the command line, or interactively through the Live Editor. Here’s a brief run-through of each approach.
Command Line
The filloutliers
function in MATLAB is a handy tool for identifying outliers and replacing them via some automatic method.
Methods to fill outliers include 'previous'
, 'linear'
, and 'spline'
.
Here’s a simple example:
Afill = filloutliers(Anoise,'next'); plot(t,Anoise,t,Afill) axis tight legend('Noisy Data with Outlier','Noisy Data with Filled Outlier')
Syntax
B = filloutliers(A,fillmethod)
B = filloutliers(A,fillmethod, findmethod)
B = filloutliers(A,fillmethod, 'percentiles', threshold)
B = filloutliers(A,fillmethod, movmethod, window)
B = filloutliers(___,dim)
B = filloutliers(___,Name,Value)
[B,TF,L,U,C] = filloutliers(___)
Given the many approaches to filling outliers, it is important to visually inspect the results before trying the filled data in a machine learning algorithm.
Better data cleaning up front can reduce the amount of time you spend tweaking your algorithm later.
Live Editor Tasks
One useful aspect of using a Live Editor task to clean outlier data is that the user interface will update the number of outliers each time you pick a detection method.
Here’s the same data with three detection methods. Grubbs and mean both return fewer than 100 outliers, while quartiles is somewhat aptly returning 1103 outliers, which is slightly more than one-fourth of the entries.
To clean the outliers, select a cleaning method—these are the same options you have in the command line.
The script will run as you make changes, so you can see how it will work immediately.
Add the outlier method you want to use to your preprocessing script and save your script.
Normalization is a useful technique when features have dissimilar ranges of values, but you want to give each feature equal consideration.
Imagine you want to create a predictive maintenance model. In order to identify fault predictors, you use sensors that record vibration frequency and pressure. In simple terms, a change from normal range might indicate a problem, and the distance from normal may indicate the severity of the problem. The healthy range for vibration frequency may be 10–800 Hz, while the healthy range for pressure might look more like 10–30 psi. To be able to use features derived from each of these measurements in the same model, you would normalize the datasets to create an even playing field.
When should you normalize your data? In general, when the absolute values or ranges of different variables differ by an order of magnitude or more. However, it also depends on whether you still need the units for analysis. Non-normalized data is useful during other preprocessing steps since you can still “reason” on the data. Normalization is often one of the last steps you would do before feature engineering.
Syntax
N = normalize(A)
N = normalize(A,dim)
N = normalize(___, method)
N = normalize(___, method, methodtype)
N = normalize(___, 'DataVariables', datavars)
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)