What Is Data Cleaning?
3 things you need to know
Data cleaning, also known as data cleansing or data wrangling, is the process of identifying and addressing anomalies in a given data set. Various techniques can be employed to cleanse data, including managing outliers, estimating missing data, or filtering out noise.
By cleaning the data, engineers and data scientists can improve the quality of their results and avoid making erroneous conclusions based on flawed or incomplete data.
MATLAB® provides functions and apps that simplify data cleaning, allowing you to focus on your analysis and problem-solving.
Why Does Data Cleaning Matter?
Data cleaning techniques are used across various domains:
- Signal Processing
Signals obtained from analog sensors are exposed to a variety of noise sources resulting in data inconsistencies. Preprocessing involves using data cleaning techniques such as removing outliers and smoothing to ensure that the analysis is performed on clean and meaningful data.
Advanced denoising techniques such as adaptive filters and wavelet transforms help remove noise while preventing the loss of underlying signal information in systems with wideband signals affected by white noise.
Figure 1 shows a plot of the daily weight of a person—with missing values—recorded for 365 days. The plot shows original and interpolated readings. The plot, zoomed in on days 200 through 250, shows identified missing values and their replacements derived using interpolation.
- Artificial Intelligence (AI)
The performance and reliability of AI models depends directly on the quality of input data. Data cleaning eliminates noise, inconsistencies, missing values, and outliers that can mislead AI models, leading to biased or erroneous outcomes. Clean input data ensures that AI algorithms learn valid patterns and provide accurate predictions.
Figure 2 shows a MATLAB plot of load consumption data cleansed of missing and outlier data and noise using the fillmissing
, filloutliers
, and smooothdata
functions, which is then inputted into an AI model to produce an accurate prediction of load consumption. Figure 3 plots the results of a prediction model alongside the actual data and the amount of error between them in MATLAB. Without using a cleansed data set to train the model, the prediction error could be higher.
- Perform Data Cleaning Using Deep Learning Models
Traditional data cleaning methods work well with data that can be modeled with commonly known statistical and mathematical models. But for complex data sets that do not fit standard models well, like human speech, EEG signals, etc., we can leverage deep learning models to perform data cleaning.
In this example shown in Figure 4, speech signals are riddled with noise from a washing machine running in the background. Data cleaning methods, such as smoothing or outlier removal, cannot effectively remove the noise from the washing machine data as it has an audio spectrum that overlaps with the speech signal. Deep learning networks, such as fully connected and convolutional, are able to clean or denoise the speech signal, thus removing the noise and leaving the underlying signal.
- Computational Finance
Accurate financial models, risk assessments, and investment strategies rely on clean data. Financial analysts spend a significant amount of their time on data cleaning instead of actual modeling due to challenges in their data. Missing data, large data, or varying data sources can lead to biased predictions and flawed decision-making. Data cleaning is an effective technique for improving accuracy and reliability of financial analysis. Figure 5 shows the missing values in the average annual stock prices of a company. The fillmissing
function in MATLAB is used to fill the missing data using linear interpolation.
How Does Data Cleaning Work?
Data cleaning is an iterative process that involves different techniques depending on your data set, the objectives of the final analysis, and the available tools and software. Data cleaning typically involves one or more of following steps:
Fill Missing Data
Missing data refers to the absence of values or information in a data set resulting in NULL, 0, empty strings, or invalid (NaN) data points. Values can be missing because of several reasons such as data acquisition, data transmission, and data conversion. Missing data can have a significant impact on the quality and validity of data analysis and modeling; hence, it is important to address it appropriately during the data cleaning process.
Missing data can be classified into three categories and identifying the right category can help you select an appropriate fill method:
- Missing at random (MAR) — In this category, the variable with missing values is dependent on other variables in the data set. For instance, a rooftop solar installation relaying telemetry data such as irradiance level, grid voltage, frequency, etc., would have missing values at night or during rainy days because there isn’t enough solar irradiance to power up the system and thus the missing values of grid voltage or frequency are caused by poor irradiance levels.
- Missing completely at random (MCAR) — In this category, the underlying cause of missing values is completely unrelated to any other variable in the data set. For example, missing packets in weather telemetry could result from malfunctioning sensors or high channel noise.
- Missing not at random (MNAR) — This scenario applies to variables where the underlying cause of missing data is related to the variable itself. For example, if a sensor relaying temperature information has reached its measurement limits, it would result in missing values in the form of its saturated thresholds.
Identifying missing data sounds straightforward, but replacing it with a suitable estimate is an involved process. You can start by spotting missing values using visualization or searching for invalid values. Replacing the missing values involves generating values that are likely to be close to the actual values. Based on the nature of the data, the technique of filling these missing values can vary. For example:
- A slow varying data like temperature could simply use the nearest valid value.
- Data sets exhibiting seasonality and reduced randomness like weather could use statistical methods like moving average, median, or K-nearest neighbors.
- Data sets exhibiting strong dependencies on its previous values like stock prices or economic indicators are well-suited for interpolation-based techniques to generate missing data.
The plot below shows raw solar irradiance data with its missing values filled using the fillmissing
function. In this instance, a moving median window-based technique is used to fill in the missing values.
Managing Outliers
Outliers are data points that deviate significantly from most observations within a data set. They can be unusually high or low values that do not seem to follow the general pattern of the data. Outliers can distort the statistical analysis and interpretation of a data set, potentially leading to misleading results. Outliers can arise due to various reasons, including measurement errors, data entry mistakes, natural variability, or genuine anomalies in the underlying process being studied.
Managing outliers involves two configurable steps:
- Detection
- Filling outliers
Detecting outliers involves defining a valid operating range outside of which any data point is identified as an outlier. Methods used in defining the valid operating range are related to the attribute, source, and purpose of the data set. These methods range from simple techniques like visualization-based or fixed threshold–based outlier detection to statistical methods like median absolute deviation, to distance-based methods, such as Euclidean and Mahalanobis.
After identifying the outliers, they can be replaced with generated values. Generating techniques used in replacing outliers are similar to the ones used for filling missing data.
Figure 7 shows input data with two outliers that are detected and filled using the linear interpolation median detection method.
Smoothing
Smoothing is a data analysis technique used to reduce noise, variability, or irregularities in a data set to reveal underlying patterns or trends more clearly. It is commonly applied in various fields including statistics, signal processing, time-series analysis, and image processing.
Like other data cleaning methods, the smoothing technique is also highly dependent on the nature and the domain of the data. You can use simple statistical methods like moving average filter, weighted moving average filter, or moving median-based filter to more complex techniques like splines, Fourier transform smoothing, and Kalman filtering. The smoothing function requires the data set to be ordered and sampled at a fixed interval.
Data Cleaning with MATLAB
Data cleaning is an important first step in many engineering and scientific workflows. MATLAB provides several functions and interactive low-code techniques (apps and Live Editor tasks) to perform data cleaning efficiently.
Interactive Capabilities
Writing code using high-level functions can be very efficient, particularly when you’re familiar with the language syntax and different function options. However, identifying the right data cleaning technique is typically iterative, and spending time getting the code just right for each iteration can slow things down. With the interactive tools in MATLAB, you can experiment with different data cleaning methods quickly, visualizing the impact of each approach on the data set, and then determining the optimal cleaning approach. Live Editor tasks and apps, such as the Data Cleaner app, make it easy to explore, clean, prepare, and complete data analysis tasks with less coding. The tools can also automatically generate MATLAB code for reuse.
Live Editor Tasks
Live Editor tasks are simple point-and-click interfaces or small apps that can be embedded into a live script. Tasks represent a series of MATLAB commands that automatically run as you explore the parameters interactively; results are displayed immediately. Tasks are useful because they can help reduce development time, errors, and time spent in plotting. You can interactively perform different data cleaning operations using a sequence of Live Editor tasks such as Clean Missing Data, Clean Outlier Data, Smooth Data, and so on.
Data Cleaner App
The Data Cleaner app allows you to process and clean column-oriented raw data. You can interactively iterate and visualize the impact of various data cleaning choices on your raw data.
Using this app, you can:
- Access column-oriented data from the MATLAB workspace or from a file.
- Explore data by using visualization, data, and summary views.
- Sort by a variable and rename or remove variables.
- Retime data in a timetable, stack or unstack table variables, clean missing data, clean outlier data, smooth data, or normalize data.
- Edit and reorder previously performed data cleaning steps based on how you want to analyze your data.
- Export cleaned data to the MATLAB workspace, or export MATLAB code for data cleaning as a script or function.
Programmatic Approach
You can use a combination of command line functions available in MATLAB to efficiently perform data cleaning of your data set and prepare it for analysis. By using functions like readmatrix
, readtable
, readtimetable
, and importdata
, raw data can be imported into MATLAB. You can then visualize your data set to identify anomalies. Once you have established the inconsistencies with your data set, you can use data cleaning functions in MATLAB to fill missing values (fillmissing
, ismissing
, rmmissing
), remove outliers (filloutliers
, rmoutliers
), and filter any noise (smoothdata
, movmean
, movmedian
) in the data set.
Data cleaning is an important first step in data analysis to make your data suitable for further analysis. For more information, check the resources below.