How to Preprocess Time Series Data with MATLAB
In this video we demonstrate how you can process and clean time series data stored in Excel sheets, in multiple formats, and with multiple sampling rates in MATLAB®. We start with importing data from Excel sheets using the Import Tool. Next, we focus on how to prepare the data to convert to the timetable datatype.
We then explore the preprocessing functions available with timetables including synchronizing the data sets to a common time reference, assessing data quality, and dealing with duplicate and missing data. At the end, we show the stacked plot of variables with row times.
Introduction
In data science it is always said that 80% of the time is spent in preparing data and 20% of time is spent complaining about the need to prepare data. So let’s try solving this issue by preparing and preprocessing time series data in less time and with a few lines of code in MATLAB.
DEMO
For this example, we will be using data from Phasor Measurement Units (PMUs)
- We have 15 PMUs which are measuring the voltage magnitude, voltage angle, and frequency synchronized to a GPS clock system
- They are in 3 time zones in Australia – Eastern Daylight Savings time zone, Central Daylight Savings time zone, and Eastern Standard time
- The PMUs were sampled at 3 different sampling rates: 30, 60, 120Hz
- This occurred at different time offsets as the PMUs did not start recording at the same time
So yes, we have data that needs to be aligned and cleaned for any further analysis.
Our data is in three Excel spreadsheets, one for each of the three time zones, and we will be importing the data using the Import Tool. If we open the spreadsheet in MATLAB it, by default, opens in the Import Tool. Click on import data. The data is then stored in the workspace as a table.
For processing the time series data and working on time-specific functions, MATLAB has a datatype timetable
that makes the processing easier. To convert the table into timetable our time should be in datetime
format.
So now we will extract the time column from each table and convert it into datetime using the datetime
function. In the same line of code, we are addressing the first challenge of our dataset; that the data is in three different time zones with different offsets. So, in the same step we align the time zones by giving the time zones as the options: Australia/Sydney for Eastern Daylight time, Australia/Adelaide for Central Daylight time and Australia/Brisbane for Eastern Standard time.
Let’s plot our frequency measurement with the time from each PMU to visually confirm the time alignment. As we see frequencies overlap. Hence, we have our time alignment correct.
Once our time columns are in datetime format and aligned we convert all our three tables into timetables by using the function table2timetable
. We now have our three tables in the format of timetables.
Next, we proceed to cleaning the data. As we can see in this linked MATLAB documentation that we can handle and clean timetables using various workflows like finding and removing the missing data, sorting and removing duplicates, etc.
Here we demonstrate one way of working with duplicate times. First, with the unique function we create a vector UniqueTime
which returns the unique and sorted rows of the time variable. This vector does not contain the duplicate row times. Just for verification we check here that we have seven duplicate row rimes. To handle that we use the retime
function to calculate the mean for all other variables with duplicate row times. Hence now we have a new table UniquedataACDT with no duplicate row times. We do a similar process for the other two tables. We will now work with the three new tables.
Missing data can be processed in different ways. For a single timetable we can use the rmissing
function and remove all the rows with missing data. As we do here for the UniquedataACDT table.
The other method is using interpolation while joining the tables.
We join all the three tables into one timetable using the synchronize
function. Synchronize collects the variables from all input timetables and outputs a single timetable with all the variables concatenated horizontally.
Here we specify union
to synchronize on a time vector that includes times from all three timetables. And linear
to fill in missing timetable elements with linear interpolation.
Hence now we have a Total table consisting of all the three clean timetables.
Once we have our data all cleaned up, we can do some visual analysis by plotting the data.
Here we show one example using the stackedplot
function. Stackedplot
plots the variables of a timetable against row times. We plot the voltage magnitudes VM6, VM1, and VM11 against the row times from the total combined table.
Please look at the description to see more resources on data preprocessing and download the code to try it with your datasets.
Thank you.