Apply Generated MATLAB Function to Expanded Data Set
This example shows how to use a small set of measurement data in Diagnostic Feature Designer to develop a feature set, generate and run code to compute those features for a larger set of measurement data, and compare model accuracies in Classification Learner.
Using a smaller data set at first has several advantages, including faster feature extraction and cleaner visualization. Subsequently generating code so that you can automate the feature computations with an expanded set of members increases the number of feature samples and therefore improves classification model accuracy.
The example, based on Analyze and Select Features for Pump Diagnostics, uses the pump fault data from that example and computes the same features. For more detailed information about the steps and the rationale for the feature development operations using the pump-fault data in this example, see Analyze and Select Features for Pump Diagnostics. This example assumes that you are familiar with the layout and operations in the app. For more information on working with the app, see the three-part tutorial in Design Condition Indicators for Predictive Maintenance Algorithms.
Load Data and Create Reduced Data Set
Load the data set pumpData
. pumpData
is a 240-member ensemble table that contains simulated measurements for flow and pressure. pumpData
also contains categorical fault codes that represent combinations of three independent faults. For example, a fault code of 0
represents data from a system with no faults. A fault code of 111
represents data from a system with all three faults.
load savedPumpData pumpData
View a histogram of original fault codes. The histogram shows the number of ensemble members associated with each fault code.
fcCat = pumpData{:,3}; histogram(fcCat) title('Fault Code Distribution for Full Pump Data Set') xlabel('Fault Codes') ylabel('Number of Members')
Create a subset of this data set that contains 10% of the data, or 24 members. Because simulation data is often clustered, generate a randomized index with which to select the members. For the purposes of this example, first use rng
to create a repeatable random seed.
rng('default')
Compute a randomized 24-element index vector idx
. Sort the vector so that the indices are in order.
pdh = height(pumpData); nsel = 24; idx = randi(pdh,nsel,1); idx = sort(idx);
Use idx
to select member rows from pumpData
.
pdSub = pumpData(idx,:);
View a histogram of the fault codes in the reduced data set.
fcCatSub = pdSub{:,3}; histogram(fcCatSub) title('Fault Code Distribution for Reduced Pump Data Set') xlabel('Fault Codes') ylabel('Number of Members')
All the fault combinations are represented.
Import Reduced Data Set into Diagnostic Feature Designer
Open Diagnostic Feature Designer by using the diagnosticFeatureDesigner
command. Import pdSub
into the app.
Extract Time-Domain Features
Extract the time-domain signal features from both the flow
and pressure
signals. For each signal, first, select the signal. Then, in the Feature Designer tab, select Time Domain Features > Signal Features and select all features.
Extract Frequency Domain Features
As Analyze and Select Features for Pump Diagnostics describes, computing the frequency spectrum of the flow highlights the cyclic nature of the flow signal. Estimate the frequency spectrum by selecting Spectral Estimation > Autoregressive model and using the options shown for both flow
and pressure
.
From the derived flow
and pressure
spectra, compute spectral features in the band 23–250 Hz, using the options shown.
Rank Features
Rank your features by selecting Rank Features > FeatureTable1. Because faultCode
contains multiple possible values, the app defaults to the One-Way ANOVA
ranking method.
Export Features to Classification Learner
Export the features set to Classification Learner so that you can train a classification model. In the Feature Ranking tab, click Export > Export Features to the Classification Learner. Select the top 15 features by selecting Select top features and typing 15
.
Train Models in Classification Learner
Once you click Export, Classification Learner opens a new session using the data you exported. Start the session by clicking Start Session.
Train all available models by clicking All in the Classification Learner tab, and then Train All.
Classification Learner trains all the models and initially sorts then by name. Use the Sort by menu to sort by Accuracy (Validation)
. For this session, the highest scoring model, KNN
, has an accuracy of about 63%. Your results may vary. Click Confusion Matrix to view the confusion matrix for this model.
Generate Code to Compute Feature Set
Now that you have completed your interactive feature work with a small data set, you can apply the same computations to the full data set using generated code. In Diagnostic Feature Designer, generate a function to calculate the features. To do so, in the Feature Ranking Tab, select Export > Generate Function for Features. Select the same 15 features that you exported to Classification Learner.
When you click OK, a function appears in the editor.
Save the function to your local folder as diagnosticFeatures
.
Apply the Function to Full Data Set
Execute diagnosticFeatures
with the full pumpData
ensemble to get the 240-member feature set. Use the following command.
feature240 = diagnosticFeatures(pumpData);
feature240
is a 240-by-16 table. The table includes the condition variable faultCode
and the 15 features.
Train Models in Classification Learner with Larger Feature Table
Train classification models again in Classification Learner, using feature240
this time. Open a new session window using the following command.
classificationLearner
In the Classification Learner window, click New Session > From Workspace. In the New Session window, in Data Set > Data Set Variable, select feature240
.
Repeat the steps you performed with the 24-member data set. Start the session and then train all models. Sort the models by Accuracy (Validation)
. In this session, the highest scoring model is Bagged Trees
, with an accuracy of about 73%, roughly 10% higher than the model computed using the reduced data. Again, your results may vary, but they should still reflect the increase in best accuracy
For this session, the highest model accuracy, achieved by both Bagged Trees
and RUSBoosted Trees
, is around 80%. Again, your results may vary, but they should still reflect the increase in best accuracy.