cvpartition with different selection of input features per fold

Question

Louis 2023-1-23

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1899295-cvpartition-with-different-selection-of-input-features-per-fold

回答： Drew 2024-9-3，22:25

I am currently utilizing cvpartition to perform k-fold cross validation.

A simplified code I am using looks like below:

% create cross validation object
cvo = cvpartition(y, 'KFold', 5); % 5-fold cross validation
% perform cross validation using logistic regression
logisticRegression = fitglm(x, y, 'CVPartition', cvo, 'ClassNames', classOrder);

The issue I have is that I now want to perform some feature selection only using the training fold of each cross validation split to prevent data leakage (using some statistics from the training data in order to select which features to use for model training).

Procedure (very rough pseudo code) for what I am describing above looks like below:

create cvo

for each fold f:

perform feature selection using training portion of the fold (using some stats calculated based on the training portion of the fold)

train the model with reduced set of features

evaluate (validate) the model using the validation portion of the fold

end

Is there a way achieve above (using different feature subsets for each fold) with cvo? I have already built a pretty extensive pipeline around using cvo, so I'd like to keep using it to prevent myself from re-doing most of the work.

Any help will be greatly appreciated. Thank you,

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Drew 2024-9-3，22:25

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1899295-cvpartition-with-different-selection-of-input-features-per-fold#answer_1510139

在 MATLAB Online 中打开

The Classification Learner app will generate code that does exactly that. Here is an example using two classes from the Fisher Iris data, to demonstrate how to generate the code:

t=readtable("fisheriris.csv");
t100=t(51:150,:); % Select a subset of the data with only two classes
% Start Classification Learner
classificationLearner

In the app,

(1) Load t100, choose 5-fold cross-validation.

(2) Select "Binary GLM Logistic Regression" from the models gallery

(3) Choose Feature Selection, pick a ranking method, and choose "top 3 predictors". In the example here, the MRMR ranking algorithm (fscmrmr) was selected.

(4) Train the model

(5) Select "Generate Function" in the export section

The generated code is below. This code is exactly as generated by the Classification Learner app. In the resulting model training function, the validation accuracy is calculated as specified in the question. There is a loop over the validation folds, feature selection on training data in each fold, model building on the selected features for that fold, and evaluation of the model for each fold. Note that the generated code could be simplified if desired. The code is longer due to things like normalization, possible handling of categorical predictors and missing values, the use of a function handle for the predict function to handle feature selection, etc.

function [trainedClassifier, validationAccuracy] = trainClassifier(trainingData)
% [trainedClassifier, validationAccuracy] = trainClassifier(trainingData)
% Returns a trained classifier and its accuracy. This code recreates the
% classification model trained in Classification Learner app. Use the
% generated code to automate training the same model with new data, or to
% learn how to programmatically train models.
%
%  Input:
%      trainingData: A table containing the same predictor and response
%       columns as those imported into the app.
%
%
%  Output:
%      trainedClassifier: A struct containing the trained classifier. The
%       struct contains various fields with information about the trained
%       classifier.
%
%      trainedClassifier.predictFcn: A function to make predictions on new
%       data.
%
%      validationAccuracy: A double representing the validation accuracy as
%       a percentage. In the app, the Models pane displays the validation
%       accuracy for each model.
%
% Use the code to train the model with new data. To retrain your
% classifier, call the function from the command line with your original
% data or new data as the input argument trainingData.
%
% For example, to retrain a classifier trained with the original data set
% T, enter:
%   [trainedClassifier, validationAccuracy] = trainClassifier(T)
%
% To make predictions with the returned 'trainedClassifier' on new data T2,
% use
%   yfit = trainedClassifier.predictFcn(T2)
%
% T2 must be a table containing at least the same predictor columns as used
% during training. For details, enter:
%   trainedClassifier.HowToPredict
% Auto-generated by MATLAB on 03-Sep-2024 18:02:04
% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
inputTable = trainingData;
predictorNames = {'SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'};
predictors = inputTable(:, predictorNames);
response = inputTable.Species;
isCategoricalPredictor = [false, false, false, false];
classNames = {'versicolor'; 'virginica'};
% Feature Ranking and Selection
% Replace Inf/-Inf values with NaN to prepare data for normalization
predictors = standardizeMissing(predictors, {Inf, -Inf});
% Normalize data for feature ranking
predictorMatrix = normalize(predictors, "DataVariable", ~isCategoricalPredictor);
% Rank features using MRMR algorithm
featureIndex = fscmrmr(...
    predictorMatrix, ...
    response);
numFeaturesToKeep = 3;
includedPredictorNames = predictors.Properties.VariableNames(featureIndex(1:numFeaturesToKeep));
predictors = predictors(:,includedPredictorNames);
isCategoricalPredictor = isCategoricalPredictor(featureIndex(1:numFeaturesToKeep));
% Train a classifier
% This code specifies all the classifier options and trains the classifier.
% For logistic regression, the response values must be converted to zeros
% and ones because the responses are assumed to follow a binomial
% distribution.
% 1 or true = 'successful' class
% 0 or false = 'failure' class
% NaN - missing response.
successClass = 'versicolor';
failureClass = 'virginica';
% Compute the majority response class. If there is a NaN-prediction from
% fitglm, convert NaN to this majority class label.
numSuccess = sum(strcmp(strtrim(response), successClass));
numFailure = sum(strcmp(strtrim(response), failureClass));
if numSuccess > numFailure
    missingClass = successClass;
else
    missingClass = failureClass;
end
successFailureAndMissingClasses = {successClass; failureClass; missingClass};
isMissing = cellfun(@(x) isempty(strtrim(x)), response);
zeroOneResponse = double(strcmp(strtrim(response), successClass));
zeroOneResponse(isMissing) = NaN;
% Prepare input arguments to fitglm.
concatenatedPredictorsAndResponse = [predictors, table(zeroOneResponse)];
% Train using fitglm.
GeneralizedLinearModel = fitglm(...
    concatenatedPredictorsAndResponse, ...
    'Distribution', 'binomial', ...
    'link', 'logit');
% Convert predicted probabilities to predicted class labels and scores.
convertSuccessProbsToPredictions = @(p) successFailureAndMissingClasses( ~isnan(p).*( (p<0.5) + 1 ) + isnan(p)*3 );
returnMultipleValuesFcn = @(varargin) varargin{1:max(1,nargout)};
scoresFcn = @(p) [p, 1-p];
predictionsAndScoresFcn = @(p) returnMultipleValuesFcn( convertSuccessProbsToPredictions(p), scoresFcn(p) );
% Create the result struct with predict function
predictorExtractionFcn = @(t) t(:, predictorNames);
featureSelectionFcn = @(x) x(:,includedPredictorNames);
logisticRegressionPredictFcn = @(x) predictionsAndScoresFcn( predict(GeneralizedLinearModel, x) );
trainedClassifier.predictFcn = @(x) logisticRegressionPredictFcn(featureSelectionFcn(predictorExtractionFcn(x)));
% Add additional fields to the result struct
trainedClassifier.RequiredVariables = {'SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'};
trainedClassifier.GeneralizedLinearModel = GeneralizedLinearModel;
trainedClassifier.SuccessClass = successClass;
trainedClassifier.FailureClass = failureClass;
trainedClassifier.MissingClass = missingClass;
trainedClassifier.ClassNames = {successClass; failureClass};
trainedClassifier.About = 'This struct is a trained model exported from Classification Learner R2024b.';
trainedClassifier.HowToPredict = sprintf('To make predictions on a new table, T, use: \n  yfit = c.predictFcn(T) \nreplacing ''c'' with the name of the variable that is this struct, e.g. ''trainedModel''. \n \nThe table, T, must contain the variables returned by: \n  c.RequiredVariables \nVariable formats (e.g. matrix/vector, datatype) must match the original training data. \nAdditional variables are ignored. \n \nFor more information, see <a href="matlab:helpview(fullfile(docroot, ''stats'', ''stats.map''), ''appclassification_exportmodeltoworkspace'')">How to predict using an exported model</a>.');
% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
inputTable = trainingData;
predictorNames = {'SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'};
predictors = inputTable(:, predictorNames);
response = inputTable.Species;
isCategoricalPredictor = [false, false, false, false];
classNames = {'versicolor'; 'virginica'};
% Perform cross-validation
KFolds = 5;
cvp = cvpartition(response, 'KFold', KFolds);
% Initialize the predictions to the proper sizes
validationPredictions = response;
numObservations = size(predictors, 1);
numClasses = 2;
validationScores = NaN(numObservations, numClasses);
for fold = 1:KFolds
    trainingPredictors = predictors(cvp.training(fold), :);
    trainingResponse = response(cvp.training(fold), :);
    foldIsCategoricalPredictor = isCategoricalPredictor;
    % Feature Ranking and Selection
    % Replace Inf/-Inf values with NaN to prepare data for normalization
    trainingPredictors = standardizeMissing(trainingPredictors, {Inf, -Inf});
    % Normalize data for feature ranking
    predictorMatrix = normalize(trainingPredictors, "DataVariable", ~foldIsCategoricalPredictor);
    % Rank features using MRMR algorithm
    featureIndex = fscmrmr(...
        predictorMatrix, ...
        trainingResponse);
    numFeaturesToKeep = 3;
    includedPredictorNames = trainingPredictors.Properties.VariableNames(featureIndex(1:numFeaturesToKeep));
    trainingPredictors = trainingPredictors(:,includedPredictorNames);
    foldIsCategoricalPredictor = foldIsCategoricalPredictor(featureIndex(1:numFeaturesToKeep));
    % Train a classifier
    % This code specifies all the classifier options and trains the classifier.
    % For logistic regression, the response values must be converted to zeros
    % and ones because the responses are assumed to follow a binomial
    % distribution.
    % 1 or true = 'successful' class
    % 0 or false = 'failure' class
    % NaN - missing response.
    successClass = 'versicolor';
    failureClass = 'virginica';
    % Compute the majority response class. If there is a NaN-prediction from
    % fitglm, convert NaN to this majority class label.
    numSuccess = sum(strcmp(strtrim(trainingResponse), successClass));
    numFailure = sum(strcmp(strtrim(trainingResponse), failureClass));
    if numSuccess > numFailure
        missingClass = successClass;
    else
        missingClass = failureClass;
    end
    successFailureAndMissingClasses = {successClass; failureClass; missingClass};
    isMissing = cellfun(@(x) isempty(strtrim(x)), trainingResponse);
    zeroOneResponse = double(strcmp(strtrim(trainingResponse), successClass));
    zeroOneResponse(isMissing) = NaN;
    % Prepare input arguments to fitglm.
    concatenatedPredictorsAndResponse = [trainingPredictors, table(zeroOneResponse)];
    % Train using fitglm.
    GeneralizedLinearModel = fitglm(...
        concatenatedPredictorsAndResponse, ...
        'Distribution', 'binomial', ...
        'link', 'logit');
    % Convert predicted probabilities to predicted class labels and scores.
    convertSuccessProbsToPredictions = @(p) successFailureAndMissingClasses( ~isnan(p).*( (p<0.5) + 1 ) + isnan(p)*3 );
    returnMultipleValuesFcn = @(varargin) varargin{1:max(1,nargout)};
    scoresFcn = @(p) [p, 1-p];
    predictionsAndScoresFcn = @(p) returnMultipleValuesFcn( convertSuccessProbsToPredictions(p), scoresFcn(p) );
    % Create the result struct with predict function
    featureSelectionFcn = @(x) x(:,includedPredictorNames);
    logisticRegressionPredictFcn = @(x) predictionsAndScoresFcn( predict(GeneralizedLinearModel, x) );
    validationPredictFcn = @(x) logisticRegressionPredictFcn(featureSelectionFcn(x));
    % Add additional fields to the result struct
    % Compute validation predictions
    validationPredictors = predictors(cvp.test(fold), :);
    [foldPredictions, foldScores] = validationPredictFcn(validationPredictors);
    % Store predictions in the original order
    validationPredictions(cvp.test(fold), :) = foldPredictions;
    validationScores(cvp.test(fold), :) = foldScores;
end
% Compute validation accuracy
correctPredictions = strcmp( strtrim(validationPredictions), strtrim(response));
isMissing = cellfun(@(x) all(isspace(x)), response, 'UniformOutput', true);
correctPredictions = correctPredictions(~isMissing);
validationAccuracy = sum(correctPredictions)/length(correctPredictions);