Formatting input data for linear regression model in leave-out-one validation testing

Hello there I have data from 10 trials stored in a 10x1 cell (Predictors) and the corespoding respose vairables stored in a 10x1 cell (Response). I am trying to trian a simple linear regression model and make predictions by leaving one trial out and using the other 9 trials to train the linear regression model and the one to predict/test the model by producing RMSE values. I am unsure of how to format my input within the "fitlm" function as I keep getting the follwing error:
% Train the network
for i = 1:length(Predictors) %iterate over all data points
validationdataX = Predictors(i);
validationdataY = Response(i);
%Exclude the current index (i) for training
trainingIndices = setdiff(1:length(Predictors),i);
traningdataX = Predictors(trainingIndices)
trainingdataY = Response(trainingIndices)
net = fitlm(traningdataX,trainingdataY)
ypred = predict(net,validationdataX);
TrueVal = validationdataY;
TrueValue = cell2mat(TrueVal);
Predvalue = {Predval};
PredictedValue = cell2mat(Predvalue);
RMSE = rmse(PredictedValue,TrueValue)
end
Error using classreg.regr.TermsRegression/handleDataArgs (line 589)
Predictor variables must be numeric vectors, numeric matrices, or categorical vectors.
Error in LinearModel.fit (line 1000)
[X,y,haveDataset,otherArgs] = LinearModel.handleDataArgs(X,paramNames,varargin{:});
Error in fitlm (line 134)
model = LinearModel.fit(X,varargin{:});
Any suggestions on how to fix this and to get the model to work correcly and make predictions using leave out one validation approach would be greatly appreciated!

9 个评论

Hi Isabelle,

Sounds like interesting project. In your code, you are passing cell arrays as predictors, which is causing the error.To resolve this issue, you need to convert your cell arrays to numeric arrays before fitting the linear model. However, I did update the code including leave out one validation approach. Here is updated code snippet example,

% Define and populate sample data for 'data' and 'responseData'

data = {1, 2, 3, 4, 5}; % Sample predictor data

responseData = {10, 20, 30, 40, 50}; % Sample response data

% Define and populate the 'Predictors' variable with sample data

Predictors = cell(1, length(data));

for i = 1:length(data)

    Predictors{i} = data{i};

end

% Define and populate the 'Response' variable with sample data

Response = cell(1, length(responseData));

for i = 1:length(responseData) Response{i} = responseData{i}; end

% Train the linear regression model with leave-one-out cross-validation

for i = 1:length(Predictors)

    % Extract validation data for the current iteration
    validationdataX = cell2mat(Predictors(i));
    validationdataY = cell2mat(Response(i));
    % Exclude the current index (i) for training
    trainingIndices = setdiff(1:length(Predictors), i);
    trainingdataX = cell2mat(Predictors(trainingIndices));
    trainingdataY = cell2mat(Response(trainingIndices));
    % Train the linear regression model
    mdl = fitlm(trainingdataX, trainingdataY);
    % Make predictions on the validation data
    ypred = predict(mdl, validationdataX);
    % Calculate RMSE for the current iteration
    RMSE = sqrt(mean((ypred - validationdataY).^2));
    % Display RMSE for each iteration
    disp(['RMSE for iteration ', num2str(i), ': ', num2str(RMSE)]);
 Hope, this is what you are looking for. Please see attached results. 

Please let me know if you have any further questions.

Hi Umar,
Thank you so much for getting back to me. I implemented your suggestions into my data and I am getting the follwing error:
Error using classreg.regr.FitObject/assignData (line 134)
Predictor and response variables must have the same length.
Error in classreg.regr.TermsRegression/assignData (line 240)
model = assignData@classreg.regr.ParametricRegression(model,X,y,w,asCat,varNames,excl);
Error in LinearModel.fit (line 1012)
model = assignData(model,X,y,weights,asCatVar,dummyCoding,model.Formula.VariableNames,exclude);
Error in fitlm (line 134)
model = LinearModel.fit(X,varargin{:});
It is because the trainingdataX is 567x541 matrix, while trainingdataY is a 9x541 matrix which is what it should be. Do you know what the issue may be causing my matrix for trainingdataX to be so off?
Thank you so much!
Hi Isabelle,
This is a common issue when working with regression models. In your case, the training data matrix trainingdataX is of size 567x541, while trainingdataY is of size 9x541. This discrepancy in dimensions is causing the error as the number of observations (rows) in the predictor matrix should match the number of observations in the response matrix. To resolve this issue, you need to ensure that the predictor and response matrices have the same number of observations. One way to address this is by transposing the trainingdataY matrix to match the number of observations in trainingdataX. Here is an example illustrating this:
% Example to transpose trainingdataY to match the number of observations in trainingdataX
trainingdataX = randn(567, 541); % Example predictor matrix
trainingdataY = randn(9, 541); % Example response matrix
% Transpose trainingdataY to match the number of observations in trainingdataX
trainingdataY = trainingdataY'; % Transpose the matrix
% Verify the dimensions after transposing
size(trainingdataX)
size(trainingdataY)
By transposing the trainingdataY matrix, you align the number of observations with trainingdataX, resolving the mismatch in dimensions and addressing the error you encountered. Please let me know if this helps resolving your problem.
Hi Umar,
Thank you so much for your response. Unfortunaltey that did not fix the isse however I think it is due to the fact that there are 63 input features that I am trying to use to predict the one contunuous variable as you can see here in the predictors and response cells:
This may be causing the differnces in dimensions seen here in the training X and Y data:
Is there a way to account for this in the code and use the 63 featuresx541 time steps for each trial to properly train and predict the continous variable in the response cell and obtain ans rmse value for each trial?
Hi Isabelle,
Try considering implementing dimensionality reduction techniques such as Principal Component Analysis (PCA) or feature selection methods to reduce the number of input features while retaining relevant information. This can help in aligning the dimensions of your input data with the response variable, potentially improving the accuracy of your predictions. Additionally, you can reshape your input data to match the desired format of 63 features x 541 time steps for each trial to ensure that the model receives the correct input dimensions during training and prediction. After making these adjustments, you can train your model using the modified data and evaluate its performance by calculating the root mean squared error (RMSE) for each trial. The RMSE value will provide insight into how well your model is predicting the continuous variable based on the input features. Hope this answers your question. Please let me know if you have any further questions.
Hi Umar is there any way to train the model and make predictions keeping all the features and not havig to reduce the dimensions of the data?
Hi Isabelle,
That is a very good question you asked. While dimensionality reduction techniques like Principal Component Analysis (PCA) can be beneficial for simplifying the input space and improving model performance, it is possible to train a model without reducing the number of features. However, keeping all features may lead to challenges such as overfitting, increased computational complexity, and potential noise in the data.
Hi, Umar I appreciate your response. I am interested in comparing a simple lienear regression model to other models that I have built and in order to compare these fairly I want to keep the number of features the same regardless of the chanllenges of overfitting, compuational complexity, and nosie. Could you guide me in how I can achive this within my code? How can I input the predicotr data with 63 featuresx541 timesteps from the 9 trials and the respose data with 1 response variablex541 timesteps from the 9 trials into a linear model without getting the errors from the dimensions not being equivalent?

Hi @Isabelle Museck,

To input the predictor and response data into a linear model without dimension mismatch errors, you have to make sure that the dimensions of the data align correctly. In the provided code snippet, you can modify the data handling part as follows:

% Train the network

for i = 1:length(Predictors) % iterate over all data points

    validationdataX = Predictors(:, i); % Use all features for the current timestep
    validationdataY = Response(:, i); % Use the response variable for the current 

timestep

    % Exclude the current index (i) for training
    trainingIndices = setdiff(1:length(Predictors), i);
    trainingdataX = Predictors(:, trainingIndices); % Use all features for training data
    trainingdataY = Response(:, trainingIndices); % Use response variable for training data
    net = fitlm(trainingdataX', trainingdataY'); % Fit linear model
    ypred = predict(net, validationdataX'); % Predict using the model
    TrueValue = validationdataY';
    PredictedValue = ypred';
    RMSE = rmse(PredictedValue, TrueValue); % Calculate RMSE

end

Please bear in mind that this is example code snippet and you have to customize this code based on your preferences. Please let me know if you have any further questions.

请先登录,再进行评论。

回答(1 个)

I have implemented the codes in MATLAB R2024a. I can see that the issue of taking cell array as input has been resolved in the comment section. By reading the comments, I get to know that the new issue is passing the data into “fitlm" function without reducing any features by dimensionality reduction.
I am understanding that your data has nine arrays of dimension 63x541arrays and have corresponding responses which are of dimension 1x541. As the issue of taking input as cell array has already been solved in the comments, I am taking “Predictors” and “Responses” to be two random matrices drawn from normal distribution as input data. I am passing data of size 63x541 into “fitlm” function, response of which is a numeric vector of size 1x541. This approach could be used for fitting the data as it is, without using any dimensionality reduction techniques.
Please see the below code for your reference.
Predictors = randn(567, 541);
Response = randn(9, 541);
for i = 0:8 % iterate over all data points
validationdataX = Predictors(63*i+1:63*(i+1),:);
validationdataY = Response(i+1,:);
Predictors1=Predictors;
Predictors1(63*i+1:63*(i+1),:)=[];
trainingdataX = Predictors1;
Response1=Response;
Response1(i+1,:)=[];
trainingdataY=Response1;
for j=0:7
model = fitlm(trainingdataX((63*j)+1:63*(j+1),:)', trainingdataY(j+1,:)');
end
ypred = predict(model, validationdataX');
TrueValue = validationdataY';
RMSE = rmse(ypred, TrueValue)
end

类别

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by