matlab回归工具箱中高斯指数GPR模型导出后，我使用相同的数据集，但是调整数据集的顺序，再次进行训练，是否存在数据泄露的问题。下面是我的代码。After exporting the Gaussian exponential (GPR) model from the matlab regression toolbox, I used the same dataset but adjusted the order of the datasets and conducted the trainin

Question

鹏程 2025-4-25

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2176638-matlab-gpr-after-exporting-the-gaussia

评论：鹏程 2025-7-21

导入数据

% res = xlsread('数据集.xlsx');

res = table2array(data1);

% 增加随机数种子确保复现性

rng(1000);

划分训练集和测试集

n = size(data1,1);

temp = randperm(n); %打乱数据集，随机生成索引

n1 = round(0.8*n);

% TreeBagger函数的输入格式要求样本作为行，特征值作为列

%这次转置的目的是方便后续按照特征进行归一化

P_train = res(temp(1: n1), 1: 10)'; % 特征值

T_train = res(temp(1: n1), 11)'; % 目标变量

M = size(P_train, 2); % 样本个数

P_test = res(temp(n1+1: end), 1: 10)';

T_test = res(temp(n1+1: end), 11)';

N = size(P_test, 2); % 样本个数

数据归一化

% mapminmax最小-最大归一化，第一个参数是归一化后的数据，第二个参数是一个结构体，用于对后续测试数据做相同的归一化

[p_train, ps_input] = mapminmax(P_train, 0, 1);

p_test = mapminmax('apply', P_test, ps_input); %使用训练集的归一化参数，对测试集进行完全相同的缩放。避免数据泄露

[t_train, ps_output] = mapminmax(T_train, 0, 1);

t_test = mapminmax('apply', T_test, ps_output);

转置以适应模型

%这次转置的目的是将数据集调整到适合树模型的输入格式要求

p_train = p_train'; p_test = p_test';

t_train = t_train'; t_test = t_test';

n_features = size(p_train, 2);

% 创建 5 折分区

cv = cvpartition(size(p_train,1), 'KFold', 5);

% 初始化预测结果和误差

validationPredictions = zeros(size(t_train));

fold_R2 = zeros(1, cv.NumTestSets);

for i = 1:cv.NumTestSets

% 获取当前 fold 的训练/测试索引

trainIdx = training(cv, i); % 训练集索引（约 4/5 样本）

testIdx = test(cv, i); % 测试集索引（约 1/5 样本）

% 提取当前 fold 的数据（样本作为行）

X_train_fold = p_train(trainIdx, :); % 特征：训练样本×10

Y_train_fold = t_train(trainIdx); % 标签：训练样本×1

% **在当前 fold 内重新训练模型**（仅用该 fold 的训练数据）

regressionGP_fold = fitrgp( ...

X_train_fold, ...

Y_train_fold, ...

'BasisFunction', 'constant', ...

'KernelFunction', 'exponential', ...

'Standardize', true... % 自动对当前 fold 的特征标准化

);

% 预测当前 fold 的测试集

Y_pred_fold = predict(regressionGP_fold, p_train(testIdx, :));

% 存储预测结果（用于后续指标计算）

validationPredictions(testIdx) = Y_pred_fold;

% 计算当前 fold 的 R²（可选）

SS_res = sum((t_train(testIdx) - Y_pred_fold).^2);

SS_tot = sum((t_train(testIdx) - mean(t_train(testIdx))).^2);

fold_R2(i) = 1 - SS_res / SS_tot;

end

% 计算交叉验证平均 R²

mean_cv_R2 = mean(fold_R2);

disp(['5折交叉验证平均 R²: ', num2str(mean_cv_R2)]);

5折交叉验证平均 R²: 0.86416

regressionGP_final = fitrgp( ...

p_train, ...

t_train, ...

'BasisFunction', 'constant', ...

'KernelFunction', 'exponential', ...

'Standardize', true...

);

% 使用 predict 函数创建结果结构体

predictorExtractionFcn = @(t) t;

gpPredictFcn = @(x) predict(regressionGP_final, x);

trainedModel.predictFcn = @(x) gpPredictFcn(predictorExtractionFcn(x));

% 向结果结构体中添加字段

trainedModel.RequiredVariables = {'AN1', 'VR1', 'ARE1', 'EF2', 'E12', 'ST_1', 'St_1', 'CRR_1', 'AT_1', 'At_1'};

trainedModel.RegressionGP = regressionGP_final;

trainedModel.About = '此结构体是从回归学习器 R2024b 导出的训练模型。';

trainedModel.HowToPredict = sprintf('要基于新表 T 进行预测，请使用: \n yfit = c.predictFcn(T) \n将 ''c'' 替换为此结构体的变量名，例如 ''trainedModel''。\n \n表 T 必须包含由以下属性返回的变量: \n c.RequiredVariables \n变量格式(例如矩阵/向量、数据类型)必须与原始训练数据匹配。\n忽略其他变量。\n \n有关详细信息，请参阅 <a href="matlab:helpview(fullfile(docroot, ''stats'', ''stats.map''), ''appregression_exportmodeltoworkspace'')">How to predict using an exported model</a>。');

仿真测试-预测

% 为了后续计算误差

t_sim1 = trainedModel.predictFcn(p_train);

t_sim2 = trainedModel.predictFcn(p_test);

%% 数据反归一化

T_sim1 = mapminmax('reverse', t_sim1, ps_output);

T_sim2 = mapminmax('reverse', t_sim2, ps_output);

%% 均方根误差RSME

error1 = sqrt(sum((T_sim1' - T_train).^2) ./ M);

error2 = sqrt(sum((T_sim2' - T_test ).^2) ./ N);

绘图

figure

plot(1: M, T_train, 'r-*', 1: M, T_sim1, 'b-o', 'LineWidth', 1)

legend('真实值', '预测值')

xlabel('预测样本')

ylabel('预测结果')

string = {'训练集预测结果对比'; ['RMSE=' num2str(error1)]};

title(string)

xlim([1, M])

grid

figure

plot(1: N, T_test, 'r-*', 1: N, T_sim2, 'b-o', 'LineWidth', 1)

legend('真实值', '预测值')

xlabel('预测样本')

ylabel('预测结果')

string = {'测试集预测结果对比'; ['RMSE=' num2str(error2)]};

title(string)

xlim([1, N])

grid

% %% 绘制误差曲线

% figure

% plot(1: trees, oobError(net), 'b-', 'LineWidth', 1)

% legend('误差曲线')

% xlabel('决策树数目')

% ylabel('误差')

% xlim([1, trees])

% grid

% %% 绘制特征重要性

% figure

% bar(importance)

% legend('重要性')

% xlabel('特征')

% ylabel('重要性')

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Image Analyst 2025-4-25

Do you have a question? If so, ask it.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Ronit 2025-7-16

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2176638-matlab-gpr-after-exporting-the-gaussia#answer_1568028

Hello,

Data leakage does not occur when the dataset is shuffled before being divided into training and test sets. In fact, to show that both sets are independent and representative, it is actually common and advised to randomly arrange the samples.

When information from the test set affects the training procedure, such as when test set statistics are used for normalization or model fitting, this is known as data leakage. Leakage is avoided in your workflow by applying normalization parameters to the test data after they have been calculated exclusively from the training data.

Furthermore, predictions on the test set are made without any exposure to test data during training, and the model is only trained on the training subset.

As a result, rearranging the data prior to splitting increases the model evaluation's robustness and prevents leakage.

I hope this clarifies your doubt.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

鹏程 2025-7-21

Thank you for your answer. I have already discovered the problem existing in the code. After the training was completed, I mistakenly used the training data to calculate R2, which led to an excessively high R2.

请先登录，再进行评论。

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

Community Treasure Hunt

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论