Trainnet with parallel-CPU mode giving incorrect results

9 次查看（过去 30 天）

Collin Rich 2024-5-25

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2122571-trainnet-with-parallel-cpu-mode-giving-incorrect-results

评论： Collin Rich 2024-5-25

I'm using trainnet to train a convolutional regression network to find the X-Y centroid of a subtle gradient region in an input image. The training data consist of paired 130x326 grayscale images and ground-truth output coordinates. Both the RMSE and loss function reach very small numbers (eg 10^-3) after a few minutes of training on a smal dataset. The trained network gives the expected results when trained in single-CPU mode, but when trained in parallel-CPU mode, the predictions are significantly off. To attempt debugging, I scaled back to a very simple network, disabled normalization, and trained with only two datapoints--fully expecting it to memorize the training data perfectly. Using single-CPU training mode, the trained network yields perfect predictions (as expected) on the training data, but after using parallel-CPU mode, the trained network does not predict correctly on the training data. I added in a more verbose loss function and confirmed that the reported losses (i.e. showin in the loss function during training) are consistent with the (Y,T) pairs during training, and that the T values are being correctly read from the training data.

It seems perhaps the final outputted network in parallel-CPU mode does not correcltly capture the results of the training.

I'm running 2024a on a MBPro (M2 Max), using Apple Accelerate BLAS. (Default BLAS persistently crashed in parallel mode with trainnet.)

Code snippet below...

layers = [
    imageInputLayer([130 326 1],"Name","imageinput","Normalization","none")
    convolution2dLayer([10 10],8,"dilation",[2 2],"Name","conv_1")
    maxPooling2dLayer([2 2],"Name","maxpool_4")
    batchNormalizationLayer
    reluLayer("Name","relu_1")
    convolution2dLayer([2 2],16,"Name","conv_2")
    fullyConnectedLayer(2,"Name","fc")];
opts = trainingOptions('sgdm', ...
    'InitialLearnRate',1e-7, ...
    'LearnRateSchedule','piecewise',...
    'LearnRateDropPeriod',500,...
    'LearnRateDropFactor',.25,...
    'MaxEpochs',1000, ...
    'Verbose',false, ...
    'ExecutionEnvironment','parallel',...
    'Shuffle','every-epoch',...
    'Plots','training-progress', ...
    'OutputNetwork','last-iteration');
FOVCnet = trainnet(trainingData,net,@modelLoss,opts); 
function loss = modelLoss(Y,T) % define loss function
Y
T
loss = mse(Y,T)
end

3 个评论
显示 1更早的评论隐藏 1更早的评论

Matt J 2024-5-25

We can't run the code without trainingData. Please attach your two data point test case in a .mat file (as an arrayDatastore).

Collin Rich 2024-5-25

Here are the two test images and coordinates. (Sorry for not putting in an arrayDatastore; I'm not sure how to put both in a single arrayDatastore. Still learning the ropes...)

请先登录，再进行评论。

请先登录，再回答此问题。