trainNetwork error unable to read file

1 次查看(过去 30 天)
HI all,
I am learning to train a convolutional network for image classification on the cloud. As a first step, I am following the example named "Train Network in the Cloud Using Automatic Parallel Support" on Mathworks.
I have started my cluster successfully and uploaded the cifar10 image library to my Amazon S3 bucket.
I then create succssefully the datastore using:
imdsTrain = imageDatastore('s3://mybucket/cifar10/train', ...
'IncludeSubfolders',true, ...
'LabelSource','foldernames');
My problem comes at the training level, where I use:
options = trainingOptions('sgdm', ...
'ExecutionEnvironment','parallel', ... % Turn on automatic parallel support.
'InitialLearnRate',initialLearnRate, ... % Set the initial learning rate.
'MiniBatchSize',miniBatchSize, ... % Set the MiniBatchSize.
'Verbose',true, ... % Do not send command line output.
'Plots','training-progress', ... % Turn on the training progress plot.
'L2Regularization',1e-10, ...
'MaxEpochs',50, ...
'Shuffle','every-epoch', ...
'ValidationData',imdsTest, ...
'ValidationFrequency',floor(numel(imdsTrain.Files)/miniBatchSize), ...
'LearnRateSchedule','piecewise', ...
'LearnRateDropFactor',0.1, ...
'LearnRateDropPeriod',45);
net = trainNetwork(augmentedImdsTrain,layers,options);
the training starts, the display of the training starts with the indication: "initializing input data normalization"
However it stops quickly with the error message:
Error in test_parallel_cloud (line 77)
net = trainNetwork(augmentedImdsTrain,layers,options);
Caused by:
Error using nnet.internal.cnn.DistributedDispatcher/computeInParallel (line
193)
Error detected on worker 1.
Error using matlab.io.datastore.ImageDatastore/read (line 77)
Unable to read file: 's3://mybucket/cifar10/train/deer/image35398.png'.
Error using matlab.io.datastore/DsFileReader (line 113)
Could not find file : s3://mybucket/cifar10/train/deer/image35398.png
every time I rerun the code it seems to stop on another image it cannot read. However the image is always on the bucket and do not seems to be corrupt when I check using imshow.
Can you see where the problem is?
  7 个评论
Fouzia Adjailia
Fouzia Adjailia 2020-5-1
hello,
I'm having a similar problem to yours and I would highly appreciate it if you can help me.
I created an image data store with a costumised read function called @formoccupancygrid, when I run my code using the parallel I get this error:
Error using classifyData (line 33)
Error detected on worker 1.
Caused by:
Error using matlab.io.datastore.ImageDatastore/readall (line 42)
Error using ReadFcn @UNKNOWN Function for file
D:\--*******************************
Undefined function handle.
I solved this problem using a parfevalOnAll, it excutes the function in all the workers. after that I have anotehr error which stats that the files don't exist, I added the files to the attached files and path in the additional path in the cluster profile manager but with no luck
looking forward to your reply.
Daniel Csata
Daniel Csata 2022-10-29
Hi!
I just ran into this same exact problem. Could you please tell me exactly how you solved it with the parpool function? Because it seems like that didnt work for me or I did something wrong.
Thank you,
Daniel

请先登录,再进行评论。

回答(1 个)

Harsha Priya Daggubati
  1 个评论
Fred
Fred 2020-4-7
Hi,
thanks for the help!
yes I carefully followed all steps mentioned one by one.
The only deviation is that I had to set up number of workers to 1 and not 8. That is because the aws system has limits on the vCPU number I can use and the instance I am using (p2.xlarge) has only one GPU.
The problem occurs when running the TrainNetwork function on the "train a network in the cloud using a buil-in parallel support" page.
Fred

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Parallel and Cloud 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by