Cluster multi-gpu training Error: Current pool is not local.

Question

Christopher McCausland 2023-1-12

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1893110-cluster-multi-gpu-training-error-current-pool-is-not-local

评论： Edric Ellis 2023-1-16

Hello,

I am trying to scale up onto a multi-gpu cluster for deep learing. I can run the model on a single GPU on the cluster with no issues, however when I try to change to multiple GPU's I get this error:

Current pool is not local. Use 'delete(gcp)' to close parallel pool and run again.

My cluster submission function looks like this:

function job = submit_train_script()
cluster = parcluster();
cluster.AdditionalProperties.AdditionalSubmitArgs = '--gres=gpu:4'; % Request 4 GPU's with sbatch
cluster.AdditionalProperties.AdditionalSubmitArgs = '--mail-type=ALL'; % Send me an email if anything happens
cluster.AdditionalProperties.AdditionalSubmitArgs = '--mail-user=myemail@mydomain.ac.uk';
cluster.AdditionalProperties.AdditionalSubmitArgs = '--nodelist=Node002'; % Use node002
% Submit the job, ask for 4 CPU workers, one for each GPU
job = cluster.batch('train_fun', ...
    "AutoAddClientPath",false, "CaptureDiary",true, ...
    "CurrentFolder",".", "Pool",4);
end

With the network options below. I request 4 GPU's, four worker CPU's to match and then set the exicution enviroment to "multi-gpu". This appears to be the recommended configuration for this type of work. I cannot work out what is causing this error.

% Iteration = Number of (files*cells) / Minibatchsize
options = trainingOptions("adam", ...
    ExecutionEnvironment="multi-gpu", ... % cpu,gpu multi-gpu option avaliable 
    GradientThreshold=1, ...
    InitialLearnRate=0.001,...
    MaxEpochs=50, ... % 50
    MiniBatchSize= 10, ... % 25 miniBatchSize, ... 10 for 16Gb card, 
    SequenceLength="longest", ...
    Shuffle="never", ...
    Verbose=0, ...
    Plots="training-progress");
net = trainNetwork(ds,layers,options);

Thanks in advance,

Christopher

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Edric Ellis 2023-1-13

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1893110-cluster-multi-gpu-training-error-current-pool-is-not-local#answer_1147735

I think you need to specify ExecutionEnvironment="parallel" for this situation. According to the trainingOptions reference page, "multi-gpu" is only for "multiple GPUs on one machine, using a local parallel pool based on your default cluster profile."

2 个评论
显示无隐藏无

Christopher McCausland 2023-1-15

在 MATLAB Online 中打开

Hi Edric,

That seems to work. I hadn't even considered the "parallel" option as I belived that the batch submit would have made the parallel pool local with respect to the cluster. Lesson learned there, thank you!

One stange outcome is a new error, (bearing in mind this code runs without error on a single GPU). The error relates to the 'eq' fucntion which I belive is inbuilt sanity check for the == operator.

The only place the == operator is used in the entire submission is to identify any rows (within the cell variable fridges) which have lables and data I want to exclude. I can do this before I read in the data, however I was wodnering if there is anything obvious that would case this to fail in "gpu" vs "parallel"?

% Exclude lables that we don't care about
includeSet = {'N1_to_N2' 'N2_to_N1' 'N1_to_W' 'W_to_N1' 'N2_to_N3' 'N3_to_N2'};
for j = 1:length(fridges)
    % Generate index for where to keep the lables 
    setidx(j) = sum(fridges{j,2} == includeSet);
end
% remove lables that are not of intrest
fridges(~setidx',:) = [];

Kind regards,

Christopher

Edric Ellis 2023-1-16

I can't see quite why this would change behaviour. Do you have an error stack from the failure indicating this is where the problem is coming from? I would be wary of using == to compare char-vectors (single-quote "strings"). This performs an elementwise comparison of the characters, and can fail if the vectors aren't the same length. You might be better off using strcmp.

请先登录，再进行评论。

Cluster multi-gpu training Error: Current pool is not local.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

2 个评论
显示无隐藏无

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Cluster multi-gpu training Error: Current pool is not local.

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

2 个评论 显示 无隐藏 无

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无