Question about randStream and Cross Validation in a parfoor loop

Question

Martin Randau 2023-12-2

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2055309-question-about-randstream-and-cross-validation-in-a-parfoor-loop

评论： Martin Randau 2023-12-7

在 MATLAB Online 中打开

Dear MathWorks

I use the following code to generate 100 seeds of a 10-5 nested cross validation algorithm in a parfoor loop:

I start the cluster and assign a RandStream

parallel.defaultClusterProfile('dcc R2021b');
clust=parcluster('dcc R2021b');
clust.AdditionalProperties.ProcsPerNode = 24;
clust.AdditionalProperties.MemUsage = '2GB';
clust.AdditionalProperties.WallTime = '60:00';
clust.AdditionalProperties.QueueName = 'compute';
numw=100;
parpool(clust, numw);
sc = parallel.pool.Constant(RandStream('Threefry'))

Then I create a CV partition in each outer loop and one in the inner loop to be used for hyperparameter optimization:

parfor seeds = 1:100
    
    stream = sc.Value;      
    stream.Substream = seeds;
    
    ...
        
    CV = cvpartition(y, 'Kfold', 10);
    
    ...
    % generate y_train from y
    ...
    
                for k = 1:10
                    cv_in = cvpartition(y_train,'Kfold',5); % used for hyperparameter optimization
                    
                    % and e.g.
                      mdl_LL_hp_opts = struct('AcquisitionFunctionName','expected-improvement-plus',...
                        'Optimizer','bayesopt','CVPartition',cv_in,'MaxObjectiveEvaluations',100,...
                        'UseParallel',0,'ShowPlots',0,'Verbose',0,'Repartition',0);
                     [mdl_all, mdl_LL_all_fitinfo,mdl_LL_all_HyperparameterOptimizationResults] = fitclinear(X_sel_train,y_train, 'learner', 'logistic', 'Regularization', 'ridge', ...
                        'OptimizeHyperparameters',{'Lambda'},'HyperparameterOptimizationOptions',mdl_LL_hp_opts, 'CategoricalPredictors', "gender", 'PredictorNames', predVars_all);

The code works but it seems that the CV partitions are not very random as can be seen from this result of Balanced Accuracy, where I only show the first ten rows of each iteration. Never mind the similar values (in this case there were too few of the minority class). The problem is that every iteration finds NaNs in the 9th outer fold (because of lack of minority class). I would expect that if CV partition used random seeds, the NaN column would be located randomly. The columns are the outer folds and the rows are iterations (seeds).

   >> ERPres(1:10,:)
ans =
5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000       NaN    0.5000
5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000       NaN    0.5000
5000    0.5000    0.5000    0.3333    0.5000    0.5000    0.5000    0.5000       NaN    0.5000
5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000       NaN    0.5000
5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000       NaN    0.5000
5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000       NaN    0.5000
5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000       NaN    0.5000
5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000       NaN    0.5000
5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000       NaN    0.5000
5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000       NaN    0.5000
    
    

So - is something wrong the the RandStream approach?

BW

Martin

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Drew 2023-12-5

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2055309-question-about-randstream-and-cross-validation-in-a-parfoor-loop#answer_1366149

编辑：Drew 2023-12-6

在 MATLAB Online 中打开

Is this code running in R2021b or R2023b? Your code references 'dcc R2021b', but the MATLAB answers sidebar says R2023b.

(1) Since cvpartition uses the global stream of random numbers, if you are running in R2021b, remember to set the global stream from the threefry RandStream that you have created. That is, in your parfor loop, after

stream = sc.Value;      
stream.Substream = seeds;

add

RandStream.setGlobalStream(stream);

(2) Note, as you may already know, starting in R2023a, each parallel worker has an independent random number stream by default. See https://www.mathworks.com/help/parallel-computing/control-random-number-streams-on-workers.html.

"By default, the MATLAB client and MATLAB workers use different random number generators, even if the workers are part of a local cluster on the same machine as the client."
"By default, each worker in a cluster working on the same job has an independent random number stream. If rand, randi, or randn are called in parallel, each worker produces a unique sequence of random numbers."

So, if you want independent random number streams on each worker, just accept the defaults (as of 23a or higher), with no need to call rng or RandStream functions. If you want some other behavior with repeatable random number sequences, see the instructions at https://www.mathworks.com/help/parallel-computing/control-random-number-streams-on-workers.html .

See the "Version History" section of the page https://www.mathworks.com/help/matlab/ref/rng.html for notes about the random number generation changes in 23b and 23a.

If this answer helps you, please remember to accept the answer.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Martin Randau 2023-12-7

Dear Drew,

thanks for the answer. The version is 2021b, sorry for the confusion.

I added RandStream.setGlobalStream(stream); at the location you suggested. However, the NaNs are still occuring in the same column across the seeds. Of course, it's possible that, given the distribution of cases/non-cases that every iteration finds the NaNs in the same order, even though the starting points are different.

The parallel job is in a distributed environment, i.e., MATLAB Parallel Server (https://www.hpc.dtu.dk/?page_id=2021). So I start the job with a .sh script where I specify "# -- Number of cores requested -- #BSUB -n 1".

BW

Martin

请先登录，再进行评论。

Question about randStream and Cross Validation in a parfoor loop

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Question about randStream and Cross Validation in a parfoor loop

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论