Working with big tall arrays in an HPC environment
2 次查看(过去 30 天)
显示 更早的评论
Hi all,
I have been using since not so long ago tall arrays to process large data arrays with relatively good results. Currently, I have 25 arrays of 40,000x35,677 each saved in a separate file that I want to feed to a training algorithm using a HPC. I already did this training satisfactorily using only 3 sets out of these 25. However when I use 25 arrays the HPC crashes, giving an out-of-memory message.
I am assigning 37 cores with 20 GB of RAM in each of them, which in theory would be enough to even bring this array to the physical memory. I decided not to do this because, in later stages of the code, I use multiprocessing (parfor loops) where each processor has to work independently with large chunks of data. I found that this speeds up the code signifficantly, so I would rather keep a similar memory allocation.
I isolated the problem to a self-contained case which is detailed as follows:
% Case to test the OOM problem
% I am loading the data into tall arrays and trying to process it
% I got a OOM problem when bringing a few of the observations to memory
% File path and number of files that I will load
folder_path = '/some/random/folder/path/'; %Linux
file_name = 'training_data';
end_num = 25; % Number of files
% Obtaining the files names in cell array
file_cell = cell(end_num,1);
data_per_file = zeros(end_num,1);
for i=1:end_num
file_cell{i} = strcat(folder_path,file_name,'_',num2str(i),'_out.mat');
file_data = whos('-file',strcat(folder_path,file_name,'_',num2str(i),'_out.mat'),'data');
data_per_file(i) = file_data.size(1);
end
P = sum(data_per_file); % Number of observations
N = file_data.size(2); % Spectral resolution
% Create a file set to load in fileDatastore
fs = matlab.io.datastore.FileSet(file_cell);
ds = fileDatastore(fs,"ReadFcn",@load_atm_db);
% Create tall array and randomize observations order
Ncores = 36;
parpool(Ncores)
A = cell2mat(tall(ds)); % Create tall array with all the observations
selectedBlocks = A(randperm(P),:); % Randomize the order of the observations
% Obtain initial dictionary atoms
disp('Calculating initial dictionary')
Natoms = 40000;
Dictionary = gather(selectedBlocks(1:Natoms,:));
disp('Finished running the code')
The "@load_atm_db" function is detailed below, which I use because the arrays in the .mat files are inside struct constructions.
function data = load_atm_db(filepath)
%load_hyp_db Load the database in a datastore
% Load the data array into a datastore without having to pass through a struct
struct_array = load(filepath);
data = struct_array.data;
end
The code crashes when it reaches the "gather" function. From the log file, I see that the gather function finishes loading the data (does 3 passes and all of them at 100%) and then crashes. The code that I use for the HPC SLURM environment is the following:
#!/bin/bash
#SBATCH --partition=queue
#SBATCH --job-name=Matlab_batch
#SBATCH --nodes=1
#SBATCH --cpus-per-task=37
#SBATCH --mem-per-cpu=20Gb
module add matlab/2024a
cd /another/random/folder/path/
matlab -nodisplay -r self_contained_case_test -logfile /still/another/random/folder/outputSelfContainedCase.out
Is there something that I can do better to avoid this out-of-memory error?
0 个评论
回答(1 个)
Mike Croucher
2024-12-9
Hi Sebastian
I can't comment on the Tall Array situation right now but I'm zooming in on this comment:
I am assigning 37 cores with 20 GB of RAM in each of them, which in theory would be enough to even bring this array to the physical memory. I decided not to do this because, in later stages of the code, I use multiprocessing (parfor loops) where each processor has to work independently with large chunks of data. I found that this speeds up the code signifficantly, so I would rather keep a similar memory allocation.
If you have enough physical RAM, you should be able to ditch tall arrays completely. To get around the parfor memory issue, use a Threads pool. E.g. to create one with 8 workers:
parpool("Threads",8)
This uses shared memory wherever possible and so memory requirements are generally lower. More details on the choice between pool types are at Choose Between Thread-Based and Process-Based Environments.
Also, I wonder if your application could make use of single precision? Your matrices, if full rather than sparse, would be 10.6Gb each in double
sizeDouble = 40000*35677*8/(1024^3)
but only half that in single since it would be 4 byes per entry instead of 8. Of course there might be (serious!) numerical issues if you do this but it works in many applications. Deep learning for example generally does fine in single precision.
Finally why 37 cores? Seems a strange number!
0 个评论
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!