Is matfile read speed affected by how file is constructed?

1 次查看(过去 30 天)
I have a dataset that is 259000x94000x6 of int16 data. Obviously, this is way too big to fit into memory (about 276 GB) or load at once. The main issue is that the data can only be downloaded in 94000 separate chunks that are 259000x6 each, but I need to analyze the data in 259000 separate chunks of 94000x6 arrays.
For the past two weeks I have been trying various big data techniques in Matlab to optimize the way to read all of this data. The fastest way seems to be to turn it into one large file with all the data, which MUST be built by appending 94000 files of 259000x6 arrays (and not the other way around, due to the native structure of the data). However, one very peculiar thing that I have found is that no matter how I build my giant .mat file (e.g. 259000x94000x6 or 94000x259000x6) the read speed using matfile is ALWAYS an order of magnitude quicker when reading it in 259000x6 chunks rather than 94000x6 chunks. I've tried using '-v7.3' with and without compression, I've tried chunking it into smaller files of 3GB each and for-looping through these files, I've tried turning it into a fileDataStore, and nothing seems to allow me to read the data in 94000x6 chunks as fast as I can in 259000x6 chunks! Has anyone else experienced this, know why this is, and/or know a workaround?
Thanks!
  1 个评论
Rik
Rik 2018-11-2
Is it possible to either share some of the data or to write some code that generates representative data?

请先登录,再进行评论。

回答(1 个)

Cameron Lee
Cameron Lee 2018-12-5
编辑:Cameron Lee 2018-12-5
I thought I'd follow this up... the short answer to the question is that read speed must be impacted by the way the file is constructed. However, I found a way around this... First, I had to build data files in chunks (I did about 80 chunks/files) that were 1175x259000x6 each. After all of these were finished, I then used the matfile command in a for-loop to bring the data in and permute the dimensions:
% Run permute function on the 80 chunk files (takes some time, cannot parfor)
for x=1:80
xstr=num2str(x)
filename=strcat('location\ChunkFolder\AlldataCH',xstr,'.mat');
m=matfile(filename,'Writable',true);
m.alldata=permute(m.alldata,[2 1 3]);
end
I was then able to read it in, and analyze it in a more timely fashion...
%% Build m (cell array of matfile connections to use repeatedly below)
xnum=0;
for x=1:80
xnum=xnum+1;
xstr=num2str(x);
filename=strcat('location\ChunkFolder\AlldataCH',xstr,'.mat');
m{xnum}=matfile(filename);
end
% Read data into MatLab in 94000x6 form & in optimized time, and analyze
parfor y=1:259920
newdata={1};
xxnum=0;
for x=1:80
xxnum=xxnum+1;
newdata{xxnum}=squeeze(m{x}.alldata(y,:,:));
end
finaldata=vertcat(newdata{:})';
%%%% DO ALL ANALYSIS HERE %%%%
end
For whatever reason, this is the only way I could find that allowed me to read the data into Matlab the way that I needed to, and in a timely manner (about a 30x improvement vs. reading it without permuting the dimensions).
As a side note, I tried to do the permute BEFORE I saved the original chunks... and that still did not work (and as I mentioned in my original post I tried just saving it as a 259000x94000x6 (and in 259000x1175x6 chunks) and that did not work). Only after I made the chunks, closed the file, brought the file back into Matlab and permuted it, did it then work. Anyway, I hope this helps anyone out there with a similar problem. Also, if anyone can find an even speedier way to do this, please just let me know.

类别

Help CenterFile Exchange 中查找有关 Gaussian Process Regression 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by