Loading Large CSV files
121 次查看(过去 30 天)
显示 更早的评论
Hi Everyone,
I had csv files of size 6GB and I tried using the import function on Matlab to load them but it failed due to memory issue. Is there a way to reduce the size of the files?
I think the no. of columns are causing the problem. I have a 133076 rows by 2329 columns. I had another file which is of the same no. of rows but only 12 rows and Matlab could handle that. However, once the columns increases, the files got really big.
Ulitmately, if I can read the data column wise so that I can have 2329 column vector of 133076, that will be great.
I am using Matlab 2014a
回答(2 个)
Cedric
2015-7-31
编辑:Cedric
2015-7-31
If the file contains numbers that you want to get in a numeric array of doubles in MATLAB ultimately, the array will be around 2.5GB. The problem probably comes from the fact that loading the whole file as text, plus processing, plus allocating this array is larger than what your machine can handle.
You can always process the file line by line, or by chunks smaller than the whole file, and populate a 2.5GB numeric array. Something along the following line:
chunk_nRows = 2e4 ;
% - Open file.
fId = fopen( 'largeFile.csv' ) ;
% - Read first line, convert to double, determine #columns.
line = fgetl( fId ) ;
row = sscanf( line, '%f,' )' ;
nCols = numel( row ) ;
% - Prealloc data, copy first row, init loop counter.
data = zeros( chunk_nRows, nCols ) ;
data(1,:) = row ;
rowCnt = 1 ;
% - Loop over rest of the file.
while ~feof( fId )
rowCnt = rowCnt + 1 ;
% - Realloc + a chunk if rowCnt larger than data array.
if rowCnt > size( data, 1 )
fprintf( 'Realloc ..\n' ) ;
data(size(data, 1)+chunk_nRows, nCols) = 0 ;
end
% - Read line, convert and store.
line = fgetl( fId ) ;
data(rowCnt,:) = sscanf( line, '%f,' )' ;
end
% - Truncate data to last row (truncate last chunk).
data = data(1:rowCnt,:) ;
% - Close file.
fclose( fId ) ;
And we can imagine plenty of other ways to read the file by block, e.g. 500MB.
Well, here is another way, which is likely to be more efficient
blockSize = 500e6 ; % Choose large enough so not too many blocks.
tailSize = 100 ; % Choose large enough so larger than one value representation.
% - Open file.
fId = fopen( 'largeFile.csv' ) ;
% - Read first line, convert to double, determine #columns.
line = fgetl( fId ) ;
data = sscanf( line, '%f,' ) ;
nCols = numel( data ) ;
lastBit = '' ;
while ~feof( fId )
% - Read and pre-process block.
buffer = fread( fId, blockSize, '*char' ) ;
isLast = length( buffer ) < blockSize ;
buffer(buffer==10) = ',' ;
buffer(buffer==13) = '' ;
% - Pre-pend last bit of last block.
if ~isempty( lastBit )
buffer = [lastBit; buffer] ; %#ok<AGROW>
end
% - Truncate to last ',' and keep last bit for next iteration.
if ~isLast
n = find( buffer(end-tailSize:end)==',', 1, 'last' ) ;
cutAt = length(buffer) - tailSize + n ;
lastBit = buffer(cutAt:end) ;
buffer(cutAt:end) = [] ;
end
% - Parse.
data = [data; sscanf( buffer, '%f,' )] ; %#ok<AGROW>
end
% - Close file.
fclose( fId ) ;
% - Reshape data vector -> array.
data = reshape( data, nCols, [] )' ;
6 个评论
Walter Roberson
2015-7-31
reading the full file doesn't require that you store the contents.
function numlines = count_remaining_lines(fid);
numlines = 0;
while true
if ~ischar(fgets(fid)); break; end %end of file
numlines = numlines + 1;
end
end
So you
fid = fopen('largeFile.csv', 'rt');
headerline = fgetl(fid);
headerfields = regexp(headerline, ',', 'split');
numcol = length(headerfields);
numrow = count_remaining_lines(fid);
frewind(fid);
fgetl(fid); %re-read and discard the headerline
data = zeros(numrow, numcol);
Now you can loop reading a row at a time, storing it into data(RowNum, :) without having to expand the data buffer.
Cedric
2015-8-1
编辑:Cedric
2015-8-1
Alan, if the data is not too confidential, could you run the following code on your file (replace 'largeFile.csv' with your file name) and post a comment with the dump file dump.csv attached?
fId = fopen( 'largeFile.csv', 'r' ) ;
data = fread( fId, 1e6, '*char' ) ;
fclose( fId ) ;
fId = fopen( 'dump.csv', 'w' ) ;
fwrite( fId, data ) ;
fclose( fId ) ;
That will extract a small chunk of your file that we will be able to use (after truncation) to perform tests, and this without you having to open a 6GB file for taking a slice.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Language Support 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!