Loading Large CSV files

130 次查看(过去 30 天)
Alan Kong
Alan Kong 2015-7-31
Hi Everyone,
I had csv files of size 6GB and I tried using the import function on Matlab to load them but it failed due to memory issue. Is there a way to reduce the size of the files?
I think the no. of columns are causing the problem. I have a 133076 rows by 2329 columns. I had another file which is of the same no. of rows but only 12 rows and Matlab could handle that. However, once the columns increases, the files got really big.
Ulitmately, if I can read the data column wise so that I can have 2329 column vector of 133076, that will be great.
I am using Matlab 2014a
  2 个评论
Walter Roberson
Walter Roberson 2015-7-31
Are the fields all numeric?
Alan Kong
Alan Kong 2015-7-31
Yes, only the first row are in string ID like 'R1' to 'R2329'

请先登录,再进行评论。

回答(2 个)

Cedric
Cedric 2015-7-31
编辑:Cedric 2015-7-31
If the file contains numbers that you want to get in a numeric array of doubles in MATLAB ultimately, the array will be around 2.5GB. The problem probably comes from the fact that loading the whole file as text, plus processing, plus allocating this array is larger than what your machine can handle.
You can always process the file line by line, or by chunks smaller than the whole file, and populate a 2.5GB numeric array. Something along the following line:
chunk_nRows = 2e4 ;
% - Open file.
fId = fopen( 'largeFile.csv' ) ;
% - Read first line, convert to double, determine #columns.
line = fgetl( fId ) ;
row = sscanf( line, '%f,' )' ;
nCols = numel( row ) ;
% - Prealloc data, copy first row, init loop counter.
data = zeros( chunk_nRows, nCols ) ;
data(1,:) = row ;
rowCnt = 1 ;
% - Loop over rest of the file.
while ~feof( fId )
rowCnt = rowCnt + 1 ;
% - Realloc + a chunk if rowCnt larger than data array.
if rowCnt > size( data, 1 )
fprintf( 'Realloc ..\n' ) ;
data(size(data, 1)+chunk_nRows, nCols) = 0 ;
end
% - Read line, convert and store.
line = fgetl( fId ) ;
data(rowCnt,:) = sscanf( line, '%f,' )' ;
end
% - Truncate data to last row (truncate last chunk).
data = data(1:rowCnt,:) ;
% - Close file.
fclose( fId ) ;
And we can imagine plenty of other ways to read the file by block, e.g. 500MB.
Well, here is another way, which is likely to be more efficient
blockSize = 500e6 ; % Choose large enough so not too many blocks.
tailSize = 100 ; % Choose large enough so larger than one value representation.
% - Open file.
fId = fopen( 'largeFile.csv' ) ;
% - Read first line, convert to double, determine #columns.
line = fgetl( fId ) ;
data = sscanf( line, '%f,' ) ;
nCols = numel( data ) ;
lastBit = '' ;
while ~feof( fId )
% - Read and pre-process block.
buffer = fread( fId, blockSize, '*char' ) ;
isLast = length( buffer ) < blockSize ;
buffer(buffer==10) = ',' ;
buffer(buffer==13) = '' ;
% - Pre-pend last bit of last block.
if ~isempty( lastBit )
buffer = [lastBit; buffer] ; %#ok<AGROW>
end
% - Truncate to last ',' and keep last bit for next iteration.
if ~isLast
n = find( buffer(end-tailSize:end)==',', 1, 'last' ) ;
cutAt = length(buffer) - tailSize + n ;
lastBit = buffer(cutAt:end) ;
buffer(cutAt:end) = [] ;
end
% - Parse.
data = [data; sscanf( buffer, '%f,' )] ; %#ok<AGROW>
end
% - Close file.
fclose( fId ) ;
% - Reshape data vector -> array.
data = reshape( data, nCols, [] )' ;
  6 个评论
Walter Roberson
Walter Roberson 2015-7-31
reading the full file doesn't require that you store the contents.
function numlines = count_remaining_lines(fid);
numlines = 0;
while true
if ~ischar(fgets(fid)); break; end %end of file
numlines = numlines + 1;
end
end
So you
fid = fopen('largeFile.csv', 'rt');
headerline = fgetl(fid);
headerfields = regexp(headerline, ',', 'split');
numcol = length(headerfields);
numrow = count_remaining_lines(fid);
frewind(fid);
fgetl(fid); %re-read and discard the headerline
data = zeros(numrow, numcol);
Now you can loop reading a row at a time, storing it into data(RowNum, :) without having to expand the data buffer.
Cedric
Cedric 2015-8-1
编辑:Cedric 2015-8-1
Alan, if the data is not too confidential, could you run the following code on your file (replace 'largeFile.csv' with your file name) and post a comment with the dump file dump.csv attached?
fId = fopen( 'largeFile.csv', 'r' ) ;
data = fread( fId, 1e6, '*char' ) ;
fclose( fId ) ;
fId = fopen( 'dump.csv', 'w' ) ;
fwrite( fId, data ) ;
fclose( fId ) ;
That will extract a small chunk of your file that we will be able to use (after truncation) to perform tests, and this without you having to open a 6GB file for taking a slice.

请先登录,再进行评论。


Amr Hashem
Amr Hashem 2015-7-31
you can split the file, by any free program.

类别

Help CenterFile Exchange 中查找有关 Language Support 的更多信息

标签

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by