Why would csvread read all data into a single column

13 次查看(过去 30 天)
I am trying to read in a csv file to matlab. It has 1 million columns and 2 rows. When I use csvread it reads the file in as a 1 column by 2 million row matrix. Why would it do this?
  1 个评论
dpb
dpb 2017-5-12
编辑:dpb 2017-5-12
Dunno...would seem there's either
  1. something in the file that's confusing textscan 's row count or
  2. there is an actual bug/limitation inside textscan
csvread is just a wrapper to dlmread which in turn simply parses the inputs and calls textscan. For a .csv file, the call boils down to
delimiter = sprintf(delimiter);
whitespace = setdiff(sprintf(' \b\t'),delimiter);
result = textscan(fid,'',nrows, ...
'delimiter',delimiter,'whitespace',whitespace, ...
'headerlines',r,'headercolumns',c,...
'returnonerror',0,'emptyvalue',0,'CollectOutput', true);
where, of course, delimiter is ','.
The "magic" occurs inside textscan as you notice there is no explicit format string but an empty string placeholder--this is the cue used internally instructing it to return the array shape as the record structure appears externally without the user having to count fields and build a format string.
Since we can't see inside textscan, this is as far as we can go.
You could try building a test file and parsing it and seeing if you can replicate the problems at a specific record length or, on the way, perhaps determine that file works correctly and the fault is in this data file.

请先登录,再进行评论。

回答(2 个)

Matthew Eicholtz
Matthew Eicholtz 2017-5-12
I think dpb's comment addresses potential csvread issues well, so I'll just add an alternative option that may work for you: readtable.
  3 个评论
Matthew Eicholtz
Matthew Eicholtz 2017-5-12
Ah yes, after re-reading the question, I agree. I was thinking 1 million instance of 2 variables, not the other way around. Good catch.
dpb
dpb 2017-5-12
Actually, for such data file sizes it would seem far better to use .mat files or stream or somesuch...there's certainly no looking at them usefully by hand it would seem.

请先登录,再进行评论。


dpb
dpb 2017-5-12
编辑:dpb 2017-5-12
Expanding upon the above comments, I did a test that looked like--
N=1E6; % the long row length
csvwrite('test.csv',randi(127,2,N)) % write a 2-row file of same N (@)
d=csvread('test.csv');
while isvector(d)
N=N/2;
csvwrite('test.csv',randi(127,2,N)) % write a 2-row file of same N
d=csvread('test.csv');
end
disp(N)
Result was for N=62500 which seems to prove conclusively there's an internal limit in textscan; probably some sort of buffer limit one would guess when the format string isn't provided.
I didn't try to refine the result to between 62500 125000 where it breaks, but that definitely seems to be the cause of the issue.
I tried the venerable textread, it never completed the 1E6 case before I gave up, so that's not a workaround.
While it would be butt-ugly as a solution, tried the explicit format and
>> d=textscan(fid,fmt,'delimiter',',','collectoutput',1);
Out of memory. Type HELP MEMORY for your options.
>>
Looks like a support request to TMW to see they can find a workaround or put it onto the enhancement list to resolve. Certainly seems as though Matlab should be able to read any file in whatever form it is in on disk as long as it can actually fit in memory without gyrations by the user.
(@) Just to be sure, I did scan the long record file by reading as stream character file and confirmed csvwrite wrote the linefeeds where should have so that, in fact, the file was actually two records on disk.
ADDENDUM
I hate it when get fixated on something... :(
But, I did a couple of additional tests and confirmed there's a hard limit apparently buried inside the textscan code at 100000--
>> N=100000;
>> csvwrite('test.csv',randi(127,2,N))
>> isvector(csvread('test.csv'))
ans =
1
>> N=N-1
N =
99999
>> csvwrite('test.csv',randi(127,2,N))
>> isvector(csvread('test.csv'))
ans =
0
>>
Fails beginning at 100,000 elements in length; 99,999 is ok, you're just not supposed to have a file with records any longer than that, it seems.
  1 个评论
dpb
dpb 2017-5-12
编辑:dpb 2017-5-13
Well, one way to make it work, albeit slowly
>> fid=fopen('test.csv','r');
>> dd=str2num(fread(fid,'*char').');
>> whos dd
Name Size Bytes Class Attributes
dd 2x1000000 16000000 double
>> fid=fclose(fid);
If you know the size a priori it would be better to just read and reshape. If the size isn't known, then two step solution is probably still significantly faster as str2num uses eval internally. But, it is interesting the interpreter can deal with that long of an internal input record while textscan can't handle that long of an external record.
fid=fopen('test.csv','r');
n=length(find(fread(fid,'*char')==10)); %
fid=fclose(fid);
d=reshape(csvread('test.csv'),n,[]);

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Large Files and Big Data 的更多信息

标签

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by