Is there a way to efficiently read a .csv file into a dataset in Matlab
2 次查看(过去 30 天)
显示 更早的评论
Ok, so here is the deal.
I have a 2.5GB csv file. I'd like to have it as a dataset so that I can use some of the indexing functions (like grab a certain row provided a certain value) type functionality.
here are some sample lines:
rs180759811,1,83977,0.0078454,0.99052,0.512,'0000','1010',0.45,.,.,F,.,.,.,.,.,.,imputed, rs188652299,1,84156,0.0012772,0.99851,0.50381,'0000','1100',0.65,.,.,R,.,.,.,1,.,.,imputed, rs192830046,1,86282,0.00080435,0.99911,0.59506,'0000','1111',0,.,.,R,.,1,.,.,.,.,imputed, rs146027550,1,88429,0.018998,0.97847,0.53261,'0000','1001',0.2,.,.,R,.,.,.,1,.,.,imputed, rs187571096,1,114699,0.010444,0.98884,0.5583,'0000','1000',0.65,.,.,R,.,.,.,1,.,.,imputed, rs191891026,1,171529,0.011039,0.98724,0.51818,'0000','1001',0.2,.,.,R,.,.,.,1,.,.,imputed,
But, as I see it, there is not a good way to go from csv --> dataset.
Here are the options I've been considering:
fgetl --> regexp --> cell array --> cell2dataset
I know I can get that to work, but it can't be the most efficient way.
textscan--> textscan allows me to specify a bunch commas as the delimiter, which is useful, but i am not even sure if I can read 1 line at a time with text scan.
csvread --> will not work because most of the values are not numeric.
Is there another option that will turn a csv directly into an array or dataset without having to treat it as strings, regexp it, the whole 9 yards?
Thanks very much.
0 个评论
回答(1 个)
Walter Roberson
2013-9-11
You can read a line at a time with textscan(), by specifying a count of 1 right after the format. But why not read it all with textscan() and then cell2dataset() the result, possibly after a horzcat() ?
cellinput = textscan(fid, '%s%f%f%f%f%f%s%s%f%s%s%s%s%s%s%s%s%s%s%s', 'delimiter', ',');
cell2dataset( horzcat(cellinput{:}) )
the horzcat() would take it from being a cell row vector with each member being a cell column vector, into being a row-and-column cell array.
For lack of better instruction, each column after the last consistent numeric column has been read in as a separate string. If you know that a certain column there will always be useless ".", then switch the corresponding %s to %*s . But for the column that is either 1 or ".", do not switch that to %g as %g will not gracefully match a "." in that column.
0 个评论
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!