How to read large text data into matlab

Question

Fan Li 2018-1-5

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/375714-how-to-read-large-text-data-into-matlab

评论： Abdullahi Samantar 2018-12-12

Hi every one I have a text file up to 10 GB which has to be read into matlab. The part of the data is listed below

ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 51.4573 -48.0145 -55.5854 0.00121546 -0.00693737 -0.00454935
2242 -25.5898 -24.3081 -29.3729 0.00671099 0.00205397 -0.0108453
2243 9.2867 27.1493 -37.9274 -0.00115821 0.00912371 -0.00178601
2244 3.89714 -48.5019 70.5903 0.0041159 -0.00255481 -0.0029498
2245 49.8803 -40.1819 -5.30361 -0.0106695 0.0224494 0.00918698
2246 0.22115 -19.9758 -2.30173 0.0190817 0.0262146 -0.0153229
2247 -53.6289 50.5517 -23.5032 0.00388499 -0.00559089 0.000787281
.
.
.
.
.
.
.
.
ITEM: TIMESTEP
10
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 -50.0606 -93.6118 -70.4534 0.000504085 -0.00684199 -0.00394166
2242 -14.4928 20.0993 3.55963 0.00244236 0.00203074 -0.0162865
2243 -2.64823 8.26566 23.6457 -0.000503352 0.0140246 -0.00909782
2244 -153.189 40.6383 -12.0141 0.00192712 -0.00177534 -0.00194966
2245 35.0712 -14.4107 6.31868 0.00668828 0.012556 0.00468532
2246 22.0675 -14.7867 61.4774 0.0182799 0.0194239 -0.00942033
2247 -3.80959 -88.6786 1.61222 0.00459477 -0.00577238 0.000324204
2248 -18.4777 -9.35017 -1.12766 0.0146401 0.00924069 -0.00730373
2249 16.2354 -7.34658 -25.1694 -0.0169203 0.0249397 0.0085598
2250 110.508 19.9749 -4.95758 -0.00500049 0.000961677 0.00667405
2251 -7.46059 3.35324 -41.665 0.0175383 -0.00791068 -0.00702065

Basically, it has many parts which start with the "ITEM: TIMESTEP". I have to skip the first 9 lines for each part and then read the other lines.

I tried the textscan function (May be I misused it), but it is very slow. Is there a faster way to do it in Matlab?

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Cedric 2018-1-6

编辑：Cedric 2018-1-6

How much RAM do you have available?

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Cedric 2018-1-6

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/375714-how-to-read-large-text-data-into-matlab#answer_298981

编辑：Cedric 2018-1-6

在 MATLAB Online 中打开

If you have enough RAM for this, the following could run a little faster. It is way less versatile than Per's solution though, and exploits specific characters present in the header. You may have to adapt it a bit if there are e.g. other types of header content:

content = fileread( 'data.txt' ) ;
blockEnds   = strfind( content, 'ITEM: T' ) - 1 ;
blockEnds   = [blockEnds(2:end), numel( content )] ;
blockStarts = strfind( content, '6]' ) + 3 ;
nBlocks     = numel( blockStarts ) ;
data        = cell( nBlocks, 1 ) ;
fprintf( '%d blocks found.\n', nBlocks ) ;
for bId = 1 : nBlocks
    data{bId} = reshape( sscanf( content(blockStarts(bId):blockEnds(bId)), '%f' ), 7, [] ).' ;
end

PS: this takes < 20s for a 1GB data file on a small laptop (with 32GB RAM though).

2 个评论
显示无隐藏无

Fan Li 2018-1-8

Thanks. This way is faster.

Abdullahi Samantar 2018-12-12

Hi Fan Li,

How did you manipulate Cedric code to get your large txt file (lammps) run.

I couldnt figure it out.

Thank you

请先登录，再进行评论。

Answer 2

per isakson 2018-1-5

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/375714-how-to-read-large-text-data-into-matlab#answer_298908

编辑：per isakson 2018-1-6

在 MATLAB Online 中打开

Given:

All headers consist of 9 lines
All data blocks consist of 7 columns of numerical data
The blocks of numerical data should be converted to double. (Added later.)
The columns of the data are separated by space, char(32)
There is RAM enough to store the parsed data. Nearly 10GB is needed to store in double. Single would introduce a rounding error.

Try:

>> cac = cssm( 'cssm.txt' );
>> whos cac
  Name      Size            Bytes  Class    Attributes
  cac       1x6              3696  cell               
>> cac
cac = 
    [7x7 double]    [11x7 double]    [7x7 double]    [11x7 double]    [7x7 double]    [11x7 double]
>>   
>> cac{1}
ans =
   1.0e+03 *
    2.2410    0.0515   -0.0480   -0.0556    0.0000   -0.0000   -0.0000
    2.2420   -0.0256   -0.0243   -0.0294    0.0000    0.0000   -0.0000
    2.2430    0.0093    0.0271   -0.0379   -0.0000    0.0000   -0.0000
    2.2440    0.0039   -0.0485    0.0706    0.0000   -0.0000   -0.0000
    2.2450    0.0499   -0.0402   -0.0053   -0.0000    0.0000    0.0000
    2.2460    0.0002   -0.0200   -0.0023    0.0000    0.0000   -0.0000
    2.2470   -0.0536    0.0506   -0.0235    0.0000   -0.0000    0.0000
>>

where

function  cac = cssm( ffs )
    fid = fopen( ffs );
    cac = cell(1,0);
    while not( feof( fid ) )
        cac(1,end+1) = textscan( fid, '%f%f%f%f%f%f%f'      ...
                     , 'Headerlines',9, 'CollectOutput',true  );
    end
    fclose( fid );
end

and cssm.txt contains three copies of the data of the question

"textscan [...] is very slow. Is there a faster way [...] Matlab?" AFAIK: No, not significantly faster. However, I don't agree that it's very slow.

2 个评论
显示无隐藏无

Fan Li 2018-1-6

Hi per isakson

I am using fgetl function which is faster.

per isakson 2018-1-6

编辑：per isakson 2018-1-6

That's comparing apples to oranges. I assumed without stating it that the "numerical blocks" should be parsed.

请先登录，再进行评论。

Answer 3

Steven Lord 2018-1-8

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/375714-how-to-read-large-text-data-into-matlab#answer_299143

If your data is too large to fit in memory all at once, consider using a datastore. Since you have data in the headers that I assume you want to access, using a TabularTextDatastore probably won't suit your needs. You may need to use a general FileDatastore or develop your own custom datastore using your knowledge of the way your data is formatted.

Once you have a datastore you could use it to create a tall array.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Fan Li 2018-1-16

编辑：Fan Li 2018-1-16

Hi Steven Lord

I have read the part about tall array and datastore. It is useful for me. But I do not know how to skip the headers which I do not need . The function I am using now and other method provided here is not for datastore. There is limited source for skipping the headers with datastore. So, can you tell me how to skip the headers with the datastore? The format of the data is provided above.

Thanks

请先登录，再进行评论。

How to read large text data into matlab

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

采纳的回答

2 个评论
显示无隐藏无

更多回答（2 个）

2 个评论
显示无隐藏无

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

Community Treasure Hunt

How to read large text data into matlab

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

采纳的回答

2 个评论 显示 无隐藏 无

更多回答（2 个）

2 个评论 显示 无隐藏 无

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

2 个评论
显示无隐藏无

2 个评论
显示无隐藏无

1 个评论
显示 -1更早的评论隐藏 -1更早的评论