How to read large text data into matlab

91 次查看(过去 30 天)
Hi every one I have a text file up to 10 GB which has to be read into matlab. The part of the data is listed below
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 51.4573 -48.0145 -55.5854 0.00121546 -0.00693737 -0.00454935
2242 -25.5898 -24.3081 -29.3729 0.00671099 0.00205397 -0.0108453
2243 9.2867 27.1493 -37.9274 -0.00115821 0.00912371 -0.00178601
2244 3.89714 -48.5019 70.5903 0.0041159 -0.00255481 -0.0029498
2245 49.8803 -40.1819 -5.30361 -0.0106695 0.0224494 0.00918698
2246 0.22115 -19.9758 -2.30173 0.0190817 0.0262146 -0.0153229
2247 -53.6289 50.5517 -23.5032 0.00388499 -0.00559089 0.000787281
.
.
.
.
.
.
.
.
ITEM: TIMESTEP
10
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 -50.0606 -93.6118 -70.4534 0.000504085 -0.00684199 -0.00394166
2242 -14.4928 20.0993 3.55963 0.00244236 0.00203074 -0.0162865
2243 -2.64823 8.26566 23.6457 -0.000503352 0.0140246 -0.00909782
2244 -153.189 40.6383 -12.0141 0.00192712 -0.00177534 -0.00194966
2245 35.0712 -14.4107 6.31868 0.00668828 0.012556 0.00468532
2246 22.0675 -14.7867 61.4774 0.0182799 0.0194239 -0.00942033
2247 -3.80959 -88.6786 1.61222 0.00459477 -0.00577238 0.000324204
2248 -18.4777 -9.35017 -1.12766 0.0146401 0.00924069 -0.00730373
2249 16.2354 -7.34658 -25.1694 -0.0169203 0.0249397 0.0085598
2250 110.508 19.9749 -4.95758 -0.00500049 0.000961677 0.00667405
2251 -7.46059 3.35324 -41.665 0.0175383 -0.00791068 -0.00702065
Basically, it has many parts which start with the "ITEM: TIMESTEP". I have to skip the first 9 lines for each part and then read the other lines.
I tried the textscan function (May be I misused it), but it is very slow. Is there a faster way to do it in Matlab?

采纳的回答

Cedric
Cedric 2018-1-6
编辑:Cedric 2018-1-6
If you have enough RAM for this, the following could run a little faster. It is way less versatile than Per's solution though, and exploits specific characters present in the header. You may have to adapt it a bit if there are e.g. other types of header content:
content = fileread( 'data.txt' ) ;
blockEnds = strfind( content, 'ITEM: T' ) - 1 ;
blockEnds = [blockEnds(2:end), numel( content )] ;
blockStarts = strfind( content, '6]' ) + 3 ;
nBlocks = numel( blockStarts ) ;
data = cell( nBlocks, 1 ) ;
fprintf( '%d blocks found.\n', nBlocks ) ;
for bId = 1 : nBlocks
data{bId} = reshape( sscanf( content(blockStarts(bId):blockEnds(bId)), '%f' ), 7, [] ).' ;
end
PS: this takes < 20s for a 1GB data file on a small laptop (with 32GB RAM though).
  2 个评论
Abdullahi Samantar
Abdullahi Samantar 2018-12-12
Hi Fan Li,
How did you manipulate Cedric code to get your large txt file (lammps) run.
I couldnt figure it out.
Thank you

请先登录,再进行评论。

更多回答(2 个)

per isakson
per isakson 2018-1-5
编辑:per isakson 2018-1-6
Given:
  • All headers consist of 9 lines
  • All data blocks consist of 7 columns of numerical data
  • The blocks of numerical data should be converted to double. (Added later.)
  • The columns of the data are separated by space, char(32)
  • There is RAM enough to store the parsed data. Nearly 10GB is needed to store in double. Single would introduce a rounding error.
Try:
>> cac = cssm( 'cssm.txt' );
>> whos cac
Name Size Bytes Class Attributes
cac 1x6 3696 cell
>> cac
cac =
[7x7 double] [11x7 double] [7x7 double] [11x7 double] [7x7 double] [11x7 double]
>>
>> cac{1}
ans =
1.0e+03 *
2.2410 0.0515 -0.0480 -0.0556 0.0000 -0.0000 -0.0000
2.2420 -0.0256 -0.0243 -0.0294 0.0000 0.0000 -0.0000
2.2430 0.0093 0.0271 -0.0379 -0.0000 0.0000 -0.0000
2.2440 0.0039 -0.0485 0.0706 0.0000 -0.0000 -0.0000
2.2450 0.0499 -0.0402 -0.0053 -0.0000 0.0000 0.0000
2.2460 0.0002 -0.0200 -0.0023 0.0000 0.0000 -0.0000
2.2470 -0.0536 0.0506 -0.0235 0.0000 -0.0000 0.0000
>>
where
function cac = cssm( ffs )
fid = fopen( ffs );
cac = cell(1,0);
while not( feof( fid ) )
cac(1,end+1) = textscan( fid, '%f%f%f%f%f%f%f' ...
, 'Headerlines',9, 'CollectOutput',true );
end
fclose( fid );
end
and cssm.txt contains three copies of the data of the question
"textscan [...] is very slow. Is there a faster way [...] Matlab?" AFAIK: No, not significantly faster. However, I don't agree that it's very slow.
  2 个评论
Fan Li
Fan Li 2018-1-6
Hi per isakson
I am using fgetl function which is faster.
per isakson
per isakson 2018-1-6
编辑:per isakson 2018-1-6
That's comparing apples to oranges. I assumed without stating it that the "numerical blocks" should be parsed.

请先登录,再进行评论。


Steven Lord
Steven Lord 2018-1-8
If your data is too large to fit in memory all at once, consider using a datastore. Since you have data in the headers that I assume you want to access, using a TabularTextDatastore probably won't suit your needs. You may need to use a general FileDatastore or develop your own custom datastore using your knowledge of the way your data is formatted.
Once you have a datastore you could use it to create a tall array.
  1 个评论
Fan Li
Fan Li 2018-1-16
编辑:Fan Li 2018-1-16
Hi Steven Lord
I have read the part about tall array and datastore. It is useful for me. But I do not know how to skip the headers which I do not need . The function I am using now and other method provided here is not for datastore. There is limited source for skipping the headers with datastore. So, can you tell me how to skip the headers with the datastore? The format of the data is provided above.
Thanks

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Data Import and Export 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by