How to read large text data into matlab
91 次查看(过去 30 天)
显示 更早的评论
Hi every one I have a text file up to 10 GB which has to be read into matlab. The part of the data is listed below
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 51.4573 -48.0145 -55.5854 0.00121546 -0.00693737 -0.00454935
2242 -25.5898 -24.3081 -29.3729 0.00671099 0.00205397 -0.0108453
2243 9.2867 27.1493 -37.9274 -0.00115821 0.00912371 -0.00178601
2244 3.89714 -48.5019 70.5903 0.0041159 -0.00255481 -0.0029498
2245 49.8803 -40.1819 -5.30361 -0.0106695 0.0224494 0.00918698
2246 0.22115 -19.9758 -2.30173 0.0190817 0.0262146 -0.0153229
2247 -53.6289 50.5517 -23.5032 0.00388499 -0.00559089 0.000787281
.
.
.
.
.
.
.
.
ITEM: TIMESTEP
10
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 -50.0606 -93.6118 -70.4534 0.000504085 -0.00684199 -0.00394166
2242 -14.4928 20.0993 3.55963 0.00244236 0.00203074 -0.0162865
2243 -2.64823 8.26566 23.6457 -0.000503352 0.0140246 -0.00909782
2244 -153.189 40.6383 -12.0141 0.00192712 -0.00177534 -0.00194966
2245 35.0712 -14.4107 6.31868 0.00668828 0.012556 0.00468532
2246 22.0675 -14.7867 61.4774 0.0182799 0.0194239 -0.00942033
2247 -3.80959 -88.6786 1.61222 0.00459477 -0.00577238 0.000324204
2248 -18.4777 -9.35017 -1.12766 0.0146401 0.00924069 -0.00730373
2249 16.2354 -7.34658 -25.1694 -0.0169203 0.0249397 0.0085598
2250 110.508 19.9749 -4.95758 -0.00500049 0.000961677 0.00667405
2251 -7.46059 3.35324 -41.665 0.0175383 -0.00791068 -0.00702065
Basically, it has many parts which start with the "ITEM: TIMESTEP". I have to skip the first 9 lines for each part and then read the other lines.
I tried the textscan function (May be I misused it), but it is very slow. Is there a faster way to do it in Matlab?
采纳的回答
Cedric
2018-1-6
编辑:Cedric
2018-1-6
If you have enough RAM for this, the following could run a little faster. It is way less versatile than Per's solution though, and exploits specific characters present in the header. You may have to adapt it a bit if there are e.g. other types of header content:
content = fileread( 'data.txt' ) ;
blockEnds = strfind( content, 'ITEM: T' ) - 1 ;
blockEnds = [blockEnds(2:end), numel( content )] ;
blockStarts = strfind( content, '6]' ) + 3 ;
nBlocks = numel( blockStarts ) ;
data = cell( nBlocks, 1 ) ;
fprintf( '%d blocks found.\n', nBlocks ) ;
for bId = 1 : nBlocks
data{bId} = reshape( sscanf( content(blockStarts(bId):blockEnds(bId)), '%f' ), 7, [] ).' ;
end
PS: this takes < 20s for a 1GB data file on a small laptop (with 32GB RAM though).
2 个评论
Abdullahi Samantar
2018-12-12
Hi Fan Li,
How did you manipulate Cedric code to get your large txt file (lammps) run.
I couldnt figure it out.
Thank you
更多回答(2 个)
per isakson
2018-1-5
编辑:per isakson
2018-1-6
Given:
- All headers consist of 9 lines
- All data blocks consist of 7 columns of numerical data
- The blocks of numerical data should be converted to double. (Added later.)
- The columns of the data are separated by space, char(32)
- There is RAM enough to store the parsed data. Nearly 10GB is needed to store in double. Single would introduce a rounding error.
Try:
>> cac = cssm( 'cssm.txt' );
>> whos cac
Name Size Bytes Class Attributes
cac 1x6 3696 cell
>> cac
cac =
[7x7 double] [11x7 double] [7x7 double] [11x7 double] [7x7 double] [11x7 double]
>>
>> cac{1}
ans =
1.0e+03 *
2.2410 0.0515 -0.0480 -0.0556 0.0000 -0.0000 -0.0000
2.2420 -0.0256 -0.0243 -0.0294 0.0000 0.0000 -0.0000
2.2430 0.0093 0.0271 -0.0379 -0.0000 0.0000 -0.0000
2.2440 0.0039 -0.0485 0.0706 0.0000 -0.0000 -0.0000
2.2450 0.0499 -0.0402 -0.0053 -0.0000 0.0000 0.0000
2.2460 0.0002 -0.0200 -0.0023 0.0000 0.0000 -0.0000
2.2470 -0.0536 0.0506 -0.0235 0.0000 -0.0000 0.0000
>>
where
function cac = cssm( ffs )
fid = fopen( ffs );
cac = cell(1,0);
while not( feof( fid ) )
cac(1,end+1) = textscan( fid, '%f%f%f%f%f%f%f' ...
, 'Headerlines',9, 'CollectOutput',true );
end
fclose( fid );
end
and cssm.txt contains three copies of the data of the question
"textscan [...] is very slow. Is there a faster way [...] Matlab?" AFAIK: No, not significantly faster. However, I don't agree that it's very slow.
2 个评论
per isakson
2018-1-6
编辑:per isakson
2018-1-6
That's comparing apples to oranges. I assumed without stating it that the "numerical blocks" should be parsed.
Steven Lord
2018-1-8
If your data is too large to fit in memory all at once, consider using a datastore. Since you have data in the headers that I assume you want to access, using a TabularTextDatastore probably won't suit your needs. You may need to use a general FileDatastore or develop your own custom datastore using your knowledge of the way your data is formatted.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Data Import and Export 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!