extract routine data from extra large (~20GB) text file

Question

0 个投票

HISTORY Trial.txt

Updated on 21:00 14/April/2018; clear some confusion dpb suggested. Updated part will be highlighted in bold.

Dear friends,

I am currently doing some molecule simulations and I need to analyze the movement of each individual atom.

I got a History file from the simulation which contains irregular format like follows ( now you can see the sample HISTORY in the attached file: HISTORY Trial.txt):

for those really willing to help, a complete short HISTORY file around 3.6GB:

https://drive.google.com/open?id=1MLl3WusAgtWFSAQ_-vVx7T6wr_LN4wee

 ika9108                                                                 
         0         1     18720                10001            374477446
timestep         0     18720 0 1            0.000500            0.000000
       57.9307218288        0.0000000000        0.0000000000            
        0.0000000000       57.9307218288        0.0000000000            
        0.0000000000        0.0000000000       57.9307218288            
C                1   12.000000    1.123285    0.000000                  
    -15.58443309        -26.16046542        -14.42223305                
O                2   16.000000   -1.041095    0.000000                  
    -14.69649899        -26.77784011        -13.67121262                
O                3   16.000000   -1.041095    0.000000                  
    -16.81540951        -26.13672909        -14.15028633                
O                4   16.000000   -1.041095    0.000000                  
    -15.19910302        -25.79374370        -15.56280780                
C                5   12.000000    1.123285    0.000000                  
    -16.61260265        -24.19749101         3.305309244                
O                6   16.000000   -1.041095    0.000000                  
    -15.74221447        -23.85292172         4.202348062                
O                7   16.000000   -1.041095    0.000000                  
    -16.55264265        -23.67515089         2.115672369                
O                8   16.000000   -1.041095    0.000000                  
    -17.54419044        -25.03709893         3.579123068     
.
.
.
Ca           11521   40.080000    2.000000    0.000000                  
    -18.93093222         17.98682377        -15.79782631                
Ca           11522   40.080000    2.000000    0.000000                  
     19.11661464        -20.33590673        -23.44339428   
.
.
.
O_spc        14401   15.999400   -0.820000    0.000000                  
     9.099065343         28.15242293         12.96874971                
H_spc        14402    1.008000    0.410000    0.000000                  
     10.11816248         28.04847873         12.79296953                
H_spc        14403    1.008000    0.410000    0.000000                  
     8.553146110         28.43604932         12.08437008                
O_spc        14404   15.999400   -0.820000    0.000000                  
    -20.67489325        -6.313716149         18.72893163                
H_spc        14405    1.008000    0.410000    0.000000                  
    -21.01831712        -6.604184870         19.66593064                
H_spc        14406    1.008000    0.410000    0.000000                  
    -20.45237732        -5.303498198         18.81546735          
.
.
.     
timestep      2000     18720 0 1            0.000500            1.000000
       57.9125298023        0.0000000000        0.0000000000            
        0.0000000000       57.9125298023        0.0000000000            
        0.0000000000        0.0000000000       57.9125298023            
C                1   12.000000    1.123285    0.194440                  
    -15.59133022        -26.23975304        -14.24546014                
O                2   16.000000   -1.041095    0.364883                  
    -14.92875146        -26.92787946        -13.43967731                
O                3   16.000000   -1.041095    0.064554                  
    -16.84330237        -26.13216641        -14.09355464                
O                4   16.000000   -1.041095    0.356997                  
    -15.00432807        -25.74003158        -15.26105138         
.
.
.

As shown, the first 2 lines represent the name of system (ika9108) and the number of atoms (18720 in total). Which should be ignored.

At line 3, it indicated current frame in the HISTORY, which in this case (frame 1), it start from timestep 0*0.5 [fs]. And the frame 2 shown next starts from timestep 2000*0.5 [fs]. Each timestep is 0.5 femtosecond; hence, 2000 timestep is 1000 [fs], equal to 1 [ps]. The simulation will simulate every timestep, but only record (1 frame) after every 2000 timesteps.

Line 4 to 6 can be ignored as it's about the volume of the system. Later lines have shown that:

C (atom Carboon), 1 (the number of this atom), 12.000000 and so on (atomic mass, something does not care about)

-15.584 (x axis), -26.160 (y axis), -14.422(z axis)

Likewise, O stand for oxygen atom, 2 represented it is the second atom and everything else goes for the same parameter.

After finishing all 18720 atoms in current frame, the HISTORY moves to the next frame. It will repeat 18720 atoms again, but this time with new x,y,z locations. And this process will repeat thousands of times until it reach 10 nanoseconds.

Hence, I need to extract the changing of locations of one particular atom that out of 18720 atoms and throughout the whole HISTORY file (which mean through all frames); for example:

timestep 0 (frame 1), C, 1, x1, y1, z1

timestep 2000 (frame 2), C, 1, x2, y2, z2

timestep 4000 (frame 3), C, 1, x3, y3, z3

and so on (frame x)

But I have no idea how you can do this with MATLAB. As I search the internet for few commands like textscan, it only read the formatSpec only in 1 line; however, I need to identify and locate the extraction in 2 lines. How can I manage this?

Later on I need to store these seperations and perform mean square displacement calclulation on them.

Thanks for helping. Any suggestion is welcome.

39 个评论
显示 37更早的评论隐藏 37更早的评论

dpb 2018-4-14

在 MATLAB Online 中打开

Well, actually, the format is pretty regular; it's a header section followed by a timestep section. Within each timestep section is a number of subsections, one for each atom in the simulation. What one does is to write a format and textscan for each of those from the inside out and then put those into a loop. In your case you're very fortunate in that the number for each of the subsection loop is also known (can be read from the header.

The form of the routine to read such a file will thus be independent of the number of elements and timesteps, simply to read the individual element position data section for each atom which is the same format repeated N times--or the 18720 in your example.

The question then is whether you want one specific atom, all N or a subset of some particular M-out-of-N; simplest would be to return them all from the file and then discard those unwanted; what might turn into a problem in that manner is total memory required if there are a very large number of timesteps -- I couldn't decipher that part of the output; is that information shown for how long the simulation was run in the header somewhere?

What would be easiest would be if you could run a sample simulation of a very small system (like a H20 molecule maybe?) for a short time such that there was a complete file you could attach. As the format is repeated, the overall size is immaterial; it's having the form itself that's needed.

One could make assumptions and just edit the text you've pasted but not knowing the model it's probable would make a mistake in doing so.

I don't quite follow the timestep nomenclature, can you amplify a little on that -- you wrote " the next timestep shown later was 1000 (2000*0.5fs) fs." but the example you give shows

timestep 0, C, 1, x1, y1, z1
timestep 2000, C, 1, x2, y2, z2

so is the 2000 the time or a count and the time is actually 1000? How does one know one is done; looking at the time/timestep value or just run out of data at end-of-file?

Lande GU 2018-4-14

Thank you for replying.

My system is a CaCO3·nH2O, aka. Amorphous calcium carbonate.

In that particular example I used in previous post is CaCO3·0.5H2O,I also got up to CaCO3·1.3H2O; furthermore, some systems with Magnesium impurities.

As the movement of the system is simulated each 0.5 [fs]; and it recorded in History file each 1 [ps], where 1 [ps] = 0.5 [fs] * 2000. Hence, the start of timestep is 0, and the next one it jumped to 2000.

The only purpose to have timestep extracted is to ensure there are no missing steps in the process.

As my professor only provide me the model of CaCO3·H2O, and unfortunately I have no knowledge of creating simulation molecules. I simply cannot provide a small file here for inspection.

Alternatively, if you have time, I can upload the smallest HISTORY file (less than 4 GB) I have on hand to google drive, which you can download and have a look.

I forgot to mention that most of the History files are around 20GB, which contains 10 [ns] of data, so I believe 1 atom at least repeat 10,000 times in the whole file, which means in the earlier example there are 18720 atoms each repeat 10,000 times.

For my study, I need to invest the activeness of H2O molecules, so mostly focused on the atoms those named O_spc. And yes, I need one particular O_spc, for example, O_spc 14401; but moreover, I also need to repeat this for every O_spc atoms. So ideal if there are 1000 O_spc atoms, I need every one of them separately to calculate their mean square displacement. Then I can identify the water molecules which have abnormal activeness. Is that clearer for you?

dpb 2018-4-14

编辑：dpb 2018-4-14

You managed to snip sections out of that file; do that to create a smaller sample file for us then that would look like a full file even though it wouldn't be complete for a real simulation it would have all the needed characteristics; simply have to "fudge" the the number of atoms to match what the record offsets actually would be vis a vis a complete file, but that wouldn't affect the logic in general.

Say take the 16 elements or so you have from the first timestep above and then take four-five timesteps and just save that small section as the trial file from which to work and attach that with the paperclip icon.

I believe with that can fake enough in the header to give you example to work from. NB: these files will undoubtedly take a long time to process directly via textscan has your prof gotten any of this done previously or is this the first stab at post-processing these files? Perhaps a former grad student has already built some tools? Or, is this simulation a known tool in the area; might there be some other toolsets already built?

Lacking those (why reinvent the wheel if don't have to?), the other thought I have if you are interested only in such specific small number of records out of the total, it might prove faster overall to parse out those records simply by counting where they are in the file since each of a given atom will be the N atoms*records/timestep apart; which records are needed is directly computable. Retrieving those to another much smaller file to actually read might be much less time overall than the parsing of the whole file; would take some trial timings to tell...

per isakson 2018-4-14

编辑：per isakson 2018-4-14

在 MATLAB Online 中打开

I assume that this represent all the data you want to retrieve

timestep    0 (frame 1), C, 1, x1, y1, z1
timestep 2000 (frame 2), C, 1, x2, y2, z2
timestep 4000 (frame 3), C, 1, x3, y3, z3
and so on (frame x)

Admittedly, I rather ask than read.

A 20GB text file is not a convenient data storage to work with. Depending on how many times you will retrieve data, it might be good to transfer the data to some other type of file. It might, however, be premature to decide on that now.

The position of one atom as a function of time can effectively be stored in a <Nx4 double> matrix, where N is the number of frames. The columns would be time,X,Y,Z. Matlab has something called floating point integer, flint, which will help avoid floating point errors in time.

One problem with a matrix is that there is no good place for meta-data. A little bit of meta-data may be squeezed into the name of the variable. However, I think the structure of my previous comment is less appropriate to store this limited data.

Would a function like this be helpful?

function    M = cssm( filespec, kind, id )
      %  filespec  e.g. 'h:\m\cssm\HISTORY_Trial.txt'
      %  kind      kind of atom, i.e. C/O/Ca
      %  id        integer 
      %  M         <Nx4 double>
  end

dpb 2018-4-14

编辑：dpb 2018-4-15

在 MATLAB Online 中打开

Well, that's a bummer!!! To illustrate, with the test file can do the following--

>> file=textread('historytrial.txt', '%s', 'delimiter', '\n', 'whitespace', '');
>> N=50;  % the sample file N
>> ix=3:N*2+4:length(file);    % index to time blocks
>> file(ix)
ans =
5×1 cell array
  'timestep         0     18720 0 1            0.000500            0.000000'
  'timestep      2000     18720 0 1            0.000500            1.000000'
  'timestep      4000     18720 0 1            0.000500            2.000000'
  'timestep      6000     18720 0 1            0.000500            3.000000'
  'timestep      8000     18720 0 1            0.000500            4.000000'
>> ixO=43;  % the equivalent atom number in trial file; would be actual in real one
>> file(ix+2*ixO)
ans =
5×1 cell array
  'O_spc        14401   15.999400   -0.820000    0.000000                  '
  'O_spc        14401   15.999400   -0.820000    0.499808                  '
  'O_spc        14401   15.999400   -0.820000    0.331586                  '
  'O_spc        14401   15.999400   -0.820000    0.168505                  '
  'O_spc        14401   15.999400   -0.820000    0.260819                  '
>> cellfun(@(s) textscan(s,fmt),file(ix+2*ixO),'uniformoutput',0)
ans =
  5×1 cell array
    {1×5 cell}
    {1×5 cell}
    {1×5 cell}
    {1×5 cell}
    {1×5 cell}
>> ans{:}
ans =
  1×5 cell array
    {1×1 cell}    [14401]    [15.999400000000000]    [-0.820000000000000]    [0]
ans =
  1×5 cell array
    {1×1 cell}    [14401]    [15.999400000000000]    [-0.820000000000000]    [0.499808000000000]
ans =
  1×5 cell array
    {1×1 cell}    [14401]    [15.999400000000000]    [-0.820000000000000]    [0.331586000000000]
ans =
  1×5 cell array
    {1×1 cell}    [14401]    [15.999400000000000]    [-0.820000000000000]    [0.168505000000000]
ans =
  1×5 cell array
    {1×1 cell}    [14401]    [15.999400000000000]    [-0.820000000000000]    [0.260819000000000]
>>

NB Per: There's an error in the sample file, Per; the second displacement record is missing for the atom 20 in the next to last section. Had to insert it; if you actually do a search instead it probably won't affect your solution.

NB:

I used the venerable textread here solely because it saves the fopen/fclose pair, and I knew this trivial file could entirely fit in memory. textscan could be used to read as large a section similarly as can comfortably fit in actual memory available; simply read the first two header records first and then as many timestep groups as desired which size can be computed after reading the actual N for the file.

It has no bearing on reading the file, but I find it interesting there's only displacement in the z direction; is this a limitation of the model or some reason that it is expected for the sample problem? Just curious here...

dpb 2018-4-15

Yes and no; I mixed metaphors, so to speak...the example was simply showing that one can compute the lines wanted directly; the index shown points at the element line; the displacement line is that index+1.

I then later on forgot about that and wasn't thinking what those numbers were when asked the question...that was an indication that I am getting old and senile... :) (or :( more realistically, I guess).

I was kinda' waiting to see if you could indeed load a full file into memory before going any further; there's no problem conceptually in the parsing of the file beyond what's been shown either way if you can suck it all up into memory; that then would simply be a timing between whether using an indexed subset is sufficiently faster than regexp as to which way to proceed.

The step if you can't read the full file in one gulp is to revert to a lower level of reading and processing in memory-sized chunks; that's do-able; the outline would be to read the header and compute the size of a time step in the file and then read an integral number of time steps that will fit. That size and how many steps/chunk would be dependent upon the problem size and how much memory you can actually effectively use...

Lande GU 2018-4-15

在 MATLAB Online 中打开

Hey there. I see. xD It's okay, I am feeling I am getting old too...

I mentioned in one comment that somehow I was not able to open the large HISTORY file in MATLAB, even I increased the maximum memories as MATLAB Help suggestion. Maybe I am doing the wrong way? I simply just double-click the file in MATLAB current folder bar to open it. After I type "Memory" command in MATLAB, it has shown me:

Maximum possible array:       55360 MB (5.805e+10 bytes) *
Memory available for all arrays:       55360 MB (5.805e+10 bytes) *
Memory used by MATLAB:        1994 MB (2.091e+09 bytes)
Physical Memory (RAM):       65417 MB (6.860e+10 bytes)
*  Limited by System Memory (physical + swap file) available.

And my biggest file is only 25GB, which should be suitable in the memories that MATLAB allows to taken. Maybe my swap file (page file) is not big enough?

Currently, I managed to use my little python knowledge to grab and record the data into excel at a very slow speed, about 4 seconds to scan through 1 frame and grab only one type of atoms, approximately it will take half a day for one HISTORY file. xD And it only utilises 4% of my CPU and 50MB memories. I need to learn how to code for multi-cores. lol

Lande GU 2018-4-15

在 MATLAB Online 中打开

This is the out of memory error message:

Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
  at javax.swing.JTable.getSelectedRows(JTable.java:2268)
  at com.mathworks.mwswing.MJTable.getSelectedRows(MJTable.java:367)
  at com.mathworks.widgets.spreadsheet.HeaderRenderer.getTableCellRendererComponent(HeaderRenderer.java:264)
  at javax.swing.plaf.basic.BasicTableHeaderUI.getHeaderRenderer(BasicTableHeaderUI.java:702)
  at javax.swing.plaf.basic.BasicTableHeaderUI.paintCell(BasicTableHeaderUI.java:709)
  at javax.swing.plaf.basic.BasicTableHeaderUI.paint(BasicTableHeaderUI.java:652)
  at javax.swing.plaf.ComponentUI.update(ComponentUI.java:161)
  at javax.swing.JComponent.paintComponent(JComponent.java:780)
  at javax.swing.JComponent.paint(JComponent.java:1056)
  at javax.swing.JComponent.paintChildren(JComponent.java:889)
  at javax.swing.JComponent.paint(JComponent.java:1065)
  at javax.swing.JComponent.paintChildren(JComponent.java:889)
  at javax.swing.JComponent.paint(JComponent.java:1065)
  at javax.swing.JLayeredPane.paint(JLayeredPane.java:586)
  at javax.swing.JComponent.paintChildren(JComponent.java:889)
  at javax.swing.JComponent.paint(JComponent.java:1065)
  at javax.swing.JComponent.paintChildren(JComponent.java:889)
  at javax.swing.JComponent.paint(JComponent.java:1065)
  at javax.swing.JLayeredPane.paint(JLayeredPane.java:586)
  at javax.swing.JComponent.paintChildren(JComponent.java:889)
  at javax.swing.JComponent.paint(JComponent.java:1065)
  at javax.swing.JComponent.paintChildren(JComponent.java:889)
  at javax.swing.JComponent.paint(JComponent.java:1065)
  at javax.swing.JComponent.paintToOffscreen(JComponent.java:5210)
  at javax.swing.RepaintManager$PaintManager.paintDoubleBuffered(RepaintManager.java:1579)
  at javax.swing.RepaintManager$PaintManager.paint(RepaintManager.java:1502)
  at javax.swing.RepaintManager.paint(RepaintManager.java:1272)
  at javax.swing.JComponent._paintImmediately(JComponent.java:5158)
  at javax.swing.JComponent.paintImmediately(JComponent.java:4969)
  at javax.swing.RepaintManager$4.run(RepaintManager.java:831)
  at javax.swing.RepaintManager$4.run(RepaintManager.java:814)
  at java.security.AccessController.doPrivileged(Native Method)

dpb 2018-4-15

Try just fileread at the command line and see if that cuts out any overhead that might be in the GUI way...I never use the GUI at all so I really don't have any idea what all those icons are actually attached to; maybe be the same thing I don't know, I just know if go to the command line there's not something else behind my back going on.

I don't think there's any way to make any significant use of the multiple cores to read a single huge file; there's where you could possibly do some preprocessing and break one huge file up into pieces that could potentially operate on in parallel. It wouldn't take much Fortran to write a routine to do that altho it's remarkable how quick ftgetl can be in Matlab sometimes; just as Per as shown with regexp on occasion surprising how good it can be if it can get down into the compiled code and stay there; it's essentially the same.

All of these simulations are already run; how long does it take to generate one of them; might it even be possible to have the code break its output file up into manageable pieces instead of creating such monstrosities? Or, instead of text files, create special-purpose files of just the raw data as stream binary files.

dpb 2018-4-15

Oh, that's fast...I'd figured at least several minutes if not 10's of minutes...probably not to worry, then.

I was talking a binary split of the size of files, 64/2--> ~32 GB; if that worked then how much free memory and how large an array could you create with it in memory, then halfway 'twixt 32/64 would be 48, not splitting the file itself for that experiment.

To split the file for processing, one would just have the Fortran code create a new file section after some number of timesteps that had created a file of a chosen size beginning with the next timestep; again would keep timesteps intact. Probably would add like one header line like the existing first line that included what looks like a text ID and a counter of the section number; presuming the simulation runs for a preset time then it would even be possible to compute the total number of sections of a given size that would be needed and write it as N of M. That's all a nicety that don't need until shown the straightforward way doesn't work or is too time consuming--I really figured the time would be a real problem just to read the file in. Just for grins, what does the 25GB one take and assuming it does load, won't Per's function work on it?

Lande GU 2018-4-16

Hey. Thanks a lot for helping. It is great to learn MATLAB with new tricks.

As Isakson stated in his experiment, even small file size exceeds the memory in MATLAB, maybe MATLAB just wants users to avoid loading mild-to-large files directly into memory.

I saw what Isakson suggest, I will read this "Large Files and Big Data" toolset in my free time.

And thanks dpb posting another approach, I will try that now and see how it works.

Maybe this question comes to an end. So far my python script works fine, and I further managed to allow it load the entire file into Memory to speed up. As I took another approach to separate all O_spc atoms at the same time, which the processing speed is slow by comparing to your result here.

Still, big thanks to you both!

dpb 2018-4-16

编辑：dpb 2018-4-16

在 MATLAB Online 中打开

"As I took another approach to separate all O_spc atoms at the same time, which the processing speed is slow by comparing to your result here."

Which is more convenient going forward; all O_spc in one big file or being able to handle them individually? What is to be done with the position data once have it?

It appears to me the function I wrote is fast enough that you can simply call it in a loop for an input list of atom IDs of interest and still be done before a python script actually reading the whole file can get more than a good start in all likelihood.

As noted, however, there's very little to add to this function to read a list instead of just one; then again, there's little "glue" code to build a combined output file from the results one-by-one as it is, either.

ADDENDUM

I hadn't really looked at the structure inside the model much; I see there are a bunch more water molecules in there than I had guessed, still

Ospc=14401:23038;                % list of special O
res=readHist(filename,Ospc(1));  % do first to find out how long is
OspcPos=zeros(length(res),4*length(Ospc)-1);  % preallocate
OspcPos(:,1:4)=res;              % save first 
for o=Ospc(2:end)                % iterate over rest
  OspcPos(:,o*4-3:o*4)=readHist(filename,Ospc(o));
end

ADDENDUM 2

Well, I tried the above to see; turns out that when one begins traversing the file from front to back so many times the speed does go way down; appears that should make the modification to read sequentially the atoms wanted in a single traverse...or make a combination of some of the items Per mentioned of removing pieces of the file that are known to be of no interest. For this model length(Ospc)/N = 0.125 or 87.5% of the file isn't of interest; that would reduce 3.6G to 0.45G. I don't know for sure just how much time would be required to do the decimation, though.

I'm disappointed; was a little naive in the presumption of speed scaling roughly linearly with numbers, I guess.

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

dpb 2018-4-16

编辑：dpb 2018-4-18

在 MATLAB Online 中打开

1 个投票

readHist.m

ADDENDUM Attached updated version that incorporates reading multiple atoms in single run; run-time seems essentially linear with nAtoms now.

ORIGINAL BEGINS Approached it with the same general concept as outlined before--compute where in the file the wanted values are--except instead of trying to read the whole file or very large chunks into memory and then try to pick out the wanted sections; compute the location in the file and just read the desired records directly. I wasn't sure how this would work from a time standpoint; I thought it might just be a problem moving the file pointer but turns out to be blazingly fast...

>> tic;H23040res=readHist('HISTORY.txt',23040);toc
Elapsed time is 0.110633 seconds.
>> 
>> [H23040res(end-3:end,1) H23040res(end-3:end,2:4)]  % pasted together for easier viewing
ans =
  12240000   3.866557282000000  19.186368170000001   6.278881618000000
  12242000   4.177788492000000  19.270542259999999   6.410293005000000
  12244000   4.425450305000000  19.523942850000001   6.673638392000000
  12246000   4.609073503000000  19.099931819999998   5.917552555000000
>>

The function for one atom given by atom number in the simulation

function [res,hdr]=readHist(file,atom)
% return position data from simulation for a given list of atoms
% optionally return the header line for ID
fid=fopen(file,'r');
l=fgets(fid);               % read the header line include \n
% define some constants from the file format
L=length(l);                % get record length including \n
nTimeLines=4;               % number lines in each time block
nAtomLines=2;               % number lines each atom output block
hdr=strtrim(l);             % save the header line for ID purposes
N=readN(fid);               % get the number atoms in simulation
nLinesTimeStep=N*nAtomLines+nTimeLines;
nBytesTimeStep=nLinesTimeStep*L;
d=dir(file);
nTimeSteps=(d.bytes-2*L)/nBytesTimeStep;
res=nan(nTimeSteps,4); 
idx=ftell(fid);                         % pointer begin of time section
for i=1:nTimeSteps
  res(i,1)=readTime(fid,L);
  res(i,2:4)=readAtomN(fid,L,atom);
  idx=idx+nBytesTimeStep;
  fseek(fid,idx,-1);
end
fid=fclose(fid);
end
...

Functions are in attached full m-file; just shows the outline at higher level walking through timestep by timestep. The fseek here positions file pointer to just before the next timestep record in the file after having read the previous timestep wanted atom position data array; since each timestep is a fixed size; is simple to just move to that location relative to beginning of file rather than calculate the difference from present position (although that's just the difference between the total length and how many bytes moved plus read). It wouldn't be hard to add the facility to read a list of atoms other than just the one.

I don't disagree with Per's assessment that perhaps some of the Matlab tools for large datasets could be brought to bear, but they seemed to have a pretty steep learning curve to try to deal with the structure of the file at hand being mostly set up for regular tabular data it appeared.

Here you're fortunate that the model is written in Fortran and uses fixed-width records so that one can compute the exact byte position in the file very easily as the multiple of record length. The advantage with this approach is it really doesn't make any difference how long the files are; it appears fast enough to be no problem. This is a moderately new but very mid/low-range 8GB Win7 machine.

4 个评论
显示 2更早的评论隐藏 2更早的评论

Lande GU 2018-4-18

Hey, no worries. I am not disappearing. lol

I am quite busy with studying yesterday, exam ahead.

I have just tried your code. It works perfectly fine, I think the time it took to extract one atom was about few seconds, it is indeed much faster than my python code. All I need to do is build a loop to let it extracting all atoms now.

I am very appreciating the help from you and Isakson. Can I accept both answers?

dpb 2018-4-18

编辑：dpb 2018-4-18

OK, my bad...sorry for the rant if misplaced! :)

Unfortunately, I don't believe the Answers forum is sophisticated-enough to let more than one Answer be accepted; you can always try it and see. What I know that you can do if you were to choose to reward both is that Accept followed by an UnAccept does not take away award points...we don't really do things for the points per se of course, but for the satisfaction of helping and the fun of the challenge.

On the multi-atom case, use the second of the two functions and pass it a vector of the desired atom numbers--it is MUCHO faster than the one-element version in a loop; the latter option is, indeed, unworkable in traversing the file N times.

The two attached files while of the same name aren't the same code; to reduce the confusion factor I'll remove the single-atom one since the multi-atom case reduces to that if only pass a single atom number.

请先登录，再进行评论。

Answer 2

per isakson 2018-4-14

编辑：per isakson 2018-4-20

在 MATLAB Online 中打开

1 个投票

First shot. I acknowledge that this might look cryptic, but it seems to work.

>> filespec ='h:\m\cssm\HISTORY_Trial.txt';
>> M = cssm( filespec, 'C', 1 )
M =
   1.0e+03 *
         0   -0.0156   -0.0262   -0.0144
    2.0000   -0.0156   -0.0262   -0.0142
    4.0000   -0.0155   -0.0263   -0.0143
    6.0000   -0.0156   -0.0261   -0.0140
    8.0000   -0.0157   -0.0262   -0.0142

and

>> M = cssm( filespec, 'O', 14 )
M =
   1.0e+03 *
         0    0.0054    0.0239   -0.0176
    2.0000    0.0054    0.0238   -0.0177
    4.0000    0.0052    0.0239   -0.0175
    6.0000    0.0054    0.0237   -0.0176
    8.0000    0.0055    0.0237   -0.0177

where

function    M = cssm( filespec, kind, id )
      str = fileread( filespec );
      str = strrep( str, char([13,10]), char(10) ); 
      fmt = '(?m)^timestep\\x20+(\\d+).+?^%s\\x20+%d\\x20[^\\n]+\\n([^\\n]+)\\n';
      xpr = sprintf( fmt, kind, id );
      cac = regexp( str, xpr, 'tokens' );
      len = length( cac );
      M   = nan( len, 4 );
      for jj = 1 : len
          M( jj, 1   ) = sscanf( cac{jj}{1}, '%f' );
          M( jj, 2:4 ) = sscanf( cac{jj}{2}, '%f%f%f' );
      end
  end

I really don't know whether this will work with your large files. However, I'm often surprised by how fast regular expressions are and thus I think it's worth a try.

.

Profiling of cssm

I created a 30MB text file by adding copies of HISTORY_Trial.txt. (I failed to download from your google drive.) Then I run

>> profile on
>> M = cssm( filespec, 'O', 14 );
>> profile viewer

.

I use R2016a/Win7 on an old vanilla desktop.

Total elapse time is <0.9s, more than half of which is used by fileread. (The file was already in system cache.)

.

2018-14-19: Release candidate, RC1

Now I have developed a function that works with the 3.6GB test file. I use a similar approach as in First shot. The main difference is that this functions reads and parses chunks of frames. I gave up on Matlabs Big Data tools.

In short the steps are

Locate the positions of all frames in the file. A frame starts with "t" in the string, "timestep", and ends with the last character before the following string, "timestep". The result is stored in the persistent variable, ix_frame_start, and used in subsequent calls with the same text file.
Calculate the positions of a series of chunks of frames. The result is stored in ix_frame_chunk_start and ix_frame_chunk_end. This calculation is cheep.
Loop over all chunks of frames
Read one chunk with fseek and fread. This is about three times faster than using memmapfile.
Loop over the list of atoms, which was given in the call.
Extract the values of "timestamp" and "X,Y,Z" with regexp.
Return the result in a struct array.

Comments

Compared to the function, readHist by dpb this function

is two order of magnitude slower to read position data for one atom
is more robust to variations in the text file

I don't see any potential to dramatically improve the speed of this function.

Backlog

Only the regular expression is well commented
Cheep test to catch anomalies in the text file

Example

>> clear all
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 44.231721 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','X','O'}, {18,19,20} ); toc
Error using scan_CaCO3nH2O (line 36)
Cannot find atom: X, id: 19, in frame 1:3 
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 30.618218 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 31.088037 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {5884,5890,6107} ); toc
Elapsed time is 45.095658 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O'}, {18,19,20,5884,5890,6107} ); toc
Elapsed time is 62.709259 seconds.
>> 
>> plot( S(1).position(:,1), S(1).position(:,2:4), '.' )
>> S(1)
ans = 
        atom: 'O'
          id: {[5884]}
    position: [1110x4 double]
>>

.

where in one m-file

function    S = scan_CaCO3nH2O( filespec, atom_list, id_list )                  %
%   
%{
S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} );
%} 
    narginchk(3,3)
      persistent ix_frame_start previous_filespec  
      if isempty( ix_frame_start ) || not( strcmpi( filespec, previous_filespec ) )
          fid = fopen( filespec, 'r' );
          ix_frame_start = frame_start_position( fid );
          previous_filespec = filespec;
      else
          fid = fopen( filespec, 'r' );
      end
      if isscalar( atom_list ) && not( isscalar( id_list ) )
          atom_list = repmat( atom_list, 1, length(id_list) );
      else
          assert( length( atom_list ) == length( id_list )...
              ,   'scan_CaCO3nH2O:LengthMismatch'         ...
              ,   'Length of "%s" and "%s" differs'       ...
              ,   inputname(2), inputname(3)              )
      end
      ix1 = ix_frame_start(1);
      ix2 = ix_frame_start(4)-1;
      [~] = fseek( fid, ix1-1, 'bof' );
      str = fread( fid, [1,(ix2-ix1+1)], '*char' );
      for jj = 1 : length( atom_list )
          xpr = regular_expression( atom_list{jj}, id_list{jj} );
          cac = regexp( str, xpr, 'tokens' );
          assert( length(cac) == 3                                ...
              ,   'scan_CaCO3nH2O:CannotFindAtom'                 ...
              ,   'Cannot find atom: %s, id: %d, in frame 1:3'    ...
              ,   atom_list{jj}, id_list{jj}                      )
      end
      target_chunk_length         = 1e8; 
      mean_frame_length           = mean( diff( ix_frame_start ) ); 
      number_of_frames_per_chunk  = floor( target_chunk_length / mean_frame_length );
      ix_frame_chunk_start = ix_frame_start( 1 : number_of_frames_per_chunk : end-2 );
      ix_frame_chunk_end   = [ ix_frame_chunk_start(2:end)-1, eof_position(fid) ];
      S = struct( 'atom'      , atom_list             ...
              ,   'id'        , num2cell( id_list )   ...
              ,   'position'  , []                    );
      C = cell( length(ix_frame_chunk_start), length(atom_list) );
      for jj = 1 : length( ix_frame_chunk_start ) % loop over all frame chunks
          ix1 = ix_frame_chunk_start(jj);
          ix2 = ix_frame_chunk_end(jj);
          [~] = fseek( fid, ix1-1, 'bof' );
          str = fread( fid, [1,(ix2-ix1+1)], '*char' );
          C(jj,:) = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list );
      end
      for kk = 1 : length(atom_list)
          S(kk).position = cell2mat( C(:,kk) );
      end
  end
  % -------------------------------------------------------------------
  function    ix_start = frame_start_position( fid )                                  %
      chunk_size  = 3e8;  % a larger chunk size increases speed a few per cent
      eof_pos     = eof_position( fid );
      ix2 = 0;
      jj  = 0;
      cac = cell( 1, 200 );
      while ix2 < eof_pos
          jj = jj + 1;
          ix1 = 1 + (jj-1)*(chunk_size-100); 
          ix2 = min( eof_pos, ix1+chunk_size );
          [~] = fseek( fid, ix1-1, 'bof' );
          str = fread( fid, [1,(ix2-ix1+1)], '*char' );
          cac{jj} = strfind( str, 'timestep' ) + ix1-1;
      end
      buffer   = cat( 2, cac{:} );
      ix_start = unique( buffer );  % beware of overlap
  end
  % ------------------------------------------------------------------- 
  function    eof_pos  = eof_position( fid )                                          %
      [~]     = fseek( fid, 0, 'eof' );
      eof_pos = ftell( fid );    
  end
  % ------------------------------------------------------------------- 
  function    C   = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list )  %
      C = cell( 1, length( atom_list ) );
      for jj = 1 : length( atom_list )    % loop over all atoms 
          C{jj} = parse_one_chunk_of_frames( str, atom_list{jj}, id_list{jj} );
      end
  end
  % ------------------------------------------------------------------- 
  function    M   = parse_one_chunk_of_frames( str, atom, id )                        %
      xpr = regular_expression( atom, id );
      cac = regexp( str, xpr, 'tokens' );
      len = length( cac );
      M   = nan( len, 4 );
      for jj = 1 : len
          M( jj, 1   ) = sscanf( cac{jj}{1}, '%f' );
          M( jj, 2:4 ) = sscanf( cac{jj}{2}, '%f%f%f' );
      end
  end
  % ------------------------------------------------------------------- 
  function    xpr = regular_expression( atom, id )                                    %
      fmt = [
          '(?m)       '   ... % ^ and $ match at line breaks
          '^          '   ... % beginning of line
          'timestep   '   ... % literal: timestep
          '\\x20+     '   ... % one or more spaces 
          '(          '   ... % start group to catch "timestamp"
          '   \\d+    '   ... % one or more digits
          ')          '   ... % 
          '\\x20      '   ... % one space
          '.+?        '   ... % one or more chatacters incl. new line, up till
          '^%s        '   ... % format specifier for "atom" at beginning of line
          '\\x20+     '   ... % one or more spaces 
          '%d         '   ... % format specifier for "id"
          '\\x20      '   ... % one space
          '(?-s)      '   ... % from here on dot matches anything except newline   
          '.+         '   ... % anything, but new line, i.e rest of line
          '\\n        '   ... % new line
          '(          '   ... % start group to catch "X,Y,Z"
          '   .+      '   ... % anything, but new line, i.e rest of line
          ')          '   ... %
          '$          '   ... % end of line
          ];    
      fmt( isspace(fmt) ) = [];
      xpr = sprintf( fmt, atom, id );
  end

.

2018-14-20: Release candidate 2.0, RC2

It's important to know when not to use a regular expression.

RC2 is three times as fast as RC1. I achieved this improvement by

replacing consecutive spaces by one space and trim trailing spaces of the 3.6GB text file. The trimmed file is half the size of the original file.
replacing the regular expression by a search with the function, strfind and a simple regular expression.

.

The local function, scan_for_atoms_in_one_chunk_of_frames, now reads

% --------------------------------------------------------------
function    C   = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list )
      pos = strfind( str, 'timestep' );
      len = length( pos );
      stp = nan( len, 1 );
      for jj = 1 : len
          stp(jj) = sscanf( str(pos(jj):pos(jj)+24), '%*s%f', 1 );
      end
      len = length( atom_list );
      C   = cell( 1, len );
      for jj = 1 : len        % loop over all atoms 
          M      = parse_one_chunk_of_frames( str, atom_list{jj}, id_list{jj} );
          M(:,1) = stp;  % this will error if the "atom" isn't found in all timesteps
          C{jj} = M;
      end
  end
  % --------------------------------------------------------------
  function    M   = parse_one_chunk_of_frames( str, atom, id ) 
      xpr = regular_expression( atom, id );
      pos = regexp( str, xpr, 'start' );
      len = length( pos );
      M   = nan( len, 4 );
      for jj = 1 : len
          [ ~, remain ] = strtok( str( pos(jj) : pos(jj)+120 ), char(10) );
          M( jj, 2:4 )  = sscanf( remain, '%f%f%f', 3 );
      end
  end
  % --------------------------------------------------------------
  function    xpr = regular_expression( atom, id )     
      fmt = '\\<%s %d ';      % preceded by word boundary, \<, and succeeded by space   
      xpr = sprintf( fmt, atom, id );
  end

Profiling of RC2

In both RC1 and RC2 two function calls totally dominate the execution time. They are

fread and
scan_for_atoms_in_one_chunk_of_frames, i.e. regexp and strfind, respectively.

With RC1 the execution time of reading and parsing where equal for scanning for one "atom". With RC2 reading is nearly twice as fast because of the smaller size of the test file. With RC2 parsing is several times faster than with RC1.

These three clips are from profiling of RC2 for scanning for one "atom"

.

Example

Restart of Matlab

>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 19.662046 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 11.702999 seconds.
>> 
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,19,20,5884,5890,6107} ); toc
Elapsed time is 16.044491 seconds.

10 个评论
显示 8更早的评论隐藏 8更早的评论

per isakson 2018-4-15

编辑：per isakson 2018-4-16

在 MATLAB Online 中打开

I tried to run cssm with your 3.6GB google drive file on my 16GB ram desktop. That ended with a red Out of memory. message. Nowadays Matlab internally uses two bytes per character. That's contributes to our current problems.

Next I tried to squeeze the superfluous spaces out of your file.

function    trim_space( source, target )
      %   trim_space( 'c:\tmp\HISTORY.txt', 'c:\tmp\CaCO3nH2O.txt' )
      %   trim_space( 'h:\m\cssm\HISTORY_Trial.txt', 'c:\tmp\trimmed_text.txt' )
      %
      str = fileread( source );
      str = strrep( str, char([13,10]), char(10) );   % replace '\r\n' by '\n' 
      str = regexprep( str, '\x20+', ' ' );       % replace multiple spaces by one space
      str = regexprep( str, '\x20+\n', '\n' );    % trim trailing space
      %
      fid = fopen( target, 'w' );
      fprintf( fid, '%s', str );
      fclose( fid );
  end

That worked nicely with my 30MB file: 30% off on size and time to read. The function, cssm, operates well on the squeezed file. The 3.6GB file caused out of memory.

That's the end of this experiment. Conclusion:

The function, cssm, would work well on appropriate sized chunks of data.
It's for a reason that The MathWorks has implemented special tools to work with Large Files and Big Data

dpb 2018-4-19

编辑：dpb 2018-4-19

在 MATLAB Online 中打开

"Compared to the function, readHist by dpb this function

is two order of magnitude slower to read position data for one atom
is more robust to variations in the text file...

I'm surprised it is that much different; did you re-profile to see the bottleneck, Per?

I'm presuming since the data file is being created from a simulation model that at least for the duration of this research project the output format won't change significantly; the script does build in some constants to handle minor things like adding a header line or the like and does check the actual record length...but it is obviously intended for the specific file for the specific purpose! :)

I'm most concerned regarding the possibility of the file size growing immeasurably large and the check to compute the number of time steps going south as currently implemented--although

>> 10^fix(log10(2^(53-1)))/1024/1024/1024
ans =
 9.3132e+05
>>

GB, I suppose it's likely DIR() will continue to return the precise number for all file sizes likely to be found. :)

PS: Thanks for the excellent regexp tutorial on the way; I've never taken time to become sufficiently adept, it always ended up being so cryptic before I ever got to the result I'd lose patience and attack another way...

per isakson 2018-4-21

编辑：per isakson 2018-4-21

在 MATLAB Online 中打开

Premature optimization: Without any supporting evidence I assumed that processing larger chunks of the file would be more efficient. Thus, I processed chunks of 50+ frames at a time. That required extra source code.

Now, I have modified the code to process one or part of one frame at a time. That made it possible to

remove the strfind( str, 'timestep' ). 'timestep' is now on the first line.
add the option 'once' to the regexp

Yesterday, RC2 used 12.2 seconds. Now the modified function uses 8.0 seconds when reading entire frames and 5.4 seconds when reading part of frames. The end is defined by the string, 'O_spc 14401', which is the first "_spc".

>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,6107,11520} ); toc
Elapsed time is 12.190721 seconds.
>> tic, S = scan_dpb_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,6107,11520} ); toc
Elapsed time is 8.020505 seconds.
>> tic, S = scan_dpb_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,6107,11520} ); toc
Elapsed time is 5.415384 seconds.

Of the 5.4 seconds fread uses 3.8 seconds, regexp 1.2 and sscanf together with strtok 0.2

dpb 2018-4-21

From which I infer that being able to actually reduce the actual transfer of bytes into memory is a major help and regexp is likely not the key culprit (albeit there is additional code in the approach from which one gets the benefit of finding stuff instead of having to precompute where everything is which will mean, likely, fewer changes if something does change in the file structure.

I gather the partial frame vis a vis full frame means computing the beginning location of the first data to be retrieved within the whole frame and only loading from there to the last instead?

I figured the same as you when started that the way to go would be to read big chunks and process in memory; wasn't at all sure when started that moving the file pointer and then just reading a record wouldn't be much slower owing to cache being refreshed behind scenes and buffering not being effective (as is noticeable in the beginning code I commented out after just trying it to see and discovering it worked quite well).

Been an interesting exercise; poked away a few new tidbits for how to handle large files effectively at least when they do have a fixed record length structure.

Thanks for playing...enjoyed your approach and learned a lot more about regexp than knew before besides. :)

Lande GU 2018-4-21

Yes, me too. Although spending my time on revise for exams now. But after I see your approach here, I have learned a lot of new knowledge. It is very exciting to see later on practice it by myself and even to transfer that idea into Python.

Thank you for your help again.

请先登录，再进行评论。

Answer 3

Sarah Palfreyman 2018-4-30

编辑：per isakson 2018-4-30

0 个投票

You can also use Text Analytics Toolbox for reading out of memory data. See TabularTextDatastore

3 个评论
显示 1更早的评论隐藏 1更早的评论

per isakson 2018-5-1

在 MATLAB Online 中打开

Doc says:

Use a TabularTextDatastore object to manage large collections of
text files containing column-oriented or tabular data

I doubt that TabularTextDatastore is suitable in this case since the data file isn't column-oriented.

dpb 2018-5-1

Yeah, you noticed... :) I'd like to see all the machinations it would take with any of these tools and then how overhead-intensive would actually be. I'm pretty sure they'd not come close for real-world case such as this. It's pretty straightforward if everything is regular and none of the TMW examples ever show a "dirty" case, they're always trivial or nearly so.

请先登录，再进行评论。

extract routine data from extra large (~20GB) text file

39 个评论
显示 37更早的评论隐藏 37更早的评论

采纳的回答

4 个评论
显示 2更早的评论隐藏 2更早的评论

更多回答（2 个）

10 个评论
显示 8更早的评论隐藏 8更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论

类别

产品

标签

Community Treasure Hunt

extract routine data from extra large (~20GB) text file

39 个评论 显示 37更早的评论 隐藏 37更早的评论

采纳的回答

4 个评论 显示 2更早的评论 隐藏 2更早的评论

更多回答（2 个）

10 个评论 显示 8更早的评论 隐藏 8更早的评论

3 个评论 显示 1更早的评论 隐藏 1更早的评论

类别

产品

标签

另请参阅

Community Treasure Hunt

39 个评论
显示 37更早的评论隐藏 37更早的评论

4 个评论
显示 2更早的评论隐藏 2更早的评论

10 个评论
显示 8更早的评论隐藏 8更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论