extract routine data from extra large (~20GB) text file

Updated on 21:00 14/April/2018; clear some confusion dpb suggested. Updated part will be highlighted in bold.
Dear friends,
I am currently doing some molecule simulations and I need to analyze the movement of each individual atom.
I got a History file from the simulation which contains irregular format like follows ( now you can see the sample HISTORY in the attached file: HISTORY Trial.txt):
for those really willing to help, a complete short HISTORY file around 3.6GB:
ika9108
0 1 18720 10001 374477446
timestep 0 18720 0 1 0.000500 0.000000
57.9307218288 0.0000000000 0.0000000000
0.0000000000 57.9307218288 0.0000000000
0.0000000000 0.0000000000 57.9307218288
C 1 12.000000 1.123285 0.000000
-15.58443309 -26.16046542 -14.42223305
O 2 16.000000 -1.041095 0.000000
-14.69649899 -26.77784011 -13.67121262
O 3 16.000000 -1.041095 0.000000
-16.81540951 -26.13672909 -14.15028633
O 4 16.000000 -1.041095 0.000000
-15.19910302 -25.79374370 -15.56280780
C 5 12.000000 1.123285 0.000000
-16.61260265 -24.19749101 3.305309244
O 6 16.000000 -1.041095 0.000000
-15.74221447 -23.85292172 4.202348062
O 7 16.000000 -1.041095 0.000000
-16.55264265 -23.67515089 2.115672369
O 8 16.000000 -1.041095 0.000000
-17.54419044 -25.03709893 3.579123068
.
.
.
Ca 11521 40.080000 2.000000 0.000000
-18.93093222 17.98682377 -15.79782631
Ca 11522 40.080000 2.000000 0.000000
19.11661464 -20.33590673 -23.44339428
.
.
.
O_spc 14401 15.999400 -0.820000 0.000000
9.099065343 28.15242293 12.96874971
H_spc 14402 1.008000 0.410000 0.000000
10.11816248 28.04847873 12.79296953
H_spc 14403 1.008000 0.410000 0.000000
8.553146110 28.43604932 12.08437008
O_spc 14404 15.999400 -0.820000 0.000000
-20.67489325 -6.313716149 18.72893163
H_spc 14405 1.008000 0.410000 0.000000
-21.01831712 -6.604184870 19.66593064
H_spc 14406 1.008000 0.410000 0.000000
-20.45237732 -5.303498198 18.81546735
.
.
.
timestep 2000 18720 0 1 0.000500 1.000000
57.9125298023 0.0000000000 0.0000000000
0.0000000000 57.9125298023 0.0000000000
0.0000000000 0.0000000000 57.9125298023
C 1 12.000000 1.123285 0.194440
-15.59133022 -26.23975304 -14.24546014
O 2 16.000000 -1.041095 0.364883
-14.92875146 -26.92787946 -13.43967731
O 3 16.000000 -1.041095 0.064554
-16.84330237 -26.13216641 -14.09355464
O 4 16.000000 -1.041095 0.356997
-15.00432807 -25.74003158 -15.26105138
.
.
.
As shown, the first 2 lines represent the name of system (ika9108) and the number of atoms (18720 in total). Which should be ignored.
At line 3, it indicated current frame in the HISTORY, which in this case (frame 1), it start from timestep 0*0.5 [fs]. And the frame 2 shown next starts from timestep 2000*0.5 [fs]. Each timestep is 0.5 femtosecond; hence, 2000 timestep is 1000 [fs], equal to 1 [ps]. The simulation will simulate every timestep, but only record (1 frame) after every 2000 timesteps.
Line 4 to 6 can be ignored as it's about the volume of the system. Later lines have shown that:
C (atom Carboon), 1 (the number of this atom), 12.000000 and so on (atomic mass, something does not care about)
-15.584 (x axis), -26.160 (y axis), -14.422(z axis)
Likewise, O stand for oxygen atom, 2 represented it is the second atom and everything else goes for the same parameter.
After finishing all 18720 atoms in current frame, the HISTORY moves to the next frame. It will repeat 18720 atoms again, but this time with new x,y,z locations. And this process will repeat thousands of times until it reach 10 nanoseconds.
Hence, I need to extract the changing of locations of one particular atom that out of 18720 atoms and throughout the whole HISTORY file (which mean through all frames); for example:
timestep 0 (frame 1), C, 1, x1, y1, z1
timestep 2000 (frame 2), C, 1, x2, y2, z2
timestep 4000 (frame 3), C, 1, x3, y3, z3
and so on (frame x)
But I have no idea how you can do this with MATLAB. As I search the internet for few commands like textscan, it only read the formatSpec only in 1 line; however, I need to identify and locate the extraction in 2 lines. How can I manage this?
Later on I need to store these seperations and perform mean square displacement calclulation on them.
Thanks for helping. Any suggestion is welcome.

39 个评论

Well, actually, the format is pretty regular; it's a header section followed by a timestep section. Within each timestep section is a number of subsections, one for each atom in the simulation. What one does is to write a format and textscan for each of those from the inside out and then put those into a loop. In your case you're very fortunate in that the number for each of the subsection loop is also known (can be read from the header.
The form of the routine to read such a file will thus be independent of the number of elements and timesteps, simply to read the individual element position data section for each atom which is the same format repeated N times--or the 18720 in your example.
The question then is whether you want one specific atom, all N or a subset of some particular M-out-of-N; simplest would be to return them all from the file and then discard those unwanted; what might turn into a problem in that manner is total memory required if there are a very large number of timesteps -- I couldn't decipher that part of the output; is that information shown for how long the simulation was run in the header somewhere?
What would be easiest would be if you could run a sample simulation of a very small system (like a H20 molecule maybe?) for a short time such that there was a complete file you could attach. As the format is repeated, the overall size is immaterial; it's having the form itself that's needed.
One could make assumptions and just edit the text you've pasted but not knowing the model it's probable would make a mistake in doing so.
I don't quite follow the timestep nomenclature, can you amplify a little on that -- you wrote " the next timestep shown later was 1000 (2000*0.5fs) fs." but the example you give shows
timestep 0, C, 1, x1, y1, z1
timestep 2000, C, 1, x2, y2, z2
so is the 2000 the time or a count and the time is actually 1000? How does one know one is done; looking at the time/timestep value or just run out of data at end-of-file?
Thank you for replying.
My system is a CaCO3·nH2O, aka. Amorphous calcium carbonate.
In that particular example I used in previous post is CaCO3·0.5H2O,I also got up to CaCO3·1.3H2O; furthermore, some systems with Magnesium impurities.
As the movement of the system is simulated each 0.5 [fs]; and it recorded in History file each 1 [ps], where 1 [ps] = 0.5 [fs] * 2000. Hence, the start of timestep is 0, and the next one it jumped to 2000.
The only purpose to have timestep extracted is to ensure there are no missing steps in the process.
As my professor only provide me the model of CaCO3·H2O, and unfortunately I have no knowledge of creating simulation molecules. I simply cannot provide a small file here for inspection.
Alternatively, if you have time, I can upload the smallest HISTORY file (less than 4 GB) I have on hand to google drive, which you can download and have a look.
I forgot to mention that most of the History files are around 20GB, which contains 10 [ns] of data, so I believe 1 atom at least repeat 10,000 times in the whole file, which means in the earlier example there are 18720 atoms each repeat 10,000 times.
For my study, I need to invest the activeness of H2O molecules, so mostly focused on the atoms those named O_spc. And yes, I need one particular O_spc, for example, O_spc 14401; but moreover, I also need to repeat this for every O_spc atoms. So ideal if there are 1000 O_spc atoms, I need every one of them separately to calculate their mean square displacement. Then I can identify the water molecules which have abnormal activeness. Is that clearer for you?
You managed to snip sections out of that file; do that to create a smaller sample file for us then that would look like a full file even though it wouldn't be complete for a real simulation it would have all the needed characteristics; simply have to "fudge" the the number of atoms to match what the record offsets actually would be vis a vis a complete file, but that wouldn't affect the logic in general.
Say take the 16 elements or so you have from the first timestep above and then take four-five timesteps and just save that small section as the trial file from which to work and attach that with the paperclip icon.
I believe with that can fake enough in the header to give you example to work from. NB: these files will undoubtedly take a long time to process directly via textscan has your prof gotten any of this done previously or is this the first stab at post-processing these files? Perhaps a former grad student has already built some tools? Or, is this simulation a known tool in the area; might there be some other toolsets already built?
Lacking those (why reinvent the wheel if don't have to?), the other thought I have if you are interested only in such specific small number of records out of the total, it might prove faster overall to parse out those records simply by counting where they are in the file since each of a given atom will be the N atoms*records/timestep apart; which records are needed is directly computable. Retrieving those to another much smaller file to actually read might be much less time overall than the parsing of the whole file; would take some trial timings to tell...
Okay, I managed to create a small sample which contains fragments from 5 frames (each frame is recorded after every 2000 timestep, so frame 1 = timestep 0, frame 2 = timestep 2000; hopefully it's clear now).
I will upload the txt both in this comment and main post.
Further explain some atom nomenclature in the HISTORY file:
C: carbon; O: oxygen; Ca: calcium. These 3 form CaCO3.
O_spc: oxygen in H2O; H_spc: hydrogen in H2O. These 2 form water.
And if you look through the trial HISTORY, I pick few headers from each atom to mimic the entire system in one frame. Hopefully, it will make more sense.
About my professer... Well, we do have PHD used to working on related aera. But his interest on that time is about another aspect of the simulation, which did not focus on the individual water movement.
Basically, my project is build on their lagecies, but none of them are doing the same thing in the past. So I am lacking of the proper process tool to analyze the data. Furthermore, they are using FORTRAN for programming; so they want me handle this part myself.
Which, more unfortunately, the university one semester MATLAB course did not cover this part.
How large are these text files? Does the largest file fit comfortable in the physical memory?
He mentioned earlier most are in the 20GB range so "comfortably" is probably not...but guess that would depend on just what kind of a workstation he has access to.
Well, luckily my desktop is powerful enough, which has 64GB physical memories and 16 cores Threadripper. As long as MATLAB is sufficient enough.
Would it be useful to assign the data to a structure array?
  • One structure per "frame" of data. "frame" being the data between one string "timestep" and the following string, "timestep".
  • Would it be robust to use the string, "timestep" as delimiter between "frames"?
  • Would C,O,Ca,H_spc,O_spc,... be appropriate to use as field names?
Hey, per isakson, thank you for replying me.
Yes, I believe "timestep" would be the appropriate delimiter. You can see from my sample file, from the start of each frame, it will indicate itself with "timestep X", but be aware it did not indicate itself for every atom in the same frame.
I am not sure what do you mean by field names. You can use it to tell which type of atom it is, but I need its type and its number (rank); for example, "C 1", "O_spc 14401". I hope the script can help me separates the trajectories of each atom into individual text (that means 18720 different files), not mixing up all the O_spc atoms or Ca atoms together.
My current thought is, arrange the data into the structure array like you said, probably in this format:
Timestep | Type of Atom | Rank of Atom | X-axis | Y-axis | Z-axis
an example would be:
timestep 0 | C | 1 | -15.58443309 | -26.16046542 | 14.42223305
Then I can use Matlab or excel to calculate mean square displacement of movement from the location.
If you can indeed suck the entire file into memory, then my idea of simply computing the index of the desired lines comes into play; for example, say (just for simplicity of explanation) you want the displacement of the first C atom--the first atom position output is always at record 7 (2 header lines plus the first 4-line timestep block + 1). Since in the actual file there are N=18720 atoms and each atom output is two records, then every subsequent C (1) atom displacement line is N*2+4 records displaced from the first. If the data are in cellstr array, say, then
Catoms=data(7:N*2+4:end);
would be those lines that can then be directly parsed.
A similar offset for the time record is just as directly computed.
If you want/need a given group of atoms, simply passing a list of their IDNUM would let you either iterate over that list and select each individually or all as a group depending upon which is more convenient for the subsequent analysis.
I assume that this represent all the data you want to retrieve
timestep 0 (frame 1), C, 1, x1, y1, z1
timestep 2000 (frame 2), C, 1, x2, y2, z2
timestep 4000 (frame 3), C, 1, x3, y3, z3
and so on (frame x)
Admittedly, I rather ask than read.
A 20GB text file is not a convenient data storage to work with. Depending on how many times you will retrieve data, it might be good to transfer the data to some other type of file. It might, however, be premature to decide on that now.
The position of one atom as a function of time can effectively be stored in a <Nx4 double> matrix, where N is the number of frames. The columns would be time,X,Y,Z. Matlab has something called floating point integer, flint, which will help avoid floating point errors in time.
One problem with a matrix is that there is no good place for meta-data. A little bit of meta-data may be squeezed into the name of the variable. However, I think the structure of my previous comment is less appropriate to store this limited data.
Would a function like this be helpful?
function M = cssm( filespec, kind, id )
% filespec e.g. 'h:\m\cssm\HISTORY_Trial.txt'
% kind kind of atom, i.e. C/O/Ca
% id integer
% M <Nx4 double>
end
Big thanks to both of you,
I am trying what dpb said to open the entire HISTORY file with MATLAB. But still failed after I increase the Maximum memory size to 40GB by the method which provides by MATLAB help: -Xmx40960m
For what Isakson said, I need to look into how this function works, especially I have never used cssm before.
Like, how do I get this M (Matrix) in the beginning?
Yeah, your assumption would be right. Just for more clearance.
timestep 0 (frame 1), C, 1, X1, Y1, Z1
timestep 2000 (frame 2), C, 1, X2, Y2, Z2
timestep 4000 (frame 3), C, 1, X3, Y3, Z3
.
.
.
timestep 2000*(n-1) (frame n), C, 1, Xn, Yn, Zn
  • "cssm" stands for computer-science-software-matlab. I use that name as a generic function name when answering questions on the net.
  • My intention is to implement that function once you confirm that I understood your requirements
Ah, per isakson. I see. xD I think it's good enough. since I can change the parameter to suit every atom. Thank you so much for your effort.
Well, that's a bummer!!! To illustrate, with the test file can do the following--
>> file=textread('historytrial.txt', '%s', 'delimiter', '\n', 'whitespace', '');
>> N=50; % the sample file N
>> ix=3:N*2+4:length(file); % index to time blocks
>> file(ix)
ans =
5×1 cell array
'timestep 0 18720 0 1 0.000500 0.000000'
'timestep 2000 18720 0 1 0.000500 1.000000'
'timestep 4000 18720 0 1 0.000500 2.000000'
'timestep 6000 18720 0 1 0.000500 3.000000'
'timestep 8000 18720 0 1 0.000500 4.000000'
>> ixO=43; % the equivalent atom number in trial file; would be actual in real one
>> file(ix+2*ixO)
ans =
5×1 cell array
'O_spc 14401 15.999400 -0.820000 0.000000 '
'O_spc 14401 15.999400 -0.820000 0.499808 '
'O_spc 14401 15.999400 -0.820000 0.331586 '
'O_spc 14401 15.999400 -0.820000 0.168505 '
'O_spc 14401 15.999400 -0.820000 0.260819 '
>> cellfun(@(s) textscan(s,fmt),file(ix+2*ixO),'uniformoutput',0)
ans =
5×1 cell array
{1×5 cell}
{1×5 cell}
{1×5 cell}
{1×5 cell}
{1×5 cell}
>> ans{:}
ans =
1×5 cell array
{1×1 cell} [14401] [15.999400000000000] [-0.820000000000000] [0]
ans =
1×5 cell array
{1×1 cell} [14401] [15.999400000000000] [-0.820000000000000] [0.499808000000000]
ans =
1×5 cell array
{1×1 cell} [14401] [15.999400000000000] [-0.820000000000000] [0.331586000000000]
ans =
1×5 cell array
{1×1 cell} [14401] [15.999400000000000] [-0.820000000000000] [0.168505000000000]
ans =
1×5 cell array
{1×1 cell} [14401] [15.999400000000000] [-0.820000000000000] [0.260819000000000]
>>
NB Per: There's an error in the sample file, Per; the second displacement record is missing for the atom 20 in the next to last section. Had to insert it; if you actually do a search instead it probably won't affect your solution.
NB:
I used the venerable textread here solely because it saves the fopen/fclose pair, and I knew this trivial file could entirely fit in memory. textscan could be used to read as large a section similarly as can comfortably fit in actual memory available; simply read the first two header records first and then as many timestep groups as desired which size can be computed after reading the actual N for the file.
It has no bearing on reading the file, but I find it interesting there's only displacement in the z direction; is this a limitation of the model or some reason that it is expected for the sample problem? Just curious here...
I failed to download from your google drive. Sign in was required. Did you share with everyone? I just sent a request via mail.
Dear dpb,
I guess you misunderstanding some concept? The 3 columns followed by 14401 (the serial/rank of the atom) are not requested in my calculation. As 15.9994 is the atomic mass for Oxygen, I am not sure about the other two, but I guess it's not related. The line after every O_spc 14401 header represents atom's location in 3 dimensions, that is what I really need to record. Which in Per's case, he gets it right.
I have not checked his work on my large file yet. I will report it once I get it on hand.
Yes and no; I mixed metaphors, so to speak...the example was simply showing that one can compute the lines wanted directly; the index shown points at the element line; the displacement line is that index+1.
I then later on forgot about that and wasn't thinking what those numbers were when asked the question...that was an indication that I am getting old and senile... :) (or :( more realistically, I guess).
I was kinda' waiting to see if you could indeed load a full file into memory before going any further; there's no problem conceptually in the parsing of the file beyond what's been shown either way if you can suck it all up into memory; that then would simply be a timing between whether using an indexed subset is sufficiently faster than regexp as to which way to proceed.
The step if you can't read the full file in one gulp is to revert to a lower level of reading and processing in memory-sized chunks; that's do-able; the outline would be to read the header and compute the size of a time step in the file and then read an integral number of time steps that will fit. That size and how many steps/chunk would be dependent upon the problem size and how much memory you can actually effectively use...
Hey there. I see. xD It's okay, I am feeling I am getting old too...
I mentioned in one comment that somehow I was not able to open the large HISTORY file in MATLAB, even I increased the maximum memories as MATLAB Help suggestion. Maybe I am doing the wrong way? I simply just double-click the file in MATLAB current folder bar to open it. After I type "Memory" command in MATLAB, it has shown me:
Maximum possible array: 55360 MB (5.805e+10 bytes) *
Memory available for all arrays: 55360 MB (5.805e+10 bytes) *
Memory used by MATLAB: 1994 MB (2.091e+09 bytes)
Physical Memory (RAM): 65417 MB (6.860e+10 bytes)
* Limited by System Memory (physical + swap file) available.
And my biggest file is only 25GB, which should be suitable in the memories that MATLAB allows to taken. Maybe my swap file (page file) is not big enough?
Currently, I managed to use my little python knowledge to grab and record the data into excel at a very slow speed, about 4 seconds to scan through 1 frame and grab only one type of atoms, approximately it will take half a day for one HISTORY file. xD And it only utilises 4% of my CPU and 50MB memories. I need to learn how to code for multi-cores. lol
This is the out of memory error message:
Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
at javax.swing.JTable.getSelectedRows(JTable.java:2268)
at com.mathworks.mwswing.MJTable.getSelectedRows(MJTable.java:367)
at com.mathworks.widgets.spreadsheet.HeaderRenderer.getTableCellRendererComponent(HeaderRenderer.java:264)
at javax.swing.plaf.basic.BasicTableHeaderUI.getHeaderRenderer(BasicTableHeaderUI.java:702)
at javax.swing.plaf.basic.BasicTableHeaderUI.paintCell(BasicTableHeaderUI.java:709)
at javax.swing.plaf.basic.BasicTableHeaderUI.paint(BasicTableHeaderUI.java:652)
at javax.swing.plaf.ComponentUI.update(ComponentUI.java:161)
at javax.swing.JComponent.paintComponent(JComponent.java:780)
at javax.swing.JComponent.paint(JComponent.java:1056)
at javax.swing.JComponent.paintChildren(JComponent.java:889)
at javax.swing.JComponent.paint(JComponent.java:1065)
at javax.swing.JComponent.paintChildren(JComponent.java:889)
at javax.swing.JComponent.paint(JComponent.java:1065)
at javax.swing.JLayeredPane.paint(JLayeredPane.java:586)
at javax.swing.JComponent.paintChildren(JComponent.java:889)
at javax.swing.JComponent.paint(JComponent.java:1065)
at javax.swing.JComponent.paintChildren(JComponent.java:889)
at javax.swing.JComponent.paint(JComponent.java:1065)
at javax.swing.JLayeredPane.paint(JLayeredPane.java:586)
at javax.swing.JComponent.paintChildren(JComponent.java:889)
at javax.swing.JComponent.paint(JComponent.java:1065)
at javax.swing.JComponent.paintChildren(JComponent.java:889)
at javax.swing.JComponent.paint(JComponent.java:1065)
at javax.swing.JComponent.paintToOffscreen(JComponent.java:5210)
at javax.swing.RepaintManager$PaintManager.paintDoubleBuffered(RepaintManager.java:1579)
at javax.swing.RepaintManager$PaintManager.paint(RepaintManager.java:1502)
at javax.swing.RepaintManager.paint(RepaintManager.java:1272)
at javax.swing.JComponent._paintImmediately(JComponent.java:5158)
at javax.swing.JComponent.paintImmediately(JComponent.java:4969)
at javax.swing.RepaintManager$4.run(RepaintManager.java:831)
at javax.swing.RepaintManager$4.run(RepaintManager.java:814)
at java.security.AccessController.doPrivileged(Native Method)
"...java.lang.OutOfMemoryError: Java heap space"
That looks to me like internals of the Java installation are getting in the way--I've no klew how or whether one can do something about it.
I suspect using the builtin "easy use" tools aren't helping; whether it will be able to load the full file directly or not I don't know but try starting a brand new session of Matlab then from the command
data=readfile('AHistoryFile.txt'):
and you'll have the best possible shot at it avoiding prior use fragmentation of available memory and loading the character image of the file as a character stream which is simply the image on the disk in memory without conversion.
It is certainly so we can speed this up immensely from what you're doing at the moment; just how complicated it needs to get depends on the result of the above test. I have nothing coming even close to 64MB real memory here, so can't do anything at all useful in comparison.
Guess it would be of interest to know which ML Release and which OS?
Further research on Internet I find the way to increase the memory for java heap space. Now MATLAB is just kept sitting there loading the huge file (cry). I will try the way as you suggest.
My OS Version: Win 10 Pro, Insider Build: 17133.73
My MATLAB Version: R2018a
somehow, MATLAB don't recognise readfile function, do you mean to create a varialble called readfile?
Oh, pooh! I meant fileread instead, sorry...there's a Statistics TB routine readFile but it was/is specific to the old deprecated dataset object that has been replaced by the native table--not what you want at all; it's slower than molasses...
My bad, sorry.
Yeah, Isakson. That's exactly the way I find. However, the default maximum (1/4 of physical memory, in my case, 16GB)is still not enough. xD So I manually increased the .prf setting to 30GB. Then try open the file in MATLAB again. Then MATLAB just starts to occupy about 10% CPU and 30GB Memory to open the file, which also freezes at that time.
Try just fileread at the command line and see if that cuts out any overhead that might be in the GUI way...I never use the GUI at all so I really don't have any idea what all those icons are actually attached to; maybe be the same thing I don't know, I just know if go to the command line there's not something else behind my back going on.
I don't think there's any way to make any significant use of the multiple cores to read a single huge file; there's where you could possibly do some preprocessing and break one huge file up into pieces that could potentially operate on in parallel. It wouldn't take much Fortran to write a routine to do that altho it's remarkable how quick ftgetl can be in Matlab sometimes; just as Per as shown with regexp on occasion surprising how good it can be if it can get down into the compiled code and stay there; it's essentially the same.
All of these simulations are already run; how long does it take to generate one of them; might it even be possible to have the code break its output file up into manageable pieces instead of creating such monstrosities? Or, instead of text files, create special-purpose files of just the raw data as stream binary files.
Now I can verify that it's not possible to load the entire file to Memory. xD
I tried your fileread function, it graduately reached 98% of 64GB memory and then failed. xD
Memory Sticks are so expensive nowadays. (Crybaby) And DDR5 is coming in no time (it won't).
So, what are we going to do now.
Problem is I have no knowledge of FORTRAN. xD
And I completely forgot my C# after not using it for 7 years after I am not designing my own 2D game anymore. XD
We do what I outlined before; read N time sections at a time and parse them, then go on to the next.
I'll throw an outline together here in a little bit...I started yesterday but held off going any further awaiting the results of this experiment.
Just out of curiosity, how much time did it take to get to the crash point?
I've been watching the download of your googledrive file; it had approached 3.3GB last I looked; I'll also look at writing a Fortran filter to break up into chunks.
Do you have a range of file sizes from small to humongous at hand? You proved 64MB didn't, can you roughly do a binary split of sizes and see where is the breaking point? Of course, you've got to also have enough free memory to store the results besides just the raw data when doing the conversion...
Not much, I think around 30 secs to 1 min to crash. The memory just starts from ground to top then the error message pop up said not enough memory.
Not sure how to do the binary split. Wouldn't it break the structure of history?
Not I don't have a range of file sizes; but I indeed have files about 100MB, 5GB, 10GB 13GB, 16GB, 25GB.
Oh, that's fast...I'd figured at least several minutes if not 10's of minutes...probably not to worry, then.
I was talking a binary split of the size of files, 64/2--> ~32 GB; if that worked then how much free memory and how large an array could you create with it in memory, then halfway 'twixt 32/64 would be 48, not splitting the file itself for that experiment.
To split the file for processing, one would just have the Fortran code create a new file section after some number of timesteps that had created a file of a chosen size beginning with the next timestep; again would keep timesteps intact. Probably would add like one header line like the existing first line that included what looks like a text ID and a counter of the section number; presuming the simulation runs for a preset time then it would even be possible to compute the total number of sections of a given size that would be needed and write it as N of M. That's all a nicety that don't need until shown the straightforward way doesn't work or is too time consuming--I really figured the time would be a real problem just to read the file in. Just for grins, what does the 25GB one take and assuming it does load, won't Per's function work on it?
I am still not sure what you mean by a binary split of the size of files.
I am currently not running much on the computer so the free memory would be 90% of total memory.
Not to worry, just thinking if you had a goodly number of files across the size range from small to large could find where you ran out of ability to load by just picking files that were roughly halfway between what did/didn't work and hone in on what the max would be...the question isn't always only what is shown as free memory but to allocate an array it has to be contiguous free memory.
I was understanding from early postings that many (most?) of these files would be >32GB; these numbers indicate quite a number aren't so big (by what your capacity is) after all; on my machine they're all just humongous!!! ;)
Oh, I get what you mean. I will try this out tomorrow, and post the result.
"So, what are we going to do now?"
  • Go back to the drawing board. Write a concise requirement specification for READER. Usage of READER. It shall be be used a few times and thrown away or used in the analyzes of many simulation results over more than half a year? During the analyzes, response times to different queries. Reruns for publications? ...
  • Several years ago I read and parsed a 20GB text file on a 8GB desktop. The first step was to split the file into two dozen sub-files with GSplit. With GSplit you can control exactly where the file is split, a bit like the Matlab function, strsplit. Next loop over the sub-files, read and parse with something similar to cssm of my answer and save results to mat-files, et cetera.
  • Now I might want to use Large Files and Big Data and store the results to one or more HDF-files. This would be an occasion to try Matlabs "Big Data" stuff.
Hey. Thanks a lot for helping. It is great to learn MATLAB with new tricks.
As Isakson stated in his experiment, even small file size exceeds the memory in MATLAB, maybe MATLAB just wants users to avoid loading mild-to-large files directly into memory.
I saw what Isakson suggest, I will read this "Large Files and Big Data" toolset in my free time.
And thanks dpb posting another approach, I will try that now and see how it works.
Maybe this question comes to an end. So far my python script works fine, and I further managed to allow it load the entire file into Memory to speed up. As I took another approach to separate all O_spc atoms at the same time, which the processing speed is slow by comparing to your result here.
Still, big thanks to you both!
"As I took another approach to separate all O_spc atoms at the same time, which the processing speed is slow by comparing to your result here."
Which is more convenient going forward; all O_spc in one big file or being able to handle them individually? What is to be done with the position data once have it?
It appears to me the function I wrote is fast enough that you can simply call it in a loop for an input list of atom IDs of interest and still be done before a python script actually reading the whole file can get more than a good start in all likelihood.
As noted, however, there's very little to add to this function to read a list instead of just one; then again, there's little "glue" code to build a combined output file from the results one-by-one as it is, either.
ADDENDUM
I hadn't really looked at the structure inside the model much; I see there are a bunch more water molecules in there than I had guessed, still
Ospc=14401:23038; % list of special O
res=readHist(filename,Ospc(1)); % do first to find out how long is
OspcPos=zeros(length(res),4*length(Ospc)-1); % preallocate
OspcPos(:,1:4)=res; % save first
for o=Ospc(2:end) % iterate over rest
OspcPos(:,o*4-3:o*4)=readHist(filename,Ospc(o));
end
ADDENDUM 2
Well, I tried the above to see; turns out that when one begins traversing the file from front to back so many times the speed does go way down; appears that should make the modification to read sequentially the atoms wanted in a single traverse...or make a combination of some of the items Per mentioned of removing pieces of the file that are known to be of no interest. For this model length(Ospc)/N = 0.125 or 87.5% of the file isn't of interest; that would reduce 3.6G to 0.45G. I don't know for sure just how much time would be required to do the decimation, though.
I'm disappointed; was a little naive in the presumption of speed scaling roughly linearly with numbers, I guess.

请先登录,再进行评论。

 采纳的回答

ADDENDUM Attached updated version that incorporates reading multiple atoms in single run; run-time seems essentially linear with nAtoms now.
ORIGINAL BEGINS Approached it with the same general concept as outlined before--compute where in the file the wanted values are--except instead of trying to read the whole file or very large chunks into memory and then try to pick out the wanted sections; compute the location in the file and just read the desired records directly. I wasn't sure how this would work from a time standpoint; I thought it might just be a problem moving the file pointer but turns out to be blazingly fast...
>> tic;H23040res=readHist('HISTORY.txt',23040);toc
Elapsed time is 0.110633 seconds.
>>
>> [H23040res(end-3:end,1) H23040res(end-3:end,2:4)] % pasted together for easier viewing
ans =
12240000 3.866557282000000 19.186368170000001 6.278881618000000
12242000 4.177788492000000 19.270542259999999 6.410293005000000
12244000 4.425450305000000 19.523942850000001 6.673638392000000
12246000 4.609073503000000 19.099931819999998 5.917552555000000
>>
The function for one atom given by atom number in the simulation
function [res,hdr]=readHist(file,atom)
% return position data from simulation for a given list of atoms
% optionally return the header line for ID
fid=fopen(file,'r');
l=fgets(fid); % read the header line include \n
% define some constants from the file format
L=length(l); % get record length including \n
nTimeLines=4; % number lines in each time block
nAtomLines=2; % number lines each atom output block
hdr=strtrim(l); % save the header line for ID purposes
N=readN(fid); % get the number atoms in simulation
nLinesTimeStep=N*nAtomLines+nTimeLines;
nBytesTimeStep=nLinesTimeStep*L;
d=dir(file);
nTimeSteps=(d.bytes-2*L)/nBytesTimeStep;
res=nan(nTimeSteps,4);
idx=ftell(fid); % pointer begin of time section
for i=1:nTimeSteps
res(i,1)=readTime(fid,L);
res(i,2:4)=readAtomN(fid,L,atom);
idx=idx+nBytesTimeStep;
fseek(fid,idx,-1);
end
fid=fclose(fid);
end
...
Functions are in attached full m-file; just shows the outline at higher level walking through timestep by timestep. The fseek here positions file pointer to just before the next timestep record in the file after having read the previous timestep wanted atom position data array; since each timestep is a fixed size; is simple to just move to that location relative to beginning of file rather than calculate the difference from present position (although that's just the difference between the total length and how many bytes moved plus read). It wouldn't be hard to add the facility to read a list of atoms other than just the one.
I don't disagree with Per's assessment that perhaps some of the Matlab tools for large datasets could be brought to bear, but they seemed to have a pretty steep learning curve to try to deal with the structure of the file at hand being mostly set up for regular tabular data it appeared.
Here you're fortunate that the model is written in Fortran and uses fixed-width records so that one can compute the exact byte position in the file very easily as the multiple of record length. The advantage with this approach is it really doesn't make any difference how long the files are; it appears fast enough to be no problem. This is a moderately new but very mid/low-range 8GB Win7 machine.

4 个评论

"I don't disagree with Per's assessment that perhaps some of the Matlab tools for large datasets could be brought to bear,..."
Actually, what this is is a "roll your own" memmapfile implementation for the specific file structure...we did similar machinations back in the mainframe days on mag tape for nuclear cross section multigroup reduction calculations--27 7-track drives all spinning wildly--just like a scene from "Mission Impossible"... :)
One last comment -- as written, there's no error checking -- things like ensuring that the computed nTimeSteps is an integer to find files that aren't complete and the like.
Also, as files get larger, all the byte addressing arithmetic ought to be in long integer; theoretically I guess one could overflow precision of a double float.
Hey, no worries. I am not disappearing. lol
I am quite busy with studying yesterday, exam ahead.
I have just tried your code. It works perfectly fine, I think the time it took to extract one atom was about few seconds, it is indeed much faster than my python code. All I need to do is build a loop to let it extracting all atoms now.
I am very appreciating the help from you and Isakson. Can I accept both answers?
OK, my bad...sorry for the rant if misplaced! :)
Unfortunately, I don't believe the Answers forum is sophisticated-enough to let more than one Answer be accepted; you can always try it and see. What I know that you can do if you were to choose to reward both is that Accept followed by an UnAccept does not take away award points...we don't really do things for the points per se of course, but for the satisfaction of helping and the fun of the challenge.
On the multi-atom case, use the second of the two functions and pass it a vector of the desired atom numbers--it is MUCHO faster than the one-element version in a loop; the latter option is, indeed, unworkable in traversing the file N times.
The two attached files while of the same name aren't the same code; to reduce the confusion factor I'll remove the single-atom one since the multi-atom case reduces to that if only pass a single atom number.

请先登录,再进行评论。

更多回答(2 个)

First shot. I acknowledge that this might look cryptic, but it seems to work.
>> filespec ='h:\m\cssm\HISTORY_Trial.txt';
>> M = cssm( filespec, 'C', 1 )
M =
1.0e+03 *
0 -0.0156 -0.0262 -0.0144
2.0000 -0.0156 -0.0262 -0.0142
4.0000 -0.0155 -0.0263 -0.0143
6.0000 -0.0156 -0.0261 -0.0140
8.0000 -0.0157 -0.0262 -0.0142
and
>> M = cssm( filespec, 'O', 14 )
M =
1.0e+03 *
0 0.0054 0.0239 -0.0176
2.0000 0.0054 0.0238 -0.0177
4.0000 0.0052 0.0239 -0.0175
6.0000 0.0054 0.0237 -0.0176
8.0000 0.0055 0.0237 -0.0177
where
function M = cssm( filespec, kind, id )
str = fileread( filespec );
str = strrep( str, char([13,10]), char(10) );
fmt = '(?m)^timestep\\x20+(\\d+).+?^%s\\x20+%d\\x20[^\\n]+\\n([^\\n]+)\\n';
xpr = sprintf( fmt, kind, id );
cac = regexp( str, xpr, 'tokens' );
len = length( cac );
M = nan( len, 4 );
for jj = 1 : len
M( jj, 1 ) = sscanf( cac{jj}{1}, '%f' );
M( jj, 2:4 ) = sscanf( cac{jj}{2}, '%f%f%f' );
end
end
I really don't know whether this will work with your large files. However, I'm often surprised by how fast regular expressions are and thus I think it's worth a try.
.
Profiling of cssm
I created a 30MB text file by adding copies of HISTORY_Trial.txt. (I failed to download from your google drive.) Then I run
>> profile on
>> M = cssm( filespec, 'O', 14 );
>> profile viewer
.
I use R2016a/Win7 on an old vanilla desktop.
Total elapse time is <0.9s, more than half of which is used by fileread. (The file was already in system cache.)
.
2018-14-19: Release candidate, RC1
Now I have developed a function that works with the 3.6GB test file. I use a similar approach as in First shot. The main difference is that this functions reads and parses chunks of frames. I gave up on Matlabs Big Data tools.
In short the steps are
  1. Locate the positions of all frames in the file. A frame starts with "t" in the string, "timestep", and ends with the last character before the following string, "timestep". The result is stored in the persistent variable, ix_frame_start, and used in subsequent calls with the same text file.
  2. Calculate the positions of a series of chunks of frames. The result is stored in ix_frame_chunk_start and ix_frame_chunk_end. This calculation is cheep.
  3. Loop over all chunks of frames
  4. Read one chunk with fseek and fread. This is about three times faster than using memmapfile.
  5. Loop over the list of atoms, which was given in the call.
  6. Extract the values of "timestamp" and "X,Y,Z" with regexp.
  7. Return the result in a struct array.
Comments
Compared to the function, readHist by dpb this function
  • is two order of magnitude slower to read position data for one atom
  • is more robust to variations in the text file
I don't see any potential to dramatically improve the speed of this function.
Backlog
  1. Only the regular expression is well commented
  2. Cheep test to catch anomalies in the text file
Example
>> clear all
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 44.231721 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','X','O'}, {18,19,20} ); toc
Error using scan_CaCO3nH2O (line 36)
Cannot find atom: X, id: 19, in frame 1:3
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 30.618218 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 31.088037 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {5884,5890,6107} ); toc
Elapsed time is 45.095658 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O'}, {18,19,20,5884,5890,6107} ); toc
Elapsed time is 62.709259 seconds.
>>
>> plot( S(1).position(:,1), S(1).position(:,2:4), '.' )
>> S(1)
ans =
atom: 'O'
id: {[5884]}
position: [1110x4 double]
>>
.
where in one m-file
function S = scan_CaCO3nH2O( filespec, atom_list, id_list ) %
%
%{
S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} );
%}
narginchk(3,3)
persistent ix_frame_start previous_filespec
if isempty( ix_frame_start ) || not( strcmpi( filespec, previous_filespec ) )
fid = fopen( filespec, 'r' );
ix_frame_start = frame_start_position( fid );
previous_filespec = filespec;
else
fid = fopen( filespec, 'r' );
end
if isscalar( atom_list ) && not( isscalar( id_list ) )
atom_list = repmat( atom_list, 1, length(id_list) );
else
assert( length( atom_list ) == length( id_list )...
, 'scan_CaCO3nH2O:LengthMismatch' ...
, 'Length of "%s" and "%s" differs' ...
, inputname(2), inputname(3) )
end
ix1 = ix_frame_start(1);
ix2 = ix_frame_start(4)-1;
[~] = fseek( fid, ix1-1, 'bof' );
str = fread( fid, [1,(ix2-ix1+1)], '*char' );
for jj = 1 : length( atom_list )
xpr = regular_expression( atom_list{jj}, id_list{jj} );
cac = regexp( str, xpr, 'tokens' );
assert( length(cac) == 3 ...
, 'scan_CaCO3nH2O:CannotFindAtom' ...
, 'Cannot find atom: %s, id: %d, in frame 1:3' ...
, atom_list{jj}, id_list{jj} )
end
target_chunk_length = 1e8;
mean_frame_length = mean( diff( ix_frame_start ) );
number_of_frames_per_chunk = floor( target_chunk_length / mean_frame_length );
ix_frame_chunk_start = ix_frame_start( 1 : number_of_frames_per_chunk : end-2 );
ix_frame_chunk_end = [ ix_frame_chunk_start(2:end)-1, eof_position(fid) ];
S = struct( 'atom' , atom_list ...
, 'id' , num2cell( id_list ) ...
, 'position' , [] );
C = cell( length(ix_frame_chunk_start), length(atom_list) );
for jj = 1 : length( ix_frame_chunk_start ) % loop over all frame chunks
ix1 = ix_frame_chunk_start(jj);
ix2 = ix_frame_chunk_end(jj);
[~] = fseek( fid, ix1-1, 'bof' );
str = fread( fid, [1,(ix2-ix1+1)], '*char' );
C(jj,:) = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list );
end
for kk = 1 : length(atom_list)
S(kk).position = cell2mat( C(:,kk) );
end
end
% -------------------------------------------------------------------
function ix_start = frame_start_position( fid ) %
chunk_size = 3e8; % a larger chunk size increases speed a few per cent
eof_pos = eof_position( fid );
ix2 = 0;
jj = 0;
cac = cell( 1, 200 );
while ix2 < eof_pos
jj = jj + 1;
ix1 = 1 + (jj-1)*(chunk_size-100);
ix2 = min( eof_pos, ix1+chunk_size );
[~] = fseek( fid, ix1-1, 'bof' );
str = fread( fid, [1,(ix2-ix1+1)], '*char' );
cac{jj} = strfind( str, 'timestep' ) + ix1-1;
end
buffer = cat( 2, cac{:} );
ix_start = unique( buffer ); % beware of overlap
end
% -------------------------------------------------------------------
function eof_pos = eof_position( fid ) %
[~] = fseek( fid, 0, 'eof' );
eof_pos = ftell( fid );
end
% -------------------------------------------------------------------
function C = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list ) %
C = cell( 1, length( atom_list ) );
for jj = 1 : length( atom_list ) % loop over all atoms
C{jj} = parse_one_chunk_of_frames( str, atom_list{jj}, id_list{jj} );
end
end
% -------------------------------------------------------------------
function M = parse_one_chunk_of_frames( str, atom, id ) %
xpr = regular_expression( atom, id );
cac = regexp( str, xpr, 'tokens' );
len = length( cac );
M = nan( len, 4 );
for jj = 1 : len
M( jj, 1 ) = sscanf( cac{jj}{1}, '%f' );
M( jj, 2:4 ) = sscanf( cac{jj}{2}, '%f%f%f' );
end
end
% -------------------------------------------------------------------
function xpr = regular_expression( atom, id ) %
fmt = [
'(?m) ' ... % ^ and $ match at line breaks
'^ ' ... % beginning of line
'timestep ' ... % literal: timestep
'\\x20+ ' ... % one or more spaces
'( ' ... % start group to catch "timestamp"
' \\d+ ' ... % one or more digits
') ' ... %
'\\x20 ' ... % one space
'.+? ' ... % one or more chatacters incl. new line, up till
'^%s ' ... % format specifier for "atom" at beginning of line
'\\x20+ ' ... % one or more spaces
'%d ' ... % format specifier for "id"
'\\x20 ' ... % one space
'(?-s) ' ... % from here on dot matches anything except newline
'.+ ' ... % anything, but new line, i.e rest of line
'\\n ' ... % new line
'( ' ... % start group to catch "X,Y,Z"
' .+ ' ... % anything, but new line, i.e rest of line
') ' ... %
'$ ' ... % end of line
];
fmt( isspace(fmt) ) = [];
xpr = sprintf( fmt, atom, id );
end
.
2018-14-20: Release candidate 2.0, RC2
It's important to know when not to use a regular expression.
RC2 is three times as fast as RC1. I achieved this improvement by
  1. replacing consecutive spaces by one space and trim trailing spaces of the 3.6GB text file. The trimmed file is half the size of the original file.
  2. replacing the regular expression by a search with the function, strfind and a simple regular expression.
.
The local function, scan_for_atoms_in_one_chunk_of_frames, now reads
% --------------------------------------------------------------
function C = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list )
pos = strfind( str, 'timestep' );
len = length( pos );
stp = nan( len, 1 );
for jj = 1 : len
stp(jj) = sscanf( str(pos(jj):pos(jj)+24), '%*s%f', 1 );
end
len = length( atom_list );
C = cell( 1, len );
for jj = 1 : len % loop over all atoms
M = parse_one_chunk_of_frames( str, atom_list{jj}, id_list{jj} );
M(:,1) = stp; % this will error if the "atom" isn't found in all timesteps
C{jj} = M;
end
end
% --------------------------------------------------------------
function M = parse_one_chunk_of_frames( str, atom, id )
xpr = regular_expression( atom, id );
pos = regexp( str, xpr, 'start' );
len = length( pos );
M = nan( len, 4 );
for jj = 1 : len
[ ~, remain ] = strtok( str( pos(jj) : pos(jj)+120 ), char(10) );
M( jj, 2:4 ) = sscanf( remain, '%f%f%f', 3 );
end
end
% --------------------------------------------------------------
function xpr = regular_expression( atom, id )
fmt = '\\<%s %d '; % preceded by word boundary, \<, and succeeded by space
xpr = sprintf( fmt, atom, id );
end
Profiling of RC2
In both RC1 and RC2 two function calls totally dominate the execution time. They are
  • fread and
  • scan_for_atoms_in_one_chunk_of_frames, i.e. regexp and strfind, respectively.
With RC1 the execution time of reading and parsing where equal for scanning for one "atom". With RC2 reading is nearly twice as fast because of the smaller size of the test file. With RC2 parsing is several times faster than with RC1.
These three clips are from profiling of RC2 for scanning for one "atom"
.
.
.
Example
Restart of Matlab
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 19.662046 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 11.702999 seconds.
>>
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,19,20,5884,5890,6107} ); toc
Elapsed time is 16.044491 seconds.

10 个评论

My python script seems to work, although quite slow. xD
But I will still look into your code, more than half of these syntaxes you used I have not learned before. I need to suggest my university to improve their MATLAB course for non-computer science UG student.
Thanks for your helping. I will update the news after I trying your code.
It looks great from your result on Trial file. Btw, I now edited the google drive file to allow everyone edit.
Similiar approach here, since using fileread, and it runs out of memory. So...
I tried to run cssm with your 3.6GB google drive file on my 16GB ram desktop. That ended with a red Out of memory. message. Nowadays Matlab internally uses two bytes per character. That's contributes to our current problems.
Next I tried to squeeze the superfluous spaces out of your file.
function trim_space( source, target )
% trim_space( 'c:\tmp\HISTORY.txt', 'c:\tmp\CaCO3nH2O.txt' )
% trim_space( 'h:\m\cssm\HISTORY_Trial.txt', 'c:\tmp\trimmed_text.txt' )
%
str = fileread( source );
str = strrep( str, char([13,10]), char(10) ); % replace '\r\n' by '\n'
str = regexprep( str, '\x20+', ' ' ); % replace multiple spaces by one space
str = regexprep( str, '\x20+\n', '\n' ); % trim trailing space
%
fid = fopen( target, 'w' );
fprintf( fid, '%s', str );
fclose( fid );
end
That worked nicely with my 30MB file: 30% off on size and time to read. The function, cssm, operates well on the squeezed file. The 3.6GB file caused out of memory.
That's the end of this experiment. Conclusion:
  • The function, cssm, would work well on appropriate sized chunks of data.
  • It's for a reason that The MathWorks has implemented special tools to work with Large Files and Big Data
Regular expressions
  • See <https://www.infoq.com/presentations/regex Understanding and Using Regular Expressions by Damian Conway>and
  • use regex101 for experimenting. I use the pcre flavor, it's close enough to Matlabs own flavor.
"Compared to the function, readHist by dpb this function
  • is two order of magnitude slower to read position data for one atom
  • is more robust to variations in the text file...
I'm surprised it is that much different; did you re-profile to see the bottleneck, Per?
I'm presuming since the data file is being created from a simulation model that at least for the duration of this research project the output format won't change significantly; the script does build in some constants to handle minor things like adding a header line or the like and does check the actual record length...but it is obviously intended for the specific file for the specific purpose! :)
I'm most concerned regarding the possibility of the file size growing immeasurably large and the check to compute the number of time steps going south as currently implemented--although
>> 10^fix(log10(2^(53-1)))/1024/1024/1024
ans =
9.3132e+05
>>
GB, I suppose it's likely DIR() will continue to return the precise number for all file sizes likely to be found. :)
PS: Thanks for the excellent regexp tutorial on the way; I've never taken time to become sufficiently adept, it always ended up being so cryptic before I ever got to the result I'd lose patience and attack another way...
I've uploaded a new version RC2, which is something like three times faster. RC2 uses one and a half second per "atom" to scan the "3.6GB" file for a long list of "atoms". The main reason for the poor performance compared to your function is that my function reads and processes the entire file - every byte.
I was wondering how much was the OH of the reading vis a vis the lookup; for the list of the O_spc in the sample simulation only 1/8-th of the actual file contained anything of interest so having to read 7/8-th of 3.6GB to get to the other 1/8-th is clearly a high cost to pay.
Could you perhaps combine the two approaches to find the beginning sections first; then use that to load sections more efficiently? (Not suggesting take the time to do so; rhetorical only...)
That might keep the "enhanced robustness" by finding the file structure by search instead of making the assumptions inherent in mine was the thought.

Premature optimization: Without any supporting evidence I assumed that processing larger chunks of the file would be more efficient. Thus, I processed chunks of 50+ frames at a time. That required extra source code.

Now, I have modified the code to process one or part of one frame at a time. That made it possible to

  • remove the   strfind( str, 'timestep' ).   'timestep' is now on the first line.
  • add the option 'once' to the regexp

Yesterday, RC2 used 12.2 seconds. Now the modified function uses 8.0 seconds when reading entire frames and 5.4 seconds when reading part of frames. The end is defined by the string, 'O_spc 14401', which is the first "_spc".

>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,6107,11520} ); toc
Elapsed time is 12.190721 seconds.
>> tic, S = scan_dpb_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,6107,11520} ); toc
Elapsed time is 8.020505 seconds.
>> tic, S = scan_dpb_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,6107,11520} ); toc
Elapsed time is 5.415384 seconds.

Of the 5.4 seconds fread uses 3.8 seconds, regexp 1.2 and sscanf together with strtok 0.2

From which I infer that being able to actually reduce the actual transfer of bytes into memory is a major help and regexp is likely not the key culprit (albeit there is additional code in the approach from which one gets the benefit of finding stuff instead of having to precompute where everything is which will mean, likely, fewer changes if something does change in the file structure.
I gather the partial frame vis a vis full frame means computing the beginning location of the first data to be retrieved within the whole frame and only loading from there to the last instead?
I figured the same as you when started that the way to go would be to read big chunks and process in memory; wasn't at all sure when started that moving the file pointer and then just reading a record wouldn't be much slower owing to cache being refreshed behind scenes and buffering not being effective (as is noticeable in the beginning code I commented out after just trying it to see and discovering it worked quite well).
Been an interesting exercise; poked away a few new tidbits for how to handle large files effectively at least when they do have a fixed record length structure.
Thanks for playing...enjoyed your approach and learned a lot more about regexp than knew before besides. :)
Yes, me too. Although spending my time on revise for exams now. But after I see your approach here, I have learned a lot of new knowledge. It is very exciting to see later on practice it by myself and even to transfer that idea into Python.
Thank you for your help again.

请先登录,再进行评论。

You can also use Text Analytics Toolbox for reading out of memory data. See TabularTextDatastore

3 个评论

So show us an application to Lande's problem and how it compares to the two solutions to date... :)
Doc says:
Use a TabularTextDatastore object to manage large collections of
text files containing column-oriented or tabular data
I doubt that TabularTextDatastore is suitable in this case since the data file isn't column-oriented.
Yeah, you noticed... :) I'd like to see all the machinations it would take with any of these tools and then how overhead-intensive would actually be. I'm pretty sure they'd not come close for real-world case such as this. It's pretty straightforward if everything is regular and none of the TMW examples ever show a "dirty" case, they're always trivial or nearly so.

请先登录,再进行评论。

类别

帮助中心File Exchange 中查找有关 Low-Level File I/O 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by