First shot. I acknowledge that this might look cryptic, but it seems to work.
>> filespec ='h:\m\cssm\HISTORY_Trial.txt';
>> M = cssm( filespec, 'C', 1 )
M =
1.0e+03 *
0 -0.0156 -0.0262 -0.0144
2.0000 -0.0156 -0.0262 -0.0142
4.0000 -0.0155 -0.0263 -0.0143
6.0000 -0.0156 -0.0261 -0.0140
8.0000 -0.0157 -0.0262 -0.0142
and
>> M = cssm( filespec, 'O', 14 )
M =
1.0e+03 *
0 0.0054 0.0239 -0.0176
2.0000 0.0054 0.0238 -0.0177
4.0000 0.0052 0.0239 -0.0175
6.0000 0.0054 0.0237 -0.0176
8.0000 0.0055 0.0237 -0.0177
where
function M = cssm( filespec, kind, id )
str = fileread( filespec );
str = strrep( str, char([13,10]), char(10) );
fmt = '(?m)^timestep\\x20+(\\d+).+?^%s\\x20+%d\\x20[^\\n]+\\n([^\\n]+)\\n';
xpr = sprintf( fmt, kind, id );
cac = regexp( str, xpr, 'tokens' );
len = length( cac );
M = nan( len, 4 );
for jj = 1 : len
M( jj, 1 ) = sscanf( cac{jj}{1}, '%f' );
M( jj, 2:4 ) = sscanf( cac{jj}{2}, '%f%f%f' );
end
end
I really don't know whether this will work with your large files. However, I'm often surprised by how fast regular expressions are and thus I think it's worth a try.
.
Profiling of cssm
I created a 30MB text file by adding copies of HISTORY_Trial.txt. (I failed to download from your google drive.) Then I run
>> profile on
>> M = cssm( filespec, 'O', 14 );
>> profile viewer
.
I use R2016a/Win7 on an old vanilla desktop.
Total elapse time is <0.9s, more than half of which is used by fileread. (The file was already in system cache.)
.
2018-14-19: Release candidate, RC1
Now I have developed a function that works with the 3.6GB test file. I use a similar approach as in First shot. The main difference is that this functions reads and parses chunks of frames. I gave up on Matlabs Big Data tools.
In short the steps are
- Locate the positions of all frames in the file. A frame starts with "t" in the string, "timestep", and ends with the last character before the following string, "timestep". The result is stored in the persistent variable, ix_frame_start, and used in subsequent calls with the same text file.
- Calculate the positions of a series of chunks of frames. The result is stored in ix_frame_chunk_start and ix_frame_chunk_end. This calculation is cheep.
- Loop over all chunks of frames
- Read one chunk with fseek and fread. This is about three times faster than using memmapfile.
- Loop over the list of atoms, which was given in the call.
- Extract the values of "timestamp" and "X,Y,Z" with regexp.
- Return the result in a struct array.
Comments
Compared to the function, readHist by dpb this function
- is two order of magnitude slower to read position data for one atom
- is more robust to variations in the text file
I don't see any potential to dramatically improve the speed of this function.
Backlog
- Only the regular expression is well commented
- Cheep test to catch anomalies in the text file
Example
>> clear all
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 44.231721 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','X','O'}, {18,19,20} ); toc
Error using scan_CaCO3nH2O (line 36)
Cannot find atom: X, id: 19, in frame 1:3
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 30.618218 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 31.088037 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O','O','O'}, {5884,5890,6107} ); toc
Elapsed time is 45.095658 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY.txt', {'O'}, {18,19,20,5884,5890,6107} ); toc
Elapsed time is 62.709259 seconds.
>>
>> plot( S(1).position(:,1), S(1).position(:,2:4), '.' )
>> S(1)
ans =
atom: 'O'
id: {[5884]}
position: [1110x4 double]
>>
.
where in one m-file
function S = scan_CaCO3nH2O( filespec, atom_list, id_list )
narginchk(3,3)
persistent ix_frame_start previous_filespec
if isempty( ix_frame_start ) || not( strcmpi( filespec, previous_filespec ) )
fid = fopen( filespec, 'r' );
ix_frame_start = frame_start_position( fid );
previous_filespec = filespec;
else
fid = fopen( filespec, 'r' );
end
if isscalar( atom_list ) && not( isscalar( id_list ) )
atom_list = repmat( atom_list, 1, length(id_list) );
else
assert( length( atom_list ) == length( id_list )...
, 'scan_CaCO3nH2O:LengthMismatch' ...
, 'Length of "%s" and "%s" differs' ...
, inputname(2), inputname(3) )
end
ix1 = ix_frame_start(1);
ix2 = ix_frame_start(4)-1;
[~] = fseek( fid, ix1-1, 'bof' );
str = fread( fid, [1,(ix2-ix1+1)], '*char' );
for jj = 1 : length( atom_list )
xpr = regular_expression( atom_list{jj}, id_list{jj} );
cac = regexp( str, xpr, 'tokens' );
assert( length(cac) == 3 ...
, 'scan_CaCO3nH2O:CannotFindAtom' ...
, 'Cannot find atom: %s, id: %d, in frame 1:3' ...
, atom_list{jj}, id_list{jj} )
end
target_chunk_length = 1e8;
mean_frame_length = mean( diff( ix_frame_start ) );
number_of_frames_per_chunk = floor( target_chunk_length / mean_frame_length );
ix_frame_chunk_start = ix_frame_start( 1 : number_of_frames_per_chunk : end-2 );
ix_frame_chunk_end = [ ix_frame_chunk_start(2:end)-1, eof_position(fid) ];
S = struct( 'atom' , atom_list ...
, 'id' , num2cell( id_list ) ...
, 'position' , [] );
C = cell( length(ix_frame_chunk_start), length(atom_list) );
for jj = 1 : length( ix_frame_chunk_start )
ix1 = ix_frame_chunk_start(jj);
ix2 = ix_frame_chunk_end(jj);
[~] = fseek( fid, ix1-1, 'bof' );
str = fread( fid, [1,(ix2-ix1+1)], '*char' );
C(jj,:) = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list );
end
for kk = 1 : length(atom_list)
S(kk).position = cell2mat( C(:,kk) );
end
end
function ix_start = frame_start_position( fid )
chunk_size = 3e8;
eof_pos = eof_position( fid );
ix2 = 0;
jj = 0;
cac = cell( 1, 200 );
while ix2 < eof_pos
jj = jj + 1;
ix1 = 1 + (jj-1)*(chunk_size-100);
ix2 = min( eof_pos, ix1+chunk_size );
[~] = fseek( fid, ix1-1, 'bof' );
str = fread( fid, [1,(ix2-ix1+1)], '*char' );
cac{jj} = strfind( str, 'timestep' ) + ix1-1;
end
buffer = cat( 2, cac{:} );
ix_start = unique( buffer );
end
function eof_pos = eof_position( fid )
[~] = fseek( fid, 0, 'eof' );
eof_pos = ftell( fid );
end
function C = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list )
C = cell( 1, length( atom_list ) );
for jj = 1 : length( atom_list )
C{jj} = parse_one_chunk_of_frames( str, atom_list{jj}, id_list{jj} );
end
end
function M = parse_one_chunk_of_frames( str, atom, id )
xpr = regular_expression( atom, id );
cac = regexp( str, xpr, 'tokens' );
len = length( cac );
M = nan( len, 4 );
for jj = 1 : len
M( jj, 1 ) = sscanf( cac{jj}{1}, '%f' );
M( jj, 2:4 ) = sscanf( cac{jj}{2}, '%f%f%f' );
end
end
function xpr = regular_expression( atom, id )
fmt = [
'(?m) ' ...
'^ ' ...
'timestep ' ...
'\\x20+ ' ...
'( ' ...
' \\d+ ' ...
') ' ...
'\\x20 ' ...
'.+? ' ...
'^%s ' ...
'\\x20+ ' ...
'%d ' ...
'\\x20 ' ...
'(?-s) ' ...
'.+ ' ...
'\\n ' ...
'( ' ...
' .+ ' ...
') ' ...
'$ ' ...
];
fmt( isspace(fmt) ) = [];
xpr = sprintf( fmt, atom, id );
end
.
2018-14-20: Release candidate 2.0, RC2
It's important to know when not to use a regular expression.
RC2 is three times as fast as RC1. I achieved this improvement by
- replacing consecutive spaces by one space and trim trailing spaces of the 3.6GB text file. The trimmed file is half the size of the original file.
- replacing the regular expression by a search with the function, strfind and a simple regular expression.
.
The local function, scan_for_atoms_in_one_chunk_of_frames, now reads
function C = scan_for_atoms_in_one_chunk_of_frames( str, atom_list, id_list )
pos = strfind( str, 'timestep' );
len = length( pos );
stp = nan( len, 1 );
for jj = 1 : len
stp(jj) = sscanf( str(pos(jj):pos(jj)+24), '%*s%f', 1 );
end
len = length( atom_list );
C = cell( 1, len );
for jj = 1 : len
M = parse_one_chunk_of_frames( str, atom_list{jj}, id_list{jj} );
M(:,1) = stp;
C{jj} = M;
end
end
function M = parse_one_chunk_of_frames( str, atom, id )
xpr = regular_expression( atom, id );
pos = regexp( str, xpr, 'start' );
len = length( pos );
M = nan( len, 4 );
for jj = 1 : len
[ ~, remain ] = strtok( str( pos(jj) : pos(jj)+120 ), char(10) );
M( jj, 2:4 ) = sscanf( remain, '%f%f%f', 3 );
end
end
function xpr = regular_expression( atom, id )
fmt = '\\<%s %d ';
xpr = sprintf( fmt, atom, id );
end
Profiling of RC2
In both RC1 and RC2 two function calls totally dominate the execution time. They are
- fread and
- scan_for_atoms_in_one_chunk_of_frames, i.e. regexp and strfind, respectively.
With RC1 the execution time of reading and parsing where equal for scanning for one "atom". With RC2 reading is nearly twice as fast because of the smaller size of the test file. With RC2 parsing is several times faster than with RC1.
These three clips are from profiling of RC2 for scanning for one "atom"
.
.
.
Example
Restart of Matlab
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 19.662046 seconds.
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O','O','O'}, {18,19,20} ); toc
Elapsed time is 11.702999 seconds.
>>
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,19,20,5884,5890,6107} ); toc
Elapsed time is 16.044491 seconds.