Read specific column from last line from very large text file

6 次查看(过去 30 天)
I have a very large text file, and I want to read a specific value from the nth column of the last line. Methods I have looked at tend to import the entire text file and then discard all the data which you don't want afterwards. Is there a way of reading the last line more efficiently?
Thanks
  2 个评论
Elvina
Elvina 2015-9-1
The file contains 15 columns and a varying number of rows (anything from 50,000 up to around 1,000,000). The last line of the text file is blank. I want to extract the last data point in the 12th column(when labelling the first column as 1).
I will be repeating this task >200 times so I would ideally like to be able to extract the required data point fairly quickly.
dpb
dpb 2015-9-1
编辑:dpb 2015-9-1
Well, that's a (minimal) start, what's the format within the file? Fixed width, or what? As Walter showed, this way "there be dragons"; in his approach you would need to keep two lines to throw away the blank at the end. Easiest approach would be to simply attach a short section of an example file including the last line to be parsed and be sure it ends as does the real one as far as number of linefeeds, etc., etc., etc., ...
As you can see from it, the difficulty mounts very quickly in how much code and debugging effort it will be; most likely simply going forward with reading the files in toto and saving the desired data will get you to the end result more quickly.
The other alternatives I'd suggest would be if there's any possibility at all to affect how these files are generated would be to first, write them as unformatted so can use memmapfile or similar tools or to write a second file that overwrites the last line each time and use that one from which parse this very particular value instead of the full (I presume) log file...
Alternatively, if there are already existing files and you can't (easily) recreate them or their auxiliary file as suggested above, you can begin a preliminary process to glean the data value itself alone to have available for the "real" job. This process could be pretty simple and just dispatched to run in background or even overnight and on weekends so that its actual run time wouldn't be of any real impact on the main task.

请先登录,再进行评论。

回答(2 个)

dpb
dpb 2015-8-27
编辑:dpb 2015-8-28
In a text file unless it has very specific other characteristics (such as fixed-width fields, etc.), the answer is generally "no".
In a specific case such as this as it is known a priori to be the last line/record, you may be able to "cheat" and use fseek to get a location in the file some number of bytes prior to the end as determined by the OS and subsequently parse a line. But, more often than not, the effort in writing and debugging such an approach isn't worth the effort of the brute force technique. I'd say it's only worth it in one of two cases --
a) This specific file structure is to be read for this specific value a very high number of times, or
b) the file really is so large as to make the reading impractical. So, just how big is "very large"? Any more, memory is pretty much not an issue most of the time.
ADDENDUM If you were really interested in pursuing this, would need specifics on the file layout and what is needed to be read.

Walter Roberson
Walter Roberson 2015-8-28
fid = fopen('YourBigText.txt', 'r'); %NOT 'rt' !
Nprobe = 5;
lsizes = zeros(1,Nprobe);
prevline = [];
for K = 1 : Nprobe
thisline = fgets(fid); %not fgetl()
if ~ischar(thisline); break; end %already reached end of file
lsizes(K) = size(thisline, 2);
prevline = thisline;
end
if feof(fid)
lastline = prevline; %we already read it in
else
offset_estimate = max(lsizes) * 2;
fseek(fid, -offset_estimate, 'eof');
prevline = [];
linesread = 0;
while true
thisline = fgets(fid);
if ~ischar(thisline); break; end %we reached end of file
prevline = thisline;
linesread = linesread + 1;
end
if linesread < 2
fprintf('last line much longer than expected, code gives up');
fclose(fid);
error('last line too long');
end
lastline = prevline;
end
lastline = regexprep(lastline, '[\r\n]+$', ''); %remove line terminator
The testing here about reading at least 2 lines is that if you only read one line before end of file then your fseek() probably landed you in the middle of the final line -- though by chance it might have landed you exactly at the beginning of the final line. We expect to land in the middle of a line, reading to the end of that line is the first line read, then we expect to read one or more full lines before encountering end of file.
The code could certainly be altered to change the seek offset and try again until it was certain it had read a full line.
Note: if your file ends with an empty line, as is not uncommon, then the code will find that empty line. Your "interface contract" is that the line of interest is the last line of the file, not that it is the last non-empty or last non-blank line of the file.
  1 个评论
Walter Roberson
Walter Roberson 2015-9-1
Adjusted to ignore empty lines:
fid = fopen('YourBigText.txt', 'r'); %NOT 'rt' !
Nprobe = 5;
lsizes = zeros(1,Nprobe);
prevline = '';
for K = 1 : Nprobe
thisline = fgets(fid); %not fgetl()
if ~ischar(thisline); break; end %already reached end of file
lsizes(K) = size(thisline, 2);
if ~isempty(deblank(thisline))
prevline = thisline;
end
end
if feof(fid)
lastline = prevline; %we already read it in
else
offset_estimate = max(lsizes) * 2;
fseek(fid, -offset_estimate, 'eof');
prevline = [];
linesread = 0;
while true
thisline = fgets(fid);
if ~ischar(thisline); break; end %we reached end of file
if ~isempty(deblank(thisline))
prevline = thisline;
linesread = linesread + 1;
end
end
if linesread < 2
fprintf('last non-empty line much longer than expected, code gives up');
fclose(fid);
error('last line too long');
end
lastline = prevline;
end
lastline = regexprep(lastline, '[\r\n]+$', ''); %remove line terminator

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Text Data Preparation 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by