Parsing data from complicated text files

10 次查看(过去 30 天)
I have about 20 years of text files that contain the records of individual tests (about 8GB of plain text files, about 4,000 individual files). Each file has this format:
********************************************************************************
Test Data Report
Station ID: [Test Station ID Number]
Station Part Number: [Test Station Part Number]
Station Serial Number: [Test Station Serial Number]
Test Procedure Number: [Test Procedure Number] [Test Procedure Revision]
Operation: [colloquial test]
Serial Number of test subject: [Serial Number + plus some other info about the test]
Date: [Day, Month Date, year]
Time: [11:00:03 AM]
Operator: [Operator Name]
Number of Results: [NNNN]
Test Result: [Passed/Failed]
********************************************************************************
--------------------------------------------------------------------------------
MEASUREMENT LL READING UL UNITS STATUS
--------------------------------------------------------------------------------
Enter Testing Time: Done
--------------------------------------------------------------------------------
08:00
--------------------------------------------------------------------------------
FOE, CAL: Passed
--------------------------------------------------------------------------------
CALIBRATION IS VALID
--------------------------------------------------------------------------------
Test Start Time: Done
--------------------------------------------------------------------------------
11:00:33 AM
--------------------------------------------------------------------------------
Group Meas Init: Passed
--------------------------------------------------------------------------------
Datapoint_01 LL Measured UL Units Passed
Datapoint_02 LL Measured UL Units Passed
Datapoint_03 LL Measured UL Units Passed
Datapoint_04 LL Measured UL Units Passed
Datapoint_05 LL Measured UL Units Passed
Datapoint_06 LL Measured UL Units Passed
Datapoint_07 LL Measured UL Units Passed
Datapoint_08 LL Measured UL Units Passed
Datapoint_09 LL Measured UL Units Passed
Datapoint_10 LL Measured UL Units Passed
Datapoint_11 LL Measured UL Units Passed
Datapoint_12 LL Measured UL Units Passed
Datapoint_13 LL Measured UL Units Passed
Datapoint_14 LL Measured UL Units Passed
Datapoint_15 LL Measured UL Units Passed
Datapoint_16 LL Measured UL Units Passed
Datapoint_17 LL Measured UL Units Passed
Datapoint_18 LL Measured UL Units Passed
Datapoint_19 LL Measured UL Units Passed
Datapoint_20 LL Measured UL Units Passed
Datapoint_21 LL Measured UL Units Passed
Datapoint_22 LL Measured UL Units Passed
Datapoint_23 LL Measured UL Units Passed
Datapoint_24 LL Measured UL Units Passed
Datapoint_25 LL Measured UL Units Passed
Datapoint_26 LL Measured UL Units Passed
Datapoint_27 LL Measured UL Units Passed
Datapoint_28 Measured UL Units Passed
Datapoint_29 Measured Units Passed
--------------------------------------------------------------------------------
Group Meas Ramp: Passed
--------------------------------------------------------------------------------
Datapoint_01 LL Measured UL Units Passed
Datapoint_02 LL Measured UL Units Passed
Datapoint_03 LL Measured UL Units Passed
Datapoint_04 LL Measured UL Units Passed
Datapoint_05 LL Measured UL Units Passed
Datapoint_06 LL Measured UL Units Passed
Datapoint_07 LL Measured UL Units Passed
Datapoint_08 LL Measured UL Units Passed
Datapoint_09 LL Measured UL Units Passed
Datapoint_10 LL Measured UL Units Passed
Datapoint_11 LL Measured UL Units Passed
Datapoint_12 LL Measured UL Units Passed
Datapoint_13 LL Measured UL Units Passed
Datapoint_14 LL Measured UL Units Passed
Datapoint_15 LL Measured UL Units Passed
Datapoint_16 LL Measured UL Units Passed
Datapoint_17 LL Measured UL Units Passed
Datapoint_18 LL Measured UL Units Passed
Datapoint_19 LL Measured UL Units Passed
Datapoint_20 LL Measured UL Units Passed
Datapoint_21 LL Measured UL Units Passed
Datapoint_22 LL Measured UL Units Passed
Datapoint_23 LL Measured UL Units Passed
Datapoint_24 LL Measured UL Units Passed
Datapoint_25 LL Measured UL Units Passed
Datapoint_26 LL Measured UL Units Passed
Datapoint_27 LL Measured UL Units Passed
Datapoint_28 Measured UL Units Passed
Datapoint_29 Measured Units Passed
--------------------------------------------------------------------------------
Time (after meas): Done
--------------------------------------------------------------------------------
11:01:16 AM
--------------------------------------------------------------------------------
Group Meas Ramp: Passed
--------------------------------------------------------------------------------
Datapoint_01 LL Measured UL Units Passed
Datapoint_02 LL Measured UL Units Passed
Datapoint_03 LL Measured UL Units Passed
Datapoint_04 LL Measured UL Units Passed
Datapoint_05 LL Measured UL Units Passed
Datapoint_06 LL Measured UL Units Passed
Datapoint_07 LL Measured UL Units Passed
Datapoint_08 LL Measured UL Units Passed
Datapoint_09 LL Measured UL Units Passed
Datapoint_10 LL Measured UL Units Passed
Datapoint_11 LL Measured UL Units Passed
Datapoint_12 LL Measured UL Units Passed
Datapoint_13 LL Measured UL Units Passed
Datapoint_14 LL Measured UL Units Passed
Datapoint_15 LL Measured UL Units Passed
Datapoint_16 LL Measured UL Units Passed
Datapoint_17 LL Measured UL Units Passed
Datapoint_18 LL Measured UL Units Passed
Datapoint_19 LL Measured UL Units Passed
Datapoint_20 LL Measured UL Units Passed
Datapoint_21 LL Measured UL Units Passed
Datapoint_22 LL Measured UL Units Passed
Datapoint_23 LL Measured UL Units Passed
Datapoint_24 LL Measured UL Units Passed
Datapoint_25 LL Measured UL Units Passed
Datapoint_26 LL Measured UL Units Passed
Datapoint_27 LL Measured UL Units Passed
Datapoint_28 Measured UL Units Passed
Datapoint_29 Measured Units Passed
--------------------------------------------------------------------------------
Time (after meas): Done
--------------------------------------------------------------------------------
11:01:37 AM
--------------------------------------------------------------------------------
Now, at the moment, the only things I care about are
  1. Whether a failure occured or not
  2. When that failure occured
I will likely want to perform other analysises on the data the in the future, but for the moment, this will suffice. I want to go through each report, determine whether a failure occured, record when that failure occured, and then plot all the failures as a histogram in terms of time so that I can see if there are any typical lengths of time it takes for a test to fail.
I have a pretty good amount of experience with working with data once it is in Matlab, but I am much less experienced with importing data, especially this kind of batch importing. Is there a simple way to do this, or am I essentially just using something like textscan() or fscanf() in a loop?
  3 个评论
Michael Browne
Michael Browne 2021-3-22
I cannot post the full text files because of company policies, and this text is just illustrative of the formating that each file has.
But, yes, the text you've highlighted is the header, however, the "Time:" is the when the test begins, not when it fails.
The "Test Result" section does record whether a test is an overal pass or fail, but it is a summation of all the data points and all the tests performed on those data points. It is the result of the software looking for a single failure, and then recording a "Failed" result in the header. I don't really care about the header result, since I really care about when a failure occurs.
So what I need to build is a function that scans through the file, looking for any "Fail" results in a section like this one:
Group Meas Ramp: Passed
--------------------------------------------------------------------------------
and then jump to the time section immediately below it, like this one here:
Time (after meas): Done
--------------------------------------------------------------------------------
11:01:16 AM
--------------------------------------------------------------------------------
It would then take the difference with the time listed in the header of the file
[11:01:16 AM] - [11:00:03 AM]
Then it would store this data as a single point, that will go towards the creation of a histogram.
I can already do this for one single file using Matlab "Import Data" tool, and a lot of manual selection that is specific to each file, but the issue is I need to do this for 4,000 files where the failure is located at the end of the file (the test hardware terminate the test in the event of a failure). So it is the automating of this data parsing that is giving me trouble.
dpb
dpb 2021-3-23
Well, we still don't have a file to test with nor is there a case that fails in the text you posted...if you expect somebody to write code, you've got to do your part to give them the help needed from your end; otherwise you'll have the result of the other poster's wasted time/effort that doesn't work because what he was provided wasn't sufficient and his best guess of what it should be apparently wasn't correct.
In general, however, the idea would be to use readcell to import each file into a cell array, use contains or regexp to find rows with the key words/phrases wanted, and then parse those lines, taking into account where the group headers are to match which are which.

请先登录,再进行评论。

采纳的回答

Michael Browne
Michael Browne 2021-3-24
编辑:Michael Browne 2021-3-24
Alright, after digging through @Mathieu NOE's code, and seeing why it failed, it turns out that there are slight variations in all the text file formatting that were introduced by ~20 years of test software updates - things like the exact number and types of white space characters changing. However, I did discover another timing flag that I could use, which had stayed consistent. Buried much deeper in the file is an 'elapsed time' flag that is very poorly named, which is why I missed it the first time (however, I still apologize for not including it in the posted format in the OP). This elapsed time flag had both a consistent format across the years, but unique as well among all the times listed in this data, so I was able to make a pattern for that, and then detect and pull that out. Once I had all those elapsed time loggings, I just selected the one at the end of the array, since that one will always be the longest, and I can just select that as the time it took to fail each data report.
Also, thank you for your paitence @dpbdpb. I actually found myself reading a lot of your replies to other issues with reading strings from text files. This solution of your made me realize that I was over-thinking my problem.
Here is what I came up with:
filename_in = 'testData.txt';
[output] = extract_data(filename_in);
function [time_to_fail] = extract_data(file)
fid = fileread(file);
% Pattern Definitions
elapsed_pattern = digitsPattern(2) + " : " + digitsPattern(2) + " : " + digitsPattern(2);
%time_format = 'HH:MM:SS';
% First screen, to check for any failures
failure_detect = strfind(fid, "Failed");
% If a failure is detected, pull the all the elapsed times
if failure_detect > 0
% 'extract' pulls out the 'hh : mm : ss' flags from the text file
% 'strrep' removes all white space, leaving 'hh:mm:ss'
elapsed_time = strrep(extract(fid,elapsed_pattern),' ','');
%'elapsed_time(end)' grabs just the last elapsed time from the
%data, which should always be the longest time.
elapsed_time = elapsed_time(end);
%Spit out the result from the function
time_to_fail = elapsed_time;
end
end
Now I just need to wrap my head around handling time in Matlab, but that is off-topic for this issue, and I have not had a chance to do my homework for it yet.

更多回答(1 个)

Mathieu NOE
Mathieu NOE 2021-3-23
hello
this is my 2 cents code to import the required data. The function will give you the time values (char array) and the number of failures. i tested it with two dummy files, one is your original data and the second one I changed the last section to create a Failed condition , plus I added another failed case with a different time value , just to check my code would correctly detect the 2 failures
Filename_in = 'data2.txt';
% Filename_out= 'dataABC_reduced.txt';
[Time_init,Time_end,fail_count] = extract_data(Filename_in);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [Time_init,Time_end,fail_count] = extract_data(Filename)
fid = fopen(Filename);
tline = fgetl(fid);
% initialization
k = 0; % counter #1
fail_count = 0; % counter #2
Time_init = '';
Time_end{1} = '';
line_fail_ind = 0;
fail_flag = 0;
while ischar(tline)
k = k+1; % loop over line index
% store initial Time value (start Time)
if contains(tline,'Time: [')
Time_init = deblank(extractBetween(tline,'[',']'))
end
% then search for 'Failed' case in line " Group Meas Ramp "
if (contains(tline,'Group Meas Ramp') && contains(tline,'Failed'))
fail_flag = 1 ;
end
if fail_flag == 1 && contains(tline,'Time (after meas)')
line_fail_ind = k;
end
% time of failure : capture when running index k = line_fail_ind + 2
% (and fail_flag == 1)
if fail_flag == 1 && k == line_fail_ind + 2
fail_count = fail_count+1;
Time_end{fail_count} = tline;
fail_flag = 0; % reset fail_flag
end
tline = fgetl(fid); % lower make matlab not case sensitive
end
fclose(fid);
end
  3 个评论
Mathieu NOE
Mathieu NOE 2021-3-23
hi
would you be able to copy paste the section of data that seems not to work 100% with my code ?
dpb
dpb 2021-3-24
Is this one test/file?
Is the Group Meas Init: section of interest? There is no time after it; only after the "Ramp" section is a ending time given. I presume maybe if the INIT fails, the rest of the test is aborted and there consequently is no file?
Need all the ground rules...

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Data Type Conversion 的更多信息

产品


版本

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by