Reading columns from a poorly-formatted text file

4 次查看(过去 30 天)
I need to read data from an old text file which has fixed columns for the data. I have tried using readtable, which was partly succesfull but seems to have been fooled by the data in places.
A shortened version of the file is attached. It has the following 'features' :
  • All columns are not always there
  • Sometimes there is no space between the columns
  • It uses -998 or -998.0 for NaN;
  • There are blank lines.
Is there an easy way to read this file, or do I need to be old-fashioned and read it line-by-line and then parse for the values?

采纳的回答

Voss
Voss 2024-8-29
编辑:Voss 2024-8-29
This file looks like a fixed-width file, but it's not quite, because lines 68 and 69 differ in format from the other lines.
dbtype COUNT.txt 65:70
65 29 JUN 97 00:00 0 01109 40 66 30 JUN 97 00:00 0 0 49 17 67 04 SEP 97 00:00 4 1 0 8 68 05 SEP 97 00:00 0 16 5 68 69 06 SEP 97 00:00 0 15 1 15 70 07 SEP 97 00:00
For example, if you use fixed field sizes to parse line 65 like this
% 29 JUN 97 00:00 0 01109 40
% ^^^^^^^^^^^^^^^ 29 JUN 97 00:00
% ^^^^ 0
% ^^^^ 0
% ^^^^ 1109
% ^^^^ 40
Then the same parsing for line 68 gives this:
% 05 SEP 97 00:00 0 16 5 68
% ^^^^^^^^^^^^^^^ 05 SEP 97 00:00
% ^^^^ 0
% ^^^^ 1
% ^^^^ 6 5
% ^^^^ 68
Notice the 16 gets split apart: that field becomes 1 and the 6 is included in the next field, which becomes '6 5' (which will be interpreted as NaN).
On the off chance that lines 68 and 69 were inadverdently modified during the shortening process and the original file doesn't have this problem, you can use some code like this to read it:
opts = fixedWidthImportOptions( ...
'VariableNamesLine',3, ...
'DataLines',5, ...
'NumVariables',7, ...
'VariableWidths',[15 4 4 4 4 4 6], ...
'VariableTypes',{'datetime','double','double','double','double','double','double'});
opts = setvaropts(opts,1,'InputFormat','dd MMM yy HH:mm','DatetimeFormat','preserveinput');
T = readtable('COUNT.txt',opts);
T = standardizeMissing(T,-998);
disp(T)
Var1 VT LP HYB RF PLK Var7 _______________ ___ ___ ____ ___ ___ ____ 03 Jan 96 00:00 4 12 0 0 0 NaN 04 Jan 96 00:00 1 9 1 0 0 NaN 05 Jan 96 00:00 1 6 0 0 0 NaN 06 Jan 96 00:00 0 2 0 0 0 NaN 07 Jan 96 00:00 1 0 0 0 0 NaN 10 Jan 96 00:00 1 0 0 0 0 NaN 11 Jan 96 00:00 0 2 0 0 0 NaN 12 Jan 96 00:00 1 3 0 0 0 NaN 15 Mar 96 00:00 2 37 1 10 0 NaN 16 Mar 96 00:00 1 20 0 11 0 NaN 17 Mar 96 00:00 2 5 3 17 0 NaN 18 Mar 96 00:00 5 1 1 14 2 NaN 19 Mar 96 00:00 0 5 10 20 0 NaN 20 Mar 96 00:00 0 6 25 18 0 NaN 21 Mar 96 00:00 2 0 3 10 0 NaN 22 Mar 96 00:00 0 3 9 13 0 NaN 23 Mar 96 00:00 0 15 1 28 0 NaN 24 Mar 96 00:00 2 22 0 30 0 NaN 25 Mar 96 00:00 0 7 27 32 0 NaN 26 Mar 96 00:00 1 13 152 39 0 NaN 27 Mar 96 00:00 3 1 40 33 0 NaN 28 Mar 96 00:00 0 1 21 27 0 NaN 29 Mar 96 00:00 1 2 23 24 0 NaN 30 Mar 96 00:00 0 0 13 46 0 NaN 31 Mar 96 00:00 2 3 0 45 0 NaN 01 Apr 96 00:00 2 0 6 31 0 NaN 02 Apr 96 00:00 3 5 8 38 0 NaN 03 Apr 96 00:00 NaN NaN NaN NaN 0 NaN 04 Apr 96 00:00 NaN NaN NaN NaN 0 NaN 05 Apr 96 00:00 0 1 138 37 0 1 06 Apr 96 00:00 0 4 21 110 0 1 07 Apr 96 00:00 0 2 56 53 0 1 08 Apr 96 00:00 4 4 132 58 0 2 09 Apr 96 00:00 1 0 59 61 0 3 10 Apr 96 00:00 0 0 720 25 0 1 11 Apr 96 00:00 0 0 943 0 0 1 12 Apr 96 00:00 0 0 1123 0 0 1 27 Jul 96 00:00 5 3 9 47 48 NaN 28 Jul 96 00:00 13 4 8 85 68 NaN 29 Jul 96 00:00 23 3 34 115 19 NaN 30 Jul 96 00:00 11 0 104 58 32 NaN 09 Nov 96 00:00 28 2 0 0 0 NaN 10 Nov 96 00:00 117 2 0 0 0 NaN 11 Nov 96 00:00 64 1 0 3 0 NaN 12 Nov 96 00:00 5 0 0 5 0 NaN 13 Nov 96 00:00 0 0 0 5 0 NaN 14 Nov 96 00:00 41 2 0 4 0 NaN 15 Nov 96 00:00 1 4 0 2 0 NaN 16 Nov 96 00:00 1 1 0 1 NaN NaN 17 Nov 96 00:00 1 1 0 1 NaN NaN 19 Mar 97 00:00 29 0 88 24 NaN NaN 20 Mar 97 00:00 0 1 12 15 NaN NaN 21 Mar 97 00:00 8 3 24 19 NaN NaN 22 Mar 97 00:00 37 4 50 22 NaN NaN 23 Jun 97 00:00 9 4 1373 22 NaN NaN 24 Jun 97 00:00 5 0 378 10 NaN NaN 25 Jun 97 00:00 4 0 102 17 NaN NaN 26 Jun 97 00:00 8 0 22 18 NaN NaN 27 Jun 97 00:00 17 0 70 23 NaN NaN 28 Jun 97 00:00 12 0 106 54 NaN NaN 29 Jun 97 00:00 0 0 1109 40 NaN NaN 30 Jun 97 00:00 0 0 49 17 NaN NaN 04 Sep 97 00:00 4 1 0 8 NaN NaN 05 Sep 97 00:00 0 1 NaN 68 NaN NaN 06 Sep 97 00:00 0 1 NaN 15 NaN NaN 07 Sep 97 00:00 NaN NaN NaN NaN NaN NaN NaT NaN NaN NaN NaN NaN NaN
But again, for this file, the data from lines 68 and 69 (05 and 06 SEP 97) is incorrect because of the misalignment of the 16 and the 15 on those respective lines:
tail(T)
Var1 VT LP HYB RF PLK Var7 _______________ ___ ___ ____ ___ ___ ____ 28 Jun 97 00:00 12 0 106 54 NaN NaN 29 Jun 97 00:00 0 0 1109 40 NaN NaN 30 Jun 97 00:00 0 0 49 17 NaN NaN 04 Sep 97 00:00 4 1 0 8 NaN NaN 05 Sep 97 00:00 0 1 NaN 68 NaN NaN 06 Sep 97 00:00 0 1 NaN 15 NaN NaN 07 Sep 97 00:00 NaN NaN NaN NaN NaN NaN NaT NaN NaN NaN NaN NaN NaN

更多回答(1 个)

Steven Lord
Steven Lord 2024-8-29
I would experiment with the settings controllable using the Import Tool to try to import the data. You could try running Import Tool with your smaller data set, import the data, and check that it imported as you expected. Then generate a script and run it on the larger file.
The conditions you identified that could make it difficult to read are:
  • All columns are not always there
  • Sometimes there is no space between the columns
  • It uses -998 or -998.0 for NaN;
  • There are blank lines.
If you specify that the file is fixed-width I think the second condition is okay. Unimportable cell handling may take care of the first condition. I think the blank lines will be imported as a row of unimportable cells, which you could repair after the fact with the missing-handling functions like rmmissing. Once you've imported the data, you could use standardizeMissing to replace -998 with NaN.

类别

Help CenterFile Exchange 中查找有关 Data Type Conversion 的更多信息

产品


版本

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by