How to ignore special characters and retrieve the data prior to the character

34 次查看(过去 30 天)
I have 40 years of data. Unfortunately, each text file has special characters # or * in them representing the highest or lowest temperatures of that specific day and month. My code works (outside regexp(minT_tbl,'#*','match') and its counterpart). However, the special characters is confusing the program making data wrong. Any help would be great!
close all;
clear all;
clc;
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
dataAll.Year = year(dataAll.Day);
dataAll.Month = month(dataAll.Day);
dataAll.DD = day(dataAll.Day)
%delete leap year
LY = (dataAll.Month(:)==2 & dataAll.DD(:)==29);
dataAll(LY,:) = [];
% Unstack variables
minT_tbl = unstack(dataAll,"MinT","Year","GroupingVariables", ["Month","DD"],"VariableNamingRule","preserve")
maxT_tbl = unstack(dataAll,"MaxT","Year","GroupingVariables", ["Month","DD"],"VariableNamingRule","preserve")
yrs =str2double(minT_tbl.Properties.VariableNames(3:end))';
%ignore special characters
regexp(minT_tbl,'#*','match')
regexp(maxT_tbl,'#*','match')
% find min
[Tmin,idxMn] = min(minT_tbl{:,3:end},[],2,'omitnan');
Tmin_yr = yrs(idxMn);
% find max
[Tmax,idxMx] = max(maxT_tbl{:,3:end},[],2,'omitnan');
Tmax_yr = yrs(idxMx);
% find low high
[lowTMax,idxMx] = min(maxT_tbl{:,3:end},[],2,'omitnan');
LowTMax_yr = yrs(idxMx);
% find high low
[highlowTMn,idxMn] = max(minT_tbl{:,3:end},[],2,'omitnan');
HighLowT_yr = yrs(idxMn);
% find avg high
AvgTMx = round(mean(table2array(maxT_tbl(:,3:end)),2,'omitnan'));
% find avg low
AvgTMn = round(mean(table2array(minT_tbl(:,3:end)),2,'omitnan'));
% Results
tempTbl = [maxT_tbl(:,["Month","DD"]), table(Tmax,Tmax_yr,AvgTMx,lowTMax,LowTMax_yr,Tmin,Tmin_yr,AvgTMn,highlowTMn,HighLowT_yr)]
tempTbl2 = splitvars(tempTbl)
FID = fopen('Meda 05 Temperature Climatology.txt','w');
report_date = datetime('now','format','yyyy-MM-dd HH:MM');
fprintf(FID,'Meda 05 Temperature Climatology at %s \n', report_date);
fprintf(FID,"Month DD Temp Max (°F) Tmax_yr AvgTMax (°F) lowTMax (°F) LowTMax_yr TempMin (°F) TMin_yr AvgTMin (°F) HighlowTMin (°F) HighlowT_yr \n");
fprintf(FID,'%3d %6d %7d %14d %11d %11d %15d %11d %13d %10d %13d %17d \n', tempTbl2{:,1:end}');
fclose(FID);
winopen('Meda 05 Temperature Climatology.txt')
function Tbl = readMonth(filename)
opts = detectImportOptions(filename)
opts.ConsecutiveDelimitersRule = 'join';
opts.MissingRule = 'omitvar';
opts = setvartype(opts,'double');
opts.VariableNames = ["Day","MaxT","MinT","AvgT"];
Tbl = readtable(filename,opts);
Tbl = standardizeMissing(Tbl,{999,'N/A'},"DataVariables",{'MaxT','MinT','AvgT'})
Tbl = standardizeMissing(Tbl,{-99,'N/A'},"DataVariables",{'MaxT','MinT','AvgT'})
[~,basename] = fileparts(filename);
nameparts = regexp(basename, '\.', 'split');
dateparts = regexp(nameparts{end}, '_','split');
year_str = dateparts{end}
d = str2double(extract(filename,digitsPattern));
Tbl.Day = datetime(d(3),d(2),Tbl.Day)
end
  6 个评论
Cris LaPierre
Cris LaPierre 2024-2-7
Test it out. It doesn't elminate them because month does not equal 2 anymore, and day does not equal 29. They are now 3 and 1.
dataAll = table();
dataAll.Day = datetime(1981,2,29) % Feb 29, 1981, which is a non-leap year
dataAll = table
Day ___________ 01-Mar-1981
dataAll.Month = month(dataAll.Day);
dataAll.DD = day(dataAll.Day)
dataAll = 1×3 table
Day Month DD ___________ _____ __ 01-Mar-1981 3 1
% Remove all Feb 29 dates from the table
LY = (dataAll.Month(:)== 2 & dataAll.DD(:) == 29);
dataAll(LY,:) = [ ]
dataAll = 1×3 table
Day Month DD ___________ _____ __ 01-Mar-1981 3 1
As you can see, the current LY code did not remove the data.
Jonathon Klepatzki
Well I am stuck then. Because I just tried it different ways and I continue to get the same result.

请先登录,再进行评论。

采纳的回答

Voss
Voss 2024-2-6
编辑:Voss 2024-2-6
The following code replaces any * or # characters in a text file with spaces (note that this replaces the existing file with a new file of the same name):
% read the file
fid = fopen(filename,'r');
str = fread(fid,[1 Inf],'*char');
fclose(fid);
% replace any * or # with a space (empty char vector should also work)
str = regexprep(str,'[*#]',' ');
% write the new file
fid = fopen(filename,'w');
fwrite(fid,str);
fclose(fid);
If you don't mind losing the original files that have the * and/or # characters in them, you can run this code for each of your text files before running your code or you can incorporate this code into your readMonth function.
If you want to preserve the original files, make a separate copy of them first, or modify the above code to write to a different file, e.g.:
% write the new file
[fp,fn,ext] = fileparts(filename);
fid = fopen(fullfile(fp,[fn '_modified' ext]),'w');
fwrite(fid,str);
fclose(fid);
and tell fileDatastore to use the modified files only, e.g.:
Datafiles = fileDatastore("temp_summary*_modified.txt","ReadFcn",@readMonth,"UniformRead",true);
  20 个评论
Star Strider
Star Strider 2024-2-13
I am lost. My code seems to work correctly when I run it, without any other modifications to it or to the tables or files it creates.
Mentioning me using ‘@’ flags me and I look to see what I need to attend to, if anything, since sometimes it’s just a reference.
Cris LaPierre
Cris LaPierre 2024-2-13
@Jonathon Klepatzki, you can specify the NumHeaderLines, VariableNamesLine, VariableUnitsLine, VariableDescriptionsLine, and the DataLines import arguments to correctly import a file that has non-data lines between the variable names and data.
However, where you are using a datastore to import your files, the same import options are used to read in all files. Therefore, all files must be formattted the same or you will get errors like the one you saw.

请先登录,再进行评论。

更多回答(3 个)

Sulaymon Eshkabilov
Here is one possible solution, to get the data correctly from the data file:
% Open the data file for reading
FID = fopen('temp_summary.05.03_1998.txt', 'r');
% Initialize a cell array to store the cleaned data
C_Lines = {};
% Read the file line by line
N_line = fgetl(FID);
while ischar(N_line)
% Remove '*' and '#' characters from the line
C_Line = strrep(N_line, '*', '');
C_Line = strrep(C_Line, '#', '');
% Store the cleaned line if it is not empty
if ~isempty(C_Line)
C_Lines{end+1} = C_Line;
end
% Read the next line
N_line = fgetl(FID);
end
% Close the file:
fclose(FID);
% Convert the cell array of cleaned lines to a character array:
C_Data = char(C_Lines)
C_Data = 32×49 char array
' Day Maximum Temp Minimum Temp Average Temp' ' 01 66 28 47.0 ' ' 02 65 29 47.0 ' ' 03 62 36 49.0 ' ' 04 63 31 47.0 ' ' 05 52 36 44.0 ' ' 06 53 28 40.5 ' ' 07 62 26 44.0 ' ' 08 65 27 46.0 ' ' 09 69 27 48.0 ' ' 10 76 28 52.0 ' ' 11 74 29 51.5 ' ' 12 62 44 53.0 ' ' 13 65 43 54.0 ' ' 14 75 32 53.5 ' ' 15 73 35 54.0 ' ' 16 73 34 53.5 ' ' 17 64 37 50.5 ' ' 18 69 27 48.0 ' ' 19 74 34 54.0 ' ' 20 77 31 54.0 ' ' 21 76 36 56.0 ' ' 22 83 37 60.0 ' ' 23 82 50 66.0 ' ' 24 64 49 56.5 ' ' 25 60 43 51.5 ' ' 26 54 47 50.5 ' ' 27 52 34 43.0 ' ' 28 51 34 42.5 ' ' 29 60 29 44.5 ' ' 30 57 31 44.0 ' ' 31 50 32 41.0 '

Cris LaPierre
Cris LaPierre 2024-2-7
编辑:Cris LaPierre 2024-2-8
I think another rather straightforward approach is to treat * and # as delmiters.
I've simplified the read function for readability
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
dataAll = 93×4 table
Day MaxT MinT AvgT ___ ____ ____ ____ 1 66 28 47 2 65 29 47 3 62 36 49 4 63 31 47 5 52 36 44 6 53 28 40.5 7 62 26 44 8 65 27 46 9 69 27 48 10 76 28 52 11 74 29 51.5 12 62 44 53 13 65 43 54 14 75 32 53.5 15 73 35 54 16 73 34 53.5
function Tbl = readMonth(filename)
Tbl = readtable(filename,"ConsecutiveDelimitersRule","join","ReadVariableNames",false,...
"Delimiter",{' ','\t','*','#'},"LeadingDelimitersRule",'ignore',...
'EmptyLineRule','skip');
Tbl.Properties.VariableNames = {'Day' 'MaxT' 'MinT' 'AvgT'};
end
  5 个评论
Cris LaPierre
Cris LaPierre 2024-2-7
移动:Cris LaPierre 2024-2-8
Hmm. Works here. Have you shared the full error message (all the red text)?
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
dataAll = 124×4 table
Day MaxT MinT AvgT ___ ____ ____ ____ 1 65 12 38.5 2 68 28 48 3 65 17 41 4 57 22 39.5 5 46 24 35 6 61 18 39.5 7 62 25 43.5 8 58 12 35 9 64 11 37.5 10 65 14 39.5 11 54 22 38 12 58 40 49 13 64 27 45.5 14 65 19 42 15 59 19 39 16 62 23 42.5
function Tbl = readMonth(filename)
Tbl = readtable(filename,"ConsecutiveDelimitersRule","join","ReadVariableNames",false,...
"Delimiter",{' ','\t','*','#'},"LeadingDelimitersRule",'ignore',...
'EmptyLineRule','skip');
Tbl.Properties.VariableNames = {'Day' 'MaxT' 'MinT' 'AvgT'};
end

请先登录,再进行评论。


Walter Roberson
Walter Roberson 2024-2-8
To answer the original question:
An alternative way to read the files is to use FixedWidthImportOptions together with readtable() https://www.mathworks.com/help/matlab/ref/matlab.io.text.fixedwidthimportoptions.html

类别

Help CenterFile Exchange 中查找有关 File Operations 的更多信息

产品


版本

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by