Removing specific characters from string in nested cells

Question

0 个投票

I have a series of strings which are contained within a nested cell array (because regexp loves to nest cells), and I would like to remove any non numeric or white space characters from them so that I can convert them to doubles, namely astrick.

I'm looking for the least painful way of removing any of these special characters from all strings. I do not have a sample file to attach, sorry, but I have dictated the shape of a sample array below.

X == 1x1 cell
X{1} == 1x1 cell (because regexp can't help itself apparently)
X{1}{1} = {'1234.,  ';'12.,*  ';'1234.,  ','123.,*   ','  321.,*  '};

12 个评论
显示 10更早的评论隐藏 10更早的评论

Bob Thompson 2018-6-13

在 MATLAB Online 中打开

Stephen, it is related to the same file, but not the same part of the file. I believe I figured the other question out, but didn't think it was elegant enough to post as an answer to my own question.

I am unable to upload an actual sample document, but a sample of what I'm extracting from would be the following.

   1  ****TABLE1****
   COLUMN1= 1.12, 2.23, 3.34, 4.45, 5.56, 6.67,
   COLUMN2= 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
   COLUMN3= 1.23, 0.34, 3.45, 5.78*, 6.54*, 8.23,
   1  ****TABLE2****

I am trying to capture the values of columns 1 and 3 from the table. I am specifically having troubles with column 3 which contains the astrick, as column 1 works fine with str2double.

col3{1 = regexp(input, '\<COLUMN3=\s*(.{1,400})1  ****','tokens');
col3{1 = regexp(col3{1{1}, '\s+','split');

I am initiating the first level of the cell as I will have multiple tables. The use of (.{1,400}) was done because I don't know how many values are in the table, and I cannot simply do (.*) because '1 **' occurs multiple times throughout the file. I don't think I can use \d or \w because of the ',' and '.' mixed in with the values. I used the second regexp to split the single string the first resulted in, as I found this more consistent with use through str2double than simply applying str2double to the entire string.

Bob Thompson 2018-6-14

编辑：Bob Thompson 2018-6-14

在 MATLAB Online 中打开

Hmmm. I'm currently using fileread and just importing the entire file as a single string. I've used fgetl in the past for other scripts, but due to the variability of this file I don't know if it's a good fit. Textscan might work, but I don't know that separating by each \n will work either, as it is possible that my various bits of data will be contained on multiple lines.

I've been working with it some again today, and I realized that my previous codes work fine for the first column of values as these do not seem to ever have special characters. I can therefore get the number of values from this array, and use that to create a repeating string for the third column.

col1 = regexp(input, '\<COLUMN1=\s*(.{1,400})1  ****','tokens');
col1 = regexp(col1{1}, '\s+','split');
colvals(:,1) = str2double(col1{1});
nvals = length(colvals);
dups = repmat('(\d*.\d*).{1,3}\s*',1,nvals); % Modified from Paolo's comment
string = ['COLUMN3=\s+',dups];
col3 = regexp(input, string, 'tokens');

This seems to work, and removes the need to conduct the split a second time, which is nice.

I'm not really sure what the ':' from Paolo's comment is supposed to do, I don't see it anywhere in the regexp documentation, and it's not in any of my strings.

Also, OCDER and Paolo, I appreciate your help, so if one of you wants to write up an actual answer I would be happy to accept it.

Bob Thompson 2018-6-15

Ah, I see. It doesn't appear in regexp.m comments, which is where I was looking.

Stephen23 2018-6-15

在 MATLAB Online 中打开

@Bob Nbob: you are right, it does not appear in the Mfile help. I notice that many other useful regular expression features also do not appear in the Mfile help: notably missing are dynamic expressions, lookaround operators, and named capture.

Both the inbuilt help and the page I linked to give a very useful introduction, and explain all features of regular expressions in MATLAB:

doc regexp
doc('Regular Expressions')

请先登录，再进行评论。

请先登录，再回答此问题。

请先登录再关注

Answer 1

Paolo 2018-6-15

编辑：Paolo 2018-6-15

在 MATLAB Online 中打开

0 个投票

Perhaps this can easily be achieved in two steps. For your input:

    1  ****TABLE1****
   COLUMN1= 1.12, 2.23, 3.34, 4.45, 5.56, 6.67,
   COLUMN2= 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
   COLUMN3= 1.23, 0.34, 3.45, 5.78*, 6.54*, 8.23,
   1  ****TABLE2****

Step 1. Find and replace all punctuation characters (let's say ",", "." and "*"). Live regex here .

   data = fileread('CORR.txt');
   expression_sub = '(?<=\d\.\d*\*?)([\*\.,])';
   data = regexprep(data,expression_sub,'');

Data will now not contain those characters. Data is now:

     '   1  ****TABLE1****
        COLUMN1= 1.12 2.23 3.34 4.45 5.56 6.67
        COLUMN2= 0.00 0.00 0.00 0.00 0.00 0.00
        COLUMN3= 1.23 0.34 3.45 5.78 6.54 8.23
        1  ****TABLE2****
     '

Step 2. Match your data. Live regex here. The expression is greedy and will try to match as many digit, full stop, digits combinations as it can. Therefore you don't need to repmat your expression like you showed.

 expression_match = '(?<=COLUMN[1,3]=\s)(\d.?\d*\s)*';
 [tokens,match] = regexp(data_sub,expression_match,'tokens','match');

Matlab manipulation.

 column1 = str2double(strsplit(cell2mat(tokens{1}),' '));
 column3 = str2double(strsplit(cell2mat(tokens{2}),' '));

column1 =

1.1200 2.2300 3.3400 4.4500 5.5600 6.6700

column3 =

1.2300 0.3400 3.4500 5.7800 6.5400 8.2300

2 个评论
显示无隐藏无

Bob Thompson 2018-6-18

Ha, using (\d.?\d*\s)* is pretty slick. I'm a little sad I didn't think of that.

Stephen23 2022-12-30

在 MATLAB Online 中打开

@Bob Thompson: the dot needs to be escaped as well (otherwise it matches all characters), e.g.:

(\d+\.?\d*\s)*

请先登录，再进行评论。

Answer 2

George Abrahams 2022-12-30

在 MATLAB Online 中打开

0 个投票

The others are right to fix the root problem causing the tricky nested cell array. Having said that, for future reference, my deepreplace function on File Exchange / GitHub would have done exactly what you requested.

x = {{{'1234.,  ';'12.,*  ';'1234.,  ';'123.,*   ';'  321.,*  '}}};
% Remove any character except for digits (0-9) and period (.)
match = regexpPattern('[^\d.]');
x = deepreplace(x,match,'');
% x = 1×1 cell array
%     {1×1 cell}
% x{1} = 1×1 cell array
%     {5×1 cell}
% x{1}{1} = 5×1 cell array
%     {'1234.'}
%     {'12.'  }
%     {'1234.'}
%     {'12310'}
%     {'321.' }

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Removing specific characters from string in nested cells

12 个评论
显示 10更早的评论隐藏 10更早的评论

采纳的回答

2 个评论
显示无隐藏无

更多回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

类别

产品

标签

Community Treasure Hunt

Removing specific characters from string in nested cells

12 个评论 显示 10更早的评论 隐藏 10更早的评论

采纳的回答

2 个评论 显示 无 隐藏 无

更多回答（1 个）

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

类别

产品

标签

另请参阅

Community Treasure Hunt

12 个评论
显示 10更早的评论隐藏 10更早的评论

2 个评论
显示无隐藏无

0 个评论
显示 -2更早的评论隐藏 -2更早的评论