Find indices of multiple strings within another string

9 次查看(过去 30 天)
I am trying to efficiently find which strings (character vectors) match between two cell arrays.
One cell array contains ~1000 equations written as strings that I'm trying to parse by matching to strings in another array (100,000 items). I need to know the indices from the 100,000 items that are found within the ~1000 equations. There may be multiple of the 100,000 items found within each of the 1000 equations.
I'm currently implementing this as such:
Equations.Equation % this is a list of ~1000 equations, a cell array of character vectors
OutputData.DataName % list of ~100,000 possible strings I'm looking for in the equations (my variable names)
for ii = 1:length(Equations)
matches=cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName);
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
This is fairly slow. Is there a way to more efficiently find within Equations(ii).Equation which items within OutputData.DataName are found and the index of those items?
  4 个评论
Paul
Paul 2022-4-9
Something's not working with this example data and the code in the question. Is there a typo somewherer?
Equations.Equation = { '(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'}
Equations = struct with fields:
Equation: {3×1 cell}
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches=cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName);
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
Error using cellfun
Non-scalar in Uniform output, at index 1, output 1.
Set 'UniformOutput' to false.
Voss
Voss 2022-4-9
It seems like Equations is actually a struct array:
Equations = struct('Equation',{ ...
'(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'})
Equations = 3×1 struct array with fields:
Equation
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).'
indices = find(matches)
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12

请先登录,再进行评论。

采纳的回答

Paul
Paul 2022-4-10
It looks like using string variables with an inner loop is much faster than a cell array with cellfun, at least here on Answers with the data provided.
Orignal code, modified by @_
Equations = struct('Equation',{ ...
'(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'});
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).'
indices = find(matches)
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12
Convert the cell arrays to strings, and implement an inner loop to compute matches. Verify the results are the same
equations = string({Equations.Equation});
dataname = string(OutputData.DataName);
mathces = nan(1,numel(dataname));
for ii = 1:numel(equations)
for jj = 1:numel(dataname)
matches(jj) = contains(equations(ii),dataname(jj));
end
matches
indices = find(matches)
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12
Wrap an outer loop aorund the original code to test timing.
ntrials = 1e5;
tic
for trials = 1:ntrials
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).';
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
end
toc
Elapsed time is 15.236180 seconds.
tic
for trials = 1:ntrials
for ii = 1:numel(equations)
for jj = 1:numel(dataname)
matches(jj) = contains(equations(ii),dataname(jj));
end
matches;
indices = find(matches);
end
end
toc
Elapsed time is 2.448469 seconds.
I was actually surprised that there isn't a string function that can replace that inner loop, but I couldnt't find one. Maybe it can be done using a particular pattern, but I couldn't figure that out either.

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Loops and Conditional Statements 的更多信息

产品


版本

R2016b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by