Matching combinations of strings

5 次查看(过去 30 天)
Marcus Glover
Marcus Glover 2024-6-17
编辑: DGM 2024-6-22
I have a table TT with a string variable TT.name. I want to return true if TT.name matches any entry in another table variable OK.name. However, I have some complications I am having a hard time parsing.
Many of the strings in TT.name are combinations of strings that appear in OK.name. I want to include these as a true match. Sometimes they have a + symbol, sometimes just a space. Further complicating matters, the table OK contains some entries with spaces, and if they do I want to treat them as an entire entry, and not break them up at the spaces.
I believe I will usually have a combination of 2 strings only, though 3 and 4 may be possible.
TT = table(["Green"; "Red"; "Blue"; "Black Blue"; "Black"; "Blue Green"; "Red + Blue"; "Red Orange"; "Red + White"; "Black Blue Red"], 'VariableNames', {'name'})
TT = 10x1 table
name ________________ "Green" "Red" "Blue" "Black Blue" "Black" "Blue Green" "Red + Blue" "Red Orange" "Red + White" "Black Blue Red"
OK = table(["Red"; "Green"; "Blue"; "Black Blue"], 'VariableNames', {'name'})
OK = 4x1 table
name ____________ "Red" "Green" "Blue" "Black Blue"
This is the output I would want, but not by manually changing rows 6 and 7:
TT.match=ismember(TT.name,OK.name);
TT.match([6 7 10])=1
TT = 10x2 table
name match ________________ _____ "Green" true "Red" true "Blue" true "Black Blue" true "Black" false "Blue Green" true "Red + Blue" true "Red Orange" false "Red + White" false "Black Blue Red" true
In the example, "Blue Green" and "Red + Blue" are true matchs, because "Blue" "Green" and "Red" all appear as entries in OK.name.
SImilarly, "Black Blue Red" is ok because it is a combination of "Black Blue" and "Red"
"Black" is not a match, because the only entry in OK.name is "Black Blue" and I do not want to separate the words from this table.
"Red Orange" and "Red + Orange" are not matches because only "Red" is in the OK table.
  2 个评论
Stephen23
Stephen23 2024-6-18
编辑:Stephen23 2024-6-18
The task is ill-defined, and most likely impossible in a general sense: this is due to the same delimiters being used to separate words in OK as well as to separate combinations from TT. Consider:
TT = "black blue" + "red" -> "black blue red"
OK = ["black", "blue red"]
Also note that a naive approach considering all permutations of OK will quickly become intractable.
Questions:
  • what size is OK ?
  • what size is TT ?
Marcus Glover
Marcus Glover 2024-6-18
编辑:Marcus Glover 2024-6-18
I think the size of OK (~250) is indeed going to make this intractable. (TT is hundreds of thousands of entries) The solution is to fix the issue with delimiters in the data.

请先登录,再进行评论。

回答(1 个)

Umar
Umar 2024-6-18
Hi Marcus,To achieve this, you can use a combination of string manipulation functions and logical comparisons in MATLAB. Here's a step-by-step approach to solving this problem: 1. Iterate through each row in the `TT.name` table. 2. For each row, split the string into individual words based on spaces or the "+" symbol. 3. Check if each individual word exists as an entry in the `OK.name` table. 4. If all words in the split string are found in the `OK.name` table, consider it a match. 5. Update the `TT.match` column accordingly. Here's some MATLAB code that implements this logic: ```matlab TT.match = false(size(TT, 1), 1); for i = 1:size(TT, 1) words = strsplit(TT.name{i}, {' ', '+'}); match_count = sum(ismember(words, OK.name)); if match_count == numel(words) TT.match(i) = true; end end ``` By following these steps, you can efficiently handle combinations of strings and spaces within the `TT.name` table and accurately identify matches based on the entries in the `OK.name` table. This approach ensures that you can automatically identify true matches without manually changing rows, as demonstrated in your desired output example. Additionally, it considers multiple strings combinations while respecting the specific conditions outlined for matching entries.
  9 个评论
DGM
DGM 2024-6-22
编辑:DGM 2024-6-22
It's okay. You're still free to think of me as a jerk. I mean, it's fair. Just please try to work on the formatting and stuff.
FWIW, also if you don't have MATLAB, I'm pretty sure you can use MATLAB Online for free for something like 20h a month. It doesn't have as many toolboxes installed as the forum editor, but it does allow the use of certain things (interactive tools) that the forum editor can't use.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Get Started with MATLAB 的更多信息

产品


版本

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by