how to extract a list of unique words from a set of one row strings

Question

Harrison 2024-11-14

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2166149-how-to-extract-a-list-of-unique-words-from-a-set-of-one-row-strings

评论： Harrison 2024-11-15

Basically I have a set of 11 strings of words, and each string has no repeating words, but I need a list of every unique word in all 11 strings.

I've found that this works for one string at a time, but I can't get a list for all 11 strings this way.

A{1} = updatedDocuments(1,1)

B{1} = strjoin(unique(strtrim(strsplit(A{1}, ',')))', '')

Is it possible to index A{1} as updatedDocuments(1:11,1) or do something similar?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Madheswaran 2024-11-14

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2166149-how-to-extract-a-list-of-unique-words-from-a-set-of-one-row-strings#answer_1545194

编辑：Madheswaran 2024-11-15

在 MATLAB Online 中打开

Hi @Harrison,

I am assuming the following:

'updatedDocuments' is an array of 'tokenizedDocument'
Each document contains text that is comma seperated and doesn't end with a comma

To get the unique words from the entire set of strings, you can follow the below approach:

% remove comma from the documents if you don't want comma to be 
% included in 'uniqeWords'
updatedDocuments = removeWords(updatedDocuments, ","); 
uniqueWords = updatedDocuments.Vocabulary;

If the 'updatedDocuments' is an cell array of char vector, you can follow the below approach:

updatedDocuments = strcat(updatedDocuments, ','); % Add comma at end of each cell
allWords = strjoin(updatedDocuments(1:11,1), ' '); % Join all words into a single string
allWords = strtrim(strsplit(allWords, ',')); % Split with comma as delimiter and trim
uniqueWords = unique(allWords); % unique words (1 x n cell where n is the number of unique words)

For more information, refer to the following documentations:

Hope this helps!

3 个评论
显示 1更早的评论隐藏 1更早的评论

Madheswaran 2024-11-15

That is because I assumed 'updatedDocument' to be a cell array of character vectors. If 'updatedDocument' were an array of 'tokenizedDocument', resolving this issue would be straightforward. I have updated the answer by including a solution for when 'updatedDocument' is a 'tokenizedDocument', in addition to the existing explanation.

Let me know if that helps!

Harrison 2024-11-15

Thats exactly right! Thank you!!

请先登录，再进行评论。

Answer 2

Paul 2024-11-14

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2166149-how-to-extract-a-list-of-unique-words-from-a-set-of-one-row-strings#answer_1544974

在 MATLAB Online 中打开

If UpdatedDocuments is a 1D cell array of chars ...

UpdatedDocuments{1} = 'one,two,three,one';
UpdatedDocuments{2} = 'one,two,three,two';
UpdatedDocuments{3} = 'one,two,three,three';
result = cellfun(@(S) strjoin(unique(strtrim(strsplit(S, ','))),','),UpdatedDocuments,'Uni',false)
result = 1x3 cell array
    {'one,three,two'}    {'one,three,two'}    {'one,three,two'}

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Paul 2024-11-15

在 MATLAB Online 中打开

The Vocabulary property of tokenizedDocument returns the uniqew words in the array

documents = tokenizedDocument([
    "an example of a short sentence  an example of a short sentence " 
    "a second short sentence a second short sentence"]);
documents
documents = 
  2x1 tokenizedDocument:

    12 tokens: an example of a short sentence an example of a short sentence
     8 tokens: a second short sentence a second short sentence
documents.Vocabulary
ans = 1x7 string array
    "an"    "example"    "of"    "a"    "short"    "sentence"    "second"

请先登录，再进行评论。

how to extract a list of unique words from a set of one row strings

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（1 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

Community Treasure Hunt

how to extract a list of unique words from a set of one row strings

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论 显示 1更早的评论隐藏 1更早的评论

更多回答（1 个）

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论