how to extract a list of unique words from a set of one row strings

56 次查看(过去 30 天)
Basically I have a set of 11 strings of words, and each string has no repeating words, but I need a list of every unique word in all 11 strings.
I've found that this works for one string at a time, but I can't get a list for all 11 strings this way.
A{1} = updatedDocuments(1,1)
B{1} = strjoin(unique(strtrim(strsplit(A{1}, ',')))', '')
Is it possible to index A{1} as updatedDocuments(1:11,1) or do something similar?

采纳的回答

Madheswaran
Madheswaran 2024-11-14,9:32
编辑:Madheswaran 2024-11-15,5:17
I am assuming the following:
  • 'updatedDocuments' is an array of 'tokenizedDocument'
  • Each document contains text that is comma seperated and doesn't end with a comma
To get the unique words from the entire set of strings, you can follow the below approach:
% remove comma from the documents if you don't want comma to be
% included in 'uniqeWords'
updatedDocuments = removeWords(updatedDocuments, ",");
uniqueWords = updatedDocuments.Vocabulary;
If the 'updatedDocuments' is an cell array of char vector, you can follow the below approach:
updatedDocuments = strcat(updatedDocuments, ','); % Add comma at end of each cell
allWords = strjoin(updatedDocuments(1:11,1), ' '); % Join all words into a single string
allWords = strtrim(strsplit(allWords, ',')); % Split with comma as delimiter and trim
uniqueWords = unique(allWords); % unique words (1 x n cell where n is the number of unique words)
For more information, refer to the following documentations:
  1. https://mathworks.com/help/textanalytics/ref/tokenizeddocument.html
  2. https://mathworks.com/help/matlab/ref/double.unique.html
Hope this helps!
  3 个评论
Madheswaran
Madheswaran 2024-11-15,5:18
That is because I assumed 'updatedDocument' to be a cell array of character vectors. If 'updatedDocument' were an array of 'tokenizedDocument', resolving this issue would be straightforward. I have updated the answer by including a solution for when 'updatedDocument' is a 'tokenizedDocument', in addition to the existing explanation.
Let me know if that helps!

请先登录,再进行评论。

更多回答(1 个)

Paul
Paul 2024-11-14,1:09
If UpdatedDocuments is a 1D cell array of chars ...
UpdatedDocuments{1} = 'one,two,three,one';
UpdatedDocuments{2} = 'one,two,three,two';
UpdatedDocuments{3} = 'one,two,three,three';
result = cellfun(@(S) strjoin(unique(strtrim(strsplit(S, ','))),','),UpdatedDocuments,'Uni',false)
result = 1x3 cell array
{'one,three,two'} {'one,three,two'} {'one,three,two'}
  1 个评论
Paul
Paul 2024-11-15,1:06
The Vocabulary property of tokenizedDocument returns the uniqew words in the array
documents = tokenizedDocument([
"an example of a short sentence an example of a short sentence "
"a second short sentence a second short sentence"]);
documents
documents =
2x1 tokenizedDocument: 12 tokens: an example of a short sentence an example of a short sentence 8 tokens: a second short sentence a second short sentence
documents.Vocabulary
ans = 1x7 string array
"an" "example" "of" "a" "short" "sentence" "second"

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Characters and Strings 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by