textAnalytics toolbox: removing Entity details from documents
3 次查看(过去 30 天)
显示 更早的评论
I have a very large set of documents that I am preprocessing to use in a bert classification model.
I have tokenized the documents and added the entity details.
Now I want to remove all of the tokenswith in the documents that have been "tagged as" orginisation.
I have the following variables:
documents: tokenized documents
tdetails: a table of tokens with the document number, sentence number, line number, Type, Language, PartOfSpeech and Entity.
Token
"Astoria" 1 2 3 'letters' 'en' 'proper-noun' 'person'
"Federal Savings Bank" 1 2 3 'other' 'en' 'proper-noun' 'organization'
"settled" 1 2 3 'letters' 'en' 'verb' 'non-entity'
How do I remove all of the tokens in the variable documents based on the entity=organisation
eg in documents(1,1).Vocabulary(7) I can find "Federal Savings Bank" which is in row 7 of the example above. I coudl loop through all of the documents and tdetails==organisation but that woudl take quite while
cant seem to figure out how to do this more simply
0 个评论
采纳的回答
Cris LaPierre
2023-11-18
documents = tokenizedDocument(Text(:));
tdetails = tokenDetails(documents) ;
documents2 = removeWords(documents,tdetails{tdetails.Entity=="organisation"});
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Language Support 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!