textAnalytics toolbox: removing Entity details from documents

Question

david cowan 2023-11-18

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2048942-textanalytics-toolbox-removing-entity-details-from-documents

移动：Cris LaPierre 2023-11-19

采纳的回答： Cris LaPierre

I have a very large set of documents that I am preprocessing to use in a bert classification model.

I have tokenized the documents and added the entity details.

Now I want to remove all of the tokenswith in the documents that have been "tagged as" orginisation.

I have the following variables:

documents: tokenized documents

tdetails: a table of tokens with the document number, sentence number, line number, Type, Language, PartOfSpeech and Entity.

Token

"Astoria" 1 2 3 'letters' 'en' 'proper-noun' 'person'

"Federal Savings Bank" 1 2 3 'other' 'en' 'proper-noun' 'organization'

"settled" 1 2 3 'letters' 'en' 'verb' 'non-entity'

How do I remove all of the tokens in the variable documents based on the entity=organisation

eg in documents(1,1).Vocabulary(7) I can find "Federal Savings Bank" which is in row 7 of the example above. I coudl loop through all of the documents and tdetails==organisation but that woudl take quite while

cant seem to figure out how to do this more simply

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Cris LaPierre 2023-11-18

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2048942-textanalytics-toolbox-removing-entity-details-from-documents#answer_1355632

在 MATLAB Online 中打开

I would use removeWords.

documents = tokenizedDocument(Text(:));
tdetails = tokenDetails(documents) ;
documents2 = removeWords(documents,tdetails{tdetails.Entity=="organisation"});