textAnalytics toolbox: removing Entity details from documents

Question

0 个投票

I have a very large set of documents that I am preprocessing to use in a bert classification model.

I have tokenized the documents and added the entity details.

Now I want to remove all of the tokenswith in the documents that have been "tagged as" orginisation.

I have the following variables:

documents: tokenized documents

tdetails: a table of tokens with the document number, sentence number, line number, Type, Language, PartOfSpeech and Entity.

Token

"Astoria" 1 2 3 'letters' 'en' 'proper-noun' 'person'

"Federal Savings Bank" 1 2 3 'other' 'en' 'proper-noun' 'organization'

"settled" 1 2 3 'letters' 'en' 'verb' 'non-entity'

How do I remove all of the tokens in the variable documents based on the entity=organisation

eg in documents(1,1).Vocabulary(7) I can find "Federal Savings Bank" which is in row 7 of the example above. I coudl loop through all of the documents and tdetails==organisation but that woudl take quite while

cant seem to figure out how to do this more simply

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Cris LaPierre 2023-11-18

在 MATLAB Online 中打开

2 个投票

I would use removeWords.

documents = tokenizedDocument(Text(:));
tdetails = tokenDetails(documents) ;
documents2 = removeWords(documents,tdetails{tdetails.Entity=="organisation"}); 

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

david cowan 2023-11-19

移动：Cris LaPierre 2023-11-19

Really appreciate that.

removeWords !!

I'll not forget that now - I knew there had to be a simple approach I was just missing

请先登录，再进行评论。

textAnalytics toolbox: removing Entity details from documents

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

更多回答（0 个）

类别

产品

版本

标签

Community Treasure Hunt

textAnalytics toolbox: removing Entity details from documents

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

1 个评论 显示 -1更早的评论 隐藏 -1更早的评论

更多回答（0 个）

类别

产品

版本

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论