detokenizedDocument: How to turn tokenized text back into human-readable, non-tokenized text?

Question

CdC 2022-8-22

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1783625-detokenizeddocument-how-to-turn-tokenized-text-back-into-human-readable-non-tokenized-text

回答： Udit06 2024-9-23

After tokenizing and manipulating a document, how do you put the results back together into non-tokenized, human-readable form?

The tokenization process adds spaces and breaks up text elements and there is not a straightforward way (that I've found so far) to get back to useable text. Is there a function/method for doing this? Here is an example:

textData = "Jim and Suzie wanted Jimmy’s to have as few 'complex' ingredients as possible (less than the seventeen they had seen some brands use).";

d = tokenizedDocument(textData);

join(string(d))

ans =

"Jim and Suzie wanted Jimmy’s to have as few ' complex ' ingredients as possible ( less than the seventeen they had seen some brands use ) ."

Note that spaces have been added between tokens.

Have tried MANY different matlab functions to try to get useable/readable text back, without success.

Would be great to have the answer, which I hope is simple, and to have this added to the the help for tokenizedDocument.

Also, would be great to have examples like for correct spelling:

https://www.mathworks.com/help/textanalytics/ug/correct-spelling-in-documents.html

yield a complete, corrected document, rather than a tokenized document, as it isn't obvious how to get the usable, original form document back from the tokenized form.

Thank you

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

CdC 2022-8-26

编辑：CdC 2022-8-26

??? Can someone at Mathworks please reply???

This is a fundamental requirement to be able to use the Text Analytics Toolbox. For example, being able to correct spelling (or use many of the other analytics functions) on text, but then not being able to put the result back into a usable text form does not accomplish anything useful.

I'm an experienced Matlab user, and I've spent hours and hours trying to figure out a way to do this, thus far with no success. This really needs to be resolved. Thank you for your help.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Udit06 2024-9-23

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1783625-detokenizeddocument-how-to-turn-tokenized-text-back-into-human-readable-non-tokenized-text#answer_1521190

在 MATLAB Online 中打开

Hi,

As per my knowledge, there is no such function to de-tokenize the tokenized document. However, you can process the text that you have obtained after joining the tokens so that they do not contain extra whitespace after quotes and paranthesis as shown below:

% Original text
textData = "Jim and Suzie wanted Jimmy’s to have as few 'complex' ingredients as possible (less than the seventeen they had seen some brands use).";
% Tokenize the text
d = tokenizedDocument(textData);
% Join tokens into a single string
joinedText = joinWords(d);
% Use regex to clean up spaces
% Remove spaces inside quotes
cleanedText = regexprep(joinedText, '''\s+(\w+)\s+''', '''$1''');
% Remove spaces after opening and before closing parentheses
cleanedText = regexprep(cleanedText, '\(\s+', '(');
cleanedText = regexprep(cleanedText, '\s+\)', ')');
disp(cleanedText);
Jim and Suzie wanted Jimmy’s to have as few 'complex' ingredients as possible (less than the seventeen they had seen some brands use) .

You may need to look for other edge cases and process those also in a similar way.

I hope this helps.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

detokenizedDocument: How to turn tokenized text back into human-readable, non-tokenized text?

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

detokenizedDocument: How to turn tokenized text back into human-readable, non-tokenized text?

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论