After tokenizing and manipulating a document, how do you put the results back together into non-tokenized, human-readable form?
The tokenization process adds spaces and breaks up text elements and there is not a straightforward way (that I've found so far) to get back to useable text. Is there a function/method for doing this? Here is an example:
textData = "Jim and Suzie wanted Jimmy’s to have as few 'complex' ingredients as possible (less than the seventeen they had seen some brands use).";
d = tokenizedDocument(textData);
ans =
"Jim and Suzie wanted Jimmy’s to have as few ' complex ' ingredients as possible ( less than the seventeen they had seen some brands use ) ."
Note that spaces have been added between tokens.
Have tried MANY different matlab functions to try to get useable/readable text back, without success.
Would be great to have the answer, which I hope is simple, and to have this added to the the help for tokenizedDocument.
Also, would be great to have examples like for correct spelling:
yield a complete, corrected document, rather than a tokenized document, as it isn't obvious how to get the usable, original form document back from the tokenized form.
Thank you