Main Content

replace

Replace substrings in documents

Description

newDocuments = replace(documents,old,new) replaces all occurrences of the substring or pattern old in documents with new.

Tip

Use the replace function to replace substrings of the words in documents by specifying substrings or patterns. To replace entire words and n-grams in documents, use the replaceWords and replaceNgrams functions respectively.

example

Examples

collapse all

Replace words in a document array.

documents = tokenizedDocument([
    "an extreme example"
    "another extreme example"])
documents = 
  2x1 tokenizedDocument:

    3 tokens: an extreme example
    3 tokens: another extreme example

newDocuments = replace(documents,"example","sentence")
newDocuments = 
  2x1 tokenizedDocument:

    3 tokens: an extreme sentence
    3 tokens: another extreme sentence

Replace substrings of the words.

newDocuments = replace(documents,"ex","X-")
newDocuments = 
  2x1 tokenizedDocument:

    3 tokens: an X-treme X-ample
    3 tokens: another X-treme X-ample

Remove digits from a document using a digits pattern.

Create an array of tokenized documents.

textData = [
    "Text Analytics Toolbox provides over 50 functions to analyze text data."
    "The bm25Similarity function measures document similarity."];
documents = tokenizedDocument(textData);

Replace instances of consecutive digits with the token "<NUMBER>" using the replace function. Specify a digits pattern using the digitsPattern function.

pat = digitsPattern;
newDocuments = replace(documents,pat,"<NUMBER>")
newDocuments = 
  2x1 tokenizedDocument:

    12 tokens: Text Analytics Toolbox provides over <NUMBER> functions to analyze text data .
     7 tokens: The bm<NUMBER>Similarity function measures document similarity .

Notice that the function replaces the digits in the token "bm25Similarity".

To replace tokens consisting entirely of digits, use the replace function and specify a pattern that also includes text boundaries. Specify text boundaries using the textBoundary function.

pat = textBoundary + digitsPattern + textBoundary;
newDocuments = replace(documents,pat,"<NUMBER>")
newDocuments = 
  2x1 tokenizedDocument:

    12 tokens: Text Analytics Toolbox provides over <NUMBER> functions to analyze text data .
     7 tokens: The bm25Similarity function measures document similarity .

In this case, the function does not replace the digits in the token "bm25Similarity".

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Substring or pattern to replace, specified as one of the following:

  • String array

  • Character vector

  • Cell array of character vectors

  • pattern array

New substring, specified as a string array, character vector, or cell array of character vectors.

Data Types: string | char | cell

Output Arguments

collapse all

Output documents, returned as a tokenizedDocument array.

Version History

Introduced in R2017b