Main Content

replaceNgrams

Replace n-grams in documents

Description

example

newDocuments = replaceNgrams(documents,oldNgrams,newNgrams) updates the specified documents by replacing the n-grams oldNgrams with the corresponding n-grams in newNgrams. The function, by default, is case sensitive.

newDocuments = replaceNgrams(documents,oldNgrams,newNgrams,'IgnoreCase',true) replaces the n-grams oldNgrams ignoring case.

Examples

collapse all

Use the replaceNgrams function to replace abbreviations with their corresponding expanded forms.

Create an array of tokenized documents.

str = [ ...
    "Currently in Cambridge, MA."
    "Next stop, NY!"];
documents = tokenizedDocument(str)
documents = 
  2x1 tokenizedDocument:

    6 tokens: Currently in Cambridge , MA .
    5 tokens: Next stop , NY !

Replace the tokens "MA" and "NY" with "Massachusetts" and ["New" "York"] respectively. If the n-grams have different lengths, you must pad the rows with the empty string "". In this case, you must pad "Massachusetts" with a single empty string "".

oldNgrams = [
    "MA"
    "NY"];
newNgrams = [
    "Massachusetts" ""
    "New" "York"];
documents = replaceNgrams(documents,oldNgrams,newNgrams)
documents = 
  2x1 tokenizedDocument:

    6 tokens: Currently in Cambridge , Massachusetts .
    6 tokens: Next stop , New York !

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

N-grams to replace, specified as a string array, character vector, or a cell array of character vectors.

If oldNgrams is a string array or cell array, then it has size NumNgrams-by-maxN , where NumNgrams is the number of n-grams, and maxN is the length of the largest n-gram. If oldNgrams is a character vector, then it represents a single word (unigram).

The value of oldNgrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of the ith row of oldNgrams must be padded with the empty string "".

For example, to specify both the unigram "Massachusetts", and the bigram ["New" "York"], specify the 2-by-2 string array ["Massachusetts" ""; "New" "York"], where "Massachusetts" is padded with a single empty string "".

Data Types: string | char | cell

New n-grams, specified as a string array, character vector, or a cell array of character vectors.

If newNgrams is a string array or cell array, then it has size NumNgrams-by-maxN , where NumNgrams is the number of n-grams, and maxN is the length of the largest n-gram. If newNgrams is a character vector, then it represents a single word (unigram).

The value of newNgrams(i,j) is the jth word of the ith n-gram. If the number of words in the ith n-gram is less than maxN, then the remaining entries of the ith row of newNgrams are empty.

newNgrams must have one row, or the same number of rows as oldNgrams.

For example, to specify both the unigram "Massachusetts", and the bigram ["New" "York"], specify the 2-by-2 string array ["Massachusetts" ""; "New" "York"], where "Massachusetts" is padded with a single empty string "".

Data Types: string | char | cell

Output Arguments

collapse all

Output documents, returned as a tokenizedDocument array.

Version History

Introduced in R2019a