Main Content

docfun

Apply function to words in documents

Description

newDocuments = docfun(func,documents) calls the function specified by the function handle func and passes elements of documents as a string vector of words.

  • If func accepts exactly one input argument, then the words of newDocuments(i) are the output of func(string(documents(i))).

  • If func accepts two input arguments, then the words of newDocuments(i) are the output of func(string(documents(i)),details), where details contains the corresponding token details output by tokenDetails.

  • If func changes the number of words in the document, then docfun removes the token details from that document.

docfun does not perform the calls to function func in a specific order.

example

newDocuments = docfun(func,documents1,...,documentsN) calls the function specified by the function handle func and passes elements of documents1,…,documentsN as string vectors of words, where N is the number of inputs to the function func. The words of newDocuments(i) are the output of func(string(documents1(i)),...,string(documentsN(i))).

Each of documents1,…,documentsN must be the same size.

example

Examples

collapse all

Apply reverse to each word in a document array.

documents = tokenizedDocument([ ...
    "an example of a short sentence" 
    "a second short sentence"])
documents = 
  2x1 tokenizedDocument:

    6 tokens: an example of a short sentence
    4 tokens: a second short sentence

func = @reverse;
newDocuments = docfun(func,documents)
newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: na elpmaxe fo a trohs ecnetnes
    4 tokens: a dnoces trohs ecnetnes

Tag words by combining the words from one document array with another, using the string function plus.

Create the first tokenizedDocument array. Erase the punctuation and convert the text to lowercase.

str = [ ...
    "An example of a short sentence."
    "A second short sentence."];
str = erasePunctuation(str);
str = lower(str);
documents1 = tokenizedDocument(str)
documents1 = 
  2x1 tokenizedDocument:

    6 tokens: an example of a short sentence
    4 tokens: a second short sentence

Create the second tokenizedDocument array. The documents have the same number of words as the corresponding documents in documents1. The words of documents2 are POS tags for the corresponding words.

documents2 = tokenizedDocument([ ...
    "_det _noun _prep _det _adj _noun"
    "_det _adj _adj _noun"])
documents2 = 
  2x1 tokenizedDocument:

    6 tokens: _det _noun _prep _det _adj _noun
    4 tokens: _det _adj _adj _noun

func = @plus;
newDocuments = docfun(func,documents1,documents2)
newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: an_det example_noun of_prep a_det short_adj sentence_noun
    4 tokens: a_det second_adj short_adj sentence_noun

The output is not the same as calling plus on the documents directly.

plus(documents1,documents2)
ans = 
  2x1 tokenizedDocument:

    12 tokens: an example of a short sentence _det _noun _prep _det _adj _noun
     8 tokens: a second short sentence _det _adj _adj _noun

Input Arguments

collapse all

Function handle that accepts N string arrays as inputs and outputs a string array. func must accept string(documents1(i)),...,string(documentsN(i)) as input.

Function handle to apply to words in documents. The function must have one of the following syntaxes:

  • newWords = func(words), where words is a string array of the words of a single document.

  • newWords = func(words,details), where words is a string array of the words of a single document, and details is the corresponding table of token details given by tokenDetails.

  • newWords = func(words1,...,wordsN), where words1,...,wordsN are string arrays of words.

Example: @reverse

Data Types: function_handle

Input documents, specified as a tokenizedDocument array.

Output Arguments

collapse all

Output documents, returned as a tokenizedDocument array.

Version History

Introduced in R2017b