topkngrams

Most frequent n-grams

Syntax

tbl = topkngrams(bag)

tbl = topkngrams(bag,k)

tbl = topkngrams(___,Name,Value)

Description

tbl = topkngrams(bag) returns a table listing the five most frequently seen n-grams in the bag-of-n-grams model bag. The function, by default, is case sensitive.

example

tbl = topkngrams(bag,k) lists the k most frequently seen n-grams in the bag-of-n-grams model bag. The function, by default, is case sensitive.

example

tbl = topkngrams(___,Name,Value) specifies additional options using one or more name-value pair arguments.

example

Examples

collapse all

Most Frequent Bigrams of Bag-of-N-Grams Model

Open Live Script

Create a table of the most frequent bigrams of a bag-of-n-grams model.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model.

bag = bagOfNgrams(documents)

bag = 
  bagOfNgrams with properties:

          Counts: [154×8799 double]
      Vocabulary: [1×3092 string]
          Ngrams: [8799×2 string]
    NgramLengths: 2
       NumNgrams: 8799
    NumDocuments: 154

Find the top 5 bigrams.

tbl = topkngrams(bag)

tbl=5×3 table
    "thou","art"    34    2
    "mine","eye"    15    2
    "thy","self"    14    2
    "thou","dost"    13    2
    "mine","own"    13    2

Find the top 10 bigrams.

tbl = topkngrams(bag,10)

tbl=10×3 table
    "thou","art"    34    2
    "mine","eye"    15    2
    "thy","self"    14    2
    "thou","dost"    13    2
    "mine","own"    13    2
    "thy","sweet"    12    2
    "thy","love"    11    2
    "dost","thou"    10    2
    "thou","wilt"    10    2
    "love","thee"    9    2

Count N-Grams of Different Lengths

Open Live Script

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model. To count n-grams of length 2 and 3 (bigrams and trigrams), specify 'NgramLengths' to be the vector [2 3].

bag = bagOfNgrams(documents,'NgramLengths',[2 3])

bag = 
  bagOfNgrams with properties:

          Counts: [154×18022 double]
      Vocabulary: [1×3092 string]
          Ngrams: [18022×3 string]
    NgramLengths: [2 3]
       NumNgrams: 18022
    NumDocuments: 154

View the 10 most common n-grams of length 2 (bigrams).

topkngrams(bag,10,'NGramLengths',2)

ans=10×3 table
    "thou","art",""    34    2
    "mine","eye",""    15    2
    "thy","self",""    14    2
    "thou","dost",""    13    2
    "mine","own",""    13    2
    "thy","sweet",""    12    2
    "thy","love",""    11    2
    "dost","thou",""    10    2
    "thou","wilt",""    10    2
    "love","thee",""    9    2

View the 10 most common n-grams of length 3 (trigrams).

 topkngrams(bag,10,'NGramLengths',3)

ans=10×3 table
    "thy","sweet","self"    4    3
    "why","dost","thou"    4    3
    "thy","self","thy"    3    3
    "thou","thy","self"    3    3
    "mine","eye","heart"    3    3
    "thou","shalt","find"    3    3
    "fair","kind","true"    3    3
    "thou","art","fair"    2    3
    "love","thy","self"    2    3
    "thy","self","thou"    2    3

Input Arguments

collapse all

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

Input bag-of-n-grams model, specified as a bagOfNgrams object.

`k` — Number of n-grams
positive integer | `Inf`

Number of n-grams to return, specified as a positive integer or Inf.

If k is Inf, then the function returns all n-grams. For bag-of-n-grams and LDA model input, the function sorts the n-grams in order of frequency and importance, respectively.

Example: 20

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'NgramLengths',[2 3] specifies to return the top bigrams and trigrams.

`NgramLengths` — N-gram lengths
positive integer | vector of positive integers

N-gram lengths, specified as the comma separated pair consisting of 'NgramLengths' and a positive integer or a vector of positive integers.

If you specify NgramLengths, then the function returns n-grams of these lengths only. If you do not specify NgramLengths, then the function returns the top n-grams regardless of length.

Example: [1 2 3]

`IgnoreCase` — Option to ignore case
`false` (default) | `true`

Option to ignore case, specified as the comma-separated pair consisting of 'IgnoreCase' and one of the following:

false – treat n-grams differing only by case as separate n-grams.
true – treat n-grams differing only by case as the same n-gram and merge counts.

`ForceCellOutput` — Indicator for forcing output to be returned as cell array
`false` (default) | `true`

Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput' and true or false.

Data Types: logical

Output Arguments

collapse all

`tbl` — Top n-grams
table | cell array of tables

Top n-grams, returned as a table or a cell array of tables. For bag-of-n-grams and LDA model input, the function sorts the n-grams in order of frequency and importance, respectively.

The table has the following columns:

`Ngram`	N-gram specified as a string vector
`Count`	Number of times the n-gram appears in the bag-of-n-grams model.
`NgramLength`	Length of the n-gram.

If bag is a non-scalar array or 'ForceCellOutput' is true, then the function returns the outputs as a cell array of tables. Each element in the cell array is a table containing the top n-grams of the corresponding element of bag.

Version History

Introduced in R2018a

topkngrams

Syntax

Description

Examples

Most Frequent Bigrams of Bag-of-N-Grams Model

Count N-Grams of Different Lengths

Input Arguments

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

`k` — Number of n-grams
positive integer | `Inf`

Name-Value Arguments

`NgramLengths` — N-gram lengths
positive integer | vector of positive integers

`IgnoreCase` — Option to ignore case
`false` (default) | `true`

`ForceCellOutput` — Indicator for forcing output to be returned as cell array
`false` (default) | `true`

Output Arguments

`tbl` — Top n-grams
table | cell array of tables

Version History

See Also

Topics

topkngrams

Syntax

Description

Examples

Most Frequent Bigrams of Bag-of-N-Grams Model

Count N-Grams of Different Lengths

Input Arguments

bag — Input bag-of-n-grams model bagOfNgrams object

k — Number of n-grams positive integer | Inf

Name-Value Arguments

NgramLengths — N-gram lengths positive integer | vector of positive integers

IgnoreCase — Option to ignore case false (default) | true

ForceCellOutput — Indicator for forcing output to be returned as cell array false (default) | true

Output Arguments

tbl — Top n-grams table | cell array of tables

Version History

See Also

Topics

`bag` — Input bag-of-n-grams model
`bagOfNgrams` object

`k` — Number of n-grams
positive integer | `Inf`

`NgramLengths` — N-gram lengths
positive integer | vector of positive integers

`IgnoreCase` — Option to ignore case
`false` (default) | `true`

`ForceCellOutput` — Indicator for forcing output to be returned as cell array
`false` (default) | `true`

`tbl` — Top n-grams
table | cell array of tables