Main Content

bpeTokenizer

Byte pair encoding tokenizer

Since R2024a

    Description

    A byte pair encoding (BPE) tokenizer maps text data to sequences of integers.

    Creation

    Description

    tokenizer = bpeTokenizer(vocabulary,mergelist) creates a bpeTokenizer object for the specified vocabulary and merge list.

    tokenizer = bpeTokenizer(vocabulary,mergelist,Name=Value) sets Properties using one or more name-value arguments in addition to the input arguments in previous syntaxes. For example, Pretokenizer="gpt4" specifies to use the GPT-4 pretokenizer.

    example

    Input Arguments

    expand all

    Tokenizer vocabulary, specified as a string array or cell array of character vectors.

    The vocabulary must contain the values of the PaddingToken, StartToken, UnknownToken, and SeparatorToken properties. The vocabulary must also contain the tokens in the merge list and the tokens that result from merging the tokens in the merge list.

    The vocabulary must contain the byte values that make up the tokens. To store the vocabulary and merge list in string arrays and text files, you must represent them as printable non-whitespace characters. In particular, bpeTokenizer objects require you to represent some characters differently than their byte encoding. The format of the byte values in the vocabulary and merge list must be consistent with the formats used in GPT-2 and other similar transformer neural networks:

    • Bytes that correspond to printable ASCII characters — Represented by those characters in the vocabulary. For example, the vocabulary must represent the character "a" as "a".

    • Byte value 173 — Represented as the byte value 238 in the vocabulary. That is, the vocabulary must represent the byte 173, which can appear in two-byte characters like "ŭ" (composed as [197 173] in UTF-8 format), as "ġ" (char(238)).

    • Bytes with values 127 through 160 — Represented as their own byte value plus 162 in the vocabulary. For example, the vocabulary must represent the byte 140, which can appear in two-byte characters like "Č" (composed as [196 140] in UTF-8 representation), as "Į" (char(140+162)).

    • Bytes with values greater than 160, excluding 173 — Represented as their own byte value in the vocabulary. For example, the vocabulary must represent the byte 195, which can appear in two-byte characters like "é" (composed as [195 169] in UTF-8 representation), as "Ã" (char(195)).

    • Bytes with values 0 through 32 — Represented as their own byte value plus 255. For example, the vocabulary must represent the space character " " (char(32)) as "Ġ" (char(32+255)).

    Some characters require multiple bytes to represent them. For example, you can represent the emoji character "😎" as the sequence of bytes [240 159 152 142]. To include such characters in the vocabulary, also include the representation of the bytes that compose the character. For example, to include the emoji character "😎" in the vocabulary, you must also include char(240), char(159+162), char(152+162), and char(142+162). The merge list must also contain pairs such that the character is the result of a series of these merges.

    To convert Unicode character representations to sequences of numeric bytes, use the unicode2native function with the encoding "UTF-8".

    Data Types: string | cell

    Pairs of tokens to merge, specified as a numPairs-by-2 string array or cell array of character vectors, where numPairs is the number of pairs.

    The vocabulary must also contain the tokens that result from merging tokens in the merge list.

    Data Types: string | cell

    Properties

    expand all

    Pretokenizer, specified as one of these values:

    • "gpt2" — Use GPT-2 pretokenizer.

    • "gpt4" — Use GPT-4 pretokenizer.

    • "bert" — Use BERT pretokenizer.

    • "whitespace" — Use a whitespace pretokenizer, that initially splits words at whitespace characters.

    • "mecab" — Use MeCab pretokenizer.

    • "none" — Do not pretokenize.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Flag to ignore case, specified as 1 (true) or 0 (false).

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Flag to strip accents, specified as 1 (true) or 0 (false).

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Context size, specified as a positive integer.

    The context size is the number of words or subwords that the tokenizer processes when splitting and merging tokens. A larger context size allows the model to consider more surrounding tokens, which can help capture long-range dependencies, but also increases the computational and memory requirements.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Padding token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Padding code, specified as a positive integer.

    Data Types: double

    Start token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Start code, specified as a positive integer.

    Data Types: double

    Unknown token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Unknown code, specified as a positive integer.

    Data Types: double

    Separator token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bpeTokenizer object. After you create a bpeTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Separator code, specified as a positive integer.

    Data Types: double

    Object Functions

    encodeTokenize and encode text for transformer neural network
    decodeConvert token codes to tokens
    encodeTokensConvert tokens to token codes
    wordTokenizeTokenize text into words using tokenizer

    Examples

    collapse all

    Create a BPE tokenizer.

    Create a vocabulary containing the characters "a" through "z" and the pairs of repeating vowels "aa", "ee", "ii", "oo", and "uu".

    vocabulary = ["a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" ...
        "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"  ...
        "aa" "ee" "ii" "oo" "uu"];

    Create a merge list that indicates to merge repeating vowels.

    mergelist = [
        "a" "a"
        "e" "e"
        "i" "i"
        "o" "o"
        "u" "u"];

    Create a BPE tokenizer with the vocabulary and merge list and specify the whitespace tokenizer.

    tokenizer = bpeTokenizer(vocabulary,mergelist,Pretokenizer="whitespace")
    tokenizer = 
      bpeTokenizer with properties:
    
            IgnoreCase: 1
          StripAccents: 1
          PaddingToken: ""
           PaddingCode: NaN
            StartToken: ""
             StartCode: NaN
          UnknownToken: ""
           UnknownCode: NaN
        SeparatorToken: ""
         SeparatorCode: NaN
          Pretokenizer: "whitespace"
           ContextSize: 512
    
    

    Encode the phrase "a cool breeze frees trees" as a sequence of integers using the tokenizer. These integers specify the indices of the tokens in the vocabulary.

    str = "a cool breeze frees trees";
    tokenCodes = encode(tokenizer,str)
    tokenCodes = 1x1 cell array
        {[1 3 30 12 2 18 28 26 5 6 18 28 19 20 18 28 19]}
    
    

    Algorithms

    expand all

    Version History

    Introduced in R2024a