bpeTokenizer
Description
A byte pair encoding (BPE) tokenizer maps text data to sequences of integers.
Creation
Syntax
Description
creates a tokenizer
= bpeTokenizer(vocabulary
,mergelist
)bpeTokenizer
object for the specified vocabulary and merge
list.
sets Properties using one or
more name-value arguments in addition to the input arguments in previous syntaxes. For
example, tokenizer
= bpeTokenizer(vocabulary
,mergelist
,Name=Value
)Pretokenizer="gpt4"
specifies to use the GPT-4 pretokenizer.
Input Arguments
vocabulary
— Tokenizer vocabulary
string array | cell array of character vectors
Tokenizer vocabulary, specified as a string array or cell array of character vectors.
The vocabulary must contain the values of the PaddingToken
,
StartToken
, UnknownToken
, and
SeparatorToken
properties. The vocabulary must also contain the
tokens in the merge list and the tokens that result from merging the tokens in the
merge list.
The vocabulary must contain the byte values that make up the tokens. To store the
vocabulary and merge list in string arrays and text files, you must represent them as
printable non-whitespace characters. In particular, bpeTokenizer
objects require you to represent some characters differently than their byte encoding.
The format of the byte values in the vocabulary and merge list must be consistent with
the formats used in GPT-2 and other similar transformer neural networks:
Bytes that correspond to printable ASCII characters — Represented by those characters in the vocabulary. For example, the vocabulary must represent the character
"a"
as"a"
.Byte value
173
— Represented as the byte value238
in the vocabulary. That is, the vocabulary must represent the byte173
, which can appear in two-byte characters like"ŭ"
(composed as[197 173]
in UTF-8 format), as"ġ"
(char(238)
).Bytes with values
127
through160
— Represented as their own byte value plus162
in the vocabulary. For example, the vocabulary must represent the byte140
, which can appear in two-byte characters like"Č"
(composed as[196 140]
in UTF-8 representation), as"Į"
(char(140+162)
).Bytes with values greater than
160
, excluding173
— Represented as their own byte value in the vocabulary. For example, the vocabulary must represent the byte195
, which can appear in two-byte characters like"é"
(composed as[195 169]
in UTF-8 representation), as"Ã"
(char(195)
).Bytes with values
0
through32
— Represented as their own byte value plus255
. For example, the vocabulary must represent the space character" "
(char(32)
) as"Ġ"
(char(32+255)
).
Some characters require multiple bytes to represent them. For example, you can
represent the emoji character "😎"
as the sequence of bytes
[240 159 152 142]
. To include such characters in the vocabulary,
also include the representation of the bytes that compose the character. For example,
to include the emoji character "😎"
in the vocabulary, you must
also include char(240)
, char(159+162)
,
char(152+162)
, and char(142+162)
. The merge
list must also contain pairs such that the character is the result of a series of
these merges.
To convert Unicode character representations to sequences of numeric bytes, use
the unicode2native
function with the
encoding "UTF-8"
.
Data Types: string
| cell
mergelist
— Pairs of tokens to merge
string array | cell array of character vectors
Pairs of tokens to merge, specified as a
numPairs
-by-2
string array or cell array of
character vectors, where numPairs
is the number of pairs.
The vocabulary must also contain the tokens that result from merging tokens in the merge list.
Data Types: string
| cell
Properties
Pretokenizer
— Pretokenizer
"gpt2"
(default) | "gpt4"
| "bert"
| "whitespace"
| "mecab"
| "none"
Pretokenizer, specified as one of these values:
"gpt2"
— Use GPT-2 pretokenizer."gpt4"
— Use GPT-4 pretokenizer."bert"
— Use BERT pretokenizer."whitespace"
— Use a whitespace pretokenizer, that initially splits words at whitespace characters."mecab"
— Use MeCab pretokenizer."none"
— Do not pretokenize.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
IgnoreCase
— Flag to ignore case
true
or 1
(default) | false
or 0
Flag to ignore case, specified as 1
(true
) or 0
(false
).
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
StripAccents
— Flag to strip accents
true
or (1
) (default) | false
or (0
)
Flag to strip accents, specified as 1
(true
) or 0
(false
).
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
ContextSize
— Context size
512
(default) | positive integer
Context size, specified as a positive integer.
The context size is the number of words or subwords that the tokenizer processes when splitting and merging tokens. A larger context size allows the model to consider more surrounding tokens, which can help capture long-range dependencies, but also increases the computational and memory requirements.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
PaddingToken
— Padding token
""
(default) | string scalar
Padding token, specified as a string scalar.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
Data Types: char
| string
PaddingCode
— Padding code
positive integer
This property is read-only.
Padding code, specified as a positive integer.
Data Types: double
StartToken
— Start token
""
(default) | string scalar
Start token, specified as a string scalar.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
Data Types: char
| string
StartCode
— Start code
positive integer
This property is read-only.
Start code, specified as a positive integer.
Data Types: double
UnknownToken
— Unknown token
""
(default) | string scalar
Unknown token, specified as a string scalar.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
Data Types: char
| string
UnknownCode
— Unknown code
positive integer
This property is read-only.
Unknown code, specified as a positive integer.
Data Types: double
SeparatorToken
— Separator token
""
(default) | string scalar
Separator token, specified as a string scalar.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
Data Types: char
| string
SeparatorCode
— Separator code
positive integer
This property is read-only.
Separator code, specified as a positive integer.
Data Types: double
Object Functions
encode | Tokenize and encode text for transformer neural network |
decode | Convert token codes to tokens |
encodeTokens | Convert tokens to token codes |
wordTokenize | Tokenize text into words using tokenizer |
Examples
Create BPE Tokenizer
Create a BPE tokenizer.
Create a vocabulary containing the characters "a"
through "z"
and the pairs of repeating vowels "aa"
, "ee"
, "ii"
, "oo"
, and "uu"
.
vocabulary = ["a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" ... "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" ... "aa" "ee" "ii" "oo" "uu"];
Create a merge list that indicates to merge repeating vowels.
mergelist = [ "a" "a" "e" "e" "i" "i" "o" "o" "u" "u"];
Create a BPE tokenizer with the vocabulary and merge list and specify the whitespace tokenizer.
tokenizer = bpeTokenizer(vocabulary,mergelist,Pretokenizer="whitespace")
tokenizer = bpeTokenizer with properties: IgnoreCase: 1 StripAccents: 1 PaddingToken: "" PaddingCode: NaN StartToken: "" StartCode: NaN UnknownToken: "" UnknownCode: NaN SeparatorToken: "" SeparatorCode: NaN Pretokenizer: "whitespace" ContextSize: 512
Encode the phrase "a cool breeze frees trees"
as a sequence of integers using the tokenizer. These integers specify the indices of the tokens in the vocabulary.
str = "a cool breeze frees trees";
tokenCodes = encode(tokenizer,str)
tokenCodes = 1x1 cell array
{[1 3 30 12 2 18 28 26 5 6 18 28 19 20 18 28 19]}
Algorithms
Byte Pair Encoding
Byte pair encoding (BPE) is a tokenization algorithm that allows transformer networks to handle a wide range of vocabulary without assigning individual tokens for every possible word. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword counterparts, which allows models to handle unseen words more effectively. This process creates a set of subword tokens that can better represent common and rare words.
These steps outline the algorithm for training a BPE tokenizer:
Start with a corpus of text. For example, a corpus that includes phrases like
"use byte pair encoding to tokenize text"
. Split the text data into words using a specified pretokenization algorithm.Initialize a vocabulary of bytes. For example, start with a vocabulary of
["a" "b" "c" ... "z"]
. For non-ASCII characters, like emojis that consist of multiple bytes, start with the byte values that comprise the character.Encode each word in the text data as a sequence of bytes, and represent the words as sequences of integers that specify the indices of the tokens in the vocabulary. For example, represent the word
"use"
as[21 19 5]
. When the encoding of a character is more than one byte, the resulting sequence of bytes can have more elements than the number of characters in the word.Count the frequency of all adjacent pairs of bytes in the corpus. For example, among the words
["use" "byte" "pair" "encoding" "to" "tokenize" "text"]
, the token pairs["t" "e"]
,["e" "n"]
, and["t" "o"]
appear twice, and the remaining pairs appear once.Identify the most frequent pair and add the corresponding merged token to the vocabulary. In the words represented as sequences of vocabulary indices, replace the corresponding pairs with the index of the new merged token in the vocabulary. Then, add this token pair to the merge list. For example, append the token pair
["t" "e"]
to the merge list. Then, add the corresponding merged token"te"
to the vocabulary so that it has the index27
. Finally, in the text data represented as vocabulary indices, replace the pairs of vocabulary indices[20 5]
(which corresponds to["t" "e"]
) with the corresponding new vocabulary index:The representation
[2 25 20 5]
for the word"byte"
becomes[2 25 27]
.The representation
[20 5 24 20]
for the word"text"
becomes[27 24 20]
.
Repeat the frequency count and merge operations until you reach a specified number of iterations or vocabulary size. For example, repeating these steps several times leads to merging the pair
["b" "y"]
to make the token"by"
, and then subsequently the pair["by" "te"]
to make the token"byte"
.
These steps outline how a BPE tokenizer tokenizes new text:
Pretokenization — Split text into individual words.
Byte-encoding — Encode each word into sequences of bytes.
Merge — By starting at the top of the merge list and progressing through it, iteratively apply each merge to pairs of tokens when possible.
Version History
Introduced in R2024a
See Also
bert
| bertDocumentClassifier
| encode
| decode
| encodeTokens
| subwordTokenize
| wordTokenize
Topics
- Train BERT Document Classifier
- Classify Text Data Using Deep Learning
- Create Simple Text Model for Classification
- Analyze Text Data Using Topic Models
- Analyze Text Data Using Multiword Phrases
- Sequence Classification Using Deep Learning (Deep Learning Toolbox)
- Deep Learning in MATLAB (Deep Learning Toolbox)
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)