bpeTokenizer
Description
A byte pair encoding (BPE) tokenizer maps text data to sequences of integers.
Creation
Syntax
Description
creates a tokenizer
= bpeTokenizer(vocabulary
,mergelist
)bpeTokenizer
object for the specified vocabulary and merge
list.
sets additional properties using one or more name-value arguments.tokenizer
= bpeTokenizer(vocabulary
,mergelist
,Name=Value
)
Input Arguments
vocabulary
— Tokenizer vocabulary
string array | cell array of character vectors
Tokenizer vocabulary, specified as a string array or cell array of character vectors.
The vocabulary must contain the values of the PaddingToken
,
StartToken
, UnknownToken
, and
SeparatorToken
properties. The vocabulary must also contain the
tokens in the merge list and the tokens that result from merging the tokens in the
merge list.
The vocabulary must contain the byte values that make up the tokens. In order to
store the vocabulary and merge list in string arrays and text files, they must be
represented as printable non-whitespace characters. In particular,
bpeTokenizer
objects require that some characters must be
represented differently to their byte encoding. The format of the byte values in the
vocabulary and merge list must be consistent with the formats used in GPT-2 and many
other transformer neural networks:
Bytes that correspond to printable ASCII characters are represented by those characters in the vocabulary. For example, the character
"a"
is represented as"a"
in the vocabulary.The byte value
173
is represented as the byte-value238
in the vocabulary. That is, the byte173
, which can appear in two-byte characters like"ŭ"
(composed as[197 173]
in UTF-8 format), is represented as"ġ"
(char(238)
) in the vocabulary.Bytes with values
127
through160
, are represented as their own byte value plus162
in the vocabulary. For example, the byte140
, which can appear in two-byte characters like"Č"
(composed as[196 140]
in UTF-8 representation), is represented as"Į"
(char(140+162)
) in the vocabulary.Bytes with values greater than
160
, excluding173
, are represented as their own byte value in the vocabulary. For example, the byte195
, which can appear in two-byte characters like"é"
(composed as[195 169]
in UTF-8 representation), is represented as"Ã"
(char(195)
) in the vocabulary.Bytes with values
0
through32
are represented as their own byte value plus255
. For example, the space character" "
(char(32)
) is represented as"Ġ"
(char(32+255)
) in the vocabulary.
Some characters require multiple bytes to represent them. For example, the emoji
character "😎"
is represented by the sequence of bytes
[240 159 152 142]
. To include such characters in the vocabulary,
also include the representation of the bytes that compose the character. For example,
to include the emoji character "😎"
in the vocabulary, also include
char(240)
, char(159+162)
,
char(152+162)
, and char(142+162)
. The merge
list must also contain pairs such that the character is the result of a series of
these merges.
To convert Unicode character representations to sequences of numeric bytes, use
the unicode2native
function with the
encoding "UTF-8"
.
Data Types: string
| cell
mergelist
— Pairs of tokens to merge
string array | cell array of character vectors
Pairs of tokens to merge, specified as a
numPairs
-by-2
string array or cell array of
character vectors, where numPairs
is the number of pairs.
The vocabulary must also contain the tokens that result from merging tokens in the merge list.
Data Types: string
| cell
Properties
Pretokenizer
— Pretokenizer
"gpt2"
(default) | "gpt4"
| "bert"
| "whitespace"
| "mecab"
| "none"
Pretokenizer, specified as one of these values:
"gpt2"
— Use GPT-2 pretokenizer."gpt4"
— Use GPT-4 pretokenizer."bert"
— Use BERT pretokenizer."whitespace"
— Use a whitespace pretokenizer, that initially splits words at whitespace characters."mecab"
— Use MeCab pretokenizer."none"
— Do not pretokenize.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
IgnoreCase
— Flag to ignore case
1
(true
) (default) | 0
(false
)
Flag to ignore case, specified as 1
(true
) or 0
(false
).
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
StripAccents
— Flag to strip accents
1
(true
) (default) | 0
(false
)
Flag to strip accents, specified as 1
(true
) or 0
(false
).
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
ContextSize
— Context size
512
(default) | positive integer
Context size, specified as a positive integer.
The context size is the number of words or subwords that the tokenizer processes when splitting and merging tokens. A larger context size allows the model to consider more surrounding tokens, which can help capture long-range dependencies, but also increases the computational and memory requirements.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
PaddingToken
— Padding token
""
(default) | string scalar
Padding token, specified as a string scalar.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
Data Types: char
| string
PaddingCode
— Padding code
positive integer
This property is read-only.
Padding code, specified as a positive integer.
Data Types: double
StartToken
— Start token
""
(default) | string scalar
Start token, specified as a string scalar.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
Data Types: char
| string
StartCode
— Start code
positive integer
This property is read-only.
Start code, specified as a positive integer.
Data Types: double
UnknownToken
— Unknown token
""
(default) | string scalar
Unknown token, specified as a string scalar.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
Data Types: char
| string
UnknownCode
— Unknown code
positive integer
This property is read-only.
Unknown code, specified as a positive integer.
Data Types: double
SeparatorToken
— Separator token
""
(default) | string scalar
Separator token, specified as a string scalar.
To set this property, use the corresponding name-value argument when you create the
bpeTokenizer
object. After you create a bpeTokenizer
object, this property is read-only.
Data Types: char
| string
SeparatorCode
— Separator code
positive integer
This property is read-only.
Separator code, specified as a positive integer.
Data Types: double
Object Functions
encode | Tokenize and encode text for transformer neural network |
decode | Convert token codes to tokens |
encodeTokens | Convert tokens to token codes |
wordTokenize | Tokenize text into words using tokenizer |
Examples
Create BPE Tokenizer
Create a BPE tokenizer.
Create an vocabulary containing the characters "a"
through "z"
and the pairs of repeating vowels "aa"
, "ee"
, "ii"
, "oo"
, and "uu"
.
vocabulary = ["a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" ... "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" ... "aa" "ee" "ii" "oo" "uu"];
Create a merge list that indicates to merge repeating vowels.
mergelist = [ "a" "a" "e" "e" "i" "i" "o" "o" "u" "u"];
Create a BPE tokenizer with the vocabulary and merge list and specify the whitespace tokenizer.
tokenizer = bpeTokenizer(vocabulary,mergelist,Pretokenizer="whitespace")
tokenizer = bpeTokenizer with properties: IgnoreCase: 1 StripAccents: 1 PaddingToken: "" PaddingCode: NaN StartToken: "" StartCode: NaN UnknownToken: "" UnknownCode: NaN SeparatorToken: "" SeparatorCode: NaN Pretokenizer: "whitespace" ContextSize: 512
Encode the phrase "a cool breeze frees trees"
as a sequence of integers using the tokenizer. These integers index into the tokenizer vocabulary.
str = "a cool breeze frees trees";
tokenCodes = encode(tokenizer,str)
tokenCodes = 1x1 cell array
{[1 3 30 12 2 18 28 26 5 6 18 28 19 20 18 28 19]}
Algorithms
Byte Pair Encoding
Byte pair encoding (BPE) is a tokenization algorithm that allows transformer networks to handle a wide range of vocabulary without assigning individual tokens for every possible word. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword counterparts, which allows models to handle unseen words more effectively. This process creates a set of subword tokens that can better represent common and rare words.
These steps outline the algorithm for training a BPE tokenizer:
Start with a corpus of text. For example, a corpus that includes phrases like
"use byte pair encoding to tokenize text"
. Split the text data into words using a specified pretokenization algorithm.Initialize a vocabulary of bytes. For example, start with a vocabulary of
["a" "b" "c" ... "z"]
. For non-ASCII characters like emojis that comprise multiple bytes, start with the byte values that make up the character.Encode each word in the text data as a sequence of bytes, and represent the words as sequences of integers that index into the vocabulary. For example, represent the word
"use"
as[21 19 5]
. When the encoding of a character is more than one byte, the resulting sequence of bytes can have more elements than the number of characters in the word.Count the frequency of all adjacent pairs of bytes in the corpus. For example, among the words
["use" "byte" "pair" "encoding" "to" "tokenize" "text"]
, the token pairs["t" "e"]
,["e" "n"]
, and["t" "o"]
appear twice, and the remaining pairs appear once.Identify the most frequent pair and add the corresponding merged token to the vocabulary. In the words represented as sequences of vocabulary indices, replace the corresponding pairs with the index of the new merged token in the vocabulary. Then, add this token pair to the merge list. For example, append the token pair
["t" "e"]
to the merge list. Then, add the corresponding merged token"te"
to the vocabulary so that it has the index27
. Then, in the text data represented as vocabulary indices, replace the pairs of vocabulary indices[20 5]
(which corresponds to["t" "e"]
with the corresponding new vocabulary index:The representation
[2 25 20 5]
for the word"byte"
becomes[2 25 27]
.The representation
[20 5 24 20]
for the word"text"
becomes[27 24 20]
.
Repeat the frequency count and merge operations until you reach a specified number of iterations or vocabulary size. For example, repeating these steps several times leads to merging the pair
["b" "y"]
to make the token"by"
, and then subsequently the pair["by" "te"]
to make the token"byte"
.
These steps outline how a BPE tokenizer tokenizes new text:
Pretokenization — Split text into individual words.
Byte-encoding — Encode each word into sequences of bytes
Merge — By starting at the top of the merge list and progressing through it, iteratively apply each merge to pairs of tokens when possible.
Version History
Introduced in R2024a
See Also
bert
| bertDocumentClassifier
| encode
| decode
| encodeTokens
| subwordTokenize
| wordTokenize
Topics
- Train BERT Document Classifier
- Classify Text Data Using Deep Learning
- Create Simple Text Model for Classification
- Analyze Text Data Using Topic Models
- Analyze Text Data Using Multiword Phrases
- Sequence Classification Using Deep Learning (Deep Learning Toolbox)
- Deep Learning in MATLAB (Deep Learning Toolbox)
MATLAB 命令
您点击的链接对应于以下 MATLAB 命令:
请在 MATLAB 命令行窗口中直接输入以执行命令。Web 浏览器不支持 MATLAB 命令。
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)