Main Content

bertTokenizer

WordPiece BERT tokenizer

Since R2023b

    Description

    A Bidirectional Encoder Representations from Transformers (BERT) neural network WordPiece tokenizer maps text data to sequences of integers.

    Creation

    Description

    tokenizer = bertTokenizer(vocabulary) creates a bertTokenizer object for the specified vocabulary.

    example

    tokenizer = bertTokenizer(vocabulary,Name=Value) sets additional properties using one or more name-value arguments.

    Input Arguments

    expand all

    Tokenizer vocabulary, specified as a string array or cell array of character vectors.

    The vocabulary must contain the values of the PaddingToken, StartToken, UnknownToken, and SeparatorToken properties.

    Data Types: string | cell

    Properties

    expand all

    Flag to ignore case, specified as 1 (true) or 0 (false).

    To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

    Flag to strip accents, specified as 1 (true) or 0 (false).

    To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

    Context size, specified as a positive integer.

    The context size is the number of words or subwords that the tokenizer processes when splitting and merging tokens. A larger context size allows the model to consider more surrounding tokens, which can help capture long-range dependencies, but also increases the computational and memory requirements.

    To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Padding token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Padding code, specified as a positive integer.

    Data Types: double

    Start token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Start code, specified as a positive integer.

    Data Types: double

    Unknown token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Unknown code, specified as a positive integer.

    Data Types: double

    Separator token, specified as a string scalar.

    To set this property, use the corresponding name-value argument when you create the bertTokenizer object. After you create a bertTokenizer object, this property is read-only.

    Data Types: char | string

    This property is read-only.

    Separator code, specified as a positive integer.

    Data Types: double

    Object Functions

    encodeTokenize and encode text for transformer neural network
    decodeConvert token codes to tokens
    encodeTokensConvert tokens to token codes
    subwordTokenizeTokenize text into subwords using BERT tokenizer
    wordTokenizeTokenize text into words using tokenizer

    Examples

    collapse all

    Create a BERT tokenizer that has a vocabulary of the words "math", "science", and "engineering". Include tokens to use as padding, start, unknown, and separator tokens.

    vocabulary = ["math" "science" "engineering" "[PAD]" "[CLS]" "[UNK]" "[SEP]"];
    tokenizer = bertTokenizer(vocabulary)
    tokenizer = 
      bertTokenizer with properties:
    
            IgnoreCase: 1
          StripAccents: 1
          PaddingToken: "[PAD]"
           PaddingCode: 4
            StartToken: "[CLS]"
             StartCode: 5
          UnknownToken: "[UNK]"
           UnknownCode: 6
        SeparatorToken: "[SEP]"
         SeparatorCode: 7
           ContextSize: 512
    
    

    Algorithms

    expand all

    References

    [1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

    [2] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al. "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." Preprint, submitted October 8, 2016. https://doi.org/10.48550/arXiv.1609.08144

    Version History

    Introduced in R2023b