Main Content

visionTransformer

Pretrained vision transformer (ViT) neural network

Since R2023b

    Description

    [net,classNames] = visionTransformer returns a base-sized ViT neural network (86.8 million parameters) with a patch size of 16. The network is fine tuned using the ImageNet 2012 data set at a resolution of 384-by-384.

    This feature requires a Deep Learning Toolbox™ license and the Computer Vision Toolbox™ Model for Vision Transformer Network support package. You can download this support package from the Add-On Explorer. For more information, see Get and Manage Add-Ons.

    example

    [net,classNames] = visionTransformer(modelName) returns the ViT neural network with the specified model name.

    [net,classNames] = visionTransformer(___,Name=Value) specifies additional options using one or more name-value arguments.

    Examples

    collapse all

    Load a pretrained ViT neural network using the visionTransformer function. If the Computer Vision Toolbox Model for Vision Transformer Network support package is not installed, then the function provides a link to the required support package in the Add-On Explorer. To install the support package, click the link, and then click Install.

    Load a pretrained ViT neural network and the class names. If the required support package is installed, then the function returns a dlnetwork object and a string array of class names.

    [net,classNames] = visionTransformer;

    View the neural network.

    net
    net = 
      dlnetwork with properties:
    
             Layers: [143x1 nnet.cnn.layer.Layer]
        Connections: [167x2 table]
         Learnables: [200x3 table]
              State: [0x3 table]
         InputNames: {'imageinput'}
        OutputNames: {'softmax'}
        Initialized: 1
    
      View summary with summary.
    
    

    View the number of classes.

    numClasses = numel(classNames)
    numClasses = 
    1000
    

    Input Arguments

    collapse all

    Model name, specified as one of these values:

    • "base-16-imagenet-384" — Base-sized model (86.8 million parameters) with a patch size of 16. The network is fine-tuned using the ImageNet 2012 data set at a resolution of 384-by-384.

    • "small-16-imagenet-384" — Small-sized model (22.1 million parameters) with a patch size of 16. The network is fine-tuned using the ImageNet 2012 data set at a resolution of 384-by-384.

    • "tiny-16-imagenet-384" — Tiny-sized model (5.7 million parameters) with a patch size of 16. The network is fine-tuned using the ImageNet 2012 data set at a resolution of 384-by-384.

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: visionTransformer(DropoutProbability=0.2) returns a pretrained vision transformer neural network with a dropout probability set of 0.2.

    Probability of dropping out input elements in dropout layers, specified as a scalar in the range [0, 1).

    When you train a neural network with dropout layers, the layer randomly sets input elements to zero using the dropout mask rand(size(X)) < p, where X is the layer input and p is the layer dropout probability. The layer then scales the remaining elements by 1/(1-p).

    This operation helps to prevent the network from overfitting [2], [3]. A higher number results in the network dropping more elements during training. At prediction time, the output of the layer is equal to its input.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Probability of dropping out input elements in attention layers, specified as a scalar in the range [0, 1).

    When you train a neural network with attention layers, the layer randomly sets attention scores to zero using the dropout mask rand(size(scores)) < p, where scores is the layer input and p is the layer dropout probability. The layer then scales the remaining elements by 1/(1-p).

    This operation helps to prevent the network from overfitting [2], [3]. A higher number results in the network dropping more elements during training. At prediction time, the output of the layer is equal to its input.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Output Arguments

    collapse all

    Pretrained ViT neural network, returned as a dlnetwork (Deep Learning Toolbox) object.

    Class names, returned as a string array.

    References

    [1] Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An Image is Worth 16x16 words: Transformers for Image Recognition at Scale." Preprint, submitted June 3, 2021. https://doi.org/10.48550/arXiv.2010.11929.

    [2] Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." The Journal of Machine Learning Research 15, no. 1 (January 1, 2014): 1929–58

    [3] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks." Communications of the ACM 60, no. 6 (May 24, 2017): 84–90. https://doi.org/10.1145/3065386.

    Extended Capabilities

    Version History

    Introduced in R2023b

    See Also

    | (Deep Learning Toolbox) | (Deep Learning Toolbox) | (Deep Learning Toolbox)