主要内容

forward

Run forward pass on CLIP network

Since R2026a

    Description

    Add-On Required: This feature requires the Computer Vision Toolbox Model for OpenAI CLIP Network add-on.

    [imageEmbeddings,textEmbeddings,state] = forward(clip,I,text) computes the feature embeddings of the images I and text embeddings of the text text from the output layers of the CLIP network, and returns the updated network state state.

    Note

    This functionality requires Deep Learning Toolbox™.

    Input Arguments

    collapse all

    CLIP network, specified as a clipNetwork object.

    Image data, specified in one of these formats:

    • H-by-W-by-3-by-B numeric array representing a batch of B truecolor images.

    • H-by-W-by-1-by-B numeric array representing a batch of B grayscale images.

    • Datastore that reads and returns truecolor images.

    • Formatted dlarray (Deep Learning Toolbox) object with two spatial dimensions of the format "SSCB". You can specify multiple test images by including a batch dimension.

    Input text corresponding to each image, specified as a B-element string array. B is the number of images in the batch. You must specify the text in English using ASCII characters. The function automatically pads or truncates each text input so that it contains exactly 77 tokens.

    Output Arguments

    collapse all

    Image feature embeddings extracted from the CLIP model encoder, returned as a dlarray object. The dimensions of the dlarray are 512-by-B or 768-by-B, depending on the value of the backbone argument of the clipNetwork object.

    Image Encoder Backbone backbone ValueImage Embeddings dlarray Size
    "vit-b-16"

    512-by-B

    "vit-l-14" or "resnet50"

    768-by-B

    Text embeddings extracted from the CLIP model encoder, returned as a dlarray object. The dimensions of the dlarray are 512-by-B or 768-by-B, depending on the value of the backbone argument of the clipNetwork object.

    Image Encoder Backbone backbone ValueText Embeddings dlarray Size
    "vit-b-16"

    512-by-B

    "vit-l-14" or "resnet50"

    768-by-B

    Updated network state, returned as a table. The network state is a table with these columns:

    • Layer – Layer name, returned as a string scalar.

    • Parameter – Parameter name, returned as a string scalar.

    • Value – Parameter value, returned as a numeric array.

    The network state contains information remembered by the network between iterations. Use the state to update the network parameters for training.

    Version History

    Introduced in R2026a