forward

Run forward pass on CLIP network

Since R2026a

Syntax

[imageEmbeddings,textEmbeddings,state] = forward(clip,I,text)

Description

Add-On Required: This feature requires the Computer Vision Toolbox Model for OpenAI CLIP Network add-on.

[imageEmbeddings,textEmbeddings,state] = forward(clip,I,text) computes the feature embeddings of the images I and text embeddings of the text text from the output layers of the CLIP network, and returns the updated network state state.

Note

This functionality requires Deep Learning Toolbox™.

Input Arguments

collapse all

`clip` — CLIP network
`clipNetwork` object

CLIP network, specified as a clipNetwork object.

`I` — Image data
numeric array | datastore | formatted `dlarray` object

Image data, specified in one of these formats:

H-by-W-by-3-by-B numeric array representing a batch of B truecolor images.
H-by-W-by-1-by-B numeric array representing a batch of B grayscale images.
Datastore that reads and returns truecolor images.
Formatted dlarray (Deep Learning Toolbox) object with two spatial dimensions of the format "SSCB". You can specify multiple test images by including a batch dimension.

`text` — Input text corresponding to each image
B-element string array

Input text corresponding to each image, specified as a B-element string array. B is the number of images in the batch. You must specify the text in English using ASCII characters. The function automatically pads or truncates each text input so that it contains exactly 77 tokens.

Output Arguments

collapse all

`imageEmbeddings` — Image feature embeddings
`dlarray` object

Image feature embeddings extracted from the CLIP model encoder, returned as a dlarray object. The dimensions of the dlarray are 512-by-B or 768-by-B, depending on the value of the backbone argument of the clipNetwork object.

Image Encoder Backbone `backbone` Value	Image Embeddings `dlarray` Size
`"vit-b-16"`	512-by-B
`"vit-l-14"` or `"resnet50"`	768-by-B

`textEmbeddings` — Text embeddings
`dlarray` object

Text embeddings extracted from the CLIP model encoder, returned as a dlarray object. The dimensions of the dlarray are 512-by-B or 768-by-B, depending on the value of the backbone argument of the clipNetwork object.

Image Encoder Backbone `backbone` Value	Text Embeddings `dlarray` Size
`"vit-b-16"`	512-by-B
`"vit-l-14"` or `"resnet50"`	768-by-B

`state` — Updated network state
table

Updated network state, returned as a table. The network state is a table with these columns:

Layer – Layer name, returned as a string scalar.
Parameter – Parameter name, returned as a string scalar.
Value – Parameter value, returned as a numeric array.

The network state contains information remembered by the network between iterations. Use the state to update the network parameters for training.

Version History

Introduced in R2026a

forward

Syntax

Description

Input Arguments

clip — CLIP network clipNetwork object

I — Image data numeric array | datastore | formatted dlarray object

text — Input text corresponding to each image B-element string array

Output Arguments

imageEmbeddings — Image feature embeddings dlarray object

textEmbeddings — Text embeddings dlarray object

state — Updated network state table