Vision-Language Models

Perform image classification, retrieval, captioning, and object detection tasks using vision-language models

Vision-Language Models (VLMs) are multimodal models that take image and text inputs, and can generate text outputs or return bounding boxes with corresponding annotations, enabling tasks such as object detection and visual grounding. These models can analyze visual content in images or videos, process accompanying text, and identify correlations between visual and textual data. They enable a range of tasks that involve interpreting visual information within the context of language, using predictive algorithms rather than true comprehension. The Computer Vision Toolbox™ provides several pretrained VLMs, including CLIP, Grounding DINO, and Moondream, for these applications:

Image captioning — Generate descriptive text for an image.
Image retrieval — Locate images from a predefined set that best match a text description.
Object detection — Detect objects in an image based on a text-based query.
Image classification — Classify images based on textual categories.

Additionally, you can use VLMs to automatically label ground truth using descriptive text prompts in the Image Labeler and Video Labeler apps. To get started, see Get Started with Vision-Language Models.

Vision-language models enable you to rapidly detect objects in images using natural language text and image input, and perform other vision-language tasks such as image captioning, classification, and retrieval.

Apps

Image Labeler	Label images for computer vision applications
Video Labeler	Label video for computer vision applications

Functions

expand all

Classify and Retrieve Images

`clipNetwork`	Create pretrained CLIP deep learning neural network for vision-language tasks (Since R2026a)
`classify`	Classify image using CLIP network (Since R2026a)
`extractImageEmbeddings`	Extract feature embeddings from image using CLIP network image encoder (Since R2026a)
`extractTextEmbeddings`	Extract text embeddings from search text using CLIP network text encoder (Since R2026a)

Caption Images

`moondream`	Create pretrained Moondream vision-language model (VLM) (Since R2026a)
`captionImage`	Caption images using Moondream vision-language model (VLM) (Since R2026a)

Text-Guided Object Detection

`groundingDinoObjectDetector`	Detect and localize objects using Grounding DINO object detector (Since R2026a)
`detect`	Detect objects using Grounding DINO object detector (Since R2026a)

Topics

Get Started

Get Started with Vision-Language Models
Use vision-language models for multimodal tasks such as image captioning, zero-shot classification, and image search.

Featured Examples

New

Automatically Search and Label Video Frames Using VLMs

Automatically search and detect objects based on natural language text queries using vision-language models (VLMs).

Since R2026a
Open Live Script

New

Automatically Label Ground Truth Using Vision-Language Model

Automatically label ground truth images for object detection using the Grounding DINO vision-language model (VLM).

Since R2026a
Open Live Script

New

Automate Ground Truth Polygon Labeling Using Grounded SAM Model

Combine Grounding DINO and the Segment Anything Model 2 (SAM 2) to automatically produce polygon labels using the Video Labeler app.

Since R2026a
Open Live Script

New

Detect Industrial Defects Using Zero-Shot AnomalyCLIP

Detect and localize industrial production defects in pill images using an AnomalyCLIP anomaly detection network.

Since R2026a
Open Live Script