主要内容

Vision-Language Models

Perform image classification, retrieval, captioning, and object detection tasks using vision-language models

Vision-Language Models (VLMs) are multimodal models that take image and text inputs, and can generate text outputs or return bounding boxes with corresponding annotations, enabling tasks such as object detection and visual grounding. These models can analyze visual content in images or videos, process accompanying text, and identify correlations between visual and textual data. They enable a range of tasks that involve interpreting visual information within the context of language, using predictive algorithms rather than true comprehension. The Computer Vision Toolbox™ provides several pretrained VLMs, including CLIP, Grounding DINO, and Moondream, for these applications:

  • Image captioning — Generate descriptive text for an image.

  • Image retrieval — Locate images from a predefined set that best match a text description.

  • Object detection — Detect objects in an image based on a text-based query.

  • Image classification — Classify images based on textual categories.

Additionally, you can use VLMs to automatically label ground truth using descriptive text prompts in the Image Labeler and Video Labeler apps. To get started, see Get Started with Vision-Language Models.

Vision-language models enable you to rapidly detect objects in images using natural language text and image input, and perform other vision-language tasks such as image captioning, classification, and retrieval.

Apps

Image LabelerLabel images for computer vision applications
Video LabelerLabel video for computer vision applications

Functions

expand all

clipNetworkCreate pretrained CLIP deep learning neural network for vision-language tasks (Since R2026a)
classifyClassify image using CLIP network (Since R2026a)
extractImageEmbeddingsExtract feature embeddings from image using CLIP network image encoder (Since R2026a)
extractTextEmbeddingsExtract text embeddings from search text using CLIP network text encoder (Since R2026a)
moondreamCreate pretrained Moondream vision-language model (VLM) (Since R2026a)
captionImageCaption images using Moondream vision-language model (VLM) (Since R2026a)
groundingDinoObjectDetectorDetect and localize objects using Grounding DINO object detector (Since R2026a)
detectDetect objects using Grounding DINO object detector (Since R2026a)

Topics

Get Started

Featured Examples