主要内容

Get Started with Vision-Language Models

Vision-language models (VLMs) are transformer-based deep learning models, trained on large data sets of images and associated text, that connect visual understanding with natural language.

The visual feature-to-language association by VLMs enables tasks such as:

  • Object detection and labeling — Identify and rapidly label objects in images using text prompts.

  • Image captioning — Generate descriptive sentences for images.

  • Zero-shot image classification — Categorize images into classes not seen during training, using class names as text prompts.

  • Image search — Find images that match a text query.

Computer Vision Toolbox™ provides several pretrained VLMs for image captioning, object detection with natural language queries, image classification, and image-text retrieval.

Select Vision-Language Model

Use this table to choose a VLM based on your application.

Vision-Language ModelUse CaseSample Visualization and Examples

Grounding-DINO – Detect objects in images using natural language text queries.

Use the groundingDinoObjectDetector object and the detect object function to rapidly detect objects in images using text queries. In particular, you can:

  • Detect objects with specific qualities, such as "red car," or "person with hat," using attribute-based queries.

  • Localize objects that contain text, such as a traffic sign that says "stop," or a specific storefront billboard.

  • Detect abstract or uncommon object types, such as objects in a painting or digitalized image.

Visualization of sample Grounding-DINO detection results, which detects certain people objects based on attributes mentioned in descriptive text queries.

For an example, see Perform Zero-Shot Object Detection Using Grounding DINO.

Grounding-DINO – Automatically label objects in images and video frames using natural language text queries.

Automatically label specific objects in image and video scenes by specifying descriptive text queries using the Grounding DINO tool in the Image Labeler and Video Labeler apps.

Visualization of sample Grounding-DINO-based labeling results, which labels specific objects based on attributes mentioned in descriptive text queries.

For an example, see Automatically Label Ground Truth Using Vision-Language Model.

Moondream™ – Caption images.

Create the Moondream vision-language model using the moondream object. Use the captionImage object function to caption images.

Schematic of caption generation using the Moondream model.

For example, see:

Contrastive Language–Image Pre-training (CLIP) – Classify and retrieve images using text.

Use the clipNetwork object and its object functions to:

  • Classify images based on textual category names, without retraining the model.

  • Retrieve images based on text queries.

  • Measure image similarity.

Schematic of text-based image retrieval using the CLIP model.

For example, see:

Integrate VLMs into Visual Data Workflows

Use outputs from VLMs to enhance and automate various stages of your visual understanding pipeline. These are examples of applications where you can use VLM outputs as an initial starting point.

  • Use the Grounding-DINO object detector to accelerate automatic data set labeling by generating bounding box annotations for custom object classes. You can define the custom object classes using input text descriptions.

  • Use the Moondream model to automatically generate captions for images in a datastore, creating paired image-caption data sets. Then, you can use these automatically generated data sets as the paired image-text data required to train or evaluate vision-language models, such as CLIP or other multimodal networks.

  • Use CLIP to generate shared image-text embeddings, which you can then use for advanced tasks such as image-text matching, clustering, search, or integration into web applications.

  • Use CLIP to measure image similarity by extracting embeddings for both images and text queries and computing the cosine similarity between any pair of embedding vectors. You can use this numerical measure of similarity for tasks like image retrieval, clustering, or duplicate detection.

See Also

Apps

Functions

Topics