Classify Images and Videos

Classify images and videos and perform activity recognition using AI models

Computer Vision Toolbox™ provides end-to-end workflows for classifying images and videos using deep learning and traditional computer vision techniques. For image category classification, you can use deep learning-based pretrained vision transformer (ViT) and CLIP models, or apply the bag-of-visual-words approach to categorize images based on their visual content. These workflows support applications such as scene recognition, content filtering, and automated tagging. Start by labeling scene-level categories using the Image Labeler and Video Labeler apps, and then train or fine-tune models using your labeled data.

For video classification and activity recognition, the toolbox enables you to classify sequences of frames into action categories such as walking, swimming, or sitting using deep learning models. These capabilities are essential for tasks like human-computer interaction and surveillance. The toolbox supports training, evaluation, and deployment of models that can interpret temporal patterns in video data to recognize complex activities and gestures.

Highlighted Topics

Featured Examples

Train Vision Transformer Network for Image Classification

Fine-tune a pretrained vision transformer (ViT) neural network to perform classification on a new collection of images.

Open Live Script

Image Category Classification Using Bag of Features

Use a bag of features approach for image category classification. This technique is also often referred to as bag of words. Visual image categorization is a process of assigning a category label to an image under test. Categories may contain images representing just about anything, for example, dogs, cats, trains, boats.

Open Live Script

Image Category Classification Using Deep Learning

Use a pretrained Convolutional Neural Network (CNN) as a feature extractor for training an image category classifier.

Open Live Script

Activity Recognition from Video and Optical Flow Data Using Deep Learning

Train an inflated-3D (I3D) two-stream convolutional neural network for activity recognition using RGB and optical flow data from videos.

Open Live Script

Human Activity Recognition Using R(2+1)D Video Classification

Train an R(2+1)D video classifier for activity recognition.

Open Live Script

Gesture Recognition using Videos and Deep Learning

Train a SlowFast convolutional neural network for gesture recognition using RGB data from videos.

Open Live Script