Automate Ground Truth Labeling for Object Tracking and Re-Identification

This example uses:

This example shows how to create an automation algorithm to automatically label data for object tracking and for object re-identification.

Overview

The Image Labeler, Video Labeler, and Ground Truth Labeler (Automated Driving Toolbox) apps provide a convenient interface to interactively label data for various computer vision tasks. These apps include built-in automation algorithms to accelerate the labeling and also let you specify your own custom labeling algorithm. For an introduction to creating your own automation algorithm, see the Create Automation Algorithm example.

This example extends the Automate Ground Truth Labeling for Object Detection example by incorporating a multi-object tracker from the Sensor Fusion and Tracking Toolbox (SFTT). The multi-object tracker automatically assigns identifiers to objects that can be used to evaluate and verify multi-object tracking systems, as well as train and evaluate object re-identification (ReID) networks.

To learn how to perform object tracking and re-identification of objects across multiple frames, see the Reidentify People Throughout a Video Sequence Using ReID Network example.

Define Automation Algorithm

Implement a track-by-detection algorithm using a pretrained object detector combined with a Global Nearest Neighbor multi-object tracker. Use a pretrained yolov4ObjectDetector detector to detect pedestrians in a video. You can use other object detectors, such as yoloxObjectDetector or ssdObjectDetector, depending on the type of objects that need to be tracked, for tracking only person specifically use yolov2TransformLayer. To learn more about about multi-object trackers, see the Implement Simple Online and Realtime Tracking (Sensor Fusion and Tracking Toolbox) example.

Define the automation algorithm using the initialize and run functions in the helperObjectDetectorTracker class. This class inherits from vision.labeler.AutomationAlgorithm and vision.labeler.mixin.Temporal. Define the inherited properties Name, Description, and UserDirections, as well as the following algorithm properties:

classdef helperObjectDetectorTracker < vision.labeler.AutomationAlgorithm & vision.labeler.mixin.Temporal
    properties
        Detector                 % Object detector
        Tracker                  % Multi-Object Tracker
        OverlapThreshold = 0.5   % Threshold for Non-Maximum Suppression
        ScoreThreshold = 0.5     % Threshold before tracking
        FrameRate = 1            % Frame-rate of the video
        ProcessNoiseIntensity = [1e-4 1e-4 1e-5 1e-5] % Tracker's velocity model process noise
    end
end

In the initialize function, create the YOLO v4 object detector using the yolov4ObjectDetector object. Create the multi-object tracker using the trackerGNN (Sensor Fusion and Tracking Toolbox) function with the following options:

FilterInitializationFcn: Use the initvisionbboxkf (Sensor Fusion and Tracking Toolbox) function to initialize a linear Kalman Filter with Constant-Velocity bounding box state definition. Specify the video frame rate and frame width and height using name-value arguments. Call the internal function initFcn of the helperObjectDetectorTracker class to use initivisionbboxkf and set the state estimation error covariance.
ConfirmationThreshold: Use [2 2] to confirm the existence of a true object if it is detected and assigned in 2 consecutive frames.
DeletionThreshold: Use [3 3] to delete object tracks after 3 consecutive missed frames.

        function initialize(this, frame, ~)

            [height,width] = size(frame);
          
            % Initialize YOLO v4 object detector
            this.Detector = yolov4ObjectDetector("csp-darknet53-coco");

            % Initialize tracker
            noiseIntensity = this.ProcessNoiseIntensity;
            this.Tracker = trackerGNN(FilterInitializationFcn=@(x) initFcn(x, this.FrameRate, width, height, noiseIntensity), ...
                ConfirmationThreshold=[2 2], ...
                DeletionThreshold=[5 5], ...
                AssignmentThreshold=35);
        end


        function filter = initFcn(detection, framerate, width, height, noiseIntensity)
            filter = initvisionbboxkf(detection, FrameRate = framerate, FrameSize = [width height],NoiseIntensity=noiseIntensity);
            filter.StateCovariance = diag([25 100 25 100 25 100 25 100]);
        end

In the run function, implement the following procedure:

Detect people using the YOLO v4 object detector.
Apply non-maximum suppression (NMS) to reduce the number of candidate regions of interest (ROIs) and select only the strongest bounding boxes.
Filter bounding boxes based on their score by specifying the ScoreThreshold property to eliminate weaker detections.
Update the tracker with the current frame's filtered bounding boxes.
Create new automated labels for the frame based off the updated tracks.

        function automatedLabels = run(this,frame)

            % Detect people using YOLO v4 object detector
            [bboxes,scores,labels] = detect(this.Detector, frame, ...
                SelectStrongest=false, ...
                MaxSize = round([size(frame,1)/2, size(frame,2)/5]));
            
            % Apply non-maximum suppression to select the strongest bounding boxes.
            [selectedBboxes,selectedScores,selectedLabels] = selectStrongestBboxMulticlass(bboxes,scores,labels, ...
                RatioType = 'Min', ...
                OverlapThreshold = this.OverlapThreshold);

            isSelectedClass = selectedLabels == lower(this.SelectedLabelDefinitions.Name);
            
            % Consider only detections that meet specified score threshold
            % and are of the selected class label
            selectedBboxes = selectedBboxes(isSelectedClass & selectedScores > this.ScoreThreshold, :);
            selectedScores = selectedScores(isSelectedClass & selectedScores > this.ScoreThreshold);
            tracks = objectTrack.empty;
            if isLocked(this.Tracker) || ~isempty(selectedBboxes)
                % Convert to objectDetection
                detections = repmat(objectDetection(this.CurrentTime,[0 0 0 0]),1,size(selectedBboxes,1));
                for i=1:numel(detections)
                    detections(i).Measurement = selectedBboxes(i,:);
                    detections(i).MeasurementNoise = (1/selectedScores(i))*25*eye(4);
                end
                [tracks, ~, alltracks, info] = this.Tracker(detections, this.CurrentTime);
            end

            if this.CurrentTime == this.StartTime && ~isempty(alltracks)
                % On the first frame, use tentative tracks
                states = [alltracks.State];
                automatedLabels = struct(...
                    'Type', labelType.Rectangle,...
                    'Name', this.SelectedLabelDefinitions.Name,...
                    'Position',wrapPositionToFrame(states, frame) ,...
                    'Attributes',struct('ID',num2cell([alltracks.TrackID])));

            elseif ~isempty(tracks)
                states = [tracks.State];
                automatedLabels = struct(...
                    'Type', labelType.Rectangle,...
                    'Name', this.SelectedLabelDefinitions.Name,...
                    'Position',wrapPositionToFrame(states, frame) ,...
                    'Attributes',struct('ID',num2cell([tracks.TrackID])));
            else
                automatedLabels = [];
            end
        end

Specify the MeasurementNoise property of each object detection to capture the uncertainty of each measurement. The tracker models each bounding box using Gaussian probability densities. While a more accurate measurement noise based on the statistics of the object detector is possible, use a variance of 25 squared pixels as a good default. In addition, use the score of each bounding box to scale the noise variance up or down. A high score detection is more precise and should have a smaller noise value than a low score detection.

Open Video in Video Labeler

Download the video and open it using the Video Labeler app.

helperDownloadLabelVideo();

Downloading Pedestrian Tracking Video (90 MB)

videoLabeler("PedestrianLabelingVideo.avi");

Perform these steps to create a rectangular ROI label named Person with a numeric ID attribute.

Click Add Label in the Label Definition section of the app toolstrip.
Select the Rectangle ROI type.
Under Label Name, type Person. Choose a preferred color and click OK.
Select the Person ROI in the left ROI Labels panel and click on Attribute in the Label Definition section of the app toolstrip.
Select Numeric from the drop-down list of attributes.
Under Attribute Name, type ID, and click OK.

Expand the Person ROI in the left ROI Labels panel to display the fields shown below.

Import Automation Algorithm

Next, open the Select Algorithm drop down menu under the Automate Labeling section of the app toolstrip. The app can detect the helperObjectDetectorTracker file located in the current example directory under +vision/+labeler. Click on Refresh and select the ObjectDetectorTracker option. This image shows the Automate Labeling section of the toolstrip that displays the name of the custom algorithm.

Run Automation Algorithm

Click on the Automate button in the Automate Labeling section of the toolstrip. Once in automation mode, click on Run to run the ObjectDetectorTracker automation algorithm. Visualize the automated labeling run frame by frame. Once the algorithm has processed the entire video, verify the generated bounding box labels, as well as their ID attributes. Use this labeling workflow for object tracking or object re-identification to obtain unique and consistent identities for each person throughout the video.

The first frame does not contain any confirmed tracks. The tracking algorithm configured in this example requires two frames to confirm a track. When a detection is not assigned to an existing track, the algorithm uses this detection to initialize a new tentative track. The new tentative track becomes a confirmed track only if a detection can be assigned to it in the next frame. In this case, the first frame does not require manual labeling. Use the initialized tentative tracks to obtain labels in the first frame. In subsequent frames, because the tracking algorithm requires two frames to confirm a new track, a person entering the field of view of the camera will not be immediately labeled.

Verify and Refine Automation Results

Once the ObjectDetectorTracker automation algorithm has completed running, review the quality of the ROI labels and confirm that each person has a single unique ID. First, return to the beginning of the video using the navigation pane below the video display. Zoom in on the group of standing individuals. Verify that each label instance has a unique ID. For example, the leftmost person has an ID of 4.

Verify automation algorithm results for correctness. Objects may have been missed or incorrect IDs may have been assigned because due to one or more of the following:

Objects are occluded or outside the image frame.
Objects are of too low resolution and the detector failed to identify them.
Objects are too closely spaced together.
Objects exhibit rapid changes in direction.

The detection and tracking algorithms generate bounding boxes with unique IDs across the videos. However, sometimes, bounding boxes can be missing entirely (false negatives). In more rare cases, bounding boxes with no visible person are maintained (false positive). Therefore, the labeling requires some manual refinement of boxes. To address this issue, add bounding boxes where false negatives exist and delete bounding boxes where false positives exist. Repair the unique IDs for identity switches and track fragmentations.

For example, the frame below at the 13 second mark, shows instances of:

Two false negatives: the person in the dark brown pants, as well as the person in the white pants.
Two identity switches: the selected ROI has its ID of 1 (previously assigned to the person in the dark brown pants) and one of the boxes around the leftmost individual has an ID of 6 (previously assigned to the second rightmost person in the frame).
Two false positives: the two boxes that are not correctly containing people in the frame.

Use the interactive capabilities of the Video Labeler app to add a new bounding box for the two false negatives. When adding a missing ROI, consider its ID value. Since in most cases, the person is tracked in a previous or future frame, use the existing ID for this person. In the above frame, assign an ID of 6 for the rightmost individual, because it is their ID in the first frame.

To repair multiple identity switches, you can choose from several methods. To minimize the amount of steps, consider the entire video. Some individuals, such as the two leftmost people in the frame above, have an ROI that remains mostly consistent throughout the video. In cases like this, we keep the original ID value and only need to repair a few frames for these individuals. However, as shown earlier, an ID switch occurred with the selected ROI in the image above from frame 1 to frame 2. Despite an initial difference on the first frame, keep the person in the blue sweater as ID 5 for efficiency, as this ID remains on the same person for the majority of the frames in the rest of the video.

To help accelerate the repair process for pedestrians who remain stationary from frame-to-frame or who are fairly constant in size throughout frames, copy and paste the ROI across frames. This approach is particularly useful for frames where the person was not properly detected from the YOLO v4 object detector. In the image frames below, some pedestrians are occluded by others. In some frames where a person should be labeled, they are not detected, and therefore, not labeled.

To address this issue, copy and paste the missing person's ROI into the frame from the last frame where the person was correctly detected with a bounding box. These partial occlusion ground truth images are valuable for training a robust ReID network. To learn more about addressing occlusions during training, see the Generate Training Data with Synthetic Object Occlusions section of the Reidentify People Throughout a Video Sequence Using ReID Network example.

Export Ground Truth

Once the labeling is complete and each person is tracked across the entire video sequence with a unique identifier, this section shows you how to map all IDs to a contiguous sequence of integers after exporting the ground truth to MATLAB.

Click Accept in the Close section of the automation toolstrip once all refinements are done. Export the ground truth next. Click on Export > To Workspace in the app toolstrip.

The ground truth MAT-file contains the results of the tracking automation algorithm and the additional post-processing. You may skip the manual correction steps described above and directly load the ground truth to continue this example.

load("groundTruth.mat","gTruth");

Note that you can import the groundTruth.mat file can back into the labeler by clicking on Import>Labels>From Workspace. Then select the gTruth variable in the popup window.

The IDs assigned during automation track IDs using the trackerGNN tracker. These track IDs can show a numerical jump in the automation due to unconfirmed tracks. The loaded groundTruth object has IDs from 1 to 7, but then jumps to ID of 9, 13, and is followed by additional varying numerical steps. As noted, having a sequential set of IDs is a nice way of organizing ground truth data, use the helperSequentiallyRenumberIDs function to adjust all of the ground truth IDs to sequential IDs.

gTruth = helperSequentiallyRenumberIDs(gTruth);

Next Steps

After sequentially ordering the ground truth, you can convert it for object tracking or for training a re-identification network. To learn more about how to convert ground truth for object tracking or training a ReID network, see the Convert Ground Truth Labeling Data for Object Tracking and Convert Ground Truth Labeling Data for Object Re-Identification examples.

Outlook

Labeling ground truth data for object tracking and re-identification is a challenging task. When the resolution of an object is high enough and minimal occlusions occur, you can use automation to accelerate labeling using automation algorithms such as ObjectDetectorTracker. However, automation often needs additional refinement especially when poor resolution or object occlusion is present, so the results must be verified and corrected manually.

To improve the results of the automation algorithm, more robust multi-object tracking algorithms such as DeepSORT can be used. To learn more about multi-object tracking using DeepSORT, see the Multi-Object Tracking with DeepSORT (Sensor Fusion and Tracking Toolbox) example.

Supporting Functions

helperDownloadLabelVideo

Download the pedestrian labeling video.

function helperDownloadLabelVideo
videoURL = "https://ssd.mathworks.com/supportfiles/vision/data/PedestrianLabelingVideo.avi";
if ~exist("PPedestrianLabelingVideo.avi","file")
    disp("Downloading Pedestrian Tracking Video (90 MB)")
    websave("PedestrianLabelingVideo.avi",videoURL);
end
end

helperSequentiallyRenumberIDs

Renumber each ID in the ground truth data to progress in a contiguous and sequential order.

function gTruth = helperSequentiallyRenumberIDs(gTruth)
allLabels = struct2table(vertcat(gTruth.LabelData.Person{:}));
oldIDs = unique(allLabels.ID);
newIDs = cast(1:numel(oldIDs),'like',oldIDs);
data = gTruth.LabelData.Person;
for i = 1:numel(data)
    for id = 1:numel(oldIDs)
        oldID = oldIDs(id);
        ind = find([data{i}.ID] == oldID);
        if ~isempty(ind)
            if length(ind) > 1
                error(['ID ' num2str(oldID) ' in video frame ' num2str(i) ' is not a unique ID.']);
            end
            data{i}(ind).ID = newIDs(id);
        end
    end
end
labelData = gTruth.LabelData;
labelData.Person = data;
gTruth = groundTruth(gTruth.DataSource, gTruth.LabelDefinitions, labelData);
end