Main Content

predictPose

Estimate object pose using Pose Mask R-CNN deep learning network

Since R2024a

Description

poses = predictPose(net,I,depthImage,intrinsics) returns 6-degrees-of-freedom (6-DoF) pose of objects within a single image or a batch of images I using a trained Pose Mask R-CNN network.

[poses,labels,scores,bboxes] = predictPose(___) also returns the labels assigned to the detected objects, the detection score for each detected object, and the bounding box location of each detected object, using the input arguments from the previous syntax.

[poses,labels,scores,bboxes,masks] = predictPose(___) performs instance segmentation of the objects, and returns the binary object masks, masks.

[___] = predictPose(___,Name=Value) specifies options using additional name-value arguments. For example, Threshold=0.7 specifies the detection threshold as 0.7.

Note

This functionality requires Deep Learning Toolbox™ and the Computer Vision Toolbox™ Model for Pose Mask R-CNN 6-DoF Object Pose Estimation. You can install the Computer Vision Toolbox Model for Pose Mask R-CNN 6-DoF Object Pose Estimation from Add-On Explorer. For more information about installing add-ons, see Get and Manage Add-Ons.

Input Arguments

collapse all

Pose Mask R-CNN pose estimation network, specified as a posemaskrcnn object.

Image to segment, specified as a numeric matrix or numeric array. The size of this argument depends on the number of images specified and whether they are color or grayscale.

Image Type and NumberData Format
Single grayscale image2-D matrix of size H-by-W
Single color image3-D array of size H-by-W-by-3.
Batch of B grayscale or color images4-D array of size H-by-W-by-C-by-B. The number of color channels C is 1 for grayscale images and 3 for color images.

The height H and width W of each image must be greater than or equal to the input height h and width w of the network.

Tip

For best network performance, use input image data of the same size that the network has been trained on.

Depth map for estimating the 3-D pose, specified as a numeric matrix or array. The size of this argument depends on the number of depth maps specified.

Number of Depth MapsData Format
Single depth map2-D numeric matrix of size H-by-W.
Batch of B depth maps4-D array of size H-by-W-by-1-by-B.

H and W must be equal to the corresponding values of I, and the number of depth maps must match the number of images specified.

Camera intrinsic parameters, specified as a cameraIntrinsics object or a B-by-1 cell array of cameraIntrinsics objects. If you specify this value as a scalar, the function applies the same camera intrinsic parameters to every input image. If you specify this value as a cell array, the number of cameraIntrinsics objects must match the number of input images B.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: predictPose(net,I,depthImage,intrinsics,Threshold=0.7) specifies the detection threshold as 0.7.

Options for All Image Formats

collapse all

Object detection threshold, specified as a numeric scalar in the range [0, 1]. The Pose Mask R-CNN network does not use object detections with scores less than the threshold value for pose estimation. Increase this value to reduce false positives.

Maximum number of strongest region proposals, specified as a positive integer or Inf. Reduce this value to increase processing speed at the cost of detection accuracy. To use all region proposals, specify this value as Inf.

Strongest bounding box threshold per class, specified as a positive numeric scalar in the range [0, 1]. The strongest bounding boxes per class are returned when their confidence scores are higher than this value. To select these boxes, predictPose uses the selectStrongestBboxMulticlass function, which uses nonmaximal suppression to eliminate overlapping bounding boxes with the same class label based on their confidence scores.

Note

predictPose first performs nonmaximal suppression using the selectStrongestBboxMulticlass function and the threshold value specified by SelectStrongestMulticlassThreshold. In the second step, predictPose performs nonmaximal suppression using the selectStrongestBbox function and the threshold value specified by SelectStrongestThreshold. You can manually determine the optimal threshold values for your application by using a validation data subset when training on a custom data set.

Strongest bounding box threshold per object, specified as a positive numeric scalar in the range [0, 1]. The strongest bounding boxes per object are returned when their confidence scores are higher than this value. To select these boxes, predictPose uses the selectStrongestBbox function, which uses nonmaximal suppression to eliminate overlapping bounding boxes based on their confidence scores across all classes.

Note

predictPose first performs nonmaximal suppression using the selectStrongestBboxMulticlass function and the threshold value specified by SelectStrongestMulticlassThreshold. In the second step, predictPose performs nonmaximal suppression using the selectStrongestBbox function and the threshold value specified by SelectStrongestThreshold. You can manually determine the optimal threshold values for your application by using a validation data subset when training on a custom data set.

Minimum size of an object-containing region, in pixels, specified as a two-element numeric vector of the form [height width]. MinSize is the size of the smallest object that the trained detector can detect. Specify larger values for this argument to reduce computation time.

Maximum size of an of object-containing region, in pixels, specified as a two-element numeric vector of the form [height width].

To reduce computation time, set this value to the known maximum region size for the objects being detected in the image. By default, MaxSize uses the height and width of the input image I.

Hardware resource for processing images with the network, specified as "auto", "gpu", or "cpu".

ExecutionEnvironmentDescription
"auto"Use a GPU if available. Otherwise, use the CPU. The use of GPU requires Parallel Computing Toolbox™ and a CUDA® enabled NVIDIA® GPU. For information about the supported compute capabilities, see GPU Computing Requirements (Parallel Computing Toolbox).
"gpu"Use the GPU. If a suitable GPU is not available, the function returns an error message.
"cpu"Use the CPU.

Options for Batch Inputs

collapse all

Number of observations returned in each batch, specified as a positive integer.

Output Arguments

collapse all

Object poses, returned as an M-by-1 vector of rigidtform3d objects or a B-by-1 cell array. The value of this output depends on the number of input images B.

Image Typeposes Value
Single imageM-by-1 vector of rigidtform3d objects. M is the number of objects detected in the image.
Batch of B imagesB-by-1 cell array, where each cell contains an M-by-1 vector of rigidtform3d objects.

Object labels, returned as an M-by-1 categorical vector or a B-by-1 cell array. When I is a single image, labels is an M-by-1 categorical vector, where M is the number of detected objects in the image.

When I is a batch of B images, labels is a B-by-1 cell array in which each cell contains an M-by-1 categorical vector with the labels of the objects detected in the corresponding image.

Detection confidence scores, returned as an M-by-1 numeric vector, or a B-by-1 cell array. When I is a single image, scores is an M-by-1 numeric vector, where M is the number of detected objects in the image.

When I is a batch of B images, then scores is a B-by-1 cell array. Each element is an M-by-1 numeric vector with the scores of the objects detected in the corresponding image.

The score for each object is in the range [0, 1]. A higher score indicates higher confidence in the detection.

Locations of detected objects within the input image, returned as an M-by-4 numeric matrix, or a B-by-1 cell array. When I is a single image, bboxes is an M-by-1 numeric matrix, where M is the number of detected objects in the image. Each row of the matrix is of the form [x y width height], where x and y specify the upper-left corner of the corresponding bounding box, and width and height specify its size in pixels.

When I is a batch of B images, bboxes is a B-by-1 cell array in which each cell contains an M-by-4 numeric matrix with the bounding boxes of the objects detected in the corresponding image.

Object masks, returned as a H-by-W-by-M logical array or a B-by-1 cell array. When I is a single image, masks is an H-by-W-by-M logical array, where H and W are the height and width of the input image, respectively, and M is the number of detected objects in the image.

When I is a batch of B images, masks is a B-by-1 cell array in which each cell contains an H-by-W-by-M logical array with the masks for the corresponding image.

Tips

Version History

Introduced in R2024a