The you-only-look-once (YOLO) v2 object detector uses a single stage object detection network. YOLO v2 is faster than other two-stage deep learning object detectors, such as regions with convolutional neural networks (Faster R-CNNs).
The YOLO v2 model runs a deep learning CNN on an input image to produce network predictions. The object detector decodes the predictions and generates bounding boxes.
YOLO v2 uses anchor boxes to detect classes of objects in an image. For more details, see Anchor Boxes for Object Detection.The YOLO v2 predicts these three attributes for each anchor box:
Intersection over union (IoU) — Predicts the objectness score of each anchor box.
Anchor box offsets — Refine the anchor box position
Class probability — Predicts the class label assigned to each anchor box.
The figure shows predefined anchor boxes (the dotted lines) at each location in a feature map and the refined location after offsets are applied. Matched boxes with a class are in color.
With transfer learning, you can use a pretrained CNN as the feature extractor in a
YOLO v2 detection network. Use the
yolov2Layers function to create a YOLO v2 detection network from any
pretrained CNN, for example
MobileNet v2. For a list of pretrained
CNNs, see Pretrained Deep Neural Networks (Deep Learning Toolbox)
You can also design a custom model based on a pretrained image classification CNN. For more details, see Design a YOLO v2 Detection Network.
You can design a custom YOLO v2 model layer by layer. The model starts with a feature
extractor network, which can be initialized from a pretrained CNN or trained from
scratch. The detection subnetwork contains a series of
Batch norm, and
ReLu layers, followed by the
transform and output layers,
yolov2OutputLayer objects, respectively.
yolov2TransformLayer transforms the raw CNN output into a form required to
produce object detections.
yolov2OutputLayer defines the anchor box parameters and implements the
loss function used to train the detector.
You can also use the Deep Network Designer (Deep Learning Toolbox) app to manually create a network. The designer incorporates Computer Vision Toolbox™ YOLO v2 features.
The reorganization layer (created using the
spaceToDepthLayer object) and the depth concatenation layer ( created
depthConcatenationLayer (Deep Learning Toolbox) object) are used to combine low-level and
high-level features. These layers improve detection by adding low-level image
information and improving detection accuracy for smaller objects. Typically, the
reorganization layer is attached to a layer within the feature extraction network
whose output feature map is larger than the feature extraction layer output.
For more details on how to create this kind of network, see Create YOLO v2 Object Detection Network.
To learn how to train an object detector by using the YOLO deep learning technique with a CNN, see the Object Detection Using YOLO v2 Deep Learning example.
To learn how to generate CUDA® code using the YOLO v2 object detector (created using the
yolov2ObjectDetector object) see Code Generation for Object Detection by Using YOLO v2.
You can use the Image Labeler,
or Ground Truth Labeler (Automated Driving Toolbox) apps to interactively
label pixels and export label data for training. The apps can also be used to label
rectangular regions of interest (ROIs) for object detection, scene labels for image
classification, and pixels for semantic segmentation. To create training data from any
of the labelers exported ground truth object, you can use the
pixelLabelTrainingData functions. For more details, see Training Data for Object Detection and Semantic Segmentation.
 Redmon, Joseph, and Ali Farhadi. “YOLO9000: Better, Faster, Stronger.” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517–25. Honolulu, HI: IEEE, 2017. https://doi.org/10.1109/CVPR.2017.690.
 Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788. Las Vegas, NV: CVPR, 2016.
depthConcatenationLayer(Deep Learning Toolbox)