Traffic Sign Detection and Recognition

This example shows how to generate CUDA® MEX code for a traffic sign detection and recognition application that uses deep learning. Traffic sign detection and recognition is an important application for driver assistance systems, aiding and providing information to the driver about road signs.

In this traffic sign detection and recognition example you perform three steps - detection, Non-Maximal Suppression (NMS), and recognition. First, the example detects the traffic signs on an input image by using an object detection network that is a variant of the You Only Look Once (YOLO) network. Then, overlapping detections are suppressed by using the NMS algorithm. Finally, the recognition network classifies the detected traffic signs.

Prerequisites

  • CUDA enabled NVIDIA® GPU with compute capability 3.2 or higher.

  • NVIDIA CUDA toolkit and driver.

  • NVIDIA cuDNN library.

  • Environment variables for the compilers and libraries. For information on the supported versions of the compilers and libraries, see Third-party Products. For setting up the environment variables, see Setting Up the Prerequisite Products.

  • GPU Coder Interface for Deep Learning Libraries support package. To install this support package, use the Add-On Explorer.

Verify GPU Environment

Use the coder.checkGpuInstall function to verify that the compilers and libraries necessary for running this example are set up correctly.

envCfg = coder.gpuEnvConfig('host');
envCfg.DeepLibTarget = 'cudnn';
envCfg.DeepCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);

Detection and Recognition Networks

The detection network is trained in the Darknet framework and imported into MATLAB® for inference. Because the size of the traffic sign is relatively small with respect to that of the image and the number of training samples per class are fewer in the training data, all the traffic signs are considered as a single class for training the detection network.

The detection network divides the input image into a 7-by-7 grid. Each grid cell detects a traffic sign if the center of the traffic sign falls within the grid cell. Each cell predicts two bounding boxes and confidence scores for these bounding boxes. Confidence scores indicate whether the box contains an object or not. Each cell predicts on probability for finding the traffic sign in the grid cell. The final score is product of the preceeding scores. You apply a threshold of 0.2 on this final score to select the detections.

The recognition network is trained on the same images by using MATLAB.

The trainRecognitionnet.m helper script shows the recognition network training.

Get the Pretrained SeriesNetwork

Download the detection and recognition networks.

getTsdr();

The detection network contains 58 layers including convolution, leaky ReLU, and fully connected layers.

load('yolo_tsr.mat');
yolo.Layers
ans = 

  58x1 Layer array with layers:

     1   'input'         Image Input             448x448x3 images
     2   'conv1'         Convolution             64 7x7x3 convolutions with stride [2  2] and padding [3  3  3  3]
     3   'relu1'         Leaky ReLU              Leaky ReLU with scale 0.1
     4   'pool1'         Max Pooling             2x2 max pooling with stride [2  2] and padding [0  0  0  0]
     5   'conv2'         Convolution             192 3x3x64 convolutions with stride [1  1] and padding [1  1  1  1]
     6   'relu2'         Leaky ReLU              Leaky ReLU with scale 0.1
     7   'pool2'         Max Pooling             2x2 max pooling with stride [2  2] and padding [0  0  0  0]
     8   'conv3'         Convolution             128 1x1x192 convolutions with stride [1  1] and padding [0  0  0  0]
     9   'relu3'         Leaky ReLU              Leaky ReLU with scale 0.1
    10   'conv4'         Convolution             256 3x3x128 convolutions with stride [1  1] and padding [1  1  1  1]
    11   'relu4'         Leaky ReLU              Leaky ReLU with scale 0.1
    12   'conv5'         Convolution             256 1x1x256 convolutions with stride [1  1] and padding [0  0  0  0]
    13   'relu5'         Leaky ReLU              Leaky ReLU with scale 0.1
    14   'conv6'         Convolution             512 3x3x256 convolutions with stride [1  1] and padding [1  1  1  1]
    15   'relu6'         Leaky ReLU              Leaky ReLU with scale 0.1
    16   'pool6'         Max Pooling             2x2 max pooling with stride [2  2] and padding [0  0  0  0]
    17   'conv7'         Convolution             256 1x1x512 convolutions with stride [1  1] and padding [0  0  0  0]
    18   'relu7'         Leaky ReLU              Leaky ReLU with scale 0.1
    19   'conv8'         Convolution             512 3x3x256 convolutions with stride [1  1] and padding [1  1  1  1]
    20   'relu8'         Leaky ReLU              Leaky ReLU with scale 0.1
    21   'conv9'         Convolution             256 1x1x512 convolutions with stride [1  1] and padding [0  0  0  0]
    22   'relu9'         Leaky ReLU              Leaky ReLU with scale 0.1
    23   'conv10'        Convolution             512 3x3x256 convolutions with stride [1  1] and padding [1  1  1  1]
    24   'relu10'        Leaky ReLU              Leaky ReLU with scale 0.1
    25   'conv11'        Convolution             256 1x1x512 convolutions with stride [1  1] and padding [0  0  0  0]
    26   'relu11'        Leaky ReLU              Leaky ReLU with scale 0.1
    27   'conv12'        Convolution             512 3x3x256 convolutions with stride [1  1] and padding [1  1  1  1]
    28   'relu12'        Leaky ReLU              Leaky ReLU with scale 0.1
    29   'conv13'        Convolution             256 1x1x512 convolutions with stride [1  1] and padding [0  0  0  0]
    30   'relu13'        Leaky ReLU              Leaky ReLU with scale 0.1
    31   'conv14'        Convolution             512 3x3x256 convolutions with stride [1  1] and padding [1  1  1  1]
    32   'relu14'        Leaky ReLU              Leaky ReLU with scale 0.1
    33   'conv15'        Convolution             512 1x1x512 convolutions with stride [1  1] and padding [0  0  0  0]
    34   'relu15'        Leaky ReLU              Leaky ReLU with scale 0.1
    35   'conv16'        Convolution             1024 3x3x512 convolutions with stride [1  1] and padding [1  1  1  1]
    36   'relu16'        Leaky ReLU              Leaky ReLU with scale 0.1
    37   'pool16'        Max Pooling             2x2 max pooling with stride [2  2] and padding [0  0  0  0]
    38   'conv17'        Convolution             512 1x1x1024 convolutions with stride [1  1] and padding [0  0  0  0]
    39   'relu17'        Leaky ReLU              Leaky ReLU with scale 0.1
    40   'conv18'        Convolution             1024 3x3x512 convolutions with stride [1  1] and padding [1  1  1  1]
    41   'relu18'        Leaky ReLU              Leaky ReLU with scale 0.1
    42   'conv19'        Convolution             512 1x1x1024 convolutions with stride [1  1] and padding [0  0  0  0]
    43   'relu19'        Leaky ReLU              Leaky ReLU with scale 0.1
    44   'conv20'        Convolution             1024 3x3x512 convolutions with stride [1  1] and padding [1  1  1  1]
    45   'relu20'        Leaky ReLU              Leaky ReLU with scale 0.1
    46   'conv21'        Convolution             1024 3x3x1024 convolutions with stride [1  1] and padding [1  1  1  1]
    47   'relu21'        Leaky ReLU              Leaky ReLU with scale 0.1
    48   'conv22'        Convolution             1024 3x3x1024 convolutions with stride [2  2] and padding [1  1  1  1]
    49   'relu22'        Leaky ReLU              Leaky ReLU with scale 0.1
    50   'conv23'        Convolution             1024 3x3x1024 convolutions with stride [1  1] and padding [1  1  1  1]
    51   'relu23'        Leaky ReLU              Leaky ReLU with scale 0.1
    52   'conv24'        Convolution             1024 3x3x1024 convolutions with stride [1  1] and padding [1  1  1  1]
    53   'relu24'        Leaky ReLU              Leaky ReLU with scale 0.1
    54   'fc25'          Fully Connected         4096 fully connected layer
    55   'relu25'        Leaky ReLU              Leaky ReLU with scale 0.1
    56   'fc26'          Fully Connected         539 fully connected layer
    57   'softmax'       Softmax                 softmax
    58   'classoutput'   Classification Output   crossentropyex

The recognition network contains 14 layers including convolution, fully connected, and the classification output layers.

load('RecognitionNet.mat');
convnet.Layers
ans = 

  14x1 Layer array with layers:

     1   'imageinput'    Image Input             48x48x3 images with 'zerocenter' normalization and 'randfliplr' augmentations
     2   'conv_1'        Convolution             100 7x7x3 convolutions with stride [1  1] and padding [0  0  0  0]
     3   'relu_1'        ReLU                    ReLU
     4   'maxpool_1'     Max Pooling             2x2 max pooling with stride [2  2] and padding [0  0  0  0]
     5   'conv_2'        Convolution             150 4x4x100 convolutions with stride [1  1] and padding [0  0  0  0]
     6   'relu_2'        ReLU                    ReLU
     7   'maxpool_2'     Max Pooling             2x2 max pooling with stride [2  2] and padding [0  0  0  0]
     8   'conv_3'        Convolution             250 4x4x150 convolutions with stride [1  1] and padding [0  0  0  0]
     9   'maxpool_3'     Max Pooling             2x2 max pooling with stride [2  2] and padding [0  0  0  0]
    10   'fc_1'          Fully Connected         300 fully connected layer
    11   'dropout'       Dropout                 90% dropout
    12   'fc_2'          Fully Connected         35 fully connected layer
    13   'softmax'       Softmax                 softmax
    14   'classoutput'   Classification Output   crossentropyex with '0' and 34 other classes

The tsdr_predict Entry-Point Function

The tsdr_predict.m entry-point function takes an image input and detects the traffic signs in the image by using the detection network. The function suppresses the overlapping detections (NMS) by using selectStrongestBbox and recognizes the traffic sign by using the recognition network. The function loads the network objects from yolo_tsr.mat into a persistent variable detectionnet and the RecognitionNet.mat into a persistent variable recognitionnet. The function reuses the the persistent objects on subsequent calls.

type('tsdr_predict.m')
function [selectedBbox,idx] = tsdr_predict(img)
%#codegen

% This function detects the traffic signs in the image using Detection Network
% (modified version of Yolo) and recognizes(classifies) using Recognition Network
%
% Inputs :
%
% im            : Input test image
%
% Outputs :
%
% selectedBbox  : Detected bounding boxes 
% idx           : Corresponding classes

% Copyright 2017-2019 The MathWorks, Inc.

coder.gpu.kernelfun;

% resize the image
img_rz = imresize(img,[448,448]);

% Converting into BGR format
img_rz = img_rz(:,:,3:-1:1);
img_rz = im2single(img_rz);

%% TSD
persistent detectionnet;
if isempty(detectionnet)   
    detectionnet = coder.loadDeepLearningNetwork('yolo_tsr.mat','Detection');
end

predictions = detectionnet.activations(img_rz,56,'OutputAs','channels');


%% Convert predictions to bounding box attributes
classes = 1;
num = 2;
side = 7;
thresh = 0.2;
[h,w,~] = size(img);


boxes = single(zeros(0,4));    
probs = single(zeros(0,1));    
for i = 0:(side*side)-1
    for n = 0:num-1
        p_index = side*side*classes + i*num + n + 1;
        scale = predictions(p_index);       
        prob = zeros(1,classes+1);
        for j = 0:classes
            class_index = i*classes + 1;
            tempProb = scale*predictions(class_index+j);
            if tempProb > thresh
                
                row = floor(i / side);
                col = mod(i,side);
                
                box_index = side*side*(classes + num) + (i*num + n)*4 + 1;
                bxX = (predictions(box_index + 0) + col) / side;
                bxY = (predictions(box_index + 1) + row) / side;
                
                bxW = (predictions(box_index + 2)^2);
                bxH = (predictions(box_index + 3)^2);
                
                prob(j+1) = tempProb;
                probs = [probs;tempProb];
                                
                boxX = (bxX-bxW/2)*w+1;
                boxY = (bxY-bxH/2)*h+1;
                boxW = bxW*w;
                boxH = bxH*h;
                boxes = [boxes; boxX,boxY,boxW,boxH];
            end
        end
    end
end

%% Run Non-Maximal Suppression on the detected bounding boxess
coder.varsize('selectedBbox',[98, 4],[1 0]);
[selectedBbox,~] = selectStrongestBbox(round(boxes),probs);

%% Recognition

persistent recognitionnet;
if isempty(recognitionnet) 
    recognitionnet = coder.loadDeepLearningNetwork('RecognitionNet.mat','Recognition');
end

idx = zeros(size(selectedBbox,1),1);
inpImg = coder.nullcopy(zeros(48,48,3,size(selectedBbox,1)));
for i = 1:size(selectedBbox,1)
    
    ymin = selectedBbox(i,2);
    ymax = ymin+selectedBbox(i,4);
    xmin = selectedBbox(i,1);
    xmax = xmin+selectedBbox(i,3);

    
    % Resize Image
    inpImg(:,:,:,i) = imresize(img(ymin:ymax,xmin:xmax,:),[48,48]);
end

for i = 1:size(selectedBbox,1)
    output = recognitionnet.predict(inpImg(:,:,:,i));
    [~,idx(i)]=max(output);
end

Generate CUDA MEX for the tsdr_predict Function

Create a GPU configuration object for a MEX target and set the target language to C++. Use the coder.DeepLearningConfig function to create a CuDNN deep learning configuration object and assign it to the DeepLearningConfig property of the GPU code configuration object. To generate CUDA MEX, use the codegen command and specify the input to be of size [480,704,3]. This value corresponds to the input image size of the tsdr_predict function.

cfg = coder.gpuConfig('mex');
cfg.TargetLang = 'C++';
cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn');
codegen -config cfg tsdr_predict -args {ones(480,704,3,'uint8')} -report
Code generation successful: To view the report, open('codegen/mex/tsdr_predict/html/report.mldatx').

To generate code by using TensorRT, pass coder.DeepLearningConfig('tensorrt') as an option to the coder configuration object instead of 'cudnn'.

Run Generated MEX

Load an input image.

im = imread('stop.jpg');
imshow(im);

Call tsdr_predict_mex on the input image.

im = imresize(im, [480,704]);
[bboxes,classes] = tsdr_predict_mex(im);

Map the class numbers to traffic sign names in the class dictionary.

classNames = {'addedLane','slow','dip','speedLimit25','speedLimit35','speedLimit40','speedLimit45',...
    'speedLimit50','speedLimit55','speedLimit65','speedLimitUrdbl','doNotPass','intersection',...
    'keepRight','laneEnds','merge','noLeftTurn','noRightTurn','stop','pedestrianCrossing',...
    'stopAhead','rampSpeedAdvisory20','rampSpeedAdvisory45','truckSpeedLimit55',...
    'rampSpeedAdvisory50','turnLeft','rampSpeedAdvisoryUrdbl','turnRight','rightLaneMustTurn',...
    'yield','yieldAhead','school','schoolSpeedLimit25','zoneAhead45','signalAhead'};

classRec = classNames(classes);

Display the detected traffic signs.

outputImage = insertShape(im,'Rectangle',bboxes,'LineWidth',3);

for i = 1:size(bboxes,1)
    outputImage = insertText(outputImage,[bboxes(i,1)+bboxes(i,3) bboxes(i,2)-20],classRec{i},'FontSize',20,'TextColor','red');
end

imshow(outputImage);

Traffic Sign Detection and Recognition on a Video

The included helper file tsdr_testVideo.m grabs frames from the test video, performs traffic sign detection and recognition, and plots the results on each frame of the test video.

  % Input video
  v = VideoReader('stop.avi');
  fps = 0;
   while hasFrame(v)
      % Take a frame
      picture = readFrame(v);
      picture = imresize(picture,[920,1632]);
      % Call MEX function for Traffic Sign Detection and Recognition
      tic;
      [bboxes,clases] = tsdr_predict_mex(picture);
      newt = toc;
      % fps
      fps = .9*fps + .1*(1/newt);
      % display
       displayDetections(picture,bboxes,clases,fps);
    end

Clear the static network objects that were loaded into memory.

clear mex;