Main Content

Improve Performance of Small Matrix Problems on the GPU using pagefun

This example shows how to use pagefun to improve the performance of applying a large number of independent rotations and translations to objects in a 3-D environment. This is typical of a range of problems which involve a large batch of calculations on small arrays.

GPUs are most effective when carrying out calculations on very large matrices. In MATLAB® this is usually achieved by vectorizing code to maximize the work done in each instruction. When you have a large data set but your calculations are divided into many small matrix operations, it can be challenging to maximize performance by running simultaneously on the many hundreds of GPU cores.

The arrayfun function allow scalar operations to be carried out in parallel on the GPU. The pagefun function adds the capability of carrying out matrix operations in batch in a similar way. The pagefun function is available in Parallel Computing Toolbox™ for use with gpuArrays.

In this example, a robot is navigating a known map containing a large number of features that the robot can identify using its sensors. The robot locates itself in the map by measuring the relative position and orientation of those objects and comparing them to the map locations. Assuming the robot is not completely lost, it can use any difference between the two to correct its position, for instance by using a Kalman Filter. We will focus on the first part of the algorithm.

Set up the map

Let's create a map of objects with randomized positions and orientations in a large room.

numObjects = 1000;
roomDimensions = [50 50 5]; % Length * breadth * height in meters

We represent positions and orientations using 3-by-1 vectors T and 3-by-3 rotation matrices R. When we have N of these transforms we pack the translations into a 3-by-N matrix, and the rotations into a 3-by-3-by-N array. The supporting function randomTransforms is provided at the end of this example and initializes N transforms with random values, providing a structure as output.

Use randomTransforms to set up the map of object transforms, and a start location for the robot.

Map = randomTransforms(numObjects,roomDimensions);
Robot = randomTransforms(1,roomDimensions);

Define the equations

To correctly identify map features the robot needs to transform the map to put its sensors at the origin. Then it can find map objects by comparing what it sees with what it expects to see.

For a map object i we can find its position relative to the robot Trel(i) and orientation Rrel(i) by transforming its global map location:

Rrel(i)=RbotRmap(i)Trel(i)=Rbot(Tmap(i)-Tbot)

where Tbot and Rbot are the position and orientation of the robot, and Tmap(i) and Rmap(i) represent the map data. The equivalent MATLAB code looks like this:

Rrel(:,:,i) = Rbot' * Rmap(:,:,i)
Trel(:,i) = Rbot' * (Tmap(:,i) - Tbot)

Perform many matrix transforms on the CPU using a for loop

We need to transform every map object to its location relative to the robot. The supporting function loopingTransform is provided at the end of this example and loops over all the transforms in turn. Note the 'like' syntax for zeros which will allow us to use the same code on the GPU in the next section.

To time the calculation we use the timeit function, which will call loopingTransform multiple times to get an average timing. Since it requires a function with no arguments, we use the @() syntax to create an anonymous function of the right form.

cpuTime = timeit(@()loopingTransform(Robot,Map,numObjects));
fprintf('It takes %3.4f seconds on the CPU to execute %d transforms.\n', ...
        cpuTime, numObjects);
It takes 0.0047 seconds on the CPU to execute 1000 transforms.

Try the same code on the GPU

To run this code on the GPU is merely a matter of copying the data into a gpuArray. When MATLAB encounters data stored on the GPU it will run any code using it on the GPU as long as it is supported.

gMap.R = gpuArray(Map.R);
gMap.T = gpuArray(Map.T);
gRobot.R = gpuArray(Robot.R);
gRobot.T = gpuArray(Robot.T);

Now we call gputimeit, which is the equivalent of timeit for code that includes GPU computation. It makes sure all GPU operations have finished before recording the time.

fprintf('Computing...\n');
Computing...
gpuTime = gputimeit(@()loopingTransform(gRobot,gMap,numObjects));

fprintf('It takes %3.4f seconds on the GPU to execute %d transforms.\n', ...
        gpuTime, numObjects);
It takes 0.2745 seconds on the GPU to execute 1000 transforms.
fprintf(['Unvectorized GPU code is %3.2f times slower ',...
    'than the CPU version.\n'], gpuTime/cpuTime);
Unvectorized GPU code is 58.19 times slower than the CPU version.

Batch process using pagefun

The GPU version above was very slow because, although all calculations were independent, they ran in series. Using pagefun we can run all the computations in parallel. The supporting function pagefunTransform is provided at the end of this example and applies the same transforms as the loopingTransform function using pagefun instead of a for-loop.

gpuPagefunTime = gputimeit(@()pagefunTransform(gRobot, gMap));
fprintf(['It takes %3.4f seconds on the GPU using pagefun ',...
    'to execute %d transforms.\n'], gpuPagefunTime, numObjects);
It takes 0.0003 seconds on the GPU using pagefun to execute 1000 transforms.
fprintf(['Vectorized GPU code is %3.2f times faster ',...
    'than the CPU version.\n'], cpuTime/gpuPagefunTime);
Vectorized GPU code is 17.88 times faster than the CPU version.
fprintf(['Vectorized GPU code is %3.2f times faster ',...
    'than the unvectorized GPU version.\n'], gpuTime/gpuPagefunTime);
Vectorized GPU code is 1040.61 times faster than the unvectorized GPU version.

Explanation

The first computation was the calculation of the rotations. This involved a matrix multiply, which translates to the function mtimes (*). We pass this to pagefun along with the two sets of rotations to be multiplied:

Rel.R = pagefun(@mtimes, Robot.R', Map.R);

Robot.R' is a 3-by-3 matrix, and Map.R is a 3-by-3-by-N array. The pagefun function matches each independent matrix from the map to the same robot rotation, and gives us the required 3-by-3-by-N output.

The translation calculation also involves a matrix multiply, but the normal rules of matrix multiplication allow this to come outside the loop without any changes:

Rel.T = Robot.R' * (Map.T - Robot.T);

More advanced GPU vectorization - Solving a "lost robot" problem

If our robot was in an unknown part of the map, it might use a global search algorithm to locate itself. The algorithm would test a number of possible locations by carrying out the above computation and looking for good correspondence between the objects seen by the robot's sensors and what it would expect to see at that position.

Now we have got multiple robots as well as multiple objects. N objects and M robots should give us N*M transforms out. To distinguish 'robot space' from 'object space' we use the 4th dimension for rotations and the 3rd for translations. That means our robot rotations will be 3-by-3-by-1-by-M, and the translations will be 3-by-1-by-M.

We initialize our search with random robot locations. A good search algorithm would use topological or other clues to seed the search more intelligently.

numRobots = 10;
Robot = randomTransforms(numRobots,roomDimensions);
Robot.R = reshape(Robot.R, 3, 3, 1, []); % Spread along the 4th dimension
Robot.T = reshape(Robot.T, 3, 1, []); % Spread along the 3rd dimension
gRobot.R = gpuArray(Robot.R);
gRobot.T = gpuArray(Robot.T);

A supporting function loopingTransform2 is defined at the end of this example and performs a looping transform using two nested loops, to loop over the robots as well as over the objects.

cpuTime = timeit(@()loopingTransform2(Robot,Map,numObjects,numRobots));
fprintf('It takes %3.4f seconds on the CPU to execute %d transforms.\n', ...
        cpuTime, numObjects*numRobots);
It takes 0.0852 seconds on the CPU to execute 10000 transforms.

For our GPU timings we use tic and toc this time, because otherwise the calculation would take too long. This will be precise enough for our purposes. To ensure any cost associated with creating the output data is included, we are calling loopingTransform2 with a single output variable, just as timeit and gputimeit do by default.

fprintf('Computing...\n');
Computing...
tic;
gRel = loopingTransform2(gRobot,gMap,numObjects,numRobots);
gpuTime = toc;

fprintf('It takes %3.4f seconds on the GPU to execute %d transforms.\n', ...
        gpuTime, numObjects*numRobots);
It takes 3.6775 seconds on the GPU to execute 10000 transforms.
fprintf(['Unvectorized GPU code is %3.2f times slower ',...
    'than the CPU version.\n'], gpuTime/cpuTime);
Unvectorized GPU code is 43.17 times slower than the CPU version.

As before, the looping version runs much slower on the GPU because it is not doing calculations in parallel.

The new pagefun version needs to incorporate the transpose operator as well as mtimes into a call to pagefun. We also need to squeeze the transposed robot orientations to put the spread over robots into the 3rd dimension, to match the translations. Despite this, the resulting code is considerably more compact. A supporting function pagefunTransform2 is provided at the end of this example and applies the same transforms as the loopingTransform2 function using two pagefun calls instead of nested for-loops.

Once again, pagefun expands dimensions appropriately. So where we multiply 3-by-3-by-1-by-M matrix Rt with 3-by-3-by-N-by-1 matrix Map.R, we get a 3-by-3-by-N-by-M matrix out.

gpuPagefunTime = gputimeit(@()pagefunTransform2(gRobot,gMap));
fprintf(['It takes %3.4f seconds on the GPU using pagefun ',...
    'to execute %d transforms.\n'], gpuPagefunTime, numObjects*numRobots);
It takes 0.0012 seconds on the GPU using pagefun to execute 10000 transforms.
fprintf(['Vectorized GPU code is %3.2f times faster ',...
    'than the CPU version.\n'], cpuTime/gpuPagefunTime);
Vectorized GPU code is 72.28 times faster than the CPU version.
fprintf(['Vectorized GPU code is %3.2f times faster ',...
    'than the unvectorized GPU version.\n'], gpuTime/gpuPagefunTime);
Vectorized GPU code is 3120.72 times faster than the unvectorized GPU version.

Conclusion

The pagefun function supports a number of 2-D operations, as well as most of the scalar operations supported by arrayfun. Together, these functions allow you to vectorize a range of computations involving matrix algebra and array manipulation, eliminating the need for loops and making huge performance gains.

Wherever you are doing small calculations on GPU data in a loop, you should consider converting to a batch implementation in this way. This can also be an opportunity to make use of the GPU to improve performance where previously it gave no performance gains.

Supporting Functions

Random Transform Function

The randomTransforms function sets up a map of object transforms and a start location for the robot for a number of objects in a room of specified dimensions.

function Tform = randomTransforms(N,roomDimensions)
Tform.T = zeros(3, N);
Tform.R = zeros(3, 3, N);

for i = 1:N
    Tform.T(:,i) = rand(3, 1) .* roomDimensions';
    % To get a random orientation, we can extract an orthonormal
    % basis for a random 3-by-3 matrix.
    Tform.R(:,:,i) = orth(rand(3, 3));
end

end

Looping Transform Function

The loopingTransform function transforms every object to its location relative to the robot by looping over the transforms in turn.

function Rel = loopingTransform(Robot,Map,numObjects)
Rel.R = zeros(size(Map.R), 'like', Map.R); % Initialize memory
Rel.T = zeros(size(Map.T), 'like', Map.T); % Initialize memory

for i = 1:numObjects
    Rel.R(:,:,i) = Robot.R' * Map.R(:,:,i);
    Rel.T(:,i) = Robot.R' * (Map.T(:,i) - Robot.T);
end

end

pagefun Transform Function

The pagefunTransform function transforms every object to its location relative to the robot by applying the transforms using the pagefun function.

function Rel = pagefunTransform(Robot,Map)
Rel.R = pagefun(@mtimes, Robot.R', Map.R);
Rel.T = Robot.R' * (Map.T - Robot.T);
end

Nested Looping Transform Function

The loopingTransform2 function performs a looping transform using two nested loops, to loop over the robots as well as over the objects. The transforms map every object to its location relative to every robot.

function Rel = loopingTransform2(Robot,Map,numObjects,numRobots)
Rel.R = zeros(3, 3, numObjects, numRobots, 'like', Map.R);
Rel.T = zeros(3, numObjects, numRobots, 'like', Map.T);

for i = 1:numObjects
    for j = 1:numRobots
        Rel.R(:,:,i,j) = Robot.R(:,:,1,j)' * Map.R(:,:,i);
        Rel.T(:,i,j) = ...
            Robot.R(:,:,1,j)' * (Map.T(:,i) - Robot.T(:,1,j));
    end
end

end

Two-call pagefun Transform Function

The pagefunTransform2 function performs transforms to map every object to its location relative to every robot using two calls to the pagefun function.

function Rel = pagefunTransform2(Robot, Map)
Rt = pagefun(@transpose, Robot.R);
Rel.R = pagefun(@mtimes, Rt, Map.R);
Rel.T = pagefun(@mtimes, squeeze(Rt), ...
    (Map.T - Robot.T));
end

See Also

| | |

Related Topics