Video length is 17:08

What Is Extended Object Tracking? | Autonomous Navigation, Part 5

From the series: Autonomous Navigation

Brian Douglas

In a lot of scenarios, there are other objects that we may need to observe and track in order to effectively navigate within an environment. This video is going to look at extended object tracking: objects that returns multiple sensor detections.

We’ll cover a basic overview of what extended object tracking is, what makes it challenging, and then briefly provide some intuition around some of the algorithms that have been developed to solve the problem.

Published: 22 Jul 2020

From the last video, we know how to plan a path through a static environment once we have a map. But in a lot of practical scenarios, the environment is not static. There are often other objects that move through the same location that we're trying to navigate within. And therefore, we may need to observe and track these objects in order to effectively plan and make decisions.

This video is going to look at extended object tracking, which is-- well, I'll explain what it is throughout the rest of this video. But I hope to provide a basic overview of what extended object tracking is, what makes it challenging, and then briefly provide some intuition around some of the algorithms that have been developed to solve the problem. I'm Brian. And welcome to a MATLAB Tech Talk.

If you are developing an autonomous car, having a map of the static environment will let you do things like plan the optimal route through a city or determine which lane to be in if you have a turn coming up. However, the environment is decidedly not static.

So if the car is trying to determine whether it's safe to change lanes, a static map isn't good enough. It needs to have an understanding of the dynamic portion of the environment as well. Those are the other cars and pedestrians and bicycles and so on. So we need to give our autonomous vehicle a way to perceive and track those objects if we want to be able to successfully navigate this environment.

Now, we've already discussed how to track a single object and multiple objects in the sensor fusion and tracking series, which I've linked to below. And I'm biased, but I think it's worth watching because we're going to build on that knowledge here.

In that series, we talked about how tracking a remote object requires us to find the correct model that describes the object's motion, how uncertainty in the system makes it difficult to associate the right measurement to the right object, and how to add and subtract tracks as objects enter and leave the local environment. And we did all of that for so-called point objects. These are objects that are small enough that the sensor returns at most a single detection for each object.

Now, it's an important distinction that a point object doesn't need to be physically small. It just needs to be smaller than the resolution of the sensor. For instance, a radar station might be tracking a very large aircraft that is also very far away. The radar would return a single ping or a single reflection for the entire aircraft, and the whole thing has been reduced to a single detection.

And if you're tracking multiple aircraft, then we have to associate each point detection to its respective tracked aircraft and then estimate their state-- that is, the aircraft position and kinematic state, like speed and heading. So if the object is smaller than the resolution of the sensor, then we can apply the algorithms that we discussed in the sensor fusion and tracking series.

However, the small object assumption is not always true. Sometimes, like in the car example from earlier, the object or objects that you're tracking are large compared to the sensor resolution. This is what we call an extended object. With extended objects, we can receive more than one measurement of the same object in the same sample time.

For instance, check out how this car sees the environment with its multiple LIDAR sensors. And if I freeze it right here, check out how much data is generated just from the front LIDAR. Look at how many data points make up this red car that's passing in front. And we have multiple data points for each of the other dynamic and static cars in the scene as well. And this is information that we can use to better understand the state of each of these objects.

But with this additional information, we can estimate more than just position in kinematic states like we did with point objects. We can estimate the size, shape, and orientation of the tracked objects as well. And we may call this the object's extent.

For instance, if these are the returned measurements for a car, a bicycle, and a pedestrian, then just looking at this we could imagine one possible scenario of the size, shape, and orientation of each. And over time as we watch these objects move, we could become more confident in our estimate of the extent.

And it would be really nice if our tracking algorithm could do the same because understanding the extent of other objects in the environment is critical to placing them within the map, and it can mean the difference between planning a path that avoids the object versus treating it as a point and crashing into it, which brings up the question of how do we model the extent?

At first glance, it's easy to think, well, why don't we just model it exactly as it is? If we have a high enough resolution sensor, shouldn't we keep track of all of the curves and features of the car or whatever it is that you're tracking? That way, we have the highest fidelity representation of the object, right?

Well, that would be nice to have, except it might be too computationally complex to solve for and maintain and it might not add that much value. Therefore, at the moment, probably the most popular approach is to approximate the extent as a simple geometric object, like a rectangle or an ellipsoid. These shapes can provide a basic spatial representation of the object without requiring a large computational effort. For example, a 2D ellipse can be represented by just three state variables, the two semi-axis lengths, L1 and L2, and the ellipse orientation angle, alpha.

So this is the overview of the extended object tracking problem. We need a tracker that can take in a lot of measurements from an unknown number of extended objects, which are potentially close to other objects in the environment, and then estimate the number of objects in the scene and the position, kinematics, and extent of each one of them.

Now, in general, extended object trackers still need to do the same things that point object trackers do-- namely, take in measurements, associate detections to objects, add and delete tracks, and determine motion models. Except with extended objects, we have to do this with a lot more information. And the data association step is particularly difficult since there's potentially multiple detections per object, and trying to associate a group of detections to one object isn't necessarily straightforward.

So that's what we're going to mostly focus on for the rest of this video. I want to try to provide some intuition around a few different ways extended object trackers deal with all of this data.

Let's start with a simple scenario. Assume a sensor has returned these eight detections. With point objects, we would know that the sensor was detecting eight different objects. However, with extended objects, we just don't know. Are these two different objects? Or are there three objects here? Or maybe they're all part of the same object.

Determining how to group these detections like this is the partitioning problem. And we need to find a way to partition these detections into groups that all belong to the same object. Now, partitioning can be a relatively easy job if the objects we're tracking are spaced far apart relative to their size, like we had with the measurement of the car and the bicycle and the pedestrian.

We can see that there's probably three different objects here since they're so far apart from each other. In this case, we could use a clustering algorithm like DBSCAN to group them together. DBSCAN finds groups of points that are closely packed together based on two tuning parameters-- how far apart points can be from each other and still be in a group and the minimum number of points that make up a group.

And once the groups have been identified, the extent and position of the objects could be estimated using the distribution of the detections. We could fit an ellipse and claim the position is at the center of mass of the detections. And then a simple way to solve this tracking problem is to use this position and extent information with a Kalman filter to update the tracks.

The benefit of this method is that we can take a scene with thousands of measurements, like we get with the LIDAR of an autonomous vehicle, and we can quickly condense all of this information down into as many points as there are objects in the scene, which can computationally be much simpler than trying to run a tracker that uses thousands of individual points.

But the downside of clustering like this is that it becomes really difficult when the tracked objects are near each other or are near static obstacles in the environment. The measurements are no longer in nice, discrete clusters, and there's some ambiguity as to which objects the measurements belong to. And if we make the wrong assumption during the clustering step-- like, assume that this group is just a single object-- then when we feed this wrong information into our tracking algorithm, it's going to produce a wrong result.

So we need to be careful to not commit to a cluster too soon until we're confident that a set of measurements actually belong to the same object. This is where the multi-hypothesis nature of the multi-object trackers is leveraged. For example, we don't need to necessarily commit to a single partitioning of the detections. Remember, before when we had eight different measurements, we walked through several different ways to partition the data. And we didn't know which was correct.

So here is one way to approach this uncertainty. We assume every partition is possible. And then we proceed with data association and tracking for every partition, all in parallel. We maintain the estimated states and probabilities for each of these hypotheses or guesses as to how the detections are partitioned.

And then at some point in the future, we get a new set of measurements, which we can use to update our filter. We proceed to partition them in every possible way. And we apply them to the different hypotheses that exist in the tracker. And at this point, some of the hypotheses are going to be extremely unlikely. These are the ones that produce clusters that deviate greatly in the measured state of the object compared to the estimated state.

We can then start to shed these hypotheses so we don't have to keep maintaining them forever going forward. And in this way, we're continuously adding new possible data partitions and shedding the old ones that are the least likely. And even though the filter keeps all of this information internally, what it would return is the tracks and the states of the objects from just the most likely hypothesis.

Now, clearly the number of partitions can be huge. For example, there are 4,140 different ways to partition eight measurements. And extended objects can return many more misdetections than that. So it's infeasible to maintain that many hypotheses in the tracker. Plus, most of these partitions would be junk anyway, because, for example, one of them is to just divide them up like this. And this grouping is going to be very unlikely.

Therefore, we need to find a way to reduce the number of partitions that we look at and preferably a way that keeps detections near each other in the same group. And one way to do this is to simply group the detections based on their relative distances to each other. Two detections would belong to the same group if their relative distances are less than some given threshold. And this is really similar to what we did with DBSCAN earlier, and it ensures that we don't end up with this crazy situation.

Now, to create the partitions, we can start with a large distance threshold to generate the first partition and then create additional ones by lowering that distance threshold, creating smaller and smaller groups. And you can see an example of this in the documentation for the partition detections function in the sensor fusion and tracking tool box.

Here, we start with 10 measurements. And the function has returned the first six partitions. And as the distance threshold is reduced, you can see that more and more detection cells are returned. Again, we don't know which one of these partitions is true. So we're going to keep all of these as active hypotheses in the tracker.

But even this type of partitioning method can be misleading, because if two objects overlap like we saw with the LIDAR example, this way of partitioning would still categorize this as a single group. Or it might break it up into a bunch of smaller groups, more than the number of objects, as the distance threshold is reduced. Therefore, there are even smarter ways of partitioning where the output of the tracker is used to help determine the partitions.

This is called prediction partitioning. In this way, if the tracker believes that there are two objects in this area, we can tell the partitioning algorithm to create two groups here. Now, however partitioning is accomplished, what we end up with are many different hypotheses of how these detections are grouped. And the next step is to associate all of the detections in each hypothesis with a tracked object.

Earlier in this video, we estimated the extent and position from the measurements prior to feeding it into the tracking algorithm. And the problem with this approach is that the measurements don't tell you the whole story about the state of the object. And so we may be making bad assumptions about the object's extent and position.

For example, sometimes the car is only measured from the rear side and produces a measurement that looks like a line. And if we use the length and width from this measurement set, we would come up with a length of zero and a position that is biased towards the back of the car, which is clearly wrong. Therefore, it's beneficial to use more sophisticated sensor modeling approaches that make use of all of the detections to update extent, position, and the track all at the same time.

Now, conceptually, it works like this. The tracker is maintaining an estimate of an object state. That's its position and how it's moving and its extent. And let's assume it's estimating extent as a rectangle here. The tracker can then predict a future state of the object using its motion model. So now the tracker has an estimate of where the object should be at some point in the future.

A sensor model within the tracker can then predict what the sensor should return, assuming the tracker knows where the sensor is relative to where it thinks the object is. And then, when we get a real measurement and we've partitioned the detections into groups, we can update the entire state of the tracked object based on that mismatch between the predictions and the detections.

Of course, there's uncertainty in all of this. There's uncertainty in the estimate of our own location, in the starting estimate of the object state, the motion models which predict a future state. There's uncertainty in the estimate of the object's extent and the sensor models that we use to predict detections. And of course, there's uncertainty in the sensor measurements themselves.

And keeping track of all of this uncertainty for all of the different hypotheses within the filter and then using that uncertainty to determine how to update state and to determine which hypothesis is most likely is the magic of extended object tracking algorithms. Now, of course, it's not magic. But the mathematics of how they accomplish this is beyond the scope of this video. But I will link to other great resources that go into detail if you are interested.

The last thing I want to say about this-- at least in this video-- is that what I've described here is the general concept behind clustering and assignment type extended object trackers where clustering the detections is done first and then the clusters are associated with tracked objects.

But there are other approaches to solving this problem that bypass the clustering step and just directly associates individual detections with a tracked object. These are so-called likelihood-based data association algorithms. There's links in the description to documents and videos that cover all of this. And hopefully this video has provided some amount of intuition into how extended object trackers differ from their point object counterparts.

And that's where I'm going to leave this video. In the next video, we're going to put all of the autonomous navigation parts together. That's mapping, localization, tracking, and planning. And then we're going to talk about how you can evaluate if the entire system is going to work.

So if you don't want to miss that or any other future Tech Talk videos, don't forget to subscribe to this channel. Also, if you want to check out my channel, Control System Lectures, I cover more control theory topics there as well. Thanks for watching, and I'll see you next time.