Deep Learning Overview for Control Systems (using Reinforcement Learning) | Deep Learning Webinars 2020, Part 3
From the series: Deep Learning Webinars 2020
Reinforcement learning allows you to solve control problems using deep learning. Instead of using labeled data, learning occurs through multiple simulations of the system of interest. This simulation data is used to train a policy represented by a deep neural network that would then replace a traditional controller or decision-making system.
Learn how to perform reinforcement learning using MathWorks products, including how to set up environment models, define the policy structure, and scale training through parallel computing to improve performance.
Published: 20 Oct 2020
All right. I think we can go ahead and get started. Hello, everybody. My name is Emmanuel, and I am a product manager at MathWorks focusing on reinforcement learning and control systems. Today I will be talking about reinforcement learning, and how you can use this technology to introduce artificial intelligence into your project. A few logistics before we begin. If you have any problems hearing the audio or seeing the presentation, please contact the webinar host by typing in the chat panel.
If you have any questions for me, you can type them into the Q&A panel on the right hand side of your screen, and I will answer those questions at the end of the presentation. Thank you in advance. My goal for the day is to make sure that after this session, you will remember three things.
Number one, what is reinforcement learning, and why you should care about it. Number two, how you can set up and solve a reinforcement learning problem. And number three, what are some benefits and drawbacks of reinforcement learning, and how can MathWorks tools help you with all of these. With that, let's go ahead and get started.
And the first thing I would like to talk about is this question right here. Why should we even care about reinforcement learning? Imagine for a second that you're a control engineer. And you're working on a project trying to design a control system that will allow a robot to walk. Now this robot has all types of sensors that you can use, and it has motors at each joint. There are arguably many different ways of approaching this problem with traditional control methods.
But a generic solution architecture could look like this. With a high level MPC controller generating trajectories for the hip, the knee, and the ankle. And the low level PID control module using these as references to generate the appropriate motor torques. So in this scenario you will have to work on three separate subsystems, or that number could be even larger if you consider the fact that low level control happens at the joint level.
But you would have the two parameters for each one of those subsystems to get the robot to walk. What if there was a way to replace all these subsystems with a single black box controller that does exactly the same thing? This is where reinforcement learning comes into the picture. It allows you to design potentially end to end decision making systems that are self tuned through a training process.
So what is reinforcement learning? It is a type of machine learning that trains something called an agent through repeated interactions with an environment. How is it different from other machine learning techniques? Well, if you think of unsupervised learning, you're starting with a data set without labels and you could, for example, do things like clustering similar objects into groups. In supervised learning, in addition to what you had before, you now also have labels for your data. So you can do things like classifying objects and images.
With reinforcement learning, on the other hand, you do not have any data points to work with at the beginning. Data is being generated on the fly by having an agent interact with its surroundings, the environment. By doing that, the agent is able to learn dynamic behaviors by itself. So for example, in this particular scenario shown here, the agent could learn how to drive through traffic without colliding with other vehicles.
So where does deep learning fit into this picture? Well, deep learning actually spans all three types of machine learning shown here. One important thing to keep in mind is that deep learning and reinforcement learning, they're not mutually exclusive. In fact, for complex applications you might want to use deep neural networks to solve the problem. Which is something that's been referred to as deep reinforcement learning.
So how does training work? Well, it works through a trial and error process. And a real life equivalent example would be pet training. The dog in this scenario would be the equivalent of an agent in the reinforcement learning terminology. The agent is observing the environment, which here is the trainer, the stick, the voice commands, and things like that for example. Now these observations represent the state of the environment.
Based on these observations, the agent, dog, takes an action. And this mapping from observations to actions is called policy. And think of it as the dog's brain, or the dog's strategy. Every time the agent takes an action, it also receives a reward. That tells the agent how good or how bad that action was.
The purpose of the whole training process is to collect as much reward as possible. So during training there is this cycle happening where the dog will take an action, observe the reward and the environment state, update its strategy, and all over again. This updating strategy, or policy, is happening through a training algorithm, a reinforcement learning algorithm. And the cycle is repeated until the agent knows which action will provide the best reward, the highest reward, based on the observations that it sees.
Let's look at these concepts through an engineering example, and specifically a self-driving car. Now, in the self-driving car scenario the agent would be the equivalent of the vehicle's computer. The agent is reading measurements from sensors, and those could be LIDAR measurements, camera frames, and so on. And these sensors present, as I mentioned earlier, the state of the environment.
For example, in this case that could be the vehicle's position, or the positions of other vehicles that are around our vehicle, and so on. Based on these observations, the agent generates an action using its current policy. And actions here could be things like steering the wheel or braking. The agent will then receive a reward that will specify how good or how bad that action was.
And that the reward could be, for instance, related to fuel efficiency, or driver comfort. And the purpose of training, again, is to collect as much reward as possible. And over the course of training, the training algorithm will modify the parameters with the policy and the agent will eventually know which action provides the highest award.
Now after training is complete there is no notion of agent, and there's no notion of training algorithm. The trained policy is a stand alone decision making system that is optimizing by default the long term reward collection. So I talked about policies, but what are some different ways to actually represent the policy in your code?
For simpler problems where the state space is small, the most straightforward way to do that, the most straightforward way to represent the policy, is using a lookup table. However, you can imagine that for more complex problems, lookup tables do not scale very well. So these days the most popular choice to represent the policy is with neural networks.
Neural networks are a great function approximators. They allow representation of complex policies, and if you use a neural network as a decision making system, that would mean taking in observations like camera frames or sensor fires, as I mentioned earlier. And the output of the network would be the next best action. OK. So we covered the basic concepts, but how can you set up and solve a reinforcement learning problem?
The good news is that, regardless of the type of reinforcement learning problem we're trying to solve, you can follow the exact same workflow. The first step in the workflow is deciding on how to represent your environment. The environment in the reinforcement learning is everything outside the agent. And it could be either a real physical system or a virtual model.
As you can imagine, using real hardware with trial and error methods might not end up well, so virtual models are a safe alternative. One additional advantage of using virtual models is that we can easily simulate extreme conditions that are hard to emulate in real life. So if you think back in the walking robot example for instance, we could simulate the robot walking on ice, which is not something we could do easily in a real setting.
The next step is coming up with a reward. As I explained earlier, the reward is basically just a number that tells the agent how good or how bad an action is. One thing to keep in mind here, is that coming up with a reward signal, a reward function, can be quite challenging. And it may take a few iterations to get it right. Next, we decide on the policy representation.
Are we going to use a deep neural network? Are we going to use a table? A we going to use a polynomial? And so on. At the same time, we are also creating the agent. Creating the agent is the equivalent of selecting a training algorithm. And with those four steps out of the way, we can start the training process. Keep in mind that reinforcement learning is a very data hungry technique. Most of the time a large number of simulations is necessary to even get a decent policy.
And this is where things like parallel computing and GPUs can help to accelerate the training process. Training could still take anywhere from minutes, to hours, or even days to complete. Even in the case where you're using parallel simulations and GPUs, and that's something to keep in mind. After training converges, the last step is to deploy the training policy, and make sure that it meets the performance expectations.
All right. So now you know what you need to do to set up the problem. But how do you actually solve it? The good news is that in R2019a release of MATLAB we launched a new tool box, reinforcement learning tool box, that allows you to go through all steps of the workflow using MATLAB in Simulink. The tool box includes some popular training algorithms like DQN, DDPG, D deterministic policy gradient, PPO, which is proximal policy optimization, and so on.
But you can also create your own custom training algorithms if you want. You can build environment models using MATLAB in Simulink, and you can also reuse existing scripts and models if you have them. You can use layers from deep learning toolbox to represent policies as neural networks. You can parallelize training and deploy policies. And finally, the toolbox includes reference examples for getting started.
All right. So I will now go through an example to show you how you can use reinforcement learning toolbox, and I'm going to use the same example I mentioned at the beginning. The walking robot. Remember here, the objective is to create the black box controller with a deep neural network that will get the robot to walk.
And the first thing we need to do here is create the environment. For this problem we already have a Simulink model of the walking robot that's built with syndicate multi body. And as you can see, we are modeling each leg in the torso individually using blocks from Simscape molded body.
If we look under the leg subsystem, you will see that we have individual joints for the ankle, for the knee, for the hip. We have blocks that represent coordinate frames, and so on. We are also modeling contact forces and friction between the ground and contact points at the robot's feet.
And lastly, we are using sensors to measure the states that we're interested in, and as you can see, we're going to feed those back into the reinforcement learning system and the agent. To get an idea of what types of observations we're looking into, these are things like the position and velocity of the robot. Those include joint angles, and so on and so forth. And that concludes the first step of the workflow.
Next, we will look at the reward signal. The question here is, how can we make sure that the robot learns how to take steps on its own? And to do that, we are going to use a reward that consists of four different terms that are added up together. And the first step, we will reward the robot if its velocity is positive.
To make sure that it does not deviate much from a straight line, we are going to add a couple of penalty terms on the y and z-axis displacement. We have a penalty term on the actualization effort. And then the last part of the reward is added to prevent a very common local minimum where the robot just drives forward to collect the most reward instead of taking steps. And then we're going to add up those terms to shape the final reward.
In the next step we will set up our network architecture. To do that we will use Deep Network Designer app to design the neural network interactively. With Deep Network Designer you can drag and drop different layers from the panel on the left, connect them together to get the desired architecture, and also adjust parameters for each layer using the panel on the right.
Once you're finished setting up the network, you can also generate MATLAB code that creates the exact same network programmatically. That saves you the trouble of having to write all this code manually. With the policy architecture out of the way you can now set up the agent. In Simulink you can use the agent block from reinforcement learning toolbox to link to an agent object created in MATLAB.
You can also specify the values of different hyperparameters for the training algorithm. In this case, as you can see, we're using an algorithm called DDPG, deep deterministic policy gradient. And we are also setting up some of the hyperparameters required by this algorithm. And then we can start the training process.
As you can see in the video on the left, you will notice that at the beginning of training the robot does not even know what it means to take a step. So as the training progresses, it slowly learns how to take a couple of steps. It's still falling though, but if you let the training process run for a few minutes or hours, the robot will eventually learn how to walk.
During training you can monitor the progress of what's happening with the episode manager that you can see on the right. The episode manager provides useful information, like the episode reward values, number of training steps, and so on. In this case, if you wanted to parallelize the training process, it would be as easy as setting up a flag in the training options.
Again, as I mentioned earlier, even though we are actually parallelizing in training in this example, the total training time was somewhere between five and six hours. Something to keep in mind and plan for. Finally, after the training process is complete, you can deploy the trained policy and verify the results.
You may have noticed so far that the reinforcement learning looks a lot like a controlled design method. And in fact, it has many parallels to control design. The policy in the reinforcement learning system can be viewed as the equivalent of a traditional controller. The environment is the equivalent of the plant. Same goes for observations and measurements. Actions and manipulated variables. The reward signal in the reinforcement learning is similar to a cost function, in let's say optimal control, or the error from some desired control objective.
And finally, the reinforcement learning training algorithm is similar to an adaptation mechanism that changes the weights of a traditional controller. So far the majority of this talk has focused on control design problems. Reinforcement learning however, has applications outside of controls as well. You can apply reinforcement learning to any problem that requires some decision making.
And that could be controls, autonomous driving, robotics, calibration problems, scheduling problems, optimization problems, and more. Some of these applications I just mentioned are covered with examples in our documentation for reinforcement learning toolbox. So far we have examples for control design, robotics, autonomous driving, imitation learning, and we are actively working on adding more to this list.
The last thing I want to talk about is some pros and cons of reinforcement learning as a technology. As I explained at the beginning, unlike supervised and unsupervised learning, reinforcement learning does not require data points prior to training. Instead, data is being generated on the fly as the agent is interacting with the environment.
Another thing that makes this technology attractive is that it provides an alternative using AI to problems that are hard to solve with traditional methods. This often leads to structural simplifications in a design, exploiting the ability of reinforcement learning to generate end to end solutions. But on the other hand, it is challenging to establish performance guarantees for trained policies.
And this is true, actually, for every technique that relies on neural networks. It's not specific to reinforcement learning. Also, as I mentioned earlier, reinforcement learning often requires many trials to even get a decent policy. And then finally, one of the biggest challenges in the reinforcement learning is to tune the large number of hyperparameters when you're actually setting up the problem.
Things like reward shaping, coming up with a network architecture that works, paramaters specific to the training algorithms, they're not straightforward to tune. And unfortunately there are no genetic guidelines that you can use either. The good news is that it isn't natural to most of these points. And that's the ability to perform simulations.
MATLAB, Simulink, reinforcement learning toolbox, and other MathWorks tools can help you quickly set up your simulation models, and work around those challenges. For example, you can reuse existing MATLAB scripts and Simulink models to quickly set up the environment. You can use simulations to verify policies, and try to find scenarios where your policy might fail before actually moving to the real world.
You can accelerate training by spinning off many parallel simulations. And finally, you can consult our reference examples, including the toolbox, to get an idea of what could be some reasonable initial values for the hyperparameters of your problem. And then iteratively tune these in simulation. So this is the end of my presentation. Hopefully, I was able to answer these three questions I raised at the beginning.
For additional information, I have included some resources right here. I talked about the documentation and the reference examples provided in the toolbox. I highly recommend watching the Tech Talk video series on the reinforcement learning we have created. These videos are available on our website, and they go through basic concepts from an engineer's perspective.
And then finally, we also have e-books on the reinforcement learning available for free download at mathworks.com, and these also covers some basic concepts with some additional details and examples.