Can I have different sample times for my RL agent?

Question

Calogero Maiuri 2022-4-26

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1705130-can-i-have-different-sample-times-for-my-rl-agent

回答： Amish 2023-10-9

I'm training a RL agent for autonomous driving decision making for lane keeping/overtaking. It outputs a command that can be either "lane keep", "right lane change" or "left lane change". Then a controller will use this command to plan the maneuver. If the commanded lane change brings my vehicle outside the road boundaries my simulation crashes (this depends on how the controller has been implemented). In order to avoid this, I've set the isDone flag to stop the training episode when the car is approaching the road boundaries.

My issue is that if I set a low sample time (0.1 s for example) the isDone behaves correctly, stopping the episode, but the output changes too rapidly and the controller can't complete the maneuver correctly, because whatever the command is, it's quickly replaced by another one. The training goes on, but with little to no improvement.

On the other hand, if I use an higher sample time (5 s for example), the maneuver is completed correctly, but the simulation crashes after a while, since the isDone doesn't trigger in time.

Ideally I would like a low sample time for the isDone flag and an higher one for the Output. I may hold the Output for some steps, but I think that would not be useful for the training. Do you have any advice on how to solve this?

Thanks for the help.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Amish 2023-10-9

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1705130-can-i-have-different-sample-times-for-my-rl-agent#answer_1328964

Hi Calogero,

I understand that you're facing a trade-off between having a fast response for your isDone flag and providing the controller with a stable and continuous output signal. This is a common challenge in reinforcement learning.

Here are some of the commonly used strategies, you can consider to strike a balance:

Prioritized Experience Replay (PER): Instead of having a fixed sample time for both isDone and the output, you can use prioritized experience replay. With PER, you assign priorities to each experience (state, action, reward, next state, done) in your replay buffer based on how important they are for learning. Experiences that lead to crashes or near-crashes can be assigned higher priorities. When you sample experiences from the buffer for training, you can give higher priority to the isDone experiences, allowing your agent to learn from them more frequently.
Action Smoothing: To provide a more stable and continuous output signal, you can implement action smoothing. Instead of changing the action command abruptly at each time step, you can average or smooth the actions over a few time steps. For example, if your agent outputs "lane change right," you can apply this action for the next few time steps before allowing the agent to switch to another action. This can help the controller complete maneuvers more smoothly.
Mix Discrete and Continuous Actions: Instead of having your agent output discrete actions ("lane keep," "right lane change," etc.) at every time step, you can have it output continuous action values that represent the intensity or degree of the desired action. For example, the agent can output a continuous value for "lane change" that ranges from -1 (full left) to 1 (full right), and the controller can use this continuous value to smoothly adjust the vehicle's position within the lane.
Dynamic Sample Time: You can use a dynamic sample time approach where the isDone flag is checked at a faster rate than the output is changed. For example, you can set up your simulation loop to check the isDone flag every 0.1 seconds while allowing the agent to change its output every 5 seconds. This way, you get the benefit of a fast isDone response without rapid changes in output.
Reward Shaping: Modify the reward function to penalize actions that bring the vehicle close to the road boundaries. This can help the agent learn to avoid actions that lead to crashes even if the isDone flag is checked less frequently.
Exploration Strategies: Adjust the exploration strategy of your RL agent. For example, you can increase exploration noise in the action selection process to encourage the agent to explore different actions for longer durations before settling on a specific one.

You can experiment with one or more of these stategies and try to find one which give the best balance between stability and rapid isDone responses from your autonomous driving decision making RL Agent.

Hope this helps!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Can I have different sample times for my RL agent?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Can I have different sample times for my RL agent?

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论