How to normalize the rewards in RL
48 次查看(过去 30 天)
显示 更早的评论
I recently learned normalizing the rewards is a key step in RL since rewards can vary over a large range of magnitudes, and the function approximators being used in RL are usually not invariant to the scale of the input. And It usually results in faster learning. I also learned that to normalize all discounted rewards across all episodes, we compute the mean and standard deviation of all the discounted rewards, and we subtract the mean from each discounted reward, and divide by the standard deviation.
How can I implement this in MATLAB? is it internally implemented in matlab? if not how can I transfer the necessary variables between episodes?
2 个评论
Umar
2024-8-4
编辑:Walter Roberson
2024-8-4
Hi @Danial Kazemikia ,
Calculate the mean and standard deviation of all discounted rewards across episodes. You can use MATLAB functions like mean() and std() to compute these statistics. For more information on these functions, please refer to,
Then, subtract the mean from each discounted reward and divide by the standard deviation to normalize the rewards. Afterwards, transfer necessary variables between episodes, you can store them in MATLAB's workspace or data structures like arrays or cell arrays. For example, you can store mean and standard deviation values for each episode in arrays.
Please let me know if you have any further questions.
回答(1 个)
Kaustab Pal
2024-8-6
编辑:Kaustab Pal
2024-8-6
Reward normalization is a crucial step in reinforcement learning (RL) as it stabilizes the training process by ensuring rewards are on a consistent scale, and it improves convergence by providing a more uniform gradient signal, among other benefits.
You can implement this in MATLAB using a custom training loop. An example of writing a custom training loop can be found in the documentation here:https://www.mathworks.com/help/releases/R2024a/reinforcement-learning/ug/train-reinforcement-learning-policy-using-custom-training.html
The below code is a slight modification of the custom training loop to show you how to normalize the rewards:
for episodeCt = 1:numEpisodes
episodeOffset = ...
mod(episodeCt-1,trajectoriesForLearning)*maxStepsPerEpisode;
% 1. Reset the environment at the start of the episode
obs = reset(env);
episodeReward = zeros(maxStepsPerEpisode,1);
% 3. Generate experiences
% for the maximum number of steps per episode
% or until a terminal condition is reached.
for stepCt = 1:maxStepsPerEpisode
% Compute an action using the policy
% based on the current observation.
action = getAction(policy,{obs});
% Apply the action to the environment
% and obtain the resulting observation and reward.
[nextObs,reward,isdone] = step(env,action{1});
% Store the action, observation,
% and reward experiences in their buffers.
j = episodeOffset + stepCt;
observationBuffer(:,:,j) = obs;
actionBuffer(:,:,j) = action{1};
%%%%%%%%%% REWARD NORMALIZATION %%%%%%%%%%
episodeReward(stepCt) = reward;
tstep = linspace(1,maxStepsPerEpisode,maxStepsPerEpisode);
discount = reshape(.99.^tstep ,[maxStepsPerEpisode,1]);
discounted_reward = discount.*episodeReward;
normalized_reward = (discounted_reward - mean(discounted_reward))/(std(discounted_reward)+1e-10);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
rewardBuffer(:,j) = reward;
maskBuffer(:,j) = 1;
obs = nextObs;
% Stop if a terminal condition is reached.
if isdone
break;
end
end
for i=1:length(nor_reward)
rewardBuffer(:,end+1) = nor_reward(i)
end
%%%%%% REMAINING PART OF THE CUSTOM TRAINING LOOP GOES HERE %%%%%
end
Hope this helps.
With regards,
Kaustab
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Environments 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!