Am training a DDPG agent on randomly set straight lines (levels) and later testing on a benchmark waveform. Shouldn't the training stablize over time and create a stable model? At 960 episodes the saved agent seems to perform better than at 2180 episodes. Both agents saved for avg.rewards over 50 episodes and > 25 K. Also the difference between model saved at 940 versus 960 episodes seems drastic.
In the picture below are the Episode Manager showing the avg.rewards (over 50 episodes) going up and down several times. One would expect it to look like the dark green line, stablizing over time? What change can I make to create a stable model?
Action space: 1.0 to 10.0, continuos
Test wave-form: 2000 seconds long
Training sample time and simulation length: Ts: 1 and Tf=250
Hyper-parameters: Learning Rates Critic = 1e-03, Actor = 1e-04 | Gamma (discount) = 0.95, Batch size = 64
Neurons: Obsv. path: FC1 = 64, FC2 = 24 and Actor path FC1 = 24
DDPG Noise Variance = 0.1, VarianceDecayRate = 1e-5 (Have tried Noise Variance 0.45 too and decay at 1e-3, 1e-4 etc.)
(For a higher res. image please see attached)