Is my Problem a Problem for Reinforcement Learning? Tried 3 agents no one seemed to learn anything.

Question

Kai Tybussek 2020-8-17

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/580584-is-my-problem-a-problem-for-reinforcement-learning-tried-3-agents-no-one-seemed-to-learn-anything

编辑： Kai Tybussek 2020-8-19

Hello im trying to solve an optimization problem with an reinforcement learning algorithm since some months now but i can´t get the agent to learn the right thing and now im stucked with this problem.

The Problem:

An observation vector [40 1], which is a recorded signal in the frequency domain needs to be manipulated with an actionvector [40 1], which is a damping vector, so that the resulting statevector [40 1] is equal to a targetFunction, in my case a rect() function.

The Actor has just one timestep to collect reward and before the observation resets. So if "Observation-Action = State != tarFrunction" than "Observation=InitialObservation" and the agent need to look for another actionvector which may cause InitialObservation to look more like a rect() function. The Agent gets a Reward which gets bigger the more equal all 40 points of State vector are to the 40 points of tarFunction.

For the first versions of the programm the Observation won´t change after reset/getting new observation. In the real application The Observation will vary due to noise a bit and may change drastically if the device connected to the system and creating the signal changes.

The Environment: LaserContinuousEnv17.m

The Environment has an continuous Observation ("Upperlimit -20" and "LowerLimit -70") and Actionspace("Upperlimit 35" and "LowerLimit 0"). The first/initial Observation is saved as a property "this.iSig" and every state is calculated with the step function and this formula:

Observation=this.iSig-Action;

this.State=Observation;

The Environment file is attached. The reward function is based on the RMSE and looks like this: -> this.tarFCT is the targetFunction "rect()"

RMSE=sqrt(sum(this.State-this.tarFCT).^2)/size(this.State,1);

perc=abs((this.State-this.tarFCT)./this.tarFCT);

Reward =(sum((1-perc)*10))+(sum(perc*(-1))*RMSE);

The reward gets bigger the closer the State is to the targetfunction. If all 40 Points are equal to the targetFunction, the reward has its maximum value. The rewad hast the positive reward part and the negative penalty part. The Agent is ment to maximize part one and minimize part two in order to maximize the REWARD. plot of the rewardfunction is attached.

The Agent:

I tried in total DDPG,PG and TD3 agents on the Problem but none of the agents even showed signs of learning even after 200k Episodes and more. I also tried to change the learning rates and different solving algorithms like sgdm,rmsprop and adam and some different weight and bias initializations but no effort at all. I attached the file for the agent configuration as well "RFL_Agent_Training_17.m".

The NeuralNetworks:

I used quite simple neural entworks for the agent in the beginning with just one or two fullyconnected layers and relu layers between and changed the neurons per layer from a minimal value of 40 to like 4000 neurons but the networks seemed not to be able to learn any rectangular shape or anything. It was just trying rndm actions like all the time.

My Question is, is there anything fundamentally wrong with this program so that the agent has no chance to learn anything ? Or is it maybe just the wrong type of network which i used for critic 1 and 2 and actor? I used likely the same network for all these parts only difference between actor and critic was the output layer. Critic has 1 layer more to geht the output to a skalar. Is it maybe neccesary to build much more complex networks for this problem ? On my first thoughts this problem didn´t ment to be this complicated at all.

I was thinking of trying this steps next:

–Different layers (Convolutional, LSTM)

–Change Weights and Bias initialization

–Change Normalization in hiddenlayer

–Try different Learnrates for actor and critic 1&2 (already ongoing work)

–Change Statespace exploring algorithm(SGDM, Adam, …)

and ist parameters. (already ongoing work)

I´m really wondering why this is not working. Is the Actionspace just too big that i need more constraints for the problem? if yes how can i do that with the RF-Toolbox?

best regards

Kai