Hi,
I understand that the agent is converging to a policy that gives minimal rewards because of the way the rewards are calculated in the ‘step’ function of the environment. The reward calculation is based on the volume of a tree, and the reward is set to be negative, which means the agent is incentivized to minimize the volume.
To understand why the agent is converging to a policy that gives minimal rewards, you need to examine the reward function and the environment dynamics.
reward = -this.volume_tree/(0.5^2*pi*sum(this.Bin_In_Training(:, 3)));
It seems that the goal of your task is to find the minimal volume. However, negative rewards can make convergence challenging, especially if the agent is trained using gradient-based methods.
Few possible workarounds could be:
- Reward Scaling: Instead of using negative rewards, consider scaling the rewards to a positive range that aligns with the agent's objective.
- Exploration: Ensure that your agent has sufficient exploration during training. Exploration allows the agent to explore different actions and states, which can help in finding better policies.
Reinforcement learning training can be sensitive to various factors, and it often requires experimentation and iterative adjustments to achieve desirable results.