Hi Zonghao zou,
One possible parameter to consider when stopping training is Q-Values. If the Q-Values are saturated, it means that no learning is happening in the network. You can perhaps look at your Q-values and decide a threshold, to perform early-stopping in the network. You don't need the end-reward or target-point to perform early stopping based on Q-Values.
Hope thi helps.