I think that the advantage value [here ](https://github.com/sweetice/Deep-reinforcement-learning-with-pytorch/blob/7b9fac7e5e40ffdc6f7ccb8b0a81e7841370a996/Char07%20PPO/PPO2.py#L129) should be base on the old actor ` target_v = reward + args.gamma * self.critic_net(next_state)`