I don't like debugging DQN anymore, so we're going to go with a far simpler, provenly reliable algorithm such as REINFORCE with/without Baseline.