-
Notifications
You must be signed in to change notification settings - Fork 54
Open
Description
Hi, I find that in po_rollout.pyx, in _search, action is iterated but never used in the rollout call:
for action in legal_actions:
rewards = []
for i in range(self._num_sims // len(legal_actions)):
state = self._agent.belief.random()
total_discounted_reward = self._rollout(state, 0)
rewards.append(total_discounted_reward)
and inside _rollout, the action is chosen by the rollout policy, not by the action under evaluation:
while depth <= self._max_depth:
action = self._rollout_policy.rollout(state, history=history)
next_state, observation, reward, nsteps = sample_generative_model(self._agent, state, action)
...
So I think every action gets evaluated by the same distribution of rollouts driven only by the rollout policy, so the averages are essentially i.i.d. and do not reflect the action being evaluated.
Metadata
Metadata
Assignees
Labels
No labels