It seems that the importance sampling code part is wrong.

https://github.com/ikostrikov/pytorch-trpo/blob/e200eb8a23b3c7941a0091efb9750dafa4b23cbb/main.py#L108-L119


The fixed log prob part of the line and the "get_loss" function part are exactly the same.
The two parts are executed consecutively so that the two values ("fixed_log_prob", "log_prob") ​​are exactly the same.
Is there a reason you wrote the code like this?

	fixed_log_prob = normal_log_density(Variable(actions), action_means, action_log_stds, action_stds).data.clone()

	def get_loss(volatile=False):
	if volatile:
	with torch.no_grad():
	action_means, action_log_stds, action_stds = policy_net(Variable(states))
	else:
	action_means, action_log_stds, action_stds = policy_net(Variable(states))

	log_prob = normal_log_density(Variable(actions), action_means, action_log_stds, action_stds)
	action_loss = -Variable(advantages) * torch.exp(log_prob - Variable(fixed_log_prob))
	return action_loss.mean()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

It seems that the importance sampling code part is wrong. #22

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

It seems that the importance sampling code part is wrong. #22

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions