Skip to content

RLLib Colab stuck on pending #2

@Phoenix-surgere

Description

@Phoenix-surgere

Hello,

I 've been following this great book faithfully but I have a problem: In chapter 6, the first time that RLLib is used, both when running the code in colab and locally, I get the following behavior: Tune is essentially stuck permanently on "PENDING" status with the following error:

(scheduler +3h2m14s) Error: No available node types can fulfill resource request {'CPU': 5.0}. Add suitable node types to this cluster to resolve this issue.
== Status ==
Current time: 2022-02-01 12:16:44 (running for 00:00:05.13)
Memory usage on this node: 2.6/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/0 GPUs, 0.0/6.28 GiB heap, 0.0/3.14 GiB objects
Result logdir: /root/ray_results/APEX_2022-02-01_12-16-39
Number of trials: 1/1 (1 PENDING)
Trial name	status	loc
APEX_CartPole-v0_cf36b_00000	PENDING	

While the code is this:

import pprint
from ray import tune
from ray.rllib.agents.dqn.apex import APEX_DEFAULT_CONFIG
from ray.rllib.agents.dqn.apex import ApexTrainer

ray.shutdown()


if __name__ == "__main__":
  config = APEX_DEFAULT_CONFIG.copy()
  pp = pprint.PrettyPrinter(indent=4)
  config['env'] = "CartPole-v0"
  config['num_workers'] = 6
  config["num_gpus"] = 0


  #config['evaluation_num_workers'] = 1
  config['evaluation_interval'] = 1
  config['learning_starts'] = 50
  pp.pprint(config)

  tune.run(ApexTrainer, config=config)


I have no idea what to do, I already set num_workers =1 to accommodate for lower CPU count availability but whatever I do still gets me that error. I do not understand what CPU:5.0 means either, I don't see in the config file anything mentioning 5 CPUs required. Any thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions