Skip to content

Training Simple Example issue #33

@Danielyijun

Description

@Danielyijun

Failure Logs

Hi,

Good day. I have tried to run example: Simple in your code. I have an issue when I was running the example.

CUDA Toolkit 9.0
CuDNN SDK v7
openmpi-3.0.0
NCCL 2.1.15(for cuda9.0)

Below is the result from running simple example:

/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:139880522606336:PARALLAX:�[31m
$ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null�[0m
Connection to 10.0.0.103 closed.
WARNING:139880522606336:PARALLAX:�[31m
$ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null�[0m
Connection to 10.0.0.103 closed.
WARNING:139880522606336:PARALLAX:�[31m
$ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null�[0m
Connection to 10.0.0.103 closed.
WARNING:139880522606336:PARALLAX:�[31m
$ ssh -p 22 10.0.0.103 "mkdir -p /tmp/parallax-jyi"�[0m
WARNING:139880522606336:PARALLAX:�[31m
$ echo 'bash -c "export schroot -c jyi -u jyi;export GRPC_POLL_STRATEGY=poll; CUDA_VISIBLE_DEVICES=1; export PARALLAX_LOG_LEVEL=20; export PARALLAX_HOSTNAME=10.0.0.103; export PARALLAX_SEARCH=False; source /home/jyi/parallax_venv/bin/activate; python3 /home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py "' | ssh -p 22 10.0.0.103 'cat > /tmp/parallax-jyi/mpi_run.sh; chmod 777 /tmp/parallax-jyi/mpi_run.sh'�[0m
WARNING:139880522606336:PARALLAX:�[31m
$ schroot -c jyi -u jyi;export GRPC_POLL_STRATEGY=poll; export CUDA_VISIBLE_DEVICES=1; source /home/jyi/parallax_venv/bin/activate; export PATH=/Home/.openmpi/bin:$PATH;export LD_LIBRARY_PATH=~/.openmpi/lib/:$LD_LIBRARY_PATH; mpirun -bind-to none -map-by slot --mca plm_rsh_no_tree_spawn 1 --mca orte_base_help_aggregate 0 -x NCCL_DEBUG=INFO -x PARALLAX_RUN_OPTION=PARALLAX_RUN_MPI -x PARALLAX_RESOURCE_INFO=master_10.0.0.103:40781:^ps_10.0.0.103:44002:^worker_10.0.0.103:46632:1 -np 1 -H 10.0.0.103:1 bash /tmp/parallax-jyi/mpi_run.sh 2>&1�[0m
/bin/sh: 1: schroot: not found
/bin/sh: 1: source: not found
bash: line 0: export: -c': not a valid identifier bash: line 0: export: -u': not a valid identifier
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
INFO:139646709864192:PARALLAX:parallel_run(PARALLAX_RUN_MPI)
INFO:139646709864192:PARALLAX:resource master_10.0.0.103:40781:^ps_10.0.0.103:44002:^worker_10.0.0.103:46632:1

[[43684,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: node03

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

2020-05-14 05:49:50.857407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1412] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
totalMemory: 7.93GiB freeMemory: 7.82GiB
2020-05-14 05:49:50.857465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1491] Adding visible gpu devices: 0
2020-05-14 05:49:51.340142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-14 05:49:51.340220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:978] 0
2020-05-14 05:49:51.340235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0: N
2020-05-14 05:49:51.340349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7535 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
2020-05-14 05:49:51.602832: E tensorflow/core/framework/op_segment.cc:53] Create kernel failed: Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'

2020-05-14 05:49:51.602953: E tensorflow/core/common_runtime/executor.cc:630] Executor failed to create kernel. Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Traceback (most recent call last):
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 137, in
tf.app.run()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 133, in main
parallax.parallel_run(single_gpu_graph, resource_info)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/common/runner.py", line 189, in parallel_run
return parallax_run_mpi(**kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/mpi/runner.py", line 192, in parallax_run_mpi
config=sess_config)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init
_WrappedSession.init(self, self._create_session())
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session
init_fn=self._scaffold.init_fn)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 287, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'w', defined at:
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 137, in
tf.app.run()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 133, in main
parallax.parallel_run(single_gpu_graph, resource_info)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/common/runner.py", line 189, in parallel_run
return parallax_run_mpi(**kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/mpi/runner.py", line 158, in parallax_run_mpi
tf.train.import_meta_graph(mpi_meta_graph_def)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1666, in import_meta_graph
meta_graph_or_file, clear_devices, import_scope, **kwargs)[0]
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1688, in _import_meta_graph_with_return_elements
**kwargs))
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
_ProcessNewOps(graph)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3438, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3438, in
for c_op in c_api_util.new_tf_operations(self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3297, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1768, in init
self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"
. Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[43684,1],0]
Exit code: 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions