Skip to content

pytest test/integration error #379

@ChaiBapchya

Description

@ChaiBapchya

Test integration

pytest test/integration/sagemaker/test_horovod.py --docker-base-name sm-tf-horovod-integration --tag latest --framework-version 1.15.0 --processor gpu

Error stacktrace:

sagemaker.exceptions.UnexpectedStatusException: Error for Training job test-tf-horovod-1591768266-74da: Failed. Reason: Alg
orithmError: ExecuteUserScriptError:
E           Command "mpirun --host algo-1 -np 1 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tc
p_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_s
tatus 1 -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/pyth
on3.6/dist-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -verbose -x orte_base_help_aggregate=0 -x SM_HOSTS -x SM_NETWORK_INTERF
ACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x
 SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_O
UTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_D
IR -x SM_HP_MODEL_DIR -x PYTHONPATH /usr/bin/python

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 1.

2020-06-10 05:55:55 Uploading - Uploading generated training model
2020-06-10 05:55:55 Failed - Training job failed
======================================================= short test summary info =======================================================
FAILED test/integration/sagemaker/test_horovod.py::test_distributed_training_horovod[gpu-3] - sagemaker.exceptions.UnexpectedStatusE...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions