question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

WorkerError in distributed training with cuda11.1

See original GitHub issue

I use RTX 3090 to do training, but it only support CUDA version ≥ 11.1. So I’ve used “determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0” to be the environment image, which caused error in distributed training.

The code is mnist_pytorch in “Docs > Tutorials > Quick Start Guide”.

The trial logs:

[2021-05-10T07:34:42Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Pod resources allocated.
[2021-05-10T07:34:42Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Pod resources allocated.
[2021-05-10T07:34:42Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Pod resources allocated.
[2021-05-10T07:34:42Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Pod resources allocated.
[2021-05-10T07:34:42Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Pod resources allocated.
[2021-05-10T07:34:43Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Created container determined-init-container
[2021-05-10T07:34:43Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Created container determined-init-container
[2021-05-10T07:34:43Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Created container determined-init-container
[2021-05-10T07:34:43Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Created container determined-init-container
[2021-05-10T07:34:43Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Created container determined-init-container
[2021-05-10T07:34:43Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Started container determined-init-container
[2021-05-10T07:34:43Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Started container determined-init-container
[2021-05-10T07:34:43Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Started container determined-init-container
[2021-05-10T07:34:44Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Started container determined-init-container
[2021-05-10T07:34:44Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Started container determined-init-container
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Created container determined-fluent-container
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Created container determined-fluent-container
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Created container determined-fluent-container
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Created container determined-fluent-container
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Created container determined-fluent-container
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Started container determined-fluent-container
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Started container determined-fluent-container
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Started container determined-fluent-container
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Started container determined-fluent-container
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Started container determined-fluent-container
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:48Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Created container determined-container
[2021-05-10T07:34:48Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Created container determined-container
[2021-05-10T07:34:48Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Started container determined-container
[2021-05-10T07:34:48Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Started container determined-container
[2021-05-10T07:34:49Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Created container determined-container
[2021-05-10T07:34:49Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Created container determined-container
[2021-05-10T07:34:50Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Created container determined-container
[2021-05-10T07:34:50Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Started container determined-container
[2021-05-10T07:34:50Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Started container determined-container
[2021-05-10T07:34:50Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Started container determined-container
[2021-05-10T07:34:50Z] 13573602 || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:50Z] c393b38a || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:50Z] c393b38a || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:50Z] 13573602 || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:50Z] 13573602 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] c393b38a || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] 13573602 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] 13573602 || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] c393b38a || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] c393b38a || + '[' -z '' ']'
[2021-05-10T07:34:50Z] 13573602 || + '[' -z '' ']'
[2021-05-10T07:34:50Z] c393b38a || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] c393b38a || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] 13573602 || + /bin/which python3
[2021-05-10T07:34:50Z] 13573602 || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] 13573602 || + '[' /root = / ']'
[2021-05-10T07:34:50Z] c393b38a || + /bin/which python3
[2021-05-10T07:34:50Z] c393b38a || + '[' /root = / ']'
[2021-05-10T07:34:50Z] 13573602 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:50Z] c393b38a || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:53Z] c7b38690 || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:53Z] 0797af9d || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:53Z] 0797af9d || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:53Z] 86c9d22a || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:53Z] 0797af9d || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 0797af9d || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 0797af9d || + '[' -z '' ']'
[2021-05-10T07:34:53Z] 0797af9d || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 0797af9d || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 86c9d22a || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 86c9d22a || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:53Z] 0797af9d || + '[' /root = / ']'
[2021-05-10T07:34:53Z] 0797af9d || + /bin/which python3
[2021-05-10T07:34:53Z] 0797af9d || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:53Z] 86c9d22a || + '[' -z '' ']'
[2021-05-10T07:34:53Z] 86c9d22a || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 86c9d22a || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 86c9d22a || + /bin/which python3
[2021-05-10T07:34:53Z] 86c9d22a || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 86c9d22a || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:53Z] 86c9d22a || + '[' /root = / ']'
[2021-05-10T07:34:53Z] c7b38690 || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:53Z] c7b38690 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] c7b38690 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] c7b38690 || + '[' -z '' ']'
[2021-05-10T07:34:53Z] c7b38690 || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] c7b38690 || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] c7b38690 || + '[' /root = / ']'
[2021-05-10T07:34:53Z] c7b38690 || + /bin/which python3
[2021-05-10T07:34:53Z] c7b38690 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:35:01Z] 13573602 || + cd /run/determined/workdir
[2021-05-10T07:35:01Z] 13573602 || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:01Z] 13573602 || + test -f startup-hook.sh
[2021-05-10T07:35:01Z] c393b38a || + cd /run/determined/workdir
[2021-05-10T07:35:01Z] c393b38a || + test -f startup-hook.sh
[2021-05-10T07:35:01Z] c393b38a || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:01Z] c7b38690 || + cd /run/determined/workdir
[2021-05-10T07:35:01Z] c7b38690 || + test -f startup-hook.sh
[2021-05-10T07:35:01Z] c7b38690 || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:02Z] 13573602 || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:02Z] 13573602 || INFO: New trial runner in (container 13573602-b07b-4ad4-9d6a-7e4ca5d9fb8d) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '13573602-b07b-4ad4-9d6a-7e4ca5d9fb8d', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-0de43703-9b09-14b9-86e6-8fdb3ba55cb4', 'GPU-4fb2a63e-162c-b338-a19d-57cf7346d36d'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:02Z] 13573602 || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/13573602-b07b-4ad4-9d6a-7e4ca5d9fb8d
[2021-05-10T07:35:02Z] 13573602 || INFO: Connected to master
[2021-05-10T07:35:02Z] 13573602 || INFO: Established WebSocket session with master
[2021-05-10T07:35:02Z] c393b38a || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:02Z] c393b38a || INFO: New trial runner in (container c393b38a-923e-4d0d-8fcd-bf0967da86aa) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'c393b38a-923e-4d0d-8fcd-bf0967da86aa', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-f85c4ec5-d32d-01bb-c08e-d2b779314a9a', 'GPU-72adb4a9-d511-dbec-bd69-10960a34b452'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:02Z] c393b38a || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/c393b38a-923e-4d0d-8fcd-bf0967da86aa
[2021-05-10T07:35:02Z] c393b38a || INFO: Connected to master
[2021-05-10T07:35:02Z] c393b38a || INFO: Established WebSocket session with master
[2021-05-10T07:35:02Z] c7b38690 || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:02Z] c7b38690 || INFO: New trial runner in (container c7b38690-2d44-460f-877b-c9c8fdc15157) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'c7b38690-2d44-460f-877b-c9c8fdc15157', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-a523e020-b504-535c-2b83-967ab28cdbab', 'GPU-8e73c217-85aa-c8c5-2551-52c59279e009'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:02Z] c7b38690 || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/c7b38690-2d44-460f-877b-c9c8fdc15157
[2021-05-10T07:35:02Z] c7b38690 || INFO: Connected to master
[2021-05-10T07:35:02Z] c7b38690 || INFO: Established WebSocket session with master
[2021-05-10T07:35:04Z] 0797af9d || + cd /run/determined/workdir
[2021-05-10T07:35:04Z] 0797af9d || + test -f startup-hook.sh
[2021-05-10T07:35:04Z] 0797af9d || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:04Z] 86c9d22a || + cd /run/determined/workdir
[2021-05-10T07:35:04Z] 86c9d22a || + test -f startup-hook.sh
[2021-05-10T07:35:04Z] 86c9d22a || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:05Z] 0797af9d || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:05Z] 0797af9d || INFO: New trial runner in (container 0797af9d-18d4-41ee-9128-4f42e8bc1f39) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '0797af9d-18d4-41ee-9128-4f42e8bc1f39', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-afc0f67e-f5bf-0b0d-9bfd-abca42f2de36', 'GPU-2bfb4f26-19eb-b65f-08b1-1de7d47801e2'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:05Z] 0797af9d || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/0797af9d-18d4-41ee-9128-4f42e8bc1f39
[2021-05-10T07:35:05Z] 0797af9d || INFO: Connected to master
[2021-05-10T07:35:05Z] 0797af9d || INFO: Established WebSocket session with master
[2021-05-10T07:35:05Z] 86c9d22a || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:05Z] 86c9d22a || INFO: New trial runner in (container 86c9d22a-d207-4d66-abbb-5ae6165c16a4) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '86c9d22a-d207-4d66-abbb-5ae6165c16a4', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-a11059eb-0891-239b-9231-a055bf282a20', 'GPU-4bafdd74-8f91-f260-1c06-8d48864ec2ee'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/86c9d22a-d207-4d66-abbb-5ae6165c16a4
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Connected to master
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Established WebSocket session with master
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 2, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] 0797af9d || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] 13573602 || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 4, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] c393b38a || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 1, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] c7b38690 || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 3, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] 86c9d22a || 2021-05-10 07:35:05.959396: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] 0797af9d || 2021-05-10 07:35:05.978576: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] c7b38690 || 2021-05-10 07:35:05.978783: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] 13573602 || 2021-05-10 07:35:05.997814: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] c393b38a || 2021-05-10 07:35:05.997814: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:07Z] c7b38690 || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 0797af9d || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 86c9d22a || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 13573602 || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] c393b38a || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 0797af9d || Traceback (most recent call last):
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/bin/horovodrun", line 5, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     from horovod.runner.launch import run_commandline
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 34, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     from horovod.runner.driver import driver_service
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/driver/driver_service.py", line 23, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     from horovod.runner.common.service import driver_service
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/common/service/driver_service.py", line 18, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     from horovod.runner.common.util import network
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/common/util/network.py", line 22, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     import cloudpickle
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/cloudpickle/__init__.py", line 3, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     from cloudpickle.cloudpickle import *
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/cloudpickle/cloudpickle.py", line 151, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     _cell_set_template_code = _make_cell_set_template_code()
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/cloudpickle/cloudpickle.py", line 132, in _make_cell_set_template_code
[2021-05-10T07:35:07Z] 0797af9d || TypeError: an integer is required (got type bytes)
[2021-05-10T07:35:07Z] 0797af9d ||     return types.CodeType(
[2021-05-10T07:35:08Z] 0797af9d || INFO: WebSocket closed
[2021-05-10T07:35:08Z] 0797af9d || INFO: Disconnected from master, exiting gracefully
[2021-05-10T07:35:08Z] 0797af9d || Traceback (most recent call last):
[2021-05-10T07:35:08Z] 0797af9d ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2021-05-10T07:35:08Z] 0797af9d ||     return _run_code(code, main_globals, None,
[2021-05-10T07:35:08Z] 0797af9d ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
[2021-05-10T07:35:08Z] 0797af9d ||     exec(code, run_globals)
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-10T07:35:08Z] 0797af9d ||     main()
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-10T07:35:08Z] 0797af9d ||     build_and_run_training_pipeline(env)
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 135, in build_and_run_training_pipeline
[2021-05-10T07:35:08Z] 0797af9d ||     subproc.run()
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/layers/_worker_process.py", line 268, in run
[2021-05-10T07:35:08Z] 0797af9d ||     self._do_startup_message_sequence()
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/layers/_worker_process.py", line 246, in _do_startup_message_sequence
[2021-05-10T07:35:08Z] 0797af9d ||     responses, exception_received = self.broadcast_server.gather_with_polling(
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/ipc.py", line 150, in gather_with_polling
[2021-05-10T07:35:08Z] 0797af9d ||     health_check()
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/layers/_worker_process.py", line 288, in _health_check
[2021-05-10T07:35:08Z] 0797af9d ||     raise det.errors.WorkerError("Training process died.")
[2021-05-10T07:35:08Z] 0797af9d || determined.errors.WorkerError: Training process died.
[2021-05-10T07:35:09Z] 0797af9d || INFO: container failed with non-zero exit code:  (exit code 1)
[2021-05-10T07:35:25Z] 13573602 || INFO: container failed with non-zero exit code:  (exit code 137)
[2021-05-10T07:35:26Z] c393b38a || INFO: container failed with non-zero exit code:  (exit code 137)
[2021-05-10T07:35:26Z] c7b38690 || INFO: container failed with non-zero exit code:  (exit code 137)
[2021-05-10T07:35:26Z] 86c9d22a || INFO: container failed with non-zero exit code:  (exit code 137)
Trial log stream ended. To reopen log stream, run: det trial logs -f 26

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
vishnu2kmohancommented, May 11, 2021

Awesome, glad to hear it!

Note: For Determined, the NVIDIA driver version on the machine/VM needs to support the version of CUDA we have installed in the Docker Images for our task environments, and we do not rely on the version of CUDA installed on the system.

0reactions
riokaacommented, May 11, 2021

Thanks!🥳 I tried determinedai/environments:cuda-11.0-pytorch-1.7-lightning-1.2-tf-2.4-gpu-0.13.0 and it succeeded in const, distributed, and adaptive training, even though my cuda version is 11.1.

I had mistakenly assumed that the cuda versions had to match exactly, which is why I encountered the error.

Read more comments on GitHub >

github_iconTop Results From Across the Web

PyTorch Distributed Training - Lei Mao's Log Book
In this particular experiment, I tested the program using two nodes. Each of the nodes has 8 GPUs and each GPU would launch...
Read more >
AUR (en) - python-horovod - Arch Linux
Package Details: python-horovod 0.26.1-1 ... Description: Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Read more >
Distributed GPU training guide (SDK v2) - Azure
Learn the best practices for performing distributed training with Azure Machine Learning SDK (v2) supported frameworks, such as MPI, ...
Read more >
Distributed training with TensorFlow
tf.distribute.MirroredStrategy supports synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device.
Read more >
Distributed training | Databricks on AWS
Learn how to perform distributed training of machine learning models.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found