WorkerError in distributed training with cuda11.1
See original GitHub issueI use RTX 3090 to do training, but it only support CUDA version ≥ 11.1. So I’ve used “determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0” to be the environment image, which caused error in distributed training.
The code is mnist_pytorch in “Docs > Tutorials > Quick Start Guide”.
The trial logs:
[2021-05-10T07:34:42Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Pod resources allocated.
[2021-05-10T07:34:42Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Pod resources allocated.
[2021-05-10T07:34:42Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Pod resources allocated.
[2021-05-10T07:34:42Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Pod resources allocated.
[2021-05-10T07:34:42Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Pod resources allocated.
[2021-05-10T07:34:43Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Created container determined-init-container
[2021-05-10T07:34:43Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Created container determined-init-container
[2021-05-10T07:34:43Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Created container determined-init-container
[2021-05-10T07:34:43Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Created container determined-init-container
[2021-05-10T07:34:43Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Created container determined-init-container
[2021-05-10T07:34:43Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Started container determined-init-container
[2021-05-10T07:34:43Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Started container determined-init-container
[2021-05-10T07:34:43Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Started container determined-init-container
[2021-05-10T07:34:44Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Started container determined-init-container
[2021-05-10T07:34:44Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Started container determined-init-container
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Created container determined-fluent-container
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Created container determined-fluent-container
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Created container determined-fluent-container
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Created container determined-fluent-container
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Created container determined-fluent-container
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Started container determined-fluent-container
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Started container determined-fluent-container
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Started container determined-fluent-container
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Started container determined-fluent-container
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Started container determined-fluent-container
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:48Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Created container determined-container
[2021-05-10T07:34:48Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Created container determined-container
[2021-05-10T07:34:48Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Started container determined-container
[2021-05-10T07:34:48Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Started container determined-container
[2021-05-10T07:34:49Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Created container determined-container
[2021-05-10T07:34:49Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Created container determined-container
[2021-05-10T07:34:50Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Created container determined-container
[2021-05-10T07:34:50Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Started container determined-container
[2021-05-10T07:34:50Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Started container determined-container
[2021-05-10T07:34:50Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Started container determined-container
[2021-05-10T07:34:50Z] 13573602 || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:50Z] c393b38a || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:50Z] c393b38a || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:50Z] 13573602 || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:50Z] 13573602 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] c393b38a || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] 13573602 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] 13573602 || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] c393b38a || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] c393b38a || + '[' -z '' ']'
[2021-05-10T07:34:50Z] 13573602 || + '[' -z '' ']'
[2021-05-10T07:34:50Z] c393b38a || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] c393b38a || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] 13573602 || + /bin/which python3
[2021-05-10T07:34:50Z] 13573602 || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] 13573602 || + '[' /root = / ']'
[2021-05-10T07:34:50Z] c393b38a || + /bin/which python3
[2021-05-10T07:34:50Z] c393b38a || + '[' /root = / ']'
[2021-05-10T07:34:50Z] 13573602 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:50Z] c393b38a || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:53Z] c7b38690 || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:53Z] 0797af9d || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:53Z] 0797af9d || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:53Z] 86c9d22a || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:53Z] 0797af9d || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 0797af9d || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 0797af9d || + '[' -z '' ']'
[2021-05-10T07:34:53Z] 0797af9d || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 0797af9d || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 86c9d22a || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 86c9d22a || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:53Z] 0797af9d || + '[' /root = / ']'
[2021-05-10T07:34:53Z] 0797af9d || + /bin/which python3
[2021-05-10T07:34:53Z] 0797af9d || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:53Z] 86c9d22a || + '[' -z '' ']'
[2021-05-10T07:34:53Z] 86c9d22a || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 86c9d22a || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 86c9d22a || + /bin/which python3
[2021-05-10T07:34:53Z] 86c9d22a || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 86c9d22a || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:53Z] 86c9d22a || + '[' /root = / ']'
[2021-05-10T07:34:53Z] c7b38690 || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:53Z] c7b38690 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] c7b38690 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] c7b38690 || + '[' -z '' ']'
[2021-05-10T07:34:53Z] c7b38690 || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] c7b38690 || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] c7b38690 || + '[' /root = / ']'
[2021-05-10T07:34:53Z] c7b38690 || + /bin/which python3
[2021-05-10T07:34:53Z] c7b38690 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:35:01Z] 13573602 || + cd /run/determined/workdir
[2021-05-10T07:35:01Z] 13573602 || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:01Z] 13573602 || + test -f startup-hook.sh
[2021-05-10T07:35:01Z] c393b38a || + cd /run/determined/workdir
[2021-05-10T07:35:01Z] c393b38a || + test -f startup-hook.sh
[2021-05-10T07:35:01Z] c393b38a || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:01Z] c7b38690 || + cd /run/determined/workdir
[2021-05-10T07:35:01Z] c7b38690 || + test -f startup-hook.sh
[2021-05-10T07:35:01Z] c7b38690 || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:02Z] 13573602 || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:02Z] 13573602 || INFO: New trial runner in (container 13573602-b07b-4ad4-9d6a-7e4ca5d9fb8d) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '13573602-b07b-4ad4-9d6a-7e4ca5d9fb8d', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-0de43703-9b09-14b9-86e6-8fdb3ba55cb4', 'GPU-4fb2a63e-162c-b338-a19d-57cf7346d36d'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:02Z] 13573602 || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/13573602-b07b-4ad4-9d6a-7e4ca5d9fb8d
[2021-05-10T07:35:02Z] 13573602 || INFO: Connected to master
[2021-05-10T07:35:02Z] 13573602 || INFO: Established WebSocket session with master
[2021-05-10T07:35:02Z] c393b38a || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:02Z] c393b38a || INFO: New trial runner in (container c393b38a-923e-4d0d-8fcd-bf0967da86aa) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'c393b38a-923e-4d0d-8fcd-bf0967da86aa', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-f85c4ec5-d32d-01bb-c08e-d2b779314a9a', 'GPU-72adb4a9-d511-dbec-bd69-10960a34b452'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:02Z] c393b38a || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/c393b38a-923e-4d0d-8fcd-bf0967da86aa
[2021-05-10T07:35:02Z] c393b38a || INFO: Connected to master
[2021-05-10T07:35:02Z] c393b38a || INFO: Established WebSocket session with master
[2021-05-10T07:35:02Z] c7b38690 || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:02Z] c7b38690 || INFO: New trial runner in (container c7b38690-2d44-460f-877b-c9c8fdc15157) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'c7b38690-2d44-460f-877b-c9c8fdc15157', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-a523e020-b504-535c-2b83-967ab28cdbab', 'GPU-8e73c217-85aa-c8c5-2551-52c59279e009'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:02Z] c7b38690 || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/c7b38690-2d44-460f-877b-c9c8fdc15157
[2021-05-10T07:35:02Z] c7b38690 || INFO: Connected to master
[2021-05-10T07:35:02Z] c7b38690 || INFO: Established WebSocket session with master
[2021-05-10T07:35:04Z] 0797af9d || + cd /run/determined/workdir
[2021-05-10T07:35:04Z] 0797af9d || + test -f startup-hook.sh
[2021-05-10T07:35:04Z] 0797af9d || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:04Z] 86c9d22a || + cd /run/determined/workdir
[2021-05-10T07:35:04Z] 86c9d22a || + test -f startup-hook.sh
[2021-05-10T07:35:04Z] 86c9d22a || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:05Z] 0797af9d || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:05Z] 0797af9d || INFO: New trial runner in (container 0797af9d-18d4-41ee-9128-4f42e8bc1f39) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '0797af9d-18d4-41ee-9128-4f42e8bc1f39', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-afc0f67e-f5bf-0b0d-9bfd-abca42f2de36', 'GPU-2bfb4f26-19eb-b65f-08b1-1de7d47801e2'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:05Z] 0797af9d || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/0797af9d-18d4-41ee-9128-4f42e8bc1f39
[2021-05-10T07:35:05Z] 0797af9d || INFO: Connected to master
[2021-05-10T07:35:05Z] 0797af9d || INFO: Established WebSocket session with master
[2021-05-10T07:35:05Z] 86c9d22a || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:05Z] 86c9d22a || INFO: New trial runner in (container 86c9d22a-d207-4d66-abbb-5ae6165c16a4) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '86c9d22a-d207-4d66-abbb-5ae6165c16a4', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-a11059eb-0891-239b-9231-a055bf282a20', 'GPU-4bafdd74-8f91-f260-1c06-8d48864ec2ee'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/86c9d22a-d207-4d66-abbb-5ae6165c16a4
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Connected to master
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Established WebSocket session with master
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 2, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] 0797af9d || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] 13573602 || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 4, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] c393b38a || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 1, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] c7b38690 || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 3, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] 86c9d22a || 2021-05-10 07:35:05.959396: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] 0797af9d || 2021-05-10 07:35:05.978576: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] c7b38690 || 2021-05-10 07:35:05.978783: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] 13573602 || 2021-05-10 07:35:05.997814: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] c393b38a || 2021-05-10 07:35:05.997814: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:07Z] c7b38690 || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 0797af9d || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 86c9d22a || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 13573602 || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] c393b38a || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 0797af9d || Traceback (most recent call last):
[2021-05-10T07:35:07Z] 0797af9d || File "/opt/conda/bin/horovodrun", line 5, in <module>
[2021-05-10T07:35:07Z] 0797af9d || from horovod.runner.launch import run_commandline
[2021-05-10T07:35:07Z] 0797af9d || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 34, in <module>
[2021-05-10T07:35:07Z] 0797af9d || from horovod.runner.driver import driver_service
[2021-05-10T07:35:07Z] 0797af9d || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/driver/driver_service.py", line 23, in <module>
[2021-05-10T07:35:07Z] 0797af9d || from horovod.runner.common.service import driver_service
[2021-05-10T07:35:07Z] 0797af9d || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/common/service/driver_service.py", line 18, in <module>
[2021-05-10T07:35:07Z] 0797af9d || from horovod.runner.common.util import network
[2021-05-10T07:35:07Z] 0797af9d || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/common/util/network.py", line 22, in <module>
[2021-05-10T07:35:07Z] 0797af9d || import cloudpickle
[2021-05-10T07:35:07Z] 0797af9d || File "/opt/conda/lib/python3.8/site-packages/cloudpickle/__init__.py", line 3, in <module>
[2021-05-10T07:35:07Z] 0797af9d || from cloudpickle.cloudpickle import *
[2021-05-10T07:35:07Z] 0797af9d || File "/opt/conda/lib/python3.8/site-packages/cloudpickle/cloudpickle.py", line 151, in <module>
[2021-05-10T07:35:07Z] 0797af9d || _cell_set_template_code = _make_cell_set_template_code()
[2021-05-10T07:35:07Z] 0797af9d || File "/opt/conda/lib/python3.8/site-packages/cloudpickle/cloudpickle.py", line 132, in _make_cell_set_template_code
[2021-05-10T07:35:07Z] 0797af9d || TypeError: an integer is required (got type bytes)
[2021-05-10T07:35:07Z] 0797af9d || return types.CodeType(
[2021-05-10T07:35:08Z] 0797af9d || INFO: WebSocket closed
[2021-05-10T07:35:08Z] 0797af9d || INFO: Disconnected from master, exiting gracefully
[2021-05-10T07:35:08Z] 0797af9d || Traceback (most recent call last):
[2021-05-10T07:35:08Z] 0797af9d || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2021-05-10T07:35:08Z] 0797af9d || return _run_code(code, main_globals, None,
[2021-05-10T07:35:08Z] 0797af9d || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
[2021-05-10T07:35:08Z] 0797af9d || exec(code, run_globals)
[2021-05-10T07:35:08Z] 0797af9d || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-10T07:35:08Z] 0797af9d || main()
[2021-05-10T07:35:08Z] 0797af9d || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-10T07:35:08Z] 0797af9d || build_and_run_training_pipeline(env)
[2021-05-10T07:35:08Z] 0797af9d || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 135, in build_and_run_training_pipeline
[2021-05-10T07:35:08Z] 0797af9d || subproc.run()
[2021-05-10T07:35:08Z] 0797af9d || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/layers/_worker_process.py", line 268, in run
[2021-05-10T07:35:08Z] 0797af9d || self._do_startup_message_sequence()
[2021-05-10T07:35:08Z] 0797af9d || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/layers/_worker_process.py", line 246, in _do_startup_message_sequence
[2021-05-10T07:35:08Z] 0797af9d || responses, exception_received = self.broadcast_server.gather_with_polling(
[2021-05-10T07:35:08Z] 0797af9d || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/ipc.py", line 150, in gather_with_polling
[2021-05-10T07:35:08Z] 0797af9d || health_check()
[2021-05-10T07:35:08Z] 0797af9d || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/layers/_worker_process.py", line 288, in _health_check
[2021-05-10T07:35:08Z] 0797af9d || raise det.errors.WorkerError("Training process died.")
[2021-05-10T07:35:08Z] 0797af9d || determined.errors.WorkerError: Training process died.
[2021-05-10T07:35:09Z] 0797af9d || INFO: container failed with non-zero exit code: (exit code 1)
[2021-05-10T07:35:25Z] 13573602 || INFO: container failed with non-zero exit code: (exit code 137)
[2021-05-10T07:35:26Z] c393b38a || INFO: container failed with non-zero exit code: (exit code 137)
[2021-05-10T07:35:26Z] c7b38690 || INFO: container failed with non-zero exit code: (exit code 137)
[2021-05-10T07:35:26Z] 86c9d22a || INFO: container failed with non-zero exit code: (exit code 137)
[32mTrial log stream ended. To reopen log stream, run: det trial logs -f 26[0m
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
PyTorch Distributed Training - Lei Mao's Log Book
In this particular experiment, I tested the program using two nodes. Each of the nodes has 8 GPUs and each GPU would launch...
Read more >AUR (en) - python-horovod - Arch Linux
Package Details: python-horovod 0.26.1-1 ... Description: Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Read more >Distributed GPU training guide (SDK v2) - Azure
Learn the best practices for performing distributed training with Azure Machine Learning SDK (v2) supported frameworks, such as MPI, ...
Read more >Distributed training with TensorFlow
tf.distribute.MirroredStrategy supports synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device.
Read more >Distributed training | Databricks on AWS
Learn how to perform distributed training of machine learning models.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Awesome, glad to hear it!
Note: For Determined, the NVIDIA driver version on the machine/VM needs to support the version of CUDA we have installed in the Docker Images for our task environments, and we do not rely on the version of CUDA installed on the system.
Thanks!🥳 I tried
determinedai/environments:cuda-11.0-pytorch-1.7-lightning-1.2-tf-2.4-gpu-0.13.0
and it succeeded in const, distributed, and adaptive training, even though my cuda version is 11.1.I had mistakenly assumed that the cuda versions had to match exactly, which is why I encountered the error.