Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ClassyVision distributed training hang after scaledown training Nodes

See original GitHub issue

🐛 Bug

Component (check all that applies):

state api
train_step api
train_loop
rendezvous
checkpoint
rollback
metrics
petctl
examples
docker
other

Background

We are also trying to support PET on K8s, and the ImageNet Example is already supported without any issue, see: https://github.com/microsoft/frameworkcontroller/tree/master/example/framework/scenario/pytorch/elastic.

However, there are some issues to support distributed ClassyVision, the main issue is this hang issue. And this issue MAY be already reported by https://github.com/pytorch/elastic/issues/25, but not active anymore.

To Reproduce

Reproduce Key Points

Current ClassyVision K8S Example, seems is not “distributed training”, as --distributed_backend is not specified to be ddp. So I used the ddp and found the issue.
Current ClassyVision Default Example only has 2 num_epochs, so it runs very short (<1min), and you can hardly test the scaledown during the short period running. So I increased the num_epochs to 100.
Current ClassyVision K8S Example, cannot specify --num_workers=0, as ClassyVision will crash due to error: classy_train.py: error: unrecognized arguments: --num_workers=0, so I tried to add the num_workers: 0 into ClassyVision config file, like this UT. However, it still crash due to error: ValueError: multiprocessing_context can only be used with multi-process loading (num_workers > 0), but got num_workers=0. So, I do not specified num_workers in any place, and use this k8s apporach to expose shared memory to containers.

Steps to reproduce the issue on K8s: (You can do similar thing without K8s, using K8s here for simple.)

Assume etcd is already setup and its address is pet-etcd:2379
Create P2P discovery Service on K8s:

apiVersion: v1
kind: Service
metadata:
  name: cv-test
spec:
  clusterIP: None
  publishNotReadyAddresses: true
  selector:
    app: cv-test

Create 3 below Pods on K8s, each with below {{INDEX}} placehoder instantiated to be 0, 1, 2:

apiVersion: v1
kind: Pod
metadata:
  name: cv-test-{{INDEX}}
  labels:
    app: cv-test
spec:
  hostname: cv-test-{{INDEX}}
  subdomain: cv-test
  containers:
  - name: elasticjob-worker
    image: torchelastic/examples:0.2.0
    imagePullPolicy: Always
    command: [
      "bash", "-c",
      "sed -i -e 's/\"num_epochs\": 2/\"num_epochs\": 100/g'
      /workspace/classy_vision/configs/template_config.json &&
      python -m torchelastic.distributed.launch
      --rdzv_backend=etcd
      --rdzv_endpoint=pet-etcd:2379
      --rdzv_id=cv-test
      --nnodes=1:4
      --nproc_per_node=1
      /workspace/classy_vision/classy_train.py
      --config_file /workspace/classy_vision/configs/template_config.json
      --distributed_backend ddp"]
    volumeMounts:
    - name: shm-volume
      mountPath: /dev/shm
  volumes:
  - name: shm-volume
    emptyDir:
      medium: Memory

All Pods will train with log like below:

[INFO] 2020-08-03 07:53:06,129 launch: Running torchelastic.distributed.launch with args: ['/opt/conda/lib/python3.7/site-packages/torchelastic/distributed/launch.py', '--rdzv_backend=etcd', '--rdzv_endpoint=pet-etcd:2379', '--rdzv_id=cv-test', '--nnodes=1:4', '--nproc_per_node=1', '/workspace/classy_vision/classy_train.py', '--config_file', '/workspace/classy_vision/configs/template_config.json', '--distributed_backend', 'ddp']
INFO 2020-08-03 07:53:06,136 Etcd machines: ['http://0.0.0.0:2379']
[INFO] 2020-08-03 07:53:06,144 launch: Using nproc_per_node=1.
[INFO] 2020-08-03 07:53:06,890 api: [default] starting workers for function: wrapper_fn
[INFO] 2020-08-03 07:53:06,890 api: [default] Rendezvous'ing worker group
INFO 2020-08-03 07:53:06,890 Attempting to join next rendezvous
INFO 2020-08-03 07:53:06,894 Observed existing rendezvous state: {'status': 'final', 'version': '15', 'participants': [0], 'keep_alives': ['/torchelastic/p2p/run_cv-test/rdzv/v_15/rank_0'], 'num_workers_waiting': 0}
INFO 2020-08-03 07:53:06,988 Added self to waiting list. Rendezvous full state: {"status": "final", "version": "15", "participants": [0], "keep_alives": ["/torchelastic/p2p/run_cv-test/rdzv/v_15/rank_0"], "num_workers_waiting": 1}
INFO 2020-08-03 07:53:06,990 Keep-alive key /torchelastic/p2p/run_cv-test/rdzv/v_15/rank_0 is not renewed.
INFO 2020-08-03 07:53:06,990 Rendevous version 15 is incomplete. 
INFO 2020-08-03 07:53:06,990 Attempting to destroy it.
INFO 2020-08-03 07:53:06,991 Destroyed rendezvous version 15 successfully.
INFO 2020-08-03 07:53:06,992 Previously existing rendezvous state changed. Will re-try joining.
INFO 2020-08-03 07:53:06,992 Attempting to join next rendezvous
INFO 2020-08-03 07:53:06,999 New rendezvous state created: {'status': 'joinable', 'version': '16', 'participants': []}
INFO 2020-08-03 07:53:07,012 Joined rendezvous version 16 as rank 0. Full state: {'status': 'joinable', 'version': '16', 'participants': [0]}
INFO 2020-08-03 07:53:07,013 Rank 0 is responsible for join last call.
INFO 2020-08-03 07:53:38,023 Rank 0 finished join last call.
INFO 2020-08-03 07:53:38,024 Waiting for remaining peers.
INFO 2020-08-03 07:53:38,025 All peers arrived. Confirming membership.
INFO 2020-08-03 07:53:38,120 Waiting for confirmations from all peers.
INFO 2020-08-03 07:53:38,122 Rendezvous version 16 is complete. Final state: {'status': 'final', 'version': '16', 'participants': [0, 1, 2], 'keep_alives': ['/torchelastic/p2p/run_cv-test/rdzv/v_16/rank_1', '/torchelastic/p2p/run_cv-test/rdzv/v_16/rank_2', '/torchelastic/p2p/run_cv-test/rdzv/v_16/rank_0'], 'num_workers_waiting': 0}
INFO 2020-08-03 07:53:38,122 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-08-03 07:53:38,128 api: [default] Rendezvous complete for workers.
Result:
	restart_count=0
	group_rank=0
	group_world_size=3
	rank stride=1
	assigned global_ranks=[0]
	master_addr=cv-test-0.cv-test.default.svc.cluster.local
	master_port=37129

[INFO] 2020-08-03 07:53:38,128 api: [default] Starting worker group
INFO:root:Classy Vision's default training script.
INFO:root:AMP disabled
INFO:root:mixup disabled
INFO:root:Synchronized Batch Normalization is disabled
INFO:root:Logging outputs to /workspace/classy_vision/output_2020-08-03T07:53:39.917536
INFO:root:Logging checkpoints to /workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints
WARNING:root:tensorboardX not installed, skipping tensorboard hooks
INFO:root:Starting training on rank 0 worker. World size is 1
INFO:root:Done setting up distributed process_group with rank 0, world_size 3
INFO:root:Using GPU, CUDA device index: 0
INFO:root:Starting training. Task: <classy_vision.tasks.classification_task.ClassificationTask object at 0x7f4a4992b2d0> initialized with config:
{
    "name": "classification_task",
    "num_epochs": 100,
    "loss": {
        "name": "my_loss"
    },
    "dataset": {
        "train": {
            "name": "my_dataset",
            "crop_size": 224,
            "class_ratio": 0.5,
            "num_samples": 320,
            "seed": 0,
            "batchsize_per_replica": 32,
            "use_shuffle": true,
            "transforms": [
                {
                    "name": "generic_image_transform",
                    "transforms": [
                        {
                            "name": "RandomResizedCrop",
                            "size": 224
                        },
                        {
                            "name": "RandomHorizontalFlip"
                        },
                        {
                            "name": "ToTensor"
                        },
                        {
                            "name": "Normalize",
                            "mean": [
                                0.485,
                                0.456,
                                0.406
                            ],
                            "std": [
                                0.229,
                                0.224,
                                0.225
                            ]
                        }
                    ]
                }
            ]
        },
        "test": {
            "name": "my_dataset",
            "crop_size": 224,
            "class_ratio": 0.5,
            "num_samples": 100,
            "seed": 1,
            "batchsize_per_replica": 32,
            "use_shuffle": false,
            "transforms": [
                {
                    "name": "generic_image_transform",
                    "transforms": [
                        {
                            "name": "Resize",
                            "size": 256
                        },
                        {
                            "name": "CenterCrop",
                            "size": 224
                        },
                        {
                            "name": "ToTensor"
                        },
                        {
                            "name": "Normalize",
                            "mean": [
                                0.485,
                                0.456,
                                0.406
                            ],
                            "std": [
                                0.229,
                                0.224,
                                0.225
                            ]
                        }
                    ]
                }
            ]
        }
    },
    "meters": {
        "accuracy": {
            "topk": [
                1
            ]
        }
    },
    "model": {
        "name": "my_model"
    },
    "optimizer": {
        "name": "sgd",
        "param_schedulers": {
            "lr": {
                "name": "step",
                "values": [
                    0.1,
                    0.01
                ]
            }
        },
        "weight_decay": 0.0001,
        "momentum": 0.9,
        "num_epochs": 100,
        "lr": 0.1,
        "nesterov": false,
        "use_larc": false,
        "larc_config": {
            "clip": true,
            "eps": 1e-08,
            "trust_coefficient": 0.02
        }
    }
}
INFO:root:Number of parameters in model: 2402
WARNING:root:Model contains unsupported modules, could not compute FLOPs for model forward pass.
INFO:root:Model does not implement input_shape. Skipping activation calculation.
INFO:root:Synced meters: [0] train phase 0 (100.00% done), loss: 0.1719, meters: [accuracy_meter(top_1=0.850467)]
INFO:root:Saving checkpoint to '/workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints'...
INFO:root:Synced meters: [0] test phase 0 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)]
INFO:root:Synced meters: [0] train phase 1 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)
INFO:root:Saving checkpoint to '/workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints'...

Then delete Pod whose group_rank is 0 (in this example, it is cv-test-0), and the training will hang forever, at log line like below: (no more log will produce anymore, but the Pod is still forever running)

INFO:root:Synced meters: [0] test phase 40 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)]
INFO:root:Synced meters: [0] train phase 41 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)]
INFO:root:Saving checkpoint to '/workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints'...

Expected behavior

After scale down, remaining workers should re-rendezvous and recover from last epoch checkpoint, with log like: The step 6 in https://github.com/microsoft/frameworkcontroller/tree/master/example/framework/scenario/pytorch/elastic#imagenet-example

Environment

torchelastic version (e.g. 0.1.0rc1): torchelastic/examples:0.2.0
OS (e.g., Linux): torchelastic/examples:0.2.0
How you installed torchelastic (conda, pip, source, docker): torchelastic/examples:0.2.0
Docker image and tag (if using docker): torchelastic/examples:0.2.0
Build command you used (if compiling from source):
Git commit (if installed from source):
Python version: torchelastic/examples:0.2.0
CUDA/cuDNN version:
GPU models and configuration:
Execution environment (on-prem, aws, etc): K8s
Any other relevant information:

Additional context

Better to also fix the issue in ClassyVision repo: https://github.com/facebookresearch/ClassyVision/blob/master/examples/elastic/docker-compose.yaml

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

kiukchungcommented, Aug 10, 2020

Thanks for reporting! We’ll make sure to update the docs for our next release. I’ve created an issue to ensure this happens: https://github.com/pytorch/elastic/issues/116. Closing for now. Feel free to reopen if you still encounter issues.

0reactions

yqwang-mscommented, Aug 6, 2020

Big thanks to @kiukchung ! I will try it later. BTW, your info about this is very valuable, and would better also put them in the example wiki? (So that others will not meet the same issue again 😃)

And for the https://github.com/facebookresearch/ClassyVision/blob/master/examples/elastic/docker-compose.yaml It seems has this issue given ddp is used, so may need to communicate with ClassyVision team to find a final solution for this (not a patch solution as you mentioned)? 😃

Top Results From Across the Web

training hang when remove/add instances · Issue #25 - GitHub

Thanks for the bug report! I've seen this issue happen due to the dataloader processes that get spawned in the background - the...

Distributed training hangs - PyTorch Forums

The training hangs without printing any logs. Observations/configurations: 4 nodes. 4 GPU/node. distributed training with each process ...

Getting started with Classy Vision

Classy Vision is an end-to-end framework for image and video classification. Classy Vision makes it easy to write and launch distributed training jobs....

Multi node PyTorch Distributed Training Guide For People In A ...

The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...

Get stuck on running distributed training using ...

Hi guys, currently I am trying to set up a distributed training cluster using 2 Linux GPU machines. My runtime is the latest...