ClassyVision distributed training hang after scaledown training Nodes
See original GitHub issue🐛 Bug
Component (check all that applies):
-
state api
-
train_step api
-
train_loop
-
rendezvous
-
checkpoint
-
rollback
-
metrics
-
petctl
-
examples
-
docker
-
other
Background
We are also trying to support PET on K8s, and the ImageNet Example is already supported without any issue, see: https://github.com/microsoft/frameworkcontroller/tree/master/example/framework/scenario/pytorch/elastic.
However, there are some issues to support distributed ClassyVision, the main issue is this hang issue. And this issue MAY be already reported by https://github.com/pytorch/elastic/issues/25, but not active anymore.
To Reproduce
Reproduce Key Points
- Current ClassyVision K8S Example, seems is not “distributed training”, as
--distributed_backend
is not specified to be ddp. So I used the ddp and found the issue. - Current ClassyVision Default Example only has 2
num_epochs
, so it runs very short (<1min), and you can hardly test the scaledown during the short period running. So I increased thenum_epochs
to 100. - Current ClassyVision K8S Example, cannot specify
--num_workers=0
, as ClassyVision will crash due to error:classy_train.py: error: unrecognized arguments: --num_workers=0
, so I tried to add thenum_workers: 0
into ClassyVision config file, like this UT. However, it still crash due to error:ValueError: multiprocessing_context can only be used with multi-process loading (num_workers > 0), but got num_workers=0
. So, I do not specifiednum_workers
in any place, and use this k8s apporach to expose shared memory to containers.
Steps to reproduce the issue on K8s: (You can do similar thing without K8s, using K8s here for simple.)
- Assume etcd is already setup and its address is
pet-etcd:2379
- Create P2P discovery Service on K8s:
apiVersion: v1
kind: Service
metadata:
name: cv-test
spec:
clusterIP: None
publishNotReadyAddresses: true
selector:
app: cv-test
- Create 3 below Pods on K8s, each with below
{{INDEX}}
placehoder instantiated to be 0, 1, 2:
apiVersion: v1
kind: Pod
metadata:
name: cv-test-{{INDEX}}
labels:
app: cv-test
spec:
hostname: cv-test-{{INDEX}}
subdomain: cv-test
containers:
- name: elasticjob-worker
image: torchelastic/examples:0.2.0
imagePullPolicy: Always
command: [
"bash", "-c",
"sed -i -e 's/\"num_epochs\": 2/\"num_epochs\": 100/g'
/workspace/classy_vision/configs/template_config.json &&
python -m torchelastic.distributed.launch
--rdzv_backend=etcd
--rdzv_endpoint=pet-etcd:2379
--rdzv_id=cv-test
--nnodes=1:4
--nproc_per_node=1
/workspace/classy_vision/classy_train.py
--config_file /workspace/classy_vision/configs/template_config.json
--distributed_backend ddp"]
volumeMounts:
- name: shm-volume
mountPath: /dev/shm
volumes:
- name: shm-volume
emptyDir:
medium: Memory
- All Pods will train with log like below:
[INFO] 2020-08-03 07:53:06,129 launch: Running torchelastic.distributed.launch with args: ['/opt/conda/lib/python3.7/site-packages/torchelastic/distributed/launch.py', '--rdzv_backend=etcd', '--rdzv_endpoint=pet-etcd:2379', '--rdzv_id=cv-test', '--nnodes=1:4', '--nproc_per_node=1', '/workspace/classy_vision/classy_train.py', '--config_file', '/workspace/classy_vision/configs/template_config.json', '--distributed_backend', 'ddp']
INFO 2020-08-03 07:53:06,136 Etcd machines: ['http://0.0.0.0:2379']
[INFO] 2020-08-03 07:53:06,144 launch: Using nproc_per_node=1.
[INFO] 2020-08-03 07:53:06,890 api: [default] starting workers for function: wrapper_fn
[INFO] 2020-08-03 07:53:06,890 api: [default] Rendezvous'ing worker group
INFO 2020-08-03 07:53:06,890 Attempting to join next rendezvous
INFO 2020-08-03 07:53:06,894 Observed existing rendezvous state: {'status': 'final', 'version': '15', 'participants': [0], 'keep_alives': ['/torchelastic/p2p/run_cv-test/rdzv/v_15/rank_0'], 'num_workers_waiting': 0}
INFO 2020-08-03 07:53:06,988 Added self to waiting list. Rendezvous full state: {"status": "final", "version": "15", "participants": [0], "keep_alives": ["/torchelastic/p2p/run_cv-test/rdzv/v_15/rank_0"], "num_workers_waiting": 1}
INFO 2020-08-03 07:53:06,990 Keep-alive key /torchelastic/p2p/run_cv-test/rdzv/v_15/rank_0 is not renewed.
INFO 2020-08-03 07:53:06,990 Rendevous version 15 is incomplete.
INFO 2020-08-03 07:53:06,990 Attempting to destroy it.
INFO 2020-08-03 07:53:06,991 Destroyed rendezvous version 15 successfully.
INFO 2020-08-03 07:53:06,992 Previously existing rendezvous state changed. Will re-try joining.
INFO 2020-08-03 07:53:06,992 Attempting to join next rendezvous
INFO 2020-08-03 07:53:06,999 New rendezvous state created: {'status': 'joinable', 'version': '16', 'participants': []}
INFO 2020-08-03 07:53:07,012 Joined rendezvous version 16 as rank 0. Full state: {'status': 'joinable', 'version': '16', 'participants': [0]}
INFO 2020-08-03 07:53:07,013 Rank 0 is responsible for join last call.
INFO 2020-08-03 07:53:38,023 Rank 0 finished join last call.
INFO 2020-08-03 07:53:38,024 Waiting for remaining peers.
INFO 2020-08-03 07:53:38,025 All peers arrived. Confirming membership.
INFO 2020-08-03 07:53:38,120 Waiting for confirmations from all peers.
INFO 2020-08-03 07:53:38,122 Rendezvous version 16 is complete. Final state: {'status': 'final', 'version': '16', 'participants': [0, 1, 2], 'keep_alives': ['/torchelastic/p2p/run_cv-test/rdzv/v_16/rank_1', '/torchelastic/p2p/run_cv-test/rdzv/v_16/rank_2', '/torchelastic/p2p/run_cv-test/rdzv/v_16/rank_0'], 'num_workers_waiting': 0}
INFO 2020-08-03 07:53:38,122 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-08-03 07:53:38,128 api: [default] Rendezvous complete for workers.
Result:
restart_count=0
group_rank=0
group_world_size=3
rank stride=1
assigned global_ranks=[0]
master_addr=cv-test-0.cv-test.default.svc.cluster.local
master_port=37129
[INFO] 2020-08-03 07:53:38,128 api: [default] Starting worker group
INFO:root:Classy Vision's default training script.
INFO:root:AMP disabled
INFO:root:mixup disabled
INFO:root:Synchronized Batch Normalization is disabled
INFO:root:Logging outputs to /workspace/classy_vision/output_2020-08-03T07:53:39.917536
INFO:root:Logging checkpoints to /workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints
WARNING:root:tensorboardX not installed, skipping tensorboard hooks
INFO:root:Starting training on rank 0 worker. World size is 1
INFO:root:Done setting up distributed process_group with rank 0, world_size 3
INFO:root:Using GPU, CUDA device index: 0
INFO:root:Starting training. Task: <classy_vision.tasks.classification_task.ClassificationTask object at 0x7f4a4992b2d0> initialized with config:
{
"name": "classification_task",
"num_epochs": 100,
"loss": {
"name": "my_loss"
},
"dataset": {
"train": {
"name": "my_dataset",
"crop_size": 224,
"class_ratio": 0.5,
"num_samples": 320,
"seed": 0,
"batchsize_per_replica": 32,
"use_shuffle": true,
"transforms": [
{
"name": "generic_image_transform",
"transforms": [
{
"name": "RandomResizedCrop",
"size": 224
},
{
"name": "RandomHorizontalFlip"
},
{
"name": "ToTensor"
},
{
"name": "Normalize",
"mean": [
0.485,
0.456,
0.406
],
"std": [
0.229,
0.224,
0.225
]
}
]
}
]
},
"test": {
"name": "my_dataset",
"crop_size": 224,
"class_ratio": 0.5,
"num_samples": 100,
"seed": 1,
"batchsize_per_replica": 32,
"use_shuffle": false,
"transforms": [
{
"name": "generic_image_transform",
"transforms": [
{
"name": "Resize",
"size": 256
},
{
"name": "CenterCrop",
"size": 224
},
{
"name": "ToTensor"
},
{
"name": "Normalize",
"mean": [
0.485,
0.456,
0.406
],
"std": [
0.229,
0.224,
0.225
]
}
]
}
]
}
},
"meters": {
"accuracy": {
"topk": [
1
]
}
},
"model": {
"name": "my_model"
},
"optimizer": {
"name": "sgd",
"param_schedulers": {
"lr": {
"name": "step",
"values": [
0.1,
0.01
]
}
},
"weight_decay": 0.0001,
"momentum": 0.9,
"num_epochs": 100,
"lr": 0.1,
"nesterov": false,
"use_larc": false,
"larc_config": {
"clip": true,
"eps": 1e-08,
"trust_coefficient": 0.02
}
}
}
INFO:root:Number of parameters in model: 2402
WARNING:root:Model contains unsupported modules, could not compute FLOPs for model forward pass.
INFO:root:Model does not implement input_shape. Skipping activation calculation.
INFO:root:Synced meters: [0] train phase 0 (100.00% done), loss: 0.1719, meters: [accuracy_meter(top_1=0.850467)]
INFO:root:Saving checkpoint to '/workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints'...
INFO:root:Synced meters: [0] test phase 0 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)]
INFO:root:Synced meters: [0] train phase 1 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)
INFO:root:Saving checkpoint to '/workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints'...
- Then delete Pod whose group_rank is 0 (in this example, it is
cv-test-0
), and the training will hang forever, at log line like below: (no more log will produce anymore, but the Pod is still forever running)
INFO:root:Synced meters: [0] test phase 40 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)]
INFO:root:Synced meters: [0] train phase 41 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)]
INFO:root:Saving checkpoint to '/workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints'...
Expected behavior
After scale down, remaining workers should re-rendezvous and recover from last epoch checkpoint, with log like: The step 6 in https://github.com/microsoft/frameworkcontroller/tree/master/example/framework/scenario/pytorch/elastic#imagenet-example
Environment
- torchelastic version (e.g. 0.1.0rc1): torchelastic/examples:0.2.0
- OS (e.g., Linux): torchelastic/examples:0.2.0
- How you installed torchelastic (
conda
,pip
, source,docker
): torchelastic/examples:0.2.0 - Docker image and tag (if using docker): torchelastic/examples:0.2.0
- Build command you used (if compiling from source):
- Git commit (if installed from source):
- Python version: torchelastic/examples:0.2.0
- CUDA/cuDNN version:
- GPU models and configuration:
- Execution environment (on-prem, aws, etc): K8s
- Any other relevant information:
Additional context
- Better to also fix the issue in ClassyVision repo: https://github.com/facebookresearch/ClassyVision/blob/master/examples/elastic/docker-compose.yaml
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Thanks for reporting! We’ll make sure to update the docs for our next release. I’ve created an issue to ensure this happens: https://github.com/pytorch/elastic/issues/116. Closing for now. Feel free to reopen if you still encounter issues.
Big thanks to @kiukchung ! I will try it later. BTW, your info about this is very valuable, and would better also put them in the example wiki? (So that others will not meet the same issue again 😃)
And for the https://github.com/facebookresearch/ClassyVision/blob/master/examples/elastic/docker-compose.yaml It seems has this issue given ddp is used, so may need to communicate with ClassyVision team to find a final solution for this (not a patch solution as you mentioned)? 😃