Getting Error when start training with single GPU. [Error: CHILD PROCESS FAILED WITH NO ERROR_FILE ]

See original GitHub issue

The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases. Please read local_rank from os.environ(‘LOCAL_RANK’)` instead. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : tools/train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {‘rank’: 0, ‘timeout’: 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3 INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future. “This is an experimental API and will be changed in future.”, FutureWarning INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_0/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 14927) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=1 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_1/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 14955) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=2 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_2/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 15007) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=3 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_3/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 15048) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future. “This is an experimental API and will be changed in future.”, FutureWarning INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004889965057373047 seconds {“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: 0, “group_rank”: 0, “worker_id”: “15048”, “role”: “default”, “hostname”: “vefak”, “state”: “FAILED”, “total_run_time”: 25, “rdzv_backend”: “static”, “raw_error”: “{"message": "<NONE>"}”, “metadata”: “{"group_world_size": 1, "entry_point": "python3", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}”, “agent_restarts”: 3}} {“name”: “torchelastic.worker.status.SUCCEEDED”, “source”: “AGENT”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: null, “group_rank”: 0, “worker_id”: null, “role”: “default”, “hostname”: “vefak”, “state”: “SUCCEEDED”, “total_run_time”: 25, “rdzv_backend”: “static”, “raw_error”: null, “metadata”: “{"group_world_size": 1, "entry_point": "python3"}”, “agent_restarts”: 3}} /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:

           CHILD PROCESS FAILED WITH NO ERROR_FILE

CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 15048 (local_rank 0) FAILED (exitcode -11) Error msg: Signal 11 (SIGSEGV) received by PID 15048 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record def trainer_main(args): # do train

warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/runpy.py”, line 193, in _run_module_as_main “main”, mod_spec) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/runpy.py”, line 85, in _run_code exec(code, run_globals) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launch.py”, line 173, in <module> main() File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launch.py”, line 169, in main run(args) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/run.py”, line 624, in run )(*cmd_args) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launcher/api.py”, line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 348, in wrapper return f(*args, **kwargs) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launcher/api.py”, line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

          tools/train.py FAILED

================================================== Root Cause: [0]: time: 2022-02-04_00:19:07 rank: 0 (local_rank: 0) exitcode: -11 (pid: 15048) error_file: <N/A> msg: “Signal 11 (SIGSEGV) received by PID 15048”

Other Failures: <NO_OTHER_FAILURES>

`

Issue Analytics

State:
Created 2 years ago
Comments:16

Top GitHub Comments

2reactions

tanjary21commented, Apr 14, 2022

Hi there, I’m running into this issue if my annotation.json file is huge(7Gb).

I am able to train my model on multiple GPUs if I use a smaller slice of the full dataset using the command: !CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29501 ./tools/dist_train.sh ./configs/motsynth/qdtrack_frcnn_r50_fpn_4e_motsynth.py 4 --work-dir work_dirs/MOTSynth/virgin --cfg-options 'optimizer.lr=0.01' 'data.train.ann_file=data/MOTSynth/annotations/test_cocoformat.json',

and I can train it on the full dataset when I use a single GPU using the following command: !python ./tools/train.py ./configs/motsynth/qdtrack_frcnn_r50_fpn_4e_motsynth.py --cfg-options work_dir="work_dirs/MOTSynth/virgin" optimizer.lr=0.0025.

My environment has the following packages:

mmcv-full 1.3.10 <pip>
mmdet 2.16.0 <pip>
cudatoolkit 10.2.89 hfd86e86_1
pytorch 1.9.0 py3.7_cuda10.2_cudnn7.6.5_0 pytorch

2reactions

MendelXucommented, Feb 8, 2022

OK. I have no idea what happened. Could you try to run the baseline with

python -m torch.distributed.launch --nproc_per_node=1 --master_port=29995  tools/train.py configs/baseline/faster_rcnn_r50_caffe_fpn_coco_full_720k.py --launcher pytorch

and

python  tools/train.py configs/baseline/faster_rcnn_r50_caffe_fpn_coco_full_720k.py --gpus 1

? If the first one doesn’t work and the second one works, I can try to implement a non-distributed version.

Top Results From Across the Web

Debugging for error from torch.distributed.run - PyTorch Forums

I just get the message ChildFailedError. If I train with a single GPU without using DDP, the specific reason for the error is...

How to run torch.distributed.run such that one can run two ...

I get the error: ====> about to start train loop Starting training! ... CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 158275 ...

torch.distributed.elastic.multiprocessing.errors.childfailederror:

I've been trying to train the deraining model on your datasets for the last one week, but every time I run the train.sh...

child process failed, exited with error number 1 - when setting ...

Hi All, I have been struggling for a while with setting up a replica set and after my initial problems with keyfile permissions...

Troubleshoot Dataflow errors | Google Cloud

The pipeline fails completely when a single bundle fails four times. ... This error occurs if the pipeline could not be started due...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Getting Error when start training with single GPU. [Error: CHILD PROCESS FAILED WITH NO ERROR_FILE ]

================================================== Root Cause: [0]: time: 2022-02-04_00:19:07 rank: 0 (local_rank: 0) exitcode: -11 (pid: 15048) error_file: <N/A> msg: “Signal 11 (SIGSEGV) received by PID 15048”

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Error parsing document with bullet list

Formal defintion of G_cls, G_reg, l_cls, l_reg