question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting Error when start training with single GPU. [Error: CHILD PROCESS FAILED WITH NO ERROR_FILE ]

See original GitHub issue

The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases. Please read local_rank from os.environ(‘LOCAL_RANK’)` instead. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : tools/train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {‘rank’: 0, ‘timeout’: 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3 INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future. “This is an experimental API and will be changed in future.”, FutureWarning INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_0/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 14927) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=1 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_1/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 14955) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=2 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_2/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 15007) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=3 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_3/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 15048) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future. “This is an experimental API and will be changed in future.”, FutureWarning INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004889965057373047 seconds {“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: 0, “group_rank”: 0, “worker_id”: “15048”, “role”: “default”, “hostname”: “vefak”, “state”: “FAILED”, “total_run_time”: 25, “rdzv_backend”: “static”, “raw_error”: “{"message": "<NONE>"}”, “metadata”: “{"group_world_size": 1, "entry_point": "python3", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}”, “agent_restarts”: 3}} {“name”: “torchelastic.worker.status.SUCCEEDED”, “source”: “AGENT”, “timestamp”: 0, “metadata”: {“run_id”: “none”, “global_rank”: null, “group_rank”: 0, “worker_id”: null, “role”: “default”, “hostname”: “vefak”, “state”: “SUCCEEDED”, “total_run_time”: 25, “rdzv_backend”: “static”, “raw_error”: null, “metadata”: “{"group_world_size": 1, "entry_point": "python3"}”, “agent_restarts”: 3}} /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 15048 (local_rank 0) FAILED (exitcode -11) Error msg: Signal 11 (SIGSEGV) received by PID 15048 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record def trainer_main(args): # do train


warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/runpy.py”, line 193, in _run_module_as_main “main”, mod_spec) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/runpy.py”, line 85, in _run_code exec(code, run_globals) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launch.py”, line 173, in <module> main() File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launch.py”, line 169, in main run(args) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/run.py”, line 624, in run )(*cmd_args) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launcher/api.py”, line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 348, in wrapper return f(*args, **kwargs) File “/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launcher/api.py”, line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


          tools/train.py FAILED               

================================================== Root Cause: [0]: time: 2022-02-04_00:19:07 rank: 0 (local_rank: 0) exitcode: -11 (pid: 15048) error_file: <N/A> msg: “Signal 11 (SIGSEGV) received by PID 15048”

Other Failures: <NO_OTHER_FAILURES>


`

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:16

github_iconTop GitHub Comments

2reactions
tanjary21commented, Apr 14, 2022

Hi there, I’m running into this issue if my annotation.json file is huge(7Gb).

I am able to train my model on multiple GPUs if I use a smaller slice of the full dataset using the command: !CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29501 ./tools/dist_train.sh ./configs/motsynth/qdtrack_frcnn_r50_fpn_4e_motsynth.py 4 --work-dir work_dirs/MOTSynth/virgin --cfg-options 'optimizer.lr=0.01' 'data.train.ann_file=data/MOTSynth/annotations/test_cocoformat.json',

and I can train it on the full dataset when I use a single GPU using the following command: !python ./tools/train.py ./configs/motsynth/qdtrack_frcnn_r50_fpn_4e_motsynth.py --cfg-options work_dir="work_dirs/MOTSynth/virgin" optimizer.lr=0.0025.

My environment has the following packages:

  • mmcv-full 1.3.10 <pip>
  • mmdet 2.16.0 <pip>
  • cudatoolkit 10.2.89 hfd86e86_1
  • pytorch 1.9.0 py3.7_cuda10.2_cudnn7.6.5_0 pytorch
2reactions
MendelXucommented, Feb 8, 2022

OK. I have no idea what happened. Could you try to run the baseline with

python -m torch.distributed.launch --nproc_per_node=1 --master_port=29995  tools/train.py configs/baseline/faster_rcnn_r50_caffe_fpn_coco_full_720k.py --launcher pytorch

and

python  tools/train.py configs/baseline/faster_rcnn_r50_caffe_fpn_coco_full_720k.py --gpus 1

? If the first one doesn’t work and the second one works, I can try to implement a non-distributed version.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Debugging for error from torch.distributed.run - PyTorch Forums
I just get the message ChildFailedError. If I train with a single GPU without using DDP, the specific reason for the error is...
Read more >
How to run torch.distributed.run such that one can run two ...
I get the error: ====> about to start train loop Starting training! ... CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 158275 ...
Read more >
torch.distributed.elastic.multiprocessing.errors.childfailederror:
I've been trying to train the deraining model on your datasets for the last one week, but every time I run the train.sh...
Read more >
child process failed, exited with error number 1 - when setting ...
Hi All, I have been struggling for a while with setting up a replica set and after my initial problems with keyfile permissions...
Read more >
Troubleshoot Dataflow errors | Google Cloud
The pipeline fails completely when a single bundle fails four times. ... This error occurs if the pipeline could not be started due...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found