Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Non-daemonic workers

See original GitHub issue

Related to #2142, but the solution doesn’t apply in my case. I have a use case for workers running in separate processes, but as non-daemons because the worker processes need to use multiprocessing. Here’s an example:

import torch
import torch.distributed as dist
import torchvision
import os
from distributed import Client, LocalCluster


def worker_fn(rank, world_size):
    print('worker', rank)

    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '8989'
    dist.init_process_group(
        backend=dist.Backend.NCCL,
        rank=rank,
        world_size=world_size,
    )
    print('initialized distributed', rank)

    if rank == 0:
        dataset = torchvision.datasets.MNIST(
            '../data/',
            train=True,
            download=True,
        )
    dist.barrier()
    if rank != 0:
        dataset = torchvision.datasets.MNIST(
            '../data/',
            train=True,
            download=False,
        )
    # load data, uses multiprocessing
    loader = torch.utils.data.DataLoader(
        dataset,
        sampler=torch.utils.data.distributed.DistributedSampler(
            dataset,
            rank=rank,
            num_replicas=world_size,
        ),
        num_workers=2,
    )
    print('constructed data loader', rank)

    # if cuda is available, initializes it as well
    assert torch.cuda.is_available()
    # do distributed training, but in this case it suffices to iterate
    for x, y in loader:
        pass


def main():
    world_size = 2
    cluster = LocalCluster(
        n_workers=world_size,
        processes=True,
        resources={
            'GPUS': 1,  # don't allow two tasks to run on the same worker
        },
    )
    cl = Client(cluster)
    futs = []
    for rank in range(world_size):
        futs.append(
            cl.submit(
                worker_fn,
                rank,
                world_size,
                resources={'GPUS': 1},
            ))

    for f in futs:
        f.result()


if __name__ == '__main__':
    main()

If processes=True, then we get an error about daemonic processes not being allowed to have children:

worker 0
worker 1
initialized distributed 1
initialized distributed 0
constructed data loader 0
constructed data loader 1
distributed.worker - WARNING -  Compute Failed
Function:  worker_fn
args:      (0, 2)
kwargs:    {}
Exception: AssertionError('daemonic processes are not allowed to have children',)

Traceback (most recent call last):
  File "scratch.py", line 152, in <module>
    main()
  File "scratch.py", line 148, in main
    f.result()
  File "/private/home/calebh/miniconda3/envs/fairtask2/lib/python3.6/site-packages/distributed/client.py", line 227, in result
    six.reraise(*result)
  File "/private/home/calebh/miniconda3/envs/fairtask2/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "scratch.py", line 123, in worker_fn
    for x, y in loader:
  File "/private/home/calebh/miniconda3/envs/fairtask2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 193, in __iter__
    return _DataLoaderIter(self)
  File "/private/home/calebh/miniconda3/envs/fairtask2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 469, in __init__
    w.start()
  File "/private/home/calebh/miniconda3/envs/fairtask2/lib/python3.6/multiprocessing/process.py", line 103, in start
    'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
distributed.worker - WARNING -  Compute Failed
Function:  worker_fn
args:      (1, 2)
kwargs:    {}
Exception: AssertionError('daemonic processes are not allowed to have children',)

If processes=False, we get stuck at distributed initialization.

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:12 (11 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, May 29, 2019

@zhanghang1989 I recommend raising a new issue. I recommend not repeating your comment on multiple issues.

1reaction

zhanghang1989commented, May 28, 2019

I am new to dask. Is that possible to set --no-nanny when using dask-ssh?

Top Results From Across the Web

Python Process Pool non-daemonic? - Stack Overflow

Pool is just a wrapper function) and substitute your own multiprocessing.Process sub-class, which is always non-daemonic, to be used for the worker processes....

multiprocessing — Process-based parallelism — Python 3.11 ...

It has methods which allows tasks to be offloaded to the worker processes in a few different ways. For example: from multiprocessing import...

Python Process Pool non-daemonic - iTecNote

Pool class creates the worker processes in its __init__ method, makes them daemonic and starts them, and it is not possible to re-set...

Python Process Pool non-daemonic? - DevPress - CSDN

Process sub-class, which is always non-daemonic, to be used for the worker processes. Here's a full example of how to do this.

Reference Manual — Curio 1.2 documentation

When submitting work, you can either provide an async function and ... g.tasks, A list of all non-daemonic tasks managed by the group,...