Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Script does not finish with dask distributed

See original GitHub issue

I’m using a pipeline that reads text from file via Apache Tika, performs some pre-processing and writes it into a MongoDB. The following is a truncated version of my script.

if __name__ == "__main__":
    mongo_client = MongoClient("mongodb://localhost:27017/")
    dask_client = dask.distributed.Client()
    file_stream_source = Stream()

    file_stream = (
        file_stream_source.scatter()
        .map(add_filesize)
        .map(add_text)
        .map(add_text_lengths)
        .buffer(16)
        .gather()
    )

    file_stream.sink(write_file)

    # file_stream_source emit loop

Everything works well, but the last few documents are missing. It seems like the dask process is killed before the task has finished. The resulting warnings/errors below support this. Is this behavior expected and I’m using the interface wrong or is this a bug?

Update: This does not happen when used in a jupyter notebook. Could this be related to the event loop?

distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-2, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-3, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-4, started daemon)>
distributed.nanny - WARNING - Worker process 15143 was killed by signal 15
distributed.nanny - WARNING - Worker process 15141 was killed by signal 15
Traceback (most recent call last):
  File "/home/dario/anaconda3/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/home/dario/anaconda3/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
distributed.nanny - WARNING - Worker process 15139 was killed by signal 15
distributed.nanny - WARNING - Worker process 15145 was killed by signal 15

relevant package versions

streamz                   0.5.1                      py_0    conda-forge
dask                      1.2.2                      py_0  
dask-core                 1.2.2                      py_0  
tornado                   6.0.2            py37h7b6447c_0

Issue Analytics

State:
Created 4 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Oct 9, 2019

What I mean is, your code doesn’t invoke distributed. But I now understand that you were providing a solution, not a new issue 😃

You should be able to achieve something similar with event loops, but your way may be simpler when none of the source nodes need an event loop anyway (but distributed always has one!). There may perhaps be a way say “run until done” on a source (i.e., stop when all of the events have been processed), which in the case with no timing of backpressure would be immediately.

0reactions

CJ-Wrightcommented, Oct 17, 2019

In the simpler case can the thread be joined? Does that thread respect backpressure?

Top Results From Across the Web

Dask distributed does not run python script - Stack Overflow

I am building a simple example to understand how dask distributed can distribute python scripts on a HPC cluster. The method ...

Large tracebacks when starting `LocalCluster` directly in script

$ python debug.py RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase....

API — Dask.distributed 2022.12.1 documentation

This performs a tree copy of the data throughout the network individually on each piece of data. This operation blocks until complete. It...

scheduling.rst.txt - Dask documentation

It is simple and cheap to use, although it can only be used on a single machine and does not scale 2. **Distributed...

Dask.distributed — Dask.distributed 2022.12.1 documentation

A small computation and network roundtrip can complete in less than 10ms. ... complex workflows (not just map/filter/reduce) which are necessary for ...