Script does not finish with dask distributed
See original GitHub issueI’m using a pipeline that reads text from file via Apache Tika, performs some pre-processing and writes it into a MongoDB. The following is a truncated version of my script.
if __name__ == "__main__":
mongo_client = MongoClient("mongodb://localhost:27017/")
dask_client = dask.distributed.Client()
file_stream_source = Stream()
file_stream = (
file_stream_source.scatter()
.map(add_filesize)
.map(add_text)
.map(add_text_lengths)
.buffer(16)
.gather()
)
file_stream.sink(write_file)
# file_stream_source emit loop
Everything works well, but the last few documents are missing. It seems like the dask process is killed before the task has finished. The resulting warnings/errors below support this. Is this behavior expected and I’m using the interface wrong or is this a bug?
Update: This does not happen when used in a jupyter notebook. Could this be related to the event loop?
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-2, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-3, started daemon)>
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-4, started daemon)>
distributed.nanny - WARNING - Worker process 15143 was killed by signal 15
distributed.nanny - WARNING - Worker process 15141 was killed by signal 15
Traceback (most recent call last):
File "/home/dario/anaconda3/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
File "/home/dario/anaconda3/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/home/dario/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
distributed.nanny - WARNING - Worker process 15139 was killed by signal 15
distributed.nanny - WARNING - Worker process 15145 was killed by signal 15
relevant package versions
streamz 0.5.1 py_0 conda-forge
dask 1.2.2 py_0
dask-core 1.2.2 py_0
tornado 6.0.2 py37h7b6447c_0
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (6 by maintainers)
Top Results From Across the Web
Dask distributed does not run python script - Stack Overflow
I am building a simple example to understand how dask distributed can distribute python scripts on a HPC cluster. The method ...
Read more >Large tracebacks when starting `LocalCluster` directly in script
$ python debug.py RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase....
Read more >API — Dask.distributed 2022.12.1 documentation
This performs a tree copy of the data throughout the network individually on each piece of data. This operation blocks until complete. It...
Read more >scheduling.rst.txt - Dask documentation
It is simple and cheap to use, although it can only be used on a single machine and does not scale 2. **Distributed...
Read more >Dask.distributed — Dask.distributed 2022.12.1 documentation
A small computation and network roundtrip can complete in less than 10ms. ... complex workflows (not just map/filter/reduce) which are necessary for ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
What I mean is, your code doesn’t invoke distributed. But I now understand that you were providing a solution, not a new issue 😃
You should be able to achieve something similar with event loops, but your way may be simpler when none of the source nodes need an event loop anyway (but distributed always has one!). There may perhaps be a way say “run until done” on a source (i.e., stop when all of the events have been processed), which in the case with no timing of backpressure would be immediately.
In the simpler case can the thread be
joined
? Does that thread respect backpressure?