Closing dangling stream
See original GitHub issueI’m creating temporary Clients to run custom task graphs on my remote cluster and was having problems with lots of socket connections sticking around and results stuck on the cluster due to exceptions causing the results on Futures to not be requested.
Explicitly calling Client.close() seemed to fix all that as I don’t see any more stuck things on the cluster after an exceptions. But now I’m seeing a TCP connection not getting closed cleanly. Here’s some code to reproduce the problem:
from dask.distributed import Client, as_completed
import uuid
def get_results():
c = Client('HOSTNAME:8786', set_as_default=False)
def daskfn():
return 'results'
futs = []
for i in range(100):
key = f'testfn-{uuid.uuid4()}'
futs.append(c.get({key: (daskfn, )}, key, sync=False))
results = []
try:
for f in as_completed(futs):
results.append(f.result())
finally:
c.close()
return results
results = get_results()
After this, using the command line program ss
, I see a socket in CLOSE-WAIT state. And, if I do something that triggers garbage collection or call gc.collect(), I see the following warning:
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://IPADDR:41742 remote=tcp://HOSTNAME:8786>
Issue Analytics
- State:
- Created 5 years ago
- Comments:13 (3 by maintainers)
Top Results From Across the Web
Closing dangling stream in <TCP How to restart LocalCluster?
If you want to close things cleanly, you can call the close method. cluster.close(). Share.
Read more >Communications — Dask.distributed 2022.12.1 documentation
Close the communication cleanly. This will attempt to flush outgoing buffers before actually closing the underlying transport. This method returns a coroutine.
Read more >How can a dangling network stream endpoint be destroyed
I have a cRIO application that creates a network stream writer endpoint ... not closing them correctly so you don't have to restart...
Read more >Close index API | Elasticsearch Guide [8.5] | Elastic
Closing indices can be disabled via the cluster settings API by setting ... all: Match any data stream or index, including hidden ones....
Read more >Stream | Node.js v19.3.0 Documentation
If the last stream is readable, dangling event listeners will be removed so that the ... stream.pipeline() closes all the streams when an...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Does anyone know if this has been resolved? I’m having those warnings even when using
LocalCluster
and it makes using Dask much slower than not using it…Did a bit of digging in on this problem the other day, specifically regarding dangling streams. I made a little progress but I’m not sure its worth my time to investigate further
I implemented the
close_rpc
method by callingself.pool.remove(self.addr)
. This closed the connections when the cluster was closed but had no effect on the dangling streams warning.Here’s what I’ve found:
The connection between the client and scheduler is getting broken while the scheduler is trying to read from the client. This can be observed from the
Stream is Closed
exception on the Scheduler.When the connection gets broken it causes the client to delete the TCP connection object before it’s been closed and then start a new one. This is indicated first by the dangling stream warning which is logged upon deletion of the TCP object and then the
Setting TCP keepalive
message which is logged upon instantiation of a TCP object in the set_tcp_timeout method.There’s a
Stream is Closed
exception occuring here https://github.com/dask/distributed/blob/master/distributed/core.py#L338 https://github.com/dask/distributed/blob/master/distributed/comm/tcp.py#L201Scheduler Logs
Client Logs
Next Steps
If I were to investigate this further I’d try to figure out why the connection is being broken between the client and scheduler. Its probably not a huge problem since the TCP object eventually gets destroyed on the client side but if these were closed properly it might prevent many new unnecessary connections from being created (which could be a potential performance issue).
I’m thinking that the connection pool isn’t behaving correctly on long running operations like downloading large parquet files from s3. I think that when the dask worker is doing some operation that takes a while and the connection times out between the client and scheduler and then closes before the TCP object can be reused as intended by the connection pool.
I also found this stack overflow post which may or may not be relevant https://stackoverflow.com/questions/11161626/tornado-server-throws-error-stream-is-closed