question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Closing dangling stream

See original GitHub issue

I’m creating temporary Clients to run custom task graphs on my remote cluster and was having problems with lots of socket connections sticking around and results stuck on the cluster due to exceptions causing the results on Futures to not be requested.

Explicitly calling Client.close() seemed to fix all that as I don’t see any more stuck things on the cluster after an exceptions. But now I’m seeing a TCP connection not getting closed cleanly. Here’s some code to reproduce the problem:

from dask.distributed import Client, as_completed                 
import uuid                                                       
                                                                  
def get_results():                                                
    c = Client('HOSTNAME:8786', set_as_default=False)
                                                                  
    def daskfn():                                                 
        return 'results'                                          
                                                                  
    futs = []                                                     
    for i in range(100):                                          
        key = f'testfn-{uuid.uuid4()}'                            
        futs.append(c.get({key: (daskfn, )}, key, sync=False))    
                                                                  
    results = []                                                  
    try:                                                          
        for f in as_completed(futs):                              
            results.append(f.result())                            
    finally:                                                      
        c.close()                                                 
    return results 

results = get_results()        

After this, using the command line program ss, I see a socket in CLOSE-WAIT state. And, if I do something that triggers garbage collection or call gc.collect(), I see the following warning:

distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://IPADDR:41742 remote=tcp://HOSTNAME:8786>

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:13 (3 by maintainers)

github_iconTop GitHub Comments

6reactions
hoangthienan95commented, Apr 30, 2019

Does anyone know if this has been resolved? I’m having those warnings even when using LocalCluster and it makes using Dask much slower than not using it…

0reactions
kylejn27commented, Nov 25, 2019

Did a bit of digging in on this problem the other day, specifically regarding dangling streams. I made a little progress but I’m not sure its worth my time to investigate further

I implemented the close_rpc method by calling self.pool.remove(self.addr). This closed the connections when the cluster was closed but had no effect on the dangling streams warning.

Here’s what I’ve found:

The connection between the client and scheduler is getting broken while the scheduler is trying to read from the client. This can be observed from the Stream is Closed exception on the Scheduler.

When the connection gets broken it causes the client to delete the TCP connection object before it’s been closed and then start a new one. This is indicated first by the dangling stream warning which is logged upon deletion of the TCP object and then the Setting TCP keepalive message which is logged upon instantiation of a TCP object in the set_tcp_timeout method.

There’s a Stream is Closed exception occuring here https://github.com/dask/distributed/blob/master/distributed/core.py#L338 https://github.com/dask/distributed/blob/master/distributed/comm/tcp.py#L201

Scheduler Logs

distributed.comm.tcp - DEBUG - Incoming connection from 'tcp://[::1]:55311' to 'tcp://127.0.0.1:8786'
distributed.comm.tcp - DEBUG - Setting TCP keepalive: nprobes=10, idle=10, interval=2
distributed.core - DEBUG - Connection from 'tcp://[::1]:55311' to Scheduler
distributed.core - DEBUG - Lost connection to 'tcp://[::1]:55311' while reading message: in <closed TCP>: Stream is closed. Last operation: None

Client Logs

distributed.comm.tcp - WARNING - Closing dangling stream in <TCP  local=tcp://[::1]:55311 remote=tcp://localhost:8786>
distributed.comm.tcp - DEBUG - Setting TCP keepalive: nprobes=10, idle=1, interval=1

Next Steps

If I were to investigate this further I’d try to figure out why the connection is being broken between the client and scheduler. Its probably not a huge problem since the TCP object eventually gets destroyed on the client side but if these were closed properly it might prevent many new unnecessary connections from being created (which could be a potential performance issue).

I’m thinking that the connection pool isn’t behaving correctly on long running operations like downloading large parquet files from s3. I think that when the dask worker is doing some operation that takes a while and the connection times out between the client and scheduler and then closes before the TCP object can be reused as intended by the connection pool.

I also found this stack overflow post which may or may not be relevant https://stackoverflow.com/questions/11161626/tornado-server-throws-error-stream-is-closed

Read more comments on GitHub >

github_iconTop Results From Across the Web

Closing dangling stream in <TCP How to restart LocalCluster?
If you want to close things cleanly, you can call the close method. cluster.close(). Share.
Read more >
Communications — Dask.distributed 2022.12.1 documentation
Close the communication cleanly. This will attempt to flush outgoing buffers before actually closing the underlying transport. This method returns a coroutine.
Read more >
How can a dangling network stream endpoint be destroyed
I have a cRIO application that creates a network stream writer endpoint ... not closing them correctly so you don't have to restart...
Read more >
Close index API | Elasticsearch Guide [8.5] | Elastic
Closing indices can be disabled via the cluster settings API by setting ... all: Match any data stream or index, including hidden ones....
Read more >
Stream | Node.js v19.3.0 Documentation
If the last stream is readable, dangling event listeners will be removed so that the ... stream.pipeline() closes all the streams when an...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found