question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Timed out trying to connect ... : connect() didn't finish in time

See original GitHub issue

Whereas the following is not a bug we can pinpoint, it really bothers us. After weeks of experimentation and investigation we didn’t find a real solution.

We run something like

(a_bag
  .repartition(100_000) # outputs small parttitions
  .map(very_cpu_heavy_computation) # outputs ~4MB partitions
  .repartition(1000).to_dataframe().to_parquet() # writes ~30MB parquet files from ~400MB in-memory-partitions
)

on 100 workers with 1500 cpus on Kubernetes pods.

When it comes to the last re-partition step, dask starts blocking the event loop and the workers start spamming the following logs

distributed.utils_comm - INFO - Retrying get_data_from_worker after exception in attempt 0/2: Timed out trying to connect to 'tcp://10.2.3.20:38449' after 20 s: Timed out trying to connect to 'tcp://10.2.3.20:38449' after 20 s: connect() didn't finish in time

=> 5000 events

distributed.core - INFO - Event loop was unresponsive in Worker for 4.95s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

=> 26,094 events

We are working around that, by setting

# less connections to avoid potential network saturation and connection timeouts
export DASK_DISTRIBUTED__WORKER__CONNECTIONS__OUTGOING=25
export DASK_DISTRIBUTED__WORKER__CONNECTIONS__INCOMING=5
export DASK_DISTRIBUTED__COMM__SOCKET_BACKLOG=16384

# graceful timeout and retry policy
export DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES=20
export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=20
export DASK_DISTRIBUTED__COMM__RETRY__COUNT=2

sothat, the job finally finishes, despite the issues in the communication.

Anyways, sometimes workers even start hanging and have to be restarted manually. Also the progress significantly slows down to the end of the job.

I’m sharing this with the community and hoping, that somebody may give pointers what to try, or even ideas for a resolution.

Environment:

  • Dask version: 2.23.0
  • Python version: 3.6
  • Operating System: ubuntu on kubernetes
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:36 (26 by maintainers)

github_iconTop GitHub Comments

4reactions
michaelnarodovitchcommented, Oct 8, 2020

What helped quite a bit is, sofar:

export DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES=20
export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=5
export DASK_DISTRIBUTED__COMM__RETRY__COUNT=10

Long application timeout effectively makes the workers saturate the scheduler socket. Short timeout + backoff protects the scheduler listener by effectively spreading the connections in time. Failed tasks (still occurs) would be rescheduled instead of failing the whole job. https://github.com/dask/distributed/blob/b2f594e69751e0733ac595026e04181c55119ae9/distributed/comm/core.py#L327-L332

2reactions
fjettercommented, Nov 18, 2020

This traceback looks “ordinary” and this might just be an ordinary timeout. Especially if the cluster is under heavy load this might just be expected behaviour. There is a discussion ongoing to increase the default timeout value to something larger, see #4228 You might want to try out to increase the value and see if this resolves the problem (try 30s or 60s if your cluster is under extreme stress)

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask distributed: How to increase timeout for worker ...
dask distributed: How to increase timeout for worker connections? connect() didn't finish in time - Stack Overflow. Stack Overflow for Teams – Start ......
Read more >
[SOLVED] How to Fix the ERR_CONNECTION_TIMED_OUT ...
An 'err connection timed out' error is something that can be easily fixed by updating your system's network adapter driver. We recommend using ......
Read more >
How to Fix the ERR_CONNECTION_TIMED_OUT Error - Kinsta
Simply put, it's an indication that the system is unavailable and the time given for one to connect has elapsed, and now the...
Read more >
11 Ways to Fix the ERR_CONNECTION_TIMED_OUT Error
When your internet connection times out and you can't access a specific ... once you've finished with the website you were trying to...
Read more >
Resolve "Connection refused" or "Connection timed out ... - AWS
Error message: "ssh: connect to host ec2-X-X-X-X.compute-1.amazonaws.com port 22: Connection timed out". This error message comes from the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found