question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Worker stuck after "Protocol out of sync"

See original GitHub issue

After switching to 1.4.5 from an older 1.3 version, we see some workers getting stuck with this pattern in the logs:

  1. Protocol out of sync” with no previous errors
  2. Closing connection. KafkaConnectionError: Socket EVENT_READ without in-flight-requests
  3. [kafka.client] Node 2 connection failed -- refreshing metadata
  4. 5 minutes later, worker starts looping out infinite “Duplicate close() with error: [Error 7] RequestTimedOutError: Request timed out after 305000 ms”, but with no apparent attempts to reconnect

More detailed (INFO) logs here: kafka-python-logs.txt

We found a worker which was stuck like this for 2 days, processing no messages but not failing directly or even rebalancing the group, causing lag on its partition. The broker is running and other workers can connect to that broker in that period. Note that Node 2 is leader for the partition which the worker is assigned to. Group coordinator is Node 1 which is why heartbeat keeps beating.

This seems to be the same thing as #1728 which wasn’t completely fixed.

#1733 was about fixing one possible cause for that error (= avoid it). I believe that in our case the error is legitimate (temporary connection problems to broker). The real issue is that the worker is not able to restore itself after this error happens and becomes stuck instead of either giving up and dying or reconnecting. I’ve tried to find code responsible for reconnecting (which doesn’t seem to fire) but I don’t understand your codebase that well. I will continue investigating, this is important for us.

(We have deployed this on several environments and see this on both 1.0.1 and 2.1 brokers, identified by client as 1.0.0)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
dpkpcommented, Apr 3, 2019

Fixes have been merged to master. I’m going to close, but please reopen if the issue persists.

0reactions
jeffwidmancommented, Apr 2, 2019

See https://github.com/dpkp/kafka-python/pull/1766#issuecomment-479079703 for a status update on this issue.

@vimal3271 1.4.4 is an older version, and one major source of deadlocks was fixed in 1.4.5, but there have been additional fixes to master since then, so please upgrade to master (or wait for the upcoming 1.4.6 release) and if you are still seeing issues then please file a new ticket. Happy to help, just don’t want to spend time debugging/fixing issues that have already been solved on master.

Read more comments on GitHub >

github_iconTop Results From Across the Web

1279502 – Pulp tasks randomly stuck at waiting or running
Then randomly a single Pulp process (a pulp worker, the resource manager, etc) will halt, seeming to deadlock. If a task is in...
Read more >
figure out why sync gets stuck after fetching a few kbyte and ...
Initial situation: existing host (ubuntu 18.04 LTS server with syncthing v0.14.43-ds1, Linux 32 bit) working fine to sync with another ubuntu ...
Read more >
Automated Task stuck at "Awaiting synchronization" - N-able
... Live Chat from from Firefox Q greyed out · Local Advanced Monitoring Agent doesn't open after installation using Remote Worker Installer ...
Read more >
Network Time Protocol (NTP) Issues Troubleshooting ... - Cisco
The debug ntp events command shows that an NTP peer stratum change occurred, and the clocks then went out of sync. USSP-B33S-SW01#debug ntp...
Read more >
Troubleshoot WSUS synchronization and import issues
TLS 1.1 and TLS 1.0 are being phased out because they're considered insecure. After you disable these protocols, you can no longer import ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found