Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[New Scheduler] Brief etcd unavailability can result in specific action queue to get stuck if unlucky

See original GitHub issue

I had an action queue get into a stuck state after about two seconds of etcd downtime while other actions were able to recover gracefully. Essentially what it appears happens is that the queue endpoint key times out in etcd and no longer exists, but the controller doesn’t hear this and continues to think the scheduler endpoint of the queue is on the same scheduler endpoint (I believe WatchEndpointRemoved should be sent to the controllers in this case but that doesn’t seem to have happened). In the QueueManager of the scheduler the activation it was sent to, it then hits this code path because the queue doesn’t exist on the host it sent it to and tries to remotely resolve it through etcd but the queue endpoint doesn’t exist in etcd:

      case t =>
        logging.warn(this, s"[${msg.activationId}] activation has been dropped (${t.getMessage})")
        completeErrorActivation(msg, "The activation has not been processed: failed to get the queue endpoint.")
    }}

All requests for this action will then be dropped in the QueueManager unless the schedulers are restarted. Is there any way to make this more resilient so that if something gets stuck in this edge case, we can recover somehow without requiring a restart? @style95

Additional logs for the timeline:

How I know connectivity to etcd fails is this logs emits for two seconds from the controller before all activations for that action begin to fail and all other actions become fine again.

[WARN] [#tid_T8Gk2BdDdf4PIjq8W8Ta12kD0ZAOMgwE] [ActionsApi] No scheduler endpoint available [marker:controller_loadbalancer_error:2:0]

Then from the invoker this log is emitted from the containers that exist for the action until the schedulers are restarted:

[ActivationClientProxy] The queue of action [REDACTED] does not exist. Check for queues in other schedulers.

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

bdoyle0182commented, Aug 2, 2022

So for what it’s worth, I’ve yet to see this issue again since I raised the etcd lease timeout to 10 seconds from 1 second and it’s been a couple weeks. So my theory was that 1 second was too short and it was facing a cyclical race condition where it would ask for a new lease, it writes it but by the time it tries to write again the lease has expired again and it’s now going be stuck in this loop. That’s my theory at least, I don’t have proof in the code; but if you want to take a look at the code based on this knowledge that does seem to have fixed it

0reactions

style95commented, Aug 1, 2022

One thing I couldn’t get is when a lease is removed for some reason, the system is supposed to recover(reissue) the lease and all data. Since the network rupture can happen at any time, the system should be resilient to them. When I design the new scheduler, that was the main requirement and we had tested it many times. But it seems the data is not properly recovered in your case and that’s my question.

Regarding the cache about the scheduler endpoint, I think that would be a good improvement. Currently, if no scheduler endpoint is found, it will just fail activations. But we can make controllers send activations and queue creation requests to the scheduler side anyway(with cached endpoints as long as it is reachable), and make schedulers handle the remaining.

One thing I can do for now is to update the default configurations.

Top Results From Across the Web

Some scheduled jobs don't run and are stuck in 'queued' state.

2 the scheduler doesn't appear to be running any jobs and shows a large number of items stuck in the 'queued' status. This...

2019-May.txt - Mailing Lists - OpenStack

Note that I purposely marked some of my unavailable time slots but can be adjusted well - believe me, since someone asks me...

Deploying and Scaling Microservices with Docker and ...

... Cron Jobs in action - At the specified schedule, the Cron Job will create a ... will cause a short service disruption,...

ram-thesis.pdf - Computer Sciences Dept.

The dissertation is approved by the following members of the Final Oral. Committee: ... how a data loss or an unavailability can occur...

Jepsen: Elasticsearch - Aphyr

If one node fails or becomes unavailable, another can take over. ... When I evaluate a new database, I check the documentation first....

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Unable to obtain the list of entities for namespace 'default': The resource requires authentication, which was not supplied with the request (code d465c13bac106c2d6e895b6b67e168d0)

[New Scheduler] Brief etcd unavailability can result in specific action queue to get stuck if unlucky

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Unable to obtain the list of entities for namespace 'default': The resource requires authentication, which was not supplied with the request (code d465c13bac106c2d6e895b6b67e168d0)

install package error: npm-3.5.2 install of alarmFeed in alarm package gives ENOTDIR error