question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Queries are kept waiting even when there are available workers

See original GitHub issue

Issue Summary

I have 4 workers for adhoc/user queries. 4 workers means number of concurrency should be 4. But sometimes we see less than 4 queries are in progress while there are waiting queries in queue on /admin/queries/tasks page.

I think this happens because of celery worker’s prefetch function. A celery worker fetches 4 tasks(queries) at a time by default. If a worker fetches 4 tasks at once and 1st query is long running, rest of 3 queries will be kept waiting for long time even if other workers are available.

Prefetch is good for many short queries but not efficient with long running queries. So I think it should be configurable through setting.py.

About celery prefetch: http://docs.celeryproject.org/en/latest/userguide/optimizing.html#reserve-one-task-at-a-time

Steps to Reproduce

  1. Set up more than 2 workers
  2. Run some queries

Technical details:

  • Redash Version: 1.0.3
  • Browser/OS: Chrome
  • How did you install Redash: Docker

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ghostcommented, Sep 20, 2018

I have discovered the source of my own issues. For the sake of posterity I’ll include a full description here.

Celery has a strict limit of 4 seconds for workers which use the worker_process_init signal.

Redash utilizes this signal for its workers. Redash also instantiates a new instance of the application upon launching of a new worker where this signal is used.

In my case, the worker was taking longer than 4 seconds to initialize, and as init_celery_flask_app() is decorated with the worker_process_init signal in the file redash/worker.py, celery was destroying the new process. supervisord was configured to restart the process indefinitely, so celery would kill it, supervisord would restart it, and this would repeat until the process would launch in under 4 seconds.

The result of this, is that it would appear (from the front-end) that the query was taking anywhere from 5 seconds to 10 minutes for the task runner to run it. Queries would eventually run if left in the queue for long enough, as Redash would eventually initialize a worker in under 4 seconds (after probably a few hundred tries).

I have not yet determined the cause of it taking longer than 4 seconds to start up, my first guess is opening connections to database servers.

My very short-term fix is to manually adjust celery’s PROC_ALIVE_TIMEOUT constant to allow for Redash to complete its initialization. There is no configuration option for this.

In redash/worker.py I added the following:

...
### Original ###
from celery import Celery
from celery.schedules import crontab
from celery.signals import worker_process_init

### Added the following: ###
from celery.concurrency import asynpool
asynpool.PROC_ALIVE_TIMEOUT = 10.0 #set this long enough

From examining the logs I saw that this allowed the worker to finish starting and actually execute the queries that were in the task queue.

I’ll leave another comment when I figure out what is causing the slow startup time for the worker.

1reaction
ghostcommented, Aug 6, 2018

Just wanted to note that this issue is still occurring with version 4.0.1.b4038.

I’ve increased the number of workers, tried applying #1783. It’s only resolved when I run:

supervisorctl stop redash_celery && redis-cli flushall && supervisorctl start redash_celery

But that’s only a temporary fix. After running a few queries (5-6 small ones) the behavior starts over. As a short-time solution, I’m going to implement a cron which runs the above command, but it appears to be an incorrect one, at least to me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sql server - Why do I have THREADPOOL waits even though I ...
I am not able to understand why I am seeing THREADPOOL waits when I run the first query, but second query shows threads...
Read more >
How to identify slow running queries in SQL Server - SQLShack
In this view, you can Right Click on any process and click on Details to view the actual TSQL running for that session....
Read more >
sys.dm_os_wait_stats (Transact-SQL) - SQL Server
Types of waits. Resource waits occur when a worker requests access to a resource that isn't available because the resource is being used...
Read more >
15 Questions About Remote Work, Answered
Will they have the software they need to be able to do work, ... It can be maintained, even enhanced, because commutes and...
Read more >
Frequently Asked Questions - Oklahoma.gov
Everyone with an active claim is required to conduct and keep a record of at least two work search efforts for each week...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found