Queries are kept waiting even when there are available workers
See original GitHub issueIssue Summary
I have 4 workers for adhoc/user queries. 4 workers means number of concurrency should be 4.
But sometimes we see less than 4 queries are in progress while there are waiting queries in queue on /admin/queries/tasks
page.
I think this happens because of celery worker’s prefetch function. A celery worker fetches 4 tasks(queries) at a time by default. If a worker fetches 4 tasks at once and 1st query is long running, rest of 3 queries will be kept waiting for long time even if other workers are available.
Prefetch is good for many short queries but not efficient with long running queries. So I think it should be configurable through setting.py
.
About celery prefetch: http://docs.celeryproject.org/en/latest/userguide/optimizing.html#reserve-one-task-at-a-time
Steps to Reproduce
- Set up more than 2 workers
- Run some queries
Technical details:
- Redash Version: 1.0.3
- Browser/OS: Chrome
- How did you install Redash: Docker
Issue Analytics
- State:
- Created 6 years ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
I have discovered the source of my own issues. For the sake of posterity I’ll include a full description here.
Celery has a strict limit of 4 seconds for workers which use the
worker_process_init
signal.Redash utilizes this signal for its workers. Redash also instantiates a new instance of the application upon launching of a new worker where this signal is used.
In my case, the worker was taking longer than 4 seconds to initialize, and as
init_celery_flask_app()
is decorated with theworker_process_init
signal in the fileredash/worker.py
, celery was destroying the new process.supervisord
was configured to restart the process indefinitely, so celery would kill it, supervisord would restart it, and this would repeat until the process would launch in under 4 seconds.The result of this, is that it would appear (from the front-end) that the query was taking anywhere from 5 seconds to 10 minutes for the task runner to run it. Queries would eventually run if left in the queue for long enough, as Redash would eventually initialize a worker in under 4 seconds (after probably a few hundred tries).
I have not yet determined the cause of it taking longer than 4 seconds to start up, my first guess is opening connections to database servers.
My very short-term fix is to manually adjust celery’s
PROC_ALIVE_TIMEOUT
constant to allow for Redash to complete its initialization. There is no configuration option for this.In
redash/worker.py
I added the following:From examining the logs I saw that this allowed the worker to finish starting and actually execute the queries that were in the task queue.
I’ll leave another comment when I figure out what is causing the slow startup time for the worker.
Just wanted to note that this issue is still occurring with version
4.0.1.b4038
.I’ve increased the number of workers, tried applying #1783. It’s only resolved when I run:
But that’s only a temporary fix. After running a few queries (5-6 small ones) the behavior starts over. As a short-time solution, I’m going to implement a cron which runs the above command, but it appears to be an incorrect one, at least to me.