Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Queries are kept waiting even when there are available workers

See original GitHub issue

Issue Summary

I have 4 workers for adhoc/user queries. 4 workers means number of concurrency should be 4. But sometimes we see less than 4 queries are in progress while there are waiting queries in queue on /admin/queries/tasks page.

I think this happens because of celery worker’s prefetch function. A celery worker fetches 4 tasks(queries) at a time by default. If a worker fetches 4 tasks at once and 1st query is long running, rest of 3 queries will be kept waiting for long time even if other workers are available.

Prefetch is good for many short queries but not efficient with long running queries. So I think it should be configurable through setting.py.

About celery prefetch: http://docs.celeryproject.org/en/latest/userguide/optimizing.html#reserve-one-task-at-a-time

Steps to Reproduce

Set up more than 2 workers
Run some queries

Technical details:

Redash Version: 1.0.3
Browser/OS: Chrome
How did you install Redash: Docker

Issue Analytics

State:
Created 6 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

ghostcommented, Sep 20, 2018

I have discovered the source of my own issues. For the sake of posterity I’ll include a full description here.

Celery has a strict limit of 4 seconds for workers which use the worker_process_init signal.

Redash utilizes this signal for its workers. Redash also instantiates a new instance of the application upon launching of a new worker where this signal is used.

In my case, the worker was taking longer than 4 seconds to initialize, and as init_celery_flask_app() is decorated with the worker_process_init signal in the file redash/worker.py, celery was destroying the new process. supervisord was configured to restart the process indefinitely, so celery would kill it, supervisord would restart it, and this would repeat until the process would launch in under 4 seconds.

The result of this, is that it would appear (from the front-end) that the query was taking anywhere from 5 seconds to 10 minutes for the task runner to run it. Queries would eventually run if left in the queue for long enough, as Redash would eventually initialize a worker in under 4 seconds (after probably a few hundred tries).

I have not yet determined the cause of it taking longer than 4 seconds to start up, my first guess is opening connections to database servers.

My very short-term fix is to manually adjust celery’s PROC_ALIVE_TIMEOUT constant to allow for Redash to complete its initialization. There is no configuration option for this.

In redash/worker.py I added the following:

...
### Original ###
from celery import Celery
from celery.schedules import crontab
from celery.signals import worker_process_init

### Added the following: ###
from celery.concurrency import asynpool
asynpool.PROC_ALIVE_TIMEOUT = 10.0 #set this long enough

From examining the logs I saw that this allowed the worker to finish starting and actually execute the queries that were in the task queue.

I’ll leave another comment when I figure out what is causing the slow startup time for the worker.

1reaction

ghostcommented, Aug 6, 2018

Just wanted to note that this issue is still occurring with version 4.0.1.b4038.

I’ve increased the number of workers, tried applying #1783. It’s only resolved when I run:

supervisorctl stop redash_celery && redis-cli flushall && supervisorctl start redash_celery

But that’s only a temporary fix. After running a few queries (5-6 small ones) the behavior starts over. As a short-time solution, I’m going to implement a cron which runs the above command, but it appears to be an incorrect one, at least to me.