question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

queue: Queue status output displays incorrect number of workers

See original GitHub issue

Bug Report

Description

After starting multiple queue task workers with dvc queue start --jobs 4, the dvc queue status output displays incorrect number of workers. It first showed 1 active, 0 idle and then only 0 active, 0 idle although two tasks are Running (only two tasks had been queued when starting the workers).

Reproduce

  1. Start workers with dvc queue start --jobs 4
  2. Check output of dvc queue status

Expected

The sum of active and idle queue task workers should match the number of started workers.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.31.0 (rpm)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.14
Subprojects:

Supports:
        azure (adlfs = None, knack = 0.10.0, azure-identity = 1.11.0),
        gdrive (pydrive2 = 1.14.0),
        gs (gcsfs = None),
        hdfs (fsspec = None, pyarrow = 9.0.0),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = None, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7),
        webhdfs (fsspec = None)
Cache types: hardlink, symlink
Cache directory: xfs on /dev/md124
Caches: local, s3
Remotes: s3, s3
Workspace directory: xfs on /dev/md124
Repo: dvc (subdir), git

Additional Information (if any):

Note that I am running DVC inside a Docker container, though it seems this should be irrelevant.

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
aschuh-hfcommented, Nov 24, 2022

Checking the worker processes given the process IDs from .dvc/tmp/celery/dcv-exp-worker-?.pid using ps aux | grep <pid>, I can see that three of the worker processes (2-4) do not actually exist, but only one of the still running task.

Maybe the dvc queue status output should include a third column for available workers before the maximum limit of allowed workers is reached?

0reactions
karajan1001commented, Nov 25, 2022

Excuse me, what are the other worker’s logs like?

[2022-11-22 18:32:13,969: ERROR/MainProcess] timed out
[2022-11-22 18:32:18,018: ERROR/MainProcess] timed out
[2022-11-22 18:32:23,136: ERROR/MainProcess] timed out
[2022-11-22 18:32:27,515: ERROR/MainProcess] timed out
[2022-11-22 18:32:32,285: ERROR/MainProcess] timed out
[2022-11-22 18:32:37,234: ERROR/MainProcess] timed out
[2022-11-22 18:32:41,309: ERROR/MainProcess] timed out
[2022-11-22 18:32:46,261: ERROR/MainProcess] timed out
[2022-11-22 18:32:51,107: ERROR/MainProcess] timed out
[2022-11-22 18:32:56,213: ERROR/MainProcess] timed out
[2022-11-22 18:33:00,586: ERROR/MainProcess] timed out
[2022-11-22 18:33:04,893: ERROR/MainProcess] timed out
[2022-11-22 18:33:09,264: ERROR/MainProcess] timed out
[2022-11-22 18:33:13,380: ERROR/MainProcess] timed out
[2022-11-22 18:33:17,512: ERROR/MainProcess] timed out

Looks something goes wrong with the celery worker.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Workers disappearing without warning/not respawning #607
Currently running 200 workers with supervisord, in five batches. Batches support a total of 25 queues. Worker numbers seem to fluctuate.
Read more >
RQ: Workers
A job is popped from any of the given Redis queues. If all queues are empty and the worker is running in burst...
Read more >
Thread-Safe Queue in Python
Thread-safe means that it can be used by multiple threads to put and get items concurrently without a race condition. The queue.
Read more >
Queue Worker Stops Processing Jobs - Laracasts
I'm regularly running into an issue where my queue worker(s) will just stop running jobs on a tube. There are no errors in...
Read more >
Python multiprocessing with an updating queue and an output ...
This code allows to use multiple cores in a process that requires that the queue which feeds data to the workers can be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found