question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flow + JobId usage creates "zombie" jobs

See original GitHub issue

The issue happens when doing a backfill, where flow jobs are used to make sure one job runs after the other.

In this situation, I don’t care about the output of the children jobs, flow jobs are only used a synchronization tool.

In the code below, first Monday and Tuesday are queued and they both execute.

Later, when Tuesday and Wednesday try to execute, Wednesday won’t run because Tuesday has finished and it won’t trigger any events.

Even after deleting Tuesday, Wednesday won’t run.

And later, if I try to schedule Thursday and Wednesday, neither runs because Wednesday is in a zombie state which never runs.


import { Queue } from "bullmq";
import { delay, FlowProducer } from "bullmq";
import { Worker } from "bullmq";

const queueName = "queue" + Math.random();

new Worker(queueName, async (job) => {
  console.log("working...", job.name);
});

const main = async () => {
  const flowProducer = new FlowProducer();
  const queue = new Queue(queueName);

  await flowProducer.add({
    queueName,
    name: "tue",
    opts: {
      jobId: "tue",
    },
    children: [
      {
        name: "mon",
        queueName,
        opts: {
          jobId: "mon",
        },
      },
    ],
  });
  await delay(100);

  // console.log: working... mon
  // console.log: working... tue

  // wed will never run because tue has finished
  await flowProducer.add({
    queueName,
    name: "wed",
    opts: {
      jobId: "wed",
    },
    children: [
      {
        name: "tue",
        queueName,
        opts: {
          jobId: "tue",
        },
      },
    ],
  });

  // after removing tue, wed won't run
  await queue.remove("tue");
  await delay(100);

  // any job that depends on wed won't run either
  await flowProducer.add({
    queueName,
    name: "thu",
    opts: {
      jobId: "thu",
    },
    children: [
      {
        name: "wed",
        queueName,
        opts: {
          jobId: "wed",
        },
      },
    ],
  });
};

main();

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
manastcommented, Oct 22, 2021

@lucasavila00 Yes, so as @roggervalf already explained this is currently working as designed, since jobs using the same Id as existing jobs in the queue (in any status) are ignored.

I am not sure why you want to specify custom jobIds, but my guess is that you want to be able to continually add jobs to a exiting flow, that sounds like a legitimate case to me.

So in order to allow for this use case, we need to make some changes. I think it would be enough if we update the parent dependents statuses if the child job with the custom id has already completed, and if not we just ignore the job as we are doing now. That implies that in your case above your “tue” job will not be re-processed, just ignored as now but the parent will be processed.

There may be edge cases that we need to figure out though, but I think that as a principle it should work.

1reaction
roggervalfcommented, Oct 22, 2021

hi @lucasavila00, I could dig a little bit in this case, so as the tue job was added in the first flow, if you add it in the second one https://github.com/taskforcesh/bullmq/blob/master/src/commands/addJob-8.lua#L66-L70 here the same job id is returned and no more logic would be processed, so it won’t be added again and no parent id would be added to that job the second time that you try to add it, wed would be in waiting state and so on, this is the reason of your zombie jobs, either way I would like to have @manast opinion about this

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to remove Zombie jobs - IBM
Zombie jobs are caused by that sbatchd is down on the execution host and jobs are killed after that. Answer. Firstly run bkill...
Read more >
Azure Automation: Get job output not returning Content
I've created a Flow that creates an Azure Automation job via HTTP. I then take the JobID returned by this and attempt to...
Read more >
Life of a Dataproc job - Google Cloud
Dataproc jobs flow; Job concurrency; Job monitoring and debugging. View Job logs in Logging; Determining who submitted a job; Error Messages.
Read more >
Dataflow zombie jobs - stucked during job update
In the issue, mention your project and Job ID, so we can find the job easily. But please do not file other support...
Read more >
Running Jobs with IBM Spectrum LSF
Queues do not correspond to individual hosts; each queue can use all server hosts ... LSF assigns each job a unique job ID...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found