question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Job is activated but is not received by job worker (intermittent issue)

See original GitHub issue

I have an issue which sounds exactly the same as the one described here https://github.com/zeebe-io/zeebe/issues/3585

Logs show that jobs are created/activated, but nodejs batch worker does not receive the jobs in ~1% of the cases. All the cases detected have happened under high load when thousands of workflow instances created.

I increased verbose level to the maximum possible level and see that the worker does not receive those jobs, just skipping them.

Last time I detected this issue when started around 2000 instances and the first activity in the workflow (which is the service task) did not receive 12 jobs. From what I discovered I think that all the jobs skipped belongs to a single batch: exported records positions (jobs activated) are very close to each other (difference from 4 to 8):

  • 2443938
  • 2443930
  • 2443886
  • 2443882 …

I see possible reasons:

  • broker does not send the batch to the client
  • client ignores received batch

In general case I would also suggested that a network interruption might cause an effect when broker thinks that jobs have been sent, but client actually does not receive it, but in my case this is impossible since broker and client are on the same server.

I tried to call zbc.completeJob() for those jobs, and broker successfully processed it and continued workflow execution. That means that broker thinks that job is actually taken by worker before.


My application:

  • zeebe-node 0.23.2
  • zeebe 0.24.1
  • Single Zeebe node, 10 cpu + 10 io threads, 10 partitions.

I have very long running tasks (up to months, or even years), I cannot wait for job timeout. I use batch worker, all the jobs are forwarded to external system. worker config :

    maxJobsToActivate: 200,
    jobBatchMinSize: 32,
    jobBatchMaxTime: 3,
    timeout: Duration.days.of(365), // yep, this is 1 year

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jwulfcommented, Sep 18, 2020

I think this is due to a race condition in the Batch processing. It passes a copy of the array of jobs for the batch to the handler. It looks like the original array could be updated asynchronously while this is happening. That’s my hypothesis.

I’ve changed the “passing a copy of the array of batched jobs” to passing a slice of the array. This means that any jobs that are added to the batch while the handler is executing, will be added to the next batch.

I will release the 0.24.1 version soon for you to test. It’s challenging to reproduce an edge case at volume like that.

0reactions
jwulfcommented, Oct 23, 2020

Closing this for now. If you still see the issue with 0.25.0 of the client, please reopen.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Job is activated but is not received by job worker (intermittent ...
If this issue is not fixable in a reasonable time I see a workaround: I can detect the lost jobs by monitoring exported...
Read more >
Zeebe 0.23.7 and 0.26.1 problems - Jobs intermittently pauses ...
This could be caused by a worker activating the job, then not completing the job. The broker would then time out the activation,...
Read more >
SAP job failing in Autosys but was success on SAP end
Still we have the issue and it is intermittent. Jobs are failing with exit code 16 on Autosys and they are triggered successfully...
Read more >
Quartz Jobs do not fire intermittently - Stack Overflow
None clustred environment. Problem and findings so far: When two concurrent jobs are scheduled to be fired, only one job fire automatically.
Read more >
Troubleshoot pipeline runs - Azure DevOps - Microsoft Learn
Learn how to troubleshoot pipeline runs in Azure Pipelines and Team Foundation Server.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found