question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AssertionError (assert count > 0) in SLURMCluster._adapt

See original GitHub issue

When I run a slurm cluster with adapt() I sometimes get the following crash (but this is not deterministic and I have not identified a way to trigger it more often).

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f59359aed90>, <Future finished exception=AssertionError()>)
Traceback (most recent call last):
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
    future.result()
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/deploy/adaptive.py", line 334, in _adapt
    workers = yield self._retire_workers(workers=recommendations['workers'])
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/deploy/adaptive.py", line 242, in _retire_workers
    close_workers=True)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2800, in retire_workers
    n=1, delete=False)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2613, in replicate
    assert count > 0
AssertionError

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:33 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
guillaumeebcommented, May 13, 2019

@ogrisel do you still encounter this bug?

0reactions
guillaumeebcommented, May 2, 2021

Is it possible the adaptive deployment is somehow force closing the workers before the futures have a chance to migrate?

Adaptive cluster is a tricky set-up. There might still be some edge cases where it fails closing properly things. But I think you’d better ask this question on distributed issue tracker, but non before having some code that reproduce things a bit 😉. I know, this is sometimes almost impossible…

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - Why does calling the function work but asserting the ...
I wrote a function that works fine but when I write an assertion code for it, it gives an assertion error. My only...
Read more >
dask_jobqueue.SLURMCluster - Dask-Jobqueue
SLURMCluster (n_workers=0, job_cls: typing. ... Launch Dask on a SLURM cluster. Parameters. queuestr ... adapt (*args[, minimum_jobs, maximum_jobs]).
Read more >
OpenFF Evaluator Documentation - Open Force Field Software
The client will automatically adapt any of the built-in calculation schemas which are based off of the. WorkflowCalculationSchema to use the correct ...
Read more >
Assertion failure when calling statistics.variance() on a float32 ...
... The assertion error is: assert T == U and count == count2 Even if ... mathematically equal zero, but due to rounding...
Read more >
ceph-users@ceph.io - Mailing Lists
0 0. Ceph pool size and OSD data distribution. by Roland Giesler ... 3 node Ceph Quincy (17.2) cluster to serve a pair...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found