AssertionError (assert count > 0) in SLURMCluster._adapt
See original GitHub issueWhen I run a slurm cluster with adapt()
I sometimes get the following crash (but this is not deterministic and I have not identified a way to trigger it more often).
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f59359aed90>, <Future finished exception=AssertionError()>)
Traceback (most recent call last):
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
future.result()
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/deploy/adaptive.py", line 334, in _adapt
workers = yield self._retire_workers(workers=recommendations['workers'])
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/deploy/adaptive.py", line 242, in _retire_workers
close_workers=True)
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2800, in retire_workers
n=1, delete=False)
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/scratch/ogrisel/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2613, in replicate
assert count > 0
AssertionError
Issue Analytics
- State:
- Created 5 years ago
- Comments:33 (18 by maintainers)
Top Results From Across the Web
python - Why does calling the function work but asserting the ...
I wrote a function that works fine but when I write an assertion code for it, it gives an assertion error. My only...
Read more >dask_jobqueue.SLURMCluster - Dask-Jobqueue
SLURMCluster (n_workers=0, job_cls: typing. ... Launch Dask on a SLURM cluster. Parameters. queuestr ... adapt (*args[, minimum_jobs, maximum_jobs]).
Read more >OpenFF Evaluator Documentation - Open Force Field Software
The client will automatically adapt any of the built-in calculation schemas which are based off of the. WorkflowCalculationSchema to use the correct ...
Read more >Assertion failure when calling statistics.variance() on a float32 ...
... The assertion error is: assert T == U and count == count2 Even if ... mathematically equal zero, but due to rounding...
Read more >ceph-users@ceph.io - Mailing Lists
0 0. Ceph pool size and OSD data distribution. by Roland Giesler ... 3 node Ceph Quincy (17.2) cluster to serve a pair...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ogrisel do you still encounter this bug?
Adaptive cluster is a tricky set-up. There might still be some edge cases where it fails closing properly things. But I think you’d better ask this question on distributed issue tracker, but non before having some code that reproduce things a bit 😉. I know, this is sometimes almost impossible…