Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError and Worker already exists

See original GitHub issue

I’m trying to setup dask with tpot.

My code looks like this:

  from dask_jobqueue import LSFCluster
cluster = LSFCluster(cores=1, memory='3GB', job_extra=['-R rusage[mem=2048,scratch=8000]'],
                    local_directory='$TMPDIR',
                    walltime='12:00')

from dask.distributed import Client
client = Client(cluster)
cluster.scale(10)

from tpot import TPOTRegressor

reg = TPOTRegressor(max_time_mins=30, generations=20, population_size=96,
                    cv=5,
                    scoring='r2',
                    memory='auto', random_state=42, verbosity=10, use_dask=True)
reg.fit(X, y)

and I keep getting those annoying errors:

distributed.scheduler - ERROR - '74905774'
Traceback (most recent call last):
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/distributed/scheduler.py", line 1306, in add_worker
    plugin.add_worker(scheduler=self, worker=address)
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/dask_jobqueue/core.py", line 62, in add_worker
    self.running_jobs[job_id] = self.pending_jobs.pop(job_id)
KeyError: '74905774'

distributed.utils - ERROR - Worker already exists tcp://10.205.103.50:35780
Traceback (most recent call last):
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/distributed/utils.py", line 648, in log_errors
    yield
File "/cluster/home/abrahalo/.local/lib64/python3.6/site-packages/distributed/scheduler.py", line 1261, in add_worker
    raise ValueError("Worker already exists %s" % address)
ValueError: Worker already exists tcp://10.205.103.50:35780

I think there might be a problem with LSFCluster because it puts a lot of workers in cluster.finished_jobs that are still running according to bjobs and even to the dask.distributed web interface.

Issue Analytics

State:
Created 5 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

guillaumeebcommented, Oct 6, 2018

A pleasure to help!

Is there a way to signal that there is a memory error? Not a message but an exception or a special return type.

You should try to ask this upstream in distributed, I imagine there has been some thought in this behavior.

0reactions

louisabrahamcommented, Oct 7, 2018

I think that setting a dask-worker with the --memory-limit option will do the trick.

ulimit doesn’t work at all on macos and doesn’t limit effectively the memory on linux.

Top Results From Across the Web

KeyError Received unregistered task of type '' on celery ...

It means that Celery can't find the implementation of the task my_app.tasks.my_task when it was called. Some possible solutions you may want ...

How to fix Python KeyError Exceptions in simple steps?

First, we access an existing key in the try-except block. If the Keyerror is not raised, there are no errors. Then the else...

How To Handle KeyError Exceptions in Python | Nick McCullum

In simple terms, a KeyError is the result of attempting to access a key within a mapping that does not actually exist in...

Python KeyError Exception Handling Examples

Python KeyError is raised when we try to access a key from dict, which doesn't exist. It's one of the built-in exception classes...

"Object already exists for key" Error with Salesforce ...

"Object already exists for key" Error with Salesforce Connector Events Listener in Multiple CloudHub Workers. You are using Mule 4 and the Salesforce ......