Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resource specification on GridEngine Clusters

See original GitHub issue

Hi Dask team!

My colleague @sntgluca and I have been very enthusiastic about the possibilities enabled by dask-jobqueue at NIBR! It’s been a very productivity-enhancing tool for me. At the same time, we found something that we think might be a be a bug, but would like to disprove/confirm this before potentially working on a PR to fix it.

Firstly, we found that when using the memory keyword argument, the SGECluster will show that XGB per worker node is allocated. However, the amount of RAM that is allocated, according to the queueing status screen, is only the default amount specified by the sysadmins.

Here is some evidence that I collected from the logs on our machines.

Firstly, Dask’s worker logs show that 8GB is allocated to them:

# Ignore this line, it is here to show the job ID.
/path/to/job_scripts/16524465: line 10: /path/to/activate: Permission denied
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.00 GB

However, for the same job ID, when I used qstat -j JOBID:

==============================================================
job_number:                 16524465
...
sge_o_path:                 hard resource_list:         m_mem_free=4G,h_rt=259200,slot_limitA=1
...
granted_req.          1:    m_mem_free=4.000G, slot_limitA=1

As you can see, the resources granted were only 4GB of memory, not the 8GB requested, but the Dask worker logs show 8GB being allocated.

I have a hunch that this is a bug, but both @sntgluca have this idea that our end users shouldn’t have to worry about GridEngine resource spec strings, and should be able to use the very nice SGECluster API to set these parameters correctly. Looking at the SGECluster source code, it looks doable with a small-ish PR to parse the memory (and other kwargs) into the correct resource specification string, if this is something that you would be open to.

Please let us know!