Resource specification on GridEngine Clusters
See original GitHub issueHi Dask team!
My colleague @sntgluca and I have been very enthusiastic about the possibilities enabled by dask-jobqueue
at NIBR! It’s been a very productivity-enhancing tool for me. At the same time, we found something that we think might be a be a bug, but would like to disprove/confirm this before potentially working on a PR to fix it.
Firstly, we found that when using the memory
keyword argument, the SGECluster
will show that X
GB per worker node is allocated. However, the amount of RAM that is allocated, according to the queueing status screen, is only the default amount specified by the sysadmins.
Here is some evidence that I collected from the logs on our machines.
Firstly, Dask’s worker logs show that 8GB is allocated to them:
# Ignore this line, it is here to show the job ID.
/path/to/job_scripts/16524465: line 10: /path/to/activate: Permission denied
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 8.00 GB
However, for the same job ID, when I used qstat -j JOBID
:
==============================================================
job_number: 16524465
...
sge_o_path: hard resource_list: m_mem_free=4G,h_rt=259200,slot_limitA=1
...
granted_req. 1: m_mem_free=4.000G, slot_limitA=1
As you can see, the resources granted were only 4GB of memory, not the 8GB requested, but the Dask worker logs show 8GB being allocated.
I have a hunch that this is a bug, but both @sntgluca have this idea that our end users shouldn’t have to worry about GridEngine resource spec strings, and should be able to use the very nice SGECluster
API to set these parameters correctly. Looking at the SGECluster
source code, it looks doable with a small-ish PR to parse the memory
(and other kwargs) into the correct resource specification string, if this is something that you would be open to.
Please let us know!
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
Indeed, it looks like if user don’t specify
resource_spec
via kwargs or config files, nothing is set inSGECluster
implementation.Something similar to what you propose is done in
PBSCluster
: https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/pbs.py#L84-L89Would you be interested in submitting a PR that does the same for SGE? It would be very welcomed!
Yes, 100%!