question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resource specification on GridEngine Clusters

See original GitHub issue

Hi Dask team!

My colleague @sntgluca and I have been very enthusiastic about the possibilities enabled by dask-jobqueue at NIBR! It’s been a very productivity-enhancing tool for me. At the same time, we found something that we think might be a be a bug, but would like to disprove/confirm this before potentially working on a PR to fix it.

Firstly, we found that when using the memory keyword argument, the SGECluster will show that XGB per worker node is allocated. However, the amount of RAM that is allocated, according to the queueing status screen, is only the default amount specified by the sysadmins.

Here is some evidence that I collected from the logs on our machines.

Firstly, Dask’s worker logs show that 8GB is allocated to them:

# Ignore this line, it is here to show the job ID.
/path/to/job_scripts/16524465: line 10: /path/to/activate: Permission denied
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    8.00 GB

However, for the same job ID, when I used qstat -j JOBID:

==============================================================
job_number:                 16524465
...
sge_o_path:                 hard resource_list:         m_mem_free=4G,h_rt=259200,slot_limitA=1
...
granted_req.          1:    m_mem_free=4.000G, slot_limitA=1

As you can see, the resources granted were only 4GB of memory, not the 8GB requested, but the Dask worker logs show 8GB being allocated.

I have a hunch that this is a bug, but both @sntgluca have this idea that our end users shouldn’t have to worry about GridEngine resource spec strings, and should be able to use the very nice SGECluster API to set these parameters correctly. Looking at the SGECluster source code, it looks doable with a small-ish PR to parse the memory (and other kwargs) into the correct resource specification string, if this is something that you would be open to.

Please let us know!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
guillaumeebcommented, Nov 15, 2018

Indeed, it looks like if user don’t specify resource_spec via kwargs or config files, nothing is set in SGECluster implementation.

Something similar to what you propose is done in PBSCluster: https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/pbs.py#L84-L89

Would you be interested in submitting a PR that does the same for SGE? It would be very welcomed!

0reactions
ericmjlcommented, May 13, 2019

Yes, 100%!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Defining Resource Requirements (Sun N1 Grid Engine 6.1 ...
The grid engine system provides users with the means to find suitable hosts for their jobs without precise knowledge of the cluster`s equipment...
Read more >
Grid Engine Administrator's Guide - Documentation - Altair
3.8.5 Greedy Resource Reservation (Deprecated) . ... The system wide unique name of the Altair Grid Engine cluster. $SGE_QMASTER_PORT.
Read more >
Grid Engine Users's Guide - Aaron Cheng
This sample cluster has more than enough resources available to run a simple example batch job. Use the qsub command to submit a...
Read more >
Grid Engine Internals
On the resource map entry you can see what the Univa Grid Engine scheduler has selected (port number / cluster name / spooling...
Read more >
Resource quota - Open Grid Scheduler
NAME SGE_resource_quota - Sun Grid Engine resource quota file format ... The job request distinction is done by a set of user, project,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found