[ray] ray on slurm not respecting memory limits
See original GitHub issueWhat is the problem?
When running a ray script in Slurm (single node), it seems that ray is not respecting the memory limitations specified in ray.init. As I understand, the below script should fail with some memory limit error from Ray but instead is the cluster that fails. I book 40GB from the cluster and limit Ray memory to 1GB for workers and 1GB for the object store. It seems to consume the 40GB (surprisingly in 30 min)
Ray version: 0.8.0
Reproduction (REQUIRED)
import ray
@ray.remote
class Store:
def __init__(self):
self.storage = list()
def add(self, item):
self.storage.append(item)
if __name__ == '__main__':
ray.init(memory=int(1 * 1e9), object_store_memory=int(1 * 1e9))
storage = Store.remote()
while True:
storage.add.remote(5)
and the slurm job script:
#!/bin/bash
#SBATCH -J jobname
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task 8
#SBATCH --mem=40GB
#SBATCH --mem-per-cpu=1GB
#SBATCH --time=120:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:1
module load gcc-7.1.0
module load cuda/10.0
module load cudnn/7.6.3/cuda-10.0
srun $@
Run command:
sbatch slurm_run_ray.sh python ray_slurm_test.py
Running the previous script produces the following Slurm error:
2020-01-30 00:59:03,618 INFO resource_spec.py:216 -- Starting Ray with 0.93 GiB memory available for workers and up to 0.93 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
slurmstepd: error: Step 7580752.0 exceeded memory limit (42349772 > 41943040), being killed
slurmstepd: error: Exceeded job memory limit
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** STEP 7580752.0 ON falcon1 CANCELLED AT 2020-01-30T01:33:31 ***
srun: error: falcon1: task 0: Killed
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Out-Of-Memory Prevention — Ray 2.2.0
If the combined usage exceeds a configurable threshold the raylet will kill a task or actor process to free up memory and prevent...
Read more >Allocating Memory - Princeton Research Computing
This page explains how to request memory in Slurm scripts and how to deal with common errors involving CPU and GPU memory. Note...
Read more >Ray: setting memory limit - workarounds - Stack Overflow
Unfortunately there is no way to do this at the moment, but it is an active area of development. In the meantime, the...
Read more >Stampede2 User Guide - TACC User Portal
Unlike the legacy KNC, a Stampede2 KNL is not a coprocessor: each 68-core KNL is a stand-alone, ... Respect memory limits and other...
Read more >hpc:slurm [eResearch Doc]
In the list of partitions (see below), the “Max mem per core” is the suggested value, but it is not enforced. Please refer...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Those changes helped to improve memory consumption but I still see the same issue. Ray keeps consuming memory until reaches the cluster limit. The closest example I found to my situation is the following:
Ray seems to not evict any entry and memory keeps growing, I had to stop at 1.5GB. Seems that the actor’s object_store_memory limit is not used for eviction. Does Ray use the overall machine memory to start evicting or the object_store_memory limit in the ray.init call? What would be the expected behavior here? Is there a way to output the number of evictions/objects in the store?
yes, I think so - see https://ray.readthedocs.io/en/latest/memory-management.html