Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[ray] ray on slurm not respecting memory limits

See original GitHub issue

What is the problem?

When running a ray script in Slurm (single node), it seems that ray is not respecting the memory limitations specified in ray.init. As I understand, the below script should fail with some memory limit error from Ray but instead is the cluster that fails. I book 40GB from the cluster and limit Ray memory to 1GB for workers and 1GB for the object store. It seems to consume the 40GB (surprisingly in 30 min)

Ray version: 0.8.0

Reproduction (REQUIRED)

import ray

@ray.remote
class Store:
    def __init__(self):
        self.storage = list()

    def add(self, item):
        self.storage.append(item)

if __name__ == '__main__':
    ray.init(memory=int(1 * 1e9), object_store_memory=int(1 * 1e9))
    storage = Store.remote()
    while True:
        storage.add.remote(5)

and the slurm job script:

#!/bin/bash

#SBATCH -J jobname
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task 8
#SBATCH --mem=40GB
#SBATCH --mem-per-cpu=1GB
#SBATCH --time=120:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:1

module load gcc-7.1.0
module load cuda/10.0
module load cudnn/7.6.3/cuda-10.0
srun $@

Run command:

sbatch slurm_run_ray.sh python ray_slurm_test.py

Running the previous script produces the following Slurm error:

2020-01-30 00:59:03,618	INFO resource_spec.py:216 -- Starting Ray with 0.93 GiB memory available for workers and up to 0.93 GiB for objects. You can adjust these settings with  ray.init(memory=<bytes>, object_store_memory=<bytes>).

slurmstepd: error: Step 7580752.0 exceeded memory limit (42349772 > 41943040), being killed
slurmstepd: error: Exceeded job memory limit
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** STEP 7580752.0 ON falcon1 CANCELLED AT 2020-01-30T01:33:31 ***
srun: error: falcon1: task 0: Killed

Issue Analytics

State:
Created 4 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

2reactions

ocorcollcommented, Feb 13, 2020

Those changes helped to improve memory consumption but I still see the same issue. Ray keeps consuming memory until reaches the cluster limit. The closest example I found to my situation is the following:

import ray
import time
import numpy as np


@ray.remote
class Store:

    def __init__(self):
        self.current_val = 0

    def set_val(self, val):
        self.current_val = val

    def get_val(self):
        return self.current_val


@ray.remote
class Worker:

    def __init__(self, store):
        self.store = store

    def run(self):
        while True:
            val = np.zeros((100, 100))
            self.store.set_val.remote(val)
            time.sleep(.1)


@ray.remote
class Reader:

    def __init__(self, store):
        self.store = store

    def run(self):
        while True:
            ray.get(self.store.get_val.remote())
            time.sleep(.1)


if __name__ == '__main__':
    ray.init(memory=int(.8 * 1e9), num_cpus=2, object_store_memory=int(1.2 * 1e9), driver_object_store_memory=int(.2 * 1e9))
    store = Store.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote()
    worker = Worker.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote(store)
    reader = Reader.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote(store)
    ray.wait([worker.run.remote(), reader.run.remote()])

Ray seems to not evict any entry and memory keeps growing, I had to stop at 1.5GB. Seems that the actor’s object_store_memory limit is not used for eviction. Does Ray use the overall machine memory to start evicting or the object_store_memory limit in the ray.init call? What would be the expected behavior here? Is there a way to output the number of evictions/objects in the store?

1reaction

richardliawcommented, Feb 10, 2020

yes, I think so - see https://ray.readthedocs.io/en/latest/memory-management.html