question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[ray] ray on slurm not respecting memory limits

See original GitHub issue

What is the problem?

When running a ray script in Slurm (single node), it seems that ray is not respecting the memory limitations specified in ray.init. As I understand, the below script should fail with some memory limit error from Ray but instead is the cluster that fails. I book 40GB from the cluster and limit Ray memory to 1GB for workers and 1GB for the object store. It seems to consume the 40GB (surprisingly in 30 min)

Ray version: 0.8.0

Reproduction (REQUIRED)

import ray

@ray.remote
class Store:
    def __init__(self):
        self.storage = list()

    def add(self, item):
        self.storage.append(item)

if __name__ == '__main__':
    ray.init(memory=int(1 * 1e9), object_store_memory=int(1 * 1e9))
    storage = Store.remote()
    while True:
        storage.add.remote(5)

and the slurm job script:

#!/bin/bash

#SBATCH -J jobname
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task 8
#SBATCH --mem=40GB
#SBATCH --mem-per-cpu=1GB
#SBATCH --time=120:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:1

module load gcc-7.1.0
module load cuda/10.0
module load cudnn/7.6.3/cuda-10.0
srun $@

Run command:

sbatch slurm_run_ray.sh python ray_slurm_test.py

Running the previous script produces the following Slurm error:

2020-01-30 00:59:03,618	INFO resource_spec.py:216 -- Starting Ray with 0.93 GiB memory available for workers and up to 0.93 GiB for objects. You can adjust these settings with  ray.init(memory=<bytes>, object_store_memory=<bytes>).

slurmstepd: error: Step 7580752.0 exceeded memory limit (42349772 > 41943040), being killed
slurmstepd: error: Exceeded job memory limit
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** STEP 7580752.0 ON falcon1 CANCELLED AT 2020-01-30T01:33:31 ***
srun: error: falcon1: task 0: Killed

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
ocorcollcommented, Feb 13, 2020

Those changes helped to improve memory consumption but I still see the same issue. Ray keeps consuming memory until reaches the cluster limit. The closest example I found to my situation is the following:

import ray
import time
import numpy as np


@ray.remote
class Store:

    def __init__(self):
        self.current_val = 0

    def set_val(self, val):
        self.current_val = val

    def get_val(self):
        return self.current_val


@ray.remote
class Worker:

    def __init__(self, store):
        self.store = store

    def run(self):
        while True:
            val = np.zeros((100, 100))
            self.store.set_val.remote(val)
            time.sleep(.1)


@ray.remote
class Reader:

    def __init__(self, store):
        self.store = store

    def run(self):
        while True:
            ray.get(self.store.get_val.remote())
            time.sleep(.1)


if __name__ == '__main__':
    ray.init(memory=int(.8 * 1e9), num_cpus=2, object_store_memory=int(1.2 * 1e9), driver_object_store_memory=int(.2 * 1e9))
    store = Store.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote()
    worker = Worker.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote(store)
    reader = Reader.options(num_cpus=0.5, memory=int(.2 * 1e9), object_store_memory=int(.2 * 1e9)).remote(store)
    ray.wait([worker.run.remote(), reader.run.remote()])

Ray seems to not evict any entry and memory keeps growing, I had to stop at 1.5GB. Seems that the actor’s object_store_memory limit is not used for eviction. Does Ray use the overall machine memory to start evicting or the object_store_memory limit in the ray.init call? What would be the expected behavior here? Is there a way to output the number of evictions/objects in the store?

1reaction
richardliawcommented, Feb 10, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

Out-Of-Memory Prevention — Ray 2.2.0
If the combined usage exceeds a configurable threshold the raylet will kill a task or actor process to free up memory and prevent...
Read more >
Allocating Memory - Princeton Research Computing
This page explains how to request memory in Slurm scripts and how to deal with common errors involving CPU and GPU memory. Note...
Read more >
Ray: setting memory limit - workarounds - Stack Overflow
Unfortunately there is no way to do this at the moment, but it is an active area of development. In the meantime, the...
Read more >
Stampede2 User Guide - TACC User Portal
Unlike the legacy KNC, a Stampede2 KNL is not a coprocessor: each 68-core KNL is a stand-alone, ... Respect memory limits and other...
Read more >
hpc:slurm [eResearch Doc]
In the list of partitions (see below), the “Max mem per core” is the suggested value, but it is not enforced. Please refer...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found