Memory-aware task scheduling to avoid OOMs under memory pressure
See original GitHub issueOverview
Currently, the Ray scheduler only schedules based on CPUs by default for tasks (e.g., num_cpus=1). The user can also request memory (e.g., memory=1e9), however in most applications it is quite difficult to predict the heap memory usage of a task. In practice, this means that Ray users often see OOMs due to memory over-subscription, and resort to hacks like increasing the number of CPUs allocated to tasks.
Ideally, Ray would manage this automatically: when tasks consume too much heap memory, the scheduler should pushback on the scheduling of new tasks and preempt eligible tasks to reduce memory pressure.
Proposed design
Allow Ray to preempt and kill tasks that are using too much heap memory. We can do this by scanning the memory usage of tasks e.g., every 100ms, and preempting tasks if we are nearing a memory limit threshold (e.g., 80%). Furthermore, the scheduler can stop scheduling new tasks should we near the threshold.
Compatibility: Preempting certain kinds of tasks can be unexpected, and breaks backwards compatibility. This can be an “opt-in” feature initially for tasks. E.g., "@ray.remote(memory=“auto”)` in order to preserve backwards compatibility. Libraries like multiprocessing and Datasets can enable this by default for their map tasks. In the future, we can try to enable it by default for tasks that are safe to preempt (e.g., those that are not launching child tasks, and have retries enabled).
Acceptance criteria: As a user, I can run Ray tasks that use large amounts of memory without needing to tune/tweak Ray resource settings to avoid OOM crashes.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:11 (8 by maintainers)
This would be a killer feature, especially for dask-on-ray 👍
Some thoughts:
Preemption Priority Level
One might also want to have a pre-emption priority level.
Maybe could have 5 (or 3, or 4) levels?
We can also denote the levels by labels:
"NEVER", "LOW", "MED", "HIGH", "ALWAYS"
Also, I prefer this API - very intuitively expresses notion of “preemptibility”:
Alternatives considered (not recommended):
Alternatively, can reverse the levels, and call it “priority” which stands for “scheduling priority”:
Additional Considerations For Short-Lived Bursty Tasks, Polling Intervals, and Memory Bandwidth
I think we need a combination of preemption (long-running tasks/actors) and scheduling back-pressure (short-lived tasks which can burst memory CPU usage etc).
Profiling short-lived tasks, as described below, could be difficult, though, especially very fine-grained tasks.
Based on the polling interval granularity, the thing we would want to avoid is ping-ponging of resource usage - a raylet schedules too many tasks, then cuts back scheduling new tasks due to preemption/backpressure, which results in low resource usage in next interval, which results in ping pong-ing back to too much.
Since Ray’s task granularity is sub ms, 100ms while reasonable-sounding might not work for certain workloads. How realistic these scenarios are should be investigated.
So the parameters to consider are how much to preempt, and also using profiling history to gain an “average view” of node resource usage, which could deal with short-lived, bursty tasks, or apply backpressure based on time-windowed peak/p95 usage, instead of on point-in-time resource usage.
Making this Consideration more Concrete
For some context, consider the fact that a Threadripper 3990x has a ~100GiB memory bandwidth, which can grow/shrink memory usage by 15GB/100ms. Datacenter chipsets may have even higher memory bandwidth. This does suggest that something on the order of 100ms does seem like a reasonable interval to poll resource usage at.
To summarize, the relevant metric to be aware of is
MEMORY_MAX_DELTA_FRACTION = MEMORY_BANDWIDTH_S * INTERVAL_S / TOTAL_MEMORY
. For instance, a machine with 128GB main memory, with 128GB/s memory bandwidth, and 100ms polling interval has a 10% max delta per interval. In this case, setting a 80% memory-usage scheduling backpressure threshold seems naively quite appropriate.Note that asking the OS to allocate new memory is typically slower than asking the CPU to touch the same memory. (For reference: https://lemire.me/blog/2020/01/17/allocating-large-blocks-of-memory-bare-metal-c-speeds/). However, I don’t know if there are pathways which are more rapid, for instance, directly MMAP-ing a large number of pages into the process memory… (although from my vague understanding it is actually a costly process… and in addition the kernel touches all of the memory anyway by 0-memsetting.)
Session-Based Profiling Instead of Preempting
Raylets can accumulate/aggregate locally-collected statistics over intervals, and periodically sync with GCS to prevent inundating the GCS with network load.
Then the scheduling can be chosen to either be conservative (pack by peak usage) or aggressive (pack by avg usage), and use some statistics to figure out reasonable thresholds for the combined memory usage for task type (or “task_group” - a label for specific - or groups of - invocations of that task), thus not having to rely on preempting except in statistical outlier scenarios, or for coarse-grained tasks which do not obey law of large numbers.
Conclusion: profiling-based placement is especially useful for actors and long-running tasks with highly-variable memory usage.
Rationale: short-lived tasks are better dealt with memory backpressure. The same could be true of tasks & actors with stable memory usage, but perhaps to account for startup time, knowing the peak memory usage also helps with scheduling and not accidentally resulting in OOM/preemption.
APIs for Specifying Profile-Guided Behaviour
Configuration on the task/actor level:
I think this is very similar to cardinality estimation in databases, you should tune your plans to the current workload. We could persist
ray.session_profile
to a file so a user can start ray with previous profiling data as the “statistical prior”. Like cardinality estimation, an aggressive scheduling strategy too can fail on bursty or correlated workloads. In the case where the user expects correlated workloads, they should choose the “peak” or “p95” scheduling strategies.Profiling as an alternative to Placement Groups
I think this would be extremely useful for profiling-based placement of actor creation tasks (unless actors spin up processes with memory that live outside of the worker’s heap - but one could do some magic with profiling all of a worker’s child processes).
Relative Importance of Type of Profiling
In my mind, memory profiling-based scheduling is more useful than cpu profiling, since scheduling for the latter poorly merely results in resource contention on a stateless resource (CPU), and at most results in some small increase in task latency, whereas memory contention can result in OOM or spilling to swap, both of which one would want to avoid at all costs.
Likewise, fractional GPU-usage scheduling/placement and profiling is more dependent on GPU-memory consumption that CU utilization.
Related: Preemptible Actors, Serializable Actors and Actor Checkpoint Recovery
Relatedly, consider my idea on
SerializableActor
, which can move actors in response to memory/CPU-backpressure instead of relying on placement groups to explicitly place: [to add link].One can also rely on the notion of a preemptible Actor, if there is a safe way to “interrupt” the actor’s ability to process new tasks (the actor has status “suspended”), and tasks scheduled for that actor will not be scheduled, until it once again has status “active”.
Here is the flow of events for preempting an actor:
status: suspended
).status: active
.An objective in pursuing preemptible actors is to make sure that this sequence of events can happen very rapidly. In staleness-tolerant use-cases, this can be aided by leveraging stale actor checkpoints (described more below).
Serializable actors and fault-tolerance
Serializable actors could also fit into the fault-tolerance picture. Instead of object lineage, Ray can have a native notion of checkpointed serializable actors. The actor ocassionally broadcasts its checkpoint to the object store on nodes with available memory resources.
When a node with an actor crashes, Ray can recover the actor by choosing one node to recover the actor from the checkpoint. This uses Ray’s inherently distributed object store as an alternative to persistent storage for checkpointing - and could result in much faster actor recovery as one does not need to read from disk.
This might also kill two birds with one stone - if we are ok on relying on slightly stale data, a preempted Actor might not have to directly transport its checkpoint at time of preemption, relying instead on a stale checkpoint on a remote node.
Additional Ideas: Ownership for Actors (incomplete thoughts)
Just like we have ownership for objects, one can perhaps reduce the GCS burden for spinning up new actors by letting other actors/tasks own a child actor.
Putting It All Together
Here is the API for specifying an actor’s:
Extended Ray Actor API
No but I think it is under discussion: https://github.com/ray-project/ray/issues/17596