[Feature][runtime env] URI reference refactor
See original GitHub issueSearch before asking
- I had searched in the issues and found no similar feature requirement.
Description
Currently, we generate a URI list in python worker when we validate a runtime env input. The related code is here https://github.com/ray-project/ray/blob/master/python/ray/_private/runtime_env/validation.py#L394.
But, if we want to support cross-language, this logic can’t be reused. For example, If we want to create a python runtime env in java, we should rewrite this logic in Java worker.
So, to address this issue, we should move this URIs generating logic from python worker to agent. And agent will reply the URIs to worker pool of raylet for reference counting. If we do that, this logic will become a general step which can be used for all the language workers. Actually, the URI resources are also downloaded by agent. So I think that makes sense.
But, if we generate URIs in agent, we will bring a risk condition. For example:
If worker A is using a runtime env A with URI A. At this point, a new task with runtime env B comes and “URI A” is also included in runtime env B. But raylet doesn’t know the URI list of runtime env B before agent reply the CreateRuntimeEnv
rpc and can’t increase the reference counter of “URI A” immediately. At this time, if worker A exits, the reference counter of “URI A” will be decreased to 0 and be deleted.
In order to avoid risk condition, maybe we should make a URI reference refactor. Here is a proposal:
- Move the URI reference counting from worker pool to agent. We will maintain two maps in agent:
- Map from runtime env to URIs
- Map from URI to reference counter
- Instead, we can add two new maps in worker pool to be a runtime env reference counting:
- Map from id to runtime env.
- Map from runtime env to reference counter and runtime env context.
- In this refactor, raylet doesn’t know the concept of URI. All the logics of URI can be handled by agent itself.
- Another benefit is that we can avoid sending some unnecessary
CreateRuntimEnv
RPCs when the runtime env already downloaded by agent.
Use case
No response
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:39 (39 by maintainers)
Top GitHub Comments
@rkooo567 @edoakes @scv119 @architkulkarni Can we reach an agreement now? The final proposal is:
Can you also add option 4: Remove the dashboard agent restart feature & follow option 3? We can alternatively fate-share raylet and agents in this case. The pro is we don’t need to worry about failover at all. The con is idk how stable agents are.
Do you happen to know when this feature is actually useful? (When’s the time only agent processes fail when raylet doesn’t)?
Also cc @scv119 can you also take a look at this?
For this problem, I think there are 4 logical components;
The status quo is
Option 1
Option 2
Option 3
The simplest solution is StateStorage & StateManager & Downloaders living in the same physical process.
I think option 1 is the simplest solution out of 3. But the problem is if StateManager and Downloaders are in different physical processes, it can be slow (since these 2 components need communication). Imo, it is highly likely not a problem (Downloading URIs must be a lot more expensive than the local RPC).
Imo, option 2 is more complex because StateStorage & manager lives in a different physical processes. This makes protocols more complex too.
Option 3 is also complex because of the same reason. But if we don’t need to handle agent failures (restart), StateStorage can be just an agent, thus it becomes a lot simpler & the most performant.
Maybe option 4 is to move Python code in agents to Raylets. This has the same benefit as option 3 + no restart, but very heavyweight to implement it.
So, my preference is
Option 3 + removing agent restart feature > Option 1 > Moving agent logic to raylet using Cython > Option 3 > Option 2