Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature][runtime env] URI reference refactor

See original GitHub issue

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Currently, we generate a URI list in python worker when we validate a runtime env input. The related code is here https://github.com/ray-project/ray/blob/master/python/ray/_private/runtime_env/validation.py#L394.

But, if we want to support cross-language, this logic can’t be reused. For example, If we want to create a python runtime env in java, we should rewrite this logic in Java worker.

So, to address this issue, we should move this URIs generating logic from python worker to agent. And agent will reply the URIs to worker pool of raylet for reference counting. If we do that, this logic will become a general step which can be used for all the language workers. Actually, the URI resources are also downloaded by agent. So I think that makes sense.

But, if we generate URIs in agent, we will bring a risk condition. For example: If worker A is using a runtime env A with URI A. At this point, a new task with runtime env B comes and “URI A” is also included in runtime env B. But raylet doesn’t know the URI list of runtime env B before agent reply the CreateRuntimeEnv rpc and can’t increase the reference counter of “URI A” immediately. At this time, if worker A exits, the reference counter of “URI A” will be decreased to 0 and be deleted.

In order to avoid risk condition, maybe we should make a URI reference refactor. Here is a proposal:

Move the URI reference counting from worker pool to agent. We will maintain two maps in agent:
- Map from runtime env to URIs
- Map from URI to reference counter
Instead, we can add two new maps in worker pool to be a runtime env reference counting:
- Map from id to runtime env.
- Map from runtime env to reference counter and runtime env context.
In this refactor, raylet doesn’t know the concept of URI. All the logics of URI can be handled by agent itself.
Another benefit is that we can avoid sending some unnecessary CreateRuntimEnv RPCs when the runtime env already downloaded by agent.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:39 (39 by maintainers)

Top GitHub Comments

3reactions

SongGuyangcommented, Feb 8, 2022

@rkooo567 @edoakes @scv119 @architkulkarni Can we reach an agreement now? The final proposal is：

Choice the option 3 of https://github.com/ray-project/ray/issues/21695#issuecomment-1022873950. (like option 2 of https://github.com/ray-project/ray/issues/21695#issuecomment-1029448144)
Removing agent restart feature. If agent exits, raylet kills itself.
Make worker pool stateless.

2reactions

rkooo567commented, Jan 31, 2022

Can you also add option 4: Remove the dashboard agent restart feature & follow option 3? We can alternatively fate-share raylet and agents in this case. The pro is we don’t need to worry about failover at all. The con is idk how stable agents are.

Do you happen to know when this feature is actually useful? (When’s the time only agent processes fail when raylet doesn’t)?

Also cc @scv119 can you also take a look at this?

For this problem, I think there are 4 logical components;

Client: The client that uses URI
StateManager: Basically managing states. Fundamentally we only need 1 state; URI ref count. Runtime env ref count is not necessary because runtime env is just a set of URIs (lmk if this is wrong!).
StateStorage: We need to store states to the component that doesn’t need to handle fault (e.g., raylet or disk).
Downloaders: Managing download / delete resources from the machine. It should run in a separate process & threads since it is heavy blocking IO.

The status quo is

Client: Workers. But we can say worker pool is a client because worker pool manages the lifetime of workers
StateManager: WorkerPool
StateStorage: WorkerPool
Downloaders: Agents

Option 1

Client: Workers. WorkerPool
StateManager: WorkerPool
StateStorage: WorkerPool
Downloaders: Agents

Option 2

Client: WorkerPool
StateManager: Agents
StateStorage: WorkerPool
Downloaders: Agents

Option 3

Client: Workers. WorkerPool
StateManager: Agents
StateStorage: Disk?
Downloaders: Agents

The simplest solution is StateStorage & StateManager & Downloaders living in the same physical process.

I think option 1 is the simplest solution out of 3. But the problem is if StateManager and Downloaders are in different physical processes, it can be slow (since these 2 components need communication). Imo, it is highly likely not a problem (Downloading URIs must be a lot more expensive than the local RPC).

Imo, option 2 is more complex because StateStorage & manager lives in a different physical processes. This makes protocols more complex too.

Option 3 is also complex because of the same reason. But if we don’t need to handle agent failures (restart), StateStorage can be just an agent, thus it becomes a lot simpler & the most performant.

Maybe option 4 is to move Python code in agents to Raylets. This has the same benefit as option 3 + no restart, but very heavyweight to implement it.

So, my preference is

Option 3 + removing agent restart feature > Option 1 > Moving agent logic to raylet using Cython > Option 3 > Option 2