question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature][runtime env] URI reference refactor

See original GitHub issue

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Currently, we generate a URI list in python worker when we validate a runtime env input. The related code is here https://github.com/ray-project/ray/blob/master/python/ray/_private/runtime_env/validation.py#L394.

But, if we want to support cross-language, this logic can’t be reused. For example, If we want to create a python runtime env in java, we should rewrite this logic in Java worker.

So, to address this issue, we should move this URIs generating logic from python worker to agent. And agent will reply the URIs to worker pool of raylet for reference counting. If we do that, this logic will become a general step which can be used for all the language workers. Actually, the URI resources are also downloaded by agent. So I think that makes sense.

But, if we generate URIs in agent, we will bring a risk condition. For example: If worker A is using a runtime env A with URI A. At this point, a new task with runtime env B comes and “URI A” is also included in runtime env B. But raylet doesn’t know the URI list of runtime env B before agent reply the CreateRuntimeEnv rpc and can’t increase the reference counter of “URI A” immediately. At this time, if worker A exits, the reference counter of “URI A” will be decreased to 0 and be deleted.

In order to avoid risk condition, maybe we should make a URI reference refactor. Here is a proposal:

  • Move the URI reference counting from worker pool to agent. We will maintain two maps in agent:
    • Map from runtime env to URIs
    • Map from URI to reference counter
  • Instead, we can add two new maps in worker pool to be a runtime env reference counting:
    • Map from id to runtime env.
    • Map from runtime env to reference counter and runtime env context.
  • In this refactor, raylet doesn’t know the concept of URI. All the logics of URI can be handled by agent itself.
  • Another benefit is that we can avoid sending some unnecessary CreateRuntimEnv RPCs when the runtime env already downloaded by agent.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:39 (39 by maintainers)

github_iconTop GitHub Comments

3reactions
SongGuyangcommented, Feb 8, 2022

@rkooo567 @edoakes @scv119 @architkulkarni Can we reach an agreement now? The final proposal is:

2reactions
rkooo567commented, Jan 31, 2022

Can you also add option 4: Remove the dashboard agent restart feature & follow option 3? We can alternatively fate-share raylet and agents in this case. The pro is we don’t need to worry about failover at all. The con is idk how stable agents are.

Do you happen to know when this feature is actually useful? (When’s the time only agent processes fail when raylet doesn’t)?

Also cc @scv119 can you also take a look at this?

For this problem, I think there are 4 logical components;

  • Client: The client that uses URI
  • StateManager: Basically managing states. Fundamentally we only need 1 state; URI ref count. Runtime env ref count is not necessary because runtime env is just a set of URIs (lmk if this is wrong!).
  • StateStorage: We need to store states to the component that doesn’t need to handle fault (e.g., raylet or disk).
  • Downloaders: Managing download / delete resources from the machine. It should run in a separate process & threads since it is heavy blocking IO.

The status quo is

  • Client: Workers. But we can say worker pool is a client because worker pool manages the lifetime of workers
  • StateManager: WorkerPool
  • StateStorage: WorkerPool
  • Downloaders: Agents

Option 1

  • Client: Workers. WorkerPool
  • StateManager: WorkerPool
  • StateStorage: WorkerPool
  • Downloaders: Agents

Option 2

  • Client: WorkerPool
  • StateManager: Agents
  • StateStorage: WorkerPool
  • Downloaders: Agents

Option 3

  • Client: Workers. WorkerPool
  • StateManager: Agents
  • StateStorage: Disk?
  • Downloaders: Agents

The simplest solution is StateStorage & StateManager & Downloaders living in the same physical process.

I think option 1 is the simplest solution out of 3. But the problem is if StateManager and Downloaders are in different physical processes, it can be slow (since these 2 components need communication). Imo, it is highly likely not a problem (Downloading URIs must be a lot more expensive than the local RPC).

Imo, option 2 is more complex because StateStorage & manager lives in a different physical processes. This makes protocols more complex too.

Option 3 is also complex because of the same reason. But if we don’t need to handle agent failures (restart), StateStorage can be just an agent, thus it becomes a lot simpler & the most performant.

Maybe option 4 is to move Python code in agents to Raylets. This has the same benefit as option 3 + no restart, but very heavyweight to implement it.

So, my preference is

Option 3 + removing agent restart feature > Option 1 > Moving agent logic to raylet using Cython > Option 3 > Option 2

Read more comments on GitHub >

github_iconTop Results From Across the Web

[runtime env] Cross-language runtime env #21731 - GitHub
A refactor of URIs reference [Feature][runtime env] URI reference refactor ... Select python version [Feature] [runtime env] support using ...
Read more >
Ping a SOAP or REST Endpoint - TIBCO Product Documentation
You can ping SOAP endpoint over HTTP GET method. The format of the URL is: http://<host>:<port>/<Context root>?ping. For example ...
Read more >
Changelog — Alembic 1.9.0 documentation
The “Pylons” environment template has been removed as of Alembic 1.8. ... Refactored the implementation of MigrateOperation constructs such as CreateIndexOp ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found