[Release 1.11.0] job submission error
See original GitHub issueOn releases/1.11.0
branch, there are job submission errors in rte_ray_client
and train_small
:
https://buildkite.com/ray-project/periodic-ci/builds/2788#6a73aaf1-80f7-40bf-9b8b-0f21c91e6e57/136-545
https://buildkite.com/ray-project/periodic-ci/builds/2788#1bdebe61-370e-4d93-a979-402732826c34/136-542
These seem like mismatched command handling in product vs in the job client. Can anyone advise on the commit to cherrypick to fix this? e.g. would it be #22011, #22209, or something else? cc @edoakes @simon-mo @krfricke. Assigning to @architkulkarni as Platform oncall.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:16 (16 by maintainers)
Top Results From Across the Web
Ray Job Submission: Going from your laptop to production
Ray Job submission is a mechanism to submit locally developed and tested applications to a running remote Ray cluster. It simplifies the user...
Read more >[v.1.11.0] Release Tracker #72267 - pytorch/pytorch - GitHub
Fixes to regressions against the most recent minor release (e.g. 1.10 for 1.11 release; see module: regression issue list) ...
Read more >spring - Maven is not using Java 11: error message "Fatal ...
When I try to run the application with Java 8 as the Java version in pom.xml, it works fine. But when I try...
Read more >How can I update STM32CubeIDE from version 1.10.1 to ...
The "automatic updates" show me a new version 1.11.0. After starting the process it stops with following error: Problem Occured.
Read more >Readme and Release notes for release 3.5.1.11 (LL ... - IBM
The update steps for a LoadLeveler submit-only machine are similar ... Fixed Loadleveler to prevent duplicate job id error by trying other ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Picking https://github.com/ray-project/ray/pull/22011 sounds good. One possibility is that
rte_ray_client
andtrain_small
use Ray client (use_connect: True
), and the codepath for that ine2e.py
is different. I will send out a PR.IIUC, the previous job command before the
wait_cluster.py
call installsawscli
and copieswait_cluster.py
and other local files to the Anyscale session: https://github.com/ray-project/ray/blob/8b1bbfe8e438a06bf2f9fe2cbf65f163d64227dd/release/e2e.py#L506-L512 Because the job fails, thewait_cluster.py
file is missing.