`cml-runner` shut down due to idle timeout while still busy.
See original GitHub issueHey everyone, started getting this error a couple of weeks ago, where jobs on GHA are failing with 143
error codes. Did a little digging and it seems like the error is because of the --idle-timeout
parameter, which is by default set to 300
.
Seems like the job thinks it’s idle and gets shut down, but in fact it is actually still running!
Not sure why this is happening, setting --idle-timeout=-1
resolves the issue, but I don’t want to do that for obvious reasons.
I’m using cml version 0.12.0
(which addressed a similar issue from what I can see? -> https://github.com/iterative/cml/issues/808 ).
Here are the logs from the runners:
root@ip-172-31-43-163:~# sudo journalctl -u cml.service -f
-- Logs begin at Tue 2022-02-22 16:02:55 UTC. --
Mar 24 17:10:29 ip-172-31-43-163 cml.sh[1869]: sshd override added, restarting daemon
Mar 24 17:10:30 ip-172-31-43-163 sudo[2799]: pam_unix(sudo:session): session closed for user root
Mar 24 17:10:37 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Preparing workdir /tmp/tmp.NZ4U6HZb1S/.cml/cml-wdw16iotk7..."}
Mar 24 17:10:37 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Launching github runner"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"EC2 id i-0c767456d3e8c3eec"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.327Z","level":"info","message":"runner status","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.328Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.910Z","level":"info","message":"runner status Current runner version: '2.289.1'","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.911Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Mar 24 17:11:18 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:11:18.401Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (BASELINE_GREENLINK_ME_CASE_04, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:15:50 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:16:40 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:16:40.360Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (BASELINE_GREENLINK_ME_CASE_04, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:16:42 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:16:42.954Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (OFFSHORE_PIPELINE, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:20:51 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:21:09 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:21:09.794Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (OFFSHORE_PIPELINE, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:21:15 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:21:15.077Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (REAL_PIPELINE_MOATA_430, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:25:43 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:25:43.309Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (REAL_PIPELINE_MOATA_430, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:25:46 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:25:46.246Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (SCOTWIND_2, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:25:51 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:30:51 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"runner status","reason":"timeout:300","status":"terminated"}
Mar 24 17:30:51 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"waiting 10 seconds before exiting..."}
Mar 24 17:31:01 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Unregistering runner cml-wdw16iotk7..."}
Mar 24 17:31:03 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-wdw16iotk7\" is still running a job\""}
Any idea what might be going on here?
Issue Analytics
- State:
- Created a year ago
- Reactions:3
- Comments:19 (14 by maintainers)
Top Results From Across the Web
cml runner early shutdown from idle-timeout with active job #808
I have been playing with short --idle-timeout values and I was a bit baffled when I encountered this. One workflow shutdown earlier before ......
Read more >How to Fix 'VSS Service is Shutting Down Due to Idle Timeout ...
Method 2: Repairing System File Corruption By far, the most common cause that will cause the 'The VSS service is shutting down due...
Read more >Fix “The VSS Service is Shutting Down Due to Idle Timeout ...
You get a message says “The VSS service is shutting down due to idle timeout” in the Event Viewer, but you don't know...
Read more >How To Fix "VSS Services Is Shutting Down Due ... - YouTube
How To Fix "VSS Services Is Shutting Down Due To Idle Timeout " Error On Windows 10/8/7. Watch later. Share. Copy link.
Read more >User:MichaelHalstead - Yocto Project Wiki
ID P Status Severity Product
High RESOLVED normal BitBake
High RESOLVED normal BitBake
12143 High RESOLVED normal Build Testing
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
the Ghost is back! 😢
@thatGreekGuy96 My immediate bypass that has worked in the past is to extend the timeout in a different way, use
--idle-timeout=21600
/ 6h ie greater than what your expected runtime will be. Previously this issue has been intermittent and extremely hard to find what the root cause is. you can take a look here as well for the previous context of something similarSounds like some sort of regression introduced with #689… 🙈 I’ll take a look a bit later today and try to find if that’s the case.