question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`cml-runner` shut down due to idle timeout while still busy.

See original GitHub issue

Hey everyone, started getting this error a couple of weeks ago, where jobs on GHA are failing with 143 error codes. Did a little digging and it seems like the error is because of the --idle-timeout parameter, which is by default set to 300. Seems like the job thinks it’s idle and gets shut down, but in fact it is actually still running! Not sure why this is happening, setting --idle-timeout=-1 resolves the issue, but I don’t want to do that for obvious reasons.

I’m using cml version 0.12.0 (which addressed a similar issue from what I can see? -> https://github.com/iterative/cml/issues/808 ).

Here are the logs from the runners:

root@ip-172-31-43-163:~# sudo journalctl -u cml.service -f
-- Logs begin at Tue 2022-02-22 16:02:55 UTC. --
Mar 24 17:10:29 ip-172-31-43-163 cml.sh[1869]: sshd override added, restarting daemon
Mar 24 17:10:30 ip-172-31-43-163 sudo[2799]: pam_unix(sudo:session): session closed for user root
Mar 24 17:10:37 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Preparing workdir /tmp/tmp.NZ4U6HZb1S/.cml/cml-wdw16iotk7..."}
Mar 24 17:10:37 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Launching github runner"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"EC2 id i-0c767456d3e8c3eec"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.327Z","level":"info","message":"runner status","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.328Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.910Z","level":"info","message":"runner status Current runner version: '2.289.1'","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.911Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Mar 24 17:11:18 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:11:18.401Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (BASELINE_GREENLINK_ME_CASE_04, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:15:50 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:16:40 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:16:40.360Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (BASELINE_GREENLINK_ME_CASE_04, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:16:42 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:16:42.954Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (OFFSHORE_PIPELINE, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:20:51 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:21:09 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:21:09.794Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (OFFSHORE_PIPELINE, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:21:15 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:21:15.077Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (REAL_PIPELINE_MOATA_430, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:25:43 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:25:43.309Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (REAL_PIPELINE_MOATA_430, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:25:46 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:25:46.246Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (SCOTWIND_2, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:25:51 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:30:51 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"runner status","reason":"timeout:300","status":"terminated"}
Mar 24 17:30:51 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"waiting 10 seconds before exiting..."}
Mar 24 17:31:01 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Unregistering runner cml-wdw16iotk7..."}
Mar 24 17:31:03 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-wdw16iotk7\" is still running a job\""}

Any idea what might be going on here?

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:3
  • Comments:19 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
dacbdcommented, Mar 25, 2022

the Ghost is back! 😢

@0x2b3bfa0 any luck on this? It’s causing a lot of problems with our CI pipeline so would be great to fix asap!

@thatGreekGuy96 My immediate bypass that has worked in the past is to extend the timeout in a different way, use --idle-timeout=21600 / 6h ie greater than what your expected runtime will be. Previously this issue has been intermittent and extremely hard to find what the root cause is. you can take a look here as well for the previous context of something similar

2reactions
0x2b3bfa0commented, Mar 24, 2022

Sounds like some sort of regression introduced with #689… 🙈 I’ll take a look a bit later today and try to find if that’s the case.

Read more comments on GitHub >

github_iconTop Results From Across the Web

cml runner early shutdown from idle-timeout with active job #808
I have been playing with short --idle-timeout values and I was a bit baffled when I encountered this. One workflow shutdown earlier before ......
Read more >
How to Fix 'VSS Service is Shutting Down Due to Idle Timeout ...
Method 2: Repairing System File Corruption​​ By far, the most common cause that will cause the 'The VSS service is shutting down due...
Read more >
Fix “The VSS Service is Shutting Down Due to Idle Timeout ...
You get a message says “The VSS service is shutting down due to idle timeout” in the Event Viewer, but you don't know...
Read more >
How To Fix "VSS Services Is Shutting Down Due ... - YouTube
How To Fix "VSS Services Is Shutting Down Due To Idle Timeout " Error On Windows 10/8/7. Watch later. Share. Copy link.
Read more >
User:MichaelHalstead - Yocto Project Wiki
ID P Status Severity Product High RESOLVED normal BitBake High RESOLVED normal BitBake 12143 High RESOLVED normal Build Testing
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found