-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Runners are being removed for being idle before its job has had a chance to be assigned to it #4000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
The way I understand it is that, when a job is available that's when the listener updates the desired replicas and the runner is created. It's not the other way around that a runner pod sits idle and waits to pickup a job unless you're setting the minRunner field to preemptively scale pods and even in that case I don't see this behaviour. |
This is correct, however the runner is only active for about 5 seconds before then being killed by the listener, before a job has started running on it. What's happening is this:
So yes briefly the runner pod does sit idle, for a few seconds before a job is assigned to it. Except the listener process is kililng the pod a little too soon.
For example, i can see that the listener brings a pod online at
Which is the point at which the listener sees the runner as ready. However there is a further delay of 5 seconds before the runner itself connects to GitHub and begins listening for jobs, a few more seconds later, the job begins running on that runner.
Maybe the listener should check to see that GitHub has successfully registered the self-hosted runner before it classes it as |
What happens when you set minRunners in your helmrelease to !0 (let's say 10) and then run jobs, are your jobs being picked up by those pods? edit: Also, I'm not from Github, I'm building ephemeral runners too and this is not an issue for me so just trying to help out. |
Yes they are. |
@JohnLBergqvist I believe we're experiencing this as well and it started randomly in the last couple weeks after not touching anything related to this for many months. Did you ever figure out any workaround aside from forcing a pod to always be available? |
@andrewbeckman Unfortunately not. I've noticed it seems to happen more if only a single job is queued after a relative period of inactivity. If multiple jobs are queued in a short space of time then there's a higher chance that more of them will schedule correctly - perhaps because the Controller's main loop is taking longer to finish, thus giving the runners more breathing room to accept a job? |
I believe we are also experiencing this issue. Have not tried the workaround of setting minRunners > 0. In our case the behavior on the Github UI side is that jobs randomly show as "Canceled" despite nobody canceling them/no subsequent pushes to a PR that would cancel a job. I'm curious about these lines in your log snippet (we are seeing the same):
As opposed to a race condition between the runner controller and the runner pod, that could point to an issue with the Github Actions Service API where it is not reporting status correctly and therefore both the runner controller and the runner pod itself are behaving "correctly" in the sense that there's nothing on the Github Service side for them to act on. That would also jibe with the fact that this has popped up in the last couple weeks despite no apparent local changes. |
In other words, this could be a problem with the GH Actions service, not with the ARC project. |
Checks
Controller Version
0.10.1 and upwards
Deployment Method
Helm
Checks
To Reproduce
To reproduce, simply deploy the scale sets as normal (I was using the Quickstart Guide), and begin running jobs. No change was made to our K8S cluster or the docker images we were using for the runners before this bug began.
Describe the bug
Ephemeral runners are correctly brought up and begin advertising themselves to the repository/organisation as expected, however if a job hasn't begun running on them within 10 seconds, the ARC will kill the runners because it thinks they're idle.
While the data below refers to Windows runners, we also have Ubuntu runners where i've observed the issue happening - just with much less frequency (around 5/10% of the time).
Describe the expected behavior
The controller should wait a bit longer before killing the jobs because they are idle. The fact that jobs are assigned correctly approx. 50% of the time implies there's a tiny threshold that's being missed somwhere along the line. Unfortunately I can't control the delay at which GitHub will recognise there's now a free runner that's come online, but it would be helpful if the controller didn't wait for what seems as little as 10 seconds after creation before it kills a runner for being apparently Idle.
Additional Context
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: