Skip to content

Runners are being removed for being idle before its job has had a chance to be assigned to it #4000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
JohnLBergqvist opened this issue Mar 27, 2025 · 9 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@JohnLBergqvist
Copy link

JohnLBergqvist commented Mar 27, 2025

Checks

Controller Version

0.10.1 and upwards

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

To reproduce, simply deploy the scale sets as normal (I was using the Quickstart Guide), and begin running jobs. No change was made to our K8S cluster or the docker images we were using for the runners before this bug began.

Describe the bug

Ephemeral runners are correctly brought up and begin advertising themselves to the repository/organisation as expected, however if a job hasn't begun running on them within 10 seconds, the ARC will kill the runners because it thinks they're idle.

While the data below refers to Windows runners, we also have Ubuntu runners where i've observed the issue happening - just with much less frequency (around 5/10% of the time).

Describe the expected behavior

The controller should wait a bit longer before killing the jobs because they are idle. The fact that jobs are assigned correctly approx. 50% of the time implies there's a tiny threshold that's being missed somwhere along the line. Unfortunately I can't control the delay at which GitHub will recognise there's now a free runner that's come online, but it would be helpful if the controller didn't wait for what seems as little as 10 seconds after creation before it kills a runner for being apparently Idle.

Additional Context

githubConfigUrl: https://github.com/redacted
githubConfigSecret: redacted
runnerGroup: redacted
minRunners: 1
template:
  spec:
    containers:
      - name: runner
        image: redacted
        command: ["run.cmd"]
    serviceAccountName: redacted
    nodeSelector: # Ensures the pods can only run on nodes that have this label
      runner-os: windows
      iam.gke.io/gke-metadata-server-enabled: "true"
    tolerations: # Ensures that the pods can only run on nodes that have this taint
      - key: runners-fooding
        operator: Equal
        value: "true"
        effect: NoSchedule
      - key: node.kubernetes.io/os
        operator: Equal
        value: "windows"
        effect: NoSchedule

Controller Logs

https://gist.github.com/JohnLBergqvist/46553ba6043449e704af88f1a706228e

Runner Pod Logs

Logs: 

√ Connected to GitHub

Current runner version: '2.323.0'
2025-03-27 20:27:35Z: Listening for Jobs


Describe output

Name:             redacted-m2xmj-runner-2sb5k
Namespace:        arc-runners
Priority:         0
Service Account:  redacted
Node:             gke-49a8bb-scng/10.128.0.10
Start Time:       Thu, 27 Mar 2025 20:23:24 +0000
Labels:           actions-ephemeral-runner=True
                  actions.github.com/organization=redacted
                  actions.github.com/scale-set-name=redacted
                  actions.github.com/scale-set-namespace=arc-runners
                  app.kubernetes.io/component=runner
                  app.kubernetes.io/instance=redacted
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=redacted
                  app.kubernetes.io/part-of=gha-runner-scale-set
                  app.kubernetes.io/version=0.11.0
                  helm.sh/chart=gha-rs-0.11.0
                  pod-template-hash=79798d59cd
Annotations:      actions.github.com/patch-id: 0
                  actions.github.com/runner-group-name: Cover
                  actions.github.com/runner-scale-set-name: redacted
                  actions.github.com/runner-spec-hash: 78d4b6447
Status:           Terminating (lasts <invalid>)
Termination Grace Period:  30s
IP:               10.36.2.11
IPs:
  IP:           10.36.2.11
Controlled By:  EphemeralRunner/redacted-m2xmj-runner-2sb5k
Containers:
  runner:
    Container ID:  containerd://redacted
    Image:         redacted
    Image ID:      redacted@sha256:redacted
    Port:          <none>
    Host Port:     <none>
    Command:
      run.cmd
    State:          Running
      Started:      Thu, 27 Mar 2025 20:27:30 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     2
      memory:  10Gi
    Environment:
      ACTIONS_RUNNER_INPUT_JITCONFIG:          <set to the key 'jitToken' in secret 'redacted-m2xmj-runner-2sb5k'>  Optional: false
      GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT:  actions-runner-controller/0.11.0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-clv4p (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  kube-api-access-clv4p:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              iam.gke.io/gke-metadata-server-enabled=true
                             runner-os=windows
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/os=windows:NoSchedule
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             runners-fooding=true:NoSchedule
Events:
  Type     Reason            Age                    From                Message
  ----     ------            ----                   ----                -------
  Normal   Scheduled         4m21s                  default-scheduler   Successfully assigned arc-runners/redacted-m2xmj-runner-2sb5k to gke-49a8bb-scng
  Normal   Pulling           4m19s                  kubelet             Pulling image "redacted"
  Normal   Pulled            18s                    kubelet             Successfully pulled image "redacted" in 4m1.518s (4m1.518s including waiting). Image size: 3372778201 bytes.
  Normal   Created           18s                    kubelet             Created container: runner
  Normal   Started           15s                    kubelet             Started container runner
  Normal   Killing           5s                     kubelet             Stopping container runner
@JohnLBergqvist JohnLBergqvist added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Mar 27, 2025
@JohnLBergqvist JohnLBergqvist changed the title Runners are being removed for being idle Runners are being removed for being idle before the job has had a chance to be assinged to it Mar 27, 2025
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@JohnLBergqvist JohnLBergqvist changed the title Runners are being removed for being idle before the job has had a chance to be assinged to it Runners are being removed for being idle before the job has had a chance to be assigned to it Mar 28, 2025
@JohnLBergqvist JohnLBergqvist changed the title Runners are being removed for being idle before the job has had a chance to be assigned to it Runners are being removed for being idle before its job has had a chance to be assigned to it Mar 28, 2025
@pulkitanz
Copy link

The way I understand it is that, when a job is available that's when the listener updates the desired replicas and the runner is created.

It's not the other way around that a runner pod sits idle and waits to pickup a job unless you're setting the minRunner field to preemptively scale pods and even in that case I don't see this behaviour.

@JohnLBergqvist
Copy link
Author

JohnLBergqvist commented Mar 31, 2025

The way I understand it is that, when a job is available that's when the listener updates the desired replicas and the runner is created.

This is correct, however the runner is only active for about 5 seconds before then being killed by the listener, before a job has started running on it.

What's happening is this:

Listener: 2025-03-27T20:19:43Z Creating new ephemeral runners (scale up)
GitHub job status: waiting for runner to become available
Listener: 2025-03-27T20:27:30Z Updating ephemeral runner status "ready": true
Runner: 2025-03-27 20:27:35Z: Listening for Jobs
Listener: 2025-03-27T20:27:40Z Removing the idle ephemeral runner

So yes briefly the runner pod does sit idle, for a few seconds before a job is assigned to it. Except the listener process is kililng the pod a little too soon.

  1. When a job is scheduled and none of the runners of that type are available, so the job itself will sit in a queued state with the job page saying "Waiting for a runner matching [runner-group] to become available".
  2. In the mean-time, the actions runner controller will scale up the runners to create the amount of ephemeral runners needed.
  3. Those runners then come online, and the job will begin because a matching runner is now available.
    However in my case, the listener is killing the runner before the job has had a chance to begin on it, because it thinks the runner is sitting idle.

For example, i can see that the listener brings a pod online at 09:49:52

  {
    "lastProbeTime": null,
    "lastTransitionTime": "2025-03-31T09:49:52Z",
    "status": "True",
    "type": "Ready"
  }
]

Which is the point at which the listener sees the runner as ready.

However there is a further delay of 5 seconds before the runner itself connects to GitHub and begins listening for jobs, a few more seconds later, the job begins running on that runner.

√ Connected to GitHub

Current runner version: '2.323.0'
2025-03-31 09:49:57Z: Listening for Jobs
2025-03-31 09:50:08Z: Running job

Maybe the listener should check to see that GitHub has successfully registered the self-hosted runner before it classes it as running?

@pulkitanz
Copy link

pulkitanz commented Mar 31, 2025

What happens when you set minRunners in your helmrelease to !0 (let's say 10) and then run jobs, are your jobs being picked up by those pods?

edit: Also, I'm not from Github, I'm building ephemeral runners too and this is not an issue for me so just trying to help out.

@JohnLBergqvist
Copy link
Author

JohnLBergqvist commented Apr 1, 2025

What happens when you set minRunners in your helmrelease to !0 (let's say 10) and then run jobs, are your jobs being picked up by those pods?

Yes they are.

@andrewbeckman
Copy link

andrewbeckman commented Apr 3, 2025

@JohnLBergqvist I believe we're experiencing this as well and it started randomly in the last couple weeks after not touching anything related to this for many months. Did you ever figure out any workaround aside from forcing a pod to always be available?

@JohnLBergqvist
Copy link
Author

@andrewbeckman Unfortunately not. I've noticed it seems to happen more if only a single job is queued after a relative period of inactivity. If multiple jobs are queued in a short space of time then there's a higher chance that more of them will schedule correctly - perhaps because the Controller's main loop is taking longer to finish, thus giving the runners more breathing room to accept a job?

@patrickvinograd
Copy link

I believe we are also experiencing this issue. Have not tried the workaround of setting minRunners > 0.

In our case the behavior on the Github UI side is that jobs randomly show as "Canceled" despite nobody canceling them/no subsequent pushes to a PR that would cancel a job.

I'm curious about these lines in your log snippet (we are seeing the same):

EphemeralRunner    Checking if runner exists in GitHub service ...
EphemeralRunner    Runner does not exist in GitHub service

As opposed to a race condition between the runner controller and the runner pod, that could point to an issue with the Github Actions Service API where it is not reporting status correctly and therefore both the runner controller and the runner pod itself are behaving "correctly" in the sense that there's nothing on the Github Service side for them to act on.

That would also jibe with the fact that this has popped up in the last couple weeks despite no apparent local changes.

@patrickvinograd
Copy link

In other words, this could be a problem with the GH Actions service, not with the ARC project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

4 participants