Skip to content

SIGSEGV after update to 0.11.0 on listener #3993

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
gordonswing opened this issue Mar 25, 2025 · 42 comments
Open
4 tasks done

SIGSEGV after update to 0.11.0 on listener #3993

gordonswing opened this issue Mar 25, 2025 · 42 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode

Comments

@gordonswing
Copy link

Checks

Controller Version

0.11.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Stay on 0.10.1 on controller and scale set
2. Upgrade to both 0.11.0
3. Check listener logs
4. Revert back to 0.10.1

Describe the bug

During upgrade listener was recreated with the image version 0.11.0 and restarting due to error:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1844383]

goroutine 1 [running]:
github.com/actions/actions-runner-controller/cmd/ghalistener/app.New({{0xc0000465e0, 0x1a}, 0x0, 0x0, {0x0, 0x0}, {0xc000430120, 0x28}, {0xc0003053b0, 0x10}, ...})
    github.com/actions/actions-runner-controller/cmd/ghalistener/app/app.go:73 +0x3e3
main.main()
    github.com/actions/actions-runner-controller/cmd/ghalistener/main.go:27 +0x1f8
Stream closed EOF for gha-runners-controller/k8s-dind-6746d74f-listener (listener)

Reverting back to 0.10.1 fixes the issue.
Config wasn't changed during the upgrade.

PS. Controller has no backward compatibility for 0.11.0 version with scale set on 0.10.1. Version should be aligned.

Describe the expected behavior

Smooth upgrade from 0.10.1 to 0.11.0

Additional Context

Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-45-generic x86_64)
k8s version: v1.31.1+rke2r1

Controller Logs

Not provided

Runner Pod Logs

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1844383]

goroutine 1 [running]:
github.com/actions/actions-runner-controller/cmd/ghalistener/app.New({{0xc0000465e0, 0x1a}, 0x0, 0x0, {0x0, 0x0}, {0xc000430120, 0x28}, {0xc0003053b0, 0x10}, ...})
    github.com/actions/actions-runner-controller/cmd/ghalistener/app/app.go:73 +0x3e3
main.main()
    github.com/actions/actions-runner-controller/cmd/ghalistener/main.go:27 +0x1f8
Stream closed EOF for gha-runners-controller/k8s-dind-6746d74f-listener (listener)
@gordonswing gordonswing added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Mar 25, 2025
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@gordonswing gordonswing changed the title <Please write what didn't work for you here> SIGSEGV after update to 0.11.0 on listener Mar 25, 2025
@jeanmorais
Copy link

The same error here.

@AFMiziara
Copy link

+1

@jeanmorais
Copy link

When I revert to 0.10.1 I got the error:

This self-hosted runner is currently using runner version 2.321.0. This version is out of date. Please update to the latest version 2.323.0

It seems we are locked. Thoughts?

@YuriBucci-Solfacil
Copy link

Same error here, i reverted to 0.10.1

@emilio507
Copy link

When I revert to 0.10.1 I got the error:

This self-hosted runner is currently using runner version 2.321.0. This version is out of date. Please update to the latest version 2.323.0

It seems we are locked. Thoughts?

This was a similar error I was facing. The fix is to upgrade your runner to version 2.323.0 and from there I was able to get the 0.10.1 to work again.

@emilio507
Copy link

One more note, right around the same time that the release of 0.11.0 charts were posted is when the 2.321.0 version of the runner image broke in all of our clusters. It seems some other change was made at the same time which impacted the functionality of the 2.321.0 version of the image making it operable.

@nikola-jokic
Copy link
Collaborator

Hey everyone,

The new field for metrics has been added, and I didn't account for it not being present... When you configure metrics, you should specify listenerMetrics, which are configurable set of metrics that will be emitted by the listener.

However, I missed checking if this field is nil since it can be. Thinking about it, it might be a good thing that the listener panics, since you are configuring the listener with metrics, while you didn't specify any. Otherwise, it would silently work, and no metrics would be served.

I think I need to update the release note to call out this field explicitly.

Please let me know if the issue persists when you comment out the listenerMetrics.

For future reference, whenever we have a behavioral change or a breaking change, we issue a minor release. Whenever we have bug fixes, we issue a patch release. This means that when doing minor version upgrades, you should probably read the release notes and make sure you understand what needs to change. I admit I didn't do a good job calling out these major changes, but in the future, please pay attention to the release notes on each minor version upgrade.

@nikola-jokic nikola-jokic removed the needs triage Requires review from the maintainers label Mar 25, 2025
@gordonswing
Copy link
Author

@nikola-jokic, it would be great to have any breaking changes in ChangeLog.

I have two sets of runners scale-set and none of them have ListenerMetrics enabled:

---
githubConfigUrl: "https://github.com/XXX"
runnerGroup: "dind"
runnerScaleSetName: "k8s-dind"
maxRunners: 30
minRunners: 3

controllerServiceAccount:
  namespace: gha-runners-controller
  name: gha-runner-scale-set-controller

template:
  spec:
    tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "gha-runner"
        effect: "NoSchedule"
    nodeSelector:
      gha-runner: "true"
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: gha-runner
                  operator: In
                  values:
                    - "true"
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: DOCKER_HOST
            value: unix:///var/run/docker/docker.sock
        securityContext:
          runAsUser: 1001
          runAsGroup: 1001
          privileged: true
          allowPrivilegeEscalation: true
        volumeMounts:
          - name: docker-sock
            mountPath: /var/run/docker
          - name: work
            mountPath: /home/runner/_work
      - name: dind
        image: docker:dind
        args: ["--host=unix:///var/run/docker/docker.sock"]
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          privileged: true
          allowPrivilegeEscalation: true
        lifecycle:
          postStart:
            exec:
              command:
                - sh
                - -c
                - |
                  # Adjust permissions on the socket file
                  while [ ! -S /var/run/docker/docker.sock ]; do
                    sleep 1
                  done
                  chown root:1001 /var/run/docker/docker.sock
                  chmod 660 /var/run/docker/docker.sock
        volumeMounts:
          - name: docker-sock
            mountPath: /var/run/docker
          - name: work
            mountPath: /home/runner/_work
    volumes:
      - name: docker-sock
        emptyDir: {}
      - name: work
        emptyDir: {}
    securityContext:
      fsGroup: 1001

And Second one

---
githubConfigUrl: "https://github.com/XXX"
maxRunners: 30
minRunners: 3
runnerGroup: "k8s"
runnerScaleSetName: "k8s-native"

containerMode:
  type: "kubernetes"

controllerServiceAccount:
  namespace: gha-runners-controller
  name: gha-runner-scale-set-controller

template:
  spec:
    tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "gha-runner"
        effect: "NoSchedule"
    nodeSelector:
      gha-runner: "true"
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: gha-runner
                  operator: In
                  values:
                    - "true"
    securityContext:
      fsGroup: 1001
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              storageClassName: "openebs-hostpath"
              resources:
                requests:
                  storage: 10Gi

@nikola-jokic
Copy link
Collaborator

Hey @gordonswing,

I completely agree. I should have been more explicit in the release notes, and I will fix that.

The metrics are configured when you install the controller.

@gordonswing
Copy link
Author

@nikola-jokic, thanks for the explanation.
This is my values.yaml for controller. Can you please narrow down what is wrong here to support 0.11.0 from the configuration side?
BTW, Can this be handled by _helpers to automatically enable listenerMetrics when metrics are enabled on the controller to prevent such issues in future and add backward compatibility?

---
replicaCount: 2
installCRDs: true
serviceAccount:
  create: true
  name: "gha-runner-scale-set-controller"
resources:
  XXX
nodeSelector:
  gha-runner: "true"
priorityClassName: "system-cluster-critical"
metrics:
  controllerManagerAddr: ":9100"
  listenerAddr: ":9100"
  listenerEndpoint: "/metrics"
flags:
  logLevel: "debug"
  updateStrategy: "eventual"
  excludeLabelPropagationPrefixes:
    - "argocd.argoproj.io/instance"
  watchNamespace: "gha-runners-dind,gha-runners-k8s"

@HenrikDK
Copy link

HenrikDK commented Mar 26, 2025

Would it be possible that the listener operates after the principle of "sensible defaults"?

Having a hard crash due to a missing settings structure seems like a bad look.

Normaly there is a significant difference between the idea of "i want runner metrics" and "i want to control my runner metrics setup in detail", for most services ops people are just used to write:

metrics: true

@nikola-jokic
Copy link
Collaborator

Hey @gordonswing,

Sure, so when you specified the controller to configure metrics, all listeners are configured to serve metrics as well. To fix your particular use case, you should set uncomment (add) for each of your scale set the listenerMetrics field.

Example for one of your scale sets --- githubConfigUrl: "https://github.com/XXX" runnerGroup: "dind" runnerScaleSetName: "k8s-dind" maxRunners: 30 minRunners: 3

controllerServiceAccount:
namespace: gha-runners-controller
name: gha-runner-scale-set-controller

template:
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gha-runner"
effect: "NoSchedule"
nodeSelector:
gha-runner: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gha-runner
operator: In
values:
- "true"
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
env:
- name: DOCKER_HOST
value: unix:///var/run/docker/docker.sock
securityContext:
runAsUser: 1001
runAsGroup: 1001
privileged: true
allowPrivilegeEscalation: true
volumeMounts:
- name: docker-sock
mountPath: /var/run/docker
- name: work
mountPath: /home/runner/_work
- name: dind
image: docker:dind
args: ["--host=unix:///var/run/docker/docker.sock"]
securityContext:
runAsUser: 0
runAsGroup: 0
privileged: true
allowPrivilegeEscalation: true
lifecycle:
postStart:
exec:
command:
- sh
- -c
- |
# Adjust permissions on the socket file
while [ ! -S /var/run/docker/docker.sock ]; do
sleep 1
done
chown root:1001 /var/run/docker/docker.sock
chmod 660 /var/run/docker/docker.sock
volumeMounts:
- name: docker-sock
mountPath: /var/run/docker
- name: work
mountPath: /home/runner/_work
volumes:
- name: docker-sock
emptyDir: {}
- name: work
emptyDir: {}
securityContext:
fsGroup: 1001

listenerMetrics:
counters:
gha_started_jobs_total:
labels:
["repository", "organization", "enterprise", "job_name", "event_name"]
gha_completed_jobs_total:
labels:
[
"repository",
"organization",
"enterprise",
"job_name",
"event_name",
"job_result",
]
gauges:
gha_assigned_jobs:
labels: ["name", "namespace", "repository", "organization", "enterprise"]
gha_running_jobs:
labels: ["name", "namespace", "repository", "organization", "enterprise"]
gha_registered_runners:
labels: ["name", "namespace", "repository", "organization", "enterprise"]
gha_busy_runners:
labels: ["name", "namespace", "repository", "organization", "enterprise"]
gha_min_runners:
labels: ["name", "namespace", "repository", "organization", "enterprise"]
gha_max_runners:
labels: ["name", "namespace", "repository", "organization", "enterprise"]
gha_desired_runners:
labels: ["name", "namespace", "repository", "organization", "enterprise"]
gha_idle_runners:
labels: ["name", "namespace", "repository", "organization", "enterprise"]
histograms:
gha_job_startup_duration_seconds:
labels:
["repository", "organization", "enterprise", "job_name", "event_name"]
buckets:
[
0.01,
0.05,
0.1,
0.5,
1.0,
2.0,
3.0,
4.0,
5.0,
6.0,
7.0,
8.0,
9.0,
10.0,
12.0,
15.0,
18.0,
20.0,
25.0,
30.0,
40.0,
50.0,
60.0,
70.0,
80.0,
90.0,
100.0,
110.0,
120.0,
150.0,
180.0,
210.0,
240.0,
300.0,
360.0,
420.0,
480.0,
540.0,
600.0,
900.0,
1200.0,
1800.0,
2400.0,
3000.0,
3600.0,
]
gha_job_execution_duration_seconds:
labels:
[
"repository",
"organization",
"enterprise",
"job_name",
"event_name",
"job_result",
]
buckets:
[
0.01,
0.05,
0.1,
0.5,
1.0,
2.0,
3.0,
4.0,
5.0,
6.0,
7.0,
8.0,
9.0,
10.0,
12.0,
15.0,
18.0,
20.0,
25.0,
30.0,
40.0,
50.0,
60.0,
70.0,
80.0,
90.0,
100.0,
110.0,
120.0,
150.0,
180.0,
210.0,
240.0,
300.0,
360.0,
420.0,
480.0,
540.0,
600.0,
900.0,
1200.0,
1800.0,
2400.0,
3000.0,
3600.0,
]

@gordonswing
Copy link
Author

Hey @gordonswing,

Sure, so when you specified the controller to configure metrics, all listeners are configured to serve metrics as well. To fix your particular use case, you should set uncomment (add) for each of your scale set the listenerMetrics field.

Example for one of your scale sets

Hvala puno!

@nikola-jokic
Copy link
Collaborator

Hey @HenrikDK,

Yes, but the reason this was left commented out is because if Helm merged these two values.yaml files together when you apply -f. So, when you specify only the subset of metrics, all of them would be applied.

While answering on this issue, the listener crash is actually the behavior I would personally prefer in this situation, even though I caused it by accident. It pointed out that the metrics were missing, and it was very obvious that something was wrong. However, it should go with a nice error message, not the nil dereference panic.

On the other hand, all metrics are commented out so you can add them simply by uncommenting them out. It served as documentation as well as the mechanism to configure metrics. I would personally love to avoid having multiple ways of configuring the same thing. The containerMode is a great example where this approach caused many issues and hours spent reproducing/maintaining it. Use should provide part of the spec in the template field, and we would expand the containerMode spec for other fields. Merging these specs is a nightmare to maintain and can cause surprising expansions.

@gordonswing
Copy link
Author

I agree with other voters here that backward compatibility here is essential and adding dozen of lines (that could be default ones) to the current config to support new version is not a best approach.

If listenerMetrics are essential to be preserved during enabling metrics, let's add them automatically. They could be still commented in case somebody want to tune them, but I don't need to tune them, why they cannot be uncommented by default or handled by _helpers instead?

It's generally always up to developer how to declare their functions and job, but product will be used by end users which feedback could be helpful to improve the product.

PS. A real nightmare is reading whole values.yaml in case we need to uncomment something to support the latest update even it's well defined in documentation and Release Notes. The best and user-friendly way is to support all current values on all version (understand is not always applicable, but for the current case I counted at least two options how to mitigate this).

@notz
Copy link

notz commented Mar 26, 2025

Also keep in mind, if you use Kubernetes + Helm, that a helm upgrade isn't updating the crds. The new crds are required to get this working, otherwise the controller will remove the listenerMetrics configuration from the AutoscalingRunnerSet

You can use this command to see the difference:
helm show crds oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller --version 0.11.0 | kubectl diff -f -

Or apply it with:
helm show crds oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller --version 0.11.0 | kubectl apply -f - --server-side

@samtoxie
Copy link

samtoxie commented Mar 26, 2025

Hi, I experienced this too. Reverted the update for now.

As the chart follows SemVer, and this is a breaking change compared to the previous version,shouldnt this have been a major update? I understand it is in the release notes, but with a minor update in SemVer you should be able to assume it does not contain breaking changes.

@nikola-jokic
Copy link
Collaborator

Hey @samtoxie,

To address this question, per semantic versioning, 0.x.x is an unstable release. Since it is unstable, minor releases could have breaking changes. Initially, based on kubernetes-sigs/controller-runtime#2327 (comment) from the controller-runtime core contributor, the controller runtime would not support major releases as long as other dependencies are released under unstable version tag.

Since our system heavily relies on these tools as well, it didn't make sense to release ARC under stable version. The framework uses unstable version, the tool we use to generate manifests is released under unstable version, therefore, it is hard to promise backward compatibility.

Having said that, patch releases would always stay backwards compatible. Minor releases can be backwards incompatible, as they basically represent the major version release. We try to issue a minor version upgrades when there are new features added, or backward incompatible changes are introduced.

@samtoxie
Copy link

Hey @samtoxie,

To address this question, per semantic versioning, 0.x.x is an unstable release. Since it is unstable, minor releases could have breaking changes. Initially, based on kubernetes-sigs/controller-runtime#2327 (comment) from the controller-runtime core contributor, the controller runtime would not support major releases as long as other dependencies are released under unstable version tag.

Since our system heavily relies on these tools as well, it didn't make sense to release ARC under stable version. The framework uses unstable version, the tool we use to generate manifests is released under unstable version, therefore, it is hard to promise backward compatibility.

Having said that, patch releases would always stay backwards compatible. Minor releases can be backwards incompatible, as they basically represent the major version release. We try to issue a minor version upgrades when there are new features added, or backward incompatible changes are introduced.

Fair!

@booleanbetrayal
Copy link

This just cost us several hours of debugging and we remediated it by temporarily providing the following to each scale set we're operating. We ran into these issues prior to any disclosure existing in the Release Notes about the breaking change. This was not in line with expectations of how a breaking change would be communicated or how fallback should be performed.

Remediation:

gha-runner-scale-set:
  listenerMetrics:
    counters:

Ideally, metrics would be something we could disable entirely at the controller level via a metrics.enabled boolean flag, but it appears that the Helm template always default these command-line flags, regardless of whether or not metrics are desired or explicit null values are provided to metrics.

@HenrikDK
Copy link

@nikola-jokic

On the other hand, all metrics are commented out so you can add them simply by un-commenting them out.

Wouldn't it be better for the listener to ship with a default metrics setup so that it never crashes? Then the helm chart would simply contain an optional configuration override?

As @notz mentioned, CRD changes which this release contained are pretty painfull for most gitops solutions today. Flux for instance does not handle updates of CRD's, and it feels like the helm people have thrown in the towel regarding any sort of automation for updating CRDs.

In our case we had to take down the entire setup (controller + 3 listeners), delete all actions.github.com CRD's on the cluster (including cleaning up finalizers), and install the new version.

I spent most of the day debugging and testing the new solution on our test cluster.

An alternative that could've avoided all this could've been a config map, or the listener image containg a json file with a default setup.

@notz
Copy link

notz commented Mar 26, 2025

@HenrikDK Flux can manage crds on upgrade, but it's not enabled by default

@hanikesn
Copy link

Even knowing we have to add the listenerMetrics object makes the upgrade quite painful as we have to rollout the changes from the controller and update the gha-runner-scale-set at the same time. I'd be nice if the EphemeralRunnerSet CRD would be backwards compatible for at least one release to make the transition easier, even if that means missing metrics, which are much less critical than non working runners.

@dee-kryvenko
Copy link

dee-kryvenko commented Mar 31, 2025

listenerMetrics are configurable metrics applied to the listener.
In order to avoid helm merging these fields, we left the metrics commented out.
When configuring metrics, please uncomment the listenerMetrics object below.
You can modify the configuration to remove the label or specify custom buckets for histogram.

Oh my, this is so bad. Besides the fact that it is crashing, do you know any other project that would require users to configure metrics to get any metrics at all? Unfortunately, this is just basic misunderstanding of dictionaries vs slices in YAML. There is a reason env on pods is a slice and not a dictionary. If you want to avoid merging, you use slices. On top of that, you do not use dictionaries like this at all in a statically typed languages... again just think of other examples i.e. env, volumes, volumeMounts etc etc etc.

@nikola-jokic
Copy link
Collaborator

Hey @dee-kryvenko,

This is where we disagree entirely. The slice is a collection of homogenous fields. This means that each field is a superset of its intended spec. Let's take env from your example. It has fields Value and ValueFrom. Without reading the documentation, it is not apparent which one takes precedence. Furthermore, let's say you apply 2 environment variables with the same name. Which one gets applied? It is not obvious. The API should be obvious.

The same applies here: What purpose is having buckets in a counter metric? Just because k8s decided on this API a while back doesn't mean that it is universally the right choice. And just because Helm doesn't properly merge in this use case, it doesn't mean that we should always design an API to conform to it. If down the line we add `kustomize, ' should we then design new features to conform to its way of handling specs? Which one should we prioritize when designing the API?

To answer the first point, the system exposes metrics by applying the spec on the controller level. Each listener would publish metrics if the controller is configured to publish metrics. It would be a fair argument to say that these metrics should be configured per scale set, while the controller metrics are configured on the controller installation. This was probably a mistake, I admit that. But knowing that we made a mistake there, and assuming this part cannot change, I would personally much rather have the listener crash (which would make me read the release notes/read what is included in this release) than silently working and only later figuring out that no metrics are being published.

It is fair to criticize the approaches we took to design the configurable metrics. I'm just trying to explain the other side of the argument and the reasoning that went into it.

@dee-kryvenko
Copy link

You didn't just disagree with me. You didn't just disagree with Kubernetes project. You just disagreed with Golang spec. What you are saying is stemming from the ignorance. It's a typical JS/python take on the issue i.e. is a take of a person who prefers dynamically typed languages. I'm not here to have this fight, if that's what you are, then it is what it is. But assuming you are not, because a) this project is written in Go and b) Helm is written in Go, I have one single question to you - can you show us structs to deserialize what you've made in the listenerMetrics? ;)

@nikola-jokic
Copy link
Collaborator

Not picking a fight, as I mentioned; I'm just explaining my reasoning :)

Yes, I can show you the structs:

And no, this is definitely not dynamically typed, as you can clearly see. And no, this is the take from someone who prefers the strong types, thus, each field is a struct of its own and not an object that can be a counter and a histogram depending on the context.

Anyway, we are going to disagree here, and that is fine. But I wanted to take this opportunity to provide more context behind the design for others reading this thread in case they are interested.

@dee-kryvenko
Copy link

Oh, I see. Yeah, it is not exactly the same concept as with env, volumes and volumeMounts that you have. You are not trying to re-invent generics at least. Good. Usually people run into this problem when they try to re-invent generics, and I haven't looked closely at your values before jumping to conclusions, my bad.

However, the main principal being discussed still applies. You are using maps with arbitrary string keys. Are you implying that the order don't matter? What happens if two or more maps define a metric with the same key? Which one takes precedence? If I am trying to work with your serialized data externally, you do realize that, I will not be able to re-serialize it back to what it was because the order cannot be guaranteed? If I am working on some diffing tool, I will be screwed. Again, there is more than one reason maps are a bad idea in this use case.

You do realize that

listenerMetrics:
  counters:
    <key>:
      <spec>
  gauges:
    <key>:
      <spec>
  histograms:
    <key>:
      <spec>

And

listenerMetrics:
  <key>:
    counter:
      <spec>
    gauge:
      <spec>
    histogram:
      <spec>

Are conceptually the same thing and both create the same challenge just in two different places? And that doesn't even have anything to do with merging. Both variants can be represented as either a map or a slice. All you did by going with a map instead of a slice here is that you bought indexing at de-serialization (which is not hard to index a slice yourself after de-serialization and to validate that the data doesn't contain conflicts) and you paid for it with lack of order and bogus merging logic. What if I chain charts or use kustomize and I need to remove a key form a map that was defined upstream by someone else? This is very well understood and agreed upon in both helm and kustomize worlds. You had to think about this when you had to leave this comment In order to avoid helm merging these fields, we left the metrics commented out - didn't that ring a bell? Is that a common practice? Any other project does that? Is that universally accepted idea that no defaults is better than sane defaults? The fact that you can't define sane defaults in your implementation didn't ring a bell either? Sorry, this is not a matter of opinion or discussion, this last part - it is just objectively and indisputably bad.

@dee-kryvenko
Copy link

Oh wow.. you also use floats. You aware that you are not supposed to use floats? There is no way to guarantee serialization handling of precision with floats.. you were supposed to use resource.Quantity. What if there is an admission controller that's not written in Go? Everything will be out of sync.

@hanikesn
Copy link

hanikesn commented Apr 1, 2025

OT: @dee-kryvenko It's obvious you're invested in this topic, and while I also don't agree with the changes made I think it's important to stay civil in the conversation and to keep in mind behind every avatar there's another human being. :)

@dee-kryvenko
Copy link

dee-kryvenko commented Apr 1, 2025

OT: @dee-kryvenko It's obvious you're invested in this topic, and while I also don't agree with the changes made I think it's important to stay civil in the conversation and to keep in mind behind every avatar there's another human being. :)

Oh yeah absolutely. Sorry about that. I am very passionate about my profession and I am of Eastern European no-filter origins. I keep forgetting that my normal way of talking may be considered rude in some parts of the world. I like this project and grateful for all the work that went into it, and I have nothing personal agains Nikola. I am just critiquing this particular change and the way it was released - it is just objectively bad. As a fellow open source maintainer myself, I wish that I was informed when I did something as bad too - we all make mistakes, and this is the way we can help each other become better tomorrow than we were yesterday.

@riccardosalamanna
Copy link

Hi

So i have read the thread extensively and the release note but i am not entirely clear what specifically i have to uncomment to have the same metrics as i did before... is it this whole chunk or something less? thanks for the help

@tiithansen
Copy link

Sorry for sliding in like this, but this new metrics setup does not solve problems. I just started a new discussion how to make these metrics useful. Also some time ago I made PR which tried to address some of the problems but was completely neglected.

@tw-sematell
Copy link

I get the crashes even with listenerMetrics set. I tried to play around with various metrics settings, but nothing helped! Really annoying.

@kuhnroyal
Copy link

I get the crashes even with listenerMetrics set. I tried to play around with various metrics settings, but nothing helped! Really annoying.

You need to delete all CRDs and install the new version. Had the same problem.

@tw-sematell
Copy link

I get the crashes even with listenerMetrics set. I tried to play around with various metrics settings, but nothing helped! Really annoying.

You need to delete all CRDs and install the new version. Had the same problem.

That was it, thank you very much!

@nikola-jokic
Copy link
Collaborator

Thank you @kuhnroyal for helping!

And for future reference, every release that contains the CRD change would require deleting CRDs. Helm doesn't automatically delete them, which is not great for developer experience, but we can't do anything about it...

@dee-kryvenko
Copy link

Yeah you can. Don't change CRDs without bumping the API version, create conversion webhooks, maintain backward compatibility. But then again that would require to think about sane defaults which apparently is a problem...

Suggesting to delete CRDs is questionable but without mentioning that deleting CRDs will make kubernetes garbage collector to delete all resources from the cluster is plain dangerous and bad.

For when CRDs do change, because unfortunately no one RTFM, I typically have to use server-side apply. But I'm not using helm, I'm using ArgoCD, so I'm not sure if that's the same problem or a different one.

@rr-krupesh-savaliya
Copy link

Same error's popping up, even with listenerMetrics set

Error:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1844383]

goroutine 1 [running]:
github.com/actions/actions-runner-controller/cmd/ghalistener/app.New({{0xc00048b1a0, 0x2e}, 0x0, 0x0, {0x0, 0x0}, {0xc00048b1d0, 0x28}, {0xc000158858, 0x11}, ...})
	github.com/actions/actions-runner-controller/cmd/ghalistener/app/app.go:73 +0x3e3
main.main()
	github.com/actions/actions-runner-controller/cmd/ghalistener/main.go:27 +0x1f8

@nbalagopal
Copy link

Thank you @kuhnroyal for helping!

And for future reference, every release that contains the CRD change would require deleting CRDs. Helm doesn't automatically delete them, which is not great for developer experience, but we can't do anything about it...

Having the tip about adding listenerMetrics in the release notes was very helpful!
Deleting CRDs is what finally got me through the upgrade process though.
Can this be part of the release notes too please? Release notes are way more discoverable than the discussion in this issue. (This one does not show up in any search results at all)

@ranyhb
Copy link

ranyhb commented Apr 14, 2025

Be sure to add this piece of code in the scale-set values
Without it, the listeners will always fail to start

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

No branches or pull requests