-
Notifications
You must be signed in to change notification settings - Fork 1.2k
SIGSEGV after update to 0.11.0 on listener #3993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
The same error here. |
+1 |
When I revert to 0.10.1 I got the error:
It seems we are locked. Thoughts? |
Same error here, i reverted to 0.10.1 |
This was a similar error I was facing. The fix is to upgrade your runner to version 2.323.0 and from there I was able to get the 0.10.1 to work again. |
One more note, right around the same time that the release of 0.11.0 charts were posted is when the 2.321.0 version of the runner image broke in all of our clusters. It seems some other change was made at the same time which impacted the functionality of the 2.321.0 version of the image making it operable. |
Hey everyone, The new field for metrics has been added, and I didn't account for it not being present... When you configure metrics, you should specify listenerMetrics, which are configurable set of metrics that will be emitted by the listener. However, I missed checking if this field is I think I need to update the release note to call out this field explicitly. Please let me know if the issue persists when you comment out the listenerMetrics. For future reference, whenever we have a behavioral change or a breaking change, we issue a minor release. Whenever we have bug fixes, we issue a patch release. This means that when doing minor version upgrades, you should probably read the release notes and make sure you understand what needs to change. I admit I didn't do a good job calling out these major changes, but in the future, please pay attention to the release notes on each minor version upgrade. |
@nikola-jokic, it would be great to have any breaking changes in ChangeLog. I have two sets of runners scale-set and none of them have ListenerMetrics enabled:
And Second one
|
Hey @gordonswing, I completely agree. I should have been more explicit in the release notes, and I will fix that. The metrics are configured when you install the controller. |
@nikola-jokic, thanks for the explanation.
|
Would it be possible that the listener operates after the principle of "sensible defaults"? Having a hard crash due to a missing settings structure seems like a bad look. Normaly there is a significant difference between the idea of "i want runner metrics" and "i want to control my runner metrics setup in detail", for most services ops people are just used to write: metrics: true |
Hey @gordonswing, Sure, so when you specified the controller to configure metrics, all listeners are configured to serve metrics as well. To fix your particular use case, you should set uncomment (add) for each of your scale set the Example for one of your scale sets--- githubConfigUrl: "https://github.com/XXX" runnerGroup: "dind" runnerScaleSetName: "k8s-dind" maxRunners: 30 minRunners: 3controllerServiceAccount: template: listenerMetrics: |
Hvala puno! |
Hey @HenrikDK, Yes, but the reason this was left commented out is because if Helm merged these two While answering on this issue, the listener crash is actually the behavior I would personally prefer in this situation, even though I caused it by accident. It pointed out that the metrics were missing, and it was very obvious that something was wrong. However, it should go with a nice error message, not the nil dereference panic. On the other hand, all metrics are commented out so you can add them simply by uncommenting them out. It served as documentation as well as the mechanism to configure metrics. I would personally love to avoid having multiple ways of configuring the same thing. The |
I agree with other voters here that backward compatibility here is essential and adding dozen of lines (that could be default ones) to the current config to support new version is not a best approach. If It's generally always up to developer how to declare their functions and job, but product will be used by end users which feedback could be helpful to improve the product. PS. A real nightmare is reading whole |
Also keep in mind, if you use Kubernetes + Helm, that a helm upgrade isn't updating the crds. The new crds are required to get this working, otherwise the controller will remove the listenerMetrics configuration from the AutoscalingRunnerSet You can use this command to see the difference: Or apply it with: |
Hi, I experienced this too. Reverted the update for now. As the chart follows SemVer, and this is a breaking change compared to the previous version,shouldnt this have been a major update? I understand it is in the release notes, but with a minor update in SemVer you should be able to assume it does not contain breaking changes. |
Hey @samtoxie, To address this question, per semantic versioning, 0.x.x is an unstable release. Since it is unstable, minor releases could have breaking changes. Initially, based on kubernetes-sigs/controller-runtime#2327 (comment) from the controller-runtime core contributor, the controller runtime would not support major releases as long as other dependencies are released under unstable version tag. Since our system heavily relies on these tools as well, it didn't make sense to release ARC under stable version. The framework uses unstable version, the tool we use to generate manifests is released under unstable version, therefore, it is hard to promise backward compatibility. Having said that, patch releases would always stay backwards compatible. Minor releases can be backwards incompatible, as they basically represent the major version release. We try to issue a minor version upgrades when there are new features added, or backward incompatible changes are introduced. |
Fair! |
This just cost us several hours of debugging and we remediated it by temporarily providing the following to each scale set we're operating. We ran into these issues prior to any disclosure existing in the Release Notes about the breaking change. This was not in line with expectations of how a breaking change would be communicated or how fallback should be performed. Remediation:
Ideally, metrics would be something we could disable entirely at the controller level via a |
Wouldn't it be better for the listener to ship with a default metrics setup so that it never crashes? Then the helm chart would simply contain an optional configuration override? As @notz mentioned, CRD changes which this release contained are pretty painfull for most gitops solutions today. Flux for instance does not handle updates of CRD's, and it feels like the helm people have thrown in the towel regarding any sort of automation for updating CRDs. In our case we had to take down the entire setup (controller + 3 listeners), delete all actions.github.com CRD's on the cluster (including cleaning up finalizers), and install the new version. I spent most of the day debugging and testing the new solution on our test cluster. An alternative that could've avoided all this could've been a config map, or the listener image containg a json file with a default setup. |
@HenrikDK Flux can manage crds on upgrade, but it's not enabled by default |
Even knowing we have to add the |
Oh my, this is so bad. Besides the fact that it is crashing, do you know any other project that would require users to configure metrics to get any metrics at all? Unfortunately, this is just basic misunderstanding of dictionaries vs slices in YAML. There is a reason |
Hey @dee-kryvenko, This is where we disagree entirely. The slice is a collection of homogenous fields. This means that each field is a superset of its intended spec. Let's take env from your example. It has fields The same applies here: What purpose is having buckets in a counter metric? Just because k8s decided on this API a while back doesn't mean that it is universally the right choice. And just because Helm doesn't properly merge in this use case, it doesn't mean that we should always design an API to conform to it. If down the line we add `kustomize, ' should we then design new features to conform to its way of handling specs? Which one should we prioritize when designing the API? To answer the first point, the system exposes metrics by applying the It is fair to criticize the approaches we took to design the configurable metrics. I'm just trying to explain the other side of the argument and the reasoning that went into it. |
You didn't just disagree with me. You didn't just disagree with Kubernetes project. You just disagreed with Golang spec. What you are saying is stemming from the ignorance. It's a typical JS/python take on the issue i.e. is a take of a person who prefers dynamically typed languages. I'm not here to have this fight, if that's what you are, then it is what it is. But assuming you are not, because a) this project is written in Go and b) Helm is written in Go, I have one single question to you - can you show us structs to deserialize what you've made in the listenerMetrics? ;) |
Not picking a fight, as I mentioned; I'm just explaining my reasoning :) Yes, I can show you the structs: actions-runner-controller/apis/actions.github.com/v1alpha1/autoscalingrunnerset_types.go Line 239 in 4ca37fb
And no, this is definitely not dynamically typed, as you can clearly see. And no, this is the take from someone who prefers the strong types, thus, each field is a struct of its own and not an object that can be a counter and a histogram depending on the context. Anyway, we are going to disagree here, and that is fine. But I wanted to take this opportunity to provide more context behind the design for others reading this thread in case they are interested. |
Oh, I see. Yeah, it is not exactly the same concept as with However, the main principal being discussed still applies. You are using maps with arbitrary string keys. Are you implying that the order don't matter? What happens if two or more maps define a metric with the same key? Which one takes precedence? If I am trying to work with your serialized data externally, you do realize that, I will not be able to re-serialize it back to what it was because the order cannot be guaranteed? If I am working on some diffing tool, I will be screwed. Again, there is more than one reason maps are a bad idea in this use case. You do realize that
And
Are conceptually the same thing and both create the same challenge just in two different places? And that doesn't even have anything to do with merging. Both variants can be represented as either a map or a slice. All you did by going with a map instead of a slice here is that you bought indexing at de-serialization (which is not hard to index a slice yourself after de-serialization and to validate that the data doesn't contain conflicts) and you paid for it with lack of order and bogus merging logic. What if I chain charts or use kustomize and I need to remove a key form a map that was defined upstream by someone else? This is very well understood and agreed upon in both helm and kustomize worlds. You had to think about this when you had to leave this comment |
Oh wow.. you also use floats. You aware that you are not supposed to use floats? There is no way to guarantee serialization handling of precision with floats.. you were supposed to use |
OT: @dee-kryvenko It's obvious you're invested in this topic, and while I also don't agree with the changes made I think it's important to stay civil in the conversation and to keep in mind behind every avatar there's another human being. :) |
Oh yeah absolutely. Sorry about that. I am very passionate about my profession and I am of Eastern European no-filter origins. I keep forgetting that my normal way of talking may be considered rude in some parts of the world. I like this project and grateful for all the work that went into it, and I have nothing personal agains Nikola. I am just critiquing this particular change and the way it was released - it is just objectively bad. As a fellow open source maintainer myself, I wish that I was informed when I did something as bad too - we all make mistakes, and this is the way we can help each other become better tomorrow than we were yesterday. |
Hi So i have read the thread extensively and the release note but i am not entirely clear what specifically i have to uncomment to have the same metrics as i did before... is it this whole chunk or something less? thanks for the help |
Sorry for sliding in like this, but this new metrics setup does not solve problems. I just started a new discussion how to make these metrics useful. Also some time ago I made PR which tried to address some of the problems but was completely neglected. |
I get the crashes even with listenerMetrics set. I tried to play around with various metrics settings, but nothing helped! Really annoying. |
You need to delete all CRDs and install the new version. Had the same problem. |
That was it, thank you very much! |
Thank you @kuhnroyal for helping! And for future reference, every release that contains the CRD change would require deleting CRDs. Helm doesn't automatically delete them, which is not great for developer experience, but we can't do anything about it... |
Yeah you can. Don't change CRDs without bumping the API version, create conversion webhooks, maintain backward compatibility. But then again that would require to think about sane defaults which apparently is a problem... Suggesting to delete CRDs is questionable but without mentioning that deleting CRDs will make kubernetes garbage collector to delete all resources from the cluster is plain dangerous and bad. For when CRDs do change, because unfortunately no one RTFM, I typically have to use server-side apply. But I'm not using helm, I'm using ArgoCD, so I'm not sure if that's the same problem or a different one. |
Same error's popping up, even with Error:
|
Having the tip about adding |
Be sure to add this piece of code in the scale-set values |
Checks
Controller Version
0.11.0
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
During upgrade listener was recreated with the image version 0.11.0 and restarting due to error:
Reverting back to 0.10.1 fixes the issue.
Config wasn't changed during the upgrade.
PS. Controller has no backward compatibility for 0.11.0 version with scale set on 0.10.1. Version should be aligned.
Describe the expected behavior
Smooth upgrade from 0.10.1 to 0.11.0
Additional Context
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: