New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cluster IP allocator metrics #110027
Fix cluster IP allocator metrics #110027
Conversation
Hi @tksm. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @aojea |
@@ -213,6 +213,7 @@ func (c LegacyRESTStorageProvider) NewLegacyRESTStorage(apiResourceConfigSource | |||
if err != nil { | |||
return LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, fmt.Errorf("cannot create cluster IP allocator: %v", err) | |||
} | |||
serviceClusterIPAllocator.EnableMetrics() | |||
restStorage.ServiceClusterIPAllocator = serviceClusterIPRegistry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to remember how this ends in the repair loops
kubernetes/pkg/controlplane/controller.go
Line 174 in 9d85e18
repairClusterIPs := servicecontroller.NewRepair(c.ServiceClusterIPInterval, c.ServiceClient, c.EventClient, &c.ServiceClusterIPRange, c.ServiceClusterIPRegistry, &c.SecondaryServiceClusterIPRange, c.SecondaryServiceClusterIPRegistry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tracing
kubernetes/pkg/controlplane/instance.go
Lines 522 to 548 in 9d85e18
func (m *Instance) InstallLegacyAPI(c *completedConfig, restOptionsGetter generic.RESTOptionsGetter) error { | |
legacyRESTStorageProvider := corerest.LegacyRESTStorageProvider{ | |
StorageFactory: c.ExtraConfig.StorageFactory, | |
ProxyTransport: c.ExtraConfig.ProxyTransport, | |
KubeletClientConfig: c.ExtraConfig.KubeletClientConfig, | |
EventTTL: c.ExtraConfig.EventTTL, | |
ServiceIPRange: c.ExtraConfig.ServiceIPRange, | |
SecondaryServiceIPRange: c.ExtraConfig.SecondaryServiceIPRange, | |
ServiceNodePortRange: c.ExtraConfig.ServiceNodePortRange, | |
LoopbackClientConfig: c.GenericConfig.LoopbackClientConfig, | |
ServiceAccountIssuer: c.ExtraConfig.ServiceAccountIssuer, | |
ExtendExpiration: c.ExtraConfig.ExtendExpiration, | |
ServiceAccountMaxExpiration: c.ExtraConfig.ServiceAccountMaxExpiration, | |
APIAudiences: c.GenericConfig.Authentication.APIAudiences, | |
} | |
legacyRESTStorage, apiGroupInfo, err := legacyRESTStorageProvider.NewLegacyRESTStorage(c.ExtraConfig.APIResourceConfigSource, restOptionsGetter) | |
if err != nil { | |
return fmt.Errorf("error building core storage: %v", err) | |
} | |
if len(apiGroupInfo.VersionedResourcesStorageMap) == 0 { // if all core storage is disabled, return. | |
return nil | |
} | |
controllerName := "bootstrap-controller" | |
coreClient := corev1client.NewForConfigOrDie(c.GenericConfig.LoopbackClientConfig) | |
eventsClient := eventsv1client.NewForConfigOrDie(c.GenericConfig.LoopbackClientConfig) | |
bootstrapController, err := c.NewBootstrapController(legacyRESTStorage, coreClient, coreClient, eventsClient, coreClient.RESTClient()) |
I still don't have clear what is the root cause , it seems that is an interaction with the repair loop confirmed by your changes, but the repair loops receive the registry allocator, no?
|
xref #109975 - might potentially help with clearing the semantics |
base: base, | ||
max: maximum(0, int(max)), | ||
family: family, | ||
metrics: &emptyMetricsRecorder{}, // disabled by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then, should we registerMetrics()
in L96, or only when we enable the metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// create metrics disabled IPv4 allocator | ||
// this metrics should be ignored | ||
c, err := NewInMemory(clusterCIDRv4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this to an individual test instead of overloading current test?
create a with metrics
create b without metrics
check b doesn't update a's metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -86,6 +87,8 @@ type Range struct { | |||
family api.IPFamily | |||
|
|||
alloc allocator.Interface | |||
// metrics is a metrics recorder that can be disabled | |||
metrics metricsRecorderInterface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this solution and seems aligned with the rest of the code base,
@thockin what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgrisonnet (sig-instrumentation) I'd like your opinion too, the fact that metrics are global is 😬 , what do you think about this solution ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it is not ideal to have metrics declared globally, but at the same time, most of the codebase/ecosystem is doing that and I don't think we had any issues in the past because of that. The reason is that to "enable" a metric, you have to register it into a registry so you always need an extra step in order for your metric to start being exposed. So I don't really think having them global is a problem as such.
That said, the approach that you've taken is one of the potential ways to make metrics initialization cleaner, but I personally prefer the one taken here https://github.com/prometheus-operator/prometheus-operator/blob/main/pkg/operator/operator.go#L180-L272 that creates structures to group similar metrics together. You are then able to initialize and register metrics via a simple call to "New". We might have some precedence of that or something similar in the codebase, but out of my head, I can't think of any that I have seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that current proposal is good enough, maybe is a TODO for sig-instrumentation to work on standardize this ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is good enough, we haven't standardized anything for now, so the implementation is up to the component owners.
That's a good point, I'll bring it to the group.
@@ -364,6 +368,11 @@ func (r *Range) Destroy() { | |||
r.alloc.Destroy() | |||
} | |||
|
|||
// EnableMetrics enables metrics recording. | |||
func (r *Range) EnableMetrics() { | |||
r.metrics = &metricsRecorder{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the metrics registerMetrics()
be initialized here now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test |
f2f925a
to
187af77
Compare
@aojea Thank you for your comment. Please take another look. |
/triage accepted |
I'm fine with this Defer lgtm to sig-instrumentation |
@dashpole do you accept LGTM responsibility? /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, thockin, tksm The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fix the bug that the metrics for the cluster IP allocator are incorrectly reported.
Which issue(s) this PR fixes:
Fixes #109994
Special notes for your reviewer:
It seems the repair loop unintentionally overwrote the metrics. I made the metrics recording disabled by default and made it enabled in only LegacyRESTStorage.
I verified the metrics no longer changed after the repair interval (3m) on kind.
After fix
After this fix, the metrics no longer changed within 5 minutes.
Before fix
Before this fix, the metrics was incorrectly changed after a while (0 - 3 m).
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: