Watch cache: use resource.group for object type in log messages and metrics #111807

ncdc · 2022-08-11T20:00:49Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Update logging and metrics references to "object type" to be by "group resource" instead, so the logs/metrics can disambiguate custom resource types, instead of grouping them all together as *unstructured.Unstructured.

Which issue(s) this PR fixes:

Fixes #111605

Special notes for your reviewer:

I have split this into 1 commit to update logging, and a separate commit to update metrics. Happy to split out and do logging first if desired, as per the discussion starting at #111605 (comment).

This does change the log/metrics keys for built-in types to also be by group resource. If that is not desirable, I can fix it so this change only applies to CRs.

Does this PR introduce a user-facing change?

Log messages and metrics for the watch cache are now keyed by `<resource>.<group>` instead of `go` struct type. This means e.g. that `*v1.Pod` becomes `pods`. Additionally, resources that come from CustomResourceDefinitions are now displayed as the correct resource and group, instead of `*unstructured.Unstructured`.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2022-08-11T20:00:52Z

Please note that we're already in Test Freeze for the release-1.25 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.25.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Thu Aug 11 19:31:32 UTC 2022.

ncdc · 2022-08-11T20:01:30Z

/sig api-machinery

MikeSpreitzer · 2022-08-12T21:19:37Z

/cc

MikeSpreitzer · 2022-08-15T05:41:43Z

@MikeSpreitzer

MikeSpreitzer

Looks good to me, and can be carried even a little further, as noted inline.

MikeSpreitzer · 2022-08-15T05:54:43Z

staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go

+		c.versioner,
+		deadline,
+		pred.AllowWatchBookmarks,
+		c.objectType,


The objectType field no longer needs to be in the cacheWatcher struct, and thus also no longer needs to be passed to newCacheWatcher.

Updated, PTAL

MikeSpreitzer · 2022-08-15T05:55:13Z

staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go

@@ -1177,7 +1195,8 @@ type cacheWatcher struct {
 	deadline            time.Time
 	allowWatchBookmarks bool
 	// Object type of the cache watcher interests
-	objectType reflect.Type
+	objectType    reflect.Type


This field is no longer used for anything and can be removed.

Updated, PTAL

fedebongio · 2022-08-16T20:04:49Z

/triage accepted

ncdc · 2022-08-16T20:20:03Z

As noted in the description:

This does change the log/metrics keys for built-in types to also be by group resource. If that is not desirable, I can fix it so this change only applies to CRs.

This is a "breaking" change in that it changes the metrics labels (e.g. *v1.Pod becomes pods). Is that ok, or should I only make this change for *unstructured.Unstructured?

MikeSpreitzer · 2022-08-16T21:17:19Z

FYI, kubernetes/website#30691

MikeSpreitzer · 2022-08-16T21:20:43Z

@ncdc: there is a concept of metric maturity. Documented at https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#metric-lifecycle . The question is, does this PR change any metric that is declared to be stable?

ncdc · 2022-08-16T22:44:32Z

It looks like they are all alpha

MikeSpreitzer · 2022-08-17T02:08:09Z

Here are the stable metrics I found in a scrape of the a running server built from the current master:

mspreitz@ubu22:~/go2/src/k8s.io/kubernetes$ kubectl get --raw /metrics | grep '^# HELP apiserver_.*\[STABLE\]'
# HELP apiserver_admission_controller_admission_duration_seconds [STABLE] Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
# HELP apiserver_admission_step_admission_duration_seconds [STABLE] Admission sub-step latency histogram in seconds, broken out for each operation and API resource and step type (validate or admit).
# HELP apiserver_current_inflight_requests [STABLE] Maximal number of currently used inflight request limit of this apiserver per request kind in last second.
# HELP apiserver_longrunning_requests [STABLE] Gauge of all active long-running apiserver requests broken out by verb, group, version, resource, scope and component. Not all requests are tracked this way.
# HELP apiserver_request_duration_seconds [STABLE] Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
# HELP apiserver_request_total [STABLE] Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.
# HELP apiserver_requested_deprecated_apis [STABLE] Gauge of deprecated APIs that have been requested, broken out by API group, version, resource, subresource, and removed_release.
# HELP apiserver_response_sizes [STABLE] Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.
# HELP apiserver_storage_objects [STABLE] Number of stored objects at the time of last check split by kind.

All CustomResources are treated as *unstructured.Unstructured, leading the watch cache to log anything related to CRs as Unstructured. This change uses the schema.GroupResource instead of object type for all type related log messages in the watch cache, resulting in distinct output for each CR type. Signed-off-by: Andy Goldstein <andy.goldstein@redhat.com>

Use the group resource instead of objectType in watch cache metrics, because all CustomResources are grouped together as *unstructured.Unstructured, instead of 1 entry per type. Signed-off-by: Andy Goldstein <andy.goldstein@redhat.com>

ncdc · 2022-08-17T15:02:42Z

/retest

MikeSpreitzer

/lgtm

MikeSpreitzer · 2022-08-17T20:48:04Z

/assign @deads2k

MikeSpreitzer · 2022-08-24T16:02:30Z

/test pull-kubernetes-dependencies

deads2k · 2022-08-25T13:44:35Z

This is a "breaking" change in that it changes the metrics labels (e.g. *v1.Pod becomes pods). Is that ok, or should I only make this change for *unstructured.Unstructured?

Best answered by
/sig instrumentation
possibly @logicalhan for apimachinery and instrumentation knowledge.

The change looks fine to me from a code perspective. I think we do make some promises about stable metrics, but I don't know if it extends to labels.

/approve
/hold

holding for instrumentation sign off from @logicalhan

k8s-ci-robot · 2022-08-25T13:45:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, MikeSpreitzer, ncdc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiserver/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

logicalhan

This is a "breaking" change in that it changes the metrics labels (e.g. *v1.Pod becomes pods). Is that ok, or should I only make this change for *unstructured.Unstructured?

We don't actually have any guarantees around label values (despite the fact that it may break alerts), we permit even changes to buckets on stable metrics. I am generally reticent about changes to label values anyway though since it can break alerts though. In this case, using group resource is actually more consistent with other metrics so this change generally looks good to me.

/lgtm

logicalhan · 2022-08-25T14:05:59Z

/hold cancel

MikeSpreitzer · 2022-08-25T15:17:19Z

Thanks @deads2k and @logicalhan . BTW, #111807 (comment) says that the metrics affected by this PR are not stable.

修复CustomerResource的watch_cache_capacity指标resource都是Unstructured的问题修复CustomerResource的watch_cache_capacity指标resource都是Unstructured的问题社区[pr111807](kubernetes#111807 1.20等低版本简单暴力修复下另外pr111807是breaking change，从1.26起，内置资源的resource名称会发上变化，比如pod从*v1.Pod变为pods，建议我们在1.26之前的版本，只修复crd资源resource都是Unstructured的问题，其他内置对象，保持不变

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Aug 11, 2022

ncdc mentioned this pull request Aug 11, 2022

Watch cache groups all CRDs as *unstructured.Unstructured in metrics + logging #111605

Closed

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 11, 2022

k8s-ci-robot requested review from derekwaynecarr and enj August 11, 2022 20:04

k8s-ci-robot added the area/apiserver label Aug 11, 2022

k8s-ci-robot requested a review from MikeSpreitzer August 12, 2022 21:19

MikeSpreitzer reviewed Aug 15, 2022

View reviewed changes

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 16, 2022

ncdc force-pushed the watch-cache-unstructured-details branch from 406d366 to d2c6aa8 Compare August 16, 2022 22:49

ncdc added 2 commits August 17, 2022 09:33

watch cache: metrics: objectType -> group resource

d08b69e

Use the group resource instead of objectType in watch cache metrics, because all CustomResources are grouped together as *unstructured.Unstructured, instead of 1 entry per type. Signed-off-by: Andy Goldstein <andy.goldstein@redhat.com>

ncdc force-pushed the watch-cache-unstructured-details branch 2 times, most recently from d2c6aa8 to eb03850 Compare August 17, 2022 13:33

MikeSpreitzer approved these changes Aug 17, 2022

View reviewed changes

k8s-ci-robot assigned MikeSpreitzer Aug 17, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 17, 2022

k8s-ci-robot assigned deads2k Aug 17, 2022

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Aug 25, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2022

logicalhan reviewed Aug 25, 2022

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 25, 2022

k8s-ci-robot assigned logicalhan Aug 25, 2022

k8s-ci-robot merged commit 2b4e850 into kubernetes:master Aug 25, 2022

k8s-ci-robot added this to the v1.26 milestone Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch cache: use resource.group for object type in log messages and metrics #111807

Watch cache: use resource.group for object type in log messages and metrics #111807

ncdc commented Aug 11, 2022

k8s-ci-robot commented Aug 11, 2022

ncdc commented Aug 11, 2022

MikeSpreitzer commented Aug 12, 2022

MikeSpreitzer commented Aug 15, 2022

MikeSpreitzer left a comment

MikeSpreitzer Aug 15, 2022

ncdc Aug 16, 2022

MikeSpreitzer Aug 15, 2022

ncdc Aug 16, 2022

fedebongio commented Aug 16, 2022

ncdc commented Aug 16, 2022

MikeSpreitzer commented Aug 16, 2022

MikeSpreitzer commented Aug 16, 2022

ncdc commented Aug 16, 2022

MikeSpreitzer commented Aug 17, 2022

ncdc commented Aug 17, 2022

MikeSpreitzer left a comment

MikeSpreitzer commented Aug 17, 2022

MikeSpreitzer commented Aug 24, 2022

deads2k commented Aug 25, 2022

k8s-ci-robot commented Aug 25, 2022

logicalhan left a comment

logicalhan commented Aug 25, 2022

MikeSpreitzer commented Aug 25, 2022

Watch cache: use resource.group for object type in log messages and metrics #111807

Watch cache: use resource.group for object type in log messages and metrics #111807

Conversation

ncdc commented Aug 11, 2022

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Aug 11, 2022

ncdc commented Aug 11, 2022

MikeSpreitzer commented Aug 12, 2022

MikeSpreitzer commented Aug 15, 2022

MikeSpreitzer left a comment

Choose a reason for hiding this comment

MikeSpreitzer Aug 15, 2022

Choose a reason for hiding this comment

ncdc Aug 16, 2022

Choose a reason for hiding this comment

MikeSpreitzer Aug 15, 2022

Choose a reason for hiding this comment

ncdc Aug 16, 2022

Choose a reason for hiding this comment

fedebongio commented Aug 16, 2022

ncdc commented Aug 16, 2022

MikeSpreitzer commented Aug 16, 2022

MikeSpreitzer commented Aug 16, 2022

ncdc commented Aug 16, 2022

MikeSpreitzer commented Aug 17, 2022

ncdc commented Aug 17, 2022

MikeSpreitzer left a comment

Choose a reason for hiding this comment

MikeSpreitzer commented Aug 17, 2022

MikeSpreitzer commented Aug 24, 2022

deads2k commented Aug 25, 2022

k8s-ci-robot commented Aug 25, 2022

logicalhan left a comment

Choose a reason for hiding this comment

logicalhan commented Aug 25, 2022

MikeSpreitzer commented Aug 25, 2022