Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch cache: use resource.group for object type in log messages and metrics #111807

Merged
merged 2 commits into from Aug 25, 2022

Conversation

ncdc
Copy link
Member

@ncdc ncdc commented Aug 11, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

Update logging and metrics references to "object type" to be by "group resource" instead, so the logs/metrics can disambiguate custom resource types, instead of grouping them all together as *unstructured.Unstructured.

Which issue(s) this PR fixes:

Fixes #111605

Special notes for your reviewer:

I have split this into 1 commit to update logging, and a separate commit to update metrics. Happy to split out and do logging first if desired, as per the discussion starting at #111605 (comment).

This does change the log/metrics keys for built-in types to also be by group resource. If that is not desirable, I can fix it so this change only applies to CRs.

Does this PR introduce a user-facing change?

Log messages and metrics for the watch cache are now keyed by `<resource>.<group>` instead of `go` struct type. This means e.g. that `*v1.Pod` becomes `pods`. Additionally, resources that come from CustomResourceDefinitions are now displayed as the correct resource and group, instead of `*unstructured.Unstructured`.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Aug 11, 2022
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.25 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.25.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Thu Aug 11 19:31:32 UTC 2022.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 11, 2022
@ncdc
Copy link
Member Author

ncdc commented Aug 11, 2022

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 11, 2022
@MikeSpreitzer
Copy link
Member

/cc

@MikeSpreitzer
Copy link
Member

@MikeSpreitzer

Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, and can be carried even a little further, as noted inline.

c.versioner,
deadline,
pred.AllowWatchBookmarks,
c.objectType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objectType field no longer needs to be in the cacheWatcher struct, and thus also no longer needs to be passed to newCacheWatcher.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, PTAL

@@ -1177,7 +1195,8 @@ type cacheWatcher struct {
deadline time.Time
allowWatchBookmarks bool
// Object type of the cache watcher interests
objectType reflect.Type
objectType reflect.Type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is no longer used for anything and can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, PTAL

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 16, 2022
@ncdc
Copy link
Member Author

ncdc commented Aug 16, 2022

As noted in the description:

This does change the log/metrics keys for built-in types to also be by group resource. If that is not desirable, I can fix it so this change only applies to CRs.

This is a "breaking" change in that it changes the metrics labels (e.g. *v1.Pod becomes pods). Is that ok, or should I only make this change for *unstructured.Unstructured?

@MikeSpreitzer
Copy link
Member

FYI, kubernetes/website#30691

@MikeSpreitzer
Copy link
Member

@ncdc: there is a concept of metric maturity. Documented at https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#metric-lifecycle . The question is, does this PR change any metric that is declared to be stable?

@ncdc
Copy link
Member Author

ncdc commented Aug 16, 2022

It looks like they are all alpha

@ncdc ncdc force-pushed the watch-cache-unstructured-details branch from 406d366 to d2c6aa8 Compare August 16, 2022 22:49
@MikeSpreitzer
Copy link
Member

Here are the stable metrics I found in a scrape of the a running server built from the current master:

mspreitz@ubu22:~/go2/src/k8s.io/kubernetes$ kubectl get --raw /metrics | grep '^# HELP apiserver_.*\[STABLE\]'
# HELP apiserver_admission_controller_admission_duration_seconds [STABLE] Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
# HELP apiserver_admission_step_admission_duration_seconds [STABLE] Admission sub-step latency histogram in seconds, broken out for each operation and API resource and step type (validate or admit).
# HELP apiserver_current_inflight_requests [STABLE] Maximal number of currently used inflight request limit of this apiserver per request kind in last second.
# HELP apiserver_longrunning_requests [STABLE] Gauge of all active long-running apiserver requests broken out by verb, group, version, resource, scope and component. Not all requests are tracked this way.
# HELP apiserver_request_duration_seconds [STABLE] Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
# HELP apiserver_request_total [STABLE] Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.
# HELP apiserver_requested_deprecated_apis [STABLE] Gauge of deprecated APIs that have been requested, broken out by API group, version, resource, subresource, and removed_release.
# HELP apiserver_response_sizes [STABLE] Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.
# HELP apiserver_storage_objects [STABLE] Number of stored objects at the time of last check split by kind.

All CustomResources are treated as *unstructured.Unstructured, leading
the watch cache to log anything related to CRs as Unstructured. This
change uses the schema.GroupResource instead of object type for all type
related log messages in the watch cache, resulting in distinct output
for each CR type.

Signed-off-by: Andy Goldstein <andy.goldstein@redhat.com>
Use the group resource instead of objectType in watch cache metrics,
because all CustomResources are grouped together as
*unstructured.Unstructured, instead of 1 entry per type.

Signed-off-by: Andy Goldstein <andy.goldstein@redhat.com>
@ncdc ncdc force-pushed the watch-cache-unstructured-details branch 2 times, most recently from d2c6aa8 to eb03850 Compare August 17, 2022 13:33
@ncdc
Copy link
Member Author

ncdc commented Aug 17, 2022

/retest

Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 17, 2022
@MikeSpreitzer
Copy link
Member

/assign @deads2k

@MikeSpreitzer
Copy link
Member

/test pull-kubernetes-dependencies

@deads2k
Copy link
Contributor

deads2k commented Aug 25, 2022

This is a "breaking" change in that it changes the metrics labels (e.g. *v1.Pod becomes pods). Is that ok, or should I only make this change for *unstructured.Unstructured?

Best answered by
/sig instrumentation
possibly @logicalhan for apimachinery and instrumentation knowledge.

The change looks fine to me from a code perspective. I think we do make some promises about stable metrics, but I don't know if it extends to labels.

/approve
/hold

holding for instrumentation sign off from @logicalhan

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Aug 25, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, MikeSpreitzer, ncdc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2022
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "breaking" change in that it changes the metrics labels (e.g. *v1.Pod becomes pods). Is that ok, or should I only make this change for *unstructured.Unstructured?

We don't actually have any guarantees around label values (despite the fact that it may break alerts), we permit even changes to buckets on stable metrics. I am generally reticent about changes to label values anyway though since it can break alerts though. In this case, using group resource is actually more consistent with other metrics so this change generally looks good to me.

/lgtm

@logicalhan
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 25, 2022
@MikeSpreitzer
Copy link
Member

Thanks @deads2k and @logicalhan . BTW, #111807 (comment) says that the metrics affected by this PR are not stable.

@k8s-ci-robot k8s-ci-robot merged commit 2b4e850 into kubernetes:master Aug 25, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Aug 25, 2022
chenchun pushed a commit to chenchun/kubernetes that referenced this pull request Mar 20, 2024
修复CustomerResource的watch_cache_capacity指标resource都是Unstructured的问题
修复CustomerResource的watch_cache_capacity指标resource都是Unstructured的问题

社区[pr111807](kubernetes#111807 1.20等低版本简单暴力修复下
另外pr111807是breaking change,从1.26起,内置资源的resource名称会发上变化,比如pod从*v1.Pod变为pods,建议我们在1.26之前的版本,只修复crd资源resource都是Unstructured的问题,其他内置对象,保持不变
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watch cache groups all CRDs as *unstructured.Unstructured in metrics + logging
6 participants