Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timing ratio histograms #110104

Merged

Conversation

MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented May 18, 2022

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR adds timing ratio histograms, as an alternative to sample-and-watermark histograms,
and switches API Priority and Fairness metrics to use the timing ratio histograms.

Which issue(s) this PR fixes:

This partially addresses #108272

Special notes for your reviewer:

This is part of the campaign to replace sample-and-watermark histograms.
I am breaking #109742 into pieces for easier reviewing.
This is most of the middle piece; PRs #110515 and #110516, both now merged, are the rest of the middle piece.
There are more PRs to come; #110164 is next in line, and there will also be PRs that replace repeated indexing with element caching (as seen in #109742 and #110154) and do other cleanup.

Does this PR introduce a user-facing change?

Some apiserver metrics were changed, as follows.
- `priority_level_seat_count_samples` is replaced with `priority_level_seat_utilization`, which samples every nanosecond rather than every millisecond; the old metric conveyed utilization despite its name.
- `priority_level_seat_count_watermarks` is removed.
- `priority_level_request_count_samples` is replaced with `priority_level_request_utilization`, which samples every nanosecond rather than every millisecond; the old metric conveyed utilization despite its name.
- `priority_level_request_count_watermarks` is removed.
- `read_vs_write_request_count_samples` is replaced with `read_vs_write_current_requests`, which samples every nanosecond rather than every second; the new metric, like the old one, measures utilization when the max-in-flight filter is used and number of requests when the API Priority and Fairness filter is used.
- `read_vs_write_request_count_watermarks` is removed.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig api-machinery
/cc @wojtek-t
/cc @tkashem
@deads2k
@lavalamp

@k8s-ci-robot k8s-ci-robot requested a review from tkashem May 18, 2022 07:04
@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label May 18, 2022
@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver area/test sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 18, 2022
@MikeSpreitzer
Copy link
Member Author

Here are the current benchmarking results.

(base) mspreitz@mjs12 metrics % go test  -benchmem -run=^$ -bench Histogram .
goos: darwin
goarch: amd64
pkg: k8s.io/apiserver/pkg/util/flowcontrol/metrics
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkSampleAndWaterMarkHistogramsVecEltSimple-16       	 1000000	      1230 ns/op	       0 B/op	       0 allocs/op
BenchmarkSampleAndWaterMarkHistogramsVecEltSafeEarly-16    	  987891	      1230 ns/op	       0 B/op	       0 allocs/op
BenchmarkSampleAndWaterMarkHistogramsVecEltSafeLate-16     	  982856	      1200 ns/op	       0 B/op	       0 allocs/op
BenchmarkTimingRatioHistogram-16                           	13167859	        90.70 ns/op	       0 B/op	       0 allocs/op
BenchmarkTimingRatioHistogramVecEltSimple-16               	13016456	        89.85 ns/op	       0 B/op	       0 allocs/op
BenchmarkTimingRatioHistogramVecEltSafeEarly-16            	 7515996	       160.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkTimingRatioHistogramVecEltSafeLate-16             	13404110	        89.83 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	k8s.io/apiserver/pkg/util/flowcontrol/metrics	9.665s

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 18, 2022
@MikeSpreitzer MikeSpreitzer force-pushed the add-timing-ratio-histograms branch 2 times, most recently from 63cb7f1 to b67ae76 Compare May 18, 2022 20:04
@MikeSpreitzer
Copy link
Member Author

The next rev should

  • fix the speling mistake in timing_ratio_histogram.go in the commit that introduces that file
  • include benchmarking results in the git commit message

@MikeSpreitzer
Copy link
Member Author

/retest

@logicalhan
Copy link
Member

/triage accepted
/assign @dgrisonnet @logicalhan

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 19, 2022
@leilajal
Copy link
Contributor

/triage accepted

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have two small nits - other than that LGTM.

But I would like to see an LGTM from @dgrisonnet on the actual metrics changes.
@dgrisonnet - can you please take a look?

}
underMember.Set(initialNumerator / initialDenominator)
return &timingRatioHistogramInner{
getGaugeOfRatio: func() Gauge { return underMember },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation (there seem to be two tabs instead of one)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github makes it hard (but not impossible) to select the initial tab on each line. Line 180 starts with one tab, line 181 starts with three. This is what gofmt produces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's even inconsistent with line 205-209...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I just made the other change you suggested, and that made gofmt put one less tab on these lines.

numerator: initialNumerator,
denominator: initialDenominator,
},
nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move to the previous line:

...
}, nil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a force-push that only made this change and re-run gofmt.

@MikeSpreitzer MikeSpreitzer force-pushed the add-timing-ratio-histograms branch 2 times, most recently from 57aaa6d to a595f43 Compare July 13, 2022 13:57
@wojtek-t
Copy link
Member

LGTM, but I want an LGTM from sig-instrumention folks before approving.

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
(from the sig-instrumentation side of things)

I'm okay with the metrics changes here, since they are alpha metrics.

@@ -0,0 +1,223 @@
/*
Copyright 2019 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong year.

// That is the registerable aspect of the underlying TimingHistogram.
compbasemetrics.Registerable

// timingRatioHistogramInner implements the RatioedObsserver aspect.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/RatioedObsserver/RatioedObserver/

var errMetricNotFound = errors.New("not found")

func TestTimingRatioHistogramVecElementSimple(t *testing.T) {
samplesHistName := "tro_test"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can basically be a constant, the way it is used in the tests.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2022
@MikeSpreitzer
Copy link
Member Author

The latest force-push makes the changes suggested by @logicalhan .

Following is the result of benchmarking the two kinds of histograms on
my laptop.

% go test  -benchmem -run=^$ -bench Histogram .
goos: darwin
goarch: amd64
pkg: k8s.io/apiserver/pkg/util/flowcontrol/metrics
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkSampleAndWaterMarkHistogramsVecEltSafeEarly-16    	  980143	      1230 ns/op	       0 B/op	       0 allocs/op
BenchmarkSampleAndWaterMarkHistogramsVecEltSafeLate-16     	  932380	      1216 ns/op	       0 B/op	       0 allocs/op
BenchmarkTimingRatioHistogram-16                           	12665247	        94.13 ns/op	       0 B/op	       0 allocs/op
BenchmarkTimingRatioHistogramVecElementSimple-16           	11015806	       100.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkTimingRatioHistogramVecElementSafeEarly-16        	 7142589	       172.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkTimingRatioHistogramVecElementSafeLate-16         	11487517	        96.67 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	k8s.io/apiserver/pkg/util/flowcontrol/metrics	7.718s
@MikeSpreitzer
Copy link
Member Author

And the latest force-push changes the timing ratio histogram test funcs to each use a distinct name for the test histogram, for additional log clarity and possibly correctness in parallel testing.

@wojtek-t
Copy link
Member

Thanks @logicalhan - lgtming with your LGTM from the sig-instrumentation POV.

/lgtm
/approve

/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MikeSpreitzer, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 13, 2022
@MikeSpreitzer
Copy link
Member Author

/retest

@k8s-ci-robot
Copy link
Contributor

@MikeSpreitzer: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit 0c0b7ca link unknown /test pull-kubernetes-unit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants