Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New histogram: Pod start SLI duration #111930

Conversation

azylinski
Copy link
Contributor

What type of PR is this?

/kind feature
/sig node
/area kubelet

What this PR does / why we need it:

A new histogram will cover the gap for Pod Startup SLI/SLO measurement, ref:
https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md

This could replace the slo-monitor component.

Does this PR introduce a user-facing change?

Add the metric pod_start_sli_duration_seconds to kubelet

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet labels Aug 19, 2022
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.25 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.25.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Aug 19 07:42:40 UTC 2022.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework labels Aug 19, 2022
@k8s-ci-robot k8s-ci-robot added area/test sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 19, 2022
@azylinski
Copy link
Contributor Author

/retest

@azylinski azylinski force-pushed the new-histogram-pod_start_sli_duration_seconds branch from a654b2e to 7bba13f Compare August 19, 2022 12:47
pkg/kubelet/config/config.go Outdated Show resolved Hide resolved
@@ -235,6 +240,10 @@ func (s *podStorage) merge(source string, change interface{}) (adds, updates, de
ref.Annotations = make(map[string]string)
}
ref.Annotations[kubetypes.ConfigSourceAnnotationKey] = source
// ignore static pods
if ref.ResourceVersion != "" {
s.podStartupLatencyTracker.ObservedPodOnWatch(ref, time.Now())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'when' value: can we set this to time.Now() stored when the object has been received from watch (as soon as possible) to reduce impact of possible process delay?

Or are you suggesting that this code is always happening directly after we receive the watch event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point and I do agree, the time could (should?) be captured as soon as we observed the Pod Update, so most likely here: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/config/apiserver.go#L64

On the other hand, the kubetypes.PodUpdate is a main/base struct (used in 54 places in 14 files), so adding an eventTimestamp is not such a small change - why don't we wait for some input from sig-node people?

I'm happy to make such a change, if there's a possibility, that could be a real time difference between those two options.

pkg/kubelet/util/pod_startup_latency_tracker.go Outdated Show resolved Hide resolved
pkg/kubelet/util/pod_startup_latency_tracker.go Outdated Show resolved Hide resolved
pkg/kubelet/util/pod_startup_latency_tracker_test.go Outdated Show resolved Hide resolved
test/e2e_node/density_test.go Outdated Show resolved Hide resolved
@azylinski azylinski force-pushed the new-histogram-pod_start_sli_duration_seconds branch 2 times, most recently from ea008c6 to 3a1a000 Compare August 22, 2022 15:02
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 22, 2022
@azylinski azylinski force-pushed the new-histogram-pod_start_sli_duration_seconds branch from 3a1a000 to c4a93c2 Compare August 23, 2022 08:32
@mborsz
Copy link
Member

mborsz commented Aug 23, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 23, 2022
@mborsz
Copy link
Member

mborsz commented Aug 23, 2022

How hard would it be to export a separate graph with "watch latency" for pods graph only? It's quite unrelated to the change, but would be extremely useful.

@azylinski
Copy link
Contributor Author

/retest

@dashpole
Copy link
Contributor

/assign @logicalhan
/triage accepted

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 20, 2022
@azylinski azylinski force-pushed the new-histogram-pod_start_sli_duration_seconds branch from 49a0902 to 2b45c9f Compare October 20, 2022 08:27
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 20, 2022
@bobbypage
Copy link
Member

/retest

@bobbypage
Copy link
Member

Few small comments, overall LGTM. Looks like it needs a rebase though since prow is failing due to merge conflict.

@azylinski azylinski force-pushed the new-histogram-pod_start_sli_duration_seconds branch from 2b45c9f to 89c8822 Compare October 21, 2022 09:02
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2022
@azylinski azylinski force-pushed the new-histogram-pod_start_sli_duration_seconds branch from 89c8822 to 492f5fa Compare October 26, 2022 09:32
@bobbypage
Copy link
Member

Thanks for the updates!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 3, 2022
@azylinski
Copy link
Contributor Author

Thanks David - now, the PR has the LGTM from all people involved :)
@smarterclayton , may I ask for a final Approval?

@smarterclayton
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: azylinski, dashpole, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2022
@k8s-ci-robot k8s-ci-robot merged commit 1bf4af4 into kubernetes:master Nov 4, 2022
SIG Node CI/Test Board automation moved this from Archive-it to Done Nov 4, 2022
SIG Node PR Triage automation moved this from Needs Approver to Done Nov 4, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging this pull request may close these issues.

None yet

7 participants