Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: more resilient node allocatable ephemeral-storage data getter #101882

Merged

Conversation

jackfrancis
Copy link
Contributor

@jackfrancis jackfrancis commented May 10, 2021

What type of PR is this?

What this PR does / why we need it:

This PR modifies the GetCapacity containerManager method that is used to derive allocatable ephemeral storage node data. This change adds a "just-in-time" cAdvisor call to gather rootfs storage data in case the containerManager runtime instance hasn't yet stored that data.

This change deals with a race that is documented in code here:

https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/cm/container_manager_linux.go#L245

In summary: (1) the containerManager is instantiated, and then (2) Starts; but, node info may have been gathered between steps 1 and 2 in some instances, which leads to outcomes in which the node data is missing ephemeral storage allocatable data. Additionally, that "incomplete" node representation is subject to being submitted to the apiserver during node registration, leading to further side-effects during cluster bootstrapping where the authoritative node data declares that it has zero allocatable ephemeral storage.

Which issue(s) this PR fixes:

Fixes #99305
Fixes #102715

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kubelet: wait for node allocatable ephemeral-storage data

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 10, 2021
@k8s-ci-robot k8s-ci-robot requested review from dims and vishh May 10, 2021 21:47
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 10, 2021
@jackfrancis
Copy link
Contributor Author

/test pull-kubernetes-e2e-capz-ha-control-plane

@jackfrancis jackfrancis marked this pull request as draft May 11, 2021 00:03
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 11, 2021
@jackfrancis
Copy link
Contributor Author

/test pull-kubernetes-e2e-capz-ha-control-plane

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 11, 2021
@jackfrancis
Copy link
Contributor Author

/test pull-kubernetes-e2e-capz-ha-control-plane

1 similar comment
@jackfrancis
Copy link
Contributor Author

/test pull-kubernetes-e2e-capz-ha-control-plane

@jackfrancis jackfrancis changed the title kubelet: ephemeral-storage capacity kubelet: wait for node allocatable ephemeral-storage data May 11, 2021
@jackfrancis jackfrancis force-pushed the kubelet-initialnode-getcapacity branch from 45d8bb5 to 6dff31a Compare May 11, 2021 16:15
@jackfrancis
Copy link
Contributor Author

/test pull-kubernetes-e2e-capz-ha-control-plane

@@ -1060,7 +1060,32 @@ func isKernelPid(pid int) bool {
return err != nil && os.IsNotExist(err)
}

// GetCapacity returns node capacity data for "cpu", "memory", "ephemeral-storage", and "huge-pages*"
// At present this method is only invoked when introspecting ephemeral storage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this comment to emphasize the practical need to work harder to deliver ephemeral-storage data: This GetCapacity method is only called to get ephemeral storage data. See here where we refer to it when defining the set of MachineInfo setter functions:

https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/kubelet_node_status.go#L617

Those set of funcs are then assigned to the main kubelet instance:

https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/kubelet.go#L848

And then those status funcs are invoked every time kubelet sets the node status for a node:

https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/kubelet_node_status.go#L583

What's interesting here is that when we invoke MachineInfo with GetCapacity() as the capacityFunc, we see that we only use capacityFunc to derive ephemeral storage data, and nothing else:

https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/nodestatus/setters.go#L332

Thus, we should improve the resiliency/reliability of GetCapacity() to actually return real ephemeral storage allocatable data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so where cpu and memory is retrieved from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU and memory are stored in the containerManager instance's capacity property when containerManager is instantiated, right here:

https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/cm/container_manager_linux.go#L246

However, storage is not stored at that time, which is why the race condition occurs, and why we may need to call it "just in time" during node info retrieval/node registration. (See this comment: https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/cm/container_manager_linux.go#L245)

CPU/memory info is gathered using another function, see here:

https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/nodestatus/setters.go#L296

there are 2 tl;dr points:

  1. storage and CPU/memory are retrieved discretely, using individual code paths
  2. CPU/memory is gathered immediately when the kubelet runtime starts, and has its own error handling (see code); storage is gathered a little later than CPU/memory, and thus node info events may occur before storage has been gathered, leading to node registrations with "0" allocatable ephemeral-storage.

This PR accomplishes the following:

If we detect that allocatable ephemeral-storage has not yet been gathered during a node info/node registration operation, we attempt to eagerly gather that storage data "just in time" in order to prevent registering a node with "0" allocatable ephemeral-storage.

Hope that makes sense!

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 11, 2021
pkg/kubelet/cm/container_manager_linux.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/container_manager_linux.go Outdated Show resolved Hide resolved
@jackfrancis
Copy link
Contributor Author

@Random-Liu @sjenning can you kindly take a look at this PR for final approval/merge? thanks!

@jackfrancis
Copy link
Contributor Author

Another issue related to what this PR does:

#108700

@pacoxu
Copy link
Member

pacoxu commented Mar 28, 2022

/check-cla
ping @mrunalp @tallclair for approval

@pacoxu
Copy link
Member

pacoxu commented Mar 28, 2022

/easycla

@jackfrancis
Copy link
Contributor Author

gentle ping @mrunalp @tallclair

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, jsafrane, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 29, 2022
@jackfrancis
Copy link
Contributor Author

looks like I need to rebase and adapt some UT stuff to deal w/ changes since my last commit, stand by @mrunalp @pacoxu

@jackfrancis jackfrancis force-pushed the kubelet-initialnode-getcapacity branch from 5374e97 to 1dc066f Compare March 30, 2022 00:18
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2022
@jackfrancis jackfrancis force-pushed the kubelet-initialnode-getcapacity branch from 1dc066f to b2d8cd1 Compare March 30, 2022 00:34
@pacoxu
Copy link
Member

pacoxu commented Mar 30, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2022
@jackfrancis
Copy link
Contributor Author

Looks like I encountered #107414

/retest

@jackfrancis jackfrancis force-pushed the kubelet-initialnode-getcapacity branch from b2d8cd1 to ab14cba Compare March 30, 2022 01:14
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2022
@pacoxu
Copy link
Member

pacoxu commented Mar 30, 2022

/lgtm
There seems to be no further change from the last commit.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2022
@jackfrancis
Copy link
Contributor Author

/lgtm
There seems to be no further change from the last commit.

Yes, just had to clean up gofmt stuff after updating UT code to use replacement mock libraries introduced in the last few months. I'll try to retest my way through various flakes here. Thanks for your responsiveness!

@jackfrancis
Copy link
Contributor Author

/retest

@jackfrancis
Copy link
Contributor Author

thx again @mrunalp for the approval merry-go-round 🙏

@pacoxu
Copy link
Member

pacoxu commented Mar 30, 2022

This was approved before the code freeze, but not merged for CI failures. 😂
@ehashman should this be included in 1.24?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects