New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pod disruption conditions for kubelet-initiated failures #112360
Add pod disruption conditions for kubelet-initiated failures #112360
Conversation
Skipping CI for Draft Pull Request. |
@mimowo: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
c870a7a
to
1658d03
Compare
1658d03
to
e994b27
Compare
/remove-sig api-machinery |
e994b27
to
f3310f9
Compare
/ok-to-test |
b8b2ba4
to
1e2b674
Compare
eb3b989
to
07365e4
Compare
2158b29
to
e1c146f
Compare
/retest |
@mimowo: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
… skip condition update instead.
e1c146f
to
4e732e2
Compare
Update to pkg/controller LGTM |
// there might still be running containers to avoid sending an unnecessary PATCH request if the | ||
// actual transition is delayed (see below) | ||
if transitioningToTerminalPhase && !couldHaveRunningContainers { | ||
// update the LastTransitionTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider to note that the reason of why we need to update LastTransitionTime
for this condition again here is because the older transition time set in updateStatusInternal
is likely stale because the status update was not sent until the all the containers have terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment in the follow up PR: #113718
for _, c := range newPodStatus.Conditions { | ||
if kubetypes.PodConditionByKubelet(c.Type) { | ||
podConditions = append(podConditions, c) | ||
} else if kubetypes.PodConditionSharedByKubelet(c.Type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the logic like this -- this is much cleaner than the previous approach of "reverting" the condition.
The logic now is: we replace/append all the "shared kubelet" conditions, and guard the AlphaNoCompatGuaranteeDisruptionTarget
condition with transitioningToTerminalPhase && !couldHaveRunningContainers
to ensure it will only be sent once all containers have terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for point this out with the details.
nit: Can we include above the logic as the comment for future maintainence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment in the follow up PR: #113718
Thanks for all the updates and refactoring the status updates in status manager, the updated logic is much more clear. LGTM from me for SIG node related areas for /lgtm |
/lgtm You can send the followup PR for the nits. |
/approve for API constant and doc change |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, liggitt, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for your reviewer:
We limit this PR to cover a subset of pod failure scenarios described in the KEP. In particular, the following scenarios are excluded:
ResourceExhausted
pod condition for oom killer and exceeding of local storage limits #113436)This is due to the following complications:
AdmissionFailures
can be considered the future.ResourceExhausted
pod condition, based on OOM killer might be unreliable in some environments, because:reason
field for the oom killed container is set asError
rather than the expectedOOMKilled
.The KEP will be updated to reflect the decisions made during implementation phase for Beta: kubernetes/enhancements#3646
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: