New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix pod stuck in termination state when mount fails or gets skipped after kubelet restart #110670
Fix pod stuck in termination state when mount fails or gets skipped after kubelet restart #110670
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @smarterclayton @bobbypage |
286a06e
to
f291291
Compare
} | ||
|
||
if uncertainVolumeCount > 0 { | ||
// If the volume has device to mount, we mark its device as mounted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"mark its device as uncertain"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
c1623ee
to
3801ad1
Compare
How about during reconstruction, we put all reconstructed volumes into actual state with uncertain mode even if desired state has this information (we might need to sync between desired state and actual state which already there)? |
@jingxu97 we could and that is what #108180 PR by @jsafrane does. But adding all volumes to ASW in uncertain state during reconstruction brings new challenges, such as if we add a volume to ASW before it gets added to DSW, then we could accidently unmount the volume. So it has all these additional checks to safeguard against similar failure modes - https://github.com/kubernetes/kubernetes/pull/108180/files#diff-2acbd2d6c088f99f9c7e9042b3449418b4824675d2692350c332d0f276489268R40 ( @jsafrane correct me if I am wrong). But that alone will not fix the issue, because we still need to avoid marking volumes which we added via recontruction to ASW as "unmounted" (from previous "uncertan" state if mount fails) - f291291#diff-3e8f8a6e5eca57db8e84bcc6c896f0a3b10289fe55d8e999b4bdf1cfdc0de394R787 So we have this middle ground PR of mine, which avoids those problems and still fixes issues we set out to fix. |
/retest |
As Hemant pointed out. will need all reconstructed volumes in ASW for my SELinux work. But I can have it behind a feature gate + we don't need to worry about backporting it as a bugfix to older releases. (And yes, I need to update it with the latest Hemant's fixes from this PR.) This PR can be somewhere in middle and could be backported to older releases. |
Yeah eventually once jan's PR gets merged and enabled by default - we could delete the separate |
Add unit tests for removing reconstructed volumes from ASOW
Also only remove volumes from skippedDuringReconstruction only if volume was marked as attached.
ee28639
to
835e8cc
Compare
/retest |
/lgtm |
/hold cancel |
Created a cherry pick for 1.24 here. |
…10670-upstream-release-1.24 Automated cherry pick of #110670: Keep track of each pod that uses a volume during
To preserve fix in kubernetes#110670, add an unit test that check a volume is *uncertain* even after final mount error when it was reconstructed. And actually fix a regression introduced in the previous patch.
To preserve fix in kubernetes#110670, add an unit test that check a volume is *uncertain* even after final mount error when it was reconstructed. And actually fix a regression introduced in the previous patch.
This PR partially uses changes from - #110575
And fixes #96635 and #70044
This PR:
There are two important things in this PR:
Now it is clear that - we need to add volumes we read from disk as uncertain, so as kubelet can unmount them properly. But there are two ways to fix the second problem (i.e where kubelet is marking mount point as unmounted, if mount fails for a volume which was marked uncertain before).