Don't force detach volume from healthy nodes #110721

jsafrane · 2022-06-22T12:34:12Z

~~WIP: PR for discussion. Missing unit tests.~~

What type of PR is this?

/kind bug

What this PR does / why we need it:

6 minute force-deatch timeout should be used only for nodes that are not healthy.

In case a CSI driver is being upgraded or it's simply slow, NodeUnstage can take more than 6 minutes. In that case, Pod is already deleted from the API server and thus A/D controller will force-detach a mounted volume, possibly corrupting the volume and breaking CSI - a CSI driver expects NodeUnstage to succeed before Kubernetes can call ControllerUnpublish.

In context of this PR, an unhealthy node means node.status.conditions["Ready"] != true.

Which issue(s) this PR fixes:

Fixes #106710, #106902

cc @gnufied @jingxu97 @bswartz

Does this PR introduce a user-facing change?

Volumes are no longer detached from healthy nodes after 6 minutes timeout. 6 minute force-detach timeout is used only for unhealthy nodes (`node.status.conditions["Ready"] != true`).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2022-06-22T12:34:20Z

@jsafrane: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsafrane · 2022-06-23T13:51:47Z

Added an unit test

jingxu97 · 2022-06-23T15:08:14Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

@@ -154,6 +155,15 @@ func (rc *reconciler) hasOutOfServiceTaint(nodeName types.NodeName) (bool, error
 	return false, nil
 }

+// isHealthy returns true if the node looks healthy.
+func (rc *reconciler) isHealthy(nodeName types.NodeName) (bool, error) {


how about naming isNodeHealthey()?

jingxu97 · 2022-06-23T15:10:19Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

+// isHealthy returns true if the node looks healthy.
+func (rc *reconciler) isHealthy(nodeName types.NodeName) (bool, error) {
+	node, err := rc.nodeLister.Get(string(nodeName))
+	if err != nil {


besides node object not found, any other possible error to get node from nodeLister, like some temporal error?

It is an informer, so it won't error on network hiccups. I've never seen a temporal errors returned.

jingxu97 · 2022-06-23T15:15:52Z

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go

+	// Act
+	// Delete the pod and the volume will be detached only after the maxLongWaitForUnmountDuration expires as volume is
+	// not unmounted. Here maxLongWaitForUnmountDuration is used to mimic that node is out of service.
+	// But in this case the node does not have the node.kubernetes.io/out-of-service taint and hence it will wait for


out-of-service is alpha feature, do we need feature gate to do the test?

That's a copied comment that I forgot to edit :-)

Fixed.

jingxu97 · 2022-06-23T15:23:10Z

since this PR will have some behavior change that might affect user workload in certain cases, eg., pod is stuck in terminating (we still have some bug there), node is healthy, volume will not be detached. Is release note to mention this change good enough?

6 minute force-deatch timeout should be used only for nodes that are not healthy. In case a CSI driver is being upgraded or it's simply slow, NodeUnstage can take more than 6 minutes. In that case, Pod is already deleted from the API server and thus A/D controller will force-detach a mounted volume, possibly corrupting the volume and breaking CSI - a CSI driver expects NodeUnstage to succeed before Kubernetes can call ControllerUnpublish.

k8s-ci-robot · 2022-06-24T13:05:27Z

@jsafrane: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-gce-storage-snapshot	`3b94ac2`	link	false	`/test pull-kubernetes-e2e-gce-storage-snapshot`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

bswartz

/lgtm

k8s-ci-robot · 2022-06-27T12:54:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bswartz, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/volume/attachdetach/OWNERS~~ [jsafrane]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ialidzhikov · 2022-06-30T19:42:42Z

@jsafrane , does it make sense to backport?

jsafrane · 2022-07-01T13:17:16Z

@ialidzhikov I am afraid it could break someone. What do you think? Can someone else than me test it thoroughly?

jsafrane · 2022-11-29T14:02:00Z

It has been in 1.25 for quite some time without any issues reported, I'm approving 1.24 backport in #114168

…0721-upstream-release-1.24 Automated cherry pick of #110721: Don't force detach volume from healthy nodes

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 22, 2022

k8s-ci-robot requested review from verult and xing-yang June 22, 2022 12:34

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 22, 2022

jsafrane force-pushed the fix-force-detach branch from 1dccbe0 to 7cc4d09 Compare June 23, 2022 13:50

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 23, 2022

jsafrane changed the title ~~WIP: Don't force detach volume from healthy nodes~~ Don't force detach volume from healthy nodes Jun 23, 2022

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 23, 2022

jingxu97 reviewed Jun 23, 2022

View reviewed changes

jsafrane force-pushed the fix-force-detach branch from 7cc4d09 to a9bdfe8 Compare June 24, 2022 08:38

jsafrane force-pushed the fix-force-detach branch from a9bdfe8 to 3b94ac2 Compare June 24, 2022 10:52

bswartz approved these changes Jun 27, 2022

View reviewed changes

k8s-ci-robot assigned bswartz Jun 27, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 27, 2022

k8s-ci-robot merged commit aefb71d into kubernetes:master Jun 27, 2022

k8s-ci-robot added this to the v1.25 milestone Jun 27, 2022

pwschuurman mentioned this pull request Jul 21, 2022

Pod Informer Update with updated Pod UID can lead to attachdetach controller desired_state_of_world cache inaccuracy #111327

Open

msau42 mentioned this pull request Nov 11, 2022

Disk failed to relink with udevadm kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#608

Closed

mattcary mentioned this pull request Nov 28, 2022

Automated cherry pick of #110721: Don't force detach volume from healthy nodes #114168

Merged

mattcary mentioned this pull request Dec 2, 2022

Automated cherry pick of #110721: Don't force detach volume from healthy nodes #114258

Closed

k8s-ci-robot added a commit that referenced this pull request Dec 2, 2022

Merge pull request #114168 from mattcary/automated-cherry-pick-of-#11…

025433a

…0721-upstream-release-1.24 Automated cherry pick of #110721: Don't force detach volume from healthy nodes

msau42 mentioned this pull request Jan 20, 2023

Detaching volume forcefully is extremely dangerous #115223

Closed

msau42 mentioned this pull request Jan 27, 2023

Automated cherry pick of #110721: Don't force detach volume from healthy nodes #115353

Closed

msau42 mentioned this pull request Mar 14, 2023

Termination signals and pod/data safety impact #116618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't force detach volume from healthy nodes #110721

Don't force detach volume from healthy nodes #110721

jsafrane commented Jun 22, 2022 •

edited

k8s-ci-robot commented Jun 22, 2022

jsafrane commented Jun 23, 2022

jingxu97 Jun 23, 2022

jsafrane Jun 24, 2022

jingxu97 Jun 23, 2022 •

edited

jsafrane Jun 24, 2022

jingxu97 Jun 23, 2022

jsafrane Jun 24, 2022

jingxu97 commented Jun 23, 2022

k8s-ci-robot commented Jun 24, 2022

bswartz left a comment

k8s-ci-robot commented Jun 27, 2022

ialidzhikov commented Jun 30, 2022

jsafrane commented Jul 1, 2022

jsafrane commented Nov 29, 2022

Don't force detach volume from healthy nodes #110721

Don't force detach volume from healthy nodes #110721

Conversation

jsafrane commented Jun 22, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Jun 22, 2022

jsafrane commented Jun 23, 2022

jingxu97 Jun 23, 2022

Choose a reason for hiding this comment

jsafrane Jun 24, 2022

Choose a reason for hiding this comment

jingxu97 Jun 23, 2022 • edited

Choose a reason for hiding this comment

jsafrane Jun 24, 2022

Choose a reason for hiding this comment

jingxu97 Jun 23, 2022

Choose a reason for hiding this comment

jsafrane Jun 24, 2022

Choose a reason for hiding this comment

jingxu97 commented Jun 23, 2022

k8s-ci-robot commented Jun 24, 2022

bswartz left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 27, 2022

ialidzhikov commented Jun 30, 2022

jsafrane commented Jul 1, 2022

jsafrane commented Nov 29, 2022

jsafrane commented Jun 22, 2022 •

edited

jingxu97 Jun 23, 2022 •

edited