Add worker to clean up stale DisruptionTarget condition #111475

alculquicondor · 2022-07-27T19:24:31Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a requirement from #110959.

It is possible that controllers that add the DisruptionTarget condition and fail before being able to Delete a pod. When the controller restarts, it might take a different decision, leaving a stale condition. This worker, added to the disruption controller, clears this condition if a DeletionTimestamp is not added to the pod after 2 minutes.

Which issue(s) this PR fixes:

Ref kubernetes/enhancements#3329

Special notes for your reviewer:

The first commit introduces a clock interface to the controller in order to write the unit tests more accurately. I proposed it initially as a separate PR in #111447, but it seems small enough to be in this PR as a separate commit.

This functionality is purposely not gated. This is because if the feature gate PodDisruptionConditions is disabled, we still want to clear the condition.

Does this PR introduce a user-facing change?

If a Pod has a DisruptionTarget condition with status=True for more than 2 minutes without getting a DeletionTimestamp, the control plane resets it to status=False

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

alculquicondor · 2022-07-27T19:25:29Z

/sig apps
/assign @liggitt @soltysh

cc @mimowo

alculquicondor · 2022-07-27T19:25:45Z

/remove-sig api-machinery

soltysh

minor nits, controller changes
/lgtm
/approve

soltysh · 2022-07-28T13:33:36Z

pkg/controller/disruption/disruption.go

+		return
+	}
+	dc.stalePodDisruptionQueue.AddAfter(key, d)
+	klog.InfoS("Enqueued pod for stale DisruptionTarget condition cleanup", "pod", klog.KObj(pod))


Nit: most of the debugging info are at level 4

Oops, left over from my debugging.

soltysh · 2022-07-28T13:34:35Z

pkg/controller/disruption/disruption.go

+		dc.queue.Forget(key)
+		return true
+	}
+	utilruntime.HandleError(fmt.Errorf("syncing Pod %v to clear DisruptionTarget condition, requeueing: %v", key.(string), err))


Suggested change

utilruntime.HandleError(fmt.Errorf("syncing Pod %v to clear DisruptionTarget condition, requeueing: %v", key.(string), err))

utilruntime.HandleError(fmt.Errorf("error syncing Pod %v to clear DisruptionTarget condition, requeueing: %v", key.(string), err))

I somewhat disagree on this. This is already an error, so it shouldn't need the word error again. And if some day the implementation of HandleError changes and wraps the given error, it wouldn't read well.

But sure, given the current implementation of HandleError, your recommendation makes sense.

alculquicondor · 2022-07-28T15:06:18Z

/hold
to prevent premature merge

aojea · 2022-08-02T08:25:12Z

pkg/controller/disruption/disruption_test.go

@@ -181,9 +187,8 @@ func newFakeDisruptionController() (*disruptionController, *pdbStates) {
 	dc.rsListerSynced = alwaysReady
 	dc.dListerSynced = alwaysReady
 	dc.ssListerSynced = alwaysReady
-	ctx := context.TODO()


from where ctx.Done() is coming after removing this?

It's coming from the test. See newFakeDisruptionControllerWithTime

To be able to write more precise unit tests in the future Change-Id: I8f45947dfacca501acd856849bd978fad0f735cd

Change-Id: I907fbdf01e7ff08d823fb23aa168ff271d8ff1ee

alculquicondor · 2022-08-02T15:27:35Z

/assign @lavalamp
for staging/src/k8s.io/client-go/util/workqueue/rate_limiting_queue.go

lavalamp · 2022-08-02T15:47:21Z

/approve

for client

alculquicondor · 2022-08-02T15:51:24Z

/hold cancel
/priority important-soon

k8s-triage-robot · 2022-08-02T16:25:06Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

mimowo · 2022-08-02T16:26:54Z

/remove-kind api-change

soltysh

/lgtm

k8s-ci-robot · 2022-08-02T17:20:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, janetkuo, lavalamp, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/disruption/OWNERS~~ [janetkuo,lavalamp,soltysh]
~~staging/src/k8s.io/client-go/OWNERS~~ [lavalamp]
~~test/OWNERS~~ [janetkuo,lavalamp,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from krmayankk and liggitt July 27, 2022 19:24

k8s-ci-robot assigned liggitt and soltysh Jul 27, 2022

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jul 27, 2022

alculquicondor force-pushed the clear_pod_disruption branch from 53b272e to a403d60 Compare July 27, 2022 20:08

k8s-ci-robot added area/kubelet sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 27, 2022

alculquicondor force-pushed the clear_pod_disruption branch from a403d60 to 5f09ef0 Compare July 27, 2022 20:53

alculquicondor mentioned this pull request Jul 27, 2022

Retriable and non-retriable Pod failures for Jobs kubernetes/enhancements#3329

Open

8 tasks

mimowo mentioned this pull request Jul 28, 2022

Append new pod conditions when deleting pods to indicate the reason for pod deletion #110959

Merged

soltysh approved these changes Jul 28, 2022

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 28, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 28, 2022

aojea reviewed Aug 2, 2022

View reviewed changes

alculquicondor mentioned this pull request Aug 2, 2022

Clean up taint_manager_test.go #111308 #111336

Merged

Add clock interface to disruption controller

dad8454

To be able to write more precise unit tests in the future Change-Id: I8f45947dfacca501acd856849bd978fad0f735cd

alculquicondor force-pushed the clear_pod_disruption branch from 7fe1b17 to bb2353d Compare August 2, 2022 15:23

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 2, 2022

Add worker to clean up stale DisruptionTarget condition

4188d9b

Change-Id: I907fbdf01e7ff08d823fb23aa168ff271d8ff1ee

alculquicondor force-pushed the clear_pod_disruption branch from bb2353d to 4188d9b Compare August 2, 2022 15:25

k8s-ci-robot assigned lavalamp Aug 2, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 2, 2022

k8s-ci-robot removed the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Aug 2, 2022

soltysh approved these changes Aug 2, 2022

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 2, 2022

k8s-ci-robot merged commit bc4c493 into kubernetes:master Aug 2, 2022

k8s-ci-robot added this to the v1.25 milestone Aug 2, 2022

harshanarayana mentioned this pull request Aug 3, 2022

job_controller: refactor job controller to be able to inject FakeClock for UTs #110710

Merged

mimowo mentioned this pull request Aug 9, 2022

REQUEST: New membership for mimowo kubernetes/org#3602

Closed

9 tasks

mimowo mentioned this pull request Sep 13, 2022

Update KEP-3329 "Retriable and non-retriable Pod failures for Jobs" for Beta kubernetes/enhancements#3463

Merged

aojea mentioned this pull request Sep 20, 2022

Failure cluster [f6cf3d64...]: TestStalePodDisruption #112594

Closed

mimowo mentioned this pull request Nov 2, 2022

Add pod disruption conditions for kubelet-initiated failures #112360

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add worker to clean up stale DisruptionTarget condition #111475

Add worker to clean up stale DisruptionTarget condition #111475

alculquicondor commented Jul 27, 2022 •

edited

alculquicondor commented Jul 27, 2022

alculquicondor commented Jul 27, 2022

soltysh left a comment

soltysh Jul 28, 2022

alculquicondor Jul 28, 2022

soltysh Jul 28, 2022

alculquicondor Jul 28, 2022

alculquicondor commented Jul 28, 2022

aojea Aug 2, 2022

alculquicondor Aug 2, 2022

alculquicondor commented Aug 2, 2022

lavalamp commented Aug 2, 2022

alculquicondor commented Aug 2, 2022

k8s-triage-robot commented Aug 2, 2022

mimowo commented Aug 2, 2022

soltysh left a comment

k8s-ci-robot commented Aug 2, 2022

	utilruntime.HandleError(fmt.Errorf("syncing Pod %v to clear DisruptionTarget condition, requeueing: %v", key.(string), err))
	utilruntime.HandleError(fmt.Errorf("error syncing Pod %v to clear DisruptionTarget condition, requeueing: %v", key.(string), err))

Add worker to clean up stale DisruptionTarget condition #111475

Add worker to clean up stale DisruptionTarget condition #111475

Conversation

alculquicondor commented Jul 27, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

alculquicondor commented Jul 27, 2022

alculquicondor commented Jul 27, 2022

soltysh left a comment

Choose a reason for hiding this comment

soltysh Jul 28, 2022

Choose a reason for hiding this comment

alculquicondor Jul 28, 2022

Choose a reason for hiding this comment

soltysh Jul 28, 2022

Choose a reason for hiding this comment

alculquicondor Jul 28, 2022

Choose a reason for hiding this comment

alculquicondor commented Jul 28, 2022

aojea Aug 2, 2022

Choose a reason for hiding this comment

alculquicondor Aug 2, 2022

Choose a reason for hiding this comment

alculquicondor commented Aug 2, 2022

lavalamp commented Aug 2, 2022

alculquicondor commented Aug 2, 2022

k8s-triage-robot commented Aug 2, 2022

mimowo commented Aug 2, 2022

soltysh left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 2, 2022

alculquicondor commented Jul 27, 2022 •

edited