Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the "Retriable and non-retriable pod failures for jobs" feature into beta #113360

Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Oct 26, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

It depends on the following PRs:

Does this PR introduce a user-facing change?

Enable the "Retriable and non-retriable pod failures for jobs" feature into beta

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 26, 2022
@k8s-ci-robot k8s-ci-robot added area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 26, 2022
@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@alculquicondor
Copy link
Member

Can you link all the PRs that this depends on?

@mimowo
Copy link
Contributor Author

mimowo commented Oct 27, 2022

/retest

@mimowo
Copy link
Contributor Author

mimowo commented Oct 27, 2022

Can you link all the PRs that this depends on?

Done

@k8s-ci-robot k8s-ci-robot added the area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework label Oct 28, 2022
@mimowo mimowo force-pushed the handling-pod-failures-beta-enable branch from 36edb2b to d3e7057 Compare October 28, 2022 08:05
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 28, 2022
@mimowo mimowo force-pushed the handling-pod-failures-beta-enable branch 4 times, most recently from bf4309a to 8927d37 Compare October 28, 2022 09:19
@mimowo
Copy link
Contributor Author

mimowo commented Oct 28, 2022

/retest

@mimowo mimowo force-pushed the handling-pod-failures-beta-enable branch from 8927d37 to 94009c9 Compare October 28, 2022 12:03
@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Oct 28, 2022
@liggitt liggitt self-assigned this Nov 8, 2022
@liggitt liggitt added this to In progress in API Reviews Nov 8, 2022
Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API constant change lgtm

a couple questions about the patch tests and one request on the default RBAC role if possible

@@ -327,6 +327,7 @@ func TestCreateNode(t *testing.T) {
description string
pods []v1.Pod
node *v1.Node
expectPatch bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this verify anything about the content of the patch?

Copy link
Contributor Author

@mimowo mimowo Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered below

@@ -766,6 +770,7 @@ func TestEventualConsistency(t *testing.T) {
newPod *v1.Pod
oldNode *v1.Node
newNode *v1.Node
expectPatch bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this verify anything about the content of the patch?

Copy link
Contributor Author

@mimowo mimowo Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a unit test and the logic to generate the patch content is beyond Taint Manager, which prepares and sends the request:

if _, err := c.CoreV1().Pods(pod.Namespace).ApplyStatus(ctx, podApply, metav1.ApplyOptions{FieldManager: fieldManager, Force: true}); err != nil {
. On the other hand preparing the apply request is the responsibility of Taint Manager so testing the patch would verify the logic. However, this is also a pattern in other Taint Manager tests, already ticketed: #111612. If you think this is the right approach I would work on this as a follow up PR.

Test coverage for the feature would not change much as we already have integration tests which verifies the condition is added:

_, cond := podutil.GetPodCondition(&testPod.Status, v1.AlphaNoCompatGuaranteeDisruptionTarget)
. An e2e test for the path is also introduced in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch would be a slice of bytes. I'm fine not checking it, given that we have integration tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, closed there related issue with a comment: #111612 (comment)

@@ -337,6 +337,9 @@ func TestPostFilter(t *testing.T) {
}
// As we use a bare clientset above, it's needed to add a reactor here
// to not fail Victims deletion logic.
cs.PrependReactor("patch", "pods", func(action clienttesting.Action) (bool, runtime.Object, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's surprising these don't need to handle the patch/delete requests in any way

Copy link
Contributor Author

@mimowo mimowo Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test focuses asserts on nominating the node for preemption. It does not assert the modifications done to the pod so it ignores the DELETE (and now also PATCH) action.

As a proposal I pushed a new commit implementing another approach (I find simple and more powerful) to index the potential victim pods in the fake client so that the test does not fail trying to patch non-existing pods (proposed in the new commit).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt liggitt moved this from In progress to API review completed, 1.26 in API Reviews Nov 8, 2022
@mimowo mimowo force-pushed the handling-pod-failures-beta-enable branch from 5726091 to 818e180 Compare November 9, 2022 08:03
@mimowo
Copy link
Contributor Author

mimowo commented Nov 9, 2022

/retest

@liggitt
Copy link
Member

liggitt commented Nov 9, 2022

RBAC and API changes lgtm

will defer to area approvers for ack of patch test at #113360 (comment)

@@ -766,6 +770,7 @@ func TestEventualConsistency(t *testing.T) {
newPod *v1.Pod
oldNode *v1.Node
newNode *v1.Node
expectPatch bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch would be a slice of bytes. I'm fine not checking it, given that we have integration tests.

for _, pod := range tt.pods {
podItems = append(podItems, *pod)
}
cs := clientsetfake.NewSimpleClientset(&v1.PodList{Items: podItems})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, this seems better.

@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2022
@liggitt
Copy link
Member

liggitt commented Nov 9, 2022

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, liggitt, mimowo, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2022
@liggitt
Copy link
Member

liggitt commented Nov 9, 2022

adding to milestone per code-freeze exception granted at https://groups.google.com/g/kubernetes-sig-release/c/tZPjSWW_g30/m/_FzB7nrUAAAJ?utm_medium=email&utm_source=footer

/milestone v1.26

@leonardpahlke
Copy link
Member

exception request approved: email

Your updated deadline to make any changes to your PR is 18:00 PST Friday 11th November 2022.

/milestone v1.26

@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Nov 9, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Nov 9, 2022

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: API review completed, 1.26
Development

Successfully merging this pull request may close these issues.

None yet

8 participants