Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deflake a preemption test that may patch Node incorrectly #114350

Merged
merged 1 commit into from Dec 10, 2022

Conversation

Huang-Wei
Copy link
Member

@Huang-Wei Huang-Wei commented Dec 7, 2022

What type of PR is this?

/kind failing-test
/kind flake
/sig scheduling

What this PR does / why we need it:

In the e2e-test grid, [sig-scheduling] SchedulerPreemption [Serial] validates pod disruption condition is added to the preempted pod is failing flakily.

We had a discussion in slack: https://kubernetes.slack.com/archives/C09TP78DV/p1670410754287509.

The preemptor was expected to be pending b/c the only extended source piece was occupied by a low-priority pod, but the log shows the preemptor gets into running directly:

I1206 20:06:08.373180       9 schedule_one.go:81] "Attempting to schedule pod" pod="sched-preemption-3784/victim-pod"
I1206 20:06:08.373416       9 default_binder.go:52] "Attempting to bind pod to node" pod="sched-preemption-3784/victim-pod" node="bootstrap-e2e-minion-group-c44v"
...
I1206 20:06:10.503305       9 schedule_one.go:81] "Attempting to schedule pod" pod="sched-preemption-3784/preemptor-pod"
I1206 20:06:10.503652       9 default_binder.go:52] "Attempting to bind pod to node" pod="sched-preemption-3784/preemptor-pod" node="bootstrap-e2e-minion-group-c44v"
I1206 20:06:10.507929       9 schedule_one.go:252] "Successfully bound pod to node" pod="sched-preemption-3784/preemptor-pod" node="bootstrap-e2e-minion-group-c44v" evaluatedNodes=1 feasibleNodes=1

A suspicious clue is that the e2e test patch Node by only updating the capacity (instead of both capacity and allocatable), it works most of the time as kubelet is working asynchronously to sync the capacity value to allocatable:

image

But we'd better not count on that due to:

  1. it's an asynchronous process
  2. kubelet may not function properly

So this PR enforces setting both capacity and allocatable in the e2e when we want to update a Node's resource capacity.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Deflake a preemption test that may patch Nodes incorrectly.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 7, 2022
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.26 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.26.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Wed Dec 7 17:58:56 UTC 2022.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 7, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/test needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 7, 2022
@k8s-ci-robot
Copy link
Contributor

@Huang-Wei: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 7, 2022
@Huang-Wei
Copy link
Member Author

cc @alculquicondor @mimowo

@aojea
Copy link
Member

aojea commented Dec 7, 2022

/cc

@mimowo
Copy link
Contributor

mimowo commented Dec 8, 2022

/lgtm
I was also able to confirm that kubelet updates Allocatable asynchronously.

Also, I've reproduced the issue on my local kind cluster by patching the quantity to 5 just before the test and looping the test.

It appears that after the proposed changes the test passes consistently.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 8, 2022
@mimowo
Copy link
Contributor

mimowo commented Dec 8, 2022

@aojea
please assign the milestone to 1.26

@mimowo
Copy link
Contributor

mimowo commented Dec 8, 2022

@Huang-Wei thanks a lot for investigating and proposing the PR

@aojea
Copy link
Member

aojea commented Dec 8, 2022

@aojea please assign the milestone to 1.26

this will have to wait for the 1.26.1, we'll merge in 1.27 and backport later

/milestone 1.27

@k8s-ci-robot
Copy link
Contributor

@aojea: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.16, v1.17, v1.18, v1.19, v1.20, v1.21, v1.22, v1.23, v1.24, v1.25, v1.26, v1.27, v1.28]

Use /milestone clear to clear the milestone.

In response to this:

@aojea please assign the milestone to 1.26

this will have to wait for the 1.26.1, we'll merge in 1.27 and backport later

/milestone 1.27

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mimowo
Copy link
Contributor

mimowo commented Dec 8, 2022

@aojea please correct the milestone assignment as the robot suggests 😄

@alculquicondor
Copy link
Member

Now I understand why the existing tests were passing:

Initially the pods are fully unschedulable because allocatable["extended"] = 0 until kubelet has a chance to update it. The next test requests 9 extended resources. At this point, probably allocatable["extended"]=5 before kubelet does the update. So the pods are unschedulable.

/lgtm

@aojea
Copy link
Member

aojea commented Dec 8, 2022

/milestone v1.27

@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Dec 8, 2022
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Dec 8, 2022
@dims
Copy link
Member

dims commented Dec 10, 2022

/retest

@k8s-ci-robot k8s-ci-robot merged commit a40111c into kubernetes:master Dec 10, 2022
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.27, v1.26 Dec 10, 2022
@mimowo
Copy link
Contributor

mimowo commented Dec 12, 2022

@kubernetes/release-managers
Should we cherry-pick this PR for 1.26?
The fix is only in the test code. However, the test will flake if there is a testgrid configuration for the 1.26 release which would run it.

@alculquicondor
Copy link
Member

yes, cherry-pick please

@Huang-Wei Huang-Wei deleted the fix-flaky-test branch December 12, 2022 19:47
@liggitt liggitt modified the milestones: v1.26, v1.27 Dec 13, 2022
k8s-ci-robot added a commit that referenced this pull request Jan 4, 2023
…50-upstream-release-1.26

Automated cherry pick of #114350: Deflake a preemption test that may patch Node incorrectly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants