New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SSA to add pod failure conditions #113304
Use SSA to add pod failure conditions #113304
Conversation
/assign @lavalamp @alculquicondor |
/retest |
a323594
to
0e387bd
Compare
@apelisse can you take a first pass on this? |
/retest |
1 similar comment
/retest |
staging/src/k8s.io/component-helpers/apps/podfailureconditions/helpers.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/component-helpers/apps/podfailureconditions/helpers.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/component-helpers/apps/podfailureconditions/helpers.go
Outdated
Show resolved
Hide resolved
18f50e9
to
0dee1c1
Compare
WithStatus(v1.ConditionTrue). | ||
WithReason("DeletionByTaintManager"). | ||
WithMessage("Taint manager: deleting due to NoExecute taint"). | ||
WithLastTransitionTime(metav1.Now()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that there is no alternative today? Or should we be copying the transition time from the condition if we see it in the status?
} | ||
if action.GetVerb() == "delete" && action.GetResource().Resource == "pods" { | ||
podDeleted = true | ||
err := wait.PollImmediate(10*time.Millisecond, 5*time.Second, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the poll necessary now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test is time-based (as we await for the controller to make progress with its actions), waiting only 500ms to observer the actions.
As a result the test occasionally fails (it so happened that it failed on my branch, but was passing consistently locally). Thus, I thought that I could loop to wait a little longer to make the test more stable. I think I could decouple it from the code changes I made.
Also note that, before introducing pod disruption conditions, the test only waited for controller to perform the fake DELETE requests, but now it also needs to perform the fake PATCH / APPLY, so it makes sense to give it a little more time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted the change and the tests passed, so probably the test flaked yesterday on the infra. Still, I think my changes make sense to make the test more reliable, but I can decouple them into a separate PR. Keep reverted from this PR for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ticketed in an independent PR: #113386
return err | ||
} | ||
} | ||
} | ||
return gcc.kubeClient.CoreV1().Pods(pod.Namespace).Delete(ctx, pod.Name, *metav1.NewDeleteOptions(0)) | ||
} | ||
|
||
func updatePodCondition(podStatusApply *corev1apply.PodStatusApplyConfiguration, condition *corev1apply.PodConditionApplyConfiguration) { | ||
if conditionIndex, _ := findPodConditionApplyByType(podStatusApply.Conditions, *condition.Type); conditionIndex < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be looking at all the conditions (including the ones that are not owned by this controller)?
WithLastTransitionTime(metav1.Now()), | ||
) | ||
|
||
if _, err := cs.CoreV1().Pods(victim.Namespace).ApplyStatus(ctx, victimPodApply, metav1.ApplyOptions{FieldManager: "Scheduler", Force: true}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a rule that the field manager is camel case? We should be using the profile name (which matches the scheduler name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no rule it should be camel case or in any other format - the documentation of the field is pretty relaxed:
FieldManager string `json:"fieldManager,omitempty" protobuf:"bytes,3,name=fieldManager"` |
reason
field when naming the actor, for example PreemptionByKubeScheduler
.
As for the profile name - could you explain what the value is, is it a constant?
I searched for profile name within the package and found this "
ProfileName() string |
preemption.go
as it is not passed via PostFilter
invocations. I guess it would require extending the set of parameters when calling PostFilter
or passing the value in context as a keyed value.
I think staying with a constant value for the actor is good enough, until we have a scenario in which it is problematic.
9344048
to
791f191
Compare
/approve |
@alculquicondor please unhold the PR if no further issues |
/hold cancel |
e6a31ea
to
fea8836
Compare
Applied remarks and squashed commits. |
/cc @apelisse |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, lavalamp, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
My suggestion for the changelog entry: The `kube-scheduler` and `kube-controller-manager` now use server side apply to set conditions related to pod disruption. |
@@ -181,7 +181,7 @@ func ObjectReaction(tracker ObjectTracker) ReactionFunc { | |||
if err := json.Unmarshal(modified, obj); err != nil { | |||
return true, nil, err | |||
} | |||
case types.StrategicMergePatchType: | |||
case types.StrategicMergePatchType, types.ApplyPatchType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change silently enabled unit tests written with server-side apply to appear to succeed, while actually testing subtly different behavior... I really don't think client-side strategic merge patch application is a good way to treat apply patches in the fake client
This is more likely to lead someone to release code they think they tested and have it break in weird and subtle ways in real life.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it might be problematic in some cases. I've done it to enable adjusting the unit tests. I'm wondering how we should fix it. Some options that come to me:
- provide a complete implementation in the testing library that would mimic the server (would be preferred, but is it feasible?)
- detect some unsupported uses in the testing library and return an error, still allowing for simple test cases to work
- revert the change, but then it isn't clear to me how to write / adjust existing unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would start by reverting the change to avoid misleading random unit tests that try to use apply
To enable specific unit tests, they could add a reactor that handles this patch type, asserts receiving specific apply patch content, and mocks the application of that patch to the test object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have opened the issue for now: #116851. Hope there are some contributors who can help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apelisse has some stuff in flight to make unit testing SSA easier / possible...
What type of PR is this?
/kind bug
/kind cleanup
What this PR does / why we need it:
In order to eliminate the risk of failures due to conflicts when adding pod disruption conditions. Currently, pod conditions
are added by patches without resource version validation which can lead even to dropping a condition when combined with condition removals: kubernetes/enhancements#3463 (comment)
Which issue(s) this PR fixes:
Special notes for your reviewer:
Added a 5s poll when asserting on the observed actions. This is because otherwise the test is flaky, probably even more than before because as it observes patches it takes longer than before to apply them.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: