New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EndpointSlice with Pods without an existing Node #110639
Conversation
@aojea: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/priority important-soon |
@@ -89,7 +89,13 @@ func (spc *ServicePortCache) totals(maxEndpointsPerSlice int) (int, int, int) { | |||
for _, eInfo := range spc.items { | |||
endpoints += eInfo.Endpoints | |||
actualSlices += eInfo.Slices | |||
desiredSlices += numDesiredSlices(eInfo.Endpoints, maxEndpointsPerSlice) | |||
if eInfo.Endpoints > 0 { | |||
desiredSlices += numDesiredSlices(eInfo.Endpoints, maxEndpointsPerSlice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea Not suggesting a change necessarily, but would it be worth updating numDesiredSlices() to check for numEndpoints==0 and return 0, rather than what it does right now and return 1? Then you can just special-case desiredSlices=0 at the bottom but not have to special-case 0 in the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we need to change endpointslicemirroring at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think that is a matter of where we want to special-case the case "there is always one slice, despite there are not endpoints at all" ... I took this approach because numDesiredSlices
is also used in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, seems you are right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
th eendpointslicemirroring seems to not use a placeholder, and already returns 0 slices for 0 endpoints, but better @robscott for confirming
There is always a placeholder slice. The ServicePortCache logic was considering always one endpointSlice per Endpoint, but if there are multiple empty Endpoints, we just use one placeholder slice, not multiple placeholder slices.
When a Pod is referencing a Node that doesn't exist on the local informer cache, the current behavior was to return an error to retry later and stop processing. However, this can cause scenarios that a missing node leaves a Slice stuck, it can no reflect other changes, or be created. Also, this doesn't respect the publishNotReadyAddresses options on Services, that considers ok to publish pod Addresses that are known to not be ready. The new behavior keeps retrying the problematic Service, but it keeps processing the updates, reflacting current state on the EndpointSlice. If the publishNotReadyAddresses is set, a missing node on a Pod is not treated as an error.
/lgtm |
@aojea the conformance test failure looks odd, because it's about EPS. However, it doesn't look related to EndpointSlice at all; the failure is that the test pods don't get their IPs within a 3m timeout. TLDR the image pulls are taking 2+ minutes for the pod:
and a few seconds after that the e3e test times out waiting for the pods to be up. Well yeah if the image pull takes 2m51s then I'm not surprised the e2e test's 3m timeout is exceeded. |
/retest |
…0639-upstream-release-1.24 Automated cherry pick of #110639: fix a bug on endpointslices tests comparing the wrong
…ce_no_node EndpointSlice with Pods without an existing Node
/kind bug
/kind cleanup
/kind documentation
What this PR does / why we need it:
endpointslices: node missing on Pod scenario
When a Pod is referencing a Node that doesn't exist on the local
informer cache, the current behavior was to return an error to
retry later and stop processing.
However, this can cause scenarios that a missing node leaves a
Slice stuck, it can no reflect other changes, or be created.
Also, this doesn't respect the publishNotReadyAddresses options
on Services, that considers ok to publish pod Addresses that are
known to not be ready.
The new behavior keeps retrying the problematic Service, but it
keeps processing the updates, reflacting current state on the
EndpointSlice. If the publishNotReadyAddresses is set, a missing
node on a Pod is not treated as an error.
Fixes: #107927