Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid re-syncing LBs for ETP=local services #109706

Merged

Conversation

alexanderConstantinescu
Copy link
Member

@alexanderConstantinescu alexanderConstantinescu commented Apr 28, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR reduces the amount of LB re-configurations for ETP=local services by removing the sync completely for any node readiness changes, from the CCM's sync loop. The referenced issue explains the problem in greater detail, so readers are henced directed to that for what concerns the genesis of the problem. A lot has been discussed in this PR and a plan forward has been drafted. The following summarizes the discussion:

  • Remove syncs w.r.t node state changes in the CCM in 1.25 (follow up PR, once this merges, is to be expected for remaining node conditions)
  • Maybe / possibly / hopefully backport that to the oldest supported Kube version also supporting ETP=local services
  • Have the reported status of the LB HC configured for ETP=local services reflect the node state in 1.25 (also in the form of a follow up PR)
  • Write KEP for dealing with transitioning node state for all service LBs.
  • Kube >= after KEP : remove the sync logic for node state changes for all services and move them to the HC

Which issue(s) this PR fixes:

Partially fixes #111539 for ETP=local services, specifically: reduces the amount of cloud API calls and service downtime caused by excessive re-configurations of all cluster LBs as an effect of transitioning node readiness state.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Reduce the number of cloud API calls and service downtime caused by excessive re-configurations of cluster LBs with externalTrafficPolicy=Local when node readiness changes (https://github.com/kubernetes/kubernetes/issues/111539). The service controller (in cloud-controller-manager) will avoid resyncing nodes which are transitioning between `Ready` / `NotReady` (only for for ETP=Local Services). The LBs used for these services will solely rely on the health check probe defined by the `healthCheckNodePort` to determine if a particular node is to be used for traffic load balancing. 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig network
/sig cloud-provider

@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.24 branch. This means every merged PR has to be cherry-picked into the release branch to be part of the upcoming v1.24.0 release.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 28, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @alexanderConstantinescu. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/cloudprovider labels Apr 28, 2022
@k8s-ci-robot k8s-ci-robot added area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 28, 2022
@leilajal
Copy link
Contributor

/remove-sig api-machinery
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 28, 2022
@bowei
Copy link
Member

bowei commented Apr 28, 2022

cc: @bowei

@bowei
Copy link
Member

bowei commented Apr 28, 2022

/assign @bowei -- I'm very interested in poking more at the lifecycle of the LB node

@thockin thockin self-assigned this Apr 28, 2022
@alexanderConstantinescu
Copy link
Member Author

alexanderConstantinescu commented Apr 28, 2022

Some additional notes here below from the recent discussion in sig-net that just ended. Also, you'll be able to find the slides presented by me, here: https://docs.google.com/presentation/d/1EsK5FLS4s6E3Ok42XhRz643YtB9uZHn3wohWvcMyHag/edit#slide=id.p

My third slide present some ideas around the possibility of removing everything the CCM does w.r.t synchronizing nodes transitioning between NotReady <-> Ready.

  1. Is this PR - which eliminates the mentioned sync for ETP=local services, from the CCM's node sync loop
  2. Is removing that sync completely for all services irrespective of ETP=local or not.
    a. Is removing that sync completely only for a selected set of services, indicated by the user through the utilization of an annotation on the service object
  3. Is removing that sync completely for ETP=local services from the CCM, but replacing it by having kube-proxy dial the kubelet's read-only port when answering to the probe executed by the LB against the HealthCheckNodePort. I.e: kube-proxy would keep the current behavior (validating that the node in question is actually hosting endpoints for the service) and AND-ing that with the kubelet's reported state. That means that the LB would continue executing one probe against the HealthCheckNodePort, but kube-proxy would open a local connection against kubelet's read-only port on the node for every LB probe. There was some confusion surrounding this during the meeting, hence why I've chosen to detail this here. This would result in the same behavior as today, but be much more dynamic and efficient at handling transitioning node state - where instead of re-configuring the LB with a new set of nodes, the LB simply probes the node on the HealthCheckNodePort as it currently does, with the result also reflecting the kubelet/node readiness state.
  4. Is removing that sync completely for all services from the CCM. Instead we define a standardised behavior for health checking nodes which are listed as backends for LoadBalancer type services, that all cloud provider implementations must adhere to. All cloud providers today implement this interface: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cloud-provider/cloud.go#L133-L162 , however: when the service is ETP != local they are free to specify custom probes on their LBs. What they do today was described at the bottom of the description to this PR. If we align the behavior for all cloud implementations: we could have them include a probe against the kubelet's read-only port - or do as in the ETP=local case and have the LB query kube-proxy which does multiple checks. In the end, LB's only really care to know if the node can service traffic and if the service NodePort on the node is reachable. All three clouds have chosen one of these two checks, not both. GCE's LB check is actually superfluous, since the CCM also checks the same condition (i.e: the kubelet state/node readiness).

3 and 4. will require enhancement proposal and are beyond the scope of this PR, but we can discuss them here (or I can open a KEP for them and we can discuss them there...?)

This PR only concerns itself with finding a solution to either 1. and/or 2., simply because this currently is a bug for clusters at scale and, I believe, we should backport the fix to older releases.

I am actually in favour of doing 2.a., meaning: modifying this PR slightly and removing any relation to ETP=local services and checking where endpoints are hosted. I would instead prefer we define an annotation (ex: service.beta.kubernetes.io/lb-ignore-node-not-ready) which a user can set on the service object to tell the CCM to ignore this service for node readiness changes. The implementation effort would be minimal and would allow the user to solve both problems presented in my first slides. I, as a user of Kubernetes, should be allowed to disable the CCM's sync for node readiness state if I don't want my LBs to be re-configured as a cause by that. I own my nodes and LBs and should thus be allowed to decide what's best.

If we can't agree on 2.a, then 1. (the current state of this PR) will at least help reduce the time taken to sync all LBs by the CCM on large clusters with a lot of services ETP=local, which is already an improvement from today.

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 4, 2022
@alexanderConstantinescu
Copy link
Member Author

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 4, 2022
@danwinship
Copy link
Contributor

/ok-to-test

@thockin
Copy link
Member

thockin commented Aug 3, 2022

@kubernetes/sig-scalability
@wojtek-t
@mborsz

FYI re scale - this should be an improvement for real clusters (minor because etp=Local, but eventually for all services)

@alexanderConstantinescu
Copy link
Member Author

alexanderConstantinescu commented Aug 4, 2022

@thockin : I am concerned the first commit might be wrong, specifically around removing nodes from the LB set based on the schedulability predicate.

It seems it was removed with #90823 which listed a production outage caused by the fact that we were syncing LBs as a cause by the schedulability constraint changing on a set of nodes.

This caused a production outage when nodes in an autoscaling group were marked unschedulable so the autoscaler would gracefully terminate them - instead the ingress controllers on those nodes were immediately removed from the LB pool and no new ingress requests could be accepted.

That makes me wonder why we even decided to sync updates to nodes which are experiencing a change to this attribute at all. That was added with PR: #90769 - which doesn't list any good reasons for adding it...

So:

Now we did the inverse it seems. Sorry for realizing this now....

k8s-ci-robot pushed a commit that referenced this pull request Mar 9, 2023
k8s-ci-robot added a commit that referenced this pull request Mar 9, 2023
…-taint-unsched

[CCM - service controller] addressing left over comments from #109706
k8s-publishing-bot pushed a commit to kubernetes/cloud-provider that referenced this pull request Mar 9, 2023
Specifically:

- kubernetes/kubernetes#109706 (comment)
- kubernetes/kubernetes#109706 (comment)

Kubernetes-commit: 9ca3d0ec157b57df4a9bf406cba39317a0aecdd2
}

if err := c.lockedUpdateLoadBalancerHosts(svc, hosts); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may have been locked some time ago, but I can 't see where is locked now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider area/ipvs area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[scale] node state changes causes excessive LB re-configurations
10 participants