Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting in 1.25 clusters, services of type=LB and xTP=Local sometimes does not update node backends on load balancers #112793

Closed
swetharepakula opened this issue Sep 29, 2022 · 15 comments
Assignees
Labels
area/controller-manager kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@swetharepakula
Copy link
Member

swetharepakula commented Sep 29, 2022

What happened?

When upgrading nodes from 1.24 to 1.25, on a cluster where master is already at 1.25, I notice that my Service type=LoadBalancer and xTP=Local have an incorrect set of nodes after the nodes have been upgraded. The set contains only the old nodes that no longer exist resulting my service being unavailable through my load balancer.

What did you expect to happen?

I would expect that the load balancer to be properly updated with the new set of nodes after the upgrade.

How can we reproduce it (as minimally and precisely as possible)?

  1. Create 1.25 Cluster with 1.24 nodes
  2. Deploy a service of type=LoadBalancer and xTP=Local
  3. Upgrade the 1.24 nodes to 1.25
  4. After upgrade is finished look at the node list for the Load Balancer.

Anything else we need to know?

The existing logging is not enough to diagnose the issue. I added some more logs and ran the KCM at log level =5 to find the root cause.

There was a change introduced to reduce the number syncs for xTP=Local services: #109706. With this change, there are situations where the xTP=Local never gets updated.

The following is the chain of events.

  1. Node is created or deleted and causes triggerNodeSync():

AddFunc: func(cur interface{}) {
s.triggerNodeSync()
},
UpdateFunc: func(old, cur interface{}) {
oldNode, ok := old.(*v1.Node)
if !ok {
return
}
curNode, ok := cur.(*v1.Node)
if !ok {
return
}
if !shouldSyncUpdatedNode(oldNode, curNode) {
return
}
s.triggerNodeSync()
},
DeleteFunc: func(old interface{}) {
s.triggerNodeSync()
},
},

  1. Inside nodeTriggerSync(), the nodeLister (line 264) filters for Ready only nodes so it does not have the new node or still contains the deleted node, which means that c.needFullSync = false when line 281 is executed.

// triggerNodeSync triggers a nodeSync asynchronously
func (c *Controller) triggerNodeSync() {
c.nodeSyncLock.Lock()
defer c.nodeSyncLock.Unlock()
newHosts, err := listWithPredicates(c.nodeLister, allNodePredicates...)
if err != nil {
runtime.HandleError(fmt.Errorf("Failed to retrieve current set of nodes from node lister: %v", err))
// if node list cannot be retrieve, trigger full node sync to be safe.
c.needFullSync = true
} else if !nodeSlicesEqualForLB(newHosts, c.knownHosts) {
// Here the last known state is recorded as knownHosts. For each
// LB update, the latest node list is retrieved. This is to prevent
// a stale set of nodes were used to be update loadbalancers when
// there are many loadbalancers in the clusters. nodeSyncInternal
// would be triggered until all loadbalancers are updated to the new state.
klog.V(2).Infof("Node changes detected, triggering a full node sync on all loadbalancer services")
c.needFullSync = true
c.knownHosts = newHosts
}
select {
case c.nodeSyncCh <- struct{}{}:
klog.V(4).Info("Triggering nodeSync")
return
default:
klog.V(4).Info("A pending nodeSync is already in queue")
return
}
}

  1. Following down the chain of functions that are called (across goroutines communicating with nodeSyncCh), we end up at nodeSyncInternal. Because c.needFullSync = false, we will only do a sync of services that were marked for retry. If the state previously was good, this means c.servicesToUpdate has 0 services before entering updateLoadBalancerHosts.

    if !c.needFullSyncAndUnmark() {
    // The set of nodes in the cluster hasn't changed, but we can retry
    // updating any services that we failed to update last time around.
    // It is required to call `c.cache.get()` on each Service in case there was
    // an update event that occurred between retries.
    var servicesToUpdate []*v1.Service
    for key := range c.servicesToUpdate {
    cachedService, exist := c.cache.get(key)
    if !exist {
    klog.Errorf("Service %q should be in the cache but not", key)
    continue
    }
    servicesToUpdate = append(servicesToUpdate, cachedService.state)
    }
    c.servicesToUpdate = c.updateLoadBalancerHosts(ctx, servicesToUpdate, workers)
    return

  2. Nodes are queried again from the NodeLister. But this time the new node or the deleted node is reflected.

    func (c *Controller) updateLoadBalancerHosts(ctx context.Context, services []*v1.Service, workers int) (servicesToRetry sets.String) {
    klog.V(4).Infof("Running updateLoadBalancerHosts(len(services)==%d, workers==%d)", len(services), workers)
    // Include all nodes and let nodeSyncService filter and figure out if
    // the update is relevant for the service in question.
    nodes, err := listWithPredicates(c.nodeLister)
    if err != nil {
    runtime.HandleError(fmt.Errorf("failed to retrieve node list: %v", err))
    return serviceKeys(services)
    }
    // lock for servicesToRetry
    servicesToRetry = sets.NewString()
    lock := sync.Mutex{}
    doWork := func(piece int) {
    if shouldRetry := c.nodeSyncService(services[piece], c.lastSyncedNodes, nodes); !shouldRetry {
    return
    }
    lock.Lock()
    defer lock.Unlock()
    key := fmt.Sprintf("%s/%s", services[piece].Namespace, services[piece].Name)
    servicesToRetry.Insert(key)
    }
    workqueue.ParallelizeUntil(ctx, workers, len(services), doWork)
    c.lastSyncedNodes = nodes
    klog.V(4).Infof("Finished updateLoadBalancerHosts")
    return servicesToRetry
    }

  3. nodeSyncService is then parallelized based on the length of services. In this case we have no services, so we do no updates but on line 808, c.lastSyncedNodes is set to the nodes found in step 4.

A subsequent full sync does not fix things. We go through steps 1 through 5 again, the difference being c.needFullSync=true. This will mean that in step 4 & 5, c.servicesToUpdate will not be empty resulting in the following:

  1. inside nodeSyncService we filter based on predicates for the xTP=Local and xTP=Cluster. In the case of xTP=Local since we do not pay attention to Ready status, all of the nodes in the c.lastSyncedNodes will be the oldNodes. And since no node creations or deletions have occurred, all the newNodes will be the same. This results in no sync (line 767).

func (c *Controller) nodeSyncService(svc *v1.Service, oldNodes, newNodes []*v1.Node) bool {
retSuccess := false
retNeedRetry := true
if svc == nil || !wantsLoadBalancer(svc) {
return retSuccess
}
newNodes = filterWithPredicates(newNodes, getNodePredicatesForService(svc)...)
oldNodes = filterWithPredicates(oldNodes, getNodePredicatesForService(svc)...)
if nodeNames(newNodes).Equal(nodeNames(oldNodes)) {
return retSuccess
}
klog.V(4).Infof("nodeSyncService started for service %s/%s", svc.Namespace, svc.Name)
if err := c.lockedUpdateLoadBalancerHosts(svc, newNodes); err != nil {
runtime.HandleError(fmt.Errorf("failed to update load balancer hosts for service %s/%s: %v", svc.Namespace, svc.Name, err))
return retNeedRetry
}
klog.V(4).Infof("nodeSyncService finished successfully for service %s/%s", svc.Namespace, svc.Name)
return retSuccess
}

In the the xTP=Cluster, the node ready status is used to filter the nodes, so c.LastSyncedNodes would already have the newly created node, but it would not be ready. Which means the node will not be part of oldNodes but it will exist in newNodes and allows the sync to continue as expected.

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.2", GitCommit:"5835544ca568b757a8ecae5c153f317e5736700e", GitTreeState:"clean", BuildDate:"2022-09-21T14:33:49Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.2-gke.300", GitCommit:"6f9a8e57036ff71785ef9c90998437413a3a8ff5", GitTreeState:"clean", BuildDate:"2022-09-26T09:26:16Z", GoVersion:"go1.19.1 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

Tested on GKE, however this should theoretically be possible on any 1.25+ cluster.

OS version

N/A

Install tools

N/A

Container runtime (CRI) and version (if applicable)

N/A

Related plugins (CNI, CSI, ...) and versions (if applicable)

N/A

@swetharepakula swetharepakula added the kind/bug Categorizes issue or PR as related to a bug. label Sep 29, 2022
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 29, 2022
@k8s-ci-robot
Copy link
Contributor

@swetharepakula: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 29, 2022
@swetharepakula
Copy link
Member Author

/sig network

/cc @thockin @alexanderConstantinescu @bowei @panslava

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 29, 2022
@aojea
Copy link
Member

aojea commented Sep 29, 2022

/cc

@thockin thockin self-assigned this Sep 29, 2022
@thockin thockin added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on. area/controller-manager kind/regression Categorizes issue or PR as related to a regression from a prior release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 29, 2022
@swetharepakula
Copy link
Member Author

I have done a test with code on master and looked through the changes that have been made since 1.25 was cut and this issue seems to be fixed. I believe this issue is specific only to 1.25

@alexanderConstantinescu
Copy link
Member

/assign

I filed #112798 which provides an in-depth answer as to why that fixes the problem

@liggitt
Copy link
Member

liggitt commented Oct 6, 2022

with #112807 merged, is this resolved?

@alexanderConstantinescu
Copy link
Member

Yes, it should be. I think this was verified by @swetharepakula last Friday?

@swetharepakula
Copy link
Member Author

This is resolved with #112807

/close

@k8s-ci-robot
Copy link
Contributor

@swetharepakula: Closing this issue.

In response to this:

This is resolved with #112807

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lobziik added a commit to openshift/cloud-provider-aws that referenced this issue Oct 13, 2022
…sion

Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-gcp that referenced this issue Dec 29, 2022
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Jan 2, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-gcp that referenced this issue Jan 5, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Jan 5, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Jan 9, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-aws that referenced this issue Jan 19, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-aws that referenced this issue Jan 19, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-aws that referenced this issue Jan 19, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-aws that referenced this issue Jan 23, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Jan 23, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-gcp that referenced this issue Jan 23, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-aws that referenced this issue Jan 26, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Jan 26, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-gcp that referenced this issue Feb 2, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-gcp that referenced this issue Feb 6, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-aws that referenced this issue Feb 20, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Mar 6, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-aws that referenced this issue Mar 9, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Mar 9, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Mar 13, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Mar 16, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-aws that referenced this issue Mar 20, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Mar 20, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
cloud-team-bot bot pushed a commit to openshift-cloud-team/cloud-provider-azure that referenced this issue Mar 23, 2023
Service handling logic in k8s.io/cloud-provider v0.25.(0,1,2)
is broken. This issue was fixed upstream, but not released yet.
We are expecting that this fixes will be included into v0.25.3, but
now need to replace released `cloud-provider` library version with
the latest version from `release-1.25` branch.

Xref: kubernetes/kubernetes#112793

This commit is expected to be dropped during next rebase.
@JoelSpeed
Copy link
Contributor

@alexanderConstantinescu The revert that you put up in #112807 went directly into the release-1.25 branch. I believe I am seeing the same issue again on release-1.26 and release-1.27. Did a revert or fix go up to the master branch that I've missed?

@alexanderConstantinescu
Copy link
Member

@JoelSpeed: There was no revert PR needed on master, because master (at that time) didn't have the bug. This is the PR I filed which attempted to fix the issue on 1.25: #112798 but which later got superseded by the revert PR. It explains the issue in greater depth.

What issue are you seeing?

@JoelSpeed
Copy link
Contributor

Observing on Azure, with the in-tree cloud provider in 1.27 (this could be a bug directly in the Azure code to be fair but wanted to rule this bug out too)

  • KCM boots up on new cluster
  • New worker nodes are added to the cluster, CSRs approved etc
  • Workers do not get added to the LB on first pass because they are not ready (Azure specific? We are using stableNodeSetPredicates)
  • Workers eventually go ready, but EnsureLoadBalancer never called again

The symptoms look the same as this bug, but I'm guessing now that we are using stableNodeSetPredicates, the Azure provider cannot exclude nodes based on their readiness, since the service controller will no longer tell it to update the nodes in the load balancer for an eTP: Local service?

@JoelSpeed
Copy link
Contributor

I think I found the issue, kubernetes-sigs/cloud-provider-azure#4230

@liggitt
Copy link
Member

liggitt commented Sep 9, 2023

@JoelSpeed: There was no revert PR needed on master, because master (at that time) didn't have the bug. This is the PR I filed which attempted to fix the issue on 1.25: #112798 but which later got superseded by the revert PR. It explains the issue in greater depth.

The revert PR (#112807) only landed in the release-1.25 branch, and #112798 never merged... what resolved this issue in master and 1.26+?

Can you link to the later PRs in master that fixed the issue?

@alexanderConstantinescu
Copy link
Member

alexanderConstantinescu commented Sep 11, 2023

what resolved this issue in master and 1.26+?

This bug was fixed on 1.26 during the feature cycle. Here's a timeline of events:

  1. [1.25 feature cycle]: PR Avoid re-syncing LBs for ETP=local services #109706 merges on 1.25 with a bug
  2. [1.26 feature cycle]: PR [CCM - service controller] Clean up node sync and fix re-sync of failed services.  #111663 merges on 1.26 with the fix to said bug (the bug was not known at the time, and this PR was originally created as a cleanup but ended up also fixing the bug)
  3. [1.25 code freeze cycle]: PR Revert 109706 + 111691 #112807 merges on 1.25, reverting 1.

I.e: we could have applied #112798 in step 3, but chose to revert instead

Let me know if this is not clear?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller-manager kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants