Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimize iptables-restore input #110268

Merged

Conversation

danwinship
Copy link
Contributor

What type of PR is this?

/kind feature
?

What this PR does / why we need it:

Currently, kube-proxy rewrites every iptables rule on every resync. In very large clusters, this ends up being a very large number of rules, and iptables-restore is not very efficient.

Most of the time, most of the rules that kube-proxy rewrites have not changed since the last sync; generally speaking, some small number of services will have gained or lost one endpoint since the last sync, and the other 10,000 minus N services have exactly the same iptables rules as they did last time.

iptables-restore doesn't require you to rewrite every chain; it just enforces the rule that if you want to update any rule in a chain, you have to update every rule in that chain. But if a chain doesn't need to change at all, you can just not include any rules for it, and iptables-restore will leave it untouched.

So, following up on #110266, which reorganized the internal syncProxyRules loop logic, this fixes the loop so that if a service hasn't changed since the last sync, then we don't rewrite its rules. (With this PR, we still rewrite all of KUBE-SERVICES / KUBE-EXTERNAL-SERVICES / KUBE-NODEPORTS on every sync, though in theory we could avoid doing that if only endpoints changed.)

If we somehow get out of sync and, eg, fail to create a chain we should have created and then forget that we failed, this will be caught on the next sync. (KUBE-SERVICES will contain a rule to jump to the non-existent chain, which will cause the iptables-restore to fail, which will cause the proxier to force a full resync.)

Which issue(s) this PR fixes:

none

Special notes for your reviewer:

The branch includes both #110266 and #109844, both of which should merge (in either order) before this merges. Only the last two commits are new to this branch.

Does this PR introduce a user-facing change?

The iptables kube-proxy backend should process service/endpoint changes
more efficiently in very large clusters.

/sig network
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. sig/network Categorizes an issue or PR as relevant to SIG Network. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 28, 2022
@k8s-ci-robot k8s-ci-robot requested review from dcbw and freehan May 28, 2022 17:48
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2022
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 10, 2022
@thockin thockin self-assigned this Jun 23, 2022
@aojea
Copy link
Member

aojea commented Jun 28, 2022

naive question, is not possible to do the diff of the iptables data buffer directly?

@danwinship
Copy link
Contributor Author

danwinship commented Jun 28, 2022

The distinction isn't quite a "diff". Eg, if 1 rule in a chain changes, we have to send the whole chain, not just the single changed rule.

We could compare the old input against the new input and figure out which chains we need to include... but that would require having two copies of the complete iptables output, which would be a bunch of memory, and it wouldn't be especially efficient to do it against the stringified form because the output isn't sorted...

Are there any particular things you don't like about the way the PR does it? I think the PendingChanges functions are kind of ugly, but fixing that seemed like it would go better as part of a larger refactoring of the change tracker stuff... (which, to be clear, I don't actually plan to do, since kpng does the change tracking in a totally different way anyway)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 29, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 29, 2022
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 9, 2022
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2022
Copy link
Member

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new metric looks good to me

Comment on lines +129 to +139
// IptablesPartialRestoreFailuresTotal is the number of iptables *partial* restore
// failures (resulting in a fall back to a full restore) that the proxy has seen.
IptablesPartialRestoreFailuresTotal = metrics.NewCounter(
&metrics.CounterOpts{
Subsystem: kubeProxySubsystem,
Name: "sync_proxy_rules_iptables_partial_restore_failures_total",
Help: "Cumulative proxy iptables partial restore failures",
StabilityLevel: metrics.ALPHA,
},
)

Copy link
Member

@aojea aojea Sep 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is easier if we label existing metric

diff --git a/pkg/proxy/metrics/metrics.go b/pkg/proxy/metrics/metrics.go
index f61cf4fc991..6fb2089bdc7 100644
--- a/pkg/proxy/metrics/metrics.go
+++ b/pkg/proxy/metrics/metrics.go
@@ -117,13 +117,14 @@ var (
 
        // IptablesRestoreFailuresTotal is the number of iptables restore failures that the proxy has
        // seen.
-       IptablesRestoreFailuresTotal = metrics.NewCounter(
+       IptablesRestoreFailuresTotal = metrics.NewCounterVec(
                &metrics.CounterOpts{
                        Subsystem:      kubeProxySubsystem,
                        Name:           "sync_proxy_rules_iptables_restore_failures_total",
                        Help:           "Cumulative proxy iptables restore failures",
                        StabilityLevel: metrics.ALPHA,
                },
+               []string{"partial_sync", "full_sync"},
        )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that an API break of the existing metric? (I mean, ok, StabilityLevel: metrics.ALPHA, but still)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't know, I thought it was compatible, but better ask an expert
@dgrisonnet can you please advise? Is adding labels to an existing metric backwards compatible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logicalhan can you help us with this question? is adding a label to a metric without labels compatible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a breaking change and you should probably amend the release note accordingly. But it is allowed since the metric is only Alpha.

// Called by the iptables.Monitor, and in response to topology changes; this calls
// syncProxyRules() and tells it to resync all services, regardless of whether the
// Service or Endpoints/EndpointSlice objects themselves have changed
func (proxier *Proxier) forceSyncProxyRules() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if !proxier.largeClusterMode || time.Since(proxier.lastIPTablesCleanup) > proxier.syncPeriod {

I think line1393 now has to switch to this fullSync condition, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't; activeNATChains gets filled in correctly even for the chains we don't actually update (that's much of what #110266 did) so we can compare that against the iptables-save data and get the right set of chains to delete.

(Though this is certainly something that we'll have to test carefully.)

defer func() {
if !success {
klog.InfoS("Sync failed", "retryingTime", proxier.syncPeriod)
proxier.syncRunner.RetryAfter(proxier.syncPeriod)
if didPartialSync {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need didPartialSync ?

should not be enough to check if !proxier.needFullSync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... if we did that, it would be the same, except that if there were 0 services, or if every service changed since the last sync (and then the sync failed), it would mistakenly call that a failed partial sync.

Anyway, I think the explicit variable makes it clearer anyway...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't we want to force sync no matter what if success is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If !success && needFullSync then we don't need to set needFullSync=true because it's already true.
  • If !success && !needFullSync && didPartialSync then existing code sets needFullSync=true.
  • If !success && !needFullSync && !didPartialSync then, as above, that implies that either there are 0 services, or else every service changed, and we don't currently set needFullSync=true:
    • If there were 0 services, then the next time we sync, either there will still be 0 services (in which case it doesn't matter whether we do a "full" or "partial" sync) or else there will be non-0 services (in which case all of the services will be in the serviceChanged set and so we will sync all of them regardless of whether we're doing a "full" or "partial" sync). So it doesn't matter whather we set needFullSync=true or not.
    • If there were non-0 services and they had all changed, then the next time we sync... Oof. The next time we sync serviceChanged and endpointsChanged will both be empty because the trackers get reset on every sync whether or not the sync succeeds. So yes, in that case, we need to force needFullySync=true!

Although, it might make more sense to fix the service/endpoint tracking, because the conntrack-cleanup code suffers from this same bug; if a service or endpoint changes, and we need to do conntrack cleanup as a result, but then the next iptables-restore fails, then we will lose track of the fact that we needed to do that conntrack cleanup...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, it might make more sense to fix the service/endpoint tracking, because the conntrack-cleanup code suffers from this same bug

Filed #112604. It seems non-trivial to fix. So I'll just fall back to doing a full resync in this case.

iptables-restore requires that if you change any rule in a chain, you
have to rewrite the entire chain. But if you avoid mentioning a chain
at all, it will leave it untouched. Take advantage of this by not
rewriting the SVC, SVL, EXT, FW, and SEP chains for services that have
not changed since the last sync, which should drastically cut down on
the size of each iptables-restore in large clusters.
@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2022
@jpbetz
Copy link
Contributor

jpbetz commented Oct 18, 2022

I looked this over and it all seemed reasonable to me (as someone not familiar with the existing codebase).

@thockin
Copy link
Member

thockin commented Oct 31, 2022

Thanks!

/lgtm
/approve
/hold

Antonio - any last words?

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 31, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, dgrisonnet, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2022
@aojea
Copy link
Member

aojea commented Nov 1, 2022

Antonio - any last words?

/hold cancel

😄

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 1, 2022
@k8s-ci-robot k8s-ci-robot merged commit 3edbebe into kubernetes:master Nov 2, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Nov 2, 2022
@danwinship danwinship deleted the minimize-iptables-changes branch November 3, 2022 14:54
@wojtek-t
Copy link
Member

I've seen a huge improvement in network programming latency for our scalability tests:
https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Network&metricname=Load_NetworkProgrammingLatency&Metric=NetworkProgrammingLatency

And I initially thought that it would be this PR, but the timing doesn't match - the first good data point is from Nov 10th, so ~week after this got merged.

@danwinship @aojea @thockin - any thoughts why:
(a) this PR didn't help
(b) what helped eventually?

@wojtek-t
Copy link
Member

heh, it seems I've responded myself. The FG was originally disabled in our tests, and it god enabled by this PR:
kubernetes/test-infra#27927

The timing matches.

So the summary is - this is great improvement! We're roughly 2x better on all percentiles.
[TBH, I initially thought it would be even better, but that's still great improvement.]

@danwinship
Copy link
Contributor Author

danwinship commented Nov 28, 2022

So the summary is - this is great improvement! We're roughly 2x better on all percentiles.
[TBH, I initially thought it would be even better, but that's still great improvement.]

FTR this is the first test of the patch at scale; I did proof-of-concept testing to convince myself that it would actually speed things up, but not with an actual kubernetes cluster with tens of thousands of actual services.

I don't know exactly what the perf tests do, but the improvement should be greatest in the case where there are a steady stream of small changes, which kube-proxy is processing more-or-less immediately. Eg, a steady-state cluster where you aren't regularly creating or deleting very large numbers of pods or services, you're just trying to deal with the fact that individual pods may come and go, services get scaled up and down, etc, occasionally a whole node may go away, etc.

So in particular, if the perf-test is both (a) doing lots of changes very close together, and (b) setting a "large" iptables-min-sync-period, then it will end up aggregating lots of changes together and so even the "minimal" restores might get large. Especially, iptables-min-sync-period should be much less important now and (with the minimize feature enabled) hopefully all clusters should be able to use --iptables-min-sync-period=1s. (And iptables-sync-period has been mostly irrelevant to performance since #81517).

Additionally, kube-proxy currently does a handful of non-iptables-restore iptables call in each sync loop (to ensure that the jump rules from INPUT, PREROUTING, etc, all exist). Those can't be optimized to use iptables-restore instead, but still suffer from the fact that every call to any iptables binary has to download the entire iptables ruleset from the kernel, so it's possible that if the iptables-restore calls get optimized enough, then the non-iptables-restore iptables time starts to dominate... we'd have to add more metrics/logging to be able to test that.

@wojtek-t
Copy link
Member

I don't know exactly what the perf tests do, but the improvement should be greatest in the case where there are a steady stream of small changes, which kube-proxy is processing more-or-less immediately. Eg, a steady-state cluster where you aren't regularly creating or deleting very large numbers of pods or services, you're just trying to deal with the fact that individual pods may come and go, services get scaled up and down, etc, occasionally a whole node may go away, etc.

To some extent it is what the test is doing. Or to be more specific - it's not really steady state, because we're pushing a significant churn of pods against the cluster (up to 100pods/s), but it's a very small fraction of all pods in the cluster (150,000 pods), so given that I would say it's mostly satisfying your description.
This is also why I originally wrote that I was expecting somewhat bigger gains from your change.

Additionally, kube-proxy currently does a handful of non-iptables-restore iptables call in each sync loop (to ensure that the jump rules from INPUT, PREROUTING, etc, all exist).

That's a good point - thanks for looking into it.
Let' me take a look into your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants