minimize iptables-restore input #110268

danwinship · 2022-05-28T17:47:22Z

What type of PR is this?

/kind feature
?

What this PR does / why we need it:

Currently, kube-proxy rewrites every iptables rule on every resync. In very large clusters, this ends up being a very large number of rules, and iptables-restore is not very efficient.

Most of the time, most of the rules that kube-proxy rewrites have not changed since the last sync; generally speaking, some small number of services will have gained or lost one endpoint since the last sync, and the other 10,000 minus N services have exactly the same iptables rules as they did last time.

iptables-restore doesn't require you to rewrite every chain; it just enforces the rule that if you want to update any rule in a chain, you have to update every rule in that chain. But if a chain doesn't need to change at all, you can just not include any rules for it, and iptables-restore will leave it untouched.

So, following up on #110266, which reorganized the internal syncProxyRules loop logic, this fixes the loop so that if a service hasn't changed since the last sync, then we don't rewrite its rules. (With this PR, we still rewrite all of KUBE-SERVICES / KUBE-EXTERNAL-SERVICES / KUBE-NODEPORTS on every sync, though in theory we could avoid doing that if only endpoints changed.)

If we somehow get out of sync and, eg, fail to create a chain we should have created and then forget that we failed, this will be caught on the next sync. (KUBE-SERVICES will contain a rule to jump to the non-existent chain, which will cause the iptables-restore to fail, which will cause the proxier to force a full resync.)

Which issue(s) this PR fixes:

none

Special notes for your reviewer:

The branch includes both #110266 and #109844, both of which should merge (in either order) before this merges. Only the last two commits are new to this branch.

Does this PR introduce a user-facing change?

The iptables kube-proxy backend should process service/endpoint changes
more efficiently in very large clusters.

/sig network
/priority important-longterm

aojea · 2022-06-28T16:47:05Z

naive question, is not possible to do the diff of the iptables data buffer directly?

danwinship · 2022-06-28T18:19:26Z

The distinction isn't quite a "diff". Eg, if 1 rule in a chain changes, we have to send the whole chain, not just the single changed rule.

We could compare the old input against the new input and figure out which chains we need to include... but that would require having two copies of the complete iptables output, which would be a bunch of memory, and it wouldn't be especially efficient to do it against the stringified form because the output isn't sorted...

Are there any particular things you don't like about the way the PR does it? I think the PendingChanges functions are kind of ugly, but fixing that seemed like it would go better as part of a larger refactoring of the change tracker stuff... (which, to be clear, I don't actually plan to do, since kpng does the change tracking in a totally different way anyway)

dgrisonnet

new metric looks good to me

aojea · 2022-09-11T20:49:44Z

pkg/proxy/metrics/metrics.go

+	// IptablesPartialRestoreFailuresTotal is the number of iptables *partial* restore
+	// failures (resulting in a fall back to a full restore) that the proxy has seen.
+	IptablesPartialRestoreFailuresTotal = metrics.NewCounter(
+		&metrics.CounterOpts{
+			Subsystem:      kubeProxySubsystem,
+			Name:           "sync_proxy_rules_iptables_partial_restore_failures_total",
+			Help:           "Cumulative proxy iptables partial restore failures",
+			StabilityLevel: metrics.ALPHA,
+		},
+	)
+


maybe it is easier if we label existing metric

diff --git a/pkg/proxy/metrics/metrics.go b/pkg/proxy/metrics/metrics.go index f61cf4fc991..6fb2089bdc7 100644 --- a/pkg/proxy/metrics/metrics.go +++ b/pkg/proxy/metrics/metrics.go @@ -117,13 +117,14 @@ var ( // IptablesRestoreFailuresTotal is the number of iptables restore failures that the proxy has // seen. - IptablesRestoreFailuresTotal = metrics.NewCounter( + IptablesRestoreFailuresTotal = metrics.NewCounterVec( &metrics.CounterOpts{ Subsystem: kubeProxySubsystem, Name: "sync_proxy_rules_iptables_restore_failures_total", Help: "Cumulative proxy iptables restore failures", StabilityLevel: metrics.ALPHA, }, + []string{"partial_sync", "full_sync"}, )

isn't that an API break of the existing metric? (I mean, ok, StabilityLevel: metrics.ALPHA, but still)

I honestly don't know, I thought it was compatible, but better ask an expert
@dgrisonnet can you please advise? Is adding labels to an existing metric backwards compatible?

@logicalhan can you help us with this question? is adding a label to a metric without labels compatible?

It is a breaking change and you should probably amend the release note accordingly. But it is allowed since the metric is only Alpha.

aojea · 2022-09-14T09:56:54Z

pkg/proxy/iptables/proxier.go

+// Called by the iptables.Monitor, and in response to topology changes; this calls
+// syncProxyRules() and tells it to resync all services, regardless of whether the
+// Service or Endpoints/EndpointSlice objects themselves have changed
+func (proxier *Proxier) forceSyncProxyRules() {


if !proxier.largeClusterMode || time.Since(proxier.lastIPTablesCleanup) > proxier.syncPeriod {

I think line1393 now has to switch to this fullSync condition, right?

It doesn't; activeNATChains gets filled in correctly even for the chains we don't actually update (that's much of what #110266 did) so we can compare that against the iptables-save data and get the right set of chains to delete.

(Though this is certainly something that we'll have to test carefully.)

aojea · 2022-09-14T10:00:11Z

pkg/proxy/iptables/proxier.go

 	defer func() {
 		if !success {
 			klog.InfoS("Sync failed", "retryingTime", proxier.syncPeriod)
 			proxier.syncRunner.RetryAfter(proxier.syncPeriod)
+			if didPartialSync {


do we need didPartialSync ?

should not be enough to check if !proxier.needFullSync

Hm... if we did that, it would be the same, except that if there were 0 services, or if every service changed since the last sync (and then the sync failed), it would mistakenly call that a failed partial sync.

Anyway, I think the explicit variable makes it clearer anyway...

But don't we want to force sync no matter what if success is false?

If !success && needFullSync then we don't need to set needFullSync=true because it's already true.

If !success && !needFullSync && didPartialSync then existing code sets needFullSync=true.

If !success && !needFullSync && !didPartialSync then, as above, that implies that either there are 0 services, or else every service changed, and we don't currently set needFullSync=true:

If there were 0 services, then the next time we sync, either there will still be 0 services (in which case it doesn't matter whether we do a "full" or "partial" sync) or else there will be non-0 services (in which case all of the services will be in the serviceChanged set and so we will sync all of them regardless of whether we're doing a "full" or "partial" sync). So it doesn't matter whather we set needFullSync=true or not.

If there were non-0 services and they had all changed, then the next time we sync... Oof. The next time we sync serviceChanged and endpointsChanged will both be empty because the trackers get reset on every sync whether or not the sync succeeds. So yes, in that case, we need to force needFullySync=true!

Although, it might make more sense to fix the service/endpoint tracking, because the conntrack-cleanup code suffers from this same bug; if a service or endpoint changes, and we need to do conntrack cleanup as a result, but then the next iptables-restore fails, then we will lose track of the fact that we needed to do that conntrack cleanup...

Although, it might make more sense to fix the service/endpoint tracking, because the conntrack-cleanup code suffers from this same bug

Filed #112604. It seems non-trivial to fix. So I'll just fall back to doing a full resync in this case.

iptables-restore requires that if you change any rule in a chain, you have to rewrite the entire chain. But if you avoid mentioning a chain at all, it will leave it untouched. Take advantage of this by not rewriting the SVC, SVL, EXT, FW, and SEP chains for services that have not changed since the last sync, which should drastically cut down on the size of each iptables-restore in large clusters.

jpbetz · 2022-10-18T20:33:27Z

I looked this over and it all seemed reasonable to me (as someone not familiar with the existing codebase).

thockin · 2022-10-31T22:44:43Z

Thanks!

/lgtm
/approve
/hold

Antonio - any last words?

k8s-ci-robot · 2022-10-31T22:45:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, dgrisonnet, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/features/OWNERS~~ [thockin]
~~pkg/proxy/OWNERS~~ [danwinship,thockin]
~~pkg/util/iptables/OWNERS~~ [danwinship,thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2022-11-01T22:54:17Z

Antonio - any last words?

/hold cancel

😄

wojtek-t · 2022-11-24T15:26:43Z

I've seen a huge improvement in network programming latency for our scalability tests:
https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=Network&metricname=Load_NetworkProgrammingLatency&Metric=NetworkProgrammingLatency

And I initially thought that it would be this PR, but the timing doesn't match - the first good data point is from Nov 10th, so ~week after this got merged.

@danwinship @aojea @thockin - any thoughts why:
(a) this PR didn't help
(b) what helped eventually?

wojtek-t · 2022-11-24T15:29:09Z

heh, it seems I've responded myself. The FG was originally disabled in our tests, and it god enabled by this PR:
kubernetes/test-infra#27927

The timing matches.

So the summary is - this is great improvement! We're roughly 2x better on all percentiles.
[TBH, I initially thought it would be even better, but that's still great improvement.]

danwinship · 2022-11-28T14:55:44Z

So the summary is - this is great improvement! We're roughly 2x better on all percentiles.
[TBH, I initially thought it would be even better, but that's still great improvement.]

FTR this is the first test of the patch at scale; I did proof-of-concept testing to convince myself that it would actually speed things up, but not with an actual kubernetes cluster with tens of thousands of actual services.

I don't know exactly what the perf tests do, but the improvement should be greatest in the case where there are a steady stream of small changes, which kube-proxy is processing more-or-less immediately. Eg, a steady-state cluster where you aren't regularly creating or deleting very large numbers of pods or services, you're just trying to deal with the fact that individual pods may come and go, services get scaled up and down, etc, occasionally a whole node may go away, etc.

So in particular, if the perf-test is both (a) doing lots of changes very close together, and (b) setting a "large" iptables-min-sync-period, then it will end up aggregating lots of changes together and so even the "minimal" restores might get large. Especially, iptables-min-sync-period should be much less important now and (with the minimize feature enabled) hopefully all clusters should be able to use --iptables-min-sync-period=1s. (And iptables-sync-period has been mostly irrelevant to performance since #81517).

Additionally, kube-proxy currently does a handful of non-iptables-restore iptables call in each sync loop (to ensure that the jump rules from INPUT, PREROUTING, etc, all exist). Those can't be optimized to use iptables-restore instead, but still suffer from the fact that every call to any iptables binary has to download the entire iptables ruleset from the kernel, so it's possible that if the iptables-restore calls get optimized enough, then the non-iptables-restore iptables time starts to dominate... we'd have to add more metrics/logging to be able to test that.

wojtek-t · 2022-11-30T11:47:45Z

I don't know exactly what the perf tests do, but the improvement should be greatest in the case where there are a steady stream of small changes, which kube-proxy is processing more-or-less immediately. Eg, a steady-state cluster where you aren't regularly creating or deleting very large numbers of pods or services, you're just trying to deal with the fact that individual pods may come and go, services get scaled up and down, etc, occasionally a whole node may go away, etc.

To some extent it is what the test is doing. Or to be more specific - it's not really steady state, because we're pushing a significant churn of pods against the cluster (up to 100pods/s), but it's a very small fraction of all pods in the cluster (150,000 pods), so given that I would say it's mostly satisfying your description.
This is also why I originally wrote that I was expecting somewhat bigger gains from your change.

Additionally, kube-proxy currently does a handful of non-iptables-restore iptables call in each sync loop (to ensure that the jump rules from INPUT, PREROUTING, etc, all exist).

That's a good point - thanks for looking into it.
Let' me take a look into your PR.

danwinship mentioned this pull request May 28, 2022

iptables proxy reorg in preparation for minimizing iptables-restore #110266

Merged

k8s-ci-robot added the area/ipvs label May 28, 2022

k8s-ci-robot requested review from dcbw and freehan May 28, 2022 17:48

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2022

danwinship force-pushed the minimize-iptables-changes branch from 1848584 to 9c1951c Compare June 10, 2022 22:41

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 10, 2022

danwinship mentioned this pull request Jun 23, 2022

Pods status changes on one worker node can be lerveraged to make flooding attack to all other nodes in the kubernetes cluster #110596

Closed

thockin self-assigned this Jun 23, 2022

danwinship force-pushed the minimize-iptables-changes branch from 9c1951c to 878960a Compare June 24, 2022 13:16

danwinship mentioned this pull request Jun 29, 2022

only clean up iptables chains periodically in large clusters #110334

Merged

danwinship force-pushed the minimize-iptables-changes branch from 878960a to d86b119 Compare June 29, 2022 16:55

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 29, 2022

danwinship force-pushed the minimize-iptables-changes branch from d86b119 to 336a159 Compare June 29, 2022 20:46

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 29, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2022

danwinship force-pushed the minimize-iptables-changes branch from 336a159 to 63dbc11 Compare July 9, 2022 11:12

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 9, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2022

dgrisonnet approved these changes Sep 8, 2022

View reviewed changes

aojea mentioned this pull request Sep 11, 2022

Minimizing iptables-restore input size kubernetes/enhancements#3454

Merged

aojea reviewed Sep 11, 2022

View reviewed changes

aojea reviewed Sep 14, 2022

View reviewed changes

danwinship force-pushed the minimize-iptables-changes branch from e8f87bd to 48ceab1 Compare September 21, 2022 12:13

danwinship added 2 commits September 26, 2022 16:30

proxy/iptables: Add metric for partial sync failures, add test

818de5a

danwinship force-pushed the minimize-iptables-changes branch from 48ceab1 to 818de5a Compare September 26, 2022 20:34

k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2022

danwinship mentioned this pull request Oct 4, 2022

WIP minimize iptables-restore openshift/sdn#461

Closed

thockin assigned jpbetz Oct 14, 2022

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 31, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 1, 2022

k8s-ci-robot merged commit 3edbebe into kubernetes:master Nov 2, 2022

k8s-ci-robot added this to the v1.26 milestone Nov 2, 2022

danwinship deleted the minimize-iptables-changes branch November 3, 2022 14:54

danwinship mentioned this pull request Nov 29, 2022

Don't re-run EnsureChain/EnsureRules on partial syncs #114181

Merged

saiharshitachava mentioned this pull request Oct 9, 2023

kube-proxy constantly restarts with fatal: bad g in signal handler post 1.25 upgrade #121033

Closed

aojea mentioned this pull request Apr 11, 2024

[WIP] Nftables proxy #124272

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

minimize iptables-restore input #110268

minimize iptables-restore input #110268

danwinship commented May 28, 2022

aojea commented Jun 28, 2022

danwinship commented Jun 28, 2022 •

edited

dgrisonnet left a comment

aojea Sep 11, 2022 •

edited

danwinship Sep 22, 2022

aojea Sep 23, 2022

aojea Oct 19, 2022

logicalhan Oct 19, 2022

aojea Sep 14, 2022

danwinship Sep 14, 2022 •

edited

aojea Sep 14, 2022

danwinship Sep 14, 2022

aojea Sep 14, 2022

danwinship Sep 15, 2022

danwinship Sep 20, 2022

jpbetz commented Oct 18, 2022

thockin commented Oct 31, 2022

k8s-ci-robot commented Oct 31, 2022

aojea commented Nov 1, 2022

wojtek-t commented Nov 24, 2022

wojtek-t commented Nov 24, 2022

danwinship commented Nov 28, 2022 •

edited

wojtek-t commented Nov 30, 2022

minimize iptables-restore input #110268

minimize iptables-restore input #110268

Conversation

danwinship commented May 28, 2022

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

aojea commented Jun 28, 2022

danwinship commented Jun 28, 2022 • edited

dgrisonnet left a comment

Choose a reason for hiding this comment

aojea Sep 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwinship Sep 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbetz commented Oct 18, 2022

thockin commented Oct 31, 2022

k8s-ci-robot commented Oct 31, 2022

aojea commented Nov 1, 2022

wojtek-t commented Nov 24, 2022

wojtek-t commented Nov 24, 2022

danwinship commented Nov 28, 2022 • edited

wojtek-t commented Nov 30, 2022

danwinship commented Jun 28, 2022 •

edited

aojea Sep 11, 2022 •

edited

danwinship Sep 14, 2022 •

edited

danwinship commented Nov 28, 2022 •

edited