rate limite etcd healthcheck request #112046

aojea · 2022-08-25T22:23:23Z

/kind bug

What this PR does / why we need it:

Etcd healthcheck create one request to etcd per healthcheck requested, this can cause an excessive traffic going from apiserver to etcd.

Implement a rate limiter based on the configured timeout of the healthcheck, so all connections that are rate limited return the last known value.

apiserver /healthz/etcd endpoint rate limits the number of forwarded health check requests to the etcd backends, answering with the last known state if the rate limit is exceeded. The rate limit is based on 1/2 of the timeout configured, with no burst allowed.

aojea · 2022-08-25T22:25:14Z

/assign @deads2k @soltysh @tkashem

I think that this is a better solution for the etcd healthzchecks, I'm not familiar with the code so I plumbed it here, but it can go directly into the etcd healthcheck

Looking for feedback first, I will add tests if this is an acceptable solution

/hold

aojea · 2022-08-25T22:25:56Z

/sig api-machinery

aojea · 2022-08-26T07:23:55Z

🤔 👀

/test pull-kubernetes-integration

staging/src/k8s.io/apiserver/pkg/server/healthz/healthz.go

liggitt · 2022-08-26T18:57:41Z

the etcd check specifically was attempted to be made non-blocking in #104437, but that didn't get completed

aojea · 2022-08-26T19:07:34Z

/test pull-kubernetes-integration

aojea · 2022-08-26T21:53:54Z

/test pull-kubernetes-integration

aojea · 2022-08-28T07:25:49Z

the etcd check specifically was attempted to be made non-blocking in #104437, but that didn't get completed

let's target the etcd check specifically, I've also found a bug in current logic

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go

aojea · 2022-08-29T11:25:11Z

/assign @liggitt @lavalamp

as original reviewers of #104437

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go

stop leaking goroutines reduce etcd test duration

return the last request error, instead of last error received The rate limit allows 1 event per healthcheck timeout / 2

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go

lavalamp · 2022-08-30T19:15:03Z

I think this is ready to go as long as we don't make N separate checkers...

leilajal · 2022-08-30T20:06:45Z

/triage accepted

aojea · 2022-09-10T09:52:16Z

I think this is ready to go as long as we don't make N separate checkers...

kindly ping @lavalamp #112046 (comment)

lavalamp · 2022-09-12T16:44:43Z

Sorry I missed your last update.

/lgtm
/approve

k8s-ci-robot · 2022-09-12T16:45:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiserver/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lavalamp · 2022-09-12T16:47:22Z

If you can figure out an easy way to count how many health checks this does in e.g. our e2e tests before/after this change, and the difference is very large, a backport might be reasonable.

(Do we scrape etcd metrics at the test end? That would be the first place I'd look. This might not be super easy to measure.)

aojea · 2022-09-13T17:29:26Z

(Do we scrape etcd metrics at the test end? That would be the first place I'd look. This might not be super easy to measure.)

Mike Spreitzer added tooling to dump the metrics of etcd on the integration tests (all test share the same etcd instance)
You just download the tarball with the mtrics

wget https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/112252/pull-kubernetes-integration/1568270000328806400/artifacts/etcd-scrapes.tgz

and run the script he created pointing to the file

$HOME/go/src/k8s.io/kubernetes/hack/run-prometheus-on-etcd-scrapes.sh etcd-scrapes.tgz

I didn't find any metrics that shows the number of Get received on the healthz endpoint, or something that can give us an idea ... I've compared a dumped version from before and other from after this patch and I couldn't find any significant difference-

I don't think that our e2e abuse the /healthz/etcd endpoint , it seems Openshift does much more stress on the endpoint and that's why we were carrying over with David's patch

lavalamp · 2022-09-13T17:32:55Z

OK, thanks for checking!

wojtek-t · 2022-11-21T12:22:04Z

I missed that originally - but this looks great - I'm really happy that you made that happen.

Re @lavalamp questions - i don't think we would find anything in our tests (which don't overload any of the endpoints FWICT), but it certainly can help protecting some real-world scenarios.
But I don't think I can come up with a justifying argument for backport.

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 25, 2022

k8s-ci-robot assigned deads2k, soltysh and tkashem Aug 25, 2022

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 25, 2022

k8s-ci-robot requested review from liggitt and sttts August 25, 2022 22:27

k8s-ci-robot added the area/apiserver label Aug 25, 2022

liggitt reviewed Aug 26, 2022

View reviewed changes

staging/src/k8s.io/apiserver/pkg/server/healthz/healthz.go Outdated Show resolved Hide resolved

aojea force-pushed the etcd_healthz branch from 70fdf95 to a53bb89 Compare August 28, 2022 07:30

aojea commented Aug 28, 2022

View reviewed changes

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go Outdated Show resolved Hide resolved

aojea force-pushed the etcd_healthz branch from 00e41ea to 687b5a7 Compare August 29, 2022 09:46

k8s-ci-robot added the area/dependency Issues or PRs related to dependency changes label Aug 29, 2022

k8s-ci-robot assigned lavalamp and liggitt Aug 29, 2022

liggitt reviewed Aug 29, 2022

View reviewed changes

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go Outdated Show resolved Hide resolved

liggitt reviewed Aug 29, 2022

View reviewed changes

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go Outdated Show resolved Hide resolved

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go Outdated Show resolved Hide resolved

aojea added 3 commits August 29, 2022 17:57

don't serialize etcd healthchecks

5a67248

fix etcd unit tests

dd6d3d9

stop leaking goroutines reduce etcd test duration

rate limit /healthz etcd healthchecks

510a85c

return the last request error, instead of last error received The rate limit allows 1 event per healthcheck timeout / 2

aojea force-pushed the etcd_healthz branch from 78653c3 to 510a85c Compare August 29, 2022 15:59

deads2k reviewed Aug 29, 2022

View reviewed changes

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go Show resolved Hide resolved

lavalamp reviewed Aug 30, 2022

View reviewed changes

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go Show resolved Hide resolved

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 30, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 12, 2022

k8s-ci-robot merged commit 0f37b31 into kubernetes:master Sep 12, 2022

k8s-ci-robot added this to the v1.26 milestone Sep 12, 2022

This was referenced Sep 14, 2022

k8s 1.25.0 openshift/kubernetes#1360

Merged

[WIP] 1.25 rebase openshift/kubernetes#1367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rate limite etcd healthcheck request #112046

rate limite etcd healthcheck request #112046

aojea commented Aug 25, 2022 •

edited

aojea commented Aug 25, 2022

aojea commented Aug 25, 2022

aojea commented Aug 26, 2022

liggitt commented Aug 26, 2022

aojea commented Aug 26, 2022

aojea commented Aug 26, 2022

aojea commented Aug 28, 2022

aojea commented Aug 29, 2022

lavalamp commented Aug 30, 2022

leilajal commented Aug 30, 2022

aojea commented Sep 10, 2022

lavalamp commented Sep 12, 2022

k8s-ci-robot commented Sep 12, 2022

lavalamp commented Sep 12, 2022

aojea commented Sep 13, 2022

lavalamp commented Sep 13, 2022

wojtek-t commented Nov 21, 2022

rate limite etcd healthcheck request #112046

rate limite etcd healthcheck request #112046

Conversation

aojea commented Aug 25, 2022 • edited

What this PR does / why we need it:

aojea commented Aug 25, 2022

aojea commented Aug 25, 2022

aojea commented Aug 26, 2022

liggitt commented Aug 26, 2022

aojea commented Aug 26, 2022

aojea commented Aug 26, 2022

aojea commented Aug 28, 2022

aojea commented Aug 29, 2022

lavalamp commented Aug 30, 2022

leilajal commented Aug 30, 2022

aojea commented Sep 10, 2022

lavalamp commented Sep 12, 2022

k8s-ci-robot commented Sep 12, 2022

lavalamp commented Sep 12, 2022

aojea commented Sep 13, 2022

lavalamp commented Sep 13, 2022

wojtek-t commented Nov 21, 2022

aojea commented Aug 25, 2022 •

edited