New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rate limite etcd healthcheck request #112046
Conversation
/sig api-machinery |
🤔 👀 /test pull-kubernetes-integration |
the etcd check specifically was attempted to be made non-blocking in #104437, but that didn't get completed |
/test pull-kubernetes-integration |
1 similar comment
/test pull-kubernetes-integration |
let's target the etcd check specifically, I've also found a bug in current logic |
staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go
Outdated
Show resolved
Hide resolved
stop leaking goroutines reduce etcd test duration
return the last request error, instead of last error received The rate limit allows 1 event per healthcheck timeout / 2
staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go
Show resolved
Hide resolved
staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go
Show resolved
Hide resolved
I think this is ready to go as long as we don't make N separate checkers... |
/triage accepted |
kindly ping @lavalamp #112046 (comment) |
Sorry I missed your last update. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, lavalamp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
If you can figure out an easy way to count how many health checks this does in e.g. our e2e tests before/after this change, and the difference is very large, a backport might be reasonable. (Do we scrape etcd metrics at the test end? That would be the first place I'd look. This might not be super easy to measure.) |
Mike Spreitzer added tooling to dump the metrics of etcd on the integration tests (all test share the same etcd instance)
and run the script he created pointing to the file
I didn't find any metrics that shows the number of I don't think that our e2e abuse the /healthz/etcd endpoint , it seems Openshift does much more stress on the endpoint and that's why we were carrying over with David's patch |
OK, thanks for checking! |
I missed that originally - but this looks great - I'm really happy that you made that happen. Re @lavalamp questions - i don't think we would find anything in our tests (which don't overload any of the endpoints FWICT), but it certainly can help protecting some real-world scenarios. |
/kind bug
What this PR does / why we need it:
Etcd healthcheck create one request to etcd per healthcheck requested, this can cause an excessive traffic going from apiserver to etcd.
Implement a rate limiter based on the configured timeout of the healthcheck, so all connections that are rate limited return the last known value.