New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry Unix domain sockets on Windows nodes for the plugin registration mechanism #110075
Retry Unix domain sockets on Windows nodes for the plugin registration mechanism #110075
Conversation
Welcome @luckerby! |
Hi @luckerby. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
3bbdb1f
to
3df9131
Compare
/ok-to-test xref: #104584 /sig windows |
fe2359a
to
c666656
Compare
/lgtm |
/lgtm |
/assign @mrunalp |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: luckerby, mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@luckerby Is there a plan to cherry pick this to older releases (1.24 and 1.23)? |
@aramase Yes, I plan to do this for older releases as well in the coming days |
…0075-upstream-release-1.23 Automated cherry pick of #110075: Add retry logic for Unix Domain sockets on Windows
…0075-upstream-release-1.24 Automated cherry pick of #110075: Add retry logic for Unix Domain sockets on Windows
What type of PR is this?
/kind bug
What this PR does / why we need it:
The plugin registration mechanism for Kubelet makes use of Unix domain sockets. On Linux this is working as expected, but on Windows - using the current Kubelet code - a socket is sometimes not validated given its backing file exists but the socket itself is not yet ready for communication at the time the plugin watcher code runs. One such occurrence can be seen here #104584 (comment).
Given the Kubelet ignores a plugin if its socket cannot be dialed on Windows, various workarounds need to be used - such as liveness probes - to force plugin re-registrations. However, this mechanism can require from just a couple to several hundred restarts for the containers responsible for creating the respective sockets, as described in the aforementioned issue. In turn this means lengthy amounts of time are sometimes required for a node to be up and running (e.g. CSI secrets store driver is essentially broken for the duration of the restarts, preventing any pod relying on it to inject secrets to start). Further, this impacts features such as cluster autoscaler on Windows.
This PR changes the check performed by Kubelet against Unix domain sockets on Windows nodes from a one-time attempt to a retry approach.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Please see the Kubelet logs here when running the original code in AKS 1.22, showing sockets being ignored multiple times. With the code in this PR the logs change as seen here with at most one retry until the respective target socket is successfully dialed.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: