New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent dirty service object leaking between reconciles #109601
Conversation
Please note that we're already in Test Freeze for the |
@mdbooth: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I have now tested this manually and verified that it fixes the problem. To describe the problem: cloud-provider-openstack adds an annotation to the Service object during reconciliation. In our environment we were incorrectly deploying cloud-provider-openstack without the required permission to patch the service object, which resulted in an error message like:
However, what we observed was that the next time the Service was reconciled after the failure, the reconcile would succeed but the annotation had not been added to the Service object. The reason for this was that the Service object passed to cloud-provider-openstack from syncService was dirty from the previous invocation. As the annotation was still present from the previous run there was no difference for the Patcher to patch, and therefore the reconcile erroneously succeeds. I have verified that with:
cloud-provider-openstack will now fail with the above permission error on every reconcile attempt, not just the first one. |
I would appreciate advice on how to write an automated test for this. I think the ideal test would be:
It might be possible to construct a unit test to do this but it would be quite complicated and probably not very easy to understand or maintain. It feels more like envtest territory? Alternatively, we could argue that this would really be a test of the behaviour of informers, not the service controller, and doesn't require a separate test? |
controller-runtime does this here, btw: https://github.com/kubernetes-sigs/controller-runtime/blob/c162794a9b12dea31c076820938a779873e93957/pkg/cache/internal/cache_reader.go#L84-L90 |
lgtm :) |
@andrewsykim the impact of this that I've noticed is fairly minor and by fixing the underlying permission issue we're unlikely to see it often, although the problem still exists if there's a transient failure to patch the Service object. I haven't been following the 1.24 release schedule, so I don't know how frozen we are right now. Is it worth trying to merge this, or wait until after 1.24? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
@andrewsykim This is still a thing, btw. Any idea who can approve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mdbooth, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
We were passing a Service object returned from the Service informer directly to the Service controller loop. This loop is expected to modify the Service object during execution, but this is not safe. Specifically any modification to the Service object by the controller will pollute the Service informer's cache and result in a potentially dirty object being passed to subsequent invocation of the controller loop.
We fix this issue by passing a copy of the object from the informer instead.
I have described in the comments how this bug affected cloud-provider-openstack, but I suspect it affects multiple cloud providers.
Which issue(s) this PR fixes:
Fixes kubernetes/cloud-provider#59
Special notes for your reviewer:
Only manually tested. See manual test description in comment below.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: