New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update daemonSet status even if syncDaemonSet fails #112127
Update daemonSet status even if syncDaemonSet fails #112127
Conversation
43c0227
to
7cf7867
Compare
/cc @alculquicondor Could you PTAL? This PR is similar to #109694. |
/priority important-soon |
c9a3342
to
9273287
Compare
Please add integration and unit tests |
9273287
to
e429045
Compare
c679c18
to
7a1fc25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks/functions well - just a few nits
7a1fc25
to
2ee024a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so much better - thank you!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gjkim42, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/triage accepted |
var _ admission.ValidationInterface = &fakePodFailAdmission{} | ||
|
||
type fakePodFailAdmission struct { | ||
limitedPodNumber int | ||
succeedPodsCount int | ||
} | ||
|
||
func (f *fakePodFailAdmission) Handles(operation admission.Operation) bool { | ||
return operation == admission.Create | ||
} | ||
|
||
func (f *fakePodFailAdmission) Validate(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces) (err error) { | ||
if attr.GetKind().GroupKind() != api.Kind("Pod") { | ||
return nil | ||
} | ||
|
||
if f.succeedPodsCount >= f.limitedPodNumber { | ||
return fmt.Errorf("fakePodFailAdmission error") | ||
} | ||
f.succeedPodsCount++ | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this function is not thread-safe.
Let me make a fix for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we don't have enough time to confirm this fix, I respect your decision.
are we sure the increased failures are only a test issue, or should we back out the change until we can bring it in flake-free? |
$ go test -c -run=TestUpdateStatusDespitePodCreationFailure ./test/integration/daemonset/
$ stress ./daemonset.test -test.run=TestUpdateStatusDespitePodCreationFailure
...
3m0s: 86 runs so far, 9 failures (10.47%)
3m5s: 88 runs so far, 9 failures (10.23%)
/tmp/go-stress-20221108T225506-3322889609
I1108 22:57:02.386727 134048 etcd.go:107] starting etcd on http://127.0.0.1:45669
I1108 22:57:02.386802 134048 etcd.go:113] storing etcd data in: /tmp/integration_test_etcd_data3293147976
{"level":"warn","ts":"2022-11-08T22:57:02.513+0900","caller":"auth/store.go:1233","msg":"simple token is not cryptographically signed"}
I1108 22:57:03.036300 134048 serving.go:342] Generated self-signed cert (/tmp/test-integration-TestUpdateStatusDespitePodCreationFailure_TestUpdateStatusDespitePodCreationFailure_OnDelete4011098137/apiserver.crt, /tmp/test-integration-TestUpdateStatusDespitePodCreationFailure_TestUpdateStatusDespitePodCreationFailure_OnDelete4011098137/apiserver.key)
I1108 22:57:03.036315 134048 server.go:554] external host was not specified, using 192.168.0.80
I1108 22:57:03.425543 134048 shared_informer.go:273] Waiting for caches to sync for node_authorizer
I1108 22:57:03.426401 134048 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1108 22:57:03.426410 134048 plugins.go:161] Loaded 12 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionPolicy,ValidatingAdmissionWebhook,ResourceQuota.
I1108 22:57:03.426682 134048 instance.go:266] Using reconciler: lease
I1108 22:57:03.494069 134048 instance.go:609] API group "internal.apiserver.k8s.io" is not enabled, skipping.
W1108 22:57:03.643933 134048 genericapiserver.go:652] Skipping API authentication.k8s.io/v1beta1 because it has no resources.
W1108 22:57:03.643955 134048 genericapiserver.go:652] Skipping API authentication.k8s.io/v1alpha1 because it has no resources.
W1108 22:57:03.646580 134048 genericapiser
…
3m10s: 91 runs so far, 10 failures (10.99%)
/tmp/go-stress-20221108T225506-1628729645
I1108 22:57:06.693933 134122 etcd.go:107] starting etcd on http://127.0.0.1:46509
I1108 22:57:06.694009 134122 etcd.go:113] storing etcd data in: /tmp/integration_test_etcd_data344882241
{"level":"warn","ts":"2022-11-08T22:57:06.758+0900","caller":"auth/store.go:1233","msg":"simple token is not cryptographically signed"}
I1108 22:57:07.248480 134122 serving.go:342] Generated self-signed cert (/tmp/test-integration-TestUpdateStatusDespitePodCreationFailure_TestUpdateStatusDespitePodCreationFailure_OnDelete864243084/apiserver.crt, /tmp/test-integration-TestUpdateStatusDespitePodCreationFailure_TestUpdateStatusDespitePodCreationFailure_OnDelete864243084/apiserver.key)
I1108 22:57:07.248495 134122 server.go:554] external host was not specified, using 192.168.0.80
I1108 22:57:07.468446 134122 shared_informer.go:273] Waiting for caches to sync for node_authorizer
I1108 22:57:07.469272 134122 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1108 22:57:07.469284 134122 plugins.go:161] Loaded 12 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionPolicy,ValidatingAdmissionWebhook,ResourceQuota.
I1108 22:57:07.469558 134122 instance.go:266] Using reconciler: lease
I1108 22:57:07.532437 134122 instance.go:609] API group "internal.apiserver.k8s.io" is not enabled, skipping.
W1108 22:57:07.692444 134122 genericapiserver.go:652] Skipping API authentication.k8s.io/v1beta1 because it has no resources.
W1108 22:57:07.692464 134122 genericapiserver.go:652] Skipping API authentication.k8s.io/v1alpha1 because it has no resources.
W1108 22:57:07.694524 134122 genericapiserver
…
3m15s: 95 runs so far, 11 failures (11.58%) I can reproduce this flake quite easily. After cherrypicking #113732, it looks that the flake is resolved.
|
I can confirm this as well, even though the flake is less common for me. With the proposed fix I cannot see it even after a larger number of runs (500+). Also, when I compile the test with
after the fix, the race is not detected anymore |
$ git diff
diff --git a/test/integration/daemonset/daemonset_test.go b/test/integration/daemonset/daemonset_test.go
index 587beb997a2..98de4061121 100644
--- a/test/integration/daemonset/daemonset_test.go
+++ b/test/integration/daemonset/daemonset_test.go
@@ -996,7 +996,7 @@ func TestUnschedulableNodeDaemonDoesLaunchPod(t *testing.T) {
func TestUpdateStatusDespitePodCreationFailure(t *testing.T) {
forEachStrategy(t, func(t *testing.T, strategy *apps.DaemonSetUpdateStrategy) {
- limitedPodNumber := 2
+ limitedPodNumber := 4
// use framework.StartTestServer to inject admission control
// e.g. to emulate the resource quota behavior FYI, I boosted the flake by increasing replicas a bit. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Currently, if syncDaemonSet has an error, we cannot update the current status for the daemonSet. This PR ensures the Kubernetes updates daemonSet status even if syncDaemonSet fails.
Which issue(s) this PR fixes:
Fixes #
ref: #112010
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: