Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update daemonSet status even if syncDaemonSet fails #112127

Merged

Conversation

gjkim42
Copy link
Member

@gjkim42 gjkim42 commented Aug 30, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

Currently, if syncDaemonSet has an error, we cannot update the current status for the daemonSet. This PR ensures the Kubernetes updates daemonSet status even if syncDaemonSet fails.

Which issue(s) this PR fixes:

Fixes #
ref: #112010

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed DaemonSet to update the status even if it fails to create a pod.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 30, 2022
@gjkim42 gjkim42 force-pushed the update-status-despite-error branch from 43c0227 to 7cf7867 Compare August 31, 2022 03:05
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 31, 2022
@gjkim42 gjkim42 changed the title WIP: Update daemonSet status even if syncDaemonSet fails Update daemonSet status even if syncDaemonSet fails Aug 31, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 31, 2022
@gjkim42
Copy link
Member Author

gjkim42 commented Aug 31, 2022

/cc @alculquicondor

Could you PTAL? This PR is similar to #109694.

@gjkim42
Copy link
Member Author

gjkim42 commented Aug 31, 2022

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 31, 2022
@gjkim42 gjkim42 force-pushed the update-status-despite-error branch 2 times, most recently from c9a3342 to 9273287 Compare August 31, 2022 03:22
@alculquicondor
Copy link
Member

Please add integration and unit tests

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 2, 2022
Copy link
Member

@atiratree atiratree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks/functions well - just a few nits

pkg/controller/daemon/daemon_controller.go Outdated Show resolved Hide resolved
test/integration/daemonset/daemonset_test.go Outdated Show resolved Hide resolved
pkg/controller/daemon/daemon_controller.go Outdated Show resolved Hide resolved
@gjkim42
Copy link
Member Author

gjkim42 commented Oct 11, 2022

looks/functions well - just a few nits

all done, thanks.

@soltysh
Could you PTAL this and #112737?

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much better - thank you!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gjkim42, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2022
@soltysh
Copy link
Contributor

soltysh commented Nov 7, 2022

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 7, 2022
@k8s-ci-robot k8s-ci-robot merged commit b4f4286 into kubernetes:master Nov 8, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Nov 8, 2022
@gjkim42 gjkim42 deleted the update-status-despite-error branch November 8, 2022 00:08
Comment on lines +27 to +48
var _ admission.ValidationInterface = &fakePodFailAdmission{}

type fakePodFailAdmission struct {
limitedPodNumber int
succeedPodsCount int
}

func (f *fakePodFailAdmission) Handles(operation admission.Operation) bool {
return operation == admission.Create
}

func (f *fakePodFailAdmission) Validate(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces) (err error) {
if attr.GetKind().GroupKind() != api.Kind("Pod") {
return nil
}

if f.succeedPodsCount >= f.limitedPodNumber {
return fmt.Errorf("fakePodFailAdmission error")
}
f.succeedPodsCount++
return nil
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea

I guess this function is not thread-safe.
Let me make a fix for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gjkim42 given we're hours before the freeze date I don't want to block the queue, so I'm reverting this in #113733 when you'll be submitting a fix you'll need to include this PR as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we don't have enough time to confirm this fix, I respect your decision.

@liggitt
Copy link
Member

liggitt commented Nov 8, 2022

are we sure the increased failures are only a test issue, or should we back out the change until we can bring it in flake-free?

@gjkim42
Copy link
Member Author

gjkim42 commented Nov 8, 2022

$ go test -c -run=TestUpdateStatusDespitePodCreationFailure ./test/integration/daemonset/
$ stress ./daemonset.test -test.run=TestUpdateStatusDespitePodCreationFailure
...
3m0s: 86 runs so far, 9 failures (10.47%)
3m5s: 88 runs so far, 9 failures (10.23%)
  
/tmp/go-stress-20221108T225506-3322889609
I1108 22:57:02.386727  134048 etcd.go:107] starting etcd on http://127.0.0.1:45669
I1108 22:57:02.386802  134048 etcd.go:113] storing etcd data in: /tmp/integration_test_etcd_data3293147976
{"level":"warn","ts":"2022-11-08T22:57:02.513+0900","caller":"auth/store.go:1233","msg":"simple token is not cryptographically signed"}
I1108 22:57:03.036300  134048 serving.go:342] Generated self-signed cert (/tmp/test-integration-TestUpdateStatusDespitePodCreationFailure_TestUpdateStatusDespitePodCreationFailure_OnDelete4011098137/apiserver.crt, /tmp/test-integration-TestUpdateStatusDespitePodCreationFailure_TestUpdateStatusDespitePodCreationFailure_OnDelete4011098137/apiserver.key)
I1108 22:57:03.036315  134048 server.go:554] external host was not specified, using 192.168.0.80
I1108 22:57:03.425543  134048 shared_informer.go:273] Waiting for caches to sync for node_authorizer
I1108 22:57:03.426401  134048 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1108 22:57:03.426410  134048 plugins.go:161] Loaded 12 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionPolicy,ValidatingAdmissionWebhook,ResourceQuota.
I1108 22:57:03.426682  134048 instance.go:266] Using reconciler: lease
I1108 22:57:03.494069  134048 instance.go:609] API group "internal.apiserver.k8s.io" is not enabled, skipping.
W1108 22:57:03.643933  134048 genericapiserver.go:652] Skipping API authentication.k8s.io/v1beta1 because it has no resources.
W1108 22:57:03.643955  134048 genericapiserver.go:652] Skipping API authentication.k8s.io/v1alpha1 because it has no resources.
W1108 22:57:03.646580  134048 genericapiser
…
3m10s: 91 runs so far, 10 failures (10.99%)

/tmp/go-stress-20221108T225506-1628729645
I1108 22:57:06.693933  134122 etcd.go:107] starting etcd on http://127.0.0.1:46509
I1108 22:57:06.694009  134122 etcd.go:113] storing etcd data in: /tmp/integration_test_etcd_data344882241
{"level":"warn","ts":"2022-11-08T22:57:06.758+0900","caller":"auth/store.go:1233","msg":"simple token is not cryptographically signed"}
I1108 22:57:07.248480  134122 serving.go:342] Generated self-signed cert (/tmp/test-integration-TestUpdateStatusDespitePodCreationFailure_TestUpdateStatusDespitePodCreationFailure_OnDelete864243084/apiserver.crt, /tmp/test-integration-TestUpdateStatusDespitePodCreationFailure_TestUpdateStatusDespitePodCreationFailure_OnDelete864243084/apiserver.key)
I1108 22:57:07.248495  134122 server.go:554] external host was not specified, using 192.168.0.80
I1108 22:57:07.468446  134122 shared_informer.go:273] Waiting for caches to sync for node_authorizer
I1108 22:57:07.469272  134122 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1108 22:57:07.469284  134122 plugins.go:161] Loaded 12 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionPolicy,ValidatingAdmissionWebhook,ResourceQuota.
I1108 22:57:07.469558  134122 instance.go:266] Using reconciler: lease
I1108 22:57:07.532437  134122 instance.go:609] API group "internal.apiserver.k8s.io" is not enabled, skipping.
W1108 22:57:07.692444  134122 genericapiserver.go:652] Skipping API authentication.k8s.io/v1beta1 because it has no resources.
W1108 22:57:07.692464  134122 genericapiserver.go:652] Skipping API authentication.k8s.io/v1alpha1 because it has no resources.
W1108 22:57:07.694524  134122 genericapiserver
…
3m15s: 95 runs so far, 11 failures (11.58%)

I can reproduce this flake quite easily.

After cherrypicking #113732, it looks that the flake is resolved.

$ git cherry-pick 9ac7112906b24a516f53e86c62658cee8c9a4141 # PR 113732
$ go test -c -run=TestUpdateStatusDespitePodCreationFailure ./test/integration/daemonset/
$ stress ./daemonset.test -test.run=TestUpdateStatusDespitePodCreationFailure
5s: 0 runs so far, 0 failures
10s: 0 runs so far, 0 failures
15s: 0 runs so far, 0 failures
20s: 12 runs so far, 0 failures
25s: 12 runs so far, 0 failures
30s: 12 runs so far, 0 failures
35s: 12 runs so far, 0 failures
40s: 24 runs so far, 0 failures
45s: 24 runs so far, 0 failures
50s: 24 runs so far, 0 failures
55s: 32 runs so far, 0 failures
1m0s: 36 runs so far, 0 failures
1m5s: 36 runs so far, 0 failures
1m10s: 40 runs so far, 0 failures
1m15s: 48 runs so far, 0 failures
1m20s: 48 runs so far, 0 failures
1m25s: 52 runs so far, 0 failures
1m30s: 58 runs so far, 0 failures
1m35s: 60 runs so far, 0 failures
1m40s: 60 runs so far, 0 failures
1m45s: 65 runs so far, 0 failures
1m50s: 72 runs so far, 0 failures
1m55s: 72 runs so far, 0 failures
2m0s: 77 runs so far, 0 failures
2m5s: 84 runs so far, 0 failures
2m10s: 84 runs so far, 0 failures
2m15s: 89 runs so far, 0 failures
2m20s: 92 runs so far, 0 failures
2m25s: 96 runs so far, 0 failures
2m30s: 97 runs so far, 0 failures
2m35s: 104 runs so far, 0 failures

@atiratree
Copy link
Member

I can confirm this as well, even though the flake is less common for me. With the proposed fix I cannot see it even after a larger number of runs (500+).

Also, when I compile the test with -race flag I can clearly see it fail because of that:

--- FAIL: TestUpdateStatusDespitePodCreationFailure (7.91s)
    --- FAIL: TestUpdateStatusDespitePodCreationFailure/TestUpdateStatusDespitePodCreationFailure_OnDelete (7.91s)
        testing.go:1319: race detected during execution of test
    testing.go:1319: race detected during execution of test
FAIL

after the fix, the race is not detected anymore

@gjkim42
Copy link
Member Author

gjkim42 commented Nov 8, 2022

I can confirm this as well, even though the flake is less common for me.

$ git diff
diff --git a/test/integration/daemonset/daemonset_test.go b/test/integration/daemonset/daemonset_test.go
index 587beb997a2..98de4061121 100644
--- a/test/integration/daemonset/daemonset_test.go
+++ b/test/integration/daemonset/daemonset_test.go
@@ -996,7 +996,7 @@ func TestUnschedulableNodeDaemonDoesLaunchPod(t *testing.T) {
 
 func TestUpdateStatusDespitePodCreationFailure(t *testing.T) {
        forEachStrategy(t, func(t *testing.T, strategy *apps.DaemonSetUpdateStrategy) {
-               limitedPodNumber := 2
+               limitedPodNumber := 4
                // use framework.StartTestServer to inject admission control
                // e.g. to emulate the resource quota behavior

FYI, I boosted the flake by increasing replicas a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants