Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed: 22422 use singleflight to alleviate simultaneous calls to #112696

Merged
merged 1 commit into from Oct 25, 2022

Conversation

aimuz
Copy link
Contributor

@aimuz aimuz commented Sep 23, 2022

Signed-off-by: aimuz mr.imuz@gmail.com

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #22422

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed: #22422 Admission controllers can cause unnecessary significant load on apiserver

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 23, 2022
@k8s-ci-robot k8s-ci-robot added do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 23, 2022
@k8s-ci-robot
Copy link
Contributor

@aimuz: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 23, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @aimuz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Sep 23, 2022
@aimuz aimuz changed the title Fixed: #22422 use singleflight to alleviate simultaneous calls to Fixed: 22422 use singleflight to alleviate simultaneous calls to Sep 23, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Sep 23, 2022
@aimuz
Copy link
Contributor Author

aimuz commented Sep 23, 2022

/sig scalability

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 23, 2022
// throttling - see #22422 for details.
liveList, err := l.client.CoreV1().LimitRanges(a.GetNamespace()).List(context.TODO(), metav1.ListOptions{})
// Fixed: #22422
// use singleflight to alleviate simultaneous calls to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a regression test verifying the fix

Copy link
Contributor Author

@aimuz aimuz Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a test, please help me to see if this matches @aojea

@aimuz aimuz force-pushed the fix-22422 branch 2 times, most recently from 37a3006 to 4192b79 Compare September 23, 2022 09:03
@aimuz aimuz requested review from aojea and removed request for derekwaynecarr and deads2k September 23, 2022 09:05
Comment on lines 894 to 900
go func(wg *sync.WaitGroup) {
_, err := handler.GetLimitRanges(attributes)
if err != nil {
t.Errorf("unexpected error: %v", err)
}
wg.Done()
}(&wg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to pass the wg var as parameter?

Suggested change
go func(wg *sync.WaitGroup) {
_, err := handler.GetLimitRanges(attributes)
if err != nil {
t.Errorf("unexpected error: %v", err)
}
wg.Done()
}(&wg)
go func() {
defer wg.Done()
_, err := handler.GetLimitRanges(attributes)
if err != nil {
t.Errorf("unexpected error: %v", err)
}
}()

nit, defer wg.Done() use to be more idiomatic

limitRangeList.Items = append(limitRangeList.Items, value)
}
// Set the latency to simulate the real environment
time.Sleep(200 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a low value and can flake, just set 1*time.Second

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think we can do this more reliable if we add a select and block on a channel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that you signal before the wg.Wait(), so you guarantee that all the request were sent

count := 0
mockClient := &fake.Clientset{}
mockClient.AddReactor("list", "limitranges", func(action core.Action) (bool, runtime.Object, error) {
count++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without the singleflight this will race, you can use an atomic counter
var count int64
atomic.AddInt64(&count,1)

if count != 1 {
t.Errorf("Expected 1 limit range, got %d", count)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add another

			_, err := handler.GetLimitRanges(attributes)
			if err != nil {
				t.Errorf("unexpected error: %v", err)
			}

and check that count == 2

@aimuz aimuz requested review from aojea and liggitt and removed request for liggitt and aojea October 20, 2022 02:27
@aimuz
Copy link
Contributor Author

aimuz commented Oct 20, 2022

/retest

liveList, err := l.client.CoreV1().LimitRanges(a.GetNamespace()).List(context.TODO(), metav1.ListOptions{})
// Fixed: #22422
// use singleflight to alleviate simultaneous calls to
lruItemObj, err, _ = l.group.Do(a.GetNamespace(), func() (interface{}, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I switch this to partition on a constant string "x" instead of the namespace, the added unit test still passes, so the test is not effectively verifying this is correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the constant string x will be used, then he will partition all calls. If there are two concurrent calls with different namespaces, he will only call them once until the call returns. This should be the wrong behavior.

Currently using namespaces as partitions, this is partitioning a set of calls with the same namespace. This is the same behavior as using namespaces as keys for LRU caching before
@liggitt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a constant string "x" here would be a bug, since it would make calls from different namespaces share a single lookup and use the same retrieved result. The unit test should fail under those circumstances, and it does not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use the constant string x, the test will block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not what I'm seeing:

git diff
diff --git a/plugin/pkg/admission/limitranger/admission.go b/plugin/pkg/admission/limitranger/admission.go
index 4525463d2e6..907a28bda32 100644
--- a/plugin/pkg/admission/limitranger/admission.go
+++ b/plugin/pkg/admission/limitranger/admission.go
@@ -165,7 +165,7 @@ func (l *LimitRanger) GetLimitRanges(a admission.Attributes) ([]*corev1.LimitRan
                if !ok || lruItemObj.(liveLookupEntry).expiry.Before(time.Now()) {
                        // Fixed: #22422
                        // use singleflight to alleviate simultaneous calls to
-                       lruItemObj, err, _ = l.group.Do(a.GetNamespace(), func() (interface{}, error) {
+                       lruItemObj, err, _ = l.group.Do("x", func() (interface{}, error) {
                                liveList, err := l.client.CoreV1().LimitRanges(a.GetNamespace()).List(context.TODO(), metav1.ListOptions{})
                                if err != nil {
                                        return nil, admission.NewForbidden(a, err)
go test ./plugin/pkg/admission/limitranger -count=1 -v -run TestLimitRanger_GetLimitRangesFixed22422
=== RUN   TestLimitRanger_GetLimitRangesFixed22422
--- PASS: TestLimitRanger_GetLimitRangesFixed22422 (0.00s)
PASS
ok  	k8s.io/kubernetes/plugin/pkg/admission/limitranger	0.800s

Copy link
Contributor Author

@aimuz aimuz Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just updated it and the problem should be fixed, Please look at the

Copy link
Contributor Author

@aimuz aimuz Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go test ./plugin/pkg/admission/limitranger -count=1 -v -run TestLimitRanger_GetLimitRangesFixed22422
=== RUN   TestLimitRanger_GetLimitRangesFixed22422
    admission_test.go:910: Expected test namespace, got test1
    admission_test.go:910: Expected test namespace, got test1
    admission_test.go:910: Expected test namespace, got test1
    admission_test.go:910: Expected test namespace, got test1
    admission_test.go:910: Expected test namespace, got test1
    admission_test.go:910: Expected test namespace, got test1
    admission_test.go:910: Expected test namespace, got test1
    admission_test.go:910: Expected test namespace, got test1
    admission_test.go:910: Expected test namespace, got test1
--- FAIL: TestLimitRanger_GetLimitRangesFixed22422 (0.00s)
FAIL
FAIL    k8s.io/kubernetes/plugin/pkg/admission/limitranger      0.792s
FAIL

// since all the calls with the same namespace will be holded, they must be catched on the singleflight group,
// There are two different sets of namespace calls
// hence only 2
if count != 2 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually shows up in the test, in the loop I turned on two different sets of namespace calls and he also correctly increased the count to 2, so I don't think it's a problem. @liggitt

Comment on lines 900 to 927
// simulating concurrent calls after a cache failure
go func() {
defer wg.Done()
_, err := handler.GetLimitRanges(attributes)
if err != nil {
t.Errorf("unexpected error: %v", err)
}
}()

// simulation of different namespaces is not a call
go func() {
defer wg.Done()
_, err := handler.GetLimitRanges(attributesTest1)
if err != nil {
t.Errorf("unexpected error: %v", err)
}
}()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two different sets of namespace calls are initiated here

@aimuz aimuz requested review from liggitt and aojea and removed request for aojea and liggitt October 21, 2022 16:23
@aimuz
Copy link
Contributor Author

aimuz commented Oct 21, 2022

/retest

@aimuz aimuz requested review from liggitt and removed request for aojea October 21, 2022 19:57
@liggitt
Copy link
Member

liggitt commented Oct 24, 2022

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aimuz, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2022
@aimuz
Copy link
Contributor Author

aimuz commented Oct 25, 2022

/retest

@aimuz
Copy link
Contributor Author

aimuz commented Oct 25, 2022

Thanks, this PR has benefited me a lot @aojea @liggitt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Admission controllers can cause unnecessary significant load on apiserver
5 participants