Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec auth: support TLS config caching #112017

Merged
merged 1 commit into from Sep 8, 2022

Conversation

enj
Copy link
Member

@enj enj commented Aug 24, 2022

This change updates the transport.Config .Dial and .TLS.GetCert fields
to use a struct wrapper. This indirection via a pointer allows the
functions to be compared and thus makes them valid to use as map keys.
This change is then leveraged by the existing global exec auth and TLS
config caches to return the same authenticator and TLS config even when
distinct but identical rest configs were used to create distinct clientsets.

Signed-off-by: Monis Khan mok@microsoft.com

/kind bug
/sig auth
/milestone v1.26
/priority important-soon
/triage accepted
Fixes #111911

Fix an ephemeral port exhaustion bug caused by improper connection management that occurred when a large number of objects were handled by kubectl while exec auth was in use.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Aug 24, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Aug 24, 2022
@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Aug 24, 2022
@enj
Copy link
Member Author

enj commented Aug 24, 2022

/assign @liggitt
cc @atiratree

// - these rest configs are used to create distinct clientsets, then
//
// the underlying TLS config is shared between those clientsets.
func TestExecTLSCache(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this test against master and confirmed that it failed as expected.

@enj
Copy link
Member Author

enj commented Aug 24, 2022

cc @soltysh

@aojea
Copy link
Member

aojea commented Aug 26, 2022

diff --git a/staging/src/k8s.io/client-go/transport/cache_test.go b/staging/src/k8s.io/client-go/transport/cache_test.go
index e07d77215ff..349ebe34617 100644
--- a/staging/src/k8s.io/client-go/transport/cache_test.go
+++ b/staging/src/k8s.io/client-go/transport/cache_test.go
@@ -19,10 +19,14 @@ package transport
 import (
        "context"
        "crypto/tls"
+       "fmt"
+       "io"
        "net"
        "net/http"
+       "net/http/httptest"
        "net/url"
        "reflect"
+       "sync"
        "testing"
 )
 
@@ -188,3 +192,119 @@ func TestTLSConfigKey(t *testing.T) {
                }
        }
 }
+
+func TestDifferentDialers(t *testing.T) {
+
+       var counter1 int
+       var mu1 sync.Mutex
+       dialFn1 := func(ctx context.Context, network, address string) (net.Conn, error) {
+               mu1.Lock()
+               counter1++
+               mu1.Unlock()
+               return (&net.Dialer{}).DialContext(ctx, network, address)
+       }
+
+       var counter2 int
+       var mu2 sync.Mutex
+       dialFn2 := func(ctx context.Context, network, address string) (net.Conn, error) {
+               mu2.Lock()
+               counter2++
+               mu2.Unlock()
+               return (&net.Dialer{}).DialContext(ctx, network, address)
+       }
+
+       ts := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+               fmt.Fprintf(w, "Hello, %s", r.Proto)
+       }))
+       ts.EnableHTTP2 = true
+       ts.StartTLS()
+       defer ts.Close()
+
+       tr1, err := tlsCache.get(&Config{
+               TLS: TLSConfig{
+                       Insecure: true,
+               },
+               Dial: &DialHolder{F: dialFn1}},
+       )
+       if err != nil {
+               t.Fatalf("failed to create transport: %v", err)
+       }
+
+       req, err := http.NewRequest("GET", ts.URL, nil)
+       if err != nil {
+               t.Fatalf("unexpected error: %v", err)
+       }
+
+       resp1, err := tr1.RoundTrip(req)
+       if err != nil {
+               t.Fatalf("unexpected error: %v", err)
+       }
+
+       data, err := io.ReadAll(resp1.Body)
+       if err != nil {
+               t.Fatalf("unexpected error: %v", err)
+       }
+       resp1.Body.Close()
+
+       if string(data) != "Hello, HTTP/2.0" {
+               t.Fatalf("unexpected response: %s", data)
+       }
+
+       tr2, err := tlsCache.get(&Config{
+               TLS: TLSConfig{
+                       Insecure: true,
+               },
+               Dial: &DialHolder{F: dialFn2}},
+       )
+       if err != nil {
+               t.Fatalf("failed to create transport: %v", err)
+       }
+       resp2, err := tr2.RoundTrip(req)
+       if err != nil {
+               t.Fatalf("unexpected error: %v", err)
+       }
+       defer resp2.Body.Close()
+       data, err = io.ReadAll(resp2.Body)
+       if err != nil {
+               t.Fatalf("unexpected error: %v", err)
+       }
+       resp2.Body.Close()
+       if string(data) != "Hello, HTTP/2.0" {
+               t.Fatalf("unexpected response: %s", data)
+       }
+
+       tr3, err := tlsCache.get(&Config{
+               TLS: TLSConfig{
+                       Insecure: true,
+               },
+               Dial: &DialHolder{F: dialFn2}},
+       )
+       if err != nil {
+               t.Fatalf("failed to create transport: %v", err)
+       }
+       resp3, err := tr3.RoundTrip(req)
+       if err != nil {
+               t.Fatalf("unexpected error: %v", err)
+       }
+
+       data, err = io.ReadAll(resp3.Body)
+       if err != nil {
+               t.Fatalf("unexpected error: %v", err)
+       }
+       resp3.Body.Close()
+       if string(data) != "Hello, HTTP/2.0" {
+               t.Fatalf("unexpected response: %s", data)
+       }
+
+       mu1.Lock()
+       mu2.Lock()
+       if counter1 != 1 || counter2 != 2 {
+               t.Fatalf("unexpacted dials: Dialer1 %d times Dialer2 %d times", counter1, counter2)
+       }
+       mu1.Unlock()
+       mu2.Unlock()
+
+       if len(tlsCache.transports) != 2 {
+               t.Fatalf("Unexpcted number of transports on the cache, expected 2 got %d", len(tlsCache.transports))
+       }
+}

this is a very complex thing, I had to create one test to verify this as is scary how many things can break touching so deep inside client-go, I don't know if I'm doing something wrong, but I have two identical transport with the same dialers but it doesn't cache them

go test -timeout 120s -run ^TestDifferentDialers$ k8s.io/client-go/transport -v

=== RUN   TestDifferentDialers
    /home/aojea/go/src/k8s.io/kubernetes/staging/src/k8s.io/client-go/transport/cache_test.go:309: Unexpcted number of transports on the cache, expected 2 got 3
--- FAIL: TestDifferentDialers (0.01s)
FAIL
FAIL	k8s.io/client-go/transport	0.014s

@enj
Copy link
Member Author

enj commented Aug 26, 2022

@aojea your test is incorrect. If you want caching to work, you need to arrange for the DialHolder object to be globally cached (the struct pointer is the cache key, not the wrapped function). The exec authenticator does this. See TestExecTLSCache for proof that this is working as expected.

@enj
Copy link
Member Author

enj commented Aug 26, 2022

@aojea to fix your test, simply change dialFn1, etc to include the DialHolder wrapper.

@@ -68,7 +68,7 @@ type Config struct {
WrapTransport WrapperFunc

// Dial specifies the dial function for creating unencrypted TCP connections.
Dial func(ctx context.Context, network, address string) (net.Conn, error)
Dial *DialHolder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh @enj , now I understand why I have the test wrong, thanks for clarifying ...

@aojea
Copy link
Member

aojea commented Aug 26, 2022

@aojea to fix your test, simply change dialFn1, etc to include the DialHolder wrapper.

indeed, you are right, I've modified it and also verifies that reuses the http2 transport and only calls the dialer once

func TestDifferentDialers(t *testing.T) {

	var counter1 int
	var mu1 sync.Mutex
	dialFn1 := &DialHolder{F: func(ctx context.Context, network, address string) (net.Conn, error) {
		mu1.Lock()
		counter1++
		mu1.Unlock()
		return (&net.Dialer{}).DialContext(ctx, network, address)
	}}

	var counter2 int
	var mu2 sync.Mutex
	dialFn2 := &DialHolder{F: func(ctx context.Context, network, address string) (net.Conn, error) {
		mu2.Lock()
		counter2++
		mu2.Unlock()
		return (&net.Dialer{}).DialContext(ctx, network, address)
	}}

	ts := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		fmt.Fprintf(w, "Hello, %s", r.Proto)
	}))
	ts.EnableHTTP2 = true
	ts.StartTLS()
	defer ts.Close()

	tr1, err := tlsCache.get(&Config{
		TLS: TLSConfig{
			Insecure: true,
		},
		Dial: dialFn1,
	})
	if err != nil {
		t.Fatalf("failed to create transport: %v", err)
	}

	req, err := http.NewRequest("GET", ts.URL, nil)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}

	resp1, err := tr1.RoundTrip(req)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}

	data, err := io.ReadAll(resp1.Body)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	resp1.Body.Close()

	if string(data) != "Hello, HTTP/2.0" {
		t.Fatalf("unexpected response: %s", data)
	}

	tr2, err := tlsCache.get(&Config{
		TLS: TLSConfig{
			Insecure: true,
		},
		Dial: dialFn2,
	})
	if err != nil {
		t.Fatalf("failed to create transport: %v", err)
	}
	resp2, err := tr2.RoundTrip(req)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	defer resp2.Body.Close()
	data, err = io.ReadAll(resp2.Body)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	resp2.Body.Close()
	if string(data) != "Hello, HTTP/2.0" {
		t.Fatalf("unexpected response: %s", data)
	}

	dialFn3 := dialFn2
	tr3, err := tlsCache.get(&Config{
		TLS: TLSConfig{
			Insecure: true,
		},
		Dial: dialFn3,
	})
	if err != nil {
		t.Fatalf("failed to create transport: %v", err)
	}
	resp3, err := tr3.RoundTrip(req)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}

	data, err = io.ReadAll(resp3.Body)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	resp3.Body.Close()
	if string(data) != "Hello, HTTP/2.0" {
		t.Fatalf("unexpected response: %s", data)
	}

	mu1.Lock()
	mu2.Lock()
        // HTTP2 reuses the connection, so only 1 Dial per each Dialer
	if counter1 != 1 || counter2 != 1 {
		t.Errorf("unexpacted dials: Dialer1 got %d expected 1 times Dialer2 got %d  expected 1 times", counter1, counter2)
	}
	mu1.Unlock()
	mu2.Unlock()

	if len(tlsCache.transports) != 2 {
		t.Fatalf("Unexpcted number of transports on the cache, expected 2 got %d", len(tlsCache.transports))
	}
}

@aojea
Copy link
Member

aojea commented Aug 29, 2022

I'm not very familiar with this code, but what if instead of operating on transport.Config

func (a *Authenticator) UpdateTransportConfig(c *transport.Config) error {

you operate on the returned transport of the config directly, and to the cache locally to this plugin?

trans, err := restclient.TransportFor(&clientConfig)
	if err != nil {
		return nil, err
	}

@enj
Copy link
Member Author

enj commented Aug 29, 2022

@aojea what problem are you trying to solve? The reason this code acts on the transport config is because you are allowed to continue changing the transport config (say you want another wrapper to inject some headers) before constructing the final transport.

@aojea
Copy link
Member

aojea commented Aug 29, 2022

@aojea what problem are you trying to solve?

not modifying a public API that will cause consumers to change their code on the next release https://pkg.go.dev/k8s.io/client-go/rest#Config.Transport ... EDIT wrong link 😄 https://pkg.go.dev/k8s.io/client-go/transport#Config
I also think that using a cache for the transport is not the ideal solution, however, we manager to get rid of it for the internal types ... I was looking for not perpetuating the transport cache usage and let it die slowly

@enj
Copy link
Member Author

enj commented Aug 29, 2022

@aojea the set of consumers manually constructing transport configs with custom dialers/cert callbacks is small so the breaking change is acceptable IMO (rest configs are generally used). Other approaches will make the diff larger and harder to backport.

The TLS cache / authenticator cache are not going anywhere - too many consumers rely on them. kubectl is just one example.

@@ -68,7 +68,7 @@ type Config struct {
WrapTransport WrapperFunc

// Dial specifies the dial function for creating unencrypted TCP connections.
Dial func(ctx context.Context, network, address string) (net.Conn, error)
Dial *DialHolder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever way to make these comparable. I think we can make use of this mechanism in a way that wouldn't break compile compatibility in a backport by adding peer DialHolder / GetCertHolder fields in transport.Config, rather than changing the type of the existing fields, and requiring the address(Dial) == address(DialHolder.Dial) or address(GetCert) == address(GetCertHolder.GetCert) to be cacheable.

I took a stab at that in https://github.com/liggitt/kubernetes/commits/enj/i/exec_tls_cache

We could revisit simplifying that back down to a single field in 1.26

I also wonder if that could be backported back to 1.22, since it doesn't rely on the single client construction work @aojea did in 1.23

@enj enj force-pushed the enj/i/exec_tls_cache branch 2 times, most recently from 39ad816 to 3747dda Compare August 30, 2022 21:20
// cannot determine equality for functions
return tlsCacheKey{}, false, nil
}
if c.Dial != nil && (c.DialHolder == nil || c.DialHolder.Dial == nil || reflect.ValueOf(c.DialHolder.Dial).Pointer() != reflect.ValueOf(c.Dial).Pointer()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should verify the address matches in isValidHolders and treat a mismatch as an error, instead of silently ignoring/making uncacheable. Same thing for GetCert address comparison below

That simplifies this to:
if c.Dial != nil && c.DialHolder == nil { return ... }

@enj
Copy link
Member Author

enj commented Sep 1, 2022

@liggitt @aojea I believe all comments have been addressed. I added some unit tests to cover the isValidHolders logic as well.

Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change lgtm, I wasn't sure about the TLS:true bit on one unit test (might be pre-existing, I don't see anything in this change that looks related to it)

@aojea
Copy link
Member

aojea commented Sep 7, 2022

lgtm

just the unit test questions

This change updates the transport.Config .Dial and .TLS.GetCert fields
to use a struct wrapper.  This indirection via a pointer allows the
functions to be compared and thus makes them valid to use as map keys.
This change is then leveraged by the existing global exec auth and TLS
config caches to return the same authenticator and TLS config even when
distinct but identical rest configs were used to create distinct
clientsets.

Signed-off-by: Monis Khan <mok@microsoft.com>
@enj
Copy link
Member Author

enj commented Sep 8, 2022

@liggitt @aojea I fixed the one unit test I could change without having test failures per #112017 (comment).

@liggitt
Copy link
Member

liggitt commented Sep 8, 2022

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 8, 2022
@liggitt
Copy link
Member

liggitt commented Sep 8, 2022

I'm curious how cleanly this picks to 1.25 - 1.22

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enj, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 8, 2022
@aojea
Copy link
Member

aojea commented Sep 8, 2022

/retest

@enj
Copy link
Member Author

enj commented Sep 8, 2022

I'm curious how cleanly this picks to 1.25 - 1.22

There was a trivial import path conflict in k8s.io/client-go/transport/transport.go but otherwise all the picks were clean.

k8s-ci-robot added a commit that referenced this pull request Sep 9, 2022
…upstream-release-1.25

Automated cherry pick of #112017: exec auth: support TLS config caching
k8s-ci-robot added a commit that referenced this pull request Sep 9, 2022
…upstream-release-1.24

Automated cherry pick of #112017: exec auth: support TLS config caching
k8s-ci-robot added a commit that referenced this pull request Sep 9, 2022
…upstream-release-1.23

Automated cherry pick of #112017: exec auth: support TLS config caching
k8s-ci-robot added a commit that referenced this pull request Sep 9, 2022
…upstream-release-1.22

Automated cherry pick of #112017: exec auth: support TLS config caching
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using exec auth in kubectl triggers one connection per applied/described object
4 participants