Sort kubelet pods by their creation time #113041

saschagrunert · 2022-10-13T10:34:18Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

There is a corner case when blocking Pod termination via a lifecycle preStop hook, for example by using this StateFulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: ubi
  serviceName: "ubi"
  replicas: 1
  template:
    metadata:
      labels:
        app: ubi
    spec:
      terminationGracePeriodSeconds: 1000
      containers:
      - name: ubi
        image: ubuntu:22.04
        command: ['sh', '-c', 'echo The app is running! && sleep 360000']
        ports:
        - containerPort: 80
          name: web
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - 'echo aaa; trap : TERM INT; sleep infinity & wait'

After creation, downscaling, forced deletion and upscaling of the replica like this:

> kubectl apply -f sts.yml
> kubectl scale sts web --replicas=0
> kubectl delete pod web-0 --grace-period=0 --force
> kubectl scale sts web --replicas=1

We will end up having two pods running by the container runtime, while the API only reports one:

> kubectl get pods
NAME    READY   STATUS    RESTARTS   AGE
web-0   1/1     Running   0          92s

> sudo crictl pods
POD ID              CREATED              STATE     NAME     NAMESPACE     ATTEMPT     RUNTIME
e05bb7dbb7e44       12 minutes ago       Ready     web-0    default       0           (default)
d90088614c73b       12 minutes ago       Ready     web-0    default       0           (default)

When now running kubectl exec -it web-0 -- ps -ef, there is a random chance that we hit the wrong container reporting the lifecycle command /bin/sh -c echo aaa; trap : TERM INT; sleep infinity & wait.

This is caused by the container lookup via its name (and no podUID) at:

kubernetes/pkg/kubelet/kubelet_pods.go

Lines 1905 to 1914 in 0210941

    
           func (kl *Kubelet) GetExec(podFullName string, podUID types.UID, containerName string, cmd []string, streamOpts remotecommandserver.Options) (*url.URL, error) { 
        
           	container, err := kl.findContainer(podFullName, podUID, containerName) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	if container == nil { 
        
           		return nil, fmt.Errorf("container not found (%q)", containerName) 
        
           	} 
        
           	return kl.streamingRuntime.GetExec(container.ID, cmd, streamOpts.Stdin, streamOpts.Stdout, streamOpts.Stderr, streamOpts.TTY) 
        
           }

And more specifiy by the conversion of the pod result map to a slice in GetPods:

kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go

Lines 407 to 411 in 0210941

    
           // Convert map to list. 
        
           var result []*kubecontainer.Pod 
        
           for _, pod := range pods { 
        
           	result = append(result, pod) 
        
           }

We now solve that unexpected behavior by tracking the creation time of the pod and sorting the result based on that. This will cause to always match the most recently created pod.

Which issue(s) this PR fixes:

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=2090782

Special notes for your reviewer:

cc @harche @rphillips

Does this PR introduce a user-facing change?

Fixed a bug where the kubelet chooses the wrong container by its name when running `kubectl exec`.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

matthyx · 2022-10-13T11:35:32Z

do we want a test to prevent future regressions?

rphillips · 2022-10-13T12:34:17Z

cc @harche due to evented PLEG work

haircommander · 2022-10-13T13:38:40Z

/triage accepted

There is a corner case when blocking Pod termination via a lifecycle preStop hook, for example by using this StateFulSet: ```yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: web spec: selector: matchLabels: app: ubi serviceName: "ubi" replicas: 1 template: metadata: labels: app: ubi spec: terminationGracePeriodSeconds: 1000 containers: - name: ubi image: ubuntu:22.04 command: ['sh', '-c', 'echo The app is running! && sleep 360000'] ports: - containerPort: 80 name: web lifecycle: preStop: exec: command: - /bin/sh - -c - 'echo aaa; trap : TERM INT; sleep infinity & wait' ``` After creation, downscaling, forced deletion and upscaling of the replica like this: ``` > kubectl apply -f sts.yml > kubectl scale sts web --replicas=0 > kubectl delete pod web-0 --grace-period=0 --force > kubectl scale sts web --replicas=1 ``` We will end up having two pods running by the container runtime, while the API only reports one: ``` > kubectl get pods NAME READY STATUS RESTARTS AGE web-0 1/1 Running 0 92s ``` ``` > sudo crictl pods POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME e05bb7dbb7e44 12 minutes ago Ready web-0 default 0 (default) d90088614c73b 12 minutes ago Ready web-0 default 0 (default) ``` When now running `kubectl exec -it web-0 -- ps -ef`, there is a random chance that we hit the wrong container reporting the lifecycle command `/bin/sh -c echo aaa; trap : TERM INT; sleep infinity & wait`. This is caused by the container lookup via its name (and no podUID) at: https://github.com/kubernetes/kubernetes/blob/02109414e8816ceefce775e8b627d67cad6ced85/pkg/kubelet/kubelet_pods.go#L1905-L1914 And more specifiy by the conversion of the pod result map to a slice in `GetPods`: https://github.com/kubernetes/kubernetes/blob/02109414e8816ceefce775e8b627d67cad6ced85/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L407-L411 We now solve that unexpected behavior by tracking the creation time of the pod and sorting the result based on that. This will cause to always match the most recently created pod. Signed-off-by: Sascha Grunert <sgrunert@redhat.com>

saschagrunert · 2022-10-13T14:33:45Z

do we want a test to prevent future regressions?

Sure, I added a unit test to cover that scenario. An e2e test would be more like a trial and error since the order of the map is based on their hashes.

haircommander · 2022-10-13T14:39:09Z

/lgtm

IMO, unit test is enough. @matthyx agree?

saschagrunert · 2022-10-13T15:03:34Z

/retest

saschagrunert · 2022-10-13T15:15:45Z

/test pull-kubernetes-e2e-kind-ipv6

matthyx · 2022-10-13T15:33:57Z

/lgtm

IMO, unit test is enough. @matthyx agree?

Yes

harche · 2022-10-14T04:57:16Z

/lgtm

saschagrunert · 2022-10-14T07:11:06Z

For approval: PTAL @Random-Liu @dchen1107 @derekwaynecarr @yujuhong @sjenning @mrunalp @klueska

matthyx · 2022-10-15T09:09:38Z

/lgtm
/triage accepted

dims · 2022-10-18T15:08:52Z

I like this change as there is some predictability in the order! This is a small enough change that i am happy to get this going. Thanks for the test case as well @saschagrunert

/approve
/lgtm

k8s-ci-robot · 2022-10-18T15:09:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [dims]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

saschagrunert added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed area/kubelet needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 13, 2022

k8s-ci-robot requested review from matthyx and Random-Liu October 13, 2022 10:35

saschagrunert force-pushed the kubelet-pods-creation-time branch from b7fc175 to 1bdd40e Compare October 13, 2022 11:30

k8s-ci-robot added the area/kubelet label Oct 13, 2022

saschagrunert force-pushed the kubelet-pods-creation-time branch from 1bdd40e to b296f82 Compare October 13, 2022 14:32

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 13, 2022

k8s-ci-robot assigned haircommander Oct 13, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 13, 2022

k8s-ci-robot assigned harche Oct 14, 2022

SergeyKanzhelev added this to Triage in SIG Node PR Triage Oct 14, 2022

matthyx moved this from Triage to Needs Approver in SIG Node PR Triage Oct 15, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 15, 2022

k8s-ci-robot assigned matthyx Oct 15, 2022

k8s-ci-robot assigned dims Oct 18, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2022

k8s-ci-robot merged commit 843ad71 into kubernetes:master Oct 18, 2022

SIG Node PR Triage automation moved this from Needs Approver to Done Oct 18, 2022

k8s-ci-robot added this to the v1.26 milestone Oct 18, 2022

saschagrunert deleted the kubelet-pods-creation-time branch October 19, 2022 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort kubelet pods by their creation time #113041

Sort kubelet pods by their creation time #113041

saschagrunert commented Oct 13, 2022

matthyx commented Oct 13, 2022

rphillips commented Oct 13, 2022

haircommander commented Oct 13, 2022

saschagrunert commented Oct 13, 2022

haircommander commented Oct 13, 2022

saschagrunert commented Oct 13, 2022

saschagrunert commented Oct 13, 2022

matthyx commented Oct 13, 2022

harche commented Oct 14, 2022

saschagrunert commented Oct 14, 2022

matthyx commented Oct 15, 2022

dims commented Oct 18, 2022

k8s-ci-robot commented Oct 18, 2022

	func (kl Kubelet) GetExec(podFullName string, podUID types.UID, containerName string, cmd []string, streamOpts remotecommandserver.Options) (url.URL, error) {
	container, err := kl.findContainer(podFullName, podUID, containerName)
	if err != nil {
	return nil, err
	}
	if container == nil {
	return nil, fmt.Errorf("container not found (%q)", containerName)
	}
	return kl.streamingRuntime.GetExec(container.ID, cmd, streamOpts.Stdin, streamOpts.Stdout, streamOpts.Stderr, streamOpts.TTY)
	}

	// Convert map to list.
	var result []*kubecontainer.Pod
	for _, pod := range pods {
	result = append(result, pod)
	}

Sort kubelet pods by their creation time #113041

Sort kubelet pods by their creation time #113041

Conversation

saschagrunert commented Oct 13, 2022

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

matthyx commented Oct 13, 2022

rphillips commented Oct 13, 2022

haircommander commented Oct 13, 2022

saschagrunert commented Oct 13, 2022

haircommander commented Oct 13, 2022

saschagrunert commented Oct 13, 2022

saschagrunert commented Oct 13, 2022

matthyx commented Oct 13, 2022

harche commented Oct 14, 2022

saschagrunert commented Oct 14, 2022

matthyx commented Oct 15, 2022

dims commented Oct 18, 2022

k8s-ci-robot commented Oct 18, 2022