Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort kubelet pods by their creation time #113041

Merged

Conversation

saschagrunert
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

There is a corner case when blocking Pod termination via a lifecycle preStop hook, for example by using this StateFulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: ubi
  serviceName: "ubi"
  replicas: 1
  template:
    metadata:
      labels:
        app: ubi
    spec:
      terminationGracePeriodSeconds: 1000
      containers:
      - name: ubi
        image: ubuntu:22.04
        command: ['sh', '-c', 'echo The app is running! && sleep 360000']
        ports:
        - containerPort: 80
          name: web
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - 'echo aaa; trap : TERM INT; sleep infinity & wait'

After creation, downscaling, forced deletion and upscaling of the replica like this:

> kubectl apply -f sts.yml
> kubectl scale sts web --replicas=0
> kubectl delete pod web-0 --grace-period=0 --force
> kubectl scale sts web --replicas=1

We will end up having two pods running by the container runtime, while the API only reports one:

> kubectl get pods
NAME    READY   STATUS    RESTARTS   AGE
web-0   1/1     Running   0          92s
> sudo crictl pods
POD ID              CREATED              STATE     NAME     NAMESPACE     ATTEMPT     RUNTIME
e05bb7dbb7e44       12 minutes ago       Ready     web-0    default       0           (default)
d90088614c73b       12 minutes ago       Ready     web-0    default       0           (default)

When now running kubectl exec -it web-0 -- ps -ef, there is a random chance that we hit the wrong container reporting the lifecycle command /bin/sh -c echo aaa; trap : TERM INT; sleep infinity & wait.

This is caused by the container lookup via its name (and no podUID) at:

func (kl *Kubelet) GetExec(podFullName string, podUID types.UID, containerName string, cmd []string, streamOpts remotecommandserver.Options) (*url.URL, error) {
container, err := kl.findContainer(podFullName, podUID, containerName)
if err != nil {
return nil, err
}
if container == nil {
return nil, fmt.Errorf("container not found (%q)", containerName)
}
return kl.streamingRuntime.GetExec(container.ID, cmd, streamOpts.Stdin, streamOpts.Stdout, streamOpts.Stderr, streamOpts.TTY)
}

And more specifiy by the conversion of the pod result map to a slice in GetPods:

// Convert map to list.
var result []*kubecontainer.Pod
for _, pod := range pods {
result = append(result, pod)
}

We now solve that unexpected behavior by tracking the creation time of the pod and sorting the result based on that. This will cause to always match the most recently created pod.

Which issue(s) this PR fixes:

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=2090782

Special notes for your reviewer:

cc @harche @rphillips

Does this PR introduce a user-facing change?

Fixed a bug where the kubelet chooses the wrong container by its name when running `kubectl exec`.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 13, 2022
@saschagrunert saschagrunert added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed area/kubelet needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 13, 2022
@matthyx
Copy link
Contributor

matthyx commented Oct 13, 2022

do we want a test to prevent future regressions?

@rphillips
Copy link
Member

cc @harche due to evented PLEG work

@haircommander
Copy link
Contributor

/triage accepted

There is a corner case when blocking Pod termination via a lifecycle
preStop hook, for example by using this StateFulSet:

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: ubi
  serviceName: "ubi"
  replicas: 1
  template:
    metadata:
      labels:
        app: ubi
    spec:
      terminationGracePeriodSeconds: 1000
      containers:
      - name: ubi
        image: ubuntu:22.04
        command: ['sh', '-c', 'echo The app is running! && sleep 360000']
        ports:
        - containerPort: 80
          name: web
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - 'echo aaa; trap : TERM INT; sleep infinity & wait'
```

After creation, downscaling, forced deletion and upscaling of the
replica like this:

```
> kubectl apply -f sts.yml
> kubectl scale sts web --replicas=0
> kubectl delete pod web-0 --grace-period=0 --force
> kubectl scale sts web --replicas=1
```

We will end up having two pods running by the container runtime, while
the API only reports one:

```
> kubectl get pods
NAME    READY   STATUS    RESTARTS   AGE
web-0   1/1     Running   0          92s
```

```
> sudo crictl pods
POD ID              CREATED              STATE     NAME     NAMESPACE     ATTEMPT     RUNTIME
e05bb7dbb7e44       12 minutes ago       Ready     web-0    default       0           (default)
d90088614c73b       12 minutes ago       Ready     web-0    default       0           (default)
```

When now running `kubectl exec -it web-0 -- ps -ef`, there is a random chance that we hit the wrong
container reporting the lifecycle command `/bin/sh -c echo aaa; trap : TERM INT; sleep infinity & wait`.

This is caused by the container lookup via its name (and no podUID) at:
https://github.com/kubernetes/kubernetes/blob/02109414e8816ceefce775e8b627d67cad6ced85/pkg/kubelet/kubelet_pods.go#L1905-L1914

And more specifiy by the conversion of the pod result map to a slice in `GetPods`:
https://github.com/kubernetes/kubernetes/blob/02109414e8816ceefce775e8b627d67cad6ced85/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L407-L411

We now solve that unexpected behavior by tracking the creation time of
the pod and sorting the result based on that. This will cause to always
match the most recently created pod.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 13, 2022
@saschagrunert
Copy link
Member Author

do we want a test to prevent future regressions?

Sure, I added a unit test to cover that scenario. An e2e test would be more like a trial and error since the order of the map is based on their hashes.

@haircommander
Copy link
Contributor

/lgtm

IMO, unit test is enough. @matthyx agree?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 13, 2022
@saschagrunert
Copy link
Member Author

/retest

@saschagrunert
Copy link
Member Author

/test pull-kubernetes-e2e-kind-ipv6

@matthyx
Copy link
Contributor

matthyx commented Oct 13, 2022

/lgtm

IMO, unit test is enough. @matthyx agree?

Yes

@harche
Copy link
Contributor

harche commented Oct 14, 2022

/lgtm

@saschagrunert
Copy link
Member Author

@matthyx matthyx moved this from Triage to Needs Approver in SIG Node PR Triage Oct 15, 2022
@matthyx
Copy link
Contributor

matthyx commented Oct 15, 2022

/lgtm
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 15, 2022
@dims
Copy link
Member

dims commented Oct 18, 2022

I like this change as there is some predictability in the order! This is a small enough change that i am happy to get this going. Thanks for the test case as well @saschagrunert

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2022
@k8s-ci-robot k8s-ci-robot merged commit 843ad71 into kubernetes:master Oct 18, 2022
SIG Node PR Triage automation moved this from Needs Approver to Done Oct 18, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Oct 18, 2022
@saschagrunert saschagrunert deleted the kubelet-pods-creation-time branch October 19, 2022 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants