kubeadm: add the preferred pod anti-affinity for CoreDNS Deployment #110593

SataQiu · 2022-06-15T06:40:07Z

What type of PR is this?

/kind bug
/kind feature

What this PR does / why we need it:

kubeadm: add the preferred pod anti-affinity for CoreDNS Deployment.
Since the performance of the scheduler has been improved, and we have already added back anti-affinity for coredns in #92652.

Which issue(s) this PR fixes:

Ref kubernetes/kubeadm#995 #54164 #90248

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kubeadm: the preferred pod anti-affinity for CoreDNS is now enabled by default

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2022-06-15T06:40:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SataQiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kubeadm/OWNERS~~ [SataQiu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

neolit123

I think we closed a similar PR for kubeadm not so long ago. On mobile right now so i don't have the link. The problem here i think was that one pod will schedule on one node and the deployment will be non-ready until another node appears. Is that true?

If so it breaks readiness checks for user single node clusters and kinder e2e tests.

It has been established as something that users can patch over the coredns deployment or skip the addon in kubeadm and deploy it manually.

SataQiu · 2022-06-15T12:10:21Z

I think we closed a similar PR for kubeadm not so long ago. On mobile right now so i don't have the link. The problem here i think was that one pod will schedule on one node and the deployment will be non-ready until another node appears. Is that true?

Thanks for your reply @neolit123
In my opinion, that's not really true. If my understanding is correct, as we are using preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution, when there is only one node, all the CoreDNS Pods will be scheduled to the same node, and there will be no pending Pod. In theory, this doesn't break our existing processing logic.

When there are more nodes joined, user can run kubectl -n kube-system rollout restart deployment coredns to rebalance CoreDNS Pods (with this patch, the CoreDNS Pods would theoretically be spread out to different nodes, that's what we expect), and we have already noted this in our official document.

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/

Note: As the cluster nodes are usually initialized sequentially, the CoreDNS Pods are likely to all run on the first control-plane node. To provide higher availability, please rebalance the CoreDNS Pods with kubectl -n kube-system rollout restart deployment coredns after at least one new node is joined.

I also tested the single node situation locally and the cluster initialization was completed normally.

root@8c651d9df8e3:/# kubectl get node
NAME           STATUS   ROLES           AGE   VERSION
8c651d9df8e3   Ready    control-plane   18m   v1.24.0-beta.0

root@8c651d9df8e3:/# kubectl get pod -A | grep coredns
kube-system       coredns-d669857b7-kjvwf                    1/1     Running   0          18m
kube-system       coredns-d669857b7-n49n4                    1/1     Running   0          18m

root@8c651d9df8e3:/# kubectl get deployment -n kube-system coredns -oyaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2022-06-15T11:46:10Z"
  generation: 1
  labels:
    k8s-app: kube-dns
  name: coredns
  namespace: kube-system
  resourceVersion: "1587"
  uid: a8cc53df-1995-42ac-ad28-dbd22e835400
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kube-dns
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: kube-dns
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: k8s-app
                  operator: In
                  values:
                  - kube-dns
              topologyKey: kubernetes.io/hostname
            weight: 100
      containers:
      - args:
        - -conf
        - /etc/coredns/Corefile
        image: registry.k8s.io/coredns/coredns:v1.9.3
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: coredns
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        - containerPort: 9153
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: 8181
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            memory: 170Mi
          requests:
            cpu: 100m
            memory: 70Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - all
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/coredns
          name: config-volume
          readOnly: true
      dnsPolicy: Default
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: coredns
      serviceAccountName: coredns
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: Corefile
            path: Corefile
          name: coredns
        name: config-volume
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2022-06-15T11:53:57Z"
    lastUpdateTime: "2022-06-15T11:53:57Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2022-06-15T11:46:23Z"
    lastUpdateTime: "2022-06-15T11:54:03Z"
    message: ReplicaSet "coredns-d669857b7" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

neolit123 · 2022-06-15T12:21:39Z

When there are more nodes joined, user can run kubectl -n kube-system rollout restart deployment coredns to rebalance CoreDNS Pods (with this patch, the CoreDNS Pods would theoretically be spread out to different nodes, that's what we expect), and we have already noted this in our official document.

From my experiments in the past even without this change the rollout resulted in the pods spreading out to more nodes and did not land on the same node. At least i did not see it happening.

In a way, with this change we ensuring that the pods will preferably not land on the same node if there are multiple nodes. Is that correct?

Also cc @rajansandeep @chrisohaver in case they want to comment.

chrisohaver · 2022-06-15T12:46:48Z

LGTM

pacoxu · 2022-06-16T03:07:09Z

LGTM

preferredDuringSchedulingIgnoredDuringExecution seems to be better.

https://github.com/coredns/deployment/blob/7ae45a97b725c605126f66a58d172347fe1a0cea/kubernetes/coredns.yaml.sed#L106-L114
However, coredns deployment is using requiredDuringSchedulingIgnoredDuringExecution instead.

https://github.com/coredns/helm/blob/4a9f0dd13f08ff41ce768b11a830e48e54ef8b1d/charts/coredns/values.yaml#L174-L185
and coredns helm is with empty affinity by default.

neolit123 · 2022-06-22T14:19:55Z

this saw no objections, thus adding /lgtm label.

@SataQiu please see my Q above:

In a way, with this change we ensuring that the pods will preferably not land on the same node if there are multiple nodes. Is that correct?

/triage accepted
/priority backlog
/lgtm

SataQiu · 2022-06-23T12:30:50Z

In a way, with this change we ensuring that the pods will preferably not land on the same node if there are multiple nodes. Is that correct?

@neolit123 Yes, this will keep CoreDNS as high available as possible without breaking anything that was there.

pacoxu · 2022-06-24T06:25:08Z

When there are more nodes joined, the user can run kubectl -n kube-system rollout restart deployment coredns to rebalance CoreDNS Pods (with this patch, the CoreDNS Pods would theoretically be spread out to different nodes, that's what we expect), and we have already noted this in our official document.

preferredDuringSchedulingIgnoredDuringExecution: if a user doesn't reschedule coredns, coredns pods will be scheduled by default to the first control-plane node only. （After any upgrade or node change, coredns will be scheduled later. This is fine in most scenarios). This is better on Day 2
requiredDuringSchedulingIgnoredDuringExecution: user will get one pending pod after init; after joining a new node, the pod will be scheduled. It is better only on Day 1.

The preferredDuringSchedulingIgnoredDuringExecution makes more sense if a user is using descheduler. 😄

kubeadm: add the preferred pod anti-affinity for CoreDNS Deployment

299e745

k8s-ci-robot requested review from fabriziopandini and neolit123 June 15, 2022 06:40

neolit123 reviewed Jun 15, 2022

View reviewed changes

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 22, 2022

k8s-ci-robot assigned neolit123 Jun 22, 2022

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 22, 2022

k8s-ci-robot merged commit e9702cf into kubernetes:master Jun 22, 2022

k8s-ci-robot added this to the v1.25 milestone Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubeadm: add the preferred pod anti-affinity for CoreDNS Deployment #110593

kubeadm: add the preferred pod anti-affinity for CoreDNS Deployment #110593

SataQiu commented Jun 15, 2022

k8s-ci-robot commented Jun 15, 2022

neolit123 left a comment

SataQiu commented Jun 15, 2022

neolit123 commented Jun 15, 2022

chrisohaver commented Jun 15, 2022

pacoxu commented Jun 16, 2022

neolit123 commented Jun 22, 2022 •

edited

SataQiu commented Jun 23, 2022

pacoxu commented Jun 24, 2022 •

edited

kubeadm: add the preferred pod anti-affinity for CoreDNS Deployment #110593

kubeadm: add the preferred pod anti-affinity for CoreDNS Deployment #110593

Conversation

SataQiu commented Jun 15, 2022

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Jun 15, 2022

neolit123 left a comment

Choose a reason for hiding this comment

SataQiu commented Jun 15, 2022

neolit123 commented Jun 15, 2022

chrisohaver commented Jun 15, 2022

pacoxu commented Jun 16, 2022

neolit123 commented Jun 22, 2022 • edited

SataQiu commented Jun 23, 2022

pacoxu commented Jun 24, 2022 • edited

neolit123 commented Jun 22, 2022 •

edited

pacoxu commented Jun 24, 2022 •

edited