Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: add the preferred pod anti-affinity for CoreDNS Deployment #110593

Merged
merged 1 commit into from Jun 22, 2022

Conversation

SataQiu
Copy link
Member

@SataQiu SataQiu commented Jun 15, 2022

What type of PR is this?

/kind bug
/kind feature

What this PR does / why we need it:

kubeadm: add the preferred pod anti-affinity for CoreDNS Deployment.
Since the performance of the scheduler has been improved, and we have already added back anti-affinity for coredns in #92652.

Which issue(s) this PR fixes:

Ref kubernetes/kubeadm#995 #54164 #90248

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kubeadm: the preferred pod anti-affinity for CoreDNS is now enabled by default

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 15, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SataQiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 15, 2022
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we closed a similar PR for kubeadm not so long ago. On mobile right now so i don't have the link. The problem here i think was that one pod will schedule on one node and the deployment will be non-ready until another node appears. Is that true?

If so it breaks readiness checks for user single node clusters and kinder e2e tests.

It has been established as something that users can patch over the coredns deployment or skip the addon in kubeadm and deploy it manually.

@SataQiu
Copy link
Member Author

SataQiu commented Jun 15, 2022

I think we closed a similar PR for kubeadm not so long ago. On mobile right now so i don't have the link. The problem here i think was that one pod will schedule on one node and the deployment will be non-ready until another node appears. Is that true?

Thanks for your reply @neolit123
In my opinion, that's not really true. If my understanding is correct, as we are using preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution, when there is only one node, all the CoreDNS Pods will be scheduled to the same node, and there will be no pending Pod. In theory, this doesn't break our existing processing logic.

When there are more nodes joined, user can run kubectl -n kube-system rollout restart deployment coredns to rebalance CoreDNS Pods (with this patch, the CoreDNS Pods would theoretically be spread out to different nodes, that's what we expect), and we have already noted this in our official document.

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/

Note: As the cluster nodes are usually initialized sequentially, the CoreDNS Pods are likely to all run on the first control-plane node. To provide higher availability, please rebalance the CoreDNS Pods with kubectl -n kube-system rollout restart deployment coredns after at least one new node is joined.

I also tested the single node situation locally and the cluster initialization was completed normally.

root@8c651d9df8e3:/# kubectl get node
NAME           STATUS   ROLES           AGE   VERSION
8c651d9df8e3   Ready    control-plane   18m   v1.24.0-beta.0

root@8c651d9df8e3:/# kubectl get pod -A | grep coredns
kube-system       coredns-d669857b7-kjvwf                    1/1     Running   0          18m
kube-system       coredns-d669857b7-n49n4                    1/1     Running   0          18m

root@8c651d9df8e3:/# kubectl get deployment -n kube-system coredns -oyaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2022-06-15T11:46:10Z"
  generation: 1
  labels:
    k8s-app: kube-dns
  name: coredns
  namespace: kube-system
  resourceVersion: "1587"
  uid: a8cc53df-1995-42ac-ad28-dbd22e835400
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kube-dns
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: kube-dns
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: k8s-app
                  operator: In
                  values:
                  - kube-dns
              topologyKey: kubernetes.io/hostname
            weight: 100
      containers:
      - args:
        - -conf
        - /etc/coredns/Corefile
        image: registry.k8s.io/coredns/coredns:v1.9.3
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: coredns
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        - containerPort: 9153
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: 8181
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            memory: 170Mi
          requests:
            cpu: 100m
            memory: 70Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - NET_BIND_SERVICE
            drop:
            - all
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/coredns
          name: config-volume
          readOnly: true
      dnsPolicy: Default
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: coredns
      serviceAccountName: coredns
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: Corefile
            path: Corefile
          name: coredns
        name: config-volume
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2022-06-15T11:53:57Z"
    lastUpdateTime: "2022-06-15T11:53:57Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2022-06-15T11:46:23Z"
    lastUpdateTime: "2022-06-15T11:54:03Z"
    message: ReplicaSet "coredns-d669857b7" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

@neolit123
Copy link
Member

When there are more nodes joined, user can run kubectl -n kube-system rollout restart deployment coredns to rebalance CoreDNS Pods (with this patch, the CoreDNS Pods would theoretically be spread out to different nodes, that's what we expect), and we have already noted this in our official document.

From my experiments in the past even without this change the rollout resulted in the pods spreading out to more nodes and did not land on the same node. At least i did not see it happening.

In a way, with this change we ensuring that the pods will preferably not land on the same node if there are multiple nodes. Is that correct?

Also cc @rajansandeep @chrisohaver in case they want to comment.

@chrisohaver
Copy link
Contributor

LGTM

@pacoxu
Copy link
Member

pacoxu commented Jun 16, 2022

LGTM

preferredDuringSchedulingIgnoredDuringExecution seems to be better.

https://github.com/coredns/deployment/blob/7ae45a97b725c605126f66a58d172347fe1a0cea/kubernetes/coredns.yaml.sed#L106-L114
However, coredns deployment is using requiredDuringSchedulingIgnoredDuringExecution instead.

https://github.com/coredns/helm/blob/4a9f0dd13f08ff41ce768b11a830e48e54ef8b1d/charts/coredns/values.yaml#L174-L185
and coredns helm is with empty affinity by default.

@neolit123
Copy link
Member

neolit123 commented Jun 22, 2022

this saw no objections, thus adding /lgtm label.

@SataQiu please see my Q above:

In a way, with this change we ensuring that the pods will preferably not land on the same node if there are multiple nodes. Is that correct?

/triage accepted
/priority backlog
/lgtm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 22, 2022
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 22, 2022
@k8s-ci-robot k8s-ci-robot merged commit e9702cf into kubernetes:master Jun 22, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jun 22, 2022
@SataQiu
Copy link
Member Author

SataQiu commented Jun 23, 2022

In a way, with this change we ensuring that the pods will preferably not land on the same node if there are multiple nodes. Is that correct?

@neolit123 Yes, this will keep CoreDNS as high available as possible without breaking anything that was there.

@pacoxu
Copy link
Member

pacoxu commented Jun 24, 2022

When there are more nodes joined, the user can run kubectl -n kube-system rollout restart deployment coredns to rebalance CoreDNS Pods (with this patch, the CoreDNS Pods would theoretically be spread out to different nodes, that's what we expect), and we have already noted this in our official document.

  • preferredDuringSchedulingIgnoredDuringExecution: if a user doesn't reschedule coredns, coredns pods will be scheduled by default to the first control-plane node only. (After any upgrade or node change, coredns will be scheduled later. This is fine in most scenarios). This is better on Day 2
  • requiredDuringSchedulingIgnoredDuringExecution: user will get one pending pod after init; after joining a new node, the pod will be scheduled. It is better only on Day 1.

The preferredDuringSchedulingIgnoredDuringExecution makes more sense if a user is using descheduler. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants