Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: unify the way to cleanup the files for kubeadm reset #110972

Merged
merged 2 commits into from Aug 29, 2022

Conversation

chendave
Copy link
Member

@chendave chendave commented Jul 6, 2022

Split the PR into two commits with each one focus on one specific issue,

  • commit 1
    Move the logic of file cleanup within each phase.
    This address the issue that if end user resets cluster by the calling of the three phases instead of kubeadm reset, then the stale data, such as /var/lib/kubelet/ could be cleanup eventually.

  • commit 2
    Cleanup etcd data dir on best effort basis.
    If end user call reset phase remove-etcd-member before reset cleanup-node then commit1 should be enough, otherwise, there is no way to tell where is the etcd data dir is configured by user, kubeadm will check the default etcd data directory and make sure that directory will be cleanup, this would address most of the problem.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes kubernetes/kubeadm#2721

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Kubeadm will cleanup the stale data on best effort basis. Stale data will be removed when each reset phase are executed, default etcd data directory will be cleanup when the `remove-etcd-member` phase are executed.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 6, 2022
@chendave
Copy link
Member Author

chendave commented Jul 6, 2022

/sig cluster-lifecycle

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 6, 2022
@chendave
Copy link
Member Author

chendave commented Jul 6, 2022

/cc @neolit123

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 6, 2022
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chendave right now, i don't have enough time to dig deeper in the problems here and to think if there are better ways to solve them.

added some comments in the meantime.
would appreciate reviews from others.

@@ -201,3 +202,16 @@ func CleanDir(filePath string) error {
}
return nil
}

func IsEmpty(dir string) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsDirEmpty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment on lines 223 to 225
if _, err := os.Stat(manifestPath); os.IsNotExist(err) {
return nil, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is part of the general logic used elsewhere. would it cause other unexpected repercussions?
can we instead just leave it unchanged and handle it on the side of callers of ReadStaticPodFromDisk for the reset command?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}
if err != nil {
klog.Warningf("[reset] Found error when loading etcd pod spec %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"[reset] Found error when loading etcd pod spec: %v"
(added :)

github is not allowing me to use code "suggestions" for some reason

// In case `cleanup-node` phase is executed before the `remove-etcd-member` phase, kubeadm will perform best effort cleanup.
// Only the default etcd data dir will be cleanup which means if the etcd data is set to the value other than
// the default value, you might want to manually cleanup the etcd data.
klog.Warning("[reset] Manually cleanup etcd data might be needed if the cluster is reset by each phase")
Copy link
Member

@neolit123 neolit123 Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about this more the behavior is confusing and this message is too.
we should at least print the location that the user must cleanup manually.

i think our general goal should be to make reset idempotent, if a phase is called and then another phase or the whole command is called kubeadm should not error and just perform cleanups even if two phases have to do the same cleanup...not sure what is best here.

Copy link
Member Author

@chendave chendave Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think our general goal should be to make reset idempotent

Agree, this is what this pr try to achieve.

we should at least print the location that the user must cleanup manually.

This is hard to tell since the config file has been removed by the cleanup-node, anywhere else we can get it? If we know where is the data persisted, we can just remove them instead of asking for manually removal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

klog.Warning("[reset] Manually cleanup etcd data might be needed if the cluster is reset by each phase")

now that we cannot tell where is the etcd data is configured, I feel we can just remove this line.

Comment on lines -223 to -227
func cleanDirs(data *resetData) {
if data.DryRun() {
fmt.Printf("[reset] Would delete contents of stateful directories: %v\n", data.dirsToClean)
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i'm not mistaken CleanDir has no dryrun mode and we are losing the dry run ability here.

the verbose output [reset] Would ... is useful too.

Copy link
Member Author

@chendave chendave Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i'm not mistaken CleanDir has no dryrun mode and we are losing the dry run ability here.

Good point!

For the cleanup-node, it is already guarded here

For the remove-etcd-member, We can add a check before the CleanDir, it's here

the verbose output [reset] Would ... is useful too.

We have printed the message here, but now it doesn't tell whether it is stateful dir or not.

@chendave
Copy link
Member Author

chendave commented Jul 8, 2022

thanks @neolit123 , no need to rush, we can pile this up until we have our hands free.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 12, 2022
@chendave
Copy link
Member Author

Hope this version is better to look.

@SataQiu @pacoxu would love to hear from you too. :)

@pacoxu
Copy link
Member

pacoxu commented Jul 14, 2022

etcd reset phase

  cleanup-node       Run cleanup node.
  preflight          Run reset pre-flight checks
  remove-etcd-member Remove a local etcd member.

cleanup-node will remove manifests & kubelet & pki.

[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

Should remove-etcd-member remove /var/lib/etcd? Currently not. See kubernetes/kubeadm#2721 (comment)

With the current change, etcd dir cannot be removed if we run

./kubeadm init
./kubeadm -v 4 reset phase remove-etcd-member
./kubeadm reset -f 

The etcd dir will not be removed.
Is this expected?

@neolit123
Copy link
Member

I think it makes more sense to clean all etcd related bits as part of the etcd phase. But we should not skip deleting the etcd bits if external etcd is detected or if this is the only remaining member.

@neolit123
Copy link
Member

The etcd dir will not be removed.
Is this expected?

That's the bug @chendave found.

@chendave
Copy link
Member Author

/hold cancel

@pacoxu @neolit123 should work now.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2022
@sftim
Copy link
Contributor

sftim commented Aug 1, 2022

My suggestion for the release note:

The `kubeadm` tool now cleans its up stale data (on a best effort basis). Stale data will be removed when each reset phase is executed; the default etcd data directory will be cleaned up when the `remove-etcd-member` phase is executed.

@chendave
Copy link
Member Author

chendave commented Aug 2, 2022

/release-note-edit Kubeadm will cleanup the stale data on best effort basis. Stale data will be removed when each reset phase are executed, default etcd data directory will be cleanup when the remove-etcd-member phase are executed.

@chendave
Copy link
Member Author

chendave commented Aug 2, 2022

make sense @sftim thanks for the suggestion!

@pacoxu
Copy link
Member

pacoxu commented Aug 8, 2022

/lgtm
tested

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2022
Comment on lines 84 to 86
klog.Warningf("[reset] Failed to delete contents of %q directory: %v", etcdDataDir, err)
} else {
fmt.Printf("[reset] Deleted contents of etcd data directories: %v\n", etcdDataDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make these consistent

Suggested change
klog.Warningf("[reset] Failed to delete contents of %q directory: %v", etcdDataDir, err)
} else {
fmt.Printf("[reset] Deleted contents of etcd data directories: %v\n", etcdDataDir)
klog.Warningf("[reset] Failed to delete contents of the etcd data directory: %v", etcdDataDir, err)
} else {
fmt.Printf("[reset] Deleted contents of the etcd data directory: %v\n", etcdDataDir)

Comment on lines 69 to 71
klog.Warningf("[reset] Failed to delete contents of %q directory: %v", etcdDataDir, err)
} else {
fmt.Printf("[reset] Deleted contents of etcd data directories: %v\n", etcdDataDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
} else {
fmt.Println("[reset] Would remove the etcd member on this node from the etcd cluster")
fmt.Printf("[reset] Would delete contents of etcd data directories: %v\n", etcdDataDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt.Printf("[reset] Would delete contents of etcd data directories: %v\n", etcdDataDir)
fmt.Printf("[reset] Would delete contents of the etcd data directory: %v\n", etcdDataDir)

Guarantee that stale files are removed if end user resets cluster
by resetting each phase.

Signed-off-by: Dave Chen <dave.chen@arm.com>
Signed-off-by: Dave Chen <dave.chen@arm.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 26, 2022
@chendave
Copy link
Member Author

@neolit123 all addressed, thanks for the review!

@neolit123
Copy link
Member

thanks, i would like someone to do another pass for LGTM
/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2022
@chendave
Copy link
Member Author

@pacoxu Needs your lgtm or some further comments on this. :)

@pacoxu
Copy link
Member

pacoxu commented Aug 29, 2022

Overall LGTM.

I think the current logic has a little weird behavior.

  • cleanup-node will remove manifests & kubelet & pki. (This behavior is not changed.)
  • remove-etcd-member will remove /var/lib/etcd no matter whether etcd manifest(/etc/kubernetes/manifests/etcd.yaml) exists.

If etcd manifest exists, kubeadm reset phase remove-etcd-member will remove /var/lib/etcd and kubelet will later restart etcd and recreate the dir. In this case, the dir remove here is in vain.

As we assume the user will reset the cluster soon, this would not be a problem.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 29, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chendave, neolit123, pacoxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@chendave
Copy link
Member Author

If etcd manifest exists, kubeadm reset phase remove-etcd-member will remove /var/lib/etcd and kubelet will later restart etcd and recreate the dir. In this case, the dir remove here is in vain.

ACK, thanks!

@chendave
Copy link
Member Author

/retest

irrelevant flaky.

@k8s-ci-robot k8s-ci-robot merged commit 891cbed into kubernetes:master Aug 29, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Aug 29, 2022
@chendave chendave deleted the cleanup_data branch August 29, 2022 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proposal: unify the way to cleanup the files for kubeadm reset
5 participants