kubeadm: unify the way to cleanup the files for `kubeadm reset` #110972

chendave · 2022-07-06T04:24:17Z

Split the PR into two commits with each one focus on one specific issue,

commit 1
Move the logic of file cleanup within each phase.
This address the issue that if end user resets cluster by the calling of the three phases instead of kubeadm reset, then the stale data, such as /var/lib/kubelet/ could be cleanup eventually.
commit 2
Cleanup etcd data dir on best effort basis.
If end user call reset phase remove-etcd-member before reset cleanup-node then commit1 should be enough, otherwise, there is no way to tell where is the etcd data dir is configured by user, kubeadm will check the default etcd data directory and make sure that directory will be cleanup, this would address most of the problem.

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes kubernetes/kubeadm#2721

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Kubeadm will cleanup the stale data on best effort basis. Stale data will be removed when each reset phase are executed, default etcd data directory will be cleanup when the `remove-etcd-member` phase are executed.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

chendave · 2022-07-06T04:24:57Z

/sig cluster-lifecycle

chendave · 2022-07-06T04:33:38Z

/cc @neolit123

neolit123

/triage accepted
/priority backlog

neolit123

@chendave right now, i don't have enough time to dig deeper in the problems here and to think if there are better ways to solve them.

added some comments in the meantime.
would appreciate reviews from others.

neolit123 · 2022-07-08T09:43:14Z

cmd/kubeadm/app/cmd/phases/reset/cleanupnode.go

@@ -201,3 +202,16 @@ func CleanDir(filePath string) error {
 	}
 	return nil
 }
+
+func IsEmpty(dir string) (bool, error) {


Good catch!

neolit123 · 2022-07-08T09:45:21Z

cmd/kubeadm/app/util/staticpod/utils.go

+	if _, err := os.Stat(manifestPath); os.IsNotExist(err) {
+		return nil, err
+	}


this is part of the general logic used elsewhere. would it cause other unexpected repercussions?
can we instead just leave it unchanged and handle it on the side of callers of ReadStaticPodFromDisk for the reset command?

neolit123 · 2022-07-08T09:48:06Z

cmd/kubeadm/app/cmd/phases/reset/removeetcdmember.go

 	}
+	if err != nil {
+		klog.Warningf("[reset] Found error when loading etcd pod spec %v", err)


"[reset] Found error when loading etcd pod spec: %v"
(added :)

github is not allowing me to use code "suggestions" for some reason

neolit123 · 2022-07-08T09:50:11Z

cmd/kubeadm/app/cmd/phases/reset/removeetcdmember.go

+		// In case `cleanup-node` phase is executed before the `remove-etcd-member` phase, kubeadm will perform best effort cleanup.
+		// Only the default etcd data dir will be cleanup which means if the etcd data is set to the value other than
+		// the default value, you might want to manually cleanup the etcd data.
+		klog.Warning("[reset] Manually cleanup etcd data might be needed if the cluster is reset by each phase")


thinking about this more the behavior is confusing and this message is too.
we should at least print the location that the user must cleanup manually.

i think our general goal should be to make reset idempotent, if a phase is called and then another phase or the whole command is called kubeadm should not error and just perform cleanups even if two phases have to do the same cleanup...not sure what is best here.

i think our general goal should be to make reset idempotent

Agree, this is what this pr try to achieve.

we should at least print the location that the user must cleanup manually.

This is hard to tell since the config file has been removed by the cleanup-node, anywhere else we can get it? If we know where is the data persisted, we can just remove them instead of asking for manually removal.

klog.Warning("[reset] Manually cleanup etcd data might be needed if the cluster is reset by each phase")

now that we cannot tell where is the etcd data is configured, I feel we can just remove this line.

neolit123 · 2022-07-08T09:54:53Z

cmd/kubeadm/app/cmd/reset.go

-func cleanDirs(data *resetData) {
-	if data.DryRun() {
-		fmt.Printf("[reset] Would delete contents of stateful directories: %v\n", data.dirsToClean)
-		return
-	}


if i'm not mistaken CleanDir has no dryrun mode and we are losing the dry run ability here.

the verbose output [reset] Would ... is useful too.

if i'm not mistaken CleanDir has no dryrun mode and we are losing the dry run ability here.

Good point!

For the cleanup-node, it is already guarded here

For the remove-etcd-member, We can add a check before the CleanDir, it's here

the verbose output [reset] Would ... is useful too.

We have printed the message here, but now it doesn't tell whether it is stateful dir or not.

chendave · 2022-07-08T11:09:45Z

thanks @neolit123 , no need to rush, we can pile this up until we have our hands free.

chendave · 2022-07-12T08:28:34Z

Hope this version is better to look.

@SataQiu @pacoxu would love to hear from you too. :)

pacoxu · 2022-07-14T03:52:10Z

etcd reset phase

  cleanup-node       Run cleanup node.
  preflight          Run reset pre-flight checks
  remove-etcd-member Remove a local etcd member.

cleanup-node will remove manifests & kubelet & pki.

[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

Should remove-etcd-member remove /var/lib/etcd? Currently not. See kubernetes/kubeadm#2721 (comment)

With the current change, etcd dir cannot be removed if we run

./kubeadm init
./kubeadm -v 4 reset phase remove-etcd-member
./kubeadm reset -f

The etcd dir will not be removed.
Is this expected?

neolit123 · 2022-07-14T08:15:25Z

I think it makes more sense to clean all etcd related bits as part of the etcd phase. But we should not skip deleting the etcd bits if external etcd is detected or if this is the only remaining member.

neolit123 · 2022-07-14T08:16:20Z

The etcd dir will not be removed.
Is this expected?

That's the bug @chendave found.

chendave · 2022-07-18T11:04:46Z

/hold cancel

@pacoxu @neolit123 should work now.

sftim · 2022-08-01T18:11:10Z

My suggestion for the release note:

The `kubeadm` tool now cleans its up stale data (on a best effort basis). Stale data will be removed when each reset phase is executed; the default etcd data directory will be cleaned up when the `remove-etcd-member` phase is executed.

chendave · 2022-08-02T02:32:56Z

/release-note-edit Kubeadm will cleanup the stale data on best effort basis. Stale data will be removed when each reset phase are executed, default etcd data directory will be cleanup when the remove-etcd-member phase are executed.

chendave · 2022-08-02T02:34:07Z

make sense @sftim thanks for the suggestion!

pacoxu · 2022-08-08T08:18:19Z

/lgtm
tested

neolit123 · 2022-08-25T09:55:34Z

cmd/kubeadm/app/cmd/phases/reset/removeetcdmember.go

+				klog.Warningf("[reset] Failed to delete contents of %q directory: %v", etcdDataDir, err)
+			} else {
+				fmt.Printf("[reset] Deleted contents of etcd data directories: %v\n", etcdDataDir)


we can make these consistent

Suggested change

klog.Warningf("[reset] Failed to delete contents of %q directory: %v", etcdDataDir, err)

} else {

fmt.Printf("[reset] Deleted contents of etcd data directories: %v\n", etcdDataDir)

klog.Warningf("[reset] Failed to delete contents of the etcd data directory: %v", etcdDataDir, err)

} else {

fmt.Printf("[reset] Deleted contents of the etcd data directory: %v\n", etcdDataDir)

neolit123 · 2022-08-25T09:56:17Z

cmd/kubeadm/app/cmd/phases/reset/removeetcdmember.go

+						klog.Warningf("[reset] Failed to delete contents of %q directory: %v", etcdDataDir, err)
+					} else {
+						fmt.Printf("[reset] Deleted contents of etcd data directories: %v\n", etcdDataDir)


same here as https://github.com/kubernetes/kubernetes/pull/110972/files#r954764366

neolit123 · 2022-08-25T09:56:39Z

cmd/kubeadm/app/cmd/phases/reset/removeetcdmember.go

 				}
 			} else {
 				fmt.Println("[reset] Would remove the etcd member on this node from the etcd cluster")
+				fmt.Printf("[reset] Would delete contents of etcd data directories: %v\n", etcdDataDir)


Suggested change

fmt.Printf("[reset] Would delete contents of etcd data directories: %v\n", etcdDataDir)

fmt.Printf("[reset] Would delete contents of the etcd data directory: %v\n", etcdDataDir)

Guarantee that stale files are removed if end user resets cluster by resetting each phase. Signed-off-by: Dave Chen <dave.chen@arm.com>

Signed-off-by: Dave Chen <dave.chen@arm.com>

chendave · 2022-08-26T04:00:06Z

@neolit123 all addressed, thanks for the review!

neolit123 · 2022-08-26T06:47:19Z

thanks, i would like someone to do another pass for LGTM
/approve

chendave · 2022-08-29T02:57:33Z

@pacoxu Needs your lgtm or some further comments on this. :)

pacoxu · 2022-08-29T06:29:48Z

Overall LGTM.

I think the current logic has a little weird behavior.

cleanup-node will remove manifests & kubelet & pki. (This behavior is not changed.)
remove-etcd-member will remove /var/lib/etcd no matter whether etcd manifest(/etc/kubernetes/manifests/etcd.yaml) exists.

If etcd manifest exists, kubeadm reset phase remove-etcd-member will remove /var/lib/etcd and kubelet will later restart etcd and recreate the dir. In this case, the dir remove here is in vain.

As we assume the user will reset the cluster soon, this would not be a problem.

/approve
/lgtm

k8s-ci-robot · 2022-08-29T06:30:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chendave, neolit123, pacoxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kubeadm/OWNERS~~ [neolit123]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

chendave · 2022-08-29T06:40:49Z

If etcd manifest exists, kubeadm reset phase remove-etcd-member will remove /var/lib/etcd and kubelet will later restart etcd and recreate the dir. In this case, the dir remove here is in vain.

ACK, thanks!

chendave · 2022-08-29T07:14:30Z

/retest

irrelevant flaky.

k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 6, 2022

k8s-ci-robot requested review from pacoxu and SataQiu July 6, 2022 04:25

k8s-ci-robot added the area/kubeadm label Jul 6, 2022

k8s-ci-robot requested a review from neolit123 July 6, 2022 04:33

neolit123 reviewed Jul 6, 2022

View reviewed changes

neolit123 reviewed Jul 8, 2022

View reviewed changes

chendave force-pushed the cleanup_data branch from a31e223 to e391e6d Compare July 12, 2022 08:15

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 12, 2022

chendave force-pushed the cleanup_data branch from e391e6d to ac074de Compare July 12, 2022 08:23

chendave force-pushed the cleanup_data branch from c6eaf09 to 5f1854e Compare July 18, 2022 11:04

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2022

k8s-ci-robot assigned pacoxu Aug 8, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2022

chendave mentioned this pull request Aug 21, 2022

skip image pull for addons that are skipped kubernetes/kubeadm#2603

Closed

neolit123 reviewed Aug 25, 2022

View reviewed changes

chendave added 2 commits August 26, 2022 11:30

Move the logic of file cleanup within each phase

f180a3f

Guarantee that stale files are removed if end user resets cluster by resetting each phase. Signed-off-by: Dave Chen <dave.chen@arm.com>

Cleanup etcd data dir on best effort basis

71ef1ea

Signed-off-by: Dave Chen <dave.chen@arm.com>

chendave force-pushed the cleanup_data branch from 5f1854e to 71ef1ea Compare August 26, 2022 03:58

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 26, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 29, 2022

k8s-ci-robot merged commit 891cbed into kubernetes:master Aug 29, 2022

k8s-ci-robot added this to the v1.26 milestone Aug 29, 2022

chendave deleted the cleanup_data branch August 29, 2022 07:46

chendave mentioned this pull request Oct 14, 2022

Decouple kubeadm phases and make each phase idempotent kubernetes/kubeadm#2769

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubeadm: unify the way to cleanup the files for `kubeadm reset` #110972

kubeadm: unify the way to cleanup the files for `kubeadm reset` #110972

chendave commented Jul 6, 2022

chendave commented Jul 6, 2022

chendave commented Jul 6, 2022

neolit123 left a comment

neolit123 left a comment

neolit123 Jul 8, 2022

chendave Jul 12, 2022

neolit123 Jul 8, 2022

chendave Jul 12, 2022

neolit123 Jul 8, 2022

neolit123 Jul 8, 2022 •

edited

chendave Jul 12, 2022 •

edited

chendave Jul 12, 2022

neolit123 Jul 8, 2022

chendave Jul 12, 2022 •

edited

chendave commented Jul 8, 2022

chendave commented Jul 12, 2022

pacoxu commented Jul 14, 2022 •

edited

neolit123 commented Jul 14, 2022

neolit123 commented Jul 14, 2022

chendave commented Jul 18, 2022

sftim commented Aug 1, 2022

chendave commented Aug 2, 2022

chendave commented Aug 2, 2022

pacoxu commented Aug 8, 2022

neolit123 Aug 25, 2022

neolit123 Aug 25, 2022

neolit123 Aug 25, 2022

chendave commented Aug 26, 2022

neolit123 commented Aug 26, 2022

chendave commented Aug 29, 2022

pacoxu commented Aug 29, 2022

k8s-ci-robot commented Aug 29, 2022

chendave commented Aug 29, 2022

chendave commented Aug 29, 2022

	fmt.Printf("[reset] Would delete contents of etcd data directories: %v\n", etcdDataDir)
	fmt.Printf("[reset] Would delete contents of the etcd data directory: %v\n", etcdDataDir)

kubeadm: unify the way to cleanup the files for kubeadm reset #110972

kubeadm: unify the way to cleanup the files for kubeadm reset #110972

Conversation

chendave commented Jul 6, 2022

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

chendave commented Jul 6, 2022

chendave commented Jul 6, 2022

neolit123 left a comment

Choose a reason for hiding this comment

neolit123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neolit123 Jul 8, 2022 • edited

Choose a reason for hiding this comment

chendave Jul 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chendave Jul 12, 2022 • edited

Choose a reason for hiding this comment

chendave commented Jul 8, 2022

chendave commented Jul 12, 2022

pacoxu commented Jul 14, 2022 • edited

neolit123 commented Jul 14, 2022

neolit123 commented Jul 14, 2022

chendave commented Jul 18, 2022

sftim commented Aug 1, 2022

chendave commented Aug 2, 2022

chendave commented Aug 2, 2022

pacoxu commented Aug 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chendave commented Aug 26, 2022

neolit123 commented Aug 26, 2022

chendave commented Aug 29, 2022

pacoxu commented Aug 29, 2022

k8s-ci-robot commented Aug 29, 2022

chendave commented Aug 29, 2022

chendave commented Aug 29, 2022

kubeadm: unify the way to cleanup the files for `kubeadm reset` #110972

kubeadm: unify the way to cleanup the files for `kubeadm reset` #110972

neolit123 Jul 8, 2022 •

edited

chendave Jul 12, 2022 •

edited

chendave Jul 12, 2022 •

edited

pacoxu commented Jul 14, 2022 •

edited