New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm: unify the way to cleanup the files for kubeadm reset
#110972
Conversation
/sig cluster-lifecycle |
/cc @neolit123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/triage accepted
/priority backlog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chendave right now, i don't have enough time to dig deeper in the problems here and to think if there are better ways to solve them.
added some comments in the meantime.
would appreciate reviews from others.
@@ -201,3 +202,16 @@ func CleanDir(filePath string) error { | |||
} | |||
return nil | |||
} | |||
|
|||
func IsEmpty(dir string) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsDirEmpty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
if _, err := os.Stat(manifestPath); os.IsNotExist(err) { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is part of the general logic used elsewhere. would it cause other unexpected repercussions?
can we instead just leave it unchanged and handle it on the side of callers of ReadStaticPodFromDisk for the reset command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
} | ||
if err != nil { | ||
klog.Warningf("[reset] Found error when loading etcd pod spec %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"[reset] Found error when loading etcd pod spec: %v"
(added :
)
github is not allowing me to use code "suggestions" for some reason
// In case `cleanup-node` phase is executed before the `remove-etcd-member` phase, kubeadm will perform best effort cleanup. | ||
// Only the default etcd data dir will be cleanup which means if the etcd data is set to the value other than | ||
// the default value, you might want to manually cleanup the etcd data. | ||
klog.Warning("[reset] Manually cleanup etcd data might be needed if the cluster is reset by each phase") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about this more the behavior is confusing and this message is too.
we should at least print the location that the user must cleanup manually.
i think our general goal should be to make reset idempotent, if a phase is called and then another phase or the whole command is called kubeadm should not error and just perform cleanups even if two phases have to do the same cleanup...not sure what is best here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think our general goal should be to make reset idempotent
Agree, this is what this pr try to achieve.
we should at least print the location that the user must cleanup manually.
This is hard to tell since the config file has been removed by the cleanup-node
, anywhere else we can get it? If we know where is the data persisted, we can just remove them instead of asking for manually removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Warning("[reset] Manually cleanup etcd data might be needed if the cluster is reset by each phase")
now that we cannot tell where is the etcd data is configured, I feel we can just remove this line.
func cleanDirs(data *resetData) { | ||
if data.DryRun() { | ||
fmt.Printf("[reset] Would delete contents of stateful directories: %v\n", data.dirsToClean) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i'm not mistaken CleanDir has no dryrun mode and we are losing the dry run ability here.
the verbose output [reset] Would ...
is useful too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i'm not mistaken CleanDir has no dryrun mode and we are losing the dry run ability here.
Good point!
For the cleanup-node
, it is already guarded here
For the remove-etcd-member
, We can add a check before the CleanDir
, it's here
the verbose output [reset] Would ... is useful too.
We have printed the message here, but now it doesn't tell whether it is stateful dir or not.
thanks @neolit123 , no need to rush, we can pile this up until we have our hands free. |
etcd reset phase
Should With the current change, etcd dir cannot be removed if we run
The etcd dir will not be removed. |
I think it makes more sense to clean all etcd related bits as part of the etcd phase. But we should not skip deleting the etcd bits if external etcd is detected or if this is the only remaining member. |
That's the bug @chendave found. |
/hold cancel @pacoxu @neolit123 should work now. |
My suggestion for the release note:
|
/release-note-edit Kubeadm will cleanup the stale data on best effort basis. Stale data will be removed when each reset phase are executed, default etcd data directory will be cleanup when the |
make sense @sftim thanks for the suggestion! |
/lgtm |
klog.Warningf("[reset] Failed to delete contents of %q directory: %v", etcdDataDir, err) | ||
} else { | ||
fmt.Printf("[reset] Deleted contents of etcd data directories: %v\n", etcdDataDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can make these consistent
klog.Warningf("[reset] Failed to delete contents of %q directory: %v", etcdDataDir, err) | |
} else { | |
fmt.Printf("[reset] Deleted contents of etcd data directories: %v\n", etcdDataDir) | |
klog.Warningf("[reset] Failed to delete contents of the etcd data directory: %v", etcdDataDir, err) | |
} else { | |
fmt.Printf("[reset] Deleted contents of the etcd data directory: %v\n", etcdDataDir) |
klog.Warningf("[reset] Failed to delete contents of %q directory: %v", etcdDataDir, err) | ||
} else { | ||
fmt.Printf("[reset] Deleted contents of etcd data directories: %v\n", etcdDataDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} else { | ||
fmt.Println("[reset] Would remove the etcd member on this node from the etcd cluster") | ||
fmt.Printf("[reset] Would delete contents of etcd data directories: %v\n", etcdDataDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt.Printf("[reset] Would delete contents of etcd data directories: %v\n", etcdDataDir) | |
fmt.Printf("[reset] Would delete contents of the etcd data directory: %v\n", etcdDataDir) |
Guarantee that stale files are removed if end user resets cluster by resetting each phase. Signed-off-by: Dave Chen <dave.chen@arm.com>
Signed-off-by: Dave Chen <dave.chen@arm.com>
@neolit123 all addressed, thanks for the review! |
thanks, i would like someone to do another pass for LGTM |
@pacoxu Needs your lgtm or some further comments on this. :) |
Overall LGTM. I think the current logic has a little weird behavior.
If As we assume the user will reset the cluster soon, this would not be a problem. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chendave, neolit123, pacoxu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ACK, thanks! |
/retest irrelevant flaky. |
Split the PR into two commits with each one focus on one specific issue,
commit 1
Move the logic of file cleanup within each phase.
This address the issue that if end user resets cluster by the calling of the three phases instead of
kubeadm reset
, then the stale data, such as /var/lib/kubelet/ could be cleanup eventually.commit 2
Cleanup etcd data dir on best effort basis.
If end user call
reset phase remove-etcd-member
beforereset cleanup-node
then commit1 should be enough, otherwise, there is no way to tell where is the etcd data dir is configured by user, kubeadm will check the default etcd data directory and make sure that directory will be cleanup, this would address most of the problem.What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes kubernetes/kubeadm#2721
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: