New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(scheduler): add "goroutines" metric and deprecate the "scheduler_goroutines" metric #112003
Conversation
0074e7d
to
564b226
Compare
564b226
to
911b176
Compare
890d261
to
6e2de16
Compare
/retest |
@@ -183,7 +183,7 @@ func (pl *PodTopologySpread) PreScore( | |||
atomic.AddInt64(tpCount, int64(count)) | |||
} | |||
} | |||
pl.parallelizer.Until(ctx, len(allNodes), processAllNode) | |||
pl.parallelizer.Until(ctx, len(allNodes), processAllNode,pl.Name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please run ./hack/update-gofmt.sh
to execute gofmt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
Updated.
@@ -1408,7 +1408,7 @@ func BenchmarkTestDefaultEvenPodsSpreadPriority(b *testing.B) { | |||
score, _ := p.Score(ctx, state, pod, n.Name) | |||
gotList[i] = framework.NodeScore{Name: n.Name, Score: score} | |||
} | |||
p.parallelizer.Until(ctx, len(filteredNodes), scoreNode) | |||
p.parallelizer.Until(ctx, len(filteredNodes), scoreNode,"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
6e2de16
to
864a7a0
Compare
Add "fixes #112034" to the description. I'll review once there is an lgtm, thanks :) |
864a7a0
to
24fd178
Compare
/retitle feature(scheduler): add "goroutines" metric and deprecate the "scheduler_goroutines" metric |
added the change suggested in #112034. Also, updated the PR description and the release note.
|
/triage accepted |
/retest |
DeprecatedVersion: "1.26.0", | ||
Name: "scheduler_goroutines", | ||
Help: "Number of running goroutines split by the work they do such as binding. This metric is replaced by the \"goroutines\" metric.", | ||
StabilityLevel: metrics.ALPHA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question, this metric is still alpha, can we just remove it instead?
// ALPHA metrics have no stability guarantees, as such, labels may
// be arbitrarily added/removed and the metric may be deleted at any time.
ALPHA StabilityLevel = "ALPHA"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, we can. But in practice, we would probably be breaking some users too suddenly. One release is not too bad for us to take the burden of maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, scheduler_goroutines
was introduced in v1.17 and it's expected that many users are already using it.
#83535
It's kind to announce it's deprecated and wait some time so that users can migrate to the new metric.
doWorkPiesWithMetrics := func(piece int) { | ||
metrics.Goroutines.WithLabelValues(labelValue).Inc() | ||
defer metrics.Goroutines.WithLabelValues(labelValue).Dec() | ||
doWorkPiece(piece) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually cannot get the point of this metric, how we can get from this metric? it's inc and then dec, what we expect from this metric finally.
I am not sure if the metric can record the intermediate state, this needs to be confirmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric helps monitor the health of the controller, to make sure that the are no goroutine leaks.
@alculquicondor @chendave Thanks for reviewing, fixed some issues. |
workqueue.ParallelizeUntil(ctx, p.parallelism, pieces, doWorkPiece, workqueue.WithChunkSize(chunkSizeFor(pieces, p.parallelism))) | ||
func (p Parallelizer) Until(ctx context.Context, pieces int, doWorkPiece workqueue.DoWorkPieceFunc, operations ...string) { | ||
withMetrics := func(piece int) { | ||
metrics.Goroutines.WithLabelValues(operations...).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks dangerous.
What happens if you pass a number of operations different than 1? Does that cause a panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can check the length of the parameters and log some warning but this seems like violating the rule of optional parameters, or revert this logic back to your early version. I am fine with either of them.
btw, do you want to add some comments on the operations
for what is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to reverting and comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option is to have a second function Parallelizer.UntilWithOperation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or pull the operation into the context? 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 Reverted and add the comment about operation
.
pkg/scheduler/metrics/metrics.go
Outdated
@@ -96,14 +96,23 @@ var ( | |||
Help: "Number of pending pods, by the queue type. 'active' means number of pods in activeQ; 'backoff' means number of pods in backoffQ; 'unschedulable' means number of pods in unschedulablePods.", | |||
StabilityLevel: metrics.STABLE, | |||
}, []string{"queue"}) | |||
// SchedulerGoroutines isn't called in some parts where starts goroutines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// SchedulerGoroutines isn't called in some parts where starts goroutines. | |
// SchedulerGoroutines isn't called in some parts where goroutines start. |
pkg/scheduler/metrics/metrics.go
Outdated
@@ -96,14 +96,23 @@ var ( | |||
Help: "Number of pending pods, by the queue type. 'active' means number of pods in activeQ; 'backoff' means number of pods in backoffQ; 'unschedulable' means number of pods in unschedulablePods.", | |||
StabilityLevel: metrics.STABLE, | |||
}, []string{"queue"}) | |||
// SchedulerGoroutines isn't called in some parts where starts goroutines. | |||
// Goroutines metric will replace SchedulerGoroutines metric and Goroutine metric contains all goroutines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Goroutines metric will replace SchedulerGoroutines metric and Goroutine metric contains all goroutines. | |
// Goroutines metric replaces SchedulerGoroutines metric. Goroutine metric tracks all goroutines. |
691ae44
to
44ab555
Compare
@alculquicondor @chendave Addressed the given reviews and squashed commits. Please retake a look 🙏 |
pkg/scheduler/metrics/metrics.go
Outdated
&metrics.GaugeOpts{ | ||
Subsystem: SchedulerSubsystem, | ||
Name: "scheduler_goroutines", | ||
Name: "goroutines", | ||
Help: "Number of running goroutines split by the work they do such as binding.", | ||
StabilityLevel: metrics.ALPHA, | ||
}, []string{"work"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 renamed.
…ler_goroutines" metric
44ab555
to
08bd123
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/hold for lgtm from sig-instrumentation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
(from instrumentation)
Thanks for deprecating the metric with the scheduler stutter at the beginning of the metric name.
/hold cancel |
What type of PR is this?
/kind feature
What this PR does / why we need it:
See https://github.com/kubernetes/kubernetes/pull/111775/files#r945093057
We often create goroutines with parallelizer. But, we don't inc/dec the SchedulerGoroutines metric in it.
Also see #112034.
This PR adds the new metric
goroutines
that replacesscheduler_goroutines
metric and increment the number of goroutines in more place thanscheduler_goroutines
.Which issue(s) this PR fixes:
Fixes #112034
Special notes for your reviewer:
What labeling should be added is open to debate.
Currently:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: