Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the job finished metric issue due to the final job status update occasionally failing #112948

Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Oct 10, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

It fixes the issue that finished jobs are occasionally double counted when the final job status update fails and
the main loop of the Job controller is repeated. This fix moves the update of the metric after the final job status update.

Note that, it is now possible that the metric can miss counting of a job finish if a controller crashes between the status
update and the record of the metric. However, this is less likely than the job status update fail, which can happen
due to a conflict.

Which issue(s) this PR fixes:

Fixes #112873

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix the occasional double-counting of the job_finished_total metric

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 10, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Oct 10, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 10, 2022
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 10, 2022
@mimowo mimowo force-pushed the 112873-fix-job-finished-metric branch from 2eeba4f to 6bd8b9e Compare October 10, 2022 09:56
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 10, 2022
@mimowo mimowo force-pushed the 112873-fix-job-finished-metric branch from 6bd8b9e to 5385f98 Compare October 10, 2022 09:57
@mimowo mimowo marked this pull request as ready for review October 10, 2022 09:58
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2022
@mimowo mimowo force-pushed the 112873-fix-job-finished-metric branch from 5385f98 to dc6fa31 Compare October 10, 2022 09:59
@mimowo
Copy link
Contributor Author

mimowo commented Oct 10, 2022

I wasn't able to reproduce the issue in integration tests, as a conflict would not be triggered for the update operation
just by creating a job and marking it its pods as completed.

I have also added a draft e2e which demonstrates the issue locally and demonstrates it is fixed after the commit. The e2e test does:

  1. check the job_finished_total metric
  2. create a run and run until completion
  3. check the job_finished_total metric delta

However, the test failed on the infra as it seems other jobs are created concurrently during the e2e tests. See logs of the kube-controller-manger:

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/112948/pull-kubernetes-e2e-kind/1579411297760448512/artifacts/kind-control-plane/pods/kube-system_kube-controller-manager-kind-control-plane_3e939135eb40f3ef8c84939789125ade/kube-controller-manager/0.log.20221010-103532

My job job-4162/increment-job-finished-metric runs concurrently with other jobs outside of the test:

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 12, 2022
test/integration/job/job_test.go Outdated Show resolved Hide resolved
test/integration/job/job_test.go Outdated Show resolved Hide resolved
test/integration/job/job_test.go Outdated Show resolved Hide resolved
test/integration/job/job_test.go Outdated Show resolved Hide resolved
test/integration/job/job_test.go Outdated Show resolved Hide resolved
@mimowo mimowo force-pushed the 112873-fix-job-finished-metric branch from 68aa28c to 2f12c96 Compare October 13, 2022 08:28
test/integration/job/job_test.go Outdated Show resolved Hide resolved
test/integration/job/job_test.go Outdated Show resolved Hide resolved
@mimowo mimowo force-pushed the 112873-fix-job-finished-metric branch from 2f12c96 to 88eed65 Compare October 13, 2022 14:02
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 13, 2022
The reason for the issue is that the metrics were bumped before the
final job status update. In case the update failed the path was
repeated by the next syncJob leading to double-counting of the metrics.

The solution is to delay recording metrics and broadcasting events
after the job status update succeeds.
@mimowo mimowo force-pushed the 112873-fix-job-finished-metric branch from 88eed65 to b64e5b2 Compare October 13, 2022 15:23
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 13, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Oct 13, 2022

Squashed changes

@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 13, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Oct 14, 2022

/assign @soltysh

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2022
@k8s-ci-robot k8s-ci-robot merged commit 9bedff1 into kubernetes:master Oct 14, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Oct 14, 2022
mimowo added a commit to mimowo/kubernetes that referenced this pull request Oct 20, 2022
…gration/job

I think I'm ready to start review and LGTM code changes within this
package, but not necessarily for the entire sig-apps.

My PRs to the packages:
kubernetes#110292
kubernetes#111113
kubernetes#112948

PRs to the packages I contributed reviews to:
kubernetes#113166
kubernetes#110294
ivelichkovich pushed a commit to ivelichkovich/kubernetes that referenced this pull request Dec 20, 2022
…gration/job

I think I'm ready to start review and LGTM code changes within this
package, but not necessarily for the entire sig-apps.

My PRs to the packages:
kubernetes#110292
kubernetes#111113
kubernetes#112948

PRs to the packages I contributed reviews to:
kubernetes#113166
kubernetes#110294
jaehnri pushed a commit to jaehnri/kubernetes that referenced this pull request Jan 3, 2023
…gration/job

I think I'm ready to start review and LGTM code changes within this
package, but not necessarily for the entire sig-apps.

My PRs to the packages:
kubernetes#110292
kubernetes#111113
kubernetes#112948

PRs to the packages I contributed reviews to:
kubernetes#113166
kubernetes#110294
@mimowo mimowo deleted the 112873-fix-job-finished-metric branch March 18, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Occasional double counting of job completions in the job_controller_job_finished_total metric
4 participants