New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the job finished metric issue due to the final job status update occasionally failing #112948
Fix the job finished metric issue due to the final job status update occasionally failing #112948
Conversation
Skipping CI for Draft Pull Request. |
/ok-to-test |
2eeba4f
to
6bd8b9e
Compare
6bd8b9e
to
5385f98
Compare
5385f98
to
dc6fa31
Compare
I wasn't able to reproduce the issue in integration tests, as a conflict would not be triggered for the update operation I have also added a draft e2e which demonstrates the issue locally and demonstrates it is fixed after the commit. The e2e test does:
However, the test failed on the infra as it seems other jobs are created concurrently during the e2e tests. See logs of the kube-controller-manger: My job job-4162/increment-job-finished-metric runs concurrently with other jobs outside of the test: |
68aa28c
to
2f12c96
Compare
2f12c96
to
88eed65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome
/lgtm
/approve
The reason for the issue is that the metrics were bumped before the final job status update. In case the update failed the path was repeated by the next syncJob leading to double-counting of the metrics. The solution is to delay recording metrics and broadcasting events after the job status update succeeds.
88eed65
to
b64e5b2
Compare
Squashed changes |
/lgtm |
/assign @soltysh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mimowo, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…gration/job I think I'm ready to start review and LGTM code changes within this package, but not necessarily for the entire sig-apps. My PRs to the packages: kubernetes#110292 kubernetes#111113 kubernetes#112948 PRs to the packages I contributed reviews to: kubernetes#113166 kubernetes#110294
…gration/job I think I'm ready to start review and LGTM code changes within this package, but not necessarily for the entire sig-apps. My PRs to the packages: kubernetes#110292 kubernetes#111113 kubernetes#112948 PRs to the packages I contributed reviews to: kubernetes#113166 kubernetes#110294
…gration/job I think I'm ready to start review and LGTM code changes within this package, but not necessarily for the entire sig-apps. My PRs to the packages: kubernetes#110292 kubernetes#111113 kubernetes#112948 PRs to the packages I contributed reviews to: kubernetes#113166 kubernetes#110294
What type of PR is this?
/kind bug
What this PR does / why we need it:
It fixes the issue that finished jobs are occasionally double counted when the final job status update fails and
the main loop of the Job controller is repeated. This fix moves the update of the metric after the final job status update.
Note that, it is now possible that the metric can miss counting of a job finish if a controller crashes between the status
update and the record of the metric. However, this is less likely than the job status update fail, which can happen
due to a conflict.
Which issue(s) this PR fixes:
Fixes #112873
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: