Skip to content

efficient polling in waitForStepsToFinish #8901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 23, 2025

Conversation

pritidesai
Copy link
Member

@pritidesai pritidesai commented Jul 21, 2025

Changes

We discovered the sidecar-tekton-log-results is consuming significantly more CPU than the task steps. For example:

$ k top pod large-result-pipeline-runcf9s8-large-task-pod --containers   
POD                                             NAME                         CPU(cores)   MEMORY(bytes)   
large-result-pipeline-runcf9s8-large-task-pod   sidecar-tekton-log-results   1244m        11Mi            
large-result-pipeline-runcf9s8-large-task-pod   step-step1                   6m           7Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step10                  5m           7Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step11                  5m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step2                   4m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step3                   5m           7Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step4                   4m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step5                   4m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step6                   4m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step7                   6m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step8                   4m           7Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step9                   4m           7Mi             

Analysis:

  • sidecar-tekton-log-results: 1244m CPU, 11Mi memory
  • All other step containers: 4–6m CPU, 6–7Mi memory

Profiling of the sidecarlogresults component revealed excessive CPU usage. The current waitForStepsToFinish implementation uses a classic busy-wait strategy—it continuously checks for file existence without any sleep interval, resulting in high CPU consumption.

Profiling using unit test showed that nearly all CPU time was spent in system calls, with a high total sample count. This led to excessive CPU usage by the sidecar, even when it was simply waiting.

To address this, the function now sleeps for 100ms between checks, significantly reducing the polling frequency. As a result, the sidecar now consumes minimal CPU while waiting.

Current profile:

$ go tool pprof -top internal/sidecarlogresults/cpu.prof
File: ___TestWaitForStepsToFinish_Profile_in_github_com_tektoncd_pipeline_internal_sidecarlogresults.test
Type: cpu
Time: 2025-07-21 14:40:00 PDT
Duration: 1.21s, Total samples = 890ms (73.77%)
Showing nodes accounting for 890ms, 100% of 890ms total
      flat  flat%   sum%        cum   cum%
     790ms 88.76% 88.76%      790ms 88.76%  syscall.syscall

Profile after adding some sleep:

$ go tool pprof -top internal/sidecarlogresults/cpu.prof
File: ___TestWaitForStepsToFinish_Profile_in_github_com_tektoncd_pipeline_internal_sidecarlogresults.test
Type: cpu
Time: 2025-07-21 14:42:28 PDT
Duration: 1.43s, Total samples = 70ms ( 4.90%)
Showing nodes accounting for 70ms, 100% of 70ms total
      flat  flat%   sum%        cum   cum%
      50ms 71.43% 71.43%       50ms 71.43%  syscall.syscall 

Total samples reduced down to 70ms compared to 890ms.

Also, the CPU consumption has significantly reduced:

$ k top pod large-result-pipeline-runhgxdc-large-task-pod --containers
POD                                             NAME                         CPU(cores)   MEMORY(bytes)   
large-result-pipeline-runhgxdc-large-task-pod   sidecar-tekton-log-results   6m           7Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step10                  4m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step11                  5m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step2                   4m           7Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step3                   3m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step4                   4m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step5                   4m           7Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step6                   6m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step7                   3m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step8                   4m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step9                   4m           6Mi         

To run the profiling test locally:

go test -v ./internal/sidecarlogresults -run TestWaitForStepsToFinish_Profile

Update:

The new key default-sidecar-log-polling-interval has been introduced to provide configurable control over how frequently the Tekton sidecar log results container polls for step completion files.

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

The log results sidecar has been optimized to significantly reduce CPU utilization.  Operators can tune the system for their environment—using a higher interval to reduce CPU load in production, or a lower interval for faster feedback in development or testing.

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jul 21, 2025
@pritidesai pritidesai requested a review from chitrangpatel July 21, 2025 22:43
@tekton-robot tekton-robot requested review from abayer and dibyom July 21, 2025 22:43
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 21, 2025
@pritidesai pritidesai force-pushed the efficient-polling-in-sidecar branch from 303cbfe to a1809ff Compare July 21, 2025 22:50
@pritidesai
Copy link
Member Author

@afrittoli, @vdemeester, @AlanGreene, please help review this PR. Thank you!

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
internal/sidecarlogresults/sidecarlogresults.go 87.7% 87.8% 0.1

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
internal/sidecarlogresults/sidecarlogresults.go 87.7% 87.8% 0.1

@pritidesai pritidesai force-pushed the efficient-polling-in-sidecar branch from a1809ff to b6b9bca Compare July 22, 2025 00:20
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
internal/sidecarlogresults/sidecarlogresults.go 87.7% 87.8% 0.1

@vdemeester vdemeester self-assigned this Jul 22, 2025
@pritidesai pritidesai added this to the v1.3.0 (LTS) milestone Jul 22, 2025
@pritidesai pritidesai force-pushed the efficient-polling-in-sidecar branch from b6b9bca to c8420e5 Compare July 22, 2025 21:09
@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 22, 2025
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
internal/sidecarlogresults/sidecarlogresults.go 87.7% 87.5% -0.2
pkg/apis/config/default.go 86.6% 81.9% -4.6
pkg/pod/pod.go 93.3% 93.4% 0.0

@pritidesai pritidesai force-pushed the efficient-polling-in-sidecar branch from c8420e5 to 2aa3e22 Compare July 22, 2025 22:19
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
internal/sidecarlogresults/sidecarlogresults.go 87.7% 87.5% -0.2
pkg/apis/config/default.go 86.6% 87.5% 0.9
pkg/pod/pod.go 93.3% 93.4% 0.0

The current waitForStepsToFinish implementation is a classic busy-wait.
It checks for file existence without any sleep, resulting in a high
CPU usage. Adding a profile with a unit test to show that almost all
time is spent in system calls with a high total sample count. This led to
execssive CPU usage by the sidecar even when just waiting.

The function now sleeps 100ms between checks, drastically reducing the
frequency. The sidecar now uses minimal CPU while waiting.

Signed-off-by: Priti Desai <pdesai@us.ibm.com>
@pritidesai pritidesai force-pushed the efficient-polling-in-sidecar branch from 2aa3e22 to e1bda58 Compare July 23, 2025 02:02
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
internal/sidecarlogresults/sidecarlogresults.go 87.7% 87.5% -0.2
pkg/apis/config/default.go 86.6% 87.5% 0.9
pkg/pod/pod.go 93.3% 93.4% 0.0

@afrittoli
Copy link
Member

/test check-pr-has-kind-label

@tekton-robot
Copy link
Collaborator

@afrittoli: No presubmit jobs available for tektoncd/pipeline@main

In response to this:

/test check-pr-has-kind-label

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

if intervalStr == "" {
intervalStr = "100ms"
}
interval, err := time.ParseDuration(intervalStr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Isn't the case of an empty string covered here as well?
NIT: Perhaps make a small function to use here and in LookForArtifacts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Isn't the case of an empty string covered here as well?

If we pass an empty string to time.ParseDuration, it is considered an invalid duration format, and the function will return an error.

NIT: Perhaps make a small function to use here and in LookForArtifacts?

Sure, created a PR: #8909

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @pritidesai - I think this makes a lot of sense. I wonder if we should backport it to TLS; polling without a sleep at all is practically a bug. @vdemeester WDYT?
/approve

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 23, 2025
@vdemeester
Copy link
Member

@afrittoli I think I agree on the backporting to LTS indeed.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/meow

@tekton-robot
Copy link
Collaborator

@vdemeester: cat image

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [afrittoli,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@afrittoli
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2025
@afrittoli
Copy link
Member

@pritidesai, could you take care of the backporting?

@tekton-robot tekton-robot merged commit b096f8d into tektoncd:main Jul 23, 2025
29 checks passed
@pritidesai
Copy link
Member Author

@pritidesai, could you take care of the backporting?

Sure, I will backport this to 0.65, 0.68, 1.0, and 1.2. Am I missing any other releases?

@pritidesai
Copy link
Member Author

/cherrypick release-v0.65.x

@tekton-robot
Copy link
Collaborator

@pritidesai: #8901 failed to apply on top of branch "release-v0.65.x":

Applying: adding efficient polling to waitForStepsToFinish
Using index info to reconstruct a base tree...
M	config/config-defaults.yaml
M	docs/additional-configs.md
M	internal/sidecarlogresults/sidecarlogresults_test.go
M	pkg/apis/config/default.go
M	pkg/apis/config/default_test.go
M	pkg/pod/pod.go
M	pkg/pod/pod_test.go
M	test/e2e-tests.sh
Falling back to patching base and 3-way merge...
Auto-merging test/e2e-tests.sh
Auto-merging pkg/pod/pod_test.go
CONFLICT (content): Merge conflict in pkg/pod/pod_test.go
Auto-merging pkg/pod/pod.go
CONFLICT (content): Merge conflict in pkg/pod/pod.go
Auto-merging pkg/apis/config/default_test.go
CONFLICT (content): Merge conflict in pkg/apis/config/default_test.go
Auto-merging pkg/apis/config/default.go
CONFLICT (content): Merge conflict in pkg/apis/config/default.go
Auto-merging internal/sidecarlogresults/sidecarlogresults_test.go
Auto-merging docs/additional-configs.md
Auto-merging config/config-defaults.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 adding efficient polling to waitForStepsToFinish
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-v0.65.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pritidesai
Copy link
Member Author

/cherrypick release-v0.68.x

@tekton-robot
Copy link
Collaborator

@pritidesai: #8901 failed to apply on top of branch "release-v0.68.x":

Applying: adding efficient polling to waitForStepsToFinish
Using index info to reconstruct a base tree...
M	docs/additional-configs.md
M	internal/sidecarlogresults/sidecarlogresults_test.go
M	pkg/pod/pod.go
M	pkg/pod/pod_test.go
M	test/e2e-tests.sh
Falling back to patching base and 3-way merge...
Auto-merging test/e2e-tests.sh
Auto-merging pkg/pod/pod_test.go
CONFLICT (content): Merge conflict in pkg/pod/pod_test.go
Auto-merging pkg/pod/pod.go
CONFLICT (content): Merge conflict in pkg/pod/pod.go
Auto-merging internal/sidecarlogresults/sidecarlogresults_test.go
Auto-merging docs/additional-configs.md
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 adding efficient polling to waitForStepsToFinish
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-v0.68.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pritidesai
Copy link
Member Author

/cherrypick release-v1.0.x

@tekton-robot
Copy link
Collaborator

@pritidesai: new pull request created: #8910

In response to this:

/cherrypick release-v1.0.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants