efficient polling in `waitForStepsToFinish` #8901

pritidesai · 2025-07-21T22:43:05Z

Changes

We discovered the sidecar-tekton-log-results is consuming significantly more CPU than the task steps. For example:

$ k top pod large-result-pipeline-runcf9s8-large-task-pod --containers   
POD                                             NAME                         CPU(cores)   MEMORY(bytes)   
large-result-pipeline-runcf9s8-large-task-pod   sidecar-tekton-log-results   1244m        11Mi            
large-result-pipeline-runcf9s8-large-task-pod   step-step1                   6m           7Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step10                  5m           7Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step11                  5m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step2                   4m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step3                   5m           7Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step4                   4m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step5                   4m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step6                   4m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step7                   6m           6Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step8                   4m           7Mi             
large-result-pipeline-runcf9s8-large-task-pod   step-step9                   4m           7Mi

Analysis:

sidecar-tekton-log-results: 1244m CPU, 11Mi memory
All other step containers: 4–6m CPU, 6–7Mi memory

Profiling of the sidecarlogresults component revealed excessive CPU usage. The current waitForStepsToFinish implementation uses a classic busy-wait strategy—it continuously checks for file existence without any sleep interval, resulting in high CPU consumption.

Profiling using unit test showed that nearly all CPU time was spent in system calls, with a high total sample count. This led to excessive CPU usage by the sidecar, even when it was simply waiting.

To address this, the function now sleeps for 100ms between checks, significantly reducing the polling frequency. As a result, the sidecar now consumes minimal CPU while waiting.

Current profile:

$ go tool pprof -top internal/sidecarlogresults/cpu.prof
File: ___TestWaitForStepsToFinish_Profile_in_github_com_tektoncd_pipeline_internal_sidecarlogresults.test
Type: cpu
Time: 2025-07-21 14:40:00 PDT
Duration: 1.21s, Total samples = 890ms (73.77%)
Showing nodes accounting for 890ms, 100% of 890ms total
      flat  flat%   sum%        cum   cum%
     790ms 88.76% 88.76%      790ms 88.76%  syscall.syscall

Profile after adding some sleep:

$ go tool pprof -top internal/sidecarlogresults/cpu.prof
File: ___TestWaitForStepsToFinish_Profile_in_github_com_tektoncd_pipeline_internal_sidecarlogresults.test
Type: cpu
Time: 2025-07-21 14:42:28 PDT
Duration: 1.43s, Total samples = 70ms ( 4.90%)
Showing nodes accounting for 70ms, 100% of 70ms total
      flat  flat%   sum%        cum   cum%
      50ms 71.43% 71.43%       50ms 71.43%  syscall.syscall

Total samples reduced down to 70ms compared to 890ms.

Also, the CPU consumption has significantly reduced:

$ k top pod large-result-pipeline-runhgxdc-large-task-pod --containers
POD                                             NAME                         CPU(cores)   MEMORY(bytes)   
large-result-pipeline-runhgxdc-large-task-pod   sidecar-tekton-log-results   6m           7Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step10                  4m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step11                  5m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step2                   4m           7Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step3                   3m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step4                   4m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step5                   4m           7Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step6                   6m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step7                   3m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step8                   4m           6Mi             
large-result-pipeline-runhgxdc-large-task-pod   step-step9                   4m           6Mi

To run the profiling test locally:

go test -v ./internal/sidecarlogresults -run TestWaitForStepsToFinish_Profile

Update:

The new key default-sidecar-log-polling-interval has been introduced to provide configurable control over how frequently the Tekton sidecar log results container polls for step completion files.

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
Has Tests included if any functionality added or changed
pre-commit Passed
Follows the commit message standard
Meets the Tekton contributor standards (including functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

The log results sidecar has been optimized to significantly reduce CPU utilization.  Operators can tune the system for their environment—using a higher interval to reduce CPU load in production, or a lower interval for faster feedback in development or testing.

pritidesai · 2025-07-21T22:51:10Z

@afrittoli, @vdemeester, @AlanGreene, please help review this PR. Thank you!

tekton-robot · 2025-07-21T22:56:25Z