-
Notifications
You must be signed in to change notification settings - Fork 48
Remove readiness probe and health check from otel-agent sidecars #1791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Remove readiness probe and health check from otel-agent sidecars #1791
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @tiffanny29631, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances the robustness of the otel-agent
deployment by implementing Kubernetes startupProbe
definitions. This change provides a dedicated grace period for the agent to fully initialize, mitigating issues where slow startup times could lead to unnecessary pod restarts and improve overall stability of the service.
Highlights
- Kubernetes Probe Configuration: Introduced
startupProbe
configurations for theotel-agent
across multiple Kubernetes manifest files. This new probe specifically targets port13133
with anhttpGet
check, configured to allow up to 5 minutes for the agent to initialize before standard readiness and liveness checks begin. - Reliability Improvement: The addition of the
startupProbe
aims to prevent premature restarts andCrashLoopBackOff
errors for theotel-agent
, especially in environments with heavy load or resource constraints where the agent might take longer to bind to its health check port.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a startupProbe
to the otel-agent
container for improved startup reliability. To improve maintainability, I've suggested centralizing the startupProbe
configuration to avoid duplication across multiple manifest files.
startupProbe: | ||
httpGet: | ||
path: / | ||
port: 13133 | ||
failureThreshold: 30 # Allow up to 5 minutes (30 * 10s) | ||
periodSeconds: 10 | ||
initialDelaySeconds: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider defining the startupProbe
configuration in a central location (e.g., a common YAML file or a Go struct) and referencing it across these manifests to avoid duplication and improve maintainability. If not feasible, consider adding a comment referencing this file as the source of truth to mitigate inconsistent updates.
startupProbe: | ||
httpGet: | ||
path: / | ||
port: 13133 | ||
failureThreshold: 30 # Allow up to 5 minutes (30 * 10s) | ||
periodSeconds: 10 | ||
initialDelaySeconds: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider defining the startupProbe
configuration in a central location (e.g., a common YAML file or a Go struct) and referencing it across these manifests to avoid duplication and improve maintainability. If not feasible, consider adding a comment referencing this file as the source of truth to mitigate inconsistent updates.
startupProbe: | ||
httpGet: | ||
path: / | ||
port: 13133 | ||
failureThreshold: 30 # Allow up to 5 minutes (30 * 10s) | ||
periodSeconds: 10 | ||
initialDelaySeconds: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider defining the startupProbe
configuration in a central location (e.g., a common YAML file or a Go struct) and referencing it across these manifests to avoid duplication and improve maintainability. If not feasible, consider adding a comment referencing this file as the source of truth to mitigate inconsistent updates.
@@ -195,6 +195,13 @@ data: | |||
volumeMounts: | |||
- name: otel-agent-config-reconciler-vol | |||
mountPath: /conf | |||
startupProbe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what the intent is with this change? The commit message mentions liveness probe, however there is no liveness probe defined for these containers (since 2023).
Also I expect the readinessProbe
on these containers may be superfluous as none of these Deployments are exposed through a Service (except for otel-collector
Deployment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in http://b/419026380#comment3. The Readiness probe for otel-agent side car can also be removed to align with the config of other containers. The readiness failure of telemetry containers is not necessary, and could block the reconciler and resource group controller pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readiness failure of telemetry containers is not necessary, and could block the reconciler and resource group controller pods.
Could you explain why the readiness probe blocks the reconciler and resource-group-controller Pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR has been updated to removal of the readiness probe of otel-agent side car in reconciler / reconciler-manager / resource group controller.
94c55d8
to
1fe01c7
Compare
Remove the readiness probe and health check from otel-agent containers in reconciler, reconciler-manager, and resourcegroup controller to align container behavior. The healthcheck component can fail to bind to the port or respond under CPU throttling, causing unnecessary pod unready states even when the container is running. The otel-collector health check in the config-management-monitoring namespace is retained since it is tied to a Service.
1fe01c7
to
650e0f8
Compare
/restest-required |
ab789e4
to
063cf7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes readiness probes and health check configurations from otel-agent sidecar containers across multiple controllers to prevent unnecessary pod unready states caused by health check port binding failures or CPU throttling issues.
- Removes health check extension configuration from OpenTelemetry collector configs
- Removes readiness probes and port 13133 from container specifications
- Updates test expectations to reflect the removal of health check endpoints
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
test/kustomization/expected.yaml | Updates test expectations to remove health check configurations and readiness probes |
pkg/reconcilermanager/controllers/reconciler_base_test.go | Removes liveness/readiness probe configurations from test cases and updates expected patch strings |
manifests/templates/resourcegroup-manifest.yaml | Removes health check extension and readiness probe from resourcegroup controller otel-agent |
manifests/templates/reconciler-manager.yaml | Removes health check port and readiness probe from reconciler-manager otel-agent |
manifests/templates/reconciler-manager-configmap.yaml | Removes health check port and readiness probe from reconciler template |
manifests/otel-agent-reconciler-cm.yaml | Removes health check extension configuration from reconciler otel-agent config |
manifests/otel-agent-cm.yaml | Removes health check extension configuration from base otel-agent config |
@tiffanny29631: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Remove the readiness probe and health check from otel-agent containers in reconciler, reconciler-manager, and resourcegroup controller to align container behavior. The healthcheck component can fail to bind to the port or respond under CPU throttling, causing unnecessary pod unready states even when the container is running. The otel-collector health check in the config-management-monitoring namespace is retained since it is tied to a Service.