Skip to content

Rework Plan CRD generation to add docs and kubectl explain support #369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 1, 2025

Conversation

brandond
Copy link
Member

@brandond brandond commented Jun 21, 2025

  • Bump component versions in Dockerfiles
  • Fix unwanted kubectl annotation passthrough
  • Convert CRD generation to controller-gen
  • Add CRD docs
  • Add job reason and message to Plan error condition when job fails.

Ref: SURE-10336

With the current deployment defaults, upgrade jobs fail if not complete after 15 minutes, and are then cleaned up (deleted) 15 minutes later. The Plan does not trigger a recreate of the failed job, and the Plan must be modified in order for the Job to be retried. This is the expected behavior, as the default SUC deployment creates Jobs with activeDeadlineSeconds: 900 and ttlSecondsAfterFinished: 900.

Prior releases of the SUC (I tested with 0.13.4) had a bug where failed Jobs would be infinitely deleted and recreated, causing the Job to effectively ignore the configured deadline. There were no specific changes made to SUC that fixed this, but I suspect it was resolved by the bump from Wrangler v1 to v3, which contained many changes.

Apparently some users were relying on the unexpected behavior of Plans continuously recreating failed jobs, and their workflow was broken by this change in behavior (see https://xkcd.com/1172/). The correct fix is NOT for the SUC to continuously recreate failed jobs, but rather for the Plan's jobActiveDeadlineSecs to be set to an appropriately high value, or 0 for no deadline.

This change to align CRD building with rancher improves the docs by not only publishing the annotated CRD spec as markdown, but also makes kubectl explain work properly for Plan fields. Documentation for the jobActiveDeadlineSecs field now covers expected behavior:

Sets ActiveDeadlineSeconds on Jobs generated to apply this Plan.
If the Job does not complete within this time, the Plan will stop processing
until it is updated to trigger a redeploy.
If set to 0, Jobs have no deadline. If not set, the controller default value
is used.

Example of docs available via kubectl explain:

brandond@dev01:~$ kubectl explain plan.spec
GROUP:      upgrade.cattle.io
KIND:       Plan
VERSION:    v1

FIELD: spec <Object>


DESCRIPTION:
    PlanSpec represents the user-configurable details of a Plan.

FIELDS:
  channel	<string>
    A URL that returns HTTP 302 with the last path element of the value returned
    in the Location header assumed to be an image tag (after munging "+" to
    "-").

  concurrency	<integer>
    The maximum number of concurrent nodes to apply this update on.

  cordon	<boolean>
    If Cordon is true, the node is cordoned before the upgrade container is run.
    If drain is specified, the value for cordon is ignored, and the node is
    cordoned.
    If neither drain nor cordon are specified and the node is marked as
    schedulable=false it will not be marked as schedulable=true when the Job
    completes.

  drain	<Object>
    Configuration for draining nodes prior to upgrade. If left unspecified, no
    drain will be performed.

  exclusive	<boolean>
    Jobs for exclusive plans cannot be run alongside any other exclusive plan.

  imagePullSecrets	<[]Object>
    Image Pull Secrets, used to pull images for the Job.

  jobActiveDeadlineSecs	<integer>
    Sets ActiveDeadlineSeconds on Jobs generated to apply this Plan.
    If the Job does not complete within this time, the Plan will stop processing
    until it is updated to trigger a redeploy.
    If set to 0, Jobs have no deadline. If not set, the controller default value
    is used.

  nodeSelector	<Object>
    Select which nodes this plan can be applied to.

  postCompleteDelay	<string>
    Time after a Job for one Node is complete before a new Job will be created
    for the next Node.

  prepare	<Object>
    The prepare init container, if specified, is run before cordon/drain which
    is run before the upgrade container.

  secrets	<[]Object>
    Secrets to be mounted into the Job Pod.

  serviceAccountName	<string>
    The service account for the pod to use. As with normal pods, if not
    specified the default service account from the namespace will be assigned.

  tolerations	<[]Object>
    Specify which node taints should be tolerated by pods applying the upgrade.
    Anything specified here is appended to the default of:
    - `{key: node.kubernetes.io/unschedulable, effect: NoSchedule, operator:
    Exists}`

  upgrade	<Object> -required-
    The upgrade container; must be specified.

  version	<string>
    Providing a value for version will prevent polling/resolution of the channel
    if specified.

  window	<Object>
    A time window in which to execute Jobs for this Plan.
    Jobs will not be generated outside this time window, but may continue
    executing into the window once started.

@brandond brandond force-pushed the fix-annotations-and-crds branch 13 times, most recently from 6ae0eb5 to 14aaee0 Compare June 24, 2025 09:19
@snasovich snasovich requested a review from a team June 24, 2025 17:44
@brandond brandond force-pushed the fix-annotations-and-crds branch 5 times, most recently from dc070ff to f471b95 Compare June 24, 2025 19:36
brandond added 4 commits June 24, 2025 20:13
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Fixes `kubectl.kubernetes.io/last-applied-configuration` annotation being passed through from Plan to Job

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Also clean up some of the api spec for better documentation, and fix
tests.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@brandond brandond force-pushed the fix-annotations-and-crds branch from f471b95 to 6815127 Compare June 24, 2025 20:13
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@brandond brandond force-pushed the fix-annotations-and-crds branch from 6815127 to fa16395 Compare June 24, 2025 21:14
@brandond brandond changed the title Fix annotations and crds Rework Plan CRD generation to add docs and kubectl explain support Jun 24, 2025
@brandond brandond requested a review from a team June 24, 2025 21:26
@jakefhyde jakefhyde self-requested a review June 24, 2025 23:25
Copy link

@manuelbuil manuelbuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: I have fat fingers

| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `concurrency` _integer_ | The maximum number of concurrent nodes to apply this update on. | | |
| `jobActiveDeadlineSecs` _integer_ | Sets ActiveDeadlineSeconds on Jobs generated to apply this Plan.<br />If the Job does not complete within this time, the Plan will stop processing until it is updated to trigger a redeploy.<br />If set to 0, Jobs have no deadline. If not set, the controller default value is used. | | |
Copy link

@manuelbuil manuelbuil Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not set, the controller default value is used

Maybe it is worth adding that is the default value (900)? Or pointing where to find it SYSTEM_UPGRADE_JOB_ACTIVE_DEADLINE_SECONDS?

Copy link
Member Author

@brandond brandond Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could cover that elsewhere, yeah. We don't really have any docs on controller CLI flags at the moment. This text is generated automatically from the field docstring and isn't the best place to cover controller config.

@brandond brandond mentioned this pull request Jun 26, 2025
Copy link

@jakefhyde jakefhyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits, approving.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@brandond brandond merged commit 99a03a0 into rancher:master Jul 1, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants