Skip to content

Enable MCM to support CA scale-down during rolling update. #1014

@elankath

Description

@elankath

How to categorize this issue?

/area robustness
/kind bug
/priority 3

What happened:

CA has scale down logic for long unregistered nodes. This means that if a machine has not joined for max-node-provision-time duration CA will scale down the nodegroup to remove the machine. Currently scale-down is disabled during rolling-update for our CA fork due to issues where the CA scaled down healthy machines. Some of these are still open like: kubernetes/autoscaler#5465

But this restriction causes issues as worker pools become bigger and there are resource exhaustion or other problems preventing VM launch or the newly launched VM from joining the cluster. The CA tries to back-off and shutdown these long unregistered nodes, but cannot do so - since the rolling update is going on. The CA also blocks scale-up attempts for other node groups due to this.

The CA does not force delete long unregistered nodes by default. It requires --force-delete-unregistered-nodes to be set to true. Perhaps this should also be made configurable.

What you expected to happen:

During rolling-update if Nodes cannot be provisioned within the timeout, the CA can convey its intention to delete long unregistered nodes, the MCM deletes the Machines objects (and possibly VM's if any are associated) with these long unregistered nodes and the CA can smoothly pivot to considering another NodeGroup for scaling.

How to reproduce it (as minimally and precisely as possible):

  1. Increase maxNodeProvisionTime to some value that permits slow human operator to SSH into VM. (>15m at-least)
  2. Enable ssh in shoot configuration and trigger rolling update of cluster
  3. Login into a Machine's VM that has been newly launched and disable kubelet.
  4. After timeout expires, this is considered as a long unregistered node and the problem described above should be visible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/robustnessRobustness, reliability, resilience relatedkind/bugBugpriority/3Priority (lower number equals higher priority)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions