Enable MCM to support CA scale-down during rolling update.

**How to categorize this issue?**

/area robustness
/kind bug
/priority 3

**What happened**:

CA has  scale down logic for long unregistered nodes. This means that if a machine has not joined for `max-node-provision-time` duration CA will scale down the nodegroup to remove the machine. Currently scale-down is disabled during rolling-update for our CA fork due to issues where the CA scaled down healthy machines. Some of these are still open like: https://github.com/kubernetes/autoscaler/issues/5465

But this restriction causes issues as worker pools become bigger and there are resource exhaustion or other problems preventing VM launch or the newly launched VM from joining the cluster. The CA tries to back-off and shutdown these long unregistered nodes, but cannot do so - since the rolling update is going on. The CA also blocks scale-up attempts for other node groups due to this.

The CA does not force delete long unregistered nodes by default. It requires `--force-delete-unregistered-nodes` to be set to `true`. Perhaps this should also be made configurable.


**What you expected to happen**:

During rolling-update if Nodes cannot be provisioned within the timeout, the CA can convey its intention to delete long unregistered nodes, the MCM deletes the Machines objects (and possibly VM's if any are associated) with these long unregistered nodes and the CA can smoothly pivot to considering another NodeGroup for scaling.

**How to reproduce it (as minimally and precisely as possible)**:
1. Increase `maxNodeProvisionTime` to some value that permits slow human operator to SSH into VM. (>15m at-least)
1. Enable ssh in shoot configuration and trigger rolling update of cluster 
1. Login into a Machine's VM that has been newly launched and disable kubelet.
1. After timeout expires, this is considered as a long unregistered node and the problem described above should be visible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable MCM to support CA scale-down during rolling update. #1014

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable MCM to support CA scale-down during rolling update. #1014

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions