-
Notifications
You must be signed in to change notification settings - Fork 126
Description
How to categorize this issue?
/area robustness
/kind bug
/priority 3
What happened:
CA has scale down logic for long unregistered nodes. This means that if a machine has not joined for max-node-provision-time
duration CA will scale down the nodegroup to remove the machine. Currently scale-down is disabled during rolling-update for our CA fork due to issues where the CA scaled down healthy machines. Some of these are still open like: kubernetes/autoscaler#5465
But this restriction causes issues as worker pools become bigger and there are resource exhaustion or other problems preventing VM launch or the newly launched VM from joining the cluster. The CA tries to back-off and shutdown these long unregistered nodes, but cannot do so - since the rolling update is going on. The CA also blocks scale-up attempts for other node groups due to this.
The CA does not force delete long unregistered nodes by default. It requires --force-delete-unregistered-nodes
to be set to true
. Perhaps this should also be made configurable.
What you expected to happen:
During rolling-update if Nodes cannot be provisioned within the timeout, the CA can convey its intention to delete long unregistered nodes, the MCM deletes the Machines objects (and possibly VM's if any are associated) with these long unregistered nodes and the CA can smoothly pivot to considering another NodeGroup for scaling.
How to reproduce it (as minimally and precisely as possible):
- Increase
maxNodeProvisionTime
to some value that permits slow human operator to SSH into VM. (>15m at-least) - Enable ssh in shoot configuration and trigger rolling update of cluster
- Login into a Machine's VM that has been newly launched and disable kubelet.
- After timeout expires, this is considered as a long unregistered node and the problem described above should be visible.