-
Notifications
You must be signed in to change notification settings - Fork 574
Description
Area
- Scheduler
- Controller
- Helm Chart
- Documents
Other components
No response
What happened?
The Coscheduling plugin has a code path where the group gets permitted, even if only one pod in the group can schedule with no other changes on the cluster or incoming pods.
This one pod then gets a node assigned, while the scheduling of the second pod gets caught in the post filter of the Coscheduling plugin and then rejected. We end up with a partially scheduled podgroup, creating fragmentation.
This is some race condition as it doesn't always happen.
During debugging I saw that the first pod of the podgroup entered the waiting
state via the permit
function of the Coscheduling plugin as it should, because we have one free node after all. But then when the second pod in the group comes through it enters the success
case of the permit
function, only to find out in the next scheduling loop that there wasn't actually a second node free. But at this point the first pod has already been admitted.
What did you expect to happen?
We would expect the incoming podgroup in the reproduction steps below to be pending until the single pod gets removed.
How can we reproduce it (as minimally and precisely as possible)?
Have exactly two nodes on which some workload can be scheduled on (for instance via taints).
Create a single pod on one of these nodes using the default-scheduler
, requesting all resources on this node.
Create a podgroup of 2 with the scheduler-plugins-scheduler
profile, with the same resource requests. We would expect this group to be pending, because the single initial pod already takes up one node.
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.32.7
Scheduler Plugins version
Commit 2fd0b94 (current master)