k8s-reboot-coordinator

Go to file

dustin/k8s-reboot-coordinator/pipeline/head This commit looks good

Details

There was a race condition while waiting for a node to be drained,
especially if there are pods that cannot be evicted immediately when the
wait starts.  It was possible for the `wait_drained` function to return
before all of the pods had been deleted, if the wait list temporarily
became empty at some point.  This could happen, for example, if multiple
`WatchEvent` messages were processed from the stream before any messages
were processed from the channel; even though there were pod identifiers
waiting in the channel to be added to the wait list, if the wait list
became empty after processing the watch events, the loop would complete.
This is made much more likely if a PodDisruptionBudget temporarily
prevents a pod from being evicted; it could take 5 or more seconds for
that pod's identifier to be pushed to the channel, and in that time, the
rest of the pods could be deleted.

To resolve this, we need to ensure that the `wait_drained` function
never returns until the sender side of the channel is dropped.  This
way, we are sure that no more pods will be added to the wait list, so
when it gets emptied, we are sure we are actually done.

2025-09-29 07:08:12 -05:00