Preservation of Machines ​
Objective ​
Currently, the Machine Controller Manager(MCM) moves Machines with errors to the Unknown phase, and after the configured machineHealthTimeout, to the Failed phase. Failed machines are swiftly moved to the Terminating phase during which the node is drained and the Machine object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult.
Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler (CA) scaling down the node.
This document proposes enhancing MCM, such that:
- VMs of machines are retained temporarily for analysis.
- There is a configurable limit to the number of machines that can be preserved automatically on failure (auto-preservation).
- There is a configurable limit to the duration for which machines are preserved.
- Users can specify which healthy machines they would like to preserve in case of failure, or for diagnoses in current state (prevent scale down by CA).
- Users can request MCM to release a preserved machine, even before the timeout expires, so that MCM can transition the machine to either
RunningorTerminatingphase, as the case may be.
Related Issue: https://github.com/gardener/machine-controller-manager/issues/1008
Proposal ​
In order to achieve the objectives mentioned, the following are proposed:
- Enhance
machineControllerManagerconfiguration in theShootSpec, to specify the max number of machines to be auto-preserved, and the time duration for which these machines will be preserved.machineControllerManager: autoPreserveFailedMax: 0 machinePreserveTimeout: 72h- This configuration will be set per worker pool.
- Since gardener worker pool can correspond to
1..NMachineDeployments depending on number of zones,autoPreserveFailedMaxwill be distributed across N machine deployments. autoPreserveFailedMaxmust be chosen such that it can be appropriately distributed across the MachineDeployments.- Example: if
autoPreserveFailedMaxis set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
- MCM will be modified to include a new sub-phase
Preservedto indicate that the machine has been preserved by MCM. - Allow user/operator to request for preservation of a specific machine/node with the use of annotations :
node.machine.sapcloud.io/preserve=nowandnode.machine.sapcloud.io/preserve=when-failed. - When annotation
node.machine.sapcloud.io/preserve=nowis added to aRunningmachine, the following will take place:cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"is added to the node to prevent CA from scaling it down.machine.CurrentStatus.PreserveExpiryTimeis updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.- The machine's phase is changed to
Running:Preserved. - After timeout, the
node.machine.sapcloud.io/preserve=nowandcluster-autoscaler.kubernetes.io/scale-down-disabled: "true"are deleted. Themachine.CurrentStatus.PreserveExpiryTimeis set tonil. The machine phase is changed toRunningand the CA may delete the node.- If a machine in
Running:Preservedfails, it is moved toFailed:Preserved.
- If a machine in
- When annotation
node.machine.sapcloud.io/preserve=when-failedis added to aRunningmachine and the machine goes toFailed, the following will take place:- The machine is drained of pods except for Daemonset pods.
- The machine phase is changed to
Failed:Preserved. cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"is added to the node to prevent CA from scaling it down.machine.CurrentStatus.PreserveExpiryTimeis updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.- After timeout, the annotations
node.machine.sapcloud.io/preserve=when-failedandcluster-autoscaler.kubernetes.io/scale-down-disabled: "true"are deleted.machine.CurrentStatus.PreserveExpiryTimeis set tonil. The phase is changed toTerminating.
- When an un-annotated machine goes to
Failedphase andautoPreserveFailedMaxis not breached:- Pods (other than DaemonSet pods) are drained.
- The machine's phase is changed to
Failed:Preserved. cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"is added to the node to prevent CA from scaling it down.machine.CurrentStatus.PreserveExpiryTimeis updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.- After timeout, the annotation
cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"is deleted.machine.CurrentStatus.PreserveExpiryTimeis set tonil. The phase is changed toTerminating. - Number of machines in
Failed:Preservedphase count towards enforcingautoPreserveFailedMax.
- If a failed machine is currently in
Failed:Preservedand before timeout its VM/node is found to be Healthy, the machine will be moved toRunning:Preserved. After the timeout, it will be moved toRunning. The rationale behind moving the machine toRunning:Preservedrather thanRunning, is to allow pods to get scheduled on to the healthy node again without the autoscaler scaling it down due to under-utilization. - A user/operator can request MCM to stop preserving a machine/node in
Running:PreservedorFailed:Preservedphase using the annotation:node.machine.sapcloud.io/preserve=false.- MCM will move a machine thus annotated either to
Runningphase orTerminatingdepending on the phase of the machine before it was preserved.
- MCM will move a machine thus annotated either to
- Machines of a MachineDeployment in
Preservedsub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. - MCM will be modified to perform drain in
Failedphase rather thanTerminating.
State Diagrams: ​
- State Diagram for when a machine or its node is explicitly annotated for preservation:mermaid
stateDiagram-v2 state "Running" as R state "Running + Requested" as RR state "Running:Preserved" as RP state "Failed (node drained)" as F state "Failed:Preserved" as FP state "Terminating" as T [*]-->R R --> RR: annotated with preserve=when-failed RP-->R: on timeout or preserve=false RR --> F: on failure F --> FP FP --> T: on timeout or preserve=false FP --> RP: if node Healthy before timeout T --> [*] R-->RP: annotated with preserve=now RP-->F: if node/VM not healthy - State Diagram for when an un-annotated
Runningmachine fails (Auto-preservation):mermaidstateDiagram-v2 state "Running" as R state "Running:Preserved" as RP state "Failed (node drained)" as F state "Failed:Preserved" as FP state "Terminating" as T [*] --> R R-->F: on failure F --> FP: if autoPreserveFailedMax not breached F --> T: if autoPreserveFailedMax breached FP --> T: on timeout or value=false FP --> RP : if node Healthy before timeout RP --> R: on timeout or preserve=false T --> [*]
Use Cases: ​
Use Case 1: Preservation Request for Analysing Running Machine ​
Scenario: Workload on machine failing. Operator wishes to diagnose.
Steps: ​
- Operator annotates node with
node.machine.sapcloud.io/preserve=now. - MCM preserves the machine, and prevents CA from scaling it down.
- Operator analyzes the VM.
Use Case 2: Proactive Preservation Request ​
Scenario: Operator suspects a machine might fail and wants to ensure preservation for analysis.
Steps: ​
- Operator annotates node with
node.machine.sapcloud.io/preserve=when-failed. - Machine fails later.
- MCM preserves the machine.
- Operator analyzes the VM.
Use Case 3: Auto-Preservation ​
Scenario: Machine fails unexpectedly, no prior annotation.
Steps: ​
- Machine transitions to
Failedphase. - Machine is drained.
- If
autoPreserveFailedMaxis not breached, machine moved toFailed:Preservedphase by MCM. - After
machinePreserveTimeout, machine is terminated by MCM.
Use Case 4: Early Release ​
Scenario: Operator has performed his analysis and no longer requires machine to be preserved.
Steps: ​
- Machine is in
Running:PreservedorFailed:Preservedphase. - Operator adds:
node.machine.sapcloud.io/preserve=falseto node. - MCM transitions machine to
RunningorTerminatingforRunning:PreservedorFailed:Preservedrespectively, even thoughmachinePreserveTimeouthas not expired. - If machine was in
Failed:Preserved, capacity becomes available for auto-preservation.
Points to Note ​
- During rolling updates MCM will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase.
- Hibernation policy will override machine preservation.
- Consumers (with access to shoot cluster) can annotate Nodes they would like to preserve.
- Operators (with access to control plane) can additionally annotate Machines that they would like to preserve. This feature can be used when a Machine does not have a backing Node and the operator wishes to preserve the backing VM.
- If the backing Node object exists but does not have the preservation annotation, preservation annotations added on the Machine will be honoured.
- However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value, and be synced to the Machine object.
- If
autoPreserveFailedMaxis reduced in the Shoot Spec, older machines are moved toTerminatingphase before newer ones. - In case of a scale down of an MCD's replica count,
Preservedmachines will be the last to be scaled down. Replica count will always be honoured. - If the value for annotation key
cluster-autoscaler.kubernetes.io/scale-down-disabledfor a machine inRunning:Preservedis changed tofalseby a user, the value will be overwritten totrueby MCM. - On increase/decrease of timeout, the new value will only apply to machines that go into
Preservedphase after the change. Operators can always editmachine.CurrentStatus.PreserveExpiryTimeto prolong the expiry time of existingPreservedmachines. - Modify CA FAQ once feature is developed to use
node.machine.sapcloud.io/preserve=nowinstead of thecluster-autoscaler.kubernetes.io/scale-down-disabled=truecurrently suggested. This would:
- harmonise machine flow
- shield from CA's internals
- make it generic and no longer CA specific
- allow a timeout to be specified