GEP-0022: Improved Usage of the ShootState API
Table of Contents
Summary
For each Shoot resource, gardenlet creates a corresponding ShootState resource which has the same lifecycle as the Shoot. Today, such resources are used for two purposes:
- Exposing the client CA certificate and key of the shoot cluster such that the
gardener-apiserveris able to issue short-lived client certificates via theshoots/adminkubeconfigsubresource. - Storing the most crucial data for shoot clusters which is necessary to migrate and restore a cluster from one
Seedto another (certificates, keys, secrets, relevant extensions state for created infrastructure or worker machines).
Especially due to the second purpose, the ShootState API imposes several scalability concerns and limitations for large Gardener installations. Hence, this GEP proposes the elimination of said ShootState API by replacing it with lightweight alternatives.
This GEP focuses on the second use-case only and ignores the first use-case. An alternative for it is worked on separately in Introduce core.gardener.cloud/v1beta1.InternalSecret API #7999.
Motivation
The gardenlet runs two controllers for maintaining the most crucial data of shoot clusters in their ShootState resources:
The
shootstate-secretcontroller watchesSecrets labeled withpersist=truein the shoot namespaces in the seed cluster. It persists all labels and data of suchSecrets into the relatedShootStateresource.In reality, those
Secrets only change when the end-user initiates a credentials rotation operation, so rather rarely. Still, duplicating all labels and data of suchSecrets into the garden cluster increases the size ofShootStateresources.The
shootstate-extensionscontroller watches all resources of theextensions.gardener.cloud/v1alpha1API group in the shoot namespaces in the seed cluster. It persists all extensions state and resources in the respective.status.stateand.status.resourcesfields. See also this document for more background information.In reality, this is the controller which can cause quite a lot of network I/O (hence, traffic and costs). While most of the extension states change rather rarely, the state of the
extensions.gardener.cloud/v1alpha1.Workerresources can update quite frequently whenever new worker machines join or leave the cluster.gardenlethas to watch and replicate all such state changes to theShootStateresource.
In large Gardener installations, this can result in a lot of requests to the gardener-apiserver (transitively, also to ETCD of the garden cluster), and it can significantly contribute to the ETCD size (note that ETCD has a practical space limit of 8 GB). Besides, it increases the memory consumption considerably for all clients working with ShootStates.
ShootStates were introduced years ago when control plane migration was still in its early phase. Back then, we wanted to achieve so-called "bad case control plane migrations" (which practically means restoring a shoot control plane without being able to access its current seed cluster (maybe because it's down or destroyed)). For such scenarios, it is obviously necessary to have to most up-to-date state available in order to perform the restoration in a new seed cluster. As the current seed cluster is unavailable in this scenario and cannot be accessed, we decided to continuously replicate the state to the garden cluster.
With Finalise "Bad Case" Control Plane Migration #6302, however, this use case has been eliminated. It was decided that it won't be implemented, especially after ☂️ [GEP-20] Highly Available Seed and Shoot Clusters #6529 brings the ability to run seed clusters in a highly available manner (which reduces the chances to ever need a "bad case control plane migration" even further). With this, it is no longer necessary to continuously replicate the relevant state to the garden cluster, allowing us to reduce the usage of the ShootState API.
Goals
- Reduce network traffic and resulting costs due to continuous replication of state from seed clusters to garden clusters.
- Reduce size of garden cluster's ETCD, and with it the risk to run into its practical space limit of 8 GB.
- Simplify code that was mainly introduced in the context of the "bad case control plane migration".
- Perform regular backups of the shoot states for shoots running on unmanaged seeds.
Non-Goals
- Adaptation of the "first use-case" of the
ShootStateAPI (this is handled separately in Introducecore.gardener.cloud/v1beta1.InternalSecretAPI #7999). - Provide means to restore a deleted
Shootfrom its persisted state.
Proposal
Now that the "bad case control plane migration" scenario has been eliminated, the sole usage of the ShootState resource is for the "good case control plane migration". However, this case is significantly simpler since the current seed cluster is still available. Hence, when a control plane migration is triggered, gardenlet could easily collect the needed data from the current seed cluster, store it in the garden cluster for later usage during the restoration phase, and eventually delete the data again.
We are proposing the following changes:
- The
shootstate-extensionscontroller ofgardenletwill be dropped, i.e., it will no longer replicate the extensions states to theShootStateresources in the garden cluster. - The
shootstate-secretcontroller will also be dropped as soon as Introducecore.gardener.cloud/v1beta1.InternalSecretAPI #7999 has been implemented.
With this, the usage of the ShootState API can be drastically reduced.
During the
Migratephase of aShoot,gardenletwill fetch all relevant state in the shoot namespaces of the current seed, similar to what the two controllers are doing as of today:- all labels and data for
Secrets labeled withpersist=true. - all
.status.stateand.status.resourcesfor resources in theextensions.gardener.cloud/v1alpha1API group.
The
gardenletputs this state into regularcore.gardener.cloud/v1beta1.ShootStateresource stored in the namespace of the respectiveShootin the garden cluster.- all labels and data for
During the
Restorephase of aShoot,gardenletwill read the createdcore.gardener.cloud/v1beta1.ShootStatefrom the namespace of the respectiveShootin the garden cluster and uses it to populate the state in the new seed cluster just like today.After successful restoration, the created
core.gardener.cloud/v1beta1.ShootStateis deleted again from the namespace of the respectiveShootin the garden cluster.
This approach addresses all the concerns and downsides of today's implementation: Control plane migrations happen rather rarely and usually only take a few minutes. The shoot state is still replicated to the garden cluster during this process, however it will no longer exist for the entire lifetime of the Shoot but only during the short period of the control plane migration (which in reality is actually never executed for most Shoots). It will also not be continuously updated anymore. Only when a migration is performed, the current state is collected, and deleted again after the migration has been completed.
⚠️There will be one exception to the above which applies for
Shoots running on so-called "unmanagedSeeds". SuchSeeds are those not backed by aseedmanagement.gardener.cloud/v1alpha1.ManagedSeedobject.For
Shoots running on such unmanagedSeeds, we cannot make any assumptions about the ETCD backups. While we can easily access the ETCD backup of aManagedSeedin case of a disaster in order to get access to the relevant state of a particularShoot, we can't do so for unmanagedSeeds.Hence, we propose to run a periodic job that backs up the shoot state for
Shoots running on unmanagedSeeds. This periodic job will be implemented as part of a new controller ingardenletwhich runs according to the configured period (default:6h).
In summary:
- the size of the ETCD of the garden cluster will be drastically reduced, especially for large Gardener installations.
- the load on the API server and ETCD of the garden cluster will be drastically reduced, preparing the components for further scale.
- the network traffic and related costs will be drastically reduced.
- only very few
ShootStates will remain (and those get updated less frequently), since productive Gardener landscapes typically don't run a lot ofShoots on unmanagedSeeds.
Alternatives
Instead of persisting the shoot state via the core.gardener.cloud/v1beta1.ShootState resource into ETCD of the garden cluster, the relevant data could be pushed into the backup buckets that already exist as of today and are used for the ETCD backups.
We have implemented a PoC during the Hack the Garden 05/2023 hackathon which extends the extensions.gardener.cloud/v1alpha1 API with two new resources:
BackupUploads, which can be used to upload arbitrary data to an existingBackupEntry.Example YAML manifest
yamlapiVersion: extensions.gardener.cloud/v1alpha1 kind: BackupUpload metadata: name: example namespace: default spec: entryName: example-entry filePath: shootstate.yaml.enc type: local data: <to-be-uploaded-data>BackupDownloads, which can be used to download arbitrary data from an existingBackupEntry.Example YAML manifest
yamlapiVersion: extensions.gardener.cloud/v1alpha1 kind: BackupDownload metadata: name: example namespace: default spec: entryName: example-entry filePath: shootstate.yaml.enc type: local status: data: <downloaded-data>
Similar to all other resources in the extensions.gardener.cloud/v1alpha1 API, we could provide respective controllers in the extensions library such that provider extensions only need to implement their specific business logic (i.e., uploading/downloading to/from their infrastructure-specific object storage).
However, this approach is significantly more expensive in terms of development costs and complexity since the following additional adaptations would be required:
- Introduction of new resources in the
extensions.gardener.cloud/v1alpha1API which comes with validation, deployment dependencies, etc. - Introduction of new extension controllers.
- Adaptation of all existing provider extensions with support for
Backup{Bucket,Entry}s with the implementation for the newBackup{Up,Down}loadresources. - Encryption of the shoot state data (since we do not want to store it as plain text in the backup buckets). With above proposal, encryption is handled by
gardener-apiservervia the standardEncryptionConfiguration. - Management (incl. rotation) of a respective encryption key.
So far, the only justification for following such a path would be the necessity for regular backups of the shoot states (e.g., every hour or so). In this case, the shoot states would again be stored in the garden cluster for the entire lifetime of the Shoots.
This alternative approach would still improve a few things like network traffic, etc. since the data would no longer be continuously replicated (only with a much lower frequency compared to today), but eventually today's technical debts remain limiting the garden cluster's scalability. As no appropriate reasoning for regular backups has been presented yet, they were added as "non-goal" in this GEP. Hence, implementing this alternative approach is considered unnecessary.
Future Improvements
Eliminating the Worker State Reconciler
With Move machine-controller-manager reconciliation responsibility from extensions to gardenlet #7594, gardenlet will become aware of the machine.sapcloud.io/v1alpha1 API used for managing the worker machines of shoot clusters.
This allows to drop the Worker state reconciler which currently watches all MachineDeployments, MachineSets, and Machines, and replicates their specifications into the .status.state of the related extensions.gardener.cloud/v1alpha1.Worker resource. Instead, gardenlet could then collect these resources during the Migrate phase itself, effectively compute the necessary Worker state, and also restore it during the Restore phase.
With this approach, we could also reduce the load on the seed clusters' API servers since it would no longer be necessary to continuously duplicate the machine-related resources into the Worker status. For now, this GEP focuses on the ShootState API only, but as soon as Move machine-controller-manager reconciliation responsibility from extensions to gardenlet #7594 has been implemented, we could start looking into this next-step improvement.
Compressing the ShootState Data
Typically, the size of a ShootState resource is dominated by the MachineDeployments, MachineSets, and Machines. In case the few remaining ShootState resources still grow too large, one could implement compression strategies for the machine state (and maybe other state as well, if necessary) to reduce the overall storage size.
Since we expect only a very small number of ShootStates to remain for productive Gardener installations, this optimization is out of scope for now.