Thanks for the reply! I am currently looking to do this for a Kubernetes cluster running various services to more reliably (and frequently) perform upgrades with automated rollbacks when necessary. At some point in the future, it may include services I am developing, but at the moment that is not the intended use case.
I am not currently familiar enough with the CI/CD pipeline (currently Renovatebot and ArgoCD) to reliably accomplish automated rollbacks, but I believe I can get everything working with the exception of rolling back a data backup (especially for upgrades that contain backwards incompatible database changes). In terms of storage, I am open to using various selfhosted services/platforms even if it means drastically changing the setup (eg - moving from TrueNAS to Longhorn, moving from Ceph to Proxmox, etc.) if it means I can accomplish this without a noticeable performance degradation to any of the services.
I understand that it can be challenging (or maybe impossible) to reliably generate backups while the services are running. I also understand that the best way to do this for databases would be to stop the service and perform a database dump. However, I’m not too concerned with losing <10 seconds of data (or however long the backup jobs take) if the backups can be performed in a way that does not result in corrupted data. Realistically, the most common use cases for the rollbacks would be invalid Kubernetes resources/application configuration as a result of the upgrade or the removal/change of a feature that I depend on.
Yes, I am using PersistentVolumes. I have played around with different tools that have backup/snapshot abilities, but I haven’t seen a way to integrate that functionality with a CD tool. I’m sure if I spent enough time working through things, I may be able to put together something that allows the CD tool to take a snapshot. However, I think that having it handle rollbacks would be a bit too much for me to handle without assistance.