Backup and Restore Implementation
See original GitHub issueSummary
QHub is currently lacking a backup and restore solution. Initially this issue was not sufficiently complex since all state was stored on a single nfs filestore. We talked about having a kubernetes cron job to run daily restic to update the filesystem to a single s3 bucket. However now there are starting to be databases and state stored in several other pvcs within QHub. We expect this to grow so we need a generic solutions that allows us to backup/restore all storage within a cluster. We are proposing kubernetes backups using velero which looks to be a well adopted open source solution for backup and restore.
Proposed implementation
We realize this is a large issue and it will be most likely easiest to approach this problem in steps.
The first step would be to deploy the velero helm chart within QHub. There are other examples of [deploying a helm chart within QHub in PRs. This being the most similar one https://github.com/Quansight/qhub/pull/733. This will only deploy the velero agent on the kubernetes cluster. This should be configured via a qhub-config.yaml configuration setting. The PR above gives an example of adding this setting. There will additionally be a key credentials
that takes an arbitrary dict of credentials to pass on to the helm chart. See https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/README.md#option-1-cli-commands. These credentials will be used to setup file backups and block storage backups. schedule
will control the frequency of regular backups. Backups should be an optional feature that is disabled by default.
velero:
enabled: true/false
schedule: "0 0 * * *"
credentials:
...
Next once velero is deployed on the cluster there should be the ability to trigger a backup manually. Similar to how we handle terraform https://github.com/Quansight/qhub/blob/main/qhub/provider/terraform.py#L23. Since velero is a go binary it should be possible to transparently download the velero binary https://github.com/vmware-tanzu/velero/releases/tag/v1.6.1 and expose it in the cli behind a qhub backup
and qhub restore
command. For now we would like to create a velero provider in https://github.com/Quansight/qhub/tree/main/qhub/provider that can trigger a backup and restore of the qhub storage.
Initially we would like a simple qhub deploy
and qhub restore
command. Eventually we could imagine this command growing into more complicated backups but we realize this problem is complicated enough as it is scoped.
Additionally there should be documentation added for the admin and dev guide.
Acceptance Criteria
- upon initial deployment of QHub cluster and configuration setting backups enabled the cluster should be backup every 24h to an s3 bucket
-
qhub backup
should trigger a manual backup of the cluster with files being backed up to s3 bucket -
qhub restore
should trigger a restore action that will refresh the contents of pvcs within cluster (this is less well understood at the moment and may not be possible). - Velero is installed via a helm chart instead of the velero binary
Tasks to complete
- https://github.com/Quansight/qhub/issues/744 work with @tarundmsharma to complete deployment of helm chart using terraform
- #745
- #746
Related to
- For history, see https://github.com/Quansight/qhub/issues/99
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Wanted to document a solution I got working on prem via
minikube
and via digital ocean. This seems to be cloud agnostic for backups which seems promising. In addition I didn’t realize how complete the velero backups are. They include all of the resources as well and give strong controls on the backup.To start the minikube cluster. Then we need to create the minio s3 backup. Sure we could use a cloud based backup.
and then an example application to test the backup with
Then kubectl apply both of these charts. Next we install velero and also install velero on the cluster. We need to create a file for the credentials for our S3 bucket and how to access it.
And then we download velero
Finally lets demonstrate a backup
You can check that a backup was performed successfully by visiting the web ui for the minio. The minikube ip address is posible via
minikube ip
and the port is30900
additionally you can also access the ui via port forwarding https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/. I also breifly tested deleting the pod resource and then restoring the volume. This seemed to work though I didn’t test this as much. However, the backup is clearly happening on DO and minikube. On prem velero has issues with hostPaths as pvc volumes however outside of testing I would consider this a rare circumstance since for any true multinode kubernetes deploements hostPaths cannot work.This also looks like it will be able to backup efs and cloud specific pvcs 😄. So good news @brl0! Still very much POC but I believe this tool will work great for our use case and then some.
The is verified working on AWS and GCP:
Backup
In order to specify a volume for restic restoration, we need to annotate a pod with
backup.velero.io/backup-volumes: <pods_name_for_persistentvolume>
. I decided to do this by creating a pod specifically for this purpose. With the following saved ascustom_pod.yaml
I added it to the cluster withkubectl apply -f custom_pod.yaml
To avoid errors on mounts that don’t need to be backed up, set the following labels to exclude the persistentvolumeclaims like so:
With this setup, velero can be installed with the
default-volumes-to-restic=false
:The backup is created with:
Restore
Note that all user notebook need to be shut down as well. Existing user sessions will maintain a connection to the persistent volume claim and prevent deletion. We delete the resources that are using the
nfs-mount-dev-share
with the commands below:With these gone, the restore can be initiatied with:
Note that the restore will say that it partially failed. This is because there is already a symlink for
/home/shared
. However, data in the user directories as well as the shared directories gets restored as expected.