Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Backup and Restore Implementation

See original GitHub issue

Summary

QHub is currently lacking a backup and restore solution. Initially this issue was not sufficiently complex since all state was stored on a single nfs filestore. We talked about having a kubernetes cron job to run daily restic to update the filesystem to a single s3 bucket. However now there are starting to be databases and state stored in several other pvcs within QHub. We expect this to grow so we need a generic solutions that allows us to backup/restore all storage within a cluster. We are proposing kubernetes backups using velero which looks to be a well adopted open source solution for backup and restore.

Proposed implementation

We realize this is a large issue and it will be most likely easiest to approach this problem in steps.

The first step would be to deploy the velero helm chart within QHub. There are other examples of [deploying a helm chart within QHub in PRs. This being the most similar one https://github.com/Quansight/qhub/pull/733. This will only deploy the velero agent on the kubernetes cluster. This should be configured via a qhub-config.yaml configuration setting. The PR above gives an example of adding this setting. There will additionally be a key credentials that takes an arbitrary dict of credentials to pass on to the helm chart. See https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/README.md#option-1-cli-commands. These credentials will be used to setup file backups and block storage backups. schedule will control the frequency of regular backups. Backups should be an optional feature that is disabled by default.

velero:
  enabled: true/false
  schedule: "0 0 * * *"
  credentials:
     ...

Next once velero is deployed on the cluster there should be the ability to trigger a backup manually. Similar to how we handle terraform https://github.com/Quansight/qhub/blob/main/qhub/provider/terraform.py#L23. Since velero is a go binary it should be possible to transparently download the velero binary https://github.com/vmware-tanzu/velero/releases/tag/v1.6.1 and expose it in the cli behind a qhub backup and qhub restore command. For now we would like to create a velero provider in https://github.com/Quansight/qhub/tree/main/qhub/provider that can trigger a backup and restore of the qhub storage.

Initially we would like a simple qhub deploy and qhub restore command. Eventually we could imagine this command growing into more complicated backups but we realize this problem is complicated enough as it is scoped.

Additionally there should be documentation added for the admin and dev guide.

Acceptance Criteria

upon initial deployment of QHub cluster and configuration setting backups enabled the cluster should be backup every 24h to an s3 bucket
qhub backup should trigger a manual backup of the cluster with files being backed up to s3 bucket
qhub restore should trigger a restore action that will refresh the contents of pvcs within cluster (this is less well understood at the moment and may not be possible).
Velero is installed via a helm chart instead of the velero binary

Tasks to complete

https://github.com/Quansight/qhub/issues/744 work with @tarundmsharma to complete deployment of helm chart using terraform
#745
#746

Related to

For history, see https://github.com/Quansight/qhub/issues/99

Issue Analytics

State:
Created 2 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

2reactions

costrouccommented, Sep 16, 2021

Wanted to document a solution I got working on prem via minikube and via digital ocean. This seems to be cloud agnostic for backups which seems promising. In addition I didn’t realize how complete the velero backups are. They include all of the resources as well and give strong controls on the backup.

minikube start --driver=docker --kubernetes-version=v1.21.3

To start the minikube cluster. Then we need to create the minio s3 backup. Sure we could use a cloud based backup.

apiVersion: v1
kind: Service
metadata:
  name: minio
spec:
  type: NodePort
  ports:
  - name: "9000"
    nodePort: 30900
    port: 9000
    targetPort: 9000
  - name: "9001"
    nodePort: 30901
    port: 9001
    targetPort: 9001
  selector:
    app: minio
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: minio-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio
  labels:
    app: minio
spec:
  replicas: 1
  selector:
    matchLabels:
      app: minio
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: minio
    spec:
      containers:
        - name: minio
          image: minio/minio:RELEASE.2021-08-25T00-41-18Z
          args:
            - "-c"
            - "mkdir -p /data/velero && /usr/bin/minio server /data --console-address 0.0.0.0:9001"
          command:
            - "sh"
          env:
            - name: MINIO_ACCESS_KEY
              value: admin
            - name: MINIO_SECRET_KEY
              value: password
          ports:
            - containerPort: 9000
            - containerPort: 9001
          volumeMounts:
            - mountPath: /data
              name: minio-claim
      restartPolicy: Always
      volumes:
        - name: minio-claim
          persistentVolumeClaim:
            claimName: minio-claim

and then an example application to test the backup with

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pod-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: hellopod
spec:
  containers:
    - name: hello
      image: busybox
      imagePullPolicy: IfNotPresent
      command:
      - /bin/sh
      - -c
      - "date >> /data/example.txt; sleep 100000"
      volumeMounts:
        - mountPath: /data
          name: pod-claim
  restartPolicy: OnFailure
  volumes:
    - name: pod-claim
      persistentVolumeClaim:
        claimName: pod-claim

Then kubectl apply both of these charts. Next we install velero and also install velero on the cluster. We need to create a file for the credentials for our S3 bucket and how to access it.

[default]
aws_access_key_id = admin
aws_secret_access_key = password

And then we download velero

wget https://github.com/vmware-tanzu/velero/releases/download/v1.6.3/velero-v1.6.3-linux-amd64.tar.gz
tar -xf *.tar.gz
cd velero-*

./velero install --provider=aws --plugins velero/velero-plugin-for-aws:v1.0.0 --use-restic --use-volume-snapshots=false --bucket=velero --secret-file /tmp/velero/credentials.txt --backup-location-config region=default,s3ForcePathStyle="true",s3Url=http://minio.default.svc:9000

Finally lets demonstrate a backup

./velero backup create anexample --default-volumes-to-restic=true

You can check that a backup was performed successfully by visiting the web ui for the minio. The minikube ip address is posible via minikube ip and the port is 30900 additionally you can also access the ui via port forwarding https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/. I also breifly tested deleting the pod resource and then restoring the volume. This seemed to work though I didn’t test this as much. However, the backup is clearly happening on DO and minikube. On prem velero has issues with hostPaths as pvc volumes however outside of testing I would consider this a rare circumstance since for any true multinode kubernetes deploements hostPaths cannot work.

This also looks like it will be able to backup efs and cloud specific pvcs 😄. So good news @brl0! Still very much POC but I believe this tool will work great for our use case and then some.

1reaction

tylerpottscommented, Dec 3, 2021

The is verified working on AWS and GCP:

Backup

In order to specify a volume for restic restoration, we need to annotate a pod with backup.velero.io/backup-volumes: <pods_name_for_persistentvolume>. I decided to do this by creating a pod specifically for this purpose. With the following saved as custom_pod.yaml I added it to the cluster with kubectl apply -f custom_pod.yaml

kind: Pod
apiVersion: v1
metadata:
  name: restic-placeholder
  namespace: dev
  annotations:
    backup.velero.io/backup-volumes: home
spec:
  volumes:
    - name: home
      persistentVolumeClaim:
        claimName: "nfs-mount-dev-share"
  containers:
    - name: placeholder
      image: ubuntu
      command: ["sleep", "36000000000000"]
      volumeMounts:
        - mountPath: "/data"
          name: home

To avoid errors on mounts that don’t need to be backed up, set the following labels to exclude the persistentvolumeclaims like so:

kubectl label pvc conda-store-dev-share velero.io/exclude-from-backup=true -n dev
kubectl label pvc hub-db-dir velero.io/exclude-from-backup=true -n dev
kubectl label pvc qhub-conda-store-storage velero.io/exclude-from-backup=true -n dev

With this setup, velero can be installed with the default-volumes-to-restic=false:

velero install \
--provider=aws \
--plugins=velero/velero-plugin-for-aws:v1.3.0 \
--use-restic \
--default-volumes-to-restic=false \
--bucket=$BUCKET \
--secret-file ./credentials.txt \
--backup-location-config region=$REGION,s3ForcePathStyle=true,s3Url=http://s3.$REGION.amazonaws.com \
--wait \
--snapshot-location-config region=$REGION

The backup is created with:

velero backup create qhub-backup --include-namespaces=dev --wait

Restore

Note that all user notebook need to be shut down as well. Existing user sessions will maintain a connection to the persistent volume claim and prevent deletion. We delete the resources that are using the nfs-mount-dev-share with the commands below:

kubectl delete deployments qhub-jupyterhub-sftp -n dev
kubectl delete pod restic-placeholder -n dev
kubectl delete pvc nfs-mount-dev-share -n dev
kubectl patch pv nfs-mount-dev-share -p '{"spec":{"claimRef": null}}'

With these gone, the restore can be initiatied with:

velero restore create qhub-restore --from-backup qhub-backup

Note that the restore will say that it partially failed. This is because there is already a symlink for /home/shared. However, data in the user directories as well as the shared directories gets restored as expected.