gVisor is one option beside Kata Containers or Firecracker for sandboxing containers to minimize the risk when running untrusted workloads on Kubernetes.
Currently, the only managed Kubernetes service which supports gVisor in dedicated node pools per default is Google Kubernetes Engine. But with a bit of an effort this is doable as well on Azure Kubernetes Service.
At the time of writing this article running gVisor on AKS is not officially supported by Microsoft. Thus said the setup can break with a Kubernetes version or node image upgrade. The setup described in this article was done on AKS v1.21.2 and the node image version AKSUbuntu-1804gen2containerd-2022.01.08.
Prerequisites
As this configuration is not officially supported the first thing on our to-do list is a new node pool. The new node pool receives a label and a taint as we want the node pool to be exclusively available for gVisor.
Before we can start with the installation of gVisor we need the configuration of containerd from one of the nodes in the new node pool. Otherwise, we cannot integrate gVisor with its runtime runsc into AKS.
This is done by using the run shell script capability of the VMSS via the Azure CLI.
> CONTAINERD_CONFIG=$(az vmss run-command invoke -g MC_cluster-blue_cluster-blue_northeurope -n aks-gvisor-42043378-vmss --command-id RunShellScript --instance-id 3 --scripts "cat /etc/containerd/config.toml") > echo $CONTAINERD_CONFIG | tr -d '\'
We copy the lines between [stdout] and [stderr] into a new file config.toml. Looking at the gVisor documentation only two lines need to be added to the config.toml after line 13.
-> https://gvisor.dev/docs/user_guide/containerd/quick_start/
version = 2 subreaper = false oom_score = 0 [plugins."io.containerd.grpc.v1.cri"] sandbox_image = "mcr.microsoft.com/oss/kubernetes/pause:3.6" [plugins."io.containerd.grpc.v1.cri".containerd] [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime] runtime_type = "io.containerd.runtime.v1.linux" runtime_engine = "/usr/bin/runc" [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime] runtime_type = "io.containerd.runtime.v1.linux" runtime_engine = "/usr/bin/runc" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc] runtime_type = "io.containerd.runsc.v1" [plugins."io.containerd.grpc.v1.cri".registry.headers] X-Meta-Source-Client = ["azure/aks"] [metrics] address = "0.0.0.0:10257"
The modified containerd configuration is ready to be used.
Installation
Modifying or installing something on the AKS nodes or on Kubernetes nodes is done via a daemon set in general. The daemon set itself needs a hostPath as volume mount, preferably /tmp, hostPID and privileged set to true.
Furthermore, for our use case the correct toleration and node selector configuration is necessary. We only want the daemon set on our dedicated gVisor node pool.
apiVersion: apps/v1 kind: DaemonSet metadata: name: gvisor namespace: kube-system spec: selector: matchLabels: app: gvisor template: metadata: labels: app: gvisor spec: hostPID: true restartPolicy: Always containers: - image: docker.io/neumanndaniel/gvisor:latest imagePullPolicy: Always name: gvisor env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName securityContext: privileged: true readOnlyRootFilesystem: true volumeMounts: - name: k8s-node mountPath: /k8s-node volumes: - name: k8s-node hostPath: path: /tmp/gvisor tolerations: - key: gvisor operator: Equal value: "enabled" effect: NoSchedule nodeSelector: gvisor: enabled
The referenced container image only contains the gVisor installation script and its own run script.
Looking at the gVisor installation script it is the same as in the documentation. Only the path where the binaries are placed has been adjusted to /usr/bin where the other containerd binaries reside.
-> https://gvisor.dev/docs/user_guide/install/#install-latest
#!/bin/sh ( set -e ARCH=$(uname -m) URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH} wget ${URL}/runsc ${URL}/runsc.sha512 \ ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512 sha512sum -c runsc.sha512 \ -c containerd-shim-runsc-v1.sha512 rm -f *.sha512 chmod a+rx runsc containerd-shim-runsc-v1 mv runsc containerd-shim-runsc-v1 /usr/bin )
What does the run script do?
#!/bin/sh URL="https://raw.githubusercontent.com/neumanndaniel/kubernetes/master/gvisor/config.toml" wget ${URL} -O /k8s-node/config.toml cp /install-gvisor.sh /k8s-node /usr/bin/nsenter -m/proc/1/ns/mnt -- chmod u+x /tmp/gvisor/install-gvisor.sh /usr/bin/nsenter -m/proc/1/ns/mnt /tmp/gvisor/install-gvisor.sh /usr/bin/nsenter -m/proc/1/ns/mnt -- cp /etc/containerd/config.toml /etc/containerd/config.toml.org /usr/bin/nsenter -m/proc/1/ns/mnt -- cp /tmp/gvisor/config.toml /etc/containerd/config.toml /usr/bin/nsenter -m/proc/1/ns/mnt -- systemctl restart containerd echo "[$(date +"%Y-%m-%d %H:%M:%S")] Successfully installed gvisor and restarted containerd on node ${NODE_NAME}." sleep infinity
The run script downloads the config.toml from GitHub as we do not want to rebuild the container image every time this file changes. In the next step the install script is copied over to the AKS node using the hostPath volume mount. Finally, we execute the install script via nsenter on the node, backing up the original containerd configuration file and replacing it. The last step is a restart of containerd itself applying the new configuration. As containerd is only a CRI running containers will not be restarted. Afterwards the daemon set is kept running with an infinite sleep.
The container image I am using is based on Alpine’s current version 3.15.0.
FROM alpine:3.15.0 COPY install-gvisor.sh / COPY run.sh / RUN chmod u+x run.sh CMD ["./run.sh"]
Using gVisor
Before we can start using gVisor as sandboxed runtime we need to make Kubernetes aware of it. This is achieved via a runtime class.
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: gvisor handler: runsc scheduling: nodeSelector: gvisor: "enabled"
In the runtime class itself gVisor is referenced by its handler runsc as defined in the config.toml.
... [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc] ...
Our example pod template deploys a NGINX proxy onto the gVisor node pool.
apiVersion: v1 kind: Pod metadata: name: nginx-gvisor spec: containers: - name: nginx image: nginx runtimeClassName: gvisor tolerations: - key: gvisor operator: Equal value: "enabled" effect: NoSchedule nodeSelector: gvisor: enabled
Important is the definition of the runtime class as otherwise Kubernetes uses runc, the default runtime. Furthermore, for the sake of completeness we specify the toleration and the node selector.
Verify gVisor usage
After the deployment of our NGINX pod, we verify if it is really using gVisor as its runtime.
For the first option we need the containerID which we retrieve by running the following command
> kubectl get pods nginx-gvisor -o json | jq '.status.containerStatuses[].containerID' -r | cut -d '/' -f3 19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f
Then we need the node name and which gVisor daemon set pod runs on the node.
> kubectl get pods --all-namespaces -o wide | grep $(kubectl get pods nginx-gvisor -o json | jq '.spec.nodeName' -r) calico-system calico-node-xp722 1/1 Running 0 97m 10.240.0.4 aks-gvisor-42043378-vmss000003 <none> <none> istio-system istio-cni-node-g9fzt 2/2 Running 0 97m 10.240.0.4 aks-gvisor-42043378-vmss000003 <none> <none> kube-system azure-ip-masq-agent-h5w7z 1/1 Running 0 97m 10.240.0.4 aks-gvisor-42043378-vmss000003 <none> <none> kube-system azuredefender-publisher-ds-wx5bf 1/1 Running 0 97m 10.240.0.122 aks-gvisor-42043378-vmss000003 <none> <none> kube-system csi-azuredisk-node-89vw5 3/3 Running 0 97m 10.240.0.4 aks-gvisor-42043378-vmss000003 <none> <none> kube-system csi-azurefile-node-pnvq6 3/3 Running 0 97m 10.240.0.4 aks-gvisor-42043378-vmss000003 <none> <none> kube-system gvisor-ws7f4 1/1 Running 0 97m 10.240.0.182 aks-gvisor-42043378-vmss000003 <none> <none> kube-system kube-proxy-2jctz 1/1 Running 0 97m 10.240.0.4 aks-gvisor-42043378-vmss000003 <none> <none> kube-system nginx-gvisor 1/1 Running 0 12m 10.240.0.150 aks-gvisor-42043378-vmss000003 <none> <none> kube-system omsagent-xk5g5 2/2 Running 0 97m 10.240.0.13 aks-gvisor-42043378-vmss000003 <none> <none>
Afterwards we do an exec into the gVisor pod and query the containerd status log.
> kubectl exec -it gvisor-ws7f4 -- /bin/sh > /usr/bin/nsenter -m/proc/1/ns/mnt -- systemctl status containerd | grep 19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f ├─18404 grep 19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f ├─21181 runsc-gofer --root=/run/containerd/runsc/k8s.io --log=/run/containerd/io.containerd.runtime.v2.task/k8s.io/19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f/log.json --log-format=json --log-fd=3 gofer --bundle /run/containerd/io.containerd.runtime.v2.task/k8s.io/19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f --spec-fd=4 --mounts-fd=5 --io-fds=6 --io-fds=7 --io-fds=8 --io-fds=9 --io-fds=10 --io-fds=11 --apply-caps=false --setup-root=false └─21228 runsc --root=/run/containerd/runsc/k8s.io --log=/run/containerd/io.containerd.runtime.v2.task/k8s.io/19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f/log.json --log-format=json wait 19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f
Looking at the output we confirm that runsc is used.
Another approach is an exec into the NGINX proxy pod and execute the installation of ping.
> kubectl exec -it nginx-gvisor -- /bin/sh > apt update && apt install iputils-ping -y ... Setting up iputils-ping (3:20210202-1) ... Failed to set capabilities on file `/bin/ping' (Operation not supported) The value of the capability argument is not permitted for a file. Or the file is not a regular (non-symlink) file Setcap failed on /bin/ping, falling back to setuid ...
The installation succeeds, but the set of required capabilities fails as we run in a sandbox provided by gVisor. Using the default runc runtime we will not see this error message as the NGINX proxy pod will not be running in a sandbox.
Summary
It takes a bit of work and ongoing maintenance using gVisor on AKS for sandboxing containers. But it works. Even gVisor is not officially supported by Microsoft we use a supported way doing the node configuration via a daemon set.
-> https://docs.microsoft.com/en-us/azure/aks/support-policies#shared-responsibility
The impact on a production cluster is further reduced by using a dedicated node pool for gVisor. Hence, if you need a sandbox for untrusted workloads gVisor is a viable option for this on AKS.
As always, you find the code examples and Kubernetes templates in my GitHub repository.
-> https://github.com/neumanndaniel/kubernetes/tree/master/gvisor