Running gVisor on Azure Kubernetes Service for sandboxing containers

gVisor is one option beside Kata Containers or Firecracker for sandboxing containers to minimize the risk when running untrusted workloads on Kubernetes.

-> https://gvisor.dev/

Currently, the only managed Kubernetes service which supports gVisor in dedicated node pools per default is Google Kubernetes Engine. But with a bit of an effort this is doable as well on Azure Kubernetes Service.

At the time of writing this article running gVisor on AKS is not officially supported by Microsoft. Thus said the setup can break with a Kubernetes version or node image upgrade. The setup described in this article was done on AKS v1.21.2 and the node image version AKSUbuntu-1804gen2containerd-2022.01.08.

Prerequisites

As this configuration is not officially supported the first thing on our to-do list is a new node pool. The new node pool receives a label and a taint as we want the node pool to be exclusively available for gVisor.

Before we can start with the installation of gVisor we need the configuration of containerd from one of the nodes in the new node pool. Otherwise, we cannot integrate gVisor with its runtime runsc into AKS.

This is done by using the run shell script capability of the VMSS via the Azure CLI.

> CONTAINERD_CONFIG=$(az vmss run-command invoke -g MC_cluster-blue_cluster-blue_northeurope -n aks-gvisor-42043378-vmss --command-id RunShellScript --instance-id 3 --scripts "cat /etc/containerd/config.toml")
> echo $CONTAINERD_CONFIG | tr -d '\'

We copy the lines between [stdout] and [stderr] into a new file config.toml. Looking at the gVisor documentation only two lines need to be added to the config.toml after line 13.

-> https://gvisor.dev/docs/user_guide/containerd/quick_start/

version = 2
subreaper = false
oom_score = 0
[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "mcr.microsoft.com/oss/kubernetes/pause:3.6"
  [plugins."io.containerd.grpc.v1.cri".containerd]

    [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
      runtime_type = "io.containerd.runtime.v1.linux"
      runtime_engine = "/usr/bin/runc"
    [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
      runtime_type = "io.containerd.runtime.v1.linux"
      runtime_engine = "/usr/bin/runc"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
      runtime_type = "io.containerd.runsc.v1"

  [plugins."io.containerd.grpc.v1.cri".registry.headers]
    X-Meta-Source-Client = ["azure/aks"]
[metrics]
  address = "0.0.0.0:10257"

The modified containerd configuration is ready to be used.

Installation

Modifying or installing something on the AKS nodes or on Kubernetes nodes is done via a daemon set in general. The daemon set itself needs a hostPath as volume mount, preferably /tmp, hostPID and privileged set to true.

Furthermore, for our use case the correct toleration and node selector configuration is necessary. We only want the daemon set on our dedicated gVisor node pool.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gvisor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: gvisor
  template:
    metadata:
      labels:
        app: gvisor
    spec:
      hostPID: true
      restartPolicy: Always
      containers:
      - image: docker.io/neumanndaniel/gvisor:latest
        imagePullPolicy: Always
        name: gvisor
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          privileged: true
          readOnlyRootFilesystem: true
        volumeMounts:
        - name: k8s-node
          mountPath: /k8s-node
      volumes:
      - name: k8s-node
        hostPath:
          path: /tmp/gvisor
      tolerations:
        - key: gvisor
          operator: Equal
          value: "enabled"
          effect: NoSchedule
      nodeSelector:
        gvisor: enabled

The referenced container image only contains the gVisor installation script and its own run script.

Looking at the gVisor installation script it is the same as in the documentation. Only the path where the binaries are placed has been adjusted to /usr/bin where the other containerd binaries reside.

-> https://gvisor.dev/docs/user_guide/install/#install-latest

#!/bin/sh

(
  set -e
  ARCH=$(uname -m)
  URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH}
  wget ${URL}/runsc ${URL}/runsc.sha512 \
    ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512
  sha512sum -c runsc.sha512 \
    -c containerd-shim-runsc-v1.sha512
  rm -f *.sha512
  chmod a+rx runsc containerd-shim-runsc-v1
  mv runsc containerd-shim-runsc-v1 /usr/bin
)

What does the run script do?

#!/bin/sh

URL="https://raw.githubusercontent.com/neumanndaniel/kubernetes/master/gvisor/config.toml"

wget ${URL} -O /k8s-node/config.toml
cp /install-gvisor.sh /k8s-node

/usr/bin/nsenter -m/proc/1/ns/mnt -- chmod u+x /tmp/gvisor/install-gvisor.sh
/usr/bin/nsenter -m/proc/1/ns/mnt /tmp/gvisor/install-gvisor.sh
/usr/bin/nsenter -m/proc/1/ns/mnt -- cp /etc/containerd/config.toml /etc/containerd/config.toml.org
/usr/bin/nsenter -m/proc/1/ns/mnt -- cp /tmp/gvisor/config.toml /etc/containerd/config.toml
/usr/bin/nsenter -m/proc/1/ns/mnt -- systemctl restart containerd

echo "[$(date +"%Y-%m-%d %H:%M:%S")] Successfully installed gvisor and restarted containerd on node ${NODE_NAME}."

sleep infinity

The run script downloads the config.toml from GitHub as we do not want to rebuild the container image every time this file changes. In the next step the install script is copied over to the AKS node using the hostPath volume mount. Finally, we execute the install script via nsenter on the node, backing up the original containerd configuration file and replacing it. The last step is a restart of containerd itself applying the new configuration. As containerd is only a CRI running containers will not be restarted. Afterwards the daemon set is kept running with an infinite sleep.

The container image I am using is based on Alpine’s current version 3.15.0.

FROM alpine:3.15.0
COPY install-gvisor.sh /
COPY run.sh /
RUN chmod u+x run.sh
CMD ["./run.sh"]

Using gVisor

Before we can start using gVisor as sandboxed runtime we need to make Kubernetes aware of it. This is achieved via a runtime class.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
scheduling:
  nodeSelector:
    gvisor: "enabled"

In the runtime class itself gVisor is referenced by its handler runsc as defined in the config.toml.

...
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
...

Our example pod template deploys a NGINX proxy onto the gVisor node pool.

apiVersion: v1
kind: Pod
metadata:
  name: nginx-gvisor
spec:
  containers:
  - name: nginx
    image: nginx
  runtimeClassName: gvisor
  tolerations:
    - key: gvisor
      operator: Equal
      value: "enabled"
      effect: NoSchedule
  nodeSelector:
    gvisor: enabled

Important is the definition of the runtime class as otherwise Kubernetes uses runc, the default runtime. Furthermore, for the sake of completeness we specify the toleration and the node selector.

Verify gVisor usage

After the deployment of our NGINX pod, we verify if it is really using gVisor as its runtime.

For the first option we need the containerID which we retrieve by running the following command

> kubectl get pods nginx-gvisor -o json | jq '.status.containerStatuses[].containerID' -r | cut -d '/' -f3
19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f

Then we need the node name and which gVisor daemon set pod runs on the node.

> kubectl get pods --all-namespaces -o wide | grep $(kubectl get pods nginx-gvisor -o json | jq '.spec.nodeName' -r)
calico-system       calico-node-xp722                               1/1     Running   0          97m   10.240.0.4     aks-gvisor-42043378-vmss000003      <none>           <none>
istio-system        istio-cni-node-g9fzt                            2/2     Running   0          97m   10.240.0.4     aks-gvisor-42043378-vmss000003      <none>           <none>
kube-system         azure-ip-masq-agent-h5w7z                       1/1     Running   0          97m   10.240.0.4     aks-gvisor-42043378-vmss000003      <none>           <none>
kube-system         azuredefender-publisher-ds-wx5bf                1/1     Running   0          97m   10.240.0.122   aks-gvisor-42043378-vmss000003      <none>           <none>
kube-system         csi-azuredisk-node-89vw5                        3/3     Running   0          97m   10.240.0.4     aks-gvisor-42043378-vmss000003      <none>           <none>
kube-system         csi-azurefile-node-pnvq6                        3/3     Running   0          97m   10.240.0.4     aks-gvisor-42043378-vmss000003      <none>           <none>
kube-system         gvisor-ws7f4                                    1/1     Running   0          97m   10.240.0.182   aks-gvisor-42043378-vmss000003      <none>           <none>
kube-system         kube-proxy-2jctz                                1/1     Running   0          97m   10.240.0.4     aks-gvisor-42043378-vmss000003      <none>           <none>
kube-system         nginx-gvisor                                    1/1     Running   0          12m   10.240.0.150   aks-gvisor-42043378-vmss000003      <none>           <none>
kube-system         omsagent-xk5g5                                  2/2     Running   0          97m   10.240.0.13    aks-gvisor-42043378-vmss000003      <none>           <none>

Afterwards we do an exec into the gVisor pod and query the containerd status log.

> kubectl exec -it gvisor-ws7f4 -- /bin/sh
> /usr/bin/nsenter -m/proc/1/ns/mnt -- systemctl status containerd | grep 19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f
           ├─18404 grep 19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f
           ├─21181 runsc-gofer --root=/run/containerd/runsc/k8s.io --log=/run/containerd/io.containerd.runtime.v2.task/k8s.io/19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f/log.json --log-format=json --log-fd=3 gofer --bundle /run/containerd/io.containerd.runtime.v2.task/k8s.io/19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f --spec-fd=4 --mounts-fd=5 --io-fds=6 --io-fds=7 --io-fds=8 --io-fds=9 --io-fds=10 --io-fds=11 --apply-caps=false --setup-root=false
           └─21228 runsc --root=/run/containerd/runsc/k8s.io --log=/run/containerd/io.containerd.runtime.v2.task/k8s.io/19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f/log.json --log-format=json wait 19733ecbcd7287b511a18d94644b02a1f9788259429ea296e8b1f1ea7084a52f

Looking at the output we confirm that runsc is used.

Another approach is an exec into the NGINX proxy pod and execute the installation of ping.

> kubectl exec -it nginx-gvisor -- /bin/sh
> apt update && apt install iputils-ping -y
...
Setting up iputils-ping (3:20210202-1) ...
Failed to set capabilities on file `/bin/ping' (Operation not supported)
The value of the capability argument is not permitted for a file. Or the file is not a regular (non-symlink) file
Setcap failed on /bin/ping, falling back to setuid
...

The installation succeeds, but the set of required capabilities fails as we run in a sandbox provided by gVisor. Using the default runc runtime we will not see this error message as the NGINX proxy pod will not be running in a sandbox.

Summary

It takes a bit of work and ongoing maintenance using gVisor on AKS for sandboxing containers. But it works. Even gVisor is not officially supported by Microsoft we use a supported way doing the node configuration via a daemon set.

-> https://docs.microsoft.com/en-us/azure/aks/support-policies#shared-responsibility

The impact on a production cluster is further reduced by using a dedicated node pool for gVisor. Hence, if you need a sandbox for untrusted workloads gVisor is a viable option for this on AKS.

As always, you find the code examples and Kubernetes templates in my GitHub repository.

-> https://github.com/neumanndaniel/kubernetes/tree/master/gvisor