Using dysk in Azure Kubernetes Service – AKS cluster upgrade and its learnings from it

In one of my previous blog posts I have explained how to use dysk in Azure Kubernetes Service as persistent storage option.

-> https://www.danielstechblog.io/using-dysk-in-azure-kubernetes-service-as-persistent-storage-option/

Today we have a look how a Kubernetes version upgrade of an AKS cluster effects dysk in operations and may force you to rethink your chosen Azure VM and OS disk SKU.

First let us start with the dysk operation during a Kubernetes version upgrade on an AKS cluster before having a look at the learnings and conclusions.

Without a surprise dysk keeps operating successfully during an AKS cluster upgrade process as seen in the following screenshots.

Before:

After:

As you also may have noticed the spin-up time of important dysk components like the csi-dysk-attacher-0 takes a lot of time and is one of the reasons that our application container does not start until the volume is attached.

The reason for that is the size of the AKS cluster with only two agent nodes powered by the Standard_D2s_v3 VM SKU. Additionally, the performance of the P4 disks (120 IOPS / 25 MB/s throughput) as OS disks is the real pain here and slows down the spin-up of the containers during the upgrade process.

The learnings from this are the following ones: use at least 3 agent nodes in your AKS cluster to have a better load distribution during the upgrade process. When you would like to have only two agent nodes at least, then you should change the VM SKU at creation time from a VM size supporting premium storage to a VM size supporting only standard storage instead. The other alternative is the change of the OS disk SKU from P4 to P10 to get more performance, but then you pay more for the disks.

Have a look at the following table to understand the proposed changes above.

VM SKU	OS disk SKU	OS disk size	IOPS	Throughput
Standard_D2s_v3	P4	32 GB	120	25 MB/s
Standard_D2s_v3	P10	128 GB	500	125 MB/s
Standard_D2_v3	P4	32 GB	500	60 MB/s

So, dysk works well during upgrade or patch processes in your AKS cluster if your AKS cluster is sized and designed properly. Otherwise you experience serious performance impacts.

-> https://docs.microsoft.com/en-us/azure/aks/concepts-security
-> https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster
-> https://docs.microsoft.com/en-us/azure/aks/faq#are-security-updates-applied-to-aks-agent-nodes