Install a high available Istio control plane on Azure Kubernetes Service

Lately I worked intensively with Istio and focused especially on the topic high availability of the Istio control plane.

When you install Istio with the default profile, as mentioned in the Istio documentation, you get a non-high available control plane.

istioctl manifest apply \
--set values.global.mtls.enabled=true \
--set values.global.controlPlaneSecurityEnabled=true

Per default Istio gets installed with a PodDisruptionBudget for every control plane component except for 3rd party services like Prometheus or Grafana.

All PDBs specifying a minimum availability of one pod for the control plane components. Beside that the Istio Ingress Gateway, Pilot, Policy (Mixer) and Telemetry (Mixer) have an HPA assigned for autoscaling.

That leaves the Istio components Citadel, Galley and the Sidecar Injector with their PDBs as a blocking component for specific operations in the AKS cluster. Even the HPA covered components can be blocking, when only one pod is running.

Which operations are blocked by the PDBs?

Cluster upgrade, cluster autoscaler scale-in and automatic node reboot operations, when using kured in the AKS cluster.

So, pretty much every useful operation in AKS regarding the underlying nodes is blocked.

The solution can be an easy one deploying Istio without the default PDBs.

istioctl manifest apply \
--set values.global.mtls.enabled=true \
--set values.global.controlPlaneSecurityEnabled=true \
--set values.global.defaultPodDisruptionBudget.enabled=false

But that weakens a non-high available control plane even more.

The best solution to solve the blocking operations issue is a high available Istio control plane.

Beside solving the issue, we add more robustness to the Istio Service Mesh itself. The minimal required setup for an HA Istio control plane consists of two pods for each Istio component except 3rd party services.

The following command installs an HA Istio control plane into an Azure Kubernetes Service cluster.

istioctl manifest apply \
--set values.global.mtls.enabled=true \
--set values.global.controlPlaneSecurityEnabled=true \
--set gateways.components.ingressGateway.k8s.hpaSpec.minReplicas=2 \
--set trafficManagement.components.pilot.k8s.hpaSpec.minReplicas=2 \
--set policy.components.policy.k8s.hpaSpec.minReplicas=2 \
--set telemetry.components.telemetry.k8s.hpaSpec.minReplicas=2 \
--set configManagement.components.galley.k8s.replicaCount=2 \
--set autoInjection.components.injector.k8s.replicaCount=2 \
--set security.components.citadel.k8s.replicaCount=2 \
--set values.grafana.enabled=true \
--set values.tracing.enabled=true \
--set values.sidecarInjectorWebhook.rewriteAppHTTPProbe=true \
--set values.gateways.istio-ingressgateway.sds.enabled=true

Afterwards the PDBs output looks different and presents us with the information that a disruption is now allowed.

Thus, cluster upgrade, cluster autoscaler scale-in and automatic node reboot operations via kured are possible again.

Istio Sidecar Injector PDB issue

If you took a deeper look at the screenshot of the PDBs output, you recognized already that the allowed disruptions column for the Sidecar Injector states 0 instead of 1. The reason for that is a wrong label selector in the PDB or a wrong label in the Deployment definition for the Sidecar Injector. Depends on which definition is your source of truth. My source of truth is the Deployment definition and I have taken a deeper look into the PDB.

> kubectl describe poddisruptionbudgets.policy istio-sidecar-injector
Name:           istio-sidecar-injector
Namespace:      istio-system
Min available:  1
Selector:       app=sidecar-injector,istio=sidecar-injector,release=istio
Status:
    Allowed disruptions:  0
    Current:              0
    Desired:              1
    Total:                0
Events:
  Type    Reason  Age                      From               Message
  ----    ------  ----                     ----               -------
  Normal  NoPods  5m46s (x922 over 7h46m)  controllermanager  No matching pods found

As you can see the following labels are set for the selector in the PDB app=sidecar-injector,istio=sidecar-injector,release=istio.

> kubectl describe deployment istio-sidecar-injector
Name:                   istio-sidecar-injector
Namespace:              istio-system
...
Labels:                 app=sidecarInjectorWebhook
                        istio=sidecar-injector
                        operator.istio.io/component=Injector
                        operator.istio.io/managed=Reconcile
                        operator.istio.io/version=1.4.3
                        release=istio
...
Selector:               istio=sidecar-injector
...
Pod Template:
  Labels:           app=sidecarInjectorWebhook
                    chart=sidecarInjectorWebhook
                    heritage=Tiller
                    istio=sidecar-injector
                    release=istio

In the Deployment definition the labels of the pod template are app=sidecarInjectorWebhook,istio=sidecar-injector,release=istio.

Because label selectors are AND and not OR based, all label selectors must match to fulfill the condition.

So, we need to run the istioctl manifest apply with the additional parameter --set autoInjection.components.injector.k8s.podDisruptionBudget.selector.matchLabels.app=sidecarInjectorWebhook again to overwrite the default label selector app=sidecar-injector of the Sidecar Injector PDB.

istioctl manifest apply \
--set values.global.mtls.enabled=true \
--set values.global.controlPlaneSecurityEnabled=true \
--set gateways.components.ingressGateway.k8s.hpaSpec.minReplicas=2 \
--set trafficManagement.components.pilot.k8s.hpaSpec.minReplicas=2 \
--set policy.components.policy.k8s.hpaSpec.minReplicas=2 \
--set telemetry.components.telemetry.k8s.hpaSpec.minReplicas=2 \
--set configManagement.components.galley.k8s.replicaCount=2 \
--set autoInjection.components.injector.k8s.replicaCount=2 \
--set autoInjection.components.injector.k8s.podDisruptionBudget.selector.matchLabels.app=sidecarInjectorWebhook \
--set security.components.citadel.k8s.replicaCount=2 \
--set values.grafana.enabled=true \
--set values.tracing.enabled=true \
--set values.sidecarInjectorWebhook.rewriteAppHTTPProbe=true \
--set values.gateways.istio-ingressgateway.sds.enabled=true

After the successful apply we see now that allowed disruptions is set to 1.

kubectl describe poddisruptionbudgets.policy istio-sidecar-injector
Name:           istio-sidecar-injector
Namespace:      istio-system
Min available:  1
Selector:       app=sidecarInjectorWebhook,istio=sidecar-injector,release=istio
Status:
    Allowed disruptions:  1
    Current:              2
    Desired:              1
    Total:                2
Events:
  Type    Reason  Age                      From               Message
  ----    ------  ----                     ----               -------
  Normal  NoPods  4m51s (x932 over 7h50m)  controllermanager  No matching pods found

I will open an issue in the Istio GitHub repository in the next couple of days regarding the above-mentioned issue.

Appendix A – Istio HA

For the sake of completeness, I am referencing the following GitHub issue.

-> https://github.com/istio/istio/issues/18565

Not so long ago Istio had issues, when more than one pod of the components Citadel, Galley and the Sidecar Injector were running in the same Kubernetes cluster.

As stated in the GitHub issue this has been solved for the mentioned Istio components.

I used Istio in version 1.4.2 and 1.4.3 while doing the HA configuration and deployment of the control plane.

Appendix B – AKS Istio how-to guide

For getting started with Istio on AKS you can check Azure docs for the how-to guide.

-> https://docs.microsoft.com/en-us/azure/aks/servicemesh-istio-about

Facebooktwitterlinkedinmail