Lately I worked intensively with Istio and focused especially on the topic high availability of the Istio control plane.
When you install Istio with the default profile, as mentioned in the Istio documentation, you get a non-high available control plane.
istioctl manifest apply \ --set values.global.mtls.enabled=true \ --set values.global.controlPlaneSecurityEnabled=true
Per default Istio gets installed with a PodDisruptionBudget for every control plane component except for 3rd party services like Prometheus or Grafana.
All PDBs specifying a minimum availability of one pod for the control plane components. Beside that the Istio Ingress Gateway, Pilot, Policy (Mixer) and Telemetry (Mixer) have an HPA assigned for autoscaling.
That leaves the Istio components Citadel, Galley and the Sidecar Injector with their PDBs as a blocking component for specific operations in the AKS cluster. Even the HPA covered components can be blocking, when only one pod is running.
Which operations are blocked by the PDBs?
Cluster upgrade, cluster autoscaler scale-in and automatic node reboot operations, when using kured in the AKS cluster.
So, pretty much every useful operation in AKS regarding the underlying nodes is blocked.
The solution can be an easy one deploying Istio without the default PDBs.
istioctl manifest apply \ --set values.global.mtls.enabled=true \ --set values.global.controlPlaneSecurityEnabled=true \ --set values.global.defaultPodDisruptionBudget.enabled=false
But that weakens a non-high available control plane even more.
The best solution to solve the blocking operations issue is a high available Istio control plane.
Beside solving the issue, we add more robustness to the Istio Service Mesh itself. The minimal required setup for an HA Istio control plane consists of two pods for each Istio component except 3rd party services.
The following command installs an HA Istio control plane into an Azure Kubernetes Service cluster.
istioctl manifest apply \ --set values.global.mtls.enabled=true \ --set values.global.controlPlaneSecurityEnabled=true \ --set gateways.components.ingressGateway.k8s.hpaSpec.minReplicas=2 \ --set trafficManagement.components.pilot.k8s.hpaSpec.minReplicas=2 \ --set policy.components.policy.k8s.hpaSpec.minReplicas=2 \ --set telemetry.components.telemetry.k8s.hpaSpec.minReplicas=2 \ --set configManagement.components.galley.k8s.replicaCount=2 \ --set autoInjection.components.injector.k8s.replicaCount=2 \ --set security.components.citadel.k8s.replicaCount=2 \ --set values.grafana.enabled=true \ --set values.tracing.enabled=true \ --set values.sidecarInjectorWebhook.rewriteAppHTTPProbe=true \ --set values.gateways.istio-ingressgateway.sds.enabled=true
Afterwards the PDBs output looks different and presents us with the information that a disruption is now allowed.
Thus, cluster upgrade, cluster autoscaler scale-in and automatic node reboot operations via kured are possible again.
Istio Sidecar Injector PDB issue
If you took a deeper look at the screenshot of the PDBs output, you recognized already that the allowed disruptions column for the Sidecar Injector states 0 instead of 1. The reason for that is a wrong label selector in the PDB or a wrong label in the Deployment definition for the Sidecar Injector. Depends on which definition is your source of truth. My source of truth is the Deployment definition and I have taken a deeper look into the PDB.
> kubectl describe poddisruptionbudgets.policy istio-sidecar-injector Name: istio-sidecar-injector Namespace: istio-system Min available: 1 Selector: app=sidecar-injector,istio=sidecar-injector,release=istio Status: Allowed disruptions: 0 Current: 0 Desired: 1 Total: 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NoPods 5m46s (x922 over 7h46m) controllermanager No matching pods found
As you can see the following labels are set for the selector in the PDB app=sidecar-injector,istio=sidecar-injector,release=istio
.
> kubectl describe deployment istio-sidecar-injector Name: istio-sidecar-injector Namespace: istio-system ... Labels: app=sidecarInjectorWebhook istio=sidecar-injector operator.istio.io/component=Injector operator.istio.io/managed=Reconcile operator.istio.io/version=1.4.3 release=istio ... Selector: istio=sidecar-injector ... Pod Template: Labels: app=sidecarInjectorWebhook chart=sidecarInjectorWebhook heritage=Tiller istio=sidecar-injector release=istio
In the Deployment definition the labels of the pod template are app=sidecarInjectorWebhook,istio=sidecar-injector,release=istio
.
Because label selectors are AND and not OR based, all label selectors must match to fulfill the condition.
So, we need to run the istioctl manifest apply
with the additional parameter --set autoInjection.components.injector.k8s.podDisruptionBudget.selector.matchLabels.app=sidecarInjectorWebhook
again to overwrite the default label selector app=sidecar-injector
of the Sidecar Injector PDB.
istioctl manifest apply \ --set values.global.mtls.enabled=true \ --set values.global.controlPlaneSecurityEnabled=true \ --set gateways.components.ingressGateway.k8s.hpaSpec.minReplicas=2 \ --set trafficManagement.components.pilot.k8s.hpaSpec.minReplicas=2 \ --set policy.components.policy.k8s.hpaSpec.minReplicas=2 \ --set telemetry.components.telemetry.k8s.hpaSpec.minReplicas=2 \ --set configManagement.components.galley.k8s.replicaCount=2 \ --set autoInjection.components.injector.k8s.replicaCount=2 \ --set autoInjection.components.injector.k8s.podDisruptionBudget.selector.matchLabels.app=sidecarInjectorWebhook \ --set security.components.citadel.k8s.replicaCount=2 \ --set values.grafana.enabled=true \ --set values.tracing.enabled=true \ --set values.sidecarInjectorWebhook.rewriteAppHTTPProbe=true \ --set values.gateways.istio-ingressgateway.sds.enabled=true
After the successful apply we see now that allowed disruptions is set to 1.
kubectl describe poddisruptionbudgets.policy istio-sidecar-injector Name: istio-sidecar-injector Namespace: istio-system Min available: 1 Selector: app=sidecarInjectorWebhook,istio=sidecar-injector,release=istio Status: Allowed disruptions: 1 Current: 2 Desired: 1 Total: 2 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NoPods 4m51s (x932 over 7h50m) controllermanager No matching pods found
I will open an issue in the Istio GitHub repository in the next couple of days regarding the above-mentioned issue.
Appendix A – Istio HA
For the sake of completeness, I am referencing the following GitHub issue.
-> https://github.com/istio/istio/issues/18565
Not so long ago Istio had issues, when more than one pod of the components Citadel, Galley and the Sidecar Injector were running in the same Kubernetes cluster.
As stated in the GitHub issue this has been solved for the mentioned Istio components.
I used Istio in version 1.4.2 and 1.4.3 while doing the HA configuration and deployment of the control plane.
Appendix B – AKS Istio how-to guide
For getting started with Istio on AKS you can check Azure docs for the how-to guide.
-> https://docs.microsoft.com/en-us/azure/aks/servicemesh-istio-about