Example view of what happens when the IO throttle on IaaS VMs/Disks kicks on the worker nodes in an AKS cluster. Additional data/logs will be in full writeup.
As you can see - an idle 5 node cluster where I execute `helm install prometheus-operation` executes fine, however when the pods/containers come online the IO latency spikes to above 100ms in some cases. This causes a cascading failure - PLEG runtime failures in the kubelet and IPC errors in the docker runtime. This triggers orphaned containers and additional IO load due to repeat boot of the containers due to the failed runtime.
This also slows kernel calls and other operations at the SDN (CNI) level because the container operations for the CNI containers are blocked in IOWait.
Download
0 formats
No download links available.
Container runtime failure due to IO latency | NatokHD