Cloud Native Chaos Engineering with Chaos Mesh
With Cloud, distributed architectures have grown even more complex and with complexity comes the uncertainty in how the system could fail.
Chaos Engineering aims to test system resiliency by injecting faults to identify weaknesses before they cause massive outages such as improper fallback settings for a service, cascading failures due to a single point of failure, or retry storms due to misconfigured timeouts.
History
Chaos Engineering started at Netflix back in 2010 when Netflix moved from on-prem servers to AWS infrastructure to test the resiliency of their infrastructure.
In 2012, Netflix open-sourced ChaosMonkey under Apache 2.0 license that randomly terminates instances to ensure that services are resilient to instance failures.
Cloud Native Chaos Engineering in CNCF Landscape
CNCF focuses on Cloud Native Chaos Engineering defined as engineering practices focused on (and built on) Kubernetes environments, applications, microservices, and infrastructure.
Cloud Native Chaos Engineering has 4 core principles:
- Open source
- CRDs for Chaos Management
- Extensible and pluggable
- Broad Community adoption
CNCF has two sandbox projects for Cloud Native Chaos Engineering
Chaos Mesh
Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments. It is based on Kubernetes Operator pattern and provides a Chaos Operator to inject into the applications and Kubernetes infrastructure in a manageable way.
Chaos Operator uses Custom Resource Defition(CRD) to define chaos objects. It provides a variety of these CRDs for fault injection such as :
- PodChaos
- NetworkChaos
- DNSChaos
- HTTPChaos
- StressChaos
- IOChaos
- TimeChaos
- KernelChaos
- AWSChaos
- GCPChaos
- JVMChaos
Chaos Mesh Installation
Chaos Mesh can be installed quickly using installation script. However, it’s recommended to use Helm 3 chart in production environments.
To install Chaos Mesh using Helm :
-
Add the Chaos Mesh repository to the Helm repository.
helm repo add chaos-mesh https://charts.chaos-mesh.org
-
It’s recommended to install ChaosMesh in a separate namespace, so you can either create a namespace
chaos-testing
manually or let Helm create it automatically, if it doesn’t exist :helm upgrade \ --install \ chaos-mesh \ chaos-mesh/chaos-mesh \ -n chaos-testing \ --create-namespace \ --version v2.0.0 \ --wait
Note: If you’re using GKE or EKS with
containerd
, then usehelm upgrade \ --install \ chaos-mesh \ chaos-mesh/chaos-mesh \ -n chaos-testing \ --create-namespace \ --set chaosDaemon.runtime=containerd \ --set chaosDaemon.socketPath=/run/containerd/containerd.sock \ --version v2.0.0 \ --wait
-
Verify if pods are running :
kubectl get pods -n chaos-testing
Run First Chaos Mesh Experiment
Chaos Experiment describes what type of fault is injected and how.
-
Setup an Nginx pod and expose it on port 80.
kubectl run nginx --image=nginx --labels="app=nginx" --port=80
-
Get the IP of the nginx pod
kubectl get pods nginx -ojsonpath="{.status.podIP}"
-
Open another terminal and setup a test pod to test the connectivity to nginx pod :
kubectl run -it test-connection --image=radial/busyboxplus:curl -- sh ping <IP of the Nginx Pod> -c 2
this should show you the time it takes to ping the IP :
-
Create your first Chaos Experiment by running :
kubectl apply -f - <<EOF apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: nginx-network-delay spec: action: delay mode: one selector: namespaces: - default labelSelectors: 'app': 'nginx' delay: latency: '1s' duration: '60s' EOF
this will create a CRD of type
NetworkChaos
that will introduce a latency of1 second
in the network of pods with labelsapp:nginx
i.e nginx pod for the next60 seconds
. -
Test the response of ping to the nginx pod now to see the delay of
1 second
.
Run HTTPChaos Experiment
HTTPChaos
allows you to inject faults in the request and response of an HTTP server. It supports abort
,delay
,replace
,patch
fault types.
Note: Before proceeding, delete the NetworkChaos experiment created earlier.
-
Check the response time of nginx pod :
kubectl exec -it test-connection -- sh time curl <IP of the Nginx Pod>
-
Create
HTTPSChaos
experiment by running:kubectl apply -f - <<EOF apiVersion: chaos-mesh.org/v1alpha1 kind: HTTPChaos metadata: name: nginx-http-delay spec: mode: all selector: labelSelectors: app: nginx target: Request port: 80 delay: 1s method: GET path: / duration: 5m EOF
this will create a CRD of type
HTTPChaos
that will introduce a latency of1 seconds
to the requests sent to the pods with labelsapp:nginx
i.e nginx pod on port 80 for the next5 mins
.Note: If you get an error like
admission webhook "vauth.kb.io" denied the request
, as of version 2.0 there is an open issue 2187 and a temporary fix is to delete the validating webhook.kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io validate-auth
-
Test the response time of nginx pod :
time curl <IP of the Nginx Pod>
you will see the additional
1 second
latency in the response.