Shardul Follow Shardul is a seasoned Engineering Leader and Cloud expert focused on Data and Infrastructure

Cloud Native Chaos Engineering with Chaos Mesh

With Cloud, distributed architectures have grown even more complex and with complexity comes the uncertainty in how the system could fail.

Chaos Engineering aims to test system resiliency by injecting faults to identify weaknesses before they cause massive outages such as improper fallback settings for a service, cascading failures due to a single point of failure, or retry storms due to misconfigured timeouts.

History

Chaos Engineering started at Netflix back in 2010 when Netflix moved from on-prem servers to AWS infrastructure to test the resiliency of their infrastructure.

In 2012, Netflix open-sourced ChaosMonkey under Apache 2.0 license that randomly terminates instances to ensure that services are resilient to instance failures.

Cloud Native Chaos Engineering in CNCF Landscape

CNCF focuses on Cloud Native Chaos Engineering defined as engineering practices focused on (and built on) Kubernetes environments, applications, microservices, and infrastructure.

Cloud Native Chaos Engineering has 4 core principles:

Open source
CRDs for Chaos Management
Extensible and pluggable
Broad Community adoption

CNCF has two sandbox projects for Cloud Native Chaos Engineering

cncf-chaos-engineering

Chaos Mesh

Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments. It is based on Kubernetes Operator pattern and provides a Chaos Operator to inject into the applications and Kubernetes infrastructure in a manageable way.

Chaos Operator uses Custom Resource Defition(CRD) to define chaos objects. It provides a variety of these CRDs for fault injection such as :

Chaos Mesh Installation

Chaos Mesh can be installed quickly using installation script. However, it’s recommended to use Helm 3 chart in production environments.

To install Chaos Mesh using Helm :

Add the Chaos Mesh repository to the Helm repository.

helm repo add chaos-mesh https://charts.chaos-mesh.org

It’s recommended to install ChaosMesh in a separate namespace, so you can either create a namespace chaos-testing manually or let Helm create it automatically, if it doesn’t exist :

helm upgrade \
     --install \
     chaos-mesh \
     chaos-mesh/chaos-mesh \
     -n chaos-testing \
     --create-namespace \
     --version v2.0.0 \
     --wait

Note: If you’re using GKE or EKS with containerd, then use

helm upgrade \
     --install \
     chaos-mesh \
     chaos-mesh/chaos-mesh \
     -n chaos-testing \
     --create-namespace \
     --set chaosDaemon.runtime=containerd \
     --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
     --version v2.0.0 \
     --wait

Verify if pods are running :
```
kubectl get pods -n chaos-testing
```

Run First Chaos Mesh Experiment

Chaos Experiment describes what type of fault is injected and how.

Setup an Nginx pod and expose it on port 80.

kubectl run nginx --image=nginx --labels="app=nginx" --port=80

Get the IP of the nginx pod

kubectl get pods nginx -ojsonpath="{.status.podIP}"

Open another terminal and setup a test pod to test the connectivity to nginx pod :
```
kubectl run -it test-connection --image=radial/busyboxplus:curl -- sh
ping <IP of the Nginx Pod> -c 2
```
this should show you the time it takes to ping the IP :

Create your first Chaos Experiment by running :

kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: nginx-network-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      'app': 'nginx'
  delay:
    latency: '1s'
  duration: '60s'
EOF

this will create a CRD of type NetworkChaos that will introduce a latency of 1 second in the network of pods with labels app:nginx i.e nginx pod for the next 60 seconds.

Test the response of ping to the nginx pod now to see the delay of 1 second.

Run HTTPChaos Experiment

HTTPChaos allows you to inject faults in the request and response of an HTTP server. It supports abort,delay,replace,patch fault types.

Note: Before proceeding, delete the NetworkChaos experiment created earlier.

Check the response time of nginx pod :

kubectl exec -it test-connection -- sh
time curl <IP of the Nginx Pod>

nginx-pod-httpchaos

Create HTTPSChaos experiment by running:
```
kubectl apply -f - <<EOF
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: nginx-http-delay
spec:
  mode: all
  selector:
    labelSelectors:
      app: nginx
  target: Request
  port: 80
  delay: 1s
  method: GET
  path: /
  duration: 5m
 EOF
```
this will create a CRD of type HTTPChaos that will introduce a latency of 1 seconds to the requests sent to the pods with labels app:nginx i.e nginx pod on port 80 for the next 5 mins.

Note: If you get an error like admission webhook "vauth.kb.io" denied the request, as of version 2.0 there is an open issue 2187 and a temporary fix is to delete the validating webhook.
```
kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io validate-auth
```
Test the response time of nginx pod :
```
time curl <IP of the Nginx Pod>
```
you will see the additional 1 second latency in the response.

09 Aug 2021

« Running Apache Spark on EKS Fargate Canary Deployment with Istio on EKS »

Shardul's Tech Blog