Shardul
Shardul Shardul is a Cloud architect focused on Data and Infrastructure

Cloud Native Chaos Engineering with Chaos Mesh


Cloud Native Chaos Engineering with Chaos Mesh

With Cloud, distributed architectures have grown even more complex and with complexity comes the uncertainty in how the system could fail.

Chaos Engineering aims to test system resiliency by injecting faults to identify weaknesses before they cause massive outages such as improper fallback settings for a service, cascading failures due to a single point of failure, or retry storms due to misconfigured timeouts.

History

Chaos Engineering started at Netflix back in 2010 when Netflix moved from on-prem servers to AWS infrastructure to test the resiliency of their infrastructure.

In 2012, Netflix open-sourced ChaosMonkey under Apache 2.0 license that randomly terminates instances to ensure that services are resilient to instance failures.

Cloud Native Chaos Engineering in CNCF Landscape

CNCF focuses on Cloud Native Chaos Engineering defined as engineering practices focused on (and built on) Kubernetes environments, applications, microservices, and infrastructure.

Cloud Native Chaos Engineering has 4 core principles:

  1. Open source
  2. CRDs for Chaos Management
  3. Extensible and pluggable
  4. Broad Community adoption

CNCF has two sandbox projects for Cloud Native Chaos Engineering

  1. ChaosMesh
  2. Litmus Chaos

cncf-chaos-engineering

Chaos Mesh

Chaos Mesh is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments. It is based on Kubernetes Operator pattern and provides a Chaos Operator to inject into the applications and Kubernetes infrastructure in a manageable way.

Chaos Operator uses Custom Resource Defition(CRD) to define chaos objects. It provides a variety of these CRDs for fault injection such as :

  1. PodChaos
  2. NetworkChaos
  3. DNSChaos
  4. HTTPChaos
  5. StressChaos
  6. IOChaos
  7. TimeChaos
  8. KernelChaos
  9. AWSChaos
  10. GCPChaos
  11. JVMChaos

Chaos Mesh Installation

Chaos Mesh can be installed quickly using installation script. However, it’s recommended to use Helm 3 chart in production environments.

To install Chaos Mesh using Helm :

  1. Add the Chaos Mesh repository to the Helm repository.

    helm repo add chaos-mesh https://charts.chaos-mesh.org
    
  2. It’s recommended to install ChaosMesh in a separate namespace, so you can either create a namespace chaos-testing manually or let Helm create it automatically, if it doesn’t exist :

    helm upgrade \
         --install \
         chaos-mesh \
         chaos-mesh/chaos-mesh \
         -n chaos-testing \
         --create-namespace \
         --version v2.0.0 \
         --wait
    

    Note: If you’re using GKE or EKS with containerd, then use

    helm upgrade \
         --install \
         chaos-mesh \
         chaos-mesh/chaos-mesh \
         -n chaos-testing \
         --create-namespace \
         --set chaosDaemon.runtime=containerd \
         --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
         --version v2.0.0 \
         --wait
    
  3. Verify if pods are running :

    kubectl get pods -n chaos-testing
    

Run First Chaos Mesh Experiment

Chaos Experiment describes what type of fault is injected and how.

  1. Setup an Nginx pod and expose it on port 80.

    kubectl run nginx --image=nginx --labels="app=nginx" --port=80
    
  2. Get the IP of the nginx pod

    kubectl get pods nginx -ojsonpath="{.status.podIP}"
    
  3. Open another terminal and setup a test pod to test the connectivity to nginx pod :

    kubectl run -it test-connection --image=radial/busyboxplus:curl -- sh
    ping <IP of the Nginx Pod> -c 2
    

    this should show you the time it takes to ping the IP :

    nginx-pod-connectivity

  4. Create your first Chaos Experiment by running :

    kubectl apply -f - <<EOF
    apiVersion: chaos-mesh.org/v1alpha1
    kind: NetworkChaos
    metadata:
      name: nginx-network-delay
    spec:
      action: delay
      mode: one
      selector:
        namespaces:
          - default
        labelSelectors:
          'app': 'nginx'
      delay:
        latency: '1s'
      duration: '60s'
    EOF
    

    this will create a CRD of type NetworkChaos that will introduce a latency of 1 second in the network of pods with labels app:nginx i.e nginx pod for the next 60 seconds.

  5. Test the response of ping to the nginx pod now to see the delay of 1 second.

    nginx-pod-connectivity-with-delay

Run HTTPChaos Experiment

HTTPChaos allows you to inject faults in the request and response of an HTTP server. It supports abort,delay,replace,patch fault types.

Note: Before proceeding, delete the NetworkChaos experiment created earlier.

  1. Check the response time of nginx pod :

    kubectl exec -it test-connection -- sh
    time curl <IP of the Nginx Pod>
    

    nginx-pod-httpchaos

  2. Create HTTPSChaos experiment by running:

    kubectl apply -f - <<EOF
    apiVersion: chaos-mesh.org/v1alpha1
    kind: HTTPChaos
    metadata:
      name: nginx-http-delay
    spec:
      mode: all
      selector:
        labelSelectors:
          app: nginx
      target: Request
      port: 80
      delay: 1s
      method: GET
      path: /
      duration: 5m
     EOF
    

    this will create a CRD of type HTTPChaos that will introduce a latency of 1 seconds to the requests sent to the pods with labels app:nginx i.e nginx pod on port 80 for the next 5 mins.

    Note: If you get an error like admission webhook "vauth.kb.io" denied the request, as of version 2.0 there is an open issue 2187 and a temporary fix is to delete the validating webhook.

    kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io validate-auth
    
  3. Test the response time of nginx pod :

    time curl <IP of the Nginx Pod>
    

    you will see the additional 1 second latency in the response.

    nginx-pod-httpchaos-delay