Shardul
Shardul Shardul is a Cloud architect focused on Data and Infrastructure

Running Apache Spark on EKS Fargate


Running Apache Spark on EKS Fargate

Apache Spark is one of the most famous Big Data frameworks that allows you to process data at any scale.

Spark jobs can run on the Kubernetes cluster and have native support for the Kubernetes scheduler in GA from release 3.1.1 onwards.

Spark comes with a spark-submit script that allows submitting spark applications on a cluster using a single interface without the need to customize the script for different cluster managers.

spark-submit on Kubernetes cluster works as follows:

  1. Spark creates a Spark driver running within a Kubernetes pod.
  2. The driver creates executors which are also running within Kubernetes pods and connects to them and executes application code.
  3. When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

spark-eks

To submit a spark job on a kubernetes cluster using spark-submit :

./bin/spark-submit \
    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.container.image=<spark-image> \
    local:///path/to/examples.jar

While spark-submit provides support for several Kubernetes features such as secrets, persistentVolumes, rbac via configuration parameters, it still lacks a lot of features thus it’s not suitable to use in production effectively.

Spark on K8s Operator

Spark on K8s Operator is a project from Google that allows submitting spark applications on Kubernetes cluster using CustomResource Definition SparkApplication. It uses mutating admission webhook to modify the pod spec and add the features not officially supported by spark-submit.

The Kubernetes Operator for Apache Spark consists of:

  1. A SparkApplication controller that watches events of creation, updates, and deletion of SparkApplication objects and acts on the watch events, a submission runner that runs spark-submit for submissions received from the controller,
  2. A Spark pod monitor that watches for Spark pods and sends pod status updates to the controller,
  3. A Mutating Admission Webhook that handles customizations for Spark driver and executor pods based on the annotations on the pods added by the controller,
  4. A command-line tool named sparkctl for working with the operator.

The following diagram shows how different components interact and work together.

spark-operator

Setup Spark on K8s Operator on EKS Fargate

  1. Setup EKS cluster using eksctl with fargate profile for default, kube-system, and spark namespaces.

     eksctl apply -f - <<EOF
     apiVersion: eksctl.io/v1alpha5
     kind: ClusterConfig
     metadata:
       name: spark-cluster
       region: us-east-1
       version: "1.21"
     availabilityZones: 
       - us-east-1a
       - us-east-1b
       - us-east-1c
     fargateProfiles:
       - name: fp-all
         selectors:
           - namespace: kube-system
           - namespace: default
           - namespace: spark
     EOF
    
  2. Install Spark on K8s Operator using helm3 in the spark namespace.

     helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
     helm upgrade \
         --install \
         spark-operator \
         spark-operator/spark-operator \
         --namespace spark \
         --create-namespace \
         --set webhook.enable=true \
         --set sparkJobNamespace=spark \
         --set serviceAccounts.spark.name=spark \
         --set logLevel=10 \
         --version 1.1.6 \
         --wait
    
  3. Verify Operator installation

     kubectl get pods -n spark
    

Submit SparkPi on EKS Cluster

  1. Submit the SparkPi application to the EKS cluster

     kubectl apply -f - <<EOF
     apiVersion: "sparkoperator.k8s.io/v1beta2"
     kind: SparkApplication
     metadata:
       name: spark-pi
       namespace: spark
     spec:
       type: Scala
       mode: cluster
       image: "gcr.io/spark-operator/spark:v3.1.1"
       imagePullPolicy: Always
       mainClass: org.apache.spark.examples.SparkPi
       mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
       sparkVersion: "3.1.1"
       restartPolicy:
         type: Never
       driver:
         cores: 1
         coreLimit: "1200m"
         memory: "512m"
         labels:
           version: 3.1.1
         serviceAccount: spark
       executor:
         cores: 1
         instances: 1
         memory: "512m"
         labels:
           version: 3.1.1
     EOF
    

    Note: hostPath volume mounts are not supported in Fargate.

  2. Check the status of SparkApplication

     kubectl -n spark describe sparkapplications.sparkoperator.k8s.io spark-pi
    
  3. Access Spark UI by port-forwarding to the spark-pi-ui-svc

     kubectl -n spark port-forward svc/spark-pi-ui-svc 4040:4040
    

    spark-pi

Managing SparkApplication with sparkctl

sparkctl is CLI tool for creating, listing, checking status of, getting logs of, and deleting SparkApplications running on the Kubernetes cluster.

  1. Build sparkctl from source:

    git clone [email protected]:GoogleCloudPlatform/spark-on-k8s-operator.git
    cd spark-on-k8s-operator/sparkctl
    go build -o /usr/local/bin/sparkctl
    
  2. List SparkApplication objects in spark namespace:

    sparkctl list -n spark
    +----------+-----------+----------------+-----------------+
    |   NAME   |   STATE   | SUBMISSION AGE | TERMINATION AGE |
    +----------+-----------+----------------+-----------------+
    | spark-pi | COMPLETED | 1h             | 1h              |
    +----------+-----------+----------------+-----------------+
    
  3. Check the status of SparkApplication spark-pi :

    sparkctl status spark-pi -n spark
    
     application state:
     +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+
     |   STATE   | SUBMISSION AGE | COMPLETION AGE |   DRIVER POD    |     DRIVER UI      | SUBMISSIONATTEMPTS | EXECUTIONATTEMPTS |
     +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+
     | COMPLETED | 1h             | 1h             | spark-pi-driver | 10.100.97.206:4040 |                  1 |                 1 |
     +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+
     executor state:
     +----------------------------------+-----------+
     |           EXECUTOR POD           |   STATE   |
     +----------------------------------+-----------+
     | spark-pi-418ac87b48d177c9-exec-1 | COMPLETED |
     +----------------------------------+-----------+
    
  4. Check SparkApplication spark-pi logs:

    sparkctl log spark-pi -n spark 
    
  5. Port-forward to Spark UI:

    sparkctl forward spark-pi -n spark
    

    you can access the Spark UI at http://localhost:4040