Shardul Follow Shardul is a seasoned Engineering Leader and Cloud expert focused on Data and Infrastructure

Running Apache Spark on EKS Fargate

Apache Spark is one of the most famous Big Data frameworks that allows you to process data at any scale.

Spark jobs can run on the Kubernetes cluster and have native support for the Kubernetes scheduler in GA from release 3.1.1 onwards.

Spark comes with a spark-submit script that allows submitting spark applications on a cluster using a single interface without the need to customize the script for different cluster managers.

spark-submit on Kubernetes cluster works as follows:

Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors which are also running within Kubernetes pods and connects to them and executes application code.
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

spark-eks

To submit a spark job on a kubernetes cluster using spark-submit :

./bin/spark-submit \
    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.container.image=<spark-image> \
    local:///path/to/examples.jar

While spark-submit provides support for several Kubernetes features such as secrets, persistentVolumes, rbac via configuration parameters, it still lacks a lot of features thus it’s not suitable to use in production effectively.

Spark on K8s Operator

Spark on K8s Operator is a project from Google that allows submitting spark applications on Kubernetes cluster using CustomResource Definition SparkApplication. It uses mutating admission webhook to modify the pod spec and add the features not officially supported by spark-submit.

The Kubernetes Operator for Apache Spark consists of:

A SparkApplication controller that watches events of creation, updates, and deletion of SparkApplication objects and acts on the watch events, a submission runner that runs spark-submit for submissions received from the controller,
A Spark pod monitor that watches for Spark pods and sends pod status updates to the controller,
A Mutating Admission Webhook that handles customizations for Spark driver and executor pods based on the annotations on the pods added by the controller,
A command-line tool named sparkctl for working with the operator.

The following diagram shows how different components interact and work together.

spark-operator

Setup Spark on K8s Operator on EKS Fargate

Setup EKS cluster using eksctl with fargate profile for default, kube-system, and spark namespaces.

 eksctl apply -f - <<EOF
 apiVersion: eksctl.io/v1alpha5
 kind: ClusterConfig
 metadata:
   name: spark-cluster
   region: us-east-1
   version: "1.21"
 availabilityZones: 
   - us-east-1a
   - us-east-1b
   - us-east-1c
 fargateProfiles:
   - name: fp-all
     selectors:
       - namespace: kube-system
       - namespace: default
       - namespace: spark
 EOF

Install Spark on K8s Operator using helm3 in the spark namespace.

 helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
 helm upgrade \
     --install \
     spark-operator \
     spark-operator/spark-operator \
     --namespace spark \
     --create-namespace \
     --set webhook.enable=true \
     --set sparkJobNamespace=spark \
     --set serviceAccounts.spark.name=spark \
     --set logLevel=10 \
     --version 1.1.6 \
     --wait

Verify Operator installation
```
 kubectl get pods -n spark
```

Submit SparkPi on EKS Cluster

Submit the SparkPi application to the EKS cluster

 kubectl apply -f - <<EOF
 apiVersion: "sparkoperator.k8s.io/v1beta2"
 kind: SparkApplication
 metadata:
   name: spark-pi
   namespace: spark
 spec:
   type: Scala
   mode: cluster
   image: "gcr.io/spark-operator/spark:v3.1.1"
   imagePullPolicy: Always
   mainClass: org.apache.spark.examples.SparkPi
   mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
   sparkVersion: "3.1.1"
   restartPolicy:
     type: Never
   driver:
     cores: 1
     coreLimit: "1200m"
     memory: "512m"
     labels:
       version: 3.1.1
     serviceAccount: spark
   executor:
     cores: 1
     instances: 1
     memory: "512m"
     labels:
       version: 3.1.1
 EOF

Note: hostPath volume mounts are not supported in Fargate.

Check the status of SparkApplication

 kubectl -n spark describe sparkapplications.sparkoperator.k8s.io spark-pi

Access Spark UI by port-forwarding to the spark-pi-ui-svc

 kubectl -n spark port-forward svc/spark-pi-ui-svc 4040:4040

spark-pi

Managing SparkApplication with sparkctl

sparkctl is CLI tool for creating, listing, checking status of, getting logs of, and deleting SparkApplications running on the Kubernetes cluster.

Build sparkctl from source:

git clone [email protected]:GoogleCloudPlatform/spark-on-k8s-operator.git
cd spark-on-k8s-operator/sparkctl
go build -o /usr/local/bin/sparkctl

List SparkApplication objects in spark namespace:

sparkctl list -n spark
+----------+-----------+----------------+-----------------+
|   NAME   |   STATE   | SUBMISSION AGE | TERMINATION AGE |
+----------+-----------+----------------+-----------------+
| spark-pi | COMPLETED | 1h             | 1h              |
+----------+-----------+----------------+-----------------+

Check the status of SparkApplication spark-pi :

sparkctl status spark-pi -n spark

 application state:
 +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+
 |   STATE   | SUBMISSION AGE | COMPLETION AGE |   DRIVER POD    |     DRIVER UI      | SUBMISSIONATTEMPTS | EXECUTIONATTEMPTS |
 +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+
 | COMPLETED | 1h             | 1h             | spark-pi-driver | 10.100.97.206:4040 |                  1 |                 1 |
 +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+
 executor state:
 +----------------------------------+-----------+
 |           EXECUTOR POD           |   STATE   |
 +----------------------------------+-----------+
 | spark-pi-418ac87b48d177c9-exec-1 | COMPLETED |
 +----------------------------------+-----------+

Check SparkApplication spark-pi logs:
```
sparkctl log spark-pi -n spark 
```
Port-forward to Spark UI:
```
sparkctl forward spark-pi -n spark
```
you can access the Spark UI at http://localhost:4040

01 Nov 2020

« Powerful things you can do with the Markdown editor Cloud Native Chaos Engineering with Chaos Mesh »

Shardul's Tech Blog

Running Apache Spark on EKS Fargate

Spark on K8s Operator

Setup Spark on K8s Operator on EKS Fargate

Submit SparkPi on EKS Cluster

Managing SparkApplication with sparkctl

Explore →