Running Apache Spark on EKS Fargate
Apache Spark is one of the most famous Big Data frameworks that allows you to process data at any scale.
Spark jobs can run on the Kubernetes cluster and have native support for the Kubernetes scheduler in GA from release 3.1.1 onwards.
Spark comes with a spark-submit
script that allows submitting spark applications on a cluster using a single interface without the need to customize the script for different cluster managers.
spark-submit
on Kubernetes cluster works as follows:
- Spark creates a Spark driver running within a Kubernetes pod.
- The driver creates executors which are also running within Kubernetes pods and connects to them and executes application code.
- When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
To submit a spark job on a kubernetes cluster using spark-submit
:
./bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=<spark-image> \
local:///path/to/examples.jar
While spark-submit
provides support for several Kubernetes features such as secrets
, persistentVolumes
, rbac
via configuration parameters, it still lacks a lot of features thus it’s not suitable to use in production effectively.
Spark on K8s Operator
Spark on K8s Operator is a project from Google that allows submitting spark applications on Kubernetes cluster using CustomResource Definition SparkApplication.
It uses mutating admission webhook to modify the pod spec and add the features not officially supported by spark-submit
.
The Kubernetes Operator for Apache Spark consists of:
- A
SparkApplication
controller that watches events of creation, updates, and deletion ofSparkApplication
objects and acts on the watch events, a submission runner that runsspark-submit
for submissions received from the controller, - A Spark pod monitor that watches for Spark pods and sends pod status updates to the controller,
- A Mutating Admission Webhook that handles customizations for Spark driver and executor pods based on the annotations on the pods added by the controller,
- A command-line tool named
sparkctl
for working with the operator.
The following diagram shows how different components interact and work together.
Setup Spark on K8s Operator on EKS Fargate
-
Setup EKS cluster using
eksctl
with fargate profile fordefault
,kube-system
, andspark
namespaces.eksctl apply -f - <<EOF apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: spark-cluster region: us-east-1 version: "1.21" availabilityZones: - us-east-1a - us-east-1b - us-east-1c fargateProfiles: - name: fp-all selectors: - namespace: kube-system - namespace: default - namespace: spark EOF
-
Install
Spark on K8s Operator
using helm3 in thespark
namespace.helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator helm upgrade \ --install \ spark-operator \ spark-operator/spark-operator \ --namespace spark \ --create-namespace \ --set webhook.enable=true \ --set sparkJobNamespace=spark \ --set serviceAccounts.spark.name=spark \ --set logLevel=10 \ --version 1.1.6 \ --wait
-
Verify Operator installation
kubectl get pods -n spark
Submit SparkPi on EKS Cluster
-
Submit the
SparkPi
application to the EKS clusterkubectl apply -f - <<EOF apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: "gcr.io/spark-operator/spark:v3.1.1" imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar" sparkVersion: "3.1.1" restartPolicy: type: Never driver: cores: 1 coreLimit: "1200m" memory: "512m" labels: version: 3.1.1 serviceAccount: spark executor: cores: 1 instances: 1 memory: "512m" labels: version: 3.1.1 EOF
Note: hostPath volume mounts are not supported in Fargate.
-
Check the status of
SparkApplication
kubectl -n spark describe sparkapplications.sparkoperator.k8s.io spark-pi
-
Access Spark UI by port-forwarding to the
spark-pi-ui-svc
kubectl -n spark port-forward svc/spark-pi-ui-svc 4040:4040
Managing SparkApplication with sparkctl
sparkctl
is CLI tool for creating, listing, checking status of, getting logs of, and deleting SparkApplications
running on the Kubernetes cluster.
-
Build
sparkctl
from source:git clone [email protected]:GoogleCloudPlatform/spark-on-k8s-operator.git cd spark-on-k8s-operator/sparkctl go build -o /usr/local/bin/sparkctl
-
List SparkApplication objects in
spark
namespace:sparkctl list -n spark +----------+-----------+----------------+-----------------+ | NAME | STATE | SUBMISSION AGE | TERMINATION AGE | +----------+-----------+----------------+-----------------+ | spark-pi | COMPLETED | 1h | 1h | +----------+-----------+----------------+-----------------+
-
Check the status of SparkApplication
spark-pi
:sparkctl status spark-pi -n spark application state: +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+ | STATE | SUBMISSION AGE | COMPLETION AGE | DRIVER POD | DRIVER UI | SUBMISSIONATTEMPTS | EXECUTIONATTEMPTS | +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+ | COMPLETED | 1h | 1h | spark-pi-driver | 10.100.97.206:4040 | 1 | 1 | +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+ executor state: +----------------------------------+-----------+ | EXECUTOR POD | STATE | +----------------------------------+-----------+ | spark-pi-418ac87b48d177c9-exec-1 | COMPLETED | +----------------------------------+-----------+
-
Check SparkApplication
spark-pi
logs:sparkctl log spark-pi -n spark
-
Port-forward to Spark UI:
sparkctl forward spark-pi -n spark
you can access the Spark UI at http://localhost:4040