Banzai Cloud Logo Close
Home Benefits Blog Company Contact
Sign up Login
Author Sandor Magyari

Running Zeppelin Spark notebooks on Kubernetes - deep dive

Apache Spark on Kubernetes series:
Introduction to Spark on Kubernetes
Scaling Spark made simple on Kubernetes
The anatomy of Spark applications on Kubernetes
Monitoring Apache Spark with Prometheus
Apache Spark CI/CD workflow howto
Spark History Server on Kubernetes
Spark scheduling on Kubernetes demystified
Spark Streaming Checkpointing on Kubernetes
Deep dive into monitoring Spark and Zeppelin with Prometheus
Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series:
Running Zeppelin Spark notebooks on Kubernetes
Running Zeppelin Spark notebooks on Kubernetes - deep dive
CI/CD flow for Zeppelin notebooks

Apache Kafka on Kubernetes series:
Kafka on Kubernetes - using etcd

In our first blog about Zeppelin on Kubernetes we explored a few of the problems we’ve encountered so far. Let’s briefly recap those, here:

  • communication between Zeppelin Server and RemoteInterpreterServer
  • dependency handling and logger setup
  • proper handling of pod lifecycles started by spark-submit

To address these problems we created the following pull request, PR-2637, in Apache Zeppelin. The most important part of that PR is to extend the functionality of RemoteInterpreterManagedProcess, which is responsible for managing and connecting to remotely running interpreter processes. SparkK8RemoteInterpreterManagedProcess simplifies connecting to RemoteInterpreterServer and directly communicates with Kubernetes clusters. Our solution uses spark-submit and interpreter.sh. However, after starting spark-submit, we’ll be watching for events related to the Spark Driver pod created by spark-submit. After the Driver state is ‘Running’, a separate Kubernetes Service will be created, bounded to RemoteInterpreterServer inside the Spark Driver. SparkK8RemoteInterpreterManagedProcess will try to connect to RemoteInterpreterServer through this service. Should the connection fail, both Spark Driver and the service will be deleted.

As you can see in the sequence diagram below, the flow of actions upon starting a Zeppelin notebook is the same as the flow upon starting any Spark Application on K8s

Zeppelin Spark K8s flow

Now let’s start a Zeppelin notebook on Minikube, using our pre-built Docker images:

  • First, start the ResourceStagingServer used by spark-submit to distribute resources across Spark driver and executors (in our case the Zeppelin Spark interpreter JAR)
wget https://raw.githubusercontent.com/apache-spark-on-k8s/spark/branch-2.2-kubernetes/conf/kubernetes-resource-staging-server.yaml
kubectl create -f kubernetes-resource-staging-server.yaml
  • Next, create a Kubernetes service to reach the Zeppelin server from outside the cluster
wget https://raw.githubusercontent.com/banzaicloud/zeppelin/k8_config/scripts/docker/spark-cluster-managers/kubernetes/zeppelin-service.yaml
kubectl create -f zeppelin-service.yaml
  • Get the address of ResourceStagingServer either from the k8s dashboard or by running
kubectl get svc spark-resource-staging-service -o jsonpath='{.spec.clusterIP}'
  • Then download the Zeppelin server’s pod definition
wget https://raw.githubusercontent.com/banzaicloud/zeppelin/k8_config/scripts/docker/spark-cluster-managers/kubernetes/zeppelin-pod.yaml
  • And edit zeppelin-pod.yaml: setting the address of ResourceStagingServer, which we already retrieved
- name: SPARK_SUBMIT_OPTIONS
    value: >-
      --kubernetes-namespace default
      --conf spark.executor.instances=2
      --conf spark.kubernetes.resourceStagingServer.uri=http://10.0.0.121:10000
      --conf spark.kubernetes.resourceStagingServer.internal.uri=http://10.0.0.121:10000
  • Start the Zeppelin server
kubectl create -f zeppelin-pod.yaml
  • Check the list of pods and wait until zeppelin-server pod status is Running
$ kubectl get po
NAME                                             READY     STATUS    RESTARTS   AGE
spark-resource-staging-server-6bcc9fb55f-w24vr   1/1       Running   0          2m
zeppelin-server                                  1/1       Running   0          1m
  • The Zeppelin UI should be reachable on the same ip as the Minikube dashboard (the address of the node), while the port can be retrieved either from the k8s dashboard or by running
kubectl get svc zeppelin-k8s-service -o jsonpath='{.spec.ports[0].nodePort}'
  • Finally, start a notebook and check the list of pods again
$ kubectl get po
NAME                                             READY     STATUS    RESTARTS   AGE
spark-resource-staging-server-6bcc9fb55f-w24vr   1/1       Running   0          33m
zeppelin-server                                  1/1       Running   0          24m
zri-2cxx5tw2h--2a94m5j1z-1512757546809-driver    1/1       Running   0          4m
zri-2cxx5tw2h--2a94m5j1z-1512757546809-exec-1    1/1       Running   0          3m
zri-2cxx5tw2h--2a94m5j1z-1512757546809-exec-2    1/1       Running   0          3m

You can also build Zeppelin from scratch and create your own Docker image via our GitHub repository.

An easier and cloud agnostic solution to all these steps is to use Pipeline, where they have been automated and wired into a CI/CD pipeline. In the next post in this series, we’ll introduce our Zeppelin spotguide and CI/CD plugin, and highlight how easy it is to run Zeppelin notebooks on Kubernetes the cloud native way.

We hope you enjoy running Zeppelin notebooks on Kubernetes!

If you’re interested in our technology and open source projects, follow us on GitHub, LinkedIn or Twitter:

Star


Comments

comments powered by Disqus