Placeholder image

Sandor Magyari

Mon, Jan 8, 2018


Running Zeppelin Spark notebooks on Kubernetes - deep dive

Apache Spark on Kubernetes series:
Introduction to Spark on Kubernetes
Scaling Spark made simple on Kubernetes
The anatomy of Spark applications on Kubernetes
Monitoring Apache Spark with Prometheus
Apache Spark CI/CD workflow howto
Spark History Server on Kubernetes
Spark scheduling on Kubernetes demystified
Spark Streaming Checkpointing on Kubernetes
Deep dive into monitoring Spark and Zeppelin with Prometheus
Apache Spark application resilience on Kubernetes

Apache Zeppelin on Kubernetes series:
Running Zeppelin Spark notebooks on Kubernetes
Running Zeppelin Spark notebooks on Kubernetes - deep dive
CI/CD flow for Zeppelin notebooks

Apache Kafka on Kubernetes series:
Kafka on Kubernetes - using etcd

In our first blog about Zeppelin on Kubernetes we presented a few problems we encountered. Let’s recap shortly, what were these:

  • communication between Zeppelin Server and RemoteInterpreterServer
  • dependency handling and logger setup
  • proper handling the lifecycle of pods started by spark-submit

To address these problems we’ve created the following PR-2637 in Apache Zeppelin. The most important part of the PR is to extend the functionality of RemoteInterpreterManagedProcess, which is responsible for managing and connecting to a remotely running interpreter process. SparkK8RemoteInterpreterManagedProcess simplifies connection to RemoteInterpreterServer and directly communicates with the Kubernetes cluster. Our solution still uses spark-submit and interpreter.sh, however after starting spark-submit will be watching for events related to the Spark Driver pod created by spark-submit. After the Driver state becomes ‘Running’, a separate Kubernetes Service gets created, bounded to RemoteInterpreterServer running inside Spark Driver. SparkK8RemoteInterpreterManagedProcess tries to connect to RemoteInterpreterServer through this service. Should the connection fail for some reason Spark Driver and the service will be deleted.

As you can see on the sequence diagram below the flow of actions - upon starting a Zeppelin notebook - is based on the same flow as in case of starting any Spark Application on K8s

Zeppelin Spark K8s flow

Now let’s see how to start a Zeppelin notebook on Minikube using our pre-built Docker images:

  • First you have to start the ResourceStagingServer used by spark-submit to distribute resources (in our case the Zeppelin Spark interpreter JAR) across Spark driver and executors
wget https://raw.githubusercontent.com/apache-spark-on-k8s/spark/branch-2.2-kubernetes/conf/kubernetes-resource-staging-server.yaml  
kubectl create -f kubernetes-resource-staging-server.yaml
  • Create a Kubernetes service to reach Zeppelin server from outside the cluster
wget https://raw.githubusercontent.com/banzaicloud/zeppelin/k8_config/scripts/docker/spark-cluster-managers/kubernetes/zeppelin-service.yaml
kubectl create -f zeppelin-service.yaml
  • Get the address of ResourceStagingServer either from K8s dashboard or by running
kubectl get svc spark-resource-staging-service -o jsonpath='{.spec.clusterIP}'
  • Download the pod definition of the Zeppelin server
wget https://raw.githubusercontent.com/banzaicloud/zeppelin/k8_config/scripts/docker/spark-cluster-managers/kubernetes/zeppelin-pod.yaml
  • Edit zeppelin-pod.yaml and set the address of ResourceStagingServer retrieved in a previous step
- name: SPARK_SUBMIT_OPTIONS
    value: >-
      --kubernetes-namespace default
      --conf spark.executor.instances=2
      --conf spark.kubernetes.resourceStagingServer.uri=http://10.0.0.121:10000
      --conf spark.kubernetes.resourceStagingServer.internal.uri=http://10.0.0.121:10000
  • Start the Zeppelin server
kubectl create -f zeppelin-pod.yaml
  • Check the list of pods, wait until zeppelin-server pod status is Running
$ kubectl get po
NAME                                             READY     STATUS    RESTARTS   AGE
spark-resource-staging-server-6bcc9fb55f-w24vr   1/1       Running   0          2m
zeppelin-server                                  1/1       Running   0          1m
  • Zeppelin UI should be reachable on the same ip as Minikube dashboard (address of node) while the port can be retrieved either from K8s dashboard or by running
kubectl get svc zeppelin-k8s-service -o jsonpath='{.spec.ports[0].nodePort}'
  • Start a notebook then check again the list of pods
$ kubectl get po
NAME                                             READY     STATUS    RESTARTS   AGE
spark-resource-staging-server-6bcc9fb55f-w24vr   1/1       Running   0          33m
zeppelin-server                                  1/1       Running   0          24m
zri-2cxx5tw2h--2a94m5j1z-1512757546809-driver    1/1       Running   0          4m
zri-2cxx5tw2h--2a94m5j1z-1512757546809-exec-1    1/1       Running   0          3m
zri-2cxx5tw2h--2a94m5j1z-1512757546809-exec-2    1/1       Running   0          3m

You can also build Zeppelin from source and create your own Docker image from our GitHub repository.

An easier and cloud agnostic solution to have all these steps above automated and wired into a CI/CD pipeline is to use Pipeline. In the next coming posts we will introduce our Zeppelin spotguide and CI/CD plugin and will highlight how easy is to run Zeppelin notebooks on Kubernetes the cloud native way using Pipeline.

Enjoy running Zeppeling notebooks on Kubernetes!

If you are interested in our technology and open source projects, follow us on GitHub, LinkedIn or Twitter:

Star



Comments

comments powered by Disqus