Placeholder image

Janos Matyas

Thu, Mar 15, 2018


Monitoring Spark with Prometheus, reloaded

Monitoring series:
Monitoring Apache Spark with Prometheus
Monitoring multiple federated clusters with Prometheus - the secure way
Application monitoring with Prometheus and Pipeline
Building a cloud cost management system on top of Prometheus
Monitoring Spark with Prometheus, reloaded

At Banzai Cloud we deploy large distributed applications to Kubernetes and operate these clusters as well. We don’t like to get a PagerDuty notification during the night so we try to get ahead of these issues by operating these clusters as efficient as we can. We do out of the box centralized log collection, end to end tracing, monitoring and alerting for all our spotguides we support - including the applications, Kubernetes cluster and the infrastructure it is deployed.

For monitoring we choose Prometheus - the de-facto monitoring tool of cloud native applications - and we do some quite advanced scenarios.

However when you do monitoring with Prometheus, there is a catch. Prometheus uses a pull model over http(s) to scrape data from the applications. For batch jobs it also supports a push model, but to enable this feature requires a special component called pushgateway.

tl;dr:

There is no out of the box solution for monitoring Spark with Prometheus. We have made one but were struggling to push the Prometheus sink upstream into the Spark codebase thus I’ve closed the PR and we made the sink source available as a standalone, independent package.

Apache Spark and Prometheus

Out of the box Spark only supports a bunch of sinks (Graphite, CSV, Ganglia) but Prometheus is not one of them, so we introduced a new Prometheus sink (PR; see the related apache ticket SPARK-22343).

Long story, short - we have blogged about the original proposal here - however the community was not really receptive to add a new industry standard monitoring solution, thus we have closed the PR and now we are making the Apache Spark Prometheus Sink available the alternative way. Ideally this should be part of Spark (as the other sinks) but it works just fine from the classpath as well.

Monitor Apache Spark with Prometheus the alternative way

We have externalized the sink into a separate library thus you can use this by either building for yourself or take it from our Maven repository.

Prometheus sink

PrometheusSink is a Spark metrics sink that publishes Spark metrics into Prometheus.

Prerequisites

Prometheus uses a pull model over http(s) to scrape data from the applications. For batch jobs it also supports a push model. We need to use this model as Spark pushes metrics to sinks. In order to enable this feature for Prometheus a special component called pushgateway needs to be running.

How to enable PrometheusSink in Spark

Spark publishes metrics to Sinks listed in the metrics configuration file. The location of the metrics configuration file can be specified for spark-submit as follows:

--conf spark.metrics.conf=<path_to_the_file>/metrics.properties

Add the following lines to metrics configuration file:

# Enable Prometheus for all instances by class name
*.sink.prometheus.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
# Prometheus pushgateway address
*.sink.prometheus.pushgateway-address-protocol=<prometheus pushgateway protocol> - defaults to http
*.sink.prometheus.pushgateway-address=<prometheus pushgateway address> - defaults to 127.0.0.1:9091
*.sink.prometheus.period=<period> - defaults to 10
*.sink.prometheus.unit=< unit> - defaults to seconds (TimeUnit.SECONDS)
*.sink.prometheus.pushgateway-enable-timestamp=<enable/disable metrics timestamp> - defaults to false
  • pushgateway-address-protocol - the scheme of the URL where pushgateway service is available
  • pushgateway-address - the host and port the URL where pushgateway service is available
  • period - controls the periodicity of metrics being sent to pushgateway
  • unit - the time unit of the periodicity
  • pushgateway-enable-timestamp - controls whether to send the timestamp of the metrics sent to pushgateway. This is disabled by default as not all versions of pushgateway support timestamp for metrics.

spark-submit needs to know repository where to download the jar containing PrometheusSink from:

--repositories https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases

Note: this is a Maven repo hosted on GitHub

Also we have to specify the spark-metrics package that includes PrometheusSink and it’s dependent packages for spark-submit:

--packages com.banzaicloud:spark-metrics_2.11:2.2.1-1.0.0,io.prometheus:simpleclient:0.0.23,io.prometheus:simpleclient_dropwizard:0.0.23,io.prometheus:simpleclient_pushgateway:0.0.23,io.dropwizard.metrics:metrics-core:3.1.2

Package version

The version number of the package is formatted as: com.banzaicloud:spark-metrics_<scala version>:<spark version>-<version>

Conclusion

This is not the ideal scenario but it perfectly does the job and it’s independent of the Spark codebase. At Banzai Cloud we are still hoping to contribute this sink once the community decided that it actually needs it. Meanwhile you can open a new feature requests, use this existing PR or use any other means to ask for native Prometheus support and let us know through one of our social channels. As usual, we are happy to help. All the software we make is open source and at the same time all we consume is open source as well - so we are always eager to make open source projects better.

In case you’d like to contribute to make this available as an official Spark package let us know.

Meanwhile if you have any questions or interested in what we are doing follow up on GitHub, LinkedIn or Twitter.

Star



Comments

comments powered by Disqus