Banzai Cloud Logo Close
Home Benefits Blog Company Contact
Sign up Login
Author Janos Matyas

Monitoring Spark with Prometheus, reloaded

Monitoring series:
Monitoring Apache Spark with Prometheus
Monitoring multiple federated clusters with Prometheus - the secure way
Application monitoring with Prometheus and Pipeline
Building a cloud cost management system on top of Prometheus
Monitoring Spark with Prometheus, reloaded

At Banzai Cloud we deploy large distributed applications to Kubernetes clusters that we also operate. We don’t enjoy waking up to PagerDuty notifications in the middle of the night, so we try to get ahead of problems by operating these clusters as efficiently as possible. We have out-of-the-box centralized log collection, end-to-end tracing, monitoring and alerts for all the spotguides we support, including applications, Kubernetes clusters and all the infrastructure we deploy.

For monitoring, we choose Prometheus - the de-facto monitoring tool of cloud native applications. We’ve already explored our use through a series of articles on use cases of some advanced scenarios.

However, when monitor with Prometheus, there’s a catch. Prometheus uses a pull model in favor of http(s) to scrape data from applications. For batch jobs, it also supports a push model, but enabling this feature requires a special component called pushgateway.

tl;dr:

There’s no out-of-the-box solution for monitoring Spark with Prometheus. We’ve created one but we’re struggling to push the Prometheus sink upstream, into the Spark codebase. Therefore, we closed the PR and made the sink source available as a standalone, independent package.

Apache Spark and Prometheus

Spark supports a few sinks out-of-the-box (Graphite, CSV, Ganglia), but Prometheus is not one of them, so we introduced a new Prometheus sink (here’s the PR and related apache ticket, SPARK-22343).

Long story, short, we’ve blogged about the original proposal before. However, the community proved unreceptive to the idea of adding a new industry standard monitoring solution. We closed our PR and now we’re making the Apache Spark Prometheus Sink available the alternative way. Ideally, this should be part of Spark (as with the other sinks), but it also works fine from the classpath.

Monitor Apache Spark with Prometheus the alternative way

We have externalized the sink into a separate library, which you can use by either building it yourself, or by taking it from our Maven repository.

Prometheus sink

PrometheusSink is a Spark metrics sink that publishes Spark metrics to Prometheus.

Prerequisites

As previously mentioned, Prometheus uses a pull model over http(s) to scrape data from applications. For batch jobs it also supports a push model. We need to use this model, since Spark pushes metrics to sinks. In order to enable this feature for Prometheus, a special component called pushgateway must be running.

How to enable PrometheusSink in Spark

Spark publishes metrics to Sinks listed in the metrics configuration file. The location of the metrics configuration file can be specified for spark-submit as follows:

--conf spark.metrics.conf=<path_to_the_file>/metrics.properties

Add the following lines to the metrics configuration file:

# Enable Prometheus for all instances by class name
*.sink.prometheus.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
# Prometheus pushgateway address
*.sink.prometheus.pushgateway-address-protocol=<prometheus pushgateway protocol> - defaults to http
*.sink.prometheus.pushgateway-address=<prometheus pushgateway address> - defaults to 127.0.0.1:9091
*.sink.prometheus.period=<period> - defaults to 10
*.sink.prometheus.unit=< unit> - defaults to seconds (TimeUnit.SECONDS)
*.sink.prometheus.pushgateway-enable-timestamp=<enable/disable metrics timestamp> - defaults to false
  • pushgateway-address-protocol - the scheme of the URL where the pushgateway service is available
  • pushgateway-address - the host and port URL where the pushgateway service is available
  • period - controls the periodicity of metrics sent to pushgateway
  • unit - the time unit of that periodicity
  • pushgateway-enable-timestamp - enables sending the timestamp of those metrics sent to the pushgateway service. This is disabled by default as not all versions of the pushgateway service support timestamp for metrics.

spark-submit needs to know the repository from which it downloads the jar containing PrometheusSink:

--repositories https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases

Note: this is a Maven repo hosted on GitHub

Also, we have to specify the spark-metrics package that includes PrometheusSink and it’s dependent packages for spark-submit:

--packages com.banzaicloud:spark-metrics_2.11:2.2.1-1.0.0,io.prometheus:simpleclient:0.0.23,io.prometheus:simpleclient_dropwizard:0.0.23,io.prometheus:simpleclient_pushgateway:0.0.23,io.dropwizard.metrics:metrics-core:3.1.2

Package version

The version number of the package is formatted as: com.banzaicloud:spark-metrics_<scala version>:<spark version>-<version>

Conclusion

This is not the ideal scenario but it does the job, and it’s independent of the Spark codebase. At Banzai Cloud we’re still hoping to contribute this sink once the community decides it actually needs it. Meanwhile, you can open new feature requests, use this existing PR, or ask for native Prometheus support through one of our social media channels. As usual, we’re happy to help. All the software we use and create is open source, so we’re always eager to help make open source projects better.

If you’d like to contribute to making this available as an official Spark package, let us know.

In the mean time, if you have any questions, or if you’re just interested in what we’re doing, follow us and our projects on GitHub, LinkedIn or Twitter.

Star


Comments

comments powered by Disqus