Banzai Cloud Logo Close
Home Products Benefits Blog Company Contact

It’s been some time since we open sourced our Kafka Operator, an operator designed from square one to take advantage of the full potential of Kafka on Kubernetes, and have built Supertubes on top to manage Kafka for our customers.

One of the most requested enterprise feature has been the implementation of rolling upgrades. Built on the Banzai Cloud’s Kafka operator, Supertubes adds support and orchestrates these upgrades. Accordingly, in today’s blog we’re going to take a deep dive into the technical details of how the operator handles an update and how you can use Supertubes to seamlessly upgrade to the newest 2.4.1 version of Kafka.

Check out Supertubes in action on your own clusters:

Register for an evaluation version and run a simple install command!

curl https://getsupertubes.sh | sh
supertubes install -a --no-demo-cluster --kubeconfig <path-to-k8s-cluster-kubeconfig-file>

or read the documentation for details.

Take a look at some of the Kafka features that we’ve automated and simplified through Supertubes and the Kafka operator, which we’ve already blogged about:

tl;dr 🔗︎

  • Support for Kafka version 2.4.1 is already available.

  • Banzai Cloud Supertubes adds support for rolling upgrades with advanced dynamic configuration. With Supertubes, upgrading can be done with one single command.

    supertubes cluster update --kafka-image "banzaicloud/kafka:2.13-2.4.1"
    

Motivation 🔗︎

Here at Banzai Cloud, we think there are several key considerations when it comes to operating a production-ready Kafka cluster on Kubernetes.

  • Broker Configuration
  • Monitoring
  • Reacting to Alerts
  • Graceful Cluster Scaling
  • Kafka Dynamic Configuration
  • Kafka Rolling Upgrades
  • Disaster recovery

The operator already handles some of the things on this list, and Supertubes now brings advanced Dynamic Configuration, Disaster recovery and Rolling Upgrades. We deliberately choose to add and handle these features together because we think they go hand in hand. If you’re interested in finding out why, keep reading.

Dynamic Configuration 🔗︎

Since Kafka version 1.1, some broker configs have been able to update without broker restarts. This is because Kafka had previously introduced a dynamic update mode to broker configs. It differentiated between three config types:

  • Read-only: requires broker restart to update, for example, zookeeper.connect
  • Per-broker: can be updated dynamically for each broker
  • Cluster-wide: can be updated dynamically as a cluster-wide default, but can be updated as a per-broker value for testing

For further details on which configuration belongs to which type, please take a look at the official documentation.

So how does this work inside Kafka? 🔗︎

Within Kafka, only the read-only config needs to be persisted to server.properties. All other types live inside Zookeeper. Brokers subscribe to various ZK ZNodes during the cluster startup. They receive events if there’s a broker config change. They can dynamically reconfigure themselves whenever an event occurs.

You must have a Kafka cluster that’s already up and running to dynamically change a cluster config. Kafka validates the given config during submission, which will cause errors if the config type is not eligible for dynamic update. This means that:

  • read-only broker configs can only be set through a file, but
  • per-broker and cluster-wide configs can be configured through a file and modified dynamically.

Since a config value can be defined at different levels, we use the following order of precedence:

  1. Dynamic per-broker config stored in ZooKeeper
  2. Dynamic cluster-wide default config stored in ZooKeeper
  3. Static broker config from server.properties
  4. Kafka defaults

A typical installation clones all these types of configs into the server.properties file and starts the cluster. Dynamic configuration is only used if a config changes during operation.

How does the Kafka operator and Supertubes handle dynamic configuration? 🔗︎

Inside the operator we don’t validate the Kafka config type, because it can easily cause backward compatibility issues across different Kafka versions. Instead, we decided to redesign our CRD, which has been enhanced to support all the Kafka config types.

To support dynamic configuration with the operator we completely redesigned our CRD, which is now annotated as v1beta1.

  headlessServiceEnabled: true
  zkAddresses:
    - "example-zookeepercluster-client.zookeeper:2181"
  clusterImage: "banzaicloud/kafka:2.13-2.4.0"
  readOnlyConfig: |
    auto.create.topics.enable=false
  clusterWideConfig: |
    background.threads=10
  brokers:
    - id: 0
      readOnlyConfig: |
        allow.everyone.if.no.acl.found=false
      brokerConfig:
        config: |
          sasl.enabled.mechanisms=PLAIN

If you take a closer look at the CRD, it’s possible to distinguish between three different Kafka config types. During cluster configuration, you must take these config types into consideration.

To keep our reconcile loop idempotent, rather than copying all these configs to the server.properties file, the operator makes use of dynamic configuration from the beginning. The operator takes all config values from readOnlyConfig, creates a configmap and starts the cluster. If the cluster becomes healthy, all the other config types are dynamically configured through Zookeeper. This not only saves hundreds of lines of code, but keeps the operator simple.

Kafka dynamic

How rolling upgrades work 🔗︎

Simply put, rolling upgrade means upgrading a cluster without data loss or service disruption. We should distinguish between two rolling upgrade scenarios:

  1. a Kafka version update. Inside a Kubernetes cluster this means, essentially, using a more recent image for its pods.

  2. a Kafka read-only config change.

So how exactly does Supertubes handle rolling upgrades?

We’ve introduced a new section on rolling upgrades in the cluster CRD:

  rollingUpgradeConfig:
    failureThreshold: 1

It contains a field called failureThreshold, for which the default value is 1. A Kafka cluster can survive a number of failures without disruption or data loss equal to this threshold. The operator constantly checks a cluster’s health during an update and, if the number of failures rise above a given threshold, it does not proceed and instead waits for a resolution from you (the human operator). During a health check the operator verifies two things:

  • Offline replica count
  • That all replicas are in sync

This check is built into the operator, but in Supertubes you can define your own checks using special alerts. We extended the in-built Prometheus Alert Manager which can react to custom alerts, to support alerts which are only processed during rolling upgrades. You can create such alerts by labeling alerts with the rollingupgrade keyword. As a result, you can now define your own cluster health checks based on metrics.

Upgrading to a newer version of Kafka with Supertubes 🔗︎

With Supertubes, you can easily upgrade to a newer version by running the following command:

supertubes cluster update --kafka-image "banzaicloud/kafka:2.13-2.4.1"

Supertubes will initiate a rolling upgrade wherein:

  1. It waits until the cluster becomes healthy.
  2. It stops a running broker.
  3. Reconfigures that pod to use the new image.
  4. It restarts the broker with the same PVC.

Kafka rolling

Conclusion 🔗︎

With Supertubes, you can easily reconfigure and upgrade your Kafka cluster without disruption or loss of data. Hosting your own managed Kafka cluster has never been easier.

About Supertubes 🔗︎

Banzai Cloud Supertubes (Supertubes) is the automation tool for setting up and operating production-ready Kafka clusters on Kubernetes, leveraging a Cloud-Native technology stack. Supertubes includes Zookeeper, the Banzai Cloud Kafka operator, Envoy, Istio and many other components that are installed, configured, and managed to operate a production-ready Kafka cluster on Kubernetes. Some of the key features are fine-grained broker configuration, scaling with rebalancing, graceful rolling upgrades, alert-based graceful scaling, monitoring, out-of-the-box mTLS with automatic certificate renewal, Kubernetes RBAC integration with Kafka ACLs, and multiple options for disaster recovery.