Banzai Cloud Logo Close
Home Products Benefits Blog Company Contact
Get Started

Kafka rolling upgrade and dynamic configuration on Kubernetes

It's been some time since we open sourced our Kafka Operator, an operator designed from square one to take advantage of the full potential of Kafka on Kubernetes. That guiding principle was what led us to use simple pods instead of StatefulSet. This blog will not detail our every design decision, so if you are interested in learning more, feel free to look at an earlier blog post about the operator.

Our most requested feature—after securing and managing Kafka using SSL—has been the implementation of graceful rolling upgrades. We are happy to announce that, from version 0.6.0 onward, Banzai Cloud's Kafka operator will support these upgrades. Accordingly, in today's blog we're going to take a deep dive into the technical details of how the operator handles an update.


As of version 0.6.0, Banzai Cloud's Kafka Operator supports graceful rolling upgrades with advanced kafka dynamic configuration. Let's take a quick look at how Banzai Cloud's operator stacks up against the competition.

Banzai Cloud Krallistic Strimzi Confluent
Open source Apache 2 Apache 2 Apache 2 No
Maintained Yes No (discontinued) Yes N/A
Fine-grained broker config support Yes (learn more) Limited via StatefulSet Limited via StatefulSet Limited via StatefulSet
Fine-grained broker volume support Yes (learn more) Limited via StatefulSet Limited via StatefulSet Limited via StatefulSet
Monitoring Yes Yes Yes Yes
Encryption using SSL Yes Yes Yes Yes
Rolling updates Yes No No Yes
Cluster external accesses Envoy (single LB) Nodeport Nodeport or LB/broker Yes (N/A)
User Management via CRD Yes No Yes No
Topic management via CRD Yes No Yes No
Reacting to Alerts Yes (Prometheus + Cruise Control No No No
Graceful Cluster Scaling (up and down) Yes (using Cruise Control) No No Yes

if you find any of this information to be inaccurate, please let us know, and we'll happily fix it


Here at Banzai Cloud, we think there are several key requirements to gracefully operating Kafka on Kubernetes.

  • Broker Configuration
  • Monitoring
  • Reacting to Alerts
  • Graceful Cluster Scaling
  • Kafka Dynamic Configuration
  • Kafka Rolling Upgrades

Our operator already handles most of the things on that list, though we previously had neither Dynamic Configuration nor Rolling Upgrades. The reason we chose to add these features the way we did, which was together, is because we think they go hand in hand. If you're interested in finding out why, keep reading.

Dynamic Configuration

Since Kafka version 1.1, some broker configs have been able to update without the broker restarting. This is because Kafka had previously introduced a dynamic update mode to broker configs. It differentiated between three config types:

Read-only: requires broker restart to update, e.g.: zookeeper.connect

Per-broker: can be updated dynamically for each broker

Cluster-wide: can be updated dynamically as a cluster-wide default, but can be updated as a per-broker value for testing

For further details on which config belongs to which type, please take a look at the official documentation.

So how does this work inside Kafka?

Within Kafka only the read-only config needs to be persisted to All other types live inside Zookeeper. Brokers subscribe to various ZK ZNodes during the cluster startup. They receive events if there's a broker config change. They can dynamically reconfigure themselves whenever an event occurs.

You must have a Kafka cluster that's already up and running to dynamically change a cluster config. Kafka validates the given config during submission, which will cause errors if the config type is not eligible for dynamic update. This means that read-only broker configs can only be set through a file, but per-broker and cluster-wide configs can be configured through a file and modified dynamically.

Since a config value can be defined at different levels, we use the following order of precedence:

  • Dynamic per-broker config stored in ZooKeeper
  • Dynamic cluster-wide default config stored in ZooKeeper
  • Static broker config from
  • Kafka defaults

A typical installation clones all these types of configs into the file and starts the cluster. Dynamic configuration is only used in the event of a config change during operation.

How does the Operator handles dynamic configuration?

Inside the operator we don't validate the Kafka config type, because it can easily cause backward compatibility issues through different Kafka versions. Instead, we decided to redesign our CRD, which has been enhanced to support all the Kafka config types.

To support dynamic configuration with the operator we completely redesigned our CRD, which is now annotated as v1beta1.

  headlessServiceEnabled: true
    - "example-zookeepercluster-client.zookeeper:2181"
  clusterImage: "wurstmeister/kafka:2.12-2.1.0"
  readOnlyConfig: |
  clusterWideConfig: |
    - id: 0
      readOnlyConfig: |
        config: |

If we take a closer look at the CRD, it's possible to distinguish between three different Kafka config types. During cluster configuration the admin must take these config types into consideration.

To keep our reconcile loop idempotent, rather than copying all these configs to the file, the operator makes use of dynamic configuration from the beginning. The operator takes all config values from readOnlyConfig, creates a configmap and starts the cluster. If the cluster becomes healthy, all the other config types are dynamically configured through Zookeeper. This not only saves hundreds of lines of code, but keeps the operator simple.

Kafka dynamic

Rolling upgrades and how they work

Simply put, rolling upgrade means upgrading a cluster without data loss or service disruption. We should distinguish between two rolling upgrade scenarios:

  1. a Kafka version update. Inside a Kubernetes cluster this means, essentially, using a more recent image for its pods.

  2. a Kafka read-only config change.

So how exactly does our operator handle rolling upgrades?

We've introduced a new section on rolling upgrades in the cluster CRD:

    failureThreshold: 1

It contains a field called failureThreshold, for which the default value is 1. A Kafka cluster can survive a number of failures without disruption or data loss equal to this threshold. The operator constantly checks a cluster's health during an update and, if the failure rises above a given threshold, it does not proceed and waits for a resolution from the user (the human operator). During a health check the operator verifies two things:

  • Offline replica count
  • That all replicas are in sync

This check is built into the operator, but we wanted to give users the ability to define their own checks as well. To do that, we decided to introduce special alerts. We extended our Prometheus Alert Manager which can react to custom alerts, to support alerts which are only processed during rolling upgrades. These alerts can be made by labeling alerts with the rollingupgrade keyword. As a result, users may now define their own cluster health checks based on metrics.

Upgrading to a newer version of Kafka

When upgrading, users change the clusterImage field in the CR. Then the operator will initiate a rolling upgrade wherein:

  • It waits until the cluster becomes healthy
  • It stops a running broker
  • Reconfigures that pod to use the new image
  • It restarts the broker with the same PVC

Kafka rolling


With this new version of our operator, users can easily reconfigure and upgrade their Kafka Cluster without disruption or loss of data. Hosting your own managed Kafka cluster has never been easier.

We recieved some complaints about our CRD, chiefly in regards to it requiring too much copy/pasting when creating a heterogeneous cluster. We listen to our customers and community, so during the CRD redesign we tried to address these complaints. We introduced brokerConfigGroups, which can be used as a template during broker configuration.

About Banzai Cloud

Banzai Cloud is changing how private clouds are built: simplifying the development, deployment, and scaling of complex applications, and putting the power of Kubernetes and Cloud Native technologies in the hands of developers and enterprises, everywhere.

Never miss a post again!

If you are interested in our technology and open source projects, follow us on GitHub, LinkedIn or Twitter: