Banzai Cloud Logo Close
Home Products Benefits Blog Company Contact
Sign in

Loki, the tail and grep for Kubernetes

At Banzai Cloud we are passionate about observability, and we expend a great amount of effort to make sure we always know what’s happening inside our Kubernetes clusters.

All clusters provisioned with Pipeline - our multi- and hybrid-cloud container management platform - are provided with, and rely upon, each of the three pillars of observability: federated monitoring, centralized log collection and traces. In order to automate log collection on Kubernetes, we opensourced a logging-operator built on the Fluent ecosystem. If you’d like to brush up on that subject, feel free to browse through our previous posts.

In most cases, especially when running in-production Kubernetes on our CNCF certified Kubernetes distribution, PKE, logs end up in supported third party systems, like Elasticsearch. While this is perfectly fine in a production environment, it might not always be advisable for internal tests, UAT and dev environments; our customers run these a lot, and, when they do, they need to iterate fast and frequently. We’ve added an additional layer of distributed microservices, running in a service mesh that’s orchestrated by our Istio operator and Backyards, the Banzai Cloud automated service mesh, to allow for the inscreasingly nuanced analysis of pre-production logs and traces. This was done in response to users and customers asking for a more developer-friendly solution to the problem of reading and filtering centralized logs.

tl;dr:

Loki Architecutre

“Just give me my log files and grep

Gathering and filtering logs in Kubernetes is not as straightforward as it could be. In a microservice oriented environment there may be hundreds of pods with multiple versions of the same service. There exist simple command line tools, like stern, that fetch multiple logs from multiple containers, however, these only work on running pods, and there is no archive if a log is rotated. Developers have been starving for a simple tool that fits the context we’ve described. There are some heavy weight production ready solutions like Elasticsearch (mentioned above) or cloud-based services for analytics, but, in simpler use cases, these represent a kind of overkill; you don’t want to have to set up indexers, workers, and everything else, just for tailing log files.

Here’s when Loki comes out of the shadows. For starters, let’s take a look at an example of debugging a workflow. As mentioned previously, all Pipeline clusters and deployments come with out-of-the-box Prometheus-based monitoring, centralized log collection via our logging-operator and insights into applications deployed on service meshes orchestrated by our Istio operator or by Backyards.

1. We receive an incoming alert from Prometheus

Alertmanager

2. We check the Prometheus query and narrow down the problem to one or more pods

Alertmanager

3. Once we have a pod’s name, we can fetch the relevant logs i.e:

$ kubectl logs cp-pipeline-c956fb7ff-9tr8q -c pipeline |grep 'cluster=<cluster_name>'

The first two steps were routine operational tasks that have been used in production for a long time. However, the last step, the gathering of the logs of specific pods, can be tricky depending on your environment:

  • As in the example above, you can accomplish this with kubectl: kubectl logs -l key=value | grep XYX
  • If you already have Elasticsearch installed, you can use Kibana or the Elasticsearch API to grab logs
  • And you can use a custom toolchain if your logs are stored in object stores, like S3

Loki

Loki sells itself by promising “Prometheus-inspired logging for cloud natives.” It does that and much more! Loki is a simple yet powerful tool that facilitates the collection of logs from Kubernetes pods. At its most basic level, Loki works by receiving log lines enriched with labels. Users can then query Loki for the logs, which are filtered via their labels and according to time-range. Simple. So let’s dig into the details.

Architecture

Loki has two major components, Promtail, to gather logs and attach labels, and the Loki server, which stores these logs and provides an interface for querying them. This works nicely in a sandbox environment, but not as well in production. Loki solved this problem by adapting a production-ready architecture that’s similar to Cortex, a choice made after the experience of writing that project.

Loki Architecutre

As you can see, the data-ingest and querier flows are separate. Data-ingest begins with log collectors. These read logs and transfer data to a distributor. A distributor behaves like a router. It uses consistent hashing to assign log streams to ingesters. Ingesters create chunks from the different log streams - based on labels - and gzip them. These chunks are regularly flushed to the server, where storage backends for chunks and indexes tend to differ. Chunks usually end up in some kind of object storage (GCS, S3, …), while indexes wind up in key-value stores like Cassandra, Bigtable, and DynamoDB. Loki uses Bolt db for indexes and chunks are stored in files.

In comparison, the querier way of doing things is much simpler. This method is based on time-range and label selectors that query indexes and matching chunks. This method also talks to the ingester for any recent data that might not have been flushed.

Connecting Loki to existing log flows

Promtail is a great tool when using exclusively Loki for your logging pipeline. However, in a production or even developer environment, this is not usually the case. Typically, you’ll want to transform your logs, measure metrics, archive raw data, ingest to other analyzer tools, etc. And, because fluentd and fluent-bit already provide a powerful logging pipeline, it seems unnecessary to try and shoe-horn Loki into such an environment.

A fluent-plugin-grafana-loki plugin exists in the official repository, but this is a general purpose tool, which lacks the necessary Kubernetes support. However, writing (or in this case extending) fluentd plugins is relatively easy, so Banzai Cloud has opensourced one: fluent-plugin-kubernetes-loki

Loki Architecutre

Fluent-plugin-kubernetes-loki

In order to fit Loki into our existing flows, we forked the official fluentd output plugin and enriched it with labeling features. The problem was that fluent-bit already attached metadata to log records, as you can see from the following snippet:

{
   "log":"10.1.0.1 - - [13/Mar/2019:15:42:31 +0000] \"GET / HTTP/1.1\" 200 612 \"-\" \"kube-probe/1.13\" \"-\"\n",
   "stream":"stdout",
   "time":"2019-03-13T15:42:31.308708572Z",
   "kubernetes":{
      "pod_name":"understood-butterfly-nginx-logging-demo-7dcdcfdcd7-h7p9n",
      "namespace_name":"default",
      "pod_id":"1f50d309-45a6-11e9-b795-025000000001",
      "labels":{
         "app":"nginx-logging-demo",
         "pod-template-hash":"7dcdcfdcd7",
         "release":"understood-butterfly"
      },
      "host":"docker-desktop",
      "container_name":"nginx-logging-demo",
      "docker_id":"3a38148aa37aa30e6e2df96af95cbda7a47b0428689bb4152413f4be25532fda"
   }
}

Sadly, this metadata is not Loki compatible. Our plugin remedies this by grabbing information from the kubernetes key and transforming it into Loki labels. However, some special characters are forbidden in Loki’s label fields, such as . and -, so these are transformed to _. Additionally, label names were not initially Prometheus compatible, and it was crucial to our set up that we use the same label keys Prometheus did. Our first version introduced mappings that looked like this:

transform_map = {
    "namespace_name" => "namespace",
    "pod_name" => "pod",
    "docker_id" => "container_id",
    "container" => "container",
    "annotations" => nil,
}

Note: a nil value means that keys with that value are to be deleted from the record map. Annotations usually have too many special characters and too much information in them to fit filtering.

Note that this plugin is a work in progress, and we add new features frequently. Although we have a lot of our own ideas on how to improve this plugin and make it a turnkey solution for Fluentd and Loki, we also want to ask you - our open source community of users and customers - what you would like to see. Please open an issue or submit a pull-request with your ideas on our Github.

If you want to learn more about Loki, visit the official introduction blog post.

LogQL - the Loki query language

Loki’s default frontend is Grafana, and you’ll need to set-up the Explore section to connect with Loki before it’s ready. Now, in order to browse logs, we use any one of a few basic queries.

A log query consists of two parts: a log stream selector, and a filter expression. For performance reasons, begin by choosing a set of log streams, using a Prometheus-style log stream selector.

Loki comes with a simple yet powerful REST API

The log stream selector will reduce the number of log streams to a manageable volume, then the regex search expression will perform a distributed grep over those log streams.

The following examples are from the official documentation

Log Stream Selector

Wrap the label component of the query expression in curly brackets, {}, and then use the key value syntax to select your labels. Label expressions are separated by commas:

{app="mysql",name="mysql-backup"}

Loki currently supports the following label matching operators:

Operator  Action
= exactly equal
!= not equal
=~ regex-match
!~ do not regex-match

Example operator usage:

{name=~"mysql.+"}
{name!~"mysql.+"}

The same rules that apply for Prometheus Label Selectors apply to Loki Log Stream Selectors. After writing a Log Stream Selector, you can further filter results by writing a search expression. Search expressions can either be text or regex expressions.

Example queries:

{job="mysql"} |= "error"
{name="kafka"} |~ "tsdb-ops.*io:2003"
{instance=~"kafka-[23]",name="kafka"} != kafka.server:type=ReplicaManager

Filter operators can be chained and will sequentially filter down the expression - resulting log lines will satisfy every filter. Eg:

{job="mysql"} |= "error" != "timeout"

The following filter types have been implemented:

Filter  Action
|= line contains string
!= line does not contain string
|~ line matches regular expression
!~ line does not match regular expression

Using Loki with the Banzai Cloud logging-operator

Let’s take a look at a practical example. First, install the logging-operator. This only installs the controller.

helm repo add banzaicloud-stable https://kubernetes-charts.banzaicloud.com
helm repo update
helm install banzaicloud-stable/logging-operator

Due to some Helm issues with CRDs, we have separated out fluentd and fluent-bit resources and put them in another chart. That way we can ensure that all the CRDs are ready and that their fluentd and fluent-bit components are safely installed.

helm install banzaicloud-stable/logging-operator-fluent

Our logging pipeline is ready, we may now define inputs and outputs. Before we do that, let’s install Loki.

Add Loki Chart repo

helm repo add loki https://grafana.github.io/loki/charts
helm repo update

Install Loki

helm install --name loki loki/loki

Install Grafana with Loki datasource

helm install --name grafana stable/grafana \
 --set "datasources.datasources\\.yaml.apiVersion=1" \
 --set "datasources.datasources\\.yaml.datasources[0].name=Loki" \
 --set "datasources.datasources\\.yaml.datasources[0].type=loki" \
 --set "datasources.datasources\\.yaml.datasources[0].url=http://loki:3100" \
 --set "datasources.datasources\\.yaml.datasources[0].access=proxy"

Get Grafana login credantials

kubectl secrets grafana -o json | jq '.data | map_values(@base64d)'

Grafana Port-forward

kubectl port-forward svc/grafana 3000:80

If you want to install Prometheus as well, just add the --set prometheus.enabled=true parameter. Now we’re ready to add the logging-operator CR and start log collecting.

apiVersion: "logging.banzaicloud.com/v1alpha1"
kind: "Plugin"
metadata:
  name: "loki-demo"
spec:
  input:
    label:
      app: "nginx-logging-demo"
  output:
    - type: "loki"
      name: "loki-demo"
      parameters:
        - name: url
          value: "http://loki:3100"
        - name: extraLabels
          value: "{\"env\":\"demo\"}"
        - name: flushInterval
          value: "10s"
        - name: bufferChunkLimit
          value: "1m"

That’s it! If you’ve done everything right, you should be ready to browse your logs in Grafana.

Loki Grafana

Install log-cli

Loki provides another tool to fetch your logs called log-cli. This command line tool is really handy if you are not browsing logs but want to run exact queries.

go get github.com/grafana/loki/cmd/logcli

If you’re running Loki without ingress enabled, you can simply port-forward to the pod.

kubectl port-forward svc/loki-loki 3100

You can run the same expressions as with the Grafana interface.

logcli --addr="http://127.0.0.1:3100" query '{env="demo"} |= "error"'

Note: you need to extend your $PATH with the go bin location

Logging operator v2

Development doesn’t stop here; we’re constantly working to improve the logging-operator on the basis of feature requests made by our ops team and from recent customers. Here are some of those features:

  • No limitations on label selectors
  • Namespaced and Global resource scopes
  • Visualised logging flows
  • Secure output credential management
  • Multi output log flows

Stay tuned for our next blog post about logging-operator version 2!

About Pipeline

Banzai Cloud’s Pipeline provides a platform which allows enterprises to develop, deploy and scale container-based applications. It leverages best-of-breed cloud components, such as Kubernetes, to create a highly productive, yet flexible environment for developers and operations teams alike. Strong security measures—multiple authentication backends, fine-grained authorization, dynamic secret management, automated secure communications between components using TLS, vulnerability scans, static code analysis, CI/CD, etc.—are a tier zero feature of the Pipeline platform, which we strive to automate and enable for all enterprises.

If you’re interested in our technology and open source projects, follow us on GitHub, LinkedIn or Twitter:


Comments

comments powered by Disqus