Fun with logging — Part 1: Flows

Bryan
4 min readNov 12, 2020

This is a two-part series on how we created an automated, centralized logging system at Renovo using BanzaiCloud’s FluentD operator. Our layout for domains and deployments ends up being quite a bit more complex than the average installation.

We currently have 5 Renovo domains ( like prod, demo, etc… ) 1 developer domain, and a collection of customer domains that get increasingly more complex the closer they get to the 5g / WaveLength deployments.

Here I will lay out our journey to centralized logging in two chapters:

  • Flows: this will include how we setup ES + FluentD on k8s and how some of the complicated parts work. Plus some logging data samples and configuration items that are not easy to figure out.
  • Kibana: finally, how we used some python to create “rubber stamped” searches, visualizations and dashboards using the Kibana API.

Most of the tooling used here is documented in such a way that the parts are there to figure out what you need at a high level. This is all fine and great until you start getting into more complicated orchestrations and hitting the limits of the documentation. My hope here is that the code samples and documentation can help you expand your offerings in a similar way and make your world more complicated. But also easier. ;)

Our layout

The DevOps team at Renovo has been busy building isolated environments for use by 3 types of users:

  • Renovo internal: these are environments for release validation ( stage ) vehicle validation ( dev ) and sales demos.
  • Developers: developer environments which allow developers a safe, consistent environment to test changes.
  • Customers: environments used by customer vehicle fleets.

Each environment consists of a few parts:

  • Secrets stored in AWS::SecretsManager
  • s3 buckets and associated IAM users and roles
  • Route53 hosted zone
  • k8s namespace based on a pattern: ${APP}-${ENV_NAME}-${INT}. For example: app-prod-1
  • ALBs and NLBs for routing traffic to pods

All of this put together allows us to isolate environment data and software releases in a clean, smart way. But more importantly this also makes everything very easy to create. Creating a brand new environment and deploying the full payload of software now takes less than 10 minutes. We’ve gotten quite good at creating new things.

Outputs

Outputs are probably the easiest part of this. Let’s take a look at a sample output for a namespace.

This creates the output channel for the flows that we’re about to make.

Now let’s create a flow to start catching the influxdb logs. Unfortunately Influx logs are a mess. You can set the logging output to JSON, if you have the enterprise license, otherwise you’re stuck with two types of very gross output.

The first type of output looks like this:

[httpd] 172.20.94.119 - admin [11/Nov/2020:14:05:15 +0000] "POST /write?consistency=one&db=DB_NAME&p=%5BREDACTED%5D&precision=n&rp=autogen&u=admin HTTP/1.1" 204 0 "-" "okhttp/3.10.0" CHECKSUM 2953

This is very unfortunate for us because of the [httpd] in front of the line. Without that, we could probably just use the apache2 logging type and be done with it. Apparently Influx likes to keep us on our toes. It turns out there is another way of forcing the LOGGING_FORMAT to json, but it’s super ugly and not easily maintainable.

ts=2020-11-11T14:09:39.259900Z lvl=info msg="Executing query" log_id=0QD63mG0000 service=query query="SHOW DATABASES"

Take specific note of this line:

types: host:string,user:string,dt:string,method:string,path:string,code:integer,size:integer,referer:string,response:string,checksum:string,runtime:integer

This is how we force the types in the ES index, which we’ll use later on to drive some interesting stats about the influxdb performance.

Here is another example of a rather complicated bit of logging data:

Received batch 1604927354542000 of size 30 from renovo4/QUEUE for ping_data

or

Received batch 1604927354542000 of size 30 from renovo4/QUEUE

We can use this flow to catch either line:

Forcing types again, which will help us later:

types: batch_id:string,size:integer,queue:string,target:string

Part 3 covers this in more details, but for now let’s look at a quick and dirty implementation of a TimeLion visualization for this data:

.es(index=CLUSTER_NAME-app-dev-1,
timefield='time',
split="telemetry_postgres.queue.keyword:5",
metric='avg:telemetry_postgres.size')
.label("Queue: $1", "^.* > telemetry_postgres.queue.keyword:(\S+) > .*")

And here’s what this looks like:

Timelion view

InfluxDB response time:

.es(index={{ index_name }},
q="influxdb.path:\/write",
timefield='time',
split="influxdb_path_uri.database.keyword:5",
metric='sum:influxdb.runtime')
.label("DB: $1", "^.* > influxdb_path_uri.database.keyword:(\S+) > .*")
.title('InfluxDB Request Time')
InfluxDB response time

So there we go, we were able to take logging data that was in a less than optimal format and turn it into something beautiful, or at the very least, useful.

--

--