Optimizing Kubernetes Log Aggregation: Tackling Fluent Bit Buffering and Backpressure Challenges
By Sufiyan Ghori
At Artera, we rely on Kubernetes (EKS) clusters to run applications, with Fluent Bit handling log forwarding to DataDog for monitoring and analysis. Recently, we faced a tricky issue: logs from a specific production Kubernetes job were not appearing in DataDog.
This blogpost dives into our technical journey to identify, and resolve the root cause, sharing insights along the way.
Mysterious Disappearance of Logs
The Platform team received a ticket from developers reporting that logs from the production ai-deploy-job
Kubernetes job were missing from DataDog. These logs are critical to Artera’s operations, and their absence limited the developer’s ability to troubleshoot job failures.
Interestingly, logs for all other jobs, and pods were available; only the ai-deploy-job
logs were sporadically missing — on some instances of the job they appeared, while at other times, they did not.
Before diving into troubleshooting, let’s review how our logging pipeline operates. Fluent Bit agents run on each Kubernetes node, collecting and forwarding logs to DataDog, as depicted below.
Initial Hypothesis
At first glance, it seemed that the logs were not being collected at all. The ai-deploy-job
was running as expected, and there were no obvious errors in the Kubernetes or application logs.
To investigate, we exec’d into the Fluent Bit pod to check if logs were being collected for the ai-deploy-job.
$ sudo grep ai-deploy-job
fluent-bit-aws-for-fluent-bit-tr6kk_logging_aws-for-fluent-bit-26bce4209c8589fb7e109c59d5b9e462645f48126c6a26978d880333373dfee2.log:2024–08–23T18:37:11.314851648Z
stderr F [2024/08/23 18:37:11] [ info] [input:tail:tail.1] inotify_fs_add():
inode=139553828
watch_fd=79226
name=/var/log/containers/ai-deploy-jobtllj6-xhrq6_ldt_ai-deploy-job-bcca3f789ced1cf57963aa9de2e377d415d5647edfd7ca3c5faafb470e61d62a.log
The log files for the job were there, which confirmed that Fluent Bit was indeed detecting and attempting to process the logs.
Our next hunch was that backpressure could potentially be causing the logs to be dropped. Backpressure occurs when the rate of data ingestion exceeds the rate at which it can be processed, leading to increased memory usage and potential data loss.
Understanding Backpressure and Memory-Based Buffering
To see how backpressure affects Fluent Bit, we need a basic understanding of the Fluent Bit data pipeline. Fluent Bit’s pipeline begins at the input stage, gathering data from sources like log files or system metrics. One of the critical stages here is buffer, where data flow is managed to ensure consistent log processing and forwarding.
The buffer operates in two modes:
- In-Memory Buffering: This default mode stores data chunks (typically
2MB
each) in memory for fast access and low latency. While efficient, in-memory buffering risks high memory usage under heavy load, as data backs up faster than it can be processed or forwarded. - Filesystem-Based Buffering: Writes data to disk, offering a more stable storage solution that helps prevent data loss in the event of system failures or restarts.
How Backpressure Leads to Data Loss
By default, Fluent Bit uses in-memory buffering. As data accumulates in memory faster than it can be processed and forwarded — often due to high ingestion rates or network latency — memory usage increases.
When memory limits (determined by mem_buf_limit) are hit, Fluent Bit temporarily pauses the input plugin to halt new data intake until memory is freed. This pause, however, results in missing data as any logs generated during this time aren’t captured, creating gaps in log data.
Note that, if memory consumption continues to grow, the system may reach an Out-of-Memory (OOM) state, where the kernel terminates Fluent Bit to reclaim resources. This crash not only disrupts log processing but can also result in lost data due to restart.
To confirm backpressure as the root cause, we examined Fluent Bit logs, and observed entries like:
[input] tail.1 paused (mem buf overlimit)
[input] tail.1 resume (mem buf overlimit)
These entries validated our theory that Fluent Bit was indeed pausing log ingestion due to memory constraints.
Implementing Filesystem-Based Buffering to Mitigate Backpressure
To address backpressure, we enabled Fluent Bit’s filesystem-based buffering, which helps prevent data loss by offloading data to disk when memory limits are reached. Here’s how it works:
- Fluent Bit holds active chunks in memory for quick access, while sending overflow chunks to disk, to achieve a balance between performance and stability.
- When memory usage reaches the
storage.max_chunks_up
threshold (default: 128 chunks), new data is redirected to disk, preventing pauses in the input plugin and reducing log loss risk. - Setting
storage.pause_on_chunks_overlimit
to off ensures that Fluent Bit continues writing logs to disk even when memory capacity is full.
Below is the code-snippet we used to enable filesystem-based buffering,
service:
extraService: |
# Sets the disk path for filesystem-based buffering.
storage.path /var/log/flb-storage/
# Allows up to 128 active chunks (approx. 256MB) in memory.
storage.max_chunks_up 128
additionalInputs: |
[INPUT]
Name tail
Tag kube.*
# Directs new data to disk when memory is at capacity, preventing pauses.
storage.type filesystem
# Disables pausing the plugin when max_chunks_up is reached.
storage.pause_on_chunks_overlimit off
[OUTPUT]
Name datadog
Match kube.*
# Sets a 500MB disk limit; older chunks are removed if the limit is reached.
storage.total_limit_size 500M
After deploying the updated config, we monitored Fluent Bit’s performance, and noticed that memory consumption stabilized, and the error we were getting previously, i.e tail.1 paused (mem buf overlimit)
wasn’t occurring anymore, indicating that backpressure was being effectively managed.
However, despite these improvements, the issue of sporadically missing ai-deploy-job
logs remained unresolved.
Reevaluating the Problem
Even with backpressure mitigated, we realized our initial hypothesis was not fully correct, as the ai-deploy-job
logs were still sporadically missing.
We conducted further searches in DataDog, using specific log content from ai-deploy-job
and, surprisingly, found that the logs were there.
However, they lacked crucial Kubernetes metadata — such as pod name, namespace, and cluster name — which led to improper indexing and made them unsearchable with standard queries. This missing metadata made the logs appears as missing in the first place.
Our next step was to figure-out why the Kubernetes metadata was missing from ai-deploy-job
? In order to do so, we first need to understand the importance of Kubernetes metadata.
Understanding the Importance of Kubernetes Metadata
Kubernetes metadata is critical for log aggregation and analysis. Fluent Bit Kubernetes Filter is responsible for enriching logs with this metadata by first querying the Kubernetes API server.
The filter performs several key operations:
- Analyze the tag and extract the following metadata
– Pod Name
– Namespace
– Container Name
– Container ID - Query kubernetes API to obtain extra metadata for the POD in question
– Pod ID
– Labels
– Annotations
– Namespace Labels
– Namespace Annotations
The metadata is cached locally in memory, then appended to each log record before passing it to the buffer for further processing.
Without this metadata, logs are not enriched with the corresponding Kubernetes resource.
Root Cause
With backpressure ruled out as the cause, and knowing that Kubernetes metadata was missing, we turned our focus to the Kubernetes filter.
We first increased the log level in Fluent Bit to capture more details, and immediately noticed warning messages:
[2024/09/12 22:53:57] [ warn] [http_client] cannot increase buffer:
current=32000 requested=64768 max=32000
This indicated that Fluent Bit’s HTTP client was unable to process a response from the Kubernetes API server because it exceeded the maximum buffer size.
This led us to suspect that the Buffer_Size
parameter in the Kubernetes filter might be the issue. By default, Buffer_Size
is set to 32k
, which limits the maximum HTTP response size Fluent Bit can handle when retrieving pod specifications from the Kubernetes API.
Given that some pod specifications may exceed this default buffer size, Fluent Bit might be discarding metadata responses that are too large, causing essential Kubernetes metadata to be missing from the logs.
According to the official Fluent Bit documentation:
Buffer_Size:
Sets the buffer size for the HTTP client when reading responses
from the Kubernetes API server. If a pod specification exceeds this limit,
the API response will be discarded,
resulting in missing metadata in the logs.
To validate our theory, we needed to determine the actual size of the ai-deploy-job
pod specifications that Fluent Bit was attempting to retrieve.
We ran the following command from within one of the Fluent Bit pods to measure the response size from the Kubernetes API server.
curl - cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
-H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
https://kubernetes.default.svc.cluster.local/api/v1/namespaces/<namespace>/pods/<pod-name> \
-o /dev/null -w "Response Size: %{size_download} bytes\n"
The output confirmed our suspicion
Response Size: 64768 bytes
The response size was approximately 64KB
, double the default Buffer_Size
of 32KB
, confirming that Fluent Bit was discarding oversized API responses.
This explained why logs were reaching DataDog without Kubernetes metadata — Fluent Bit was unable to process these responses and thus couldn’t enrich the logs.
But why is only this job’s pod specification exceeding the buffer limit? Examining the pod specification revealed that ai-deploy-job
had an unusually long argument list, bringing its pod spec size to around 64KB
— twice Fluent Bit’s default Buffer_Size
.
Other jobs didn’t have this issue because their pod specifications are within the buffer limit.
Final Solution
To resolve this completely, we needed to increase the Buffer_Size
in the Kubernetes filter to accommodate larger pod specifications.
[FILTER]
Name kubernetes
Match kube.*
Buffer_Size 256k
After updating the configuration and restarting the Fluent Bit pods, we observed:
- The buffer-related warnings disappeared.
- Logs from the previously missing jobs appeared in DataDog with complete metadata.
Conclusion
This was an interesting challenge that highlighted the complexities of Fluent Bit’s buffering and metadata enrichment in Kubernetes.
Through our investigation, we identified the need to adjust the Buffer_Size
parameter in the Kubernetes filter. Which allowed Fluent Bit to handle larger pod specifications, and thus making sure that complete metadata made it to DataDog.
To optimize further, we switched to filesystem-based buffering. Moving some data from memory to disk gave us better control during high log volumes, reducing the risk of data loss and keeping Fluent Bit running smoothly under load.
These optimizations not only resolved our immediate metadata issue but also demonstrated how buffer tuning and file-based buffering enhance log aggregation reliability in Kubernetes.