At Netflix scale, diagnosing production performance issues requires tools beyond traditional logging. Netflix engineers pioneered using eBPF for safe, high-performance tracing inside the Linux kernel. They paired it with flame graphs, visual representations developed at Netflix to pinpoint CPU hotspots and scheduling delays. This combination offers deep visibility into system behavior with minimal overhead.
This article explores the full workflow used by Netflix SREs and performance engineers to debug live systems, illustrated with reusable examples and useful tools.
Overview of Netflix’s Production Debugging Stack
Netflix built internal tooling around perf, bcc, bpftrace, and eBPF utilities to trace JVM and kernel metrics. FlameScope, Netflix’s open-source companion tool, visually maps performance metrics over time and highlights hotspots using a combination of heat maps and flame graphs.
A tool called bpftop was released to monitor live eBPF application metrics like runtime percentage, events per second, and CPU usage to detect inefficient eBPF traces (The New Stack).
Step-by-Step Debugging Workflow
Step One: Collect CPU Profiling Data Using perf
Use perf to sample stack traces while applications are running in production:
perf record -F 99 -a -g -- sleep 30
Turn the raw profiling data into a flame graph:
perf script | stackcollapse-perf.pl | flamegraph.pl > perf.svg
This visual format quickly shows which functions are consuming the most CPU over time.
Step Two: Trace Kernel Events with bpftrace
To monitor kernel-level delays or system calls dynamically, use bpftrace. For example, track time spent in sched_switch to detect scheduling issues:
bpftrace -e 'tracepoint:sched:sched_switch { @[comm] = count(); }'
This approach offers real-time observability without affecting performance and works without needing kernel changes.
Step Three: Visualize with FlameScope
Load flame graph files into FlameScope, which combines time-series heatmaps with function-level profiling. It helps reveal performance variability, scheduling stalls, or I/O waits as they happen.
Step Four: Use bpftop to Monitor eBPF Trace Performance
Run bpftop to see which eBPF probes are active, how many events per second they generate, and their CPU usage. This helps teams optimize tracing programs without adding overhead to running services.
Real-World Case Examples
Mixed-Mode Profiling of Java Applications
Netflix uses JVMs compiled with -XX:+PreserveFramePointer, allowing perf and flame graphs to capture both Java and native code paths. The resulting flame graphs show execution across both Java bytecode and kernel/syscall layers, enabling root cause analysis for latency issues (GitHub).
Detecting Noisy Neighbors in Containers and VMs
Using eBPF to track scheduler latency, Netflix identifies when one service steals CPU cycles from others in multi-tenant environments. They can automatically alert or reschedule workloads if scheduler latency exceeds acceptable thresholds.
Identifying Disk I/O Bottlenecks
By tracing block layer events and I/O latency using eBPF, teams spot slow disk operations—even those caused by container or kernel resource contention. Off-CPU flame graphs reveal blocked threads waiting on I/O rather than consuming CPU, helping to distinguish compute from storage issues.
Advanced Tips and Best Practices
- Build JVMs with frame pointer support to enable accurate stack traces for mixed profiles.
- Generate differential flame graphs before and after deployment changes to detect regressions early in CI pipelines.
- Focus on off-CPU flame graph data to diagnose I/O delays, lock contention, thread stalls, or network waits rather than on-CPU usage.
- Automate flame graph generation and trace capture via self-service dashboards or as part of continuous profiling infrastructure.
- Sample coarse kernel events using
bccorbpftracescripts integrated into health checks or CI validation.
Frequently Asked Questions
Is eBPF safe for production use?
Yes. eBPF runs within a kernel verifier sandbox, ensuring safe execution. Performance monitoring scripts can run with minimal performance overhead and no need to recompile or restart kernel modules (brendangregg.com, en.wikipedia.org).
Can flame graphs handle mixed Java and native code stacks?
Absolutely. Netflix pioneered mixed-mode flame graphs to visualize Java bytecode and native calls side‑by‑side, helping diagnose complex performance issues spanning both layers.
What open-source tools did Netflix release publicly?
Netflix published FlameScope under Apache 2 license. They also contributed documentation and use cases about eBPF workflows and launched bpftop to monitor eBPF trace performance.
How much overhead does this tracing add?
With careful setup, full eBPF tracing and flame graph capture typically add less than 5 percent CPU overhead. Many traces run under 1 percent, especially when limited to specific events or durations.
Netflix’s approach blends state-of-the-art tracing frameworks with production-safe visualization tools. By using eBPF, bpftrace, perf, and FlameScope, engineers gain real-time visibility into kernel behavior, CPU scheduling, and I/O delays. These methods are foundational for any SRE or performance team working at scale today.






