NUMA Bottlenecks in Multi-Core CPUs

Exposing and Fixing Real NUMA Bottlenecks in Multi-Core CPUs (Real-World Guide)

NUMA (Non-Uniform Memory Access) architecture has become standard in modern multi-core and multi-socket systems. While it offers significant performance benefits by reducing memory access latency, it can also introduce performance bottlenecks if not handled properly.

Understanding NUMA Basics

NUMA systems consist of multiple memory nodes tied to specific CPU cores. Accessing local memory is fast; accessing memory from another node incurs a performance penalty. When a process running on one NUMA node frequently accesses memory from another, it creates remote memory access latency a classic NUMA bottleneck.

Step 1: Discover the System’s NUMA Topology

To inspect how CPUs and memory are grouped, use:

numactl --hardware

Sample output:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 64000 MB
node 1 cpus: 4 5 6 7
node 1 size: 64000 MB

This shows each node’s CPUs and memory size. Having this insight is crucial for pinning workloads effectively.

Step 2: Detect Remote Memory Accesses

To determine if a process is accessing memory from other nodes:

numastat -p <PID>

Focus on these columns:

  • numa_miss: Memory accesses redirected to another node
  • numa_foreign: Memory originally allocated on another node

High values in these fields indicate cross-node traffic, which often leads to degraded performance.

Step 3: Bind Processes to Specific NUMA Nodes

To ensure CPU and memory locality for better performance:

numactl --cpunodebind=0 --membind=0 ./application

This command binds both CPU execution and memory allocation to NUMA node 0.

For example, a latency-sensitive service like a real-time analytics engine can benefit greatly from this pinning approach.

Step 4: Use taskset to Pin Running Processes

To bind an already-running process to specific cores:

taskset -cp 0-3 <PID>

This restricts execution to cores 0–3, typically part of the same NUMA node.

To view core usage:

ps -o pid,psr,comm -eH

Step 5: Manage NUMA Balancing Dynamically

Linux has a feature for automatic NUMA balancing. It can help or hurt depending on the workload.

Check its current status:

cat /proc/sys/kernel/numa_balancing

To enable or disable:

echo 1 > /proc/sys/kernel/numa_balancing   # Enable
echo 0 > /proc/sys/kernel/numa_balancing   # Disable

Guidance:

  • Enable for large, long-running apps like JVMs or databases
  • Disable for lightweight or batch processes

Step 6: Monitor NUMA Metrics with perf

perf offers low-overhead, real-time tracking of memory access behavior:

perf stat -e numa_miss,numa_foreign -p <PID>

To analyze CPU cycles and identify bottlenecks:

perf record -g -e cycles -a -- sleep 5
perf report

Step 7: Tune Memory Allocation with HugePages

Using HugePages reduces memory fragmentation and improves cache behavior:

echo 1024 > /proc/sys/vm/nr_hugepages

To check status:

cat /proc/meminfo | grep Huge

Disable Transparent HugePages (optional for certain workloads):

echo never > /sys/kernel/mm/transparent_hugepage/enabled

Step 8: Add NUMA-Awareness in Code (C/C++)

For custom applications, leverage libnuma for memory locality:

#include <numa.h>

if (numa_available() != -1) {
    numa_set_preferred(0);  // Prefer node 0 memory
}

This is especially useful for real-time processing or latency-sensitive services.

Use Cases and Scenarios

Java Application with Garbage Collection Issues

  • Symptoms: Long GC pauses, unpredictable behavior
  • Solution: Add JVM flags for NUMA awareness
-XX:+UseNUMA -XX:+UseParallelGC

Bind the JVM to specific NUMA resources:

numactl --cpunodebind=1 --membind=1 java -jar app.jar

Slow Database Performance on Multi-Core Systems

  • Symptoms: Queries slow down under load
  • Solution: Bind the database process using systemd
[Service]
ExecStart=/usr/sbin/mysqld
CPUAffinity=4-7

Use numastat -p $(pidof mysqld) to monitor behavior after changes.

Extra Tools and Techniques

TaskCommand / Tool
Monitor CPU usage per corempstat -P ALL 1
Detect memory leaksslabtop
View open memory statscat /proc/meminfo
Align NIC IRQs with NUMA nodeecho <mask> > /proc/irq/<IRQ>/smp_affinity