NUMA (Non-Uniform Memory Access) architecture has become standard in modern multi-core and multi-socket systems. While it offers significant performance benefits by reducing memory access latency, it can also introduce performance bottlenecks if not handled properly.
Understanding NUMA Basics
NUMA systems consist of multiple memory nodes tied to specific CPU cores. Accessing local memory is fast; accessing memory from another node incurs a performance penalty. When a process running on one NUMA node frequently accesses memory from another, it creates remote memory access latency a classic NUMA bottleneck.
Step 1: Discover the System’s NUMA Topology
To inspect how CPUs and memory are grouped, use:
numactl --hardware
Sample output:
available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 64000 MB node 1 cpus: 4 5 6 7 node 1 size: 64000 MB
This shows each node’s CPUs and memory size. Having this insight is crucial for pinning workloads effectively.
Step 2: Detect Remote Memory Accesses
To determine if a process is accessing memory from other nodes:
numastat -p <PID>
Focus on these columns:
numa_miss: Memory accesses redirected to another nodenuma_foreign: Memory originally allocated on another node
High values in these fields indicate cross-node traffic, which often leads to degraded performance.
Step 3: Bind Processes to Specific NUMA Nodes
To ensure CPU and memory locality for better performance:
numactl --cpunodebind=0 --membind=0 ./application
This command binds both CPU execution and memory allocation to NUMA node 0.
For example, a latency-sensitive service like a real-time analytics engine can benefit greatly from this pinning approach.
Step 4: Use taskset to Pin Running Processes
To bind an already-running process to specific cores:
taskset -cp 0-3 <PID>
This restricts execution to cores 0–3, typically part of the same NUMA node.
To view core usage:
ps -o pid,psr,comm -eH
Step 5: Manage NUMA Balancing Dynamically
Linux has a feature for automatic NUMA balancing. It can help or hurt depending on the workload.
Check its current status:
cat /proc/sys/kernel/numa_balancing
To enable or disable:
echo 1 > /proc/sys/kernel/numa_balancing # Enable echo 0 > /proc/sys/kernel/numa_balancing # Disable
Guidance:
- Enable for large, long-running apps like JVMs or databases
- Disable for lightweight or batch processes
Step 6: Monitor NUMA Metrics with perf
perf offers low-overhead, real-time tracking of memory access behavior:
perf stat -e numa_miss,numa_foreign -p <PID>
To analyze CPU cycles and identify bottlenecks:
perf record -g -e cycles -a -- sleep 5 perf report
Step 7: Tune Memory Allocation with HugePages
Using HugePages reduces memory fragmentation and improves cache behavior:
echo 1024 > /proc/sys/vm/nr_hugepages
To check status:
cat /proc/meminfo | grep Huge
Disable Transparent HugePages (optional for certain workloads):
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Step 8: Add NUMA-Awareness in Code (C/C++)
For custom applications, leverage libnuma for memory locality:
#include <numa.h>
if (numa_available() != -1) {
numa_set_preferred(0); // Prefer node 0 memory
}
This is especially useful for real-time processing or latency-sensitive services.
Use Cases and Scenarios
Java Application with Garbage Collection Issues
- Symptoms: Long GC pauses, unpredictable behavior
- Solution: Add JVM flags for NUMA awareness
-XX:+UseNUMA -XX:+UseParallelGC
Bind the JVM to specific NUMA resources:
numactl --cpunodebind=1 --membind=1 java -jar app.jar
Slow Database Performance on Multi-Core Systems
- Symptoms: Queries slow down under load
- Solution: Bind the database process using
systemd
[Service] ExecStart=/usr/sbin/mysqld CPUAffinity=4-7
Use numastat -p $(pidof mysqld) to monitor behavior after changes.
Extra Tools and Techniques
| Task | Command / Tool |
|---|---|
| Monitor CPU usage per core | mpstat -P ALL 1 |
| Detect memory leaks | slabtop |
| View open memory stats | cat /proc/meminfo |
| Align NIC IRQs with NUMA node | echo <mask> > /proc/irq/<IRQ>/smp_affinity |






