Unlocking Peak Performance: PyTorch Adds Native NUMA Support to `torchrun` for Faster Distributed Training
14 mins read

Unlocking Peak Performance: PyTorch Adds Native NUMA Support to `torchrun` for Faster Distributed Training

Introduction

In the rapidly evolving landscape of artificial intelligence, performance is paramount. As models grow larger and datasets expand, the gap between raw computational power and real-world application speed is often defined by system-level bottlenecks. The latest PyTorch News brings a significant advancement on this front, directly addressing a critical hardware-software interaction that can silently throttle training performance. Meta AI has introduced native support for Non-Uniform Memory Access (NUMA) affinity directly into torchrun, PyTorch’s de facto tool for launching distributed jobs. This update is a game-changer for anyone training large models on multi-socket, multi-GPU servers, a common setup in both on-premise data centers and cloud environments like AWS SageMaker and Azure Machine Learning. By providing a simple, declarative way to bind processes to specific CPU cores and their local memory, PyTorch is empowering developers to eliminate cross-socket memory latency, maximize hardware utilization, and ultimately, accelerate their path from research to production. This development underscores a key trend seen across the AI ecosystem, from Google DeepMind News to NVIDIA AI News: the increasing importance of deep system-level optimizations to unlock the full potential of modern hardware.

Section 1: Understanding NUMA and Its Impact on Deep Learning

Before diving into the implementation, it’s crucial to understand the underlying hardware architecture that makes this new feature so impactful. Modern high-performance servers, especially those equipped with multiple CPUs and GPUs, often use a NUMA architecture.

What is Non-Uniform Memory Access (NUMA)?

In a traditional Uniform Memory Access (UMA) system, all CPUs have equal access speed to all parts of the main memory. This is common in consumer desktops and laptops. However, in multi-socket servers, this model becomes inefficient. A NUMA architecture divides the system into “NUMA nodes.” Each node consists of a CPU socket and its own dedicated, local memory. While any CPU can access memory from any other node (remote memory), accessing its own local memory is significantly faster. Think of it as an office with multiple pods (NUMA nodes). Retrieving a file from your own desk’s filing cabinet (local memory) is much quicker than walking over to a colleague’s pod to get a file from their cabinet (remote memory). This difference in access speed—the “non-uniform” part—is the key characteristic of NUMA systems.

Why NUMA Matters for AI/ML Workloads

Deep learning training is an incredibly memory-intensive process. Data loading, preprocessing, and model computations constantly move data between CPU memory and GPU memory. In a distributed training setup on a multi-GPU server, multiple processes run concurrently, each managing a GPU. If the operating system scheduler places a training process on a CPU in one NUMA node, but its data resides in the memory of another NUMA node, every memory access incurs a high-latency penalty. This cross-node communication can become a severe bottleneck, causing the CPU to wait for data and leaving the expensive GPU underutilized. For large models like those from the Hugging Face Transformers News, where data pipelines are critical, this latency can add up to significant increases in overall training time. Optimizing for NUMA ensures that each process, its data, and its associated GPU are all physically located within the same NUMA node, minimizing latency and maximizing throughput.

Checking Your System’s NUMA Configuration

You can easily inspect your server’s NUMA topology using command-line tools. The lscpu command provides a concise overview, while numactl offers more detail. This is the first step in diagnosing potential performance issues and planning your process mapping strategy.

# Use lscpu to get a quick overview of your NUMA architecture
lscpu | grep NUMA

# Example Output:
# NUMA node(s):          2
# NUMA node0 CPU(s):     0-23,48-71
# NUMA node1 CPU(s):     24-47,72-95

# Use numactl for a more detailed hardware inventory
numactl --hardware

# Example Output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
# node 0 size: 257586 MB
# node 0 free: 213456 MB
# node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
# node 1 size: 257644 MB
# node 1 free: 234567 MB
# node distances:
# node   0   1 
#   0:  10  21 
#   1:  21  10

Section 2: Implementing NUMA Affinity with `torchrun`

With a clear understanding of the “why,” let’s explore the “how.” PyTorch’s torchrun (part of the torch.distributed.elastic module) is the standard utility for launching single-node or multi-node distributed training jobs. It handles worker management, fault tolerance, and now, NUMA affinity, making it an essential tool in the MLOps pipeline, rivaling orchestration tools discussed in Ray News and MLflow News.

Multi-GPU server - GPU Servers For AI, Deep / Machine Learning & HPC | Supermicro
Multi-GPU server – GPU Servers For AI, Deep / Machine Learning & HPC | Supermicro

New Command-Line Arguments for NUMA Binding

The core of this new feature is exposed through two simple yet powerful command-line arguments for torchrun:

  • --cpu-map: This argument controls how training processes are pinned to specific CPU cores. It accepts a mapping strategy or an explicit list of cores.
  • --mem-map: This argument controls how the memory for each process is allocated. It ensures that memory is sourced from the NUMA node local to the process’s assigned CPU.

For both arguments, torchrun provides an intelligent auto mode. When you specify --cpu-map auto and --mem-map auto, torchrun will automatically detect the system’s NUMA topology and GPU-to-NUMA node affinity. It then intelligently pins each training process (identified by its local rank) to the CPU cores on the same NUMA node as its assigned GPU. This automatic configuration is a massive simplification over manual scripting with tools like numactl.

A Practical Example: Basic NUMA Binding

Let’s consider a common scenario: a server with 2 NUMA nodes and 8 GPUs, with 4 GPUs physically connected to each node. We want to launch 8 training processes, one for each GPU, and ensure each process is NUMA-local.

Here is a simple PyTorch training script, main.py, which we will use for our examples. It initializes the distributed process group and prints the device for each rank.

# main.py
import os
import torch
import torch.distributed as dist

def main():
    # torchrun sets these environment variables
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])

    # Initialize the process group
    dist.init_process_group(backend="nccl")
    
    # Assign a GPU to each process
    torch.cuda.set_device(local_rank)
    device = torch.device(f"cuda:{local_rank}")
    
    print(
        f"Hello from Rank {rank}/{world_size} (Local Rank {local_rank}). "
        f"Using device: {device}"
    )
    
    # Your training loop would go here
    # ...
    
    # Clean up
    dist.destroy_process_group()

if __name__ == "__main__":
    main()

Now, we can launch this script using torchrun with the new NUMA flags. The auto setting handles the complex mapping for us.

# Launch an 8-process job on the current machine (nnodes=1, nproc_per_node=8)
# The `auto` flag tells torchrun to detect NUMA/GPU topology and pin processes accordingly.

torchrun \
    --standalone \
    --nnodes=1 \
    --nproc-per-node=8 \
    --cpu-map auto \
    --mem-map auto \
    main.py

When this command is executed, torchrun will ensure that the processes for local ranks 0-3 are bound to CPUs and memory on NUMA node 0 (assuming GPUs 0-3 are on that node), and processes for local ranks 4-7 are bound to CPUs and memory on NUMA node 1. This simple change can lead to significant performance improvements by eliminating cross-socket data transfers.

Section 3: Advanced NUMA Mapping and Custom Strategies

While the auto mode is incredibly convenient and covers most use cases, torchrun also provides fine-grained control for advanced scenarios or systems where automatic detection might not be optimal. This flexibility is crucial for performance engineers and researchers who need to eke out every last bit of performance, a topic often covered in DeepSpeed News and other MLOps publications.

Custom Mapping with Explicit Lists

You can provide a comma-separated list to --cpu-map and --mem-map to specify the exact binding for each local process. The list is indexed by the local rank. This is particularly useful for heterogeneous systems or for experimenting with different binding strategies.

For example, on our 2-node server, if we want to manually replicate the behavior of auto, we would first identify the NUMA node for each GPU and then create our map. Let’s assume GPUs 0-3 are on NUMA node 0 and GPUs 4-7 are on NUMA node 1.

torchrun interface - Law Enforcement Torch Run Benefitting Special Olympics GA Thursday ...
torchrun interface – Law Enforcement Torch Run Benefitting Special Olympics GA Thursday …
# Manually specify that the first 4 processes (local ranks 0-3) should be on NUMA node 0,
# and the next 4 processes (local ranks 4-7) should be on NUMA node 1.

# The list corresponds to the local rank of the process.
# --cpu-map 0,0,0,0,1,1,1,1 means:
# local_rank 0 -> NUMA node 0 CPUs
# local_rank 1 -> NUMA node 0 CPUs
# ...
# local_rank 4 -> NUMA node 1 CPUs
# etc.

torchrun \
    --standalone \
    --nnodes=1 \
    --nproc-per-node=8 \
    --cpu-map 0,0,0,0,1,1,1,1 \
    --mem-map 0,0,0,0,1,1,1,1 \
    main.py

This explicit control is powerful. For instance, if you have a data-loading-heavy workload, you might experiment with assigning more CPU cores to your data loader processes or ensuring they are pinned to cores with the best I/O access, all configurable through these flags.

The `numa` Mapper: A Balanced Approach

In addition to auto and explicit lists, torchrun provides a numa mapper. This strategy distributes the processes evenly across the available NUMA nodes in a round-robin fashion. For example, --cpu-map numa on an 8-process job on a 2-node machine would result in the same 0,0,0,0,1,1,1,1 mapping. It’s a convenient shorthand when you want balanced distribution without relying on the GPU affinity detection of the auto mode. This can be useful in CPU-only training scenarios or when the GPU topology is unusual.

Section 4: Best Practices, Verification, and Optimization

Simply adding flags is the first step; verifying the results and understanding the context is key to successful optimization. Here are some best practices and considerations when using NUMA affinity in your PyTorch workflows.

When to Use NUMA Binding

NUMA optimization is most impactful on multi-socket CPU servers with multiple GPUs. If you are training on a single-socket machine (like most developer workstations) or a laptop, your system is likely UMA, and these flags will have no effect. The benefits become more pronounced as the number of GPUs and CPU-GPU data transfer volume increases. This is especially relevant for users of large instances on cloud platforms like Vertex AI or bare-metal providers, where multi-socket configurations are common.

Verifying Process Affinity

After launching your job, it’s good practice to verify that the processes are pinned correctly. You can use system tools like htop (press ‘t’ for tree view and ‘F2’ -> Display Options -> add ‘PROCESSOR’ to the columns) or numastat to observe process placement and memory usage per node.

A more direct method is to use ps to check the affinity of a running process by its PID:

# Find the PID of one of your training processes
pgrep -f "python main.py"

# Let's say the PID is 12345
# Check the NUMA memory policy for that process
numactl --show --pid 12345

# Example Output showing it's bound to node 0
# policy: bind
# preferred node: 0
# physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 48 49 50 51 52 53 54 55 56 57 58 59 
# cpubind: 0
# nodebind: 0
# membind: 0

Benchmarking for Real-World Gains

The performance improvement from NUMA binding can vary significantly based on your model architecture, data loading pipeline, and specific hardware. The only way to know the true impact is to benchmark. Run your training job for a fixed number of steps or epochs both with and without the --cpu-map auto --mem-map auto flags. Measure key metrics like training throughput (e.g., samples/second or images/second) and GPU utilization. In many I/O-bound or CPU-bound distributed workloads, you can expect to see a non-trivial performance uplift, sometimes in the range of 5-15% or even more, simply by enabling this feature.

Conclusion

The introduction of native NUMA support in torchrun is a testament to the PyTorch team’s commitment to production-grade performance and usability. This latest entry in PyTorch News provides a deceptively simple solution to a complex systems-level problem that has long been a source of performance bottlenecks in large-scale deep learning. By abstracting away the intricacies of process pinning and memory binding, developers can now easily ensure their distributed training jobs are NUMA-aware, leading to lower memory latency, higher GPU utilization, and faster training times.

As models continue to scale, such hardware-aware software optimizations are no longer a niche concern but a fundamental requirement for efficient MLOps. Whether you are working with frameworks from the LangChain News or training massive models from Mistral AI News, the underlying compute engine’s efficiency is critical. We encourage all PyTorch users who train on multi-socket servers to explore this new feature. Start with the auto setting, benchmark your workloads, and unlock the hidden performance potential of your hardware. This is a powerful new tool in the arsenal for building faster, more efficient AI systems.