Process And Memory Affinity: Why Do You Care?

serveurs

I've written about NUMA effects and process affinity on this blog lots of times in the past. It's a complex topic that has a lot of real-world affects on your MPI and HPC applications. If you're not using processor and memory affinity, you're likely experiencing performance degradation without even realizing it.

In short:

If you're not booting your Linux kernel in NUMA mode, you should be.
If you're not using processor affinity with your MPI/HPC applications, you should be.

Intel Sandy Bridge-based server. Click to see the full-sized image.

Take the look at this hwloc-drawn schematic of the inside of a modern Intel "Sandy Bridge"-based server (click on it for the full-sized image). Notice a few things about it:

Each processor core has its own L1 and L2 caches. They aren't shared.
The L3 cache is shared among all processor cores on the same processor socket.
Nothing is shared - not even back-end RAM - between processor cores on different processor sockets.
There's multiple different network interfaces, and each is local to a specific NUMA node.

I mention these things because they all have distinct performance implications in an HPC application. Let's go through each of those points again, in order:

You absolutely do not want any individual HPC process to migrate to a different processor core. Doing so would invalidate all your L1 and L2 cache entries, which could cause all manner of cache misses and pipeline stalls. If this happens frequently, overall performance can plummet.
The L3 on this particular chip (Intel ) is fairly large (20MB) - it somewhat mitigates the cost of process movement within a single processor socket. This is good. But if your process is L1/L2-cache sensitive, it may not be enough.
If the OS decides to move your process to a different processor socket, you've lost it all - your immediate data in the L1 and L2 caches, whatever secondary storage you had cached in L3... all of it. Ouch. L1 and L2 cache misses are bad enough; L3 cache misses and resulting pipeline stalls... Ouch.
When sending and receiving MPI messages, if your application is latency-sensitive, you really only want to use the network interfaces that are on the same NUMA node as the target process. Otherwise, your MPI message has to traverse to the other NUMA node before it gets to the network hardware, thereby incurring additional delay.

There are actually two performance effects from these three items. We already discussed how overall latency can be affected on each operation.

But don't forget thatnetwork congestionis another important performance effect.

"That's crazy talk!" you say. "We're talking about theinside of a single server. There's no network here!"

Not so, my friend. Don't forget what connects all the NUMA nodes and processor sockets together: QPI (or HyperTransport in AMD machines). That's a network. A full-blown, real network. With protocols, packets, and checksums. Oh my!

You can't completely eliminate the amount of traffic that is flowing across the NUMA-node-connecting-network, but you do want to minimize it. Networking 101 tells us that, in many cases, reducing congestion and contention on network links leads to overall better performance of the fabric. The same principle is true on networksinside a serveras it is for networksoutside of a server.

Using processor and memory affinity helps minimize all of the effects described above. Processes start and stay in a single location, and all the data they use in RAM tends to stay on the same NUMA node (thereby making it local). Caches aren't thrashed. Well-behaved MPI implementations use local NICs (when available). Less inter-NUMA-node traffic = more efficient computation.

With all that background, let's go back and address the two first points from this blog entry:

If your Linux kernel (and/or BIOS) is ignoring the NUMA layout of RAM, the memory associated with one process may be located on a remote NUMA node. Result: lots of inter-NUMA-node traffic to get to that memory. Bad. When Linux is NUMA-aware, it will (by default) make an effort to use memory on the NUMA node local to the allocating process.
Process affinity controls in modern MPI implementations and operating systems are getting pretty good. By telling the MPI implementation to bind processes to unique cores or sockets (typically via a command line parameter or environment variable), the OS won't migrate processes around, and you might see a bit of a performance boost in your applications. For free.

Use affinity. It's a Good Thing. And it's likely free and easy to enable in your MPI implementation / operating system already.

(Final note: the particular server shown in this example only has one processor socket per NUMA node. Things get even more... interesting... if there are multiple sockets per NUMA node.)

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

serveurs

Nouvelles chaudes

Huawei S5735-L24T4S-A1: A Compact, Stackable Access Switch Built for the Future

Huawei S5735-L24T4S-A: High-Performance Stacking Meets Zero-Noise Deployment

S5735-L24P4XE-A-V2: Huawei’s Smart Choice for High-Density Campus Deployments

S5735-L24P4X-A1: Huawei’s High-Performance Access Switch Redefining Campus Networking

Huawei S5735-L24P4S-A1 Review: Reliable Gigabit Access with Enterprise-Grade Features

What Is an Orthogonal Architecture?

Huawei s5735-l24p4s-a-v2 Delivers Scalable, Secure, and Smart PoE Access for Modern IT Infrastructures

Huawei S5735-L48T4XE-A-V2 Switch Delivers Enterprise-Grade Performance in a Compact Design

Huawei S5735-L48P4XE-A-V2 Review: Versatile Campus Switch with iStack and Full L3 Support

Differences Between Huawei CE Series and S Series Switches

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Process and memory affinity: why do you care?

Tags chauds: Calcul intensif mpi NUMA process affinity hwloc

Ordering Guide

Ressources ressources

À propos de nous

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

serveurs

Nouvelles chaudes

Huawei S5735-L24T4S-A1: A Compact, Stackable Access Switch Built for the Future

Huawei S5735-L24T4S-A: High-Performance Stacking Meets Zero-Noise Deployment

S5735-L24P4XE-A-V2: Huawei’s Smart Choice for High-Density Campus Deployments

S5735-L24P4X-A1: Huawei’s High-Performance Access Switch Redefining Campus Networking

Huawei S5735-L24P4S-A1 Review: Reliable Gigabit Access with Enterprise-Grade Features

What Is an Orthogonal Architecture?

Huawei s5735-l24p4s-a-v2 Delivers Scalable, Secure, and Smart PoE Access for Modern IT Infrastructures

Huawei S5735-L48T4XE-A-V2 Switch Delivers Enterprise-Grade Performance in a Compact Design

Huawei S5735-L48P4XE-A-V2 Review: Versatile Campus Switch with iStack and Full L3 Support

Differences Between Huawei CE Series and S Series Switches

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Process and memory affinity: why do you care?

Tags chauds: Calcul intensif mpi NUMA process affinity hwloc

Ordering Guide

Ressources ressources

À propos de nous

Huawei CloudEngine S5731‑S48P4X Datasheet