Hardware And Software Queuing

serveurs

I've talked before about how getting high performance in MPI is all about offloading to dedicated hardware. You want to get software out of the way as soon as possible and let the underlying hardware progress the message passing at max speed.

But the funny thing about networking hardware: it tends to have limited resources. You might have incredibly awesome NICs in your HPC cluster, but they only have a finite (small) amount of resources such as RAM, queues, queue depth, descriptors (for queue entries), etc.

MPI's job is to manage these resources. But with rising core counts - such as Intel E5 2690 v2's 10 cores per socket - distributing those network / NIC resources between all the MPI processes on a server can require treading a fine line between aggressive performance and fair sharing.

In general, many MPI implementations divide up networking resources evenly/fairly at the beginning of a job. Each MPI process gets 1/Nth of the networking resources on a server (i.e., shared among N MPI processes on that server). Then the MPI implementation attempts to keep its hardware resources as full as possible so that the main CPU can go off to do other things and let the networking hardware progress itself.

Much network hardware is implemented with some kind of queue-based interface back to the main CPU. For example, the host enqueues a "send" or a "receive" descriptor on the networking hardware, and then networking hardware processor dequeues the descriptor and processes it.

In the case of a send, for example, the network hardware likely does something like this:

Dequeue the descriptor
Realize that it's a send descriptor
Transfer the packet buffer from RAM (e.g., via DMA over the PCI bus) to local memory
If necessary, form a network header
Send the network header out on the physical layer (e.g., wire or fiber)
Send the packet buffer out on the physical layer

These actions are likely overlapped and/or pipelined - they don't need to occur in serial. And since the networking hardware is running on an ASIC that was created specifically for this purpose, it's really, really fast and efficient.

Receive descriptors are usually a way of telling the NIC that when an incoming packet matching a certain pattern arrives, put it in a specific RAM location where the host can process the incoming packet.

But remember how I said above that network hardware has its limits? These limits are generallymuchsmaller than those out in main memory. The queue to communicate with the NIC, for example, can only hold so many descriptors - usually a few thousand or so.

Now also remember that the main CPU is typically really flippin' fast. Probablymuchfaster than the NIC hardware. Meaning: it can probably queue up descriptorsmuchfaster than the network hardware can dequeue them.

What's an MPI implementation to do, for example, in a case like this:

char message; for (int i = 0; i < 2000000; ++i) { MPI_Isend(&message, 1, MPI_CHAR, dest, tag, comm, &req[i]); }

The MPI will likely aggressively enqueue send descriptors for the first bunch of those messages.

...but then the network hardware queue will fill up. Yoinks!

Remember that MPI is a reliable and guaranteed message passing system. So if you call MPI_Isend(), the message either has to (eventually) be delivered, or MPI_Isend() must fail (which may not be discovered until you test or wait on the resulting request, but let's ignore that fact for the moment).

In this case, the MPI will likely queue up the messages in software, and wait for some space to become available in the network hardware queue before enqueueing the next bunch. Simple enough, right?

Yes... and no.

Depending on how the network hardware works, send and receive descriptors may share the same queue. And if you're slamming that queue with send descriptors, you won't be able to post any receive descriptors. Which, if you're simultaneously receiving network traffic (and you probably are), you need to keep re-posting new receive buffers. Hence: you better keep a little space in that queue for re-posting receive buffers.

So it may not be a good idea tocompletelymonopolize the hardware queue with send requests.

And what about other messages that need to get sent? For example, the MPI may also need to send some control messages around, such as replying to large message rendezvous requests, etc.

Additionally, if the MPI is really smart, it may be able to coalesce some of these software-enqueued MPI_Isend requests. For example, if it sees the same message going to the same receiver on the same communicator and tag, it can just tick off a counter and say "send N of this same message", and then only actually send that message (and counter) across the network once. The receiver will expand that message into N receives, and all is good.

(there's a few variations possible on the above coalescing scheme, but note that that kind of situation usually only happens in benchmarks!)

The point: MPI has to carefully tread between absolute performance of a sequence of sends, or overall fairness/performance. Allocating hardware resources fairly to ensure anoveralllevel of high performance is a difficult balancing act.

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

serveurs

Nouvelles chaudes

S5735-L48LP4XE-A-V2: Scalable, Secure, and PoE-Ready for Demanding Enterprise Deployments

S5735-L48LP4S-A-V2 Powers Smarter Campus Networks with Advanced PoE and Cloud Management

S5735-L24T4X-A1 Empowers Installers with Scalable, Reliable, and Efficient Network Access

Best Ethernet Switches for Business (2025): Selection Guide and Top Picks

Huawei S5735-L24T4S-A1: A Compact, Stackable Access Switch Built for the Future

Huawei S5735-L24T4S-A: High-Performance Stacking Meets Zero-Noise Deployment

S5735-L24P4XE-A-V2: Huawei’s Smart Choice for High-Density Campus Deployments

S5735-L24P4X-A1: Huawei’s High-Performance Access Switch Redefining Campus Networking

Huawei S5735-L24P4S-A1 Review: Reliable Gigabit Access with Enterprise-Grade Features

What Is an Orthogonal Architecture?

Huawei s5735-l24p4s-a-v2 Delivers Scalable, Secure, and Smart PoE Access for Modern IT Infrastructures

Huawei S5735-L48T4XE-A-V2 Switch Delivers Enterprise-Grade Performance in a Compact Design

Huawei S5735-L48P4XE-A-V2 Review: Versatile Campus Switch with iStack and Full L3 Support

Differences Between Huawei CE Series and S Series Switches

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Hardware and software queuing

Tags chauds: Calcul intensif mpi

Ordering Guide

Ressources ressources

À propos de nous

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

serveurs

Nouvelles chaudes

S5735-L48LP4XE-A-V2: Scalable, Secure, and PoE-Ready for Demanding Enterprise Deployments

S5735-L48LP4S-A-V2 Powers Smarter Campus Networks with Advanced PoE and Cloud Management

S5735-L24T4X-A1 Empowers Installers with Scalable, Reliable, and Efficient Network Access

Best Ethernet Switches for Business (2025): Selection Guide and Top Picks

Huawei S5735-L24T4S-A1: A Compact, Stackable Access Switch Built for the Future

Huawei S5735-L24T4S-A: High-Performance Stacking Meets Zero-Noise Deployment

S5735-L24P4XE-A-V2: Huawei’s Smart Choice for High-Density Campus Deployments

S5735-L24P4X-A1: Huawei’s High-Performance Access Switch Redefining Campus Networking

Huawei S5735-L24P4S-A1 Review: Reliable Gigabit Access with Enterprise-Grade Features

What Is an Orthogonal Architecture?

Huawei s5735-l24p4s-a-v2 Delivers Scalable, Secure, and Smart PoE Access for Modern IT Infrastructures

Huawei S5735-L48T4XE-A-V2 Switch Delivers Enterprise-Grade Performance in a Compact Design

Huawei S5735-L48P4XE-A-V2 Review: Versatile Campus Switch with iStack and Full L3 Support

Differences Between Huawei CE Series and S Series Switches

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Hardware and software queuing

Tags chauds: Calcul intensif mpi

Ordering Guide

Ressources ressources

À propos de nous

Huawei CloudEngine S5731‑S48P4X Datasheet