Tree-based Launch In Open Mpi (part 2)

serveurs

In my prior blog entry, I described the basics of Open MPI's tree-based launching system over ssh (yes, there are still some valid / good reasons for using ssh over a native job scheduler / resource manager's parallel launch mechanisms...).

That entry got a little long, so I split the rest of the discussion into a separate blog entry.

The prior entry ended after describing that Open MPI uses a binomial tree-based launcher.

One thing I didn't say in the last entry: the tree-based launcher is not only an optimization, it's alsonecessaryfor launching larger parallel jobs. There are operating system-imposed limits on the number of open file descriptors in a process, meaning that mpirun simplycan'topen an ssh session to all remote servers as the number of servers scales up.

There were real-world cases of users hitting those limits, therebyforcingthe move to a more scalable, tree-based system.

Since the initial implementation of the tree-based launcher in 2009, server CPUs and networks have gotten significantly faster: an individual ssh session is significantly faster to establish than it was six years ago. As a direct result, Open MPI added two more improvements to its ssh tree-based launch.

The first was to remove the serialization of ssh sessions on a single server:

Instead, overlap initiating all the ssh connections from each interior node in the tree. Meaning: fork all the ssh connections at once, and then process them as their connections start progressing. Or, more simply: parallelize the initiation of ssh connections.

Not only does the overlap of ssh session initiation significantly speedup the overall process:

Modern servers tend to have a lot of cores; during job launch, it's common to be able to co-opt many cores to effect job startup duties. Hence, the individual ssh processes can be load balanced across all available cores.
The progress of each ssh process blocks on network IO for "long" periods of time (thousands or millions of CPU cycles), allowing the OS to swap it out and progress other work (e.g., other ongoing ssh processes).

The second Open MPI improvement was to switch from a binomial tree to a radix tree.

Specifically, by default:

For jobs up to 1K servers, each server in the interior of the launch tree (including mpirun) launches on 32 servers.
For jobs between 1K-4K servers, each interior server launches on 64 servers.
For jobs larger than 4K servers, each interior server launches on 128 servers.

While users can certainly override the default radix value at run time, these defaults reflect two observations:

A heuristic: because modern CPU processors aresofast, the time to complete N overlapped ssh connections from a single server tends to be less than the time to complete N overlapped ssh connections split between a single parent-child pair in the interior of a launch tree.
Open MPI's "out of band" command-and-control network is routed through the launch tree structure. As such, it is desirable to keep the tree shallow.

These two improvements - pipelining ssh and using a radix tree - together make launching via ssh quite viable, even at large scale.

More improvements are certainly possible (and desirable). For example, there is ongoing work to separate the "out of band" message routing from the job launch topology, thereby allowing smaller radixes, more parallelization, and potentially shorter overall job launch time.

Stay tuned for future blog entries on this topic!

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

serveurs

Nouvelles chaudes

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

Introduction to the Huawei CloudEngine S6730-S Series Switches

Huawei S6730-H48X6CZ-V2: The Ultimate High-Speed Network Switch

Overview of the S6730-H28X6CZ-V2 Switch

Huawei CloudEngine S6730-H24X4Y4C: A High-Performance Enterprise Switch for Modern Networks

Introduction to Huawei CloudEngine S6730-H Series Switches

Comprehensive Guide to the CloudEngine S6730-H24X6C-V2: Features, Specifications, and Applications

Huawei S6730-S24X6Q: Advanced Ethernet Switch for Modern Networks

Comprehensive Guide to the S6730-H48X6C-V2 High-Performance Switch

Huawei CloudEngine S6730-H28Y4C: High-Performance Switch for Modern Networks

Overview of the S6730-H24X6C-V2

Tree-based launch in Open MPI (part 2)

Tags chauds: Calcul intensif mpi Open MPI

Ordering Guide

Ressources ressources

À propos de nous

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

serveurs

Nouvelles chaudes

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

Introduction to the Huawei CloudEngine S6730-S Series Switches

Huawei S6730-H48X6CZ-V2: The Ultimate High-Speed Network Switch

Overview of the S6730-H28X6CZ-V2 Switch

Huawei CloudEngine S6730-H24X4Y4C: A High-Performance Enterprise Switch for Modern Networks

​Introduction to Huawei CloudEngine S6730-H Series Switches

Comprehensive Guide to the CloudEngine S6730-H24X6C-V2: Features, Specifications, and Applications

Huawei S6730-S24X6Q: Advanced Ethernet Switch for Modern Networks

Comprehensive Guide to the S6730-H48X6C-V2 High-Performance Switch

Huawei CloudEngine S6730-H28Y4C: High-Performance Switch for Modern Networks

Overview of the S6730-H24X6C-V2

Tree-based launch in Open MPI (part 2)

Tags chauds: Calcul intensif mpi Open MPI

Ordering Guide

Ressources ressources

À propos de nous

Introduction to Huawei CloudEngine S6730-H Series Switches