[OMPI users] Mapping and Ranking in 3.1.3

Ben Menadue

2018-11-07 03:25:30 UTC

Hi,

Consider a hybrid MPI + OpenMP code on a system with 2 x 8-core processes per node, running with OMP_NUM_THREADS=4. A common placement policy we see is to have rank 0 on the first 4 cores of the first socket, rank 1 on the second 4 cores, rank 2 on the first 4 cores of the second socket, and so on. In 3.1.2 this is easily accomplished with

$ mpirun --map-by ppr:2:socket:PE=4 --report-bindings
[raijin1:07173] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../../../..][../../../../../../../..]
[raijin1:07173] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [../../../../BB/BB/BB/BB][../../../../../../../..]
[raijin1:07173] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/../../../..]
[raijin1:07173] MCW rank 3 bound to socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][../../../../BB/BB/BB/BB]
<and similarly on subsequent nodes>

although looking at the man page now it seems like this is an invalid construct (even through it worked).

However, it looks like this (mis)use no longer works in OpenMPI 3.1.3:

--------------------------------------------------------------------------
An invalid value was given for the number of processes
per resource (ppr) to be mapped on each node:

PPR: 2:socket:PE=4

The specification must be a comma-separated list containing
combinations of number, followed by a colon, followed
by the resource type. For example, a value of "1:socket" indicates that
one process is to be mapped onto each socket. Values are supported
for hwthread, core, L1-3 caches, socket, numa, and node. Note that
enough characters must be provided to clearly specify the desired
resource (e.g., "nu" for "numa").
--------------------------------------------------------------------------

Weâve come up with an equivalent but it needs both --map-by and --rank-by:

$ mpirun --map-by node:PE=4 --rank-by core

(without the --rank-by it (as expected) round-robins between nodes first instead of the ranks on each node). Is this the correct approach for getting this distribution?

As an aside, Iâm not sure if this is the expected behaviour, but using --map-by socket:PE=4 fails because it tries putting rank 4 on the first socket of the first node even through thereâs no free cores left there (because of the PE=4), instead of moving to the next node. But weâd still need to use the --rank-by option in this case, anyway.

Cheers,
Ben