[OMPI users] Launching hybrid MPI/OpenMP jobs on a cluster: correct OpenMPI flags?

Discussion:

Wirawan Purwanto

2016-10-03 21:22:33 UTC

Hi,

I have been trying to understand how to correctly launch hybrid
MPI/OpenMP (i.e. multi-threaded MPI jobs) with mpirun. I am quite
puzzled as to what is the correct command-line options to use. The
description on mpirun man page is very confusing and I could not get
what I wanted.

A background: The cluster is using SGE, and I am using OpenMPI 1.10.2
compiled with & for gcc 4.9.3. The MPI library was configured with SGE
support. The compute nodes have 32 cores, which are basically 2
sockets of Xeon E5-2698 v3 (16-core Haswell).

A colleague told me the following:

$ export OMP_NUM_THREADS=2
$ mpirun -np 16 -map-by node:PE=2 ./EXECUTABLE

I could see the executable using 200% of CPU per process--that's good.
There is one catch in the general case. "-map-by node" will assign the
MPI processes in a round-robin fashion (so MPI rank 0 gets node 0, mpi
rank 1 gets node 1, and so on until all nodes are given 1 process,
then it will go back to node 0,1, ...).

Instead of the scenario above, I was trying to get the MPI processes
side-by-side (more like "fill_up" policy in SGE scheduler), i.e. fill
node 0 first, then fill node 1, and so on. How do I do this properly?

I tried a few attempts that fail:

$ export OMP_NUM_THREADS=2
$ mpirun -np 16 -map-by core:PE=2 ./EXECUTABLE

or

$ export OMP_NUM_THREADS=2
$ mpirun -np 16 -map-by socket:PE=2 ./EXECUTABLE

Both failed with an error mesage:

--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that cannot support that
directive.

Please specify a mapping level that has more than one cpu, or
else let us define a default mapping that will allow multiple
cpus-per-proc.
--------------------------------------------------------------------------

Another attempt was:

$ export OMP_NUM_THREADS=2
$ mpirun -np 16 -map-by socket:PE=2 -bind-to socket ./EXECUTABLE

Here's the error message:

--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a conflicting binding
policy was specified:

#cpus-per-proc: 2
type of cpus: cores as cpus
binding policy given: SOCKET

The correct binding policy for the given type of cpu is:

correct binding policy: bind-to core

This is the binding policy we would apply by default for this
situation, so no binding need be specified. Please correct the
situation and try again.
--------------------------------------------------------------------------

Clearly I am not understanding how this map-by works. Could somebody
help me? There was a wiki article partially written:

https://github.com/open-mpi/ompi/wiki/ProcessPlacement

but unfortunately it is also not clear to me.

--
Wirawan Purwanto
Computational Scientist, HPC Group
Information Technology Services
Old Dominion University
Norfolk, VA 23529

r***@open-mpi.org

2016-10-04 02:15:22 UTC

Permalink

FWIW: the socket option seems to work fine for me:

$ mpirun -n 12 -map-by socket:pe=2 -host rhc001 --report-bindings hostname
[rhc001:200408] MCW rank 1 bound to socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]]: [../../../../../../../../../../../..][BB/BB/../../../../../../../../../..]
[rhc001:200408] MCW rank 2 bound to socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [../../BB/BB/../../../../../../../..][../../../../../../../../../../../..]
[rhc001:200408] MCW rank 3 bound to socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../../../../../..][../../BB/BB/../../../../../../../..]
[rhc001:200408] MCW rank 4 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]: [../../../../BB/BB/../../../../../..][../../../../../../../../../../../..]
[rhc001:200408] MCW rank 5 bound to socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]]: [../../../../../../../../../../../..][../../../../BB/BB/../../../../../..]
[rhc001:200408] MCW rank 6 bound to socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [../../../../../../BB/BB/../../../..][../../../../../../../../../../../..]
[rhc001:200408] MCW rank 7 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../..][../../../../../../BB/BB/../../../..]
[rhc001:200408] MCW rank 8 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]: [../../../../../../../../BB/BB/../..][../../../../../../../../../../../..]
[rhc001:200408] MCW rank 9 bound to socket 1[core 20[hwt 0-1]], socket 1[core 21[hwt 0-1]]: [../../../../../../../../../../../..][../../../../../../../../BB/BB/../..]
[rhc001:200408] MCW rank 10 bound to socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]]: [../../../../../../../../../../BB/BB][../../../../../../../../../../../..]
[rhc001:200408] MCW rank 11 bound to socket 1[core 22[hwt 0-1]], socket 1[core 23[hwt 0-1]]: [../../../../../../../../../../../..][../../../../../../../../../../BB/BB]
[rhc001:200408] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../../../../../../..][../../../../../../../../../../../..]
rhc001
rhc001
rhc001
rhc001
rhc001
rhc001
rhc001
rhc001
rhc001
rhc001
rhc001
rhc001
$

I know that isn’t the pattern you are seeking - will have to ponder that one a bit. Is it possible that mpirun is not sitting on the same topology as your compute nodes?

Post by Wirawan Purwanto
Hi,
I have been trying to understand how to correctly launch hybrid
MPI/OpenMP (i.e. multi-threaded MPI jobs) with mpirun. I am quite
puzzled as to what is the correct command-line options to use. The
description on mpirun man page is very confusing and I could not get
what I wanted.
A background: The cluster is using SGE, and I am using OpenMPI 1.10.2
compiled with & for gcc 4.9.3. The MPI library was configured with SGE
support. The compute nodes have 32 cores, which are basically 2
sockets of Xeon E5-2698 v3 (16-core Haswell).
$ export OMP_NUM_THREADS=2
$ mpirun -np 16 -map-by node:PE=2 ./EXECUTABLE
I could see the executable using 200% of CPU per process--that's good.
There is one catch in the general case. "-map-by node" will assign the
MPI processes in a round-robin fashion (so MPI rank 0 gets node 0, mpi
rank 1 gets node 1, and so on until all nodes are given 1 process,
then it will go back to node 0,1, ...).
Instead of the scenario above, I was trying to get the MPI processes
side-by-side (more like "fill_up" policy in SGE scheduler), i.e. fill
node 0 first, then fill node 1, and so on. How do I do this properly?
$ export OMP_NUM_THREADS=2
$ mpirun -np 16 -map-by core:PE=2 ./EXECUTABLE
or
$ export OMP_NUM_THREADS=2
$ mpirun -np 16 -map-by socket:PE=2 ./EXECUTABLE
--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that cannot support that
directive.
Please specify a mapping level that has more than one cpu, or
else let us define a default mapping that will allow multiple
cpus-per-proc.
--------------------------------------------------------------------------
$ export OMP_NUM_THREADS=2
$ mpirun -np 16 -map-by socket:PE=2 -bind-to socket ./EXECUTABLE
--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a conflicting binding
#cpus-per-proc: 2
type of cpus: cores as cpus
binding policy given: SOCKET
correct binding policy: bind-to core
This is the binding policy we would apply by default for this
situation, so no binding need be specified. Please correct the
situation and try again.
--------------------------------------------------------------------------
Clearly I am not understanding how this map-by works. Could somebody
https://github.com/open-mpi/ompi/wiki/ProcessPlacement
but unfortunately it is also not clear to me.
--
Wirawan Purwanto
Computational Scientist, HPC Group
Information Technology Services
Old Dominion University
Norfolk, VA 23529
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Dave Love

2016-10-11 15:16:05 UTC

Permalink

Post by Wirawan Purwanto
Instead of the scenario above, I was trying to get the MPI processes
side-by-side (more like "fill_up" policy in SGE scheduler), i.e. fill
node 0 first, then fill node 1, and so on. How do I do this properly?
$ export OMP_NUM_THREADS=2
$ mpirun -np 16 -map-by core:PE=2 ./EXECUTABLE

...

Post by Wirawan Purwanto
Clearly I am not understanding how this map-by works. Could somebody
https://github.com/open-mpi/ompi/wiki/ProcessPlacement
but unfortunately it is also not clear to me.

Me neither; this stuff has traditionally been quite unclear and really
needs documenting/explaining properly.

This sort of thing from my local instructions for OMPI 1.8 probably does
what you want for OMP_NUM_THREADS=2 (where the qrsh options just get me
a couple of small nodes):

$ qrsh -pe mpi 24 -l num_proc=12 \
mpirun -n 12 --map-by slot:PE=2 --bind-to core --report-bindings true |&
sort -k 4 -n
[comp544:03093] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
[comp544:03093] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
[comp544:03093] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B][./././././.]
[comp544:03093] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
[comp544:03093] MCW rank 4 bound to socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]]: [./././././.][././B/B/./.]
[comp544:03093] MCW rank 5 bound to socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][././././B/B]
[comp527:03056] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
[comp527:03056] MCW rank 7 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
[comp527:03056] MCW rank 8 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B][./././././.]
[comp527:03056] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
[comp527:03056] MCW rank 10 bound to socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]]: [./././././.][././B/B/./.]
[comp527:03056] MCW rank 11 bound to socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][././././B/B]

I don't remember how I found that out.

r***@open-mpi.org

2016-10-28 23:17:28 UTC

Permalink

FWIW: I’ll be presenting “Mapping, Ranking, and Binding - Oh My!” at the OMPI BoF meeting at SC’16, for those who can attend

Post by Dave Love

...

Post by Wirawan Purwanto
Clearly I am not understanding how this map-by works. Could somebody
https://github.com/open-mpi/ompi/wiki/ProcessPlacement
but unfortunately it is also not clear to me.

Me neither; this stuff has traditionally been quite unclear and really
needs documenting/explaining properly.
This sort of thing from my local instructions for OMPI 1.8 probably does
what you want for OMP_NUM_THREADS=2 (where the qrsh options just get me
$ qrsh -pe mpi 24 -l num_proc=12 \
mpirun -n 12 --map-by slot:PE=2 --bind-to core --report-bindings true |&
sort -k 4 -n
[comp544:03093] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
[comp544:03093] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
[comp544:03093] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B][./././././.]
[comp544:03093] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
[comp544:03093] MCW rank 4 bound to socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]]: [./././././.][././B/B/./.]
[comp544:03093] MCW rank 5 bound to socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][././././B/B]
[comp527:03056] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
[comp527:03056] MCW rank 7 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
[comp527:03056] MCW rank 8 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B][./././././.]
[comp527:03056] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
[comp527:03056] MCW rank 10 bound to socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]]: [./././././.][././B/B/./.]
[comp527:03056] MCW rank 11 bound to socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][././././B/B]
I don't remember how I found that out.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users