[OMPI users] question about "--rank-by slot" behavior

Discussion:

David Shrader

2016-11-30 19:26:44 UTC

Hello All,

The man page for mpirun says that the default ranking procedure is
round-robin by slot. It doesn't seem to be that straight-forward to me,
though, and I wanted to ask about the behavior.

To help illustrate my confusion, here are a few examples where the
ranking behavior changed based on the mapping behavior, which doesn't
make sense to me, yet. First, here is a simple map by core (using 4
nodes of 32 cpu cores each):

$> mpirun -n 128 --map-by core --report-bindings true
[gr0649.localdomain:119614] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119614] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
[./B/./././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119614] MCW rank 2 bound to socket 0[core 2[hwt 0]]:
[././B/././././././././././././././.][./././././././././././././././././.]
...output snipped...

Things look as I would expect: ranking happens round-robin through the
cpu cores. Now, here's a map by socket example:

$> mpirun -n 128 --map-by socket --report-bindings true
[gr0649.localdomain:119926] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119926] MCW rank 1 bound to socket 1[core 18[hwt
0]]:
[./././././././././././././././././.][B/././././././././././././././././.]
[gr0649.localdomain:119926] MCW rank 2 bound to socket 0[core 1[hwt 0]]:
[./B/./././././././././././././././.][./././././././././././././././././.]
...output snipped...

Why is rank 1 on a different socket? I know I am mapping by socket in
this example, but, fundamentally, nothing should really be different in
terms of ranking, correct? The same number of processes are available on
each host as in the first example, and available in the same locations.
How is "slot" different in this case? If I use "--rank-by core," I
recover the output from the first example.

I thought that maybe "--rank-by slot" might be following something laid
down by "--map-by", but the following example shows that isn't
completely correct, either:

$> mpirun -n 128 --map-by socket:span --report-bindings true
[gr0649.localdomain:119319] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119319] MCW rank 1 bound to socket 1[core 18[hwt
0]]:
[./././././././././././././././././.][B/././././././././././././././././.]
[gr0649.localdomain:119319] MCW rank 2 bound to socket 0[core 1[hwt 0]]:
[./B/./././././././././././././././.][./././././././././././././././././.]
...output snipped...

If ranking by slot were somehow following something left over by
mapping, I would have expected rank 2 to end up on a different host. So,
now I don't know what to expect from using "--rank-by slot." Does anyone
have any pointers?

Thank you for the help!
David

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov

r***@open-mpi.org

2016-11-30 20:23:19 UTC

Permalink

I think you have confused “slot” with a physical “core”. The two have absolutely nothing to do with each other.

A “slot” is nothing more than a scheduling entry in which a process can be placed. So when you --rank-by slot, the ranks are assigned round-robin by scheduler entry - i.e., you assign all the ranks on the first node, then assign all the ranks on the next node, etc.

It doesn’t matter where those ranks are placed, or what core or socket they are running on. We just blindly go thru and assign numbers.

If you rank-by core, then we cycle across the procs by looking at the core number they are bound to, assigning all the procs on a node before moving to the next node. If you rank-by socket, then you cycle across the procs on a node by round-robin of sockets, assigning all procs on the node before moving to the next node. If you then added “span” to that directive, we’d round-robin by socket across all nodes before circling around to the next proc on this node.

HTH
Ralph

Post by David Shrader
Hello All,
The man page for mpirun says that the default ranking procedure is round-robin by slot. It doesn't seem to be that straight-forward to me, though, and I wanted to ask about the behavior.
$> mpirun -n 128 --map-by core --report-bindings true
[gr0649.localdomain:119614] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119614] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119614] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././.][./././././././././././././././././.]
...output snipped...
$> mpirun -n 128 --map-by socket --report-bindings true
[gr0649.localdomain:119926] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119926] MCW rank 1 bound to socket 1[core 18[hwt 0]]: [./././././././././././././././././.][B/././././././././././././././././.]
[gr0649.localdomain:119926] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.]
...output snipped...
Why is rank 1 on a different socket? I know I am mapping by socket in this example, but, fundamentally, nothing should really be different in terms of ranking, correct? The same number of processes are available on each host as in the first example, and available in the same locations. How is "slot" different in this case? If I use "--rank-by core," I recover the output from the first example.
$> mpirun -n 128 --map-by socket:span --report-bindings true
[gr0649.localdomain:119319] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.]
[gr0649.localdomain:119319] MCW rank 1 bound to socket 1[core 18[hwt 0]]: [./././././././././././././././././.][B/././././././././././././././././.]
[gr0649.localdomain:119319] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.]
...output snipped...
If ranking by slot were somehow following something left over by mapping, I would have expected rank 2 to end up on a different host. So, now I don't know what to expect from using "--rank-by slot." Does anyone have any pointers?
Thank you for the help!
David
--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

David Shrader

2016-11-30 22:16:59 UTC

Permalink

Hello Ralph,

I do understand that "slot" is an abstract term and isn't tied down to
any particular piece of hardware. What I am trying to understand is how
"slot" came to be equivalent to "socket" in my second and third example,
but "core" in my first example. As far as I can tell, MPI ranks should
have been assigned the same in all three examples. Why weren't they?

You mentioned that, when using "--rank-by slot", the ranks are assigned
round-robin by scheduler entry; does this mean that the scheduler
entries change based on the mapping algorithm (the only thing I changed
in my examples) and this results in ranks being assigned differently?

Thanks again,
David

Post by r***@open-mpi.org
I think you have confused “slot” with a physical “core”. The two have absolutely nothing to do with each other.
A “slot” is nothing more than a scheduling entry in which a process can be placed. So when you --rank-by slot, the ranks are assigned round-robin by scheduler entry - i.e., you assign all the ranks on the first node, then assign all the ranks on the next node, etc.
It doesn’t matter where those ranks are placed, or what core or socket they are running on. We just blindly go thru and assign numbers.
If you rank-by core, then we cycle across the procs by looking at the core number they are bound to, assigning all the procs on a node before moving to the next node. If you rank-by socket, then you cycle across the procs on a node by round-robin of sockets, assigning all procs on the node before moving to the next node. If you then added “span” to that directive, we’d round-robin by socket across all nodes before circling around to the next proc on this node.
HTH
Ralph

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov

r***@open-mpi.org

2016-11-30 23:46:14 UTC

Permalink

“slot’ never became equivalent to “socket”, or to “core”. Here is what happened:

*for your first example: the mapper assigns the first process to the first node because there is a free core there, and you said to map-by core. It goes on to assign the second process to the second core, and the third process to the third core, etc. until we reach the defined #procs for that node (i.e., the number of assigned “slots” for that node). When it goes to rank the procs, the ranker starts with the first process assigned on the first node - this process occupies the first “slot”, and so it gets rank 0. The ranker then assigns rank 1 to the second process it assigned to the first node, as that process occupies the second “slot”. Etc.

* your 2nd example: the mapper assigns the first process to the first socket of the first node, the second process to the second socket of the first node, and the third process to the first socket of the first node, until all the “slots” for that node have been filled. The ranker then starts with the first process that was assigned to the first node, and gives it rank 0. The ranker then assigns rank 1 to the second process that was assigned to the node - that would be the first proc mapped to the second socket. The ranker then assigns rank 2 to the third proc assigned to the node - that would be the 2nd proc assigned to the first socket.

* your 3rd example: the mapper assigns the first process to the first socket of the first node, the second process to the second socket of the first node, and the third process to the first socket of the second node, continuing around until all procs have been mapped. The ranker then starts with the first proc assigned to the first node, and gives it rank 0. The ranker then assigns rank 1 to the second process assigned to the first node (because we are ranking by slot!), which corresponds to the first proc mapped to the second socket. The ranker then assigns rank 2 to the third process assigned to the first node, which corresponds to the second proc mapped to the first socket of that node.

So you can see that you will indeed get the same relative ranking, even though the mapping was done using a different algorithm.

HTH
Ralph

Post by David Shrader
Hello Ralph,
I do understand that "slot" is an abstract term and isn't tied down to any particular piece of hardware. What I am trying to understand is how "slot" came to be equivalent to "socket" in my second and third example, but "core" in my first example. As far as I can tell, MPI ranks should have been assigned the same in all three examples. Why weren't they?
You mentioned that, when using "--rank-by slot", the ranks are assigned round-robin by scheduler entry; does this mean that the scheduler entries change based on the mapping algorithm (the only thing I changed in my examples) and this results in ranks being assigned differently?
Thanks again,
David

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

David Shrader

2016-12-01 00:15:18 UTC

Permalink

Thank you for the explanation! I understand what is going on now: there
is a process list for each node whose order is dependent on the mapping
policy, and the ranker, when using "slot," walks through that list.
Makes sense.

Thank you again!
David

Post by r***@open-mpi.org
*for your first example: the mapper assigns the first process to the first node because there is a free core there, and you said to map-by core. It goes on to assign the second process to the second core, and the third process to the third core, etc. until we reach the defined #procs for that node (i.e., the number of assigned “slots” for that node). When it goes to rank the procs, the ranker starts with the first process assigned on the first node - this process occupies the first “slot”, and so it gets rank 0. The ranker then assigns rank 1 to the second process it assigned to the first node, as that process occupies the second “slot”. Etc.
* your 2nd example: the mapper assigns the first process to the first socket of the first node, the second process to the second socket of the first node, and the third process to the first socket of the first node, until all the “slots” for that node have been filled. The ranker then starts with the first process that was assigned to the first node, and gives it rank 0. The ranker then assigns rank 1 to the second process that was assigned to the node - that would be the first proc mapped to the second socket. The ranker then assigns rank 2 to the third proc assigned to the node - that would be the 2nd proc assigned to the first socket.
* your 3rd example: the mapper assigns the first process to the first socket of the first node, the second process to the second socket of the first node, and the third process to the first socket of the second node, continuing around until all procs have been mapped. The ranker then starts with the first proc assigned to the first node, and gives it rank 0. The ranker then assigns rank 1 to the second process assigned to the first node (because we are ranking by slot!), which corresponds to the first proc mapped to the second socket. The ranker then assigns rank 2 to the third process assigned to the first node, which corresponds to the second proc mapped to the first socket of that node.
So you can see that you will indeed get the same relative ranking, even though the mapping was done using a different algorithm.
HTH
Ralph

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov