Discussion:
[OMPI users] No more default core binding since 2.0.2?
Reuti
2017-04-09 10:40:29 UTC
Permalink
Hi,

While I noticed an automatic core binding in Open MPI 1.8 (which in a shared cluster may lead to oversubscribing of cores), I can't spot this any longer in the 2.x series. So the question arises:

- Was this a general decision to no longer enable automatic core binding?

First I thought it might be because of:

- We define plm_rsh_agent=foo in $OMPI_ROOT/etc/openmpi-mca-params.conf
- We compiled with --with-sge

But also started on the command line by `ssh` to the nodes, there seems no automatic core binding to take place any longer.

-- Reuti
r***@open-mpi.org
2017-04-09 14:35:09 UTC
Permalink
There has been no change in the policy - however, if you are oversubscribed, we did fix a bug to ensure that we don’t auto-bind in that situation

Can you pass along your cmd line? So far as I can tell, it still seems to be working.
Post by Reuti
Hi,
- Was this a general decision to no longer enable automatic core binding?
- We define plm_rsh_agent=foo in $OMPI_ROOT/etc/openmpi-mca-params.conf
- We compiled with --with-sge
But also started on the command line by `ssh` to the nodes, there seems no automatic core binding to take place any longer.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Reuti
2017-04-09 20:49:57 UTC
Permalink
Hi,
Post by r***@open-mpi.org
There has been no change in the policy - however, if you are oversubscribed, we did fix a bug to ensure that we don’t auto-bind in that situation
Can you pass along your cmd line? So far as I can tell, it still seems to be working.
I'm not sure whether it was the case with 1.8, but according to the man page it binds now to sockets for number of processes > 2 . And this can lead the effect that one sometimes may notice a drop in performance when just this socket has other jobs running (by accident).

So, this is solved - I wasn't aware of the binding by socket.

But I can't see a binding by core for number of processes <= 2. Does it mean 2 per node or 2 overall for the `mpiexec`?

- -- Reuti
Post by r***@open-mpi.org
Post by Reuti
Hi,
- Was this a general decision to no longer enable automatic core binding?
- We define plm_rsh_agent=foo in $OMPI_ROOT/etc/openmpi-mca-params.conf
- We compiled with --with-sge
But also started on the command line by `ssh` to the nodes, there seems no automatic core binding to take place any longer.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-04-09 21:09:59 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi,
There has been no change in the policy - however, if you are oversubscribed, we did fix a bug to ensure that we don’t auto-bind in that situation
Can you pass along your cmd line? So far as I can tell, it still seems to be working.
I'm not sure whether it was the case with 1.8, but according to the man page it binds now to sockets for number of processes > 2 . And this can lead the effect that one sometimes may notice a drop in performance when just this socket has other jobs running (by accident).
So, this is solved - I wasn't aware of the binding by socket.
But I can't see a binding by core for number of processes <= 2. Does it mean 2 per node or 2 overall for the `mpiexec`?
It’s 2 processes overall
- -- Reuti
Post by Reuti
Hi,
- Was this a general decision to no longer enable automatic core binding?
- We define plm_rsh_agent=foo in $OMPI_ROOT/etc/openmpi-mca-params.conf
- We compiled with --with-sge
But also started on the command line by `ssh` to the nodes, there seems no automatic core binding to take place any longer.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org <https://gpgtools.org/>
iEYEARECAAYFAljqnnYACgkQo/GbGkBRnRrwtACgpUAlpvQElzbjoVdvsQubZmTo
Pj4An05kJd3pW0YWW4HXaf/7Zl7xTc+y
=kzwG
-----END PGP SIGNATURE-----
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
Reuti
2017-04-09 22:45:47 UTC
Permalink
Post by Reuti
But I can't see a binding by core for number of processes <= 2. Does it mean 2 per node or 2 overall for the `mpiexec`?
It’s 2 processes overall
Having a round-robin allocation in the cluster, this might not be what was intended (to bind only one or two cores per exechost)?

Obviously the default changes (from --bind-to core to --bin-to socket), whether I compiled Open MPI with or w/o libnuma (I wanted to get rid of the warning in the output only – now it works). But "--bind-to core" I could also use w/o libnuma and it worked, I got only that warning in addition about the memory couldn't be bound.

BTW: I always had to use -ldl when using `mpicc`. Now, that I compiled in libnuma, this necessity is gone.

-- Reuti
r***@open-mpi.org
2017-04-09 23:58:40 UTC
Permalink
Let me try to clarify. If you launch a job that has only 1 or 2 processes in it (total), then we bind to core by default. This is done because a job that small is almost always some kind of benchmark.

If there are more than 2 processes in the job (total), then we default to binding to NUMA (if NUMA’s are present - otherwise, to socket) across the entire job.

You can always override these behaviors.
Post by Reuti
Post by Reuti
But I can't see a binding by core for number of processes <= 2. Does it mean 2 per node or 2 overall for the `mpiexec`?
It’s 2 processes overall
Having a round-robin allocation in the cluster, this might not be what was intended (to bind only one or two cores per exechost)?
Obviously the default changes (from --bind-to core to --bin-to socket), whether I compiled Open MPI with or w/o libnuma (I wanted to get rid of the warning in the output only – now it works). But "--bind-to core" I could also use w/o libnuma and it worked, I got only that warning in addition about the memory couldn't be bound.
BTW: I always had to use -ldl when using `mpicc`. Now, that I compiled in libnuma, this necessity is gone.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Reuti
2017-04-10 08:37:30 UTC
Permalink
Post by r***@open-mpi.org
Let me try to clarify. If you launch a job that has only 1 or 2 processes in it (total), then we bind to core by default. This is done because a job that small is almost always some kind of benchmark.
Yes, I see. But only if libnuma was compiled in AFAICS.
Post by r***@open-mpi.org
If there are more than 2 processes in the job (total), then we default to binding to NUMA (if NUMA’s are present - otherwise, to socket) across the entire job.
Mmh - can I spot a difference in --report-bindings between these two? To me both looks like being bound to socket.

-- Reuti
Post by r***@open-mpi.org
You can always override these behaviors.
Post by Reuti
Post by r***@open-mpi.org
Post by Reuti
But I can't see a binding by core for number of processes <= 2. Does it mean 2 per node or 2 overall for the `mpiexec`?
It’s 2 processes overall
Having a round-robin allocation in the cluster, this might not be what was intended (to bind only one or two cores per exechost)?
Obviously the default changes (from --bind-to core to --bin-to socket), whether I compiled Open MPI with or w/o libnuma (I wanted to get rid of the warning in the output only – now it works). But "--bind-to core" I could also use w/o libnuma and it worked, I got only that warning in addition about the memory couldn't be bound.
BTW: I always had to use -ldl when using `mpicc`. Now, that I compiled in libnuma, this necessity is gone.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-04-10 15:27:34 UTC
Permalink
Post by Reuti
Post by r***@open-mpi.org
Let me try to clarify. If you launch a job that has only 1 or 2 processes in it (total), then we bind to core by default. This is done because a job that small is almost always some kind of benchmark.
Yes, I see. But only if libnuma was compiled in AFAICS.
Post by r***@open-mpi.org
If there are more than 2 processes in the job (total), then we default to binding to NUMA (if NUMA’s are present - otherwise, to socket) across the entire job.
Mmh - can I spot a difference in --report-bindings between these two? To me both looks like being bound to socket.
You won’t see a difference if the NUMA and socket are identical in terms of the cores they cover.
Post by Reuti
-- Reuti
Post by r***@open-mpi.org
You can always override these behaviors.
Post by Reuti
Post by r***@open-mpi.org
Post by Reuti
But I can't see a binding by core for number of processes <= 2. Does it mean 2 per node or 2 overall for the `mpiexec`?
It’s 2 processes overall
Having a round-robin allocation in the cluster, this might not be what was intended (to bind only one or two cores per exechost)?
Obviously the default changes (from --bind-to core to --bin-to socket), whether I compiled Open MPI with or w/o libnuma (I wanted to get rid of the warning in the output only – now it works). But "--bind-to core" I could also use w/o libnuma and it worked, I got only that warning in addition about the memory couldn't be bound.
BTW: I always had to use -ldl when using `mpicc`. Now, that I compiled in libnuma, this necessity is gone.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
Reuti
2017-04-10 15:47:41 UTC
Permalink
Post by r***@open-mpi.org
Post by Reuti
Post by r***@open-mpi.org
Let me try to clarify. If you launch a job that has only 1 or 2 processes in it (total), then we bind to core by default. This is done because a job that small is almost always some kind of benchmark.
Yes, I see. But only if libnuma was compiled in AFAICS.
Post by r***@open-mpi.org
If there are more than 2 processes in the job (total), then we default to binding to NUMA (if NUMA’s are present - otherwise, to socket) across the entire job.
Mmh - can I spot a difference in --report-bindings between these two? To me both looks like being bound to socket.
You won’t see a difference if the NUMA and socket are identical in terms of the cores they cover.
Ok, thx.
Post by r***@open-mpi.org
Post by Reuti
-- Reuti
Post by r***@open-mpi.org
You can always override these behaviors.
Post by Reuti
Post by r***@open-mpi.org
Post by Reuti
But I can't see a binding by core for number of processes <= 2. Does it mean 2 per node or 2 overall for the `mpiexec`?
It’s 2 processes overall
Having a round-robin allocation in the cluster, this might not be what was intended (to bind only one or two cores per exechost)?
Obviously the default changes (from --bind-to core to --bin-to socket), whether I compiled Open MPI with or w/o libnuma (I wanted to get rid of the warning in the output only – now it works). But "--bind-to core" I could also use w/o libnuma and it worked, I got only that warning in addition about the memory couldn't be bound.
BTW: I always had to use -ldl when using `mpicc`. Now, that I compiled in libnuma, this necessity is gone.
-- Reuti
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Reuti
2017-04-10 12:43:57 UTC
Permalink
[…]BTW: I always had to use -ldl when using `mpicc`. Now, that I compiled in libnuma, this necessity is gone.
Looks like I compiled too many versions in the last couple of days. The -ldl is necessary in case --disable-shared --enable-static was given to have a plain static version.

-- Reuti
Loading...