Discussion:
[OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)
MM
2016-10-16 09:24:56 UTC
Permalink
I would like to see if there are any updates re this thread back from 2010:

https://mail-archive.com/***@lists.open-mpi.org/msg15154.html

I've got 3 boxes at home, a laptop and 2 other quadcore nodes . When the
CPU is at 100% for a long time, the fans make quite some noise:-)

The laptop runs the UI, and the 2 other boxes are the compute nodes.
The user triggers compute tasks at random times... In between those times
when no parallelized compute is done, the user does analysis, looks at data
and so on.
This does not involve any MPI compute.
At that point, the nodes are blocked in a mpi_broadcast with each of the 4
processes on each of the nodes polling at 100%, triggering the cpu fan:-)

homogeneous openmpi 1.10.3 linux 4.7.5

Nowadays, are there any more options than the yield_when_idle mentioned in
that initial thread?

The model I have used for so far is really a master/slave model where the
master sends the jobs (which take substantially longer than the MPI
communication itself), so in this model I would want the mpi nodes to be
really idle and i can sacrifice the latency while there's nothing to do.
if there are no other options, is it possible to somehow start all the
processes outside of the mpi world, then only start the mpi framework once
it's needed?

Regards,
Jeff Hammond
2016-10-16 18:03:42 UTC
Permalink
If you want to keep long-waiting MPI processes from clogging your CPU
pipeline and heating up your machines, you can turn blocking MPI
collectives into nicer ones by implementing them in terms of MPI-3
nonblocking collectives using something like the following.

I typed this code straight into this email, so you should validate it
carefully.

Jeff

#ifdef HAVE_UNISTD_H
#include #include <unistd.h>
const int myshortdelay = 1; /* microseconds */
const int mylongdelay = 1; /* seconds */
#else
#define USE_USLEEP 0
#define USE_SLEEP 0
#endif

#ifdef HAVE_SCHED_H
#include <sched.h>
#else
#define USE_YIELD 0
#endif

int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root,
MPI_Comm comm )
{
MPI_Request request;
{
int rc = PMPI_Ibcast(buffer, count, datatype, root, comm, &request);
if (rc!=MPI_SUCCESS) return rc;
}
int flag = 0;
while (!flag)
{
int rc = PMPI_Test(&request, &flag, MPI_STATUS_IGNORE)
if (rc!=MPI_SUCCESS) return rc;

/* pick one of these... */
#if USE_YIELD
sched_yield();
#elif USE_USLEEP
usleep(myshortdelay);
#elif USE_SLEEP
sleep(mylongdelay);
#elif USE_CPU_RELAX
cpu_relax(); /*
http://linux-kernel.2935.n7.nabble.com/x86-cpu-relax-why-nop-vs-pause-td398656.html
*/
#else
#warning Hard polling may not be the best idea...
#endif
}
return MPI_SUCCESS;
}
Post by MM
I've got 3 boxes at home, a laptop and 2 other quadcore nodes . When the
CPU is at 100% for a long time, the fans make quite some noise:-)
Post by MM
The laptop runs the UI, and the 2 other boxes are the compute nodes.
The user triggers compute tasks at random times... In between those times
when no parallelized compute is done, the user does analysis, looks at data
and so on.
Post by MM
This does not involve any MPI compute.
At that point, the nodes are blocked in a mpi_broadcast with each of the
4 processes on each of the nodes polling at 100%, triggering the cpu fan:-)
Post by MM
homogeneous openmpi 1.10.3 linux 4.7.5
Nowadays, are there any more options than the yield_when_idle mentioned
in that initial thread?
Post by MM
The model I have used for so far is really a master/slave model where the
master sends the jobs (which take substantially longer than the MPI
communication itself), so in this model I would want the mpi nodes to be
really idle and i can sacrifice the latency while there's nothing to do.
Post by MM
if there are no other options, is it possible to somehow start all the
processes outside of the mpi world, then only start the mpi framework once
it's needed?
Post by MM
Regards,
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Dave Love
2016-11-07 16:54:34 UTC
Permalink
[Some time ago]
Post by Jeff Hammond
If you want to keep long-waiting MPI processes from clogging your CPU
pipeline and heating up your machines, you can turn blocking MPI
collectives into nicer ones by implementing them in terms of MPI-3
nonblocking collectives using something like the following.
I see sleeping for ‘0s’ typically taking ≳50μs on Linux (measured on
RHEL 6 or 7, without specific tuning, on recent Intel). It doesn't look
like something you want in paths that should be low latency, but maybe
there's something you can do to improve that? (sched_yield takes <1μs.)
Post by Jeff Hammond
I typed this code straight into this email, so you should validate it
carefully.
...
Post by Jeff Hammond
#elif USE_CPU_RELAX
cpu_relax(); /*
http://linux-kernel.2935.n7.nabble.com/x86-cpu-relax-why-nop-vs-pause-td398656.html
*/
Is cpu_relax available to userland? (GCC has an x86-specific intrinsic
__builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
gcc-4.4.)
Jeff Hammond
2016-11-07 22:29:37 UTC
Permalink
Post by Dave Love
[Some time ago]
Post by Jeff Hammond
If you want to keep long-waiting MPI processes from clogging your CPU
pipeline and heating up your machines, you can turn blocking MPI
collectives into nicer ones by implementing them in terms of MPI-3
nonblocking collectives using something like the following.
I see sleeping for ‘0s’ typically taking ≳50ÎŒs on Linux (measured on
RHEL 6 or 7, without specific tuning, on recent Intel). It doesn't look
like something you want in paths that should be low latency, but maybe
there's something you can do to improve that? (sched_yield takes <1ÎŒs.)
I demonstrated a bunch of different implementations with the instruction to
"pick one of these...", where establishing the relationship between
implementation and performance was left as an exercise for the reader :-)
If latency is of the utmost importance to you, you should use the pause
instruction, but this will of course keep the hardware thread running.

Note that MPI implementations may be interested in taking advantage of
https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
It's not possible to use this from outside of MPI because the memory
changed when the ibcast completes locally may not be visible to the user,
but it would allow blocking MPI calls to park hardware threads.
Post by Dave Love
Post by Jeff Hammond
I typed this code straight into this email, so you should validate it
carefully.
...
Post by Jeff Hammond
#elif USE_CPU_RELAX
cpu_relax(); /*
http://linux-kernel.2935.n7.nabble.com/x86-cpu-relax-why-nop-vs-pause-td398656.html
Post by Dave Love
Post by Jeff Hammond
*/
Is cpu_relax available to userland? (GCC has an x86-specific intrinsic
__builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
gcc-4.4.)
The pause instruction is available in ring3. Just use that if cpu_relax
wrapper is not implemented.

Jeff

--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Dave Love
2016-11-09 16:38:47 UTC
Permalink
Post by Jeff Hammond
Post by Dave Love
I see sleeping for ‘0s’ typically taking ≳50μs on Linux (measured on
RHEL 6 or 7, without specific tuning, on recent Intel). It doesn't look
like something you want in paths that should be low latency, but maybe
there's something you can do to improve that? (sched_yield takes <1μs.)
I demonstrated a bunch of different implementations with the instruction to
"pick one of these...", where establishing the relationship between
implementation and performance was left as an exercise for the reader :-)
The point was that only the one seemed available on RHEL6 to this
exercised reader. No complaints about the useful list of possibilities.
Post by Jeff Hammond
Note that MPI implementations may be interested in taking advantage of
https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
Is that really useful if it's KNL-specific and MSR-based, with a setup
that implementations couldn't assume?
Post by Jeff Hammond
Post by Dave Love
Is cpu_relax available to userland? (GCC has an x86-specific intrinsic
__builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
gcc-4.4.)
The pause instruction is available in ring3. Just use that if cpu_relax
wrapper is not implemented.
[OK; I meant in a userland library.]

Are there published measurements of the typical effects of spinning and
ameliorations on some sort of "representative" system?
Jeff Hammond
2016-11-28 16:30:16 UTC
Permalink
Post by Jeff Hammond
Post by Jeff Hammond
Note that MPI implementations may be interested in taking advantage of
https://software.intel.com/en-us/blogs/2016/10/06/intel-
xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
Is that really useful if it's KNL-specific and MSR-based, with a setup
that implementations couldn't assume?
Why wouldn't it be useful in the context of a parallel runtime system like
MPI? MPI implementations take advantage of all sorts of stuff that needs
to be queried with configuration, during compilation or at runtime.

TSX requires that one check the CPUID bits for it, and plenty of folks are
happily using MSRs (e.g.
http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html).
Post by Jeff Hammond
Post by Jeff Hammond
Post by Dave Love
Is cpu_relax available to userland? (GCC has an x86-specific intrinsic
__builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
gcc-4.4.)
The pause instruction is available in ring3. Just use that if cpu_relax
wrapper is not implemented.
[OK; I meant in a userland library.]
Are there published measurements of the typical effects of spinning and
ameliorations on some sort of "representative" system?
None that are published, unfortunately.

Best,

Jeff
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Dave Love
2016-12-08 14:25:12 UTC
Permalink
Post by Jeff Hammond
Post by Jeff Hammond
Post by Jeff Hammond
Note that MPI implementations may be interested in taking advantage of
https://software.intel.com/en-us/blogs/2016/10/06/intel-
xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
Is that really useful if it's KNL-specific and MSR-based, with a setup
that implementations couldn't assume?
Why wouldn't it be useful in the context of a parallel runtime system like
MPI? MPI implementations take advantage of all sorts of stuff that needs
to be queried with configuration, during compilation or at runtime.
I probably should have said "useful in practice". The difference from
other things I can think of is that access to MSRs is privileged, and
it's not clear to me what the implications are of changing it or to what
extent you can assume people will.
Post by Jeff Hammond
TSX requires that one check the CPUID bits for it, and plenty of folks are
happily using MSRs (e.g.
http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html).
Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.
Andreas Schäfer
2016-12-09 23:12:15 UTC
Permalink
Post by Dave Love
Post by Jeff Hammond
Post by Jeff Hammond
Post by Jeff Hammond
Note that MPI implementations may be interested in taking advantage of
https://software.intel.com/en-us/blogs/2016/10/06/intel-
xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
Is that really useful if it's KNL-specific and MSR-based, with a setup
that implementations couldn't assume?
Why wouldn't it be useful in the context of a parallel runtime system like
MPI? MPI implementations take advantage of all sorts of stuff that needs
to be queried with configuration, during compilation or at runtime.
I probably should have said "useful in practice". The difference from
other things I can think of is that access to MSRs is privileged, and
it's not clear to me what the implications are of changing it or to what
extent you can assume people will.
Post by Jeff Hammond
TSX requires that one check the CPUID bits for it, and plenty of folks are
happily using MSRs (e.g.
http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html).
Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.
LIKWID[1] uses a daemon to provide limited RW access to MSRs for
applications. I wouldn't wonder if support for this was added to
LIKWID by RRZE.

Cheers
-Andreas

[1] https://github.com/RRZE-HPC/likwid
--
==========================================================
Andreas Schäfer
HPC and Supercomputing
Institute for Multiscale Simulation
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-20866
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==========================================================

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!
Dave Love
2016-12-12 14:24:16 UTC
Permalink
Post by Andreas Schäfer
Post by Dave Love
Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.
LIKWID[1] uses a daemon to provide limited RW access to MSRs for
applications. I wouldn't wonder if support for this was added to
LIKWID by RRZE.
Yes, that's one of the N I had in mind; others provide Linux modules.

From a system manager's point of view it's not clear what are the
implications of the unprivileged access, or even how much it really
helps. I've seen enough setups suggested for HPC systems in areas I
understand (and used by vendors) which allow privilege escalation more
or less trivially, maybe without any real operational advantage. If
it's clearly safe and helpful then great, but I couldn't assess that.
Andreas Schäfer
2016-12-14 07:00:28 UTC
Permalink
Post by Dave Love
Post by Andreas Schäfer
Post by Dave Love
Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.
LIKWID[1] uses a daemon to provide limited RW access to MSRs for
applications. I wouldn't wonder if support for this was added to
LIKWID by RRZE.
Yes, that's one of the N I had in mind; others provide Linux modules.
From a system manager's point of view it's not clear what are the
implications of the unprivileged access, or even how much it really
helps. I've seen enough setups suggested for HPC systems in areas I
understand (and used by vendors) which allow privilege escalation more
or less trivially, maybe without any real operational advantage. If
it's clearly safe and helpful then great, but I couldn't assess that.
I think LIKWID's access daemon is specifically designed to provide a
safe way of giving limited access to MSRs. I'm cc'ing Thomas Röhl as
he knows more about this.

Cheers
-Andreas
--
==========================================================
Andreas Schäfer
HPC and Supercomputing
Institute for Multiscale Simulation
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-20866
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==========================================================

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!
Thomas Röhl
2016-12-15 11:27:23 UTC
Permalink
Post by Andreas Schäfer
Post by Dave Love
Post by Andreas Schäfer
Post by Dave Love
Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.
LIKWID[1] uses a daemon to provide limited RW access to MSRs for
applications. I wouldn't wonder if support for this was added to
LIKWID by RRZE.
Yes, that's one of the N I had in mind; others provide Linux modules.
From a system manager's point of view it's not clear what are the
implications of the unprivileged access, or even how much it really
helps. I've seen enough setups suggested for HPC systems in areas I
understand (and used by vendors) which allow privilege escalation more
or less trivially, maybe without any real operational advantage. If
it's clearly safe and helpful then great, but I couldn't assess that.
I think LIKWID's access daemon is specifically designed to provide a
safe way of giving limited access to MSRs. I'm cc'ing Thomas Röhl as
he knows more about this.
As Andreas stated, the access daemon was written providing a rather safe
method for users to access the MSRs. It opens a UNIX socket to
communicate with the actual application. The lists of allowed registers
are compiled inside the daemon, so no changes can be done from the
outside and users are limited to the allowed registers. The code was
checked by an IT security team and all recommendations were integrated
but there are possibly other bugs (like in any other code).

If a user wants to dig deep into his/her code or control the behavior of
a machine, providing access to MSR for users is really helpful. The
LIKWID suite contains some examples like controlling CPU frequencies,
(de)activating various hardware prefetchers or configuring the power
budget.

For system manger's, the user access to MSRs can be a real pain because
all MSRs need to be checked before/after a user's work to provide the
system in a consistent state to the next user. Moreover, for both kernel
modules and a privilege escalating daemon, there is commonly a reduction
of security that must be compared to the possible advantages. My
experience shows that system manager's don't want to load third-party
kernel modules on their in-production systems (as long as there is no
big company behind) but they also don't trust a suid-root daemon as the
one of LIKWID.

For a runtime management system as OpenMPI, the integration of libcap is
probably the safest way to access to the MSRs. You don't need a daemon
and the application keeps running with common user privileges. The
handling of libcap can be somewhat annoying and was Linux distribution
dependent at the time I checked it (some worked, some not, some showed
completely undefined behavior).

Cheers,
Thomas
--
--
M.Sc. Thomas Roehl, HPC Services
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales RechenZentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
Tel. +49 9131 85-20800
mailto:***@rrze.fau.de
http://www.hpc.rrze.uni-erlangen.de/
Loading...