[OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

If you want to keep long-waiting MPI processes from clogging your CPU
pipeline and heating up your machines, you can turn blocking MPI
collectives into nicer ones by implementing them in terms of MPI-3
nonblocking collectives using something like the following.

I typed this code straight into this email, so you should validate it
carefully.

Jeff

#ifdef HAVE_UNISTD_H
#include #include <unistd.h>
const int myshortdelay = 1; /* microseconds */
const int mylongdelay = 1; /* seconds */
#else
#define USE_USLEEP 0
#define USE_SLEEP 0
#endif

#ifdef HAVE_SCHED_H
#include <sched.h>
#else
#define USE_YIELD 0
#endif

int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root,
MPI_Comm comm )
{
MPI_Request request;
{
int rc = PMPI_Ibcast(buffer, count, datatype, root, comm, &request);
if (rc!=MPI_SUCCESS) return rc;
}
int flag = 0;
while (!flag)
{
int rc = PMPI_Test(&request, &flag, MPI_STATUS_IGNORE)
if (rc!=MPI_SUCCESS) return rc;

/* pick one of these... */
#if USE_YIELD
sched_yield();
#elif USE_USLEEP
usleep(myshortdelay);
#elif USE_SLEEP
sleep(mylongdelay);
#elif USE_CPU_RELAX
cpu_relax(); /*
http://linux-kernel.2935.n7.nabble.com/x86-cpu-relax-why-nop-vs-pause-td398656.html
*/
#else
#warning Hard polling may not be the best idea...
#endif
}
return MPI_SUCCESS;
}

Post by MM
I've got 3 boxes at home, a laptop and 2 other quadcore nodes . When the

CPU is at 100% for a long time, the fans make quite some noise:-)

Post by MM
The laptop runs the UI, and the 2 other boxes are the compute nodes.
The user triggers compute tasks at random times... In between those times

when no parallelized compute is done, the user does analysis, looks at data
and so on.

Post by MM
This does not involve any MPI compute.
At that point, the nodes are blocked in a mpi_broadcast with each of the

4 processes on each of the nodes polling at 100%, triggering the cpu fan:-)

Post by MM
homogeneous openmpi 1.10.3 linux 4.7.5
Nowadays, are there any more options than the yield_when_idle mentioned

in that initial thread?

Post by MM
The model I have used for so far is really a master/slave model where the

master sends the jobs (which take substantially longer than the MPI
communication itself), so in this model I would want the mpi nodes to be
really idle and i can sacrifice the latency while there's nothing to do.

Post by MM
if there are no other options, is it possible to somehow start all the

processes outside of the mpi world, then only start the mpi framework once
it's needed?

Post by MM
Regards,
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/

Dave Love

2016-11-07 16:54:34 UTC

[Some time ago]

Post by Jeff Hammond
If you want to keep long-waiting MPI processes from clogging your CPU
pipeline and heating up your machines, you can turn blocking MPI
collectives into nicer ones by implementing them in terms of MPI-3
nonblocking collectives using something like the following.

I see sleeping for ‘0s’ typically taking ≳50μs on Linux (measured on
RHEL 6 or 7, without specific tuning, on recent Intel). It doesn't look
like something you want in paths that should be low latency, but maybe
there's something you can do to improve that? (sched_yield takes <1μs.)

Post by Jeff Hammond
I typed this code straight into this email, so you should validate it
carefully.

...

Post by Jeff Hammond
#elif USE_CPU_RELAX
cpu_relax(); /*
http://linux-kernel.2935.n7.nabble.com/x86-cpu-relax-why-nop-vs-pause-td398656.html
*/

Is cpu_relax available to userland? (GCC has an x86-specific intrinsic
__builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
gcc-4.4.)

Jeff Hammond

2016-11-07 22:29:37 UTC

Post by Dave Love
[Some time ago]

I see sleeping for â0sâ typically taking â³50ÎŒs on Linux (measured on
RHEL 6 or 7, without specific tuning, on recent Intel). It doesn't look
like something you want in paths that should be low latency, but maybe
there's something you can do to improve that? (sched_yield takes <1ÎŒs.)

I demonstrated a bunch of different implementations with the instruction to
"pick one of these...", where establishing the relationship between
implementation and performance was left as an exercise for the reader :-)
If latency is of the utmost importance to you, you should use the pause
instruction, but this will of course keep the hardware thread running.

Note that MPI implementations may be interested in taking advantage of
https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
It's not possible to use this from outside of MPI because the memory
changed when the ibcast completes locally may not be visible to the user,
but it would allow blocking MPI calls to park hardware threads.

Post by Jeff Hammond
I typed this code straight into this email, so you should validate it
carefully.

...

Post by Jeff Hammond
#elif USE_CPU_RELAX
cpu_relax(); /*

http://linux-kernel.2935.n7.nabble.com/x86-cpu-relax-why-nop-vs-pause-td398656.html

Post by Jeff Hammond
*/

Is cpu_relax available to userland? (GCC has an x86-specific intrinsic
__builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
gcc-4.4.)

The pause instruction is available in ring3. Just use that if cpu_relax
wrapper is not implemented.

Jeff

--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/

Dave Love

2016-11-09 16:38:47 UTC

Post by Dave Love
I see sleeping for ‘0s’ typically taking ≳50μs on Linux (measured on
RHEL 6 or 7, without specific tuning, on recent Intel). It doesn't look
like something you want in paths that should be low latency, but maybe
there's something you can do to improve that? (sched_yield takes <1μs.)

The point was that only the one seemed available on RHEL6 to this
exercised reader. No complaints about the useful list of possibilities.

Post by Jeff Hammond
Note that MPI implementations may be interested in taking advantage of
https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.

Is that really useful if it's KNL-specific and MSR-based, with a setup
that implementations couldn't assume?

Post by Dave Love
Is cpu_relax available to userland? (GCC has an x86-specific intrinsic
__builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
gcc-4.4.)

The pause instruction is available in ring3. Just use that if cpu_relax
wrapper is not implemented.

[OK; I meant in a userland library.]

Are there published measurements of the typical effects of spinning and
ameliorations on some sort of "representative" system?

Jeff Hammond

2016-11-28 16:30:16 UTC

Post by Jeff Hammond
Note that MPI implementations may be interested in taking advantage of
https://software.intel.com/en-us/blogs/2016/10/06/intel-

xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
Is that really useful if it's KNL-specific and MSR-based, with a setup
that implementations couldn't assume?

Why wouldn't it be useful in the context of a parallel runtime system like
MPI? MPI implementations take advantage of all sorts of stuff that needs
to be queried with configuration, during compilation or at runtime.

TSX requires that one check the CPUID bits for it, and plenty of folks are
happily using MSRs (e.g.
http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html).

Post by Dave Love
Is cpu_relax available to userland? (GCC has an x86-specific intrinsic
__builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
gcc-4.4.)

The pause instruction is available in ring3. Just use that if cpu_relax
wrapper is not implemented.

[OK; I meant in a userland library.]
Are there published measurements of the typical effects of spinning and
ameliorations on some sort of "representative" system?

None that are published, unfortunately.

Best,

Jeff

--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/

Dave Love

2016-12-08 14:25:12 UTC

Post by Jeff Hammond
Note that MPI implementations may be interested in taking advantage of
https://software.intel.com/en-us/blogs/2016/10/06/intel-

xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
Is that really useful if it's KNL-specific and MSR-based, with a setup
that implementations couldn't assume?

I probably should have said "useful in practice". The difference from
other things I can think of is that access to MSRs is privileged, and
it's not clear to me what the implications are of changing it or to what
extent you can assume people will.

Post by Jeff Hammond
TSX requires that one check the CPUID bits for it, and plenty of folks are
happily using MSRs (e.g.
http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html).

Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.

Andreas Schäfer

2016-12-09 23:12:15 UTC

Post by Jeff Hammond
Note that MPI implementations may be interested in taking advantage of
https://software.intel.com/en-us/blogs/2016/10/06/intel-

xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
Is that really useful if it's KNL-specific and MSR-based, with a setup
that implementations couldn't assume?

Post by Jeff Hammond
TSX requires that one check the CPUID bits for it, and plenty of folks are
happily using MSRs (e.g.
http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html).

Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.

LIKWID[1] uses a daemon to provide limited RW access to MSRs for
applications. I wouldn't wonder if support for this was added to
LIKWID by RRZE.

Cheers
-Andreas

[1] https://github.com/RRZE-HPC/likwid
--
==========================================================
Andreas Schäfer
HPC and Supercomputing
Institute for Multiscale Simulation
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-20866
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==========================================================

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!

Dave Love

2016-12-12 14:24:16 UTC

Post by Dave Love
Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.

LIKWID[1] uses a daemon to provide limited RW access to MSRs for
applications. I wouldn't wonder if support for this was added to
LIKWID by RRZE.

Yes, that's one of the N I had in mind; others provide Linux modules.

From a system manager's point of view it's not clear what are the
implications of the unprivileged access, or even how much it really
helps. I've seen enough setups suggested for HPC systems in areas I
understand (and used by vendors) which allow privilege escalation more
or less trivially, maybe without any real operational advantage. If
it's clearly safe and helpful then great, but I couldn't assess that.

Andreas Schäfer

2016-12-14 07:00:28 UTC

Post by Dave Love
Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.

LIKWID[1] uses a daemon to provide limited RW access to MSRs for
applications. I wouldn't wonder if support for this was added to
LIKWID by RRZE.

Yes, that's one of the N I had in mind; others provide Linux modules.
From a system manager's point of view it's not clear what are the
implications of the unprivileged access, or even how much it really
helps. I've seen enough setups suggested for HPC systems in areas I
understand (and used by vendors) which allow privilege escalation more
or less trivially, maybe without any real operational advantage. If
it's clearly safe and helpful then great, but I couldn't assess that.

I think LIKWID's access daemon is specifically designed to provide a
safe way of giving limited access to MSRs. I'm cc'ing Thomas Röhl as
he knows more about this.

Cheers
-Andreas
--
==========================================================
Andreas Schäfer
HPC and Supercomputing
Institute for Multiscale Simulation
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-20866
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==========================================================

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!

Thomas Röhl

2016-12-15 11:27:23 UTC