Discussion:
[OMPI users] MPI_Reduce performance
Gabriele Fatigati
2010-09-08 09:21:49 UTC
Permalink
Dear OpenMPI users,

i'm using OpenMPI 1.3.3 on Infiniband 4x interconnnection network. My
parallel application use intensive MPI_Reduce communication over
communicator created with MPI_Comm_split.

I've noted strange behaviour during execution. My code is instrumented with
Scalasca 1.3 to report subroutine execution time. First execution shows
elapsed time with 128 processors ( job_communicator is created with
MPI_Comm_split). In both cases is composed to the same ranks of
MPI_COMM_WORLD:

MPI_Reduce(.....,job_communicator)

The elapsed time is 2671 sec.

Second run use MPI_BARRIER before MPI_Reduce:

MPI_Barrier(job_communicator..)
MPI_Reduce(.....,job_communicator)

The elapsed time of Barrier+Reduce is 2167 sec, (about 8 minutes less).

So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to
mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the
same for others collective communications. Someone can explaine me why
MPI_reduce has this strange behaviour?

Thanks in advance.
--
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it Tel: +39 051 6171722

g.fatigati [AT] cineca.it
David Zhang
2010-09-08 16:44:29 UTC
Permalink
Doing Reduce without Barrier first allows one process to call Reduce and
exit immediately without waiting for other processes to call Reduce.
Therefore, this allows one process to advance faster than other processes.
I suspect the 2671 second result is the difference between the fastest and
slowest process. Having Barrier reduce the time difference because it
forces the faster processes to go slower.
Post by Gabriele Fatigati
Dear OpenMPI users,
i'm using OpenMPI 1.3.3 on Infiniband 4x interconnnection network. My
parallel application use intensive MPI_Reduce communication over
communicator created with MPI_Comm_split.
I've noted strange behaviour during execution. My code is instrumented with
Scalasca 1.3 to report subroutine execution time. First execution shows
elapsed time with 128 processors ( job_communicator is created with
MPI_Comm_split). In both cases is composed to the same ranks of
MPI_Reduce(.....,job_communicator)
The elapsed time is 2671 sec.
MPI_Barrier(job_communicator..)
MPI_Reduce(.....,job_communicator)
The elapsed time of Barrier+Reduce is 2167 sec, (about 8 minutes less).
So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to
mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the
same for others collective communications. Someone can explaine me why
MPI_reduce has this strange behaviour?
Thanks in advance.
--
Ing. Gabriele Fatigati
Parallel programmer
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it Tel: +39 051 6171722
g.fatigati [AT] cineca.it
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
David Zhang
University of California, San Diego
Ashley Pittman
2010-09-08 17:50:44 UTC
Permalink
So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the same for others collective communications. Someone can explaine me why MPI_reduce has this strange behaviour?
There are many cases where where adding an explicit barrier before a call to reduce would be superfluous so the standard rightly says that it isn't needed and need not be performed. As you've seen though there are also cases where it can help. I'd be interested to know the effect if you only added a barrier before MPI_Reduce occasionally, perhaps every one or two hundred iterations, this can also have a beneficial effect as a barrier every iteration adds significant overhead.

This is a textbook example of where the new asynchronous barrier could help, in theory it should have the effect of being able keep processes in sync without any additional overhead in the case that they are already well synchronised.

Ashley,
--
Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
Terry Frankcombe
2010-09-09 01:52:34 UTC
Permalink
Gabriele,

Can you clarify... those timings are what is reported for the reduction
call specifically, not the total execution time?

If so, then the difference is, to a first approximation, the time you
spend sitting idly by doing absolutely nothing waiting at the barrier.

Ciao
Terry
--
Dr. Terry Frankcombe
Research School of Chemistry, Australian National University
Ph: (+61) 0417 163 509 Skype: terry.frankcombe
Gabriele Fatigati
2010-09-09 07:11:51 UTC
Permalink
Hy Terry,

this time is spent in MPI_Reduce, it isn't total execution time.
Post by Terry Frankcombe
Gabriele,
Can you clarify... those timings are what is reported for the reduction
call specifically, not the total execution time?
If so, then the difference is, to a first approximation, the time you
spend sitting idly by doing absolutely nothing waiting at the barrier.
Ciao
Terry
--
Dr. Terry Frankcombe
Research School of Chemistry, Australian National University
Ph: (+61) 0417 163 509 Skype: terry.frankcombe
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it Tel: +39 051 6171722

g.fatigati [AT] cineca.it
Gabriele Fatigati
2010-09-09 07:14:31 UTC
Permalink
More in depth,

total execution time without Barrier is about 10000 sec.

Total execution time with Barrier+Reduce is 9453, with 128 procs.
Post by Terry Frankcombe
Gabriele,
Can you clarify... those timings are what is reported for the reduction
call specifically, not the total execution time?
If so, then the difference is, to a first approximation, the time you
spend sitting idly by doing absolutely nothing waiting at the barrier.
Ciao
Terry
--
Dr. Terry Frankcombe
Research School of Chemistry, Australian National University
Ph: (+61) 0417 163 509 Skype: terry.frankcombe
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it Tel: +39 051 6171722

g.fatigati [AT] cineca.it
Ralph Castain
2010-09-09 07:24:20 UTC
Permalink
As people have said, these time values are to be expected. All they reflect is the time difference spent in reduce waiting for the slowest process to catch up to everyone else. The barrier removes that factor by forcing all processes to start from the same place.

No mystery here - just a reflection of the fact that your processes arrive at the MPI_Reduce calls at different times.
Post by Gabriele Fatigati
More in depth,
total execution time without Barrier is about 10000 sec.
Total execution time with Barrier+Reduce is 9453, with 128 procs.
Gabriele,
Can you clarify... those timings are what is reported for the reduction
call specifically, not the total execution time?
If so, then the difference is, to a first approximation, the time you
spend sitting idly by doing absolutely nothing waiting at the barrier.
Ciao
Terry
--
Dr. Terry Frankcombe
Research School of Chemistry, Australian National University
Ph: (+61) 0417 163 509 Skype: terry.frankcombe
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Ing. Gabriele Fatigati
Parallel programmer
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it Tel: +39 051 6171722
g.fatigati [AT] cineca.it
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Terry Frankcombe
2010-09-09 07:31:17 UTC
Permalink
Post by Ralph Castain
As people have said, these time values are to be expected. All they
reflect is the time difference spent in reduce waiting for the slowest
process to catch up to everyone else. The barrier removes that factor
by forcing all processes to start from the same place.
No mystery here - just a reflection of the fact that your processes
arrive at the MPI_Reduce calls at different times.
Yes, however, it seems Gabriele is saying the total execution time
*drops* by ~500 s when the barrier is put *in*. (Is that the right way
around, Gabriele?)

That's harder to explain as a sync issue.
Post by Ralph Castain
Post by Gabriele Fatigati
More in depth,
total execution time without Barrier is about 10000 sec.
Total execution time with Barrier+Reduce is 9453, with 128 procs.
Gabriele,
Can you clarify... those timings are what is reported for the reduction
call specifically, not the total execution time?
If so, then the difference is, to a first approximation, the time you
spend sitting idly by doing absolutely nothing waiting at the barrier.
Ciao
Terry
--
Dr. Terry Frankcombe
Research School of Chemistry, Australian National University
Ph: (+61) 0417 163 509 Skype: terry.frankcombe
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Ing. Gabriele Fatigati
Parallel programmer
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it Tel: +39 051 6171722
g.fatigati [AT] cineca.it
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Gabriele Fatigati
2010-09-09 07:36:28 UTC
Permalink
Yes Terry,

thats' right.
Post by Terry Frankcombe
Post by Ralph Castain
As people have said, these time values are to be expected. All they
reflect is the time difference spent in reduce waiting for the slowest
process to catch up to everyone else. The barrier removes that factor
by forcing all processes to start from the same place.
No mystery here - just a reflection of the fact that your processes
arrive at the MPI_Reduce calls at different times.
Yes, however, it seems Gabriele is saying the total execution time
*drops* by ~500 s when the barrier is put *in*. (Is that the right way
around, Gabriele?)
That's harder to explain as a sync issue.
Post by Ralph Castain
Post by Gabriele Fatigati
More in depth,
total execution time without Barrier is about 10000 sec.
Total execution time with Barrier+Reduce is 9453, with 128 procs.
Gabriele,
Can you clarify... those timings are what is reported for
the reduction
call specifically, not the total execution time?
If so, then the difference is, to a first approximation, the
time you
spend sitting idly by doing absolutely nothing waiting at
the barrier.
Ciao
Terry
--
Dr. Terry Frankcombe
Research School of Chemistry, Australian National University
Ph: (+61) 0417 163 509 Skype: terry.frankcombe
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Ing. Gabriele Fatigati
Parallel programmer
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it Tel: +39 051 6171722
g.fatigati [AT] cineca.it
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it Tel: +39 051 6171722

g.fatigati [AT] cineca.it
Ashley Pittman
2010-09-09 07:46:16 UTC
Permalink
Post by Terry Frankcombe
Post by Ralph Castain
As people have said, these time values are to be expected. All they
reflect is the time difference spent in reduce waiting for the slowest
process to catch up to everyone else. The barrier removes that factor
by forcing all processes to start from the same place.
No mystery here - just a reflection of the fact that your processes
arrive at the MPI_Reduce calls at different times.
Yes, however, it seems Gabriele is saying the total execution time
*drops* by ~500 s when the barrier is put *in*. (Is that the right way
around, Gabriele?)
That's harder to explain as a sync issue.
Not really, you need some way of keeping processes in sync or else the slow ones get slower and the fast ones stay fast. If you have an un-balanced algorithm then you can end up swamping certain ranks and when they get behind they get even slower and performance goes off a cliff edge.

Adding sporadic barriers keeps everything in sync and running nicely, if things are performing well then the barrier only slows things down but if there is a problem it'll bring all process back together and destroy the positive feedback cycle. This is why you often only need a synchronisation point every so often, I'm also a huge fan of asyncronous barriers as a full sync is a blunt and slow operation, using asyncronous barriers you can allow small differences in timing but prevent them from getting too large with very little overhead in the common case where processes are synced already. I'm thinking specifically of starting a sync-barrier on iteration N, waiting for it on N+25 and immediately starting another one, again waiting for it 25 steps later.

Ashley.
--
Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
Ralph Castain
2010-09-09 11:24:22 UTC
Permalink
Post by Ashley Pittman
Post by Terry Frankcombe
Post by Ralph Castain
As people have said, these time values are to be expected. All they
reflect is the time difference spent in reduce waiting for the slowest
process to catch up to everyone else. The barrier removes that factor
by forcing all processes to start from the same place.
No mystery here - just a reflection of the fact that your processes
arrive at the MPI_Reduce calls at different times.
Yes, however, it seems Gabriele is saying the total execution time
*drops* by ~500 s when the barrier is put *in*. (Is that the right way
around, Gabriele?)
That's harder to explain as a sync issue.
Not really, you need some way of keeping processes in sync or else the slow ones get slower and the fast ones stay fast. If you have an un-balanced algorithm then you can end up swamping certain ranks and when they get behind they get even slower and performance goes off a cliff edge.
Adding sporadic barriers keeps everything in sync and running nicely, if things are performing well then the barrier only slows things down but if there is a problem it'll bring all process back together and destroy the positive feedback cycle. This is why you often only need a synchronisation point every so often,
Precisely. And that is why we added the "sync" collective which inserts a barrier every so often since we don't have async barriers at this time. Also helps clean up memory growth due to unanticipated messages arriving during unbalanced algorithms.

See ompi_info --param coll all to see how to enable it.
Post by Ashley Pittman
I'm also a huge fan of asyncronous barriers as a full sync is a blunt and slow operation, using asyncronous barriers you can allow small differences in timing but prevent them from getting too large with very little overhead in the common case where processes are synced already. I'm thinking specifically of starting a sync-barrier on iteration N, waiting for it on N+25 and immediately starting another one, again waiting for it 25 steps later.
Ashley.
--
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Richard Treumann
2010-09-09 15:16:10 UTC
Permalink
Ashley's observation may apply to an application that iterates on many to
one communication patterns. If the only collective used is MPI_Reduce,
some non-root tasks can get ahead and keep pushing iteration results at
tasks that are nearer the root. This could overload them and cause some
extra slow down.

In most parallel applications, there is some web of interdependency across
tasks between iterations that keeps them roughly in step. I find it hare
to believe that there are many programs that need semantically redundant
MPI_Barriers.

For example -

In a program that does neighbor communication, no task can get very far
ahead of its neighbors. It is possible for a task at one corner to be a a
few steps ahead of one at the opposite corner but only a few steps. In
this case though, the distant neighbor is not being affected by that task
that is out ahead anyway. It is only affected by its immediate neighbors,

In a program that does an MPI_Bcast from root and an MPI_Reduce to root in
each iteration, No task gets far ahead because the task that finished the
Bcast early, just wait longer at the Reduce.

An program that makes a call to a non-rooted collective every iteration
will stay in pretty tight synch.

Think carefully before tossing in either MPI_Barrier or some non-blocking
barrier. Unless MPI_Bcast or MPI_Reduce is the only collective you call,
your problem is likely not progress skew..


Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363




From:
Ashley Pittman <***@pittman.co.uk>
To:
Open MPI Users <***@open-mpi.org>
Date:
09/09/2010 03:53 AM
Subject:
Re: [OMPI users] MPI_Reduce performance
Post by Terry Frankcombe
Post by Ralph Castain
As people have said, these time values are to be expected. All they
reflect is the time difference spent in reduce waiting for the slowest
process to catch up to everyone else. The barrier removes that factor
by forcing all processes to start from the same place.
No mystery here - just a reflection of the fact that your processes
arrive at the MPI_Reduce calls at different times.
Yes, however, it seems Gabriele is saying the total execution time
*drops* by ~500 s when the barrier is put *in*. (Is that the right way
around, Gabriele?)
That's harder to explain as a sync issue.
Not really, you need some way of keeping processes in sync or else the
slow ones get slower and the fast ones stay fast. If you have an
un-balanced algorithm then you can end up swamping certain ranks and when
they get behind they get even slower and performance goes off a cliff
edge.

Adding sporadic barriers keeps everything in sync and running nicely, if
things are performing well then the barrier only slows things down but if
there is a problem it'll bring all process back together and destroy the
positive feedback cycle. This is why you often only need a
synchronisation point every so often, I'm also a huge fan of asyncronous
barriers as a full sync is a blunt and slow operation, using asyncronous
barriers you can allow small differences in timing but prevent them from
getting too large with very little overhead in the common case where
processes are synced already. I'm thinking specifically of starting a
sync-barrier on iteration N, waiting for it on N+25 and immediately
starting another one, again waiting for it 25 steps later.

Ashley.
--
Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
Gus Correa
2010-09-09 16:00:28 UTC
Permalink
Hello All

Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary
words, may be part of a larger context of load balance, or not?

Would Ashley's recipe of sporadic barriers be a silver bullet to
improve load imbalance problems, regardless of which collectives or
even point-to-point calls are in use?

I have in mind for instance our big climate models.
Some of them work in MPMD mode, where several executables
representing atmosphere, ocean, etc, have their own
communicators, but interact with each other indirectly,
coordinated by a flux coupler (within yet another communicator).
The coupler receives, merges, and sends data across the other (physical)
components.
The components don't talk to each other:
the coupler is the broker, the master.
This structure may be used in other fields and areas of application,
I'd guess.

More often than not some components lag behind (regardless of how
much you tune the number of processors assigned to each component),
slowing down the whole scheme.
The coupler must sit and wait for that late component,
the other components must sit and wait for the coupler,
and the (vicious) "positive feedback" cycle that
Ashley mentioned goes on and on.

Would sporadic barriers in the flux coupler "shake up" these delays?

Ashley: How did you get to the magic number of 25 iterations for the
sporadic barriers?
Would it be application/communicator pattern dependent?

Many thanks,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
Post by Richard Treumann
Ashley's observation may apply to an application that iterates on many
to one communication patterns. If the only collective used is
MPI_Reduce, some non-root tasks can get ahead and keep pushing iteration
results at tasks that are nearer the root. This could overload them and
cause some extra slow down.
In most parallel applications, there is some web of interdependency
across tasks between iterations that keeps them roughly in step. I find
it hare to believe that there are many programs that need semantically
redundant MPI_Barriers.
For example -
In a program that does neighbor communication, no task can get very far
ahead of its neighbors. It is possible for a task at one corner to be a
a few steps ahead of one at the opposite corner but only a few steps. In
this case though, the distant neighbor is not being affected by that
task that is out ahead anyway. It is only affected by its immediate
neighbors,
In a program that does an MPI_Bcast from root and an MPI_Reduce to root
in each iteration, No task gets far ahead because the task that finished
the Bcast early, just wait longer at the Reduce.
An program that makes a call to a non-rooted collective every iteration
will stay in pretty tight synch.
Think carefully before tossing in either MPI_Barrier or some
non-blocking barrier. Unless MPI_Bcast or MPI_Reduce is the only
collective you call, your problem is likely not progress skew..
Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
Date: 09/09/2010 03:53 AM
Subject: Re: [OMPI users] MPI_Reduce performance
------------------------------------------------------------------------
Post by Terry Frankcombe
Post by Ralph Castain
As people have said, these time values are to be expected. All they
reflect is the time difference spent in reduce waiting for the slowest
process to catch up to everyone else. The barrier removes that factor
by forcing all processes to start from the same place.
No mystery here - just a reflection of the fact that your processes
arrive at the MPI_Reduce calls at different times.
Yes, however, it seems Gabriele is saying the total execution time
*drops* by ~500 s when the barrier is put *in*. (Is that the right way
around, Gabriele?)
That's harder to explain as a sync issue.
Not really, you need some way of keeping processes in sync or else the
slow ones get slower and the fast ones stay fast. If you have an
un-balanced algorithm then you can end up swamping certain ranks and
when they get behind they get even slower and performance goes off a
cliff edge.
Adding sporadic barriers keeps everything in sync and running nicely, if
things are performing well then the barrier only slows things down but
if there is a problem it'll bring all process back together and destroy
the positive feedback cycle. This is why you often only need a
synchronisation point every so often, I'm also a huge fan of asyncronous
barriers as a full sync is a blunt and slow operation, using asyncronous
barriers you can allow small differences in timing but prevent them from
getting too large with very little overhead in the common case where
processes are synced already. I'm thinking specifically of starting a
sync-barrier on iteration N, waiting for it on N+25 and immediately
starting another one, again waiting for it 25 steps later.
Ashley.
--
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk <http://padb.pittman.org.uk/>
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
------------------------------------------------------------------------
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Eugene Loh
2010-09-09 16:38:21 UTC
Permalink
Post by Gus Correa
More often than not some components lag behind (regardless of how
much you tune the number of processors assigned to each component),
slowing down the whole scheme.
The coupler must sit and wait for that late component,
the other components must sit and wait for the coupler,
and the (vicious) "positive feedback" cycle that
Ashley mentioned goes on and on.
I think "sit and wait" is the "typical" scenario that Dick mentions.
Someone lags, so someone else has to wait.

In contrast, the "feedback" cycle Ashley mentions is where someone lags
and someone else keeps racing ahead, pumping even more data at the
laggard, forcing the laggard ever further behind.
Richard Treumann
2010-09-09 18:10:13 UTC
Permalink
I was pointing out that most programs have some degree of elastic
synchronization built in. Tasks (or groups or components in a coupled
model) seldom only produce data.they also consume what other tasks produce
and that limits the potential skew.

If step n for a task (or group or coupled component) depends on data
produced by step n-1 in another task (or group or coupled component) then
no task can be farther ahead of the task it depends on than one step. If
there are 2 tasks that each need the others step n-1 result to compute
step n then they can never get farther than one step out of synch. If
there were a rank ordered loop of 8 tasks so each one needs the output of
the prior step on task ((me-1) mod tasks) to compute then you can get
more skew because if
task 5 gets stalled in step 3,
task 6 will finish step 3 and send results to 7 but stall on recv for step
4 (lacking the end of step 3 send by task 5)
task 7 will finish step 4 and send results to 0 but stall on recv for
step 5
task 0 will finish step 5 and send results to 1 but stall on recv for
step 6
etc

In a 2D or 3D grid, the dependency is tighter so the possible skew is
less. but it is still significant on a huge grid In a program with
frequent calls to MPI_Allreduce on COMM_WORLD, the skew is very limited.
The available skew gets harder to predict as the interdependencies grow
more complex.

I call this "elasticity" because the amount of stretch varies but, like a
bungee cord or an waist band, only goes so far. Every parallel program has
some degree of elasticity built into the way its parts interact.

I assume a coupler has some elasticity too. That is, ocean and atmosphere
each model Monday and report in to coupler but neither can model Tuesday
until they get some of the Monday results generated by the other. (I am
pretending granularity is day by day) Wouldn't the right level of
synchronization among component result automatically form the data
dependencies among them?



Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363




From:
Eugene Loh <***@oracle.com>
To:
Open MPI Users <***@open-mpi.org>
Date:
09/09/2010 12:40 PM
Subject:
Re: [OMPI users] MPI_Reduce performance
Post by Gus Correa
More often than not some components lag behind (regardless of how
much you tune the number of processors assigned to each component),
slowing down the whole scheme.
The coupler must sit and wait for that late component,
the other components must sit and wait for the coupler,
and the (vicious) "positive feedback" cycle that
Ashley mentioned goes on and on.
I think "sit and wait" is the "typical" scenario that Dick mentions.
Someone lags, so someone else has to wait.

In contrast, the "feedback" cycle Ashley mentions is where someone lags
and someone else keeps racing ahead, pumping even more data at the
laggard, forcing the laggard ever further behind.
Alex A. Granovsky
2010-09-09 18:40:06 UTC
Permalink
Isn't in evident from the theory of random processes and probability theory that in the limit of infinitely
large cluster and parallel process, the probability of deadlocks with current implementation is unfortunately
quite a finite quantity and in limit approaches to unity regardless on any particular details of the program.

Just my two cents.
Alex Granovsky

----- Original Message -----
From: Richard Treumann
To: Open MPI Users
Sent: Thursday, September 09, 2010 10:10 PM
Subject: Re: [OMPI users] MPI_Reduce performance



I was pointing out that most programs have some degree of elastic synchronization built in. Tasks (or groups or components in a coupled model) seldom only produce data.they also consume what other tasks produce and that limits the potential skew.

If step n for a task (or group or coupled component) depends on data produced by step n-1 in another task (or group or coupled component) then no task can be farther ahead of the task it depends on than one step. If there are 2 tasks that each need the others step n-1 result to compute step n then they can never get farther than one step out of synch. If there were a rank ordered loop of 8 tasks so each one needs the output of the prior step on task ((me-1) mod tasks) to compute then you can get more skew because if
task 5 gets stalled in step 3,
task 6 will finish step 3 and send results to 7 but stall on recv for step 4 (lacking the end of step 3 send by task 5)
task 7 will finish step 4 and send results to 0 but stall on recv for step 5
task 0 will finish step 5 and send results to 1 but stall on recv for step 6
etc

In a 2D or 3D grid, the dependency is tighter so the possible skew is less. but it is still significant on a huge grid In a program with frequent calls to MPI_Allreduce on COMM_WORLD, the skew is very limited. The available skew gets harder to predict as the interdependencies grow more complex.

I call this "elasticity" because the amount of stretch varies but, like a bungee cord or an waist band, only goes so far. Every parallel program has some degree of elasticity built into the way its parts interact.

I assume a coupler has some elasticity too. That is, ocean and atmosphere each model Monday and report in to coupler but neither can model Tuesday until they get some of the Monday results generated by the other. (I am pretending granularity is day by day) Wouldn't the right level of synchronization among component result automatically form the data dependencies among them?



Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363



From: Eugene Loh <***@oracle.com>
To: Open MPI Users <***@open-mpi.org>
Date: 09/09/2010 12:40 PM
Subject: Re: [OMPI users] MPI_Reduce performance
Sent by: users-***@open-mpi.org


------------------------------------------------------------------------------
Post by Gus Correa
More often than not some components lag behind (regardless of how
much you tune the number of processors assigned to each component),
slowing down the whole scheme.
The coupler must sit and wait for that late component,
the other components must sit and wait for the coupler,
and the (vicious) "positive feedback" cycle that
Ashley mentioned goes on and on.
I think "sit and wait" is the "typical" scenario that Dick mentions.
Someone lags, so someone else has to wait.

In contrast, the "feedback" cycle Ashley mentions is where someone lags
and someone else keeps racing ahead, pumping even more data at the
laggard, forcing the laggard ever further behind.
_______________________________________________
users mailing list
***@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Eugene Loh
2010-09-09 19:32:20 UTC
Permalink
_______________________________________________
users mailing list
***@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Alex A. Granovsky
2010-09-09 22:37:44 UTC
Permalink
Eugene,

you did not take into account the dispersion/dephasing between different processes. As cluster size and the
number of instances of parallel process increase, the dispersion increases as well, making different instances
to be a kind out of sync - not really out of sync, but just because of different speed of execution on different nodes, delays, etc...
If you account for this, you get the result I mentioned.

Alex

----- Original Message -----
From: Eugene Loh
To: Open MPI Users
Sent: Thursday, September 09, 2010 11:32 PM
Subject: Re: [OMPI users] MPI_Reduce performance


Alex A. Granovsky wrote:
Isn't in evident from the theory of random processes and probability theory that in the limit of infinitely
large cluster and parallel process, the probability of deadlocks with current implementation is unfortunately
quite a finite quantity and in limit approaches to unity regardless on any particular details of the program.
No, not at all. Consider simulating a physical volume. Each process is assigned to some small subvolume. It updates conditions locally, but on the surface of its simulation subvolume it needs information from "nearby" processes. It cannot proceed along the surface until it has that neighboring information. Its neighbors, in turn, cannot proceed until their neighbors have reached some point. Two distant processes can be quite out of step with one another, but only by some bounded amount. At some point, a leading process has to wait for information from a laggard to propagate to it. All processes proceed together, in some loose lock-step fashion. Many applications behave in this fashion. Actually, in many applications, the synchronization is tightened in that "physics" is made to propagate faster than neighbor-by-neighbor.

As the number of processes increases, the laggard might seem relatively slower in comparison, but that isn't deadlock.

As the size of the cluster increases, the chances of a system component failure increase, but that also is a different matter.
Ashley Pittman
2010-09-09 19:34:38 UTC
Permalink
Post by Gus Correa
Hello All
Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary words, may be part of a larger context of load balance, or not?
Would Ashley's recipe of sporadic barriers be a silver bullet to
improve load imbalance problems, regardless of which collectives or
even point-to-point calls are in use?
No, it only holds where there is no data dependency between some of the ranks, in particular if there are any non-rooted collectives in an iteration of your code then it cannot make any difference at all, likewise if you have a reduce followed by a barrier using the same root for example then you already have global synchronisation each iteration and it won't help. My feeling is that it applies to a significant minority of problems, certainly the phrase "adding barriers can make codes faster" should be textbook stuff if it isn't already.
Post by Gus Correa
Would sporadic barriers in the flux coupler "shake up" these delays?
I don't fully understand your description but it sounds like it might set the program back to a clean slate which would give you per-iteraion delays only rather than cumulative or worse delays.
Post by Gus Correa
Ashley: How did you get to the magic number of 25 iterations for the
sporadic barriers?
Experience and finger in the air. The major factors in picking this number is the likelihood of a positives feedback cycle of delays happening, the delays these delays add and the cost of a barrier itself. Having too low a value will slightly reduce performance, having too high a value can drastically reduce performance.

As a further item (because I like them) the asynchronous barrier is even better again if used properly, in the good case it doesn't cause any process to block ever so the cost is only that of the CPU cycles the code takes itself, in the bad case where it has to delay a rank then this tends to have a positive impact on performance.
Post by Gus Correa
Would it be application/communicator pattern dependent?
Absolutely.

Ashley,
--
Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
jody
2010-09-09 20:10:26 UTC
Permalink
Hi
@Ashley:
What is the exact semantics of an asynchronous barrier,
and is it part of the MPI specs?

Thanks
Jody
Post by Gus Correa
Hello All
Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary words, may be part of a larger context of load balance, or not?
Would Ashley's recipe of sporadic barriers be a silver bullet to
improve load imbalance problems, regardless of which collectives or
even point-to-point calls are in use?
No, it only holds where there is no data dependency between some of the ranks, in particular if there are any non-rooted collectives in an iteration of your code then it cannot make any difference at all, likewise if you have a reduce followed by a barrier using the same root for example then you already have global synchronisation each iteration and it won't help.  My feeling is that it applies to a significant minority of problems, certainly the phrase "adding barriers can make codes faster" should be textbook stuff if it isn't already.
Post by Gus Correa
Would sporadic barriers in the flux coupler "shake up" these delays?
I don't fully understand your description but it sounds like it might set the program back to a clean slate which would give you per-iteraion delays only rather than cumulative or worse delays.
Post by Gus Correa
Ashley:  How did you get to the magic number of 25 iterations for the
sporadic barriers?
Experience and finger in the air.  The major factors in picking this number is the likelihood of a positives feedback cycle of delays happening, the delays these delays add and the cost of a barrier itself.  Having too low a value will slightly reduce performance, having too high a value can drastically reduce performance.
As a further item (because I like them) the asynchronous barrier is even better again if used properly, in the good case it doesn't cause any process to block ever so the cost is only that of the CPU cycles the code takes itself, in the bad case where it has to delay a rank then this tends to have a positive impact on performance.
Post by Gus Correa
Would it be application/communicator pattern dependent?
Absolutely.
Ashley,
--
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Ashley Pittman
2010-09-09 20:31:27 UTC
Permalink
Post by jody
Hi
What is the exact semantics of an asynchronous barrier,
I'm not sure of the exact semantics but once you've got your head around the concept it's fairly simple to understand how to use it, you call MPI_IBarrier() and it gives you a handle you can test with MPI_Test() or block for with MPI_Wait(). The tricky part comes in how many times you can call MPI_IBarrier() on a communicator without waiting for the previous barriers to complete but I haven't been following the discussion on this one to know the specifics.
Post by jody
and is it part of the MPI specs?
It will be a part of the next revision of the standard I believe. It's been a long time coming and there is at least one implementation out there already but I can't comment on it's usability today. To be clear it's something I've long advocated and have implemented and played around with in the past however it's not yet available to users today but I believe it will be shortly and as you'll have read my believe is it's going to be a very useful addition to the MPI offering.

Ashley,
--
Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
Jeff Squyres
2010-09-22 12:59:50 UTC
Permalink
Post by Ashley Pittman
Post by jody
What is the exact semantics of an asynchronous barrier,
I'm not sure of the exact semantics but once you've got your head around the concept it's fairly simple to understand how to use it, you call MPI_IBarrier() and it gives you a handle you can test with MPI_Test() or block for with MPI_Wait(). The tricky part comes in how many times you can call MPI_IBarrier() on a communicator without waiting for the previous barriers to complete but I haven't been following the discussion on this one to know the specifics.
FWIW: here's a brief writeup of MPI_Ibarrier -

http://blogs.cisco.com/ciscotalk/performance/comments/ibarrier/
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Richard Treumann
2010-09-09 20:40:17 UTC
Permalink
Ashley

Can you provide an example of a situation in which these semantically
redundant barriers help?

I may be missing something but my statement for the text book would be

"If adding a barrier to your MPI program makes it run faster, there is
almost certainly a flaw in it that is better solved another way."

The only exception I can think of is some sort of one direction data
dependancy with messages small enough to go eagerly. A program that calls
MPI_Reduce with a small message and the same root every iteration and
calls no other collective would be an example.

In that case, fast tasks at leaf positions would run free and a slow task
near the root could pile up early arrivals and end up with some additional
slowing. Unless it was driven into paging I cannot imagine the slowdown
would be significant though.

Even that should not be a problem for an MPI implementation that backs
off on eager send before it floods early arrival buffers.


Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
Ashley Pittman
2010-09-09 21:34:15 UTC
Permalink
Post by Ashley Pittman
Ashley
Can you provide an example of a situation in which these semantically redundant barriers help?
I'm not making the case for semantically redundant barriers, I'm making a case for implicit synchronisation in every iteration of a application. Many applications have this already by nature of the data-flow required, anything that calls mpi_allgather or mpi_allreduce are the easiest to verify but there are many other ways of achieving the same thing. My point is about the subset of programs which don't have this attribute and are therefore susceptible to synchronisation problems. It's my experience that for low iteration counts these codes can run fine but once they hit a problem they go over a cliff edge performance wise and there is no way back from there until the end of the job. The email from Gabriele would appear to be a case that demonstrates this problem but I've seen it many
times before.

Using your previous email as an example I would describe adding barriers to a problem as a way artificially reducing the "elasticity" of the program to ensure balanced use of resources.
Post by Ashley Pittman
I may be missing something but my statement for the text book would be
"If adding a barrier to your MPI program makes it run faster, there is almost certainly a flaw in it that is better solved another way."
The only exception I can think of is some sort of one direction data dependancy with messages small enough to go eagerly. A program that calls MPI_Reduce with a small message and the same root every iteration and calls no other collective would be an example.
In that case, fast tasks at leaf positions would run free and a slow task near the root could pile up early arrivals and end up with some additional slowing. Unless it was driven into paging I cannot imagine the slowdown would be significant though.
I've diagnosed problems where the cause was a receive queue of tens of thousands of messages, in this case each and every receive performs slowly unless the descriptor is near the front of the queue so the concern is not purely about memory usage at individual processes although that can also be a factor.

Ashley,
--
Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
Richard Treumann
2010-09-10 14:08:26 UTC
Permalink
Hi Ashley

I understand the problem with descriptor flooding can be serious in an
application with unidirectional data dependancy. Perhaps we have a
different perception of how common that is.

It seems to me that such programs would be very rare but if they are more
common than I imagine, then discussion of how to modulate them is
worthwhile. In many cases, I think that adding some flow control to the
application is a better solution than semantically redundant barrier. (A
barrier that is there only to affect performance, not correctness, is what
I mean by semantically redundant)

For example, a Master/Worker application could have each worker break
after every 4th send to the master and post an MPI_Recv for an
OK_to_continue token. If the token had been sent, this would delay the
worker a few microseconds. If it had not been sent, the worker would be
kept waiting.

The Master would keep track of how many messages from each worker it had
absorbed and on message 3 from a particular worker, send an OK_to_continue
token to that worker. The master would keep sending OK_to_continue tokens
every 4th recv from then on (7, 11, 15 ...) The descriptor queues would
all remain short and only a worker that the master could not keep up with
would ever lose a chance to keep working. By sending the OK_to_continue
token a bit early, the application would ensure that when there was no
backlog, every worker would find a token when it looked for it and there
would be no significant loss of compute time.

Even with non-blocking barrier and a 10 step lag between Ibarrier and
Wait, , if some worker is slow for 12 steps, the fast workers end up being
kept in a non-productive MPI_Wait.

Dick


Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
[image removed]
Re: [OMPI users] MPI_Reduce performance
Ashley Pittman
Open MPI Users
09/09/2010 05:37 PM
Please respond to Open MPI Users
Post by Ashley Pittman
Ashley
Can you provide an example of a situation in which these
semantically redundant barriers help?
I'm not making the case for semantically redundant barriers, I'm
making a case for implicit synchronisation in every iteration of a
application. Many applications have this already by nature of the
data-flow required, anything that calls mpi_allgather or
mpi_allreduce are the easiest to verify but there are many other
ways of achieving the same thing. My point is about the subset of
programs which don't have this attribute and are therefore
susceptible to synchronisation problems. It's my experience that
for low iteration counts these codes can run fine but once they hit
a problem they go over a cliff edge performance wise and there is no
way back from there until the end of the job. The email from
Gabriele would appear to be a case that demonstrates this problem
but I've seen it many times before.
Using your previous email as an example I would describe adding
barriers to a problem as a way artificially reducing the
"elasticity" of the program to ensure balanced use of resources.
Post by Ashley Pittman
I may be missing something but my statement for the text book would be
"If adding a barrier to your MPI program makes it run faster,
there is almost certainly a flaw in it that is better solved another
way."
Post by Ashley Pittman
The only exception I can think of is some sort of one direction
data dependancy with messages small enough to go eagerly. A program
that calls MPI_Reduce with a small message and the same root every
iteration and calls no other collective would be an example.
Post by Ashley Pittman
In that case, fast tasks at leaf positions would run free and a
slow task near the root could pile up early arrivals and end up with
some additional slowing. Unless it was driven into paging I cannot
imagine the slowdown would be significant though.
I've diagnosed problems where the cause was a receive queue of tens
of thousands of messages, in this case each and every receive
performs slowly unless the descriptor is near the front of the queue
so the concern is not purely about memory usage at individual
processes although that can also be a factor.
Ashley,
--
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Eugene Loh
2010-09-10 14:27:02 UTC
Permalink
_______________________________________________
users mailing list
***@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Richard Treumann
2010-09-10 15:24:27 UTC
Permalink
The difference is that MPI_Barrier or even MPI_Ibarrier is a crude tool
for the job. It is likely to leave resources idle that could be doing
productive work.

I agree it is part of the tool kit but if this kind of problem is
significant enough so a textbook should cover it then I would advocate
that people be taught where to look for the problem and to consider which
tool to use.

The first step is to look for way to improve load balance. The second
step is to look for ways to impede as few processes as possible and only
when they must be impeded.

I am also suggesting that there is a middle class of application that
improves with semantically redundant barrier. Applications with a
significant design flaw that it would be better to find and fix than to
mask.

The MPI standard provides for a flow control to prevent eager message
flooding but it does not provide a flow control for descriptor flooding.
Descriptor flooding may even bring down a job and the standard does not
oblige the MPI implementation to prevent that. This problem was
understood from the beginning but it cannot be solved within the semantic
rules for non-blocking Send/Recv. The options were to give the problem to
the application writer or to say "A non-blocking operation will return
promptly unless there is a resource shortage".

Dick


Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
[image removed]
Re: [OMPI users] MPI_Reduce performance
Eugene Loh
Open MPI Users
09/10/2010 10:30 AM
Please respond to Open MPI Users
Hi Ashley
I understand the problem with descriptor flooding can be serious in
an application with unidirectional data dependancy. Perhaps we have
a different perception of how common that is.
Ashley speculated it was a "significant minority." I don't know
what that means, but it seems like it is a minority (most
computations have causal relationships among the processes holding
unbounded imbalances in check) and yet we end up seeing these
exceptions.
I think that adding some flow control to the application is a better
solution than semantically redundant barrier.
It seems to me there is no difference. Flow control, at this level,
is just semantically redundant synchronization. A barrier is just a
special case of that._______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Gabriele Fatigati
2010-09-09 07:34:42 UTC
Permalink
Mm,I don't understand.
The experiments on my appliciation shows that an intensive use of Barrier+
Reduce is more faster than a single Reduce.
Post by Ralph Castain
As people have said, these time values are to be expected. All they reflect
is the time difference spent in reduce waiting for the slowest process to
catch up to everyone else. The barrier removes that factor by forcing all
processes to start from the same place.
No mystery here - just a reflection of the fact that your processes arrive
at the MPI_Reduce calls at different times.
More in depth,
total execution time without Barrier is about 10000 sec.
Total execution time with Barrier+Reduce is 9453, with 128 procs.
Post by Terry Frankcombe
Gabriele,
Can you clarify... those timings are what is reported for the reduction
call specifically, not the total execution time?
If so, then the difference is, to a first approximation, the time you
spend sitting idly by doing absolutely nothing waiting at the barrier.
Ciao
Terry
--
Dr. Terry Frankcombe
Research School of Chemistry, Australian National University
Ph: (+61) 0417 163 509 Skype: terry.frankcombe
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Ing. Gabriele Fatigati
Parallel programmer
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it Tel: +39 051 6171722
g.fatigati [AT] cineca.it
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it Tel: +39 051 6171722

g.fatigati [AT] cineca.it
Loading...