Discussion:
[OMPI users] Performance: MPICH2 vs OpenMPI
Sangamesh B
2008-10-08 13:10:03 UTC
Permalink
Hi All,

I wanted to switch from mpich2/mvapich2 to OpenMPI, as OpenMPI
supports both ethernet and infiniband. Before doing that I tested an
application 'GROMACS' to compare the performance of MPICH2 & OpenMPI. Both
have been compiled with GNU compilers.

After this benchmark, I came to know that OpenMPI is slower than MPICH2.

This benchmark is run on a AMD dual core, dual opteron processor. Both have
compiled with default configurations.

The job is run on 2 nodes - 8 cores.

OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.

Any comments ..?

Thanks,
Sangamesh
Ray Muno
2008-10-08 13:25:52 UTC
Permalink
I would be interested in what others have to say about this as well.

We have been doing a bit of performance testing since we are deploying a
new cluster and it is our first InfiniBand based set up.

In our experience, so far, OpenMPI is coming out faster than MVAPICH.
Comparisons were made with different compilers, PGI and Pathscale. We do
not have a running implementation of OpenMPI with SunStudio compilers.

Our tests were with actual user codes running on up to 600 processors so
far.
Post by Sangamesh B
Hi All,
I wanted to switch from mpich2/mvapich2 to OpenMPI, as OpenMPI
supports both ethernet and infiniband. Before doing that I tested an
application 'GROMACS' to compare the performance of MPICH2 & OpenMPI. Both
have been compiled with GNU compilers.
After this benchmark, I came to know that OpenMPI is slower than MPICH2.
This benchmark is run on a AMD dual core, dual opteron processor. Both have
compiled with default configurations.
The job is run on 2 nodes - 8 cores.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
Any comments ..?
Thanks,
Sangamesh
-Ray Muno
Aerospace Engineering.
Brock Palen
2008-10-08 13:39:58 UTC
Permalink
Your doing this on just one node? That would be using the OpenMPI SM
transport, Last I knew it wasn't that optimized though should still
be much faster than TCP.

I am surpised at your result though I do not have MPICH2 on the
cluster right now I don't have time to compare.

How did you run the job?

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
***@umich.edu
(734)936-1985
Post by Sangamesh B
Hi All,
I wanted to switch from mpich2/mvapich2 to OpenMPI, as
OpenMPI supports both ethernet and infiniband. Before doing that I
tested an application 'GROMACS' to compare the performance of
MPICH2 & OpenMPI. Both have been compiled with GNU compilers.
After this benchmark, I came to know that OpenMPI is slower than MPICH2.
This benchmark is run on a AMD dual core, dual opteron processor.
Both have compiled with default configurations.
The job is run on 2 nodes - 8 cores.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
Any comments ..?
Thanks,
Sangamesh
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Sangamesh B
2008-10-08 13:57:42 UTC
Permalink
Post by Brock Palen
Your doing this on just one node? That would be using the OpenMPI SM
transport, Last I knew it wasn't that optimized though should still be much
faster than TCP.
its on 2 nodes. I'm using TCP only. There is no infiniband hardware.
Post by Brock Palen
I am surpised at your result though I do not have MPICH2 on the cluster
right now I don't have time to compare.
How did you run the job?
MPICH2:

time /opt/mpich2/gnu/bin/mpirun -machinefile ./mach -np 8
/opt/apps/gromacs333/bin/mdrun_mpi | tee gro_bench_8p

OpenMPI:

$ time /opt/ompi127/bin/mpirun -machinefile ./mach -np 8
/opt/apps/gromacs333_ompi/bin/mdrun_mpi | tee gromacs_openmpi_8process
Post by Brock Palen
Brock Palen
www.umich.edu/~brockp <http://www.umich.edu/%7Ebrockp>
Center for Advanced Computing
(734)936-1985
Hi All,
Post by Sangamesh B
I wanted to switch from mpich2/mvapich2 to OpenMPI, as OpenMPI
supports both ethernet and infiniband. Before doing that I tested an
application 'GROMACS' to compare the performance of MPICH2 & OpenMPI. Both
have been compiled with GNU compilers.
After this benchmark, I came to know that OpenMPI is slower than MPICH2.
This benchmark is run on a AMD dual core, dual opteron processor. Both
have compiled with default configurations.
The job is run on 2 nodes - 8 cores.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
Any comments ..?
Thanks,
Sangamesh
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2008-10-08 13:46:21 UTC
Permalink
Post by Sangamesh B
I wanted to switch from mpich2/mvapich2 to OpenMPI, as
OpenMPI supports both ethernet and infiniband. Before doing that I
tested an application 'GROMACS' to compare the performance of MPICH2
& OpenMPI. Both have been compiled with GNU compilers.
After this benchmark, I came to know that OpenMPI is slower than MPICH2.
This benchmark is run on a AMD dual core, dual opteron processor.
Both have compiled with default configurations.
The job is run on 2 nodes - 8 cores.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
A few things:

- What version of Open MPI are you using? Please send the information
listed here:

http://www.open-mpi.org/community/help/

- Did you specify to use mpi_leave_pinned? Use "--mca
mpi_leave_pinned 1" on your mpirun command line (I don't know if leave
pinned behavior benefits Gromacs or not, but it likely won't hurt)

- Did you enable processor affinity? Use "--mca mpi_paffinity_alone
1" on your mpirun command line.

- Are you sure that Open MPI didn't fall back to ethernet (and not use
IB)? Use "--mca btl openib,self" on your mpirun command line.

- Have you tried compiling Open MPI with something other than GCC?
Just this week, we've gotten some reports from an OMPI member that
they are sometimes seeing *huge* performance differences with OMPI
compiled with GCC vs. any other compiler (Intel, PGI, Pathscale). We
are working to figure out why; no root cause has been identified yet.
--
Jeff Squyres
Cisco Systems
Sangamesh B
2008-10-08 14:26:10 UTC
Permalink
Post by Sangamesh B
I wanted to switch from mpich2/mvapich2 to OpenMPI, as OpenMPI
Post by Sangamesh B
supports both ethernet and infiniband. Before doing that I tested an
application 'GROMACS' to compare the performance of MPICH2 & OpenMPI. Both
have been compiled with GNU compilers.
After this benchmark, I came to know that OpenMPI is slower than MPICH2.
This benchmark is run on a AMD dual core, dual opteron processor. Both
have compiled with default configurations.
The job is run on 2 nodes - 8 cores.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
- What version of Open MPI are you using? Please send the information
1.2.7
Post by Sangamesh B
http://www.open-mpi.org/community/help/
- Did you specify to use mpi_leave_pinned?
No
Post by Sangamesh B
Use "--mca mpi_leave_pinned 1" on your mpirun command line (I don't know if
leave pinned behavior benefits Gromacs or not, but it likely won't hurt)
- Did you enable processor affinity?
No
Post by Sangamesh B
Use "--mca mpi_paffinity_alone 1" on your mpirun command line.
Will use these options in the next benchmark
Post by Sangamesh B
- Are you sure that Open MPI didn't fall back to ethernet (and not use IB)?
Use "--mca btl openib,self" on your mpirun command line.
I'm using TCP. There is no infiniband support. But eventhough the results
can be compared?
Post by Sangamesh B
- Have you tried compiling Open MPI with something other than GCC?
No.
Post by Sangamesh B
Just this week, we've gotten some reports from an OMPI member that they
are sometimes seeing *huge* performance differences with OMPI compiled with
GCC vs. any other compiler (Intel, PGI, Pathscale). We are working to
figure out why; no root cause has been identified yet.
I'll try for other than gcc and comeback to you
Post by Sangamesh B
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Sangamesh B
2008-10-08 14:27:31 UTC
Permalink
FYI attached here OpenMPI install details
Post by Sangamesh B
Post by Sangamesh B
I wanted to switch from mpich2/mvapich2 to OpenMPI, as OpenMPI
Post by Sangamesh B
supports both ethernet and infiniband. Before doing that I tested an
application 'GROMACS' to compare the performance of MPICH2 & OpenMPI. Both
have been compiled with GNU compilers.
After this benchmark, I came to know that OpenMPI is slower than MPICH2.
This benchmark is run on a AMD dual core, dual opteron processor. Both
have compiled with default configurations.
The job is run on 2 nodes - 8 cores.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
- What version of Open MPI are you using? Please send the information
1.2.7
Post by Sangamesh B
http://www.open-mpi.org/community/help/
- Did you specify to use mpi_leave_pinned?
No
Post by Sangamesh B
Use "--mca mpi_leave_pinned 1" on your mpirun command line (I don't know
if leave pinned behavior benefits Gromacs or not, but it likely won't hurt)
- Did you enable processor affinity?
No
Post by Sangamesh B
Use "--mca mpi_paffinity_alone 1" on your mpirun command line.
Will use these options in the next benchmark
Post by Sangamesh B
- Are you sure that Open MPI didn't fall back to ethernet (and not use
IB)? Use "--mca btl openib,self" on your mpirun command line.
I'm using TCP. There is no infiniband support. But eventhough the results
can be compared?
Post by Sangamesh B
- Have you tried compiling Open MPI with something other than GCC?
No.
Post by Sangamesh B
Just this week, we've gotten some reports from an OMPI member that they
are sometimes seeing *huge* performance differences with OMPI compiled with
GCC vs. any other compiler (Intel, PGI, Pathscale). We are working to
figure out why; no root cause has been identified yet.
I'll try for other than gcc and comeback to you
Post by Sangamesh B
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
*****************************************************************************
** **
** WARNING: This email contains an attachment of a very suspicious type. **
** You are urged NOT to open this attachment unless you are absolutely **
** sure it is legitimate. Opening this attachment may cause irreparable **
** damage to your computer and your files. If you have any questions **
** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
** **
** This warning was added by the IU Computer Science Dept. mail scanner. **
*****************************************************************************
Jeff Squyres
2008-10-08 14:51:23 UTC
Permalink
Post by Jeff Squyres
- What version of Open MPI are you using? Please send the
1.2.7
http://www.open-mpi.org/community/help/
- Did you specify to use mpi_leave_pinned?
No
Use "--mca mpi_leave_pinned 1" on your mpirun command line (I don't
know if leave pinned behavior benefits Gromacs or not, but it likely
won't hurt)
I see from your other mail that you are not using IB. If you're only
using TCP, then mpi_leave_pinned will have little/no effect.
Post by Jeff Squyres
- Did you enable processor affinity?
No
Use "--mca mpi_paffinity_alone 1" on your mpirun command line.
Will use these options in the next benchmark
- Are you sure that Open MPI didn't fall back to ethernet (and not
use IB)? Use "--mca btl openib,self" on your mpirun command line.
I'm using TCP. There is no infiniband support. But eventhough the
results can be compared?
Yes, they should be comparable. We've always known that our TCP
support is "ok" but not "great" (truthfully: we've not tuned it nearly
as extensively as we've tuned our other transports). But such a huge
performance difference is surprising.

It this one 1 or more nodes? It might be useful to delineate between
TCP and shared memory performance difference. I believe that MPICH2's
shmem performance is likely to be better than OMPI v1.2's, but like
TCP, it shouldn't be *that* huge.
Post by Jeff Squyres
- Have you tried compiling Open MPI with something other than GCC?
No.
Just this week, we've gotten some reports from an OMPI member that
they are sometimes seeing *huge* performance differences with OMPI
compiled with GCC vs. any other compiler (Intel, PGI, Pathscale).
We are working to figure out why; no root cause has been identified
yet.
I'll try for other than gcc and comeback to you
That would be most useful; thanks.
--
Jeff Squyres
Cisco Systems
Ashley Pittman
2008-10-08 14:58:12 UTC
Permalink
Post by Jeff Squyres
- Have you tried compiling Open MPI with something other than GCC?
Just this week, we've gotten some reports from an OMPI member that
they are sometimes seeing *huge* performance differences with OMPI
compiled with GCC vs. any other compiler (Intel, PGI, Pathscale).
We
are working to figure out why; no root cause has been identified yet.
Jeff,

You probably already know this but the obvious candidate here is the
memcpy() function, icc sticks in it's own which in some cases is much
better than the libc one. It's unusual for compilers to have *huge*
differences from code optimisations alone.

Ashley,
Brock Palen
2008-10-08 15:29:30 UTC
Permalink
Post by Ashley Pittman
Jeff,
You probably already know this but the obvious candidate here is the
memcpy() function, icc sticks in it's own which in some cases is much
better than the libc one. It's unusual for compilers to have *huge*
differences from code optimisations alone.
I know this is off topic, but I was interested in this performance,
I compared dcopy() from blas, memcpy() and just C code with optimizer
turned up in PGI/7.2

Results are here:

http://www.mlds-networks.com/index.php/component/option,com_mojo/
Itemid,29/p,49/

</OT>
Post by Ashley Pittman
Ashley,
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2008-10-08 16:58:38 UTC
Permalink
Post by Ashley Pittman
You probably already know this but the obvious candidate here is the
memcpy() function, icc sticks in it's own which in some cases is much
better than the libc one. It's unusual for compilers to have *huge*
differences from code optimisations alone.
Yep -- memcpy is one of the things that we're looking at. Haven't
heard back on the results from the next round of testing yet (one of
the initial suggestions we had was to separate openib vs. sm
performance and see if one of them yielded an obvious difference).
--
Jeff Squyres
Cisco Systems
Samuel Sarholz
2008-10-08 13:47:49 UTC
Permalink
Hi,

my experience is that OpenMPI has slightly less latency and less
bandwidth than Intel MPI (which is based on mpich2) using InfiniBand.
I don't remember the numbers using shared memory.

As you are seeing a huge difference, I would suspect that either
something with your compilation is strange or more probable that you hit
the cc-numa effect of the Opteron.
You might want to bind the MPI processes (and even clean the filesystem
caches) to avoid that effect.

best regards,
Samuel
Post by Sangamesh B
Hi All,
I wanted to switch from mpich2/mvapich2 to OpenMPI, as OpenMPI
supports both ethernet and infiniband. Before doing that I tested an
application 'GROMACS' to compare the performance of MPICH2 & OpenMPI.
Both have been compiled with GNU compilers.
After this benchmark, I came to know that OpenMPI is slower than MPICH2.
This benchmark is run on a AMD dual core, dual opteron processor. Both
have compiled with default configurations.
The job is run on 2 nodes - 8 cores.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
Any comments ..?
Thanks,
Sangamesh
Eugene Loh
2008-10-08 21:01:29 UTC
Permalink
Post by Sangamesh B
I wanted to switch from mpich2/mvapich2 to OpenMPI, as OpenMPI
supports both ethernet and infiniband. Before doing that I tested an
application 'GROMACS' to compare the performance of MPICH2 & OpenMPI.
Both have been compiled with GNU compilers.
After this benchmark, I came to know that OpenMPI is slower than MPICH2.
This benchmark is run on a AMD dual core, dual opteron processor. Both
have compiled with default configurations.
The job is run on 2 nodes - 8 cores.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
I agree with Samuel that this difference is strikingly large.

I had a thought that might not apply to your case, but I figured I'd
share it anyhow.

I don't understand MPICH very well, but it seemed as though some of the
flags used in building MPICH are supposed to be added in automatically
to the mpicc/etc compiler wrappers. That is, if you specified CFLAGS=-O
to build MPICH, then if you compiled an application with mpicc you would
automatically get -O. At least that was my impression. Maybe I
misunderstood the documentation. (If you want to use some flags just
for building MPICH but you don't want users to get those flags
automatically when they use mpicc, you're supposed to use flags like
MPICH2LIB_CFLAGS instead of just CFLAGS when you run "configure".)

Not only may this theory not apply to your case, but I'm not even sure
it holds water. I just tried building MPICH2 with --enable-fast turned
on. The "configure" output indicates I'm getting CFLAGS=-O2, but when I
run "mpicc -show" it seems to invoke gcc without any optimization flags
by default.

So, I guess I'm sending this mail less to help you and more as a request
that someone might improve my understanding.

With regards to your issue, do you have any indication when you get that
25m39s timing if there is a grotesque amount of time being spent in MPI
calls? Or, is the slowdown due to non-MPI portions?
Brian Dobbins
2008-10-08 21:09:48 UTC
Permalink
Hi guys,

[From Eugene Loh:]
Post by Sangamesh B
OpenMPI - 25 m 39 s.
Post by Sangamesh B
MPICH2 - 15 m 53 s.
With regards to your issue, do you have any indication when you get that
25m39s timing if there is a grotesque amount of time being spent in MPI
calls? Or, is the slowdown due to non-MPI portions?
Just to add my two cents: if this job *can* be run on less than 8
processors (ideally, even on just 1), then I'd recommend doing so. That is,
run it with OpenMPI and with MPICH2 on 1, 2 and 4 processors as well. If
the single-processor jobs still give vastly different timings, then perhaps
Eugene is on the right track and it comes down to various computational
optimizations and not so much the message-passing that's make a difference.
Timings from 2 and 4 process runs might be interesting as well to see how
this difference changes with process counts.

I've seen differences between various MPI libraries before, but nothing
quite this severe either. If I get the time, maybe I'll try to set up
Gromacs tonight -- I've got both MPICH2 and OpenMPI installed here and can
try to duplicate the runs. Sangamesh, is this a standard benchmark case
that anyone can download and run?

Cheers,
- Brian


Brian Dobbins
Yale Engineering HPC
Aurélien Bouteiller
2008-10-08 21:25:34 UTC
Permalink
This post might be inappropriate. Click to display it.
Jeff Squyres
2008-10-09 00:10:48 UTC
Permalink
Post by Aurélien Bouteiller
Make sure you don't use a "debug" build of Open MPI. If you use
trunk, the build system detects it and turns on debug by default. It
really kills performance. --disable-debug will remove all those
nasty printfs from the critical path.
You can easily tell if you have a debug build of OMPI with the
ompi_info command:

shell$ ompi_info | grep debug
Internal debug support: no
Memory debugging support: no
shell$

You want to see "no" for both of those.
--
Jeff Squyres
Cisco Systems
Sangamesh B
2008-10-09 12:06:19 UTC
Permalink
Make sure you don't use a "debug" build of Open MPI. If you use trunk, the
build system detects it and turns on debug by default. It really kills
performance. --disable-debug will remove all those nasty printfs from the
critical path.
You can easily tell if you have a debug build of OMPI with the ompi_info
shell$ ompi_info | grep debug
Internal debug support: no
Memory debugging support: no
shell$
Yes. It is "no"
$ /opt/ompi127/bin/ompi_info -all | grep debug
Internal debug support: no
Memory debugging support: no

I've tested GROMACS for a single process (mpirun -np 1):
Here are the results:

OpenMPI : 120m 6s

MPICH2 : 67m 44s

I'm trying to bulid the codes with PGI, but facing problem with compilation
of GROMACS.
You want to see "no" for both of those.
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Brock Palen
2008-10-09 14:00:06 UTC
Permalink
Which benchmark did you use?

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
***@umich.edu
(734)936-1985
Post by Aurélien Bouteiller
Make sure you don't use a "debug" build of Open MPI. If you use
trunk, the build system detects it and turns on debug by default.
It really kills performance. --disable-debug will remove all those
nasty printfs from the critical path.
You can easily tell if you have a debug build of OMPI with the
shell$ ompi_info | grep debug
Internal debug support: no
Memory debugging support: no
shell$
Yes. It is "no"
$ /opt/ompi127/bin/ompi_info -all | grep debug
Internal debug support: no
Memory debugging support: no
OpenMPI : 120m 6s
MPICH2 : 67m 44s
I'm trying to bulid the codes with PGI, but facing problem with
compilation of GROMACS.
You want to see "no" for both of those.
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Sangamesh B
2008-10-10 05:15:37 UTC
Permalink
Post by Brock Palen
Which benchmark did you use?
Out of 4 benchmarks I used d.dppc benchmark.
Post by Brock Palen
Brock Palen
www.umich.edu/~brockp <http://www.umich.edu/%7Ebrockp>
Center for Advanced Computing
(734)936-1985
Make sure you don't use a "debug" build of Open MPI. If you use trunk, the
build system detects it and turns on debug by default. It really kills
performance. --disable-debug will remove all those nasty printfs from the
critical path.
You can easily tell if you have a debug build of OMPI with the ompi_info
shell$ ompi_info | grep debug
Internal debug support: no
Memory debugging support: no
shell$
Yes. It is "no"
$ /opt/ompi127/bin/ompi_info -all | grep debug
Internal debug support: no
Memory debugging support: no
OpenMPI : 120m 6s
MPICH2 : 67m 44s
I'm trying to bulid the codes with PGI, but facing problem with
compilation of GROMACS.
You want to see "no" for both of those.
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Brock Palen
2008-10-10 16:57:34 UTC
Permalink
Actually I had a much differnt results,

gromacs-3.3.1 one node dual core dual socket opt2218 openmpi-1.2.7
pgi/7.2
mpich2 gcc

19M OpenMPI
M Mpich2

So for me OpenMPI+pgi was faster, I don't know how you got such a low
mpich2 number.

On the other hand if you do this preprocess before you run:

grompp -sort -shuffle -np 4
mdrun -v

With -sort and -shuffle the OpenMPI run time went down,

12M OpenMPI + sort shuffle

I think my install of mpich2 may be bad, I have never installed it
before, only mpich1, OpenMPI and LAM. So take my mpich2 numbers with
salt, Lots of salt.

On that point though -sort -shuffle may be useful for you, be sure to
understand what they do before you use them.
Read:
http://cac.engin.umich.edu/resources/software/gromacs.html

Last, make sure that your using the single precision version of
gromacs for both runs. the double is about half the speed of the
single.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
***@umich.edu
(734)936-1985
Post by Brock Palen
Which benchmark did you use?
Out of 4 benchmarks I used d.dppc benchmark.
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
(734)936-1985
Make sure you don't use a "debug" build of Open MPI. If you use
trunk, the build system detects it and turns on debug by default.
It really kills performance. --disable-debug will remove all those
nasty printfs from the critical path.
You can easily tell if you have a debug build of OMPI with the
shell$ ompi_info | grep debug
Internal debug support: no
Memory debugging support: no
shell$
Yes. It is "no"
$ /opt/ompi127/bin/ompi_info -all | grep debug
Internal debug support: no
Memory debugging support: no
OpenMPI : 120m 6s
MPICH2 : 67m 44s
I'm trying to bulid the codes with PGI, but facing problem with
compilation of GROMACS.
You want to see "no" for both of those.
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Brock Palen
2008-10-10 17:08:13 UTC
Permalink
Whoops didn't include the mpich2 numbers,

20M mpich2 same node,

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
***@umich.edu
(734)936-1985
Post by Brock Palen
Actually I had a much differnt results,
gromacs-3.3.1 one node dual core dual socket opt2218
openmpi-1.2.7 pgi/7.2
mpich2 gcc
19M OpenMPI
M Mpich2
So for me OpenMPI+pgi was faster, I don't know how you got such a
low mpich2 number.
grompp -sort -shuffle -np 4
mdrun -v
With -sort and -shuffle the OpenMPI run time went down,
12M OpenMPI + sort shuffle
I think my install of mpich2 may be bad, I have never installed it
before, only mpich1, OpenMPI and LAM. So take my mpich2 numbers
with salt, Lots of salt.
On that point though -sort -shuffle may be useful for you, be sure
to understand what they do before you use them.
http://cac.engin.umich.edu/resources/software/gromacs.html
Last, make sure that your using the single precision version of
gromacs for both runs. the double is about half the speed of the
single.
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
(734)936-1985
Post by Brock Palen
Which benchmark did you use?
Out of 4 benchmarks I used d.dppc benchmark.
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
(734)936-1985
Make sure you don't use a "debug" build of Open MPI. If you use
trunk, the build system detects it and turns on debug by default.
It really kills performance. --disable-debug will remove all those
nasty printfs from the critical path.
You can easily tell if you have a debug build of OMPI with the
shell$ ompi_info | grep debug
Internal debug support: no
Memory debugging support: no
shell$
Yes. It is "no"
$ /opt/ompi127/bin/ompi_info -all | grep debug
Internal debug support: no
Memory debugging support: no
OpenMPI : 120m 6s
MPICH2 : 67m 44s
I'm trying to bulid the codes with PGI, but facing problem with
compilation of GROMACS.
You want to see "no" for both of those.
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Brian Dobbins
2008-10-10 17:10:51 UTC
Permalink
Hi guys,
Post by Brock Palen
Actually I had a much differnt results,
gromacs-3.3.1 one node dual core dual socket opt2218 openmpi-1.2.7
pgi/7.2
mpich2 gcc
For some reason, the difference in minutes didn't come through, it seems,
but I would guess that if it's a medium-large difference, then it has its
roots in PGI7.2 vs. GCC rather than MPICH2 vs. OpenMPI. Though, to be fair,
I find GCC vs. PGI (for C code) is often a toss-up - one may beat the other
handily on one code, and then lose just as badly on another.

I think my install of mpich2 may be bad, I have never installed it before,
Post by Brock Palen
only mpich1, OpenMPI and LAM. So take my mpich2 numbers with salt, Lots of
salt.
I think the biggest difference in performance with various MPICH2 install
comes from differences in the 'channel' used.. I tend to make sure that I
use the 'nemesis' channel, which may or may not be the default these days.
If not, though, most people would probably want it. I think it has issues
with threading (or did ages ago?), but I seem to recall it being
considerably faster than even the 'ssm' channel.

Sangamesh: My advice to you would be to recompile Gromacs and specify, in
the *Gromacs* compile / configure, to use the same CFLAGS you used with
MPICH2. Eg, "-O2 -m64", whatever. If you do that, I bet the times between
MPICH2 and OpenMPI will be pretty comparable for your benchmark case -
especially when run on a single processor.

Cheers,
- Brian
Sangamesh B
2008-10-15 12:51:17 UTC
Permalink
Post by Brian Dobbins
Hi guys,
Post by Brock Palen
Actually I had a much differnt results,
gromacs-3.3.1 one node dual core dual socket opt2218 openmpi-1.2.7
pgi/7.2
mpich2 gcc
For some reason, the difference in minutes didn't come through, it
seems, but I would guess that if it's a medium-large difference, then it has
its roots in PGI7.2 vs. GCC rather than MPICH2 vs. OpenMPI. Though, to be
fair, I find GCC vs. PGI (for C code) is often a toss-up - one may beat the
other handily on one code, and then lose just as badly on another.
I think my install of mpich2 may be bad, I have never installed it before,
Post by Brock Palen
only mpich1, OpenMPI and LAM. So take my mpich2 numbers with salt, Lots of
salt.
I think the biggest difference in performance with various MPICH2 install
comes from differences in the 'channel' used.. I tend to make sure that I
use the 'nemesis' channel, which may or may not be the default these days.
If not, though, most people would probably want it. I think it has issues
with threading (or did ages ago?), but I seem to recall it being
considerably faster than even the 'ssm' channel.
Sangamesh: My advice to you would be to recompile Gromacs and specify,
in the *Gromacs* compile / configure, to use the same CFLAGS you used with
MPICH2. Eg, "-O2 -m64", whatever. If you do that, I bet the times between
MPICH2 and OpenMPI will be pretty comparable for your benchmark case -
especially when run on a single processor.
I reinstalled all softwares with -O3 optimization. Following are the
performance numbers for a 4 process job on a single node:

MPICH2: 26 m 54 s
OpenMPI: 24 m 39 s

More details:

$ /home/san/PERF_TEST/mpich2/bin/mpich2version
MPICH2 Version: 1.0.7
MPICH2 Release date: Unknown, built on Mon Oct 13 18:02:13 IST 2008
MPICH2 Device: ch3:sock
MPICH2 configure: --prefix=/home/san/PERF_TEST/mpich2
MPICH2 CC: /usr/bin/gcc -O3 -O2
MPICH2 CXX: /usr/bin/g++ -O2
MPICH2 F77: /usr/bin/gfortran -O3 -O2
MPICH2 F90: /usr/bin/gfortran -O2


$ /home/san/PERF_TEST/openmpi/bin/ompi_info
Open MPI: 1.2.7
Open MPI SVN revision: r19401
Open RTE: 1.2.7
Open RTE SVN revision: r19401
OPAL: 1.2.7
OPAL SVN revision: r19401
Prefix: /home/san/PERF_TEST/openmpi
Configured architecture: x86_64-unknown-linux-gnu
Configured by: san
Configured on: Mon Oct 13 19:10:13 IST 2008
Configure host: locuzcluster.org
Built by: san
Built on: Mon Oct 13 19:18:25 IST 2008
Built host: locuzcluster.org
C bindings: yes
C++ bindings: yes
Fortran77 bindings: yes (all)
Fortran90 bindings: yes
Fortran90 bindings size: small
C compiler: /usr/bin/gcc
C compiler absolute: /usr/bin/gcc
C++ compiler: /usr/bin/g++
C++ compiler absolute: /usr/bin/g++
Fortran77 compiler: /usr/bin/gfortran
Fortran77 compiler abs: /usr/bin/gfortran
Fortran90 compiler: /usr/bin/gfortran
Fortran90 compiler abs: /usr/bin/gfortran
C profiling: yes
C++ profiling: yes
Fortran77 profiling: yes
Fortran90 profiling: yes
C++ exceptions: no
Thread support: posix (mpi: no, progress: no)
Internal debug support: no
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
libltdl support: yes
Heterogeneous support: yes
mpirun default --prefix: no

Thanks,
Sangamesh
Post by Brian Dobbins
Cheers,
- Brian
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Eugene Loh
2008-10-24 17:56:24 UTC
Permalink
Post by Sangamesh B
I reinstalled all softwares with -O3 optimization. Following are the
MPICH2: 26 m 54 s
OpenMPI: 24 m 39 s
I'm not sure I'm following. OMPI is faster here, but is that a result
of MPICH2 slowing down? The original post at
http://www.open-mpi.org/community/lists/users/2008/10/6891.php had:

OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.

So, did MPICH2 slow down, or can one not compare these timings?
Sangamesh B
2008-10-25 07:03:07 UTC
Permalink
Post by Sangamesh B
I reinstalled all softwares with -O3 optimization. Following are the
MPICH2: 26 m 54 s
OpenMPI: 24 m 39 s
I'm not sure I'm following. OMPI is faster here, but is that a result of
MPICH2 slowing down? The original post at
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
So, did MPICH2 slow down, or can one not compare these timings?
No.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
This job is run with 8 processes i.e. on 2 nodes.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
This job is run with 4 processes i.e. on 1 node
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Sangamesh B
2008-10-25 07:08:57 UTC
Permalink
Post by Sangamesh B
Post by Sangamesh B
I reinstalled all softwares with -O3 optimization. Following are the
MPICH2: 26 m 54 s
OpenMPI: 24 m 39 s
I'm not sure I'm following. OMPI is faster here, but is that a result of
MPICH2 slowing down? The original post at
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
So, did MPICH2 slow down, or can one not compare these timings?
No. Initial benchmark result was on 2 nodes. Later the benchmark was
done on only one node.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
Post by Sangamesh B
This job is run with 8 processes i.e. on 2 nodes.
MPICH2: 26 m 54 s
OpenMPI: 24 m 39 s
Post by Sangamesh B
This job is run with 4 processes i.e. on 1 node
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2008-10-09 14:13:28 UTC
Permalink
Post by Sangamesh B
OpenMPI : 120m 6s
MPICH2 : 67m 44s
That seems to indicate that something else is going on -- with -np 1,
there should be no MPI communication, right? I wonder if the memory
allocator performance is coming into play here.

Try re-configuring/re-building Open MPI with --without-memory-manager.
--
Jeff Squyres
Cisco Systems
Brian Dobbins
2008-10-09 14:24:01 UTC
Permalink
Post by Sangamesh B
OpenMPI : 120m 6s
MPICH2 : 67m 44s
That seems to indicate that something else is going on -- with -np 1, there
should be no MPI communication, right? I wonder if the memory allocator
performance is coming into play here.
I'd be more inclined to double-check how the Gromacs app is being compiled
in the first place - I wouldn't think the OpenMPI memory allocator would
make anywhere near that much difference. Sangamesh, do you know what
command line was used to compile both of these? Someone correct me if I'm
wrong, but *if* MPICH2 embeds optimization flags in the 'mpicc' command and
OpenMPI does not, then if he's not specifying any optimization flags in the
compilation of Gromacs, MPICH2 will pass its embedded ones on to the Gromacs
compile and be faster. I'm rusty on my GCC, too, though - does it default
to an O2 level, or does it default to no optimizations?

Since the benchmark is readily available, I'll try running it later
today.. didn't get a chance last night.

Cheers,
- Brian
Eugene Loh
2008-10-09 15:53:52 UTC
Permalink
_______________________________________________
users mailing list
***@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Terry Frankcombe
2008-10-09 16:01:52 UTC
Permalink
I'm rusty on my GCC, too, though - does it default to an O2
level, or does it default to no optimizations?
Default gcc is indeed no optimisation. gcc seems to like making users
type really long complicated command lines even more than OpenMPI does.

(Yes yes, I know! Don't tell me!)
Sangamesh B
2008-10-09 11:48:29 UTC
Permalink
Post by Brian Dobbins
Hi guys,
[From Eugene Loh:]
Post by Sangamesh B
OpenMPI - 25 m 39 s.
Post by Sangamesh B
MPICH2 - 15 m 53 s.
With regards to your issue, do you have any indication when you get that
25m39s timing if there is a grotesque amount of time being spent in MPI
calls? Or, is the slowdown due to non-MPI portions?
Just to add my two cents: if this job *can* be run on less than 8
processors (ideally, even on just 1), then I'd recommend doing so. That is,
run it with OpenMPI and with MPICH2 on 1, 2 and 4 processors as well. If
the single-processor jobs still give vastly different timings, then perhaps
Eugene is on the right track and it comes down to various computational
optimizations and not so much the message-passing that's make a difference.
Timings from 2 and 4 process runs might be interesting as well to see how
this difference changes with process counts.
I've seen differences between various MPI libraries before, but nothing
quite this severe either. If I get the time, maybe I'll try to set up
Gromacs tonight -- I've got both MPICH2 and OpenMPI installed here and can
try to duplicate the runs. Sangamesh, is this a standard benchmark case
that anyone can download and run?
Yes.
ftp://ftp.gromacs.org/pub/benchmarks/gmxbench-3.0.tar.gz
Post by Brian Dobbins
Cheers,
- Brian
Brian Dobbins
Yale Engineering HPC
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Eugene Loh
2008-10-08 23:00:39 UTC
Permalink
Post by Eugene Loh
Post by Sangamesh B
The job is run on 2 nodes - 8 cores.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
I don't understand MPICH very well, but it seemed as though some of
the flags used in building MPICH are supposed to be added in
automatically to the mpicc/etc compiler wrappers.
Again, this may not apply to your case, but I found out some more
details on my theory.

If you build MPICH2 like this:

% configure CFLAGS=-O2
% make

then when you use "mpicc" to build your application, you automatically
get that optimization flag built in.

What had confused me was that I tried confirming the theory by building
MPICH2 like this:

% configure --enable-fast
% make

That does *NOT* up the mpicc optimization level (despite their
documentation).
George Bosilca
2008-10-08 21:24:30 UTC
Permalink
One thing to look for is the process distribution. Based on the
application communication pattern, the process distribution can have a
tremendous impact on the execution time. Imagine that the application
split the processes in two equal groups based on the rank and only
communicate in each group. If such a group end-up on the same node,
then it will use sm for communications. On the opposite, if they end-
up spread across the nodes they will use TCP (which obviously has a
bigger latency and a smaller bandwidth) and the overall performance
will be greatly impacted.

By default, Open MPI use the following strategy to distribute
processes: if a node has several processors, then consecutive ranks
will be started on the same node. As an example in your case (2 nodes
with 4 processors each), the ranks 0-3 will be started on the first
host, while the ranks 4-7 on the second one. I don't know what is the
default distribution for MPICH2 ...

Anyway, there is a easy way to check if the process distribution is
the root of your problem. Please execute your application twice, once
providing to mpirun the --bynode argument, and once with the --byslot.

george.
Post by Sangamesh B
Hi All,
I wanted to switch from mpich2/mvapich2 to OpenMPI, as
OpenMPI supports both ethernet and infiniband. Before doing that I
tested an application 'GROMACS' to compare the performance of MPICH2
& OpenMPI. Both have been compiled with GNU compilers.
After this benchmark, I came to know that OpenMPI is slower than MPICH2.
This benchmark is run on a AMD dual core, dual opteron processor.
Both have compiled with default configurations.
The job is run on 2 nodes - 8 cores.
OpenMPI - 25 m 39 s.
MPICH2 - 15 m 53 s.
Any comments ..?
Thanks,
Sangamesh
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Anthony Chan
2008-10-09 16:28:42 UTC
Permalink
Post by Sangamesh B
OpenMPI : 120m 6s
MPICH2 : 67m 44s
That seems to indicate that something else is going on -- with -np 1,
there should be no MPI communication, right? I wonder if the memory
allocator performance is coming into play here.
If the app sends message to its own rank, it could still go through MPI stack
even with -np 1, i.e. it involves at least 1 memcpy() for point-to-point calls.
Post by Sangamesh B
I'd be more inclined to double-check how the Gromacs app is being
compiled in the first place - I wouldn't think the OpenMPI memory
allocator would make anywhere near that much difference. Sangamesh, do
you know what command line was used to compile both of these? Someone
correct me if I'm wrong, but if MPICH2 embeds optimization flags in
the 'mpicc' command and OpenMPI does not, then if he's not specifying
any optimization flags in the compilation of Gromacs, MPICH2 will pass
its embedded ones on to the Gromacs compile and be faster. I'm rusty
on my GCC, too, though - does it default to an O2 level, or does it
default to no optimizations?
MPICH2 does pass CFLAGS specified in configure step to mpicc and friends.
If users don't want CFLAGS to be passed to mpicc, they should set
MPICH2LIB_CFLAGS instead. The reason behind passing CFLAGS to mpicc
is that CFLAGS may contain flag like -m64 or -m32 which is needed in
mpicc to make sure object files compatible with MPICH2 libraries.

I assume default installation here means no CFLAGS is specified, in that
case MPICH2's mpicc will not contain any optimization flag (this is true
in 1.0.7 or later, earlier versions of MPICH2 had various inconsistent
way of handling compiler flags between compiling the libraries and those
used in compiler wrappers.) If Gromacs is compiled with mpicc,
"mpicc -show -c" will show if any optimization flag is used. Without "-c",
the -show alone displays the link command. To check what mpich2 libraries
are compiled of, use $bindir/mpich2version.

If I recall correctly, gcc defaults to "-g -O2". Not sure if the newer version
of gcc changes that.

A.Chan
Post by Sangamesh B
Since the benchmark is readily available, I'll try running it later
today.. didn't get a chance last night.
Cheers,
- Brian
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Rajeev Thakur
2008-10-15 16:20:20 UTC
Permalink
For MPICH2 1.0.7, configure with --with-device=ch3:nemesis. That will use
shared memory within a node unlike ch3:sock which uses TCP. Nemesis is the
default in 1.1a1.

Rajeev
Date: Wed, 15 Oct 2008 18:21:17 +0530
Subject: Re: [OMPI users] Performance: MPICH2 vs OpenMPI
Content-Type: text/plain; charset="iso-8859-1"
On Fri, Oct 10, 2008 at 10:40 PM, Brian Dobbins
Post by Brian Dobbins
Hi guys,
On Fri, Oct 10, 2008 at 12:57 PM, Brock Palen
Post by Brock Palen
Actually I had a much differnt results,
gromacs-3.3.1 one node dual core dual socket opt2218
openmpi-1.2.7
Post by Brian Dobbins
Post by Brock Palen
pgi/7.2
mpich2 gcc
For some reason, the difference in minutes didn't come
through, it
Post by Brian Dobbins
seems, but I would guess that if it's a medium-large
difference, then it has
Post by Brian Dobbins
its roots in PGI7.2 vs. GCC rather than MPICH2 vs. OpenMPI.
Though, to be
Post by Brian Dobbins
fair, I find GCC vs. PGI (for C code) is often a toss-up -
one may beat the
Post by Brian Dobbins
other handily on one code, and then lose just as badly on another.
I think my install of mpich2 may be bad, I have never
installed it before,
Post by Brian Dobbins
Post by Brock Palen
only mpich1, OpenMPI and LAM. So take my mpich2 numbers
with salt, Lots of
Post by Brian Dobbins
Post by Brock Palen
salt.
I think the biggest difference in performance with
various MPICH2 install
Post by Brian Dobbins
comes from differences in the 'channel' used.. I tend to
make sure that I
Post by Brian Dobbins
use the 'nemesis' channel, which may or may not be the
default these days.
Post by Brian Dobbins
If not, though, most people would probably want it. I
think it has issues
Post by Brian Dobbins
with threading (or did ages ago?), but I seem to recall it being
considerably faster than even the 'ssm' channel.
Sangamesh: My advice to you would be to recompile
Gromacs and specify,
Post by Brian Dobbins
in the *Gromacs* compile / configure, to use the same
CFLAGS you used with
Post by Brian Dobbins
MPICH2. Eg, "-O2 -m64", whatever. If you do that, I bet
the times between
Post by Brian Dobbins
MPICH2 and OpenMPI will be pretty comparable for your
benchmark case -
Post by Brian Dobbins
especially when run on a single processor.
I reinstalled all softwares with -O3 optimization. Following are the
MPICH2: 26 m 54 s
OpenMPI: 24 m 39 s
$ /home/san/PERF_TEST/mpich2/bin/mpich2version
MPICH2 Version: 1.0.7
MPICH2 Release date: Unknown, built on Mon Oct 13 18:02:13 IST 2008
MPICH2 Device: ch3:sock
MPICH2 configure: --prefix=/home/san/PERF_TEST/mpich2
MPICH2 CC: /usr/bin/gcc -O3 -O2
MPICH2 CXX: /usr/bin/g++ -O2
MPICH2 F77: /usr/bin/gfortran -O3 -O2
MPICH2 F90: /usr/bin/gfortran -O2
$ /home/san/PERF_TEST/openmpi/bin/ompi_info
Open MPI: 1.2.7
Open MPI SVN revision: r19401
Open RTE: 1.2.7
Open RTE SVN revision: r19401
OPAL: 1.2.7
OPAL SVN revision: r19401
Prefix: /home/san/PERF_TEST/openmpi
Configured architecture: x86_64-unknown-linux-gnu
Configured by: san
Configured on: Mon Oct 13 19:10:13 IST 2008
Configure host: locuzcluster.org
Built by: san
Built on: Mon Oct 13 19:18:25 IST 2008
Built host: locuzcluster.org
C bindings: yes
C++ bindings: yes
Fortran77 bindings: yes (all)
Fortran90 bindings: yes
Fortran90 bindings size: small
C compiler: /usr/bin/gcc
C compiler absolute: /usr/bin/gcc
C++ compiler: /usr/bin/g++
C++ compiler absolute: /usr/bin/g++
Fortran77 compiler: /usr/bin/gfortran
Fortran77 compiler abs: /usr/bin/gfortran
Fortran90 compiler: /usr/bin/gfortran
Fortran90 compiler abs: /usr/bin/gfortran
C profiling: yes
C++ profiling: yes
Fortran77 profiling: yes
Fortran90 profiling: yes
C++ exceptions: no
Thread support: posix (mpi: no, progress: no)
Internal debug support: no
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
libltdl support: yes
Heterogeneous support: yes
mpirun default --prefix: no
Thanks,
Sangamesh
Loading...