Discussion:
[OMPI users] Tuning vader for MPI_Wait Halt?
Matt Thompson
2017-06-05 16:51:36 UTC
Permalink
OMPI Users,

I was wondering if there is a best way to "tune" vader to get around an
intermittent MPI_Wait halt?

I ask because I recently found that if I use Open MPI 2.1.x on either my
desktop or on the supercomputer I have access to, if vader is enabled, the
model seems to "deadlock" at an MPI_Wait call. If I run as:

mpirun --mca btl self,sm,tcp

on my desktop it works. When I moved to my cluster, I tried the more
generic:

mpirun --mca btl ^vader

since it uses openib, and with it things work. Well, I hope that's how one
would turn off vader in MCA speak. (Note: this deadlock seems a bit
sporadic, but I do now have a case which seems to cause it reproducibly).

Now, I know vader is supposed to be the "better" sm communication tech, so
I'd rather use it and thought maybe I could twiddle some tuning knobs. So I
looked at:

https://www.open-mpi.org/faq/?category=sm

and there I saw question 6 "How do I know what MCA parameters are available
for tuning MPI performance?". But when I try the commands listed (minus the
HTML/CSS tags):

(1081) $ ompi_info --param btl sm
MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.0)
(1082) $ ompi_info --param mpool sm
(1083) $

Huh. I expected more, but searching around the Open MPI FAQs made me think
I should use:

ompi_info --param btl sm --level 9

which does spit out a lot, though the equivalent for mpool sm does not.

Any ideas on which of the many knobs is best to try and turn? Something
that, by default, perhaps is one thing for sm but different for vader? I
tried to see if "ompi_info --param btl vader --level 9" did something, but
it doesn't put anything out.

I will note that this code runs just fine with Open MPI 2.0.2 as well as
with Intel MPI and SGI MPT, so I'm thinking the code itself is okay, but
something from Open MPI 2.0.x to Open MPI 2.1.x changed. I see two entries
in the Open MPI 2.1.0 announcement about vader, but nothing specific about
how to "revert" if they are even causing the problem:

- Fix regression that lowered the memory maximum message bandwidth for
large messages on some BTL network transports, such as openib, sm,
and vader.


- The vader BTL is now more efficient in terms of memory usage when
using XPMEM.


Thanks for any help,
Matt
--
Matt Thompson

Man Among Men
Fulcrum of History
Nathan Hjelm
2017-06-05 17:00:52 UTC
Permalink
Can you provide a reproducer for the hang? What kernel version are you using? Is xpmem installed?

-Nathan

On Jun 05, 2017, at 10:53 AM, Matt Thompson <***@gmail.com> wrote:

OMPI Users,

I was wondering if there is a best way to "tune" vader to get around an intermittent MPI_Wait halt? 

I ask because I recently found that if I use Open MPI 2.1.x on either my desktop or on the supercomputer I have access to, if vader is enabled, the model seems to "deadlock" at an MPI_Wait call. If I run as:

  mpirun --mca btl self,sm,tcp 

on my desktop it works. When I moved to my cluster, I tried the more generic:

  mpirun --mca btl ^vader

since it uses openib, and with it things work. Well, I hope that's how one would turn off vader in MCA speak. (Note: this deadlock seems a bit sporadic, but I do now have a case which seems to cause it reproducibly).

Now, I know vader is supposed to be the "better" sm communication tech, so I'd rather use it and thought maybe I could twiddle some tuning knobs. So I looked at:

  https://www.open-mpi.org/faq/?category=sm

and there I saw question 6 "How do I know what MCA parameters are available for tuning MPI performance?". But when I try the commands listed (minus the HTML/CSS tags):

(1081) $ ompi_info --param btl sm
                 MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.0)
(1082) $ ompi_info --param mpool sm
(1083) $

Huh. I expected more, but searching around the Open MPI FAQs made me think I should use:

  ompi_info --param btl sm --level 9

which does spit out a lot, though the equivalent for mpool sm does not. 

Any ideas on which of the many knobs is best to try and turn? Something that, by default, perhaps is one thing for sm but different for vader? I tried to see if "ompi_info --param btl vader --level 9" did something, but it doesn't put anything out.

I will note that this code runs just fine with Open MPI 2.0.2 as well as with Intel MPI and SGI MPT, so I'm thinking the code itself is okay, but something from Open MPI 2.0.x to Open MPI 2.1.x changed. I see two entries in the Open MPI 2.1.0 announcement about vader, but nothing specific about how to "revert" if they are even causing the problem:

- Fix regression that lowered the memory maximum message bandwidth for
large messages on some BTL network transports, such as openib, sm,
and vader.

- The vader BTL is now more efficient in terms of memory usage when
using XPMEM.

Thanks for any help,
Matt


-- 
Matt Thompson
Man Among Men
Fulcrum of History
Matt Thompson
2017-06-05 18:20:52 UTC
Permalink
Nathan,

Sadly, I'm not sure I can provide a reproducer, as it's currently our full
earth system model and is accessing terabytes of background files, etc.
That said, I'll work on it. I have a tiny version of the model, but that
usually always works everywhere (and I can only reproduce the issue at a
rather high resolution).

We do have a couple of code testers that duplicate functionality around
that MPI_Wait call, but, and this is the fun part, it seems to be a very
specific type of that call (only if you are doing a daily time-averaged
collection!). Still, I'll try and test that tester with Open MPI 2.1.0.
Maybe it'll hang!

As for kernel, my desktop is 3.10.0-514.16.1.el7.x86_64 (RHEL 7) and the
cluster compute node is on 3.0.101-0.47.90-default (SLES11 SP3). If I run
'lsmod' I see xpmem on the cluster, but my desktop does not have it. So,
perhaps not XPMEM related?

Matt
Post by Nathan Hjelm
Can you provide a reproducer for the hang? What kernel version are you
using? Is xpmem installed?
-Nathan
OMPI Users,
I was wondering if there is a best way to "tune" vader to get around an
intermittent MPI_Wait halt?
I ask because I recently found that if I use Open MPI 2.1.x on either my
desktop or on the supercomputer I have access to, if vader is enabled, the
mpirun --mca btl self,sm,tcp
mpirun --mca btl ^vader
since it uses openib, and with it things work. Well, I hope that's how one
would turn off vader in MCA speak. (Note: this deadlock seems a bit
sporadic, but I do now have a case which seems to cause it reproducibly).
Now, I know vader is supposed to be the "better" sm communication tech, so
I'd rather use it and thought maybe I could twiddle some tuning knobs. So I
https://www.open-mpi.org/faq/?category=sm
and there I saw question 6 "How do I know what MCA parameters are
available for tuning MPI performance?". But when I try the commands listed
(1081) $ ompi_info --param btl sm
MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.0)
(1082) $ ompi_info --param mpool sm
(1083) $
ompi_info --param btl sm --level 9
which does spit out a lot, though the equivalent for mpool sm does not.
Any ideas on which of the many knobs is best to try and turn? Something
that, by default, perhaps is one thing for sm but different for vader? I
tried to see if "ompi_info --param btl vader --level 9" did something, but
it doesn't put anything out.
I will note that this code runs just fine with Open MPI 2.0.2 as well as
with Intel MPI and SGI MPT, so I'm thinking the code itself is okay, but
something from Open MPI 2.0.x to Open MPI 2.1.x changed. I see two entries
in the Open MPI 2.1.0 announcement about vader, but nothing specific about
- Fix regression that lowered the memory maximum message bandwidth for
large messages on some BTL network transports, such as openib, sm,
and vader.
- The vader BTL is now more efficient in terms of memory usage when
using XPMEM.
Thanks for any help,
Matt
--
Matt Thompson
Man Among Men
Fulcrum of History
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Matt Thompson

Man Among Men
Fulcrum of History
Nathan Hjelm
2017-06-07 17:41:13 UTC
Permalink
What also might be helpful is the output of the stack trace analysis tool (https://github.com/LLNL/STAT) when used with a debug version of Open MPI (configure with --enable-debug). Once built run STAT against mpirun with the -ci and send me the dot file produced.

-Nathan

On Jun 07, 2017, at 10:11 AM, Matt Thompson <***@gmail.com> wrote:

Nathan,

Sadly, I'm not sure I can provide a reproducer, as it's currently our full earth system model and is accessing terabytes of background files, etc. That said, I'll work on it. I have a tiny version of the model, but that usually always works everywhere (and I can only reproduce the issue at a rather high resolution).

We do have a couple of code testers that duplicate functionality around that MPI_Wait call, but, and this is the fun part, it seems to be a very specific type of that call (only if you are doing a daily time-averaged collection!). Still, I'll try and test that tester with Open MPI 2.1.0. Maybe it'll hang!

As for kernel, my desktop is 3.10.0-514.16.1.el7.x86_64 (RHEL 7)  and the cluster compute node is on 3.0.101-0.47.90-default (SLES11 SP3). If I run 'lsmod' I see xpmem on the cluster, but my desktop does not have it. So, perhaps not XPMEM related?

Matt

On Mon, Jun 5, 2017 at 1:00 PM, Nathan Hjelm <***@me.com> wrote:
Can you provide a reproducer for the hang? What kernel version are you using? Is xpmem installed?

-Nathan

On Jun 05, 2017, at 10:53 AM, Matt Thompson <***@gmail.com> wrote:

OMPI Users,

I was wondering if there is a best way to "tune" vader to get around an intermittent MPI_Wait halt? 

I ask because I recently found that if I use Open MPI 2.1.x on either my desktop or on the supercomputer I have access to, if vader is enabled, the model seems to "deadlock" at an MPI_Wait call. If I run as:

  mpirun --mca btl self,sm,tcp 

on my desktop it works. When I moved to my cluster, I tried the more generic:

  mpirun --mca btl ^vader

since it uses openib, and with it things work. Well, I hope that's how one would turn off vader in MCA speak. (Note: this deadlock seems a bit sporadic, but I do now have a case which seems to cause it reproducibly).

Now, I know vader is supposed to be the "better" sm communication tech, so I'd rather use it and thought maybe I could twiddle some tuning knobs. So I looked at:

  https://www.open-mpi.org/faq/?category=sm

and there I saw question 6 "How do I know what MCA parameters are available for tuning MPI performance?". But when I try the commands listed (minus the HTML/CSS tags):

(1081) $ ompi_info --param btl sm
                 MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.0)
(1082) $ ompi_info --param mpool sm
(1083) $

Huh. I expected more, but searching around the Open MPI FAQs made me think I should use:

  ompi_info --param btl sm --level 9

which does spit out a lot, though the equivalent for mpool sm does not. 

Any ideas on which of the many knobs is best to try and turn? Something that, by default, perhaps is one thing for sm but different for vader? I tried to see if "ompi_info --param btl vader --level 9" did something, but it doesn't put anything out.

I will note that this code runs just fine with Open MPI 2.0.2 as well as with Intel MPI and SGI MPT, so I'm thinking the code itself is okay, but something from Open MPI 2.0.x to Open MPI 2.1.x changed. I see two entries in the Open MPI 2.1.0 announcement about vader, but nothing specific about how to "revert" if they are even causing the problem:

- Fix regression that lowered the memory maximum message bandwidth for
large messages on some BTL network transports, such as openib, sm,
and vader.

- The vader BTL is now more efficient in terms of memory usage when
using XPMEM.

Thanks for any help,
Matt


-- 
Matt Thompson
Man Among Men
Fulcrum of History
_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Matt Thompson
Man Among Men
Fulcrum of History
Loading...