Discussion:
[OMPI users] openmpi single node jobs using btl openib
Jingchao Zhang
2017-02-06 20:38:13 UTC
Permalink
Hi,


We recently noticed openmpi is using btl openib over self,sm for single node jobs, which has caused performance degradation for some applications, e.g. 'cp2k'. For opempi version 2.0.1, our test shows single node 'cp2k' job using openib is ~25% slower than using self,sm. We advise users do '--mca btl_base_exclude openib' as a temporary fix. I need to point out that not all applications are affected by this feature. Many of them have the same single-node performance with/without openib. Why doesn't openmpi use self,sm by default for single node jobs? Is this the intended behavior?


Thanks,

Jingchao
Tobias Kloeffel
2017-02-07 08:54:46 UTC
Permalink
Hello Jingchao,
try to use -mca mpi_leave_pinned 0, also for multinode jobs.

kind regards,
Tobias Klöffel

On 02/06/2017 09:38 PM, Jingchao Zhang wrote:
>
> Hi,
>
>
> We recently noticed openmpi is using btl openib over self,sm for
> single node jobs, which has caused performance degradation for some
> applications, e.g. 'cp2k'. For opempi version 2.0.1, our test shows
> single node 'cp2k' job using openib is ~25% slower than using self,sm.
> We advise users do '--mca btl_base_exclude openib' as a temporary fix.
> I need to point out that not all applications are affected by this
> feature. Many of them have the same single-node performance
> with/without openib. Why doesn't openmpi use self,sm by default for
> single node jobs? Is this the intended behavior?
>
>
> Thanks,
>
> Jingchao
>
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
M.Sc. Tobias Klöffel
=======================================================
Interdisciplinary Center for Molecular Materials (ICMM)
and Computer-Chemistry-Center (CCC)
Department Chemie und Pharmazie
Friedrich-Alexander-Universität Erlangen-Nürnberg
Nägelsbachstr. 25
D-91052 Erlangen, Germany

Room: 2.305
Phone: +49 (0) 9131 / 85 - 20423
Fax: +49 (0) 9131 / 85 - 26565

=======================================================

E-mail: ***@fau.de
Jingchao Zhang
2017-02-07 20:07:23 UTC
Permalink
Hi Tobias,


Thanks for the reply. I tried both "export OMPI_MCA_mpi_leave_pinned=0" and "mpirun -mca mpi_leave_pinned 0" but still got the same behavior. Our OpenMPI version is 2.0.1. Repo version is v2.0.0-257-gee86e07. We have Intel Qlogic and OPA networks on the same cluster.


Below is our configuration flags:

./configure --prefix=$PREFIX \
--with-hwloc=internal \
--enable-mpirun-prefix-by-default \
--with-slurm \
--with-verbs \
--with-psm \
--with-psm2 \
--disable-openib-connectx-xrc \
--with-knem=/opt/knem-1.1.2.90mlnx1 \
--with-cma


So the question remains why OpenMPI choose openib over self,sm for single node jobs? Isn't there a mechanism to differentiate btl networks for single/multi-node jobs?


Thanks,

Jingchao

________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Tobias Kloeffel <***@fau.de>
Sent: Tuesday, February 7, 2017 2:54:46 AM
To: Open MPI Users
Subject: Re: [OMPI users] openmpi single node jobs using btl openib

Hello Jingchao,
try to use -mca mpi_leave_pinned 0, also for multinode jobs.

kind regards,
Tobias Klöffel

On 02/06/2017 09:38 PM, Jingchao Zhang wrote:

Hi,


We recently noticed openmpi is using btl openib over self,sm for single node jobs, which has caused performance degradation for some applications, e.g. 'cp2k'. For opempi version 2.0.1, our test shows single node 'cp2k' job using openib is ~25% slower than using self,sm. We advise users do '--mca btl_base_exclude openib' as a temporary fix. I need to point out that not all applications are affected by this feature. Many of them have the same single-node performance with/without openib. Why doesn't openmpi use self,sm by default for single node jobs? Is this the intended behavior?


Thanks,

Jingchao



_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



--
M.Sc. Tobias Klöffel
=======================================================
Interdisciplinary Center for Molecular Materials (ICMM)
and Computer-Chemistry-Center (CCC)
Department Chemie und Pharmazie
Friedrich-Alexander-Universität Erlangen-Nürnberg
Nägelsbachstr. 25
D-91052 Erlangen, Germany

Room: 2.305
Phone: +49 (0) 9131 / 85 - 20423
Fax: +49 (0) 9131 / 85 - 26565

=======================================================

E-mail: ***@fau.de<mailto:***@fau.de>
Jeff Squyres (jsquyres)
2017-02-07 20:14:40 UTC
Permalink
Can you try upgrading to Open MPI v2.0.2? We just released that last week with a bunch of bug fixes.


> On Feb 7, 2017, at 3:07 PM, Jingchao Zhang <***@unl.edu> wrote:
>
> Hi Tobias,
>
> Thanks for the reply. I tried both "export OMPI_MCA_mpi_leave_pinned=0" and "mpirun -mca mpi_leave_pinned 0" but still got the same behavior. Our OpenMPI version is 2.0.1. Repo version is v2.0.0-257-gee86e07. We have Intel Qlogic and OPA networks on the same cluster.
>
> Below is our configuration flags:
> ./configure --prefix=$PREFIX \
> --with-hwloc=internal \
> --enable-mpirun-prefix-by-default \
> --with-slurm \
> --with-verbs \
> --with-psm \
> --with-psm2 \
> --disable-openib-connectx-xrc \
> --with-knem=/opt/knem-1.1.2.90mlnx1 \
> --with-cma
>
> So the question remains why OpenMPI choose openib over self,sm for single node jobs? Isn't there a mechanism to differentiate btl networks for single/multi-node jobs?
>
> Thanks,
> Jingchao
> From: users <users-***@lists.open-mpi.org> on behalf of Tobias Kloeffel <***@fau.de>
> Sent: Tuesday, February 7, 2017 2:54:46 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] openmpi single node jobs using btl openib
>
> Hello Jingchao,
> try to use -mca mpi_leave_pinned 0, also for multinode jobs.
>
> kind regards,
> Tobias Klöffel
>
> On 02/06/2017 09:38 PM, Jingchao Zhang wrote:
>> Hi,
>>
>> We recently noticed openmpi is using btl openib over self,sm for single node jobs, which has caused performance degradation for some applications, e.g. 'cp2k'. For opempi version 2.0.1, our test shows single node 'cp2k' job using openib is ~25% slower than using self,sm. We advise users do '--mca btl_base_exclude openib' as a temporary fix. I need to point out that not all applications are affected by this feature. Many of them have the same single-node performance with/without openib. Why doesn't openmpi use self,sm by default for single node jobs? Is this the intended behavior?
>>
>> Thanks,
>> Jingchao
>>
>>
>> _______________________________________________
>> users mailing list
>>
>> ***@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> M.Sc. Tobias Klöffel
> =======================================================
> Interdisciplinary Center for Molecular Materials (ICMM)
> and Computer-Chemistry-Center (CCC)
> Department Chemie und Pharmazie
> Friedrich-Alexander-Universität Erlangen-Nürnberg
> Nägelsbachstr. 25
> D-91052 Erlangen, Germany
>
> Room: 2.305
> Phone: +49 (0) 9131 / 85 - 20423
> Fax: +49 (0) 9131 / 85 - 26565
>
> =======================================================
>
> E-mail:
> ***@fau.de
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Jeff Squyres
***@cisco.com
Jingchao Zhang
2017-02-07 21:50:55 UTC
Permalink
Hi Jeff,


I just installed Open MPI: 2.0.2 (repo revision: v2.0.1-348-ge291d0e; release date: Jan 31, 2017) but have the same problem.


Attached please find two gdb backtraces on any write of a file descriptor returned from opening /dev/infiniband/uverbs in the cp2k.popt process.


Thanks,

Jingchao

________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Jeff Squyres (jsquyres) <***@cisco.com>
Sent: Tuesday, February 7, 2017 2:14:40 PM
To: Open MPI User's List
Subject: Re: [OMPI users] openmpi single node jobs using btl openib

Can you try upgrading to Open MPI v2.0.2? We just released that last week with a bunch of bug fixes.


> On Feb 7, 2017, at 3:07 PM, Jingchao Zhang <***@unl.edu> wrote:
>
> Hi Tobias,
>
> Thanks for the reply. I tried both "export OMPI_MCA_mpi_leave_pinned=0" and "mpirun -mca mpi_leave_pinned 0" but still got the same behavior. Our OpenMPI version is 2.0.1. Repo version is v2.0.0-257-gee86e07. We have Intel Qlogic and OPA networks on the same cluster.
>
> Below is our configuration flags:
> ./configure --prefix=$PREFIX \
> --with-hwloc=internal \
> --enable-mpirun-prefix-by-default \
> --with-slurm \
> --with-verbs \
> --with-psm \
> --with-psm2 \
> --disable-openib-connectx-xrc \
> --with-knem=/opt/knem-1.1.2.90mlnx1 \
> --with-cma
>
> So the question remains why OpenMPI choose openib over self,sm for single node jobs? Isn't there a mechanism to differentiate btl networks for single/multi-node jobs?
>
> Thanks,
> Jingchao
> From: users <users-***@lists.open-mpi.org> on behalf of Tobias Kloeffel <***@fau.de>
> Sent: Tuesday, February 7, 2017 2:54:46 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] openmpi single node jobs using btl openib
>
> Hello Jingchao,
> try to use -mca mpi_leave_pinned 0, also for multinode jobs.
>
> kind regards,
> Tobias Klöffel
>
> On 02/06/2017 09:38 PM, Jingchao Zhang wrote:
>> Hi,
>>
>> We recently noticed openmpi is using btl openib over self,sm for single node jobs, which has caused performance degradation for some applications, e.g. 'cp2k'. For opempi version 2.0.1, our test shows single node 'cp2k' job using openib is ~25% slower than using self,sm. We advise users do '--mca btl_base_exclude openib' as a temporary fix. I need to point out that not all applications are affected by this feature. Many of them have the same single-node performance with/without openib. Why doesn't openmpi use self,sm by default for single node jobs? Is this the intended behavior?
>>
>> Thanks,
>> Jingchao
>>
>>
>> _______________________________________________
>> users mailing list
>>
>> ***@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> M.Sc. Tobias Klöffel
> =======================================================
> Interdisciplinary Center for Molecular Materials (ICMM)
> and Computer-Chemistry-Center (CCC)
> Department Chemie und Pharmazie
> Friedrich-Alexander-Universität Erlangen-Nürnberg
> Nägelsbachstr. 25
> D-91052 Erlangen, Germany
>
> Room: 2.305
> Phone: +49 (0) 9131 / 85 - 20423
> Fax: +49 (0) 9131 / 85 - 26565
>
> =======================================================
>
> E-mail:
> ***@fau.de
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Jeff Squyres
***@cisco.com
Gilles Gouaillardet
2017-02-08 00:37:21 UTC
Permalink
Hi,


there are several uncommon things happening here :

- btl/vader has a higher exclusivity than btl/sm, so bottom line, vader
should be used instead of sm

- is your interconnect infiniband or qlogic ? infiniband uses pml/ob1
and btl/openib for inter node communication,

whereas qlogic users pml/cm and mtl/psm.

- does your program involve MPI_Comm_spawn ? note that nor btl/vader nor
btl/sm can be user for inter job communications

(e.g. the main task and a spawn-ed task), so btl/openib would be used
even for intra node communications.


can you please run again your app with

mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca
mtl_base_verbose 10 ...


Cheers,


Gilles


On 2/8/2017 6:50 AM, Jingchao Zhang wrote:
>
> Hi Jeff,
>
>
> I just installed Open MPI: 2.0.2 (repo revision: v2.0.1-348-ge291d0e;
> release date: Jan 31, 2017) but have the same problem.
>
>
> Attached please find two gdb backtraces on any write of a file
> descriptor returned from opening /dev/infiniband/uverbs in the
> cp2k.popt process.
>
>
> Thanks,
>
> Jingchao
>
> ------------------------------------------------------------------------
> *From:* users <users-***@lists.open-mpi.org> on behalf of Jeff
> Squyres (jsquyres) <***@cisco.com>
> *Sent:* Tuesday, February 7, 2017 2:14:40 PM
> *To:* Open MPI User's List
> *Subject:* Re: [OMPI users] openmpi single node jobs using btl openib
> Can you try upgrading to Open MPI v2.0.2? We just released that last
> week with a bunch of bug fixes.
>
>
> > On Feb 7, 2017, at 3:07 PM, Jingchao Zhang <***@unl.edu> wrote:
> >
> > Hi Tobias,
> >
> > Thanks for the reply. I tried both "export
> OMPI_MCA_mpi_leave_pinned=0" and "mpirun -mca mpi_leave_pinned 0" but
> still got the same behavior. Our OpenMPI version is 2.0.1. Repo
> version is v2.0.0-257-gee86e07. We have Intel Qlogic and OPA networks
> on the same cluster.
> >
> > Below is our configuration flags:
> > ./configure --prefix=$PREFIX \
> > --with-hwloc=internal \
> > --enable-mpirun-prefix-by-default \
> > --with-slurm \
> > --with-verbs \
> > --with-psm \
> > --with-psm2 \
> > --disable-openib-connectx-xrc \
> > --with-knem=/opt/knem-1.1.2.90mlnx1 \
> > --with-cma
> >
> > So the question remains why OpenMPI choose openib over self,sm for
> single node jobs? Isn't there a mechanism to differentiate btl
> networks for single/multi-node jobs?
> >
> > Thanks,
> > Jingchao
> > From: users <users-***@lists.open-mpi.org> on behalf of Tobias
> Kloeffel <***@fau.de>
> > Sent: Tuesday, February 7, 2017 2:54:46 AM
> > To: Open MPI Users
> > Subject: Re: [OMPI users] openmpi single node jobs using btl openib
> >
> > Hello Jingchao,
> > try to use -mca mpi_leave_pinned 0, also for multinode jobs.
> >
> > kind regards,
> > Tobias Klöffel
> >
> > On 02/06/2017 09:38 PM, Jingchao Zhang wrote:
> >> Hi,
> >>
> >> We recently noticed openmpi is using btl openib over self,sm for
> single node jobs, which has caused performance degradation for some
> applications, e.g. 'cp2k'. For opempi version 2.0.1, our test shows
> single node 'cp2k' job using openib is ~25% slower than using self,sm.
> We advise users do '--mca btl_base_exclude openib' as a temporary fix.
> I need to point out that not all applications are affected by this
> feature. Many of them have the same single-node performance
> with/without openib. Why doesn't openmpi use self,sm by default for
> single node jobs? Is this the intended behavior?
> >>
> >> Thanks,
> >> Jingchao
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >>
> >> ***@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> >
> > --
> > M.Sc. Tobias Klöffel
> > =======================================================
> > Interdisciplinary Center for Molecular Materials (ICMM)
> > and Computer-Chemistry-Center (CCC)
> > Department Chemie und Pharmazie
> > Friedrich-Alexander-Universität Erlangen-Nürnberg
> > Nägelsbachstr. 25
> > D-91052 Erlangen, Germany
> >
> > Room: 2.305
> > Phone: +49 (0) 9131 / 85 - 20423
> > Fax: +49 (0) 9131 / 85 - 26565
> >
> > =======================================================
> >
> > E-mail:
> > ***@fau.de
> > _______________________________________________
> > users mailing list
> > ***@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> ***@cisco.com
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Jingchao Zhang
2017-02-08 21:39:05 UTC
Permalink
Hi Gilles,


- btl/vader has a higher exclusivity than btl/sm, so bottom line, vader should be used instead of sm

Yes, I believe vader is being used in our case. I should say self,vader in my previous posts.


- is your interconnect infiniband or qlogic ? infiniband uses pml/ob1 and btl/openib for inter node communication, whereas qlogic users pml/cm and mtl/psm.

We have two fabrics in the cluster.  The older/original nodes are
interconnected on a Qlogic/Intel Truescale fabric.  We've expanded the cluster
with newer nodes that are interconnected with Intel Omni-path architecture
fabric.  The nodes can present to user-space the faster/preferred PSM/PSM2
and the IB/verbs interfaces.  We build ompi such that it has the freedom
to use the best transport, ie our build options: --with-verbs --with-psm
--with-psm2.  Perhaps we should stop building with verbs, but it is our
understanding that ompi should select the best transport at initialization.


- does your program involve MPI_Comm_spawn ? note that nor btl/vader nor btl/sm can be user for inter job communications (e.g. the main task and a spawn-ed task), so btl/openib would be used even for intra node communications.

From what we checked in cp2k source code, it doesn't have any 'MPI_Comm_spawn'. If we disable the use of openib within the single node run, and it performs much better/as expected, what is being used for inter job communication?  Is there a flag to specify that outputs rank-to-rank ompi transport selection?  That would be very handy to have if it doesn't exist.


can you please run again your app with mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca mtl_base_verbose 10 ...

Please see attached for the verbose results.


Thanks,


Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400
________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Gilles Gouaillardet <***@rist.or.jp>
Sent: Tuesday, February 7, 2017 6:37:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] openmpi single node jobs using btl openib


Hi,


there are several uncommon things happening here :

- btl/vader has a higher exclusivity than btl/sm, so bottom line, vader should be used instead of sm

- is your interconnect infiniband or qlogic ? infiniband uses pml/ob1 and btl/openib for inter node communication,

whereas qlogic users pml/cm and mtl/psm.

- does your program involve MPI_Comm_spawn ? note that nor btl/vader nor btl/sm can be user for inter job communications

(e.g. the main task and a spawn-ed task), so btl/openib would be used even for intra node communications.


can you please run again your app with

mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca mtl_base_verbose 10 ...


Cheers,


Gilles

On 2/8/2017 6:50 AM, Jingchao Zhang wrote:

Hi Jeff,


I just installed Open MPI: 2.0.2 (repo revision: v2.0.1-348-ge291d0e; release date: Jan 31, 2017) but have the same problem.


Attached please find two gdb backtraces on any write of a file descriptor returned from opening /dev/infiniband/uverbs in the cp2k.popt process.


Thanks,

Jingchao

________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Jeff Squyres (jsquyres) <***@cisco.com><mailto:***@cisco.com>
Sent: Tuesday, February 7, 2017 2:14:40 PM
To: Open MPI User's List
Subject: Re: [OMPI users] openmpi single node jobs using btl openib

Can you try upgrading to Open MPI v2.0.2? We just released that last week with a bunch of bug fixes.


> On Feb 7, 2017, at 3:07 PM, Jingchao Zhang <***@unl.edu><mailto:***@unl.edu> wrote:
>
> Hi Tobias,
>
> Thanks for the reply. I tried both "export OMPI_MCA_mpi_leave_pinned=0" and "mpirun -mca mpi_leave_pinned 0" but still got the same behavior. Our OpenMPI version is 2.0.1. Repo version is v2.0.0-257-gee86e07. We have Intel Qlogic and OPA networks on the same cluster.
>
> Below is our configuration flags:
> ./configure --prefix=$PREFIX \
> --with-hwloc=internal \
> --enable-mpirun-prefix-by-default \
> --with-slurm \
> --with-verbs \
> --with-psm \
> --with-psm2 \
> --disable-openib-connectx-xrc \
> --with-knem=/opt/knem-1.1.2.90mlnx1 \
> --with-cma
>
> So the question remains why OpenMPI choose openib over self,sm for single node jobs? Isn't there a mechanism to differentiate btl networks for single/multi-node jobs?
>
> Thanks,
> Jingchao
> From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Tobias Kloeffel <***@fau.de><mailto:***@fau.de>
> Sent: Tuesday, February 7, 2017 2:54:46 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] openmpi single node jobs using btl openib
>
> Hello Jingchao,
> try to use -mca mpi_leave_pinned 0, also for multinode jobs.
>
> kind regards,
> Tobias Klöffel
>
> On 02/06/2017 09:38 PM, Jingchao Zhang wrote:
>> Hi,
>>
>> We recently noticed openmpi is using btl openib over self,sm for single node jobs, which has caused performance degradation for some applications, e.g. 'cp2k'. For opempi version 2.0.1, our test shows single node 'cp2k' job using openib is ~25% slower than using self,sm. We advise users do '--mca btl_base_exclude openib' as a temporary fix. I need to point out that not all applications are affected by this feature. Many of them have the same single-node performance with/without openib. Why doesn't openmpi use self,sm by default for single node jobs? Is this the intended behavior?
>>
>> Thanks,
>> Jingchao
>>
>>
>> _______________________________________________
>> users mailing list
>>
>> ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> M.Sc. Tobias Klöffel
> =======================================================
> Interdisciplinary Center for Molecular Materials (ICMM)
> and Computer-Chemistry-Center (CCC)
> Department Chemie und Pharmazie
> Friedrich-Alexander-UniversitÀt Erlangen-NÌrnberg
> NÀgelsbachstr. 25
> D-91052 Erlangen, Germany
>
> Room: 2.305
> Phone: +49 (0) 9131 / 85 - 20423
> Fax: +49 (0) 9131 / 85 - 26565
>
> =======================================================
>
> E-mail:
> ***@fau.de<mailto:***@fau.de>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Jeff Squyres
***@cisco.com<mailto:***@cisco.com>

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Cabral, Matias A
2017-02-08 22:45:59 UTC
Permalink
Hi Jingchao,

The log shows the psm mtl is being selected.


[c1725.crane.hcc.unl.edu:187002] select: init returned priority 20
[c1725.crane.hcc.unl.edu:187002] selected cm best priority 30
[c1725.crane.hcc.unl.edu:187002] select: component ob1 not selected / finalized
[c1725.crane.hcc.unl.edu:187002] select: component cm selected


[c1725.crane.hcc.unl.edu:187002] mca:base:select: Auto-selecting mtl components
[c1725.crane.hcc.unl.edu:187002] mca:base:select:( mtl) Querying component [psm]
[c1725.crane.hcc.unl.edu:187002] mca:base:select:( mtl) Query of component [psm] set priority to 30
[c1725.crane.hcc.unl.edu:187002] mca:base:select:( mtl) Selected component [psm]
[c1725.crane.hcc.unl.edu:187002] select: initializing mtl component psm



So, I suspect you may be seeing the same issues that Hristo also reported in “the other email thread” when running CP2K and using PSM shm device. Therefore, if you are running in a single node, you may ask for the vader btl to be selected: mpirun -mca pml obi1 -mca btl vader 

Thanks,

_MAC

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Jingchao Zhang
Sent: Wednesday, February 08, 2017 1:39 PM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] openmpi single node jobs using btl openib


Hi Gilles,



- btl/vader has a higher exclusivity than btl/sm, so bottom line, vader should be used instead of sm

Yes, I believe vader is being used in our case. I should say self,vader in my previous posts.



- is your interconnect infiniband or qlogic ? infiniband uses pml/ob1 and btl/openib for inter node communication, whereas qlogic users pml/cm and mtl/psm.
We have two fabrics in the cluster.  The older/original nodes are
interconnected on a Qlogic/Intel Truescale fabric.  We've expanded the cluster
with newer nodes that are interconnected with Intel Omni-path architecture
fabric.  The nodes can present to user-space the faster/preferred PSM/PSM2
and the IB/verbs interfaces.  We build ompi such that it has the freedom
to use the best transport, ie our build options: --with-verbs --with-psm
--with-psm2.  Perhaps we should stop building with verbs, but it is our
understanding that ompi should select the best transport at initialization.


- does your program involve MPI_Comm_spawn ? note that nor btl/vader nor btl/sm can be user for inter job communications (e.g. the main task and a spawn-ed task), so btl/openib would be used even for intra node communications.

From what we checked in cp2k source code, it doesn't have any 'MPI_Comm_spawn'. If we disable the use of openib within the single node run, and it performs much better/as expected, what is being used for inter job communication?  Is there a flag to specify that outputs rank-to-rank ompi transport selection?  That would be very handy to have if it doesn't exist.



can you please run again your app with mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca mtl_base_verbose 10 ...

Please see attached for the verbose results.



Thanks,


Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400
________________________________
From: users <users-***@lists.open-mpi.org<mailto:users-***@lists.open-mpi.org>> on behalf of Gilles Gouaillardet <***@rist.or.jp<mailto:***@rist.or.jp>>
Sent: Tuesday, February 7, 2017 6:37:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] openmpi single node jobs using btl openib


Hi,



there are several uncommon things happening here :

- btl/vader has a higher exclusivity than btl/sm, so bottom line, vader should be used instead of sm

- is your interconnect infiniband or qlogic ? infiniband uses pml/ob1 and btl/openib for inter node communication,

whereas qlogic users pml/cm and mtl/psm.

- does your program involve MPI_Comm_spawn ? note that nor btl/vader nor btl/sm can be user for inter job communications

(e.g. the main task and a spawn-ed task), so btl/openib would be used even for intra node communications.



can you please run again your app with

mpirun --mca pml_base_verbose 10 --mca btl_base_verbose 10 --mca mtl_base_verbose 10 ...



Cheers,



Gilles

On 2/8/2017 6:50 AM, Jingchao Zhang wrote:

Hi Jeff,



I just installed Open MPI: 2.0.2 (repo revision: v2.0.1-348-ge291d0e; release date: Jan 31, 2017) but have the same problem.



Attached please find two gdb backtraces on any write of a file descriptor returned from opening /dev/infiniband/uverbs in the cp2k.popt process.



Thanks,

Jingchao

________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Jeff Squyres (jsquyres) <***@cisco.com><mailto:***@cisco.com>
Sent: Tuesday, February 7, 2017 2:14:40 PM
To: Open MPI User's List
Subject: Re: [OMPI users] openmpi single node jobs using btl openib

Can you try upgrading to Open MPI v2.0.2? We just released that last week with a bunch of bug fixes.


> On Feb 7, 2017, at 3:07 PM, Jingchao Zhang <***@unl.edu><mailto:***@unl.edu> wrote:
>
> Hi Tobias,
>
> Thanks for the reply. I tried both "export OMPI_MCA_mpi_leave_pinned=0" and "mpirun -mca mpi_leave_pinned 0" but still got the same behavior. Our OpenMPI version is 2.0.1. Repo version is v2.0.0-257-gee86e07. We have Intel Qlogic and OPA networks on the same cluster.
>
> Below is our configuration flags:
> ./configure --prefix=$PREFIX \
> --with-hwloc=internal \
> --enable-mpirun-prefix-by-default \
> --with-slurm \
> --with-verbs \
> --with-psm \
> --with-psm2 \
> --disable-openib-connectx-xrc \
> --with-knem=/opt/knem-1.1.2.90mlnx1 \
> --with-cma
>
> So the question remains why OpenMPI choose openib over self,sm for single node jobs? Isn't there a mechanism to differentiate btl networks for single/multi-node jobs?
>
> Thanks,
> Jingchao
> From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Tobias Kloeffel <***@fau.de><mailto:***@fau.de>
> Sent: Tuesday, February 7, 2017 2:54:46 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] openmpi single node jobs using btl openib
>
> Hello Jingchao,
> try to use -mca mpi_leave_pinned 0, also for multinode jobs.
>
> kind regards,
> Tobias Klöffel
>
> On 02/06/2017 09:38 PM, Jingchao Zhang wrote:
>> Hi,
>>
>> We recently noticed openmpi is using btl openib over self,sm for single node jobs, which has caused performance degradation for some applications, e.g. 'cp2k'. For opempi version 2.0.1, our test shows single node 'cp2k' job using openib is ~25% slower than using self,sm. We advise users do '--mca btl_base_exclude openib' as a temporary fix. I need to point out that not all applications are affected by this feature. Many of them have the same single-node performance with/without openib. Why doesn't openmpi use self,sm by default for single node jobs? Is this the intended behavior?
>>
>> Thanks,
>> Jingchao
>>
>>
>> _______________________________________________
>> users mailing list
>>
>> ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> M.Sc. Tobias Klöffel
> =======================================================
> Interdisciplinary Center for Molecular Materials (ICMM)
> and Computer-Chemistry-Center (CCC)
> Department Chemie und Pharmazie
> Friedrich-Alexander-UniversitÀt Erlangen-NÌrnberg
> NÀgelsbachstr. 25
> D-91052 Erlangen, Germany
>
> Room: 2.305
> Phone: +49 (0) 9131 / 85 - 20423
> Fax: +49 (0) 9131 / 85 - 26565
>
> =======================================================
>
> E-mail:
> ***@fau.de<mailto:***@fau.de>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Jeff Squyres
***@cisco.com<mailto:***@cisco.com>

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




_______________________________________________

users mailing list

***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>

https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Nathan Hjelm
2017-02-08 01:07:51 UTC
Permalink
That backtrace shows we are registering MPI_Alloc_mem memory with verbs. This is expected behavior but it doesn’t show the openib btl being used for any communication. I am looking into a issue on an OmniPath system where just initializing the openib btl causes performance problems even if it is never used. It would be best to open an issue on this.

-Nathan

> On Feb 7, 2017, at 2:50 PM, Jingchao Zhang <***@unl.edu> wrote:
>
> Hi Jeff,
>
> I just installed Open MPI: 2.0.2 (repo revision: v2.0.1-348-ge291d0e; release date: Jan 31, 2017) but have the same problem.
>
> Attached please find two gdb backtraces on any write of a file descriptor returned from opening /dev/infiniband/uverbs in the cp2k.popt process.
>
> Thanks,
> Jingchao
> From: users <users-***@lists.open-mpi.org> on behalf of Jeff Squyres (jsquyres) <***@cisco.com>
> Sent: Tuesday, February 7, 2017 2:14:40 PM
> To: Open MPI User's List
> Subject: Re: [OMPI users] openmpi single node jobs using btl openib
>
> Can you try upgrading to Open MPI v2.0.2? We just released that last week with a bunch of bug fixes.
>
>
> > On Feb 7, 2017, at 3:07 PM, Jingchao Zhang <***@unl.edu> wrote:
> >
> > Hi Tobias,
> >
> > Thanks for the reply. I tried both "export OMPI_MCA_mpi_leave_pinned=0" and "mpirun -mca mpi_leave_pinned 0" but still got the same behavior. Our OpenMPI version is 2.0.1. Repo version is v2.0.0-257-gee86e07. We have Intel Qlogic and OPA networks on the same cluster.
> >
> > Below is our configuration flags:
> > ./configure --prefix=$PREFIX \
> > --with-hwloc=internal \
> > --enable-mpirun-prefix-by-default \
> > --with-slurm \
> > --with-verbs \
> > --with-psm \
> > --with-psm2 \
> > --disable-openib-connectx-xrc \
> > --with-knem=/opt/knem-1.1.2.90mlnx1 \
> > --with-cma
> >
> > So the question remains why OpenMPI choose openib over self,sm for single node jobs? Isn't there a mechanism to differentiate btl networks for single/multi-node jobs?
> >
> > Thanks,
> > Jingchao
> > From: users <users-***@lists.open-mpi.org> on behalf of Tobias Kloeffel <***@fau.de>
> > Sent: Tuesday, February 7, 2017 2:54:46 AM
> > To: Open MPI Users
> > Subject: Re: [OMPI users] openmpi single node jobs using btl openib
> >
> > Hello Jingchao,
> > try to use -mca mpi_leave_pinned 0, also for multinode jobs.
> >
> > kind regards,
> > Tobias Klöffel
> >
> > On 02/06/2017 09:38 PM, Jingchao Zhang wrote:
> >> Hi,
> >>
> >> We recently noticed openmpi is using btl openib over self,sm for single node jobs, which has caused performance degradation for some applications, e.g. 'cp2k'. For opempi version 2.0.1, our test shows single node 'cp2k' job using openib is ~25% slower than using self,sm. We advise users do '--mca btl_base_exclude openib' as a temporary fix. I need to point out that not all applications are affected by this feature. Many of them have the same single-node performance with/without openib. Why doesn't openmpi use self,sm by default for single node jobs? Is this the intended behavior?
> >>
> >> Thanks,
> >> Jingchao
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >>
> >> ***@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> >
> > --
> > M.Sc. Tobias Klöffel
> > =======================================================
> > Interdisciplinary Center for Molecular Materials (ICMM)
> > and Computer-Chemistry-Center (CCC)
> > Department Chemie und Pharmazie
> > Friedrich-Alexander-Universität Erlangen-Nürnberg
> > Nägelsbachstr. 25
> > D-91052 Erlangen, Germany
> >
> > Room: 2.305
> > Phone: +49 (0) 9131 / 85 - 20423
> > Fax: +49 (0) 9131 / 85 - 26565
> >
> > =======================================================
> >
> > E-mail:
> > ***@fau.de
> > _______________________________________________
> > users mailing list
> > ***@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> ***@cisco.com
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> <ompi_2.0.2_bt.txt>_______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...