Discussion:
[OMPI users] OpenMPI with PSM on True Scale with OmniPath drivers
William Hay
2018-01-22 10:36:17 UTC
Permalink
We have a couple of clusters with Qlogic Infinipath/Intel TrueScale
networking. While testing a kernel upgrade we find that the Truescale
drivers will no longer build against recent RHEL kernels. Intel tells
us that the Omnipath drivers will work for True Scale adapters so we
install those. Basic functionality appears fine however we are having
trouble getting OpenMPI to work.

Using our existing builds of OpenMPI 1.10 jobs receive lots of signal
11 and crash(output attached)

If we modify LD_LIBRARY_PATH to point to the directory containing the
compatibility library provides as part of the OmniPath drivers it instead
produces complainst about not finding /dev/hfi1_0 which exists on our
cluster with actual OmniPath but not on the clusters with TrueScale
(output also attached).

We had a similar issue with Intel MPI but there it was possible to get
it to work by passing a -psm option to mpirun. That combined with the
mention of PSM2 in the output when complaining about /dev/hfi1_0 makes
us think OpenMPI is trying to run with PSM2 rather than the original
PSM and failing because that isn't supported by TrueScale.

We hoped that there would be an mca parameter or combination of parameters
that would resolve this issue but while Googling has turned up a few
things that look like they would force the use of PSM over PSM2 none of
them seem to make a difference.

Any suggestions?

William
Gilles Gouaillardet
2018-01-22 11:30:45 UTC
Permalink
William,

In order to force PSM (aka Infinipath) you can

mpirun --mca pml cm --mca mtl psm ...

(Replace with psm2 for PSM2 (aka Omnipath)

You can also

mpirun --mca pml_base_verbose 10 --mca mtl_base_verbose 10 ...

in order to collect some logs.

Bottom line, pml/cm should be selected (instead of pml/ob1) and the appropriate mtl should be selected.


On top of that, you might need to rebuild Open MPI if some user level library has been changed.

Note Open MPI 1.10 is now legacy, and I strongly encourage you to upgrade to 2.1.x or 3.0.x


Cheers,

Gilles
Post by William Hay
We have a couple of clusters with Qlogic Infinipath/Intel TrueScale
networking. While testing a kernel upgrade we find that the Truescale
drivers will no longer build against recent RHEL kernels. Intel tells
us that the Omnipath drivers will work for True Scale adapters so we
install those. Basic functionality appears fine however we are having
trouble getting OpenMPI to work.
Using our existing builds of OpenMPI 1.10 jobs receive lots of signal
11 and crash(output attached)
If we modify LD_LIBRARY_PATH to point to the directory containing the
compatibility library provides as part of the OmniPath drivers it instead
produces complainst about not finding /dev/hfi1_0 which exists on our
cluster with actual OmniPath but not on the clusters with TrueScale
(output also attached).
We had a similar issue with Intel MPI but there it was possible to get
it to work by passing a -psm option to mpirun. That combined with the
mention of PSM2 in the output when complaining about /dev/hfi1_0 makes
us think OpenMPI is trying to run with PSM2 rather than the original
PSM and failing because that isn't supported by TrueScale.
We hoped that there would be an mca parameter or combination of parameters
that would resolve this issue but while Googling has turned up a few
things that look like they would force the use of PSM over PSM2 none of
them seem to make a difference.
Any suggestions?
William
Cabral, Matias A
2018-01-22 17:45:04 UTC
Permalink
Hi William,

Couple other questions:
- Please share how you ompi configure line looks like.
- Please clarify which is/are the compat libraries you refer to. There are some that are actually for the opposite case: Making TS app/libs run on Omnipath.
- As Gilles mentions, moving to a newer major OMPI version is advisable. If this is not possible, move to 1.10.7 that has many updates against 1.10.1.

Thanks,

_MAC


-----Original Message-----
From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Gilles Gouaillardet
Sent: Monday, January 22, 2018 3:31 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] OpenMPI with PSM on True Scale with OmniPath drivers

William,

In order to force PSM (aka Infinipath) you can

mpirun --mca pml cm --mca mtl psm ...

(Replace with psm2 for PSM2 (aka Omnipath)

You can also

mpirun --mca pml_base_verbose 10 --mca mtl_base_verbose 10 ...

in order to collect some logs.

Bottom line, pml/cm should be selected (instead of pml/ob1) and the appropriate mtl should be selected.


On top of that, you might need to rebuild Open MPI if some user level library has been changed.

Note Open MPI 1.10 is now legacy, and I strongly encourage you to upgrade to 2.1.x or 3.0.x


Cheers,

Gilles
Post by William Hay
We have a couple of clusters with Qlogic Infinipath/Intel TrueScale
networking. While testing a kernel upgrade we find that the Truescale
drivers will no longer build against recent RHEL kernels. Intel tells
us that the Omnipath drivers will work for True Scale adapters so we
install those. Basic functionality appears fine however we are having
trouble getting OpenMPI to work.
Using our existing builds of OpenMPI 1.10 jobs receive lots of signal
11 and crash(output attached)
If we modify LD_LIBRARY_PATH to point to the directory containing the
compatibility library provides as part of the OmniPath drivers it
instead produces complainst about not finding /dev/hfi1_0 which exists
on our cluster with actual OmniPath but not on the clusters with
TrueScale (output also attached).
We had a similar issue with Intel MPI but there it was possible to get
it to work by passing a -psm option to mpirun. That combined with the
mention of PSM2 in the output when complaining about /dev/hfi1_0 makes
us think OpenMPI is trying to run with PSM2 rather than the original
PSM and failing because that isn't supported by TrueScale.
We hoped that there would be an mca parameter or combination of
parameters that would resolve this issue but while Googling has turned
up a few things that look like they would force the use of PSM over
PSM2 none of them seem to make a difference.
Any suggestions?
William
Loading...