William Hay
2018-01-22 10:36:17 UTC
We have a couple of clusters with Qlogic Infinipath/Intel TrueScale
networking. While testing a kernel upgrade we find that the Truescale
drivers will no longer build against recent RHEL kernels. Intel tells
us that the Omnipath drivers will work for True Scale adapters so we
install those. Basic functionality appears fine however we are having
trouble getting OpenMPI to work.
Using our existing builds of OpenMPI 1.10 jobs receive lots of signal
11 and crash(output attached)
If we modify LD_LIBRARY_PATH to point to the directory containing the
compatibility library provides as part of the OmniPath drivers it instead
produces complainst about not finding /dev/hfi1_0 which exists on our
cluster with actual OmniPath but not on the clusters with TrueScale
(output also attached).
We had a similar issue with Intel MPI but there it was possible to get
it to work by passing a -psm option to mpirun. That combined with the
mention of PSM2 in the output when complaining about /dev/hfi1_0 makes
us think OpenMPI is trying to run with PSM2 rather than the original
PSM and failing because that isn't supported by TrueScale.
We hoped that there would be an mca parameter or combination of parameters
that would resolve this issue but while Googling has turned up a few
things that look like they would force the use of PSM over PSM2 none of
them seem to make a difference.
Any suggestions?
William
networking. While testing a kernel upgrade we find that the Truescale
drivers will no longer build against recent RHEL kernels. Intel tells
us that the Omnipath drivers will work for True Scale adapters so we
install those. Basic functionality appears fine however we are having
trouble getting OpenMPI to work.
Using our existing builds of OpenMPI 1.10 jobs receive lots of signal
11 and crash(output attached)
If we modify LD_LIBRARY_PATH to point to the directory containing the
compatibility library provides as part of the OmniPath drivers it instead
produces complainst about not finding /dev/hfi1_0 which exists on our
cluster with actual OmniPath but not on the clusters with TrueScale
(output also attached).
We had a similar issue with Intel MPI but there it was possible to get
it to work by passing a -psm option to mpirun. That combined with the
mention of PSM2 in the output when complaining about /dev/hfi1_0 makes
us think OpenMPI is trying to run with PSM2 rather than the original
PSM and failing because that isn't supported by TrueScale.
We hoped that there would be an mca parameter or combination of parameters
that would resolve this issue but while Googling has turned up a few
things that look like they would force the use of PSM over PSM2 none of
them seem to make a difference.
Any suggestions?
William