Discussion:
[OMPI users] OpenMPI + InfiniBand
Sergei Hrushev
2016-12-23 05:16:36 UTC
Permalink
Hi All !

As there are no any positive changes with "UDSM + IPoIB" problem since my
previous post,
we installed IPoIB on the cluster and "No OpenFabrics connection..." error
doesn't appear more.
But now OpenMPI reports about another problem:

In app ERROR OUTPUT stream:

[node2:14142] [[37935,0],0] ORTE_ERROR_LOG: Data unpack had inadequate
space in file base/plm_base_launch_support.c at line 1035

In app OUTPUT stream:

--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------

When I'm trying to run the task using single node - all works properly.
But when I specify "run on 2 nodes", the problem appears.

I tried to run ping using IPoIB addresses and all hosts are resolved
properly,
ping requests and replies are going over IB without any problems.
So all nodes (including head) see each other via IPoIB.
But MPI app fails.

Same test task works perfect on all nodes being run with Ethernet transport
instead of InfiniBand.

P.S. We use Torque resource manager to enqueue MPI tasks.

Best regards,
Sergei.
g***@rist.or.jp
2016-12-23 09:58:21 UTC
Permalink
Serguei,

this looks like a very different issue, orted cannot be remotely started.

that typically occurs if orted cannot find some dependencies

(the Open MPI libs and/or the compiler runtime)

for example, from a node, ssh <other node> orted should not fail because
of unresolved dependencies.

a simple trick is to replace

mpirun ...

with

`which mpirun` ...

a better option (as long as you do not plan to relocate Open MPI install
dir) is to configure with

--enable-mpirun-prefix-by-default

Cheers,

Gilles

----- Original Message -----

Hi All !

As there are no any positive changes with "UDSM + IPoIB" problem
since my previous post,
we installed IPoIB on the cluster and "No OpenFabrics connection..."
error doesn't appear more.
But now OpenMPI reports about another problem:

In app ERROR OUTPUT stream:

[node2:14142] [[37935,0],0] ORTE_ERROR_LOG: Data unpack had
inadequate space in file base/plm_base_launch_support.c at line 1035

In app OUTPUT stream:

--------------------------------------------------------------------
------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-
default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_
tmpdir_base).
Please check with your sys admin to determine the correct location
to use.

* compilation of the orted with dynamic libraries when static are
required
(e.g., on Cray). Please check your configure cmd line and consider
using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------
------

When I'm trying to run the task using single node - all works
properly.
But when I specify "run on 2 nodes", the problem appears.

I tried to run ping using IPoIB addresses and all hosts are resolved
properly,
ping requests and replies are going over IB without any problems.
So all nodes (including head) see each other via IPoIB.
But MPI app fails.

Same test task works perfect on all nodes being run with Ethernet
transport instead of InfiniBand.

P.S. We use Torque resource manager to enqueue MPI tasks.

Best regards,
Sergei.
r***@open-mpi.org
2016-12-23 15:41:08 UTC
Permalink
Also check to ensure you are using the same version of OMPI on all nodes - this message usually means that a different version was used on at least one node.
Post by g***@rist.or.jp
Serguei,
this looks like a very different issue, orted cannot be remotely started.
that typically occurs if orted cannot find some dependencies
(the Open MPI libs and/or the compiler runtime)
for example, from a node, ssh <other node> orted should not fail because of unresolved dependencies.
a simple trick is to replace
mpirun ...
with
`which mpirun` ...
a better option (as long as you do not plan to relocate Open MPI install dir) is to configure with
--enable-mpirun-prefix-by-default
Cheers,
Gilles
----- Original Message -----
Hi All !
As there are no any positive changes with "UDSM + IPoIB" problem since my previous post,
we installed IPoIB on the cluster and "No OpenFabrics connection..." error doesn't appear more.
[node2:14142] [[37935,0],0] ORTE_ERROR_LOG: Data unpack had inadequate space in file base/plm_base_launch_support.c at line 1035
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
When I'm trying to run the task using single node - all works properly.
But when I specify "run on 2 nodes", the problem appears.
I tried to run ping using IPoIB addresses and all hosts are resolved properly,
ping requests and replies are going over IB without any problems.
So all nodes (including head) see each other via IPoIB.
But MPI app fails.
Same test task works perfect on all nodes being run with Ethernet transport instead of InfiniBand.
P.S. We use Torque resource manager to enqueue MPI tasks.
Best regards,
Sergei.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Sergei Hrushev
2016-12-26 10:10:17 UTC
Permalink
Hi Gilles!
Post by g***@rist.or.jp
this looks like a very different issue, orted cannot be remotely started.
...
a better option (as long as you do not plan to relocate Open MPI install
dir) is to configure with
--enable-mpirun-prefix-by-default
Yes, that's was a problem with orted.
I checked PATH and LD_LIBRARY_PATH variables and both are specified, but it
was not enough!

So I added --enable-mpirun-prefix-by-default to configure and even when
--prefix isn't specified the recompiled version woks properly.

When Ethernet transfer is used, all works both with and without
--enable-mpirun-prefix-by-default.

Thank you!

Best regards,
Sergei.
g***@rist.or.jp
2016-12-26 11:01:29 UTC
Permalink
Sergei,

thanks for confirming you are now able to use Open MPI

fwiw, orted is remotely started by the selected plm component.

it can be ssh if you run without a batch manager, the tm interface if
PBS/torque, srun if slurm, etc ...

that should explain why exporting PATH and LD_LIBRARY_PATH is not enough
in your environment,

not to mention your .bashrc or equivalent might reset/unset

Cheers,

Gilles

----- Original Message -----

Hi Gilles!


this looks like a very different issue, orted cannot be remotely
started.
...

a better option (as long as you do not plan to relocate Open MPI
install dir) is to configure with

--enable-mpirun-prefix-by-default


Yes, that's was a problem with orted.
I checked PATH and LD_LIBRARY_PATH variables and both are specified,
but it was not enough!

So I added --enable-mpirun-prefix-by-default to configure and even
when --prefix isn't specified the recompiled version woks properly.

When Ethernet transfer is used, all works both with and without --
enable-mpirun-prefix-by-default.

Thank you!

Best regards,
Sergei.

Loading...