[OMPI users] Question on run-time error "ORTE was unable to reliably start"

Discussion:

Blosch, Edwin L

2016-07-28 20:55:28 UTC

I am running cases that are starting just fine and running for a few hours, then they die with a message that seems like a startup type of failure. Message shown below. The message appears in standard output from rank 0 process. I'm assuming there is a failing card or port or something.

What diagnostic flags can I add to mpirun to help shed light on the problem?

What kinds of problems could cause this kind of message, which looks start-up related, after the job has already been running many hours?

Ed

--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
-------------------------------------------------------------------------

Ralph Castain

2016-07-28 21:06:36 UTC

Permalink

What kind of system was this on? ssh, slurm, ...?

Post by Blosch, Edwin L
I am running cases that are starting just fine and running for a few hours, then they die with a message that seems like a startup type of failure. Message shown below. The message appears in standard output from rank 0 process. I'm assuming there is a failing card or port or something.
What diagnostic flags can I add to mpirun to help shed light on the problem?
What kinds of problems could cause this kind of message, which looks start-up related, after the job has already been running many hours?
Ed
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
-------------------------------------------------------------------------
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Blosch, Edwin L

2016-07-29 01:24:21 UTC

Permalink

Cray CS400, RedHat 6.5, PBS Pro (but OpenMPI is built --without-tm), OpenMPI 1.8.8, ssh

-----Original Message-----
From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, July 28, 2016 4:07 PM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: EXTERNAL: Re: [OMPI users] Question on run-time error "ORTE was unable to reliably start"

What kind of system was this on? ssh, slurm, ...?

Post by Blosch, Edwin L
I am running cases that are starting just fine and running for a few hours, then they die with a message that seems like a startup type of failure. Message shown below. The message appears in standard output from rank 0 process. I'm assuming there is a failing card or port or something.
What diagnostic flags can I add to mpirun to help shed light on the problem?
What kinds of problems could cause this kind of message, which looks start-up related, after the job has already been running many hours?
Ed
----------------------------------------------------------------------
---- ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on one or more
nodes. Please check your PATH and LD_LIBRARY_PATH settings, or
configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are
required (e.g., on Cray). Please check your configure cmd line and
consider using one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a lack of
common network interfaces and/or no route found between them. Please
check network connectivity (including firewalls and network routing
requirements).
----------------------------------------------------------------------
--- _______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Ralph Castain

2016-07-30 00:37:35 UTC

Permalink

Really scratching my head over this one. The app won’t start running until after all the daemons have been launched, so this doesn’t seem possible at first glance. I’m wondering if something else is going on that might lead to a similar error? Does the application call comm_spawn, for example? Or is it a script that eventually attempts to launch another job?

Post by Blosch, Edwin L
Cray CS400, RedHat 6.5, PBS Pro (but OpenMPI is built --without-tm), OpenMPI 1.8.8, ssh
-----Original Message-----
Sent: Thursday, July 28, 2016 4:07 PM
Subject: EXTERNAL: Re: [OMPI users] Question on run-time error "ORTE was unable to reliably start"
What kind of system was this on? ssh, slurm, ...?

Post by Blosch, Edwin L
I am running cases that are starting just fine and running for a few hours, then they die with a message that seems like a startup type of failure. Message shown below. The message appears in standard output from rank 0 process. I'm assuming there is a failing card or port or something.
What diagnostic flags can I add to mpirun to help shed light on the problem?
What kinds of problems could cause this kind of message, which looks start-up related, after the job has already been running many hours?
Ed
----------------------------------------------------------------------
---- ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on one or more
nodes. Please check your PATH and LD_LIBRARY_PATH settings, or
configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are
required (e.g., on Cray). Please check your configure cmd line and
consider using one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a lack of
common network interfaces and/or no route found between them. Please
check network connectivity (including firewalls and network routing
requirements).
----------------------------------------------------------------------
--- _______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Blosch, Edwin L

2016-08-12 00:33:43 UTC

Permalink

I had another observation of the problem, with a little more insight. I can confirm that the job has been running several hours before dying with the 'ORTE was unable to reliably start' message. Somehow it is possible. I had used the following options to try and get some more diagnostics: --output-filename mpirun-stdio -mca btl ^tcp --mca plm_base_verbose 10 --mca btl_base_verbose 30

In the stack traces of each process, I saw roughly half of them reported dying at an MPI_BARRIER() call. The rest had progressed further, and they were at an MPI_WAITALL command. It is implemented like this: Every process posts non-blocking receives (IRECV), hits an MPI_BARRIER, then everybody posts non-blocking sends (ISEND), then MPI_WAITALL. This entire exchange process happens twice in a row, sending different sets of variables. The application type is unstructured CFD, so any given process is talking to 10 to 15 other processes exchanging data across domain boundaries. There are a range of message sizes flying around, some as small as 500 bytes, others as large as 1 MB. I'm using 480 processes.

I'm wondering if I'm kicking off too many of these non-blocking messages and some network resource is getting exhausted, and perhaps orted is doing some kind of 'ping' to make sure everyone is still alive, and it can't reach some process, and so the error suggests a startup problem. Wild guesses, no idea really.

For what it's worth, the barrier wasn't in an earlier implementation of this routine. I was seeing some jobs dying suddenly with MxM library errors, and I put this barrier in place, and those problems seemed to go away. So it just got committed and forgotten a couple years ago. I thought (still think) the code is correct without the barrier.

Also, I am running under MVAPICH at the moment and not having the same problems yet.

Finally, using the same exact model and application, I had a failure that left a different message:
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

hostname: k2n01

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

--------------------------------------------------------------------------

-----Original Message-----
From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, July 29, 2016 7:38 PM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] EXTERNAL: Re: Question on run-time error "ORTE was unable to reliably start"

Really scratching my head over this one. The app won’t start running until after all the daemons have been launched, so this doesn’t seem possible at first glance. I’m wondering if something else is going on that might lead to a similar error? Does the application call comm_spawn, for example? Or is it a script that eventually attempts to launch another job?

Post by Blosch, Edwin L
I am running cases that are starting just fine and running for a few hours, then they die with a message that seems like a startup type of failure. Message shown below. The message appears in standard output from rank 0 process. I'm assuming there is a failing card or port or something.
What diagnostic flags can I add to mpirun to help shed light on the problem?
What kinds of problems could cause this kind of message, which looks start-up related, after the job has already been running many hours?
Ed
---------------------------------------------------------------------
-
---- ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on one or more
nodes. Please check your PATH and LD_LIBRARY_PATH settings, or
configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are
required (e.g., on Cray). Please check your configure cmd line and
consider using one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a lack
of common network interfaces and/or no route found between them.
Please check network connectivity (including firewalls and network
routing requirements).
---------------------------------------------------------------------
-
--- _______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Gilles Gouaillardet

2016-08-12 00:53:03 UTC

Permalink

Hi,

this is very puzzling ...

is your application using MPI_Comm_spawn and friends ?

If not, is orted on node k2n01 *really* dead ? or does the head node
incorrectly believes orted died ?

you might want to add the following configuration in your ~/.ssh/config

TCPKeepAlive=yes

ServerAliveInterval=60

you might also want to use a lower value for (kernel)
net.ipv4.tcp_keep_alive_time

(default is 7200 seconds)

also, which interconnect are you using ?

if mxm is available on your system, i will be used.

if you do not want to use mxm, then you can

mpirun --mca pml ob1

also did you run a dmesg on k2n01 ? a common and hard to troubleshoot
issue is the oom-killer killed orted (!)

Cheers,

Gilles

Post by Blosch, Edwin L
I had another observation of the problem, with a little more insight. I can confirm that the job has been running several hours before dying with the 'ORTE was unable to reliably start' message. Somehow it is possible. I had used the following options to try and get some more diagnostics: --output-filename mpirun-stdio -mca btl ^tcp --mca plm_base_verbose 10 --mca btl_base_verbose 30
In the stack traces of each process, I saw roughly half of them reported dying at an MPI_BARRIER() call. The rest had progressed further, and they were at an MPI_WAITALL command. It is implemented like this: Every process posts non-blocking receives (IRECV), hits an MPI_BARRIER, then everybody posts non-blocking sends (ISEND), then MPI_WAITALL. This entire exchange process happens twice in a row, sending different sets of variables. The application type is unstructured CFD, so any given process is talking to 10 to 15 other processes exchanging data across domain boundaries. There are a range of message sizes flying around, some as small as 500 bytes, others as large as 1 MB. I'm using 480 processes.
I'm wondering if I'm kicking off too many of these non-blocking messages and some network resource is getting exhausted, and perhaps orted is doing some kind of 'ping' to make sure everyone is still alive, and it can't reach some process, and so the error suggests a startup problem. Wild guesses, no idea really.
For what it's worth, the barrier wasn't in an earlier implementation of this routine. I was seeing some jobs dying suddenly with MxM library errors, and I put this barrier in place, and those problems seemed to go away. So it just got committed and forgotten a couple years ago. I thought (still think) the code is correct without the barrier.
Also, I am running under MVAPICH at the moment and not having the same problems yet.
--------------------------------------------------------------------------
hostname: k2n01
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
-----Original Message-----
Sent: Friday, July 29, 2016 7:38 PM
Subject: Re: [OMPI users] EXTERNAL: Re: Question on run-time error "ORTE was unable to reliably start"
Really scratching my head over this one. The app won’t start running until after all the daemons have been launched, so this doesn’t seem possible at first glance. I’m wondering if something else is going on that might lead to a similar error? Does the application call comm_spawn, for example? Or is it a script that eventually attempts to launch another job?

Post by Blosch, Edwin L
I am running cases that are starting just fine and running for a few hours, then they die with a message that seems like a startup type of failure. Message shown below. The message appears in standard output from rank 0 process. I'm assuming there is a failing card or port or something.
What diagnostic flags can I add to mpirun to help shed light on the problem?
What kinds of problems could cause this kind of message, which looks start-up related, after the job has already been running many hours?
Ed
---------------------------------------------------------------------
-
---- ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on one or more
nodes. Please check your PATH and LD_LIBRARY_PATH settings, or
configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are
required (e.g., on Cray). Please check your configure cmd line and
consider using one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a lack
of common network interfaces and/or no route found between them.
Please check network connectivity (including firewalls and network
routing requirements).
---------------------------------------------------------------------
-
--- _______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users