[OMPI users] Unable to spawn MPI processes on multiple nodes with recent version of OpenMPI

Discussion:

Andrew Benson

2018-09-15 20:11:40 UTC

I'm running into problems trying to spawn MPI processes across multiple nodes
on a cluster using recent versions of OpenMPI. Specifically, using the attached
Fortan code, compiled using OpenMPI 3.1.2 with:

mpif90 test.F90 -o test.exe

and run via a PBS scheduler using the attached test1.pbs, it fails as can be
seen in the attached testFAIL.err file.

If I do the same but using OpenMPI v1.10.3 then it works successfully, giving
me the output in the attached testSUCCESS.err file.

From testing a few different versions of OpenMPI it seems that the behavior
changed between v1.10.7 and v2.0.4.

Is there some change in options needed to make this work with newer OpenMPIs?

Thanks for any help you can offer!

-Andrew

Andrew Benson

2018-09-15 20:46:15 UTC

Permalink

Ralph H Castain

2018-09-16 14:03:15 UTC

Permalink

I see you are using “preconnect_all” - that is the source of the trouble. I don’t believe we have tested that option in years and the code is almost certainly dead. I’d suggest removing that option and things should work.

Post by Andrew Benson
I'm running into problems trying to spawn MPI processes across multiple nodes
on a cluster using recent versions of OpenMPI. Specifically, using the attached
mpif90 test.F90 -o test.exe
and run via a PBS scheduler using the attached test1.pbs, it fails as can be
seen in the attached testFAIL.err file.
If I do the same but using OpenMPI v1.10.3 then it works successfully, giving
me the output in the attached testSUCCESS.err file.
From testing a few different versions of OpenMPI it seems that the behavior
changed between v1.10.7 and v2.0.4.
Is there some change in options needed to make this work with newer OpenMPIs?
http://users.obs.carnegiescience.edu/abenson/config.log.bz2
Thanks for any help you can offer!
-Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCCESS.err.bz2>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Andrew Benson

2018-09-16 19:33:28 UTC

Permalink

Thanks - I'll try removing that option.

Post by Ralph H Castain
I see you are using “preconnect_all” - that is the source of the trouble. I
don’t believe we have tested that option in years and the code is almost
certainly dead. I’d suggest removing that option and things should work.

Post by Andrew Benson
I'm running into problems trying to spawn MPI processes across multiple
nodes on a cluster using recent versions of OpenMPI. Specifically, using
mpif90 test.F90 -o test.exe
and run via a PBS scheduler using the attached test1.pbs, it fails as can
be seen in the attached testFAIL.err file.
If I do the same but using OpenMPI v1.10.3 then it works successfully,
giving me the output in the attached testSUCCESS.err file.
From testing a few different versions of OpenMPI it seems that the behavior
changed between v1.10.7 and v2.0.4.
Is there some change in options needed to make this work with newer OpenMPIs?
http://users.obs.carnegiescience.edu/abenson/config.log.bz2
Thanks for any help you can offer!
-Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCC
ESS.err.bz2>_______________________________________________ users mailing
list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html

* Galacticus: https://bitbucket.org/abensonca/galacticus

Andrew Benson

2018-09-17 00:01:07 UTC

Permalink

Removing the preconnect_all option didn't resolve the problem unfortunately.

I tried changing a few of the other options that I pass to mpirun. What does
seem to make a difference is the "--map-by node" option. If I remove that
option that my test code runs successfully - the output is in the attached
test.err file.

Ideally I'd like to be able to use "--map-by node" so that the initial
processes are distributed across the available resources. Is there some reason
why the child processes would be unable to communicate when "--map-by node" is
used?

-Andrew

I see you are using âpreconnect_allâ - that is the source of the trouble. I
donât believe we have tested that option in years and the code is almost
certainly dead. Iâd suggest removing that option and things should work.

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html

* Galacticus: https://bitbucket.org/abensonca/galacticus

Andrew Benson

2018-09-19 14:59:44 UTC

Permalink

On further investigation removing the "preconnect_all" option does change the
problem at least. Without "preconnect_all" I no longer see:

--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

Process 1 ([[32179,2],15]) is on host: node092
Process 2 ([[32179,2],0]) is on host: unknown!
BTLs attempted: self tcp vader

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------

Instead it hangs for several minutes and finally aborts with:

--------------------------------------------------------------------------
A request has timed out and will therefore fail:

Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------
[node091:19470] *** An error occurred in MPI_Comm_spawn
[node091:19470] *** reported by process [1614086145,0]
[node091:19470] *** on communicator MPI_COMM_WORLD
[node091:19470] *** MPI_ERR_UNKNOWN: unknown error
[node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[node091:19470] *** and potentially your MPI job)

I've tried increasing both pmix_server_max_wait and pmix_base_exchange_timeout
as suggested in the error message, but the result is unchanged (it just takes
longer to time out).

Once again, if I remove "--map-by node" it runs successfully.

-Andrew

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html

* Galacticus: https://bitbucket.org/abensonca/galacticus

Andrew Benson

2018-09-19 15:00:27 UTC

Permalink

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html

* Galacticus: https://bitbucket.org/abensonca/galacticus

Ralph H Castain

2018-10-06 16:02:47 UTC

Permalink

Sorry for delay - this should be fixed by https://github.com/open-mpi/ompi/pull/5854

Post by Andrew Benson
On further investigation removing the "preconnect_all" option does change the
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[32179,2],15]) is on host: node092
Process 2 ([[32179,2],0]) is on host: unknown!
BTLs attempted: self tcp vader
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345
Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------
[node091:19470] *** An error occurred in MPI_Comm_spawn
[node091:19470] *** reported by process [1614086145,0]
[node091:19470] *** on communicator MPI_COMM_WORLD
[node091:19470] *** MPI_ERR_UNKNOWN: unknown error
[node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[node091:19470] *** and potentially your MPI job)
I've tried increasing both pmix_server_max_wait and pmix_base_exchange_timeout
as suggested in the error message, but the result is unchanged (it just takes
longer to time out).
Once again, if I remove "--map-by node" it runs successfully.
-Andrew

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html
* Galacticus: https://bitbucket.org/abensonca/galacticus

Ralph H Castain

2018-10-06 17:02:49 UTC

Permalink

Just FYI: on master (and perhaps 4.0), child jobs do not inherit their parent's mapping policy by default. You have to add â-mca rmaps_base_inherit 1â to your mpirun cmd line.

Thanks, I'll try this right away.
Thanks,
Andrew
--
* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html <http://users.obs.carnegiescience.edu/abenson/contact.html>
* Galacticus: http://sites.google.com/site/galacticusmodel <http://sites.google.com/site/galacticusmodel>
Sorry for delay - this should be fixed by https://github.com/open-mpi/ompi/pull/5854 <https://github.com/open-mpi/ompi/pull/5854>

Post by Andrew Benson
I'm running into problems trying to spawn MPI processes across multiple
nodes on a cluster using recent versions of OpenMPI. Specifically, using
mpif90 test.F90 -o test.exe
and run via a PBS scheduler using the attached test1.pbs, it fails as can
be seen in the attached testFAIL.err file.
If I do the same but using OpenMPI v1.10.3 then it works successfully,
giving me the output in the attached testSUCCESS.err file.
From testing a few different versions of OpenMPI it seems that the behavior
changed between v1.10.7 and v2.0.4.
Is there some change in options needed to make this work with newer OpenMPIs?
http://users.obs.carnegiescience.edu/abenson/config.log.bz2 <http://users.obs.carnegiescience.edu/abenson/config.log.bz2>
Thanks for any help you can offer!
-Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCC
ESS.err.bz2>_______________________________________________ users mailing
list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>

--
* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html <http://users.obs.carnegiescience.edu/abenson/contact.html>
* Galacticus: https://bitbucket.org/abensonca/galacticus <https://bitbucket.org/abensonca/galacticus>

Andrew Benson

2018-10-06 17:04:38 UTC

Permalink

Ok, thanks - that's good to know.

-Andrew

--

* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html

* Galacticus: http://sites.google.com/site/galacticusmodel

Post by Ralph H Castain
Just FYI: on master (and perhaps 4.0), child jobs do not inherit their
parent's mapping policy by default. You have to add â-mca
rmaps_base_inherit 1â to your mpirun cmd line.
Thanks, I'll try this right away.
Thanks,
Andrew
--
* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html
* Galacticus: http://sites.google.com/site/galacticusmodel

Post by Ralph H Castain
Sorry for delay - this should be fixed by
https://github.com/open-mpi/ompi/pull/5854

Post by Andrew Benson
On further investigation removing the "preconnect_all" option does

change the
--------------------------------------------------------------------------

Post by Andrew Benson
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[32179,2],15]) is on host: node092
Process 2 ([[32179,2],0]) is on host: unknown!
BTLs attempted: self tcp vader
Your MPI job is now going to abort; sorry.

--------------------------------------------------------------------------
--------------------------------------------------------------------------

Post by Andrew Benson
Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345
Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.

--------------------------------------------------------------------------

Post by Andrew Benson
[node091:19470] *** An error occurred in MPI_Comm_spawn
[node091:19470] *** reported by process [1614086145,0]
[node091:19470] *** on communicator MPI_COMM_WORLD
[node091:19470] *** MPI_ERR_UNKNOWN: unknown error
[node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this

communicator will

Post by Andrew Benson
now abort,
[node091:19470] *** and potentially your MPI job)
I've tried increasing both pmix_server_max_wait and

pmix_base_exchange_timeout

Post by Andrew Benson
as suggested in the error message, but the result is unchanged (it just

takes

Post by Andrew Benson
longer to time out).
Once again, if I remove "--map-by node" it runs successfully.
-Andrew

I see you are using âpreconnect_allâ - that is the source of the

trouble. I

Post by Andrew Benson

donât believe we have tested that option in years and the code is

almost

Post by Andrew Benson

certainly dead. Iâd suggest removing that option and things should

work.

Post by Andrew Benson

Post by Andrew Benson
I'm running into problems trying to spawn MPI processes across

multiple

Post by Andrew Benson

Post by Andrew Benson
nodes on a cluster using recent versions of OpenMPI. Specifically,

using

Post by Andrew Benson

Post by Andrew Benson
mpif90 test.F90 -o test.exe
and run via a PBS scheduler using the attached test1.pbs, it fails as

can

Post by Andrew Benson

Post by Andrew Benson
be seen in the attached testFAIL.err file.
If I do the same but using OpenMPI v1.10.3 then it works successfully,
giving me the output in the attached testSUCCESS.err file.
From testing a few different versions of OpenMPI it seems that the behavior
changed between v1.10.7 and v2.0.4.
Is there some change in options needed to make this work with newer OpenMPIs?
http://users.obs.carnegiescience.edu/abenson/config.log.bz2
Thanks for any help you can offer!

-Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCC

Post by Andrew Benson

Post by Andrew Benson
ESS.err.bz2>_______________________________________________ users

mailing

Post by Andrew Benson

Post by Andrew Benson
list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

http://users.obs.carnegiescience.edu/abenson/contact.html

Post by Andrew Benson
* Galacticus: https://bitbucket.org/abensonca/galacticus