On further investigation removing the "preconnect_all" option does change the
problem at least. Without "preconnect_all" I no longer see:
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[32179,2],15]) is on host: node092
Process 2 ([[32179,2],0]) is on host: unknown!
BTLs attempted: self tcp vader
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
Instead it hangs for several minutes and finally aborts with:
--------------------------------------------------------------------------
A request has timed out and will therefore fail:
Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345
Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------
[node091:19470] *** An error occurred in MPI_Comm_spawn
[node091:19470] *** reported by process [1614086145,0]
[node091:19470] *** on communicator MPI_COMM_WORLD
[node091:19470] *** MPI_ERR_UNKNOWN: unknown error
[node091:19470] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[node091:19470] *** and potentially your MPI job)
I've tried increasing both pmix_server_max_wait and pmix_base_exchange_timeout
as suggested in the error message, but the result is unchanged (it just takes
longer to time out).
Once again, if I remove "--map-by node" it runs successfully.
-Andrew
Post by Ralph H CastainI see you are using “preconnect_all” - that is the source of the trouble. I
don’t believe we have tested that option in years and the code is almost
certainly dead. I’d suggest removing that option and things should work.
Post by Andrew BensonI'm running into problems trying to spawn MPI processes across multiple
nodes on a cluster using recent versions of OpenMPI. Specifically, using
mpif90 test.F90 -o test.exe
and run via a PBS scheduler using the attached test1.pbs, it fails as can
be seen in the attached testFAIL.err file.
If I do the same but using OpenMPI v1.10.3 then it works successfully,
giving me the output in the attached testSUCCESS.err file.
From testing a few different versions of OpenMPI it seems that the behavior
changed between v1.10.7 and v2.0.4.
Is there some change in options needed to make this work with newer OpenMPIs?
http://users.obs.carnegiescience.edu/abenson/config.log.bz2
Thanks for any help you can offer!
-Andrew<ompi_info.log.bz2><test.F90><test1.pbs><testFAIL.err.bz2><testSUCC
ESS.err.bz2>_______________________________________________ users mailing
list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
* Andrew Benson: http://users.obs.carnegiescience.edu/abenson/contact.html
* Galacticus: https://bitbucket.org/abensonca/galacticus