Discussion:
[OMPI users] Intermittent failure when launch application linked with OpenMPI 3.1.1
David Whitaker
2018-10-04 22:56:37 UTC
Permalink
Hi,
When launching an application linked with OpenMPI 3.1.1 using the line:
srun --mpi=pmi2 --distribution=arbitrary
--cpu_bind=map_cpu:0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
-n 1024 a.out

I often (most of the time) get:

[amd-0013][[29472,1],727][connect/btl_openib_connect_udcm.c:1531:udcm_find_endpoint]
could not find endpoint with port: 1, lid: 21, msg_type: 100
[amd-0013][[29472,1],727][connect/btl_openib_connect_udcm.c:2036:udcm_process_messages]
could not find associated endpoint.
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

Process 1 ([[29472,1],727]) is on host: amd-0013
Process 2 ([[29472,1],711]) is on host: unknown!
BTLs attempted: self openib

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[amd-0013:16718] *** An error occurred in MPI_Allreduce
[amd-0013:16718] *** reported by process [1931476993,727]
[amd-0013:16718] *** on communicator MPI_COMM_WORLD
[amd-0013:16718] *** MPI_ERR_INTERN: internal error
[amd-0013:16718] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,


This failure is intermittent and I can sometimes get to work no problem.
I have tried setting environment variables:
export OMPI_MCA_btl_openib_connect_udcm_max_retry=500
export OMPI_MCA_btl_openib_connect_udcm_timeout=5000000

but it is uncertain that these are helping.

Does anyone understand what is happening and how I can prevent it?

Many thanks,
Dave
--
CCCCCCCCCCCCCCCCCCCCCCFFFFFFFFFFFFFFFFFFFFFFFFFDDDDDDDDDDDDDDDDDDDDD
David Whitaker, Ph.D. ***@cray.com
Aerospace CFD Specialist phone: (651)605-9078
ISV Applications/Cray Inc fax: (651)605-9001
CCCCCCCCCCCCCCCCCCCCCCFFFFFFFFFFFFFFFFFFFFFFFFFDDDDDDDDDDDDDDDDDDDDD
Loading...