Discussion:
[OMPI users] OPAL ERROR with openmpi-master-201705270239-a6f6113 on SuSE Linux
Siegmar Gross
2017-05-29 11:38:40 UTC
Permalink
Hi,

I have installed openmpi-master-201705270239-a6f6113 on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Unfortunately, two of my small test programs fail with the same errors
for different command lines. Everything works as expected if I only use
the machine which I use to start the job.


loki spawn 158 mpiexec -np 1 --host loki:4 spawn_multiple_master

Parent process 0 running on loki
I create 3 slave processes.

[loki:12292] SERVER READY
Slave process 0 of 3 running on loki
Slave process 1 of 3 running on loki
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
Slave process 2 of 3 running on loki
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 2: argv[1]: program type 2
spawn_slave 2: argv[2]: another parameter
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1
Parent process 0: tasks in MPI_COMM_WORLD: 1
tasks in COMM_CHILD_PROCESSES local group: 1
tasks in COMM_CHILD_PROCESSES remote group: 3




loki spawn 159 mpiexec -np 1 --host nfs1:4 spawn_multiple_master

Parent process 0 running on nfs1
I create 3 slave processes.

[nfs1:06562] SERVER READY
[nfs1:06567] OPAL ERROR: (null) in file
../../../../openmpi-master-201705270239-a6f6113/opal/mca/pmix/base/pmix_base_fns.c
at line 170
[nfs1:6567] *** An error occurred in MPI_Comm_spawn_multiple
[nfs1:6567] *** reported by process [2946629633,0]
[nfs1:6567] *** on communicator MPI_COMM_WORLD
[nfs1:6567] *** Unknown error (this should not happen!)
[nfs1:6567] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[nfs1:6567] *** and potentially your MPI job)




loki spawn 160 mpiexec -np 1 --host loki,nfs1,nfs2:2 spawn_multiple_master

Parent process 0 running on loki
I create 3 slave processes.

[loki:12401] SERVER READY
[nfs1:06919] SERVER READY
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_dpm_dyn_init() failed
--> Returned "(null)" (32764) instead of "Success" (0)
--------------------------------------------------------------------------
[nfs1:6924] *** An error occurred in MPI_Init
[nfs1:6924] *** reported by process [2949447682,0]
[nfs1:6924] *** on a NULL communicator
[nfs1:6924] *** Unknown error
[nfs1:6924] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[nfs1:6924] *** and potentially your MPI job)
[nfs1:06924] OPAL ERROR: (null) in file
../../../../openmpi-master-201705270239-a6f6113/opal/mca/pmix/base/pmix_base_fns.c
at line 170
loki spawn 162




I would be grateful, if somebody can fix the problem. Do you need anything
else? Thank you very much for any help in advance.


Kind regards

Siegmar

Loading...