Discussion:
[OMPI users] openmpi-v3.x-201705250239-d5200ea
Siegmar Gross
2017-05-29 12:05:20 UTC
Permalink
Hi,

I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Unfortunately, my rankfiles don't work any longer.


loki rankfiles 136 cat rf_loki_nfs1
rank 0=loki slot=0:0-3;1:0-1
rank 1=loki slot=1:2-5
rank 2=nfs1 slot=0:4
rank 3=nfs1 slot=1:5


loki rankfiles 137 mpiexec -report-bindings -np 4 -rf rf_loki_nfs1 hostname
[nfs1:11461] [[41737,0],1] ORTE_ERROR_LOG: Not found in file
../../../../../openmpi-v3.x-201705250239-d5200ea/orte/mca/rmaps/rank_file/rmaps_rank_file.c
at line 408
[nfs1:11461] [[41737,0],1] ORTE_ERROR_LOG: Not found in file
../../../../../openmpi-v3.x-201705250239-d5200ea/orte/mca/rmaps/rank_file/rmaps_rank_file.c
at line 162
[nfs1:11461] [[41737,0],1] ORTE_ERROR_LOG: Not found in file
../../../../openmpi-v3.x-201705250239-d5200ea/orte/mca/rmaps/base/rmaps_base_map_job.c
at line 370
[nfs1:11461] [[41737,0],1] ORTE_ERROR_LOG: Not found in file
../../../../openmpi-v3.x-201705250239-d5200ea/orte/mca/odls/base/odls_base_default_fns.c
at line 425
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[41737,0],0] on node loki
Remote daemon: [[41737,0],1] on node nfs1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
loki rankfiles 138




I would be grateful, if somebody can fix the problem. Do you need anything
else? Thank you very much for any help in advance.


Kind regards

Siegmar

Loading...