Discussion:
[OMPI users] segmentation fault with openmpi-2.0.2rc2 on Linux
Siegmar Gross
2016-12-28 14:06:17 UTC
Permalink
Hi,

I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately,
I get an error when I run one of my programs. Everything works as
expected with openmpi-master-201612232109-67a08e8. The program
gets a timeout with openmpi-v2.x-201612232156-5ce66b0.

loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2rc2
C compiler absolute: /opt/solstudio12.5b/bin/cc


loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
I create 4 slave processes

--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.

Local host: loki
System call: open(2)
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
[loki:17855] *** Process received signal ***
[loki:17855] Signal: Segmentation fault (11)
[loki:17855] Signal code: Address not mapped (1)
[loki:17855] Failing at address: 0x8
[loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870]
[loki:17855] [ 1]
/usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae]
[loki:17855] [ 2]
/usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x196)[0x7f053250cb16]
[loki:17855] [ 3]
/usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8]
[loki:17855] [ 4]
/usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c)[0x7f053e52300c]
[loki:17855] [ 5]
/usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+0x1ed)[0x7f053e523eed]
[loki:17855] [ 6]
/usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x1a3)[0x7f0531ea7c03]
[loki:17855] [ 7]
/usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38]
[loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file
../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
/usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_dyn_init+0xcd)[0x7f053d48aeed]
[loki:17855] [ 9]
/usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3]
[loki:17855] [10]
/usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd]
[loki:17855] [11] spawn_slave[0x4009cf]
[loki:17855] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f053cd53b25]
[loki:17855] [13] spawn_slave[0x400892]
[loki:17855] *** End of error message ***
[loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file
../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

Process 1 ([[55817,2],0]) is on host: loki
Process 2 ([[55817,2],1]) is on host: unknown!
BTLs attempted: self sm tcp vader

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_dpm_dyn_init() failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
loki spawn 146







loki spawn 120 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2a1
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 121 which mpiexec
/usr/local/openmpi-2.1.0_64_cc/bin/mpiexec
loki spawn 122 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
I create 4 slave processes

[loki:21301] OPAL ERROR: Timeout in file
../../../../openmpi-v2.x-201612232156-5ce66b0/opal/mca/pmix/base/pmix_base_fns.c
at line 195
[loki:21301] *** An error occurred in MPI_Comm_spawn
[loki:21301] *** reported by process [3431727105,0]
[loki:21301] *** on communicator MPI_COMM_WORLD
[loki:21301] *** MPI_ERR_UNKNOWN: unknown error
[loki:21301] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
[loki:21301] *** and potentially your MPI job)
loki spawn 123






loki spawn 111 ompi_info | grep -e "Open MPI:" -e "C compiler"
Open MPI: 3.0.0a1
C compiler: cc
C compiler absolute: /opt/solstudio12.5b/bin/cc
C compiler family name: SUN
C compiler version: 0x5140
loki spawn 111 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master

Parent process 0 running on loki
I create 4 slave processes

Parent process 0: tasks in MPI_COMM_WORLD: 1
tasks in COMM_CHILD_PROCESSES local group: 1
tasks in COMM_CHILD_PROCESSES remote group: 4

Slave process 1 of 4 running on loki
Slave process 3 of 4 running on loki
Slave process 0 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
loki spawn 112


I would be grateful, if somebody can fix the problems. Thank you
very much for any help in advance.


Kind regards

Siegmar
Howard Pritchard
2017-01-02 23:17:43 UTC
Permalink
HI Siegmar,

I've attempted to reproduce this using gnu compilers and
the version of this test program(s) you posted earlier in 2016
but am unable to reproduce the problem.

Could you double check that the slave program can be
successfully run when launched directly by mpirun/mpiexec?
It might also help to use --mca btl_base_verbose 10 when
running the slave program standalone.

Thanks,

Howard



2016-12-28 7:06 GMT-07:00 Siegmar Gross <
Post by Siegmar Gross
Hi,
I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately,
I get an error when I run one of my programs. Everything works as
expected with openmpi-master-201612232109-67a08e8. The program
gets a timeout with openmpi-v2.x-201612232156-5ce66b0.
loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2rc2
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: loki
System call: open(2)
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
[loki:17855] *** Process received signal ***
[loki:17855] Signal: Segmentation fault (11)
[loki:17855] Signal code: Address not mapped (1)
[loki:17855] Failing at address: 0x8
[loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870]
[loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc
/lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae]
[loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc
/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x1
96)[0x7f053250cb16]
[loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc
/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8]
[loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc
/lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c
)[0x7f053e52300c]
[loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc
/lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+
0x1ed)[0x7f053e523eed]
[loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc
/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_
intra_dec_fixed+0x1a3)[0x7f0531ea7c03]
[loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc
/lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38]
[loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in
file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line
186
/usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_
dyn_init+0xcd)[0x7f053d48aeed]
[loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc
/lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3]
[loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc
/lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd]
[loki:17855] [11] spawn_slave[0x4009cf]
[loki:17855] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f053cd53b25]
[loki:17855] [13] spawn_slave[0x400892]
[loki:17855] *** End of error message ***
[loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file
../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[55817,2],0]) is on host: loki
Process 2 ([[55817,2],1]) is on host: unknown!
BTLs attempted: self sm tcp vader
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_dpm_dyn_init() failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
loki spawn 146
loki spawn 120 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2a1
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 121 which mpiexec
/usr/local/openmpi-2.1.0_64_cc/bin/mpiexec
loki spawn 122 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
[loki:21301] OPAL ERROR: Timeout in file ../../../../openmpi-v2.x-20161
2232156-5ce66b0/opal/mca/pmix/base/pmix_base_fns.c at line 195
[loki:21301] *** An error occurred in MPI_Comm_spawn
[loki:21301] *** reported by process [3431727105,0]
[loki:21301] *** on communicator MPI_COMM_WORLD
[loki:21301] *** MPI_ERR_UNKNOWN: unknown error
[loki:21301] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[loki:21301] *** and potentially your MPI job)
loki spawn 123
loki spawn 111 ompi_info | grep -e "Open MPI:" -e "C compiler"
Open MPI: 3.0.0a1
C compiler: cc
C compiler absolute: /opt/solstudio12.5b/bin/cc
C compiler family name: SUN
C compiler version: 0x5140
loki spawn 111 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
Parent process 0: tasks in MPI_COMM_WORLD: 1
tasks in COMM_CHILD_PROCESSES local group: 1
tasks in COMM_CHILD_PROCESSES remote group: 4
Slave process 1 of 4 running on loki
Slave process 3 of 4 running on loki
Slave process 0 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
loki spawn 112
I would be grateful, if somebody can fix the problems. Thank you
very much for any help in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Siegmar Gross
2017-01-03 07:32:44 UTC
Permalink
Hi Howard,

thank you very much that you try to solve my problem. I haven't
changed the programs since 2013 so that you use the correct
version. The program works as expected with the master trunk as
you can see at the bottom of this email from my last mail. The
slave program works when I launch it directly.

loki spawn 122 mpicc --showme
cc -I/usr/local/openmpi-2.0.2_64_cc/include -m64 -mt -mt -Wl,-rpath -Wl,/usr/local/openmpi-2.0.2_64_cc/lib64 -Wl,--enable-new-dtags
-L/usr/local/openmpi-2.0.2_64_cc/lib64 -lmpi
loki spawn 123 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2rc2
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 124 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 --mca btl_base_verbose 10 spawn_slave
[loki:05572] mca: base: components_register: registering framework btl components
[loki:05572] mca: base: components_register: found loaded component self
[loki:05572] mca: base: components_register: component self register function successful
[loki:05572] mca: base: components_register: found loaded component sm
[loki:05572] mca: base: components_register: component sm register function successful
[loki:05572] mca: base: components_register: found loaded component tcp
[loki:05572] mca: base: components_register: component tcp register function successful
[loki:05572] mca: base: components_register: found loaded component vader
[loki:05572] mca: base: components_register: component vader register function successful
[loki:05572] mca: base: components_open: opening btl components
[loki:05572] mca: base: components_open: found loaded component self
[loki:05572] mca: base: components_open: component self open function successful
[loki:05572] mca: base: components_open: found loaded component sm
[loki:05572] mca: base: components_open: component sm open function successful
[loki:05572] mca: base: components_open: found loaded component tcp
[loki:05572] mca: base: components_open: component tcp open function successful
[loki:05572] mca: base: components_open: found loaded component vader
[loki:05572] mca: base: components_open: component vader open function successful
[loki:05572] select: initializing btl component self
[loki:05572] select: init of component self returned success
[loki:05572] select: initializing btl component sm
[loki:05572] select: init of component sm returned failure
[loki:05572] mca: base: close: component sm closed
[loki:05572] mca: base: close: unloading component sm
[loki:05572] select: initializing btl component tcp
[loki:05572] select: init of component tcp returned success
[loki:05572] select: initializing btl component vader
[loki][[35331,1],0][../../../../../openmpi-2.0.2rc2/opal/mca/btl/vader/btl_vader_component.c:454:mca_btl_vader_component_init] No peers to communicate with.
Disabling vader.
[loki:05572] select: init of component vader returned failure
[loki:05572] mca: base: close: component vader closed
[loki:05572] mca: base: close: unloading component vader
[loki:05572] mca: bml: Using self btl for send to [[35331,1],0] on node loki
Slave process 0 of 1 running on loki
spawn_slave 0: argv[0]: spawn_slave
[loki:05572] mca: base: close: component self closed
[loki:05572] mca: base: close: unloading component self
[loki:05572] mca: base: close: component tcp closed
[loki:05572] mca: base: close: unloading component tcp
loki spawn 125


Kind regards and thank you very much once more

Siegmar
Post by Howard Pritchard
HI Siegmar,
I've attempted to reproduce this using gnu compilers and
the version of this test program(s) you posted earlier in 2016
but am unable to reproduce the problem.
Could you double check that the slave program can be
successfully run when launched directly by mpirun/mpiexec?
It might also help to use --mca btl_base_verbose 10 when
running the slave program standalone.
Thanks,
Howard
Hi,
I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately,
I get an error when I run one of my programs. Everything works as
expected with openmpi-master-201612232109-67a08e8. The program
gets a timeout with openmpi-v2.x-201612232156-5ce66b0.
loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2rc2
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: loki
System call: open(2)
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
[loki:17855] *** Process received signal ***
[loki:17855] Signal: Segmentation fault (11)
[loki:17855] Signal code: Address not mapped (1)
[loki:17855] Failing at address: 0x8
[loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870]
[loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae]
[loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x196)[0x7f053250cb16]
[loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8]
[loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c)[0x7f053e52300c]
[loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+0x1ed)[0x7f053e523eed]
[loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x1a3)[0x7f0531ea7c03]
[loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38]
[loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
/usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_dyn_init+0xcd)[0x7f053d48aeed]
[loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3]
[loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd]
[loki:17855] [11] spawn_slave[0x4009cf]
[loki:17855] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f053cd53b25]
[loki:17855] [13] spawn_slave[0x400892]
[loki:17855] *** End of error message ***
[loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[55817,2],0]) is on host: loki
Process 2 ([[55817,2],1]) is on host: unknown!
BTLs attempted: self sm tcp vader
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_dpm_dyn_init() failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
loki spawn 146
loki spawn 120 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2a1
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 121 which mpiexec
/usr/local/openmpi-2.1.0_64_cc/bin/mpiexec
loki spawn 122 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
[loki:21301] OPAL ERROR: Timeout in file ../../../../openmpi-v2.x-201612232156-5ce66b0/opal/mca/pmix/base/pmix_base_fns.c at line 195
[loki:21301] *** An error occurred in MPI_Comm_spawn
[loki:21301] *** reported by process [3431727105,0]
[loki:21301] *** on communicator MPI_COMM_WORLD
[loki:21301] *** MPI_ERR_UNKNOWN: unknown error
[loki:21301] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[loki:21301] *** and potentially your MPI job)
loki spawn 123
loki spawn 111 ompi_info | grep -e "Open MPI:" -e "C compiler"
Open MPI: 3.0.0a1
C compiler: cc
C compiler absolute: /opt/solstudio12.5b/bin/cc
C compiler family name: SUN
C compiler version: 0x5140
loki spawn 111 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
Parent process 0: tasks in MPI_COMM_WORLD: 1
tasks in COMM_CHILD_PROCESSES local group: 1
tasks in COMM_CHILD_PROCESSES remote group: 4
Slave process 1 of 4 running on loki
Slave process 3 of 4 running on loki
Slave process 0 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
loki spawn 112
I would be grateful, if somebody can fix the problems. Thank you
very much for any help in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Howard Pritchard
2017-01-03 16:22:14 UTC
Permalink
HI Siegmar,

Could you please rerun the spawn_slave program with 4 processes?
Your original traceback indicates a failure in the barrier in the slave
program. I'm interested in seeing if when you run the slave program
standalone with 4 processes the barrier failure is observed.

Thanks,

Howard


2017-01-03 0:32 GMT-07:00 Siegmar Gross <
Post by Siegmar Gross
Hi Howard,
thank you very much that you try to solve my problem. I haven't
changed the programs since 2013 so that you use the correct
version. The program works as expected with the master trunk as
you can see at the bottom of this email from my last mail. The
slave program works when I launch it directly.
loki spawn 122 mpicc --showme
cc -I/usr/local/openmpi-2.0.2_64_cc/include -m64 -mt -mt -Wl,-rpath
-Wl,/usr/local/openmpi-2.0.2_64_cc/lib64 -Wl,--enable-new-dtags
-L/usr/local/openmpi-2.0.2_64_cc/lib64 -lmpi
loki spawn 123 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2rc2
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 124 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 --mca
btl_base_verbose 10 spawn_slave
[loki:05572] mca: base: components_register: registering framework btl components
[loki:05572] mca: base: components_register: found loaded component self
[loki:05572] mca: base: components_register: component self register function successful
[loki:05572] mca: base: components_register: found loaded component sm
[loki:05572] mca: base: components_register: component sm register function successful
[loki:05572] mca: base: components_register: found loaded component tcp
[loki:05572] mca: base: components_register: component tcp register function successful
[loki:05572] mca: base: components_register: found loaded component vader
[loki:05572] mca: base: components_register: component vader register function successful
[loki:05572] mca: base: components_open: opening btl components
[loki:05572] mca: base: components_open: found loaded component self
[loki:05572] mca: base: components_open: component self open function successful
[loki:05572] mca: base: components_open: found loaded component sm
[loki:05572] mca: base: components_open: component sm open function successful
[loki:05572] mca: base: components_open: found loaded component tcp
[loki:05572] mca: base: components_open: component tcp open function successful
[loki:05572] mca: base: components_open: found loaded component vader
[loki:05572] mca: base: components_open: component vader open function successful
[loki:05572] select: initializing btl component self
[loki:05572] select: init of component self returned success
[loki:05572] select: initializing btl component sm
[loki:05572] select: init of component sm returned failure
[loki:05572] mca: base: close: component sm closed
[loki:05572] mca: base: close: unloading component sm
[loki:05572] select: initializing btl component tcp
[loki:05572] select: init of component tcp returned success
[loki:05572] select: initializing btl component vader
[loki][[35331,1],0][../../../../../openmpi-2.0.2rc2/opal/mca
/btl/vader/btl_vader_component.c:454:mca_btl_vader_component_init] No
peers to communicate with. Disabling vader.
[loki:05572] select: init of component vader returned failure
[loki:05572] mca: base: close: component vader closed
[loki:05572] mca: base: close: unloading component vader
[loki:05572] mca: bml: Using self btl for send to [[35331,1],0] on node loki
Slave process 0 of 1 running on loki
spawn_slave 0: argv[0]: spawn_slave
[loki:05572] mca: base: close: component self closed
[loki:05572] mca: base: close: unloading component self
[loki:05572] mca: base: close: component tcp closed
[loki:05572] mca: base: close: unloading component tcp
loki spawn 125
Kind regards and thank you very much once more
Siegmar
Post by Howard Pritchard
HI Siegmar,
I've attempted to reproduce this using gnu compilers and
the version of this test program(s) you posted earlier in 2016
but am unable to reproduce the problem.
Could you double check that the slave program can be
successfully run when launched directly by mpirun/mpiexec?
It might also help to use --mca btl_base_verbose 10 when
running the slave program standalone.
Thanks,
Howard
Hi,
I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately,
I get an error when I run one of my programs. Everything works as
expected with openmpi-master-201612232109-67a08e8. The program
gets a timeout with openmpi-v2.x-201612232156-5ce66b0.
loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2rc2
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
------------------------------------------------------------
--------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: loki
System call: open(2)
Error: No such file or directory (errno 2)
------------------------------------------------------------
--------------
[loki:17855] *** Process received signal ***
[loki:17855] Signal: Segmentation fault (11)
[loki:17855] Signal code: Address not mapped (1)
[loki:17855] Failing at address: 0x8
[loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870]
[loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc
/lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae]
[loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc
/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x1
96)[0x7f053250cb16]
[loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc
/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8]
[loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc
/lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c
)[0x7f053e52300c]
[loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc
/lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+
0x1ed)[0x7f053e523eed]
[loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc
/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_
intra_dec_fixed+0x1a3)[0x7f0531ea7c03]
[loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc
/lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38]
[loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not
found in file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c
at line 186
/usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_
dyn_init+0xcd)[0x7f053d48aeed]
[loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc
/lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3]
[loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc
/lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd]
[loki:17855] [11] spawn_slave[0x4009cf]
[loki:17855] [12] /lib64/libc.so.6(__libc_start_
main+0xf5)[0x7f053cd53b25]
[loki:17855] [13] spawn_slave[0x400892]
[loki:17855] *** End of error message ***
[loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file
../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
------------------------------------------------------------
--------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[55817,2],0]) is on host: loki
Process 2 ([[55817,2],1]) is on host: unknown!
BTLs attempted: self sm tcp vader
Your MPI job is now going to abort; sorry.
------------------------------------------------------------
--------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
------------------------------------------------------------
--------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_dpm_dyn_init() failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
------------------------------------------------------------
--------------
loki spawn 146
loki spawn 120 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2a1
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 121 which mpiexec
/usr/local/openmpi-2.1.0_64_cc/bin/mpiexec
loki spawn 122 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
[loki:21301] OPAL ERROR: Timeout in file
../../../../openmpi-v2.x-201612232156-5ce66b0/opal/mca/pmix/base/pmix_base_fns.c
at line 195
[loki:21301] *** An error occurred in MPI_Comm_spawn
[loki:21301] *** reported by process [3431727105,0]
[loki:21301] *** on communicator MPI_COMM_WORLD
[loki:21301] *** MPI_ERR_UNKNOWN: unknown error
[loki:21301] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[loki:21301] *** and potentially your MPI job)
loki spawn 123
loki spawn 111 ompi_info | grep -e "Open MPI:" -e "C compiler"
Open MPI: 3.0.0a1
C compiler: cc
C compiler absolute: /opt/solstudio12.5b/bin/cc
C compiler family name: SUN
C compiler version: 0x5140
loki spawn 111 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
Parent process 0: tasks in MPI_COMM_WORLD: 1
tasks in COMM_CHILD_PROCESSES local group: 1
tasks in COMM_CHILD_PROCESSES remote group: 4
Slave process 1 of 4 running on loki
Slave process 3 of 4 running on loki
Slave process 0 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
loki spawn 112
I would be grateful, if somebody can fix the problems. Thank you
very much for any help in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Siegmar Gross
2017-01-04 06:31:06 UTC
Permalink
Hi Howard,

it still works with 4 processes and "vader" will not send the
following output about missing communication peers if I start
at least 2 processes.

...
[loki:14965] select: initializing btl component vader
[loki][[42444,1],0][../../../../../openmpi-2.0.2rc2/opal/mca/btl/vader/btl_vader_component.c:454:mca_btl_vader_component_init] No peers to communicate with.
Disabling vader.
[loki:14965] select: init of component vader returned failure
[loki:14965] mca: base: close: component vader closed
[loki:14965] mca: base: close: unloading component vader
...

Now the output from 4 processes.

loki spawn 112 mpiexec -np 4 --host loki --slot-list 0:0-5,1:0-5 --mca btl_base_verbose 10 spawn_slave
[loki:14046] mca: base: components_register: registering framework btl components
[loki:14046] mca: base: components_register: found loaded component self
[loki:14047] mca: base: components_register: registering framework btl components
[loki:14047] mca: base: components_register: found loaded component self
[loki:14048] mca: base: components_register: registering framework btl components
[loki:14048] mca: base: components_register: found loaded component self
[loki:14046] mca: base: components_register: component self register function successful
[loki:14047] mca: base: components_register: component self register function successful
[loki:14047] mca: base: components_register: found loaded component sm
[loki:14048] mca: base: components_register: component self register function successful
[loki:14048] mca: base: components_register: found loaded component sm
[loki:14046] mca: base: components_register: found loaded component sm
[loki:14048] mca: base: components_register: component sm register function successful
[loki:14047] mca: base: components_register: component sm register function successful
[loki:14047] mca: base: components_register: found loaded component tcp
[loki:14046] mca: base: components_register: component sm register function successful
[loki:14046] mca: base: components_register: found loaded component tcp
[loki:14048] mca: base: components_register: found loaded component tcp
[loki:14046] mca: base: components_register: component tcp register function successful
[loki:14046] mca: base: components_register: found loaded component vader
[loki:14047] mca: base: components_register: component tcp register function successful
[loki:14047] mca: base: components_register: found loaded component vader
[loki:14048] mca: base: components_register: component tcp register function successful
[loki:14048] mca: base: components_register: found loaded component vader
[loki:14047] mca: base: components_register: component vader register function successful
[loki:14047] mca: base: components_open: opening btl components
[loki:14047] mca: base: components_open: found loaded component self
[loki:14047] mca: base: components_open: component self open function successful
[loki:14047] mca: base: components_open: found loaded component sm
[loki:14046] mca: base: components_register: component vader register function successful
[loki:14046] mca: base: components_open: opening btl components
[loki:14046] mca: base: components_open: found loaded component self
[loki:14046] mca: base: components_open: component self open function successful
[loki:14046] mca: base: components_open: found loaded component sm
[loki:14048] mca: base: components_register: component vader register function successful
[loki:14048] mca: base: components_open: opening btl components
[loki:14048] mca: base: components_open: found loaded component self
[loki:14048] mca: base: components_open: component self open function successful
[loki:14048] mca: base: components_open: found loaded component sm
[loki:14048] mca: base: components_open: component sm open function successful
[loki:14048] mca: base: components_open: found loaded component tcp
[loki:14046] mca: base: components_open: component sm open function successful
[loki:14046] mca: base: components_open: found loaded component tcp
[loki:14046] mca: base: components_open: component tcp open function successful
[loki:14046] mca: base: components_open: found loaded component vader
[loki:14046] mca: base: components_open: component vader open function successful
[loki:14047] mca: base: components_open: component sm open function successful
[loki:14047] mca: base: components_open: found loaded component tcp
[loki:14047] mca: base: components_open: component tcp open function successful
[loki:14047] mca: base: components_open: found loaded component vader
[loki:14047] mca: base: components_open: component vader open function successful
[loki:14048] mca: base: components_open: component tcp open function successful
[loki:14048] mca: base: components_open: found loaded component vader
[loki:14048] mca: base: components_open: component vader open function successful
[loki:14048] select: initializing btl component self
[loki:14048] select: init of component self returned success
[loki:14048] select: initializing btl component sm
[loki:14048] select: init of component sm returned success
[loki:14048] select: initializing btl component tcp
[loki:14047] select: initializing btl component self
[loki:14047] select: init of component self returned success
[loki:14047] select: initializing btl component sm
[loki:14047] select: init of component sm returned success
[loki:14047] select: initializing btl component tcp
[loki:14046] select: initializing btl component self
[loki:14046] select: init of component self returned success
[loki:14046] select: initializing btl component sm
[loki:14049] mca: base: components_register: registering framework btl components
[loki:14049] mca: base: components_register: found loaded component self
[loki:14049] mca: base: components_register: component self register function successful
[loki:14048] select: init of component tcp returned success
[loki:14048] select: initializing btl component vader
[loki:14047] select: init of component tcp returned success
[loki:14047] select: initializing btl component vader
[loki:14049] mca: base: components_register: found loaded component sm
[loki:14049] mca: base: components_register: component sm register function successful
[loki:14049] mca: base: components_register: found loaded component tcp
[loki:14046] select: init of component sm returned success
[loki:14046] select: initializing btl component tcp
[loki:14049] mca: base: components_register: component tcp register function successful
[loki:14049] mca: base: components_register: found loaded component vader
[loki:14047] select: init of component vader returned success
[loki:14048] select: init of component vader returned success
[loki:14049] mca: base: components_register: component vader register function successful
[loki:14049] mca: base: components_open: opening btl components
[loki:14049] mca: base: components_open: found loaded component self
[loki:14049] mca: base: components_open: component self open function successful
[loki:14049] mca: base: components_open: found loaded component sm
[loki:14049] mca: base: components_open: component sm open function successful
[loki:14049] mca: base: components_open: found loaded component tcp
[loki:14049] mca: base: components_open: component tcp open function successful
[loki:14049] mca: base: components_open: found loaded component vader
[loki:14049] mca: base: components_open: component vader open function successful
[loki:14049] select: initializing btl component self
[loki:14049] select: init of component self returned success
[loki:14049] select: initializing btl component sm
[loki:14046] select: init of component tcp returned success
[loki:14046] select: initializing btl component vader
[loki:14049] select: init of component sm returned success
[loki:14049] select: initializing btl component tcp
[loki:14049] select: init of component tcp returned success
[loki:14049] select: initializing btl component vader
[loki:14046] select: init of component vader returned success
[loki:14049] select: init of component vader returned success
[loki:14048] mca: bml: Using self btl for send to [[43365,1],2] on node loki
[loki:14046] mca: bml: Using self btl for send to [[43365,1],0] on node loki
[loki:14049] mca: bml: Using self btl for send to [[43365,1],3] on node loki
[loki:14047] mca: bml: Using self btl for send to [[43365,1],1] on node loki
[loki:14046] mca: bml: Using vader btl for send to [[43365,1],1] on node loki
[loki:14046] mca: bml: Using vader btl for send to [[43365,1],2] on node loki
[loki:14046] mca: bml: Using vader btl for send to [[43365,1],3] on node loki
[loki:14048] mca: bml: Using vader btl for send to [[43365,1],0] on node loki
[loki:14048] mca: bml: Using vader btl for send to [[43365,1],1] on node loki
[loki:14048] mca: bml: Using vader btl for send to [[43365,1],3] on node loki
[loki:14047] mca: bml: Using vader btl for send to [[43365,1],0] on node loki
[loki:14047] mca: bml: Using vader btl for send to [[43365,1],2] on node loki
[loki:14047] mca: bml: Using vader btl for send to [[43365,1],3] on node loki
[loki:14049] mca: bml: Using vader btl for send to [[43365,1],0] on node loki
[loki:14049] mca: bml: Using vader btl for send to [[43365,1],1] on node loki
[loki:14049] mca: bml: Using vader btl for send to [[43365,1],2] on node loki
Slave process 0 of 4 running on loki
Slave process 2 of 4 running on loki
Slave process 1 of 4 running on loki
Slave process 3 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
[loki:14048] mca: base: close: component self closed
[loki:14048] mca: base: close: unloading component self
[loki:14048] mca: base: close: component sm closed
[loki:14048] mca: base: close: unloading component sm
[loki:14046] mca: base: close: component self closed
[loki:14046] mca: base: close: unloading component self
[loki:14046] mca: base: close: component sm closed
[loki:14046] mca: base: close: unloading component sm
[loki:14048] mca: base: close: component tcp closed
[loki:14048] mca: base: close: unloading component tcp
[loki:14046] mca: base: close: component tcp closed
[loki:14046] mca: base: close: unloading component tcp
[loki:14048] mca: base: close: component vader closed
[loki:14048] mca: base: close: unloading component vader
[loki:14049] mca: base: close: component self closed
[loki:14049] mca: base: close: unloading component self
[loki:14047] mca: base: close: component self closed
[loki:14047] mca: base: close: unloading component self
[loki:14046] mca: base: close: component vader closed
[loki:14046] mca: base: close: unloading component vader
[loki:14049] mca: base: close: component sm closed
[loki:14049] mca: base: close: unloading component sm
[loki:14047] mca: base: close: component sm closed
[loki:14047] mca: base: close: unloading component sm
[loki:14049] mca: base: close: component tcp closed
[loki:14049] mca: base: close: unloading component tcp
[loki:14047] mca: base: close: component tcp closed
[loki:14047] mca: base: close: unloading component tcp
[loki:14049] mca: base: close: component vader closed
[loki:14049] mca: base: close: unloading component vader
[loki:14047] mca: base: close: component vader closed
[loki:14047] mca: base: close: unloading component vader
loki spawn 112

Kind regards

Siegmar
Post by Howard Pritchard
HI Siegmar,
Could you please rerun the spawn_slave program with 4 processes?
Your original traceback indicates a failure in the barrier in the slave
program. I'm interested in seeing if when you run the slave program
standalone with 4 processes the barrier failure is observed.
Thanks,
Howard
Hi Howard,
thank you very much that you try to solve my problem. I haven't
changed the programs since 2013 so that you use the correct
version. The program works as expected with the master trunk as
you can see at the bottom of this email from my last mail. The
slave program works when I launch it directly.
loki spawn 122 mpicc --showme
cc -I/usr/local/openmpi-2.0.2_64_cc/include -m64 -mt -mt -Wl,-rpath -Wl,/usr/local/openmpi-2.0.2_64_cc/lib64 -Wl,--enable-new-dtags
-L/usr/local/openmpi-2.0.2_64_cc/lib64 -lmpi
loki spawn 123 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2rc2
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 124 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 --mca btl_base_verbose 10 spawn_slave
[loki:05572] mca: base: components_register: registering framework btl components
[loki:05572] mca: base: components_register: found loaded component self
[loki:05572] mca: base: components_register: component self register function successful
[loki:05572] mca: base: components_register: found loaded component sm
[loki:05572] mca: base: components_register: component sm register function successful
[loki:05572] mca: base: components_register: found loaded component tcp
[loki:05572] mca: base: components_register: component tcp register function successful
[loki:05572] mca: base: components_register: found loaded component vader
[loki:05572] mca: base: components_register: component vader register function successful
[loki:05572] mca: base: components_open: opening btl components
[loki:05572] mca: base: components_open: found loaded component self
[loki:05572] mca: base: components_open: component self open function successful
[loki:05572] mca: base: components_open: found loaded component sm
[loki:05572] mca: base: components_open: component sm open function successful
[loki:05572] mca: base: components_open: found loaded component tcp
[loki:05572] mca: base: components_open: component tcp open function successful
[loki:05572] mca: base: components_open: found loaded component vader
[loki:05572] mca: base: components_open: component vader open function successful
[loki:05572] select: initializing btl component self
[loki:05572] select: init of component self returned success
[loki:05572] select: initializing btl component sm
[loki:05572] select: init of component sm returned failure
[loki:05572] mca: base: close: component sm closed
[loki:05572] mca: base: close: unloading component sm
[loki:05572] select: initializing btl component tcp
[loki:05572] select: init of component tcp returned success
[loki:05572] select: initializing btl component vader
[loki][[35331,1],0][../../../../../openmpi-2.0.2rc2/opal/mca/btl/vader/btl_vader_component.c:454:mca_btl_vader_component_init] No peers to communicate with.
Disabling vader.
[loki:05572] select: init of component vader returned failure
[loki:05572] mca: base: close: component vader closed
[loki:05572] mca: base: close: unloading component vader
[loki:05572] mca: bml: Using self btl for send to [[35331,1],0] on node loki
Slave process 0 of 1 running on loki
spawn_slave 0: argv[0]: spawn_slave
[loki:05572] mca: base: close: component self closed
[loki:05572] mca: base: close: unloading component self
[loki:05572] mca: base: close: component tcp closed
[loki:05572] mca: base: close: unloading component tcp
loki spawn 125
Kind regards and thank you very much once more
Siegmar
HI Siegmar,
I've attempted to reproduce this using gnu compilers and
the version of this test program(s) you posted earlier in 2016
but am unable to reproduce the problem.
Could you double check that the slave program can be
successfully run when launched directly by mpirun/mpiexec?
It might also help to use --mca btl_base_verbose 10 when
running the slave program standalone.
Thanks,
Howard
Hi,
I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately,
I get an error when I run one of my programs. Everything works as
expected with openmpi-master-201612232109-67a08e8. The program
gets a timeout with openmpi-v2.x-201612232156-5ce66b0.
loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2rc2
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: loki
System call: open(2)
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
[loki:17855] *** Process received signal ***
[loki:17855] Signal: Segmentation fault (11)
[loki:17855] Signal code: Address not mapped (1)
[loki:17855] Failing at address: 0x8
[loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870]
[loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae]
[loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x196)[0x7f053250cb16]
[loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8]
[loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c)[0x7f053e52300c]
[loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+0x1ed)[0x7f053e523eed]
[loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x1a3)[0x7f0531ea7c03]
[loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38]
[loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
/usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_dyn_init+0xcd)[0x7f053d48aeed]
[loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3]
[loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd]
[loki:17855] [11] spawn_slave[0x4009cf]
[loki:17855] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f053cd53b25]
[loki:17855] [13] spawn_slave[0x400892]
[loki:17855] *** End of error message ***
[loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[55817,2],0]) is on host: loki
Process 2 ([[55817,2],1]) is on host: unknown!
BTLs attempted: self sm tcp vader
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_dpm_dyn_init() failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
loki spawn 146
loki spawn 120 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
Open MPI: 2.0.2a1
C compiler absolute: /opt/solstudio12.5b/bin/cc
loki spawn 121 which mpiexec
/usr/local/openmpi-2.1.0_64_cc/bin/mpiexec
loki spawn 122 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
[loki:21301] OPAL ERROR: Timeout in file ../../../../openmpi-v2.x-201612232156-5ce66b0/opal/mca/pmix/base/pmix_base_fns.c at line 195
[loki:21301] *** An error occurred in MPI_Comm_spawn
[loki:21301] *** reported by process [3431727105,0]
[loki:21301] *** on communicator MPI_COMM_WORLD
[loki:21301] *** MPI_ERR_UNKNOWN: unknown error
[loki:21301] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[loki:21301] *** and potentially your MPI job)
loki spawn 123
loki spawn 111 ompi_info | grep -e "Open MPI:" -e "C compiler"
Open MPI: 3.0.0a1
C compiler: cc
C compiler absolute: /opt/solstudio12.5b/bin/cc
C compiler family name: SUN
C compiler version: 0x5140
loki spawn 111 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Parent process 0 running on loki
I create 4 slave processes
Parent process 0: tasks in MPI_COMM_WORLD: 1
tasks in COMM_CHILD_PROCESSES local group: 1
tasks in COMM_CHILD_PROCESSES remote group: 4
Slave process 1 of 4 running on loki
Slave process 3 of 4 running on loki
Slave process 0 of 4 running on loki
Slave process 2 of 4 running on loki
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 1: argv[0]: spawn_slave
loki spawn 112
I would be grateful, if somebody can fix the problems. Thank you
very much for any help in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...