Discussion:
[OMPI users] [open-mpi/ompi] pmix use of strnlen is not portable (#1771)
Siegmar Gross
2016-06-09 17:44:24 UTC
Permalink
Hi Ralph,
Closed #1771 <https://github.com/open-mpi/ompi/issues/1771> via #1772 <https://github.com/open-mpi/ompi/pull/1772>.
Thank you very much for your help. Now I have new problems
with the same program on my Sparc and x86_64 Solaris machines.

tyr hello_1 106 ompi_info | grep -e "OPAL repo revision:" -e "C compiler absolute:"
OPAL repo revision: dev-4251-g1f651d1
C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc

tyr hello_1 107 mpiexec -np 2 hello_1_mpi
[tyr:08647] *** Process received signal ***
[tyr:08647] Signal: Bus Error (10)
[tyr:08647] Signal code: Invalid address alignment (1)
[tyr:08647] Failing at address: 1001c94eb
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/libopen-pal.so.0.0.0:0xdefa4
/lib/sparcv9/libc.so.1:0xd8b98
/lib/sparcv9/libc.so.1:0xcc70c
/lib/sparcv9/libc.so.1:0xcc918
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/openmpi/mca_pmix_pmix114.so:0x8c800 [ Signal 10 (BUS)]
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/openmpi/mca_pmix_pmix114.so:0x8cba4
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/openmpi/mca_pmix_pmix114.so:0x8de10
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/libopen-pal.so.0.0.0:0xee62c
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/libopen-pal.so.0.0.0:0xee948
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/libopen-pal.so.0.0.0:opal_libevent2022_event_base_loop+0x310
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/openmpi/mca_pmix_pmix114.so:0x4b7f4
/lib/sparcv9/libc.so.1:0xd8a6c
[tyr:08647] *** End of error message ***
Bus error

tyr hello_1 108 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
GNU gdb (GDB) 7.6.1
...
Reading symbols from /export2/prog/SunOS_sparc/openmpi-master_64_gcc/bin/orterun...done.
(gdb) set args -np 2 hello_1_mpi
(gdb) r
Starting program: /usr/local/openmpi-master_64_gcc/bin/mpiexec -np 2 hello_1_mpi
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[New LWP 2 ]
[New LWP 3 ]
[New LWP 4 ]
[New LWP 5 ]
[New Thread 3 (LWP 3)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 3 (LWP 3)]
0xffffffff7988c800 in parse_connect_ack (msg=0x1001c97bb "", len=13, nspace=0xffffffff797fbac0,
rank=0xffffffff797fbaa8, version=0xffffffff797fbab8, cred=0xffffffff797fbab0)
at ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/server/pmix_server_listener.c:332
332 *rank = *(int *)msg;
(gdb) bt
#0 0xffffffff7988c800 in parse_connect_ack (msg=0x1001c97bb "", len=13,
nspace=0xffffffff797fbac0, rank=0xffffffff797fbaa8, version=0xffffffff797fbab8,
cred=0xffffffff797fbab0)
at ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/server/pmix_server_listener.c:332
#1 0xffffffff7988cbac in pmix_server_authenticate (sd=29, out_rank=0xffffffff797fbc0c,
peer=0xffffffff797fbc10)
at ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/server/pmix_server_listener.c:403
#2 0xffffffff7988de18 in connection_handler (sd=-1, flags=4, cbdata=0x1001cdc30)
at ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/server/pmix_server_listener.c:564
#3 0xffffffff7ecee634 in event_process_active_single_queue ()
from /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0
#4 0xffffffff7ecee950 in event_process_active ()
from /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0
#5 0xffffffff7ecef22c in opal_libevent2022_event_base_loop ()
from /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0
#6 0xffffffff7984b7fc in progress_engine (obj=0x1001bb0b0)
at ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/util/progress_threads.c:52
#7 0xffffffff7d9d8a74 in _lwp_start () from /lib/sparcv9/libc.so.1
#8 0xffffffff7d9d8a74 in _lwp_start () from /lib/sparcv9/libc.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) print msg
$1 = 0x1001c97bb ""
(gdb) print (int *)msg
$2 = (int *) 0x1001c97bb
(gdb) print *(int *)msg
$3 = 0
(gdb)






sunpc1 fd1026 102 ompi_info | grep -e "OPAL repo revision:" -e "C compiler absolute:"
OPAL repo revision: dev-4251-g1f651d1
C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc

sunpc1 fd1026 103 mpiexec -np 2 hello_1_mpi
[sunpc1:27530] PMIX ERROR: NOT-SUPPORTED in file ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/server/pmix_server_listener.c at
line 540
[sunpc1:27532] PMIX ERROR: UNREACHABLE in file ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/client/pmix_client.c at line 983
[sunpc1:27532] PMIX ERROR: UNREACHABLE in file ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/client/pmix_client.c at line 199
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

pmix init failed
--> Returned value Unreachable (-12) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_ess_init failed
--> Returned value Unreachable (-12) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[sunpc1:27532] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all
other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[26724,1],0]
Exit code: 1
--------------------------------------------------------------------------
sunpc1 fd1026 104



Hopefully you can solve the problem as well. Thank you very much
for any help in advance.


Kind regards

Siegmar
Ralph Castain
2016-06-09 19:09:12 UTC
Permalink
Just pushed the fix to master today - not required for 2.x.
Post by Siegmar Gross
Hi Ralph,
Closed #1771 <https://github.com/open-mpi/ompi/issues/1771> via #1772 <https://github.com/open-mpi/ompi/pull/1772>.
Thank you very much for your help. Now I have new problems
with the same program on my Sparc and x86_64 Solaris machines.
tyr hello_1 106 ompi_info | grep -e "OPAL repo revision:" -e "C compiler absolute:"
OPAL repo revision: dev-4251-g1f651d1
C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
tyr hello_1 107 mpiexec -np 2 hello_1_mpi
[tyr:08647] *** Process received signal ***
[tyr:08647] Signal: Bus Error (10)
[tyr:08647] Signal code: Invalid address alignment (1)
[tyr:08647] Failing at address: 1001c94eb
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x2c
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/libopen-pal.so.0.0.0:0xdefa4
/lib/sparcv9/libc.so.1:0xd8b98
/lib/sparcv9/libc.so.1:0xcc70c
/lib/sparcv9/libc.so.1:0xcc918
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/openmpi/mca_pmix_pmix114.so:0x8c800 [ Signal 10 (BUS)]
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/openmpi/mca_pmix_pmix114.so:0x8cba4
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/openmpi/mca_pmix_pmix114.so:0x8de10
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/libopen-pal.so.0.0.0:0xee62c
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/libopen-pal.so.0.0.0:0xee948
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/libopen-pal.so.0.0.0:opal_libevent2022_event_base_loop+0x310
/export2/prog/SunOS_sparc/openmpi-master_64_gcc/lib64/openmpi/mca_pmix_pmix114.so:0x4b7f4
/lib/sparcv9/libc.so.1:0xd8a6c
[tyr:08647] *** End of error message ***
Bus error
tyr hello_1 108 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
GNU gdb (GDB) 7.6.1
...
Reading symbols from /export2/prog/SunOS_sparc/openmpi-master_64_gcc/bin/orterun...done.
(gdb) set args -np 2 hello_1_mpi
(gdb) r
Starting program: /usr/local/openmpi-master_64_gcc/bin/mpiexec -np 2 hello_1_mpi
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[New LWP 2 ]
[New LWP 3 ]
[New LWP 4 ]
[New LWP 5 ]
[New Thread 3 (LWP 3)]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 3 (LWP 3)]
0xffffffff7988c800 in parse_connect_ack (msg=0x1001c97bb "", len=13, nspace=0xffffffff797fbac0,
rank=0xffffffff797fbaa8, version=0xffffffff797fbab8, cred=0xffffffff797fbab0)
at ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/server/pmix_server_listener.c:332
332 *rank = *(int *)msg;
(gdb) bt
#0 0xffffffff7988c800 in parse_connect_ack (msg=0x1001c97bb "", len=13,
nspace=0xffffffff797fbac0, rank=0xffffffff797fbaa8, version=0xffffffff797fbab8,
cred=0xffffffff797fbab0)
at ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/server/pmix_server_listener.c:332
#1 0xffffffff7988cbac in pmix_server_authenticate (sd=29, out_rank=0xffffffff797fbc0c,
peer=0xffffffff797fbc10)
at ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/server/pmix_server_listener.c:403
#2 0xffffffff7988de18 in connection_handler (sd=-1, flags=4, cbdata=0x1001cdc30)
at ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/server/pmix_server_listener.c:564
#3 0xffffffff7ecee634 in event_process_active_single_queue ()
from /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0
#4 0xffffffff7ecee950 in event_process_active ()
from /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0
#5 0xffffffff7ecef22c in opal_libevent2022_event_base_loop ()
from /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0
#6 0xffffffff7984b7fc in progress_engine (obj=0x1001bb0b0)
at ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/util/progress_threads.c:52
#7 0xffffffff7d9d8a74 in _lwp_start () from /lib/sparcv9/libc.so.1
#8 0xffffffff7d9d8a74 in _lwp_start () from /lib/sparcv9/libc.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) print msg
$1 = 0x1001c97bb ""
(gdb) print (int *)msg
$2 = (int *) 0x1001c97bb
(gdb) print *(int *)msg
$3 = 0
(gdb)
sunpc1 fd1026 102 ompi_info | grep -e "OPAL repo revision:" -e "C compiler absolute:"
OPAL repo revision: dev-4251-g1f651d1
C compiler absolute: /usr/local/gcc-5.1.0/bin/gcc
sunpc1 fd1026 103 mpiexec -np 2 hello_1_mpi
[sunpc1:27530] PMIX ERROR: NOT-SUPPORTED in file ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/server/pmix_server_listener.c at line 540
[sunpc1:27532] PMIX ERROR: UNREACHABLE in file ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/client/pmix_client.c at line 983
[sunpc1:27532] PMIX ERROR: UNREACHABLE in file ../../../../../../openmpi-dev-4251-g1f651d1/opal/mca/pmix/pmix114/pmix/src/client/pmix_client.c at line 199
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
pmix init failed
--> Returned value Unreachable (-12) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_ess_init failed
--> Returned value Unreachable (-12) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_mpi_init: ompi_rte_init failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[sunpc1:27532] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
Process name: [[26724,1],0]
Exit code: 1
--------------------------------------------------------------------------
sunpc1 fd1026 104
Hopefully you can solve the problem as well. Thank you very much
for any help in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/06/29421.php
Loading...