Discussion:
[OMPI users] bus error with openmpi-1.8.2 and gcc-4.9.0
Siegmar Gross
2014-09-02 10:16:17 UTC
Permalink
Hi,

yesterday I installed openmpi-1.8.2 on my machines (Solaris 10 Sparc
(tyr), Solaris 10 x86_64 (sunpc0), and openSUSE Linux 12.1 x86_64
(linpc0)) with gcc-4.9.0. A small program works on some machines,
but breaks with a bus error on Solaris 10 Sparc.


tyr small_prog 118 which mpicc
/usr/local/openmpi-1.8.2_64_gcc/bin/mpicc
tyr small_prog 119 ompi_info | grep MPI:
Open MPI: 1.8.2
tyr small_prog 120 mpiexec -np 1 --host linpc0 init_finalize
Hello!
tyr small_prog 121 mpiexec -np 1 --host sunpc0 init_finalize
Hello!
tyr small_prog 122 mpiexec -np 1 --host tyr init_finalize
[tyr:28081] *** Process received signal ***
[tyr:28081] Signal: Bus Error (10)
[tyr:28081] Signal code: Invalid address alignment (1)
[tyr:28081] Failing at address: ffffffff7fffd304
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd118
/lib/sparcv9/libc.so.1:0xd8b98
/lib/sparcv9/libc.so.1:0xcc70c
/lib/sparcv9/libc.so.1:0xcc918
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8 [ Signal 10 (BUS)]
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
/home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
/home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
[tyr:28081] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 28081 on node tyr exited on signal 10 (Bus Error).
--------------------------------------------------------------------------
tyr small_prog 123



gdb shows the following backtrace.

tyr small_prog 123 /usr/local/gdb-7.6.1_64_gcc/bin/gdb /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
GNU gdb (GDB) 7.6.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.10".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/bin/orterun...done.
(gdb) run -np 1 --host tyr init_finalize
Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 --host tyr init_finalize
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[New LWP 2 ]
[tyr:28099] *** Process received signal ***
[tyr:28099] Signal: Bus Error (10)
[tyr:28099] Signal code: Invalid address alignment (1)
[tyr:28099] Failing at address: ffffffff7fffd244
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd118
/lib/sparcv9/libc.so.1:0xd8b98
/lib/sparcv9/libc.so.1:0xcc70c
/lib/sparcv9/libc.so.1:0xcc918
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8 [ Signal 10 (BUS)]
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
/home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
/home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
[tyr:28099] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 28099 on node tyr exited on signal 10 (Bus Error).
--------------------------------------------------------------------------
[LWP 2 exited]
[New Thread 2 ]
[Switching to Thread 1 (LWP 1)]
sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to satisfy query
(gdb) bt
#0 0xffffffff7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
#1 0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
#2 0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
#3 0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
#4 0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
#5 0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
#6 0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
#7 0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
#8 0xffffffff7ec77474 in vm_close () from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
#9 0xffffffff7ec74a54 in lt_dlclose ()
from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
#10 0xffffffff7ec99b78 in ri_destructor (obj=0x1001eada0)
at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_component_repository.c:391
#11 0xffffffff7ec98490 in opal_obj_run_destructors (object=0x1001eada0)
at ../../../../openmpi-1.8.2/opal/class/opal_object.h:446
#12 0xffffffff7ec993f4 in mca_base_component_repository_release (
component=0xffffffff7b023ef0 <mca_oob_tcp_component>)
at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_component_repository.c:244
#13 0xffffffff7ec9b73c in mca_base_component_unload (
component=0xffffffff7b023ef0 <mca_oob_tcp_component>, output_id=-1)
at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:47
#14 0xffffffff7ec9b7d0 in mca_base_component_close (
component=0xffffffff7b023ef0 <mca_oob_tcp_component>, output_id=-1)
at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:60
#15 0xffffffff7ec9b8a4 in mca_base_components_close (output_id=-1,
components=0xffffffff7f12b030 <orte_oob_base_framework+80>, skip=0x0)
at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:86
#16 0xffffffff7ec9b80c in mca_base_framework_components_close (
framework=0xffffffff7f12afe0 <orte_oob_base_framework>, skip=0x0)
at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:66
#17 0xffffffff7efae0e8 in orte_oob_base_close ()
at ../../../../openmpi-1.8.2/orte/mca/oob/base/oob_base_frame.c:94
#18 0xffffffff7ecb28b4 in mca_base_framework_close (
framework=0xffffffff7f12afe0 <orte_oob_base_framework>)
at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_framework.c:187
#19 0xffffffff7bf078c0 in rte_finalize ()
at ../../../../../openmpi-1.8.2/orte/mca/ess/hnp/ess_hnp_module.c:858
#20 0xffffffff7ef30924 in orte_finalize () at ../../openmpi-1.8.2/orte/runtime/orte_finalize.c:65
#21 0x00000001000070c4 in orterun (argc=6, argv=0xffffffff7fffe0e8)
at ../../../../openmpi-1.8.2/orte/tools/orterun/orterun.c:1096
#22 0x0000000100003d70 in main (argc=6, argv=0xffffffff7fffe0e8)
at ../../../../openmpi-1.8.2/orte/tools/orterun/main.c:13
(gdb)


I would be grateful, if somebody can fix the problem. Thank you
very much for any help in advance.


Kind regards

Siegmar
Siegmar Gross
2014-09-02 12:20:01 UTC
Permalink
Hi Takahiro,
I forgot to follow the previous report, sorry.
The patch I suggested is not included in Open MPI 1.8.2.
The backtrace Siegmar reported points the problem that I fixed
in the patch.
http://www.open-mpi.org/community/lists/users/2014/08/24968.php
Could you try my patch again?
Yes, your patch solves the bus error in openmpi-1.8.2 and
openmpi-1.8.3a1r32641.


Thank you very much for your help once more

Siegmar
Open MPI 1.8 needs custom patch that I posted. See my previous mail.
Could you review it and commit it to v1.8 branch?
Regards,
Takahiro
Post by Siegmar Gross
Hi,
yesterday I installed openmpi-1.8.2 on my machines (Solaris 10 Sparc
(tyr), Solaris 10 x86_64 (sunpc0), and openSUSE Linux 12.1 x86_64
(linpc0)) with gcc-4.9.0. A small program works on some machines,
but breaks with a bus error on Solaris 10 Sparc.
tyr small_prog 118 which mpicc
/usr/local/openmpi-1.8.2_64_gcc/bin/mpicc
Open MPI: 1.8.2
tyr small_prog 120 mpiexec -np 1 --host linpc0 init_finalize
Hello!
tyr small_prog 121 mpiexec -np 1 --host sunpc0 init_finalize
Hello!
tyr small_prog 122 mpiexec -np 1 --host tyr init_finalize
[tyr:28081] *** Process received signal ***
[tyr:28081] Signal: Bus Error (10)
[tyr:28081] Signal code: Invalid address alignment (1)
[tyr:28081] Failing at address: ffffffff7fffd304
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b
acktrace_print+0x2c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd11
8
Post by Siegmar Gross
/lib/sparcv9/libc.so.1:0xd8b98
/lib/sparcv9/libc.so.1:0xcc70c
/lib/sparcv9/libc.so.1:0xcc918
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e
e8 [ Signal 10 (BUS)]
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d
b_base_store+0xc8
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
til_decode_pidmap+0x798
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
til_nidmap_init+0x3cc
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22
6c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i
nit+0x308
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in
it+0x31c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x
2a8
Post by Siegmar Gross
/home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
/home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
[tyr:28081] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 28081 on node tyr exited on
signal 10 (Bus Error).
Post by Siegmar Gross
--------------------------------------------------------------------------
tyr small_prog 123
gdb shows the following backtrace.
tyr small_prog 123 /usr/local/gdb-7.6.1_64_gcc/bin/gdb
/usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
Post by Siegmar Gross
GNU gdb (GDB) 7.6.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
Post by Siegmar Gross
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.10".
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/bin/orterun...done.
Post by Siegmar Gross
(gdb) run -np 1 --host tyr init_finalize
Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 --host
tyr init_finalize
Post by Siegmar Gross
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[New LWP 2 ]
[tyr:28099] *** Process received signal ***
[tyr:28099] Signal: Bus Error (10)
[tyr:28099] Signal code: Invalid address alignment (1)
[tyr:28099] Failing at address: ffffffff7fffd244
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_b
acktrace_print+0x2c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd11
8
Post by Siegmar Gross
/lib/sparcv9/libc.so.1:0xd8b98
/lib/sparcv9/libc.so.1:0xcc70c
/lib/sparcv9/libc.so.1:0xcc918
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3e
e8 [ Signal 10 (BUS)]
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_d
b_base_store+0xc8
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
til_decode_pidmap+0x798
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_u
til_nidmap_init+0x3cc
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x22
6c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_i
nit+0x308
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_in
it+0x31c
/export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x
2a8
Post by Siegmar Gross
/home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
/home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
[tyr:28099] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 28099 on node tyr exited on
signal 10 (Bus Error).
Post by Siegmar Gross
--------------------------------------------------------------------------
[LWP 2 exited]
[New Thread 2 ]
[Switching to Thread 1 (LWP 1)]
sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to
satisfy query
Post by Siegmar Gross
(gdb) bt
#0 0xffffffff7f6173d0 in rtld_db_dlactivity () from
/usr/lib/sparcv9/ld.so.1
Post by Siegmar Gross
#1 0xffffffff7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
#2 0xffffffff7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
#3 0xffffffff7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
#4 0xffffffff7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
#5 0xffffffff7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
#6 0xffffffff7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
#7 0xffffffff7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
#8 0xffffffff7ec77474 in vm_close () from
/usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
Post by Siegmar Gross
#9 0xffffffff7ec74a54 in lt_dlclose ()
from /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
#10 0xffffffff7ec99b78 in ri_destructor (obj=0x1001eada0)
at
../../../../openmpi-1.8.2/opal/mca/base/mca_base_component_repository.c:391
Post by Siegmar Gross
#11 0xffffffff7ec98490 in opal_obj_run_destructors (object=0x1001eada0)
at ../../../../openmpi-1.8.2/opal/class/opal_object.h:446
#12 0xffffffff7ec993f4 in mca_base_component_repository_release (
component=0xffffffff7b023ef0 <mca_oob_tcp_component>)
at
../../../../openmpi-1.8.2/opal/mca/base/mca_base_component_repository.c:244
Post by Siegmar Gross
#13 0xffffffff7ec9b73c in mca_base_component_unload (
component=0xffffffff7b023ef0 <mca_oob_tcp_component>, output_id=-1)
at
../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:47
Post by Siegmar Gross
#14 0xffffffff7ec9b7d0 in mca_base_component_close (
component=0xffffffff7b023ef0 <mca_oob_tcp_component>, output_id=-1)
at
../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:60
Post by Siegmar Gross
#15 0xffffffff7ec9b8a4 in mca_base_components_close (output_id=-1,
components=0xffffffff7f12b030 <orte_oob_base_framework+80>, skip=0x0)
at
../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:86
Post by Siegmar Gross
#16 0xffffffff7ec9b80c in mca_base_framework_components_close (
framework=0xffffffff7f12afe0 <orte_oob_base_framework>, skip=0x0)
at
../../../../openmpi-1.8.2/opal/mca/base/mca_base_components_close.c:66
Post by Siegmar Gross
#17 0xffffffff7efae0e8 in orte_oob_base_close ()
at ../../../../openmpi-1.8.2/orte/mca/oob/base/oob_base_frame.c:94
#18 0xffffffff7ecb28b4 in mca_base_framework_close (
framework=0xffffffff7f12afe0 <orte_oob_base_framework>)
at ../../../../openmpi-1.8.2/opal/mca/base/mca_base_framework.c:187
#19 0xffffffff7bf078c0 in rte_finalize ()
at ../../../../../openmpi-1.8.2/orte/mca/ess/hnp/ess_hnp_module.c:858
#20 0xffffffff7ef30924 in orte_finalize () at
../../openmpi-1.8.2/orte/runtime/orte_finalize.c:65
Post by Siegmar Gross
#21 0x00000001000070c4 in orterun (argc=6, argv=0xffffffff7fffe0e8)
at ../../../../openmpi-1.8.2/orte/tools/orterun/orterun.c:1096
#22 0x0000000100003d70 in main (argc=6, argv=0xffffffff7fffe0e8)
at ../../../../openmpi-1.8.2/orte/tools/orterun/main.c:13
(gdb)
I would be grateful, if somebody can fix the problem. Thank you
very much for any help in advance.
Kind regards
Siegmar
Loading...