Noam Bernstein
2016-11-17 20:22:19 UTC
Hi - weâve started seeing over the last few days crashes and hangs in openmpi, in a code that hasnât been touched in months, and an openmpi installation (v. 1.8.5) that also hasnât been touched in months. The symptoms are either a hang, with a stack trace (from attaching to the one running process thatâs got 0% CPU usage) that looks like this:
(gdb) where
#0 0x000000358980f00d in nanosleep () from /lib64/libpthread.so.0
#1 0x00002af19a8758de in opal_memory_ptmalloc2_free () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#2 0x0000000002bca106 in for__free_vm ()
#3 0x0000000002b8cf62 in for__exit_handler ()
#4 0x0000000002b89782 in for__issue_diagnostic ()
#5 0x0000000002b90a50 in for__signal_handler ()
#6 <signal handler called>
#7 0x00002af19a8746fc in malloc_consolidate () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#8 0x00002af19a876e69 in opal_memory_ptmalloc2_int_malloc () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#9 0x00002af19a877c4f in opal_memory_ptmalloc2_int_memalign () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#10 0x00002af19a8788a3 in opal_memory_ptmalloc2_memalign () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#11 0x00002af19a29e0f4 in ompi_free_list_grow () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#12 0x00002af1a0718546 in append_frag_to_list () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#13 0x00002af1a0718cbe in match_one () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#14 0x00002af1a07190f3 in mca_pml_ob1_recv_frag_callback_match () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#15 0x00002af19fab4a48 in btl_openib_handle_incoming () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#16 0x00002af19fab5e1f in poll_device () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#17 0x00002af19fab618c in btl_openib_component_progress () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#18 0x00002af19a801f8a in opal_progress () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#19 0x00002af19a2b7a0d in ompi_request_default_wait_all () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#20 0x00002af1a17afef2 in ompi_coll_tuned_sendrecv_nonzero_actual () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_coll_tuned.so
#21 0x00002af1a17b7542 in ompi_coll_tuned_alltoallv_intra_pairwise () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_coll_tuned.so
#22 0x00002af19a2c9419 in PMPI_Alltoallv () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#23 0x00002af19a05f2a2 in pmpi_alltoallv__ () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.2
#24 0x0000000000416213 in m_alltoall_i (comm=..., xsnd=..., psnd=Cannot access memory at address 0x51
) at mpi.F:1906
#25 0x00000000029ca135 in mapset (grid=...) at fftmpi_map.F:267
#26 0x0000000002a15c62 in vamp () at main.F:2002
#27 0x000000000041281e in main ()
#28 0x000000358941ed1d in __libc_start_main () from /lib64/libc.so.6
#29 0x0000000000412729 in _start ()
(gdb) quit
Or segfault that looks like this
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
vasp.gamma_para.i 0000000002C7B031 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002C7916B Unknown Unknown Unknown
vasp.gamma_para.i 0000000002BECFF4 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002BECE06 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002B89827 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002B90A50 Unknown Unknown Unknown
libpthread-2.12.s 0000003FED60F7E0 Unknown Unknown Unknown
libopen-pal.so.6. 00002AF7775346FC Unknown Unknown Unknown
libopen-pal.so.6. 00002AF777536E69 opal_memory_ptmal Unknown Unknown
libopen-pal.so.6. 00002AF777537C4F opal_memory_ptmal Unknown Unknown
libopen-pal.so.6. 00002AF7775388A3 opal_memory_ptmal Unknown Unknown
libmlx4-rdmav2.so 00002AF77EE87242 Unknown Unknown Unknown
libmlx4-rdmav2.so 00002AF77EE8979F Unknown Unknown Unknown
libmlx4-rdmav2.so 00002AF77EE89AD6 Unknown Unknown Unknown
libibverbs.so.1.0 00002AF77CBFFDD2 ibv_create_qp Unknown Unknown
mca_btl_openib.so 00002AF77C7D15C5 Unknown Unknown Unknown
mca_btl_openib.so 00002AF77C7D4088 Unknown Unknown Unknown
mca_btl_openib.so 00002AF77C7C6CAD mca_btl_openib_en Unknown Unknown
mca_pml_ob1.so 00002AF77D42D7F6 mca_pml_ob1_send_ Unknown Unknown
mca_pml_ob1.so 00002AF77D424279 mca_pml_ob1_isend Unknown Unknown
mca_coll_tuned.so 00002AF77E4BDECB ompi_coll_tuned_s Unknown Unknown
mca_coll_tuned.so 00002AF77E4C5542 ompi_coll_tuned_a Unknown Unknown
libmpi.so.1.6.0 00002AF776F89419 PMPI_Alltoallv Unknown Unknown
libmpi_mpifh.so.2 00002AF776D1F2A2 pmpi_alltoallv_ Unknown Unknown
vasp.gamma_para.i 0000000000416213 m_alltoall_i_ 1906 mpi.F
vasp.gamma_para.i 00000000029CA135 mapset_.R 267 fftmpi_map.F
vasp.gamma_para.i 0000000002A15C62 MAIN__ 2002 main.F
vasp.gamma_para.i 000000000041281E Unknown Unknown Unknown
libc-2.12.so 0000003FED21ED1D __libc_start_main Unknown Unknown
vasp.gamma_para.i 0000000000412729 Unknown Unknown Unknown
This is on a Linux infiniband system, using CentOS 6 and the CentOS build in OFED. Itâs possible that the crashes only started after a recent kernel update.
Iâm in the process of recompiling openmpi 1.8.8 and the mpi-using code (vasp 5.4.1), just to make sure everythingâs clean, but I was just wondering if anyone had any ideas as to what might even be causing this kind of behavior, or what other information might be useful for me to gather to figure out whatâs going on. As I implied at the top, this setupâs been working well for years, and I believe entirely untouched (the openmpi library and executable, I mean, since we did just have a kernel update) for far longer than these crashes.
thanks,
Noam
(gdb) where
#0 0x000000358980f00d in nanosleep () from /lib64/libpthread.so.0
#1 0x00002af19a8758de in opal_memory_ptmalloc2_free () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#2 0x0000000002bca106 in for__free_vm ()
#3 0x0000000002b8cf62 in for__exit_handler ()
#4 0x0000000002b89782 in for__issue_diagnostic ()
#5 0x0000000002b90a50 in for__signal_handler ()
#6 <signal handler called>
#7 0x00002af19a8746fc in malloc_consolidate () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#8 0x00002af19a876e69 in opal_memory_ptmalloc2_int_malloc () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#9 0x00002af19a877c4f in opal_memory_ptmalloc2_int_memalign () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#10 0x00002af19a8788a3 in opal_memory_ptmalloc2_memalign () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#11 0x00002af19a29e0f4 in ompi_free_list_grow () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#12 0x00002af1a0718546 in append_frag_to_list () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#13 0x00002af1a0718cbe in match_one () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#14 0x00002af1a07190f3 in mca_pml_ob1_recv_frag_callback_match () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#15 0x00002af19fab4a48 in btl_openib_handle_incoming () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#16 0x00002af19fab5e1f in poll_device () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#17 0x00002af19fab618c in btl_openib_component_progress () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#18 0x00002af19a801f8a in opal_progress () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#19 0x00002af19a2b7a0d in ompi_request_default_wait_all () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#20 0x00002af1a17afef2 in ompi_coll_tuned_sendrecv_nonzero_actual () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_coll_tuned.so
#21 0x00002af1a17b7542 in ompi_coll_tuned_alltoallv_intra_pairwise () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_coll_tuned.so
#22 0x00002af19a2c9419 in PMPI_Alltoallv () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#23 0x00002af19a05f2a2 in pmpi_alltoallv__ () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.2
#24 0x0000000000416213 in m_alltoall_i (comm=..., xsnd=..., psnd=Cannot access memory at address 0x51
) at mpi.F:1906
#25 0x00000000029ca135 in mapset (grid=...) at fftmpi_map.F:267
#26 0x0000000002a15c62 in vamp () at main.F:2002
#27 0x000000000041281e in main ()
#28 0x000000358941ed1d in __libc_start_main () from /lib64/libc.so.6
#29 0x0000000000412729 in _start ()
(gdb) quit
Or segfault that looks like this
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
vasp.gamma_para.i 0000000002C7B031 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002C7916B Unknown Unknown Unknown
vasp.gamma_para.i 0000000002BECFF4 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002BECE06 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002B89827 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002B90A50 Unknown Unknown Unknown
libpthread-2.12.s 0000003FED60F7E0 Unknown Unknown Unknown
libopen-pal.so.6. 00002AF7775346FC Unknown Unknown Unknown
libopen-pal.so.6. 00002AF777536E69 opal_memory_ptmal Unknown Unknown
libopen-pal.so.6. 00002AF777537C4F opal_memory_ptmal Unknown Unknown
libopen-pal.so.6. 00002AF7775388A3 opal_memory_ptmal Unknown Unknown
libmlx4-rdmav2.so 00002AF77EE87242 Unknown Unknown Unknown
libmlx4-rdmav2.so 00002AF77EE8979F Unknown Unknown Unknown
libmlx4-rdmav2.so 00002AF77EE89AD6 Unknown Unknown Unknown
libibverbs.so.1.0 00002AF77CBFFDD2 ibv_create_qp Unknown Unknown
mca_btl_openib.so 00002AF77C7D15C5 Unknown Unknown Unknown
mca_btl_openib.so 00002AF77C7D4088 Unknown Unknown Unknown
mca_btl_openib.so 00002AF77C7C6CAD mca_btl_openib_en Unknown Unknown
mca_pml_ob1.so 00002AF77D42D7F6 mca_pml_ob1_send_ Unknown Unknown
mca_pml_ob1.so 00002AF77D424279 mca_pml_ob1_isend Unknown Unknown
mca_coll_tuned.so 00002AF77E4BDECB ompi_coll_tuned_s Unknown Unknown
mca_coll_tuned.so 00002AF77E4C5542 ompi_coll_tuned_a Unknown Unknown
libmpi.so.1.6.0 00002AF776F89419 PMPI_Alltoallv Unknown Unknown
libmpi_mpifh.so.2 00002AF776D1F2A2 pmpi_alltoallv_ Unknown Unknown
vasp.gamma_para.i 0000000000416213 m_alltoall_i_ 1906 mpi.F
vasp.gamma_para.i 00000000029CA135 mapset_.R 267 fftmpi_map.F
vasp.gamma_para.i 0000000002A15C62 MAIN__ 2002 main.F
vasp.gamma_para.i 000000000041281E Unknown Unknown Unknown
libc-2.12.so 0000003FED21ED1D __libc_start_main Unknown Unknown
vasp.gamma_para.i 0000000000412729 Unknown Unknown Unknown
This is on a Linux infiniband system, using CentOS 6 and the CentOS build in OFED. Itâs possible that the crashes only started after a recent kernel update.
Iâm in the process of recompiling openmpi 1.8.8 and the mpi-using code (vasp 5.4.1), just to make sure everythingâs clean, but I was just wondering if anyone had any ideas as to what might even be causing this kind of behavior, or what other information might be useful for me to gather to figure out whatâs going on. As I implied at the top, this setupâs been working well for years, and I believe entirely untouched (the openmpi library and executable, I mean, since we did just have a kernel update) for far longer than these crashes.
thanks,
Noam