Discussion:
[OMPI users] malloc related crash inside openmpi
Noam Bernstein
2016-11-17 20:22:19 UTC
Permalink
Hi - we’ve started seeing over the last few days crashes and hangs in openmpi, in a code that hasn’t been touched in months, and an openmpi installation (v. 1.8.5) that also hasn’t been touched in months. The symptoms are either a hang, with a stack trace (from attaching to the one running process that’s got 0% CPU usage) that looks like this:
(gdb) where
#0 0x000000358980f00d in nanosleep () from /lib64/libpthread.so.0
#1 0x00002af19a8758de in opal_memory_ptmalloc2_free () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#2 0x0000000002bca106 in for__free_vm ()
#3 0x0000000002b8cf62 in for__exit_handler ()
#4 0x0000000002b89782 in for__issue_diagnostic ()
#5 0x0000000002b90a50 in for__signal_handler ()
#6 <signal handler called>
#7 0x00002af19a8746fc in malloc_consolidate () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#8 0x00002af19a876e69 in opal_memory_ptmalloc2_int_malloc () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#9 0x00002af19a877c4f in opal_memory_ptmalloc2_int_memalign () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#10 0x00002af19a8788a3 in opal_memory_ptmalloc2_memalign () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#11 0x00002af19a29e0f4 in ompi_free_list_grow () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#12 0x00002af1a0718546 in append_frag_to_list () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#13 0x00002af1a0718cbe in match_one () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#14 0x00002af1a07190f3 in mca_pml_ob1_recv_frag_callback_match () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#15 0x00002af19fab4a48 in btl_openib_handle_incoming () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#16 0x00002af19fab5e1f in poll_device () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#17 0x00002af19fab618c in btl_openib_component_progress () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#18 0x00002af19a801f8a in opal_progress () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#19 0x00002af19a2b7a0d in ompi_request_default_wait_all () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#20 0x00002af1a17afef2 in ompi_coll_tuned_sendrecv_nonzero_actual () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_coll_tuned.so
#21 0x00002af1a17b7542 in ompi_coll_tuned_alltoallv_intra_pairwise () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_coll_tuned.so
#22 0x00002af19a2c9419 in PMPI_Alltoallv () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#23 0x00002af19a05f2a2 in pmpi_alltoallv__ () from /usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.2
#24 0x0000000000416213 in m_alltoall_i (comm=..., xsnd=..., psnd=Cannot access memory at address 0x51
) at mpi.F:1906
#25 0x00000000029ca135 in mapset (grid=...) at fftmpi_map.F:267
#26 0x0000000002a15c62 in vamp () at main.F:2002
#27 0x000000000041281e in main ()
#28 0x000000358941ed1d in __libc_start_main () from /lib64/libc.so.6
#29 0x0000000000412729 in _start ()
(gdb) quit

Or segfault that looks like this

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
vasp.gamma_para.i 0000000002C7B031 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002C7916B Unknown Unknown Unknown
vasp.gamma_para.i 0000000002BECFF4 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002BECE06 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002B89827 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002B90A50 Unknown Unknown Unknown
libpthread-2.12.s 0000003FED60F7E0 Unknown Unknown Unknown
libopen-pal.so.6. 00002AF7775346FC Unknown Unknown Unknown
libopen-pal.so.6. 00002AF777536E69 opal_memory_ptmal Unknown Unknown
libopen-pal.so.6. 00002AF777537C4F opal_memory_ptmal Unknown Unknown
libopen-pal.so.6. 00002AF7775388A3 opal_memory_ptmal Unknown Unknown
libmlx4-rdmav2.so 00002AF77EE87242 Unknown Unknown Unknown
libmlx4-rdmav2.so 00002AF77EE8979F Unknown Unknown Unknown
libmlx4-rdmav2.so 00002AF77EE89AD6 Unknown Unknown Unknown
libibverbs.so.1.0 00002AF77CBFFDD2 ibv_create_qp Unknown Unknown
mca_btl_openib.so 00002AF77C7D15C5 Unknown Unknown Unknown
mca_btl_openib.so 00002AF77C7D4088 Unknown Unknown Unknown
mca_btl_openib.so 00002AF77C7C6CAD mca_btl_openib_en Unknown Unknown
mca_pml_ob1.so 00002AF77D42D7F6 mca_pml_ob1_send_ Unknown Unknown
mca_pml_ob1.so 00002AF77D424279 mca_pml_ob1_isend Unknown Unknown
mca_coll_tuned.so 00002AF77E4BDECB ompi_coll_tuned_s Unknown Unknown
mca_coll_tuned.so 00002AF77E4C5542 ompi_coll_tuned_a Unknown Unknown
libmpi.so.1.6.0 00002AF776F89419 PMPI_Alltoallv Unknown Unknown
libmpi_mpifh.so.2 00002AF776D1F2A2 pmpi_alltoallv_ Unknown Unknown
vasp.gamma_para.i 0000000000416213 m_alltoall_i_ 1906 mpi.F
vasp.gamma_para.i 00000000029CA135 mapset_.R 267 fftmpi_map.F
vasp.gamma_para.i 0000000002A15C62 MAIN__ 2002 main.F
vasp.gamma_para.i 000000000041281E Unknown Unknown Unknown
libc-2.12.so 0000003FED21ED1D __libc_start_main Unknown Unknown
vasp.gamma_para.i 0000000000412729 Unknown Unknown Unknown

This is on a Linux infiniband system, using CentOS 6 and the CentOS build in OFED. It’s possible that the crashes only started after a recent kernel update.

I’m in the process of recompiling openmpi 1.8.8 and the mpi-using code (vasp 5.4.1), just to make sure everything’s clean, but I was just wondering if anyone had any ideas as to what might even be causing this kind of behavior, or what other information might be useful for me to gather to figure out what’s going on. As I implied at the top, this setup’s been working well for years, and I believe entirely untouched (the openmpi library and executable, I mean, since we did just have a kernel update) for far longer than these crashes.

thanks,
Noam
Noam Bernstein
2016-11-23 18:44:43 UTC
Permalink
Post by Noam Bernstein
.
.
.
.
I’m in the process of recompiling openmpi 1.8.8 and the mpi-using code (vasp 5.4.1), just to make sure everything’s clean, but I was just wondering if anyone had any ideas as to what might even be causing this kind of behavior, or what other information might be useful for me to gather to figure out what’s going on. As I implied at the top, this setup’s been working well for years, and I believe entirely untouched (the openmpi library and executable, I mean, since we did just have a kernel update) for far longer than these crashes.
No one has any suggestions about this problem? I tried openmpi 1.8.8, and a newer version of Mellanox’s OFED, and behavior is the same.

Does anyone who knows the guts of mpi have any ideas whether this even looks like an openmpi problem (as opposed to lower level, i.e. infiniband drivers, or higher level, i.e. calling code), from the stack traces I posted earlier?

Noam

____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
George Bosilca
2016-11-23 20:02:18 UTC
Permalink
Noam,

I do not recall exactly which version of Open MPI was affected, but we had
some issues with the non-reentrancy of our memory allocator. More recent
versions (1.10 and 2.0) will not have this issue. Can you update to a newer
version of Open MPI (1.10 or maybe 2.0) and see if you can reproduce it?

Thanks,
George.



On Wed, Nov 23, 2016 at 11:44 AM, Noam Bernstein <
Post by Noam Bernstein
Hi - we’ve started seeing over the last few days crashes and hangs in
openmpi, in a code that hasn’t been touched in months, and an openmpi
installation (v. 1.8.5) that also hasn’t been touched in months. The
symptoms are either a hang, with a stack trace (from attaching to the one
.
.
.
.
I’m in the process of recompiling openmpi 1.8.8 and the mpi-using code
(vasp 5.4.1), just to make sure everything’s clean, but I was just
wondering if anyone had any ideas as to what might even be causing this
kind of behavior, or what other information might be useful for me to
gather to figure out what’s going on. As I implied at the top, this
setup’s been working well for years, and I believe entirely untouched (the
openmpi library and executable, I mean, since we did just have a kernel
update) for far longer than these crashes.
No one has any suggestions about this problem? I tried openmpi 1.8.8, and
a newer version of Mellanox’s OFED, and behavior is the same.
Does anyone who knows the guts of mpi have any ideas whether this even
looks like an openmpi problem (as opposed to lower level, i.e. infiniband
drivers, or higher level, i.e. calling code), from the stack traces I
posted earlier?
Noam
____________
|
|
|
*U.S. NAVAL*
|
|
_*RESEARCH*_
|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Noam Bernstein
2016-11-23 20:08:35 UTC
Permalink
Noam,
I do not recall exactly which version of Open MPI was affected, but we had some issues with the non-reentrancy of our memory allocator. More recent versions (1.10 and 2.0) will not have this issue. Can you update to a newer version of Open MPI (1.10 or maybe 2.0) and see if you can reproduce it?
Interesting. I just tried 2.0.1 and it does seems to have fixed the problem, although it’s so far from deterministic that I can’t say this will full confidence yet.

Is there any general advice on the merits of going to 1.10 vs. 2.0 (from 1.8)?

thanks,
Noam


____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Noam Bernstein
2016-11-23 20:21:01 UTC
Permalink
Post by Noam Bernstein
Noam,
I do not recall exactly which version of Open MPI was affected, but we had some issues with the non-reentrancy of our memory allocator. More recent versions (1.10 and 2.0) will not have this issue. Can you update to a newer version of Open MPI (1.10 or maybe 2.0) and see if you can reproduce it?
Interesting. I just tried 2.0.1 and it does seems to have fixed the problem, although it’s so far from deterministic that I can’t say this will full confidence yet.
No, I spoke too soon. It fails in the same way with 2.0.1. I guess I’ll try 1.10 just in case.

Noam


____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
George Bosilca
2016-11-23 20:45:41 UTC
Permalink
Thousands reasons ;)

https://raw.githubusercontent.com/open-mpi/ompi/v2.x/NEWS

George.
Post by George Bosilca
Noam,
I do not recall exactly which version of Open MPI was affected, but we had
some issues with the non-reentrancy of our memory allocator. More recent
versions (1.10 and 2.0) will not have this issue. Can you update to a newer
version of Open MPI (1.10 or maybe 2.0) and see if you can reproduce it?
Interesting. I just tried 2.0.1 and it does seems to have fixed the
problem, although it’s so far from deterministic that I can’t say this will
full confidence yet.
Is there any general advice on the merits of going to 1.10 vs. 2.0 (from 1.8)?
thanks,
Noam
____________
|
|
|
*U.S. NAVAL*
|
|
_*RESEARCH*_
|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Noam Bernstein
2016-11-23 22:21:54 UTC
Permalink
Post by George Bosilca
Thousands reasons ;)
Still trying to check if 2.0.1 fixes the problem, and discovered that earlier runs weren’t actually using the version I intended. When I do use 2.0.1, I get the following errors:
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: compute-1-35
Framework: ess
Component: pmi
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_ess_base_open failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

I’ve confirmed that mpirun PATH and LD_LIBRARY_PATH are pointing to 2.0.1 version of things within the job script. Configure line is as I’ve used for 1.8.x, i.e.
export CC=gcc
export CXX=g++
export F77=ifort
export FC=ifort

./configure \
--prefix=${DEST} \
--with-tm=/usr/local/torque \
--enable-mpirun-prefix-by-default \
--with-verbs=/usr \
--with-verbs-libdir=/usr/lib64
Followed by “make install” Any suggestions for getting 2.0.1 working?

thanks,
Noam

____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
r***@open-mpi.org
2016-11-23 22:26:52 UTC
Permalink
It looks like the library may not have been fully installed on that node - can you see if the prefix location is present, and that the LD_LIBRARY_PATH on that node is correctly set? The referenced component did not exist prior to the 2.0 series, so I’m betting that your LD_LIBRARY_PATH isn’t correct on that node.
Post by Noam Bernstein
Post by George Bosilca
Thousands reasons ;)
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: compute-1-35
Framework: ess
Component: pmi
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_ess_base_open failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
I’ve confirmed that mpirun PATH and LD_LIBRARY_PATH are pointing to 2.0.1 version of things within the job script. Configure line is as I’ve used for 1.8.x, i.e.
export CC=gcc
export CXX=g++
export F77=ifort
export FC=ifort
./configure \
--prefix=${DEST} \
--with-tm=/usr/local/torque \
--enable-mpirun-prefix-by-default \
--with-verbs=/usr \
--with-verbs-libdir=/usr/lib64
Followed by “make install” Any suggestions for getting 2.0.1 working?
thanks,
Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Noam Bernstein
2016-11-23 22:31:25 UTC
Permalink
Post by r***@open-mpi.org
It looks like the library may not have been fully installed on that node - can you see if the prefix location is present, and that the LD_LIBRARY_PATH on that node is correctly set? The referenced component did not exist prior to the 2.0 series, so I’m betting that your LD_LIBRARY_PATH isn’t correct on that node.
The LD_LIBRARY path is definitely correct on the node that’s running the mpirun, I checked that, and the openmpi directory is supposedly NFS mounted everywhere. I suppose installation may have not fully worked and I didn’t notice. What’s the name of the library it’s looking for?

Noam


____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
r***@open-mpi.org
2016-11-24 15:52:36 UTC
Permalink
Just to be clear: are you saying that mpirun exits with that message? Or is your application process exiting with it?

There is no reason for mpirun to be looking for that library.

The library in question is in the <prefix>/lib/openmpi directory, and is named mca_ess_pmi.[la,so]
Post by Noam Bernstein
Post by r***@open-mpi.org
It looks like the library may not have been fully installed on that node - can you see if the prefix location is present, and that the LD_LIBRARY_PATH on that node is correctly set? The referenced component did not exist prior to the 2.0 series, so I’m betting that your LD_LIBRARY_PATH isn’t correct on that node.
The LD_LIBRARY path is definitely correct on the node that’s running the mpirun, I checked that, and the openmpi directory is supposedly NFS mounted everywhere. I suppose installation may have not fully worked and I didn’t notice. What’s the name of the library it’s looking for?
Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Noam Bernstein
2016-11-25 16:20:50 UTC
Permalink
Post by r***@open-mpi.org
Just to be clear: are you saying that mpirun exits with that message? Or is your application process exiting with it?
There is no reason for mpirun to be looking for that library.
The library in question is in the <prefix>/lib/openmpi directory, and is named mca_ess_pmi.[la,so]
Looks like this openmpi 2 crash was a matter of not using the correctly linked executable on all nodes. Now that it’s straightened out, I think it’s all working, and apparently even fixed my malloc related crash, so perhaps the allocator fix in 2.0.1 is really addressing the problem.

Thank you all for the help.

Noam
Jeff Squyres (jsquyres)
2016-11-28 11:18:08 UTC
Permalink
Post by Noam Bernstein
Looks like this openmpi 2 crash was a matter of not using the correctly linked executable on all nodes. Now that it’s straightened out, I think it’s all working, and apparently even fixed my malloc related crash, so perhaps the allocator fix in 2.0.1 is really addressing the problem.
Glad you got it working!

One final note: the error message you saw is typical when there's more than one version of Open MPI installed into the same directory tree. Check out this FAQ item for more detail:

https://www.open-mpi.org/faq/?category=building#install-overwrite

--
Jeff Squyres
***@cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Loading...