[OMPI users] Crash in libopen-pal.so

Discussion:

Justin Luitjens

2017-06-19 22:05:44 UTC

I have an application that works on other systems but on the current system I'm running I'm seeing the following crash:

[dt04:22457] *** Process received signal ***
[dt04:22457] Signal: Segmentation fault (11)
[dt04:22457] Signal code: Address not mapped (1)
[dt04:22457] Failing at address: 0x55556a1da250
[dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaab353370]
[dt04:22457] [ 1] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x50)[0x2aaaacbcf810]
[dt04:22457] [ 2] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x9b)[0x2aaaacbcff3b]
[dt04:22457] [ 3] ./hacc_tpm[0x42f068]
[dt04:22457] [ 4] ./hacc_tpm[0x42f231]
[dt04:22457] [ 5] ./hacc_tpm[0x40f64d]
[dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac30db35]
[dt04:22457] [ 7] ./hacc_tpm[0x4115cf]
[dt04:22457] *** End of error message ***

This app is a CUDA app but doesn't use GPU direct so that should be irrelevant.

I'm building with ggc/5.3.0 cuda/8.0.44 openmpi/1.10.7

I'm using this on centos 7 and am using a vanilla MPI configure line: ./configure --prefix=/home/jluitjens/libs/openmpi/

Currently I'm trying to do this with just a single MPI process but multiple MPI processes fail in the same way:

mpirun --oversubscribe -np 1 ./command

What is odd is the crash occurs around the same spot in the code but not consistently at the same spot. The spot in the code where the single thread is at the time of the crash is nowhere near MPI code. The code where it is crashing is just using malloc to allocate some memory. This makes me think the crash is due to a thread outside of the application I'm working on (perhaps in OpenMPI itself) or perhaps due to openmpi hijacking malloc/free.

Does anyone have any ideas of what I could try to work around this issue?

Thanks,
Justin

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Sylvain Jeaugey

2017-06-19 22:10:15 UTC

Permalink

Justin, can you try setting mpi_leave_pinned to 0 to disable
libptmalloc2 and confirm this is related to ptmalloc ?

Thanks,
Sylvain

Post by Justin Luitjens
I have an application that works on other systems but on the current
[dt04:22457] *** Process received signal ***
[dt04:22457] Signal: Segmentation fault (11)
[dt04:22457] Signal code: Address not mapped (1)
[dt04:22457] Failing at address: 0x55556a1da250
[dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaab353370]
[dt04:22457] [ 1]
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x50)[0x2aaaacbcf810]
[dt04:22457] [ 2]
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x9b)[0x2aaaacbcff3b]
[dt04:22457] [ 3] ./hacc_tpm[0x42f068]
[dt04:22457] [ 4] ./hacc_tpm[0x42f231]
[dt04:22457] [ 5] ./hacc_tpm[0x40f64d]
[dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac30db35]
[dt04:22457] [ 7] ./hacc_tpm[0x4115cf]
[dt04:22457] *** End of error message ***
This app is a CUDA app but doesnt use GPU direct so that should be
irrelevant.
Im building with ggc/5.3.0 cuda/8.0.44 openmpi/1.10.7
./configure --prefix=/home/jluitjens/libs/openmpi/
Currently Im trying to do this with just a single MPI process but
mpirun --oversubscribe -np 1 ./command
What is odd is the crash occurs around the same spot in the code but
not consistently at the same spot. The spot in the code where the
single thread is at the time of the crash is nowhere near MPI code.
The code where it is crashing is just using malloc to allocate some
memory. This makes me think the crash is due to a thread outside of
the application Im working on (perhaps in OpenMPI itself) or perhaps
due to openmpi hijacking malloc/free.
Does anyone have any ideas of what I could try to work around this issue?
Thanks,
Justin
------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s)
and may contain confidential information. Any unauthorized review,
use, disclosure or distribution is prohibited. If you are not the
intended recipient, please contact the sender by reply email and
destroy all copies of the original message.
------------------------------------------------------------------------
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Dmitry N. Mikushin

2017-06-19 22:19:50 UTC

Permalink

Hi Justin,

If you can build application in debug mode, try inserting valgrind into
your MPI command. It's usually very good in tracking down failing memory
allocations origins.

Kind regards,
- Dmitry.

Justin, can you try setting mpi_leave_pinned to 0 to disable libptmalloc2
and confirm this is related to ptmalloc ?
Thanks,
Sylvain
I have an application that works on other systems but on the current
[dt04:22457] *** Process received signal ***
[dt04:22457] Signal: Segmentation fault (11)
[dt04:22457] Signal code: Address not mapped (1)
[dt04:22457] Failing at address: 0x55556a1da250
[dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaab353370]
[dt04:22457] [ 1] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_
memory_ptmalloc2_int_free+0x50)[0x2aaaacbcf810]
[dt04:22457] [ 2] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_
memory_ptmalloc2_free+0x9b)[0x2aaaacbcff3b]
[dt04:22457] [ 3] ./hacc_tpm[0x42f068]
[dt04:22457] [ 4] ./hacc_tpm[0x42f231]
[dt04:22457] [ 5] ./hacc_tpm[0x40f64d]
[dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac30db35]
[dt04:22457] [ 7] ./hacc_tpm[0x4115cf]
[dt04:22457] *** End of error message ***
This app is a CUDA app but doesnât use GPU direct so that should be
irrelevant.
Iâm building with ggc/5.3.0 cuda/8.0.44 openmpi/1.10.7
./configure --prefix=/home/jluitjens/libs/openmpi/
Currently Iâm trying to do this with just a single MPI process but
mpirun --oversubscribe -np 1 ./command
What is odd is the crash occurs around the same spot in the code but not
consistently at the same spot. The spot in the code where the single
thread is at the time of the crash is nowhere near MPI code. The code
where it is crashing is just using malloc to allocate some memory. This
makes me think the crash is due to a thread outside of the application Iâm
working on (perhaps in OpenMPI itself) or perhaps due to openmpi hijacking
malloc/free.
Does anyone have any ideas of what I could try to work around this issue?
Thanks,
Justin
------------------------------
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.
------------------------------
_______________________________________________
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users