Justin Luitjens
2017-06-19 22:05:44 UTC
I have an application that works on other systems but on the current system I'm running I'm seeing the following crash:
[dt04:22457] *** Process received signal ***
[dt04:22457] Signal: Segmentation fault (11)
[dt04:22457] Signal code: Address not mapped (1)
[dt04:22457] Failing at address: 0x55556a1da250
[dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaab353370]
[dt04:22457] [ 1] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x50)[0x2aaaacbcf810]
[dt04:22457] [ 2] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x9b)[0x2aaaacbcff3b]
[dt04:22457] [ 3] ./hacc_tpm[0x42f068]
[dt04:22457] [ 4] ./hacc_tpm[0x42f231]
[dt04:22457] [ 5] ./hacc_tpm[0x40f64d]
[dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac30db35]
[dt04:22457] [ 7] ./hacc_tpm[0x4115cf]
[dt04:22457] *** End of error message ***
This app is a CUDA app but doesn't use GPU direct so that should be irrelevant.
I'm building with ggc/5.3.0 cuda/8.0.44 openmpi/1.10.7
I'm using this on centos 7 and am using a vanilla MPI configure line: ./configure --prefix=/home/jluitjens/libs/openmpi/
Currently I'm trying to do this with just a single MPI process but multiple MPI processes fail in the same way:
mpirun --oversubscribe -np 1 ./command
What is odd is the crash occurs around the same spot in the code but not consistently at the same spot. The spot in the code where the single thread is at the time of the crash is nowhere near MPI code. The code where it is crashing is just using malloc to allocate some memory. This makes me think the crash is due to a thread outside of the application I'm working on (perhaps in OpenMPI itself) or perhaps due to openmpi hijacking malloc/free.
Does anyone have any ideas of what I could try to work around this issue?
Thanks,
Justin
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
[dt04:22457] *** Process received signal ***
[dt04:22457] Signal: Segmentation fault (11)
[dt04:22457] Signal code: Address not mapped (1)
[dt04:22457] Failing at address: 0x55556a1da250
[dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaab353370]
[dt04:22457] [ 1] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x50)[0x2aaaacbcf810]
[dt04:22457] [ 2] /home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x9b)[0x2aaaacbcff3b]
[dt04:22457] [ 3] ./hacc_tpm[0x42f068]
[dt04:22457] [ 4] ./hacc_tpm[0x42f231]
[dt04:22457] [ 5] ./hacc_tpm[0x40f64d]
[dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac30db35]
[dt04:22457] [ 7] ./hacc_tpm[0x4115cf]
[dt04:22457] *** End of error message ***
This app is a CUDA app but doesn't use GPU direct so that should be irrelevant.
I'm building with ggc/5.3.0 cuda/8.0.44 openmpi/1.10.7
I'm using this on centos 7 and am using a vanilla MPI configure line: ./configure --prefix=/home/jluitjens/libs/openmpi/
Currently I'm trying to do this with just a single MPI process but multiple MPI processes fail in the same way:
mpirun --oversubscribe -np 1 ./command
What is odd is the crash occurs around the same spot in the code but not consistently at the same spot. The spot in the code where the single thread is at the time of the crash is nowhere near MPI code. The code where it is crashing is just using malloc to allocate some memory. This makes me think the crash is due to a thread outside of the application I'm working on (perhaps in OpenMPI itself) or perhaps due to openmpi hijacking malloc/free.
Does anyone have any ideas of what I could try to work around this issue?
Thanks,
Justin
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------