Martin Böhm
2018-05-03 18:48:25 UTC
Dear all,
I have a problem with a segfault on a user-built OpenMPI 3.0.1 running on Ubuntu 16.04
in local mode (only processes on a single computer).
The problem manifests itself as a segfault when allocating shared memory for
(at least) one 128-bit atomic variable (say std::atomic<__int128>) and then
storing into this variable. This error only occurs when there are at least two
processes, even though only the process of shared rank 0 is the one doing both
the allocation and the writing. The compile flags -march=core2 or -march=native
or -latomic were tried, none of which helped.
An example of the code that triggers it on my computers is this:
https://github.com/bohm/binstretch/blob/parallel-classic/algorithm/bug.cpp
The code works fine with mpirun -np 1 and segfaults with mpirun -np 2, 3 and 4;
if line 41 is commented out (the 128-bit atomic write), everything works fine
with -np 2 or more.
As for Ubuntu's stock package containing OpenMPI 1.10.2, the code segfaults with
"-np 2" and "-np 3" but not "-np 1" or "-np 4".
Thank you for any assistance concerning this problem. I would suspect my own
code to be the most likely culprit, since it triggers on both the stock package
and custom-built OpenMPI.
I attach the config.log.bz2 and ompi_info.log. Below I list some runs of the program
and what errors are produced.
Thank you for any assistance. I have tried googling and searching the mailing list
for this problem; if I missed something, I apologize.
Martin Böhm
----- Ubuntu 16.04 stock mpirun and mpic++ -----
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/bin/mpirun -np 1 ../tests/bug
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/bin/mpirun -np 4 ../tests/bug
Thread 2: ex1 success.
Thread 3: ex1 success.
ex1 success.
Thread 1: ex1 success.
ex2 success.
Thread 2: ex2 success.
Thread 1: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
Inserted into ex2.
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/bin/mpirun -np 2 ../tests/bug
Thread 1: ex1 success.
ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:13662] *** Process received signal ***
[kamenice:13662] Signal: Segmentation fault (11)
[kamenice:13662] Signal code: (128)
[kamenice:13662] Failing at address: (nil)
[kamenice:13662] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f31773844b0]
[kamenice:13662] [ 1] ../tests/bug[0x40d8ac]
[kamenice:13662] [ 2] ../tests/bug[0x408997]
[kamenice:13662] [ 3] ../tests/bug[0x408bf0]
[kamenice:13662] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f317736f830]
[kamenice:13662] [ 5] ../tests/bug[0x4086e9]
[kamenice:13662] *** End of error message ***
----- Ubuntu 16.04 custom-compiled OpenMPI 3.0.1, installed to /usr/local (the stock packages were uninstalled) -----
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 1 ../tests/bug
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 2 ../tests/bug
ex1 success.
Thread 1: ex1 success.
Inserted into ex1.
[kamenice:22794] *** Process received signal ***
ex2 success.
Thread 1: ex2 success.
[kamenice:22794] Signal: Segmentation fault (11)
[kamenice:22794] Signal code: (128)
[kamenice:22794] Failing at address: (nil)
[kamenice:22794] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7ff8bad084b0]
[kamenice:22794] [ 1] ../tests/bug[0x401010]
[kamenice:22794] [ 2] ../tests/bug[0x400d27]
[kamenice:22794] [ 3] ../tests/bug[0x400f80]
[kamenice:22794] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7ff8bacf3830]
[kamenice:22794] [ 5] ../tests/bug[0x400a79]
[kamenice:22794] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 4 --oversubscribe ../tests/bug
ex1 success.
Thread 1: ex1 success.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex2 success.
Thread 1: ex2 success.
Thread 2: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
[kamenice:22728] *** Process received signal ***
[kamenice:22728] Signal: Segmentation fault (11)
[kamenice:22728] Signal code: (128)
[kamenice:22728] Failing at address: (nil)
[kamenice:22728] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f826a6294b0]
[kamenice:22728] [ 1] ../tests/bug[0x401010]
[kamenice:22728] [ 2] ../tests/bug[0x400d27]
[kamenice:22728] [ 3] ../tests/bug[0x400f80]
[kamenice:22728] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f826a614830]
[kamenice:22728] [ 5] ../tests/bug[0x400a79]
[kamenice:22728] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 2 valgrind ../tests/bug
==22814== Memcheck, a memory error detector
==22814== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22814== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22814== Command: ../tests/bug
==22814==
==22815== Memcheck, a memory error detector
==22815== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22815== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22815== Command: ../tests/bug
==22815==
ex1 success.
Thread 1: ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:22814] *** Process received signal ***
[kamenice:22814] Signal: Segmentation fault (11)
[kamenice:22814] Signal code: (128)
[kamenice:22814] Failing at address: (nil)
[kamenice:22814] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x51704b0]
[kamenice:22814] [ 1] ../tests/bug[0x401010]
[kamenice:22814] [ 2] ../tests/bug[0x400d27]
[kamenice:22814] [ 3] ../tests/bug[0x400f80]
[kamenice:22814] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x515b830]
[kamenice:22814] [ 5] ../tests/bug[0x400a79]
[kamenice:22814] *** End of error message ***
==22814==
==22814== Process terminating with default action of signal 11 (SIGSEGV)
==22814== at 0x5170428: raise (raise.c:54)
==22814== by 0x5827E4D: show_stackframe (in /usr/local/lib/libopen-pal.so.40.1.0)
==22814== by 0x51704AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==22814== by 0x40100F: std::atomic<__int128>::store(__int128, std::memory_order) (atomic:225)
==22814== by 0x400D26: shared_memory_init(int, int) (bug.cpp:41)
==22814== by 0x400F7F: main (bug.cpp:80)
==22814==
==22814== HEAP SUMMARY:
==22814== in use at exit: 2,759,989 bytes in 9,014 blocks
==22814== total heap usage: 20,168 allocs, 11,154 frees, 3,820,707 bytes allocated
==22814==
==22814== LEAK SUMMARY:
==22814== definitely lost: 12 bytes in 1 blocks
==22814== indirectly lost: 0 bytes in 0 blocks
==22814== possibly lost: 608 bytes in 2 blocks
==22814== still reachable: 2,759,369 bytes in 9,011 blocks
==22814== suppressed: 0 bytes in 0 blocks
==22814== Rerun with --leak-check=full to see details of leaked memory
==22814==
==22814== For counts of detected and suppressed errors, rerun with: -v
==22814== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
==22815==
==22815== Process terminating with default action of signal 15 (SIGTERM)
==22815== at 0x523674D: ??? (syscall-template.S:84)
==22815== by 0x583B4A7: poll (poll2.h:46)
==22815== by 0x583B4A7: poll_dispatch (poll.c:165)
==22815== by 0x5831BDE: opal_libevent2022_event_base_loop (event.c:1630)
==22815== by 0x57F210D: progress_engine (in /usr/local/lib/libopen-pal.so.40.1.0)
==22815== by 0x5FD76B9: start_thread (pthread_create.c:333)
==22815== by 0x524241C: clone (clone.S:109)
==22815==
==22815== HEAP SUMMARY:
==22815== in use at exit: 2,766,405 bytes in 9,017 blocks
==22815== total heap usage: 20,167 allocs, 11,150 frees, 3,823,751 bytes allocated
==22815==
==22815== LEAK SUMMARY:
==22815== definitely lost: 12 bytes in 1 blocks
==22815== indirectly lost: 0 bytes in 0 blocks
==22815== possibly lost: 608 bytes in 2 blocks
==22815== still reachable: 2,765,785 bytes in 9,014 blocks
==22815== suppressed: 0 bytes in 0 blocks
==22815== Rerun with --leak-check=full to see details of leaked memory
==22815==
==22815== For counts of detected and suppressed errors, rerun with: -v
==22815== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 9 (Killed).
--------------------------------------------------------------------------
I have a problem with a segfault on a user-built OpenMPI 3.0.1 running on Ubuntu 16.04
in local mode (only processes on a single computer).
The problem manifests itself as a segfault when allocating shared memory for
(at least) one 128-bit atomic variable (say std::atomic<__int128>) and then
storing into this variable. This error only occurs when there are at least two
processes, even though only the process of shared rank 0 is the one doing both
the allocation and the writing. The compile flags -march=core2 or -march=native
or -latomic were tried, none of which helped.
An example of the code that triggers it on my computers is this:
https://github.com/bohm/binstretch/blob/parallel-classic/algorithm/bug.cpp
The code works fine with mpirun -np 1 and segfaults with mpirun -np 2, 3 and 4;
if line 41 is commented out (the 128-bit atomic write), everything works fine
with -np 2 or more.
As for Ubuntu's stock package containing OpenMPI 1.10.2, the code segfaults with
"-np 2" and "-np 3" but not "-np 1" or "-np 4".
Thank you for any assistance concerning this problem. I would suspect my own
code to be the most likely culprit, since it triggers on both the stock package
and custom-built OpenMPI.
I attach the config.log.bz2 and ompi_info.log. Below I list some runs of the program
and what errors are produced.
Thank you for any assistance. I have tried googling and searching the mailing list
for this problem; if I missed something, I apologize.
Martin Böhm
----- Ubuntu 16.04 stock mpirun and mpic++ -----
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/bin/mpirun -np 1 ../tests/bug
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/bin/mpirun -np 4 ../tests/bug
Thread 2: ex1 success.
Thread 3: ex1 success.
ex1 success.
Thread 1: ex1 success.
ex2 success.
Thread 2: ex2 success.
Thread 1: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
Inserted into ex2.
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/bin/mpirun -np 2 ../tests/bug
Thread 1: ex1 success.
ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:13662] *** Process received signal ***
[kamenice:13662] Signal: Segmentation fault (11)
[kamenice:13662] Signal code: (128)
[kamenice:13662] Failing at address: (nil)
[kamenice:13662] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f31773844b0]
[kamenice:13662] [ 1] ../tests/bug[0x40d8ac]
[kamenice:13662] [ 2] ../tests/bug[0x408997]
[kamenice:13662] [ 3] ../tests/bug[0x408bf0]
[kamenice:13662] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f317736f830]
[kamenice:13662] [ 5] ../tests/bug[0x4086e9]
[kamenice:13662] *** End of error message ***
----- Ubuntu 16.04 custom-compiled OpenMPI 3.0.1, installed to /usr/local (the stock packages were uninstalled) -----
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 1 ../tests/bug
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 2 ../tests/bug
ex1 success.
Thread 1: ex1 success.
Inserted into ex1.
[kamenice:22794] *** Process received signal ***
ex2 success.
Thread 1: ex2 success.
[kamenice:22794] Signal: Segmentation fault (11)
[kamenice:22794] Signal code: (128)
[kamenice:22794] Failing at address: (nil)
[kamenice:22794] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7ff8bad084b0]
[kamenice:22794] [ 1] ../tests/bug[0x401010]
[kamenice:22794] [ 2] ../tests/bug[0x400d27]
[kamenice:22794] [ 3] ../tests/bug[0x400f80]
[kamenice:22794] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7ff8bacf3830]
[kamenice:22794] [ 5] ../tests/bug[0x400a79]
[kamenice:22794] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 4 --oversubscribe ../tests/bug
ex1 success.
Thread 1: ex1 success.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex2 success.
Thread 1: ex2 success.
Thread 2: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
[kamenice:22728] *** Process received signal ***
[kamenice:22728] Signal: Segmentation fault (11)
[kamenice:22728] Signal code: (128)
[kamenice:22728] Failing at address: (nil)
[kamenice:22728] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f826a6294b0]
[kamenice:22728] [ 1] ../tests/bug[0x401010]
[kamenice:22728] [ 2] ../tests/bug[0x400d27]
[kamenice:22728] [ 3] ../tests/bug[0x400f80]
[kamenice:22728] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f826a614830]
[kamenice:22728] [ 5] ../tests/bug[0x400a79]
[kamenice:22728] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 2 valgrind ../tests/bug
==22814== Memcheck, a memory error detector
==22814== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22814== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22814== Command: ../tests/bug
==22814==
==22815== Memcheck, a memory error detector
==22815== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22815== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22815== Command: ../tests/bug
==22815==
ex1 success.
Thread 1: ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:22814] *** Process received signal ***
[kamenice:22814] Signal: Segmentation fault (11)
[kamenice:22814] Signal code: (128)
[kamenice:22814] Failing at address: (nil)
[kamenice:22814] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x51704b0]
[kamenice:22814] [ 1] ../tests/bug[0x401010]
[kamenice:22814] [ 2] ../tests/bug[0x400d27]
[kamenice:22814] [ 3] ../tests/bug[0x400f80]
[kamenice:22814] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x515b830]
[kamenice:22814] [ 5] ../tests/bug[0x400a79]
[kamenice:22814] *** End of error message ***
==22814==
==22814== Process terminating with default action of signal 11 (SIGSEGV)
==22814== at 0x5170428: raise (raise.c:54)
==22814== by 0x5827E4D: show_stackframe (in /usr/local/lib/libopen-pal.so.40.1.0)
==22814== by 0x51704AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==22814== by 0x40100F: std::atomic<__int128>::store(__int128, std::memory_order) (atomic:225)
==22814== by 0x400D26: shared_memory_init(int, int) (bug.cpp:41)
==22814== by 0x400F7F: main (bug.cpp:80)
==22814==
==22814== HEAP SUMMARY:
==22814== in use at exit: 2,759,989 bytes in 9,014 blocks
==22814== total heap usage: 20,168 allocs, 11,154 frees, 3,820,707 bytes allocated
==22814==
==22814== LEAK SUMMARY:
==22814== definitely lost: 12 bytes in 1 blocks
==22814== indirectly lost: 0 bytes in 0 blocks
==22814== possibly lost: 608 bytes in 2 blocks
==22814== still reachable: 2,759,369 bytes in 9,011 blocks
==22814== suppressed: 0 bytes in 0 blocks
==22814== Rerun with --leak-check=full to see details of leaked memory
==22814==
==22814== For counts of detected and suppressed errors, rerun with: -v
==22814== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
==22815==
==22815== Process terminating with default action of signal 15 (SIGTERM)
==22815== at 0x523674D: ??? (syscall-template.S:84)
==22815== by 0x583B4A7: poll (poll2.h:46)
==22815== by 0x583B4A7: poll_dispatch (poll.c:165)
==22815== by 0x5831BDE: opal_libevent2022_event_base_loop (event.c:1630)
==22815== by 0x57F210D: progress_engine (in /usr/local/lib/libopen-pal.so.40.1.0)
==22815== by 0x5FD76B9: start_thread (pthread_create.c:333)
==22815== by 0x524241C: clone (clone.S:109)
==22815==
==22815== HEAP SUMMARY:
==22815== in use at exit: 2,766,405 bytes in 9,017 blocks
==22815== total heap usage: 20,167 allocs, 11,150 frees, 3,823,751 bytes allocated
==22815==
==22815== LEAK SUMMARY:
==22815== definitely lost: 12 bytes in 1 blocks
==22815== indirectly lost: 0 bytes in 0 blocks
==22815== possibly lost: 608 bytes in 2 blocks
==22815== still reachable: 2,765,785 bytes in 9,014 blocks
==22815== suppressed: 0 bytes in 0 blocks
==22815== Rerun with --leak-check=full to see details of leaked memory
==22815==
==22815== For counts of detected and suppressed errors, rerun with: -v
==22815== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 9 (Killed).
--------------------------------------------------------------------------