[OMPI users] User-built OpenMPI 3.0.1 segfaults when storing into an atomic 128-bit variable
Martin Böhm
2018-05-03 18:48:25 UTC
Dear all,
I have a problem with a segfault on a user-built OpenMPI 3.0.1 running on Ubuntu 16.04
in local mode (only processes on a single computer).

The problem manifests itself as a segfault when allocating shared memory for
(at least) one 128-bit atomic variable (say std::atomic<__int128>) and then
storing into this variable. This error only occurs when there are at least two
processes, even though only the process of shared rank 0 is the one doing both
the allocation and the writing. The compile flags -march=core2 or -march=native
or -latomic were tried, none of which helped.

An example of the code that triggers it on my computers is this:

The code works fine with mpirun -np 1 and segfaults with mpirun -np 2, 3 and 4;
if line 41 is commented out (the 128-bit atomic write), everything works fine
with -np 2 or more.

As for Ubuntu's stock package containing OpenMPI 1.10.2, the code segfaults with
"-np 2" and "-np 3" but not "-np 1" or "-np 4".

Thank you for any assistance concerning this problem. I would suspect my own
code to be the most likely culprit, since it triggers on both the stock package
and custom-built OpenMPI.

I attach the config.log.bz2 and ompi_info.log. Below I list some runs of the program
and what errors are produced.

Thank you for any assistance. I have tried googling and searching the mailing list
for this problem; if I missed something, I apologize.

Martin Böhm

----- Ubuntu 16.04 stock mpirun and mpic++ -----
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/bin/mpirun -np 1 ../tests/bug
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/bin/mpirun -np 4 ../tests/bug
Thread 2: ex1 success.
Thread 3: ex1 success.
ex1 success.
Thread 1: ex1 success.
ex2 success.
Thread 2: ex2 success.
Thread 1: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
Inserted into ex2.
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/bin/mpirun -np 2 ../tests/bug
Thread 1: ex1 success.
ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:13662] *** Process received signal ***
[kamenice:13662] Signal: Segmentation fault (11)
[kamenice:13662] Signal code: (128)
[kamenice:13662] Failing at address: (nil)
[kamenice:13662] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f31773844b0]
[kamenice:13662] [ 1] ../tests/bug[0x40d8ac]
[kamenice:13662] [ 2] ../tests/bug[0x408997]
[kamenice:13662] [ 3] ../tests/bug[0x408bf0]
[kamenice:13662] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f317736f830]
[kamenice:13662] [ 5] ../tests/bug[0x4086e9]
[kamenice:13662] *** End of error message ***

----- Ubuntu 16.04 custom-compiled OpenMPI 3.0.1, installed to /usr/local (the stock packages were uninstalled) -----
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 1 ../tests/bug
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 2 ../tests/bug
ex1 success.
Thread 1: ex1 success.
Inserted into ex1.
[kamenice:22794] *** Process received signal ***
ex2 success.
Thread 1: ex2 success.
[kamenice:22794] Signal: Segmentation fault (11)
[kamenice:22794] Signal code: (128)
[kamenice:22794] Failing at address: (nil)
[kamenice:22794] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7ff8bad084b0]
[kamenice:22794] [ 1] ../tests/bug[0x401010]
[kamenice:22794] [ 2] ../tests/bug[0x400d27]
[kamenice:22794] [ 3] ../tests/bug[0x400f80]
[kamenice:22794] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7ff8bacf3830]
[kamenice:22794] [ 5] ../tests/bug[0x400a79]
[kamenice:22794] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 4 --oversubscribe ../tests/bug
ex1 success.
Thread 1: ex1 success.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex2 success.
Thread 1: ex2 success.
Thread 2: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
[kamenice:22728] *** Process received signal ***
[kamenice:22728] Signal: Segmentation fault (11)
[kamenice:22728] Signal code: (128)
[kamenice:22728] Failing at address: (nil)
[kamenice:22728] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f826a6294b0]
[kamenice:22728] [ 1] ../tests/bug[0x401010]
[kamenice:22728] [ 2] ../tests/bug[0x400d27]
[kamenice:22728] [ 3] ../tests/bug[0x400f80]
[kamenice:22728] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f826a614830]
[kamenice:22728] [ 5] ../tests/bug[0x400a79]
[kamenice:22728] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
***@kamenice:~/cl/w/b/classic/algorithm$ /usr/local/bin/mpirun -np 2 valgrind ../tests/bug
==22814== Memcheck, a memory error detector
==22814== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22814== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22814== Command: ../tests/bug
==22815== Memcheck, a memory error detector
==22815== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22815== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22815== Command: ../tests/bug
ex1 success.
Thread 1: ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:22814] *** Process received signal ***
[kamenice:22814] Signal: Segmentation fault (11)
[kamenice:22814] Signal code: (128)
[kamenice:22814] Failing at address: (nil)
[kamenice:22814] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x51704b0]
[kamenice:22814] [ 1] ../tests/bug[0x401010]
[kamenice:22814] [ 2] ../tests/bug[0x400d27]
[kamenice:22814] [ 3] ../tests/bug[0x400f80]
[kamenice:22814] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x515b830]
[kamenice:22814] [ 5] ../tests/bug[0x400a79]
[kamenice:22814] *** End of error message ***
==22814== Process terminating with default action of signal 11 (SIGSEGV)
==22814== at 0x5170428: raise (raise.c:54)
==22814== by 0x5827E4D: show_stackframe (in /usr/local/lib/libopen-pal.so.40.1.0)
==22814== by 0x51704AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==22814== by 0x40100F: std::atomic<__int128>::store(__int128, std::memory_order) (atomic:225)
==22814== by 0x400D26: shared_memory_init(int, int) (bug.cpp:41)
==22814== by 0x400F7F: main (bug.cpp:80)
==22814== HEAP SUMMARY:
==22814== in use at exit: 2,759,989 bytes in 9,014 blocks
==22814== total heap usage: 20,168 allocs, 11,154 frees, 3,820,707 bytes allocated
==22814== LEAK SUMMARY:
==22814== definitely lost: 12 bytes in 1 blocks
==22814== indirectly lost: 0 bytes in 0 blocks
==22814== possibly lost: 608 bytes in 2 blocks
==22814== still reachable: 2,759,369 bytes in 9,011 blocks
==22814== suppressed: 0 bytes in 0 blocks
==22814== Rerun with --leak-check=full to see details of leaked memory
==22814== For counts of detected and suppressed errors, rerun with: -v
==22814== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
==22815== Process terminating with default action of signal 15 (SIGTERM)
==22815== at 0x523674D: ??? (syscall-template.S:84)
==22815== by 0x583B4A7: poll (poll2.h:46)
==22815== by 0x583B4A7: poll_dispatch (poll.c:165)
==22815== by 0x5831BDE: opal_libevent2022_event_base_loop (event.c:1630)
==22815== by 0x57F210D: progress_engine (in /usr/local/lib/libopen-pal.so.40.1.0)
==22815== by 0x5FD76B9: start_thread (pthread_create.c:333)
==22815== by 0x524241C: clone (clone.S:109)
==22815== HEAP SUMMARY:
==22815== in use at exit: 2,766,405 bytes in 9,017 blocks
==22815== total heap usage: 20,167 allocs, 11,150 frees, 3,823,751 bytes allocated
==22815== LEAK SUMMARY:
==22815== definitely lost: 12 bytes in 1 blocks
==22815== indirectly lost: 0 bytes in 0 blocks
==22815== possibly lost: 608 bytes in 2 blocks
==22815== still reachable: 2,765,785 bytes in 9,014 blocks
==22815== suppressed: 0 bytes in 0 blocks
==22815== Rerun with --leak-check=full to see details of leaked memory
==22815== For counts of detected and suppressed errors, rerun with: -v
==22815== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 9 (Killed).
Joseph Schuchart
2018-05-03 20:16:07 UTC

You say that you allocate shared memory, do you mean shared memory
windows? If so, this could be the reason:

The alignment of memory allocated for MPI windows is not suitable for
128bit values in Open MPI (only 8-byte alignment is guaranteed atm). I
have seen the alignment change depending on the number of processes.
Could you check the alignment of the memory you are trying to access?

Post by Martin Böhm
Dear all,
I have a problem with a segfault on a user-built OpenMPI 3.0.1 running on Ubuntu 16.04
in local mode (only processes on a single computer).
The problem manifests itself as a segfault when allocating shared memory for
(at least) one 128-bit atomic variable (say std::atomic<__int128>) and then
storing into this variable. This error only occurs when there are at least two
processes, even though only the process of shared rank 0 is the one doing both
the allocation and the writing. The compile flags -march=core2 or -march=native
or -latomic were tried, none of which helped.
The code works fine with mpirun -np 1 and segfaults with mpirun -np 2, 3 and 4;
if line 41 is commented out (the 128-bit atomic write), everything works fine
with -np 2 or more.
As for Ubuntu's stock package containing OpenMPI 1.10.2, the code segfaults with
"-np 2" and "-np 3" but not "-np 1" or "-np 4".
Thank you for any assistance concerning this problem. I would suspect my own
code to be the most likely culprit, since it triggers on both the stock package
and custom-built OpenMPI.
I attach the config.log.bz2 and ompi_info.log. Below I list some runs of the program
and what errors are produced.
Thank you for any assistance. I have tried googling and searching the mailing list
for this problem; if I missed something, I apologize.
Martin Böhm
----- Ubuntu 16.04 stock mpirun and mpic++ -----
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex1 success.
Thread 1: ex1 success.
ex2 success.
Thread 2: ex2 success.
Thread 1: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
Inserted into ex2.
Thread 1: ex1 success.
ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:13662] *** Process received signal ***
[kamenice:13662] Signal: Segmentation fault (11)
[kamenice:13662] Signal code: (128)
[kamenice:13662] Failing at address: (nil)
[kamenice:13662] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f31773844b0]
[kamenice:13662] [ 1] ../tests/bug[0x40d8ac]
[kamenice:13662] [ 2] ../tests/bug[0x408997]
[kamenice:13662] [ 3] ../tests/bug[0x408bf0]
[kamenice:13662] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f317736f830]
[kamenice:13662] [ 5] ../tests/bug[0x4086e9]
[kamenice:13662] *** End of error message ***
----- Ubuntu 16.04 custom-compiled OpenMPI 3.0.1, installed to /usr/local (the stock packages were uninstalled) -----
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
ex1 success.
Thread 1: ex1 success.
Inserted into ex1.
[kamenice:22794] *** Process received signal ***
ex2 success.
Thread 1: ex2 success.
[kamenice:22794] Signal: Segmentation fault (11)
[kamenice:22794] Signal code: (128)
[kamenice:22794] Failing at address: (nil)
[kamenice:22794] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7ff8bad084b0]
[kamenice:22794] [ 1] ../tests/bug[0x401010]
[kamenice:22794] [ 2] ../tests/bug[0x400d27]
[kamenice:22794] [ 3] ../tests/bug[0x400f80]
[kamenice:22794] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7ff8bacf3830]
[kamenice:22794] [ 5] ../tests/bug[0x400a79]
[kamenice:22794] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
ex1 success.
Thread 1: ex1 success.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex2 success.
Thread 1: ex2 success.
Thread 2: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
[kamenice:22728] *** Process received signal ***
[kamenice:22728] Signal: Segmentation fault (11)
[kamenice:22728] Signal code: (128)
[kamenice:22728] Failing at address: (nil)
[kamenice:22728] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f826a6294b0]
[kamenice:22728] [ 1] ../tests/bug[0x401010]
[kamenice:22728] [ 2] ../tests/bug[0x400d27]
[kamenice:22728] [ 3] ../tests/bug[0x400f80]
[kamenice:22728] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f826a614830]
[kamenice:22728] [ 5] ../tests/bug[0x400a79]
[kamenice:22728] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
==22814== Memcheck, a memory error detector
==22814== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22814== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22814== Command: ../tests/bug
==22815== Memcheck, a memory error detector
==22815== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22815== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22815== Command: ../tests/bug
ex1 success.
Thread 1: ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:22814] *** Process received signal ***
[kamenice:22814] Signal: Segmentation fault (11)
[kamenice:22814] Signal code: (128)
[kamenice:22814] Failing at address: (nil)
[kamenice:22814] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x51704b0]
[kamenice:22814] [ 1] ../tests/bug[0x401010]
[kamenice:22814] [ 2] ../tests/bug[0x400d27]
[kamenice:22814] [ 3] ../tests/bug[0x400f80]
[kamenice:22814] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x515b830]
[kamenice:22814] [ 5] ../tests/bug[0x400a79]
[kamenice:22814] *** End of error message ***
==22814== Process terminating with default action of signal 11 (SIGSEGV)
==22814== at 0x5170428: raise (raise.c:54)
==22814== by 0x5827E4D: show_stackframe (in /usr/local/lib/libopen-pal.so.40.1.0)
==22814== by 0x51704AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==22814== by 0x40100F: std::atomic<__int128>::store(__int128, std::memory_order) (atomic:225)
==22814== by 0x400D26: shared_memory_init(int, int) (bug.cpp:41)
==22814== by 0x400F7F: main (bug.cpp:80)
==22814== in use at exit: 2,759,989 bytes in 9,014 blocks
==22814== total heap usage: 20,168 allocs, 11,154 frees, 3,820,707 bytes allocated
==22814== definitely lost: 12 bytes in 1 blocks
==22814== indirectly lost: 0 bytes in 0 blocks
==22814== possibly lost: 608 bytes in 2 blocks
==22814== still reachable: 2,759,369 bytes in 9,011 blocks
==22814== suppressed: 0 bytes in 0 blocks
==22814== Rerun with --leak-check=full to see details of leaked memory
==22814== For counts of detected and suppressed errors, rerun with: -v
==22814== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
==22815== Process terminating with default action of signal 15 (SIGTERM)
==22815== at 0x523674D: ??? (syscall-template.S:84)
==22815== by 0x583B4A7: poll (poll2.h:46)
==22815== by 0x583B4A7: poll_dispatch (poll.c:165)
==22815== by 0x5831BDE: opal_libevent2022_event_base_loop (event.c:1630)
==22815== by 0x57F210D: progress_engine (in /usr/local/lib/libopen-pal.so.40.1.0)
==22815== by 0x5FD76B9: start_thread (pthread_create.c:333)
==22815== by 0x524241C: clone (clone.S:109)
==22815== in use at exit: 2,766,405 bytes in 9,017 blocks
==22815== total heap usage: 20,167 allocs, 11,150 frees, 3,823,751 bytes allocated
==22815== definitely lost: 12 bytes in 1 blocks
==22815== indirectly lost: 0 bytes in 0 blocks
==22815== possibly lost: 608 bytes in 2 blocks
==22815== still reachable: 2,765,785 bytes in 9,014 blocks
==22815== suppressed: 0 bytes in 0 blocks
==22815== Rerun with --leak-check=full to see details of leaked memory
==22815== For counts of detected and suppressed errors, rerun with: -v
==22815== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 9 (Killed).
users mailing list
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de
Nathan Hjelm
2018-05-04 00:17:09 UTC
That is probably it. When there are 4 ranks there are 4 int64’s just before the user data (for PSCW). With 1 rank we don’t even bother, its just malloc (16-byte aligned). With any other odd number of ranks the user data is after an odd number of int64’s and is 8-byte aligned. There is no requirement in MPI to provide 16-byte alignment (which is required for _Atomic __int128 because of the alignment requirement of cmpxchg16b) so you have to align it yourself.

Post by Joseph Schuchart
You say that you allocate shared memory, do you mean shared memory windows? If so, this could be the reason: https://github.com/open-mpi/ompi/issues/4952
The alignment of memory allocated for MPI windows is not suitable for 128bit values in Open MPI (only 8-byte alignment is guaranteed atm). I have seen the alignment change depending on the number of processes. Could you check the alignment of the memory you are trying to access?
Post by Martin Böhm
Dear all,
I have a problem with a segfault on a user-built OpenMPI 3.0.1 running on Ubuntu 16.04
in local mode (only processes on a single computer).
The problem manifests itself as a segfault when allocating shared memory for
(at least) one 128-bit atomic variable (say std::atomic<__int128>) and then
storing into this variable. This error only occurs when there are at least two
processes, even though only the process of shared rank 0 is the one doing both
the allocation and the writing. The compile flags -march=core2 or -march=native
or -latomic were tried, none of which helped.
The code works fine with mpirun -np 1 and segfaults with mpirun -np 2, 3 and 4;
if line 41 is commented out (the 128-bit atomic write), everything works fine
with -np 2 or more.
As for Ubuntu's stock package containing OpenMPI 1.10.2, the code segfaults with
"-np 2" and "-np 3" but not "-np 1" or "-np 4".
Thank you for any assistance concerning this problem. I would suspect my own
code to be the most likely culprit, since it triggers on both the stock package
and custom-built OpenMPI.
I attach the config.log.bz2 and ompi_info.log. Below I list some runs of the program
and what errors are produced.
Thank you for any assistance. I have tried googling and searching the mailing list
for this problem; if I missed something, I apologize.
Martin Böhm
----- Ubuntu 16.04 stock mpirun and mpic++ -----
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex1 success.
Thread 1: ex1 success.
ex2 success.
Thread 2: ex2 success.
Thread 1: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
Inserted into ex2.
Thread 1: ex1 success.
ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:13662] *** Process received signal ***
[kamenice:13662] Signal: Segmentation fault (11)
[kamenice:13662] Signal code: (128)
[kamenice:13662] Failing at address: (nil)
[kamenice:13662] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f31773844b0]
[kamenice:13662] [ 1] ../tests/bug[0x40d8ac]
[kamenice:13662] [ 2] ../tests/bug[0x408997]
[kamenice:13662] [ 3] ../tests/bug[0x408bf0]
[kamenice:13662] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f317736f830]
[kamenice:13662] [ 5] ../tests/bug[0x4086e9]
[kamenice:13662] *** End of error message ***
----- Ubuntu 16.04 custom-compiled OpenMPI 3.0.1, installed to /usr/local (the stock packages were uninstalled) -----
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
ex1 success.
Thread 1: ex1 success.
Inserted into ex1.
[kamenice:22794] *** Process received signal ***
ex2 success.
Thread 1: ex2 success.
[kamenice:22794] Signal: Segmentation fault (11)
[kamenice:22794] Signal code: (128)
[kamenice:22794] Failing at address: (nil)
[kamenice:22794] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7ff8bad084b0]
[kamenice:22794] [ 1] ../tests/bug[0x401010]
[kamenice:22794] [ 2] ../tests/bug[0x400d27]
[kamenice:22794] [ 3] ../tests/bug[0x400f80]
[kamenice:22794] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7ff8bacf3830]
[kamenice:22794] [ 5] ../tests/bug[0x400a79]
[kamenice:22794] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
ex1 success.
Thread 1: ex1 success.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex2 success.
Thread 1: ex2 success.
Thread 2: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
[kamenice:22728] *** Process received signal ***
[kamenice:22728] Signal: Segmentation fault (11)
[kamenice:22728] Signal code: (128)
[kamenice:22728] Failing at address: (nil)
[kamenice:22728] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f826a6294b0]
[kamenice:22728] [ 1] ../tests/bug[0x401010]
[kamenice:22728] [ 2] ../tests/bug[0x400d27]
[kamenice:22728] [ 3] ../tests/bug[0x400f80]
[kamenice:22728] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f826a614830]
[kamenice:22728] [ 5] ../tests/bug[0x400a79]
[kamenice:22728] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
==22814== Memcheck, a memory error detector
==22814== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22814== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22814== Command: ../tests/bug
==22815== Memcheck, a memory error detector
==22815== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22815== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22815== Command: ../tests/bug
ex1 success.
Thread 1: ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:22814] *** Process received signal ***
[kamenice:22814] Signal: Segmentation fault (11)
[kamenice:22814] Signal code: (128)
[kamenice:22814] Failing at address: (nil)
[kamenice:22814] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x51704b0]
[kamenice:22814] [ 1] ../tests/bug[0x401010]
[kamenice:22814] [ 2] ../tests/bug[0x400d27]
[kamenice:22814] [ 3] ../tests/bug[0x400f80]
[kamenice:22814] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x515b830]
[kamenice:22814] [ 5] ../tests/bug[0x400a79]
[kamenice:22814] *** End of error message ***
==22814== Process terminating with default action of signal 11 (SIGSEGV)
==22814== at 0x5170428: raise (raise.c:54)
==22814== by 0x5827E4D: show_stackframe (in /usr/local/lib/libopen-pal.so.40.1.0)
==22814== by 0x51704AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==22814== by 0x40100F: std::atomic<__int128>::store(__int128, std::memory_order) (atomic:225)
==22814== by 0x400D26: shared_memory_init(int, int) (bug.cpp:41)
==22814== by 0x400F7F: main (bug.cpp:80)
==22814== in use at exit: 2,759,989 bytes in 9,014 blocks
==22814== total heap usage: 20,168 allocs, 11,154 frees, 3,820,707 bytes allocated
==22814== definitely lost: 12 bytes in 1 blocks
==22814== indirectly lost: 0 bytes in 0 blocks
==22814== possibly lost: 608 bytes in 2 blocks
==22814== still reachable: 2,759,369 bytes in 9,011 blocks
==22814== suppressed: 0 bytes in 0 blocks
==22814== Rerun with --leak-check=full to see details of leaked memory
==22814== For counts of detected and suppressed errors, rerun with: -v
==22814== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
==22815== Process terminating with default action of signal 15 (SIGTERM)
==22815== at 0x523674D: ??? (syscall-template.S:84)
==22815== by 0x583B4A7: poll (poll2.h:46)
==22815== by 0x583B4A7: poll_dispatch (poll.c:165)
==22815== by 0x5831BDE: opal_libevent2022_event_base_loop (event.c:1630)
==22815== by 0x57F210D: progress_engine (in /usr/local/lib/libopen-pal.so.40.1.0)
==22815== by 0x5FD76B9: start_thread (pthread_create.c:333)
==22815== by 0x524241C: clone (clone.S:109)
==22815== in use at exit: 2,766,405 bytes in 9,017 blocks
==22815== total heap usage: 20,167 allocs, 11,150 frees, 3,823,751 bytes allocated
==22815== definitely lost: 12 bytes in 1 blocks
==22815== indirectly lost: 0 bytes in 0 blocks
==22815== possibly lost: 608 bytes in 2 blocks
==22815== still reachable: 2,765,785 bytes in 9,014 blocks
==22815== suppressed: 0 bytes in 0 blocks
==22815== Rerun with --leak-check=full to see details of leaked memory
==22815== For counts of detected and suppressed errors, rerun with: -v
==22815== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 9 (Killed).
users mailing list
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
users mailing list
Jeff Hammond
2018-05-04 03:43:40 UTC
Given that this seems to break user experience on a relatively frequent
basis, I’d like to know the compelling reason why MPI implementers aren’t
willing to do something utterly trivial to fix it.

And don’t tell me that 16B alignment wastes memory versus 8B alignment.
Open-MPI “wastes” 4B relative to MPICH for every handle on I32LP64 systems.
The internal state associated with MPI allocations - particularly windows -
is bigger than 8B. I recall ptmalloc uses something like 32B per heap

Post by Nathan Hjelm
That is probably it. When there are 4 ranks there are 4 int64’s just
before the user data (for PSCW). With 1 rank we don’t even bother, its just
malloc (16-byte aligned). With any other odd number of ranks the user data
is after an odd number of int64’s and is 8-byte aligned. There is no
requirement in MPI to provide 16-byte alignment (which is required for
_Atomic __int128 because of the alignment requirement of cmpxchg16b) so you
have to align it yourself.
Post by Joseph Schuchart
You say that you allocate shared memory, do you mean shared memory
Post by Joseph Schuchart
The alignment of memory allocated for MPI windows is not suitable for
128bit values in Open MPI (only 8-byte alignment is guaranteed atm). I have
seen the alignment change depending on the number of processes. Could you
check the alignment of the memory you are trying to access?
Post by Joseph Schuchart
Post by Martin Böhm
Dear all,
I have a problem with a segfault on a user-built OpenMPI 3.0.1 running
on Ubuntu 16.04
Post by Joseph Schuchart
Post by Martin Böhm
in local mode (only processes on a single computer).
The problem manifests itself as a segfault when allocating shared
memory for
Post by Joseph Schuchart
Post by Martin Böhm
(at least) one 128-bit atomic variable (say std::atomic<__int128>) and
Post by Joseph Schuchart
Post by Martin Böhm
storing into this variable. This error only occurs when there are at
least two
Post by Joseph Schuchart
Post by Martin Böhm
processes, even though only the process of shared rank 0 is the one
doing both
Post by Joseph Schuchart
Post by Martin Böhm
the allocation and the writing. The compile flags -march=core2 or
Post by Joseph Schuchart
Post by Martin Böhm
or -latomic were tried, none of which helped.
Post by Joseph Schuchart
Post by Martin Böhm
The code works fine with mpirun -np 1 and segfaults with mpirun -np 2,
3 and 4;
Post by Joseph Schuchart
Post by Martin Böhm
if line 41 is commented out (the 128-bit atomic write), everything
works fine
Post by Joseph Schuchart
Post by Martin Böhm
with -np 2 or more.
As for Ubuntu's stock package containing OpenMPI 1.10.2, the code
segfaults with
Post by Joseph Schuchart
Post by Martin Böhm
"-np 2" and "-np 3" but not "-np 1" or "-np 4".
Thank you for any assistance concerning this problem. I would suspect
my own
Post by Joseph Schuchart
Post by Martin Böhm
code to be the most likely culprit, since it triggers on both the stock
Post by Joseph Schuchart
Post by Martin Böhm
and custom-built OpenMPI.
I attach the config.log.bz2 and ompi_info.log. Below I list some runs
of the program
Post by Joseph Schuchart
Post by Martin Böhm
and what errors are produced.
Thank you for any assistance. I have tried googling and searching the
mailing list
Post by Joseph Schuchart
Post by Martin Böhm
for this problem; if I missed something, I apologize.
Martin Böhm
----- Ubuntu 16.04 stock mpirun and mpic++ -----
Post by Joseph Schuchart
Post by Martin Böhm
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
Post by Joseph Schuchart
Post by Martin Böhm
Thread 2: ex1 success.
Thread 3: ex1 success.
ex1 success.
Thread 1: ex1 success.
ex2 success.
Thread 2: ex2 success.
Thread 1: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
Inserted into ex2.
Post by Joseph Schuchart
Post by Martin Böhm
Thread 1: ex1 success.
ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:13662] *** Process received signal ***
[kamenice:13662] Signal: Segmentation fault (11)
[kamenice:13662] Signal code: (128)
[kamenice:13662] Failing at address: (nil)
[kamenice:13662] [ 0]
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:13662] [ 1] ../tests/bug[0x40d8ac]
[kamenice:13662] [ 2] ../tests/bug[0x408997]
[kamenice:13662] [ 3] ../tests/bug[0x408bf0]
[kamenice:13662] [ 4]
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:13662] [ 5] ../tests/bug[0x4086e9]
[kamenice:13662] *** End of error message ***
----- Ubuntu 16.04 custom-compiled OpenMPI 3.0.1, installed to
/usr/local (the stock packages were uninstalled) -----
Post by Joseph Schuchart
Post by Martin Böhm
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
Post by Joseph Schuchart
Post by Martin Böhm
ex1 success.
Thread 1: ex1 success.
Inserted into ex1.
[kamenice:22794] *** Process received signal ***
ex2 success.
Thread 1: ex2 success.
[kamenice:22794] Signal: Segmentation fault (11)
[kamenice:22794] Signal code: (128)
[kamenice:22794] Failing at address: (nil)
[kamenice:22794] [ 0]
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22794] [ 1] ../tests/bug[0x401010]
[kamenice:22794] [ 2] ../tests/bug[0x400d27]
[kamenice:22794] [ 3] ../tests/bug[0x400f80]
[kamenice:22794] [ 4]
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22794] [ 5] ../tests/bug[0x400a79]
[kamenice:22794] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
Post by Joseph Schuchart
Post by Martin Böhm
mpirun noticed that process rank 0 with PID 0 on node kamenice exited
on signal 11 (Segmentation fault).
--oversubscribe ../tests/bug
Post by Joseph Schuchart
Post by Martin Böhm
ex1 success.
Thread 1: ex1 success.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex2 success.
Thread 1: ex2 success.
Thread 2: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
[kamenice:22728] *** Process received signal ***
[kamenice:22728] Signal: Segmentation fault (11)
[kamenice:22728] Signal code: (128)
[kamenice:22728] Failing at address: (nil)
[kamenice:22728] [ 0]
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22728] [ 1] ../tests/bug[0x401010]
[kamenice:22728] [ 2] ../tests/bug[0x400d27]
[kamenice:22728] [ 3] ../tests/bug[0x400f80]
[kamenice:22728] [ 4]
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22728] [ 5] ../tests/bug[0x400a79]
[kamenice:22728] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
Post by Joseph Schuchart
Post by Martin Böhm
mpirun noticed that process rank 0 with PID 0 on node kamenice exited
on signal 11 (Segmentation fault).
valgrind ../tests/bug
Post by Joseph Schuchart
Post by Martin Böhm
==22814== Memcheck, a memory error detector
==22814== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et
Post by Joseph Schuchart
Post by Martin Böhm
==22814== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright
Post by Joseph Schuchart
Post by Martin Böhm
==22814== Command: ../tests/bug
==22815== Memcheck, a memory error detector
==22815== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et
Post by Joseph Schuchart
Post by Martin Böhm
==22815== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright
Post by Joseph Schuchart
Post by Martin Böhm
==22815== Command: ../tests/bug
ex1 success.
Thread 1: ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:22814] *** Process received signal ***
[kamenice:22814] Signal: Segmentation fault (11)
[kamenice:22814] Signal code: (128)
[kamenice:22814] Failing at address: (nil)
[kamenice:22814] [ 0]
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22814] [ 1] ../tests/bug[0x401010]
[kamenice:22814] [ 2] ../tests/bug[0x400d27]
[kamenice:22814] [ 3] ../tests/bug[0x400f80]
[kamenice:22814] [ 4]
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22814] [ 5] ../tests/bug[0x400a79]
[kamenice:22814] *** End of error message ***
==22814== Process terminating with default action of signal 11 (SIGSEGV)
==22814== at 0x5170428: raise (raise.c:54)
==22814== by 0x5827E4D: show_stackframe (in
Post by Joseph Schuchart
Post by Martin Böhm
==22814== by 0x51704AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==22814== by 0x40100F: std::atomic<__int128>::store(__int128,
std::memory_order) (atomic:225)
Post by Joseph Schuchart
Post by Martin Böhm
==22814== by 0x400D26: shared_memory_init(int, int) (bug.cpp:41)
==22814== by 0x400F7F: main (bug.cpp:80)
==22814== in use at exit: 2,759,989 bytes in 9,014 blocks
==22814== total heap usage: 20,168 allocs, 11,154 frees, 3,820,707
bytes allocated
Post by Joseph Schuchart
Post by Martin Böhm
==22814== definitely lost: 12 bytes in 1 blocks
==22814== indirectly lost: 0 bytes in 0 blocks
==22814== possibly lost: 608 bytes in 2 blocks
==22814== still reachable: 2,759,369 bytes in 9,011 blocks
==22814== suppressed: 0 bytes in 0 blocks
==22814== Rerun with --leak-check=full to see details of leaked memory
==22814== For counts of detected and suppressed errors, rerun with: -v
==22814== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
==22815== Process terminating with default action of signal 15 (SIGTERM)
==22815== at 0x523674D: ??? (syscall-template.S:84)
==22815== by 0x583B4A7: poll (poll2.h:46)
==22815== by 0x583B4A7: poll_dispatch (poll.c:165)
==22815== by 0x5831BDE: opal_libevent2022_event_base_loop
Post by Joseph Schuchart
Post by Martin Böhm
==22815== by 0x57F210D: progress_engine (in
Post by Joseph Schuchart
Post by Martin Böhm
==22815== by 0x5FD76B9: start_thread (pthread_create.c:333)
==22815== by 0x524241C: clone (clone.S:109)
==22815== in use at exit: 2,766,405 bytes in 9,017 blocks
==22815== total heap usage: 20,167 allocs, 11,150 frees, 3,823,751
bytes allocated
Post by Joseph Schuchart
Post by Martin Böhm
==22815== definitely lost: 12 bytes in 1 blocks
==22815== indirectly lost: 0 bytes in 0 blocks
==22815== possibly lost: 608 bytes in 2 blocks
==22815== still reachable: 2,765,785 bytes in 9,014 blocks
==22815== suppressed: 0 bytes in 0 blocks
==22815== Rerun with --leak-check=full to see details of leaked memory
==22815== For counts of detected and suppressed errors, rerun with: -v
==22815== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Post by Joseph Schuchart
Post by Martin Böhm
mpirun noticed that process rank 0 with PID 0 on node kamenice exited
on signal 9 (Killed).
Post by Joseph Schuchart
Post by Martin Böhm
users mailing list
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
users mailing list
users mailing list
Jeff Hammond
Nathan Hjelm
2018-05-04 06:23:11 UTC
Not saying we won't change the behavior. Just saying the user can't expect a particular alignment as there is no guarantee in the standard. In Open MPI we just don't bother to align the pointer so right now so it naturally aligns as 64-bit. It isn't about wasting memory.

Also remember that by default the shared memory regions are contiguous across local ranks so each rank will get a buffer alignment dictated by the sizes of the allocations specified by the prior ranks in addition to the zero rank buffer alignment.

Given that this seems to break user experience on a relatively frequent basis, I’d like to know the compelling reason why MPI implementers aren’t willing to do something utterly trivial to fix it.
And don’t tell me that 16B alignment wastes memory versus 8B alignment. Open-MPI “wastes” 4B relative to MPICH for every handle on I32LP64 systems. The internal state associated with MPI allocations - particularly windows - is bigger than 8B. I recall ptmalloc uses something like 32B per heap allocation.
Post by Nathan Hjelm
That is probably it. When there are 4 ranks there are 4 int64’s just before the user data (for PSCW). With 1 rank we don’t even bother, its just malloc (16-byte aligned). With any other odd number of ranks the user data is after an odd number of int64’s and is 8-byte aligned. There is no requirement in MPI to provide 16-byte alignment (which is required for _Atomic __int128 because of the alignment requirement of cmpxchg16b) so you have to align it yourself.
Post by Joseph Schuchart
You say that you allocate shared memory, do you mean shared memory windows? If so, this could be the reason: https://github.com/open-mpi/ompi/issues/4952
The alignment of memory allocated for MPI windows is not suitable for 128bit values in Open MPI (only 8-byte alignment is guaranteed atm). I have seen the alignment change depending on the number of processes. Could you check the alignment of the memory you are trying to access?
Post by Martin Böhm
Dear all,
I have a problem with a segfault on a user-built OpenMPI 3.0.1 running on Ubuntu 16.04
in local mode (only processes on a single computer).
The problem manifests itself as a segfault when allocating shared memory for
(at least) one 128-bit atomic variable (say std::atomic<__int128>) and then
storing into this variable. This error only occurs when there are at least two
processes, even though only the process of shared rank 0 is the one doing both
the allocation and the writing. The compile flags -march=core2 or -march=native
or -latomic were tried, none of which helped.
The code works fine with mpirun -np 1 and segfaults with mpirun -np 2, 3 and 4;
if line 41 is commented out (the 128-bit atomic write), everything works fine
with -np 2 or more.
As for Ubuntu's stock package containing OpenMPI 1.10.2, the code segfaults with
"-np 2" and "-np 3" but not "-np 1" or "-np 4".
Thank you for any assistance concerning this problem. I would suspect my own
code to be the most likely culprit, since it triggers on both the stock package
and custom-built OpenMPI.
I attach the config.log.bz2 and ompi_info.log. Below I list some runs of the program
and what errors are produced.
Thank you for any assistance. I have tried googling and searching the mailing list
for this problem; if I missed something, I apologize.
Martin Böhm
----- Ubuntu 16.04 stock mpirun and mpic++ -----
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex1 success.
Thread 1: ex1 success.
ex2 success.
Thread 2: ex2 success.
Thread 1: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
Inserted into ex2.
Thread 1: ex1 success.
ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:13662] *** Process received signal ***
[kamenice:13662] Signal: Segmentation fault (11)
[kamenice:13662] Signal code: (128)
[kamenice:13662] Failing at address: (nil)
[kamenice:13662] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f31773844b0]
[kamenice:13662] [ 1] ../tests/bug[0x40d8ac]
[kamenice:13662] [ 2] ../tests/bug[0x408997]
[kamenice:13662] [ 3] ../tests/bug[0x408bf0]
[kamenice:13662] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f317736f830]
[kamenice:13662] [ 5] ../tests/bug[0x4086e9]
[kamenice:13662] *** End of error message ***
----- Ubuntu 16.04 custom-compiled OpenMPI 3.0.1, installed to /usr/local (the stock packages were uninstalled) -----
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
ex1 success.
Thread 1: ex1 success.
Inserted into ex1.
[kamenice:22794] *** Process received signal ***
ex2 success.
Thread 1: ex2 success.
[kamenice:22794] Signal: Segmentation fault (11)
[kamenice:22794] Signal code: (128)
[kamenice:22794] Failing at address: (nil)
[kamenice:22794] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7ff8bad084b0]
[kamenice:22794] [ 1] ../tests/bug[0x401010]
[kamenice:22794] [ 2] ../tests/bug[0x400d27]
[kamenice:22794] [ 3] ../tests/bug[0x400f80]
[kamenice:22794] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7ff8bacf3830]
[kamenice:22794] [ 5] ../tests/bug[0x400a79]
[kamenice:22794] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
ex1 success.
Thread 1: ex1 success.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex2 success.
Thread 1: ex2 success.
Thread 2: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
[kamenice:22728] *** Process received signal ***
[kamenice:22728] Signal: Segmentation fault (11)
[kamenice:22728] Signal code: (128)
[kamenice:22728] Failing at address: (nil)
[kamenice:22728] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f826a6294b0]
[kamenice:22728] [ 1] ../tests/bug[0x401010]
[kamenice:22728] [ 2] ../tests/bug[0x400d27]
[kamenice:22728] [ 3] ../tests/bug[0x400f80]
[kamenice:22728] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f826a614830]
[kamenice:22728] [ 5] ../tests/bug[0x400a79]
[kamenice:22728] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 11 (Segmentation fault).
==22814== Memcheck, a memory error detector
==22814== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22814== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22814== Command: ../tests/bug
==22815== Memcheck, a memory error detector
==22815== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==22815== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==22815== Command: ../tests/bug
ex1 success.
Thread 1: ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:22814] *** Process received signal ***
[kamenice:22814] Signal: Segmentation fault (11)
[kamenice:22814] Signal code: (128)
[kamenice:22814] Failing at address: (nil)
[kamenice:22814] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x51704b0]
[kamenice:22814] [ 1] ../tests/bug[0x401010]
[kamenice:22814] [ 2] ../tests/bug[0x400d27]
[kamenice:22814] [ 3] ../tests/bug[0x400f80]
[kamenice:22814] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x515b830]
[kamenice:22814] [ 5] ../tests/bug[0x400a79]
[kamenice:22814] *** End of error message ***
==22814== Process terminating with default action of signal 11 (SIGSEGV)
==22814== at 0x5170428: raise (raise.c:54)
==22814== by 0x5827E4D: show_stackframe (in /usr/local/lib/libopen-pal.so.40.1.0)
==22814== by 0x51704AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==22814== by 0x40100F: std::atomic<__int128>::store(__int128, std::memory_order) (atomic:225)
==22814== by 0x400D26: shared_memory_init(int, int) (bug.cpp:41)
==22814== by 0x400F7F: main (bug.cpp:80)
==22814== in use at exit: 2,759,989 bytes in 9,014 blocks
==22814== total heap usage: 20,168 allocs, 11,154 frees, 3,820,707 bytes allocated
==22814== definitely lost: 12 bytes in 1 blocks
==22814== indirectly lost: 0 bytes in 0 blocks
==22814== possibly lost: 608 bytes in 2 blocks
==22814== still reachable: 2,759,369 bytes in 9,011 blocks
==22814== suppressed: 0 bytes in 0 blocks
==22814== Rerun with --leak-check=full to see details of leaked memory
==22814== For counts of detected and suppressed errors, rerun with: -v
==22814== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
==22815== Process terminating with default action of signal 15 (SIGTERM)
==22815== at 0x523674D: ??? (syscall-template.S:84)
==22815== by 0x583B4A7: poll (poll2.h:46)
==22815== by 0x583B4A7: poll_dispatch (poll.c:165)
==22815== by 0x5831BDE: opal_libevent2022_event_base_loop (event.c:1630)
==22815== by 0x57F210D: progress_engine (in /usr/local/lib/libopen-pal.so.40.1.0)
==22815== by 0x5FD76B9: start_thread (pthread_create.c:333)
==22815== by 0x524241C: clone (clone.S:109)
==22815== in use at exit: 2,766,405 bytes in 9,017 blocks
==22815== total heap usage: 20,167 allocs, 11,150 frees, 3,823,751 bytes allocated
==22815== definitely lost: 12 bytes in 1 blocks
==22815== indirectly lost: 0 bytes in 0 blocks
==22815== possibly lost: 608 bytes in 2 blocks
==22815== still reachable: 2,765,785 bytes in 9,014 blocks
==22815== suppressed: 0 bytes in 0 blocks
==22815== Rerun with --leak-check=full to see details of leaked memory
==22815== For counts of detected and suppressed errors, rerun with: -v
==22815== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 9 (Killed).
users mailing list
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
users mailing list
users mailing list
Jeff Hammond
users mailing list
Jeff Hammond
2018-05-04 18:30:22 UTC
Post by Nathan Hjelm
Not saying we won't change the behavior. Just saying the user can't expect
a particular alignment as there is no guarantee in the standard. In Open
MPI we just don't bother to align the pointer so right now so it naturally
aligns as 64-bit. It isn't about wasting memory.
We should add an info key for alignment. It's pretty silly we don't have
one already, given how windows are allocated.

At the very least, MPI-3 implementations should allocate all windows to
128b on x86_64 in order to allow MPI_Accumulate(MPI_COMPLEX_DOUBLE) to use

It's pretty lame for MPI_Alloc_mem to be worse than malloc. We should fix
this in the standard. MPI should not be breaking the ABI behavior when
substituted for the system allocator.
Post by Nathan Hjelm
Also remember that by default the shared memory regions are contiguous
across local ranks so each rank will get a buffer alignment dictated by the
sizes of the allocations specified by the prior ranks in addition to the
zero rank buffer alignment.
If a user is allocating std::atomic<__int128>, every element will be
128b-aligned if the base is. Noncontiguous is actually worse in that the
implementation could allocate the segment for each process with only 64b

Post by Nathan Hjelm
Given that this seems to break user experience on a relatively frequent
basis, I’d like to know the compelling reason why MPI implementers aren’t
willing to do something utterly trivial to fix it.
And don’t tell me that 16B alignment wastes memory versus 8B alignment.
Open-MPI “wastes” 4B relative to MPICH for every handle on I32LP64 systems.
The internal state associated with MPI allocations - particularly windows -
is bigger than 8B. I recall ptmalloc uses something like 32B per heap
Post by Nathan Hjelm
That is probably it. When there are 4 ranks there are 4 int64’s just
before the user data (for PSCW). With 1 rank we don’t even bother, its just
malloc (16-byte aligned). With any other odd number of ranks the user data
is after an odd number of int64’s and is 8-byte aligned. There is no
requirement in MPI to provide 16-byte alignment (which is required for
_Atomic __int128 because of the alignment requirement of cmpxchg16b) so you
have to align it yourself.
Post by Joseph Schuchart
You say that you allocate shared memory, do you mean shared memory
windows? If so, this could be the reason: https://github.com/open-mpi/om
Post by Joseph Schuchart
The alignment of memory allocated for MPI windows is not suitable for
128bit values in Open MPI (only 8-byte alignment is guaranteed atm). I have
seen the alignment change depending on the number of processes. Could you
check the alignment of the memory you are trying to access?
Post by Joseph Schuchart
Post by Martin Böhm
Dear all,
I have a problem with a segfault on a user-built OpenMPI 3.0.1 running
on Ubuntu 16.04
Post by Joseph Schuchart
Post by Martin Böhm
in local mode (only processes on a single computer).
The problem manifests itself as a segfault when allocating shared
memory for
Post by Joseph Schuchart
Post by Martin Böhm
(at least) one 128-bit atomic variable (say std::atomic<__int128>) and
Post by Joseph Schuchart
Post by Martin Böhm
storing into this variable. This error only occurs when there are at
least two
Post by Joseph Schuchart
Post by Martin Böhm
processes, even though only the process of shared rank 0 is the one
doing both
Post by Joseph Schuchart
Post by Martin Böhm
the allocation and the writing. The compile flags -march=core2 or
Post by Joseph Schuchart
Post by Martin Böhm
or -latomic were tried, none of which helped.
Post by Joseph Schuchart
Post by Martin Böhm
The code works fine with mpirun -np 1 and segfaults with mpirun -np 2,
3 and 4;
Post by Joseph Schuchart
Post by Martin Böhm
if line 41 is commented out (the 128-bit atomic write), everything
works fine
Post by Joseph Schuchart
Post by Martin Böhm
with -np 2 or more.
As for Ubuntu's stock package containing OpenMPI 1.10.2, the code
segfaults with
Post by Joseph Schuchart
Post by Martin Böhm
"-np 2" and "-np 3" but not "-np 1" or "-np 4".
Thank you for any assistance concerning this problem. I would suspect
my own
Post by Joseph Schuchart
Post by Martin Böhm
code to be the most likely culprit, since it triggers on both the
stock package
Post by Joseph Schuchart
Post by Martin Böhm
and custom-built OpenMPI.
I attach the config.log.bz2 and ompi_info.log. Below I list some runs
of the program
Post by Joseph Schuchart
Post by Martin Böhm
and what errors are produced.
Thank you for any assistance. I have tried googling and searching the
mailing list
Post by Joseph Schuchart
Post by Martin Böhm
for this problem; if I missed something, I apologize.
Martin Böhm
----- Ubuntu 16.04 stock mpirun and mpic++ -----
Post by Joseph Schuchart
Post by Martin Böhm
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
Post by Joseph Schuchart
Post by Martin Böhm
Thread 2: ex1 success.
Thread 3: ex1 success.
ex1 success.
Thread 1: ex1 success.
ex2 success.
Thread 2: ex2 success.
Thread 1: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
Inserted into ex2.
Post by Joseph Schuchart
Post by Martin Böhm
Thread 1: ex1 success.
ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:13662] *** Process received signal ***
[kamenice:13662] Signal: Segmentation fault (11)
[kamenice:13662] Signal code: (128)
[kamenice:13662] Failing at address: (nil)
[kamenice:13662] [ 0] /lib/x86_64-linux-gnu/libc.so.
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:13662] [ 1] ../tests/bug[0x40d8ac]
[kamenice:13662] [ 2] ../tests/bug[0x408997]
[kamenice:13662] [ 3] ../tests/bug[0x408bf0]
[kamenice:13662] [ 4] /lib/x86_64-linux-gnu/libc.so.
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:13662] [ 5] ../tests/bug[0x4086e9]
[kamenice:13662] *** End of error message ***
----- Ubuntu 16.04 custom-compiled OpenMPI 3.0.1, installed to
/usr/local (the stock packages were uninstalled) -----
Post by Joseph Schuchart
Post by Martin Böhm
ex1 success.
ex2 success.
Inserted into ex1.
Inserted into ex2.
Post by Joseph Schuchart
Post by Martin Böhm
ex1 success.
Thread 1: ex1 success.
Inserted into ex1.
[kamenice:22794] *** Process received signal ***
ex2 success.
Thread 1: ex2 success.
[kamenice:22794] Signal: Segmentation fault (11)
[kamenice:22794] Signal code: (128)
[kamenice:22794] Failing at address: (nil)
[kamenice:22794] [ 0] /lib/x86_64-linux-gnu/libc.so.
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22794] [ 1] ../tests/bug[0x401010]
[kamenice:22794] [ 2] ../tests/bug[0x400d27]
[kamenice:22794] [ 3] ../tests/bug[0x400f80]
[kamenice:22794] [ 4] /lib/x86_64-linux-gnu/libc.so.
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22794] [ 5] ../tests/bug[0x400a79]
[kamenice:22794] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
Post by Joseph Schuchart
Post by Martin Böhm
mpirun noticed that process rank 0 with PID 0 on node kamenice exited
on signal 11 (Segmentation fault).
Post by Joseph Schuchart
Post by Martin Böhm
--oversubscribe ../tests/bug
Post by Joseph Schuchart
Post by Martin Böhm
ex1 success.
Thread 1: ex1 success.
Thread 2: ex1 success.
Thread 3: ex1 success.
ex2 success.
Thread 1: ex2 success.
Thread 2: ex2 success.
Thread 3: ex2 success.
Inserted into ex1.
[kamenice:22728] *** Process received signal ***
[kamenice:22728] Signal: Segmentation fault (11)
[kamenice:22728] Signal code: (128)
[kamenice:22728] Failing at address: (nil)
[kamenice:22728] [ 0] /lib/x86_64-linux-gnu/libc.so.
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22728] [ 1] ../tests/bug[0x401010]
[kamenice:22728] [ 2] ../tests/bug[0x400d27]
[kamenice:22728] [ 3] ../tests/bug[0x400f80]
[kamenice:22728] [ 4] /lib/x86_64-linux-gnu/libc.so.
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22728] [ 5] ../tests/bug[0x400a79]
[kamenice:22728] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
Post by Joseph Schuchart
Post by Martin Böhm
mpirun noticed that process rank 0 with PID 0 on node kamenice exited
on signal 11 (Segmentation fault).
Post by Joseph Schuchart
Post by Martin Böhm
valgrind ../tests/bug
Post by Joseph Schuchart
Post by Martin Böhm
==22814== Memcheck, a memory error detector
==22814== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et
Post by Joseph Schuchart
Post by Martin Böhm
==22814== Using Valgrind-3.11.0 and LibVEX; rerun with -h for
copyright info
Post by Joseph Schuchart
Post by Martin Böhm
==22814== Command: ../tests/bug
==22815== Memcheck, a memory error detector
==22815== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et
Post by Joseph Schuchart
Post by Martin Böhm
==22815== Using Valgrind-3.11.0 and LibVEX; rerun with -h for
copyright info
Post by Joseph Schuchart
Post by Martin Böhm
==22815== Command: ../tests/bug
ex1 success.
Thread 1: ex1 success.
Thread 1: ex2 success.
ex2 success.
Inserted into ex1.
[kamenice:22814] *** Process received signal ***
[kamenice:22814] Signal: Segmentation fault (11)
[kamenice:22814] Signal code: (128)
[kamenice:22814] Failing at address: (nil)
[kamenice:22814] [ 0] /lib/x86_64-linux-gnu/libc.so.
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22814] [ 1] ../tests/bug[0x401010]
[kamenice:22814] [ 2] ../tests/bug[0x400d27]
[kamenice:22814] [ 3] ../tests/bug[0x400f80]
[kamenice:22814] [ 4] /lib/x86_64-linux-gnu/libc.so.
Post by Joseph Schuchart
Post by Martin Böhm
[kamenice:22814] [ 5] ../tests/bug[0x400a79]
[kamenice:22814] *** End of error message ***
==22814== Process terminating with default action of signal 11
Post by Joseph Schuchart
Post by Martin Böhm
==22814== at 0x5170428: raise (raise.c:54)
==22814== by 0x5827E4D: show_stackframe (in
Post by Joseph Schuchart
Post by Martin Böhm
==22814== by 0x51704AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==22814== by 0x40100F: std::atomic<__int128>::store(__int128,
std::memory_order) (atomic:225)
Post by Joseph Schuchart
Post by Martin Böhm
==22814== by 0x400D26: shared_memory_init(int, int) (bug.cpp:41)
==22814== by 0x400F7F: main (bug.cpp:80)
==22814== in use at exit: 2,759,989 bytes in 9,014 blocks
==22814== total heap usage: 20,168 allocs, 11,154 frees, 3,820,707
bytes allocated
Post by Joseph Schuchart
Post by Martin Böhm
==22814== definitely lost: 12 bytes in 1 blocks
==22814== indirectly lost: 0 bytes in 0 blocks
==22814== possibly lost: 608 bytes in 2 blocks
==22814== still reachable: 2,759,369 bytes in 9,011 blocks
==22814== suppressed: 0 bytes in 0 blocks
==22814== Rerun with --leak-check=full to see details of leaked memory
==22814== For counts of detected and suppressed errors, rerun with: -v
==22814== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from
Post by Joseph Schuchart
Post by Martin Böhm
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
==22815== Process terminating with default action of signal 15
Post by Joseph Schuchart
Post by Martin Böhm
==22815== at 0x523674D: ??? (syscall-template.S:84)
==22815== by 0x583B4A7: poll (poll2.h:46)
==22815== by 0x583B4A7: poll_dispatch (poll.c:165)
==22815== by 0x5831BDE: opal_libevent2022_event_base_loop
Post by Joseph Schuchart
Post by Martin Böhm
==22815== by 0x57F210D: progress_engine (in
Post by Joseph Schuchart
Post by Martin Böhm
==22815== by 0x5FD76B9: start_thread (pthread_create.c:333)
==22815== by 0x524241C: clone (clone.S:109)
==22815== in use at exit: 2,766,405 bytes in 9,017 blocks
==22815== total heap usage: 20,167 allocs, 11,150 frees, 3,823,751
bytes allocated
Post by Joseph Schuchart
Post by Martin Böhm
==22815== definitely lost: 12 bytes in 1 blocks
==22815== indirectly lost: 0 bytes in 0 blocks
==22815== possibly lost: 608 bytes in 2 blocks
==22815== still reachable: 2,765,785 bytes in 9,014 blocks
==22815== suppressed: 0 bytes in 0 blocks
==22815== Rerun with --leak-check=full to see details of leaked memory
==22815== For counts of detected and suppressed errors, rerun with: -v
==22815== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from
Post by Joseph Schuchart
Post by Martin Böhm
Post by Joseph Schuchart
Post by Martin Böhm
mpirun noticed that process rank 0 with PID 0 on node kamenice exited
on signal 9 (Killed).
Post by Joseph Schuchart
Post by Martin Böhm
Post by Joseph Schuchart
Post by Martin Böhm
users mailing list
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
users mailing list
users mailing list
Jeff Hammond
users mailing list
users mailing list
Jeff Hammond