Discussion:
[OMPI users] Sockets half-broken in Open MPI 2.0.2?
Alexander Supalov
2018-06-06 10:54:55 UTC
Permalink
Hi everybody,

I noticed that sockets do not seem to work properly in the Open MPI version
mentioned above. Intranode runs are OK. Internode, over 100-MBit Ethernet,
I can go only as high as 32 KiB in a simple MPI ping-pong kind of
benchmark. Before I start composing a full bug report: is this another
known issue?

Here are the diagnostics under Ubuntu 16.04 LTS:

***@pete:~/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-plain$
mpirun --prefix
/home/papa/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-installed
-map-by node -hostfile mpi.hosts -n 2 $PWD/pingpong1
r = 0 bytes = 0 iters = 1 time = 0.0156577 lat
= 0.00782887 bw = 0
r = 1 bytes = 0 iters = 1 time = 0.011045 lat
= 0.00552249 bw = 0
r = 0 bytes = 1 iters = 1 time = 0.000459942 lat
= 0.000229971 bw = 4348.37
r = 1 bytes = 1 iters = 1 time = 0.000268888 lat
= 0.000134444 bw = 7438.04
r = 0 bytes = 2 iters = 1 time = 0.000386158 lat
= 0.000193079 bw = 10358.5
r = 1 bytes = 2 iters = 1 time = 0.000253175 lat
= 0.000126587 bw = 15799.3
r = 0 bytes = 4 iters = 1 time = 0.000388046 lat
= 0.000194023 bw = 20616.1
r = 1 bytes = 4 iters = 1 time = 0.000235434 lat
= 0.000117717 bw = 33979.8
r = 0 bytes = 8 iters = 1 time = 0.000354141 lat
= 0.00017707 bw = 45179.8
r = 1 bytes = 8 iters = 1 time = 0.000240324 lat
= 0.000120162 bw = 66576.8
r = 0 bytes = 16 iters = 1 time = 0.000350701 lat
= 0.000175351 bw = 91245.8
r = 1 bytes = 16 iters = 1 time = 0.000184242 lat
= 9.2121e-05 bw = 173685
r = 0 bytes = 32 iters = 1 time = 0.000351037 lat
= 0.000175518 bw = 182317
r = 1 bytes = 32 iters = 1 time = 0.00025953 lat
= 0.000129765 bw = 246600
r = 0 bytes = 64 iters = 1 time = 0.000425288 lat
= 0.000212644 bw = 300973
r = 1 bytes = 64 iters = 1 time = 0.000241162 lat
= 0.000120581 bw = 530764
r = 0 bytes = 128 iters = 1 time = 0.000401526 lat
= 0.000200763 bw = 637568
r = 1 bytes = 128 iters = 1 time = 0.000279226 lat
= 0.000139613 bw = 916820
r = 0 bytes = 256 iters = 1 time = 0.000436665 lat
= 0.000218332 bw = 1.17252e+06
r = 1 bytes = 256 iters = 1 time = 0.000269657 lat
= 0.000134829 bw = 1.89871e+06
r = 0 bytes = 512 iters = 1 time = 0.000496634 lat
= 0.000248317 bw = 2.06188e+06
r = 1 bytes = 512 iters = 1 time = 0.000291029 lat
= 0.000145514 bw = 3.51855e+06
r = 1 bytes = 1024 iters = 1 time = 0.000405219 r = 0
bytes = 1024 iters = 1 time = 0.000672843 lat =
0.000336421 bw = 3.0438e+06
lat = 0.000202609 bw = 5.05406e+06
r = 0 bytes = 2048 iters = 1 time = 0.000874569 lat
= 0.000437284 bw = 4.68345e+06
r = 1 bytes = 2048 iters = 1 time = 0.000489308 lat
= 0.000244654 bw = 8.37101e+06
r = 1 bytes = 4096 iters = 1 time = 0.000853111 lat
= 0.000426556 r = 0 bytes = 4096 iters = 1 time =
0.00142215 lat = 0.000711077 bw = 5.76027e+06
bw = 9.6025e+06
r = 0 bytes = 8192 iters = 1 time = 0.00239346 lat
= 0.00119673 bw = 6.84531e+06
r = 1 bytes = 8192 iters = 1 time = 0.00132503 lat
= 0.000662515 bw = 1.2365e+07
r = 0 bytes = 16384 iters = 1 time = 0.004443 lat
= 0.0022215 bw = 7.37519e+06
r = 1 bytes = 16384 iters = 1 time = 0.00255605 lat
= 0.00127803 bw = 1.28198e+07
r = 1 bytes = 32768 iters = 1 r = 0 bytes = 32768
iters = 1 time = 0.00812741 lat = 0.0040637 bw =
8.06358e+06
time = 0.0046272 lat = 0.0023136 bw = 1.41632e+07
[pete:07038] *** Process received signal ***
[pete:07038] Signal: Segmentation fault (11)
[pete:07038] Signal code: (128)
[pete:07038] Failing at address: (nil)
[pete:07038] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f716325a390]
[pete:07038] [ 1]
/home/papa/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-installed/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_put+0x1b)[0x7f715a32192b]
[pete:07038] [ 2]
/home/papa/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-installed/lib/openmpi/mca_btl_tcp.so(+0x7eae)[0x7f715a52deae]
[pete:07038] [ 3]
/home/papa/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-installed/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x7f3)[0x7f7162975ef3]
[pete:07038] [ 4]
/home/papa/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-installed/lib/libopen-pal.so.20(opal_progress+0x101)[0x7f71629361b1]
[pete:07038] [ 5]
/home/papa/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-installed/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x2b5)[0x7f715a312a95]
[pete:07038] [ 6]
/home/papa/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-installed/lib/libmpi.so.20(PMPI_Send+0x14b)[0x7f71634d5ffb]
[pete:07038] [ 7]
/home/papa/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-plain/pingpong1[0x400b25]
[pete:07038] [ 8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f7162e9f830]
[pete:07038] [ 9]
/home/papa/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-plain/pingpong1[0x400999]
[pete:07038] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node pete exited on signal
11 (Segmentation fault).
--------------------------------------------------------------------------

Host list:

192.168.178.31
192.168.178.32

Platform:

Intel, Ubuntu 16.04 LTS on one side, Ubuntu 14.04 LTS on the other, Open
MPI 2.0. or 2.1.0 on both, 100-Mbit Ethernet in between.

Note that I have to map by node in order to get internode connectivity
tested, otherwise I get an intranode run. A bit unexpected, given the host
file.

Best regards.

Alexander
Alexander Supalov
2018-06-06 12:51:30 UTC
Permalink
Thanks. Fair enough. I will mark 2.0.2 as faulty for myself, and try the
latest version when I have time for this.

On Wed, Jun 6, 2018 at 2:40 PM, Jeff Squyres (jsquyres) via users <
Alexander --
I don't know offhand if 2.0.2 was faulty in this area. We usually ask
users to upgrade to at least the latest release in a given series (e.g.,
2.0.4) because various bug fixes are included in each sub-release. It
wouldn't be much use to go through all the effort to make a proper bug
report for v2.0.2 if it the issue was already fixed by v2.0.4.
On Jun 6, 2018, at 7:40 AM, Alexander Supalov <
Thanks. This was not my question. I want to know if 2.0.2 was indeed
faulty in this area.
On Wed, Jun 6, 2018 at 1:22 PM, Gilles Gouaillardet <
Alexander,
Note the v2.0 series is no more supported, and you should upgrade to
v3.1, v3.0 or v2.1
You might have to force the tcp buffers size to 0 for optimal
performances
iirc, mpirun —mca btl_tcp_sndbuf_size 0 —mca btl_tcp_rcvbuf_size 0 ...
(I am afk, so please confirm both parameter names and default values
with ompi_info —all)
If you upgrade to the latest version, default parameter is already
optimal.
Last but not least, the btl/tcp component uses all the available
interfaces by default, so you might want to first restrict to a single
interface
mpirun —mca btl_tcp_if_include 192.168.0.0/24 ...
Hope this helps !
Gilles
On Wednesday, June 6, 2018, Alexander Supalov <
Hi everybody,
I noticed that sockets do not seem to work properly in the Open MPI
version mentioned above. Intranode runs are OK. Internode, over 100-MBit
Ethernet, I can go only as high as 32 KiB in a simple MPI ping-pong kind of
benchmark. Before I start composing a full bug report: is this another
known issue?
mpirun --prefix /home/papa/Documents/Projects/Books/Inside/MPI/Source/openmpi-2.0.2-installed
-map-by node -hostfile mpi.hosts -n 2 $PWD/pingpong1
r = 0 bytes = 0 iters = 1 time = 0.0156577
lat = 0.00782887 bw = 0
r = 1 bytes = 0 iters = 1 time = 0.011045
lat = 0.00552249 bw = 0
r = 0 bytes = 1 iters = 1 time = 0.000459942
lat = 0.000229971 bw = 4348.37
r = 1 bytes = 1 iters = 1 time = 0.000268888
lat = 0.000134444 bw = 7438.04
r = 0 bytes = 2 iters = 1 time = 0.000386158
lat = 0.000193079 bw = 10358.5
r = 1 bytes = 2 iters = 1 time = 0.000253175
lat = 0.000126587 bw = 15799.3
r = 0 bytes = 4 iters = 1 time = 0.000388046
lat = 0.000194023 bw = 20616.1
r = 1 bytes = 4 iters = 1 time = 0.000235434
lat = 0.000117717 bw = 33979.8
r = 0 bytes = 8 iters = 1 time = 0.000354141
lat = 0.00017707 bw = 45179.8
r = 1 bytes = 8 iters = 1 time = 0.000240324
lat = 0.000120162 bw = 66576.8
r = 0 bytes = 16 iters = 1 time = 0.000350701
lat = 0.000175351 bw = 91245.8
r = 1 bytes = 16 iters = 1 time = 0.000184242
lat = 9.2121e-05 bw = 173685
r = 0 bytes = 32 iters = 1 time = 0.000351037
lat = 0.000175518 bw = 182317
r = 1 bytes = 32 iters = 1 time = 0.00025953
lat = 0.000129765 bw = 246600
r = 0 bytes = 64 iters = 1 time = 0.000425288
lat = 0.000212644 bw = 300973
r = 1 bytes = 64 iters = 1 time = 0.000241162
lat = 0.000120581 bw = 530764
r = 0 bytes = 128 iters = 1 time = 0.000401526
lat = 0.000200763 bw = 637568
r = 1 bytes = 128 iters = 1 time = 0.000279226
lat = 0.000139613 bw = 916820
r = 0 bytes = 256 iters = 1 time = 0.000436665
lat = 0.000218332 bw = 1.17252e+06
r = 1 bytes = 256 iters = 1 time = 0.000269657
lat = 0.000134829 bw = 1.89871e+06
r = 0 bytes = 512 iters = 1 time = 0.000496634
lat = 0.000248317 bw = 2.06188e+06
r = 1 bytes = 512 iters = 1 time = 0.000291029
lat = 0.000145514 bw = 3.51855e+06
r = 1 bytes = 1024 iters = 1 time = 0.000405219 r =
0 bytes = 1024 iters = 1 time = 0.000672843 lat =
0.000336421 bw = 3.0438e+06
lat = 0.000202609 bw = 5.05406e+06
r = 0 bytes = 2048 iters = 1 time = 0.000874569
lat = 0.000437284 bw = 4.68345e+06
r = 1 bytes = 2048 iters = 1 time = 0.000489308
lat = 0.000244654 bw = 8.37101e+06
r = 1 bytes = 4096 iters = 1 time = 0.000853111
lat = 0.000426556 r = 0 bytes = 4096 iters = 1 time =
0.00142215 lat = 0.000711077 bw = 5.76027e+06
bw = 9.6025e+06
r = 0 bytes = 8192 iters = 1 time = 0.00239346
lat = 0.00119673 bw = 6.84531e+06
r = 1 bytes = 8192 iters = 1 time = 0.00132503
lat = 0.000662515 bw = 1.2365e+07
r = 0 bytes = 16384 iters = 1 time = 0.004443
lat = 0.0022215 bw = 7.37519e+06
r = 1 bytes = 16384 iters = 1 time = 0.00255605
lat = 0.00127803 bw = 1.28198e+07
r = 1 bytes = 32768 iters = 1 r = 0 bytes = 32768
iters = 1 time = 0.00812741 lat = 0.0040637 bw =
8.06358e+06
time = 0.0046272 lat = 0.0023136 bw = 1.41632e+07
[pete:07038] *** Process received signal ***
[pete:07038] Signal: Segmentation fault (11)
[pete:07038] Signal code: (128)
[pete:07038] Failing at address: (nil)
[pete:07038] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[
0x7f716325a390]
[pete:07038] [ 1] /home/papa/Documents/Projects/Books/Inside/MPI/Source/
openmpi-2.0.2-installed/lib/openmpi/mca_pml_ob1.so(mca_
pml_ob1_send_request_put+0x1b)[0x7f715a32192b]
[pete:07038] [ 2] /home/papa/Documents/Projects/Books/Inside/MPI/Source/
openmpi-2.0.2-installed/lib/openmpi/mca_btl_tcp.so(+
0x7eae)[0x7f715a52deae]
[pete:07038] [ 3] /home/papa/Documents/Projects/Books/Inside/MPI/Source/
openmpi-2.0.2-installed/lib/libopen-pal.so.20(opal_
libevent2022_event_base_loop+0x7f3)[0x7f7162975ef3]
[pete:07038] [ 4] /home/papa/Documents/Projects/Books/Inside/MPI/Source/
openmpi-2.0.2-installed/lib/libopen-pal.so.20(opal_progress+0x101)[
0x7f71629361b1]
[pete:07038] [ 5] /home/papa/Documents/Projects/Books/Inside/MPI/Source/
openmpi-2.0.2-installed/lib/openmpi/mca_pml_ob1.so(mca_
pml_ob1_send+0x2b5)[0x7f715a312a95]
[pete:07038] [ 6] /home/papa/Documents/Projects/Books/Inside/MPI/Source/
openmpi-2.0.2-installed/lib/libmpi.so.20(PMPI_Send+0x14b)[0x7f71634d5ffb]
[pete:07038] [ 7] /home/papa/Documents/Projects/Books/Inside/MPI/Source/
openmpi-2.0.2-plain/pingpong1[0x400b25]
[pete:07038] [ 8] /lib/x86_64-linux-gnu/libc.so.
6(__libc_start_main+0xf0)[0x7f7162e9f830]
[pete:07038] [ 9] /home/papa/Documents/Projects/Books/Inside/MPI/Source/
openmpi-2.0.2-plain/pingpong1[0x400999]
[pete:07038] *** End of error message ***
------------------------------------------------------------
--------------
mpirun noticed that process rank 0 with PID 0 on node pete exited on
signal 11 (Segmentation fault).
------------------------------------------------------------
--------------
192.168.178.31
192.168.178.32
Intel, Ubuntu 16.04 LTS on one side, Ubuntu 14.04 LTS on the other, Open
MPI 2.0. or 2.1.0 on both, 100-Mbit Ethernet in between.
Note that I have to map by node in order to get internode connectivity
tested, otherwise I get an intranode run. A bit unexpected, given the host
file.
Best regards.
Alexander
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...