[OMPI users] openmpi hang observed with RoCE transport and P_Write

Discussion:

[OMPI users] openmpi hang observed with RoCE transport and P_Write_Indv test

Sriharsha Basavapatna via users

2017-04-13 10:48:03 UTC

Hi,

I'm seeing an issue with Openmpi Version 2.0.1. The setup uses 2 nodes
with 1 process on each node and the test case is P_Write_Indv. The problem
occurs when the test runs 4MB byte size and the mode is NON-AGGREGATE.
The test just hangs at that point. Here's the exact command/options that's
being used, followed by the output log and the stack trace (with gdb):

# /usr/local/mpi/openmpi/bin/mpirun -np 2 -hostfile hostfile -mca btl
self,sm,open ib -mca btl_openib_receive_queues P,65536,256,192,128 -mca
btl_openib_cpc_include rdmacm -mca orte_base_help_aggregate 0
--allow-run-as-root --bind-to none --map-by node
/usr/local/imb/openmpi/IMB-IO -msglog 21:22 -include P_Write_Indv -time 300

#-----------------------------------------------------------------------------
# Benchmarking P_Write_Indv
# #processes = 1
# ( 1 additional process waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 0.56 0.56 0.56 0.00
2097152 20 31662.31 31662.31 31662.31 63.17
4194304 10 64159.89 64159.89 64159.89 62.34

#-----------------------------------------------------------------------------
# Benchmarking P_Write_Indv
# #processes = 1
# ( 1 additional process waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#
# MODE: NON-AGGREGATE
#
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 100 570.14 570.14 570.14 0.00
2097152 20 55007.33 55007.33 55007.33 36.36
4194304 10 85838.17 85838.17 85838.17 46.60

#1 0x00007f08bf5af2a3 in poll_device () from
/usr/local/mpi/openmpi/lib/libopen-pal.so.20
#2 0x00007f08bf5afe15 in btl_openib_component_progress ()
from /usr/local/mpi/openmpi/lib/libopen-pal.so.20
#3 0x00007f08bf55b89c in opal_progress () from
/usr/local/mpi/openmpi/lib/libopen-pal.so.20
#4 0x00007f08c0294a55 in ompi_request_default_wait_all ()
from /usr/local/mpi/openmpi/lib/libmpi.so.20
#5 0x00007f08c02da295 in ompi_coll_base_allreduce_intra_recursivedoubling ()
from /usr/local/mpi/openmpi/lib/libmpi.so.20
#6 0x00007f08c034b399 in mca_io_ompio_set_view_internal ()
from /usr/local/mpi/openmpi/lib/libmpi.so.20
#7 0x00007f08c034b76c in mca_io_ompio_file_set_view ()
from /usr/local/mpi/openmpi/lib/libmpi.so.20
#8 0x00007f08c02cc00f in PMPI_File_set_view () from
/usr/local/mpi/openmpi/lib/libmpi.so.20
#9 0x000000000040e4fd in IMB_init_transfer ()
#10 0x0000000000405f5c in IMB_init_buffers_iter ()
#11 0x0000000000402d25 in main ()
(gdb)

I did some analysis, please see the details below:

The mpi threads on the Root-node and the Peer-node are hung trying to
get completions on the RQ and SQ. There are no new completions as per the
data in the RQ/SQ and corresponding CQs in the device. Here's the sequence
as per the queue status observed in the device (while the thread is stuck
waiting for new completions):

Root-node (RQ) Peer-node (SQ)

0 <---------------------------0
.................
.................
17 SEND-Inline WRs
(17 RQ-CQEs seen)
................
................
16 <----------------------------16

17<----------------------------- 17
1 RDMA-WRITE-Inline+Signaled
(1 SQ-CQE generated)

18 <-----------------------------18
.................
.................
19 RDMA-WRITE-Inlines
................
................
36 <----------------------------36

Like shown in the above diagram here's the sequence of events (Work
Requests and Completions) that occurred between the Root node and the
Peer node.

1) Peer node posts 17 Send WRs with Inline flag set
2) Root node receives all these 17 pkts in its RQ
3) 17 CQEs are generated in the Root node in its RCQ
4) Peer node posts an RDMA-WRITE WR with Inline and Signaled flag bits set
5) Operation completes on the SQ and a CQE is generated in the SCQ
6) There's no CQE on the Root node since it is an RDMA-WRITE operation
7) Peer node posts 19 RDMA-WRITE WRs with Inline flag, but no Signaled flag
8) No CQEs on the Peer node, since they are not Signaled
9) No CQEs on the Root node, since they are RDMA-WRITEs

At this point, the Root node is polling on its RCQ for new completions
and there aren't any, since all SENDs are already received and CQEs seen.

Similarly, the Peer node is polling on its SCQ for new completions
and there aren't any, since the 19 RDMA-WRITEs are not signaled.

There is an exact similar condition in the reverse direction too.
That is, Root node issues a bunch of SENDs to the Peer node, followed
by some RDMA-WRITEs. The Peer node gets CQEs for the SENDs and looks
for more CQEs but there won't be any, since the subsequent ones are
all RDMA-WRITEs. The Root node itself is polling on its SCQ and it won't
find any new completions since there are no more signaled WRs.

So the 2 nodes are now in a hung state, polling on both the SCQ and RCQ,
while there's no such operations pending that can generate new CQEs.

Thanks,
-Harsha

Jeff Squyres (jsquyres)

2017-04-13 11:11:51 UTC

Permalink

Can you try the latest version of Open MPI? There have been bug fixes in the MPI one-sided area.

Try Open MPI v2.1.0, or v2.0.2 if you want to stick with the v2.0.x series. I think there have been some post-release one-sided fixes, too -- you may also want to try nightly snapshots on both of those branches.

Post by Sriharsha Basavapatna via users
Hi,
I'm seeing an issue with Openmpi Version 2.0.1. The setup uses 2 nodes
with 1 process on each node and the test case is P_Write_Indv. The problem
occurs when the test runs 4MB byte size and the mode is NON-AGGREGATE.
The test just hangs at that point. Here's the exact command/options that's
# /usr/local/mpi/openmpi/bin/mpirun -np 2 -hostfile hostfile -mca btl
self,sm,open ib -mca btl_openib_receive_queues P,65536,256,192,128 -mca
btl_openib_cpc_include rdmacm -mca orte_base_help_aggregate 0
--allow-run-as-root --bind-to none --map-by node
/usr/local/imb/openmpi/IMB-IO -msglog 21:22 -include P_Write_Indv -time 300
#-----------------------------------------------------------------------------
# Benchmarking P_Write_Indv
# #processes = 1
# ( 1 additional process waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 0.56 0.56 0.56 0.00
2097152 20 31662.31 31662.31 31662.31 63.17
4194304 10 64159.89 64159.89 64159.89 62.34
#-----------------------------------------------------------------------------
# Benchmarking P_Write_Indv
# #processes = 1
# ( 1 additional process waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#
# MODE: NON-AGGREGATE
#
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 100 570.14 570.14 570.14 0.00
2097152 20 55007.33 55007.33 55007.33 36.36
4194304 10 85838.17 85838.17 85838.17 46.60
#1 0x00007f08bf5af2a3 in poll_device () from
/usr/local/mpi/openmpi/lib/libopen-pal.so.20
#2 0x00007f08bf5afe15 in btl_openib_component_progress ()
from /usr/local/mpi/openmpi/lib/libopen-pal.so.20
#3 0x00007f08bf55b89c in opal_progress () from
/usr/local/mpi/openmpi/lib/libopen-pal.so.20
#4 0x00007f08c0294a55 in ompi_request_default_wait_all ()
from /usr/local/mpi/openmpi/lib/libmpi.so.20
#5 0x00007f08c02da295 in ompi_coll_base_allreduce_intra_recursivedoubling ()
from /usr/local/mpi/openmpi/lib/libmpi.so.20
#6 0x00007f08c034b399 in mca_io_ompio_set_view_internal ()
from /usr/local/mpi/openmpi/lib/libmpi.so.20
#7 0x00007f08c034b76c in mca_io_ompio_file_set_view ()
from /usr/local/mpi/openmpi/lib/libmpi.so.20
#8 0x00007f08c02cc00f in PMPI_File_set_view () from
/usr/local/mpi/openmpi/lib/libmpi.so.20
#9 0x000000000040e4fd in IMB_init_transfer ()
#10 0x0000000000405f5c in IMB_init_buffers_iter ()
#11 0x0000000000402d25 in main ()
(gdb)
The mpi threads on the Root-node and the Peer-node are hung trying to
get completions on the RQ and SQ. There are no new completions as per the
data in the RQ/SQ and corresponding CQs in the device. Here's the sequence
as per the queue status observed in the device (while the thread is stuck
Root-node (RQ) Peer-node (SQ)
0 <---------------------------0
.................
.................
17 SEND-Inline WRs
(17 RQ-CQEs seen)
................
................
16 <----------------------------16
17<----------------------------- 17
1 RDMA-WRITE-Inline+Signaled
(1 SQ-CQE generated)
18 <-----------------------------18
.................
.................
19 RDMA-WRITE-Inlines
................
................
36 <----------------------------36
Like shown in the above diagram here's the sequence of events (Work
Requests and Completions) that occurred between the Root node and the
Peer node.
1) Peer node posts 17 Send WRs with Inline flag set
2) Root node receives all these 17 pkts in its RQ
3) 17 CQEs are generated in the Root node in its RCQ
4) Peer node posts an RDMA-WRITE WR with Inline and Signaled flag bits set
5) Operation completes on the SQ and a CQE is generated in the SCQ
6) There's no CQE on the Root node since it is an RDMA-WRITE operation
7) Peer node posts 19 RDMA-WRITE WRs with Inline flag, but no Signaled flag
8) No CQEs on the Peer node, since they are not Signaled
9) No CQEs on the Root node, since they are RDMA-WRITEs
At this point, the Root node is polling on its RCQ for new completions
and there aren't any, since all SENDs are already received and CQEs seen.
Similarly, the Peer node is polling on its SCQ for new completions
and there aren't any, since the 19 RDMA-WRITEs are not signaled.
There is an exact similar condition in the reverse direction too.
That is, Root node issues a bunch of SENDs to the Peer node, followed
by some RDMA-WRITEs. The Peer node gets CQEs for the SENDs and looks
for more CQEs but there won't be any, since the subsequent ones are
all RDMA-WRITEs. The Root node itself is polling on its SCQ and it won't
find any new completions since there are no more signaled WRs.
So the 2 nodes are now in a hung state, polling on both the SCQ and RCQ,
while there's no such operations pending that can generate new CQEs.
Thanks,
-Harsha
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
Jeff Squyres
***@cisco.com

Sriharsha Basavapatna via users

2017-04-13 17:19:23 UTC

Permalink

Hi Jeff,

The same problem is seen with OpenMPI v2.1.0 too.

Thanks,
-Harsha

On Thu, Apr 13, 2017 at 4:41 PM, Jeff Squyres (jsquyres)

Post by Jeff Squyres (jsquyres)
Can you try the latest version of Open MPI? There have been bug fixes in the MPI one-sided area.
Try Open MPI v2.1.0, or v2.0.2 if you want to stick with the v2.0.x series. I think there have been some post-release one-sided fixes, too -- you may also want to try nightly snapshots on both of those branches.

--
Jeff Squyres

Sriharsha Basavapatna via users

2017-04-17 05:52:09 UTC

Permalink

The problem is also observed with nightly snapshot version:

Open MPI: 2.1.1a1
Open MPI repo revision: v2.1.0-49-g9d7e7a8
Open MPI release date: Unreleased developer copy
Open RTE: 2.1.1a1
Open RTE repo revision: v2.1.0-49-g9d7e7a8

Thanks,
-Harsha

On Thu, Apr 13, 2017 at 10:49 PM, Sriharsha Basavapatna

Post by Sriharsha Basavapatna via users
Hi Jeff,
The same problem is seen with OpenMPI v2.1.0 too.
Thanks,
-Harsha
On Thu, Apr 13, 2017 at 4:41 PM, Jeff Squyres (jsquyres)

--
Jeff Squyres