Ben Menadue
2018-09-21 04:19:01 UTC
Hi,
A couple of our users have reported issues using UCX in OpenMPI 3.1.2. Itâs failing with this message:
[r1071:27563:0:27563] rc_verbs_iface.c:63 FATAL: send completion with error: local protection error
The actual MPI calls provoking this are different between the two applications â one is an MPI_Bcast and the other is an MPI_Waitany â but in both cases it ends up in ompi_request_default_wait_all and then into the progress engines:
0 0x00000000000373dc ucs_log_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
1 0x00000000000368ff uct_rc_verbs_iface_poll_tx() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
2 0x00000000000368ff uct_rc_verbs_iface_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
3 0x00000000000179d2 ucs_callbackq_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
4 0x0000000000018e0a uct_worker_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
5 0x00000000000050a9 mca_pml_ucx_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
6 0x000000000002b554 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
7 0x000000000004a7fa sync_wait_st() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../opal/threads/wait_sync.h:83
8 0x000000000004b073 ompi_request_default_wait_all() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:237
9 0x00000000000ce548 ompi_coll_base_bcast_intra_generic() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:98
10 0x00000000000ced08 ompi_coll_base_bcast_intra_pipeline() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:280
11 0x0000000000004f28 ompi_coll_tuned_bcast_intra_dec_fixed() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:303
12 0x0000000000067b60 PMPI_Bcast() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pbcast.c:111
and
0 0x00000000000373dc ucs_log_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
1 0x00000000000368ff uct_rc_verbs_iface_poll_tx() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
2 0x00000000000368ff uct_rc_verbs_iface_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
3 0x00000000000179d2 ucs_callbackq_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
4 0x0000000000018e0a uct_worker_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
5 0x0000000000005099 mca_pml_ucx_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
6 0x000000000002b554 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
7 0x00000000000331cc ompi_sync_wait_mt() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/threads/wait_sync.c:85
8 0x000000000004ad0b ompi_request_default_wait_any() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:131
9 0x00000000000b91ab PMPI_Waitany() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pwaitany.c:83
Iâm not sure if itâs an issue with the ucx PML or with UCX itself, though. In both cases, disabling ucx and using yalla or ob1 works fine. Has anyone else seen this?
Thanks,
Ben
A couple of our users have reported issues using UCX in OpenMPI 3.1.2. Itâs failing with this message:
[r1071:27563:0:27563] rc_verbs_iface.c:63 FATAL: send completion with error: local protection error
The actual MPI calls provoking this are different between the two applications â one is an MPI_Bcast and the other is an MPI_Waitany â but in both cases it ends up in ompi_request_default_wait_all and then into the progress engines:
0 0x00000000000373dc ucs_log_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
1 0x00000000000368ff uct_rc_verbs_iface_poll_tx() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
2 0x00000000000368ff uct_rc_verbs_iface_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
3 0x00000000000179d2 ucs_callbackq_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
4 0x0000000000018e0a uct_worker_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
5 0x00000000000050a9 mca_pml_ucx_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
6 0x000000000002b554 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
7 0x000000000004a7fa sync_wait_st() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../opal/threads/wait_sync.h:83
8 0x000000000004b073 ompi_request_default_wait_all() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:237
9 0x00000000000ce548 ompi_coll_base_bcast_intra_generic() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:98
10 0x00000000000ced08 ompi_coll_base_bcast_intra_pipeline() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:280
11 0x0000000000004f28 ompi_coll_tuned_bcast_intra_dec_fixed() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:303
12 0x0000000000067b60 PMPI_Bcast() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pbcast.c:111
and
0 0x00000000000373dc ucs_log_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
1 0x00000000000368ff uct_rc_verbs_iface_poll_tx() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
2 0x00000000000368ff uct_rc_verbs_iface_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
3 0x00000000000179d2 ucs_callbackq_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
4 0x0000000000018e0a uct_worker_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
5 0x0000000000005099 mca_pml_ucx_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
6 0x000000000002b554 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
7 0x00000000000331cc ompi_sync_wait_mt() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/threads/wait_sync.c:85
8 0x000000000004ad0b ompi_request_default_wait_any() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:131
9 0x00000000000b91ab PMPI_Waitany() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pwaitany.c:83
Iâm not sure if itâs an issue with the ucx PML or with UCX itself, though. In both cases, disabling ucx and using yalla or ob1 works fine. Has anyone else seen this?
Thanks,
Ben