Discussion:
[OMPI users] OpenMPI 3.1.2: Run-time failure in UCX PML
Ben Menadue
2018-09-21 04:19:01 UTC
Permalink
Hi,

A couple of our users have reported issues using UCX in OpenMPI 3.1.2. It’s failing with this message:

[r1071:27563:0:27563] rc_verbs_iface.c:63 FATAL: send completion with error: local protection error

The actual MPI calls provoking this are different between the two applications — one is an MPI_Bcast and the other is an MPI_Waitany — but in both cases it ends up in ompi_request_default_wait_all and then into the progress engines:

0 0x00000000000373dc ucs_log_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
1 0x00000000000368ff uct_rc_verbs_iface_poll_tx() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
2 0x00000000000368ff uct_rc_verbs_iface_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
3 0x00000000000179d2 ucs_callbackq_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
4 0x0000000000018e0a uct_worker_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
5 0x00000000000050a9 mca_pml_ucx_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
6 0x000000000002b554 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
7 0x000000000004a7fa sync_wait_st() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../opal/threads/wait_sync.h:83
8 0x000000000004b073 ompi_request_default_wait_all() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:237
9 0x00000000000ce548 ompi_coll_base_bcast_intra_generic() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:98
10 0x00000000000ced08 ompi_coll_base_bcast_intra_pipeline() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:280
11 0x0000000000004f28 ompi_coll_tuned_bcast_intra_dec_fixed() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:303
12 0x0000000000067b60 PMPI_Bcast() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pbcast.c:111

and

0 0x00000000000373dc ucs_log_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
1 0x00000000000368ff uct_rc_verbs_iface_poll_tx() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
2 0x00000000000368ff uct_rc_verbs_iface_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
3 0x00000000000179d2 ucs_callbackq_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
4 0x0000000000018e0a uct_worker_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
5 0x0000000000005099 mca_pml_ucx_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
6 0x000000000002b554 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
7 0x00000000000331cc ompi_sync_wait_mt() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/threads/wait_sync.c:85
8 0x000000000004ad0b ompi_request_default_wait_any() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:131
9 0x00000000000b91ab PMPI_Waitany() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pwaitany.c:83

I’m not sure if it’s an issue with the ucx PML or with UCX itself, though. In both cases, disabling ucx and using yalla or ob1 works fine. Has anyone else seen this?

Thanks,
Ben
Pavel Shamis
2018-09-21 20:40:20 UTC
Permalink
I would suggest to post the error in UCX issues -
https://github.com/openucx/ucx/issues
It is typical IB error complaining about an access to unregistered memory.
Usually it caused by some pointer corruption in OMPI/UCX or application
code.

Best,
Pasha
Post by Ben Menadue
Hi,
A couple of our users have reported issues using UCX in OpenMPI 3.1.2.
[r1071:27563:0:27563] rc_verbs_iface.c:63 FATAL: send completion with
error: local protection error
The actual MPI calls provoking this are different between the two
applications — one is an MPI_Bcast and the other is an MPI_Waitany — but in
both cases it ends up in ompi_request_default_wait_all and then into the
0 0x00000000000373dc ucs_log_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
1 0x00000000000368ff uct_rc_verbs_iface_poll_tx() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
2 0x00000000000368ff uct_rc_verbs_iface_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
3 0x00000000000179d2 ucs_callbackq_dispatch() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
4 0x0000000000018e0a uct_worker_progress() /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
5 0x00000000000050a9 mca_pml_ucx_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
6 0x000000000002b554 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
7 0x000000000004a7fa sync_wait_st() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../opal/threads/wait_sync.h:83
8 0x000000000004b073 ompi_request_default_wait_all() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:237
9 0x00000000000ce548 ompi_coll_base_bcast_intra_generic() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:98
10 0x00000000000ced08 ompi_coll_base_bcast_intra_pipeline() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:280
11 0x0000000000004f28 ompi_coll_tuned_bcast_intra_dec_fixed() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:303
12 0x0000000000067b60 PMPI_Bcast() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pbcast.c:111
and
0 0x00000000000373dc ucs_log_dispatch()
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
1 0x00000000000368ff uct_rc_verbs_iface_poll_tx()
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
2 0x00000000000368ff uct_rc_verbs_iface_progress()
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
3 0x00000000000179d2 ucs_callbackq_dispatch()
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
4 0x0000000000018e0a uct_worker_progress()
/short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
5 0x0000000000005099 mca_pml_ucx_progress()
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
6 0x000000000002b554 opal_progress()
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
7 0x00000000000331cc ompi_sync_wait_mt()
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/threads/wait_sync.c:85
8 0x000000000004ad0b ompi_request_default_wait_any()
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:131
9 0x00000000000b91ab PMPI_Waitany()
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pwaitany.c:83
I’m not sure if it’s an issue with the ucx PML or with UCX itself, though.
In both cases, disabling ucx and using yalla or ob1 works fine. Has anyone
else seen this?
Thanks,
Ben
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jorge D'Elia
2018-09-25 12:26:35 UTC
Permalink
Hi,

As I regularly do, I am trying to download the latest stable
version. But this time with the link:

https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gz

the following error message appears:

Hmm. We’re having trouble finding that site.
We can’t connect to the server at download.open-mpi.org.

The same with the other similar links ...

Regards.
--
Jorge D'Elia.
CIMEC (UNL-CONICET), http://www.cimec.org.ar/
Predio CONICET-Santa Fe, Colec. Ruta Nac. 168,
Paraje El Pozo, 3000, Santa Fe, ARGENTINA.
Tel +54-342-4511594/95 ext 7062, fax: +54-342-4511169
Llolsten Kaonga
2018-09-25 12:56:05 UTC
Permalink
Hello Jorge,

What happens when you go to this link https://www.open-mpi.org/software/ompi/v3.1/ and click on the file openmpi-3.1.2.tar.gz in the table? I asks because I am able to download the tarball without a problem. Maybe the problem you were seeing was temporary.

Cheers.
--
Llolsten

-----Original Message-----
From: users <users-***@lists.open-mpi.org> On Behalf Of Jorge D'Elia
Sent: Tuesday, September 25, 2018 8:27 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: [OMPI users] Difficulties when trying to download files?

Hi,

As I regularly do, I am trying to download the latest stable version. But this time with the link:

https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gz

the following error message appears:

Hmm. We’re having trouble finding that site.
We can’t connect to the server at download.open-mpi.org.

The same with the other similar links ...

Regards.
--
Jorge D'Elia.
CIMEC (UNL-CONICET), http://www.cimec.org.ar/ Predio CONICET-Santa Fe, Colec. Ruta Nac. 168, Paraje El Pozo, 3000, Santa Fe, ARGENTINA.
Tel +54-342-4511594/95 ext 7062, fax: +54-342-4511169 _______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
Jorge D'Elia
2018-09-25 13:13:29 UTC
Permalink
----- Mensaje original -----
Enviado: Martes, 25 de Septiembre 2018 9:56:05
Asunto: RE: [OMPI users] Difficulties when trying to download files?
Hello Jorge,
What happens when you go to this link
https://www.open-mpi.org/software/ompi/v3.1/ and click on the file
openmpi-3.1.2.tar.gz in the table? I asks because I am able to download the
tarball without a problem. Maybe the problem you were seeing was temporary.
Hi Llolsten,

Now I could not tell you because 10 minutes after sending my question
I was able to download it perfectly and I assumed that it had been
fixed (since last week I noticed this problem. In any case, many
thanks for the quick response!

Cheers.
Jorge.
-----Original Message-----
Sent: Tuesday, September 25, 2018 8:27 AM
Subject: [OMPI users] Difficulties when trying to download files?
Hi,
As I regularly do, I am trying to download the latest stable version. But this
https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gz
Hmm. We’re having trouble finding that site.
We can’t connect to the server at download.open-mpi.org.
The same with the other similar links ...
Regards.
--
Jorge D'Elia.
CIMEC (UNL-CONICET), http://www.cimec.org.ar/ Predio CONICET-Santa Fe, Colec.
Ruta Nac. 168, Paraje El Pozo, 3000, Santa Fe, ARGENTINA.
Tel +54-342-4511594/95 ext 7062, fax: +54-342-4511169
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres) via users
2018-09-25 13:40:08 UTC
Permalink
Must have been some kind of temporary DNS glitch. Shrug.

Next time it happens, also be sure to check

https://downforeveryoneorjustme.com/download.open-mpi.org
Post by Jorge D'Elia
----- Mensaje original -----
Enviado: Martes, 25 de Septiembre 2018 9:56:05
Asunto: RE: [OMPI users] Difficulties when trying to download files?
Hello Jorge,
What happens when you go to this link
https://www.open-mpi.org/software/ompi/v3.1/ and click on the file
openmpi-3.1.2.tar.gz in the table? I asks because I am able to download the
tarball without a problem. Maybe the problem you were seeing was temporary.
Hi Llolsten,
Now I could not tell you because 10 minutes after sending my question
I was able to download it perfectly and I assumed that it had been
fixed (since last week I noticed this problem. In any case, many
thanks for the quick response!
Cheers.
Jorge.
-----Original Message-----
Sent: Tuesday, September 25, 2018 8:27 AM
Subject: [OMPI users] Difficulties when trying to download files?
Hi,
As I regularly do, I am trying to download the latest stable version. But this
https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gz
Hmm. We’re having trouble finding that site.
We can’t connect to the server at download.open-mpi.org.
The same with the other similar links ...
Regards.
--
Jorge D'Elia.
CIMEC (UNL-CONICET), http://www.cimec.org.ar/ Predio CONICET-Santa Fe, Colec.
Ruta Nac. 168, Paraje El Pozo, 3000, Santa Fe, ARGENTINA.
Tel +54-342-4511594/95 ext 7062, fax: +54-342-4511169
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com

Loading...