Bert Wesarg via users
2018-11-13 21:56:41 UTC
Hi,
On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
shared memory node (aka. notebook)? I just realized that I don't have
a SHMEM since Open MPI 3.0. But building with UCX does not help
either. I tried with UCX 1.4 but Open MPI SHMEM
still does not work:
$ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
$ oshrun -np 2 ./shmem_hello_world-4.0.0
[1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR
no remote registered memory access transport to tudtug:27716:
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR
no remote registered memory access transport to tudtug:27715:
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):
SPML add procs failed
--> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
[tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
[tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Local host: tudtug
PID: 27715
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[2212,1],1]
Exit code: 255
--------------------------------------------------------------------------
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[tudtug:27710] 1 more process has sent help message help-shmem-api.txt
/ shmem-abort
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
MPI works as expected:
$ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
$ mpirun -np 2 ./mpi_hello_world-4.0.0
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
***@tudtug Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12,
2018, 108)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
***@tudtug Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12,
2018, 108)
I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
-b -d -c -s'.
Thanks for the help.
Best,
Bert
On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
The Open MPI Team, representing a consortium of research, academic, and
industry partners, is pleased to announce the release of Open MPI version
4.0.0.
v4.0.0 is the start of a new release series for Open MPI. Starting with
this release, the OpenIB BTL supports only iWarp and RoCE by default.
Starting with this release, UCX is the preferred transport protocol
for Infiniband interconnects. The embedded PMIx runtime has been updated
to 3.0.2. The embedded Romio has been updated to 3.2.1. This
release is ABI compatible with the 3.x release streams. There have been numerous
other bug fixes and performance improvements.
Note that starting with Open MPI v4.0.0, prototypes for several
MPI-1 symbols that were deleted in the MPI-3.0 specification
(which was published in 2012) are no longer available by default in
mpi.h. See the README for further details.
https://www.open-mpi.org/software/ompi/v4.0/
4.0.0 -- September, 2018
------------------------
- OSHMEM updated to the OpenSHMEM 1.4 API.
- Do not build OpenSHMEM layer when there are no SPMLs available.
Currently, this means the OpenSHMEM layer will only build if
a MXM or UCX library is found.
so what is the most convenience way to get SHMEM working on a singleindustry partners, is pleased to announce the release of Open MPI version
4.0.0.
v4.0.0 is the start of a new release series for Open MPI. Starting with
this release, the OpenIB BTL supports only iWarp and RoCE by default.
Starting with this release, UCX is the preferred transport protocol
for Infiniband interconnects. The embedded PMIx runtime has been updated
to 3.0.2. The embedded Romio has been updated to 3.2.1. This
release is ABI compatible with the 3.x release streams. There have been numerous
other bug fixes and performance improvements.
Note that starting with Open MPI v4.0.0, prototypes for several
MPI-1 symbols that were deleted in the MPI-3.0 specification
(which was published in 2012) are no longer available by default in
mpi.h. See the README for further details.
https://www.open-mpi.org/software/ompi/v4.0/
4.0.0 -- September, 2018
------------------------
- OSHMEM updated to the OpenSHMEM 1.4 API.
- Do not build OpenSHMEM layer when there are no SPMLs available.
Currently, this means the OpenSHMEM layer will only build if
a MXM or UCX library is found.
shared memory node (aka. notebook)? I just realized that I don't have
a SHMEM since Open MPI 3.0. But building with UCX does not help
either. I tried with UCX 1.4 but Open MPI SHMEM
still does not work:
$ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
$ oshrun -np 2 ./shmem_hello_world-4.0.0
[1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR
no remote registered memory access transport to tudtug:27716:
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR
no remote registered memory access transport to tudtug:27715:
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):
SPML add procs failed
--> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
[tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
[tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Local host: tudtug
PID: 27715
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[2212,1],1]
Exit code: 255
--------------------------------------------------------------------------
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[tudtug:27710] 1 more process has sent help message help-shmem-api.txt
/ shmem-abort
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
MPI works as expected:
$ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
$ mpirun -np 2 ./mpi_hello_world-4.0.0
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
***@tudtug Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12,
2018, 108)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
***@tudtug Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12,
2018, 108)
I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
-b -d -c -s'.
Thanks for the help.
Best,
Bert