Discussion:
[OMPI users] [Open MPI Announce] Open MPI 4.0.0 Released
Bert Wesarg via users
2018-11-13 21:56:41 UTC
Permalink
Hi,

On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
The Open MPI Team, representing a consortium of research, academic, and
industry partners, is pleased to announce the release of Open MPI version
4.0.0.
v4.0.0 is the start of a new release series for Open MPI. Starting with
this release, the OpenIB BTL supports only iWarp and RoCE by default.
Starting with this release, UCX is the preferred transport protocol
for Infiniband interconnects. The embedded PMIx runtime has been updated
to 3.0.2. The embedded Romio has been updated to 3.2.1. This
release is ABI compatible with the 3.x release streams. There have been numerous
other bug fixes and performance improvements.
Note that starting with Open MPI v4.0.0, prototypes for several
MPI-1 symbols that were deleted in the MPI-3.0 specification
(which was published in 2012) are no longer available by default in
mpi.h. See the README for further details.
https://www.open-mpi.org/software/ompi/v4.0/
4.0.0 -- September, 2018
------------------------
- OSHMEM updated to the OpenSHMEM 1.4 API.
- Do not build OpenSHMEM layer when there are no SPMLs available.
Currently, this means the OpenSHMEM layer will only build if
a MXM or UCX library is found.
so what is the most convenience way to get SHMEM working on a single
shared memory node (aka. notebook)? I just realized that I don't have
a SHMEM since Open MPI 3.0. But building with UCX does not help
either. I tried with UCX 1.4 but Open MPI SHMEM
still does not work:

$ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
$ oshrun -np 2 ./shmem_hello_world-4.0.0
[1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR
no remote registered memory access transport to tudtug:27716:
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR
no remote registered memory access transport to tudtug:27715:
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

SPML add procs failed
--> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
[tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
[tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

Local host: tudtug
PID: 27715
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[2212,1],1]
Exit code: 255
--------------------------------------------------------------------------
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[tudtug:27710] 1 more process has sent help message help-shmem-api.txt
/ shmem-abort
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed

MPI works as expected:

$ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
$ mpirun -np 2 ./mpi_hello_world-4.0.0
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
***@tudtug Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12,
2018, 108)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
***@tudtug Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12,
2018, 108)

I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
-b -d -c -s'.

Thanks for the help.

Best,
Bert
Howard Pritchard
2018-11-14 04:25:49 UTC
Permalink
Hello Bert,

What OS are you running on your notebook?

If you are running Linux, and you have root access to your system, then
you should be able to resolve the Open SHMEM support issue by installing
the XPMEM device driver on your system, and rebuilding UCX so it picks
up XPMEM support.

The source code is on GitHub:

https://github.com/hjelmn/xpmem

Some instructions on how to build the xpmem device driver are at

https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM

You will need to install the kernel source and symbols rpms on your
system before building the xpmem device driver.

Hope this helps,

Howard


Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
Post by Bert Wesarg via users
Hi,
On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
The Open MPI Team, representing a consortium of research, academic, and
industry partners, is pleased to announce the release of Open MPI version
4.0.0.
v4.0.0 is the start of a new release series for Open MPI. Starting with
this release, the OpenIB BTL supports only iWarp and RoCE by default.
Starting with this release, UCX is the preferred transport protocol
for Infiniband interconnects. The embedded PMIx runtime has been updated
to 3.0.2. The embedded Romio has been updated to 3.2.1. This
release is ABI compatible with the 3.x release streams. There have been
numerous
other bug fixes and performance improvements.
Note that starting with Open MPI v4.0.0, prototypes for several
MPI-1 symbols that were deleted in the MPI-3.0 specification
(which was published in 2012) are no longer available by default in
mpi.h. See the README for further details.
https://www.open-mpi.org/software/ompi/v4.0/
4.0.0 -- September, 2018
------------------------
- OSHMEM updated to the OpenSHMEM 1.4 API.
- Do not build OpenSHMEM layer when there are no SPMLs available.
Currently, this means the OpenSHMEM layer will only build if
a MXM or UCX library is found.
so what is the most convenience way to get SHMEM working on a single
shared memory node (aka. notebook)? I just realized that I don't have
a SHMEM since Open MPI 3.0. But building with UCX does not help
either. I tried with UCX 1.4 but Open MPI SHMEM
$ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
$ oshrun -np 2 ./shmem_hello_world-4.0.0
[1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
SPML add procs failed
--> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
[tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
[tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Local host: tudtug
PID: 27715
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero
status, thus causing
Process name: [[2212,1],1]
Exit code: 255
--------------------------------------------------------------------------
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[tudtug:27710] 1 more process has sent help message help-shmem-api.txt
/ shmem-abort
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
$ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
$ mpirun -np 2 ./mpi_hello_world-4.0.0
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
-b -d -c -s'.
Thanks for the help.
Best,
Bert
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Kawashima, Takahiro
2018-11-14 04:35:27 UTC
Permalink
XPMEM moved to GitLab.

https://gitlab.com/hjelmn/xpmem

Thanks,
Takahiro Kawashima,
Fujitsu
Post by Howard Pritchard
Hello Bert,
What OS are you running on your notebook?
If you are running Linux, and you have root access to your system, then
you should be able to resolve the Open SHMEM support issue by installing
the XPMEM device driver on your system, and rebuilding UCX so it picks
up XPMEM support.
https://github.com/hjelmn/xpmem
Some instructions on how to build the xpmem device driver are at
https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
You will need to install the kernel source and symbols rpms on your
system before building the xpmem device driver.
Hope this helps,
Howard
Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
Post by Bert Wesarg via users
Hi,
On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
The Open MPI Team, representing a consortium of research, academic, and
industry partners, is pleased to announce the release of Open MPI version
4.0.0.
v4.0.0 is the start of a new release series for Open MPI. Starting with
this release, the OpenIB BTL supports only iWarp and RoCE by default.
Starting with this release, UCX is the preferred transport protocol
for Infiniband interconnects. The embedded PMIx runtime has been updated
to 3.0.2. The embedded Romio has been updated to 3.2.1. This
release is ABI compatible with the 3.x release streams. There have been
numerous
other bug fixes and performance improvements.
Note that starting with Open MPI v4.0.0, prototypes for several
MPI-1 symbols that were deleted in the MPI-3.0 specification
(which was published in 2012) are no longer available by default in
mpi.h. See the README for further details.
https://www.open-mpi.org/software/ompi/v4.0/
4.0.0 -- September, 2018
------------------------
- OSHMEM updated to the OpenSHMEM 1.4 API.
- Do not build OpenSHMEM layer when there are no SPMLs available.
Currently, this means the OpenSHMEM layer will only build if
a MXM or UCX library is found.
so what is the most convenience way to get SHMEM working on a single
shared memory node (aka. notebook)? I just realized that I don't have
a SHMEM since Open MPI 3.0. But building with UCX does not help
either. I tried with UCX 1.4 but Open MPI SHMEM
$ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
$ oshrun -np 2 ./shmem_hello_world-4.0.0
[1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
SPML add procs failed
--> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
[tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
[tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Local host: tudtug
PID: 27715
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero
status, thus causing
Process name: [[2212,1],1]
Exit code: 255
--------------------------------------------------------------------------
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[tudtug:27710] 1 more process has sent help message help-shmem-api.txt
/ shmem-abort
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
$ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
$ mpirun -np 2 ./mpi_hello_world-4.0.0
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
-b -d -c -s'.
Thanks for the help.
Bert Wesarg via users
2018-11-14 06:06:07 UTC
Permalink
Dear Takahiro,
On Wed, Nov 14, 2018 at 5:38 AM Kawashima, Takahiro
Post by Kawashima, Takahiro
XPMEM moved to GitLab.
https://gitlab.com/hjelmn/xpmem
the first words from the README aren't very pleasant to read:

This is an experimental version of XPMEM based on a version provided by
Cray and uploaded to https://code.google.com/p/xpmem. This version supports
any kernel 3.12 and newer. *Keep in mind there may be bugs and this version
may cause kernel panics, code crashes, eat your cat, etc.*

Installing this on my laptop where I just want developing with SHMEM
it would be a pitty to lose work just because of that.

Best,
Bert
Post by Kawashima, Takahiro
Thanks,
Takahiro Kawashima,
Fujitsu
Post by Howard Pritchard
Hello Bert,
What OS are you running on your notebook?
If you are running Linux, and you have root access to your system, then
you should be able to resolve the Open SHMEM support issue by installing
the XPMEM device driver on your system, and rebuilding UCX so it picks
up XPMEM support.
https://github.com/hjelmn/xpmem
Some instructions on how to build the xpmem device driver are at
https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
You will need to install the kernel source and symbols rpms on your
system before building the xpmem device driver.
Hope this helps,
Howard
Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
Post by Bert Wesarg via users
Hi,
On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
The Open MPI Team, representing a consortium of research, academic, and
industry partners, is pleased to announce the release of Open MPI version
4.0.0.
v4.0.0 is the start of a new release series for Open MPI. Starting with
this release, the OpenIB BTL supports only iWarp and RoCE by default.
Starting with this release, UCX is the preferred transport protocol
for Infiniband interconnects. The embedded PMIx runtime has been updated
to 3.0.2. The embedded Romio has been updated to 3.2.1. This
release is ABI compatible with the 3.x release streams. There have been
numerous
other bug fixes and performance improvements.
Note that starting with Open MPI v4.0.0, prototypes for several
MPI-1 symbols that were deleted in the MPI-3.0 specification
(which was published in 2012) are no longer available by default in
mpi.h. See the README for further details.
https://www.open-mpi.org/software/ompi/v4.0/
4.0.0 -- September, 2018
------------------------
- OSHMEM updated to the OpenSHMEM 1.4 API.
- Do not build OpenSHMEM layer when there are no SPMLs available.
Currently, this means the OpenSHMEM layer will only build if
a MXM or UCX library is found.
so what is the most convenience way to get SHMEM working on a single
shared memory node (aka. notebook)? I just realized that I don't have
a SHMEM since Open MPI 3.0. But building with UCX does not help
either. I tried with UCX 1.4 but Open MPI SHMEM
$ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
$ oshrun -np 2 ./shmem_hello_world-4.0.0
[1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
SPML add procs failed
--> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
[tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
[tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Local host: tudtug
PID: 27715
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero
status, thus causing
Process name: [[2212,1],1]
Exit code: 255
--------------------------------------------------------------------------
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[tudtug:27710] 1 more process has sent help message help-shmem-api.txt
/ shmem-abort
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
$ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
$ mpirun -np 2 ./mpi_hello_world-4.0.0
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
-b -d -c -s'.
Thanks for the help.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Howard Pritchard
2018-11-14 11:09:09 UTC
Permalink
Hi Bert,

If you'd prefer to return to the land of convenience and don't need to mix
MPI
and OpenSHMEM, then you may want to try the path I outlined in the email
archived at the following link

https://www.mail-archive.com/***@lists.open-mpi.org/msg32274.html

Howard


Am Di., 13. Nov. 2018 um 23:10 Uhr schrieb Bert Wesarg via users <
Post by Bert Wesarg via users
Dear Takahiro,
On Wed, Nov 14, 2018 at 5:38 AM Kawashima, Takahiro
Post by Kawashima, Takahiro
XPMEM moved to GitLab.
https://gitlab.com/hjelmn/xpmem
This is an experimental version of XPMEM based on a version provided by
Cray and uploaded to https://code.google.com/p/xpmem. This version supports
any kernel 3.12 and newer. *Keep in mind there may be bugs and this version
may cause kernel panics, code crashes, eat your cat, etc.*
Installing this on my laptop where I just want developing with SHMEM
it would be a pitty to lose work just because of that.
Best,
Bert
Post by Kawashima, Takahiro
Thanks,
Takahiro Kawashima,
Fujitsu
Post by Howard Pritchard
Hello Bert,
What OS are you running on your notebook?
If you are running Linux, and you have root access to your system,
then
Post by Kawashima, Takahiro
Post by Howard Pritchard
you should be able to resolve the Open SHMEM support issue by
installing
Post by Kawashima, Takahiro
Post by Howard Pritchard
the XPMEM device driver on your system, and rebuilding UCX so it picks
up XPMEM support.
https://github.com/hjelmn/xpmem
Some instructions on how to build the xpmem device driver are at
https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
You will need to install the kernel source and symbols rpms on your
system before building the xpmem device driver.
Hope this helps,
Howard
Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
Post by Bert Wesarg via users
Hi,
On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
The Open MPI Team, representing a consortium of research,
academic, and
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
industry partners, is pleased to announce the release of Open MPI
version
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
4.0.0.
v4.0.0 is the start of a new release series for Open MPI.
Starting with
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
this release, the OpenIB BTL supports only iWarp and RoCE by
default.
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
Starting with this release, UCX is the preferred transport
protocol
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
for Infiniband interconnects. The embedded PMIx runtime has been
updated
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
to 3.0.2. The embedded Romio has been updated to 3.2.1. This
release is ABI compatible with the 3.x release streams. There have
been
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
numerous
other bug fixes and performance improvements.
Note that starting with Open MPI v4.0.0, prototypes for several
MPI-1 symbols that were deleted in the MPI-3.0 specification
(which was published in 2012) are no longer available by default in
mpi.h. See the README for further details.
https://www.open-mpi.org/software/ompi/v4.0/
4.0.0 -- September, 2018
------------------------
- OSHMEM updated to the OpenSHMEM 1.4 API.
- Do not build OpenSHMEM layer when there are no SPMLs available.
Currently, this means the OpenSHMEM layer will only build if
a MXM or UCX library is found.
so what is the most convenience way to get SHMEM working on a single
shared memory node (aka. notebook)? I just realized that I don't have
a SHMEM since Open MPI 3.0. But building with UCX does not help
either. I tried with UCX 1.4 but Open MPI SHMEM
$ oshcc -o shmem_hello_world-4.0.0
openmpi-4.0.0/examples/hello_oshmem_c.c
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
$ oshrun -np 2 ./shmem_hello_world-4.0.0
[1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
--------------------------------------------------------------------------
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
It looks like SHMEM_INIT failed for some reason; your parallel
process is
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's
some
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
additional information (which may only be relevant to an Open SHMEM
SPML add procs failed
--> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
[tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed
to
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
initialize - aborting
[tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed
to
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
initialize - aborting
--------------------------------------------------------------------------
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with
errorcode
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
-1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
A SHMEM process is aborting at a time when it cannot guarantee that
all
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Local host: tudtug
PID: 27715
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
oshrun detected that one or more processes exited with non-zero
status, thus causing
Process name: [[2212,1],1]
Exit code: 255
--------------------------------------------------------------------------
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[tudtug:27710] 1 more process has sent help message
help-shmem-api.txt
Post by Kawashima, Takahiro
Post by Howard Pritchard
Post by Bert Wesarg via users
/ shmem-abort
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
$ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
$ mpirun -np 2 ./mpi_hello_world-4.0.0
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
-b -d -c -s'.
Thanks for the help.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bert Wesarg via users
2018-11-19 07:55:35 UTC
Permalink
Dear Howard,

just want to report back, that using xpmem with ucx got me a working
SHMEM again.

Thanks.

Best,
Bert
Post by Howard Pritchard
Hi Bert,
If you'd prefer to return to the land of convenience and don't need to mix MPI
and OpenSHMEM, then you may want to try the path I outlined in the email
archived at the following link
Howard
Post by Bert Wesarg via users
Dear Takahiro,
On Wed, Nov 14, 2018 at 5:38 AM Kawashima, Takahiro
Post by Kawashima, Takahiro
XPMEM moved to GitLab.
https://gitlab.com/hjelmn/xpmem
This is an experimental version of XPMEM based on a version provided by
Cray and uploaded to https://code.google.com/p/xpmem. This version supports
any kernel 3.12 and newer. *Keep in mind there may be bugs and this version
may cause kernel panics, code crashes, eat your cat, etc.*
Installing this on my laptop where I just want developing with SHMEM
it would be a pitty to lose work just because of that.
Best,
Bert
Post by Kawashima, Takahiro
Thanks,
Takahiro Kawashima,
Fujitsu
Post by Howard Pritchard
Hello Bert,
What OS are you running on your notebook?
If you are running Linux, and you have root access to your system, then
you should be able to resolve the Open SHMEM support issue by installing
the XPMEM device driver on your system, and rebuilding UCX so it picks
up XPMEM support.
https://github.com/hjelmn/xpmem
Some instructions on how to build the xpmem device driver are at
https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
You will need to install the kernel source and symbols rpms on your
system before building the xpmem device driver.
Hope this helps,
Howard
Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
Post by Bert Wesarg via users
Hi,
On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
The Open MPI Team, representing a consortium of research, academic, and
industry partners, is pleased to announce the release of Open MPI version
4.0.0.
v4.0.0 is the start of a new release series for Open MPI. Starting with
this release, the OpenIB BTL supports only iWarp and RoCE by default.
Starting with this release, UCX is the preferred transport protocol
for Infiniband interconnects. The embedded PMIx runtime has been updated
to 3.0.2. The embedded Romio has been updated to 3.2.1. This
release is ABI compatible with the 3.x release streams. There have been
numerous
other bug fixes and performance improvements.
Note that starting with Open MPI v4.0.0, prototypes for several
MPI-1 symbols that were deleted in the MPI-3.0 specification
(which was published in 2012) are no longer available by default in
mpi.h. See the README for further details.
https://www.open-mpi.org/software/ompi/v4.0/
4.0.0 -- September, 2018
------------------------
- OSHMEM updated to the OpenSHMEM 1.4 API.
- Do not build OpenSHMEM layer when there are no SPMLs available.
Currently, this means the OpenSHMEM layer will only build if
a MXM or UCX library is found.
so what is the most convenience way to get SHMEM working on a single
shared memory node (aka. notebook)? I just realized that I don't have
a SHMEM since Open MPI 3.0. But building with UCX does not help
either. I tried with UCX 1.4 but Open MPI SHMEM
$ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
$ oshrun -np 2 ./shmem_hello_world-4.0.0
[1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
SPML add procs failed
--> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
[tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
[tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Local host: tudtug
PID: 27715
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero
status, thus causing
Process name: [[2212,1],1]
Exit code: 255
--------------------------------------------------------------------------
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[tudtug:27710] 1 more process has sent help message help-shmem-api.txt
/ shmem-abort
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
$ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
$ mpirun -np 2 ./mpi_hello_world-4.0.0
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
-b -d -c -s'.
Thanks for the help.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Nathan Hjelm via users
2018-11-15 02:41:07 UTC
Permalink
I really need to update that wording. It has been awhile and the code seems to have stabilized. It’s quite safe to use and supports some of the latest kernel versions.

-Nathan
Post by Bert Wesarg via users
Dear Takahiro,
On Wed, Nov 14, 2018 at 5:38 AM Kawashima, Takahiro
Post by Kawashima, Takahiro
XPMEM moved to GitLab.
https://gitlab.com/hjelmn/xpmem
This is an experimental version of XPMEM based on a version provided by
Cray and uploaded to https://code.google.com/p/xpmem. This version supports
any kernel 3.12 and newer. *Keep in mind there may be bugs and this version
may cause kernel panics, code crashes, eat your cat, etc.*
Installing this on my laptop where I just want developing with SHMEM
it would be a pitty to lose work just because of that.
Best,
Bert
Post by Kawashima, Takahiro
Thanks,
Takahiro Kawashima,
Fujitsu
Post by Howard Pritchard
Hello Bert,
What OS are you running on your notebook?
If you are running Linux, and you have root access to your system, then
you should be able to resolve the Open SHMEM support issue by installing
the XPMEM device driver on your system, and rebuilding UCX so it picks
up XPMEM support.
https://github.com/hjelmn/xpmem
Some instructions on how to build the xpmem device driver are at
https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
You will need to install the kernel source and symbols rpms on your
system before building the xpmem device driver.
Hope this helps,
Howard
Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
Post by Bert Wesarg via users
Hi,
On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
The Open MPI Team, representing a consortium of research, academic, and
industry partners, is pleased to announce the release of Open MPI version
4.0.0.
v4.0.0 is the start of a new release series for Open MPI. Starting with
this release, the OpenIB BTL supports only iWarp and RoCE by default.
Starting with this release, UCX is the preferred transport protocol
for Infiniband interconnects. The embedded PMIx runtime has been updated
to 3.0.2. The embedded Romio has been updated to 3.2.1. This
release is ABI compatible with the 3.x release streams. There have been
numerous
other bug fixes and performance improvements.
Note that starting with Open MPI v4.0.0, prototypes for several
MPI-1 symbols that were deleted in the MPI-3.0 specification
(which was published in 2012) are no longer available by default in
mpi.h. See the README for further details.
https://www.open-mpi.org/software/ompi/v4.0/
4.0.0 -- September, 2018
------------------------
- OSHMEM updated to the OpenSHMEM 1.4 API.
- Do not build OpenSHMEM layer when there are no SPMLs available.
Currently, this means the OpenSHMEM layer will only build if
a MXM or UCX library is found.
so what is the most convenience way to get SHMEM working on a single
shared memory node (aka. notebook)? I just realized that I don't have
a SHMEM since Open MPI 3.0. But building with UCX does not help
either. I tried with UCX 1.4 but Open MPI SHMEM
$ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
$ oshrun -np 2 ./shmem_hello_world-4.0.0
[1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
SPML add procs failed
--> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
[tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
[tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Local host: tudtug
PID: 27715
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero
status, thus causing
Process name: [[2212,1],1]
Exit code: 255
--------------------------------------------------------------------------
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[tudtug:27710] 1 more process has sent help message help-shmem-api.txt
/ shmem-abort
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
$ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
$ mpirun -np 2 ./mpi_hello_world-4.0.0
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
-b -d -c -s'.
Thanks for the help.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bert Wesarg via users
2018-11-14 06:01:18 UTC
Permalink
Howard,
Post by Howard Pritchard
Hello Bert,
What OS are you running on your notebook?
Ubuntu 18.04
Post by Howard Pritchard
If you are running Linux, and you have root access to your system, then
you should be able to resolve the Open SHMEM support issue by installing
the XPMEM device driver on your system, and rebuilding UCX so it picks
up XPMEM support.
https://github.com/hjelmn/xpmem
Some instructions on how to build the xpmem device driver are at
https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
You will need to install the kernel source and symbols rpms on your
system before building the xpmem device driver.
I will try that. I already tried KNEM, which also did not worked.
Though thats definitely leaving the country of convenience. For a
development machine where performance doesn't matter, its a huge step
back for Open MPI I think.

I wil report back if that works.

Thanks.

Best,
Bert
Post by Howard Pritchard
Hope this helps,
Howard
Post by Bert Wesarg via users
Hi,
On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
The Open MPI Team, representing a consortium of research, academic, and
industry partners, is pleased to announce the release of Open MPI version
4.0.0.
v4.0.0 is the start of a new release series for Open MPI. Starting with
this release, the OpenIB BTL supports only iWarp and RoCE by default.
Starting with this release, UCX is the preferred transport protocol
for Infiniband interconnects. The embedded PMIx runtime has been updated
to 3.0.2. The embedded Romio has been updated to 3.2.1. This
release is ABI compatible with the 3.x release streams. There have been numerous
other bug fixes and performance improvements.
Note that starting with Open MPI v4.0.0, prototypes for several
MPI-1 symbols that were deleted in the MPI-3.0 specification
(which was published in 2012) are no longer available by default in
mpi.h. See the README for further details.
https://www.open-mpi.org/software/ompi/v4.0/
4.0.0 -- September, 2018
------------------------
- OSHMEM updated to the OpenSHMEM 1.4 API.
- Do not build OpenSHMEM layer when there are no SPMLs available.
Currently, this means the OpenSHMEM layer will only build if
a MXM or UCX library is found.
so what is the most convenience way to get SHMEM working on a single
shared memory node (aka. notebook)? I just realized that I don't have
a SHMEM since Open MPI 3.0. But building with UCX does not help
either. I tried with UCX 1.4 but Open MPI SHMEM
$ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
$ oshrun -np 2 ./shmem_hello_world-4.0.0
[1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR
self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
mm/posix - Destination is unreachable, cma/cma - no put short
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
[tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
Error: add procs FAILED rc=-2
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
SPML add procs failed
--> Returned "Out of resource" (-2) instead of "Success" (0)
--------------------------------------------------------------------------
[tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
[tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Local host: tudtug
PID: 27715
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero
status, thus causing
Process name: [[2212,1],1]
Exit code: 255
--------------------------------------------------------------------------
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[tudtug:27710] 1 more process has sent help message help-shmem-api.txt
/ shmem-abort
[tudtug:27710] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
$ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
$ mpirun -np 2 ./mpi_hello_world-4.0.0
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
2018, 108)
I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
-b -d -c -s'.
Thanks for the help.
Best,
Bert
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Continue reading on narkive:
Loading...