Discussion:
[OMPI users] Problem running with UCX/oshmem on single node?
Howard Pritchard
2018-05-10 01:45:34 UTC
Permalink
Hi Craig,

You are experiencing problems because you don't have a transport installed
that UCX can use for oshmem.

You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a
switch), and install that
on your system, or else install xpmem (https://github.com/hjelmn/xpmem).
Note there is a bug right now
in UCX that you may hit if you try to go thee xpmem only route:

https://github.com/open-mpi/ompi/issues/5083
and
https://github.com/openucx/ucx/issues/2588

If you are just running on a single node and want to experiment with the
OpenSHMEM program model,
and do not have mellanox mlx5 equipment installed on the node, you are much
better off trying to use SOS
over OFI libfabric:

https://github.com/Sandia-OpenSHMEM/SOS
https://github.com/ofiwg/libfabric/releases

For SOS you will need to install the hydra launcher as well:

http://www.mpich.org/downloads/

I really wish google would do a better job at hitting my responses about
this type of problem. I seem to
respond every couple of months to this exact problem on this mail list.


Howard
I'm trying to play with oshmem on a single node (just to have a way to do
some simple
CentOS 6.9 (gcc 4.4.7)
built and installed ucx 1.3.0
built and installed openmpi-3.1.0
[cfreese]$ cat oshmem.c
#include <mpp/shmem.h>
int
main() {
shmem_init();
}
[cfreese]$ mpicc oshmem.c -loshmem
[cfreese]$ shmemrun -np 2 ./a.out
[ucs1l:30118] mca: base: components_register: registering framework spml
components
[ucs1l:30118] mca: base: components_register: found loaded component ucx
[ucs1l:30119] mca: base: components_register: registering framework spml
components
[ucs1l:30119] mca: base: components_register: found loaded component ucx
[ucs1l:30119] mca: base: components_register: component ucx register
function successful
[ucs1l:30118] mca: base: components_register: component ucx register
function successful
[ucs1l:30119] mca: base: components_open: opening spml components
[ucs1l:30119] mca: base: components_open: found loaded component ucx
[ucs1l:30118] mca: base: components_open: opening spml components
[ucs1l:30118] mca: base: components_open: found loaded component ucx
[ucs1l:30119] mca: base: components_open: component ucx open function
successful
[ucs1l:30118] mca: base: components_open: component ucx open function
successful
[ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
mca_spml_base_select() select: initializing spml component ucx
[ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
- mca_spml_ucx_component_init() in ucx, my priority is 21
[ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
mca_spml_base_select() select: initializing spml component ucx
[ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
- mca_spml_ucx_component_init() in ucx, my priority is 21
[ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
- mca_spml_ucx_component_init() *** ucx initialized ****
[ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
mca_spml_base_select() select: init returned priority 21
[ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
mca_spml_base_select() selected ucx best priority 21
[ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
mca_spml_base_select() select: component ucx selected
[ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
mca_spml_ucx_enable() *** ucx ENABLED ****
[ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
- mca_spml_ucx_component_init() *** ucx initialized ****
[ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
mca_spml_base_select() select: init returned priority 21
[ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
mca_spml_base_select() selected ucx best priority 21
[ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
mca_spml_base_select() select: component ucx selected
[ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
mca_spml_ucx_enable() *** ucx ENABLED ****
here's where I think the real issue is....
[1525891910.424102] [ucs1l:30119:0] select.c:316 UCX ERROR no
remote registered memory access transport to <no debug data>: mm/posix -
Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
- no put short, self/self - Destination is unreachable
[1525891910.424104] [ucs1l:30118:0] select.c:316 UCX ERROR no
remote registered memory access transport to <no debug data>: mm/posix -
Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
- no put short, self/self - Destination is unreachable
[ucs1l:30119] Error ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is unreachable
[ucs1l:30118] Error ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:293 -
mca_spml_ucx_add_procs() ucp_ep_create failed: Destination is unreachable
0x0000000000bb0f10 ***
0x0000000000f98ef0 ***
======= Backtrace: =========
======= Backtrace: =========
/lib64/libc.so.6[0x338d875dee]
/lib64/libc.so.6[0x338d875dee]
/lib64/libc.so.6[0x338d878c80]
/lib64/libc.so.6[0x338d878c80]
/opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[
0x7fea58e4637c]
/opt/openmpi-3.1.0/lib/liboshmem.so.40(mca_spml_ucx_add_procs+0x2dc)[
0x7f1dc261437c]
/opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_
init+0x273)[0x7fea58e07833]
/opt/openmpi-3.1.0/lib/liboshmem.so.40(oshmem_shmem_
init+0x273)[0x7f1dc25d5833]
/opt/openmpi-3.1.0/lib/liboshmem.so.40(pshmem_init+0x28)[0x7f1dc25d8438]
./a.out[0x40061d]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x338d81ed1d]
./a.out[0x400559]
======= Memory map: ========
[ucs1l:30118] *** Process received signal ***
[ucs1l:30118] Signal: Aborted (6)
[ucs1l:30118] Signal code: (-6)
.
.
.
So it looks like UCX is found, but none of the underlying "transports"
work. Futzing with
ucx_info I do see posix, sysv, tcp, self is known within UCX...
# Memory domain: posix
# component: posix
# allocate: unlimited
# remote key: 37 bytes
#
# Transport: mm
#
# Device: posix
#
# bandwidth: 6911.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
# atomic_add: 32, 64 bit, cpu
# atomic_fadd: 32, 64 bit, cpu
# atomic_swap: 32, 64 bit, cpu
# atomic_cswap: 32, 64 bit, cpu
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
# ...
(with various futzing around with parameters I think I was able to get the
UCX ucx_perftest to
do something, so I'm not convinced it's completely a UCX fault).
I'm guessing there's something simple that I'm missing to get oshmem/ucx
configured, but I've been unable to find much
with regard to setting up and using UCX so I'm hoping someone might be
able to point me in the right direction.
Thanks.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Michael Di Domenico
2018-05-14 14:33:59 UTC
Permalink
Post by Howard Pritchard
You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a
switch), and install that
on your system, or else install xpmem (https://github.com/hjelmn/xpmem).
Note there is a bug right now
How stringent is the Connect-X 4/5 requirement? i have Connect-X 3
cards will they work? during the configure step is seems to yell at
me that mlx5 wont compile because i don't have Mellanox OFED v3.1
installed, is that also a requirement (i'm using the RHEl7.4 bundled
version of ofed, not then vendor versions)

Loading...