Nathan Hjelm
2018-05-10 01:24:24 UTC
Thanks for confirming that it works for you as well. I have a PR open on v3.1.x that brings osc/rdma up to date with master. I will also be bringing some code that greatly improves the multi-threaded RMA performance on Aries systems (at least with benchmarksâ github.com/hpc/rma-mt). That will not make it into v3.1.x but will be in v4.0.0.
-Nathan
-Nathan
Nathan,
Thank you, I can confirm that it works as expected with master on our system. I will stick to this version then until 3.1.1 is out.
Joseph
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Thank you, I can confirm that it works as expected with master on our system. I will stick to this version then until 3.1.1 is out.
Joseph
Looks like it doesn't fail with master so at some point I fixed this bug. The current plan is to bring all the master changes into v3.1.1. This includes a number of bug fixes.
-Nathan
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________-Nathan
Nathan,
Thanks for looking into that. My test program is attached.
Best
Joseph
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________Thanks for looking into that. My test program is attached.
Best
Joseph
I will take a look today. Can you send me your test program?
-Nathan
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________-Nathan
All,
I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 (Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. Unfortunately, a simple (single-threaded) test case consisting of two processes performing an MPI_Rget+MPI_Wait hangs when running on two nodes. It succeeds if both processes run on a single node.
```
# this seems necessary to avoid a linker error during build
export CRAYPE_LINK_TYPE=dynamic
module swap PrgEnv-cray PrgEnv-intel
module sw craype-haswell craype-sandybridge
module unload craype-hugepages16M
module unload cray-mpich
```
```
mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop
[nid03060:36184] mca: base: components_register: registering framework btl components
[nid03060:36184] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: component self register function successful
[nid03060:36184] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: registering framework btl components
[nid03061:36208] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: found loaded component ugni
[nid03061:36208] mca: base: components_register: component self register function successful
[nid03061:36208] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: found loaded component ugni
[nid03060:36184] mca: base: components_register: component ugni register function successful
[nid03060:36184] mca: base: components_register: found loaded component vader
[nid03061:36208] mca: base: components_register: component ugni register function successful
[nid03061:36208] mca: base: components_register: found loaded component vader
[nid03060:36184] mca: base: components_register: component vader register function successful
[nid03060:36184] mca: base: components_open: opening btl components
[nid03060:36184] mca: base: components_open: found loaded component self
[nid03060:36184] mca: base: components_open: component self open function successful
[nid03060:36184] mca: base: components_open: found loaded component ugni
[nid03060:36184] mca: base: components_open: component ugni open function successful
[nid03060:36184] mca: base: components_open: found loaded component vader
[nid03060:36184] mca: base: components_open: component vader open function successful
[nid03060:36184] select: initializing btl component self
[nid03060:36184] select: init of component self returned success
[nid03060:36184] select: initializing btl component ugni
[nid03061:36208] mca: base: components_register: component vader register function successful
[nid03061:36208] mca: base: components_open: opening btl components
[nid03061:36208] mca: base: components_open: found loaded component self
[nid03061:36208] mca: base: components_open: component self open function successful
[nid03061:36208] mca: base: components_open: found loaded component ugni
[nid03061:36208] mca: base: components_open: component ugni open function successful
[nid03061:36208] mca: base: components_open: found loaded component vader
[nid03061:36208] mca: base: components_open: component vader open function successful
[nid03061:36208] select: initializing btl component self
[nid03061:36208] select: init of component self returned success
[nid03061:36208] select: initializing btl component ugni
[nid03061:36208] select: init of component ugni returned success
[nid03061:36208] select: initializing btl component vader
[nid03061:36208] select: init of component vader returned failure
[nid03061:36208] mca: base: close: component vader closed
[nid03061:36208] mca: base: close: unloading component vader
[nid03060:36184] select: init of component ugni returned success
[nid03060:36184] select: initializing btl component vader
[nid03060:36184] select: init of component vader returned failure
[nid03060:36184] mca: base: close: component vader closed
[nid03060:36184] mca: base: close: unloading component vader
[nid03061:36208] mca: bml: Using self btl for send to [[54630,1],1] on node nid03061
[nid03060:36184] mca: bml: Using self btl for send to [[54630,1],0] on node nid03060
[nid03061:36208] mca: bml: Using ugni btl for send to [[54630,1],0] on node (null)
[nid03060:36184] mca: bml: Using ugni btl for send to [[54630,1],1] on node (null)
```
It looks like the UGNI btl is being initialized correctly but then fails to find the node to communicate with? Is there a way to get more information? There doesn't seem to be an MCA parameter to increase verbosity specifically of the UGNI btl.
Any help would be appreciated!
Cheers
Joseph
<config.log.tgz>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 (Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. Unfortunately, a simple (single-threaded) test case consisting of two processes performing an MPI_Rget+MPI_Wait hangs when running on two nodes. It succeeds if both processes run on a single node.
```
# this seems necessary to avoid a linker error during build
export CRAYPE_LINK_TYPE=dynamic
module swap PrgEnv-cray PrgEnv-intel
module sw craype-haswell craype-sandybridge
module unload craype-hugepages16M
module unload cray-mpich
```
```
mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop
[nid03060:36184] mca: base: components_register: registering framework btl components
[nid03060:36184] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: component self register function successful
[nid03060:36184] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: registering framework btl components
[nid03061:36208] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: found loaded component ugni
[nid03061:36208] mca: base: components_register: component self register function successful
[nid03061:36208] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: found loaded component ugni
[nid03060:36184] mca: base: components_register: component ugni register function successful
[nid03060:36184] mca: base: components_register: found loaded component vader
[nid03061:36208] mca: base: components_register: component ugni register function successful
[nid03061:36208] mca: base: components_register: found loaded component vader
[nid03060:36184] mca: base: components_register: component vader register function successful
[nid03060:36184] mca: base: components_open: opening btl components
[nid03060:36184] mca: base: components_open: found loaded component self
[nid03060:36184] mca: base: components_open: component self open function successful
[nid03060:36184] mca: base: components_open: found loaded component ugni
[nid03060:36184] mca: base: components_open: component ugni open function successful
[nid03060:36184] mca: base: components_open: found loaded component vader
[nid03060:36184] mca: base: components_open: component vader open function successful
[nid03060:36184] select: initializing btl component self
[nid03060:36184] select: init of component self returned success
[nid03060:36184] select: initializing btl component ugni
[nid03061:36208] mca: base: components_register: component vader register function successful
[nid03061:36208] mca: base: components_open: opening btl components
[nid03061:36208] mca: base: components_open: found loaded component self
[nid03061:36208] mca: base: components_open: component self open function successful
[nid03061:36208] mca: base: components_open: found loaded component ugni
[nid03061:36208] mca: base: components_open: component ugni open function successful
[nid03061:36208] mca: base: components_open: found loaded component vader
[nid03061:36208] mca: base: components_open: component vader open function successful
[nid03061:36208] select: initializing btl component self
[nid03061:36208] select: init of component self returned success
[nid03061:36208] select: initializing btl component ugni
[nid03061:36208] select: init of component ugni returned success
[nid03061:36208] select: initializing btl component vader
[nid03061:36208] select: init of component vader returned failure
[nid03061:36208] mca: base: close: component vader closed
[nid03061:36208] mca: base: close: unloading component vader
[nid03060:36184] select: init of component ugni returned success
[nid03060:36184] select: initializing btl component vader
[nid03060:36184] select: init of component vader returned failure
[nid03060:36184] mca: base: close: component vader closed
[nid03060:36184] mca: base: close: unloading component vader
[nid03061:36208] mca: bml: Using self btl for send to [[54630,1],1] on node nid03061
[nid03060:36184] mca: bml: Using self btl for send to [[54630,1],0] on node nid03060
[nid03061:36208] mca: bml: Using ugni btl for send to [[54630,1],0] on node (null)
[nid03060:36184] mca: bml: Using ugni btl for send to [[54630,1],1] on node (null)
```
It looks like the UGNI btl is being initialized correctly but then fails to find the node to communicate with? Is there a way to get more information? There doesn't seem to be an MCA parameter to increase verbosity specifically of the UGNI btl.
Any help would be appreciated!
Cheers
Joseph
<config.log.tgz>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users