Discussion:
[OMPI users] ARM HPC Compiler 18.4.0 / OpenMPI 2.1.4 Hang for IMB All Reduce Test on 4 Ranks
Hammond, Simon David via users
2018-08-16 02:36:25 UTC
Permalink
Hi OpenMPI Users,

I am compiling OpenMPI 2.1.4 with the ARM 18.4.0 HPC Compiler on our ARM ThunderX2 system. Configuration options below. For now, I am using the simplest configuration test we can use on our system.

If I use the OpenMPI 2.1.4 which I have compiled and run a simple 4 rank run of the IMB MPI benchmark on a single node (so using shared memory for communication), the test will hang at the 4-rank test case (see below). All four processes seem to be spinning at 100% of a single core.

Configure Line: ./configure --prefix=/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0 --with-slurm --enable-mpi-thread-multiple CC=`which armclang` CXX=`which armclang++` FC=`which armflang`

#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 4
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.02 0.02 0.02
4 1000 2.31 2.31 2.31
8 1000 2.37 2.37 2.37
16 1000 2.46 2.46 2.46
32 1000 2.46 2.46 2.46
<Hang forever>

When I use GDB to halt the code on one of the ranks and perform backtracing. I get seem to get the following stacks repeated (in a loop).

#0 0x0000ffffbe3e765c in opal_timer_linux_get_cycles_sys_timer ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#1 0x0000ffffbe36d910 in opal_progress ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#2 0x0000ffffbe6f2568 in ompi_request_default_wait ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#3 0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#4 0x0000ffffbe703000 in PMPI_Barrier () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#5 0x0000000000402554 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so
(gdb) bt
#0 0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so
#1 0x0000ffffb793f544 in btl_openib_component_progress ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/openmpi/mca_btl_openib.so
#2 0x0000ffffbe36d980 in opal_progress ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#3 0x0000ffffbe6f2568 in ompi_request_default_wait ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#4 0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#5 0x0000ffffbe703000 in PMPI_Barrier () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#6 0x0000000000402554 in main ()


--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Kawashima, Takahiro
2018-08-16 03:00:48 UTC
Permalink
Hi,

Open MPI 2.1.3 and 2.1.4 have a bug in shared memory communication.
Open MPI community is preparing 2.1.5 to fix it.

https://github.com/open-mpi/ompi/pull/5536

Could you try this patch?

https://github.com/open-mpi/ompi/commit/6086b52719ed02725dfa5e91c0d12c3c66a8e168

Or, use the 2.1.5rc1 (release candidate)?

https://www.open-mpi.org/software/ompi/v2.1/

Thanks,
Takahiro Kawashima,
MPI development team,
Fujitsu
Post by Hammond, Simon David via users
Hi OpenMPI Users,
I am compiling OpenMPI 2.1.4 with the ARM 18.4.0 HPC Compiler on our ARM ThunderX2 system. Configuration options below. For now, I am using the simplest configuration test we can use on our system.
If I use the OpenMPI 2.1.4 which I have compiled and run a simple 4 rank run of the IMB MPI benchmark on a single node (so using shared memory for communication), the test will hang at the 4-rank test case (see below). All four processes seem to be spinning at 100% of a single core.
Configure Line: ./configure --prefix=/home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0 --with-slurm --enable-mpi-thread-multiple CC=`which armclang` CXX=`which armclang++` FC=`which armflang`
#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 4
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.02 0.02 0.02
4 1000 2.31 2.31 2.31
8 1000 2.37 2.37 2.37
16 1000 2.46 2.46 2.46
32 1000 2.46 2.46 2.46
<Hang forever>
When I use GDB to halt the code on one of the ranks and perform backtracing. I get seem to get the following stacks repeated (in a loop).
#0 0x0000ffffbe3e765c in opal_timer_linux_get_cycles_sys_timer ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#1 0x0000ffffbe36d910 in opal_progress ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#2 0x0000ffffbe6f2568 in ompi_request_default_wait ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#3 0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#4 0x0000ffffbe703000 in PMPI_Barrier () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#5 0x0000000000402554 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so
(gdb) bt
#0 0x0000ffffbc42084c in mlx5_poll_cq_1 () from /lib64/libmlx5-rdmav2.so
#1 0x0000ffffb793f544 in btl_openib_component_progress ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/openmpi/mca_btl_openib.so
#2 0x0000ffffbe36d980 in opal_progress ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libopen-pal.so.20
#3 0x0000ffffbe6f2568 in ompi_request_default_wait ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#4 0x0000ffffbe73f718 in ompi_coll_base_barrier_intra_recursivedoubling ()
from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#5 0x0000ffffbe703000 in PMPI_Barrier () from /home/projects/arm64-tx2/openmpi/2.1.4/arm/18.4.0/lib/libmpi.so.20
#6 0x0000000000402554 in main ()
Loading...