Discussion:
[OMPI users] Hang in mca_btl_vader_component_progress ()
Joshua Wall
2017-01-12 18:35:19 UTC
Permalink
Hello Users

I'm by no means an MPI expert, but I have successfully being using my
own compiled version of OMPI 1.10.2 for some time without issue. Lately
however I'm seeing a strange issue, which is that when I try to run on
more than 3 or 4 nodes I get a hang during setup. My code (the Fortran MHD
code FLASH ver
4.2.2) is attempting to call MPI_COMM_SPLIT:


!! first make a communicator for group of processors
!! that have the whole computational grid
!! The grid is duplicated on all communicators
countInComm=dr_globalNumProcs/dr_meshCopyCount

if((countInComm*dr_meshCopyCount) /= dr_globalNumProcs)&
call Driver_abortFlash("when duplicating mesh, numProcs should
be a multiple of meshCopyCount")

color = dr_globalMe/countInComm
key = mod(dr_globalMe,countInComm)
call MPI_Comm_split(dr_globalComm,color,key,dr_meshComm,error)
call MPI_Comm_split(dr_globalComm,key,color,dr_meshAcrossComm,error)

call MPI_COMM_RANK(dr_meshComm,dr_meshMe, error)
call MPI_COMM_SIZE(dr_meshComm, dr_meshNumProcs,error)

call MPI_COMM_RANK(dr_meshAcrossComm,dr_meshAcrossMe, error)
call MPI_COMM_SIZE(dr_meshAcrossComm, dr_meshAcrossNumProcs,error)


and is hanging in split call. Attaching a GDB to the process on the
local node I find (CentOS is way behind on updating GDB so there aren't
a lot of symbols unfortunately):


(gdb) bt full
#0 0x00002aaab150facd in mca_btl_vader_component_progress () from
/home/draco/jwall/local_openmpi/lib/openmpi/mca_btl_vader.so
No symbol table info available.
#1 0x00002aaaad348e6a in opal_progress () from
/home/draco/jwall/local_openmpi/lib/libopen-pal.so.13
_mm_free_fn = 0
event_debug_map_PRIMES = {53, 97, 193, 389, 769, 1543, 3079,
6151, 12289, 24593, 49157, 98317, 196613, 393241, 786433, 1572869,
3145739, 6291469, 12582917, 25165843, 50331653,
100663319, 201326611, 402653189, 805306457, 1610612741}
_event_debug_map_lock = 0x3e9cc70
_mm_realloc_fn = 0
event_debug_mode_too_late = 1
global_debug_map = {hth_table = 0x0, hth_table_length = 0,
hth_n_entries = 0, hth_load_limit = 0, hth_prime_idx = -1}
warn_once = 0
use_monotonic = 1
eventops = {0x2aaaad5ee860, 0x2aaaad5ee8c0, 0x2aaaad5ee900, 0x0}
_mm_malloc_fn = 0
event_global_current_base_ = 0x0
opal_libevent2021__event_debug_mode_on = 0
#2 0x00002aaaab635305 in ompi_request_default_wait_all () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#3 0x00002aaab1f60417 in ompi_coll_tuned_sendrecv_nonzero_actual ()
from /home/draco/jwall/local_openmpi/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#4 0x00002aaab1f68074 in ompi_coll_tuned_allgather_intra_bruck () from
/home/draco/jwall/local_openmpi/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#5 0x00002aaaab621e4d in ompi_comm_split () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#6 0x00002aaaab64f16d in PMPI_Comm_split () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#7 0x00002aaaab3db70f in pmpi_comm_split__ () from
/home/draco/jwall/local_openmpi/lib/libmpi_mpifh.so.12
No symbol table info available.
#8 0x00000000005e43ed in driver_setupparallelenv_ ()
No symbol table info available.
#9 0x00000000005e1187 in driver_initflash_ ()
No symbol table info available.
#10 0x00000000004451ce in __flash_run_MOD_initialize_code ()
No symbol table info available.
#11 0x00000000004124e9 in handle_call.1908 ()
No symbol table info available.
#12 0x0000000000422791 in run_loop_mpi.1914 ()
No symbol table info available.
#13 0x00000000004169db in MAIN__ ()
No symbol table info available.
#14 0x0000000000423c6f in main ()
No symbol table info available.
#15 0x00002aaaac507d1d in __libc_start_main () from /lib64/libc.so.6
No symbol table info available.
#16 0x0000000000405509 in _start ()
No symbol table info available.

Anyone have any ideas what the issue might be?

Thanks so much.

Joshua Wall
Ph. D. Candidate
Physics Department
Drexel University
--
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
Loading...