Joshua Wall
2017-01-12 18:35:19 UTC
Hello Users
I'm by no means an MPI expert, but I have successfully being using my
own compiled version of OMPI 1.10.2 for some time without issue. Lately
however I'm seeing a strange issue, which is that when I try to run on
more than 3 or 4 nodes I get a hang during setup. My code (the Fortran MHD
code FLASH ver
4.2.2) is attempting to call MPI_COMM_SPLIT:
!! first make a communicator for group of processors
!! that have the whole computational grid
!! The grid is duplicated on all communicators
countInComm=dr_globalNumProcs/dr_meshCopyCount
if((countInComm*dr_meshCopyCount) /= dr_globalNumProcs)&
call Driver_abortFlash("when duplicating mesh, numProcs should
be a multiple of meshCopyCount")
color = dr_globalMe/countInComm
key = mod(dr_globalMe,countInComm)
call MPI_Comm_split(dr_globalComm,color,key,dr_meshComm,error)
call MPI_Comm_split(dr_globalComm,key,color,dr_meshAcrossComm,error)
call MPI_COMM_RANK(dr_meshComm,dr_meshMe, error)
call MPI_COMM_SIZE(dr_meshComm, dr_meshNumProcs,error)
call MPI_COMM_RANK(dr_meshAcrossComm,dr_meshAcrossMe, error)
call MPI_COMM_SIZE(dr_meshAcrossComm, dr_meshAcrossNumProcs,error)
and is hanging in split call. Attaching a GDB to the process on the
local node I find (CentOS is way behind on updating GDB so there aren't
a lot of symbols unfortunately):
(gdb) bt full
#0 0x00002aaab150facd in mca_btl_vader_component_progress () from
/home/draco/jwall/local_openmpi/lib/openmpi/mca_btl_vader.so
No symbol table info available.
#1 0x00002aaaad348e6a in opal_progress () from
/home/draco/jwall/local_openmpi/lib/libopen-pal.so.13
_mm_free_fn = 0
event_debug_map_PRIMES = {53, 97, 193, 389, 769, 1543, 3079,
6151, 12289, 24593, 49157, 98317, 196613, 393241, 786433, 1572869,
3145739, 6291469, 12582917, 25165843, 50331653,
100663319, 201326611, 402653189, 805306457, 1610612741}
_event_debug_map_lock = 0x3e9cc70
_mm_realloc_fn = 0
event_debug_mode_too_late = 1
global_debug_map = {hth_table = 0x0, hth_table_length = 0,
hth_n_entries = 0, hth_load_limit = 0, hth_prime_idx = -1}
warn_once = 0
use_monotonic = 1
eventops = {0x2aaaad5ee860, 0x2aaaad5ee8c0, 0x2aaaad5ee900, 0x0}
_mm_malloc_fn = 0
event_global_current_base_ = 0x0
opal_libevent2021__event_debug_mode_on = 0
#2 0x00002aaaab635305 in ompi_request_default_wait_all () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#3 0x00002aaab1f60417 in ompi_coll_tuned_sendrecv_nonzero_actual ()
from /home/draco/jwall/local_openmpi/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#4 0x00002aaab1f68074 in ompi_coll_tuned_allgather_intra_bruck () from
/home/draco/jwall/local_openmpi/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#5 0x00002aaaab621e4d in ompi_comm_split () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#6 0x00002aaaab64f16d in PMPI_Comm_split () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#7 0x00002aaaab3db70f in pmpi_comm_split__ () from
/home/draco/jwall/local_openmpi/lib/libmpi_mpifh.so.12
No symbol table info available.
#8 0x00000000005e43ed in driver_setupparallelenv_ ()
No symbol table info available.
#9 0x00000000005e1187 in driver_initflash_ ()
No symbol table info available.
#10 0x00000000004451ce in __flash_run_MOD_initialize_code ()
No symbol table info available.
#11 0x00000000004124e9 in handle_call.1908 ()
No symbol table info available.
#12 0x0000000000422791 in run_loop_mpi.1914 ()
No symbol table info available.
#13 0x00000000004169db in MAIN__ ()
No symbol table info available.
#14 0x0000000000423c6f in main ()
No symbol table info available.
#15 0x00002aaaac507d1d in __libc_start_main () from /lib64/libc.so.6
No symbol table info available.
#16 0x0000000000405509 in _start ()
No symbol table info available.
Anyone have any ideas what the issue might be?
Thanks so much.
Joshua Wall
Ph. D. Candidate
Physics Department
Drexel University
I'm by no means an MPI expert, but I have successfully being using my
own compiled version of OMPI 1.10.2 for some time without issue. Lately
however I'm seeing a strange issue, which is that when I try to run on
more than 3 or 4 nodes I get a hang during setup. My code (the Fortran MHD
code FLASH ver
4.2.2) is attempting to call MPI_COMM_SPLIT:
!! first make a communicator for group of processors
!! that have the whole computational grid
!! The grid is duplicated on all communicators
countInComm=dr_globalNumProcs/dr_meshCopyCount
if((countInComm*dr_meshCopyCount) /= dr_globalNumProcs)&
call Driver_abortFlash("when duplicating mesh, numProcs should
be a multiple of meshCopyCount")
color = dr_globalMe/countInComm
key = mod(dr_globalMe,countInComm)
call MPI_Comm_split(dr_globalComm,color,key,dr_meshComm,error)
call MPI_Comm_split(dr_globalComm,key,color,dr_meshAcrossComm,error)
call MPI_COMM_RANK(dr_meshComm,dr_meshMe, error)
call MPI_COMM_SIZE(dr_meshComm, dr_meshNumProcs,error)
call MPI_COMM_RANK(dr_meshAcrossComm,dr_meshAcrossMe, error)
call MPI_COMM_SIZE(dr_meshAcrossComm, dr_meshAcrossNumProcs,error)
and is hanging in split call. Attaching a GDB to the process on the
local node I find (CentOS is way behind on updating GDB so there aren't
a lot of symbols unfortunately):
(gdb) bt full
#0 0x00002aaab150facd in mca_btl_vader_component_progress () from
/home/draco/jwall/local_openmpi/lib/openmpi/mca_btl_vader.so
No symbol table info available.
#1 0x00002aaaad348e6a in opal_progress () from
/home/draco/jwall/local_openmpi/lib/libopen-pal.so.13
_mm_free_fn = 0
event_debug_map_PRIMES = {53, 97, 193, 389, 769, 1543, 3079,
6151, 12289, 24593, 49157, 98317, 196613, 393241, 786433, 1572869,
3145739, 6291469, 12582917, 25165843, 50331653,
100663319, 201326611, 402653189, 805306457, 1610612741}
_event_debug_map_lock = 0x3e9cc70
_mm_realloc_fn = 0
event_debug_mode_too_late = 1
global_debug_map = {hth_table = 0x0, hth_table_length = 0,
hth_n_entries = 0, hth_load_limit = 0, hth_prime_idx = -1}
warn_once = 0
use_monotonic = 1
eventops = {0x2aaaad5ee860, 0x2aaaad5ee8c0, 0x2aaaad5ee900, 0x0}
_mm_malloc_fn = 0
event_global_current_base_ = 0x0
opal_libevent2021__event_debug_mode_on = 0
#2 0x00002aaaab635305 in ompi_request_default_wait_all () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#3 0x00002aaab1f60417 in ompi_coll_tuned_sendrecv_nonzero_actual ()
from /home/draco/jwall/local_openmpi/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#4 0x00002aaab1f68074 in ompi_coll_tuned_allgather_intra_bruck () from
/home/draco/jwall/local_openmpi/lib/openmpi/mca_coll_tuned.so
No symbol table info available.
#5 0x00002aaaab621e4d in ompi_comm_split () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#6 0x00002aaaab64f16d in PMPI_Comm_split () from
/home/draco/jwall/local_openmpi/lib/libmpi.so.12
No symbol table info available.
#7 0x00002aaaab3db70f in pmpi_comm_split__ () from
/home/draco/jwall/local_openmpi/lib/libmpi_mpifh.so.12
No symbol table info available.
#8 0x00000000005e43ed in driver_setupparallelenv_ ()
No symbol table info available.
#9 0x00000000005e1187 in driver_initflash_ ()
No symbol table info available.
#10 0x00000000004451ce in __flash_run_MOD_initialize_code ()
No symbol table info available.
#11 0x00000000004124e9 in handle_call.1908 ()
No symbol table info available.
#12 0x0000000000422791 in run_loop_mpi.1914 ()
No symbol table info available.
#13 0x00000000004169db in MAIN__ ()
No symbol table info available.
#14 0x0000000000423c6f in main ()
No symbol table info available.
#15 0x00002aaaac507d1d in __libc_start_main () from /lib64/libc.so.6
No symbol table info available.
#16 0x0000000000405509 in _start ()
No symbol table info available.
Anyone have any ideas what the issue might be?
Thanks so much.
Joshua Wall
Ph. D. Candidate
Physics Department
Drexel University
--
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
Joshua Wall
Doctoral Candidate
Department of Physics
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104