No, there are no others you need to set. Ralph's referring to the fact
that we set OMPI environment variables in the processes that are
started on the remote nodes.
I was asking to ensure you hadn't set any MCA parameters in the
environment that could be creating a problem. Do you have any set in
files, perchance?
And can you run "env | grep OMPI" from the script that you invoked via
mpirun?
- you mpirun on a single node and all works fine
- you mpirun on multiple nodes and all works fine (e.g., mpirun --host
a,b,c your_executable) - you mpirun on multiple nodes and list a host
more than once and it hangs (e.g., mpirun --host a,a,b,c
your_executable)
Is that correct?
If so, can you attach a debugger to one of the hung processes and see
exactly where it's hung? (i.e., get the stack traces)
Per a question from your prior mail: yes, Open MPI does create mmapped
files in /tmp for use with shared memory communication. They *should*
get cleaned up when you exit, however, unless something disastrous
happens.
Thank you very much!
Now I am more clear with what Ralph asked.
Yes what you described is right with the sm btl layer. As I double
checked again, the problem is that when I use sm btl for MPI
commnunication on the same host(as --mca btl openib,sm,self),
issues come up as you described, all ran well on a single node, all
ran well on multiple but different nodes, but it hang at MPI_Init() call
if I ran on multiple nodes and list a host more than once. However,
if I instead use tcp or openib btl without sm layer(as --mca btl
openib,self), all these 3 cases ran just fine.
I do setup the MCAs "plm_rsh_agent" to "rsh:ssh" and
"btl_openib_warn_default_gid_prefix" to 0 in all cases, with or
without sm btl layer. The OMPI environment variables set for each
processes are quoted below(as output by env | grep OMPI in my
script invoked by mpirun):
------
//process #0:
OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294
OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14
6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172
.33.10.1:53997
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=0 OMPI_UNIVERSE_SIZE=4
OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=0 OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
//process #1:
OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294
OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14
6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172
.33.10.1:53997
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=1 OMPI_UNIVERSE_SIZE=4
OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=1 OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_LOCAL_RANK=1
//process #3:
OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 OMPI_MCA_orte_daemonize=1
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=3 OMPI_MCA_orte_ess_num_procs=4
OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14
6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172
.33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=3 OMPI_UNIVERSE_SIZE=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=3
OMPI_COMM_WORLD_LOCAL_RANK=1
//process #2:
OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 OMPI_MCA_orte_daemonize=1
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=2 OMPI_MCA_orte_ess_num_procs=4
OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14
6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172
.33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=2 OMPI_UNIVERSE_SIZE=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=2
OMPI_COMM_WORLD_LOCAL_RANK=0
------
process #0 and #1 are on the same host, while process #2 and #3
are on the other.
When I use sm btl layer, my program just hang at the MPI_Init() at
the very beginning.
I wish I made myself clear.
Thanks,
Yiguang