Discussion:
[OMPI users] help: sm btl does not work when I specify the same host twice or more in the node list
y***@adina.com
2012-02-09 14:31:22 UTC
Permalink
Hi all,

Good morning!

I have trouble to communicate through sm btl in open MPI, please
check the attached file for my system information. I am using open
MPI 1.4.3, intel compilers V11.1, on linux RHEL 5.4 with kernel 2.6.

The tests are the following:

(1) if I specify the btl to mpirun by "--mca btl self,sm,openib", if I did
not specify any of my computing nodes twice or more in the node
list, my job runs fine. However, if I specify any of the computing
nodes twice or more in the node list, it will hang there forever.

(2) if I did not specify the sm btl to mpirun as "--mca btl
self,openib", I could run my job smoothly, either put any of the
computing nodes twice or more in the node list, or not.
From above 2 tests, apparently something wrong with sm btl
interface on my system. As I checked the user archive, sm btl
issue has been encountered due to the comm_spawned
parent/child processes. But this seems not the case here, if I do
not use any of my MPI based solver, only with MPI initialization and
finalization procedures called, it still has this issue.

Any comments?

Thanks,
Yiguang
Jeff Squyres
2012-02-10 20:50:34 UTC
Permalink
Can you provide a specific example?

I'm able to do this just fine, for example (with the upcoming OMPI 1.4.5):

mpirun --host svbu-mpi001,svbu-mpi001,svbu-mpi002,svbu-mpi002 --mca btl sm,openib,self ring
Post by y***@adina.com
Hi all,
Good morning!
I have trouble to communicate through sm btl in open MPI, please
check the attached file for my system information. I am using open
MPI 1.4.3, intel compilers V11.1, on linux RHEL 5.4 with kernel 2.6.
(1) if I specify the btl to mpirun by "--mca btl self,sm,openib", if I did
not specify any of my computing nodes twice or more in the node
list, my job runs fine. However, if I specify any of the computing
nodes twice or more in the node list, it will hang there forever.
(2) if I did not specify the sm btl to mpirun as "--mca btl
self,openib", I could run my job smoothly, either put any of the
computing nodes twice or more in the node list, or not.
From above 2 tests, apparently something wrong with sm btl
interface on my system. As I checked the user archive, sm btl
issue has been encountered due to the comm_spawned
parent/child processes. But this seems not the case here, if I do
not use any of my MPI based solver, only with MPI initialization and
finalization procedures called, it still has this issue.
Any comments?
Thanks,
Yiguang
The following section of this message contains a file attachment
prepared for transmission using the Internet MIME message format.
If you are using Pegasus Mail, or any another MIME-compliant system,
you should be able to save it or view it from within your mailer.
If you cannot, please ask your system administrator for assistance.
---- File information -----------
File: ompiinfo-config-uname-output.tgz
Date: 9 Feb 2012, 8:58
Size: 126316 bytes.
Type: Unknown
<ompiinfo-config-uname-output.tgz>_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
y***@adina.com
2012-02-13 19:26:41 UTC
Permalink
Hi Jeff,

Thank you very much for your help!

I tried to run the same test of ring_c from standard examples in
Open MPI 1.4.3 distribution. If I ran as you described from the
command line, it worked without any problem with sm btl
included(with --mca btl self,sm,openib). However, if I use sm
btl(with --mca btl self,sm,openib), and ran ring_c from an in-house
script, it showed the same issue as I described in my previous
email, it will hang at MPI_Init(...) call. I think this issue is related to
some environmental setting in the script. Do you have any hints,
any prerequisite of system environmental configuration to work with
sm btl layer in Open MPI?

Thanks again,
Yiguang
Jeff Squyres
2012-02-14 11:13:58 UTC
Permalink
Post by y***@adina.com
I tried to run the same test of ring_c from standard examples in
Open MPI 1.4.3 distribution. If I ran as you described from the
command line, it worked without any problem with sm btl
included(with --mca btl self,sm,openib). However, if I use sm
btl(with --mca btl self,sm,openib), and ran ring_c from an in-house
script, it showed the same issue as I described in my previous
email, it will hang at MPI_Init(...) call. I think this issue is related to
some environmental setting in the script. Do you have any hints,
any prerequisite of system environmental configuration to work with
sm btl layer in Open MPI?
There actually aren't too many tunables in the sm BTL itself.

Can you share the script that you're using to launch Open MPI?

If not, can you share the output of "env | grep OMPI" from your script, perhaps on the line before you launch mpirun?
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
y***@adina.com
2012-02-14 14:44:36 UTC
Permalink
Hi Jeff,

The command "env | grep OMPI" output nothing but a blank line
from my script. Anything I should set for mpirun?

On the other hand, you may get reminded that I found you
discussed some similar issue with Jonathan Dursi. The difference
is that when I tried with --mca btl_sm_num_fifos #(np-1), it does
not work with me, and I did find those files in the tmp directory that
sm mmaped in(shared_mem_pool.ibnode001, etc), but for some
mysterious reason, it hang at MPI_Init, so these files are created
when we call MPI_Init?

Thanks,
Yiguang
Ralph Castain
2012-02-14 14:55:31 UTC
Permalink
It looks like your script is stripping away the OMPI envars. That will break the job. Can you look at the script and see why it does that?

Sent from my iPad
Post by y***@adina.com
Hi Jeff,
The command "env | grep OMPI" output nothing but a blank line
from my script. Anything I should set for mpirun?
On the other hand, you may get reminded that I found you
discussed some similar issue with Jonathan Dursi. The difference
is that when I tried with --mca btl_sm_num_fifos #(np-1), it does
not work with me, and I did find those files in the tmp directory that
sm mmaped in(shared_mem_pool.ibnode001, etc), but for some
mysterious reason, it hang at MPI_Init, so these files are created
when we call MPI_Init?
Thanks,
Yiguang
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
y***@adina.com
2012-02-14 15:32:28 UTC
Permalink
Hi Ralph,

Could you please tell me what OMPI envars are broken? or what
OMPI envars should be there for OMPI to work properly?

Although I start my c-shell script from a bash command line(not
sure if this matters), I only add Open MPI executable and lib path to
$PATH and $LD_LIBRARY_PATH, no other OMPI environmental
variables are set on my system(in bash or csh) as I checked.

Thanks,
Yiguang
Ralph Castain
2012-02-14 15:35:55 UTC
Permalink
Guess I am confused. It was my impression that you mpirun a script that actually starts your process. Yes?

Sent from my iPad
Post by y***@adina.com
Hi Ralph,
Could you please tell me what OMPI envars are broken? or what
OMPI envars should be there for OMPI to work properly?
Although I start my c-shell script from a bash command line(not
sure if this matters), I only add Open MPI executable and lib path to
$PATH and $LD_LIBRARY_PATH, no other OMPI environmental
variables are set on my system(in bash or csh) as I checked.
Thanks,
Yiguang
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
y***@adina.com
2012-02-14 15:47:09 UTC
Permalink
Yes, in short, I start a c-shell script from bash command line, in
which I mpirun another c-shell script which start the computing
process. The only OMPI related envars are PATH and
LD_LIBRARY_PATH. Any other OPMI envars I should set?
Jeff Squyres
2012-02-15 16:01:44 UTC
Permalink
Post by y***@adina.com
Yes, in short, I start a c-shell script from bash command line, in
which I mpirun another c-shell script which start the computing
process. The only OMPI related envars are PATH and
LD_LIBRARY_PATH. Any other OPMI envars I should set?
No, there are no others you need to set. Ralph's referring to the fact that we set OMPI environment variables in the processes that are started on the remote nodes.

I was asking to ensure you hadn't set any MCA parameters in the environment that could be creating a problem. Do you have any set in files, perchance?

And can you run "env | grep OMPI" from the script that you invoked via mpirun?

So just to be clear on the exact problem you're seeing:

- you mpirun on a single node and all works fine
- you mpirun on multiple nodes and all works fine (e.g., mpirun --host a,b,c your_executable)
- you mpirun on multiple nodes and list a host more than once and it hangs (e.g., mpirun --host a,a,b,c your_executable)

Is that correct?

If so, can you attach a debugger to one of the hung processes and see exactly where it's hung? (i.e., get the stack traces)

Per a question from your prior mail: yes, Open MPI does create mmapped files in /tmp for use with shared memory communication. They *should* get cleaned up when you exit, however, unless something disastrous happens.
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
y***@adina.com
2012-02-15 16:54:26 UTC
Permalink
No, there are no others you need to set. Ralph's referring to the fact
that we set OMPI environment variables in the processes that are
started on the remote nodes.
I was asking to ensure you hadn't set any MCA parameters in the
environment that could be creating a problem. Do you have any set in
files, perchance?
And can you run "env | grep OMPI" from the script that you invoked via
mpirun?
- you mpirun on a single node and all works fine
- you mpirun on multiple nodes and all works fine (e.g., mpirun --host
a,b,c your_executable) - you mpirun on multiple nodes and list a host
more than once and it hangs (e.g., mpirun --host a,a,b,c
your_executable)
Is that correct?
If so, can you attach a debugger to one of the hung processes and see
exactly where it's hung? (i.e., get the stack traces)
Per a question from your prior mail: yes, Open MPI does create mmapped
files in /tmp for use with shared memory communication. They *should*
get cleaned up when you exit, however, unless something disastrous
happens.
Thank you very much!

Now I am more clear with what Ralph asked.

Yes what you described is right with the sm btl layer. As I double
checked again, the problem is that when I use sm btl for MPI
commnunication on the same host(as --mca btl openib,sm,self),
issues come up as you described, all ran well on a single node, all
ran well on multiple but different nodes, but it hang at MPI_Init() call
if I ran on multiple nodes and list a host more than once. However,
if I instead use tcp or openib btl without sm layer(as --mca btl
openib,self), all these 3 cases ran just fine.

I do setup the MCAs "plm_rsh_agent" to "rsh:ssh" and
"btl_openib_warn_default_gid_prefix" to 0 in all cases, with or
without sm btl layer. The OMPI environment variables set for each
processes are quoted below(as output by env | grep OMPI in my
script invoked by mpirun):

------
//process #0:

OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294
OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14
6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172
.33.10.1:53997
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=0 OMPI_UNIVERSE_SIZE=4
OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=0 OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0

//process #1:

OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294
OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14
6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172
.33.10.1:53997
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=1 OMPI_UNIVERSE_SIZE=4
OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=1 OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_LOCAL_RANK=1

//process #3:

OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 OMPI_MCA_orte_daemonize=1
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=3 OMPI_MCA_orte_ess_num_procs=4
OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14
6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172
.33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=3 OMPI_UNIVERSE_SIZE=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=3
OMPI_COMM_WORLD_LOCAL_RANK=1

//process #2:

OMPI_MCA_plm_rsh_agent=rsh:ssh
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=openib,sm,self
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 OMPI_MCA_orte_daemonize=1
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425
OMPI_MCA_orte_ess_vpid=2 OMPI_MCA_orte_ess_num_procs=4
OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14
6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172
.33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=2 OMPI_UNIVERSE_SIZE=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=2
OMPI_COMM_WORLD_LOCAL_RANK=0

------
process #0 and #1 are on the same host, while process #2 and #3
are on the other.

When I use sm btl layer, my program just hang at the MPI_Init() at
the very beginning.

I wish I made myself clear.

Thanks,
Yiguang
Jeff Squyres
2012-02-15 18:13:40 UTC
Permalink
Post by y***@adina.com
When I use sm btl layer, my program just hang at the MPI_Init() at
the very beginning.
Ok, I think I was thrown off by the other things in this conversation.

So the real issue is: the sm BTL is not working for you.

What version of Open MPI are you using?

Can you rm -rf any Open MPI directories that may be left over in /tmp?
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
y***@adina.com
2012-02-15 18:33:39 UTC
Permalink
Post by Jeff Squyres
So the real issue is: the sm BTL is not working for you.
Yes.
Post by Jeff Squyres
What version of Open MPI are you using?
It is 1.4.3 I am using.
Post by Jeff Squyres
Can you rm -rf any Open MPI directories that may be left over in /tmp?
Yes, I have tried that. The clean up does not help to make sm btl
work.
y***@adina.com
2012-02-16 14:09:03 UTC
Permalink
OK, with Jeff's kind help, I solved this issue in a very simple way.
Now I would like to report back the reason for this issue and the
solution.

(1) The scenario under which this issue happened:

In my OPMI environment, the $TMPDIR envar is set to different
scratch directory for different MPI process, even some MPI
processes are running on the same host. This is not troublesome if
we use openib,self,tcp btl layer for communication. However, if we
use sm btl layer, then, as Jeff said:

"""
Open MPI creates its shared memory files in $TMPDIR. It implicitly
expects all shared memory files to be found under the same
$TMPDIR for all procs on a single machine.

More specifically, Open MPI creates what we call a "session
directory" under $TMPDIR that is an implicit rendezvous point for all
processes on the same machine. Some meta data is put in there,
to include the shared memory mmap files.

So if the different processes have a different idea of where the
rendezvous session directory exists, they'll end up blocking waiting
for others to show up at their (individual) rendezvous points... but
that will never happen, because each process is waiting at their
own rendezvous point.

"""

So in this case, there is a block and wait on each other for MPI
processes shared data through shared memory, which will never
be released, hence the hang at the MPI_Init call.

(2) Solution to this issue:

You may set the $TMPDIR to a same directory on the same host if
possible; or you could setenv OMPI_PREFIX_ENV to a common
directory for MPI processes on the same host while keeping your
$TMPDIR setting. either way is verified and working fine for me!

Thanks,
Yiguang
Jeff Squyres
2012-02-16 15:18:13 UTC
Permalink
Post by y***@adina.com
You may set the $TMPDIR to a same directory on the same host if
possible; or you could setenv OMPI_PREFIX_ENV to a common
directory for MPI processes on the same host while keeping your
$TMPDIR setting. either way is verified and working fine for me!
A clarification on this...

I found OMPI_PREFIX_ENV through some code diving, and it looks like this is an old name from previous logic. We'll actually be removing it from our SVN trunk shortly.

I think the right answer here is to use the orte_tmpdir_base MCA parameter:

mpirun --mca orte_tmpdir_base /tmp ...

This will tell OMPI where to put the session directory for all processes (even if their $TMPDIRs are different from each other). This should be used instead of setting OMPI_PREFIX_ENV.
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Loading...