Discussion:
[OMPI users] problem with "--host" with openmpi-v3.x-201705250239-d5200ea
Siegmar Gross
2017-05-30 10:42:48 UTC
Permalink
Hi,

I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.


loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
hello_1_mpi

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------



Everything is ok if I use the same command on "exin".

exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...



Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "loki".

loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...


"exin" is a virtual machine on QEMU so that it uses a slightly different
processor architecture, e.g., it has no L3 cache but larger L2 caches.

loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical id" -e
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
physical id : 0
physical id : 1


loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "physical
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1


Any ideas what's different in the newer versions of Open MPI? Is the new
behavior intended? I would be grateful, if somebody can fix the problem,
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my messages
in versions "3.x" and "master" as well, if the programs are started on any
machine. Do you need anything else? Thank you very much for any help in
advance.


Kind regards

Siegmar
g***@rist.or.jp
2017-05-30 11:36:34 UTC
Permalink
Hi Siegmar,

what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi

are loki and exin different ? (os, sockets, core)

Cheers,

Gilles

----- Original Message -----
Post by Siegmar Gross
Hi,
I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
----------------------------------------------------------------------
----
Post by Siegmar Gross
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
----------------------------------------------------------------------
----
Post by Siegmar Gross
Everything is ok if I use the same command on "exin".
exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
loki".
Post by Siegmar Gross
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
"exin" is a virtual machine on QEMU so that it uses a slightly
different
Post by Siegmar Gross
processor architecture, e.g., it has no L3 cache but larger L2 caches.
loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical id" -e
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
physical id : 0
physical id : 1
loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
physical
Post by Siegmar Gross
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1
Any ideas what's different in the newer versions of Open MPI? Is the new
behavior intended? I would be grateful, if somebody can fix the problem,
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my messages
in versions "3.x" and "master" as well, if the programs are started on any
machine. Do you need anything else? Thank you very much for any help in
advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Siegmar Gross
2017-05-30 12:24:19 UTC
Permalink
This post might be inappropriate. Click to display it.
g***@rist.or.jp
2017-05-30 13:16:43 UTC
Permalink
Hi Siegmar,

my bad, there was a typo in my reply.
i really meant
Post by Siegmar Gross
Post by g***@rist.or.jp
what if you ?
mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
but you also tried that and it did not help.

i could not find anything in your logs that suggest mpiexec tries to
start 5 MPI tasks,
did i miss something ?

i will try to reproduce the issue by myself

Cheers,

Gilles

----- Original Message -----
Post by Siegmar Gross
Hi Gilles,
Post by g***@rist.or.jp
what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
I need as many slots as processes so that I use "-np 2".
"mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
breaks, if I use at least "-np 3" and distribute the processes across at
least two machines.
loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
Process 0 of 2 running on loki
Process 1 of 2 running on exin
Now 1 slave tasks are sending greetings.
message type: 3
msg length: 131 characters
hostname: exin
operating system: Linux
release: 4.4.49-92.11-default
processor: x86_64
loki hello_1 119
Post by g***@rist.or.jp
are loki and exin different ? (os, sockets, core)
Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
kernel.
loki fd1026 108 uname -a
Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (
2d3e9d4)
Post by Siegmar Gross
x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 109 ssh exin uname -a
Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 (
8f9478a)
Post by Siegmar Gross
x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 110
The number of sockets and cores is identical, but the processor types are
different as you can see at the end of my previous email. "loki" uses two
"Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
topologies (89 K) if you are interested in the output from lstopo. I'
ve
Post by Siegmar Gross
added some runs. Most interesting in my opinion are the last two
"mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
"mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
Why does mpiexec create five processes although I've asked for only three
processes? Why do I have to break the program with <Ctrl-c> for the first
of the above commands?
loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
----------------------------------------------------------------------
----
Post by Siegmar Gross
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
----------------------------------------------------------------------
----
Post by Siegmar Gross
loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
Process 0 of 3 running on exin
Process 1 of 3 running on exin
Process 2 of 3 running on exin
...
loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
Process 1 of 3 running on loki
Process 0 of 3 running on loki
Process 2 of 3 running on loki
...
Process 0 of 3 running on exin
Process 1 of 3 running on exin
[exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/
opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_
connect]
Post by Siegmar Gross
connect() to 193.xxx.xxx.xxx failed: Connection refused (111)
^Cloki hello_1 116
loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
Process 0 of 3 running on loki
Process 2 of 3 running on loki
Process 1 of 3 running on loki
...
Process 1 of 3 running on exin
Process 0 of 3 running on exin
[exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/
opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_
blocking]
Post by Siegmar Gross
recv(16, 0/8) failed: Connection reset by peer (104)
[exin:31909]
../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_
ob1_sendreq.c:191
Post by Siegmar Gross
FATAL
loki hello_1 117
Do you need anything else?
Kind regards and thank you very much for your help
Siegmar
Post by g***@rist.or.jp
Cheers,
Gilles
----- Original Message -----
Post by Siegmar Gross
Hi,
I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
-------------------------------------------------------------------
---
Post by Siegmar Gross
Post by g***@rist.or.jp
----
Post by Siegmar Gross
There are not enough slots available in the system to satisfy the 3
slots
Post by Siegmar Gross
hello_1_mpi
Either request fewer slots for your application, or make more slots
available
Post by Siegmar Gross
for use.
-------------------------------------------------------------------
---
Post by Siegmar Gross
Post by g***@rist.or.jp
----
Post by Siegmar Gross
Everything is ok if I use the same command on "exin".
exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
loki".
Post by Siegmar Gross
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
"exin" is a virtual machine on QEMU so that it uses a slightly
different
Post by Siegmar Gross
processor architecture, e.g., it has no L3 cache but larger L2 caches.
loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "
physical
Post by Siegmar Gross
Post by g***@rist.or.jp
id" -e
Post by Siegmar Gross
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
physical id : 0
physical id : 1
loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -
e "
Post by Siegmar Gross
Post by g***@rist.or.jp
physical
Post by Siegmar Gross
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1
Any ideas what's different in the newer versions of Open MPI? Is the
new
Post by Siegmar Gross
behavior intended? I would be grateful, if somebody can fix the
problem,
Post by Siegmar Gross
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
messages
Post by Siegmar Gross
in versions "3.x" and "master" as well, if the programs are started on
any
Post by Siegmar Gross
machine. Do you need anything else? Thank you very much for any help
in
Post by Siegmar Gross
advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-05-30 13:20:16 UTC
Permalink
This behavior is as-expected. When you specify "-host foo,bar”, you have told us to assign one slot to each of those nodes. Thus, running 3 procs exceeds the number of slots you assigned.

You can tell it to set the #slots to the #cores it discovers on the node by using “-host foo:*,bar:*”

I cannot replicate your behavior of "-np 3 -host foo:2,bar:3” running more than 3 procs
Post by Siegmar Gross
Hi Gilles,
Post by g***@rist.or.jp
what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
I need as many slots as processes so that I use "-np 2".
"mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
breaks, if I use at least "-np 3" and distribute the processes across at
least two machines.
loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
Process 0 of 2 running on loki
Process 1 of 2 running on exin
Now 1 slave tasks are sending greetings.
message type: 3
msg length: 131 characters
hostname: exin
operating system: Linux
release: 4.4.49-92.11-default
processor: x86_64
loki hello_1 119
Post by g***@rist.or.jp
are loki and exin different ? (os, sockets, core)
Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
kernel.
loki fd1026 108 uname -a
Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 109 ssh exin uname -a
Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 (8f9478a) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 110
The number of sockets and cores is identical, but the processor types are
different as you can see at the end of my previous email. "loki" uses two
"Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
topologies (89 K) if you are interested in the output from lstopo. I've
added some runs. Most interesting in my opinion are the last two
"mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
"mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
Why does mpiexec create five processes although I've asked for only three
processes? Why do I have to break the program with <Ctrl-c> for the first
of the above commands?
loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
Process 0 of 3 running on exin
Process 1 of 3 running on exin
Process 2 of 3 running on exin
...
loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
Process 1 of 3 running on loki
Process 0 of 3 running on loki
Process 2 of 3 running on loki
...
Process 0 of 3 running on exin
Process 1 of 3 running on exin
[exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect] connect() to 193.xxx.xxx.xxx failed: Connection refused (111)
^Cloki hello_1 116
loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
Process 0 of 3 running on loki
Process 2 of 3 running on loki
Process 1 of 3 running on loki
...
Process 1 of 3 running on exin
Process 0 of 3 running on exin
[exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking] recv(16, 0/8) failed: Connection reset by peer (104)
[exin:31909] ../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191 FATAL
loki hello_1 117
Do you need anything else?
Kind regards and thank you very much for your help
Siegmar
Post by g***@rist.or.jp
Cheers,
Gilles
----- Original Message -----
Post by Siegmar Gross
Hi,
I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
----------------------------------------------------------------------
----
Post by Siegmar Gross
There are not enough slots available in the system to satisfy the 3
slots
Post by Siegmar Gross
hello_1_mpi
Either request fewer slots for your application, or make more slots
available
Post by Siegmar Gross
for use.
----------------------------------------------------------------------
----
Post by Siegmar Gross
Everything is ok if I use the same command on "exin".
exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
loki".
Post by Siegmar Gross
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
"exin" is a virtual machine on QEMU so that it uses a slightly
different
Post by Siegmar Gross
processor architecture, e.g., it has no L3 cache but larger L2 caches.
loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
id" -e
Post by Siegmar Gross
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
physical id : 0
physical id : 1
loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
physical
Post by Siegmar Gross
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1
Any ideas what's different in the newer versions of Open MPI? Is the
new
Post by Siegmar Gross
behavior intended? I would be grateful, if somebody can fix the
problem,
Post by Siegmar Gross
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
messages
Post by Siegmar Gross
in versions "3.x" and "master" as well, if the programs are started on
any
Post by Siegmar Gross
machine. Do you need anything else? Thank you very much for any help
in
Post by Siegmar Gross
advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gilles Gouaillardet
2017-05-31 02:43:41 UTC
Permalink
Ralph,


the issue Siegmar initially reported was

loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi


per what you wrote, this should be equivalent to

loki hello_1 111 mpiexec -np 3 --host loki:2,exin:1 hello_1_mpi

and this is what i initially wanted to double check (but i made a typo
in my reply)


anyway, the logs Siegmar posted indicate the two commands produce the
same output

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
hello_1_mpi

Either request fewer slots for your application, or make more slots
available
for use.
--------------------------------------------------------------------------


to me, this is incorrect since the command line made 3 available slots.
also, i am unable to reproduce any of these issues :-(



Siegmar,

can you please post your configure command line, and try these commands
from loki

mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
mpiexec -np 1 --host exin ldd ./hello_1_mpi

if Open MPI is not installed on a shared filesystem (NFS for example),
please also double check
both install were built from the same source and with the same options


Cheers,

Gilles
Post by r***@open-mpi.org
This behavior is as-expected. When you specify "-host foo,bar”, you have told us to assign one slot to each of those nodes. Thus, running 3 procs exceeds the number of slots you assigned.
You can tell it to set the #slots to the #cores it discovers on the node by using “-host foo:*,bar:*”
I cannot replicate your behavior of "-np 3 -host foo:2,bar:3” running more than 3 procs
Post by Siegmar Gross
Hi Gilles,
Post by g***@rist.or.jp
what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
I need as many slots as processes so that I use "-np 2".
"mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
breaks, if I use at least "-np 3" and distribute the processes across at
least two machines.
loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
Process 0 of 2 running on loki
Process 1 of 2 running on exin
Now 1 slave tasks are sending greetings.
message type: 3
msg length: 131 characters
hostname: exin
operating system: Linux
release: 4.4.49-92.11-default
processor: x86_64
loki hello_1 119
Post by g***@rist.or.jp
are loki and exin different ? (os, sockets, core)
Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
kernel.
loki fd1026 108 uname -a
Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 109 ssh exin uname -a
Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 (8f9478a) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 110
The number of sockets and cores is identical, but the processor types are
different as you can see at the end of my previous email. "loki" uses two
"Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
topologies (89 K) if you are interested in the output from lstopo. I've
added some runs. Most interesting in my opinion are the last two
"mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
"mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
Why does mpiexec create five processes although I've asked for only three
processes? Why do I have to break the program with <Ctrl-c> for the first
of the above commands?
loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
Process 0 of 3 running on exin
Process 1 of 3 running on exin
Process 2 of 3 running on exin
...
loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
Process 1 of 3 running on loki
Process 0 of 3 running on loki
Process 2 of 3 running on loki
...
Process 0 of 3 running on exin
Process 1 of 3 running on exin
[exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect] connect() to 193.xxx.xxx.xxx failed: Connection refused (111)
^Cloki hello_1 116
loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
Process 0 of 3 running on loki
Process 2 of 3 running on loki
Process 1 of 3 running on loki
...
Process 1 of 3 running on exin
Process 0 of 3 running on exin
[exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking] recv(16, 0/8) failed: Connection reset by peer (104)
[exin:31909] ../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191 FATAL
loki hello_1 117
Do you need anything else?
Kind regards and thank you very much for your help
Siegmar
Post by g***@rist.or.jp
Cheers,
Gilles
----- Original Message -----
Post by Siegmar Gross
Hi,
I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
----------------------------------------------------------------------
----
Post by Siegmar Gross
There are not enough slots available in the system to satisfy the 3
slots
Post by Siegmar Gross
hello_1_mpi
Either request fewer slots for your application, or make more slots
available
Post by Siegmar Gross
for use.
----------------------------------------------------------------------
----
Post by Siegmar Gross
Everything is ok if I use the same command on "exin".
exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
loki".
Post by Siegmar Gross
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
"exin" is a virtual machine on QEMU so that it uses a slightly
different
Post by Siegmar Gross
processor architecture, e.g., it has no L3 cache but larger L2 caches.
loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
id" -e
Post by Siegmar Gross
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
physical id : 0
physical id : 1
loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
physical
Post by Siegmar Gross
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1
Any ideas what's different in the newer versions of Open MPI? Is the
new
Post by Siegmar Gross
behavior intended? I would be grateful, if somebody can fix the
problem,
Post by Siegmar Gross
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
messages
Post by Siegmar Gross
in versions "3.x" and "master" as well, if the programs are started on
any
Post by Siegmar Gross
machine. Do you need anything else? Thank you very much for any help
in
Post by Siegmar Gross
advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-05-31 03:22:07 UTC
Permalink
Until the fixes pending in the big ORTE update PR are committed, I suggest not wasting time chasing this down. I tested the “patched” version of the 3.x branch, and it works just fine.
Post by Gilles Gouaillardet
Ralph,
the issue Siegmar initially reported was
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
per what you wrote, this should be equivalent to
loki hello_1 111 mpiexec -np 3 --host loki:2,exin:1 hello_1_mpi
and this is what i initially wanted to double check (but i made a typo in my reply)
anyway, the logs Siegmar posted indicate the two commands produce the same output
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
to me, this is incorrect since the command line made 3 available slots.
also, i am unable to reproduce any of these issues :-(
Siegmar,
can you please post your configure command line, and try these commands from loki
mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
mpiexec -np 1 --host exin ldd ./hello_1_mpi
if Open MPI is not installed on a shared filesystem (NFS for example), please also double check
both install were built from the same source and with the same options
Cheers,
Gilles
Post by r***@open-mpi.org
This behavior is as-expected. When you specify "-host foo,bar”, you have told us to assign one slot to each of those nodes. Thus, running 3 procs exceeds the number of slots you assigned.
You can tell it to set the #slots to the #cores it discovers on the node by using “-host foo:*,bar:*”
I cannot replicate your behavior of "-np 3 -host foo:2,bar:3” running more than 3 procs
Post by Siegmar Gross
Hi Gilles,
Post by g***@rist.or.jp
what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
I need as many slots as processes so that I use "-np 2".
"mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
breaks, if I use at least "-np 3" and distribute the processes across at
least two machines.
loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
Process 0 of 2 running on loki
Process 1 of 2 running on exin
Now 1 slave tasks are sending greetings.
message type: 3
msg length: 131 characters
hostname: exin
operating system: Linux
release: 4.4.49-92.11-default
processor: x86_64
loki hello_1 119
Post by g***@rist.or.jp
are loki and exin different ? (os, sockets, core)
Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
kernel.
loki fd1026 108 uname -a
Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 109 ssh exin uname -a
Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 (8f9478a) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 110
The number of sockets and cores is identical, but the processor types are
different as you can see at the end of my previous email. "loki" uses two
"Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
topologies (89 K) if you are interested in the output from lstopo. I've
added some runs. Most interesting in my opinion are the last two
"mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
"mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
Why does mpiexec create five processes although I've asked for only three
processes? Why do I have to break the program with <Ctrl-c> for the first
of the above commands?
loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
Process 0 of 3 running on exin
Process 1 of 3 running on exin
Process 2 of 3 running on exin
...
loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
Process 1 of 3 running on loki
Process 0 of 3 running on loki
Process 2 of 3 running on loki
...
Process 0 of 3 running on exin
Process 1 of 3 running on exin
[exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect] connect() to 193.xxx.xxx.xxx failed: Connection refused (111)
^Cloki hello_1 116
loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
Process 0 of 3 running on loki
Process 2 of 3 running on loki
Process 1 of 3 running on loki
...
Process 1 of 3 running on exin
Process 0 of 3 running on exin
[exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking] recv(16, 0/8) failed: Connection reset by peer (104)
[exin:31909] ../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191 FATAL
loki hello_1 117
Do you need anything else?
Kind regards and thank you very much for your help
Siegmar
Post by g***@rist.or.jp
Cheers,
Gilles
----- Original Message -----
Post by Siegmar Gross
Hi,
I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
----------------------------------------------------------------------
----
Post by Siegmar Gross
There are not enough slots available in the system to satisfy the 3
slots
Post by Siegmar Gross
hello_1_mpi
Either request fewer slots for your application, or make more slots
available
Post by Siegmar Gross
for use.
----------------------------------------------------------------------
----
Post by Siegmar Gross
Everything is ok if I use the same command on "exin".
exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
loki".
Post by Siegmar Gross
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
"exin" is a virtual machine on QEMU so that it uses a slightly
different
Post by Siegmar Gross
processor architecture, e.g., it has no L3 cache but larger L2 caches.
loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
id" -e
Post by Siegmar Gross
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
physical id : 0
physical id : 1
loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
physical
Post by Siegmar Gross
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1
Any ideas what's different in the newer versions of Open MPI? Is the
new
Post by Siegmar Gross
behavior intended? I would be grateful, if somebody can fix the
problem,
Post by Siegmar Gross
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
messages
Post by Siegmar Gross
in versions "3.x" and "master" as well, if the programs are started on
any
Post by Siegmar Gross
machine. Do you need anything else? Thank you very much for any help
in
Post by Siegmar Gross
advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Siegmar Gross
2017-05-31 06:24:08 UTC
Permalink
Hi Gilles,

I configured Open MPI with the following command.

../openmpi-v3.x-201705250239-d5200ea/configure \
--prefix=/usr/local/openmpi-3.0.0_64_cc \
--libdir=/usr/local/openmpi-3.0.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/lib64" \
CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
FCFLAGS="-m64" \
CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
--enable-mpi-cxx \
--enable-cxx-exceptions \
--enable-mpi-java \
--with-cuda=/usr/local/cuda \
--with-valgrind=/usr/local/valgrind \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

Do you know when the fixes pending in the big ORTE update PR
are committed? Perhaps Ralph has a point suggesting not to spend
time with the problem if it may already be resolved. Nevertheless,
I added the requested information after the commands below.
Post by Gilles Gouaillardet
Ralph,
the issue Siegmar initially reported was
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
per what you wrote, this should be equivalent to
loki hello_1 111 mpiexec -np 3 --host loki:2,exin:1 hello_1_mpi
and this is what i initially wanted to double check (but i made a typo in my reply)
anyway, the logs Siegmar posted indicate the two commands produce the same output
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
to me, this is incorrect since the command line made 3 available slots.
also, i am unable to reproduce any of these issues :-(
Siegmar,
can you please post your configure command line, and try these commands from loki
mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
loki hello_1 112 mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
[loki:25620] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[loki:25620] plm:base:set_hnp_name: initial bias 25620 nodename hash 3121685933
[loki:25620] plm:base:set_hnp_name: final jobfam 64424
[loki:25620] [[64424,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[loki:25620] [[64424,0],0] plm:base:receive start comm
[loki:25620] [[64424,0],0] plm:base:setup_job
[loki:25620] [[64424,0],0] plm:base:setup_vm
[loki:25620] [[64424,0],0] plm:base:setup_vm creating map
[loki:25620] [[64424,0],0] setup:vm: working unmanaged allocation
[loki:25620] [[64424,0],0] using dash_host
[loki:25620] [[64424,0],0] checking node loki
[loki:25620] [[64424,0],0] ignoring myself
[loki:25620] [[64424,0],0] checking node exin
[loki:25620] [[64424,0],0] plm:base:setup_vm add new daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:base:setup_vm assigning new daemon [[64424,0],1] to node exin
[loki:25620] [[64424,0],0] plm:rsh: launching vm
[loki:25620] [[64424,0],0] plm:rsh: local shell: 2 (tcsh)
[loki:25620] [[64424,0],0] plm:rsh: assuming same remote shell as local shell
[loki:25620] [[64424,0],0] plm:rsh: remote shell: 2 (tcsh)
[loki:25620] [[64424,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "4222091264" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca
orte_hnp_uri "4222091264.0;tcp://193.174.24.40:38978" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"
[loki:25620] [[64424,0],0] plm:rsh:launch daemon 0 not a child of mine
[loki:25620] [[64424,0],0] plm:rsh: adding node exin to launch list
[loki:25620] [[64424,0],0] plm:rsh: activating launch event
[loki:25620] [[64424,0],0] plm:rsh: recording launch of daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh exin orted -mca ess "env" -mca ess_base_jobid "4222091264" -mca ess_base_vpid 1
-mca ess_base_num_procs "2" -mca orte_hnp_uri "4222091264.0;tcp://193.174.24.40:38978" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"]
[exin:19816] [[64424,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[exin:19816] [[64424,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[exin:19816] [[64424,0],1] plm:base:receive start comm
[loki:25620] [[64424,0],0] plm:base:orted_report_launch from daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:base:orted_report_launch from daemon [[64424,0],1] on node exin
[loki:25620] [[64424,0],0] RECEIVED TOPOLOGY SIG 0N:2S:0L3:12L2:24L1:12C:24H:x86_64 FROM NODE exin
[loki:25620] [[64424,0],0] NEW TOPOLOGY - ADDING
[loki:25620] [[64424,0],0] plm:base:orted_report_launch completed for daemon [[64424,0],1] at contact 4222091264.1;tcp://192.168.75.71:49169
[loki:25620] [[64424,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[loki:25620] [[64424,0],0] complete_setup on job [64424,1]
[loki:25620] [[64424,0],0] plm:base:launch_apps for job [64424,1]
[exin:19816] [[64424,0],1] plm:rsh: remote spawn called
[exin:19816] [[64424,0],1] plm:rsh: remote spawn - have no children!
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[loki:25620] [[64424,0],0] plm:base:orted_cmd sending orted_exit commands
[exin:19816] [[64424,0],1] plm:base:receive stop comm
[loki:25620] [[64424,0],0] plm:base:receive stop comm
loki hello_1 112
Post by Gilles Gouaillardet
mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
loki hello_1 113 mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
[loki:25750] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[loki:25750] plm:base:set_hnp_name: initial bias 25750 nodename hash 3121685933
[loki:25750] plm:base:set_hnp_name: final jobfam 64298
[loki:25750] [[64298,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[loki:25750] [[64298,0],0] plm:base:receive start comm
[loki:25750] [[64298,0],0] plm:base:setup_job
[loki:25750] [[64298,0],0] plm:base:setup_vm
[loki:25750] [[64298,0],0] plm:base:setup_vm creating map
[loki:25750] [[64298,0],0] setup:vm: working unmanaged allocation
[loki:25750] [[64298,0],0] using dash_host
[loki:25750] [[64298,0],0] checking node exin
[loki:25750] [[64298,0],0] plm:base:setup_vm add new daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:setup_vm assigning new daemon [[64298,0],1] to node exin
[loki:25750] [[64298,0],0] plm:rsh: launching vm
[loki:25750] [[64298,0],0] plm:rsh: local shell: 2 (tcsh)
[loki:25750] [[64298,0],0] plm:rsh: assuming same remote shell as local shell
[loki:25750] [[64298,0],0] plm:rsh: remote shell: 2 (tcsh)
[loki:25750] [[64298,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "4213833728" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca
orte_hnp_uri "4213833728.0;tcp://193.174.24.40:53840" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"
[loki:25750] [[64298,0],0] plm:rsh:launch daemon 0 not a child of mine
[loki:25750] [[64298,0],0] plm:rsh: adding node exin to launch list
[loki:25750] [[64298,0],0] plm:rsh: activating launch event
[loki:25750] [[64298,0],0] plm:rsh: recording launch of daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh exin orted -mca ess "env" -mca ess_base_jobid "4213833728" -mca ess_base_vpid 1
-mca ess_base_num_procs "2" -mca orte_hnp_uri "4213833728.0;tcp://193.174.24.40:53840" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"]
[exin:19978] [[64298,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[exin:19978] [[64298,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[exin:19978] [[64298,0],1] plm:base:receive start comm
[loki:25750] [[64298,0],0] plm:base:orted_report_launch from daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:orted_report_launch from daemon [[64298,0],1] on node exin
[loki:25750] [[64298,0],0] RECEIVED TOPOLOGY SIG 0N:2S:0L3:12L2:24L1:12C:24H:x86_64 FROM NODE exin
[loki:25750] [[64298,0],0] NEW TOPOLOGY - ADDING
[loki:25750] [[64298,0],0] plm:base:orted_report_launch completed for daemon [[64298,0],1] at contact 4213833728.1;tcp://192.168.75.71:56878
[loki:25750] [[64298,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[loki:25750] [[64298,0],0] plm:base:setting slots for node loki by cores
[loki:25750] [[64298,0],0] complete_setup on job [64298,1]
[loki:25750] [[64298,0],0] plm:base:launch_apps for job [64298,1]
[exin:19978] [[64298,0],1] plm:rsh: remote spawn called
[exin:19978] [[64298,0],1] plm:rsh: remote spawn - have no children!
[loki:25750] [[64298,0],0] plm:base:receive processing msg
[loki:25750] [[64298,0],0] plm:base:receive update proc state command from [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0
[loki:25750] [[64298,0],0] plm:base:receive done processing commands
[loki:25750] [[64298,0],0] plm:base:launch wiring up iof for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:launch job [64298,1] is not a dynamic spawn
exin
[loki:25750] [[64298,0],0] plm:base:receive processing msg
[loki:25750] [[64298,0],0] plm:base:receive update proc state command from [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for vpid 0 state NORMALLY TERMINATED exit_code 0
[loki:25750] [[64298,0],0] plm:base:receive done processing commands
[loki:25750] [[64298,0],0] plm:base:orted_cmd sending orted_exit commands
[exin:19978] [[64298,0],1] plm:base:receive stop comm
[loki:25750] [[64298,0],0] plm:base:receive stop comm
loki hello_1 113
Post by Gilles Gouaillardet
mpiexec -np 1 --host exin ldd ./hello_1_mpi
I have to adapt the path, because the executables are not in the
local directory (due to my old heterogeneous environment).

loki hello_1 169 mpiexec -np 1 --host exin which -a hello_1_mpi
/home/fd1026/Linux/x86_64/bin/hello_1_mpi


loki hello_1 165 mpiexec -np 1 --host exin ldd $HOME/Linux/x86_64/bin/hello_1_mpi
linux-vdso.so.1 (0x00007ffc81ffb000)
libmpi.so.0 => /usr/local/openmpi-3.0.0_64_cc/lib64/libmpi.so.0 (0x00007f7e242ac000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7e2408f000)
libc.so.6 => /lib64/libc.so.6 (0x00007f7e23cec000)
libopen-rte.so.0 => /usr/local/openmpi-3.0.0_64_cc/lib64/libopen-rte.so.0 (0x00007f7e23569000)
libopen-pal.so.0 => /usr/local/openmpi-3.0.0_64_cc/lib64/libopen-pal.so.0 (0x00007f7e22ddd000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f7e22bd9000)
libnuma.so.1 => /usr/local/lib64/libnuma.so.1 (0x00007f7e229cc000)
libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f7e227ac000)
libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0 (0x00007f7e225a2000)
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f7e22262000)
librt.so.1 => /lib64/librt.so.1 (0x00007f7e2205a000)
libm.so.6 => /lib64/libm.so.6 (0x00007f7e21d5d000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f7e21b59000)
libz.so.1 => /lib64/libz.so.1 (0x00007f7e21943000)
/lib64/ld-linux-x86-64.so.2 (0x0000555d41aa1000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f7e2171c000)
libcap.so.2 => /lib64/libcap.so.2 (0x00007f7e21517000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f7e21300000)
libpcre.so.1 => /usr/lib64/libpcre.so.1 (0x00007f7e21090000)
loki hello_1 166


Thank you very much for your help

Siegmar
Post by Gilles Gouaillardet
if Open MPI is not installed on a shared filesystem (NFS for example), please also double check
both install were built from the same source and with the same options
Cheers,
Gilles
Post by r***@open-mpi.org
This behavior is as-expected. When you specify "-host foo,bar”, you have told us to assign one slot to each of those nodes. Thus, running 3 procs exceeds the
number of slots you assigned.
You can tell it to set the #slots to the #cores it discovers on the node by using “-host foo:*,bar:*”
I cannot replicate your behavior of "-np 3 -host foo:2,bar:3” running more than 3 procs
Post by Siegmar Gross
Hi Gilles,
Post by g***@rist.or.jp
what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
I need as many slots as processes so that I use "-np 2".
"mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
breaks, if I use at least "-np 3" and distribute the processes across at
least two machines.
loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
Process 0 of 2 running on loki
Process 1 of 2 running on exin
Now 1 slave tasks are sending greetings.
message type: 3
msg length: 131 characters
hostname: exin
operating system: Linux
release: 4.4.49-92.11-default
processor: x86_64
loki hello_1 119
Post by g***@rist.or.jp
are loki and exin different ? (os, sockets, core)
Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
kernel.
loki fd1026 108 uname -a
Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 109 ssh exin uname -a
Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 (8f9478a) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 110
The number of sockets and cores is identical, but the processor types are
different as you can see at the end of my previous email. "loki" uses two
"Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
topologies (89 K) if you are interested in the output from lstopo. I've
added some runs. Most interesting in my opinion are the last two
"mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
"mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
Why does mpiexec create five processes although I've asked for only three
processes? Why do I have to break the program with <Ctrl-c> for the first
of the above commands?
loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
Process 0 of 3 running on exin
Process 1 of 3 running on exin
Process 2 of 3 running on exin
...
loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
Process 1 of 3 running on loki
Process 0 of 3 running on loki
Process 2 of 3 running on loki
...
Process 0 of 3 running on exin
Process 1 of 3 running on exin
[exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect] connect()
to 193.xxx.xxx.xxx failed: Connection refused (111)
^Cloki hello_1 116
loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
Process 0 of 3 running on loki
Process 2 of 3 running on loki
Process 1 of 3 running on loki
...
Process 1 of 3 running on exin
Process 0 of 3 running on exin
[exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking] recv(16,
0/8) failed: Connection reset by peer (104)
[exin:31909] ../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191 FATAL
loki hello_1 117
Do you need anything else?
Kind regards and thank you very much for your help
Siegmar
Post by g***@rist.or.jp
Cheers,
Gilles
----- Original Message -----
Post by Siegmar Gross
Hi,
I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
----------------------------------------------------------------------
----
Post by Siegmar Gross
There are not enough slots available in the system to satisfy the 3
slots
Post by Siegmar Gross
hello_1_mpi
Either request fewer slots for your application, or make more slots
available
Post by Siegmar Gross
for use.
----------------------------------------------------------------------
----
Post by Siegmar Gross
Everything is ok if I use the same command on "exin".
exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
loki".
Post by Siegmar Gross
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
"exin" is a virtual machine on QEMU so that it uses a slightly
different
Post by Siegmar Gross
processor architecture, e.g., it has no L3 cache but larger L2 caches.
loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
id" -e
Post by Siegmar Gross
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
physical id : 0
physical id : 1
loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
physical
Post by Siegmar Gross
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1
Any ideas what's different in the newer versions of Open MPI? Is the
new
Post by Siegmar Gross
behavior intended? I would be grateful, if somebody can fix the
problem,
Post by Siegmar Gross
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
messages
Post by Siegmar Gross
in versions "3.x" and "master" as well, if the programs are started on
any
Post by Siegmar Gross
machine. Do you need anything else? Thank you very much for any help
in
Post by Siegmar Gross
advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gilles Gouaillardet
2017-05-31 06:38:17 UTC
Permalink
Siegmar,


the "big ORTE update" is a bunch of backports from master to v3.x

btw, does the same error occurs with master ?


i noted mpirun simply does

ssh exin orted ...

can you double check the right orted (e.g.
/usr/local/openmpi-3.0.0_64_cc/bin/orted)

or you can try to

mpirun --mca orte_launch_agent /usr/local/openmpi-3.0.0_64_cc/bin/orted ...


Cheers,


Gilles
Post by Siegmar Gross
Hi Gilles,
I configured Open MPI with the following command.
../openmpi-v3.x-201705250239-d5200ea/configure \
--prefix=/usr/local/openmpi-3.0.0_64_cc \
--libdir=/usr/local/openmpi-3.0.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64
-L/usr/local/cuda/lib64" \
CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
FCFLAGS="-m64" \
CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
--enable-mpi-cxx \
--enable-cxx-exceptions \
--enable-mpi-java \
--with-cuda=/usr/local/cuda \
--with-valgrind=/usr/local/valgrind \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
Do you know when the fixes pending in the big ORTE update PR
are committed? Perhaps Ralph has a point suggesting not to spend
time with the problem if it may already be resolved. Nevertheless,
I added the requested information after the commands below.
Post by Gilles Gouaillardet
Ralph,
the issue Siegmar initially reported was
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
per what you wrote, this should be equivalent to
loki hello_1 111 mpiexec -np 3 --host loki:2,exin:1 hello_1_mpi
and this is what i initially wanted to double check (but i made a typo in my reply)
anyway, the logs Siegmar posted indicate the two commands produce the same output
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
to me, this is incorrect since the command line made 3 available slots.
also, i am unable to reproduce any of these issues :-(
Siegmar,
can you please post your configure command line, and try these commands from loki
mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
loki hello_1 112 mpiexec -np 3 --host loki:2,exin --mca
plm_base_verbose 5 hostname
[loki:25620] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[loki:25620] plm:base:set_hnp_name: initial bias 25620 nodename hash 3121685933
[loki:25620] plm:base:set_hnp_name: final jobfam 64424
[loki:25620] [[64424,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[loki:25620] [[64424,0],0] plm:base:receive start comm
[loki:25620] [[64424,0],0] plm:base:setup_job
[loki:25620] [[64424,0],0] plm:base:setup_vm
[loki:25620] [[64424,0],0] plm:base:setup_vm creating map
[loki:25620] [[64424,0],0] setup:vm: working unmanaged allocation
[loki:25620] [[64424,0],0] using dash_host
[loki:25620] [[64424,0],0] checking node loki
[loki:25620] [[64424,0],0] ignoring myself
[loki:25620] [[64424,0],0] checking node exin
[loki:25620] [[64424,0],0] plm:base:setup_vm add new daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:base:setup_vm assigning new daemon
[[64424,0],1] to node exin
[loki:25620] [[64424,0],0] plm:rsh: launching vm
[loki:25620] [[64424,0],0] plm:rsh: local shell: 2 (tcsh)
[loki:25620] [[64424,0],0] plm:rsh: assuming same remote shell as local shell
[loki:25620] [[64424,0],0] plm:rsh: remote shell: 2 (tcsh)
/usr/bin/ssh <template> orted -mca ess "env" -mca
ess_base_jobid "4222091264" -mca ess_base_vpid "<template>" -mca
ess_base_num_procs "2" -mca orte_hnp_uri
"4222091264.0;tcp://193.174.24.40:38978" -mca orte_node_regex
"loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"
[loki:25620] [[64424,0],0] plm:rsh:launch daemon 0 not a child of mine
[loki:25620] [[64424,0],0] plm:rsh: adding node exin to launch list
[loki:25620] [[64424,0],0] plm:rsh: activating launch event
[loki:25620] [[64424,0],0] plm:rsh: recording launch of daemon
[[64424,0],1]
[loki:25620] [[64424,0],0] plm:rsh: executing: (/usr/bin/ssh)
[/usr/bin/ssh exin orted -mca ess "env" -mca ess_base_jobid
"4222091264" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca
orte_hnp_uri "4222091264.0;tcp://193.174.24.40:38978" -mca
orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"]
[exin:19816] [[64424,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[exin:19816] [[64424,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[exin:19816] [[64424,0],1] plm:base:receive start comm
[loki:25620] [[64424,0],0] plm:base:orted_report_launch from daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:base:orted_report_launch from daemon
[[64424,0],1] on node exin
[loki:25620] [[64424,0],0] RECEIVED TOPOLOGY SIG
0N:2S:0L3:12L2:24L1:12C:24H:x86_64 FROM NODE exin
[loki:25620] [[64424,0],0] NEW TOPOLOGY - ADDING
[loki:25620] [[64424,0],0] plm:base:orted_report_launch completed for
daemon [[64424,0],1] at contact 4222091264.1;tcp://192.168.75.71:49169
[loki:25620] [[64424,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[loki:25620] [[64424,0],0] complete_setup on job [64424,1]
[loki:25620] [[64424,0],0] plm:base:launch_apps for job [64424,1]
[exin:19816] [[64424,0],1] plm:rsh: remote spawn called
[exin:19816] [[64424,0],1] plm:rsh: remote spawn - have no children!
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[loki:25620] [[64424,0],0] plm:base:orted_cmd sending orted_exit commands
[exin:19816] [[64424,0],1] plm:base:receive stop comm
[loki:25620] [[64424,0],0] plm:base:receive stop comm
loki hello_1 112
Post by Gilles Gouaillardet
mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
loki hello_1 113 mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
[loki:25750] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[loki:25750] plm:base:set_hnp_name: initial bias 25750 nodename hash 3121685933
[loki:25750] plm:base:set_hnp_name: final jobfam 64298
[loki:25750] [[64298,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[loki:25750] [[64298,0],0] plm:base:receive start comm
[loki:25750] [[64298,0],0] plm:base:setup_job
[loki:25750] [[64298,0],0] plm:base:setup_vm
[loki:25750] [[64298,0],0] plm:base:setup_vm creating map
[loki:25750] [[64298,0],0] setup:vm: working unmanaged allocation
[loki:25750] [[64298,0],0] using dash_host
[loki:25750] [[64298,0],0] checking node exin
[loki:25750] [[64298,0],0] plm:base:setup_vm add new daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:setup_vm assigning new daemon
[[64298,0],1] to node exin
[loki:25750] [[64298,0],0] plm:rsh: launching vm
[loki:25750] [[64298,0],0] plm:rsh: local shell: 2 (tcsh)
[loki:25750] [[64298,0],0] plm:rsh: assuming same remote shell as local shell
[loki:25750] [[64298,0],0] plm:rsh: remote shell: 2 (tcsh)
/usr/bin/ssh <template> orted -mca ess "env" -mca
ess_base_jobid "4213833728" -mca ess_base_vpid "<template>" -mca
ess_base_num_procs "2" -mca orte_hnp_uri
"4213833728.0;tcp://193.174.24.40:53840" -mca orte_node_regex
"loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"
[loki:25750] [[64298,0],0] plm:rsh:launch daemon 0 not a child of mine
[loki:25750] [[64298,0],0] plm:rsh: adding node exin to launch list
[loki:25750] [[64298,0],0] plm:rsh: activating launch event
[loki:25750] [[64298,0],0] plm:rsh: recording launch of daemon
[[64298,0],1]
[loki:25750] [[64298,0],0] plm:rsh: executing: (/usr/bin/ssh)
[/usr/bin/ssh exin orted -mca ess "env" -mca ess_base_jobid
"4213833728" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca
orte_hnp_uri "4213833728.0;tcp://193.174.24.40:53840" -mca
orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"]
[exin:19978] [[64298,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[exin:19978] [[64298,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[exin:19978] [[64298,0],1] plm:base:receive start comm
[loki:25750] [[64298,0],0] plm:base:orted_report_launch from daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:orted_report_launch from daemon
[[64298,0],1] on node exin
[loki:25750] [[64298,0],0] RECEIVED TOPOLOGY SIG
0N:2S:0L3:12L2:24L1:12C:24H:x86_64 FROM NODE exin
[loki:25750] [[64298,0],0] NEW TOPOLOGY - ADDING
[loki:25750] [[64298,0],0] plm:base:orted_report_launch completed for
daemon [[64298,0],1] at contact 4213833728.1;tcp://192.168.75.71:56878
[loki:25750] [[64298,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[loki:25750] [[64298,0],0] plm:base:setting slots for node loki by cores
[loki:25750] [[64298,0],0] complete_setup on job [64298,1]
[loki:25750] [[64298,0],0] plm:base:launch_apps for job [64298,1]
[exin:19978] [[64298,0],1] plm:rsh: remote spawn called
[exin:19978] [[64298,0],1] plm:rsh: remote spawn - have no children!
[loki:25750] [[64298,0],0] plm:base:receive processing msg
[loki:25750] [[64298,0],0] plm:base:receive update proc state command from [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for
vpid 0 state RUNNING exit_code 0
[loki:25750] [[64298,0],0] plm:base:receive done processing commands
[loki:25750] [[64298,0],0] plm:base:launch wiring up iof for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:launch job [64298,1] is not a dynamic spawn
exin
[loki:25750] [[64298,0],0] plm:base:receive processing msg
[loki:25750] [[64298,0],0] plm:base:receive update proc state command from [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for
vpid 0 state NORMALLY TERMINATED exit_code 0
[loki:25750] [[64298,0],0] plm:base:receive done processing commands
[loki:25750] [[64298,0],0] plm:base:orted_cmd sending orted_exit commands
[exin:19978] [[64298,0],1] plm:base:receive stop comm
[loki:25750] [[64298,0],0] plm:base:receive stop comm
loki hello_1 113
Post by Gilles Gouaillardet
mpiexec -np 1 --host exin ldd ./hello_1_mpi
I have to adapt the path, because the executables are not in the
local directory (due to my old heterogeneous environment).
loki hello_1 169 mpiexec -np 1 --host exin which -a hello_1_mpi
/home/fd1026/Linux/x86_64/bin/hello_1_mpi
loki hello_1 165 mpiexec -np 1 --host exin ldd
$HOME/Linux/x86_64/bin/hello_1_mpi
linux-vdso.so.1 (0x00007ffc81ffb000)
libmpi.so.0 =>
/usr/local/openmpi-3.0.0_64_cc/lib64/libmpi.so.0 (0x00007f7e242ac000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7e2408f000)
libc.so.6 => /lib64/libc.so.6 (0x00007f7e23cec000)
libopen-rte.so.0 =>
/usr/local/openmpi-3.0.0_64_cc/lib64/libopen-rte.so.0
(0x00007f7e23569000)
libopen-pal.so.0 =>
/usr/local/openmpi-3.0.0_64_cc/lib64/libopen-pal.so.0
(0x00007f7e22ddd000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f7e22bd9000)
libnuma.so.1 => /usr/local/lib64/libnuma.so.1
(0x00007f7e229cc000)
libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f7e227ac000)
libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0
(0x00007f7e225a2000)
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1
(0x00007f7e22262000)
librt.so.1 => /lib64/librt.so.1 (0x00007f7e2205a000)
libm.so.6 => /lib64/libm.so.6 (0x00007f7e21d5d000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f7e21b59000)
libz.so.1 => /lib64/libz.so.1 (0x00007f7e21943000)
/lib64/ld-linux-x86-64.so.2 (0x0000555d41aa1000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f7e2171c000)
libcap.so.2 => /lib64/libcap.so.2 (0x00007f7e21517000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f7e21300000)
libpcre.so.1 => /usr/lib64/libpcre.so.1 (0x00007f7e21090000)
loki hello_1 166
Thank you very much for your help
Siegmar
Post by Gilles Gouaillardet
if Open MPI is not installed on a shared filesystem (NFS for
example), please also double check
both install were built from the same source and with the same options
Cheers,
Gilles
Post by r***@open-mpi.org
This behavior is as-expected. When you specify "-host foo,bar”, you
have told us to assign one slot to each of those nodes. Thus,
running 3 procs exceeds the number of slots you assigned.
You can tell it to set the #slots to the #cores it discovers on the
node by using “-host foo:*,bar:*”
I cannot replicate your behavior of "-np 3 -host foo:2,bar:3”
running more than 3 procs
On May 30, 2017, at 5:24 AM, Siegmar Gross
Hi Gilles,
Post by g***@rist.or.jp
what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
I need as many slots as processes so that I use "-np 2".
"mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
breaks, if I use at least "-np 3" and distribute the processes across at
least two machines.
loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
Process 0 of 2 running on loki
Process 1 of 2 running on exin
Now 1 slave tasks are sending greetings.
message type: 3
msg length: 131 characters
hostname: exin
operating system: Linux
release: 4.4.49-92.11-default
processor: x86_64
loki hello_1 119
Post by g***@rist.or.jp
are loki and exin different ? (os, sockets, core)
Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
kernel.
loki fd1026 108 uname -a
Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016
(2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 109 ssh exin uname -a
Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017
(8f9478a) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 110
The number of sockets and cores is identical, but the processor types are
different as you can see at the end of my previous email. "loki" uses two
"Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
topologies (89 K) if you are interested in the output from lstopo. I've
added some runs. Most interesting in my opinion are the last two
"mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
"mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
Why does mpiexec create five processes although I've asked for only three
processes? Why do I have to break the program with <Ctrl-c> for the first
of the above commands?
loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
Process 0 of 3 running on exin
Process 1 of 3 running on exin
Process 2 of 3 running on exin
...
loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
Process 1 of 3 running on loki
Process 0 of 3 running on loki
Process 2 of 3 running on loki
...
Process 0 of 3 running on exin
Process 1 of 3 running on exin
[exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect]
connect() to 193.xxx.xxx.xxx failed: Connection refused (111)
^Cloki hello_1 116
loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
Process 0 of 3 running on loki
Process 2 of 3 running on loki
Process 1 of 3 running on loki
...
Process 1 of 3 running on exin
Process 0 of 3 running on exin
[exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking]
recv(16, 0/8) failed: Connection reset by peer (104)
[exin:31909]
../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191
FATAL
loki hello_1 117
Do you need anything else?
Kind regards and thank you very much for your help
Siegmar
Post by g***@rist.or.jp
Cheers,
Gilles
----- Original Message -----
Post by Siegmar Gross
Hi,
I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
----------------------------------------------------------------------
----
Post by Siegmar Gross
There are not enough slots available in the system to satisfy the 3
slots
Post by Siegmar Gross
hello_1_mpi
Either request fewer slots for your application, or make more slots
available
Post by Siegmar Gross
for use.
----------------------------------------------------------------------
----
Post by Siegmar Gross
Everything is ok if I use the same command on "exin".
exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
loki".
Post by Siegmar Gross
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
"exin" is a virtual machine on QEMU so that it uses a slightly
different
Post by Siegmar Gross
processor architecture, e.g., it has no L3 cache but larger L2 caches.
loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
id" -e
Post by Siegmar Gross
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
physical id : 0
physical id : 1
loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
physical
Post by Siegmar Gross
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1
Any ideas what's different in the newer versions of Open MPI? Is the
new
Post by Siegmar Gross
behavior intended? I would be grateful, if somebody can fix the
problem,
Post by Siegmar Gross
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
messages
Post by Siegmar Gross
in versions "3.x" and "master" as well, if the programs are started on
any
Post by Siegmar Gross
machine. Do you need anything else? Thank you very much for any help
in
Post by Siegmar Gross
advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Siegmar Gross
2017-05-31 07:20:47 UTC
Permalink
Hi Gilles,
Post by g***@rist.or.jp
Siegmar,
the "big ORTE update" is a bunch of backports from master to v3.x
btw, does the same error occurs with master ?
Yes, it does, but the error occurs only if I use a real machine with
my virtual machine "exin". I get the expected result if I use two
real machines and I also get the expected output if I login on exin
and start the command on exin.

exin fd1026 108 mpiexec -np 3 --host loki:2,exin hostname
exin
loki
loki
exin fd1026 108



loki hello_1 111 mpiexec -np 1 --host loki which orted
/usr/local/openmpi-master_64_cc/bin/orted
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 112 mpiexec -np 3 --host loki:6,exin:6 hostname
loki
loki
loki
loki hello_1 113 mpiexec -np 3 --host loki:2,exin:6 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 114 mpiexec -np 3 --host loki:2,nfs1 hostname
loki
loki
nfs1
loki hello_1 115
Post by g***@rist.or.jp
i noted mpirun simply does
ssh exin orted ...
can you double check the right orted (e.g. /usr/local/openmpi-3.0.0_64_cc/bin/orted)
loki hello_1 110 mpiexec -np 1 --host loki which orted
/usr/local/openmpi-3.0.0_64_cc/bin/orted
loki hello_1 111 mpiexec -np 1 --host exin which orted
/usr/local/openmpi-3.0.0_64_cc/bin/orted
loki hello_1 112
Post by g***@rist.or.jp
or you can try to
mpirun --mca orte_launch_agent /usr/local/openmpi-3.0.0_64_cc/bin/orted ...
loki hello_1 112 mpirun --mca orte_launch_agent /usr/local/openmpi-3.0.0_64_cc/bin/orted -np 3 --host loki:2,exin hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 113



Kind regards

Siegmar
Post by g***@rist.or.jp
Cheers,
Gilles
Post by Siegmar Gross
Hi Gilles,
I configured Open MPI with the following command.
../openmpi-v3.x-201705250239-d5200ea/configure \
--prefix=/usr/local/openmpi-3.0.0_64_cc \
--libdir=/usr/local/openmpi-3.0.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/lib64" \
CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
FCFLAGS="-m64" \
CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
--enable-mpi-cxx \
--enable-cxx-exceptions \
--enable-mpi-java \
--with-cuda=/usr/local/cuda \
--with-valgrind=/usr/local/valgrind \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
Do you know when the fixes pending in the big ORTE update PR
are committed? Perhaps Ralph has a point suggesting not to spend
time with the problem if it may already be resolved. Nevertheless,
I added the requested information after the commands below.
Post by Gilles Gouaillardet
Ralph,
the issue Siegmar initially reported was
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
per what you wrote, this should be equivalent to
loki hello_1 111 mpiexec -np 3 --host loki:2,exin:1 hello_1_mpi
and this is what i initially wanted to double check (but i made a typo in my reply)
anyway, the logs Siegmar posted indicate the two commands produce the same output
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
to me, this is incorrect since the command line made 3 available slots.
also, i am unable to reproduce any of these issues :-(
Siegmar,
can you please post your configure command line, and try these commands from loki
mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
loki hello_1 112 mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
[loki:25620] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[loki:25620] plm:base:set_hnp_name: initial bias 25620 nodename hash 3121685933
[loki:25620] plm:base:set_hnp_name: final jobfam 64424
[loki:25620] [[64424,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[loki:25620] [[64424,0],0] plm:base:receive start comm
[loki:25620] [[64424,0],0] plm:base:setup_job
[loki:25620] [[64424,0],0] plm:base:setup_vm
[loki:25620] [[64424,0],0] plm:base:setup_vm creating map
[loki:25620] [[64424,0],0] setup:vm: working unmanaged allocation
[loki:25620] [[64424,0],0] using dash_host
[loki:25620] [[64424,0],0] checking node loki
[loki:25620] [[64424,0],0] ignoring myself
[loki:25620] [[64424,0],0] checking node exin
[loki:25620] [[64424,0],0] plm:base:setup_vm add new daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:base:setup_vm assigning new daemon [[64424,0],1] to node exin
[loki:25620] [[64424,0],0] plm:rsh: launching vm
[loki:25620] [[64424,0],0] plm:rsh: local shell: 2 (tcsh)
[loki:25620] [[64424,0],0] plm:rsh: assuming same remote shell as local shell
[loki:25620] [[64424,0],0] plm:rsh: remote shell: 2 (tcsh)
/usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "4222091264" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca
orte_hnp_uri "4222091264.0;tcp://193.174.24.40:38978" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"
[loki:25620] [[64424,0],0] plm:rsh:launch daemon 0 not a child of mine
[loki:25620] [[64424,0],0] plm:rsh: adding node exin to launch list
[loki:25620] [[64424,0],0] plm:rsh: activating launch event
[loki:25620] [[64424,0],0] plm:rsh: recording launch of daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh exin orted -mca ess "env" -mca ess_base_jobid "4222091264" -mca ess_base_vpid 1
-mca ess_base_num_procs "2" -mca orte_hnp_uri "4222091264.0;tcp://193.174.24.40:38978" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm
"rsh"]
[exin:19816] [[64424,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[exin:19816] [[64424,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[exin:19816] [[64424,0],1] plm:base:receive start comm
[loki:25620] [[64424,0],0] plm:base:orted_report_launch from daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:base:orted_report_launch from daemon [[64424,0],1] on node exin
[loki:25620] [[64424,0],0] RECEIVED TOPOLOGY SIG 0N:2S:0L3:12L2:24L1:12C:24H:x86_64 FROM NODE exin
[loki:25620] [[64424,0],0] NEW TOPOLOGY - ADDING
[loki:25620] [[64424,0],0] plm:base:orted_report_launch completed for daemon [[64424,0],1] at contact 4222091264.1;tcp://192.168.75.71:49169
[loki:25620] [[64424,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[loki:25620] [[64424,0],0] complete_setup on job [64424,1]
[loki:25620] [[64424,0],0] plm:base:launch_apps for job [64424,1]
[exin:19816] [[64424,0],1] plm:rsh: remote spawn called
[exin:19816] [[64424,0],1] plm:rsh: remote spawn - have no children!
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[loki:25620] [[64424,0],0] plm:base:orted_cmd sending orted_exit commands
[exin:19816] [[64424,0],1] plm:base:receive stop comm
[loki:25620] [[64424,0],0] plm:base:receive stop comm
loki hello_1 112
Post by Gilles Gouaillardet
mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
loki hello_1 113 mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
[loki:25750] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[loki:25750] plm:base:set_hnp_name: initial bias 25750 nodename hash 3121685933
[loki:25750] plm:base:set_hnp_name: final jobfam 64298
[loki:25750] [[64298,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[loki:25750] [[64298,0],0] plm:base:receive start comm
[loki:25750] [[64298,0],0] plm:base:setup_job
[loki:25750] [[64298,0],0] plm:base:setup_vm
[loki:25750] [[64298,0],0] plm:base:setup_vm creating map
[loki:25750] [[64298,0],0] setup:vm: working unmanaged allocation
[loki:25750] [[64298,0],0] using dash_host
[loki:25750] [[64298,0],0] checking node exin
[loki:25750] [[64298,0],0] plm:base:setup_vm add new daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:setup_vm assigning new daemon [[64298,0],1] to node exin
[loki:25750] [[64298,0],0] plm:rsh: launching vm
[loki:25750] [[64298,0],0] plm:rsh: local shell: 2 (tcsh)
[loki:25750] [[64298,0],0] plm:rsh: assuming same remote shell as local shell
[loki:25750] [[64298,0],0] plm:rsh: remote shell: 2 (tcsh)
/usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "4213833728" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca
orte_hnp_uri "4213833728.0;tcp://193.174.24.40:53840" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"
[loki:25750] [[64298,0],0] plm:rsh:launch daemon 0 not a child of mine
[loki:25750] [[64298,0],0] plm:rsh: adding node exin to launch list
[loki:25750] [[64298,0],0] plm:rsh: activating launch event
[loki:25750] [[64298,0],0] plm:rsh: recording launch of daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh exin orted -mca ess "env" -mca ess_base_jobid "4213833728" -mca ess_base_vpid 1
-mca ess_base_num_procs "2" -mca orte_hnp_uri "4213833728.0;tcp://193.174.24.40:53840" -mca orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm
"rsh"]
[exin:19978] [[64298,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[exin:19978] [[64298,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[exin:19978] [[64298,0],1] plm:base:receive start comm
[loki:25750] [[64298,0],0] plm:base:orted_report_launch from daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:orted_report_launch from daemon [[64298,0],1] on node exin
[loki:25750] [[64298,0],0] RECEIVED TOPOLOGY SIG 0N:2S:0L3:12L2:24L1:12C:24H:x86_64 FROM NODE exin
[loki:25750] [[64298,0],0] NEW TOPOLOGY - ADDING
[loki:25750] [[64298,0],0] plm:base:orted_report_launch completed for daemon [[64298,0],1] at contact 4213833728.1;tcp://192.168.75.71:56878
[loki:25750] [[64298,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[loki:25750] [[64298,0],0] plm:base:setting slots for node loki by cores
[loki:25750] [[64298,0],0] complete_setup on job [64298,1]
[loki:25750] [[64298,0],0] plm:base:launch_apps for job [64298,1]
[exin:19978] [[64298,0],1] plm:rsh: remote spawn called
[exin:19978] [[64298,0],1] plm:rsh: remote spawn - have no children!
[loki:25750] [[64298,0],0] plm:base:receive processing msg
[loki:25750] [[64298,0],0] plm:base:receive update proc state command from [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0
[loki:25750] [[64298,0],0] plm:base:receive done processing commands
[loki:25750] [[64298,0],0] plm:base:launch wiring up iof for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:launch job [64298,1] is not a dynamic spawn
exin
[loki:25750] [[64298,0],0] plm:base:receive processing msg
[loki:25750] [[64298,0],0] plm:base:receive update proc state command from [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for vpid 0 state NORMALLY TERMINATED exit_code 0
[loki:25750] [[64298,0],0] plm:base:receive done processing commands
[loki:25750] [[64298,0],0] plm:base:orted_cmd sending orted_exit commands
[exin:19978] [[64298,0],1] plm:base:receive stop comm
[loki:25750] [[64298,0],0] plm:base:receive stop comm
loki hello_1 113
Post by Gilles Gouaillardet
mpiexec -np 1 --host exin ldd ./hello_1_mpi
I have to adapt the path, because the executables are not in the
local directory (due to my old heterogeneous environment).
loki hello_1 169 mpiexec -np 1 --host exin which -a hello_1_mpi
/home/fd1026/Linux/x86_64/bin/hello_1_mpi
loki hello_1 165 mpiexec -np 1 --host exin ldd $HOME/Linux/x86_64/bin/hello_1_mpi
linux-vdso.so.1 (0x00007ffc81ffb000)
libmpi.so.0 => /usr/local/openmpi-3.0.0_64_cc/lib64/libmpi.so.0 (0x00007f7e242ac000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7e2408f000)
libc.so.6 => /lib64/libc.so.6 (0x00007f7e23cec000)
libopen-rte.so.0 => /usr/local/openmpi-3.0.0_64_cc/lib64/libopen-rte.so.0 (0x00007f7e23569000)
libopen-pal.so.0 => /usr/local/openmpi-3.0.0_64_cc/lib64/libopen-pal.so.0 (0x00007f7e22ddd000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f7e22bd9000)
libnuma.so.1 => /usr/local/lib64/libnuma.so.1 (0x00007f7e229cc000)
libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f7e227ac000)
libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0 (0x00007f7e225a2000)
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f7e22262000)
librt.so.1 => /lib64/librt.so.1 (0x00007f7e2205a000)
libm.so.6 => /lib64/libm.so.6 (0x00007f7e21d5d000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f7e21b59000)
libz.so.1 => /lib64/libz.so.1 (0x00007f7e21943000)
/lib64/ld-linux-x86-64.so.2 (0x0000555d41aa1000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f7e2171c000)
libcap.so.2 => /lib64/libcap.so.2 (0x00007f7e21517000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f7e21300000)
libpcre.so.1 => /usr/lib64/libpcre.so.1 (0x00007f7e21090000)
loki hello_1 166
Thank you very much for your help
Siegmar
Post by Gilles Gouaillardet
if Open MPI is not installed on a shared filesystem (NFS for example), please also double check
both install were built from the same source and with the same options
Cheers,
Gilles
Post by r***@open-mpi.org
This behavior is as-expected. When you specify "-host foo,bar”, you have told us to assign one slot to each of those nodes. Thus, running 3 procs exceeds
the number of slots you assigned.
You can tell it to set the #slots to the #cores it discovers on the node by using “-host foo:*,bar:*”
I cannot replicate your behavior of "-np 3 -host foo:2,bar:3” running more than 3 procs
Post by Siegmar Gross
Hi Gilles,
Post by g***@rist.or.jp
what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
I need as many slots as processes so that I use "-np 2".
"mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
breaks, if I use at least "-np 3" and distribute the processes across at
least two machines.
loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
Process 0 of 2 running on loki
Process 1 of 2 running on exin
Now 1 slave tasks are sending greetings.
message type: 3
msg length: 131 characters
hostname: exin
operating system: Linux
release: 4.4.49-92.11-default
processor: x86_64
loki hello_1 119
Post by g***@rist.or.jp
are loki and exin different ? (os, sockets, core)
Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
kernel.
loki fd1026 108 uname -a
Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 109 ssh exin uname -a
Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 (8f9478a) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 110
The number of sockets and cores is identical, but the processor types are
different as you can see at the end of my previous email. "loki" uses two
"Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
topologies (89 K) if you are interested in the output from lstopo. I've
added some runs. Most interesting in my opinion are the last two
"mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
"mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
Why does mpiexec create five processes although I've asked for only three
processes? Why do I have to break the program with <Ctrl-c> for the first
of the above commands?
loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
Process 0 of 3 running on exin
Process 1 of 3 running on exin
Process 2 of 3 running on exin
...
loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
Process 1 of 3 running on loki
Process 0 of 3 running on loki
Process 2 of 3 running on loki
...
Process 0 of 3 running on exin
Process 1 of 3 running on exin
[exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect]
connect() to 193.xxx.xxx.xxx failed: Connection refused (111)
^Cloki hello_1 116
loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
Process 0 of 3 running on loki
Process 2 of 3 running on loki
Process 1 of 3 running on loki
...
Process 1 of 3 running on exin
Process 0 of 3 running on exin
[exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking] recv(16,
0/8) failed: Connection reset by peer (104)
[exin:31909] ../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191 FATAL
loki hello_1 117
Do you need anything else?
Kind regards and thank you very much for your help
Siegmar
Post by g***@rist.or.jp
Cheers,
Gilles
----- Original Message -----
Post by Siegmar Gross
Hi,
I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
----------------------------------------------------------------------
----
Post by Siegmar Gross
There are not enough slots available in the system to satisfy the 3
slots
Post by Siegmar Gross
hello_1_mpi
Either request fewer slots for your application, or make more slots
available
Post by Siegmar Gross
for use.
----------------------------------------------------------------------
----
Post by Siegmar Gross
Everything is ok if I use the same command on "exin".
exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
loki".
Post by Siegmar Gross
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
"exin" is a virtual machine on QEMU so that it uses a slightly
different
Post by Siegmar Gross
processor architecture, e.g., it has no L3 cache but larger L2 caches.
loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
id" -e
Post by Siegmar Gross
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
physical id : 0
physical id : 1
loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
physical
Post by Siegmar Gross
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1
Any ideas what's different in the newer versions of Open MPI? Is the
new
Post by Siegmar Gross
behavior intended? I would be grateful, if somebody can fix the
problem,
Post by Siegmar Gross
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
messages
Post by Siegmar Gross
in versions "3.x" and "master" as well, if the programs are started on
any
Post by Siegmar Gross
machine. Do you need anything else? Thank you very much for any help
in
Post by Siegmar Gross
advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gilles Gouaillardet
2017-06-01 00:37:44 UTC
Permalink
Thanks Siegmar,


i was finally able to reproduce it.

the error is triggered by the VM topology, and i was able to reproduce
it by manually removing the "NUMA" objects from the topology.


as a workaround, you can

mpirun --map-by socket ...


i will follow-up on the devel ML with Ralph.



Best regards,


Gilles
Post by Siegmar Gross
Hi Gilles,
Post by g***@rist.or.jp
Siegmar,
the "big ORTE update" is a bunch of backports from master to v3.x
btw, does the same error occurs with master ?
Yes, it does, but the error occurs only if I use a real machine with
my virtual machine "exin". I get the expected result if I use two
real machines and I also get the expected output if I login on exin
and start the command on exin.
exin fd1026 108 mpiexec -np 3 --host loki:2,exin hostname
exin
loki
loki
exin fd1026 108
loki hello_1 111 mpiexec -np 1 --host loki which orted
/usr/local/openmpi-master_64_cc/bin/orted
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 112 mpiexec -np 3 --host loki:6,exin:6 hostname
loki
loki
loki
loki hello_1 113 mpiexec -np 3 --host loki:2,exin:6 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 114 mpiexec -np 3 --host loki:2,nfs1 hostname
loki
loki
nfs1
loki hello_1 115
Post by g***@rist.or.jp
i noted mpirun simply does
ssh exin orted ...
can you double check the right orted (e.g.
/usr/local/openmpi-3.0.0_64_cc/bin/orted)
loki hello_1 110 mpiexec -np 1 --host loki which orted
/usr/local/openmpi-3.0.0_64_cc/bin/orted
loki hello_1 111 mpiexec -np 1 --host exin which orted
/usr/local/openmpi-3.0.0_64_cc/bin/orted
loki hello_1 112
Post by g***@rist.or.jp
or you can try to
mpirun --mca orte_launch_agent
/usr/local/openmpi-3.0.0_64_cc/bin/orted ...
loki hello_1 112 mpirun --mca orte_launch_agent
/usr/local/openmpi-3.0.0_64_cc/bin/orted -np 3 --host loki:2,exin
hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 113
Kind regards
Siegmar
Post by g***@rist.or.jp
Cheers,
Gilles
Post by Siegmar Gross
Hi Gilles,
I configured Open MPI with the following command.
../openmpi-v3.x-201705250239-d5200ea/configure \
--prefix=/usr/local/openmpi-3.0.0_64_cc \
--libdir=/usr/local/openmpi-3.0.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64
-L/usr/local/cuda/lib64" \
CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
FCFLAGS="-m64" \
CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
--enable-mpi-cxx \
--enable-cxx-exceptions \
--enable-mpi-java \
--with-cuda=/usr/local/cuda \
--with-valgrind=/usr/local/valgrind \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
Do you know when the fixes pending in the big ORTE update PR
are committed? Perhaps Ralph has a point suggesting not to spend
time with the problem if it may already be resolved. Nevertheless,
I added the requested information after the commands below.
Post by Gilles Gouaillardet
Ralph,
the issue Siegmar initially reported was
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
per what you wrote, this should be equivalent to
loki hello_1 111 mpiexec -np 3 --host loki:2,exin:1 hello_1_mpi
and this is what i initially wanted to double check (but i made a typo in my reply)
anyway, the logs Siegmar posted indicate the two commands produce the same output
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
to me, this is incorrect since the command line made 3 available slots.
also, i am unable to reproduce any of these issues :-(
Siegmar,
can you please post your configure command line, and try these commands from loki
mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
loki hello_1 112 mpiexec -np 3 --host loki:2,exin --mca
plm_base_verbose 5 hostname
[loki:25620] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[loki:25620] plm:base:set_hnp_name: initial bias 25620 nodename hash 3121685933
[loki:25620] plm:base:set_hnp_name: final jobfam 64424
[loki:25620] [[64424,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[loki:25620] [[64424,0],0] plm:base:receive start comm
[loki:25620] [[64424,0],0] plm:base:setup_job
[loki:25620] [[64424,0],0] plm:base:setup_vm
[loki:25620] [[64424,0],0] plm:base:setup_vm creating map
[loki:25620] [[64424,0],0] setup:vm: working unmanaged allocation
[loki:25620] [[64424,0],0] using dash_host
[loki:25620] [[64424,0],0] checking node loki
[loki:25620] [[64424,0],0] ignoring myself
[loki:25620] [[64424,0],0] checking node exin
[loki:25620] [[64424,0],0] plm:base:setup_vm add new daemon
[[64424,0],1]
[loki:25620] [[64424,0],0] plm:base:setup_vm assigning new daemon
[[64424,0],1] to node exin
[loki:25620] [[64424,0],0] plm:rsh: launching vm
[loki:25620] [[64424,0],0] plm:rsh: local shell: 2 (tcsh)
[loki:25620] [[64424,0],0] plm:rsh: assuming same remote shell as local shell
[loki:25620] [[64424,0],0] plm:rsh: remote shell: 2 (tcsh)
/usr/bin/ssh <template> orted -mca ess "env" -mca
ess_base_jobid "4222091264" -mca ess_base_vpid "<template>" -mca
ess_base_num_procs "2" -mca orte_hnp_uri
"4222091264.0;tcp://193.174.24.40:38978" -mca orte_node_regex
"loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"
[loki:25620] [[64424,0],0] plm:rsh:launch daemon 0 not a child of mine
[loki:25620] [[64424,0],0] plm:rsh: adding node exin to launch list
[loki:25620] [[64424,0],0] plm:rsh: activating launch event
[loki:25620] [[64424,0],0] plm:rsh: recording launch of daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:rsh: executing: (/usr/bin/ssh)
[/usr/bin/ssh exin orted -mca ess "env" -mca ess_base_jobid
"4222091264" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca
orte_hnp_uri "4222091264.0;tcp://193.174.24.40:38978" -mca
orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"]
[exin:19816] [[64424,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[exin:19816] [[64424,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[exin:19816] [[64424,0],1] plm:base:receive start comm
[loki:25620] [[64424,0],0] plm:base:orted_report_launch from daemon [[64424,0],1]
[loki:25620] [[64424,0],0] plm:base:orted_report_launch from daemon
[[64424,0],1] on node exin
[loki:25620] [[64424,0],0] RECEIVED TOPOLOGY SIG
0N:2S:0L3:12L2:24L1:12C:24H:x86_64 FROM NODE exin
[loki:25620] [[64424,0],0] NEW TOPOLOGY - ADDING
[loki:25620] [[64424,0],0] plm:base:orted_report_launch completed
for daemon [[64424,0],1] at contact
4222091264.1;tcp://192.168.75.71:49169
[loki:25620] [[64424,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[loki:25620] [[64424,0],0] complete_setup on job [64424,1]
[loki:25620] [[64424,0],0] plm:base:launch_apps for job [64424,1]
[exin:19816] [[64424,0],1] plm:rsh: remote spawn called
[exin:19816] [[64424,0],1] plm:rsh: remote spawn - have no children!
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[loki:25620] [[64424,0],0] plm:base:orted_cmd sending orted_exit commands
[exin:19816] [[64424,0],1] plm:base:receive stop comm
[loki:25620] [[64424,0],0] plm:base:receive stop comm
loki hello_1 112
Post by Gilles Gouaillardet
mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
loki hello_1 113 mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
[loki:25750] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[loki:25750] plm:base:set_hnp_name: initial bias 25750 nodename hash 3121685933
[loki:25750] plm:base:set_hnp_name: final jobfam 64298
[loki:25750] [[64298,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[loki:25750] [[64298,0],0] plm:base:receive start comm
[loki:25750] [[64298,0],0] plm:base:setup_job
[loki:25750] [[64298,0],0] plm:base:setup_vm
[loki:25750] [[64298,0],0] plm:base:setup_vm creating map
[loki:25750] [[64298,0],0] setup:vm: working unmanaged allocation
[loki:25750] [[64298,0],0] using dash_host
[loki:25750] [[64298,0],0] checking node exin
[loki:25750] [[64298,0],0] plm:base:setup_vm add new daemon
[[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:setup_vm assigning new daemon
[[64298,0],1] to node exin
[loki:25750] [[64298,0],0] plm:rsh: launching vm
[loki:25750] [[64298,0],0] plm:rsh: local shell: 2 (tcsh)
[loki:25750] [[64298,0],0] plm:rsh: assuming same remote shell as local shell
[loki:25750] [[64298,0],0] plm:rsh: remote shell: 2 (tcsh)
/usr/bin/ssh <template> orted -mca ess "env" -mca
ess_base_jobid "4213833728" -mca ess_base_vpid "<template>" -mca
ess_base_num_procs "2" -mca orte_hnp_uri
"4213833728.0;tcp://193.174.24.40:53840" -mca orte_node_regex
"loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"
[loki:25750] [[64298,0],0] plm:rsh:launch daemon 0 not a child of mine
[loki:25750] [[64298,0],0] plm:rsh: adding node exin to launch list
[loki:25750] [[64298,0],0] plm:rsh: activating launch event
[loki:25750] [[64298,0],0] plm:rsh: recording launch of daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:rsh: executing: (/usr/bin/ssh)
[/usr/bin/ssh exin orted -mca ess "env" -mca ess_base_jobid
"4213833728" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca
orte_hnp_uri "4213833728.0;tcp://193.174.24.40:53840" -mca
orte_node_regex "loki,exin" --mca plm_base_verbose "5" -mca plm "rsh"]
[exin:19978] [[64298,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[exin:19978] [[64298,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[exin:19978] [[64298,0],1] plm:base:receive start comm
[loki:25750] [[64298,0],0] plm:base:orted_report_launch from daemon [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:orted_report_launch from daemon
[[64298,0],1] on node exin
[loki:25750] [[64298,0],0] RECEIVED TOPOLOGY SIG
0N:2S:0L3:12L2:24L1:12C:24H:x86_64 FROM NODE exin
[loki:25750] [[64298,0],0] NEW TOPOLOGY - ADDING
[loki:25750] [[64298,0],0] plm:base:orted_report_launch completed
for daemon [[64298,0],1] at contact
4213833728.1;tcp://192.168.75.71:56878
[loki:25750] [[64298,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[loki:25750] [[64298,0],0] plm:base:setting slots for node loki by cores
[loki:25750] [[64298,0],0] complete_setup on job [64298,1]
[loki:25750] [[64298,0],0] plm:base:launch_apps for job [64298,1]
[exin:19978] [[64298,0],1] plm:rsh: remote spawn called
[exin:19978] [[64298,0],1] plm:rsh: remote spawn - have no children!
[loki:25750] [[64298,0],0] plm:base:receive processing msg
[loki:25750] [[64298,0],0] plm:base:receive update proc state
command from [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state
for vpid 0 state RUNNING exit_code 0
[loki:25750] [[64298,0],0] plm:base:receive done processing commands
[loki:25750] [[64298,0],0] plm:base:launch wiring up iof for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:launch job [64298,1] is not a dynamic spawn
exin
[loki:25750] [[64298,0],0] plm:base:receive processing msg
[loki:25750] [[64298,0],0] plm:base:receive update proc state
command from [[64298,0],1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state for job [64298,1]
[loki:25750] [[64298,0],0] plm:base:receive got update_proc_state
for vpid 0 state NORMALLY TERMINATED exit_code 0
[loki:25750] [[64298,0],0] plm:base:receive done processing commands
[loki:25750] [[64298,0],0] plm:base:orted_cmd sending orted_exit commands
[exin:19978] [[64298,0],1] plm:base:receive stop comm
[loki:25750] [[64298,0],0] plm:base:receive stop comm
loki hello_1 113
Post by Gilles Gouaillardet
mpiexec -np 1 --host exin ldd ./hello_1_mpi
I have to adapt the path, because the executables are not in the
local directory (due to my old heterogeneous environment).
loki hello_1 169 mpiexec -np 1 --host exin which -a hello_1_mpi
/home/fd1026/Linux/x86_64/bin/hello_1_mpi
loki hello_1 165 mpiexec -np 1 --host exin ldd
$HOME/Linux/x86_64/bin/hello_1_mpi
linux-vdso.so.1 (0x00007ffc81ffb000)
libmpi.so.0 =>
/usr/local/openmpi-3.0.0_64_cc/lib64/libmpi.so.0 (0x00007f7e242ac000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7e2408f000)
libc.so.6 => /lib64/libc.so.6 (0x00007f7e23cec000)
libopen-rte.so.0 =>
/usr/local/openmpi-3.0.0_64_cc/lib64/libopen-rte.so.0
(0x00007f7e23569000)
libopen-pal.so.0 =>
/usr/local/openmpi-3.0.0_64_cc/lib64/libopen-pal.so.0
(0x00007f7e22ddd000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f7e22bd9000)
libnuma.so.1 => /usr/local/lib64/libnuma.so.1
(0x00007f7e229cc000)
libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f7e227ac000)
libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0
(0x00007f7e225a2000)
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1
(0x00007f7e22262000)
librt.so.1 => /lib64/librt.so.1 (0x00007f7e2205a000)
libm.so.6 => /lib64/libm.so.6 (0x00007f7e21d5d000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f7e21b59000)
libz.so.1 => /lib64/libz.so.1 (0x00007f7e21943000)
/lib64/ld-linux-x86-64.so.2 (0x0000555d41aa1000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f7e2171c000)
libcap.so.2 => /lib64/libcap.so.2 (0x00007f7e21517000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f7e21300000)
libpcre.so.1 => /usr/lib64/libpcre.so.1 (0x00007f7e21090000)
loki hello_1 166
Thank you very much for your help
Siegmar
Post by Gilles Gouaillardet
if Open MPI is not installed on a shared filesystem (NFS for
example), please also double check
both install were built from the same source and with the same options
Cheers,
Gilles
Post by r***@open-mpi.org
This behavior is as-expected. When you specify "-host foo,bar”,
you have told us to assign one slot to each of those nodes. Thus,
running 3 procs exceeds the number of slots you assigned.
You can tell it to set the #slots to the #cores it discovers on
the node by using “-host foo:*,bar:*”
I cannot replicate your behavior of "-np 3 -host foo:2,bar:3”
running more than 3 procs
On May 30, 2017, at 5:24 AM, Siegmar Gross
Hi Gilles,
Post by g***@rist.or.jp
what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
I need as many slots as processes so that I use "-np 2".
"mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
breaks, if I use at least "-np 3" and distribute the processes across at
least two machines.
loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
Process 0 of 2 running on loki
Process 1 of 2 running on exin
Now 1 slave tasks are sending greetings.
message type: 3
msg length: 131 characters
hostname: exin
operating system: Linux
release: 4.4.49-92.11-default
processor: x86_64
loki hello_1 119
Post by g***@rist.or.jp
are loki and exin different ? (os, sockets, core)
Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
kernel.
loki fd1026 108 uname -a
Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016
(2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 109 ssh exin uname -a
Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC
2017 (8f9478a) x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 110
The number of sockets and cores is identical, but the processor types are
different as you can see at the end of my previous email. "loki" uses two
"Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
topologies (89 K) if you are interested in the output from lstopo. I've
added some runs. Most interesting in my opinion are the last two
"mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
"mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
Why does mpiexec create five processes although I've asked for only three
processes? Why do I have to break the program with <Ctrl-c> for the first
of the above commands?
loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
hello_1_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
Process 0 of 3 running on exin
Process 1 of 3 running on exin
Process 2 of 3 running on exin
...
loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
Process 1 of 3 running on loki
Process 0 of 3 running on loki
Process 2 of 3 running on loki
...
Process 0 of 3 running on exin
Process 1 of 3 running on exin
[exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect]
connect() to 193.xxx.xxx.xxx failed: Connection refused (111)
^Cloki hello_1 116
loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
Process 0 of 3 running on loki
Process 2 of 3 running on loki
Process 1 of 3 running on loki
...
Process 1 of 3 running on exin
Process 0 of 3 running on exin
[exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking]
recv(16, 0/8) failed: Connection reset by peer (104)
[exin:31909]
../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191
FATAL
loki hello_1 117
Do you need anything else?
Kind regards and thank you very much for your help
Siegmar
Post by g***@rist.or.jp
Cheers,
Gilles
----- Original Message -----
Post by Siegmar Gross
Hi,
I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.
loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
----------------------------------------------------------------------
----
Post by Siegmar Gross
There are not enough slots available in the system to satisfy the 3
slots
Post by Siegmar Gross
hello_1_mpi
Either request fewer slots for your application, or make more slots
available
Post by Siegmar Gross
for use.
----------------------------------------------------------------------
----
Post by Siegmar Gross
Everything is ok if I use the same command on "exin".
exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
Everything is also ok if I use
openmpi-v2.x-201705260340-58c6b3c on "
loki".
Post by Siegmar Gross
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...
"exin" is a virtual machine on QEMU so that it uses a slightly
different
Post by Siegmar Gross
processor architecture, e.g., it has no L3 cache but larger L2 caches.
loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
id" -e
Post by Siegmar Gross
"cpu cores" -e "cache size" | sort | uniq
cache size : 15360 KB
cpu cores : 6
physical id : 0
physical id : 1
loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
physical
Post by Siegmar Gross
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size : 4096 KB
cpu cores : 6
model name : Intel Core Processor (Haswell, no TSX)
physical id : 0
physical id : 1
Any ideas what's different in the newer versions of Open MPI? Is the
new
Post by Siegmar Gross
behavior intended? I would be grateful, if somebody can fix the
problem,
Post by Siegmar Gross
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
messages
Post by Siegmar Gross
in versions "3.x" and "master" as well, if the programs are started on
any
Post by Siegmar Gross
machine. Do you need anything else? Thank you very much for any help
in
Post by Siegmar Gross
advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...