Discussion:
[OMPI users] How to launch ompi-server?
Adam Sylvester
2017-03-19 11:37:37 UTC
Permalink
I am trying to use ompi-server with Open MPI 1.10.6. I'm wondering if I
should run this with or without the mpirun command. If I run this:

ompi-server --no-daemonize -r +

It prints something such as 959315968.0;tcp://172.31.3.57:45743 to stdout
but I have thus far been unable to connect to it. That is, in another
application on another machine which is on the same network as the
ompi-server machine, I try

MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "true");

char myport[MPI_MAX_PORT_NAME];
MPI_Open_port(MPI_INFO_NULL, myport);
MPI_Publish_name("adam-server", info, myport);

But the MPI_Publish_name() function hangs forever when I run it like

mpirun -np 1 --ompi-server "959315968.0;tcp://172.31.3.57:45743" server

Blog posts are inconsistent as to if you should run ompi-server with mpirun
or not so I tried using it but this seg faults:

mpirun -np 1 ompi-server --no-daemonize -r +
[ip-172-31-5-39:14785] *** Process received signal ***
[ip-172-31-5-39:14785] Signal: Segmentation fault (11)
[ip-172-31-5-39:14785] Signal code: Address not mapped (1)
[ip-172-31-5-39:14785] Failing at address: 0x6e0
[ip-172-31-5-39:14785] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7f895d7a5370]
[ip-172-31-5-39:14785] [ 1]
/usr/local/lib/libopen-pal.so.13(opal_hwloc191_hwloc_get_cpubind+0x9)[0x7f895e336839]
[ip-172-31-5-39:14785] [ 2]
/usr/local/lib/libopen-rte.so.12(orte_ess_base_proc_binding+0x17a)[0x7f895e5d8fca]
[ip-172-31-5-39:14785] [ 3]
/usr/local/lib/openmpi/mca_ess_env.so(+0x15dd)[0x7f895cdcd5dd]
[ip-172-31-5-39:14785] [ 4]
/usr/local/lib/libopen-rte.so.12(orte_init+0x168)[0x7f895e5b5368]
[ip-172-31-5-39:14785] [ 5] ompi-server[0x4014d4]
[ip-172-31-5-39:14785] [ 6]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f895d3f6b35]
[ip-172-31-5-39:14785] [ 7] ompi-server[0x40176b]
[ip-172-31-5-39:14785] *** End of error message ***

Am I doing something wrong?
r***@open-mpi.org
2017-03-19 18:46:49 UTC
Permalink
Well, your initial usage looks correct - you don’t launch ompi-server via mpirun. However, it sounds like there is probably a bug somewhere if it hangs as you describe.

Scratching my head, I can only recall less than a handful of people ever using these MPI functions to cross-connect jobs, so it does tend to fall into disrepair. As I said, I’ll try to repair it, at least for 3.0.
Post by Adam Sylvester
ompi-server --no-daemonize -r +
It prints something such as 959315968.0;tcp://172.31.3.57:45743 <http://172.31.3.57:45743/> to stdout but I have thus far been unable to connect to it. That is, in another application on another machine which is on the same network as the ompi-server machine, I try
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "true");
char myport[MPI_MAX_PORT_NAME];
MPI_Open_port(MPI_INFO_NULL, myport);
MPI_Publish_name("adam-server", info, myport);
But the MPI_Publish_name() function hangs forever when I run it like
mpirun -np 1 --ompi-server "959315968.0;tcp://172.31.3.57:45743 <http://172.31.3.57:45743/>" server
mpirun -np 1 ompi-server --no-daemonize -r +
[ip-172-31-5-39:14785] *** Process received signal ***
[ip-172-31-5-39:14785] Signal: Segmentation fault (11)
[ip-172-31-5-39:14785] Signal code: Address not mapped (1)
[ip-172-31-5-39:14785] Failing at address: 0x6e0
[ip-172-31-5-39:14785] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7f895d7a5370]
[ip-172-31-5-39:14785] [ 1] /usr/local/lib/libopen-pal.so.13(opal_hwloc191_hwloc_get_cpubind+0x9)[0x7f895e336839]
[ip-172-31-5-39:14785] [ 2] /usr/local/lib/libopen-rte.so.12(orte_ess_base_proc_binding+0x17a)[0x7f895e5d8fca]
[ip-172-31-5-39:14785] [ 3] /usr/local/lib/openmpi/mca_ess_env.so(+0x15dd)[0x7f895cdcd5dd]
[ip-172-31-5-39:14785] [ 4] /usr/local/lib/libopen-rte.so.12(orte_init+0x168)[0x7f895e5b5368]
[ip-172-31-5-39:14785] [ 5] ompi-server[0x4014d4]
[ip-172-31-5-39:14785] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f895d3f6b35]
[ip-172-31-5-39:14785] [ 7] ompi-server[0x40176b]
[ip-172-31-5-39:14785] *** End of error message ***
Am I doing something wrong?
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Adam Sylvester
2017-03-19 20:40:21 UTC
Permalink
I did a little more testing in case this helps... if I run ompi-server on
the same host as the one I call MPI_Publish_name() on, it does successfully
connect. But when I run it on a separate machine (which is on the same
network and accessible via TCP), I get the issue above where it hangs.

Thanks for taking a look - if you'd like me to open a bug report for this
one somewhere, just let me know.

-Adam
Post by r***@open-mpi.org
Well, your initial usage looks correct - you don’t launch ompi-server via
mpirun. However, it sounds like there is probably a bug somewhere if it
hangs as you describe.
Scratching my head, I can only recall less than a handful of people ever
using these MPI functions to cross-connect jobs, so it does tend to fall
into disrepair. As I said, I’ll try to repair it, at least for 3.0.
I am trying to use ompi-server with Open MPI 1.10.6. I'm wondering if I
ompi-server --no-daemonize -r +
It prints something such as 959315968.0;tcp://172.31.3.57:45743 to stdout
but I have thus far been unable to connect to it. That is, in another
application on another machine which is on the same network as the
ompi-server machine, I try
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "true");
char myport[MPI_MAX_PORT_NAME];
MPI_Open_port(MPI_INFO_NULL, myport);
MPI_Publish_name("adam-server", info, myport);
But the MPI_Publish_name() function hangs forever when I run it like
mpirun -np 1 --ompi-server "959315968.0;tcp://172.31.3.57:45743" server
Blog posts are inconsistent as to if you should run ompi-server with
mpirun -np 1 ompi-server --no-daemonize -r +
[ip-172-31-5-39:14785] *** Process received signal ***
[ip-172-31-5-39:14785] Signal: Segmentation fault (11)
[ip-172-31-5-39:14785] Signal code: Address not mapped (1)
[ip-172-31-5-39:14785] Failing at address: 0x6e0
[ip-172-31-5-39:14785] [ 0] /lib64/libpthread.so.0(+
0xf370)[0x7f895d7a5370]
[ip-172-31-5-39:14785] [ 1] /usr/local/lib/libopen-pal.so.
13(opal_hwloc191_hwloc_get_cpubind+0x9)[0x7f895e336839]
[ip-172-31-5-39:14785] [ 2] /usr/local/lib/libopen-rte.so.
12(orte_ess_base_proc_binding+0x17a)[0x7f895e5d8fca]
[ip-172-31-5-39:14785] [ 3] /usr/local/lib/openmpi/mca_
ess_env.so(+0x15dd)[0x7f895cdcd5dd]
[ip-172-31-5-39:14785] [ 4] /usr/local/lib/libopen-rte.so.
12(orte_init+0x168)[0x7f895e5b5368]
[ip-172-31-5-39:14785] [ 5] ompi-server[0x4014d4]
[ip-172-31-5-39:14785] [ 6] /lib64/libc.so.6(__libc_start_
main+0xf5)[0x7f895d3f6b35]
[ip-172-31-5-39:14785] [ 7] ompi-server[0x40176b]
[ip-172-31-5-39:14785] *** End of error message ***
Am I doing something wrong?
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-05-27 20:02:35 UTC
Permalink
This is now fixed in the master and will make it for v3.0, which is planned for release in the near future
I did a little more testing in case this helps... if I run ompi-server on the same host as the one I call MPI_Publish_name() on, it does successfully connect. But when I run it on a separate machine (which is on the same network and accessible via TCP), I get the issue above where it hangs.
Thanks for taking a look - if you'd like me to open a bug report for this one somewhere, just let me know.
-Adam
Well, your initial usage looks correct - you don’t launch ompi-server via mpirun. However, it sounds like there is probably a bug somewhere if it hangs as you describe.
Scratching my head, I can only recall less than a handful of people ever using these MPI functions to cross-connect jobs, so it does tend to fall into disrepair. As I said, I’ll try to repair it, at least for 3.0.
Post by Adam Sylvester
ompi-server --no-daemonize -r +
It prints something such as 959315968.0;tcp://172.31.3.57:45743 <http://172.31.3.57:45743/> to stdout but I have thus far been unable to connect to it. That is, in another application on another machine which is on the same network as the ompi-server machine, I try
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "true");
char myport[MPI_MAX_PORT_NAME];
MPI_Open_port(MPI_INFO_NULL, myport);
MPI_Publish_name("adam-server", info, myport);
But the MPI_Publish_name() function hangs forever when I run it like
mpirun -np 1 --ompi-server "959315968.0;tcp://172.31.3.57:45743 <http://172.31.3.57:45743/>" server
mpirun -np 1 ompi-server --no-daemonize -r +
[ip-172-31-5-39:14785] *** Process received signal ***
[ip-172-31-5-39:14785] Signal: Segmentation fault (11)
[ip-172-31-5-39:14785] Signal code: Address not mapped (1)
[ip-172-31-5-39:14785] Failing at address: 0x6e0
[ip-172-31-5-39:14785] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7f895d7a5370]
[ip-172-31-5-39:14785] [ 1] /usr/local/lib/libopen-pal.so.13(opal_hwloc191_hwloc_get_cpubind+0x9)[0x7f895e336839]
[ip-172-31-5-39:14785] [ 2] /usr/local/lib/libopen-rte.so.12(orte_ess_base_proc_binding+0x17a)[0x7f895e5d8fca]
[ip-172-31-5-39:14785] [ 3] /usr/local/lib/openmpi/mca_ess_env.so(+0x15dd)[0x7f895cdcd5dd]
[ip-172-31-5-39:14785] [ 4] /usr/local/lib/libopen-rte.so.12(orte_init+0x168)[0x7f895e5b5368]
[ip-172-31-5-39:14785] [ 5] ompi-server[0x4014d4]
[ip-172-31-5-39:14785] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f895d3f6b35]
[ip-172-31-5-39:14785] [ 7] ompi-server[0x40176b]
[ip-172-31-5-39:14785] *** End of error message ***
Am I doing something wrong?
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Adam Sylvester
2017-05-28 16:10:53 UTC
Permalink
Thanks! Similar to the MPI_Comm_accept() thread, I've been working around
this but looking forward to using it in 3.0 to clean up my applications.
Post by r***@open-mpi.org
This is now fixed in the master and will make it for v3.0, which is
planned for release in the near future
I did a little more testing in case this helps... if I run ompi-server on
the same host as the one I call MPI_Publish_name() on, it does successfully
connect. But when I run it on a separate machine (which is on the same
network and accessible via TCP), I get the issue above where it hangs.
Thanks for taking a look - if you'd like me to open a bug report for this
one somewhere, just let me know.
-Adam
Post by r***@open-mpi.org
Well, your initial usage looks correct - you don’t launch ompi-server via
mpirun. However, it sounds like there is probably a bug somewhere if it
hangs as you describe.
Scratching my head, I can only recall less than a handful of people ever
using these MPI functions to cross-connect jobs, so it does tend to fall
into disrepair. As I said, I’ll try to repair it, at least for 3.0.
I am trying to use ompi-server with Open MPI 1.10.6. I'm wondering if I
ompi-server --no-daemonize -r +
It prints something such as 959315968.0;tcp://172.31.3.57:45743 to
stdout but I have thus far been unable to connect to it. That is, in
another application on another machine which is on the same network as the
ompi-server machine, I try
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "true");
char myport[MPI_MAX_PORT_NAME];
MPI_Open_port(MPI_INFO_NULL, myport);
MPI_Publish_name("adam-server", info, myport);
But the MPI_Publish_name() function hangs forever when I run it like
mpirun -np 1 --ompi-server "959315968.0;tcp://172.31.3.57:45743" server
Blog posts are inconsistent as to if you should run ompi-server with
mpirun -np 1 ompi-server --no-daemonize -r +
[ip-172-31-5-39:14785] *** Process received signal ***
[ip-172-31-5-39:14785] Signal: Segmentation fault (11)
[ip-172-31-5-39:14785] Signal code: Address not mapped (1)
[ip-172-31-5-39:14785] Failing at address: 0x6e0
[ip-172-31-5-39:14785] [ 0] /lib64/libpthread.so.0(+0xf370
)[0x7f895d7a5370]
[ip-172-31-5-39:14785] [ 1] /usr/local/lib/libopen-pal.so.
13(opal_hwloc191_hwloc_get_cpubind+0x9)[0x7f895e336839]
[ip-172-31-5-39:14785] [ 2] /usr/local/lib/libopen-rte.so.
12(orte_ess_base_proc_binding+0x17a)[0x7f895e5d8fca]
[ip-172-31-5-39:14785] [ 3] /usr/local/lib/openmpi/mca_ess
_env.so(+0x15dd)[0x7f895cdcd5dd]
[ip-172-31-5-39:14785] [ 4] /usr/local/lib/libopen-rte.so.
12(orte_init+0x168)[0x7f895e5b5368]
[ip-172-31-5-39:14785] [ 5] ompi-server[0x4014d4]
[ip-172-31-5-39:14785] [ 6] /lib64/libc.so.6(__libc_start_
main+0xf5)[0x7f895d3f6b35]
[ip-172-31-5-39:14785] [ 7] ompi-server[0x40176b]
[ip-172-31-5-39:14785] *** End of error message ***
Am I doing something wrong?
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...