Discussion:
[OMPI users] Can't connect using MPI Ports
Florian Lindner
2017-11-03 14:48:45 UTC
Permalink
Hello,

I'm working on a sample program to connect two MPI communicators launched with mpirun using Ports.

Firstly, I use MPI_Open_port to obtain a name and write that to a file:

if (options.participant == A) { // A publishes the port
if (options.commType == single and rank == 0)
openPublishPort(options);

if (options.commType == many)
openPublishPort(options);
}
MPI_Barrier(MPI_COMM_WORLD);

participant is a command line argument and defines the role of A as server. B is the client.

void openPublishPort(Options options)
{
using namespace boost::filesystem;
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

char p[MPI_MAX_PORT_NAME];
MPI_Open_port(MPI_INFO_NULL, p);
std::string portName(p);

create_directory(options.publishDirectory);
std::string filename;
if (options.commType == many)
filename = "A-" + std::to_string(rank) + ".address";
if (options.commType == single)
filename = "intercomm.address";

auto path = options.publishDirectory / filename;
DEBUG << "Writing address " << portName << " to " << path;
std::ofstream ofs(path.string(), std::ofstream::out);
ofs << portName;
}

This works fine as far as I see. Next, I try to connect:

MPI_Comm icomm;
std::string portName;
if (options.participant == A) { // receives connections
if (options.commType == single) {
if (rank == 0)
portName = readPort(options);
INFO << "Accepting connection on " << portName;
MPI_Comm_accept(portName.c_str(), MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
INFO << "Received connection";
}
}

if (options.participant == B) { // connects to the intercomms
if (options.commType == single) {
if (rank == 0)
portName = readPort(options);
INFO << "Trying to connect to " << portName;
MPI_Comm_connect(portName.c_str(), MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
INFO << "Connected";
}
}


options.single says that I want to use a single communicator that contains all ranks on both participants, A and B.
readPort reads the port name from the file that was written before.

Now, when I first launch A and, in another terminal, B, nothing happens until a timeout occurs.

% mpirun -n 1 ./mpiports --commType="single" --participant="A"
[2017-11-03 15:29:55.469891] [debug] Writing address 3048013825.0:1069313090 to "./publish/intercomm.address"
[2017-11-03 15:29:55.470169] [debug] Read address 3048013825.0:1069313090 from "./publish/intercomm.address"
[2017-11-03 15:29:55.470185] [info] Accepting connection on 3048013825.0:1069313090
[asaru:16199] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 195
[...]

and on the other site:

% mpirun -n 1 ./mpiports --commType="single" --participant="B"
[2017-11-03 15:29:59.698921] [debug] Read address 3048013825.0:1069313090 from "./publish/intercomm.address"
[2017-11-03 15:29:59.698947] [info] Trying to connect to 3048013825.0:1069313090
[asaru:16238] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 195
[...]

The complete code, including cmake build script can be downloaded at:

https://www.dropbox.com/s/azo5ti4kjg12zjy/MPI_Ports.tar.gz?dl=0

Why is the connection not working?

Thanks a lot,
Florian
r***@open-mpi.org
2017-11-03 15:18:46 UTC
Permalink
What version of OMPI are you using?
Post by Florian Lindner
Hello,
I'm working on a sample program to connect two MPI communicators launched with mpirun using Ports.
if (options.participant == A) { // A publishes the port
if (options.commType == single and rank == 0)
openPublishPort(options);
if (options.commType == many)
openPublishPort(options);
}
MPI_Barrier(MPI_COMM_WORLD);
participant is a command line argument and defines the role of A as server. B is the client.
void openPublishPort(Options options)
{
using namespace boost::filesystem;
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
char p[MPI_MAX_PORT_NAME];
MPI_Open_port(MPI_INFO_NULL, p);
std::string portName(p);
create_directory(options.publishDirectory);
std::string filename;
if (options.commType == many)
filename = "A-" + std::to_string(rank) + ".address";
if (options.commType == single)
filename = "intercomm.address";
auto path = options.publishDirectory / filename;
DEBUG << "Writing address " << portName << " to " << path;
std::ofstream ofs(path.string(), std::ofstream::out);
ofs << portName;
}
MPI_Comm icomm;
std::string portName;
if (options.participant == A) { // receives connections
if (options.commType == single) {
if (rank == 0)
portName = readPort(options);
INFO << "Accepting connection on " << portName;
MPI_Comm_accept(portName.c_str(), MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
INFO << "Received connection";
}
}
if (options.participant == B) { // connects to the intercomms
if (options.commType == single) {
if (rank == 0)
portName = readPort(options);
INFO << "Trying to connect to " << portName;
MPI_Comm_connect(portName.c_str(), MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
INFO << "Connected";
}
}
options.single says that I want to use a single communicator that contains all ranks on both participants, A and B.
readPort reads the port name from the file that was written before.
Now, when I first launch A and, in another terminal, B, nothing happens until a timeout occurs.
% mpirun -n 1 ./mpiports --commType="single" --participant="A"
[2017-11-03 15:29:55.469891] [debug] Writing address 3048013825.0:1069313090 to "./publish/intercomm.address"
[2017-11-03 15:29:55.470169] [debug] Read address 3048013825.0:1069313090 from "./publish/intercomm.address"
[2017-11-03 15:29:55.470185] [info] Accepting connection on 3048013825.0:1069313090
[asaru:16199] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 195
[...]
% mpirun -n 1 ./mpiports --commType="single" --participant="B"
[2017-11-03 15:29:59.698921] [debug] Read address 3048013825.0:1069313090 from "./publish/intercomm.address"
[2017-11-03 15:29:59.698947] [info] Trying to connect to 3048013825.0:1069313090
[asaru:16238] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 195
[...]
https://www.dropbox.com/s/azo5ti4kjg12zjy/MPI_Ports.tar.gz?dl=0
Why is the connection not working?
Thanks a lot,
Florian
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Florian Lindner
2017-11-03 18:23:03 UTC
Permalink
Post by r***@open-mpi.org
What version of OMPI are you using?
2.1.1 @ Arch Linux.

Best,
Florian
r***@open-mpi.org
2017-11-03 23:05:07 UTC
Permalink
Yeah, there isn’t any way that is going to work in the 2.x series. I’m not sure it was ever fixed, but you might try the latest 3.0, the 3.1rc, and even master.

The only methods that are known to work are:

* connecting processes within the same mpirun - e.g., using comm_spawn

* connecting processes across different mpiruns, with the ompi-server daemon as the rendezvous point

The old command line method (i.e., what you are trying to use) hasn’t been much on the radar. I don’t know if someone else has picked it up or not...
Ralph
Post by Florian Lindner
Post by r***@open-mpi.org
What version of OMPI are you using?
Best,
Florian
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Florian Lindner
2017-11-05 14:48:43 UTC
Permalink
Post by r***@open-mpi.org
Yeah, there isn’t any way that is going to work in the 2.x series. I’m not sure it was ever fixed, but you might try the latest 3.0, the 3.1rc, and even master.
* connecting processes within the same mpirun - e.g., using comm_spawn
That is not an option for our application.
Post by r***@open-mpi.org
* connecting processes across different mpiruns, with the ompi-server daemon as the rendezvous point
The old command line method (i.e., what you are trying to use) hasn’t been much on the radar. I don’t know if someone else has picked it up or not...
What do you mean with "the old command line method".

Isn't the ompi-server just another means of exchanging port names, i.e. the same I do using files?

In my understanding, using Publish_name and Lookup_name or exchanging the information using files (or command line or stdin) shouldn't have any
impact on the connection (Connect / Accept) itself.

Best,
Florian
Post by r***@open-mpi.org
Ralph
Post by Florian Lindner
Post by r***@open-mpi.org
What version of OMPI are you using?
Best,
Florian
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2017-11-05 19:57:47 UTC
Permalink
Post by Florian Lindner
Yeah, there isn’t any way that is going to work in the 2.x series. I’m not sure it was ever fixed, but you might try the latest 3.0, the 3.1rc, and even master.
* connecting processes within the same mpirun - e.g., using comm_spawn
That is not an option for our application.
* connecting processes across different mpiruns, with the ompi-server daemon as the rendezvous point
The old command line method (i.e., what you are trying to use) hasn’t been much on the radar. I don’t know if someone else has picked it up or not...
What do you mean with "the old command line method”.
Isn't the ompi-server just another means of exchanging port names, i.e. the same I do using files?
No, it isn’t - there is a handshake that ompi-server facilitates.
Post by Florian Lindner
In my understanding, using Publish_name and Lookup_name or exchanging the information using files (or command line or stdin) shouldn't have any
impact on the connection (Connect / Accept) itself.
Depends on the implementation underneath connect/accept.

The initial MPI standard authors had fixed in their minds that the connect/accept handshake would take place over a TCP socket, and so no intermediate rendezvous broker was involved. That isn’t how we’ve chosen to implement it this time around, and so you do need the intermediary. If/when some developer wants to add another method, they are welcome to do so - but the general opinion was that the broker requirement was fine.
Post by Florian Lindner
Best,
Florian
Ralph
Post by Florian Lindner
Post by r***@open-mpi.org
What version of OMPI are you using?
Best,
Florian
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>
Florian Lindner
2017-11-06 15:46:58 UTC
Permalink
Post by Florian Lindner
Post by r***@open-mpi.org
Yeah, there isn’t any way that is going to work in the 2.x series. I’m not sure it was ever fixed, but you might try
the latest 3.0, the 3.1rc, and even master.
* connecting processes within the same mpirun - e.g., using comm_spawn
That is not an option for our application.
Post by r***@open-mpi.org
* connecting processes across different mpiruns, with the ompi-server daemon as the rendezvous point
The old command line method (i.e., what you are trying to use) hasn’t been much on the radar. I don’t know if someone
else has picked it up or not...
What do you mean with "the old command line method”.
Isn't the ompi-server just another means of exchanging port names, i.e. the same I do using files?
No, it isn’t - there is a handshake that ompi-server facilitates.
Post by Florian Lindner
In my understanding, using Publish_name and Lookup_name or exchanging the information using files (or command line or
stdin) shouldn't have any
impact on the connection (Connect / Accept) itself.
Depends on the implementation underneath connect/accept.
The initial MPI standard authors had fixed in their minds that the connect/accept handshake would take place over a TCP
socket, and so no intermediate rendezvous broker was involved. That isn’t how we’ve chosen to implement it this time
around, and so you do need the intermediary. If/when some developer wants to add another method, they are welcome to do
so - but the general opinion was that the broker requirement was fine.
Ok. Just to make sure I understood correctly:

The MPI Ports functionality (chapter 10.4 of MPI 3.1), mainly consisting of MPI_Open_port, MPI_Comm_accept and
MPI_Comm_connect is not usuable without running an ompi-server as a third process?

Thank again,
Florian
r***@open-mpi.org
2017-11-06 16:00:36 UTC
Permalink
Post by Florian Lindner
Post by r***@open-mpi.org
Post by Florian Lindner
Yeah, there isn’t any way that is going to work in the 2.x series. I’m not sure it was ever fixed, but you might try
the latest 3.0, the 3.1rc, and even master.
* connecting processes within the same mpirun - e.g., using comm_spawn
That is not an option for our application.
* connecting processes across different mpiruns, with the ompi-server daemon as the rendezvous point
The old command line method (i.e., what you are trying to use) hasn’t been much on the radar. I don’t know if someone
else has picked it up or not...
What do you mean with "the old command line method”.
Isn't the ompi-server just another means of exchanging port names, i.e. the same I do using files?
No, it isn’t - there is a handshake that ompi-server facilitates.
Post by Florian Lindner
In my understanding, using Publish_name and Lookup_name or exchanging the information using files (or command line or
stdin) shouldn't have any
impact on the connection (Connect / Accept) itself.
Depends on the implementation underneath connect/accept.
The initial MPI standard authors had fixed in their minds that the connect/accept handshake would take place over a TCP
socket, and so no intermediate rendezvous broker was involved. That isn’t how we’ve chosen to implement it this time
around, and so you do need the intermediary. If/when some developer wants to add another method, they are welcome to do
so - but the general opinion was that the broker requirement was fine.
The MPI Ports functionality (chapter 10.4 of MPI 3.1), mainly consisting of MPI_Open_port, MPI_Comm_accept and
MPI_Comm_connect is not usuable without running an ompi-server as a third process?
Yes, that’s correct. The reason for moving in that direction is that the resource managers, as they continue to integrate PMIx into them, are going to be providing that third party. This will make connect/accept much easier to use, and a great deal more scalable.

See https://github.com/pmix/RFCs/blob/master/RFC0003.md <https://github.com/pmix/RFCs/blob/master/RFC0003.md> for an explanation.
Post by Florian Lindner
Thank again,
Florian
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Florian Lindner
2017-11-09 09:54:28 UTC
Permalink
Post by Florian Lindner
The MPI Ports functionality (chapter 10.4 of MPI 3.1), mainly consisting of MPI_Open_port, MPI_Comm_accept and
MPI_Comm_connect is not usuable without running an ompi-server as a third process?
Yes, that’s correct. The reason for moving in that direction is that the resource managers, as they continue to
integrate PMIx into them, are going to be providing that third party. This will make connect/accept much easier to use,
and a great deal more scalable.
See https://github.com/pmix/RFCs/blob/master/RFC0003.md for an explanation.
Ok, thanks for that input. I haven't heard of pmix so far (only as part of some ompi error messages).

Using ompi-server -d -r 'ompi.connect' I was able to publish and retrieve the port name, however, still no connection
could be established.

% mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out A
Published port 3044605953.0:664448538

% mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out B
Looked up port 3044605953.0:664448538


at this point, both processes hang.

The code is:

#include <iostream>
#include <string>
#include <mpi.h>

int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);
std::string a(argv[1]);
char p[MPI_MAX_PORT_NAME];
MPI_Comm icomm;

if (a == "A") {
MPI_Open_port(MPI_INFO_NULL, p);
MPI_Publish_name("foobar", MPI_INFO_NULL, p);
printf("Published port %s\n", p);
MPI_Comm_accept(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
}
if (a == "B") {
MPI_Lookup_name("foobar", MPI_INFO_NULL, p);
printf("Looked up port %s\n", p);
MPI_Comm_connect(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
}

MPI_Finalize();

return 0;
}



Do you have any idea?

Best,
Florian
r***@open-mpi.org
2017-11-09 17:01:01 UTC
Permalink
I did a quick check across the v2.1 and v3.0 OMPI releases and both failed, though with different signatures. Looks like a problem in the OMPI dynamics integration (i.e., the PMIx library looked like it was doing the right things).

I’d suggest filing an issue on the OMPI github site so someone can address it (I don’t work much on OMPI any more, I’m afraid).
Post by Florian Lindner
Post by Florian Lindner
The MPI Ports functionality (chapter 10.4 of MPI 3.1), mainly consisting of MPI_Open_port, MPI_Comm_accept and
MPI_Comm_connect is not usuable without running an ompi-server as a third process?
Yes, that’s correct. The reason for moving in that direction is that the resource managers, as they continue to
integrate PMIx into them, are going to be providing that third party. This will make connect/accept much easier to use,
and a great deal more scalable.
See https://github.com/pmix/RFCs/blob/master/RFC0003.md for an explanation.
Ok, thanks for that input. I haven't heard of pmix so far (only as part of some ompi error messages).
Using ompi-server -d -r 'ompi.connect' I was able to publish and retrieve the port name, however, still no connection
could be established.
% mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out A
Published port 3044605953.0:664448538
% mpirun -n 1 --ompi-server "file:ompi.connect" ./a.out B
Looked up port 3044605953.0:664448538
at this point, both processes hang.
#include <iostream>
#include <string>
#include <mpi.h>
int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);
std::string a(argv[1]);
char p[MPI_MAX_PORT_NAME];
MPI_Comm icomm;
if (a == "A") {
MPI_Open_port(MPI_INFO_NULL, p);
MPI_Publish_name("foobar", MPI_INFO_NULL, p);
printf("Published port %s\n", p);
MPI_Comm_accept(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
}
if (a == "B") {
MPI_Lookup_name("foobar", MPI_INFO_NULL, p);
printf("Looked up port %s\n", p);
MPI_Comm_connect(p, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &icomm);
}
MPI_Finalize();
return 0;
}
Do you have any idea?
Best,
Florian
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...