Discussion:
[OMPI users] MPI_Comm_spawn question
e***@info.sgu.ru
2017-01-31 15:33:42 UTC
Permalink
Hi,

I am trying to write trivial master-slave program. Master simply creates
slaves, sends them a string, they print it out and exit. Everything works
just fine, however, when I add a delay (more than 2 sec) before calling
MPI_Init on slave, MPI fails with MPI_ERR_SPAWN. I am pretty sure that
MPI_Comm_spawn has some kind of timeout on waiting for slaves to call
MPI_Init, and if they fail to respond in time, it returns an error.

I believe there is a way to change this behaviour, but I wasn't able to
find any suggestions/ideas in the internet.
I would appreciate if someone could help with this.

---
--- terminal command i use to run program:
mpirun -n 1 hello 2 2 // the first argument to "hello" is number of
slaves, the second is delay in seconds

--- Error message I get when delay is >=2 sec:
[host:2231] *** An error occurred in MPI_Comm_spawn
[host:2231] *** reported by process [3453419521,0]
[host:2231] *** on communicator MPI_COMM_SELF
[host:2231] *** MPI_ERR_SPAWN: could not spawn processes
[host:2231] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[host:2231] *** and potentially your MPI job)

--- The program itself:
#include "stdlib.h"
#include "stdio.h"
#include "mpi.h"
#include "unistd.h"

MPI_Comm slave_comm;
MPI_Comm new_world;
#define MESSAGE_SIZE 40

void slave() {
printf("Slave initialized; ");
MPI_Comm_get_parent(&slave_comm);
MPI_Intercomm_merge(slave_comm, 1, &new_world);

int slave_rank;
MPI_Comm_rank(new_world, &slave_rank);

char message[MESSAGE_SIZE];
MPI_Bcast(message, MESSAGE_SIZE, MPI_CHAR, 0, new_world);

printf("Slave %d received message from master: %s\n", slave_rank, message);
}

void master(int slave_count, char* executable, char* delay) {
char* slave_argv[] = { delay, NULL };
MPI_Comm_spawn( executable,
slave_argv,
slave_count,
MPI_INFO_NULL,
0,
MPI_COMM_SELF,
&slave_comm,
MPI_ERRCODES_IGNORE);
MPI_Intercomm_merge(slave_comm, 0, &new_world);
char* helloWorld = "Hello New World!\0";
MPI_Bcast(helloWorld, MESSAGE_SIZE, MPI_CHAR, 0, new_world);
printf("Processes spawned!\n");
}

int main(int argc, char* argv[]) {
if (argc > 2) {
MPI_Init(&argc, &argv);
master(atoi(argv[1]), argv[0], argv[2]);
} else {
sleep(atoi(argv[1])); /// delay
MPI_Init(&argc, &argv);
slave();
}
MPI_Comm_free(&new_world);
MPI_Comm_free(&slave_comm);
MPI_Finalize();
}


Thank you,

Andrew Elistratov
r***@open-mpi.org
2017-01-31 15:56:50 UTC
Permalink
What version of OMPI are you using?
Post by e***@info.sgu.ru
Hi,
I am trying to write trivial master-slave program. Master simply creates
slaves, sends them a string, they print it out and exit. Everything works
just fine, however, when I add a delay (more than 2 sec) before calling
MPI_Init on slave, MPI fails with MPI_ERR_SPAWN. I am pretty sure that
MPI_Comm_spawn has some kind of timeout on waiting for slaves to call
MPI_Init, and if they fail to respond in time, it returns an error.
I believe there is a way to change this behaviour, but I wasn't able to
find any suggestions/ideas in the internet.
I would appreciate if someone could help with this.
---
mpirun -n 1 hello 2 2 // the first argument to "hello" is number of
slaves, the second is delay in seconds
[host:2231] *** An error occurred in MPI_Comm_spawn
[host:2231] *** reported by process [3453419521,0]
[host:2231] *** on communicator MPI_COMM_SELF
[host:2231] *** MPI_ERR_SPAWN: could not spawn processes
[host:2231] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[host:2231] *** and potentially your MPI job)
#include "stdlib.h"
#include "stdio.h"
#include "mpi.h"
#include "unistd.h"
MPI_Comm slave_comm;
MPI_Comm new_world;
#define MESSAGE_SIZE 40
void slave() {
printf("Slave initialized; ");
MPI_Comm_get_parent(&slave_comm);
MPI_Intercomm_merge(slave_comm, 1, &new_world);
int slave_rank;
MPI_Comm_rank(new_world, &slave_rank);
char message[MESSAGE_SIZE];
MPI_Bcast(message, MESSAGE_SIZE, MPI_CHAR, 0, new_world);
printf("Slave %d received message from master: %s\n", slave_rank, message);
}
void master(int slave_count, char* executable, char* delay) {
char* slave_argv[] = { delay, NULL };
MPI_Comm_spawn( executable,
slave_argv,
slave_count,
MPI_INFO_NULL,
0,
MPI_COMM_SELF,
&slave_comm,
MPI_ERRCODES_IGNORE);
MPI_Intercomm_merge(slave_comm, 0, &new_world);
char* helloWorld = "Hello New World!\0";
MPI_Bcast(helloWorld, MESSAGE_SIZE, MPI_CHAR, 0, new_world);
printf("Processes spawned!\n");
}
int main(int argc, char* argv[]) {
if (argc > 2) {
MPI_Init(&argc, &argv);
master(atoi(argv[1]), argv[0], argv[2]);
} else {
sleep(atoi(argv[1])); /// delay
MPI_Init(&argc, &argv);
slave();
}
MPI_Comm_free(&new_world);
MPI_Comm_free(&slave_comm);
MPI_Finalize();
}
Thank you,
Andrew Elistratov
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
e***@info.sgu.ru
2017-02-01 09:00:46 UTC
Permalink
I am using Open MPI version 2.0.1.
r***@open-mpi.org
2017-02-04 03:14:02 UTC
Permalink
We know v2.0.1 has problems with comm_spawn, and so you may be encountering one of those. Regardless, there is indeed a timeout mechanism in there. It was added because people would execute a comm_spawn, and then would hang and eat up their entire allocation time for nothing.

In v2.0.2, I see it is still hardwired at 60 seconds. I believe we eventually realized we needed to make that a variable, but it didn’t get into the 2.0.2 release.
Post by e***@info.sgu.ru
I am using Open MPI version 2.0.1.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gilles Gouaillardet
2017-02-04 10:21:12 UTC
Permalink
Andrew,

the 2 seconds timeout is very likely a bug that was fixed, so i strongly
suggest you give a try to the latest 2.0.2 that was released earlier this
week.

Ralph is referring an other timeout which is hard coded (fwiw, the MPI
standard says nothing about timeout, so we hardcoded one to prevent jobs
from hanging forever) to 600 seconds in master, but is still 60 seconds in
the v2.0.x branch
IIRC, the hard coded timeout is in MPI_Comm_{accept,connect} and i do not
know if it is somehow involved in MPI_Comm_spawn.

Cheers,

Gilles
Post by r***@open-mpi.org
We know v2.0.1 has problems with comm_spawn, and so you may be
encountering one of those. Regardless, there is indeed a timeout mechanism
in there. It was added because people would execute a comm_spawn, and then
would hang and eat up their entire allocation time for nothing.
In v2.0.2, I see it is still hardwired at 60 seconds. I believe we
eventually realized we needed to make that a variable, but it didn’t get
into the 2.0.2 release.
Post by e***@info.sgu.ru
I am using Open MPI version 2.0.1.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...