George Reeke
2016-10-04 20:08:23 UTC
Dear colleagues,
I have a parallel MPI application written in C that works normally in
a serial version and in the parallel version in the sense that all
numerical output is correct. When it tries to shut down, it gives the
following console error messsage:
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:
Process name: [[51524,1],0]
Exit code: 13
The Process name given is not the number of any Linux process.
The Exit code seems to be any number in the range ~12 to 17.
The core dumps produced do not have usable backtrace information.
I cannot determine the cause of the problem. Let me be as explicit
as possible:
OS RHEL 6.8, compiler gcc with -g, no optimization
Version of MPI (RedHat package): openmpi-1.10-1.10.2-2.el6.x86_64
The startup command is like this:
mpirun --output-filename junk -n 1 cnsP0 NOSP : -n 3 cnsPn < v8tin/dan
cnsP0 is a master code that reads a control file (specified after the
'<' on the command line). The other executables (cnsPn) only send and
receive messages (and do math), no file IO. I only tried with 4 nodes
so far.
Early in startup, another process is started via MPI_Comm_spawn.
I suspect this is relevant to the problem, although simple test
programs using the same setup complete normally. This process,
andmsg, receives status or debug information asynchronously via
messages from the other processes and writes them to stderr.
I have tried many versions of the shutdown code, all with the same
result. Here is one version (debug writes deleted, comments modified):
Notes: "is_host(NC.node)" returns 1 if this is the rank 0 node.
NC.dmsgid is the node id of the andmsg process, which uses the
intercommunicator NC.commd. andmsg counts the number of ival
messages received and tries to shut down when it has one from
each of the original 4 (or however many) nodes.
Application code:
/* Everything works OK up to here. */
rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid,
SHUTDOWN_ANDMSG, NC.commd); /* andmsg counts these */
/* This message confirms that andmsg got 4 SHUTDOWN messages */
if (is_host(NC.node)) { MPI_Recv(&ival, 1, MPI_INT, NC.dmsgid,
CLOSING_ANDMSG, NC.commd, MPI_STATUS_IGNORE);
}
/* Results similar with or without this barrier */
rc = MPI_Barrier(NC.commc); /* NC.commc is original world comm */
/* Behavior is same with or without this extra message exchange */
if (is_host(NC.node)) {
ival = SHUTDOWN_ANDMSG;
rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid,
SHUTDOWN_ANDMSG, NC.commd);
}
/* Behavior is same with or without this disconnect */
rc = MPI_Comm_disconnect(&NC.commd);
rc = MPI_Finalize();
exit(0);
Spawned process code extract:
if (num2stop <= 0) { /* Countdown of shutdown messages received */
int rc;
rc = MPI_Send(&num2stop, 1, MPI_INT, NC.hostid,
CLOSING_ANDMSG, NC.commd);
/* Receive extra synch message commented above */
rc = MPI_Recv(&sdmsg, 1, MPI_INT, NC.hostid, MPI_ANY_TAG,
NC.commd, MPI_STATUS_IGNORE);
sleep(1); /* Results same with or without this sleep */
/* Results same with or without this disconnect [see above] */
rc = MPI_Comm_disconnect(&NC.commd);
rc = MPI_Finalize();
exit(0);
}
I would much appreciate any suggestions how to debug this.
From the suggestions at the community help web page, here is more
information:
config.log file, bzipped version, is attached.
ompi_info --all output is attached.
I am not sending information from other nodes or network config--for
test purposes, all processes are running on the one node, my laptop
with i7 processor.
I did not set any MCA environment parameters.
Thanks,
George Reeke
I have a parallel MPI application written in C that works normally in
a serial version and in the parallel version in the sense that all
numerical output is correct. When it tries to shut down, it gives the
following console error messsage:
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:
Process name: [[51524,1],0]
Exit code: 13
The Process name given is not the number of any Linux process.
The Exit code seems to be any number in the range ~12 to 17.
The core dumps produced do not have usable backtrace information.
I cannot determine the cause of the problem. Let me be as explicit
as possible:
OS RHEL 6.8, compiler gcc with -g, no optimization
Version of MPI (RedHat package): openmpi-1.10-1.10.2-2.el6.x86_64
The startup command is like this:
mpirun --output-filename junk -n 1 cnsP0 NOSP : -n 3 cnsPn < v8tin/dan
cnsP0 is a master code that reads a control file (specified after the
'<' on the command line). The other executables (cnsPn) only send and
receive messages (and do math), no file IO. I only tried with 4 nodes
so far.
Early in startup, another process is started via MPI_Comm_spawn.
I suspect this is relevant to the problem, although simple test
programs using the same setup complete normally. This process,
andmsg, receives status or debug information asynchronously via
messages from the other processes and writes them to stderr.
I have tried many versions of the shutdown code, all with the same
result. Here is one version (debug writes deleted, comments modified):
Notes: "is_host(NC.node)" returns 1 if this is the rank 0 node.
NC.dmsgid is the node id of the andmsg process, which uses the
intercommunicator NC.commd. andmsg counts the number of ival
messages received and tries to shut down when it has one from
each of the original 4 (or however many) nodes.
Application code:
/* Everything works OK up to here. */
rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid,
SHUTDOWN_ANDMSG, NC.commd); /* andmsg counts these */
/* This message confirms that andmsg got 4 SHUTDOWN messages */
if (is_host(NC.node)) { MPI_Recv(&ival, 1, MPI_INT, NC.dmsgid,
CLOSING_ANDMSG, NC.commd, MPI_STATUS_IGNORE);
}
/* Results similar with or without this barrier */
rc = MPI_Barrier(NC.commc); /* NC.commc is original world comm */
/* Behavior is same with or without this extra message exchange */
if (is_host(NC.node)) {
ival = SHUTDOWN_ANDMSG;
rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid,
SHUTDOWN_ANDMSG, NC.commd);
}
/* Behavior is same with or without this disconnect */
rc = MPI_Comm_disconnect(&NC.commd);
rc = MPI_Finalize();
exit(0);
Spawned process code extract:
if (num2stop <= 0) { /* Countdown of shutdown messages received */
int rc;
rc = MPI_Send(&num2stop, 1, MPI_INT, NC.hostid,
CLOSING_ANDMSG, NC.commd);
/* Receive extra synch message commented above */
rc = MPI_Recv(&sdmsg, 1, MPI_INT, NC.hostid, MPI_ANY_TAG,
NC.commd, MPI_STATUS_IGNORE);
sleep(1); /* Results same with or without this sleep */
/* Results same with or without this disconnect [see above] */
rc = MPI_Comm_disconnect(&NC.commd);
rc = MPI_Finalize();
exit(0);
}
I would much appreciate any suggestions how to debug this.
From the suggestions at the community help web page, here is more
information:
config.log file, bzipped version, is attached.
ompi_info --all output is attached.
I am not sending information from other nodes or network config--for
test purposes, all processes are running on the one node, my laptop
with i7 processor.
I did not set any MCA environment parameters.
Thanks,
George Reeke