Discussion:
[OMPI users] stdout/stderr question
emre brookes
2018-09-04 21:08:26 UTC
Permalink
Background:
---
Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
$ mpirun --version
mpirun (Open MPI) 1.10.2

I did search thru the docs a bit (ok, maybe I missed something obvious,
my apologies if so)
---
Question:

Is there some setting to turn off the extra messages generated by openmpi ?

e.g.
$ mpirun -np 2 my_job > my_job.stdout
adds this message to my_job.stdout
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?

I need to separate this from the collected stdout of the job processes
themselves.

Somewhat kludgy options that come to mind:

1. I can use --output-filename outfile, which does separate the
"openmpi" messages,
but this creates a file for each process and I'd rather keep them as
produced in one file,
but without any messages from openmpi, which I'd like to keep separately.

2. Or I could write a script to filter the output and separate. A bit
risky as someone could conceivably put something that looks like a
openmpi message pattern in the mpi executable output.

3. hack the source code of openmpi.

Any suggestions as to a more elegant or standard way of dealing with this?

TIA,
Emre.
George Reeke
2018-09-04 21:36:13 UTC
Permalink
Post by emre brookes
Is there some setting to turn off the extra messages generated by openmpi ?
----------
Post by emre brookes
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
----------
Post by emre brookes
I need to separate this from the collected stdout of the job processes
themselves.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
I use an environment variable to name a path for my standard output.
My code, when it finds that variable, opens that file and writes
everything to it instead of stdout (I write from the Rank 0 node
only). Then openmpi (or slurm) can write to stdout all it wants.
George Reeke
emre brookes
2018-09-05 01:04:04 UTC
Permalink
Post by emre brookes
Post by emre brookes
Is there some setting to turn off the extra messages generated by openmpi ?
----------
Post by emre brookes
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
----------
Post by emre brookes
I need to separate this from the collected stdout of the job processes
themselves.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
I use an environment variable to name a path for my standard output.
My code, when it finds that variable, opens that file and writes
everything to it instead of stdout (I write from the Rank 0 node
only). Then openmpi (or slurm) can write to stdout all it wants.
George Reeke
Yes, that is another option, but is a bit difficult in my case,
as I have a framework for job submission and the job executable will come
from other developers and I don't want to put this restriction on them,
so I really need to capture the original standard out.

But perhaps I could write a wrapper to do this.
On further thought, this may be the way to go.

Thanks,
Emre
emre brookes
2018-09-05 01:16:09 UTC
Permalink
Post by emre brookes
Post by emre brookes
Post by emre brookes
Is there some setting to turn off the extra messages generated by openmpi ?
----------
Post by emre brookes
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some
specified file?
----------
Post by emre brookes
I need to separate this from the collected stdout of the job processes
themselves.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
I use an environment variable to name a path for my standard output.
My code, when it finds that variable, opens that file and writes
everything to it instead of stdout (I write from the Rank 0 node
only). Then openmpi (or slurm) can write to stdout all it wants.
George Reeke
Yes, that is another option, but is a bit difficult in my case,
as I have a framework for job submission and the job executable will come
from other developers and I don't want to put this restriction on them,
so I really need to capture the original standard out.
But perhaps I could write a wrapper to do this.
On further thought, this may be the way to go.
Yep, works great:
cap_stdout.sh:
---
#!/bin/bash
my_job > my_job.stdout
---
$ mpirun -np 2 cap_stdout.sh

(of course cap_stdout.sh can be fancied up with passing arguments for
executable, its params, stdout filename or other things)

Thanks again.
Emre
Gilles Gouaillardet
2018-09-05 01:37:36 UTC
Permalink
Open MPI should likely write this message on stderr, I will have a look
at that.


That being said, and though I have no intention to dodge the question,
this case should not happen.

A well written (MPI) program should either exit(0) or have main() return
0, so this scenario

(e.g. all MPI tasks call MPI_Finalize() and then at least one MPI task
exit with a non zero error code)

should not happen.


If your program might fail, it should call MPI_Abort() with a non zero
error code *before* calling MPI_Finalize().

note this error can occur if your main() subroutine does not return any
value (e.g. it returns an undefined value, that might be non zero)


Cheers,


Gilles
Post by emre brookes
---
Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
$  mpirun --version
mpirun (Open MPI) 1.10.2
I did search thru the docs a bit (ok, maybe I missed something
obvious, my apologies if so)
---
Is there some setting to turn off the extra messages generated by openmpi ?
e.g.
$ mpirun -np 2 my_job > my_job.stdout
adds this message to my_job.stdout
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
I need to separate this from the collected stdout of the job processes
themselves.
1. I can use --output-filename outfile, which does separate the
"openmpi" messages,
but this creates a file for each process and I'd rather keep them as
produced in one file,
but without any messages from openmpi, which I'd like to keep separately.
2. Or I could write a script to filter the output and separate. A bit
risky as someone could conceivably put something that looks like a
openmpi message pattern in the mpi executable output.
3. hack the source code of openmpi.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
emre brookes
2018-09-05 11:11:56 UTC
Permalink
Thanks Gilles,

My goal is to separate openmpi errors from the stdout of the MPI program
itself so that errors can be identified externally (in particular in an
external framework running MPI jobs from various developers).

My not so "well written MPI program" was doing this:
MPI_Finalize();
exit( errorcode );
Which I assume you are telling me was bad practice & will replace with
MPI_Abort( MPI_COMM_WORLD, errorcode );
MPI_Finalize();
exit( errorcode );
I was previously a bit put off of MPI_Abort due to the vagueness of the
_Description_
This routine makes a "best attempt" to abort all tasks in the group of
comm. This function does not require that the invoking environment
take any action with the error code. However, a UNIX or POSIX
environment should handle this as a return errorcode from the main
program or an abort (errorcode).
& I didn't really have an MPI issue to "Abort", but had used this for a
user input or parameter issue.
Nevertheless, I accept your best practice recommendation.

It was not only the originally reported message, other messages went to
stdout.
Initially used the Ubuntu 16 LTS "$ apt install openmpi-bin
libopenmpi-dev" which got me version (1.10.2),
but this morning compiled and tested 2.1.5, with the same behavior, e.g.:

$ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
[domain-name-embargoed:26078] 1 more process has sent help message
help-mpi-api.txt / mpi-abort
[domain-name-embargoed:26078] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 0
hello from 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$

Tested 3.1.2, where this has been *somewhat* fixed:

$ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[domain-name-embargoed:19784] 1 more process has sent help message
help-mpi-api.txt / mpi-abort
[domain-name-embargoed:19784] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 1
hello from 0
$

But the originally reported error still goes to stdout:

$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$

Summary:
1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
3.1.2 sends at least one type of openmpi generated messages to stdout.
I'll continue with my "wrapper" strategy for now, as it seems safest and
most broadly deployable [e.g. on compute resources where I need to use
admin installed versions of MPI],
but it would be nice for openmpi to ensure all generated messages end up
in stderr.

-Emre
Open MPI should likely write this message on stderr, I will have a
look at that.
That being said, and though I have no intention to dodge the question,
this case should not happen.
A well written (MPI) program should either exit(0) or have main()
return 0, so this scenario
(e.g. all MPI tasks call MPI_Finalize() and then at least one MPI task
exit with a non zero error code)
should not happen.
If your program might fail, it should call MPI_Abort() with a non zero
error code *before* calling MPI_Finalize().
note this error can occur if your main() subroutine does not return
any value (e.g. it returns an undefined value, that might be non zero)
Cheers,
Gilles
Post by emre brookes
---
Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
$ mpirun --version
mpirun (Open MPI) 1.10.2
I did search thru the docs a bit (ok, maybe I missed something
obvious, my apologies if so)
---
Is there some setting to turn off the extra messages generated by openmpi ?
e.g.
$ mpirun -np 2 my_job > my_job.stdout
adds this message to my_job.stdout
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
I need to separate this from the collected stdout of the job
processes themselves.
1. I can use --output-filename outfile, which does separate the
"openmpi" messages,
but this creates a file for each process and I'd rather keep them as
produced in one file,
but without any messages from openmpi, which I'd like to keep
separately.
2. Or I could write a script to filter the output and separate. A bit
risky as someone could conceivably put something that looks like a
openmpi message pattern in the mpi executable output.
3. hack the source code of openmpi.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
Ralph H Castain
2018-09-10 17:27:37 UTC
Permalink
I’m not sure why this would be happening. These error outputs go through the “show_help” functionality, and we specifically target it at stderr:

/* create an output stream for us */
OBJ_CONSTRUCT(&lds, opal_output_stream_t);
lds.lds_want_stderr = true;
orte_help_output = opal_output_open(&lds);

Jeff: is it possible the opal_output system is ignoring the request and pushing it to stdout??
Ralph
Post by emre brookes
Thanks Gilles,
My goal is to separate openmpi errors from the stdout of the MPI program itself so that errors can be identified externally (in particular in an external framework running MPI jobs from various developers).
MPI_Finalize();
exit( errorcode );
Which I assume you are telling me was bad practice & will replace with
MPI_Abort( MPI_COMM_WORLD, errorcode );
MPI_Finalize();
exit( errorcode );
_Description_
This routine makes a "best attempt" to abort all tasks in the group of comm. This function does not require that the invoking environment take any action with the error code. However, a UNIX or POSIX environment should handle this as a return errorcode from the main program or an abort (errorcode).
& I didn't really have an MPI issue to "Abort", but had used this for a user input or parameter issue.
Nevertheless, I accept your best practice recommendation.
It was not only the originally reported message, other messages went to stdout.
Initially used the Ubuntu 16 LTS "$ apt install openmpi-bin libopenmpi-dev" which got me version (1.10.2),
$ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
[domain-name-embargoed:26078] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[domain-name-embargoed:26078] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 0
hello from 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$
$ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[domain-name-embargoed:19784] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[domain-name-embargoed:19784] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 1
hello from 0
$
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
3.1.2 sends at least one type of openmpi generated messages to stdout.
I'll continue with my "wrapper" strategy for now, as it seems safest and
most broadly deployable [e.g. on compute resources where I need to use admin installed versions of MPI],
but it would be nice for openmpi to ensure all generated messages end up in stderr.
-Emre
Open MPI should likely write this message on stderr, I will have a look at that.
That being said, and though I have no intention to dodge the question, this case should not happen.
A well written (MPI) program should either exit(0) or have main() return 0, so this scenario
(e.g. all MPI tasks call MPI_Finalize() and then at least one MPI task exit with a non zero error code)
should not happen.
If your program might fail, it should call MPI_Abort() with a non zero error code *before* calling MPI_Finalize().
note this error can occur if your main() subroutine does not return any value (e.g. it returns an undefined value, that might be non zero)
Cheers,
Gilles
Post by emre brookes
---
Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
$ mpirun --version
mpirun (Open MPI) 1.10.2
I did search thru the docs a bit (ok, maybe I missed something obvious, my apologies if so)
---
Is there some setting to turn off the extra messages generated by openmpi ?
e.g.
$ mpirun -np 2 my_job > my_job.stdout
adds this message to my_job.stdout
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
I need to separate this from the collected stdout of the job processes themselves.
1. I can use --output-filename outfile, which does separate the "openmpi" messages,
but this creates a file for each process and I'd rather keep them as produced in one file,
but without any messages from openmpi, which I'd like to keep separately.
2. Or I could write a script to filter the output and separate. A bit risky as someone could conceivably put something that looks like a openmpi message pattern in the mpi executable output.
3. hack the source code of openmpi.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Gilles Gouaillardet
2018-09-11 01:06:46 UTC
Permalink
I investigated a this a bit and found that the (latest ?) v3 branches
have the expected behavior

(e.g. the error messages is sent to stderr)


Since it is very unlikely Open MPI 2.1 will ever be updated, I can
simply encourage you to upgrade to a newer Open MPI version.

Latest fully supported versions are currently such as 3.1.2 or 3.0.2



Cheers,

Gilles
Post by Ralph H Castain
/* create an output stream for us */
OBJ_CONSTRUCT(&lds, opal_output_stream_t);
lds.lds_want_stderr = true;
orte_help_output = opal_output_open(&lds);
Jeff: is it possible the opal_output system is ignoring the request and pushing it to stdout??
Ralph
Post by emre brookes
Thanks Gilles,
My goal is to separate openmpi errors from the stdout of the MPI program itself so that errors can be identified externally (in particular in an external framework running MPI jobs from various developers).
MPI_Finalize();
exit( errorcode );
Which I assume you are telling me was bad practice & will replace with
MPI_Abort( MPI_COMM_WORLD, errorcode );
MPI_Finalize();
exit( errorcode );
_Description_
This routine makes a "best attempt" to abort all tasks in the group of comm. This function does not require that the invoking environment take any action with the error code. However, a UNIX or POSIX environment should handle this as a return errorcode from the main program or an abort (errorcode).
& I didn't really have an MPI issue to "Abort", but had used this for a user input or parameter issue.
Nevertheless, I accept your best practice recommendation.
It was not only the originally reported message, other messages went to stdout.
Initially used the Ubuntu 16 LTS "$ apt install openmpi-bin libopenmpi-dev" which got me version (1.10.2),
$ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
[domain-name-embargoed:26078] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[domain-name-embargoed:26078] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 0
hello from 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$
$ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[domain-name-embargoed:19784] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[domain-name-embargoed:19784] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 1
hello from 0
$
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
3.1.2 sends at least one type of openmpi generated messages to stdout.
I'll continue with my "wrapper" strategy for now, as it seems safest and
most broadly deployable [e.g. on compute resources where I need to use admin installed versions of MPI],
but it would be nice for openmpi to ensure all generated messages end up in stderr.
-Emre
Open MPI should likely write this message on stderr, I will have a look at that.
That being said, and though I have no intention to dodge the question, this case should not happen.
A well written (MPI) program should either exit(0) or have main() return 0, so this scenario
(e.g. all MPI tasks call MPI_Finalize() and then at least one MPI task exit with a non zero error code)
should not happen.
If your program might fail, it should call MPI_Abort() with a non zero error code *before* calling MPI_Finalize().
note this error can occur if your main() subroutine does not return any value (e.g. it returns an undefined value, that might be non zero)
Cheers,
Gilles
Post by emre brookes
---
Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
$ mpirun --version
mpirun (Open MPI) 1.10.2
I did search thru the docs a bit (ok, maybe I missed something obvious, my apologies if so)
---
Is there some setting to turn off the extra messages generated by openmpi ?
e.g.
$ mpirun -np 2 my_job > my_job.stdout
adds this message to my_job.stdout
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
I need to separate this from the collected stdout of the job processes themselves.
1. I can use --output-filename outfile, which does separate the "openmpi" messages,
but this creates a file for each process and I'd rather keep them as produced in one file,
but without any messages from openmpi, which I'd like to keep separately.
2. Or I could write a script to filter the output and separate. A bit risky as someone could conceivably put something that looks like a openmpi message pattern in the mpi executable output.
3. hack the source code of openmpi.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
emre brookes
2018-09-11 02:14:02 UTC
Permalink
Post by Gilles Gouaillardet
I investigated a this a bit and found that the (latest ?) v3 branches
have the expected behavior
(e.g. the error messages is sent to stderr)
Since it is very unlikely Open MPI 2.1 will ever be updated, I can
simply encourage you to upgrade to a newer Open MPI version.
Latest fully supported versions are currently such as 3.1.2 or 3.0.2
Cheers,
Gilles
So you tested 3.1.2 or something newer with this error?
Post by Gilles Gouaillardet
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
-Emre
Post by Gilles Gouaillardet
Post by Ralph H Castain
I’m not sure why this would be happening. These error outputs go
through the “show_help” functionality, and we specifically target it
/* create an output stream for us */
OBJ_CONSTRUCT(&lds, opal_output_stream_t);
lds.lds_want_stderr = true;
orte_help_output = opal_output_open(&lds);
Jeff: is it possible the opal_output system is ignoring the request
and pushing it to stdout??
Ralph
Post by emre brookes
Thanks Gilles,
My goal is to separate openmpi errors from the stdout of the MPI
program itself so that errors can be identified externally (in
particular in an external framework running MPI jobs from various
developers).
MPI_Finalize();
exit( errorcode );
Which I assume you are telling me was bad practice & will replace with
MPI_Abort( MPI_COMM_WORLD, errorcode );
MPI_Finalize();
exit( errorcode );
_Description_
This routine makes a "best attempt" to abort all tasks in the group
of comm. This function does not require that the invoking
environment take any action with the error code. However, a UNIX or
POSIX environment should handle this as a return errorcode from the
main program or an abort (errorcode).
& I didn't really have an MPI issue to "Abort", but had used this
for a user input or parameter issue.
Nevertheless, I accept your best practice recommendation.
It was not only the originally reported message, other messages went to stdout.
Initially used the Ubuntu 16 LTS "$ apt install openmpi-bin
libopenmpi-dev" which got me version (1.10.2),
$ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
[domain-name-embargoed:26078] 1 more process has sent help message
help-mpi-api.txt / mpi-abort
[domain-name-embargoed:26078] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 0
hello from 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$
$ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[domain-name-embargoed:19784] 1 more process has sent help message
help-mpi-api.txt / mpi-abort
[domain-name-embargoed:19784] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 1
hello from 0
$
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
3.1.2 sends at least one type of openmpi generated messages to stdout.
I'll continue with my "wrapper" strategy for now, as it seems safest and
most broadly deployable [e.g. on compute resources where I need to
use admin installed versions of MPI],
but it would be nice for openmpi to ensure all generated messages end up in stderr.
-Emre
Open MPI should likely write this message on stderr, I will have a look at that.
That being said, and though I have no intention to dodge the
question, this case should not happen.
A well written (MPI) program should either exit(0) or have main()
return 0, so this scenario
(e.g. all MPI tasks call MPI_Finalize() and then at least one MPI
task exit with a non zero error code)
should not happen.
If your program might fail, it should call MPI_Abort() with a non
zero error code *before* calling MPI_Finalize().
note this error can occur if your main() subroutine does not return
any value (e.g. it returns an undefined value, that might be non zero)
Cheers,
Gilles
Post by emre brookes
---
Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
$ mpirun --version
mpirun (Open MPI) 1.10.2
I did search thru the docs a bit (ok, maybe I missed something
obvious, my apologies if so)
---
Is there some setting to turn off the extra messages generated by openmpi ?
e.g.
$ mpirun -np 2 my_job > my_job.stdout
adds this message to my_job.stdout
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
I need to separate this from the collected stdout of the job processes themselves.
1. I can use --output-filename outfile, which does separate the "openmpi" messages,
but this creates a file for each process and I'd rather keep them
as produced in one file,
but without any messages from openmpi, which I'd like to keep separately.
2. Or I could write a script to filter the output and separate. A
bit risky as someone could conceivably put something that looks
like a openmpi message pattern in the mpi executable output.
3. hack the source code of openmpi.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Gilles Gouaillardet
2018-09-11 03:58:15 UTC
Permalink
It seems I got it wrong :-(


Can you please give the attached patch a try ?


FWIW, an other option would be to opal_output(orte_help_output, ...) but
we would have to make orte_help_output "public first.


Cheers,


Gilles
Post by emre brookes
Post by Gilles Gouaillardet
I investigated a this a bit and found that the (latest ?) v3 branches
have the expected behavior
(e.g. the error messages is sent to stderr)
Since it is very unlikely Open MPI 2.1 will ever be updated, I can
simply encourage you to upgrade to a newer Open MPI version.
Latest fully supported versions are currently such as 3.1.2 or 3.0.2
Cheers,
Gilles
So you tested 3.1.2 or something newer with this error?
Post by Gilles Gouaillardet
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
  Process name: [[22380,1],0]
  Exit code:    255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
-Emre
Post by Gilles Gouaillardet
I’m not sure why this would be happening. These error outputs go
through the “show_help” functionality, and we specifically target it
     /* create an output stream for us */
     OBJ_CONSTRUCT(&lds, opal_output_stream_t);
     lds.lds_want_stderr = true;
     orte_help_output = opal_output_open(&lds);
Jeff: is it possible the opal_output system is ignoring the request
and pushing it to stdout??
Ralph
Post by emre brookes
Thanks Gilles,
My goal is to separate openmpi errors from the stdout of the MPI
program itself so that errors can be identified externally (in
particular in an external framework running MPI jobs from various
developers).
   MPI_Finalize();
   exit( errorcode );
Which I assume you are telling me was bad practice & will replace with
   MPI_Abort( MPI_COMM_WORLD, errorcode );
   MPI_Finalize();
   exit( errorcode );
_Description_
This routine makes a "best attempt" to abort all tasks in the
group of comm. This function does not require that the invoking
environment take any action with the error code. However, a UNIX
or POSIX environment should handle this as a return errorcode from
the main program or an abort (errorcode).
& I didn't really have an MPI issue to "Abort", but had used this
for a user input or parameter issue.
Nevertheless, I accept your best practice recommendation.
It was not only the originally reported message, other messages went to stdout.
Initially used the Ubuntu 16 LTS  "$ apt install openmpi-bin
libopenmpi-dev" which got me version (1.10.2),
$ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
[domain-name-embargoed:26078] 1 more process has sent help message
help-mpi-api.txt / mpi-abort
[domain-name-embargoed:26078] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 0
hello from 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$
$ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[domain-name-embargoed:19784] 1 more process has sent help message
help-mpi-api.txt / mpi-abort
[domain-name-embargoed:19784] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 1
hello from 0
$
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
  Process name: [[22380,1],0]
  Exit code:    255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
3.1.2 sends at least one type of openmpi generated messages to stdout.
I'll continue with my "wrapper" strategy for now, as it seems safest and
most broadly deployable [e.g. on compute resources where I need to
use admin installed versions of MPI],
but it would be nice for openmpi to ensure all generated messages end up in stderr.
-Emre
Open MPI should likely write this message on stderr, I will have a look at that.
That being said, and though I have no intention to dodge the
question, this case should not happen.
A well written (MPI) program should either exit(0) or have main()
return 0, so this scenario
(e.g. all MPI tasks call MPI_Finalize() and then at least one MPI
task exit with a non zero error code)
should not happen.
If your program might fail, it should call MPI_Abort() with a non
zero error code *before* calling MPI_Finalize().
note this error can occur if your main() subroutine does not
return any value (e.g. it returns an undefined value, that might
be non zero)
Cheers,
Gilles
Post by emre brookes
---
Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
$  mpirun --version
mpirun (Open MPI) 1.10.2
I did search thru the docs a bit (ok, maybe I missed something
obvious, my apologies if so)
---
Is there some setting to turn off the extra messages generated by openmpi ?
e.g.
$ mpirun -np 2 my_job > my_job.stdout
adds this message to my_job.stdout
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
I need to separate this from the collected stdout of the job
processes themselves.
1. I can use --output-filename outfile, which does separate the
"openmpi" messages,
but this creates a file for each process and I'd rather keep them
as produced in one file,
but without any messages from openmpi, which I'd like to keep separately.
2. Or I could write a script to filter the output and separate. A
bit risky as someone could conceivably put something that looks
like a openmpi message pattern in the mpi executable output.
3. hack the source code of openmpi.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Ralph H Castain
2018-09-11 04:07:13 UTC
Permalink
Looks like there is a place in orte/mca/state/state_base_fns.c:850 that also outputs to orte_clean_output instead of using show_help. Outside of those two places, everything else seems to go to show_help.
Post by Gilles Gouaillardet
It seems I got it wrong :-(
Can you please give the attached patch a try ?
FWIW, an other option would be to opal_output(orte_help_output, ...) but we would have to make orte_help_output "public first.
Cheers,
Gilles
Post by emre brookes
I investigated a this a bit and found that the (latest ?) v3 branches have the expected behavior
(e.g. the error messages is sent to stderr)
Since it is very unlikely Open MPI 2.1 will ever be updated, I can simply encourage you to upgrade to a newer Open MPI version.
Latest fully supported versions are currently such as 3.1.2 or 3.0.2
Cheers,
Gilles
So you tested 3.1.2 or something newer with this error?
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
-Emre
Post by Ralph H Castain
/* create an output stream for us */
OBJ_CONSTRUCT(&lds, opal_output_stream_t);
lds.lds_want_stderr = true;
orte_help_output = opal_output_open(&lds);
Jeff: is it possible the opal_output system is ignoring the request and pushing it to stdout??
Ralph
Post by emre brookes
Thanks Gilles,
My goal is to separate openmpi errors from the stdout of the MPI program itself so that errors can be identified externally (in particular in an external framework running MPI jobs from various developers).
MPI_Finalize();
exit( errorcode );
Which I assume you are telling me was bad practice & will replace with
MPI_Abort( MPI_COMM_WORLD, errorcode );
MPI_Finalize();
exit( errorcode );
_Description_
This routine makes a "best attempt" to abort all tasks in the group of comm. This function does not require that the invoking environment take any action with the error code. However, a UNIX or POSIX environment should handle this as a return errorcode from the main program or an abort (errorcode).
& I didn't really have an MPI issue to "Abort", but had used this for a user input or parameter issue.
Nevertheless, I accept your best practice recommendation.
It was not only the originally reported message, other messages went to stdout.
Initially used the Ubuntu 16 LTS "$ apt install openmpi-bin libopenmpi-dev" which got me version (1.10.2),
$ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
[domain-name-embargoed:26078] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[domain-name-embargoed:26078] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 0
hello from 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$
$ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[domain-name-embargoed:19784] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[domain-name-embargoed:19784] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 1
hello from 0
$
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
3.1.2 sends at least one type of openmpi generated messages to stdout.
I'll continue with my "wrapper" strategy for now, as it seems safest and
most broadly deployable [e.g. on compute resources where I need to use admin installed versions of MPI],
but it would be nice for openmpi to ensure all generated messages end up in stderr.
-Emre
Open MPI should likely write this message on stderr, I will have a look at that.
That being said, and though I have no intention to dodge the question, this case should not happen.
A well written (MPI) program should either exit(0) or have main() return 0, so this scenario
(e.g. all MPI tasks call MPI_Finalize() and then at least one MPI task exit with a non zero error code)
should not happen.
If your program might fail, it should call MPI_Abort() with a non zero error code *before* calling MPI_Finalize().
note this error can occur if your main() subroutine does not return any value (e.g. it returns an undefined value, that might be non zero)
Cheers,
Gilles
Post by emre brookes
---
Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
$ mpirun --version
mpirun (Open MPI) 1.10.2
I did search thru the docs a bit (ok, maybe I missed something obvious, my apologies if so)
---
Is there some setting to turn off the extra messages generated by openmpi ?
e.g.
$ mpirun -np 2 my_job > my_job.stdout
adds this message to my_job.stdout
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
I need to separate this from the collected stdout of the job processes themselves.
1. I can use --output-filename outfile, which does separate the "openmpi" messages,
but this creates a file for each process and I'd rather keep them as produced in one file,
but without any messages from openmpi, which I'd like to keep separately.
2. Or I could write a script to filter the output and separate. A bit risky as someone could conceivably put something that looks like a openmpi message pattern in the mpi executable output.
3. hack the source code of openmpi.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<default_hnp_abort.diff>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
emre brookes
2018-09-11 13:10:42 UTC
Permalink
Post by Gilles Gouaillardet
It seems I got it wrong :-(
Ah, you've joined the rest of us :)
Post by Gilles Gouaillardet
Can you please give the attached patch a try ?
Working with a git clone of 3.1.x, patch applied

$ /src/ompi-3.1.x/bin/mpicxx test.cpp
$ /src/ompi-3.1.x/bin/mpirun a.out > stdout
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

Process name: [[2667,1],2]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 1
hello from 2
hello from 3
hello from 5
hello from 0
hello from 4
$

Works correctly for this error message.

Thanks,
-Emre
Post by Gilles Gouaillardet
FWIW, an other option would be to opal_output(orte_help_output, ...)
but we would have to make orte_help_output "public first.
Cheers,
Gilles
Post by emre brookes
Post by Gilles Gouaillardet
I investigated a this a bit and found that the (latest ?) v3
branches have the expected behavior
(e.g. the error messages is sent to stderr)
Since it is very unlikely Open MPI 2.1 will ever be updated, I can
simply encourage you to upgrade to a newer Open MPI version.
Latest fully supported versions are currently such as 3.1.2 or 3.0.2
Cheers,
Gilles
So you tested 3.1.2 or something newer with this error?
Post by Gilles Gouaillardet
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
-Emre
Post by Gilles Gouaillardet
Post by Ralph H Castain
I’m not sure why this would be happening. These error outputs go
through the “show_help” functionality, and we specifically target
/* create an output stream for us */
OBJ_CONSTRUCT(&lds, opal_output_stream_t);
lds.lds_want_stderr = true;
orte_help_output = opal_output_open(&lds);
Jeff: is it possible the opal_output system is ignoring the request
and pushing it to stdout??
Ralph
Post by emre brookes
Thanks Gilles,
My goal is to separate openmpi errors from the stdout of the MPI
program itself so that errors can be identified externally (in
particular in an external framework running MPI jobs from various
developers).
MPI_Finalize();
exit( errorcode );
Which I assume you are telling me was bad practice & will replace with
MPI_Abort( MPI_COMM_WORLD, errorcode );
MPI_Finalize();
exit( errorcode );
_Description_
This routine makes a "best attempt" to abort all tasks in the
group of comm. This function does not require that the invoking
environment take any action with the error code. However, a UNIX
or POSIX environment should handle this as a return errorcode
from the main program or an abort (errorcode).
& I didn't really have an MPI issue to "Abort", but had used this
for a user input or parameter issue.
Nevertheless, I accept your best practice recommendation.
It was not only the originally reported message, other messages went to stdout.
Initially used the Ubuntu 16 LTS "$ apt install openmpi-bin
libopenmpi-dev" which got me version (1.10.2),
but this morning compiled and tested 2.1.5, with the same
$ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
[domain-name-embargoed:26078] 1 more process has sent help message
help-mpi-api.txt / mpi-abort
[domain-name-embargoed:26078] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 0
hello from 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$
$ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[domain-name-embargoed:19784] 1 more process has sent help message
help-mpi-api.txt / mpi-abort
[domain-name-embargoed:19784] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 1
hello from 0
$
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero
status, thus causing
Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
3.1.2 sends at least one type of openmpi generated messages to stdout.
I'll continue with my "wrapper" strategy for now, as it seems safest and
most broadly deployable [e.g. on compute resources where I need to
use admin installed versions of MPI],
but it would be nice for openmpi to ensure all generated messages end up in stderr.
-Emre
Open MPI should likely write this message on stderr, I will have a look at that.
That being said, and though I have no intention to dodge the
question, this case should not happen.
A well written (MPI) program should either exit(0) or have main()
return 0, so this scenario
(e.g. all MPI tasks call MPI_Finalize() and then at least one MPI
task exit with a non zero error code)
should not happen.
If your program might fail, it should call MPI_Abort() with a non
zero error code *before* calling MPI_Finalize().
note this error can occur if your main() subroutine does not
return any value (e.g. it returns an undefined value, that might
be non zero)
Cheers,
Gilles
Post by emre brookes
---
Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
$ mpirun --version
mpirun (Open MPI) 1.10.2
I did search thru the docs a bit (ok, maybe I missed something
obvious, my apologies if so)
---
Is there some setting to turn off the extra messages generated by openmpi ?
e.g.
$ mpirun -np 2 my_job > my_job.stdout
adds this message to my_job.stdout
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
I need to separate this from the collected stdout of the job
processes themselves.
1. I can use --output-filename outfile, which does separate the
"openmpi" messages,
but this creates a file for each process and I'd rather keep
them as produced in one file,
but without any messages from openmpi, which I'd like to keep separately.
2. Or I could write a script to filter the output and separate.
A bit risky as someone could conceivably put something that
looks like a openmpi message pattern in the mpi executable output.
3. hack the source code of openmpi.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres) via users
2018-09-11 14:23:02 UTC
Permalink
Gilles: Can you submit a PR to fix these 2 places?

Thanks!
Post by emre brookes
Post by Gilles Gouaillardet
It seems I got it wrong :-(
Ah, you've joined the rest of us :)
Post by Gilles Gouaillardet
Can you please give the attached patch a try ?
Working with a git clone of 3.1.x, patch applied
$ /src/ompi-3.1.x/bin/mpicxx test.cpp
$ /src/ompi-3.1.x/bin/mpirun a.out > stdout
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[2667,1],2]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 1
hello from 2
hello from 3
hello from 5
hello from 0
hello from 4
$
Works correctly for this error message.
Thanks,
-Emre
Post by Gilles Gouaillardet
FWIW, an other option would be to opal_output(orte_help_output, ...) but we would have to make orte_help_output "public first.
Cheers,
Gilles
Post by emre brookes
I investigated a this a bit and found that the (latest ?) v3 branches have the expected behavior
(e.g. the error messages is sent to stderr)
Since it is very unlikely Open MPI 2.1 will ever be updated, I can simply encourage you to upgrade to a newer Open MPI version.
Latest fully supported versions are currently such as 3.1.2 or 3.0.2
Cheers,
Gilles
So you tested 3.1.2 or something newer with this error?
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
-Emre
Post by Ralph H Castain
/* create an output stream for us */
OBJ_CONSTRUCT(&lds, opal_output_stream_t);
lds.lds_want_stderr = true;
orte_help_output = opal_output_open(&lds);
Jeff: is it possible the opal_output system is ignoring the request and pushing it to stdout??
Ralph
Post by emre brookes
Thanks Gilles,
My goal is to separate openmpi errors from the stdout of the MPI program itself so that errors can be identified externally (in particular in an external framework running MPI jobs from various developers).
MPI_Finalize();
exit( errorcode );
Which I assume you are telling me was bad practice & will replace with
MPI_Abort( MPI_COMM_WORLD, errorcode );
MPI_Finalize();
exit( errorcode );
_Description_
This routine makes a "best attempt" to abort all tasks in the group of comm. This function does not require that the invoking environment take any action with the error code. However, a UNIX or POSIX environment should handle this as a return errorcode from the main program or an abort (errorcode).
& I didn't really have an MPI issue to "Abort", but had used this for a user input or parameter issue.
Nevertheless, I accept your best practice recommendation.
It was not only the originally reported message, other messages went to stdout.
Initially used the Ubuntu 16 LTS "$ apt install openmpi-bin libopenmpi-dev" which got me version (1.10.2),
$ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
[domain-name-embargoed:26078] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[domain-name-embargoed:26078] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 0
hello from 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
$
$ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[domain-name-embargoed:19784] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[domain-name-embargoed:19784] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
$ cat stdout
hello from 1
hello from 0
$
$ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
$ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[22380,1],0]
Exit code: 255
--------------------------------------------------------------------------
$ cat stdout
hello from 0
hello from 1
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
$
1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
3.1.2 sends at least one type of openmpi generated messages to stdout.
I'll continue with my "wrapper" strategy for now, as it seems safest and
most broadly deployable [e.g. on compute resources where I need to use admin installed versions of MPI],
but it would be nice for openmpi to ensure all generated messages end up in stderr.
-Emre
Open MPI should likely write this message on stderr, I will have a look at that.
That being said, and though I have no intention to dodge the question, this case should not happen.
A well written (MPI) program should either exit(0) or have main() return 0, so this scenario
(e.g. all MPI tasks call MPI_Finalize() and then at least one MPI task exit with a non zero error code)
should not happen.
If your program might fail, it should call MPI_Abort() with a non zero error code *before* calling MPI_Finalize().
note this error can occur if your main() subroutine does not return any value (e.g. it returns an undefined value, that might be non zero)
Cheers,
Gilles
Post by emre brookes
---
Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
$ mpirun --version
mpirun (Open MPI) 1.10.2
I did search thru the docs a bit (ok, maybe I missed something obvious, my apologies if so)
---
Is there some setting to turn off the extra messages generated by openmpi ?
e.g.
$ mpirun -np 2 my_job > my_job.stdout
adds this message to my_job.stdout
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
which strangely goes to stdout and not stderr.
I would intuitively expect error or notice messages to go to stderr.
Is there a way to redirect these messages to stderr or some specified file?
I need to separate this from the collected stdout of the job processes themselves.
1. I can use --output-filename outfile, which does separate the "openmpi" messages,
but this creates a file for each process and I'd rather keep them as produced in one file,
but without any messages from openmpi, which I'd like to keep separately.
2. Or I could write a script to filter the output and separate. A bit risky as someone could conceivably put something that looks like a openmpi message pattern in the mpi executable output.
3. hack the source code of openmpi.
Any suggestions as to a more elegant or standard way of dealing with this?
TIA,
Emre.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com

Loading...