Joseph Schuchart
2018-02-09 14:52:31 UTC
All,
I am trying to debug my MPI application using good ol' printf and I am
running into an issue with Open MPI's output redirection (using
--output-filename).
The system I'm running on is an IB cluster with the home directory
mounted through NFS.
1) Sometimes I get the following error message and the application hangs:
```
$ mpirun -n 2 -N 1 --output-filename output.log ls
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_base_setup.c
at line 314
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/orted/iof_orted.c
at line 184
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_base_setup.c
at line 237
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/odls/base/odls_base_default_fns.c
at line 1147
```
So far I have only seen this error when running straight out of my home
directory, not when running from a subdirectory.
In case this error does not appear all log files are written correctly.
2) If I call mpirun from within a subdirectory I am only seeing output
files from processes running on the same node as rank 0. I have not seen
above error messages in this case.
Example:
```
# two procs, one per node
~/test $ mpirun -n 2 -N 1 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0
# two procs, single node
~/test $ mpirun -n 2 -N 2 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0 rank.1
```
Using Open MPI 2.1.1, I can observe a similar effect:
```
# two procs, one per node
~/test $ mpirun --output-filename output.log -n 2 -N 1 ls
~/test $ ls
output.log.1.0
# two procs, single node
~/test $ mpirun --output-filename output.log -n 2 -N 2 ls
~/test $ ls
output.log.1.0 output.log.1.1
```
Any idea why this happens and/or how to debug this?
In case this helps, the NFS mount flags are:
(rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=<addr>,mountvers=3,mountport=<port>,mountproto=udp,local_lock=none,addr=<addr>)
I also tested above commands with MPICH, which gives me the expected
output for all processes on all nodes.
Any help would be much appreciated!
Cheers,
Joseph
I am trying to debug my MPI application using good ol' printf and I am
running into an issue with Open MPI's output redirection (using
--output-filename).
The system I'm running on is an IB cluster with the home directory
mounted through NFS.
1) Sometimes I get the following error message and the application hangs:
```
$ mpirun -n 2 -N 1 --output-filename output.log ls
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_base_setup.c
at line 314
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/orted/iof_orted.c
at line 184
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_base_setup.c
at line 237
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/odls/base/odls_base_default_fns.c
at line 1147
```
So far I have only seen this error when running straight out of my home
directory, not when running from a subdirectory.
In case this error does not appear all log files are written correctly.
2) If I call mpirun from within a subdirectory I am only seeing output
files from processes running on the same node as rank 0. I have not seen
above error messages in this case.
Example:
```
# two procs, one per node
~/test $ mpirun -n 2 -N 1 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0
# two procs, single node
~/test $ mpirun -n 2 -N 2 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0 rank.1
```
Using Open MPI 2.1.1, I can observe a similar effect:
```
# two procs, one per node
~/test $ mpirun --output-filename output.log -n 2 -N 1 ls
~/test $ ls
output.log.1.0
# two procs, single node
~/test $ mpirun --output-filename output.log -n 2 -N 2 ls
~/test $ ls
output.log.1.0 output.log.1.1
```
Any idea why this happens and/or how to debug this?
In case this helps, the NFS mount flags are:
(rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=<addr>,mountvers=3,mountport=<port>,mountproto=udp,local_lock=none,addr=<addr>)
I also tested above commands with MPICH, which gives me the expected
output for all processes on all nodes.
Any help would be much appreciated!
Cheers,
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de