[OMPI users] OMPI users] Installation of openmpi-1.10.7 fails

Discussion:

Gilles Gouaillardet

2018-01-23 13:33:02 UTC

Vahid,

There used to be a bug in the IOF part, but I am pretty sure this has already been fixed.

Does the issue also occur with GNU compilers ?
There used to be an issue with Intel Fortran runtime (short read/write were silently ignored) and that was also fixed some time ago.

Cheers,

Gilles

This would work for Quantum Espresso input. I am waiting to see what happens to EPW. I donât think EPW accepts the -i argument. I will report back once the EPW job is done.
Cheers,
Vahid
Â
well, my final comment on this topic, as somebody suggested earlier in this email chain, if you provide the input with the -i argument instead of piping from standard input, things seem to work as far as I can see (disclaimer: I do not know what the final outcome should be. I just see that the application does not complain about the 'end of file while reading crystal k points'). So maybe that is the most simple solution.
Thanks
Edgar
after some further investigation, I am fairly confident that this is not an MPI I/O problem.
The input file input_tmp.in is generated in this sequence of instructions (which is in Modules/open_close_input_file.f90)
---
Â IF ( TRIM(input_file_) /= ' ' ) THEn Â Â Â Â ! Â Â Â Â ! copy file to be opened into input_file Â Â Â Â ! Â Â Â Â input_file = input_file_ Â Â Â Â ! Â ELSE Â Â Â Â ! Â Â Â Â ! if no file specified then copy from standard input Â Â Â Â ! Â Â Â Â input_file="input_tmp.in" Â Â Â Â OPEN(UNIT = stdtmp, FILE=trim(input_file), FORM='formatted', & Â Â Â Â Â Â Â Â Â STATUS='unknown', IOSTAT = ierr ) Â Â Â Â IF ( ierr > 0 ) GO TO 30 Â Â Â Â ! Â Â Â Â dummy=' ' Â Â Â Â WRITE(stdout, '(5x,a)') "Waiting for input..." Â Â Â Â DO WHILE ( TRIM(dummy) .NE. "MAGICALME" ) Â Â Â Â Â Â Â READ (stdin,fmt='(A512)',END=20) dummy Â Â Â Â Â Â Â WRITE (stdtmp,'(A)') trim(dummy) Â Â Â Â END DO Â Â Â Â ! 20Â Â CLOSE ( UNIT=stdtmp, STATUS='keep' )
----
Basically, if no input file has been provided, the input file is generated by reading from standard input. Since the application is being launched e.g. with
mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out
the data comes from nscf.in. I simply do not know enough about IO forwarding do be able to tell why we do not see the entire file, but one interesting detail is that if I run it in the debugger, the input_tmp.in is created correctly. However, if I run it using mpirun as shown above, the file is cropped incorrectly, which leads to the error message mentioned in this email chain.
Anyway, I would probably need some help here from somebody who knows the runtime better than me on what could go wrong at this point.
Thanks
Edgar
Concerning the following error
Â Â Â from pw_readschemafile : error # Â Â Â Â 1
Â Â Â xml data file not found
The nscf run uses files generated by the scf.in run. So I first run scf.in and when it finishes, I run nscf.in. If you have done this and still get the above error, then this could be another bug. It does not happen for me with intel14/openmpi-1.8.8.
Thanks for the update,
Vahid
Â 1. I can in fact reproduce your bug on my systems.
Â 0.66666667Â 0.50000000Â 0.83333333Â 5.787037e-04
Â 0.66666667Â 0.50000000Â 0.91666667Â 5.787037e-04
Â 0.66666667Â 0.58333333Â 0.00000000Â 5.787037e-04
Â 0.66666667Â 0.58333333Â 0.08333333Â 5.787037e-04
Â 0.66666667Â 0.58333333Â 0.16666667Â 5.787037e-04
Â 0.66666667Â 0.58333333Â 0.25000000Â 5.787037e-04
Â 0.66666667Â 0.58333333Â 0.33333333Â 5.787037e-04
Â 0.66666667Â 0.58333333Â 0.41666667Â 5.787037e-04
Â 0.66666667Â 0.58333333Â 0.50000000Â 5.787037e-04
Â 0.66666667Â 0.58333333Â 0.58333333Â 5
which is what I *think* causes the problem.
Â 3. I tried to find where input_tmp.in is generated, but haven't completely identified the location. However, I could not find MPI file_write(_all) operations anywhere in the code, although there are some MPI_file_read(_all) operations.
Â 4. I can confirm that the behavior with Open MPI 1.8.x is different. input_tmp.in looks more complete (at least it doesn't end in the middle of the line). The simulation does still not finish for me, but the bug reported is slightly different, I might just be missing a file or something
Â Â Â Â from pw_readschemafile : error #Â Â Â Â Â Â Â Â 1
Â Â Â Â xml data file not found
Since I think input_tmp.in is generated from data that is provided in nscf.in, it might very well be something in the MPI_File_read(_all) operation that causes the issue, but since both ompio and romio are affected, there is good chance that something outside of the control of io components is causing the trouble (maybe a datatype issue that has changed from 1.8.x series to 3.0.x).
Â 5. Last but not least, I also wanted to mention that I ran all parallel tests that I found in the testsuiteÂ (run-tests-cp-parallel, run-tests-pw-parallel, run-tests-ph-parallel, run-tests-epw-parallel ), and they all passed with ompio (and romio314 although I only ran a subset of the tests with romio314).
Thanks
Edgar
-
Hi Edgar,
Just to let you know that the nscf run withÂ --mca io ompio crashed like the other two runs.
Thank you,
Vahid
ok, thank you for the information. Two short questions and requests. I have qe-6.2.1 compiled and running on my system (although it is with gcc-6.4 instead of the intel compiler), and I am currently running the parallel test suite. So far, all the tests passed, although it is still running.
My question is now, would it be possible for you to give me access to exactly the same data set that you are using?Â You could upload to a webpage or similar and just send me the link.Â
The second question/request, could you rerun your tests one more time, this time forcing using ompio? e.g. --mca io ompio
Thanks
Edgar
~/bin/openmpi-v3.0/bin/mpiexec -np 64 /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out
~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out
And it crashed like the first time.Â
K_POINTS (automatic)
12 12 12 0 0 0
K_POINTS (crystal)
1728
Â Â 0.00000000Â Â 0.00000000Â Â 0.00000000Â Â 5.787037e-04Â
Â Â 0.00000000Â Â 0.00000000Â Â 0.08333333Â Â 5.787037e-04Â
Â Â 0.00000000Â Â 0.00000000Â Â 0.16666667Â Â 5.787037e-04Â
Â Â 0.00000000Â Â 0.00000000Â Â 0.25000000Â Â 5.787037e-04Â
Â Â 0.00000000Â Â 0.00000000Â Â 0.33333333Â Â 5.787037e-04Â
Â Â 0.00000000Â Â 0.00000000Â Â 0.41666667Â Â 5.787037e-04Â
Â Â 0.00000000Â Â 0.00000000Â Â 0.50000000Â Â 5.787037e-04Â
Â Â 0.00000000Â Â 0.00000000Â Â 0.58333333Â Â 5.787037e-04Â
Â Â 0.00000000Â Â 0.00000000Â Â 0.66666667Â Â 5.787037e-04Â
Â Â 0.00000000Â Â 0.00000000Â Â 0.75000000Â Â 5.787037e-04Â
âŠâŠ.
âŠâŠ.
To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler module, configured with only the Â Â Â Â Â Â Â Â â--prefix=" Â and then âmake all installâ. I did not enable or disable any other options.
Cheers,
Vahid
thanks, that is interesting. Since /scratch is a lustre file system, Open MPI should actually utilize romio314 for that anyway, not ompio. What I have seen however happen on at least one occasions is that ompio was still used since ( I suspect) romio314 didn't pick up correctly the configuration options. It is a little bit of a mess from that perspective that we have to pass the romio arguments with different flag/options than for ompio, e.g.
--with-lustre=/path/to/lustre/ --with-io-romio-flags="--with-file-system=ufs+nfs+lustre --with-lustre=/path/to/lustre"
ompio should pick up the lustre options correctly if lustre headers/libraries are found at the default location, even if the user did not pass the --with-lustre option. I am not entirely sure what happens in romio if the user did not pass the --with-file-system=ufs+nfs+lustre but the lustre headers/libraries are found at the default location, i.e. whether the lustre adio component is still compiled or not.
Anyway, lets wait for the outcome of your run enforcing using the romio314 component, and I will still try to reproduce your problem on my system.
Thanks
Edgar
My openmpi3.0.x run (called nscf run) was reading data from a routine Quantum Espresso input file edited by hand. The preliminary run (called scf run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha. Well, that's a little disappointing. It would be good to understand why it is crashing -- is the app doing something that is accidentally not standard? Is there a bug in (soon to be released) Open MPI 3.0.1? ...?

Vahid Askarpour

2018-01-23 13:40:54 UTC

Permalink

This post might be inappropriate. Click to display it.

Edgar Gabriel

2018-01-23 14:12:18 UTC

Permalink

I ran all my tests with gcc 6.4

Thanks

Edgar

Post by Vahid Askarpour
Gilles,
I have not tried compiling the latest openmpi with GCC. I am waiting
to see how the intel version turns out before attempting GCC.
Cheers,
Vahid

Post by Vahid Askarpour
On Jan 23, 2018, at 9:33 AM, Gilles Gouaillardet
Vahid,
There used to be a bug in the IOF part, but I am pretty sure this has already been fixed.
Does the issue also occur with GNU compilers ?
There used to be an issue with Intel Fortran runtime (short
read/write were silently ignored) and that was also fixed some time ago.
Cheers,
Gilles
This would work for Quantum Espresso input. I am waiting to see what
happens to EPW. I donât think EPW accepts the -i argument. I will
report back once the EPW job is done.
Cheers,
Vahid

well, my final comment on this topic, as somebody suggested earlier
in this email chain, if you provide the input with the -i argument
instead of piping from standard input, things seem to work as far as
I can see (disclaimer: I do not know what the final outcome should
be. I just see that the application does not complain about the 'end
of file while reading crystal k points'). So maybe that is the most
simple solution.
Thanks
Edgar

after some further investigation, I am fairly confident that this
is not an MPI I/O problem.
The input file input_tmp.in is generated in this sequence of
instructions (which is in Modules/open_close_input_file.f90)
---
Â IF ( TRIM(input_file_) /= ' ' ) THEn
Â Â Â Â !
Â Â Â Â ! copy file to be opened into input_file
Â Â Â Â !
Â Â Â Â input_file = input_file_
Â Â Â Â !
Â ELSE
Â Â Â Â !
Â Â Â Â ! if no file specified then copy from standard input
Â Â Â Â !
Â Â Â Â input_file="input_tmp.in"
Â Â Â Â OPEN(UNIT = stdtmp, FILE=trim(input_file), FORM='formatted', &
Â Â Â Â Â Â Â Â Â STATUS='unknown', IOSTAT = ierr )
Â Â Â Â IF ( ierr > 0 ) GO TO 30
Â Â Â Â !
Â Â Â Â dummy=' '
Â Â Â Â WRITE(stdout, '(5x,a)') "Waiting for input..."
Â Â Â Â DO WHILE ( TRIM(dummy) .NE. "MAGICALME" )
Â Â Â Â Â Â Â READ (stdin,fmt='(A512)',END=20) dummy
Â Â Â Â Â Â Â WRITE (stdtmp,'(A)') trim(dummy)
Â Â Â Â END DO
Â Â Â Â !
20Â Â CLOSE ( UNIT=stdtmp, STATUS='keep' )
----
Basically, if no input file has been provided, the input file is
generated by reading from standard input. Since the application is
being launched e.g. with
mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out
the data comes from nscf.in. I simply do not know enough about IO
forwarding do be able to tell why we do not see the entire file,
but one interesting detail is that if I run it in the debugger, the
input_tmp.in is created correctly. However, if I run it using
mpirun as shown above, the file is cropped incorrectly, which leads
to the error message mentioned in this email chain.
Anyway, I would probably need some help here from somebody who
knows the runtime better than me on what could go wrong at this point.
Thanks
Edgar

Concerning the following error
Â Â Â from pw_readschemafile : error # Â Â Â Â 1
Â Â Â xml data file not found
The nscf run uses files generated by the scf.in run. So I first
run scf.in and when it finishes, I run nscf.in. If you have done
this and still get the above error, then this could be another
bug. It does not happen for me with intel14/openmpi-1.8.8.
Thanks for the update,
Vahid

Post by Vahid Askarpour
On Jan 19, 2018, at 3:08 PM, Edgar Gabriel
Â 1. I can in fact reproduce your bug on my systems.
Â 2. I can confirm that the problem occurs both with romio314 and
ompio. I *think* the issue is that the input_tmp.in file is
incomplete. In both cases (ompio and romio) the end of the file
Â 0.66666667Â 0.50000000 0.83333333Â 5.787037e-04
Â 0.66666667Â 0.50000000 0.91666667Â 5.787037e-04
Â 0.66666667Â 0.58333333 0.00000000Â 5.787037e-04
Â 0.66666667Â 0.58333333 0.08333333Â 5.787037e-04
Â 0.66666667Â 0.58333333 0.16666667Â 5.787037e-04
Â 0.66666667Â 0.58333333 0.25000000Â 5.787037e-04
Â 0.66666667Â 0.58333333 0.33333333Â 5.787037e-04
Â 0.66666667Â 0.58333333 0.41666667Â 5.787037e-04
Â 0.66666667Â 0.58333333 0.50000000Â 5.787037e-04
Â 0.66666667Â 0.58333333 0.58333333Â 5
which is what I *think* causes the problem.
Â 3. I tried to find where input_tmp.in is generated, but haven't
completely identified the location. However, I could not find MPI
file_write(_all) operations anywhere in the code, although there
are some MPI_file_read(_all) operations.
Â 4. I can confirm that the behavior with Open MPI 1.8.x is
different. input_tmp.in looks more complete (at least it doesn't
end in the middle of the line). The simulation does still not
finish for me, but the bug reported is slightly different, I
might just be missing a file or something
Â Â Â Â from pw_readschemafile : error #Â Â Â Â Â Â Â Â 1
Â Â Â Â xml data file not found
Since I think input_tmp.in is generated from data that is
provided in nscf.in, it might very well be something in the
MPI_File_read(_all) operation that causes the issue, but since
both ompio and romio are affected, there is good chance that
something outside of the control of io components is causing the
trouble (maybe a datatype issue that has changed from 1.8.x
series to 3.0.x).
Â 5. Last but not least, I also wanted to mention that I ran all
parallel tests that I found in the testsuite
(run-tests-cp-parallel, run-tests-pw-parallel,
run-tests-ph-parallel, run-tests-epw-parallel ), and they all
passed with ompio (and romio314 although I only ran a subset of
the tests with romio314).
Thanks
Edgar
-

Hi Edgar,
Just to let you know that the nscf run withÂ --mca io ompio
crashed like the other two runs.
Thank you,
Vahid

Post by Vahid Askarpour
On Jan 19, 2018, at 12:46 PM, Edgar Gabriel
ok, thank you for the information. Two short questions and
requests. I have qe-6.2.1 compiled and running on my system
(although it is with gcc-6.4 instead of the intel compiler),
and I am currently running the parallel test suite. So far, all
the tests passed, although it is still running.
My question is now, would it be possible for you to give me
access to exactly the same data set that you are using? You
could upload to a webpage or similar and just send me the link.
The second question/request, could you rerun your tests one
more time, this time forcing using ompio? e.g. --mca io ompio
Thanks
Edgar

Post by Vahid Askarpour
To run EPW, the command for running the preliminary nscf run
~/bin/openmpi-v3.0/bin/mpiexec -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 <
nscf.in > nscf.out
~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 <
nscf.in > nscf.out
And it crashed like the first time.
It is interesting that the preliminary scf run works fine. The
scf run requires Quantum Espresso to generate the k points
K_POINTS (automatic)
12 12 12 0 0 0
The nscf run which crashes includes a list of k points (1728
K_POINTS (crystal)
1728
Â Â 0.00000000Â Â 0.00000000Â Â 0.00000000Â Â 5.787037e-04
Â Â 0.00000000Â Â 0.00000000Â Â 0.08333333Â Â 5.787037e-04
Â Â 0.00000000Â Â 0.00000000Â Â 0.16666667Â Â 5.787037e-04
Â Â 0.00000000Â Â 0.00000000Â Â 0.25000000Â Â 5.787037e-04
Â Â 0.00000000Â Â 0.00000000Â Â 0.33333333Â Â 5.787037e-04
Â Â 0.00000000Â Â 0.00000000Â Â 0.41666667Â Â 5.787037e-04
Â Â 0.00000000Â Â 0.00000000Â Â 0.50000000Â Â 5.787037e-04
Â Â 0.00000000Â Â 0.00000000Â Â 0.58333333Â Â 5.787037e-04
Â Â 0.00000000Â Â 0.00000000Â Â 0.66666667Â Â 5.787037e-04
Â Â 0.00000000Â Â 0.00000000Â Â 0.75000000Â Â 5.787037e-04
âŠâŠ.
âŠâŠ.
To build openmpi (either 1.10.7 or 3.0.x), I loaded the
fortran compiler module, configured with only the â--prefix="
Â and then âmake all installâ. I did not enable or disable any
other options.
Cheers,
Vahid

Post by Vahid Askarpour
On Jan 19, 2018, at 10:23 AM, Edgar Gabriel
thanks, that is interesting. Since /scratch is a lustre file
system, Open MPI should actually utilize romio314 for that
anyway, not ompio. What I have seen however happen on at
least one occasions is that ompio was still used since ( I
suspect) romio314 didn't pick up correctly the configuration
options. It is a little bit of a mess from that perspective
that we have to pass the romio arguments with different
flag/options than for ompio, e.g.
--with-lustre=/path/to/lustre/
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre
--with-lustre=/path/to/lustre"
ompio should pick up the lustre options correctly if lustre
headers/libraries are found at the default location, even if
the user did not pass the --with-lustre option. I am not
entirely sure what happens in romio if the user did not pass
the --with-file-system=ufs+nfs+lustre but the lustre
headers/libraries are found at the default location, i.e.
whether the lustre adio component is still compiled or not.
Anyway, lets wait for the outcome of your run enforcing using
the romio314 component, and I will still try to reproduce
your problem on my system.
Thanks
Edgar

Post by Vahid Askarpour
Gilles,
I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conteâs queue at Purdue.
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu
<http://persistent-nfs.rcac.purdue.edu/>:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu
<http://persistent-nfs.rcac.purdue.edu/>:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu
<http://depotint-nfs.rcac.purdue.edu/>:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin
The code is compiled in my $HOME and is run on the scratch.
Cheers,
Vahid

Post by Gilles Gouaillardet
Vahid,
i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...
That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.
Cheers,
Gilles

Post by Vahid Askarpour
I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.
Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.
What file system did you run it on? NFS?
Thanks
Edgar

My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.

Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Edgar Gabriel
Associate Professor
Department of Computer Science

Associate Director
Center for Advanced Computing and Data Science (CACDS)

University of Houston
Philip G. Hoffman Hall, Room 228 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
--

Jeff Squyres (jsquyres)

2018-01-23 16:04:26 UTC

Permalink

Post by Gilles Gouaillardet
There used to be a bug in the IOF part, but I am pretty sure this has already been fixed.

Gilles: can you cite what you're talking about?

Edgar was testing on master, so if there was some kind of IOF fix, I would assume that it would already be on master.

--
Jeff Squyres
***@cisco.com

Gilles Gouaillardet

2018-01-24 00:13:10 UTC

Permalink

Jeff,

i guess i was referring to
https://mail-archive.com/***@lists.open-mpi.org/msg29818.html

that was reported and fixed in August 2016, so the current issue is
likely unrelated.

Since you opened a github issue for that, let's all follow-up at
https://github.com/open-mpi/ompi/issues/4744

Cheers,

Gilles

Post by Jeff Squyres (jsquyres)

Post by Gilles Gouaillardet
There used to be a bug in the IOF part, but I am pretty sure this has already been fixed.

Gilles: can you cite what you're talking about?
Edgar was testing on master, so if there was some kind of IOF fix, I would assume that it would already be on master.