Discussion:
[OMPI users] Installation of openmpi-1.10.7 fails
Vahid Askarpour
2018-01-05 18:34:22 UTC
Permalink
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.

When compiling, I get the following error:

make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1

Thank you,

Vahid
Vahid Askarpour
2018-01-05 19:29:07 UTC
Permalink
I am also attaching the config.log file.

Thanks,

Vahid
[This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres)
2018-01-05 21:06:06 UTC
Permalink
I forget what the underlying issue was, but this issue just came up and was recently fixed:

https://github.com/open-mpi/ompi/issues/4345

However, the v1.10 series is fairly ancient -- the fix was not applied to that series. The fix was applied to the v2.1.x series, and a snapshot tarball containing the fix is available here (generally just take the latest tarball):

https://www.open-mpi.org/nightly/v2.x/

The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).
Post by Vahid Askarpour
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Vahid Askarpour
2018-01-05 22:22:14 UTC
Permalink
Thank you Jeff for your suggestion to use the v.2.1 series.

I am attempting to use openmpi with EPW. On the EPW website (http://epw.org.uk/Main/DownloadAndInstall), it is stated that:


Compatibility of EPW

EPW is tested and should work on the following compilers and libraries:

* gcc640 serial
* gcc640 + openmpi-1.10.7
* intel 12 + openmpi-1.10.7
* intel 17 + impi
* PGI 17 + mvapich2.3

EPW is know to have the following incompatibilities with:

* openmpi 2.0.2 (but likely on all the 2.x.x version): Works but memory leak. If you open and close a file a lot of times with openmpi 2.0.2, the memory increase linearly with the number of times the file is open.

So I am hoping to avoid the 2.x.x series and use the 1.10.7 version suggested by the EPW developers. However, it appears that this is not possible.

Vahid

On Jan 5, 2018, at 5:06 PM, Jeff Squyres (jsquyres) <***@cisco.com<mailto:***@cisco.com>> wrote:

I forget what the underlying issue was, but this issue just came up and was recently fixed:

https://github.com/open-mpi/ompi/issues/4345

However, the v1.10 series is fairly ancient -- the fix was not applied to that series. The fix was applied to the v2.1.x series, and a snapshot tarball containing the fix is available here (generally just take the latest tarball):

https://www.open-mpi.org/nightly/v2.x/

The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).



On Jan 5, 2018, at 1:34 PM, Vahid Askarpour <***@dal.ca<mailto:***@dal.ca>> wrote:

I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.

When compiling, I get the following error:

make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la<http://mca_pml_ucx.la>
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la<http://mca_pml_ucx.la>] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1

Thank you,

Vahid
_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


--
Jeff Squyres
***@cisco.com<mailto:***@cisco.com>
Jeff Squyres (jsquyres)
2018-01-05 23:35:06 UTC
Permalink
You can still give Open MPI 2.1.1 a try. It should be source compatible with EPW. Hopefully the behavior is close enough that it should work.

If not, please encourage the EPW developers to upgrade. v3.0.x is the current stable series; v1.10.x is ancient.
Post by Vahid Askarpour
Thank you Jeff for your suggestion to use the v.2.1 series.
Post by Vahid Askarpour
Compatibility of EPW
• gcc640 serial
• gcc640 + openmpi-1.10.7
• intel 12 + openmpi-1.10.7
• intel 17 + impi
• PGI 17 + mvapich2.3
• openmpi 2.0.2 (but likely on all the 2.x.x version): Works but memory leak. If you open and close a file a lot of times with openmpi 2.0.2, the memory increase linearly with the number of times the file is open.
So I am hoping to avoid the 2.x.x series and use the 1.10.7 version suggested by the EPW developers. However, it appears that this is not possible.
Vahid
Post by Vahid Askarpour
https://github.com/open-mpi/ompi/issues/4345
https://www.open-mpi.org/nightly/v2.x/
The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).
Post by Vahid Askarpour
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Gilles Gouaillardet
2018-01-06 00:40:02 UTC
Permalink
Vahid,

This looks like the description of the issue reported at
https://github.com/open-mpi/ompi/issues/4336
The fix is currently available in 3.0.1rc1, and I will back port the fix fo
the v2.x branch.
A workaround is to use ROM-IO instead of ompio, you can achieve this with
mpirun —mca io ^ompio ...
(FWIW 1.10 series use ROM-IO by default, so there is no leak out of the box)

IIRC, a possible (and ugly) workaround for the compilation issue is to
configure —with-ucx=/usr ...
That being said, you should really upgrade to a supported version of Open
MPI as previously suggested

Cheers,

Gilles
Post by Jeff Squyres (jsquyres)
You can still give Open MPI 2.1.1 a try. It should be source compatible
with EPW. Hopefully the behavior is close enough that it should work.
If not, please encourage the EPW developers to upgrade. v3.0.x is the
current stable series; v1.10.x is ancient.
Post by Vahid Askarpour
Thank you Jeff for your suggestion to use the v.2.1 series.
I am attempting to use openmpi with EPW. On the EPW website (
Post by Vahid Askarpour
Compatibility of EPW
• gcc640 serial
• gcc640 + openmpi-1.10.7
• intel 12 + openmpi-1.10.7
• intel 17 + impi
• PGI 17 + mvapich2.3
• openmpi 2.0.2 (but likely on all the 2.x.x version): Works but
memory leak. If you open and close a file a lot of times with openmpi
2.0.2, the memory increase linearly with the number of times the file is
open.
Post by Vahid Askarpour
So I am hoping to avoid the 2.x.x series and use the 1.10.7 version
suggested by the EPW developers. However, it appears that this is not
possible.
Post by Vahid Askarpour
Vahid
Post by Vahid Askarpour
I forget what the underlying issue was, but this issue just came up and
https://github.com/open-mpi/ompi/issues/4345
However, the v1.10 series is fairly ancient -- the fix was not applied
to that series. The fix was applied to the v2.1.x series, and a snapshot
tarball containing the fix is available here (generally just take the
Post by Vahid Askarpour
Post by Vahid Askarpour
https://www.open-mpi.org/nightly/v2.x/
The fix is still pending for the v3.0.x and v3.1.x series (i.e., there
are pending pull requests that haven't been merged yet, so the nightly
snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).
Post by Vahid Askarpour
Post by Vahid Askarpour
Post by Vahid Askarpour
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708)
using GCC-6.4.0.
Post by Vahid Askarpour
Post by Vahid Askarpour
Post by Vahid Askarpour
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.
10.7/ompi/mca/pml/ob1'
Post by Vahid Askarpour
Post by Vahid Askarpour
Post by Vahid Askarpour
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.
10.7/ompi/mca/pml/ucx'
Post by Vahid Askarpour
Post by Vahid Askarpour
Post by Vahid Askarpour
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.
10.7/ompi/mca/pml/ucx'
Post by Vahid Askarpour
Post by Vahid Askarpour
Post by Vahid Askarpour
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Vahid Askarpour
2018-01-06 00:58:09 UTC
Permalink
Gilles,

I will try the 3.0.1rc1 version to see how it goes.

Thanks,

Vahid

On Jan 5, 2018, at 8:40 PM, Gilles Gouaillardet <***@gmail.com<mailto:***@gmail.com>> wrote:

Vahid,

This looks like the description of the issue reported at https://github.com/open-mpi/ompi/issues/4336
The fix is currently available in 3.0.1rc1, and I will back port the fix fo the v2.x branch.
A workaround is to use ROM-IO instead of ompio, you can achieve this with
mpirun —mca io ^ompio ...
(FWIW 1.10 series use ROM-IO by default, so there is no leak out of the box)

IIRC, a possible (and ugly) workaround for the compilation issue is to
configure —with-ucx=/usr ...
That being said, you should really upgrade to a supported version of Open MPI as previously suggested

Cheers,

Gilles

On Saturday, January 6, 2018, Jeff Squyres (jsquyres) <***@cisco.com<mailto:***@cisco.com>> wrote:
You can still give Open MPI 2.1.1 a try. It should be source compatible with EPW. Hopefully the behavior is close enough that it should work.

If not, please encourage the EPW developers to upgrade. v3.0.x is the current stable series; v1.10.x is ancient.
Post by Vahid Askarpour
Thank you Jeff for your suggestion to use the v.2.1 series.
Post by Vahid Askarpour
Compatibility of EPW
• gcc640 serial
• gcc640 + openmpi-1.10.7
• intel 12 + openmpi-1.10.7
• intel 17 + impi
• PGI 17 + mvapich2.3
• openmpi 2.0.2 (but likely on all the 2.x.x version): Works but memory leak. If you open and close a file a lot of times with openmpi 2.0.2, the memory increase linearly with the number of times the file is open.
So I am hoping to avoid the 2.x.x series and use the 1.10.7 version suggested by the EPW developers. However, it appears that this is not possible.
Vahid
Post by Vahid Askarpour
https://github.com/open-mpi/ompi/issues/4345
https://www.open-mpi.org/nightly/v2.x/
The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).
Post by Vahid Askarpour
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la<http://mca_pml_ucx.la/>
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la<http://mca_pml_ucx.la/>] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com<mailto:***@cisco.com>



_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres)
2018-01-11 16:04:43 UTC
Permalink
Vahid --

Were you able to give it a whirl?

Thanks.
Post by Vahid Askarpour
Gilles,
I will try the 3.0.1rc1 version to see how it goes.
Thanks,
Vahid
Post by Gilles Gouaillardet
Vahid,
This looks like the description of the issue reported at https://github.com/open-mpi/ompi/issues/4336
The fix is currently available in 3.0.1rc1, and I will back port the fix fo the v2.x branch.
A workaround is to use ROM-IO instead of ompio, you can achieve this with
mpirun —mca io ^ompio ...
(FWIW 1.10 series use ROM-IO by default, so there is no leak out of the box)
IIRC, a possible (and ugly) workaround for the compilation issue is to
configure —with-ucx=/usr ...
That being said, you should really upgrade to a supported version of Open MPI as previously suggested
Cheers,
Gilles
You can still give Open MPI 2.1.1 a try. It should be source compatible with EPW. Hopefully the behavior is close enough that it should work.
If not, please encourage the EPW developers to upgrade. v3.0.x is the current stable series; v1.10.x is ancient.
Post by Vahid Askarpour
Thank you Jeff for your suggestion to use the v.2.1 series.
Post by Vahid Askarpour
Compatibility of EPW
• gcc640 serial
• gcc640 + openmpi-1.10.7
• intel 12 + openmpi-1.10.7
• intel 17 + impi
• PGI 17 + mvapich2.3
• openmpi 2.0.2 (but likely on all the 2.x.x version): Works but memory leak. If you open and close a file a lot of times with openmpi 2.0.2, the memory increase linearly with the number of times the file is open.
So I am hoping to avoid the 2.x.x series and use the 1.10.7 version suggested by the EPW developers. However, it appears that this is not possible.
Vahid
Post by Vahid Askarpour
https://github.com/open-mpi/ompi/issues/4345
https://www.open-mpi.org/nightly/v2.x/
The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).
Post by Vahid Askarpour
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Vahid Askarpour
2018-01-11 16:34:06 UTC
Permalink
Hi Jeff,

I looked for the 3.0.1 version but I only found the 3.0.0 version available for download. So I thought it may take a while for the 3.0.1 to become available. Or did I miss something?

Thanks,

Vahid
Post by Jeff Squyres (jsquyres)
Vahid --
Were you able to give it a whirl?
Thanks.
Post by Vahid Askarpour
Gilles,
I will try the 3.0.1rc1 version to see how it goes.
Thanks,
Vahid
Post by Gilles Gouaillardet
Vahid,
This looks like the description of the issue reported at https://github.com/open-mpi/ompi/issues/4336
The fix is currently available in 3.0.1rc1, and I will back port the fix fo the v2.x branch.
A workaround is to use ROM-IO instead of ompio, you can achieve this with
mpirun —mca io ^ompio ...
(FWIW 1.10 series use ROM-IO by default, so there is no leak out of the box)
IIRC, a possible (and ugly) workaround for the compilation issue is to
configure —with-ucx=/usr ...
That being said, you should really upgrade to a supported version of Open MPI as previously suggested
Cheers,
Gilles
You can still give Open MPI 2.1.1 a try. It should be source compatible with EPW. Hopefully the behavior is close enough that it should work.
If not, please encourage the EPW developers to upgrade. v3.0.x is the current stable series; v1.10.x is ancient.
Post by Vahid Askarpour
Thank you Jeff for your suggestion to use the v.2.1 series.
Post by Vahid Askarpour
Compatibility of EPW
• gcc640 serial
• gcc640 + openmpi-1.10.7
• intel 12 + openmpi-1.10.7
• intel 17 + impi
• PGI 17 + mvapich2.3
• openmpi 2.0.2 (but likely on all the 2.x.x version): Works but memory leak. If you open and close a file a lot of times with openmpi 2.0.2, the memory increase linearly with the number of times the file is open.
So I am hoping to avoid the 2.x.x series and use the 1.10.7 version suggested by the EPW developers. However, it appears that this is not possible.
Vahid
Post by Vahid Askarpour
https://github.com/open-mpi/ompi/issues/4345
https://www.open-mpi.org/nightly/v2.x/
The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).
Post by Vahid Askarpour
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres)
2018-01-11 16:50:57 UTC
Permalink
You are correct: 3.0.1 has not been released yet.

However, our nightly snapshots of the 3.0.x branch are available for download. These are not official releases, but they are great for getting users to test what will eventually become an official release (i.e., 3.0.1) to see if particular bugs have been fixed. This is one of the benefits of open source. :-)

Here's where the 3.0.1 nightly snapshots are available for download:

https://www.open-mpi.org/nightly/v3.0.x/

They are organized by date.
Post by Vahid Askarpour
Hi Jeff,
I looked for the 3.0.1 version but I only found the 3.0.0 version available for download. So I thought it may take a while for the 3.0.1 to become available. Or did I miss something?
Thanks,
Vahid
Post by Jeff Squyres (jsquyres)
Vahid --
Were you able to give it a whirl?
Thanks.
Post by Vahid Askarpour
Gilles,
I will try the 3.0.1rc1 version to see how it goes.
Thanks,
Vahid
Post by Gilles Gouaillardet
Vahid,
This looks like the description of the issue reported at https://github.com/open-mpi/ompi/issues/4336
The fix is currently available in 3.0.1rc1, and I will back port the fix fo the v2.x branch.
A workaround is to use ROM-IO instead of ompio, you can achieve this with
mpirun —mca io ^ompio ...
(FWIW 1.10 series use ROM-IO by default, so there is no leak out of the box)
IIRC, a possible (and ugly) workaround for the compilation issue is to
configure —with-ucx=/usr ...
That being said, you should really upgrade to a supported version of Open MPI as previously suggested
Cheers,
Gilles
You can still give Open MPI 2.1.1 a try. It should be source compatible with EPW. Hopefully the behavior is close enough that it should work.
If not, please encourage the EPW developers to upgrade. v3.0.x is the current stable series; v1.10.x is ancient.
Post by Vahid Askarpour
Thank you Jeff for your suggestion to use the v.2.1 series.
Post by Vahid Askarpour
Compatibility of EPW
• gcc640 serial
• gcc640 + openmpi-1.10.7
• intel 12 + openmpi-1.10.7
• intel 17 + impi
• PGI 17 + mvapich2.3
• openmpi 2.0.2 (but likely on all the 2.x.x version): Works but memory leak. If you open and close a file a lot of times with openmpi 2.0.2, the memory increase linearly with the number of times the file is open.
So I am hoping to avoid the 2.x.x series and use the 1.10.7 version suggested by the EPW developers. However, it appears that this is not possible.
Vahid
Post by Vahid Askarpour
https://github.com/open-mpi/ompi/issues/4345
https://www.open-mpi.org/nightly/v2.x/
The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).
Post by Vahid Askarpour
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Vahid Askarpour
2018-01-11 16:59:34 UTC
Permalink
Great. I will try the 3.0.x version to see how it goes.

On a side note, I did manage to run EPW without getting memory leaks using openmpi-1.8.8 and gcc-4.8.5. These are the tools that apparently worked when the code was developed as seen on their Test Farm (http://epw.org.uk/Main/TestFarm).

Thanks,

Vahid

On Jan 11, 2018, at 12:50 PM, Jeff Squyres (jsquyres) <***@cisco.com<mailto:***@cisco.com>> wrote:

You are correct: 3.0.1 has not been released yet.

However, our nightly snapshots of the 3.0.x branch are available for download. These are not official releases, but they are great for getting users to test what will eventually become an official release (i.e., 3.0.1) to see if particular bugs have been fixed. This is one of the benefits of open source. :-)

Here's where the 3.0.1 nightly snapshots are available for download:

https://www.open-mpi.org/nightly/v3.0.x/

They are organized by date.


On Jan 11, 2018, at 11:34 AM, Vahid Askarpour <***@dal.ca<mailto:***@dal.ca>> wrote:

Hi Jeff,

I looked for the 3.0.1 version but I only found the 3.0.0 version available for download. So I thought it may take a while for the 3.0.1 to become available. Or did I miss something?

Thanks,

Vahid

On Jan 11, 2018, at 12:04 PM, Jeff Squyres (jsquyres) <***@cisco.com<mailto:***@cisco.com>> wrote:

Vahid --

Were you able to give it a whirl?

Thanks.


On Jan 5, 2018, at 7:58 PM, Vahid Askarpour <***@dal.ca<mailto:***@dal.ca>> wrote:

Gilles,

I will try the 3.0.1rc1 version to see how it goes.

Thanks,

Vahid

On Jan 5, 2018, at 8:40 PM, Gilles Gouaillardet <***@gmail.com<mailto:***@gmail.com>> wrote:

Vahid,

This looks like the description of the issue reported at https://github.com/open-mpi/ompi/issues/4336
The fix is currently available in 3.0.1rc1, and I will back port the fix fo the v2.x branch.
A workaround is to use ROM-IO instead of ompio, you can achieve this with
mpirun —mca io ^ompio ...
(FWIW 1.10 series use ROM-IO by default, so there is no leak out of the box)

IIRC, a possible (and ugly) workaround for the compilation issue is to
configure —with-ucx=/usr ...
That being said, you should really upgrade to a supported version of Open MPI as previously suggested

Cheers,

Gilles

On Saturday, January 6, 2018, Jeff Squyres (jsquyres) <***@cisco.com<mailto:***@cisco.com>> wrote:
You can still give Open MPI 2.1.1 a try. It should be source compatible with EPW. Hopefully the behavior is close enough that it should work.

If not, please encourage the EPW developers to upgrade. v3.0.x is the current stable series; v1.10.x is ancient.



On Jan 5, 2018, at 5:22 PM, Vahid Askarpour <***@dal.ca<mailto:***@dal.ca>> wrote:

Thank you Jeff for your suggestion to use the v.2.1 series.

I am attempting to use openmpi with EPW. On the EPW website (http://epw.org.uk/Main/DownloadAndInstall), it is stated that:

Compatibility of EPW

EPW is tested and should work on the following compilers and libraries:

• gcc640 serial
• gcc640 + openmpi-1.10.7
• intel 12 + openmpi-1.10.7
• intel 17 + impi
• PGI 17 + mvapich2.3
EPW is know to have the following incompatibilities with:

• openmpi 2.0.2 (but likely on all the 2.x.x version): Works but memory leak. If you open and close a file a lot of times with openmpi 2.0.2, the memory increase linearly with the number of times the file is open.

So I am hoping to avoid the 2.x.x series and use the 1.10.7 version suggested by the EPW developers. However, it appears that this is not possible.

Vahid

On Jan 5, 2018, at 5:06 PM, Jeff Squyres (jsquyres) <***@cisco.com<mailto:***@cisco.com>> wrote:

I forget what the underlying issue was, but this issue just came up and was recently fixed:

https://github.com/open-mpi/ompi/issues/4345

However, the v1.10 series is fairly ancient -- the fix was not applied to that series. The fix was applied to the v2.1.x series, and a snapshot tarball containing the fix is available here (generally just take the latest tarball):

https://www.open-mpi.org/nightly/v2.x/

The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).



On Jan 5, 2018, at 1:34 PM, Vahid Askarpour <***@dal.ca<mailto:***@dal.ca>> wrote:

I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.

When compiling, I get the following error:

make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la<http://mca_pml_ucx.la>
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la<http://mca_pml_ucx.la>] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1

Thank you,

Vahid
_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com<mailto:***@cisco.com>



_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com<mailto:***@cisco.com>



_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com<mailto:***@cisco.com>



_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com<mailto:***@cisco.com>



_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
Vahid Askarpour
2018-01-18 22:34:03 UTC
Permalink
Hi Jeff,

I compiled Quantum Espresso/EPW with openmpi-3.0.x. The openmpi was compiled with intel14.

A preliminary run for EPW using Quantum Espresso crashed with the following message:

end of file while reading crystal k points

There are 1728 k points in the input file and Quantum Espresso, by default, can read up to 40000 k points.

This error did not occur with openmpi-1.8.1.

So I will just continue to use openmpi-1.8.1 as it does not crash.

Thanks,

Vahid
Post by Jeff Squyres (jsquyres)
You are correct: 3.0.1 has not been released yet.
However, our nightly snapshots of the 3.0.x branch are available for download. These are not official releases, but they are great for getting users to test what will eventually become an official release (i.e., 3.0.1) to see if particular bugs have been fixed. This is one of the benefits of open source. :-)
https://www.open-mpi.org/nightly/v3.0.x/
They are organized by date.
Post by Vahid Askarpour
Hi Jeff,
I looked for the 3.0.1 version but I only found the 3.0.0 version available for download. So I thought it may take a while for the 3.0.1 to become available. Or did I miss something?
Thanks,
Vahid
Post by Jeff Squyres (jsquyres)
Vahid --
Were you able to give it a whirl?
Thanks.
Post by Vahid Askarpour
Gilles,
I will try the 3.0.1rc1 version to see how it goes.
Thanks,
Vahid
Post by Gilles Gouaillardet
Vahid,
This looks like the description of the issue reported at https://github.com/open-mpi/ompi/issues/4336
The fix is currently available in 3.0.1rc1, and I will back port the fix fo the v2.x branch.
A workaround is to use ROM-IO instead of ompio, you can achieve this with
mpirun —mca io ^ompio ...
(FWIW 1.10 series use ROM-IO by default, so there is no leak out of the box)
IIRC, a possible (and ugly) workaround for the compilation issue is to
configure —with-ucx=/usr ...
That being said, you should really upgrade to a supported version of Open MPI as previously suggested
Cheers,
Gilles
You can still give Open MPI 2.1.1 a try. It should be source compatible with EPW. Hopefully the behavior is close enough that it should work.
If not, please encourage the EPW developers to upgrade. v3.0.x is the current stable series; v1.10.x is ancient.
Post by Vahid Askarpour
Thank you Jeff for your suggestion to use the v.2.1 series.
Post by Vahid Askarpour
Compatibility of EPW
• gcc640 serial
• gcc640 + openmpi-1.10.7
• intel 12 + openmpi-1.10.7
• intel 17 + impi
• PGI 17 + mvapich2.3
• openmpi 2.0.2 (but likely on all the 2.x.x version): Works but memory leak. If you open and close a file a lot of times with openmpi 2.0.2, the memory increase linearly with the number of times the file is open.
So I am hoping to avoid the 2.x.x series and use the 1.10.7 version suggested by the EPW developers. However, it appears that this is not possible.
Vahid
Post by Vahid Askarpour
https://github.com/open-mpi/ompi/issues/4345
https://www.open-mpi.org/nightly/v2.x/
The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).
Post by Vahid Askarpour
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres)
2018-01-18 22:39:20 UTC
Permalink
FWIW: If your Open MPI 3.0.x runs are reading data that was written by MPI IO via Open MPI 1.10.x or 1.8.x runs, that data formats may not be compatible (and could lead to errors like you're seeing -- premature end of file, etc.).
Post by Vahid Askarpour
Hi Jeff,
I compiled Quantum Espresso/EPW with openmpi-3.0.x. The openmpi was compiled with intel14.
end of file while reading crystal k points
There are 1728 k points in the input file and Quantum Espresso, by default, can read up to 40000 k points.
This error did not occur with openmpi-1.8.1.
So I will just continue to use openmpi-1.8.1 as it does not crash.
Thanks,
Vahid
Post by Jeff Squyres (jsquyres)
You are correct: 3.0.1 has not been released yet.
However, our nightly snapshots of the 3.0.x branch are available for download. These are not official releases, but they are great for getting users to test what will eventually become an official release (i.e., 3.0.1) to see if particular bugs have been fixed. This is one of the benefits of open source. :-)
https://www.open-mpi.org/nightly/v3.0.x/
They are organized by date.
Post by Vahid Askarpour
Hi Jeff,
I looked for the 3.0.1 version but I only found the 3.0.0 version available for download. So I thought it may take a while for the 3.0.1 to become available. Or did I miss something?
Thanks,
Vahid
Post by Jeff Squyres (jsquyres)
Vahid --
Were you able to give it a whirl?
Thanks.
Post by Vahid Askarpour
Gilles,
I will try the 3.0.1rc1 version to see how it goes.
Thanks,
Vahid
Post by Gilles Gouaillardet
Vahid,
This looks like the description of the issue reported at https://github.com/open-mpi/ompi/issues/4336
The fix is currently available in 3.0.1rc1, and I will back port the fix fo the v2.x branch.
A workaround is to use ROM-IO instead of ompio, you can achieve this with
mpirun —mca io ^ompio ...
(FWIW 1.10 series use ROM-IO by default, so there is no leak out of the box)
IIRC, a possible (and ugly) workaround for the compilation issue is to
configure —with-ucx=/usr ...
That being said, you should really upgrade to a supported version of Open MPI as previously suggested
Cheers,
Gilles
You can still give Open MPI 2.1.1 a try. It should be source compatible with EPW. Hopefully the behavior is close enough that it should work.
If not, please encourage the EPW developers to upgrade. v3.0.x is the current stable series; v1.10.x is ancient.
Post by Vahid Askarpour
Thank you Jeff for your suggestion to use the v.2.1 series.
Post by Vahid Askarpour
Compatibility of EPW
• gcc640 serial
• gcc640 + openmpi-1.10.7
• intel 12 + openmpi-1.10.7
• intel 17 + impi
• PGI 17 + mvapich2.3
• openmpi 2.0.2 (but likely on all the 2.x.x version): Works but memory leak. If you open and close a file a lot of times with openmpi 2.0.2, the memory increase linearly with the number of times the file is open.
So I am hoping to avoid the 2.x.x series and use the 1.10.7 version suggested by the EPW developers. However, it appears that this is not possible.
Vahid
Post by Vahid Askarpour
https://github.com/open-mpi/ompi/issues/4345
https://www.open-mpi.org/nightly/v2.x/
The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).
Post by Vahid Askarpour
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Vahid Askarpour
2018-01-18 22:53:43 UTC
Permalink
My openmpi3.0.x run (called nscf run) was reading data from a routine Quantum Espresso input file edited by hand. The preliminary run (called scf run) was done with openmpi3.0.x on a similar input file also edited by hand.

Vahid
Post by Jeff Squyres (jsquyres)
FWIW: If your Open MPI 3.0.x runs are reading data that was written by MPI IO via Open MPI 1.10.x or 1.8.x runs, that data formats may not be compatible (and could lead to errors like you're seeing -- premature end of file, etc.).
Post by Vahid Askarpour
Hi Jeff,
I compiled Quantum Espresso/EPW with openmpi-3.0.x. The openmpi was compiled with intel14.
end of file while reading crystal k points
There are 1728 k points in the input file and Quantum Espresso, by default, can read up to 40000 k points.
This error did not occur with openmpi-1.8.1.
So I will just continue to use openmpi-1.8.1 as it does not crash.
Thanks,
Vahid
Post by Jeff Squyres (jsquyres)
You are correct: 3.0.1 has not been released yet.
However, our nightly snapshots of the 3.0.x branch are available for download. These are not official releases, but they are great for getting users to test what will eventually become an official release (i.e., 3.0.1) to see if particular bugs have been fixed. This is one of the benefits of open source. :-)
https://www.open-mpi.org/nightly/v3.0.x/
They are organized by date.
Post by Vahid Askarpour
Hi Jeff,
I looked for the 3.0.1 version but I only found the 3.0.0 version available for download. So I thought it may take a while for the 3.0.1 to become available. Or did I miss something?
Thanks,
Vahid
Post by Jeff Squyres (jsquyres)
Vahid --
Were you able to give it a whirl?
Thanks.
Post by Vahid Askarpour
Gilles,
I will try the 3.0.1rc1 version to see how it goes.
Thanks,
Vahid
Post by Gilles Gouaillardet
Vahid,
This looks like the description of the issue reported at https://github.com/open-mpi/ompi/issues/4336
The fix is currently available in 3.0.1rc1, and I will back port the fix fo the v2.x branch.
A workaround is to use ROM-IO instead of ompio, you can achieve this with
mpirun —mca io ^ompio ...
(FWIW 1.10 series use ROM-IO by default, so there is no leak out of the box)
IIRC, a possible (and ugly) workaround for the compilation issue is to
configure —with-ucx=/usr ...
That being said, you should really upgrade to a supported version of Open MPI as previously suggested
Cheers,
Gilles
You can still give Open MPI 2.1.1 a try. It should be source compatible with EPW. Hopefully the behavior is close enough that it should work.
If not, please encourage the EPW developers to upgrade. v3.0.x is the current stable series; v1.10.x is ancient.
Post by Vahid Askarpour
Thank you Jeff for your suggestion to use the v.2.1 series.
Post by Vahid Askarpour
Compatibility of EPW
• gcc640 serial
• gcc640 + openmpi-1.10.7
• intel 12 + openmpi-1.10.7
• intel 17 + impi
• PGI 17 + mvapich2.3
• openmpi 2.0.2 (but likely on all the 2.x.x version): Works but memory leak. If you open and close a file a lot of times with openmpi 2.0.2, the memory increase linearly with the number of times the file is open.
So I am hoping to avoid the 2.x.x series and use the 1.10.7 version suggested by the EPW developers. However, it appears that this is not possible.
Vahid
Post by Vahid Askarpour
https://github.com/open-mpi/ompi/issues/4345
https://www.open-mpi.org/nightly/v2.x/
The fix is still pending for the v3.0.x and v3.1.x series (i.e., there are pending pull requests that haven't been merged yet, so the nightly snapshots for the v3.0.x and v3.1.x branches do not yet contain this fix).
Post by Vahid Askarpour
I am attempting to install openmpi-1.10.7 on CentOS Linux (7.4.1708) using GCC-6.4.0.
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ob1'
Making all in mca/pml/ucx
make[2]: Entering directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
CC pml_ucx.lo
CC pml_ucx_request.lo
CC pml_ucx_datatype.lo
CC pml_ucx_component.lo
CCLD mca_pml_ucx.la
libtool: error: require no space between '-L' and '-lrt'
make[2]: *** [Makefile:1725: mca_pml_ucx.la] Error 1
make[2]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi/mca/pml/ucx'
make[1]: *** [Makefile:3261: all-recursive] Error 1
make[1]: Leaving directory '/home/vaskarpo/bin/openmpi-1.10.7/ompi'
make: *** [Makefile:1777: all-recursive] Error 1
Thank you,
Vahid
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres)
2018-01-18 23:17:09 UTC
Permalink
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine Quantum Espresso input file edited by hand. The preliminary run (called scf run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.

Well, that's a little disappointing.

It would be good to understand why it is crashing -- is the app doing something that is accidentally not standard? Is there a bug in (soon to be released) Open MPI 3.0.1? ...?
--
Jeff Squyres
***@cisco.com
Edgar Gabriel
2018-01-19 01:55:18 UTC
Permalink
I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.

Since it seemed to have worked with 2.0.x (except for the running out file handles problem), there is the suspicion that one of the fixes that we introduced since then is the problem.

What file system did you run it on? NFS?

Thanks

Edgar
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine Quantum Espresso input file edited by hand. The preliminary run (called scf run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing something that is accidentally not standard? Is there a bug in (soon to be released) Open MPI 3.0.1? ...?
Gilles Gouaillardet
2018-01-19 02:14:29 UTC
Permalink
Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.

Cheers,

Gilles
Post by Edgar Gabriel
I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.
Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.
What file system did you run it on? NFS?
Thanks
Edgar
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Vahid Askarpour
2018-01-19 13:15:29 UTC
Permalink
Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

***@conte-fe00:~ $ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
mds-d01-***@o2ib1:mds-d02-***@o2ib1:/lustreD
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid
Post by Gilles Gouaillardet
Vahid,
i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...
That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.
Cheers,
Gilles
Post by Edgar Gabriel
I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.
Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.
What file system did you run it on? NFS?
Thanks
Edgar
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Vinson, John (Fed)
2018-01-19 13:56:57 UTC
Permalink
Hi Vahid,

This may be a red herring, but are you using a redirect or -i for the QE input? If you are running "pw.x < input" try running with "pw.x -i input".

John

-----Original Message-----
From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Vahid Askarpour
Sent: Friday, January 19, 2018 8:15 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Installation of openmpi-1.10.7 fails

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

***@conte-fe00:~ $ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
mds-d01-***@o2ib1:mds-d02-***@o2ib1:/lustreD
lustre 1.4P 994T 347T 75% /scratch/conte depotint-nfs.rcac.purdue.edu:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid
Post by Gilles Gouaillardet
Vahid,
i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component
with mpirun --mca io romio314 ...
That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO, which makes me very puzzled on
why the latest Open MPI is crashing.
Cheers,
Gilles
Post by Edgar Gabriel
I will try to reproduce this problem with 3.0.x, but it might take me
a couple of days to get to it.
Since it seemed to have worked with 2.0.x (except for the running out
file handles problem), there is the suspicion that one of the fixes
that we introduced since then is the problem.
What file system did you run it on? NFS?
Thanks
Edgar
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a
routine Quantum Espresso input file edited by hand. The preliminary
run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app
doing something that is accidentally not standard? Is there a bug
in (soon to be
released) Open MPI 3.0.1? ...?
_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
s.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cjohn.vinso
n%40nist.gov%7C6fe0d5f26dd545205eb108d55f40b74e%7C2ab5d82fd8fa4797a93
e054655c61dec%7C1%7C1%7C636519653913113615&sdata=Vt%2BSdJFcvmdqEKgMPU
ylYAd%2FdQMgTUEXiBPGzQkeSio%3D&reserved=0
_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists
.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cjohn.vinson%
40nist.gov%7C6fe0d5f26dd545205eb108d55f40b74e%7C2ab5d82fd8fa4797a93e05
4655c61dec%7C1%7C1%7C636519653913113615&sdata=Vt%2BSdJFcvmdqEKgMPUylYA
d%2FdQMgTUEXiBPGzQkeSio%3D&reserved=0
_______________________________________________
users mailing list
***@lists.open-mpi.org
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cjohn.vinson%40nist.gov%7C6fe0d5f26dd545205eb108d55f40b74e%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C1%7C636519653913113615&sdata=Vt%2BSdJFcvmdqEKgMPUylYAd%2FdQMgTUEXiBPGzQkeSio%3D&reserved=0
Edgar Gabriel
2018-01-19 14:23:37 UTC
Permalink
thanks, that is interesting. Since /scratch is a lustre file system,
Open MPI should actually utilize romio314 for that anyway, not ompio.
What I have seen however happen on at least one occasions is that ompio
was still used since ( I suspect) romio314 didn't pick up correctly the
configuration options. It is a little bit of a mess from that
perspective that we have to pass the romio arguments with different
flag/options than for ompio, e.g.

--with-lustre=/path/to/lustre/
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre
--with-lustre=/path/to/lustre"

ompio should pick up the lustre options correctly if lustre
headers/libraries are found at the default location, even if the user
did not pass the --with-lustre option. I am not entirely sure what
happens in romio if the user did not pass the
--with-file-system=ufs+nfs+lustre but the lustre headers/libraries are
found at the default location, i.e. whether the lustre adio component is
still compiled or not.

Anyway, lets wait for the outcome of your run enforcing using the
romio314 component, and I will still try to reproduce your problem on my
system.

Thanks
Edgar
Post by Vahid Askarpour
Gilles,
I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin
The code is compiled in my $HOME and is run on the scratch.
Cheers,
Vahid
Post by Gilles Gouaillardet
Vahid,
i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...
That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.
Cheers,
Gilles
Post by Edgar Gabriel
I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.
Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.
What file system did you run it on? NFS?
Thanks
Edgar
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Vahid Askarpour
2018-01-19 16:32:46 UTC
Permalink
To run EPW, the command for running the preliminary nscf run is (http://epw.org.uk/Documentation/B-dopedDiamond):

~/bin/openmpi-v3.0/bin/mpiexec -np 64 /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf run requires Quantum Espresso to generate the k points automatically as shown below:

K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this case) as seen below:

K_POINTS (crystal)
1728
0.00000000 0.00000000 0.00000000 5.787037e-04
0.00000000 0.00000000 0.08333333 5.787037e-04
0.00000000 0.00000000 0.16666667 5.787037e-04
0.00000000 0.00000000 0.25000000 5.787037e-04
0.00000000 0.00000000 0.33333333 5.787037e-04
0.00000000 0.00000000 0.41666667 5.787037e-04
0.00000000 0.00000000 0.50000000 5.787037e-04
0.00000000 0.00000000 0.58333333 5.787037e-04
0.00000000 0.00000000 0.66666667 5.787037e-04
0.00000000 0.00000000 0.75000000 5.787037e-04


.


.

To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler module, configured with only the “--prefix=" and then “make all install”. I did not enable or disable any other options.

Cheers,

Vahid


On Jan 19, 2018, at 10:23 AM, Edgar Gabriel <***@central.uh.edu<mailto:***@central.uh.edu>> wrote:


thanks, that is interesting. Since /scratch is a lustre file system, Open MPI should actually utilize romio314 for that anyway, not ompio. What I have seen however happen on at least one occasions is that ompio was still used since ( I suspect) romio314 didn't pick up correctly the configuration options. It is a little bit of a mess from that perspective that we have to pass the romio arguments with different flag/options than for ompio, e.g.

--with-lustre=/path/to/lustre/ --with-io-romio-flags="--with-file-system=ufs+nfs+lustre --with-lustre=/path/to/lustre"

ompio should pick up the lustre options correctly if lustre headers/libraries are found at the default location, even if the user did not pass the --with-lustre option. I am not entirely sure what happens in romio if the user did not pass the --with-file-system=ufs+nfs+lustre but the lustre headers/libraries are found at the default location, i.e. whether the lustre adio component is still compiled or not.

Anyway, lets wait for the outcome of your run enforcing using the romio314 component, and I will still try to reproduce your problem on my system.

Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

***@conte-fe00:~ $ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu>:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu>:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
mds-d01-***@o2ib1:mds-d02-***@o2ib1:/lustreD<mailto:mds-d01-***@o2ib1:mds-d02-***@o2ib1:/lustreD>
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu<http://depotint-nfs.rcac.purdue.edu>:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid



On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet <***@gmail.com><mailto:***@gmail.com> wrote:

Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.

Cheers,

Gilles

On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel <***@central.uh.edu><mailto:***@central.uh.edu> wrote:


I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.

Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.

What file system did you run it on? NFS?

Thanks

Edgar


On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:


On Jan 18, 2018, at 5:53 PM, Vahid Askarpour <***@dal.ca><mailto:***@dal.ca> wrote:


My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.


Gotcha.

Well, that's a little disappointing.

It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?



_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
Edgar Gabriel
2018-01-19 16:46:16 UTC
Permalink
ok, thank you for the information. Two short questions and requests. I
have qe-6.2.1 compiled and running on my system (although it is with
gcc-6.4 instead of the intel compiler), and I am currently running the
parallel test suite. So far, all the tests passed, although it is still
running.

My question is now, would it be possible for you to give me access to
exactly the same data set that you are using?  You could upload to a
webpage or similar and just send me the link.

The second question/request, could you rerun your tests one more time,
this time forcing using ompio? e.g. --mca io ompio

Thanks

Edgar
Post by Vahid Askarpour
To run EPW, the command for running the preliminary nscf run is
~/bin/openmpi-v3.0/bin/mpiexec -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in >
nscf.out
~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in >
nscf.out
And it crashed like the first time.
It is interesting that the preliminary scf run works fine. The scf run
requires Quantum Espresso to generate the k points automatically as
K_POINTS (automatic)
12 12 12 0 0 0
K_POINTS (crystal)
1728
  0.00000000  0.00000000  0.00000000  5.787037e-04
  0.00000000  0.00000000  0.08333333  5.787037e-04
  0.00000000  0.00000000  0.16666667  5.787037e-04
  0.00000000  0.00000000  0.25000000  5.787037e-04
  0.00000000  0.00000000  0.33333333  5.787037e-04
  0.00000000  0.00000000  0.41666667  5.787037e-04
  0.00000000  0.00000000  0.50000000  5.787037e-04
  0.00000000  0.00000000  0.58333333  5.787037e-04
  0.00000000  0.00000000  0.66666667  5.787037e-04
  0.00000000  0.00000000  0.75000000  5.787037e-04


.


.
To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran
compiler module, configured with only the “--prefix="  and then “make
all install”. I did not enable or disable any other options.
Cheers,
Vahid
Post by Edgar Gabriel
thanks, that is interesting. Since /scratch is a lustre file system,
Open MPI should actually utilize romio314 for that anyway, not ompio.
What I have seen however happen on at least one occasions is that
ompio was still used since ( I suspect) romio314 didn't pick up
correctly the configuration options. It is a little bit of a mess
from that perspective that we have to pass the romio arguments with
different flag/options than for ompio, e.g.
--with-lustre=/path/to/lustre/
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre
--with-lustre=/path/to/lustre"
ompio should pick up the lustre options correctly if lustre
headers/libraries are found at the default location, even if the user
did not pass the --with-lustre option. I am not entirely sure what
happens in romio if the user did not pass the
--with-file-system=ufs+nfs+lustre but the lustre headers/libraries
are found at the default location, i.e. whether the lustre adio
component is still compiled or not.
Anyway, lets wait for the outcome of your run enforcing using the
romio314 component, and I will still try to reproduce your problem on
my system.
Thanks
Edgar
Post by Vahid Askarpour
Gilles,
I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu <http://persistent-nfs.rcac.purdue.edu>:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu <http://persistent-nfs.rcac.purdue.edu>:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu <http://depotint-nfs.rcac.purdue.edu>:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin
The code is compiled in my $HOME and is run on the scratch.
Cheers,
Vahid
Post by Gilles Gouaillardet
Vahid,
i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...
That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.
Cheers,
Gilles
Post by Edgar Gabriel
I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.
Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.
What file system did you run it on? NFS?
Thanks
Edgar
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Edgar Gabriel
Associate Professor
Department of Computer Science

Associate Director
Center for Advanced Computing and Data Science (CACDS)

University of Houston
Philip G. Hoffman Hall, Room 228 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
--
Vahid Askarpour
2018-01-19 17:44:44 UTC
Permalink
Hi Edgar,

Just to let you know that the nscf run with --mca io ompio crashed like the other two runs.

Thank you,

Vahid

On Jan 19, 2018, at 12:46 PM, Edgar Gabriel <***@central.uh.edu<mailto:***@central.uh.edu>> wrote:


ok, thank you for the information. Two short questions and requests. I have qe-6.2.1 compiled and running on my system (although it is with gcc-6.4 instead of the intel compiler), and I am currently running the parallel test suite. So far, all the tests passed, although it is still running.

My question is now, would it be possible for you to give me access to exactly the same data set that you are using? You could upload to a webpage or similar and just send me the link.

The second question/request, could you rerun your tests one more time, this time forcing using ompio? e.g. --mca io ompio

Thanks

Edgar

On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
To run EPW, the command for running the preliminary nscf run is (http://epw.org.uk/Documentation/B-dopedDiamond):

~/bin/openmpi-v3.0/bin/mpiexec -np 64 /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf run requires Quantum Espresso to generate the k points automatically as shown below:

K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this case) as seen below:

K_POINTS (crystal)
1728
0.00000000 0.00000000 0.00000000 5.787037e-04
0.00000000 0.00000000 0.08333333 5.787037e-04
0.00000000 0.00000000 0.16666667 5.787037e-04
0.00000000 0.00000000 0.25000000 5.787037e-04
0.00000000 0.00000000 0.33333333 5.787037e-04
0.00000000 0.00000000 0.41666667 5.787037e-04
0.00000000 0.00000000 0.50000000 5.787037e-04
0.00000000 0.00000000 0.58333333 5.787037e-04
0.00000000 0.00000000 0.66666667 5.787037e-04
0.00000000 0.00000000 0.75000000 5.787037e-04


.


.

To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler module, configured with only the “--prefix=" and then “make all install”. I did not enable or disable any other options.

Cheers,

Vahid


On Jan 19, 2018, at 10:23 AM, Edgar Gabriel <***@central.uh.edu<mailto:***@central.uh.edu>> wrote:


thanks, that is interesting. Since /scratch is a lustre file system, Open MPI should actually utilize romio314 for that anyway, not ompio. What I have seen however happen on at least one occasions is that ompio was still used since ( I suspect) romio314 didn't pick up correctly the configuration options. It is a little bit of a mess from that perspective that we have to pass the romio arguments with different flag/options than for ompio, e.g.

--with-lustre=/path/to/lustre/ --with-io-romio-flags="--with-file-system=ufs+nfs+lustre --with-lustre=/path/to/lustre"

ompio should pick up the lustre options correctly if lustre headers/libraries are found at the default location, even if the user did not pass the --with-lustre option. I am not entirely sure what happens in romio if the user did not pass the --with-file-system=ufs+nfs+lustre but the lustre headers/libraries are found at the default location, i.e. whether the lustre adio component is still compiled or not.

Anyway, lets wait for the outcome of your run enforcing using the romio314 component, and I will still try to reproduce your problem on my system.

Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

***@conte-fe00:~ $ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
mds-d01-***@o2ib1:mds-d02-***@o2ib1:/lustreD<mailto:mds-d01-***@o2ib1:mds-d02-***@o2ib1:/lustreD>
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu<http://depotint-nfs.rcac.purdue.edu/>:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid



On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet <***@gmail.com><mailto:***@gmail.com> wrote:

Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.

Cheers,

Gilles

On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel <***@central.uh.edu><mailto:***@central.uh.edu> wrote:


I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.

Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.

What file system did you run it on? NFS?

Thanks

Edgar


On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:


On Jan 18, 2018, at 5:53 PM, Vahid Askarpour <***@dal.ca><mailto:***@dal.ca> wrote:


My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.


Gotcha.

Well, that's a little disappointing.

It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?



_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
--
Edgar Gabriel
Associate Professor
Department of Computer Science

Associate Director
Center for Advanced Computing and Data Science (CACDS)

University of Houston
Philip G. Hoffman Hall, Room 228 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
--
Edgar Gabriel
2018-01-19 19:08:16 UTC
Permalink
ok, here is what found out so far, will have to stop for now however for
today:

 1. I can in fact reproduce your bug on my systems.

 2. I can confirm that the problem occurs both with romio314 and ompio.
I *think* the issue is that the input_tmp.in file is incomplete. In both
cases (ompio and romio) the end of the file looks as follows (and its
exactly the same for both libraries):

***@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files> tail -10
input_tmp.in
  0.66666667  0.50000000  0.83333333  5.787037e-04
  0.66666667  0.50000000  0.91666667  5.787037e-04
  0.66666667  0.58333333  0.00000000  5.787037e-04
  0.66666667  0.58333333  0.08333333  5.787037e-04
  0.66666667  0.58333333  0.16666667  5.787037e-04
  0.66666667  0.58333333  0.25000000  5.787037e-04
  0.66666667  0.58333333  0.33333333  5.787037e-04
  0.66666667  0.58333333  0.41666667  5.787037e-04
  0.66666667  0.58333333  0.50000000  5.787037e-04
  0.66666667  0.58333333  0.58333333  5

which is what I *think* causes the problem.

 3. I tried to find where input_tmp.in is generated, but haven't
completely identified the location. However, I could not find MPI
file_write(_all) operations anywhere in the code, although there are
some MPI_file_read(_all) operations.

 4. I can confirm that the behavior with Open MPI 1.8.x is different.
input_tmp.in looks more complete (at least it doesn't end in the middle
of the line). The simulation does still not finish for me, but the bug
reported is slightly different, I might just be missing a file or something


     from pw_readschemafile : error #         1
     xml data file not found

Since I think input_tmp.in is generated from data that is provided in
nscf.in, it might very well be something in the MPI_File_read(_all)
operation that causes the issue, but since both ompio and romio are
affected, there is good chance that something outside of the control of
io components is causing the trouble (maybe a datatype issue that has
changed from 1.8.x series to 3.0.x).

 5. Last but not least, I also wanted to mention that I ran all
parallel tests that I found in the testsuite (run-tests-cp-parallel,
run-tests-pw-parallel, run-tests-ph-parallel, run-tests-epw-parallel ),
and they all passed with ompio (and romio314 although I only ran a
subset of the tests with romio314).

Thanks

Edgar

-
Post by Vahid Askarpour
Hi Edgar,
Just to let you know that the nscf run with --mca io ompio crashed
like the other two runs.
Thank you,
Vahid
Post by Edgar Gabriel
ok, thank you for the information. Two short questions and requests.
I have qe-6.2.1 compiled and running on my system (although it is
with gcc-6.4 instead of the intel compiler), and I am currently
running the parallel test suite. So far, all the tests passed,
although it is still running.
My question is now, would it be possible for you to give me access to
exactly the same data set that you are using?  You could upload to a
webpage or similar and just send me the link.
The second question/request, could you rerun your tests one more
time, this time forcing using ompio? e.g. --mca io ompio
Thanks
Edgar
Post by Vahid Askarpour
To run EPW, the command for running the preliminary nscf run is
~/bin/openmpi-v3.0/bin/mpiexec -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in >
nscf.out
~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in >
nscf.out
And it crashed like the first time.
It is interesting that the preliminary scf run works fine. The scf
run requires Quantum Espresso to generate the k points automatically
K_POINTS (automatic)
12 12 12 0 0 0
The nscf run which crashes includes a list of k points (1728 in this
K_POINTS (crystal)
1728
  0.00000000  0.00000000  0.00000000  5.787037e-04
  0.00000000  0.00000000  0.08333333  5.787037e-04
  0.00000000  0.00000000  0.16666667  5.787037e-04
  0.00000000  0.00000000  0.25000000  5.787037e-04
  0.00000000  0.00000000  0.33333333  5.787037e-04
  0.00000000  0.00000000  0.41666667  5.787037e-04
  0.00000000  0.00000000  0.50000000  5.787037e-04
  0.00000000  0.00000000  0.58333333  5.787037e-04
  0.00000000  0.00000000  0.66666667  5.787037e-04
  0.00000000  0.00000000  0.75000000  5.787037e-04


.


.
To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran
compiler module, configured with only the “--prefix="  and then
“make all install”. I did not enable or disable any other options.
Cheers,
Vahid
Post by Vahid Askarpour
On Jan 19, 2018, at 10:23 AM, Edgar Gabriel
thanks, that is interesting. Since /scratch is a lustre file
system, Open MPI should actually utilize romio314 for that anyway,
not ompio. What I have seen however happen on at least one
occasions is that ompio was still used since ( I suspect) romio314
didn't pick up correctly the configuration options. It is a little
bit of a mess from that perspective that we have to pass the romio
arguments with different flag/options than for ompio, e.g.
--with-lustre=/path/to/lustre/
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre
--with-lustre=/path/to/lustre"
ompio should pick up the lustre options correctly if lustre
headers/libraries are found at the default location, even if the
user did not pass the --with-lustre option. I am not entirely sure
what happens in romio if the user did not pass the
--with-file-system=ufs+nfs+lustre but the lustre headers/libraries
are found at the default location, i.e. whether the lustre adio
component is still compiled or not.
Anyway, lets wait for the outcome of your run enforcing using the
romio314 component, and I will still try to reproduce your problem
on my system.
Thanks
Edgar
Post by Vahid Askarpour
Gilles,
I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu
<http://persistent-nfs.rcac.purdue.edu/>:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu
<http://persistent-nfs.rcac.purdue.edu/>:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu <http://depotint-nfs.rcac.purdue.edu/>:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin
The code is compiled in my $HOME and is run on the scratch.
Cheers,
Vahid
Post by Gilles Gouaillardet
Vahid,
i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...
That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.
Cheers,
Gilles
Post by Edgar Gabriel
I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.
Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.
What file system did you run it on? NFS?
Thanks
Edgar
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Vahid Askarpour
2018-01-19 19:22:29 UTC
Permalink
Concerning the following error

from pw_readschemafile : error # 1
xml data file not found

The nscf run uses files generated by the scf.in run. So I first run scf.in and when it finishes, I run nscf.in. If you have done this and still get the above error, then this could be another bug. It does not happen for me with intel14/openmpi-1.8.8.

Thanks for the update,

Vahid

On Jan 19, 2018, at 3:08 PM, Edgar Gabriel <***@central.uh.edu<mailto:***@central.uh.edu>> wrote:


ok, here is what found out so far, will have to stop for now however for today:

1. I can in fact reproduce your bug on my systems.

2. I can confirm that the problem occurs both with romio314 and ompio. I *think* the issue is that the input_tmp.in file is incomplete. In both cases (ompio and romio) the end of the file looks as follows (and its exactly the same for both libraries):

***@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files<mailto:***@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files>> tail -10 input_tmp.in
0.66666667 0.50000000 0.83333333 5.787037e-04
0.66666667 0.50000000 0.91666667 5.787037e-04
0.66666667 0.58333333 0.00000000 5.787037e-04
0.66666667 0.58333333 0.08333333 5.787037e-04
0.66666667 0.58333333 0.16666667 5.787037e-04
0.66666667 0.58333333 0.25000000 5.787037e-04
0.66666667 0.58333333 0.33333333 5.787037e-04
0.66666667 0.58333333 0.41666667 5.787037e-04
0.66666667 0.58333333 0.50000000 5.787037e-04
0.66666667 0.58333333 0.58333333 5

which is what I *think* causes the problem.

3. I tried to find where input_tmp.in is generated, but haven't completely identified the location. However, I could not find MPI file_write(_all) operations anywhere in the code, although there are some MPI_file_read(_all) operations.

4. I can confirm that the behavior with Open MPI 1.8.x is different. input_tmp.in looks more complete (at least it doesn't end in the middle of the line). The simulation does still not finish for me, but the bug reported is slightly different, I might just be missing a file or something

from pw_readschemafile : error # 1
xml data file not found

Since I think input_tmp.in is generated from data that is provided in nscf.in, it might very well be something in the MPI_File_read(_all) operation that causes the issue, but since both ompio and romio are affected, there is good chance that something outside of the control of io components is causing the trouble (maybe a datatype issue that has changed from 1.8.x series to 3.0.x).

5. Last but not least, I also wanted to mention that I ran all parallel tests that I found in the testsuite (run-tests-cp-parallel, run-tests-pw-parallel, run-tests-ph-parallel, run-tests-epw-parallel ), and they all passed with ompio (and romio314 although I only ran a subset of the tests with romio314).


Thanks

Edgar

-



On 01/19/2018 11:44 AM, Vahid Askarpour wrote:
Hi Edgar,

Just to let you know that the nscf run with --mca io ompio crashed like the other two runs.

Thank you,

Vahid

On Jan 19, 2018, at 12:46 PM, Edgar Gabriel <***@central.uh.edu<mailto:***@central.uh.edu>> wrote:


ok, thank you for the information. Two short questions and requests. I have qe-6.2.1 compiled and running on my system (although it is with gcc-6.4 instead of the intel compiler), and I am currently running the parallel test suite. So far, all the tests passed, although it is still running.

My question is now, would it be possible for you to give me access to exactly the same data set that you are using? You could upload to a webpage or similar and just send me the link.

The second question/request, could you rerun your tests one more time, this time forcing using ompio? e.g. --mca io ompio

Thanks

Edgar

On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
To run EPW, the command for running the preliminary nscf run is (http://epw.org.uk/Documentation/B-dopedDiamond):

~/bin/openmpi-v3.0/bin/mpiexec -np 64 /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf run requires Quantum Espresso to generate the k points automatically as shown below:

K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this case) as seen below:

K_POINTS (crystal)
1728
0.00000000 0.00000000 0.00000000 5.787037e-04
0.00000000 0.00000000 0.08333333 5.787037e-04
0.00000000 0.00000000 0.16666667 5.787037e-04
0.00000000 0.00000000 0.25000000 5.787037e-04
0.00000000 0.00000000 0.33333333 5.787037e-04
0.00000000 0.00000000 0.41666667 5.787037e-04
0.00000000 0.00000000 0.50000000 5.787037e-04
0.00000000 0.00000000 0.58333333 5.787037e-04
0.00000000 0.00000000 0.66666667 5.787037e-04
0.00000000 0.00000000 0.75000000 5.787037e-04


.


.

To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler module, configured with only the “--prefix=" and then “make all install”. I did not enable or disable any other options.

Cheers,

Vahid


On Jan 19, 2018, at 10:23 AM, Edgar Gabriel <***@central.uh.edu<mailto:***@central.uh.edu>> wrote:


thanks, that is interesting. Since /scratch is a lustre file system, Open MPI should actually utilize romio314 for that anyway, not ompio. What I have seen however happen on at least one occasions is that ompio was still used since ( I suspect) romio314 didn't pick up correctly the configuration options. It is a little bit of a mess from that perspective that we have to pass the romio arguments with different flag/options than for ompio, e.g.

--with-lustre=/path/to/lustre/ --with-io-romio-flags="--with-file-system=ufs+nfs+lustre --with-lustre=/path/to/lustre"

ompio should pick up the lustre options correctly if lustre headers/libraries are found at the default location, even if the user did not pass the --with-lustre option. I am not entirely sure what happens in romio if the user did not pass the --with-file-system=ufs+nfs+lustre but the lustre headers/libraries are found at the default location, i.e. whether the lustre adio component is still compiled or not.

Anyway, lets wait for the outcome of your run enforcing using the romio314 component, and I will still try to reproduce your problem on my system.

Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

***@conte-fe00:~ $ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
mds-d01-***@o2ib1:mds-d02-***@o2ib1:/lustreD<mailto:mds-d01-***@o2ib1:mds-d02-***@o2ib1:/lustreD>
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu<http://depotint-nfs.rcac.purdue.edu/>:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid



On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet <***@gmail.com><mailto:***@gmail.com> wrote:

Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.

Cheers,

Gilles

On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel <***@central.uh.edu><mailto:***@central.uh.edu> wrote:


I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.

Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.

What file system did you run it on? NFS?

Thanks

Edgar


On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:


On Jan 18, 2018, at 5:53 PM, Vahid Askarpour <***@dal.ca><mailto:***@dal.ca> wrote:


My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.


Gotcha.

Well, that's a little disappointing.

It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?



_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
Edgar Gabriel
2018-01-22 19:17:38 UTC
Permalink
after some further investigation, I am fairly confident that this is not
an MPI I/O problem.

The input file input_tmp.in is generated in this sequence of
instructions (which is in Modules/open_close_input_file.f90)

---

  IF ( TRIM(input_file_) /= ' ' ) THEn
     !
     ! copy file to be opened into input_file
     !
     input_file = input_file_
     !
  ELSE
     !
     ! if no file specified then copy from standard input
     !
     input_file="input_tmp.in"
     OPEN(UNIT = stdtmp, FILE=trim(input_file), FORM='formatted', &
          STATUS='unknown', IOSTAT = ierr )
     IF ( ierr > 0 ) GO TO 30
     !
     dummy=' '
     WRITE(stdout, '(5x,a)') "Waiting for input..."
     DO WHILE ( TRIM(dummy) .NE. "MAGICALME" )
        READ (stdin,fmt='(A512)',END=20) dummy
        WRITE (stdtmp,'(A)') trim(dummy)
     END DO
     !
20   CLOSE ( UNIT=stdtmp, STATUS='keep' )

----

Basically, if no input file has been provided, the input file is
generated by reading from standard input. Since the application is being
launched e.g. with

mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out

the data comes from nscf.in. I simply do not know enough about IO
forwarding do be able to tell why we do not see the entire file, but one
interesting detail is that if I run it in the debugger, the input_tmp.in
is created correctly. However, if I run it using mpirun as shown above,
the file is cropped incorrectly, which leads to the error message
mentioned in this email chain.

Anyway, I would probably need some help here from somebody who knows the
runtime better than me on what could go wrong at this point.

Thanks

Edgar
Post by Vahid Askarpour
Concerning the following error
     from pw_readschemafile : error #         1
     xml data file not found
The nscf run uses files generated by the scf.in run. So I first run
scf.in and when it finishes, I run nscf.in. If you have done this and
still get the above error, then this could be another bug. It does not
happen for me with intel14/openmpi-1.8.8.
Thanks for the update,
Vahid
Post by Edgar Gabriel
 1. I can in fact reproduce your bug on my systems.
 2. I can confirm that the problem occurs both with romio314 and
ompio. I *think* the issue is that the input_tmp.in file is
incomplete. In both cases (ompio and romio) the end of the file looks
input_tmp.in
  0.66666667  0.50000000  0.83333333  5.787037e-04
  0.66666667  0.50000000  0.91666667  5.787037e-04
  0.66666667  0.58333333  0.00000000  5.787037e-04
  0.66666667  0.58333333  0.08333333  5.787037e-04
  0.66666667  0.58333333  0.16666667  5.787037e-04
  0.66666667  0.58333333  0.25000000  5.787037e-04
  0.66666667  0.58333333  0.33333333  5.787037e-04
  0.66666667  0.58333333  0.41666667  5.787037e-04
  0.66666667  0.58333333  0.50000000  5.787037e-04
  0.66666667  0.58333333  0.58333333  5
which is what I *think* causes the problem.
 3. I tried to find where input_tmp.in is generated, but haven't
completely identified the location. However, I could not find MPI
file_write(_all) operations anywhere in the code, although there are
some MPI_file_read(_all) operations.
 4. I can confirm that the behavior with Open MPI 1.8.x is different.
input_tmp.in looks more complete (at least it doesn't end in the
middle of the line). The simulation does still not finish for me, but
the bug reported is slightly different, I might just be missing a
file or something
     from pw_readschemafile : error #         1
     xml data file not found
Since I think input_tmp.in is generated from data that is provided in
nscf.in, it might very well be something in the MPI_File_read(_all)
operation that causes the issue, but since both ompio and romio are
affected, there is good chance that something outside of the control
of io components is causing the trouble (maybe a datatype issue that
has changed from 1.8.x series to 3.0.x).
 5. Last but not least, I also wanted to mention that I ran all
parallel tests that I found in the testsuite  (run-tests-cp-parallel,
run-tests-pw-parallel, run-tests-ph-parallel, run-tests-epw-parallel
), and they all passed with ompio (and romio314 although I only ran a
subset of the tests with romio314).
Thanks
Edgar
-
Post by Vahid Askarpour
Hi Edgar,
Just to let you know that the nscf run with --mca io ompio crashed
like the other two runs.
Thank you,
Vahid
Post by Vahid Askarpour
On Jan 19, 2018, at 12:46 PM, Edgar Gabriel
ok, thank you for the information. Two short questions and
requests. I have qe-6.2.1 compiled and running on my system
(although it is with gcc-6.4 instead of the intel compiler), and I
am currently running the parallel test suite. So far, all the tests
passed, although it is still running.
My question is now, would it be possible for you to give me access
to exactly the same data set that you are using?  You could upload
to a webpage or similar and just send me the link.
The second question/request, could you rerun your tests one more
time, this time forcing using ompio? e.g. --mca io ompio
Thanks
Edgar
Post by Vahid Askarpour
To run EPW, the command for running the preliminary nscf run is
~/bin/openmpi-v3.0/bin/mpiexec -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in
Post by Vahid Askarpour
nscf.out
~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in
Post by Vahid Askarpour
nscf.out
And it crashed like the first time.
It is interesting that the preliminary scf run works fine. The scf
run requires Quantum Espresso to generate the k points
K_POINTS (automatic)
12 12 12 0 0 0
The nscf run which crashes includes a list of k points (1728 in
K_POINTS (crystal)
1728
  0.00000000  0.00000000  0.00000000  5.787037e-04
  0.00000000  0.00000000  0.08333333  5.787037e-04
  0.00000000  0.00000000  0.16666667  5.787037e-04
  0.00000000  0.00000000  0.25000000  5.787037e-04
  0.00000000  0.00000000  0.33333333  5.787037e-04
  0.00000000  0.00000000  0.41666667  5.787037e-04
  0.00000000  0.00000000  0.50000000  5.787037e-04
  0.00000000  0.00000000  0.58333333  5.787037e-04
  0.00000000  0.00000000  0.66666667  5.787037e-04
  0.00000000  0.00000000  0.75000000  5.787037e-04


.


.
To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran
compiler module, configured with only the “--prefix="  and then
“make all install”. I did not enable or disable any other options.
Cheers,
Vahid
Post by Vahid Askarpour
On Jan 19, 2018, at 10:23 AM, Edgar Gabriel
thanks, that is interesting. Since /scratch is a lustre file
system, Open MPI should actually utilize romio314 for that
anyway, not ompio. What I have seen however happen on at least
one occasions is that ompio was still used since ( I suspect)
romio314 didn't pick up correctly the configuration options. It
is a little bit of a mess from that perspective that we have to
pass the romio arguments with different flag/options than for
ompio, e.g.
--with-lustre=/path/to/lustre/
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre
--with-lustre=/path/to/lustre"
ompio should pick up the lustre options correctly if lustre
headers/libraries are found at the default location, even if the
user did not pass the --with-lustre option. I am not entirely
sure what happens in romio if the user did not pass the
--with-file-system=ufs+nfs+lustre but the lustre
headers/libraries are found at the default location, i.e. whether
the lustre adio component is still compiled or not.
Anyway, lets wait for the outcome of your run enforcing using the
romio314 component, and I will still try to reproduce your
problem on my system.
Thanks
Edgar
Post by Vahid Askarpour
Gilles,
I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu
<http://persistent-nfs.rcac.purdue.edu/>:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu
<http://persistent-nfs.rcac.purdue.edu/>:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu <http://depotint-nfs.rcac.purdue.edu/>:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin
The code is compiled in my $HOME and is run on the scratch.
Cheers,
Vahid
Post by Gilles Gouaillardet
Vahid,
i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...
That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.
Cheers,
Gilles
Post by Edgar Gabriel
I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.
Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.
What file system did you run it on? NFS?
Thanks
Edgar
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Edgar Gabriel
2018-01-22 22:05:59 UTC
Permalink
well, my final comment on this topic, as somebody suggested earlier in
this email chain, if you provide the input with the -i argument instead
of piping from standard input, things seem to work as far as I can see
(disclaimer: I do not know what the final outcome should be. I just see
that the application does not complain about the 'end of file while
reading crystal k points'). So maybe that is the most simple solution.

Thanks

Edgar
Post by Edgar Gabriel
after some further investigation, I am fairly confident that this is
not an MPI I/O problem.
The input file input_tmp.in is generated in this sequence of
instructions (which is in Modules/open_close_input_file.f90)
---
  IF ( TRIM(input_file_) /= ' ' ) THEn
     !
     ! copy file to be opened into input_file
     !
     input_file = input_file_
     !
  ELSE
     !
     ! if no file specified then copy from standard input
     !
     input_file="input_tmp.in"
     OPEN(UNIT = stdtmp, FILE=trim(input_file), FORM='formatted', &
          STATUS='unknown', IOSTAT = ierr )
     IF ( ierr > 0 ) GO TO 30
     !
     dummy=' '
     WRITE(stdout, '(5x,a)') "Waiting for input..."
     DO WHILE ( TRIM(dummy) .NE. "MAGICALME" )
        READ (stdin,fmt='(A512)',END=20) dummy
        WRITE (stdtmp,'(A)') trim(dummy)
     END DO
     !
20   CLOSE ( UNIT=stdtmp, STATUS='keep' )
----
Basically, if no input file has been provided, the input file is
generated by reading from standard input. Since the application is
being launched e.g. with
mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out
the data comes from nscf.in. I simply do not know enough about IO
forwarding do be able to tell why we do not see the entire file, but
one interesting detail is that if I run it in the debugger, the
input_tmp.in is created correctly. However, if I run it using mpirun
as shown above, the file is cropped incorrectly, which leads to the
error message mentioned in this email chain.
Anyway, I would probably need some help here from somebody who knows
the runtime better than me on what could go wrong at this point.
Thanks
Edgar
Post by Vahid Askarpour
Concerning the following error
     from pw_readschemafile : error #         1
     xml data file not found
The nscf run uses files generated by the scf.in run. So I first run
scf.in and when it finishes, I run nscf.in. If you have done this and
still get the above error, then this could be another bug. It does
not happen for me with intel14/openmpi-1.8.8.
Thanks for the update,
Vahid
Post by Edgar Gabriel
 1. I can in fact reproduce your bug on my systems.
 2. I can confirm that the problem occurs both with romio314 and
ompio. I *think* the issue is that the input_tmp.in file is
incomplete. In both cases (ompio and romio) the end of the file
input_tmp.in
  0.66666667  0.50000000  0.83333333  5.787037e-04
  0.66666667  0.50000000  0.91666667  5.787037e-04
  0.66666667  0.58333333  0.00000000  5.787037e-04
  0.66666667  0.58333333  0.08333333  5.787037e-04
  0.66666667  0.58333333  0.16666667  5.787037e-04
  0.66666667  0.58333333  0.25000000  5.787037e-04
  0.66666667  0.58333333  0.33333333  5.787037e-04
  0.66666667  0.58333333  0.41666667  5.787037e-04
  0.66666667  0.58333333  0.50000000  5.787037e-04
  0.66666667  0.58333333  0.58333333  5
which is what I *think* causes the problem.
 3. I tried to find where input_tmp.in is generated, but haven't
completely identified the location. However, I could not find MPI
file_write(_all) operations anywhere in the code, although there are
some MPI_file_read(_all) operations.
 4. I can confirm that the behavior with Open MPI 1.8.x is
different. input_tmp.in looks more complete (at least it doesn't end
in the middle of the line). The simulation does still not finish for
me, but the bug reported is slightly different, I might just be
missing a file or something
     from pw_readschemafile : error #         1
     xml data file not found
Since I think input_tmp.in is generated from data that is provided
in nscf.in, it might very well be something in the
MPI_File_read(_all) operation that causes the issue, but since both
ompio and romio are affected, there is good chance that something
outside of the control of io components is causing the trouble
(maybe a datatype issue that has changed from 1.8.x series to 3.0.x).
 5. Last but not least, I also wanted to mention that I ran all
parallel tests that I found in the testsuite 
(run-tests-cp-parallel, run-tests-pw-parallel,
run-tests-ph-parallel, run-tests-epw-parallel ), and they all passed
with ompio (and romio314 although I only ran a subset of the tests
with romio314).
Thanks
Edgar
-
Post by Vahid Askarpour
Hi Edgar,
Just to let you know that the nscf run with --mca io ompio crashed
like the other two runs.
Thank you,
Vahid
Post by Vahid Askarpour
On Jan 19, 2018, at 12:46 PM, Edgar Gabriel
ok, thank you for the information. Two short questions and
requests. I have qe-6.2.1 compiled and running on my system
(although it is with gcc-6.4 instead of the intel compiler), and I
am currently running the parallel test suite. So far, all the
tests passed, although it is still running.
My question is now, would it be possible for you to give me access
to exactly the same data set that you are using?  You could upload
to a webpage or similar and just send me the link.
The second question/request, could you rerun your tests one more
time, this time forcing using ompio? e.g. --mca io ompio
Thanks
Edgar
Post by Vahid Askarpour
To run EPW, the command for running the preliminary nscf run is
~/bin/openmpi-v3.0/bin/mpiexec -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 <
nscf.in > nscf.out
~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64
/home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 <
nscf.in > nscf.out
And it crashed like the first time.
It is interesting that the preliminary scf run works fine. The
scf run requires Quantum Espresso to generate the k points
K_POINTS (automatic)
12 12 12 0 0 0
The nscf run which crashes includes a list of k points (1728 in
K_POINTS (crystal)
1728
  0.00000000  0.00000000  0.00000000  5.787037e-04
  0.00000000  0.00000000  0.08333333  5.787037e-04
  0.00000000  0.00000000  0.16666667  5.787037e-04
  0.00000000  0.00000000  0.25000000  5.787037e-04
  0.00000000  0.00000000  0.33333333  5.787037e-04
  0.00000000  0.00000000  0.41666667  5.787037e-04
  0.00000000  0.00000000  0.50000000  5.787037e-04
  0.00000000  0.00000000  0.58333333  5.787037e-04
  0.00000000  0.00000000  0.66666667  5.787037e-04
  0.00000000  0.00000000  0.75000000  5.787037e-04


.


.
To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran
compiler module, configured with only the “--prefix="  and then
“make all install”. I did not enable or disable any other options.
Cheers,
Vahid
Post by Vahid Askarpour
On Jan 19, 2018, at 10:23 AM, Edgar Gabriel
thanks, that is interesting. Since /scratch is a lustre file
system, Open MPI should actually utilize romio314 for that
anyway, not ompio. What I have seen however happen on at least
one occasions is that ompio was still used since ( I suspect)
romio314 didn't pick up correctly the configuration options. It
is a little bit of a mess from that perspective that we have to
pass the romio arguments with different flag/options than for
ompio, e.g.
--with-lustre=/path/to/lustre/
--with-io-romio-flags="--with-file-system=ufs+nfs+lustre
--with-lustre=/path/to/lustre"
ompio should pick up the lustre options correctly if lustre
headers/libraries are found at the default location, even if the
user did not pass the --with-lustre option. I am not entirely
sure what happens in romio if the user did not pass the
--with-file-system=ufs+nfs+lustre but the lustre
headers/libraries are found at the default location, i.e.
whether the lustre adio component is still compiled or not.
Anyway, lets wait for the outcome of your run enforcing using
the romio314 component, and I will still try to reproduce your
problem on my system.
Thanks
Edgar
Post by Vahid Askarpour
Gilles,
I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu
<http://persistent-nfs.rcac.purdue.edu/>:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu
<http://persistent-nfs.rcac.purdue.edu/>:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu <http://depotint-nfs.rcac.purdue.edu/>:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin
The code is compiled in my $HOME and is run on the scratch.
Cheers,
Vahid
Post by Gilles Gouaillardet
Vahid,
i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...
That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.
Cheers,
Gilles
Post by Edgar Gabriel
I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.
Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.
What file system did you run it on? NFS?
Thanks
Edgar
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Vahid Askarpour
2018-01-23 13:06:38 UTC
Permalink
This would work for Quantum Espresso input. I am waiting to see what happens to EPW. I don’t think EPW accepts the -i argument. I will report back once the EPW job is done.

Cheers,

Vahid

On Jan 22, 2018, at 6:05 PM, Edgar Gabriel <***@central.uh.edu<mailto:***@central.uh.edu>> wrote:


well, my final comment on this topic, as somebody suggested earlier in this email chain, if you provide the input with the -i argument instead of piping from standard input, things seem to work as far as I can see (disclaimer: I do not know what the final outcome should be. I just see that the application does not complain about the 'end of file while reading crystal k points'). So maybe that is the most simple solution.

Thanks

Edgar

On 1/22/2018 1:17 PM, Edgar Gabriel wrote:

after some further investigation, I am fairly confident that this is not an MPI I/O problem.

The input file input_tmp.in is generated in this sequence of instructions (which is in Modules/open_close_input_file.f90)

---

IF ( TRIM(input_file_) /= ' ' ) THEn
!
! copy file to be opened into input_file
!
input_file = input_file_
!
ELSE
!
! if no file specified then copy from standard input
!
input_file="input_tmp.in"
OPEN(UNIT = stdtmp, FILE=trim(input_file), FORM='formatted', &
STATUS='unknown', IOSTAT = ierr )
IF ( ierr > 0 ) GO TO 30
!
dummy=' '
WRITE(stdout, '(5x,a)') "Waiting for input..."
DO WHILE ( TRIM(dummy) .NE. "MAGICALME" )
READ (stdin,fmt='(A512)',END=20) dummy
WRITE (stdtmp,'(A)') trim(dummy)
END DO
!
20 CLOSE ( UNIT=stdtmp, STATUS='keep' )

----

Basically, if no input file has been provided, the input file is generated by reading from standard input. Since the application is being launched e.g. with

mpirun -np 64 ../bin/pw.x -npool 64 <nscf.in >nscf.out


the data comes from nscf.in. I simply do not know enough about IO forwarding do be able to tell why we do not see the entire file, but one interesting detail is that if I run it in the debugger, the input_tmp.in is created correctly. However, if I run it using mpirun as shown above, the file is cropped incorrectly, which leads to the error message mentioned in this email chain.

Anyway, I would probably need some help here from somebody who knows the runtime better than me on what could go wrong at this point.

Thanks

Edgar



On 1/19/2018 1:22 PM, Vahid Askarpour wrote:
Concerning the following error

from pw_readschemafile : error # 1
xml data file not found

The nscf run uses files generated by the scf.in run. So I first run scf.in and when it finishes, I run nscf.in. If you have done this and still get the above error, then this could be another bug. It does not happen for me with intel14/openmpi-1.8.8.

Thanks for the update,

Vahid

On Jan 19, 2018, at 3:08 PM, Edgar Gabriel <***@central.uh.edu<mailto:***@central.uh.edu>> wrote:


ok, here is what found out so far, will have to stop for now however for today:

1. I can in fact reproduce your bug on my systems.

2. I can confirm that the problem occurs both with romio314 and ompio. I *think* the issue is that the input_tmp.in file is incomplete. In both cases (ompio and romio) the end of the file looks as follows (and its exactly the same for both libraries):

***@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files<mailto:***@crill-002:/tmp/gabriel/qe-6.2.1/QE_input_files>> tail -10 input_tmp.in
0.66666667 0.50000000 0.83333333 5.787037e-04
0.66666667 0.50000000 0.91666667 5.787037e-04
0.66666667 0.58333333 0.00000000 5.787037e-04
0.66666667 0.58333333 0.08333333 5.787037e-04
0.66666667 0.58333333 0.16666667 5.787037e-04
0.66666667 0.58333333 0.25000000 5.787037e-04
0.66666667 0.58333333 0.33333333 5.787037e-04
0.66666667 0.58333333 0.41666667 5.787037e-04
0.66666667 0.58333333 0.50000000 5.787037e-04
0.66666667 0.58333333 0.58333333 5

which is what I *think* causes the problem.

3. I tried to find where input_tmp.in is generated, but haven't completely identified the location. However, I could not find MPI file_write(_all) operations anywhere in the code, although there are some MPI_file_read(_all) operations.

4. I can confirm that the behavior with Open MPI 1.8.x is different. input_tmp.in looks more complete (at least it doesn't end in the middle of the line). The simulation does still not finish for me, but the bug reported is slightly different, I might just be missing a file or something

from pw_readschemafile : error # 1
xml data file not found

Since I think input_tmp.in is generated from data that is provided in nscf.in, it might very well be something in the MPI_File_read(_all) operation that causes the issue, but since both ompio and romio are affected, there is good chance that something outside of the control of io components is causing the trouble (maybe a datatype issue that has changed from 1.8.x series to 3.0.x).

5. Last but not least, I also wanted to mention that I ran all parallel tests that I found in the testsuite (run-tests-cp-parallel, run-tests-pw-parallel, run-tests-ph-parallel, run-tests-epw-parallel ), and they all passed with ompio (and romio314 although I only ran a subset of the tests with romio314).


Thanks

Edgar

-



On 01/19/2018 11:44 AM, Vahid Askarpour wrote:
Hi Edgar,

Just to let you know that the nscf run with --mca io ompio crashed like the other two runs.

Thank you,

Vahid

On Jan 19, 2018, at 12:46 PM, Edgar Gabriel <***@central.uh.edu<mailto:***@central.uh.edu>> wrote:


ok, thank you for the information. Two short questions and requests. I have qe-6.2.1 compiled and running on my system (although it is with gcc-6.4 instead of the intel compiler), and I am currently running the parallel test suite. So far, all the tests passed, although it is still running.

My question is now, would it be possible for you to give me access to exactly the same data set that you are using? You could upload to a webpage or similar and just send me the link.

The second question/request, could you rerun your tests one more time, this time forcing using ompio? e.g. --mca io ompio

Thanks

Edgar

On 1/19/2018 10:32 AM, Vahid Askarpour wrote:
To run EPW, the command for running the preliminary nscf run is (http://epw.org.uk/Documentation/B-dopedDiamond):

~/bin/openmpi-v3.0/bin/mpiexec -np 64 /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

So I submitted it with the following command:

~/bin/openmpi-v3.0/bin/mpiexec --mca io romio314 -np 64 /home/vaskarpo/bin/qe-6.0_intel14_soc/bin/pw.x -npool 64 < nscf.in > nscf.out

And it crashed like the first time.

It is interesting that the preliminary scf run works fine. The scf run requires Quantum Espresso to generate the k points automatically as shown below:

K_POINTS (automatic)
12 12 12 0 0 0

The nscf run which crashes includes a list of k points (1728 in this case) as seen below:

K_POINTS (crystal)
1728
0.00000000 0.00000000 0.00000000 5.787037e-04
0.00000000 0.00000000 0.08333333 5.787037e-04
0.00000000 0.00000000 0.16666667 5.787037e-04
0.00000000 0.00000000 0.25000000 5.787037e-04
0.00000000 0.00000000 0.33333333 5.787037e-04
0.00000000 0.00000000 0.41666667 5.787037e-04
0.00000000 0.00000000 0.50000000 5.787037e-04
0.00000000 0.00000000 0.58333333 5.787037e-04
0.00000000 0.00000000 0.66666667 5.787037e-04
0.00000000 0.00000000 0.75000000 5.787037e-04


.


.

To build openmpi (either 1.10.7 or 3.0.x), I loaded the fortran compiler module, configured with only the “--prefix=" and then “make all install”. I did not enable or disable any other options.

Cheers,

Vahid


On Jan 19, 2018, at 10:23 AM, Edgar Gabriel <***@central.uh.edu<mailto:***@central.uh.edu>> wrote:


thanks, that is interesting. Since /scratch is a lustre file system, Open MPI should actually utilize romio314 for that anyway, not ompio. What I have seen however happen on at least one occasions is that ompio was still used since ( I suspect) romio314 didn't pick up correctly the configuration options. It is a little bit of a mess from that perspective that we have to pass the romio arguments with different flag/options than for ompio, e.g.

--with-lustre=/path/to/lustre/ --with-io-romio-flags="--with-file-system=ufs+nfs+lustre --with-lustre=/path/to/lustre"

ompio should pick up the lustre options correctly if lustre headers/libraries are found at the default location, even if the user did not pass the --with-lustre option. I am not entirely sure what happens in romio if the user did not pass the --with-file-system=ufs+nfs+lustre but the lustre headers/libraries are found at the default location, i.e. whether the lustre adio component is still compiled or not.

Anyway, lets wait for the outcome of your run enforcing using the romio314 component, and I will still try to reproduce your problem on my system.

Thanks
Edgar

On 1/19/2018 7:15 AM, Vahid Askarpour wrote:

Gilles,

I have submitted that job with --mca io romio314. If it finishes, I will let you know. It is sitting in Conte’s queue at Purdue.

As to Edgar’s question about the file system, here is the output of df -Th:

***@conte-fe00:~ $ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 435G 16G 398G 4% /
tmpfs tmpfs 16G 1.4M 16G 1% /dev/shm
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/home
nfs 80T 64T 17T 80% /home
persistent-nfs.rcac.purdue.edu<http://persistent-nfs.rcac.purdue.edu/>:/persistent/apps
nfs 8.0T 4.0T 4.1T 49% /apps
mds-d01-***@o2ib1:mds-d02-***@o2ib1:/lustreD<mailto:mds-d01-***@o2ib1:mds-d02-***@o2ib1:/lustreD>
lustre 1.4P 994T 347T 75% /scratch/conte
depotint-nfs.rcac.purdue.edu<http://depotint-nfs.rcac.purdue.edu/>:/depot
nfs 4.5P 3.0P 1.6P 66% /depot
172.18.84.186:/persistent/fsadmin
nfs 200G 130G 71G 65% /usr/rmt_share/fsadmin

The code is compiled in my $HOME and is run on the scratch.

Cheers,

Vahid



On Jan 18, 2018, at 10:14 PM, Gilles Gouaillardet <***@gmail.com><mailto:***@gmail.com> wrote:

Vahid,

i the v1.10 series, the default MPI-IO component was ROMIO based, and
in the v3 series, it is now ompio.
You can force the latest Open MPI to use the ROMIO based component with
mpirun --mca io romio314 ...

That being said, your description (e.g. a hand edited file) suggests
that I/O is not performed with MPI-IO,
which makes me very puzzled on why the latest Open MPI is crashing.

Cheers,

Gilles

On Fri, Jan 19, 2018 at 10:55 AM, Edgar Gabriel <***@central.uh.edu><mailto:***@central.uh.edu> wrote:


I will try to reproduce this problem with 3.0.x, but it might take me a
couple of days to get to it.

Since it seemed to have worked with 2.0.x (except for the running out file
handles problem), there is the suspicion that one of the fixes that we
introduced since then is the problem.

What file system did you run it on? NFS?

Thanks

Edgar


On 1/18/2018 5:17 PM, Jeff Squyres (jsquyres) wrote:


On Jan 18, 2018, at 5:53 PM, Vahid Askarpour <***@dal.ca><mailto:***@dal.ca> wrote:


My openmpi3.0.x run (called nscf run) was reading data from a routine
Quantum Espresso input file edited by hand. The preliminary run (called scf
run) was done with openmpi3.0.x on a similar input file also edited by hand.


Gotcha.

Well, that's a little disappointing.

It would be good to understand why it is crashing -- is the app doing
something that is accidentally not standard? Is there a bug in (soon to be
released) Open MPI 3.0.1? ...?



_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users

Stephen Guzik
2018-01-19 20:42:54 UTC
Permalink
Not sure if this is related and I have not had time to investigate it
much or reduce but I am also having issues with 3.0.x.  There's a couple
of layers of cgns and hdf5 but I am seeing:

mpirun --mca io romio314 --mca btl self,vader,openib...
-- works perfectly

mpirun --mca btl self,vader,openib...
cgio_open_file:H5Dwrite:write to node data failed

The files system in NFS and an openmpi-v3.0.x-201711220306-2399e85 build.

Stephen

Stephen Guzik, Ph.D.
Assistant Professor, Department of Mechanical Engineering
Colorado State University
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine Quantum Espresso input file edited by hand. The preliminary run (called scf run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing something that is accidentally not standard? Is there a bug in (soon to be released) Open MPI 3.0.1? ...?
Edgar Gabriel
2018-01-19 21:09:24 UTC
Permalink
this is most likely a different issue. The bug in the original case is
appearing also on a local file system/disk, it doesn't have to be NSF.

That being said, I would urge to submit a new issue ( or a new email
thread), I would be more than happy to look into your problem as well,
since we submit a number of patches into the 3.0.x branch specifically
for NFS.

Thanks
Edgar
Post by Stephen Guzik
Not sure if this is related and I have not had time to investigate it
much or reduce but I am also having issues with 3.0.x.  There's a couple
mpirun --mca io romio314 --mca btl self,vader,openib...
-- works perfectly
mpirun --mca btl self,vader,openib...
cgio_open_file:H5Dwrite:write to node data failed
The files system in NFS and an openmpi-v3.0.x-201711220306-2399e85 build.
Stephen
Stephen Guzik, Ph.D.
Assistant Professor, Department of Mechanical Engineering
Colorado State University
Post by Jeff Squyres (jsquyres)
Post by Vahid Askarpour
My openmpi3.0.x run (called nscf run) was reading data from a routine Quantum Espresso input file edited by hand. The preliminary run (called scf run) was done with openmpi3.0.x on a similar input file also edited by hand.
Gotcha.
Well, that's a little disappointing.
It would be good to understand why it is crashing -- is the app doing something that is accidentally not standard? Is there a bug in (soon to be released) Open MPI 3.0.1? ...?
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...