Discussion:
[OMPI users] [version 2.1.5] invalid memory reference
Patrick Begou
2018-09-18 07:49:26 UTC
Permalink
Hi

I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc 7.3.0/OpenMPI
2.1.5 and with this latest config I have random segfaults.
Same binary, same server, same number of processes (16), same parameters for the
run. Sometimes it runs until the end, sometime I get  'invalid memory reference'.

Building the application and OpenMPI in debug mode I saw that this random
segfault always occur in collective communications inside OpenMPI. I've no idea
howto track this. These are 2 call stack traces (just the openmpi part):

*Calling  MPI_ALLREDUCE(...)**
*
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f01937022ef in ???
#1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
    at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2  0x7f0192dd0331 in mca_btl_vader_component_progress
    at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3  0x7f0192d6b92b in opal_progress
    at ../../opal/runtime/opal_progress.c:226
#4  0x7f0194a8a9a4 in sync_wait_st
    at ../../opal/threads/wait_sync.h:80
#5  0x7f0194a8a9a4 in ompi_request_default_wait_all
    at ../../ompi/request/req_wait.c:221
#6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
    at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
#7  0x7f0194aa0a0a in PMPI_Allreduce
    at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
#8  0x7f0194f2e2ba in ompi_allreduce_f
    at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
#9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
    at linear_solver_deflation_m.f90:341


*Calling MPI_WAITALL()*

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7fda5a8d72ef in ???
#1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
    at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2  0x7fda59fa5331 in mca_btl_vader_component_progress
    at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3  0x7fda59f4092b in opal_progress
    at ../../opal/runtime/opal_progress.c:226
#4  0x7fda5bc5f9a4 in sync_wait_st
    at ../../opal/threads/wait_sync.h:80
#5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
    at ../../ompi/request/req_wait.c:221
#6  0x7fda5bca329e in PMPI_Waitall
    at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
#7  0x7fda5c10bc00 in ompi_waitall_f
    at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
#8  0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
    at data_comm_m.f90:5849


The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
207                /* call the registered callback function */
208               reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc,
reg->cbdata);


OpenMPI 2.1.5 is build with:
CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
-mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
../configure --prefix=$DESTMPI --enable-mpirun-prefix-by-default --disable-dlopen \
--enable-mca-no-build=openib --without-verbs --enable-mpi-cxx --without-slurm
--enable-mpi-thread-multiple --enable-debug --enable-mem-debug

Any help appreciated

Patrick
--
===================================================================
| Equipe M.O.S.T. | |
| Patrick BEGOU | mailto:***@grenoble-inp.fr |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
George Bosilca
2018-09-18 13:54:59 UTC
Permalink
Few days ago we have pushed a fix in master for a strikingly similar issue.
The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x
series. The best path forward will be to migrate to a more recent OMPI
version.

George.


On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou <
Post by Patrick Begou
Hi
I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc
7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
Same binary, same server, same number of processes (16), same parameters
for the run. Sometimes it runs until the end, sometime I get 'invalid
memory reference'.
Building the application and OpenMPI in debug mode I saw that this random
segfault always occur in collective communications inside OpenMPI. I've no
idea howto track this. These are 2 call stack traces (just the openmpi
*Calling MPI_ALLREDUCE(...)*
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0 0x7f01937022ef in ???
#1 0x7f0192dd0331 in mca_btl_vader_check_fboxes
at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2 0x7f0192dd0331 in mca_btl_vader_component_progress
at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3 0x7f0192d6b92b in opal_progress
at ../../opal/runtime/opal_progress.c:226
#4 0x7f0194a8a9a4 in sync_wait_st
at ../../opal/threads/wait_sync.h:80
#5 0x7f0194a8a9a4 in ompi_request_default_wait_all
at ../../ompi/request/req_wait.c:221
#6 0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
#7 0x7f0194aa0a0a in PMPI_Allreduce
at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
#8 0x7f0194f2e2ba in ompi_allreduce_f
at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
#9 0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
at linear_solver_deflation_m.f90:341
*Calling MPI_WAITALL()*
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0 0x7fda5a8d72ef in ???
#1 0x7fda59fa5331 in mca_btl_vader_check_fboxes
at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2 0x7fda59fa5331 in mca_btl_vader_component_progress
at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3 0x7fda59f4092b in opal_progress
at ../../opal/runtime/opal_progress.c:226
#4 0x7fda5bc5f9a4 in sync_wait_st
at ../../opal/threads/wait_sync.h:80
#5 0x7fda5bc5f9a4 in ompi_request_default_wait_all
at ../../ompi/request/req_wait.c:221
#6 0x7fda5bca329e in PMPI_Waitall
at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
#7 0x7fda5c10bc00 in ompi_waitall_f
at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
#8 0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
at data_comm_m.f90:5849
The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
207 /* call the registered callback function */
208 reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc,
reg->cbdata);
CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
-mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
../configure --prefix=$DESTMPI --enable-mpirun-prefix-by-default
--disable-dlopen \
--enable-mca-no-build=openib --without-verbs --enable-mpi-cxx
--without-slurm --enable-mpi-thread-multiple --enable-debug
--enable-mem-debug
Any help appreciated
Patrick
--
===================================================================
| Equipe M.O.S.T. | |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Patrick Begou
2018-09-19 07:41:15 UTC
Permalink
Hi George

thanks for your answer. I was previously using OpenMPI 3.1.2 and have also this
problem. However, using --enable-debug --enable-mem-debugat configuration time,
I was unable to reproduce the failure and it was quite difficult for me do trace
the problem. May be I have not run enought tests to reach the failure point.

I fall back to  OpenMPI 2.1.5, thinking the problem was in the 3.x version. The
problem was still there but with the debug config I was able to trace the call
stack.

Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning the git
repo ?

Thanks

Patrick
Post by George Bosilca
Few days ago we have pushed a fix in master for a strikingly similar issue.
The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x
series. The best path forward will be to migrate to a more recent OMPI version.
George.
On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou
Hi
I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc
7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
Same binary, same server, same number of processes (16), same parameters
for the run. Sometimes it runs until the end, sometime I get  'invalid
memory reference'.
Building the application and OpenMPI in debug mode I saw that this random
segfault always occur in collective communications inside OpenMPI. I've no
*Calling  MPI_ALLREDUCE(...)**
*
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0  0x7f01937022ef in ???
#1  0x7f0192dd0331 in mca_btl_vader_check_fboxes
    at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2  0x7f0192dd0331 in mca_btl_vader_component_progress
    at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3  0x7f0192d6b92b in opal_progress
    at ../../opal/runtime/opal_progress.c:226
#4  0x7f0194a8a9a4 in sync_wait_st
    at ../../opal/threads/wait_sync.h:80
#5  0x7f0194a8a9a4 in ompi_request_default_wait_all
    at ../../ompi/request/req_wait.c:221
#6  0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
    at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
#7  0x7f0194aa0a0a in PMPI_Allreduce
    at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
#8  0x7f0194f2e2ba in ompi_allreduce_f
    at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
#9  0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
    at linear_solver_deflation_m.f90:341
*Calling MPI_WAITALL()*
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0  0x7fda5a8d72ef in ???
#1  0x7fda59fa5331 in mca_btl_vader_check_fboxes
    at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2  0x7fda59fa5331 in mca_btl_vader_component_progress
    at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3  0x7fda59f4092b in opal_progress
    at ../../opal/runtime/opal_progress.c:226
#4  0x7fda5bc5f9a4 in sync_wait_st
    at ../../opal/threads/wait_sync.h:80
#5  0x7fda5bc5f9a4 in ompi_request_default_wait_all
    at ../../ompi/request/req_wait.c:221
#6  0x7fda5bca329e in PMPI_Waitall
    at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
#7  0x7fda5c10bc00 in ompi_waitall_f
    at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
#8  0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
    at data_comm_m.f90:5849
The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
207                /* call the registered callback function */
208 reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc, reg->cbdata);
CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
-mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
../configure --prefix=$DESTMPI --enable-mpirun-prefix-by-default --disable-dlopen \
--enable-mca-no-build=openib --without-verbs --enable-mpi-cxx
--without-slurm --enable-mpi-thread-multiple  --enable-debug
--enable-mem-debug
Any help appreciated
Patrick
--
===================================================================
| Equipe M.O.S.T. | |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
===================================================================
| Equipe M.O.S.T. | |
| Patrick BEGOU | mailto:***@grenoble-inp.fr |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
George Bosilca
2018-09-19 13:53:21 UTC
Permalink
I can't speculate on why you did not notice the memory issue before, simply
because for months we (the developers) didn't noticed and our testing
infrastructure didn't catch this bug despite running millions of tests. The
root cause of the bug was a memory ordering issue, and these are really
tricky to identify.

According to https://github.com/open-mpi/ompi/issues/5638 the patch was
backported to all stable releases starting from 2.1. Until their official
release however you would either need to get a nightly snapshot or test
your luck with master.

George.


On Wed, Sep 19, 2018 at 3:41 AM Patrick Begou <
Post by Patrick Begou
Hi George
thanks for your answer. I was previously using OpenMPI 3.1.2 and have also
this problem. However, using --enable-debug --enable-mem-debug at
configuration time, I was unable to reproduce the failure and it was quite
difficult for me do trace the problem. May be I have not run enought tests
to reach the failure point.
I fall back to OpenMPI 2.1.5, thinking the problem was in the 3.x
version. The problem was still there but with the debug config I was able
to trace the call stack.
Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning
the git repo ?
Thanks
Patrick
Few days ago we have pushed a fix in master for a strikingly similar
issue. The patch will eventually make it in the 4.0 and 3.1 but not on the
2.x series. The best path forward will be to migrate to a more recent OMPI
version.
George.
On Tue, Sep 18, 2018 at 3:50 AM Patrick Begou <
Post by Patrick Begou
Hi
I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc
7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
Same binary, same server, same number of processes (16), same parameters
for the run. Sometimes it runs until the end, sometime I get 'invalid
memory reference'.
Building the application and OpenMPI in debug mode I saw that this random
segfault always occur in collective communications inside OpenMPI. I've no
*Calling MPI_ALLREDUCE(...)*
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0 0x7f01937022ef in ???
#1 0x7f0192dd0331 in mca_btl_vader_check_fboxes
at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2 0x7f0192dd0331 in mca_btl_vader_component_progress
at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3 0x7f0192d6b92b in opal_progress
at ../../opal/runtime/opal_progress.c:226
#4 0x7f0194a8a9a4 in sync_wait_st
at ../../opal/threads/wait_sync.h:80
#5 0x7f0194a8a9a4 in ompi_request_default_wait_all
at ../../ompi/request/req_wait.c:221
#6 0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
#7 0x7f0194aa0a0a in PMPI_Allreduce
at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
#8 0x7f0194f2e2ba in ompi_allreduce_f
at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
#9 0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
at linear_solver_deflation_m.f90:341
*Calling MPI_WAITALL()*
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0 0x7fda5a8d72ef in ???
#1 0x7fda59fa5331 in mca_btl_vader_check_fboxes
at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2 0x7fda59fa5331 in mca_btl_vader_component_progress
at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3 0x7fda59f4092b in opal_progress
at ../../opal/runtime/opal_progress.c:226
#4 0x7fda5bc5f9a4 in sync_wait_st
at ../../opal/threads/wait_sync.h:80
#5 0x7fda5bc5f9a4 in ompi_request_default_wait_all
at ../../ompi/request/req_wait.c:221
#6 0x7fda5bca329e in PMPI_Waitall
at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
#7 0x7fda5c10bc00 in ompi_waitall_f
at
/kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
#8 0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
at data_comm_m.f90:5849
The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
207 /* call the registered callback function */
208 reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc,
reg->cbdata);
CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native
-mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
../configure --prefix=$DESTMPI --enable-mpirun-prefix-by-default
--disable-dlopen \
--enable-mca-no-build=openib --without-verbs --enable-mpi-cxx
--without-slurm --enable-mpi-thread-multiple --enable-debug
--enable-mem-debug
Any help appreciated
Patrick
--
===================================================================
| Equipe M.O.S.T. | |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
--
===================================================================
| Equipe M.O.S.T. | |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
Jeff Squyres (jsquyres) via users
2018-09-19 15:11:46 UTC
Permalink
Yeah, it's a bit terrible, but we didn't reliably reproduce this problem for many months, either. :-\

As George noted, it's been ported to all the release branches but is not yet in an official release. Until an official release (4.0.0 just had an rc; it will be released soon, and 3.0.3 will have an RC in the immediate future), your best bet will be to get a nightly tarball from any of the v2.1.x, v3.0.x, v3.1.x, or v4.0.x releases.

My $0.02: if you're just upgrading from Open PI v1.7, you might as well jump up to v4.0.x (i.e., don't bother jumping to an older release).
I can't speculate on why you did not notice the memory issue before, simply because for months we (the developers) didn't noticed and our testing infrastructure didn't catch this bug despite running millions of tests. The root cause of the bug was a memory ordering issue, and these are really tricky to identify.
According to https://github.com/open-mpi/ompi/issues/5638 the patch was backported to all stable releases starting from 2.1. Until their official release however you would either need to get a nightly snapshot or test your luck with master.
George.
Hi George
thanks for your answer. I was previously using OpenMPI 3.1.2 and have also this problem. However, using --enable-debug --enable-mem-debug at configuration time, I was unable to reproduce the failure and it was quite difficult for me do trace the problem. May be I have not run enought tests to reach the failure point.
I fall back to OpenMPI 2.1.5, thinking the problem was in the 3.x version. The problem was still there but with the debug config I was able to trace the call stack.
Which OpenMPI 3.x version do you suggest ? A nightly snapshot ? Cloning the git repo ?
Thanks
Patrick
Few days ago we have pushed a fix in master for a strikingly similar issue. The patch will eventually make it in the 4.0 and 3.1 but not on the 2.x series. The best path forward will be to migrate to a more recent OMPI version.
George.
Hi
I'm moving a large CFD code from Gcc 4.8.5/OpenMPI 1.7.3 to Gcc 7.3.0/OpenMPI 2.1.5 and with this latest config I have random segfaults.
Same binary, same server, same number of processes (16), same parameters for the run. Sometimes it runs until the end, sometime I get 'invalid memory reference'.
Calling MPI_ALLREDUCE(...)
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0 0x7f01937022ef in ???
#1 0x7f0192dd0331 in mca_btl_vader_check_fboxes
at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2 0x7f0192dd0331 in mca_btl_vader_component_progress
at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3 0x7f0192d6b92b in opal_progress
at ../../opal/runtime/opal_progress.c:226
#4 0x7f0194a8a9a4 in sync_wait_st
at ../../opal/threads/wait_sync.h:80
#5 0x7f0194a8a9a4 in ompi_request_default_wait_all
at ../../ompi/request/req_wait.c:221
#6 0x7f0194af1936 in ompi_coll_base_allreduce_intra_recursivedoubling
at ../../../../ompi/mca/coll/base/coll_base_allreduce.c:225
#7 0x7f0194aa0a0a in PMPI_Allreduce
at /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pallreduce.c:107
#8 0x7f0194f2e2ba in ompi_allreduce_f
at /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pallreduce_f.c:87
#9 0x8e21fd in __linear_solver_deflation_m_MOD_solve_el_grp_pcg
at linear_solver_deflation_m.f90:341
Calling MPI_WAITALL()
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0 0x7fda5a8d72ef in ???
#1 0x7fda59fa5331 in mca_btl_vader_check_fboxes
at ../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
#2 0x7fda59fa5331 in mca_btl_vader_component_progress
at ../../../../../opal/mca/btl/vader/btl_vader_component.c:689
#3 0x7fda59f4092b in opal_progress
at ../../opal/runtime/opal_progress.c:226
#4 0x7fda5bc5f9a4 in sync_wait_st
at ../../opal/threads/wait_sync.h:80
#5 0x7fda5bc5f9a4 in ompi_request_default_wait_all
at ../../ompi/request/req_wait.c:221
#6 0x7fda5bca329e in PMPI_Waitall
at /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/c/profile/pwaitall.c:76
#7 0x7fda5c10bc00 in ompi_waitall_f
at /kareline/data/begou/GCC7/openmpi-2.1.5/build/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
#8 0x6dcbf7 in __data_comm_m_MOD_update_ghost_ext_comm_r1
at data_comm_m.f90:5849
The segfault is alway located in opal/mca/btl/vader/btl_vader_fbox.h at
207 /* call the registered callback function */
208 reg->cbfunc(&mca_btl_vader.super, hdr.data.tag, &desc, reg->cbdata);
CFLAGS="-O3 -march=native -mtune=native" CXXFLAGS="-O3 -march=native -mtune=native" FCFLAGS="-O3 -march=native -mtune=native" \
../configure --prefix=$DESTMPI --enable-mpirun-prefix-by-default --disable-dlopen \
--enable-mca-no-build=openib --without-verbs --enable-mpi-cxx --without-slurm --enable-mpi-thread-multiple --enable-debug --enable-mem-debug
Any help appreciated
Patrick
--
===================================================================
| Equipe M.O.S.T. | |
| Patrick BEGOU |
|
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
===================================================================
| Equipe M.O.S.T. | |
| Patrick BEGOU |
|
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Patrick Begou
2018-10-11 14:58:30 UTC
Permalink
Hi Jeff and George

thanks for your answer. I find some time to work again on this problem dans I
have downloaded OpenMPI 4.0.0rc4. It compiles without any problem but building
the first dependance of my code (hdf5 1.8.12) with this version 4 fails:

../../src/H5Smpio.c:355:28: error: 'MPI_LB' undeclared (first use in this
function); did you mean 'MPI_IO'?
             old_types[0] = MPI_LB;
                            ^~~~~~
                            MPI_IO
../../src/H5Smpio.c:355:28: note: each undeclared identifier is reported only
once for each function it appears in
../../src/H5Smpio.c:357:28: error: 'MPI_UB' undeclared (first use in this
function); did you mean 'MPI_LB'?
             old_types[2] = MPI_UB;
                            ^~~~~~
                            MPI_LB
../../src/H5Smpio.c:365:24: warning: implicit declaration of function
'MPI_Type_struct'; did you mean 'MPI_Type_size_x'? [-Wimplicit-function-declaration]
             mpi_code = MPI_Type_struct(3,               /* count */
                        ^~~~~~~~~~~~~~~
                        MPI_Type_size_x

It is not possible for me to use a more recent hdf5 version as the API as
changed and will not work with the code, even in compatible mode.

At this time, I'll try version 3 from the git repo if I have the required tools
available on my server. All prerequisites compile successfully with 3.1.2.

Patrick
--
===================================================================
| Equipe M.O.S.T. | |
| Patrick BEGOU | mailto:***@grenoble-inp.fr |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
Jeff Squyres (jsquyres) via users
2018-10-11 15:19:19 UTC
Permalink
Patrick --

You might want to update your HDF code to not use MPI_LB and MPI_UB -- these constants were deprecated in MPI-2.1 in 2009 (an equivalent function, MPI_TYPE_CREATE_RESIZED was added in MPI-2.0 in 1997), and were removed from the MPI-3.0 standard in 2012.

Meaning: the death of these constants has been written on the wall since 2009.

That being said, Open MPI v4.0 did not remove these constants -- we just *disabled them by default*, specifically for cases like this. I.e., we want to make the greater MPI community aware that:

1) there are MPI-1 constructs that were initially deprecated and finally removed from the standard (this happened years ago)
2) MPI applications should start moving away from these removed MPI-1 constructs
3) Open MPI is disabling these removed MPI-1 constructs by default in Open MPI v4.0. The current plan is to actually fully remove these MPI-1 constructs in Open MPI v5.0 (perhaps in 2019?).

For the v4.0.x series, you can configure/build Open MPI with --enable-mpi1-compatibility to re-activate MPI_LB and MPI_UB.
Post by Patrick Begou
Hi Jeff and George
../../src/H5Smpio.c:355:28: error: 'MPI_LB' undeclared (first use in this function); did you mean 'MPI_IO'?
old_types[0] = MPI_LB;
^~~~~~
MPI_IO
../../src/H5Smpio.c:355:28: note: each undeclared identifier is reported only once for each function it appears in
../../src/H5Smpio.c:357:28: error: 'MPI_UB' undeclared (first use in this function); did you mean 'MPI_LB'?
old_types[2] = MPI_UB;
^~~~~~
MPI_LB
../../src/H5Smpio.c:365:24: warning: implicit declaration of function 'MPI_Type_struct'; did you mean 'MPI_Type_size_x'? [-Wimplicit-function-declaration]
mpi_code = MPI_Type_struct(3, /* count */
^~~~~~~~~~~~~~~
MPI_Type_size_x
It is not possible for me to use a more recent hdf5 version as the API as changed and will not work with the code, even in compatible mode.
At this time, I'll try version 3 from the git repo if I have the required tools available on my server. All prerequisites compile successfully with 3.1.2.
Patrick
--
===================================================================
| Equipe M.O.S.T. | |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
--
Jeff Squyres
***@cisco.com
Patrick Bégou
2018-10-12 09:59:54 UTC
Permalink
I have downloaded the nightly snapshot tarball of october 10th 2018 for
the 3.1 version and it solves the memory problem.
I ran my test case on 1, 2, 4, 10, 16, 20, 32, 40, and 64 cores
successfully.
This version also allows to compile my prerequisites libraries, so we
can use it out of the box to stay in production.
This allow me time to update hdf5, but also petsc/slepc libs, to works
with more recent MPI standard and moving to OpenMPI4.x versions.

Thanks for all your precious advices.

Patrick
Post by Jeff Squyres (jsquyres) via users
Patrick --
You might want to update your HDF code to not use MPI_LB and MPI_UB -- these constants were deprecated in MPI-2.1 in 2009 (an equivalent function, MPI_TYPE_CREATE_RESIZED was added in MPI-2.0 in 1997), and were removed from the MPI-3.0 standard in 2012.
Meaning: the death of these constants has been written on the wall since 2009.
1) there are MPI-1 constructs that were initially deprecated and finally removed from the standard (this happened years ago)
2) MPI applications should start moving away from these removed MPI-1 constructs
3) Open MPI is disabling these removed MPI-1 constructs by default in Open MPI v4.0. The current plan is to actually fully remove these MPI-1 constructs in Open MPI v5.0 (perhaps in 2019?).
For the v4.0.x series, you can configure/build Open MPI with --enable-mpi1-compatibility to re-activate MPI_LB and MPI_UB.
Post by Patrick Begou
Hi Jeff and George
../../src/H5Smpio.c:355:28: error: 'MPI_LB' undeclared (first use in this function); did you mean 'MPI_IO'?
old_types[0] = MPI_LB;
^~~~~~
MPI_IO
../../src/H5Smpio.c:355:28: note: each undeclared identifier is reported only once for each function it appears in
../../src/H5Smpio.c:357:28: error: 'MPI_UB' undeclared (first use in this function); did you mean 'MPI_LB'?
old_types[2] = MPI_UB;
^~~~~~
MPI_LB
../../src/H5Smpio.c:365:24: warning: implicit declaration of function 'MPI_Type_struct'; did you mean 'MPI_Type_size_x'? [-Wimplicit-function-declaration]
mpi_code = MPI_Type_struct(3, /* count */
^~~~~~~~~~~~~~~
MPI_Type_size_x
It is not possible for me to use a more recent hdf5 version as the API as changed and will not work with the code, even in compatible mode.
At this time, I'll try version 3 from the git repo if I have the required tools available on my server. All prerequisites compile successfully with 3.1.2.
Patrick
--
===================================================================
| Equipe M.O.S.T. | |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
Jeff Hammond
2018-10-11 16:52:07 UTC
Permalink
MPI_LB, MPI_UB and MPI_Type_struct have been deprecated since MPI-2 and
were removed in MPI-3 (
https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node34.htm). It is
trivial to replace MPI_Type_struct with MPI_Type_create_struct. Replacing
MPI_UB and MPI_LB with MPI_Type_create_resized but it is the right thing to
do.

If you are struck with an older version of HDF5, you may need to maintain a
fork with patches for the aforementioned issues in order to use recent
releases of MPI.

Jeff

On Thu, Oct 11, 2018 at 8:01 AM Patrick Begou <
Post by Patrick Begou
Hi Jeff and George
thanks for your answer. I find some time to work again on this problem dans I
have downloaded OpenMPI 4.0.0rc4. It compiles without any problem but building
../../src/H5Smpio.c:355:28: error: 'MPI_LB' undeclared (first use in this
function); did you mean 'MPI_IO'?
old_types[0] = MPI_LB;
^~~~~~
MPI_IO
../../src/H5Smpio.c:355:28: note: each undeclared identifier is reported only
once for each function it appears in
../../src/H5Smpio.c:357:28: error: 'MPI_UB' undeclared (first use in this
function); did you mean 'MPI_LB'?
old_types[2] = MPI_UB;
^~~~~~
MPI_LB
../../src/H5Smpio.c:365:24: warning: implicit declaration of function
'MPI_Type_struct'; did you mean 'MPI_Type_size_x'?
[-Wimplicit-function-declaration]
mpi_code = MPI_Type_struct(3, /* count */
^~~~~~~~~~~~~~~
MPI_Type_size_x
It is not possible for me to use a more recent hdf5 version as the API as
changed and will not work with the code, even in compatible mode.
At this time, I'll try version 3 from the git repo if I have the required tools
available on my server. All prerequisites compile successfully with 3.1.2.
Patrick
--
===================================================================
| Equipe M.O.S.T. | |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Nathan Hjelm via users
2018-10-11 15:07:34 UTC
Permalink
Those features (MPI_LB/MPI_UB/MPI_Type_struct) were removed in MPI-3.0. It is fairly straightforward to update the code to be MPI-3.0 compliant.

MPI_Type_struct -> MPI_Type_create_struct

MPI_LB/MPI_UB -> MPI_Type_create_resized

Example:

types[0] = MPI_LB;
disp[0] = my_lb;
lens[0] = 1;
types[1] = MPI_INt;
disp[1] = disp1;
lens[1] = count;
types[2] = MPI_UB;
disp[2] = my_ub;
lens[2] = 1;

MPI_Type_struct (3, lens, disp, types, &new_type);


becomes:

types[0] = MPI_INt;
disp[0] = disp1;
lens[0] = count;

MPI_Type_create_struct (1, lens, disp, types, &tmp_type);
MPI_Type_create_resized (tmp_type, my_lb, my_ub, &new_type);
MPI_Type_free (&tmp_type);


-Nathan

On Oct 11, 2018, at 09:00 AM, Patrick Begou <***@legi.grenoble-inp.fr> wrote:

Hi Jeff and George

thanks for your answer. I find some time to work again on this problem dans I
have downloaded OpenMPI 4.0.0rc4. It compiles without any problem but building
the first dependance of my code (hdf5 1.8.12) with this version 4 fails:

../../src/H5Smpio.c:355:28: error: 'MPI_LB' undeclared (first use in this
function); did you mean 'MPI_IO'?
             old_types[0] = MPI_LB;
                            ^~~~~~
                            MPI_IO
../../src/H5Smpio.c:355:28: note: each undeclared identifier is reported only
once for each function it appears in
../../src/H5Smpio.c:357:28: error: 'MPI_UB' undeclared (first use in this
function); did you mean 'MPI_LB'?
             old_types[2] = MPI_UB;
                            ^~~~~~
                            MPI_LB
../../src/H5Smpio.c:365:24: warning: implicit declaration of function
'MPI_Type_struct'; did you mean 'MPI_Type_size_x'? [-Wimplicit-function-declaration]
             mpi_code = MPI_Type_struct(3,               /* count */
                        ^~~~~~~~~~~~~~~
                        MPI_Type_size_x

It is not possible for me to use a more recent hdf5 version as the API as
changed and will not work with the code, even in compatible mode.

At this time, I'll try version 3 from the git repo if I have the required tools
available on my server. All prerequisites compile successfully with 3.1.2.

Patrick
--
===================================================================
| Equipe M.O.S.T. | |
| Patrick BEGOU | mailto:***@grenoble-inp.fr |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===================================================================
Loading...