Discussion:
[OMPI users] Seg fault in opal_progress
Noam Bernstein
2018-07-10 21:15:11 UTC
Permalink
Hi OpenMPI users - I’m trying to debug a non-deterministic crash, apparently in opal_progress, with OpenMPI 3.1.0. All of them seem to involve mpi_allreduce, although it’s different particular calls from this code (VASP), and they seem more frequent for larger core/mpi task counts (128 happens within a few minutes, 5-200 iterations of the code, while 16 cores run thousands of iterations without the problem happening). The tail end of the stack trace looks like
libopen-pal.so.40 00002AC204D2B890 opal_progress Unknown Unknown
libmpi.so.40.10.0 00002AC2047E6CEA ompi_coll_base_se Unknown Unknown
libmpi.so.40.10.0 00002AC2047E81B3 ompi_coll_base_al Unknown Unknown
libmpi.so.40.10.0 00002AC20479EF7F PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00002AC2045301D7 mpi_allreduce_ Unknown Unknown
or
libopen-pal.so.40 00002AD2F1B94890 opal_progress Unknown Unknown
libmpi.so.40.10.0 00002AD2F15F678D ompi_request_defa Unknown Unknown
libmpi.so.40.10.0 00002AD2F164FD00 ompi_coll_base_se Unknown Unknown
libmpi.so.40.10.0 00002AD2F16511B3 ompi_coll_base_al Unknown Unknown
libmpi.so.40.10.0 00002AD2F1607F7F PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00002AD2F13991D7 mpi_allreduce_ Unknown Unknown

What are useful steps I can do to debug? Recompile with —enable-debug? Are there any other versions that are worth trying? I don’t recall this error happening before we switched to 3.1.0.

thanks,
Noam
Noam Bernstein
2018-07-11 13:58:11 UTC
Permalink
Post by Noam Bernstein
What are useful steps I can do to debug? Recompile with —enable-debug? Are there any other versions that are worth trying? I don’t recall this error happening before we switched to 3.1.0.
thanks,
Noam
It appears that the problem is there with OpenMPI 3.1.1, but not 2.1.3. Of course I can’t be 100% sure, since it’s non deterministic, but 3 runs died after 0-3 iterations with 3.1.1, and did 3 runs with 10 iterations each with 2.1.3.

Noam

____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Noam Bernstein
2018-07-11 15:03:21 UTC
Permalink
Post by Noam Bernstein
Post by Noam Bernstein
What are useful steps I can do to debug? Recompile with —enable-debug? Are there any other versions that are worth trying? I don’t recall this error happening before we switched to 3.1.0.
thanks,
Noam
It appears that the problem is there with OpenMPI 3.1.1, but not 2.1.3. Of course I can’t be 100% sure, since it’s non deterministic, but 3 runs died after 0-3 iterations with 3.1.1, and did 3 runs with 10 iterations each with 2.1.3.
After more extensive testing it’s clear that it still happens with 2.1.3, but much less frequently. I’m going to try to get more detailed info with version 3.1.1, where it’s easier to reproduce.

Noam



____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Jeff Squyres (jsquyres) via users
2018-07-11 15:29:07 UTC
Permalink
Ok, that would be great -- thanks.

Recompiling Open MPI with --enable-debug will turn on several debugging/sanity checks inside Open MPI, and it will also enable debugging symbols. Hence, If you can get a failure when a debug Open MPI build, it might give you a core file that can be used to get a more detailed stack trace, poke around and see if there's a NULL pointer somewhere, ...etc.
What are useful steps I can do to debug? Recompile with —enable-debug? Are there any other versions that are worth trying? I don’t recall this error happening before we switched to 3.1.0.
thanks,
Noam
It appears that the problem is there with OpenMPI 3.1.1, but not 2.1.3. Of course I can’t be 100% sure, since it’s non deterministic, but 3 runs died after 0-3 iterations with 3.1.1, and did 3 runs with 10 iterations each with 2.1.3.
After more extensive testing it’s clear that it still happens with 2.1.3, but much less frequently. I’m going to try to get more detailed info with version 3.1.1, where it’s easier to reproduce.
Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Noam Bernstein
2018-07-11 21:13:46 UTC
Permalink
Post by Jeff Squyres (jsquyres) via users
Ok, that would be great -- thanks.
Recompiling Open MPI with --enable-debug will turn on several debugging/sanity checks inside Open MPI, and it will also enable debugging symbols. Hence, If you can get a failure when a debug Open MPI build, it might give you a core file that can be used to get a more detailed stack trace, poke around and see if there's a NULL pointer somewhere, 
etc.
I haven’t tried to get a core file yes, but it’s not producing any more info from the runtime stack trace, despite configure with —enable-debug:

Image PC Routine Line Source
vasp.gamma_para.i 0000000002DCE8C1 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002DCC9FB Unknown Unknown Unknown
vasp.gamma_para.i 0000000002D409E4 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002D407F6 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002CDCED9 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002CE3DB6 Unknown Unknown Unknown
libpthread-2.12.s 0000003F8E60F7E0 Unknown Unknown Unknown
mca_btl_vader.so 00002B1AFA5FAC30 Unknown Unknown Unknown
mca_btl_vader.so 00002B1AFA5FD00D Unknown Unknown Unknown
libopen-pal.so.40 00002B1AE884327C opal_progress Unknown Unknown
mca_pml_ob1.so 00002B1AFB855DCE Unknown Unknown Unknown
mca_pml_ob1.so 00002B1AFB858305 mca_pml_ob1_send Unknown Unknown
libmpi.so.40.10.1 00002B1AE823A5DA ompi_coll_base_al Unknown Unknown
mca_coll_tuned.so 00002B1AFC6F0842 ompi_coll_tuned_a Unknown Unknown
libmpi.so.40.10.1 00002B1AE81B66F5 PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00002B1AE7F2259B mpi_allreduce_ Unknown Unknown
vasp.gamma_para.i 000000000042D1ED m_sum_d_ 1300 mpi.F
vasp.gamma_para.i 000000000089947D nonl_mp_vnlacc_.R 1754 nonl.F
vasp.gamma_para.i 0000000000972C51 hamil_mp_hamiltmu 825 hamil.F
vasp.gamma_para.i 0000000001BD2608 david_mp_eddav_.R 419 davidson.F
vasp.gamma_para.i 0000000001D2179E elmin_.R 424 electron.F
vasp.gamma_para.i 0000000002B92452 vamp_IP_electroni 4783 main.F
vasp.gamma_para.i 0000000002B6E173 MAIN__ 2800 main.F
vasp.gamma_para.i 000000000041325E Unknown Unknown Unknown
libc-2.12.so 0000003F8E21ED1D __libc_start_main Unknown Unknown
vasp.gamma_para.i 0000000000413169 Unknown Unknown Unknown

This is the configure line that was supposedly used to create the library:
./configure --prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080 --with-tm=/usr/local/torque --enable-mpirun-prefix-by-default --with-verbs=/usr --with-verbs-libdir=/usr/lib64 --enable-debug

Is there any way I can confirm that the version of the openmpi library I think I’m using really was compiled with debugging?

Noam

____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY

Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Åke Sandgren
2018-07-12 07:39:27 UTC
Permalink
Are you running with ulimit -s unlimited?
If not that looks like a out-of-stack crash, which VASP frequently causes.

If you are running with unlimited stack, I could perhaps run that input
case on our VASP build. (Which have a bunch of fixes for bad stack usage
among other things)
Post by Nathan Hjelm via users
On Jul 11, 2018, at 11:29 AM, Jeff Squyres (jsquyres) via users
Ok, that would be great -- thanks.
Recompiling Open MPI with --enable-debug will turn on several
debugging/sanity checks inside Open MPI, and it will also enable
debugging symbols.  Hence, If you can get a failure when a debug Open
MPI build, it might give you a core file that can be used to get a
more detailed stack trace, poke around and see if there's a NULL
pointer somewhere, …etc.
I haven’t tried to get a core file yes, but it’s not producing any more
Image              PC                Routine            Line      
 Source
vasp.gamma_para.i  0000000002DCE8C1  Unknown               Unknown
 Unknown
vasp.gamma_para.i  0000000002DCC9FB  Unknown               Unknown
 Unknown
vasp.gamma_para.i  0000000002D409E4  Unknown               Unknown
 Unknown
vasp.gamma_para.i  0000000002D407F6  Unknown               Unknown
 Unknown
vasp.gamma_para.i  0000000002CDCED9  Unknown               Unknown
 Unknown
vasp.gamma_para.i  0000000002CE3DB6  Unknown               Unknown
 Unknown
libpthread-2.12.s  0000003F8E60F7E0  Unknown               Unknown
 Unknown
mca_btl_vader.so   00002B1AFA5FAC30  Unknown               Unknown
 Unknown
mca_btl_vader.so   00002B1AFA5FD00D  Unknown               Unknown
 Unknown
libopen-pal.so.40  00002B1AE884327C  opal_progress         Unknown
 Unknown
mca_pml_ob1.so     00002B1AFB855DCE  Unknown               Unknown
 Unknown
mca_pml_ob1.so     00002B1AFB858305  mca_pml_ob1_send      Unknown
 Unknown
libmpi.so.40.10.1  00002B1AE823A5DA  ompi_coll_base_al     Unknown
 Unknown
mca_coll_tuned.so  00002B1AFC6F0842  ompi_coll_tuned_a     Unknown
 Unknown
libmpi.so.40.10.1  00002B1AE81B66F5  PMPI_Allreduce        Unknown
 Unknown
libmpi_mpifh.so.4  00002B1AE7F2259B  mpi_allreduce_        Unknown
 Unknown
vasp.gamma_para.i  000000000042D1ED  m_sum_d_                 1300
 mpi.F
vasp.gamma_para.i  000000000089947D  nonl_mp_vnlacc_.R        1754
 nonl.F
vasp.gamma_para.i  0000000000972C51  hamil_mp_hamiltmu         825
 hamil.F
vasp.gamma_para.i  0000000001BD2608  david_mp_eddav_.R         419
 davidson.F
vasp.gamma_para.i  0000000001D2179E  elmin_.R                  424
 electron.F
vasp.gamma_para.i  0000000002B92452  vamp_IP_electroni        4783
 main.F
vasp.gamma_para.i  0000000002B6E173  MAIN__                   2800
 main.F
vasp.gamma_para.i  000000000041325E  Unknown               Unknown
 Unknown
libc-2.12.so       0000003F8E21ED1D  __libc_start_main     Unknown
 Unknown
vasp.gamma_para.i  0000000000413169  Unknown               Unknown
 Unknown
 ./configure
--prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080
--with-tm=/usr/local/torque --enable-mpirun-prefix-by-default
--with-verbs=/usr --with-verbs-libdir=/usr/lib64 --enable-debug
Is there any way I can confirm that the version of the openmpi library I
think I’m using really was compiled with debugging?
Noam
____________
|
|
|
*U.S. NAVAL*
|
|
_*RESEARCH*_
|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ***@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
Noam Bernstein
2018-07-12 12:37:03 UTC
Permalink
I’ve recompiled 3.1.1 with —enable-debug —enable-mem-debug, and I still get no detailed information from the mpi libraries, only VASP (as before):

ldd (at runtime, so I’m fairly sure it’s referring to the right executable and LD_LIBRARY_PATH) info:
vexec /usr/local/vasp/bin/5.4.4/0test/vasp.gamma_para.intel
linux-vdso.so.1 => (0x00007ffd869f6000)
libmkl_intel_lp64.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_intel_lp64.so (0x00002b0b70015000)
libmkl_sequential.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_sequential.so (0x00002b0b70a56000)
libmkl_core.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_core.so (0x00002b0b717ef000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000366a000000)
libmpi_usempif08.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_usempif08.so.40 (0x00002b0b732f3000)
libmpi_usempi_ignore_tkr.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_usempi_ignore_tkr.so.40 (0x00002b0b73535000)
libmpi_mpifh.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_mpifh.so.40 (0x00002b0b73737000)
libmpi.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi.so.40 (0x00002b0b73991000)
libm.so.6 => /lib64/libm.so.6 (0x0000003f5b400000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003f5ac00000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003f5a800000)
libc.so.6 => /lib64/libc.so.6 (0x0000003f5a400000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003669800000)
/lib64/ld-linux-x86-64.so.2 (0x0000003f5a000000)
libopen-rte.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libopen-rte.so.40 (0x00002b0b73d48000)
libopen-pal.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libopen-pal.so.40 (0x00002b0b74066000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003f5bc00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003f5b000000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003f6c000000)
libz.so.1 => /lib64/libz.so.1 (0x0000003f5b800000)
libifport.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifport.so.5 (0x00002b0b743b8000)
libifcore.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifcore.so.5 (0x00002b0b745e7000)
libimf.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libimf.so (0x00002b0b74948000)
libsvml.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libsvml.so (0x00002b0b74e35000)
libintlc.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libintlc.so.5 (0x00002b0b75d40000)
libifcoremt.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifcoremt.so.5 (0x00002b0b75faa000)
ompi info (using same path as indicated by ldd output)
tin 1125 : /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/bin/ompi_info | grep debug
Prefix: /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080
Configure command line: '--prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080' '--with-tm=/usr/local/torque' '--enable-mpirun-prefix-by-default' '--with-verbs=/usr' '--with-verbs-libdir=/usr/lib64' '--enable-debug' '--enable-mem-debug'
Internal debug support: yes
Memory debugging support: yes
resulting stack trace:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
vasp.gamma_para.i 0000000002DCE8C1 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002DCC9FB Unknown Unknown Unknown
vasp.gamma_para.i 0000000002D409E4 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002D407F6 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002CDCED9 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002CE3DB6 Unknown Unknown Unknown
libpthread-2.12.s 0000003F5AC0F7E0 Unknown Unknown Unknown
mca_btl_vader.so 00002AD17AC74CB8 Unknown Unknown Unknown
mca_btl_vader.so 00002AD17AC770F5 Unknown Unknown Unknown
libopen-pal.so.40 00002AD168B816A4 opal_progress Unknown Unknown
libmpi.so.40.10.1 00002AD1684D0D75 Unknown Unknown Unknown
libmpi.so.40.10.1 00002AD1684D0DB8 ompi_request_defa Unknown Unknown
libmpi.so.40.10.1 00002AD168571EBE ompi_coll_base_se Unknown Unknown
libmpi.so.40.10.1 00002AD1685724B8 Unknown Unknown Unknown
libmpi.so.40.10.1 00002AD168573514 ompi_coll_base_al Unknown Unknown
mca_coll_tuned.so 00002AD17CD6C852 ompi_coll_tuned_a Unknown Unknown
libmpi.so.40.10.1 00002AD1684EE969 PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00002AD1682595B7 mpi_allreduce_ Unknown Unknown
vasp.gamma_para.i 000000000042D1ED m_sum_d_ 1300 mpi.F
vasp.gamma_para.i 0000000001BD5293 david_mp_eddav_.R 778 davidson.F
vasp.gamma_para.i 0000000001D2179E elmin_.R 424 electron.F
vasp.gamma_para.i 0000000002B92452 vamp_IP_electroni 4783 main.F
vasp.gamma_para.i 0000000002B6E173 MAIN__ 2800 main.F
vasp.gamma_para.i 000000000041325E Unknown Unknown Unknown
libc-2.12.so 0000003F5A41ED1D __libc_start_main Unknown Unknown
vasp.gamma_para.i 0000000000413169 Unknown Unknown Unknown


I’ve checked ulimit -s (at runtime), and it is unlimited.

I’m going to try the 3.1.x 20180710 nightly snapshot next.

Let me ask the source of the VASP inputs about sharing them. Note that the crash really only happens at an appreciable rate running on 128 tasks (8x16 core nodes), and even then, if I do a 10 geometry step run, only in about 1/3 of all runs, so it’s not a completely trivial amount of resources to reproduce

Noam
Noam Bernstein
2018-07-12 13:35:46 UTC
Permalink
Post by Noam Bernstein
I’m going to try the 3.1.x 20180710 nightly snapshot next.
Same behavior, exactly - segfault, no debugging info beyond the vasp routine that calls mpi_allreduce.

Noam


____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Jeff Squyres (jsquyres) via users
2018-07-12 14:51:08 UTC
Permalink
Do you get core files?

Loading up the core file in a debugger might give us more information.
Post by Noam Bernstein
I’m going to try the 3.1.x 20180710 nightly snapshot next.
Same behavior, exactly - segfault, no debugging info beyond the vasp routine that calls mpi_allreduce.
Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Noam Bernstein
2018-07-12 14:59:47 UTC
Permalink
Post by Jeff Squyres (jsquyres) via users
Do you get core files?
Loading up the core file in a debugger might give us more information.
No, I don’t, despite setting "ulimit -c unlimited”. I’m not sure what’s going on with that (or the lack of line info in the stack trace). Could be an intel compiler issue?

Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Jeff Squyres (jsquyres) via users
2018-07-12 15:02:53 UTC
Permalink
Post by Jeff Squyres (jsquyres) via users
Do you get core files?
Loading up the core file in a debugger might give us more information.
No, I don’t, despite setting "ulimit -c unlimited”. I’m not sure what’s going on with that (or the lack of line info in the stack trace). Could be an intel compiler issue?
If you're running your job through a job scheduler (such as SLURM or Torque), you may need to set something in the config and/or environment of the job scheduler daemons to allow launched jobs to get core files.

E.g., if you "ulimit -c" in your interactive shell and see "unlimited", but if you "ulimit -c" in a launched job and see "0", then the job scheduler is doing that to your environment somewhere.

--
Jeff Squyres
***@cisco.com
Noam Bernstein
2018-07-12 15:45:25 UTC
Permalink
Post by Jeff Squyres (jsquyres) via users
Post by Jeff Squyres (jsquyres) via users
Do you get core files?
Loading up the core file in a debugger might give us more information.
No, I don’t, despite setting "ulimit -c unlimited”. I’m not sure what’s going on with that (or the lack of line info in the stack trace). Could be an intel compiler issue?
If you're running your job through a job scheduler (such as SLURM or Torque), you may need to set something in the config and/or environment of the job scheduler daemons to allow launched jobs to get core files.
E.g., if you "ulimit -c" in your interactive shell and see "unlimited", but if you "ulimit -c" in a launched job and see "0", then the job scheduler is doing that to your environment somewhere.
I am using a scheduler (torque), but as I also told Åke off list in our side-discussion about VASP, I’m already doing that. I mpirun a script which does a few things like ulimit -c and ulimit -s, and then runs the actual executable with $* arguments.

I’m trying to recompile with gcc/gfortran to see if that changes anything.

Noam
Jeff Squyres (jsquyres) via users
2018-07-12 15:58:48 UTC
Permalink
Post by Noam Bernstein
Post by Jeff Squyres (jsquyres) via users
E.g., if you "ulimit -c" in your interactive shell and see "unlimited", but if you "ulimit -c" in a launched job and see "0", then the job scheduler is doing that to your environment somewhere.
I am using a scheduler (torque), but as I also told Åke off list in our side-discussion about VASP, I’m already doing that. I mpirun a script which does a few things like ulimit -c and ulimit -s, and then runs the actual executable with $* arguments.
That may not be sufficient.

Remember that your job script only runs on the Mother Superior node (i.e., the first node in the job). Hence, while your job script may affect the corefile size settings in that shell (and its children), remember that the remote MPI processes are (effectively) launched via tm_spawn() -- not ssh. I.e., Open MPI will end up calling tm_spawn() to launch orted processes on all the nodes in your job. The TM daemons on the nodes in your job will then fork/exec the orteds, meaning that they inherit the environment (including corefile size restrictions) of the TM daemons. The orteds eventually fork/exec your MPI processes.

This is a long way of saying: your shell startup files may not be executed, and the "ulimit -c" you did in your job script may not be propagated out to the other nodes. Instead, your MPI processes may be inheriting the corefile size limitations from the Torque daemons.

In my SLURM cluster here at Cisco (which is a pretty ancient version at this point; I have no idea if things have changed), I had to put a "ulimit -c unlimited" in a relevant /etc/sysconfig/slurmd file so that that is executed before the slurmd (SLURM daemon) is executed. Then my MPI processes start with unlimited corefile size restrictions.

(You may have already done this; I just want to make sure we're on the same sheet of music here...)

--
Jeff Squyres
***@cisco.com
Noam Bernstein
2018-07-12 16:41:03 UTC
Permalink
(You may have already done this; I just want to make sure we're on the same sheet of music here…)
I’m not talking about the job script or shell startup files. The actual “executable” passed to mpirun on the command line is the script which runs ulimit and then the actually mpi binary. I think that should be safe, right?

Noam
Jeff Squyres (jsquyres) via users
2018-07-12 17:47:52 UTC
Permalink
Noam and I actually talked on the phone (whaaaatt!?) and worked through this a bit more.

Oddly, he can generate core files if he runs in /tmp, but not if he runs in an NFS-mounted directory (!). I haven't seen that before -- if someone knows why that would happen, I'd love to hear the explanation.

Regardless, here's what we're going to do:

1. Noam is going to run with a master nightly snapshot and see if that makes things better.

2. If not, he'll be able to generate core files and we can poke around in a debugger to get a bit more information.
Post by Noam Bernstein
(You may have already done this; I just want to make sure we're on the same sheet of music here…)
I’m not talking about the job script or shell startup files. The actual “executable” passed to mpirun on the command line is the script which runs ulimit and then the actually mpi binary. I think that should be safe, right?
Noam
--
Jeff Squyres
***@cisco.com
Noam Bernstein
2018-07-14 00:41:36 UTC
Permalink
Just to summarize for the list. With Jeff’s prodding I got it generating core files with the debug (and mem-debug) version of openmpi, and below is the kind of stack trace I’m getting from gdb. It looks slightly different when I use a slightly different implementation that doesn’t use MPI_INPLACE, but nearly the same. The array that’s being summed is not large, 3776 doubles.


#0 0x0000003160a32495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003160a33bfd in abort () at abort.c:121
#2 0x0000000002a3903e in for__issue_diagnostic ()
#3 0x0000000002a3ff66 in for__signal_handler ()
#4 <signal handler called>
#5 0x00002b67a4217029 in mca_btl_vader_check_fboxes () at btl_vader_fbox.h:208
#6 0x00002b67a421962e in mca_btl_vader_component_progress () at btl_vader_component.c:724
#7 0x00002b67934fd311 in opal_progress () at runtime/opal_progress.c:229
#8 0x00002b6792e2f0df in ompi_request_wait_completion (req=0xe863600) at ../ompi/request/request.h:415
#9 0x00002b6792e2f122 in ompi_request_default_wait (req_ptr=0x7ffebdbb8c20, status=0x0) at request/req_wait.c:42
#10 0x00002b6792ed7d5a in ompi_coll_base_allreduce_intra_ring (sbuf=0x1, rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0, module=0xe14f8b0)
at base/coll_base_allreduce.c:460
#11 0x00002b67a6ccb3e2 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x1, rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0, module=0xe14f8b0)
at coll_tuned_decision_fixed.c:74
#12 0x00002b6792e4d9b0 in PMPI_Allreduce (sendbuf=0x1, recvbuf=0xeb79ca0, count=3776, datatype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0) at pallreduce.c:113
#13 0x00002b6792bb6287 in ompi_allreduce_f (sendbuf=0x1 <Address 0x1 out of bounds>,
recvbuf=0xeb79ca0 "\310,&AYI\257\276\031\372\214\223\270-y>\207\066\226\003W\f\240\276\334'}\225\376\336\277>\227§\231", count=0x7ffebdbbc4d4, datatype=0x2b48f5c, op=0x2b48f60,
comm=0x5a0ae60, ierr=0x7ffebdbb8f60) at pallreduce_f.c:87
#14 0x000000000042991b in m_sumb_d (comm=..., vec=..., n=Cannot access memory at address 0x928
) at mpi.F:870
#15 m_sum_d (comm=..., vec=..., n=Cannot access memory at address 0x928
) at mpi.F:3184
#16 0x0000000001b22b83 in david::eddav (hamiltonian=..., p=Cannot access memory at address 0x1
) at davidson.F:779
#17 0x0000000001c6ef0e in elmin (hamiltonian=..., kineden=Cannot access memory at address 0x19
) at electron.F:424
#18 0x0000000002a108b2 in electronic_optimization () at main.F:4783
#19 0x00000000029ec5d3 in vamp () at main.F:2800
#20 0x00000000004100de in main ()
#21 0x0000003160a1ed1d in __libc_start_main (main=0x4100b0 <main>, argc=1, ubp_av=0x7ffebdbc5e38, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>,
stack_end=0x7ffebdbc5e28) at libc-start.c:226
#22 0x000000000040ffe9 in _start ()
Nathan Hjelm via users
2018-07-14 05:31:33 UTC
Permalink
Please give master a try. This looks like another signature of running out of space for shared memory buffers.

-Nathan
Post by Noam Bernstein
Just to summarize for the list. With Jeff’s prodding I got it generating core files with the debug (and mem-debug) version of openmpi, and below is the kind of stack trace I’m getting from gdb. It looks slightly different when I use a slightly different implementation that doesn’t use MPI_INPLACE, but nearly the same. The array that’s being summed is not large, 3776 doubles.
#0 0x0000003160a32495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003160a33bfd in abort () at abort.c:121
#2 0x0000000002a3903e in for__issue_diagnostic ()
#3 0x0000000002a3ff66 in for__signal_handler ()
#4 <signal handler called>
#5 0x00002b67a4217029 in mca_btl_vader_check_fboxes () at btl_vader_fbox.h:208
#6 0x00002b67a421962e in mca_btl_vader_component_progress () at btl_vader_component.c:724
#7 0x00002b67934fd311 in opal_progress () at runtime/opal_progress.c:229
#8 0x00002b6792e2f0df in ompi_request_wait_completion (req=0xe863600) at ../ompi/request/request.h:415
#9 0x00002b6792e2f122 in ompi_request_default_wait (req_ptr=0x7ffebdbb8c20, status=0x0) at request/req_wait.c:42
#10 0x00002b6792ed7d5a in ompi_coll_base_allreduce_intra_ring (sbuf=0x1, rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0, module=0xe14f8b0)
at base/coll_base_allreduce.c:460
#11 0x00002b67a6ccb3e2 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x1, rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0, module=0xe14f8b0)
at coll_tuned_decision_fixed.c:74
#12 0x00002b6792e4d9b0 in PMPI_Allreduce (sendbuf=0x1, recvbuf=0xeb79ca0, count=3776, datatype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0) at pallreduce.c:113
#13 0x00002b6792bb6287 in ompi_allreduce_f (sendbuf=0x1 <Address 0x1 out of bounds>,
recvbuf=0xeb79ca0 "\310,&AYI\257\276\031\372\214\223\270-y>\207\066\226\003W\f\240\276\334'}\225\376\336\277>\227§\231", count=0x7ffebdbbc4d4, datatype=0x2b48f5c, op=0x2b48f60,
comm=0x5a0ae60, ierr=0x7ffebdbb8f60) at pallreduce_f.c:87
#14 0x000000000042991b in m_sumb_d (comm=..., vec=..., n=Cannot access memory at address 0x928
) at mpi.F:870
#15 m_sum_d (comm=..., vec=..., n=Cannot access memory at address 0x928
) at mpi.F:3184
#16 0x0000000001b22b83 in david::eddav (hamiltonian=..., p=Cannot access memory at address 0x1
) at davidson.F:779
#17 0x0000000001c6ef0e in elmin (hamiltonian=..., kineden=Cannot access memory at address 0x19
) at electron.F:424
#18 0x0000000002a108b2 in electronic_optimization () at main.F:4783
#19 0x00000000029ec5d3 in vamp () at main.F:2800
#20 0x00000000004100de in main ()
#21 0x0000003160a1ed1d in __libc_start_main (main=0x4100b0 <main>, argc=1, ubp_av=0x7ffebdbc5e38, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>,
stack_end=0x7ffebdbc5e28) at libc-start.c:226
#22 0x000000000040ffe9 in _start ()
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Noam Bernstein
2018-07-16 12:34:14 UTC
Permalink
Post by Nathan Hjelm via users
Please give master a try. This looks like another signature of running out of space for shared memory buffers.
Sorry, I wasn’t explicit on this point - I’m already using master, specifically
openmpi-master-201807120327-34bc777.tar.gz

Noam
Noam Bernstein
2018-07-16 13:31:57 UTC
Permalink
Post by Noam Bernstein
Post by Nathan Hjelm via users
Please give master a try. This looks like another signature of running out of space for shared memory buffers.
Sorry, I wasn’t explicit on this point - I’m already using master, specifically
openmpi-master-201807120327-34bc777.tar.gz
And a bit more data on the stack traces, since the problem is non-deterministic. I’ve run 30 sets of 10 iterations of the code, and 8 crashed. In every case the final part of the stack trace was
Program terminated with signal 6, Aborted.
#0 0x0000003f5a432495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
#0 0x0000003f5a432495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003f5a433bfd in abort () at abort.c:121
#2 0x0000000002a3985e in for__issue_diagnostic ()
#3 0x0000000002a40786 in for__signal_handler ()
#4 <signal handler called>
#5 0x00002ae37088f029 in mca_btl_vader_check_fboxes () at btl_vader_fbox.h:208
#6 0x00002ae37089162e in mca_btl_vader_component_progress () at btl_vader_component.c:724
#7 0x00002ae35fa41311 in opal_progress () at runtime/opal_progress.c:229
#8 0x00002ae3724a11b7 in ompi_request_wait_completion (req=0xd2a4700) at ../../../../ompi/request/request.h:415
with some variation in the routines that lead to this point. In all cases the mpi call was some all to all routine, all but one “opmi_allreduce_f", and one "ompi_alltoallv_z”.

I can of course post all 8 stack traces if that’s useful.

Noam

Noam Bernstein
2018-07-11 21:25:25 UTC
Permalink
Post by Noam Bernstein
After more extensive testing it’s clear that it still happens with 2.1.3, but much less frequently. I’m going to try to get more detailed info with version 3.1.1, where it’s easier to reproduce.
objdump —debugging produces output consistent with no debugging symbols in the library so files:
tin 1061 : objdump --debugging /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi.so.40

/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi.so.40: file format elf64-x86-64


Noam
Jeff Squyres (jsquyres) via users
2018-07-11 22:12:16 UTC
Permalink
$ ompi_info | grep -i debug
Configure command line: '--prefix=/home/jsquyres/bogus' '--with-usnic' '--with-libfabric=/home/jsquyres/libfabric-current/install' '--enable-mpirun-prefix-by-default' '--enable-debug' '--enable-mem-debug' '--enable-mem-profile' '--disable-mpi-fortran' '--enable-debug' '--enable-mem-debug' '--enable-picky'
Internal debug support: yes
Memory debugging support: yes
C/R Enabled Debugging: no

That should tell you whether you have debug support or not.
Post by Noam Bernstein
After more extensive testing it’s clear that it still happens with 2.1.3, but much less frequently. I’m going to try to get more detailed info with version 3.1.1, where it’s easier to reproduce.
tin 1061 : objdump --debugging /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi.so.40
/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi.so.40: file format elf64-x86-64
Noam
--
Jeff Squyres
***@cisco.com
Nathan Hjelm via users
2018-07-11 22:16:45 UTC
Permalink
Might be also worth testing a master snapshot and see if that fixes the issue. There are a couple of fixes being backported from master to v3.0.x and v3.1.x now.

-Nathan

On Jul 11, 2018, at 03:16 PM, Noam Bernstein <***@nrl.navy.mil> wrote:

On Jul 11, 2018, at 11:29 AM, Jeff Squyres (jsquyres) via users <***@lists.open-mpi.org> wrote:
Ok, that would be great -- thanks.

Recompiling Open MPI with --enable-debug will turn on several debugging/sanity checks inside Open MPI, and it will also enable debugging symbols.  Hence, If you can get a failure when a debug Open MPI build, it might give you a core file that can be used to get a more detailed stack trace, poke around and see if there's a NULL pointer somewhere, 
etc.

I haven’t tried to get a core file yes, but it’s not producing any more info from the runtime stack trace, despite configure with —enable-debug:

Image              PC                Routine            Line        Source
vasp.gamma_para.i  0000000002DCE8C1  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002DCC9FB  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002D409E4  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002D407F6  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002CDCED9  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002CE3DB6  Unknown               Unknown  Unknown
libpthread-2.12.s  0000003F8E60F7E0  Unknown               Unknown  Unknown
mca_btl_vader.so   00002B1AFA5FAC30  Unknown               Unknown  Unknown
mca_btl_vader.so   00002B1AFA5FD00D  Unknown               Unknown  Unknown
libopen-pal.so.40  00002B1AE884327C  opal_progress         Unknown  Unknown
mca_pml_ob1.so     00002B1AFB855DCE  Unknown               Unknown  Unknown
mca_pml_ob1.so     00002B1AFB858305  mca_pml_ob1_send      Unknown  Unknown
libmpi.so.40.10.1  00002B1AE823A5DA  ompi_coll_base_al     Unknown  Unknown
mca_coll_tuned.so  00002B1AFC6F0842  ompi_coll_tuned_a     Unknown  Unknown
libmpi.so.40.10.1  00002B1AE81B66F5  PMPI_Allreduce        Unknown  Unknown
libmpi_mpifh.so.4  00002B1AE7F2259B  mpi_allreduce_        Unknown  Unknown
vasp.gamma_para.i  000000000042D1ED  m_sum_d_                 1300  mpi.F
vasp.gamma_para.i  000000000089947D  nonl_mp_vnlacc_.R        1754  nonl.F
vasp.gamma_para.i  0000000000972C51  hamil_mp_hamiltmu         825  hamil.F
vasp.gamma_para.i  0000000001BD2608  david_mp_eddav_.R         419  davidson.F
vasp.gamma_para.i  0000000001D2179E  elmin_.R                  424  electron.F
vasp.gamma_para.i  0000000002B92452  vamp_IP_electroni        4783  main.F
vasp.gamma_para.i  0000000002B6E173  MAIN__                   2800  main.F
vasp.gamma_para.i  000000000041325E  Unknown               Unknown  Unknown
libc-2.12.so       0000003F8E21ED1D  __libc_start_main     Unknown  Unknown
vasp.gamma_para.i  0000000000413169  Unknown               Unknown  Unknown

This is the configure line that was supposedly used to create the library:
  ./configure --prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080 --with-tm=/usr/local/torque --enable-mpirun-prefix-by-default --with-verbs=/usr --with-verbs-libdir=/usr/lib64 --enable-debug

Is there any way I can confirm that the version of the openmpi library I think I’m using really was compiled with debugging?

Noam


____________

|
|

|U.S. NAVAL|

|_RESEARCH_|


LABORATORY


Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil
Ben Menadue
2018-07-12 03:36:46 UTC
Permalink
Hi,

Perhaps related — we’re seeing this one with 3.1.1. I’ll see if I can get the application run against our --enable-debug build.

Cheers,
Ben

[raijin7:1943 :0:1943] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x45)

/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c: [ append_frag_to_ordered_list() ]
...
118 * account for this rollover or the matching will fail.
119 * Extract the items from the list to order them safely */
120 if( hdr->hdr_seq < prior->hdr.hdr_match.hdr_seq ) {
==> 121 uint16_t d1, d2 = prior->hdr.hdr_match.hdr_seq - hdr->hdr_seq;
122 do {
123 d1 = d2;
124 prior = (mca_pml_ob1_recv_frag_t*)(prior->super.super.opal_list_prev);

==== backtrace ====
0 0x0000000000012d5f append_frag_to_ordered_list() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:121
1 0x0000000000013a06 mca_pml_ob1_recv_frag_callback_match() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:390
2 0x00000000000044ef mca_btl_vader_check_fboxes() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/mca/btl/vader/../../../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
3 0x000000000000602f mca_btl_vader_component_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/mca/btl/vader/../../../../../../../opal/mca/btl/vader/btl_vader_component.c:689
4 0x000000000002b554 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
5 0x00000000000331cc ompi_sync_wait_mt() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/../../../../opal/threads/wait_sync.c:85
6 0x000000000004a989 ompi_request_wait_completion() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/../../../../ompi/request/request.h:403
7 0x000000000004aa1d ompi_request_default_wait() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:42
8 0x00000000000d3486 ompi_coll_base_sendrecv_actual() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_util.c:59
9 0x00000000000d0d2b ompi_coll_base_sendrecv() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_util.h:67
10 0x00000000000d14c7 ompi_coll_base_allgather_intra_recursivedoubling() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_allgather.c:329
11 0x00000000000056dc ompi_coll_tuned_allgather_intra_dec_fixed() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:551
12 0x000000000006185d PMPI_Allgather() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mpi/c/profile/pallgather.c:122
13 0x000000000004362c ompi_allgather_f() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/intel/debug-0/ompi/mpi/fortran/mpif-h/profile/pallgather_f.c:86
14 0x00000000005ed3cb comms_allgather_integer_0() /short/z00/aab900/onetep/src/comms_mod.F90:14795
15 0x0000000001309fe1 multigrid_bc_for_dlmg() /short/z00/aab900/onetep/src/multigrid_methods_mod.F90:270
16 0x0000000001309fe1 multigrid_initialise() /short/z00/aab900/onetep/src/multigrid_methods_mod.F90:174
17 0x0000000000f0c885 hartree_via_multigrid() /short/z00/aab900/onetep/src/hartree_mod.F90:181
18 0x0000000000a0c62a electronic_init_pot() /short/z00/aab900/onetep/src/electronic_init_mod.F90:1123
19 0x0000000000a14d62 electronic_init_denskern() /short/z00/aab900/onetep/src/electronic_init_mod.F90:334
20 0x0000000000a50136 energy_and_force_calculate() /short/z00/aab900/onetep/src/energy_and_force_mod.F90:1702
21 0x00000000014f46e7 onetep() /short/z00/aab900/onetep/src/onetep.F90:277
22 0x000000000041465e main() ???:0
23 0x000000000001ed1d __libc_start_main() ???:0
24 0x0000000000414569 _start() ???:0
===================
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
onetep.nci 0000000001DCC6DE Unknown Unknown Unknown
libpthread-2.12.s 00002B6D46ED07E0 Unknown Unknown Unknown
libmlx4-rdmav2.so 00002B6D570E3B18 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node raijin7 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Post by Nathan Hjelm via users
Might be also worth testing a master snapshot and see if that fixes the issue. There are a couple of fixes being backported from master to v3.0.x and v3.1.x now.
-Nathan
Post by Noam Bernstein
Post by Jeff Squyres (jsquyres) via users
Ok, that would be great -- thanks.
Recompiling Open MPI with --enable-debug will turn on several debugging/sanity checks inside Open MPI, and it will also enable debugging symbols. Hence, If you can get a failure when a debug Open MPI build, it might give you a core file that can be used to get a more detailed stack trace, poke around and see if there's a NULL pointer somewhere, 
etc.
Image PC Routine Line Source
vasp.gamma_para.i 0000000002DCE8C1 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002DCC9FB Unknown Unknown Unknown
vasp.gamma_para.i 0000000002D409E4 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002D407F6 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002CDCED9 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002CE3DB6 Unknown Unknown Unknown
libpthread-2.12.s 0000003F8E60F7E0 Unknown Unknown Unknown
mca_btl_vader.so 00002B1AFA5FAC30 Unknown Unknown Unknown
mca_btl_vader.so 00002B1AFA5FD00D Unknown Unknown Unknown
libopen-pal.so.40 00002B1AE884327C opal_progress Unknown Unknown
mca_pml_ob1.so 00002B1AFB855DCE Unknown Unknown Unknown
mca_pml_ob1.so 00002B1AFB858305 mca_pml_ob1_send Unknown Unknown
libmpi.so.40.10.1 00002B1AE823A5DA ompi_coll_base_al Unknown Unknown
mca_coll_tuned.so 00002B1AFC6F0842 ompi_coll_tuned_a Unknown Unknown
libmpi.so.40.10.1 00002B1AE81B66F5 PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00002B1AE7F2259B mpi_allreduce_ Unknown Unknown
vasp.gamma_para.i 000000000042D1ED m_sum_d_ 1300 mpi.F
vasp.gamma_para.i 000000000089947D nonl_mp_vnlacc_.R 1754 nonl.F
vasp.gamma_para.i 0000000000972C51 hamil_mp_hamiltmu 825 hamil.F
vasp.gamma_para.i 0000000001BD2608 david_mp_eddav_.R 419 davidson.F
vasp.gamma_para.i 0000000001D2179E elmin_.R 424 electron.F
vasp.gamma_para.i 0000000002B92452 vamp_IP_electroni 4783 main.F
vasp.gamma_para.i 0000000002B6E173 MAIN__ 2800 main.F
vasp.gamma_para.i 000000000041325E Unknown Unknown Unknown
libc-2.12.so 0000003F8E21ED1D __libc_start_main Unknown Unknown
vasp.gamma_para.i 0000000000413169 Unknown Unknown Unknown
./configure --prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080 --with-tm=/usr/local/torque --enable-mpirun-prefix-by-default --with-verbs=/usr --with-verbs-libdir=/usr/lib64 --enable-debug
Is there any way I can confirm that the version of the openmpi library I think I’m using really was compiled with debugging?
Noam
____________
|
|
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Ben Menadue
2018-07-12 05:52:49 UTC
Permalink
Here’s what happens using a debug build:

[raijin7:22225] ompi_comm_peer_lookup: invalid peer index (2)
[raijin7:22225:0:22225] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)

/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_comm.h: [ mca_pml_ob1_peer_lookup() ]
...
75 mca_pml_ob1_comm_proc_t* proc = OBJ_NEW(mca_pml_ob1_comm_proc_t);
76 proc->ompi_proc = ompi_comm_peer_lookup (comm, rank);
77 OBJ_RETAIN(proc->ompi_proc);
==> 78 opal_atomic_wmb ();
79 pml_comm->procs[rank] = proc;
80 }
81 OPAL_THREAD_UNLOCK(&pml_comm->proc_lock);

==== backtrace ====
0 0x0000000000017505 mca_pml_ob1_peer_lookup() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_comm.h:78
1 0x0000000000019119 mca_pml_ob1_recv_frag_callback_match() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:361
2 0x00000000000052d7 mca_btl_vader_check_fboxes() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/opal/mca/btl/vader/../../../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
3 0x00000000000077fd mca_btl_vader_component_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/opal/mca/btl/vader/../../../../../../../opal/mca/btl/vader/btl_vader_component.c:689
4 0x000000000002ff90 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/opal/../../../../opal/runtime/opal_progress.c:228
5 0x000000000003b168 ompi_sync_wait_mt() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/opal/../../../../opal/threads/wait_sync.c:85
6 0x000000000005cd64 ompi_request_wait_completion() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/ompi/../../../../ompi/request/request.h:403
7 0x000000000005ce28 ompi_request_default_wait() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/ompi/../../../../ompi/request/req_wait.c:42
8 0x00000000001142d9 ompi_coll_base_sendrecv_zero() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_barrier.c:64
9 0x0000000000114763 ompi_coll_base_barrier_intra_recursivedoubling() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_barrier.c:215
10 0x0000000000004cad ompi_coll_tuned_barrier_intra_dec_fixed() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:212
11 0x00000000000831ac PMPI_Barrier() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-1/ompi/mpi/c/profile/pbarrier.c:63
12 0x0000000000044041 ompi_barrier_f() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/intel/debug-1/ompi/mpi/fortran/mpif-h/profile/pbarrier_f.c:76
13 0x00000000005c79de comms_barrier() /short/z00/aab900/onetep/src/comms_mod.F90:1543
14 0x00000000005c79de comms_bcast_logical_0() /short/z00/aab900/onetep/src/comms_mod.F90:10756
15 0x0000000001c21509 utils_devel_code_logical() /short/z00/aab900/onetep/src/utils_mod.F90:2646
16 0x0000000001309ddb multigrid_bc_for_dlmg() /short/z00/aab900/onetep/src/multigrid_methods_mod.F90:260
17 0x0000000001309ddb multigrid_initialise() /short/z00/aab900/onetep/src/multigrid_methods_mod.F90:174
18 0x0000000000f0c885 hartree_via_multigrid() /short/z00/aab900/onetep/src/hartree_mod.F90:181
19 0x0000000000a0c62a electronic_init_pot() /short/z00/aab900/onetep/src/electronic_init_mod.F90:1123
20 0x0000000000a14d62 electronic_init_denskern() /short/z00/aab900/onetep/src/electronic_init_mod.F90:334
21 0x0000000000a50136 energy_and_force_calculate() /short/z00/aab900/onetep/src/energy_and_force_mod.F90:1702
22 0x00000000014f46e7 onetep() /short/z00/aab900/onetep/src/onetep.F90:277
23 0x000000000041465e main() ???:0
24 0x000000000001ed1d __libc_start_main() ???:0
25 0x0000000000414569 _start() ???:0
===================
Post by Ben Menadue
Hi,
Perhaps related — we’re seeing this one with 3.1.1. I’ll see if I can get the application run against our --enable-debug build.
Cheers,
Ben
[raijin7:1943 :0:1943] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x45)
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c: [ append_frag_to_ordered_list() ]
...
118 * account for this rollover or the matching will fail.
119 * Extract the items from the list to order them safely */
120 if( hdr->hdr_seq < prior->hdr.hdr_match.hdr_seq ) {
==> 121 uint16_t d1, d2 = prior->hdr.hdr_match.hdr_seq - hdr->hdr_seq;
122 do {
123 d1 = d2;
124 prior = (mca_pml_ob1_recv_frag_t*)(prior->super.super.opal_list_prev);
==== backtrace ====
0 0x0000000000012d5f append_frag_to_ordered_list() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:121
1 0x0000000000013a06 mca_pml_ob1_recv_frag_callback_match() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:390
2 0x00000000000044ef mca_btl_vader_check_fboxes() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/mca/btl/vader/../../../../../../../opal/mca/btl/vader/btl_vader_fbox.h:208
3 0x000000000000602f mca_btl_vader_component_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/mca/btl/vader/../../../../../../../opal/mca/btl/vader/btl_vader_component.c:689
4 0x000000000002b554 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
5 0x00000000000331cc ompi_sync_wait_mt() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/opal/../../../../opal/threads/wait_sync.c:85
6 0x000000000004a989 ompi_request_wait_completion() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/../../../../ompi/request/request.h:403
7 0x000000000004aa1d ompi_request_default_wait() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:42
8 0x00000000000d3486 ompi_coll_base_sendrecv_actual() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_util.c:59
9 0x00000000000d0d2b ompi_coll_base_sendrecv() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_util.h:67
10 0x00000000000d14c7 ompi_coll_base_allgather_intra_recursivedoubling() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_allgather.c:329
11 0x00000000000056dc ompi_coll_tuned_allgather_intra_dec_fixed() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:551
12 0x000000000006185d PMPI_Allgather() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/gcc/debug-0/ompi/mpi/c/profile/pallgather.c:122
13 0x000000000004362c ompi_allgather_f() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.1/build/intel/debug-0/ompi/mpi/fortran/mpif-h/profile/pallgather_f.c:86
14 0x00000000005ed3cb comms_allgather_integer_0() /short/z00/aab900/onetep/src/comms_mod.F90:14795
15 0x0000000001309fe1 multigrid_bc_for_dlmg() /short/z00/aab900/onetep/src/multigrid_methods_mod.F90:270
16 0x0000000001309fe1 multigrid_initialise() /short/z00/aab900/onetep/src/multigrid_methods_mod.F90:174
17 0x0000000000f0c885 hartree_via_multigrid() /short/z00/aab900/onetep/src/hartree_mod.F90:181
18 0x0000000000a0c62a electronic_init_pot() /short/z00/aab900/onetep/src/electronic_init_mod.F90:1123
19 0x0000000000a14d62 electronic_init_denskern() /short/z00/aab900/onetep/src/electronic_init_mod.F90:334
20 0x0000000000a50136 energy_and_force_calculate() /short/z00/aab900/onetep/src/energy_and_force_mod.F90:1702
21 0x00000000014f46e7 onetep() /short/z00/aab900/onetep/src/onetep.F90:277
22 0x000000000041465e main() ???:0
23 0x000000000001ed1d __libc_start_main() ???:0
24 0x0000000000414569 _start() ???:0
===================
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
onetep.nci 0000000001DCC6DE Unknown Unknown Unknown
libpthread-2.12.s 00002B6D46ED07E0 Unknown Unknown Unknown
libmlx4-rdmav2.so 00002B6D570E3B18 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node raijin7 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Post by Nathan Hjelm via users
Might be also worth testing a master snapshot and see if that fixes the issue. There are a couple of fixes being backported from master to v3.0.x and v3.1.x now.
-Nathan
Post by Noam Bernstein
Post by Jeff Squyres (jsquyres) via users
Ok, that would be great -- thanks.
Recompiling Open MPI with --enable-debug will turn on several debugging/sanity checks inside Open MPI, and it will also enable debugging symbols. Hence, If you can get a failure when a debug Open MPI build, it might give you a core file that can be used to get a more detailed stack trace, poke around and see if there's a NULL pointer somewhere, 
etc.
Image PC Routine Line Source
vasp.gamma_para.i 0000000002DCE8C1 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002DCC9FB Unknown Unknown Unknown
vasp.gamma_para.i 0000000002D409E4 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002D407F6 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002CDCED9 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002CE3DB6 Unknown Unknown Unknown
libpthread-2.12.s 0000003F8E60F7E0 Unknown Unknown Unknown
mca_btl_vader.so 00002B1AFA5FAC30 Unknown Unknown Unknown
mca_btl_vader.so 00002B1AFA5FD00D Unknown Unknown Unknown
libopen-pal.so.40 00002B1AE884327C opal_progress Unknown Unknown
mca_pml_ob1.so 00002B1AFB855DCE Unknown Unknown Unknown
mca_pml_ob1.so 00002B1AFB858305 mca_pml_ob1_send Unknown Unknown
libmpi.so.40.10.1 00002B1AE823A5DA ompi_coll_base_al Unknown Unknown
mca_coll_tuned.so 00002B1AFC6F0842 ompi_coll_tuned_a Unknown Unknown
libmpi.so.40.10.1 00002B1AE81B66F5 PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00002B1AE7F2259B mpi_allreduce_ Unknown Unknown
vasp.gamma_para.i 000000000042D1ED m_sum_d_ 1300 mpi.F
vasp.gamma_para.i 000000000089947D nonl_mp_vnlacc_.R 1754 nonl.F
vasp.gamma_para.i 0000000000972C51 hamil_mp_hamiltmu 825 hamil.F
vasp.gamma_para.i 0000000001BD2608 david_mp_eddav_.R 419 davidson.F
vasp.gamma_para.i 0000000001D2179E elmin_.R 424 electron.F
vasp.gamma_para.i 0000000002B92452 vamp_IP_electroni 4783 main.F
vasp.gamma_para.i 0000000002B6E173 MAIN__ 2800 main.F
vasp.gamma_para.i 000000000041325E Unknown Unknown Unknown
libc-2.12.so 0000003F8E21ED1D __libc_start_main Unknown Unknown
vasp.gamma_para.i 0000000000413169 Unknown Unknown Unknown
./configure --prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080 --with-tm=/usr/local/torque --enable-mpirun-prefix-by-default --with-verbs=/usr --with-verbs-libdir=/usr/lib64 --enable-debug
Is there any way I can confirm that the version of the openmpi library I think I’m using really was compiled with debugging?
Noam
____________
|
|
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...