Iâve recompiled 3.1.1 with âenable-debug âenable-mem-debug, and I still get no detailed information from the mpi libraries, only VASP (as before):
ldd (at runtime, so Iâm fairly sure itâs referring to the right executable and LD_LIBRARY_PATH) info:
vexec /usr/local/vasp/bin/5.4.4/0test/vasp.gamma_para.intel
linux-vdso.so.1 => (0x00007ffd869f6000)
libmkl_intel_lp64.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_intel_lp64.so (0x00002b0b70015000)
libmkl_sequential.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_sequential.so (0x00002b0b70a56000)
libmkl_core.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/mkl/lib/intel64/libmkl_core.so (0x00002b0b717ef000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000366a000000)
libmpi_usempif08.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_usempif08.so.40 (0x00002b0b732f3000)
libmpi_usempi_ignore_tkr.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_usempi_ignore_tkr.so.40 (0x00002b0b73535000)
libmpi_mpifh.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi_mpifh.so.40 (0x00002b0b73737000)
libmpi.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libmpi.so.40 (0x00002b0b73991000)
libm.so.6 => /lib64/libm.so.6 (0x0000003f5b400000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003f5ac00000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003f5a800000)
libc.so.6 => /lib64/libc.so.6 (0x0000003f5a400000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003669800000)
/lib64/ld-linux-x86-64.so.2 (0x0000003f5a000000)
libopen-rte.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libopen-rte.so.40 (0x00002b0b73d48000)
libopen-pal.so.40 => /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/lib/libopen-pal.so.40 (0x00002b0b74066000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003f5bc00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003f5b000000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003f6c000000)
libz.so.1 => /lib64/libz.so.1 (0x0000003f5b800000)
libifport.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifport.so.5 (0x00002b0b743b8000)
libifcore.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifcore.so.5 (0x00002b0b745e7000)
libimf.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libimf.so (0x00002b0b74948000)
libsvml.so => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libsvml.so (0x00002b0b74e35000)
libintlc.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libintlc.so.5 (0x00002b0b75d40000)
libifcoremt.so.5 => /usr/local/intel/compilers_and_libraries_2017.2.174/linux/compiler/lib/intel64/libifcoremt.so.5 (0x00002b0b75faa000)
ompi info (using same path as indicated by ldd output)
tin 1125 : /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080/bin/ompi_info | grep debug
Prefix: /usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080
Configure command line: '--prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080' '--with-tm=/usr/local/torque' '--enable-mpirun-prefix-by-default' '--with-verbs=/usr' '--with-verbs-libdir=/usr/lib64' '--enable-debug' '--enable-mem-debug'
Internal debug support: yes
Memory debugging support: yes
resulting stack trace:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
vasp.gamma_para.i 0000000002DCE8C1 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002DCC9FB Unknown Unknown Unknown
vasp.gamma_para.i 0000000002D409E4 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002D407F6 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002CDCED9 Unknown Unknown Unknown
vasp.gamma_para.i 0000000002CE3DB6 Unknown Unknown Unknown
libpthread-2.12.s 0000003F5AC0F7E0 Unknown Unknown Unknown
mca_btl_vader.so 00002AD17AC74CB8 Unknown Unknown Unknown
mca_btl_vader.so 00002AD17AC770F5 Unknown Unknown Unknown
libopen-pal.so.40 00002AD168B816A4 opal_progress Unknown Unknown
libmpi.so.40.10.1 00002AD1684D0D75 Unknown Unknown Unknown
libmpi.so.40.10.1 00002AD1684D0DB8 ompi_request_defa Unknown Unknown
libmpi.so.40.10.1 00002AD168571EBE ompi_coll_base_se Unknown Unknown
libmpi.so.40.10.1 00002AD1685724B8 Unknown Unknown Unknown
libmpi.so.40.10.1 00002AD168573514 ompi_coll_base_al Unknown Unknown
mca_coll_tuned.so 00002AD17CD6C852 ompi_coll_tuned_a Unknown Unknown
libmpi.so.40.10.1 00002AD1684EE969 PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00002AD1682595B7 mpi_allreduce_ Unknown Unknown
vasp.gamma_para.i 000000000042D1ED m_sum_d_ 1300 mpi.F
vasp.gamma_para.i 0000000001BD5293 david_mp_eddav_.R 778 davidson.F
vasp.gamma_para.i 0000000001D2179E elmin_.R 424 electron.F
vasp.gamma_para.i 0000000002B92452 vamp_IP_electroni 4783 main.F
vasp.gamma_para.i 0000000002B6E173 MAIN__ 2800 main.F
vasp.gamma_para.i 000000000041325E Unknown Unknown Unknown
libc-2.12.so 0000003F5A41ED1D __libc_start_main Unknown Unknown
vasp.gamma_para.i 0000000000413169 Unknown Unknown Unknown
Iâve checked ulimit -s (at runtime), and it is unlimited.
Iâm going to try the 3.1.x 20180710 nightly snapshot next.
Let me ask the source of the VASP inputs about sharing them. Note that the crash really only happens at an appreciable rate running on 128 tasks (8x16 core nodes), and even then, if I do a 10 geometry step run, only in about 1/3 of all runs, so itâs not a completely trivial amount of resources to reproduce
Noam