Discussion:
[OMPI users] Double free or corruption with OpenMPI 2.0
ashwin .D
2017-06-13 11:54:09 UTC
Permalink
Hello,
I am using OpenMPI 2.0.0 with a computational fluid dynamics
software and I am encountering a series of errors when running this with
mpirun. This is my lscpu output

CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1 and I am running OpenMPI's mpirun in the following

way

mpirun -np 4 cfd_software

and I get double free or corruption every single time.

I have two questions -

1) I am unable to capture the standard error that mpirun throws in a file

How can I go about capturing the standard error of mpirun ?

2) Has this error i.e. double free or corruption been reported by
others ? Is there a Is a

bug fix available ?

Regards,

Ashwin.
ashwin .D
2017-06-13 12:22:02 UTC
Permalink
Also when I try to build and run a make check I get these errors - Am I
clear to proceed or is my installation broken ? This is on Ubuntu 16.04
LTS.

==================================================
Open MPI 2.1.1: test/datatype/test-suite.log
==================================================

# TOTAL: 9
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: external32
================

/home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup
error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined
symbol: ompi_datatype_pack_external_size
Post by ashwin .D
Hello,
I am using OpenMPI 2.0.0 with a computational fluid dynamics
software and I am encountering a series of errors when running this with
mpirun. This is my lscpu output
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1 and I am running OpenMPI's mpirun in the following
way
mpirun -np 4 cfd_software
and I get double free or corruption every single time.
I have two questions -
1) I am unable to capture the standard error that mpirun throws in a file
How can I go about capturing the standard error of mpirun ?
2) Has this error i.e. double free or corruption been reported by others ? Is there a Is a
bug fix available ?
Regards,
Ashwin.
Jeff Hammond
2017-06-13 14:00:36 UTC
Permalink
If you are not using external32 in datatypes code, this issue doesn't
matter. I don't think most implementations support external32...

Double free indicates application error. Such errors are possible but
extremely rare inside of MPI libraries. The incidence of applications
corrupting memory is about a million times higher than MPI libraries in my
experience.

Jeff
Post by ashwin .D
Also when I try to build and run a make check I get these errors - Am I
clear to proceed or is my installation broken ? This is on Ubuntu 16.04
LTS.
==================================================
Open MPI 2.1.1: test/datatype/test-suite.log
==================================================
# TOTAL: 9
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
.. contents:: :depth: 2
FAIL: external32
================
/home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup
error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined
symbol: ompi_datatype_pack_external_size
Post by ashwin .D
Hello,
I am using OpenMPI 2.0.0 with a computational fluid dynamics
software and I am encountering a series of errors when running this with
mpirun. This is my lscpu output
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1 and I am running OpenMPI's mpirun in the following
way
mpirun -np 4 cfd_software
and I get double free or corruption every single time.
I have two questions -
1) I am unable to capture the standard error that mpirun throws in a file
How can I go about capturing the standard error of mpirun ?
2) Has this error i.e. double free or corruption been reported by others ? Is there a Is a
bug fix available ?
Regards,
Ashwin.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Jeff Squyres (jsquyres)
2017-06-13 15:50:20 UTC
Permalink
Also when I try to build and run a make check I get these errors - Am I clear to proceed or is my installation broken ? This is on Ubuntu 16.04 LTS.
==================================================
Open MPI 2.1.1: test/datatype/test-suite.log
==================================================
# TOTAL: 9
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
.. contents:: :depth: 2
FAIL: external32
================
/home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined symbol: ompi_datatype_pack_external_size
I'm a little confused -- you said you're running Open MPI 2.0.0 but you're running the 2.1.1 tests.

Can you send all the information listed here:

https://www.open-mpi.org/community/help/
--
Jeff Squyres
***@cisco.com
ashwin .D
2017-06-14 10:31:11 UTC
Permalink
Hello,
I found a thread with Intel MPI(although I am using gfortran
4.8.5 and OpenMPI 2.1.1) -
https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/564266
but the error the OP gets is the same as mine

*** glibc detected *** ./a.out: double free or corruption (!prev):
0x00007fc6d0000c80 ***
04 ======= Backtrace: =========
05 /lib64/libc.so.6[0x3411e75e66]
06/lib64/libc.so.6[0x3411e789b3]

So the explanation given in that post is this -
"From their examination our Development team concluded the underlying
problem with openmpi 1.8.6 resulted from mixing out-of-date/incompatible
Fortran RTLs. In short, there were older static Fortran RTL bodies
incorporated in the openmpi library that when mixed with newer Fortran RTL
led to the failure. They found the issue is resolved in the newer
openmpi-1.10.1rc2 and recommend resolving requires using a newer openmpi
release with our 15.0 (or newer) release." Could this be possible with my
version as well ?


I am willing to debug this provided I am given some clue on how to approach
my problem. At the moment I am unable to proceed further and the only thing
I can add is I ran tests with the sequential form of my application and it
is much slower although I am using shared memory and all the cores are in
the same machine.

Best regards,
Ashwin.
Post by ashwin .D
Also when I try to build and run a make check I get these errors - Am I
clear to proceed or is my installation broken ? This is on Ubuntu 16.04
LTS.
==================================================
Open MPI 2.1.1: test/datatype/test-suite.log
==================================================
# TOTAL: 9
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
.. contents:: :depth: 2
FAIL: external32
================
/home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup
error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined
symbol: ompi_datatype_pack_external_size
Post by ashwin .D
Hello,
I am using OpenMPI 2.0.0 with a computational fluid dynamics
software and I am encountering a series of errors when running this with
mpirun. This is my lscpu output
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1 and I am running OpenMPI's mpirun in the following
way
mpirun -np 4 cfd_software
and I get double free or corruption every single time.
I have two questions -
1) I am unable to capture the standard error that mpirun throws in a file
How can I go about capturing the standard error of mpirun ?
2) Has this error i.e. double free or corruption been reported by others ? Is there a Is a
bug fix available ?
Regards,
Ashwin.
g***@rist.or.jp
2017-06-14 13:54:07 UTC
Permalink
Hi,

at first, i suggest you decide which Open MPI version you want to use.

the most up to date versions are 2.0.3 and 2.1.1

then please provide all the info Jeff previously requested.

ideally, you would write a simple and standalone program that exhibits
the issue, so we can reproduce and investigate it.

if not, i suggest you use an other MPI library (mvapich, Intel MPI or
any mpich-based MPI) and see if the issue is still there.

if the double free error still occurs, it is very likely the issue comes
from your application and not the MPI library.

if you have a parallel debugger such as allinea ddt, then you can run
your program under the debugger with thorough memory debugging. the
program will halt when the memory corruption occurs, and this will be a
hint

(app issue vs mpi issue).

if you did not configure Open MPI with --enable-debug, then please do so
and try again,

you will increase the likelyhood of trapping such a memory corruption
error earlier, and you will get a clean Open MPI stack trace if a crash
occurs.

you might also want to try to

mpirun --mca btl tcp,self ...

and see if you get a different behavior.

this will only use TCP for inter process communication, and this is way
easier to debug than shared memory or rdma

Cheers,

Gilles

----- Original Message -----

Hello,
I found a thread with Intel MPI(although I am using
gfortran 4.8.5 and OpenMPI 2.1.1) - https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/564266
but the error the OP gets is the same as mine

*** glibc detected *** ./a.out: double free or corruption (!prev):
0x00007fc6d0000c80 ***
04 ======= Backtrace: =========
05 /lib64/libc.so.6[0x3411e75e66]
06 /lib64/libc.so.6[0x3411e789b3]

So the explanation given in that post is this -
"From their examination our Development team concluded the
underlying problem with openmpi 1.8.6 resulted from mixing out-of-date/
incompatible Fortran RTLs. In short, there were older static Fortran RTL
bodies incorporated in the openmpi library that when mixed with newer
Fortran RTL led to the failure. They found the issue is resolved in the
newer openmpi-1.10.1rc2 and recommend resolving requires using a newer
openmpi release with our 15.0 (or newer) release." Could this be
possible with my version as well ?


I am willing to debug this provided I am given some clue on how to
approach my problem. At the moment I am unable to proceed further and
the only thing I can add is I ran tests with the sequential form of my
application and it is much slower although I am using shared memory and
all the cores are in the same machine.

Best regards,
Ashwin.





On Tue, Jun 13, 2017 at 5:52 PM, ashwin .D <***@gmail.com>
wrote:

Also when I try to build and run a make check I get these errors
- Am I clear to proceed or is my installation broken ? This is on Ubuntu
16.04 LTS.

==================================================
Open MPI 2.1.1: test/datatype/test-suite.log
==================================================

# TOTAL: 9
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: external32
================

/home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol
lookup error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32:
undefined symbol: ompi_datatype_pack_external_size
FAIL external32 (exit status:

On Tue, Jun 13, 2017 at 5:24 PM, ashwin .D <***@gmail.com>
wrote:

Hello,
I am using OpenMPI 2.0.0 with a computational
fluid dynamics software and I am encountering a series of errors when
running this with mpirun. This is my lscpu output

CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1 and I am running OpenMPI's mpirun
in the following

way

mpirun -np 4 cfd_software



and I get double free or corruption every single time.



I have two questions -



1) I am unable to capture the standard error that mpirun
throws in a file

How can I go about capturing the standard error of mpirun ?

2) Has this error i.e. double free or corruption been
reported by others ? Is there a Is a

bug fix available ?



Regards,

Ashwin.
Jeff Hammond
2017-06-14 20:05:48 UTC
Permalink
The "error *** glibc detected *** $(PROGRAM): double free or corruption" is
ubiquitous and rarely has anything to do with MPI.


As Gilles said, use a debugger to figure out why your application is
corrupting the heap.


Jeff
Post by ashwin .D
Hello,
I found a thread with Intel MPI(although I am using gfortran
4.8.5 and OpenMPI 2.1.1) - https://software.intel.com/en-
us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/564266 but
the error the OP gets is the same as mine
0x00007fc6d0000c80 ***
04 ======= Backtrace: =========
05 /lib64/libc.so.6[0x3411e75e66]
06/lib64/libc.so.6[0x3411e789b3]
So the explanation given in that post is this -
"From their examination our Development team concluded the underlying
problem with openmpi 1.8.6 resulted from mixing out-of-date/incompatible
Fortran RTLs. In short, there were older static Fortran RTL bodies
incorporated in the openmpi library that when mixed with newer Fortran RTL
led to the failure. They found the issue is resolved in the newer
openmpi-1.10.1rc2 and recommend resolving requires using a newer openmpi
release with our 15.0 (or newer) release." Could this be possible with my
version as well ?
I am willing to debug this provided I am given some clue on how to
approach my problem. At the moment I am unable to proceed further and the
only thing I can add is I ran tests with the sequential form of my
application and it is much slower although I am using shared memory and all
the cores are in the same machine.
Best regards,
Ashwin.
Post by ashwin .D
Also when I try to build and run a make check I get these errors - Am I
clear to proceed or is my installation broken ? This is on Ubuntu 16.04
LTS.
==================================================
Open MPI 2.1.1: test/datatype/test-suite.log
==================================================
# TOTAL: 9
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
.. contents:: :depth: 2
FAIL: external32
================
/home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup
error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined
symbol: ompi_datatype_pack_external_size
Post by ashwin .D
Hello,
I am using OpenMPI 2.0.0 with a computational fluid dynamics
software and I am encountering a series of errors when running this with
mpirun. This is my lscpu output
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1 and I am running OpenMPI's mpirun in the following
way
mpirun -np 4 cfd_software
and I get double free or corruption every single time.
I have two questions -
1) I am unable to capture the standard error that mpirun throws in a file
How can I go about capturing the standard error of mpirun ?
2) Has this error i.e. double free or corruption been reported by others ? Is there a Is a
bug fix available ?
Regards,
Ashwin.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
ashwin .D
2017-06-15 07:36:15 UTC
Permalink
Hello Jeff and Gilles,
I just logged in to see the archives and
this message of Gilles -
https://www.mail-archive.com/***@lists.open-mpi.org//msg31219.html and
this message of Jeff -
https://www.mail-archive.com/***@lists.open-mpi.org//msg31217.html are
very useful. Please give me a couple of days to implement some of the ideas
that you both have suggested and allow me to get back to you.

Best regards,
Ashwin
Post by ashwin .D
Hello,
I found a thread with Intel MPI(although I am using gfortran
4.8.5 and OpenMPI 2.1.1) - https://software.intel.com/en-
us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/564266 but
the error the OP gets is the same as mine
0x00007fc6d0000c80 ***
04 ======= Backtrace: =========
05 /lib64/libc.so.6[0x3411e75e66]
06/lib64/libc.so.6[0x3411e789b3]
So the explanation given in that post is this -
"From their examination our Development team concluded the underlying
problem with openmpi 1.8.6 resulted from mixing out-of-date/incompatible
Fortran RTLs. In short, there were older static Fortran RTL bodies
incorporated in the openmpi library that when mixed with newer Fortran RTL
led to the failure. They found the issue is resolved in the newer
openmpi-1.10.1rc2 and recommend resolving requires using a newer openmpi
release with our 15.0 (or newer) release." Could this be possible with my
version as well ?
I am willing to debug this provided I am given some clue on how to
approach my problem. At the moment I am unable to proceed further and the
only thing I can add is I ran tests with the sequential form of my
application and it is much slower although I am using shared memory and all
the cores are in the same machine.
Best regards,
Ashwin.
Post by ashwin .D
Also when I try to build and run a make check I get these errors - Am I
clear to proceed or is my installation broken ? This is on Ubuntu 16.04
LTS.
==================================================
Open MPI 2.1.1: test/datatype/test-suite.log
==================================================
# TOTAL: 9
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
.. contents:: :depth: 2
FAIL: external32
================
/home/t/openmpi-2.1.1/test/datatype/.libs/lt-external32: symbol lookup
error: /home/openmpi-2.1.1/test/datatype/.libs/lt-external32: undefined
symbol: ompi_datatype_pack_external_size
Post by ashwin .D
Hello,
I am using OpenMPI 2.0.0 with a computational fluid dynamics
software and I am encountering a series of errors when running this with
mpirun. This is my lscpu output
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1 and I am running OpenMPI's mpirun in the following
way
mpirun -np 4 cfd_software
and I get double free or corruption every single time.
I have two questions -
1) I am unable to capture the standard error that mpirun throws in a file
How can I go about capturing the standard error of mpirun ?
2) Has this error i.e. double free or corruption been reported by others ? Is there a Is a
bug fix available ?
Regards,
Ashwin.
Loading...