Discussion:
[OMPI users] Double free or corruption problem updated result
ashwin .D
2017-06-17 13:50:45 UTC
Permalink
Hello Gilles,
I am enclosing all the information you requested.

1) as an attachment I enclose the log file
2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
reinstalled it /usr/lib/local.
I ran all the examples in the examples directory. All passed except
oshmem_strided_puts where I got this message

[[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
valid range
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
errorcode -1.
--------------------------------------------------------------------------


3) I deleted all old OpenMPI versions under /usr/local/lib.
4) I am using the COSMO weather model - http://www.cosmo-model.org/ to run
simulations
The support staff claim they have seen no errors with a similar setup. They
use

1) gfortran 4.8.5
2) OpenMPI 1.10.1

The only difference is I use OpenMPI 2.1.1.

5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo. and
I got the same error as in the mpi_logs file

6) Regarding compiler and linking options on Ubuntu 16.04

mpif90 --showme:compile and --showme:link give me the options for compiling
and linking.

Here are the options from my makefile

-pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking

7) I have a 64 bit OS.

Well I think I have responded all of your questions. In any case I have not
please let me know and I will respond ASAP. The only thing I have not done
is look at /usr/local/include. I saw some old OpenMPI files there. If those
need to be deleted I will do after I hear from you.

Best regards,
Ashwin.
ashwin .D
2017-06-18 02:41:39 UTC
Permalink
There is a sequential version of the same program COSMO (no reference to
MPI) that I can run without any problems. Of course it takes a lot longer
to complete. Now I also ran valgrind (not sure whether that is useful or
not) and I have enclosed the logs.
Post by ashwin .D
Hello Gilles,
I am enclosing all the information you requested.
1) as an attachment I enclose the log file
2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
reinstalled it /usr/lib/local.
I ran all the examples in the examples directory. All passed except
oshmem_strided_puts where I got this message
[[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
valid range
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
errorcode -1.
--------------------------------------------------------------------------
3) I deleted all old OpenMPI versions under /usr/local/lib.
4) I am using the COSMO weather model - http://www.cosmo-model.org/ to
run simulations
The support staff claim they have seen no errors with a similar setup.
They use
1) gfortran 4.8.5
2) OpenMPI 1.10.1
The only difference is I use OpenMPI 2.1.1.
5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo.
and I got the same error as in the mpi_logs file
6) Regarding compiler and linking options on Ubuntu 16.04
mpif90 --showme:compile and --showme:link give me the options for
compiling and linking.
Here are the options from my makefile
-pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking
7) I have a 64 bit OS.
Well I think I have responded all of your questions. In any case I have
not please let me know and I will respond ASAP. The only thing I have not
done is look at /usr/local/include. I saw some old OpenMPI files there. If
those need to be deleted I will do after I hear from you.
Best regards,
Ashwin.
Gilles Gouaillardet
2017-06-18 03:20:53 UTC
Permalink
Ashwin,

did you try to run your app with a MPICH-based library (mvapich,
IntelMPI or even stock mpich) ?
or did you try with Open MPI v1.10 ?
the stacktrace does not indicate the double free occurs in MPI...

it seems you ran valgrind vs a shell and not your binary.
assuming your mpirun command is
mpirun lmparbin_all
i suggest you try again with
mpirun --tag-output valgrind lmparbin_all
that will generate one valgrind log per task, but these are prefixed
so it should be easier to figure out what is going wrong

Cheers,

Gilles
Post by ashwin .D
There is a sequential version of the same program COSMO (no reference to
MPI) that I can run without any problems. Of course it takes a lot longer to
complete. Now I also ran valgrind (not sure whether that is useful or not)
and I have enclosed the logs.
Post by ashwin .D
Hello Gilles,
I am enclosing all the information you requested.
1) as an attachment I enclose the log file
2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
reinstalled it /usr/lib/local.
I ran all the examples in the examples directory. All passed except
oshmem_strided_puts where I got this message
[[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
valid range
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
errorcode -1.
--------------------------------------------------------------------------
3) I deleted all old OpenMPI versions under /usr/local/lib.
4) I am using the COSMO weather model - http://www.cosmo-model.org/ to run
simulations
The support staff claim they have seen no errors with a similar setup.
They use
1) gfortran 4.8.5
2) OpenMPI 1.10.1
The only difference is I use OpenMPI 2.1.1.
5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo.
and I got the same error as in the mpi_logs file
6) Regarding compiler and linking options on Ubuntu 16.04
mpif90 --showme:compile and --showme:link give me the options for
compiling and linking.
Here are the options from my makefile
-pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking
7) I have a 64 bit OS.
Well I think I have responded all of your questions. In any case I have
not please let me know and I will respond ASAP. The only thing I have not
done is look at /usr/local/include. I saw some old OpenMPI files there. If
those need to be deleted I will do after I hear from you.
Best regards,
Ashwin.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
ashwin .D
2017-06-18 05:41:37 UTC
Permalink
Hello Gilles,

First of all I am extremely grateful for this
communication from you on a weekend and that too few hours after I

posted my email. Well I am not sure I can go on posting log files as
you rightly point out that MPI is not the source of the

problem. Still I have enclosed the valgrind log files as you
requested. I have downloaded the MPICH packages as you suggested

and I am going to install them shortly. But before I do that I think I
have a clue on the source of my problem(double free or corruption) and
I would really appreciate

your advice.


As I mentioned before COSMO has been compiled with mpif90 for shared
memory usage and with gfortran for sequential access.

But it is dependent on a lot of external third party software such as
zlib, libcurl, hdf5, netcdf and netcdf-fortran. When I

looked at the config.log of those packages all of them had been
compiled with gfortran and gcc and some cases g++ with
enable-shared option. So my question then is could that be a source of
the "mismatch" ?

In other words I would have to recompile all those packages with
mpif90 and mpicc and then try another test. At the very


least there should be no mixing of gcc/gfortran compiled code with
mpif90 compiled code. Comments ?


Best regards,
Ashwin.
Post by Gilles Gouaillardet
Ashwin,
did you try to run your app with a MPICH-based library (mvapich,
IntelMPI or even stock mpich) ?
or did you try with Open MPI v1.10 ?
the stacktrace does not indicate the double free occurs in MPI...
it seems you ran valgrind vs a shell and not your binary.
assuming your mpirun command is
mpirun lmparbin_all
i suggest you try again with
mpirun --tag-output valgrind lmparbin_all
that will generate one valgrind log per task, but these are prefixed
so it should be easier to figure out what is going wrong
Cheers,
Gilles
There is a sequential version of the same program COSMO (no reference to
MPI) that I can run without any problems. Of course it takes a lot longer to
complete. Now I also ran valgrind (not sure whether that is useful or not)
and I have enclosed the logs.
There is a sequential version of the same program COSMO (no reference to
MPI) that I can run without any problems. Of course it takes a lot longer
to complete. Now I also ran valgrind (not sure whether that is useful or
not) and I have enclosed the logs.
Post by ashwin .D
Hello Gilles,
I am enclosing all the information you requested.
1) as an attachment I enclose the log file
2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature and I
reinstalled it /usr/lib/local.
I ran all the examples in the examples directory. All passed except
oshmem_strided_puts where I got this message
[[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1 is not in
valid range
------------------------------------------------------------
--------------
SHMEM_ABORT was invoked on rank 0 (pid 13409, host=a-Vostro-3800) with
errorcode -1.
------------------------------------------------------------
--------------
3) I deleted all old OpenMPI versions under /usr/local/lib.
4) I am using the COSMO weather model - http://www.cosmo-model.org/ to
run simulations
The support staff claim they have seen no errors with a similar setup.
They use
1) gfortran 4.8.5
2) OpenMPI 1.10.1
The only difference is I use OpenMPI 2.1.1.
5) I did try this option as well mpirun --mca btl tcp,self -np 4 cosmo.
and I got the same error as in the mpi_logs file
6) Regarding compiler and linking options on Ubuntu 16.04
mpif90 --showme:compile and --showme:link give me the options for
compiling and linking.
Here are the options from my makefile
-pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking
7) I have a 64 bit OS.
Well I think I have responded all of your questions. In any case I have
not please let me know and I will respond ASAP. The only thing I have not
done is look at /usr/local/include. I saw some old OpenMPI files there. If
those need to be deleted I will do after I hear from you.
Best regards,
Ashwin.
Gilles Gouaillardet
2017-06-19 07:13:07 UTC
Permalink
Ashwin,


the valgrind logs clearly indicate you are trying to access some memory
that was already free'd


for example

[1,0]<stderr>:==4683== Invalid read of size 4
[1,0]<stderr>:==4683== at 0x795DC2: __src_input_MOD_organize_input
(src_input.f90:2318)
[1,0]<stderr>:==4683== Address 0xb4001d0 is 0 bytes inside a block of
size 24 free'd
[1,0]<stderr>:==4683== by 0x63F3690: free_NC_var (in
/usr/local/lib/libnetcdf.so.11.0.3)

[1,0]<stderr>:==4683== by 0x63BB431: nc_close (in
/usr/local/lib/libnetcdf.so.11.0.3)
[1,0]<stderr>:==4683== by 0x435A9F: __io_utilities_MOD_close_file
(io_utilities.f90:995)
[1,0]<stderr>:==4683== Block was alloc'd at
[1,0]<stderr>:==4683== by 0x63F378C: new_x_NC_var (in
/usr/local/lib/libnetcdf.so.11.0.3)
[1,0]<stderr>:==4683== by 0x63BAF85: nc_open (in
/usr/local/lib/libnetcdf.so.11.0.3)
[1,0]<stderr>:==4683== by 0x547E6F6: nf_open_ (nf_control.F90:189)

so the double-free error could be a side effect of this.

at this stage, i suggest you fix your application, and see if it
resolves your issue.
(e.g. there is no need to try an other MPI library and/or version for now)

Cheers,

Gilles
Post by ashwin .D
Hello Gilles,
First of all I am extremely grateful for this communication from you on a weekend and that too few hours after I
posted my email. Well I am not sure I can go on posting log files as you rightly point out that MPI is not the source of the
problem. Still I have enclosed the valgrind log files as you requested. I have downloaded the MPICH packages as you suggested
and I am going to install them shortly. But before I do that I think I have a clue on the source of my problem(double free or corruption) and I would really appreciate
your advice.
As I mentioned before COSMO has been compiled with mpif90 for shared memory usage and with gfortran for sequential access.
But it is dependent on a lot of external third party software such as zlib, libcurl, hdf5, netcdf and netcdf-fortran. When I
looked at the config.log of those packages all of them had been compiled with gfortran and gcc and some cases g++ with
enable-shared option. So my question then is could that be a source of the "mismatch" ?
In other words I would have to recompile all those packages with mpif90 and mpicc and then try another test. At the very
least there should be no mixing of gcc/gfortran compiled code with mpif90 compiled code. Comments ?
Best regards,
Ashwin.
Post by Gilles Gouaillardet
Ashwin,
did you try to run your app with a MPICH-based library (mvapich,
IntelMPI or even stock mpich) ?
or did you try with Open MPI v1.10 ?
the stacktrace does not indicate the double free occurs in MPI...
it seems you ran valgrind vs a shell and not your binary.
assuming your mpirun command is
mpirun lmparbin_all
i suggest you try again with
mpirun --tag-output valgrind lmparbin_all
that will generate one valgrind log per task, but these are prefixed
so it should be easier to figure out what is going wrong
Cheers,
Gilles
There is a sequential version of the same program COSMO (no reference to
MPI) that I can run without any problems. Of course it takes a lot longer to
complete. Now I also ran valgrind (not sure whether that is useful or not)
and I have enclosed the logs.
There is a sequential version of the same program COSMO (no
reference to MPI) that I can run without any problems. Of course
it takes a lot longer to complete. Now I also ran valgrind (not
sure whether that is useful or not) and I have enclosed the logs.
Hello Gilles,
I am enclosing all the information you requested.
1) as an attachment I enclose the log file
2) I did rebuild OpenMPI 2.1.1 with the --enable-debug feature
and I reinstalled it /usr/lib/local.
I ran all the examples in the examples directory. All passed
except oshmem_strided_puts where I got this message
[[48654,1],0][pshmem_iput.c:70:pshmem_short_iput] Target PE #1
is not in valid range
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 13409,
host=a-Vostro-3800) with errorcode -1.
--------------------------------------------------------------------------
3) I deleted all old OpenMPI versions under /usr/local/lib.
4) I am using the COSMO weather model -
http://www.cosmo-model.org/ to run simulations
The support staff claim they have seen no errors with a
similar setup. They use
1) gfortran 4.8.5
2) OpenMPI 1.10.1
The only difference is I use OpenMPI 2.1.1.
5) I did try this option as well mpirun --mca btl tcp,self -np
4 cosmo. and I got the same error as in the mpi_logs file
6) Regarding compiler and linking options on Ubuntu 16.04
mpif90 --showme:compile and --showme:link give me the options
for compiling and linking.
Here are the options from my makefile
-pthread -lmpi_usempi -lmpi_mpifh -lmpi for linking
7) I have a 64 bit OS.
Well I think I have responded all of your questions. In any
case I have not please let me know and I will respond ASAP.
The only thing I have not done is look at /usr/local/include.
I saw some old OpenMPI files there. If those need to be
deleted I will do after I hear from you.
Best regards,
Ashwin.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...