Discussion:
[OMPI users] OMPI users] MPI_ABORT was invoked on rank 0 in communicator compute with errorcode 59
Gilles Gouaillardet
2016-11-16 17:01:17 UTC
Permalink
Hi,

With ddt, you can do offline debugging just to get where the program crashes
ddt -n 8 --offline a.out ...
You might also wanna try the reverse connect feature

Cheers,

Gilles
Hi Gus,
#!/bin/bash
#PBS -N myjob
#PBS -l nodes=1:ppn=8
#PBS -l walltime=120:00:00
#PBS -l pvmem=2000MB
module load openmpi/2.0.0
cd /cluster/home/t48263uhn/Carp/PlosOneData/
mpirun -np 8 carp.debug.petsc.pt +F /cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par
mpirun -tv -np 8 carp.debug.petsc.pt +F /cluster/home/t48263uhn/Carp/PlosOneData/parameters_ECG_adjust.par
"This version of Open MPI is known to have a problem using the "--debug"
option to mpirun, and has therefore disabled it. This functionality will
be restored in a future version of Open MPI.
Please see https://github.com/open-mpi/ompi/issues/1225 for details."
"This version of Open MPI is known to have a problem using the "--debug"
option to mpirun, and has therefore disabled it. This functionality will
be restored in a future version of Open MPI.
Please see https://github.com/open-mpi/ompi/issues/1225 for details."
I believe there is an older version Open MPI on the system, but the system admin asked me not to use it.
I may try that and report the results. I have also attached the missing files in gzip format.
Thanks,
Ali
________________________________________
Sent: Tuesday, November 15, 2016 5:42 PM
To: Open MPI Users
Subject: Re: [OMPI users] MPI_ABORT was invoked on rank 0 in communicator compute with errorcode 59
Hi Mohammadali
"Signal number 11 SEGV", is the Unix/Linux signal for a memory
violation (a.k.a. segmentation violation or segmentation fault).
This normally happens when the program tries to read
or write in a memory area that it did not allocate, already
freed, or belongs to another process.
That is most likely a programming error on the FEM code,
probably not an MPI error, probably not a PETSC error either.
The "errorcode 59" seems to be the PETSC error message
issued when it receives a signal (in this case a
segmentation fault signal, I guess) from the operational
system (Linux, probably).
Apparently it simply throws the error message and
calls MPI_Abort, and the program stops.
#define PETSC_ERR_SIG 59 /* signal received */
**
One suggestion is to compile the code with debugging flags (-g),
and attach a debugger to it. Not an easy task if you have many
processes/ranks in your program, if your debugger is the default
Linux gdb, but it is not impossible to do either.
Depending on the computer you have, you may have a parallel debugger,
such as TotalView or DDT, which are more user friendly.
You could also compile it with the flag -traceback
(or -fbacktrace, the syntax depends on the compiler, check the compiler
man page).
This at least will tell you the location in the program where the
segmentation fault happened (in the STDERR file of your job).
I hope this helps.
Gus Correa
PS - The zip attachment with your "myjob.sh" script
was removed from the email.
Many email server programs remove zip for safety.
Files with ".sh" suffix are also removed in general.
You could compress it with gzip or bzip2 instead.
Hi,
I am running simulations in a software which uses ompi to solve an FEM
problem. From time to time I receive the error “
MPI_ABORT was invoked on rank 0 in communicator compute with errorcode
59” in the output file while for the larger simulations (with larger FEM
mesh) I almost always get this error. I don’t have any idea what is the
cause of this error. The error file contains a PETSC error: ”caught
signal number 11 SEGV”. I am running my jobs on a HPC system which has
Open MPI version 2.0.0. I am also using a bash file (myjob.sh) which is
attached. The ompi_info - - all command and ifconfig command outputs
are also attached. I appreciate any help in this regard.
Thanks
Ali
**************************
Mohammadali Beheshti
Post-Doctoral Fellow
Department of Medicine (Cardiology)
Toronto General Research Institute
University Health Network
Tel: 416-340-4800 <tel:416-340-4800> ext. 6837
**************************
This e-mail may contain confidential and/or privileged information for
the sole use of the intended recipient.
Any review or distribution by anyone other than the person for whom it
was originally intended is strictly prohibited.
If you have received this e-mail in error, please contact the sender and
delete all copies.
Opinions, conclusions or other information contained in this e-mail may
not be that of the organization.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
This e-mail may contain confidential and/or privileged information for the sole use of the intended recipient.
Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited.
If you have received this e-mail in error, please contact the sender and delete all copies.
Opinions, conclusions or other information contained in this e-mail may not be that of the organization.
Loading...