Discussion:
[OMPI users] coredump about MPI
g***@buaa.edu.cn
2017-03-02 14:19:51 UTC
Permalink
hi developers and users:
I have a question about the coredump of MPI programs. I have two nodes, when the program was runned on the single node respectively,
It can get the corefile correctly(In order to make a coredump, there is a divide-by-zero operation in this program).
But when I runned the program on two nodes, if the illegle operation happened in the node which is different from the node used to execute
this "mpirun" command, there is no coredump file.
I have checked "ulimit -c" and so on,but still can not figure out.
thanks a lot for your help and best regards!

-------------------------------------
Eric
Jeff Squyres (jsquyres)
2017-03-02 15:34:56 UTC
Permalink
A few suggestions:

1. Look for the core files in directories where you might not expect:
- your $HOME (particularly if your $HOME is not a networked filesystem)
- in /cores
- in the pwd where the executable was launched on that machine

2. If multiple processes will be writing core files in the same directory, make sure that they don't write to the same filename (you'll likely end up with a single corrupt corefile). For example, on Linux, you can (as root) "echo "core.%e-%t-%p" >/proc/sys/kernel/core_pattern" to get a unique corefile for each process and host (this is what I use on my development cluster).

3. If you are launching via a resource scheduler (e.g., SLURM, Torque, etc.), the scheduler may be resetting the corefile limit back down to zero before launching your job. If this is what is happening, it may be a little tricky to override this because the scheduler will likely do it *on each node*, and therefore you likely need to override it *in each MPI process* (via setrlimit(2)).
Post by g***@buaa.edu.cn
I have a question about the coredump of MPI programs. I have two nodes, when the program was runned on the single node respectively,
It can get the corefile correctly(In order to make a coredump, there is a divide-by-zero operation in this program).
But when I runned the program on two nodes, if the illegle operation happened in the node which is different from the node used to execute
this "mpirun" command, there is no coredump file.
I have checked "ulimit -c" and so on,but still can not figure out.
thanks a lot for your help and best regards!
-------------------------------------
Eric
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Loading...