Discussion:
[OMPI users] Java-OpenMPI returns with SIGSEGV
Gundram Leifert
2016-07-05 14:58:49 UTC
Permalink
Hello,

I try to send many byte-arrays via broadcast. After a specific number of
repetitions the process either hangs up or returns with a SIGSEGV. Does
any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

public static void log(String msg) {
try {
System.err.println(String.format("%2d/%2d:%s",
MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
} catch (MPIException ex) {
System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
}
}

private static int hashcode(byte[] bytearray) {
if (bytearray == null) {
return 0;
}
int hash = 39;
for (int i = 0; i < bytearray.length; i++) {
byte b = bytearray[i];
hash = hash * 7 + (int) b;
}
return hash;
}

public static void main(String args[]) throws MPIException {
log("start main");
MPI.Init(args);
try {
log("initialized done");
byte[] saveMem = new byte[100000000];
MPI.COMM_WORLD.barrier();
Random r = new Random();
r.nextBytes(saveMem);
if (MPI.COMM_WORLD.getRank() == 0) {
for (int i = 0; i < 1000; i++) {
saveMem[r.nextInt(saveMem.length)]++;
log("i = " + i);
int[] lengthData = new int[]{saveMem.length};
log("object hash = " + hashcode(saveMem));
log("length = " + lengthData[0]);
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
log("bcast length done (length = " + lengthData[0]
+ ")");
MPI.COMM_WORLD.barrier();
MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
}
MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
} else {
while (true) {
int[] lengthData = new int[1];
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
log("bcast length done (length = " + lengthData[0]
+ ")");
if (lengthData[0] == 0) {
break;
}
MPI.COMM_WORLD.barrier();
saveMem = new byte[lengthData[0]];
MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
log("object hash = " + hashcode(saveMem));
}
}
MPI.COMM_WORLD.barrier();
} catch (MPIException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} catch (RuntimeException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} finally {
MPI.Finalize();
}

}

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# #
# SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable
core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable
core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code: (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code: (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0]
/usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on
signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <***@rist.or.jp>
Date: Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
--disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram
Jason Maldonis
2016-07-05 16:05:51 UTC
Permalink
After reading your thread looks like it may be related to an issue I had a
few weeks ago (I'm a novice though). Maybe my thread will be of help:
https://www.open-mpi.org/community/lists/users/2016/06/29425.php

When you say "After a specific number of repetitions the process either
hangs up or returns with a SIGSEGV." does you mean that a single call
hangs, or that at some point during the for loop a call hangs? If you mean
the latter, then it might relate to my issue. Otherwise my thread probably
won't be helpful.

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
***@wisc.edu
608-295-5532

On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
***@uni-rostock.de> wrote:

> Hello,
>
> I try to send many byte-arrays via broadcast. After a specific number of
> repetitions the process either hangs up or returns with a SIGSEGV. Does any
> one can help me solving the problem:
>
> ########## The code:
>
> import java.util.Random;
> import mpi.*;
>
> public class TestSendBigFiles {
>
> public static void log(String msg) {
> try {
> System.err.println(String.format("%2d/%2d:%s",
> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
> } catch (MPIException ex) {
> System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
> }
> }
>
> private static int hashcode(byte[] bytearray) {
> if (bytearray == null) {
> return 0;
> }
> int hash = 39;
> for (int i = 0; i < bytearray.length; i++) {
> byte b = bytearray[i];
> hash = hash * 7 + (int) b;
> }
> return hash;
> }
>
> public static void main(String args[]) throws MPIException {
> log("start main");
> MPI.Init(args);
> try {
> log("initialized done");
> byte[] saveMem = new byte[100000000];
> MPI.COMM_WORLD.barrier();
> Random r = new Random();
> r.nextBytes(saveMem);
> if (MPI.COMM_WORLD.getRank() == 0) {
> for (int i = 0; i < 1000; i++) {
> saveMem[r.nextInt(saveMem.length)]++;
> log("i = " + i);
> int[] lengthData = new int[]{saveMem.length};
> log("object hash = " + hashcode(saveMem));
> log("length = " + lengthData[0]);
> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
> log("bcast length done (length = " + lengthData[0] +
> ")");
> MPI.COMM_WORLD.barrier();
> MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE,
> 0);
> log("bcast data done");
> MPI.COMM_WORLD.barrier();
> }
> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
> } else {
> while (true) {
> int[] lengthData = new int[1];
> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
> log("bcast length done (length = " + lengthData[0] +
> ")");
> if (lengthData[0] == 0) {
> break;
> }
> MPI.COMM_WORLD.barrier();
> saveMem = new byte[lengthData[0]];
> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
> MPI.BYTE, 0);
> log("bcast data done");
> MPI.COMM_WORLD.barrier();
> log("object hash = " + hashcode(saveMem));
> }
> }
> MPI.COMM_WORLD.barrier();
> } catch (MPIException ex) {
> System.out.println("caugth error." + ex);
> log(ex.getMessage());
> } catch (RuntimeException ex) {
> System.out.println("caugth error." + ex);
> log(ex.getMessage());
> } finally {
> MPI.Finalize();
> }
>
> }
>
> }
>
>
> ############ The Error (if it does not just hang up):
>
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
> #
> #
> # A fatal error has been detected by the Java Runtime Environment:
> # JRE version: 7.0_25-b15
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # #
> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
> #
> # JRE version: 7.0_25-b15
> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
> # An error report file with more information is saved as:
> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.sun.com/bugreport/crash.jsp
> #
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.sun.com/bugreport/crash.jsp
> #
> [titan01:01172] *** Process received signal ***
> [titan01:01172] Signal: Aborted (6)
> [titan01:01172] Signal code: (-6)
> [titan01:01173] *** Process received signal ***
> [titan01:01173] Signal: Aborted (6)
> [titan01:01173] Signal code: (-6)
> [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
> [titan01:01172] [ 3]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
> [titan01:01172] [ 4]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
> [titan01:01172] [ 5]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
> [titan01:01172] [ 6] [titan01:01173] [ 0]
> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
> [titan01:01172] *** End of error message ***
> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
> [titan01:01173] [ 3]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
> [titan01:01173] [ 4]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
> [titan01:01173] [ 5]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
> [titan01:01173] [ 7] [0x2af69c0693a1]
> [titan01:01173] *** End of error message ***
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 0 on node titan01 exited on
> signal 6 (Aborted).
>
>
> ########CONFIGURATION:
> I used the ompi master sources from github:
> commit 267821f0dd405b5f4370017a287d9a49f92e734a
> Author: Gilles Gouaillardet <***@rist.or.jp>
> Date: Tue Jul 5 13:47:50 2016 +0900
>
> ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
> --disable-dlopen --disable-mca-dso
>
> Thanks a lot for your help!
> Gundram
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>
Gundram Leifert
2016-07-06 12:12:11 UTC
Permalink
Hello Jason,

thanks for your response! I thing it is another problem. I try to send
100MB bytes. So there are not many tries (between 10 and 30). I realized
that the execution of this code can result 3 different errors:

1. most often the posted error message occures.

2. in <10% the cases i have a live lock. I can see 3 java-processes, one
with 200% and two with 100% processor utilization. After ~15 minutes
without new system outputs this error occurs.


[thread 47499823949568 also had an error]
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable
core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid24256.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:24256] *** Process received signal ***
[titan01:24256] Signal: Aborted (6)
[titan01:24256] Signal code: (-6)
[titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
[titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
[titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
[titan01:24256] [ 3]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
[titan01:24256] [ 4]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
[titan01:24256] [ 5]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
[titan01:24256] [ 6]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
[titan01:24256] [ 7]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
[titan01:24256] [ 8]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
[titan01:24256] [ 9]
/home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
[titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
[titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
[titan01:24256] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titan01 exited on
signal 6 (Aborted).
--------------------------------------------------------------------------


3. in <10% the cases i have a dead lock while MPI.init. This stays for
more than 15 minutes without returning with an error message...

Can I enable some debug-flags to see what happens on C / OpenMPI side?

Thanks in advance for your help!
Gundram Leifert


On 07/05/2016 06:05 PM, Jason Maldonis wrote:
> After reading your thread looks like it may be related to an issue I
> had a few weeks ago (I'm a novice though). Maybe my thread will be of
> help: https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>
> When you say "After a specific number of repetitions the process
> either hangs up or returns with a SIGSEGV." does you mean that a
> single call hangs, or that at some point during the for loop a call
> hangs? If you mean the latter, then it might relate to my issue.
> Otherwise my thread probably won't be helpful.
>
> Jason Maldonis
> Research Assistant of Professor Paul Voyles
> Materials Science Grad Student
> University of Wisconsin, Madison
> 1509 University Ave, Rm M142
> Madison, WI 53706
> ***@wisc.edu <mailto:***@wisc.edu>
> 608-295-5532
>
> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert
> <***@uni-rostock.de
> <mailto:***@uni-rostock.de>> wrote:
>
> Hello,
>
> I try to send many byte-arrays via broadcast. After a specific
> number of repetitions the process either hangs up or returns with
> a SIGSEGV. Does any one can help me solving the problem:
>
> ########## The code:
>
> import java.util.Random;
> import mpi.*;
>
> public class TestSendBigFiles {
>
> public static void log(String msg) {
> try {
> System.err.println(String.format("%2d/%2d:%s",
> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
> } catch (MPIException ex) {
> System.err.println(String.format("%2s/%2s:%s", "?",
> "?", msg));
> }
> }
>
> private static int hashcode(byte[] bytearray) {
> if (bytearray == null) {
> return 0;
> }
> int hash = 39;
> for (int i = 0; i < bytearray.length; i++) {
> byte b = bytearray[i];
> hash = hash * 7 + (int) b;
> }
> return hash;
> }
>
> public static void main(String args[]) throws MPIException {
> log("start main");
> MPI.Init(args);
> try {
> log("initialized done");
> byte[] saveMem = new byte[100000000];
> MPI.COMM_WORLD.barrier();
> Random r = new Random();
> r.nextBytes(saveMem);
> if (MPI.COMM_WORLD.getRank() == 0) {
> for (int i = 0; i < 1000; i++) {
> saveMem[r.nextInt(saveMem.length)]++;
> log("i = " + i);
> int[] lengthData = new int[]{saveMem.length};
> log("object hash = " + hashcode(saveMem));
> log("length = " + lengthData[0]);
> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
> <http://MPI.INT>, 0);
> log("bcast length done (length = " +
> lengthData[0] + ")");
> MPI.COMM_WORLD.barrier();
> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
> MPI.BYTE, 0);
> log("bcast data done");
> MPI.COMM_WORLD.barrier();
> }
> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT
> <http://MPI.INT>, 0);
> } else {
> while (true) {
> int[] lengthData = new int[1];
> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
> <http://MPI.INT>, 0);
> log("bcast length done (length = " +
> lengthData[0] + ")");
> if (lengthData[0] == 0) {
> break;
> }
> MPI.COMM_WORLD.barrier();
> saveMem = new byte[lengthData[0]];
> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
> MPI.BYTE, 0);
> log("bcast data done");
> MPI.COMM_WORLD.barrier();
> log("object hash = " + hashcode(saveMem));
> }
> }
> MPI.COMM_WORLD.barrier();
> } catch (MPIException ex) {
> System.out.println("caugth error." + ex);
> log(ex.getMessage());
> } catch (RuntimeException ex) {
> System.out.println("caugth error." + ex);
> log(ex.getMessage());
> } finally {
> MPI.Finalize();
> }
>
> }
>
> }
>
>
> ############ The Error (if it does not just hang up):
>
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172,
> tid=47822674495232
> #
> #
> # A fatal error has been detected by the Java Runtime Environment:
> # JRE version: 7.0_25-b15
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # #
> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173,
> tid=47238546896640
> #
> # JRE version: 7.0_25-b15
> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
> #
> # Failed to write core dump. Core dumps have been disabled. To
> enable core dumping, try "ulimit -c unlimited" before starting
> Java again
> #
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
> #
> # Failed to write core dump. Core dumps have been disabled. To
> enable core dumping, try "ulimit -c unlimited" before starting
> Java again
> #
> # An error report file with more information is saved as:
> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
> # An error report file with more information is saved as:
> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.sun.com/bugreport/crash.jsp
> #
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.sun.com/bugreport/crash.jsp
> #
> [titan01:01172] *** Process received signal ***
> [titan01:01172] Signal: Aborted (6)
> [titan01:01172] Signal code: (-6)
> [titan01:01173] *** Process received signal ***
> [titan01:01173] Signal: Aborted (6)
> [titan01:01173] Signal code: (-6)
> [titan01:01172] [ 0]
> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
> [titan01:01172] [ 1]
> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
> [titan01:01172] [ 3]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
> [titan01:01172] [ 4]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
> [titan01:01172] [ 5]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
> [titan01:01172] [ 6] [titan01:01173] [ 0]
> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
> [titan01:01172] *** End of error message ***
> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
> [titan01:01173] [ 3]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
> [titan01:01173] [ 4]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
> [titan01:01173] [ 5]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
> [titan01:01173] [ 7] [0x2af69c0693a1]
> [titan01:01173] *** End of error message ***
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 0 on node titan01
> exited on signal 6 (Aborted).
>
>
> ########CONFIGURATION:
> I used the ompi master sources from github:
> commit 267821f0dd405b5f4370017a287d9a49f92e734a
> Author: Gilles Gouaillardet <***@rist.or.jp
> <mailto:***@rist.or.jp>>
> Date: Tue Jul 5 13:47:50 2016 +0900
>
> ./configure --enable-mpi-java
> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
> --disable-mca-dso
>
> Thanks a lot for your help!
> Gundram
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org <mailto:***@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php
Gilles Gouaillardet
2016-07-06 13:10:52 UTC
Permalink
Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic)
version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you
also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <***@uni-rostock.de>
wrote:

> Hello Jason,
>
> thanks for your response! I thing it is another problem. I try to send
> 100MB bytes. So there are not many tries (between 10 and 30). I realized
> that the execution of this code can result 3 different errors:
>
> 1. most often the posted error message occures.
>
> 2. in <10% the cases i have a live lock. I can see 3 java-processes, one
> with 200% and two with 100% processor utilization. After ~15 minutes
> without new system outputs this error occurs.
>
>
> [thread 47499823949568 also had an error]
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
> # guarantee(PageArmed == 0) failed: invariant
> #
> # JRE version: 7.0_25-b15
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
> linux-amd64 compressed oops)
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.sun.com/bugreport/crash.jsp
> #
> [titan01:24256] *** Process received signal ***
> [titan01:24256] Signal: Aborted (6)
> [titan01:24256] Signal code: (-6)
> [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
> [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
> [titan01:24256] [ 3]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
> [titan01:24256] [ 4]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
> [titan01:24256] [ 5]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
> [titan01:24256] [ 6]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
> [titan01:24256] [ 7]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
> [titan01:24256] [ 8]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
> [titan01:24256] [ 9]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
> [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
> [titan01:24256] *** End of error message ***
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node titan01 exited on
> signal 6 (Aborted).
> --------------------------------------------------------------------------
>
>
> 3. in <10% the cases i have a dead lock while MPI.init. This stays for
> more than 15 minutes without returning with an error message...
>
> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>
> Thanks in advance for your help!
> Gundram Leifert
>
>
> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>
> After reading your thread looks like it may be related to an issue I had a
> few weeks ago (I'm a novice though). Maybe my thread will be of help:
> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>
> When you say "After a specific number of repetitions the process either
> hangs up or returns with a SIGSEGV." does you mean that a single call
> hangs, or that at some point during the for loop a call hangs? If you mean
> the latter, then it might relate to my issue. Otherwise my thread probably
> won't be helpful.
>
> Jason Maldonis
> Research Assistant of Professor Paul Voyles
> Materials Science Grad Student
> University of Wisconsin, Madison
> 1509 University Ave, Rm M142
> Madison, WI 53706
> ***@wisc.edu <javascript:_e(%7B%7D,'cvml','***@wisc.edu');>
> 608-295-5532
>
> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
> ***@uni-rostock.de
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>
>> Hello,
>>
>> I try to send many byte-arrays via broadcast. After a specific number of
>> repetitions the process either hangs up or returns with a SIGSEGV. Does any
>> one can help me solving the problem:
>>
>> ########## The code:
>>
>> import java.util.Random;
>> import mpi.*;
>>
>> public class TestSendBigFiles {
>>
>> public static void log(String msg) {
>> try {
>> System.err.println(String.format("%2d/%2d:%s",
>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>> } catch (MPIException ex) {
>> System.err.println(String.format("%2s/%2s:%s", "?", "?",
>> msg));
>> }
>> }
>>
>> private static int hashcode(byte[] bytearray) {
>> if (bytearray == null) {
>> return 0;
>> }
>> int hash = 39;
>> for (int i = 0; i < bytearray.length; i++) {
>> byte b = bytearray[i];
>> hash = hash * 7 + (int) b;
>> }
>> return hash;
>> }
>>
>> public static void main(String args[]) throws MPIException {
>> log("start main");
>> MPI.Init(args);
>> try {
>> log("initialized done");
>> byte[] saveMem = new byte[100000000];
>> MPI.COMM_WORLD.barrier();
>> Random r = new Random();
>> r.nextBytes(saveMem);
>> if (MPI.COMM_WORLD.getRank() == 0) {
>> for (int i = 0; i < 1000; i++) {
>> saveMem[r.nextInt(saveMem.length)]++;
>> log("i = " + i);
>> int[] lengthData = new int[]{saveMem.length};
>> log("object hash = " + hashcode(saveMem));
>> log("length = " + lengthData[0]);
>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>> log("bcast length done (length = " + lengthData[0] +
>> ")");
>> MPI.COMM_WORLD.barrier();
>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>> MPI.BYTE, 0);
>> log("bcast data done");
>> MPI.COMM_WORLD.barrier();
>> }
>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
>> } else {
>> while (true) {
>> int[] lengthData = new int[1];
>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>> log("bcast length done (length = " + lengthData[0] +
>> ")");
>> if (lengthData[0] == 0) {
>> break;
>> }
>> MPI.COMM_WORLD.barrier();
>> saveMem = new byte[lengthData[0]];
>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>> MPI.BYTE, 0);
>> log("bcast data done");
>> MPI.COMM_WORLD.barrier();
>> log("object hash = " + hashcode(saveMem));
>> }
>> }
>> MPI.COMM_WORLD.barrier();
>> } catch (MPIException ex) {
>> System.out.println("caugth error." + ex);
>> log(ex.getMessage());
>> } catch (RuntimeException ex) {
>> System.out.println("caugth error." + ex);
>> log(ex.getMessage());
>> } finally {
>> MPI.Finalize();
>> }
>>
>> }
>>
>> }
>>
>>
>> ############ The Error (if it does not just hang up):
>>
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
>> #
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> # JRE version: 7.0_25-b15
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>> linux-amd64 compressed oops)
>> # Problematic frame:
>> # #
>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
>> #
>> # JRE version: 7.0_25-b15
>> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable
>> core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>> linux-amd64 compressed oops)
>> # Problematic frame:
>> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable
>> core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>> #
>> # If you would like to submit a bug report, please visit:
>> # http://bugreport.sun.com/bugreport/crash.jsp
>> #
>> #
>> # If you would like to submit a bug report, please visit:
>> # http://bugreport.sun.com/bugreport/crash.jsp
>> #
>> [titan01:01172] *** Process received signal ***
>> [titan01:01172] Signal: Aborted (6)
>> [titan01:01172] Signal code: (-6)
>> [titan01:01173] *** Process received signal ***
>> [titan01:01173] Signal: Aborted (6)
>> [titan01:01173] Signal code: (-6)
>> [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>> [titan01:01172] [ 3]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>> [titan01:01172] [ 4]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>> [titan01:01172] [ 5]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>> [titan01:01172] *** End of error message ***
>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>> [titan01:01173] [ 3]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>> [titan01:01173] [ 4]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>> [titan01:01173] [ 5]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>> [titan01:01173] [ 7] [0x2af69c0693a1]
>> [titan01:01173] *** End of error message ***
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited on
>> signal 6 (Aborted).
>>
>>
>> ########CONFIGURATION:
>> I used the ompi master sources from github:
>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>> Author: Gilles Gouaillardet <***@rist.or.jp
>> <javascript:_e(%7B%7D,'cvml','***@rist.or.jp');>>
>> Date: Tue Jul 5 13:47:50 2016 +0900
>>
>> ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>> --disable-dlopen --disable-mca-dso
>>
>> Thanks a lot for your help!
>> Gundram
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>
>
>
>
> _______________________________________________
> users mailing ***@open-mpi.org <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>
>
>
Gundram Leifert
2016-07-06 16:03:14 UTC
Permalink
Hello Giles,

thank you for your hints! I did 3 changes, unfortunately the same error
occures:

update ompi:
commit ae8444682f0a7aa158caea08800542ce9874455e
Author: Ralph Castain <***@open-mpi.org>
Date: Tue Jul 5 20:07:16 2016 -0700

update java:
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)

delete hashcode-lines.

Now I get this error message - to 100%, after different number of
iterations (15-300):

0/ 3:length = 100000000
0/ 3:bcast length done (length = 100000000)
1/ 3:bcast length done (length = 100000000)
2/ 3:bcast length done (length = 100000000)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build
1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*,
int)+0x94
#
# Failed to write core dump. Core dumps have been disabled. To enable
core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid16578.log
#
# Compiler replay data is saved as:
# /home/gl069/ompi/bin/executor/replay_pid16578.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[titan01:16578] *** Process received signal ***
[titan01:16578] Signal: Aborted (6)
[titan01:16578] Signal code: (-6)
[titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
[titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
[titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
[titan01:16578] [ 3]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
[titan01:16578] [ 4]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
[titan01:16578] [ 5]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
[titan01:16578] [ 6]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
[titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
[titan01:16578] [ 8]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
[titan01:16578] [ 9]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
[titan01:16578] [10]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
[titan01:16578] [11]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
[titan01:16578] [12]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [13]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [14]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [15]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [16]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [17]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [18]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [19]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [20]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [21]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [22]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
[titan01:16578] [23]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
[titan01:16578] [24]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
[titan01:16578] [25]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
[titan01:16578] [26]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
[titan01:16578] [27]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
[titan01:16578] [28]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
[titan01:16578] [29]
/home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
[titan01:16578] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node titan01 exited on
signal 6 (Aborted).
--------------------------------------------------------------------------

I don't know if it is a problem of java or ompi - but the last years,
java worked with no problems on my machine...

Thank you for your tips in advance!
Gundram

On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
> Note a race condition in MPI_Init has been fixed yesterday in the master.
> can you please update your OpenMPI and try again ?
>
> hopefully the hang will disappear.
>
> Can you reproduce the crash with a simpler (and ideally deterministic)
> version of your program.
> the crash occurs in hashcode, and this makes little sense to me. can
> you also update your jdk ?
>
> Cheers,
>
> Gilles
>
> On Wednesday, July 6, 2016, Gundram Leifert
> <***@uni-rostock.de
> <mailto:***@uni-rostock.de>> wrote:
>
> Hello Jason,
>
> thanks for your response! I thing it is another problem. I try to
> send 100MB bytes. So there are not many tries (between 10 and 30).
> I realized that the execution of this code can result 3 different
> errors:
>
> 1. most often the posted error message occures.
>
> 2. in <10% the cases i have a live lock. I can see 3
> java-processes, one with 200% and two with 100% processor
> utilization. After ~15 minutes without new system outputs this
> error occurs.
>
>
> [thread 47499823949568 also had an error]
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
> # guarantee(PageArmed == 0) failed: invariant
> #
> # JRE version: 7.0_25-b15
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
> linux-amd64 compressed oops)
> # Failed to write core dump. Core dumps have been disabled. To
> enable core dumping, try "ulimit -c unlimited" before starting
> Java again
> #
> # An error report file with more information is saved as:
> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.sun.com/bugreport/crash.jsp
> #
> [titan01:24256] *** Process received signal ***
> [titan01:24256] Signal: Aborted (6)
> [titan01:24256] Signal code: (-6)
> [titan01:24256] [ 0]
> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
> [titan01:24256] [ 1]
> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
> [titan01:24256] [ 3]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
> [titan01:24256] [ 4]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
> [titan01:24256] [ 5]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
> [titan01:24256] [ 6]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
> [titan01:24256] [ 7]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
> [titan01:24256] [ 8]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
> [titan01:24256] [ 9]
> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
> [titan01:24256] [10]
> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
> [titan01:24256] *** End of error message ***
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node titan01
> exited on signal 6 (Aborted).
> --------------------------------------------------------------------------
>
>
> 3. in <10% the cases i have a dead lock while MPI.init. This stays
> for more than 15 minutes without returning with an error message...
>
> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>
> Thanks in advance for your help!
> Gundram Leifert
>
>
> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>> After reading your thread looks like it may be related to an
>> issue I had a few weeks ago (I'm a novice though). Maybe my
>> thread will be of help:
>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>
>> When you say "After a specific number of repetitions the process
>> either hangs up or returns with a SIGSEGV." does you mean that a
>> single call hangs, or that at some point during the for loop a
>> call hangs? If you mean the latter, then it might relate to my
>> issue. Otherwise my thread probably won't be helpful.
>>
>> Jason Maldonis
>> Research Assistant of Professor Paul Voyles
>> Materials Science Grad Student
>> University of Wisconsin, Madison
>> 1509 University Ave, Rm M142
>> Madison, WI 53706
>> ***@wisc.edu <javascript:_e(%7B%7D,'cvml','***@wisc.edu');>
>> 608-295-5532
>>
>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert
>> <***@uni-rostock.de
>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>> wrote:
>>
>> Hello,
>>
>> I try to send many byte-arrays via broadcast. After a
>> specific number of repetitions the process either hangs up or
>> returns with a SIGSEGV. Does any one can help me solving the
>> problem:
>>
>> ########## The code:
>>
>> import java.util.Random;
>> import mpi.*;
>>
>> public class TestSendBigFiles {
>>
>> public static void log(String msg) {
>> try {
>> System.err.println(String.format("%2d/%2d:%s",
>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>> } catch (MPIException ex) {
>> System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
>> }
>> }
>>
>> private static int hashcode(byte[] bytearray) {
>> if (bytearray == null) {
>> return 0;
>> }
>> int hash = 39;
>> for (int i = 0; i < bytearray.length; i++) {
>> byte b = bytearray[i];
>> hash = hash * 7 + (int) b;
>> }
>> return hash;
>> }
>>
>> public static void main(String args[]) throws MPIException {
>> log("start main");
>> MPI.Init(args);
>> try {
>> log("initialized done");
>> byte[] saveMem = new byte[100000000];
>> MPI.COMM_WORLD.barrier();
>> Random r = new Random();
>> r.nextBytes(saveMem);
>> if (MPI.COMM_WORLD.getRank() == 0) {
>> for (int i = 0; i < 1000; i++) {
>> saveMem[r.nextInt(saveMem.length)]++;
>> log("i = " + i);
>> int[] lengthData = new int[]{saveMem.length};
>> log("object hash = " + hashcode(saveMem));
>> log("length = " + lengthData[0]);
>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT <http://MPI.INT>, 0);
>> log("bcast length done (length = " +
>> lengthData[0] + ")");
>> MPI.COMM_WORLD.barrier();
>> MPI.COMM_WORLD.bcast(saveMem,
>> lengthData[0], MPI.BYTE, 0);
>> log("bcast data done");
>> MPI.COMM_WORLD.barrier();
>> }
>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT
>> <http://MPI.INT>, 0);
>> } else {
>> while (true) {
>> int[] lengthData = new int[1];
>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT <http://MPI.INT>, 0);
>> log("bcast length done (length = " +
>> lengthData[0] + ")");
>> if (lengthData[0] == 0) {
>> break;
>> }
>> MPI.COMM_WORLD.barrier();
>> saveMem = new byte[lengthData[0]];
>> MPI.COMM_WORLD.bcast(saveMem,
>> saveMem.length, MPI.BYTE, 0);
>> log("bcast data done");
>> MPI.COMM_WORLD.barrier();
>> log("object hash = " + hashcode(saveMem));
>> }
>> }
>> MPI.COMM_WORLD.barrier();
>> } catch (MPIException ex) {
>> System.out.println("caugth error." + ex);
>> log(ex.getMessage());
>> } catch (RuntimeException ex) {
>> System.out.println("caugth error." + ex);
>> log(ex.getMessage());
>> } finally {
>> MPI.Finalize();
>> }
>>
>> }
>>
>> }
>>
>>
>> ############ The Error (if it does not just hang up):
>>
>> #
>> # A fatal error has been detected by the Java Runtime
>> Environment:
>> #
>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172,
>> tid=47822674495232
>> #
>> #
>> # A fatal error has been detected by the Java Runtime
>> Environment:
>> # JRE version: 7.0_25-b15
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed
>> mode linux-amd64 compressed oops)
>> # Problematic frame:
>> # #
>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173,
>> tid=47238546896640
>> #
>> # JRE version: 7.0_25-b15
>> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>> #
>> # Failed to write core dump. Core dumps have been disabled.
>> To enable core dumping, try "ulimit -c unlimited" before
>> starting Java again
>> #
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed
>> mode linux-amd64 compressed oops)
>> # Problematic frame:
>> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>> #
>> # Failed to write core dump. Core dumps have been disabled.
>> To enable core dumping, try "ulimit -c unlimited" before
>> starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>> #
>> # If you would like to submit a bug report, please visit:
>> # http://bugreport.sun.com/bugreport/crash.jsp
>> #
>> #
>> # If you would like to submit a bug report, please visit:
>> # http://bugreport.sun.com/bugreport/crash.jsp
>> #
>> [titan01:01172] *** Process received signal ***
>> [titan01:01172] Signal: Aborted (6)
>> [titan01:01172] Signal code: (-6)
>> [titan01:01173] *** Process received signal ***
>> [titan01:01173] Signal: Aborted (6)
>> [titan01:01173] Signal code: (-6)
>> [titan01:01172] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>> [titan01:01172] [ 1]
>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>> [titan01:01172] [ 2]
>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>> [titan01:01172] [ 3]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>> [titan01:01172] [ 4]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>> [titan01:01172] [ 5]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>> [titan01:01173] [ 1]
>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>> [titan01:01172] *** End of error message ***
>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>> [titan01:01173] [ 2]
>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>> [titan01:01173] [ 3]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>> [titan01:01173] [ 4]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>> [titan01:01173] [ 5]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>> [titan01:01173] [ 6]
>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>> [titan01:01173] [ 7] [0x2af69c0693a1]
>> [titan01:01173] *** End of error message ***
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been
>> aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 0 on node titan01
>> exited on signal 6 (Aborted).
>>
>>
>> ########CONFIGURATION:
>> I used the ompi master sources from github:
>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>> Author: Gilles Gouaillardet <***@rist.or.jp
>> <javascript:_e(%7B%7D,'cvml','***@rist.or.jp');>>
>> Date: Tue Jul 5 13:47:50 2016 +0900
>>
>> ./configure --enable-mpi-java
>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
>> --disable-mca-dso
>>
>> Thanks a lot for your help!
>> Gundram
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org
>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org
>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php
Gilles Gouaillardet
2016-07-06 22:41:41 UTC
Permalink
I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <***@uni-rostock.de>
wrote:

> Hello Giles,
>
> thank you for your hints! I did 3 changes, unfortunately the same error
> occures:
>
> update ompi:
> commit ae8444682f0a7aa158caea08800542ce9874455e
> Author: Ralph Castain <***@open-mpi.org>
> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
> Date: Tue Jul 5 20:07:16 2016 -0700
>
> update java:
> java version "1.8.0_92"
> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>
> delete hashcode-lines.
>
> Now I get this error message - to 100%, after different number of
> iterations (15-300):
>
> 0/ 3:length = 100000000
> 0/ 3:bcast length done (length = 100000000)
> 1/ 3:bcast length done (length = 100000000)
> 2/ 3:bcast length done (length = 100000000)
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
> tid=0x00002b3d29716700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build
> 1.8.0_92-b14)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*,
> int)+0x94
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
> #
> # Compiler replay data is saved as:
> # /home/gl069/ompi/bin/executor/replay_pid16578.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.java.com/bugreport/crash.jsp
> #
> [titan01:16578] *** Process received signal ***
> [titan01:16578] Signal: Aborted (6)
> [titan01:16578] Signal code: (-6)
> [titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
> [titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
> [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
> [titan01:16578] [ 3]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
> [titan01:16578] [ 4]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
> [titan01:16578] [ 5]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
> [titan01:16578] [ 6]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
> [titan01:16578] [ 8]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
> [titan01:16578] [ 9]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
> [titan01:16578] [10]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
> [titan01:16578] [11]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
> [titan01:16578] [12]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
> [titan01:16578] [13]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
> [titan01:16578] [14]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
> [titan01:16578] [15]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
> [titan01:16578] [16]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
> [titan01:16578] [17]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
> [titan01:16578] [18]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
> [titan01:16578] [19]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
> [titan01:16578] [20]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
> [titan01:16578] [21]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
> [titan01:16578] [22]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
> [titan01:16578] [23]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
> [titan01:16578] [24]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
> [titan01:16578] [25]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
> [titan01:16578] [26]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
> [titan01:16578] [27]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
> [titan01:16578] [28]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
> [titan01:16578] [29]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
> [titan01:16578] *** End of error message ***
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 2 with PID 0 on node titan01 exited on
> signal 6 (Aborted).
> --------------------------------------------------------------------------
>
> I don't know if it is a problem of java or ompi - but the last years,
> java worked with no problems on my machine...
>
> Thank you for your tips in advance!
> Gundram
>
> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>
> Note a race condition in MPI_Init has been fixed yesterday in the master.
> can you please update your OpenMPI and try again ?
>
> hopefully the hang will disappear.
>
> Can you reproduce the crash with a simpler (and ideally deterministic)
> version of your program.
> the crash occurs in hashcode, and this makes little sense to me. can you
> also update your jdk ?
>
> Cheers,
>
> Gilles
>
> On Wednesday, July 6, 2016, Gundram Leifert <
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>
> ***@uni-rostock.de
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>
>> Hello Jason,
>>
>> thanks for your response! I thing it is another problem. I try to send
>> 100MB bytes. So there are not many tries (between 10 and 30). I realized
>> that the execution of this code can result 3 different errors:
>>
>> 1. most often the posted error message occures.
>>
>> 2. in <10% the cases i have a live lock. I can see 3 java-processes, one
>> with 200% and two with 100% processor utilization. After ~15 minutes
>> without new system outputs this error occurs.
>>
>>
>> [thread 47499823949568 also had an error]
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
>> # guarantee(PageArmed == 0) failed: invariant
>> #
>> # JRE version: 7.0_25-b15
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>> linux-amd64 compressed oops)
>> # Failed to write core dump. Core dumps have been disabled. To enable
>> core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>> #
>> # If you would like to submit a bug report, please visit:
>> # http://bugreport.sun.com/bugreport/crash.jsp
>> #
>> [titan01:24256] *** Process received signal ***
>> [titan01:24256] Signal: Aborted (6)
>> [titan01:24256] Signal code: (-6)
>> [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>> [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>> [titan01:24256] [ 3]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>> [titan01:24256] [ 4]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>> [titan01:24256] [ 5]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>> [titan01:24256] [ 6]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>> [titan01:24256] [ 7]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>> [titan01:24256] [ 8]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>> [titan01:24256] [ 9]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>> [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>> [titan01:24256] *** End of error message ***
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 0 on node titan01 exited on
>> signal 6 (Aborted).
>> --------------------------------------------------------------------------
>>
>>
>> 3. in <10% the cases i have a dead lock while MPI.init. This stays for
>> more than 15 minutes without returning with an error message...
>>
>> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>>
>> Thanks in advance for your help!
>> Gundram Leifert
>>
>>
>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>
>> After reading your thread looks like it may be related to an issue I had
>> a few weeks ago (I'm a novice though). Maybe my thread will be of help:
>> <https://www.open-mpi.org/community/lists/users/2016/06/29425.php>
>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>
>> When you say "After a specific number of repetitions the process either
>> hangs up or returns with a SIGSEGV." does you mean that a single call
>> hangs, or that at some point during the for loop a call hangs? If you mean
>> the latter, then it might relate to my issue. Otherwise my thread probably
>> won't be helpful.
>>
>> Jason Maldonis
>> Research Assistant of Professor Paul Voyles
>> Materials Science Grad Student
>> University of Wisconsin, Madison
>> 1509 University Ave, Rm M142
>> Madison, WI 53706
>> ***@wisc.edu
>> 608-295-5532
>>
>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
>> ***@uni-rostock.de
>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>>
>>> Hello,
>>>
>>> I try to send many byte-arrays via broadcast. After a specific number of
>>> repetitions the process either hangs up or returns with a SIGSEGV. Does any
>>> one can help me solving the problem:
>>>
>>> ########## The code:
>>>
>>> import java.util.Random;
>>> import mpi.*;
>>>
>>> public class TestSendBigFiles {
>>>
>>> public static void log(String msg) {
>>> try {
>>> System.err.println(String.format("%2d/%2d:%s",
>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>> } catch (MPIException ex) {
>>> System.err.println(String.format("%2s/%2s:%s", "?", "?",
>>> msg));
>>> }
>>> }
>>>
>>> private static int hashcode(byte[] bytearray) {
>>> if (bytearray == null) {
>>> return 0;
>>> }
>>> int hash = 39;
>>> for (int i = 0; i < bytearray.length; i++) {
>>> byte b = bytearray[i];
>>> hash = hash * 7 + (int) b;
>>> }
>>> return hash;
>>> }
>>>
>>> public static void main(String args[]) throws MPIException {
>>> log("start main");
>>> MPI.Init(args);
>>> try {
>>> log("initialized done");
>>> byte[] saveMem = new byte[100000000];
>>> MPI.COMM_WORLD.barrier();
>>> Random r = new Random();
>>> r.nextBytes(saveMem);
>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>> for (int i = 0; i < 1000; i++) {
>>> saveMem[r.nextInt(saveMem.length)]++;
>>> log("i = " + i);
>>> int[] lengthData = new int[]{saveMem.length};
>>> log("object hash = " + hashcode(saveMem));
>>> log("length = " + lengthData[0]);
>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>> log("bcast length done (length = " + lengthData[0] +
>>> ")");
>>> MPI.COMM_WORLD.barrier();
>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>>> MPI.BYTE, 0);
>>> log("bcast data done");
>>> MPI.COMM_WORLD.barrier();
>>> }
>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
>>> } else {
>>> while (true) {
>>> int[] lengthData = new int[1];
>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>> log("bcast length done (length = " + lengthData[0] +
>>> ")");
>>> if (lengthData[0] == 0) {
>>> break;
>>> }
>>> MPI.COMM_WORLD.barrier();
>>> saveMem = new byte[lengthData[0]];
>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>> MPI.BYTE, 0);
>>> log("bcast data done");
>>> MPI.COMM_WORLD.barrier();
>>> log("object hash = " + hashcode(saveMem));
>>> }
>>> }
>>> MPI.COMM_WORLD.barrier();
>>> } catch (MPIException ex) {
>>> System.out.println("caugth error." + ex);
>>> log(ex.getMessage());
>>> } catch (RuntimeException ex) {
>>> System.out.println("caugth error." + ex);
>>> log(ex.getMessage());
>>> } finally {
>>> MPI.Finalize();
>>> }
>>>
>>> }
>>>
>>> }
>>>
>>>
>>> ############ The Error (if it does not just hang up):
>>>
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
>>> #
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> # JRE version: 7.0_25-b15
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>> linux-amd64 compressed oops)
>>> # Problematic frame:
>>> # #
>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
>>> #
>>> # JRE version: 7.0_25-b15
>>> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>> #
>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>> #
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>> linux-amd64 compressed oops)
>>> # Problematic frame:
>>> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>> #
>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>> [titan01:01172] *** Process received signal ***
>>> [titan01:01172] Signal: Aborted (6)
>>> [titan01:01172] Signal code: (-6)
>>> [titan01:01173] *** Process received signal ***
>>> [titan01:01173] Signal: Aborted (6)
>>> [titan01:01173] Signal code: (-6)
>>> [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>> [titan01:01172] [ 3]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>> [titan01:01172] [ 4]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>> [titan01:01172] [ 5]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>> [titan01:01172] *** End of error message ***
>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>> [titan01:01173] [ 3]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>> [titan01:01173] [ 4]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>> [titan01:01173] [ 5]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>> [titan01:01173] *** End of error message ***
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited on
>>> signal 6 (Aborted).
>>>
>>>
>>> ########CONFIGURATION:
>>> I used the ompi master sources from github:
>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>> Author: Gilles Gouaillardet <***@rist.or.jp
>>> <javascript:_e(%7B%7D,'cvml','***@rist.or.jp');>>
>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>
>>> ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>> --disable-dlopen --disable-mca-dso
>>>
>>> Thanks a lot for your help!
>>> Gundram
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>
>>
>>
>>
>> _______________________________________________
>> users mailing ***@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>
>>
>>
>
> _______________________________________________
> users mailing ***@open-mpi.org <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>
>
>
Gilles Gouaillardet
2016-07-07 01:00:23 UTC
Permalink
Gundram,


fwiw, i cannot reproduce the issue on my box

- centos 7

- java version "1.8.0_71"
Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)


i noticed on non zero rank saveMem is allocated at each iteration.
ideally, the garbage collector can take care of that and this should not
be an issue.

would you mind giving the attached file a try ?

Cheers,

Gilles

On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
> I will have a look at it today
>
> how did you configure OpenMPI ?
>
> Cheers,
>
> Gilles
>
> On Thursday, July 7, 2016, Gundram Leifert
> <***@uni-rostock.de
> <mailto:***@uni-rostock.de>> wrote:
>
> Hello Giles,
>
> thank you for your hints! I did 3 changes, unfortunately the same
> error occures:
>
> update ompi:
> commit ae8444682f0a7aa158caea08800542ce9874455e
> Author: Ralph Castain <***@open-mpi.org>
> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
> Date: Tue Jul 5 20:07:16 2016 -0700
>
> update java:
> java version "1.8.0_92"
> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>
> delete hashcode-lines.
>
> Now I get this error message - to 100%, after different number of
> iterations (15-300):
>
> 0/ 3:length = 100000000
> 0/ 3:bcast length done (length = 100000000)
> 1/ 3:bcast length done (length = 100000000)
> 2/ 3:bcast length done (length = 100000000)
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
> tid=0x00002b3d29716700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build
> 1.8.0_92-b14)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # V [libjvm.so+0x414d24]
> ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
> #
> # Failed to write core dump. Core dumps have been disabled. To
> enable core dumping, try "ulimit -c unlimited" before starting
> Java again
> #
> # An error report file with more information is saved as:
> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
> #
> # Compiler replay data is saved as:
> # /home/gl069/ompi/bin/executor/replay_pid16578.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.java.com/bugreport/crash.jsp
> #
> [titan01:16578] *** Process received signal ***
> [titan01:16578] Signal: Aborted (6)
> [titan01:16578] Signal code: (-6)
> [titan01:16578] [ 0]
> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
> [titan01:16578] [ 1]
> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
> [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
> [titan01:16578] [ 3]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
> [titan01:16578] [ 4]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
> [titan01:16578] [ 5]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
> [titan01:16578] [ 6]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
> [titan01:16578] [ 8]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
> [titan01:16578] [ 9]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
> [titan01:16578] [10]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
> [titan01:16578] [11]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
> [titan01:16578] [12]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
> [titan01:16578] [13]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
> [titan01:16578] [14]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
> [titan01:16578] [15]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
> [titan01:16578] [16]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
> [titan01:16578] [17]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
> [titan01:16578] [18]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
> [titan01:16578] [19]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
> [titan01:16578] [20]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
> [titan01:16578] [21]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
> [titan01:16578] [22]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
> [titan01:16578] [23]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
> [titan01:16578] [24]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
> [titan01:16578] [25]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
> [titan01:16578] [26]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
> [titan01:16578] [27]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
> [titan01:16578] [28]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
> [titan01:16578] [29]
> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
> [titan01:16578] *** End of error message ***
> -------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 2 with PID 0 on node titan01
> exited on signal 6 (Aborted).
> --------------------------------------------------------------------------
>
> I don't know if it is a problem of java or ompi - but the last
> years, java worked with no problems on my machine...
>
> Thank you for your tips in advance!
> Gundram
>
> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>> Note a race condition in MPI_Init has been fixed yesterday in the
>> master.
>> can you please update your OpenMPI and try again ?
>>
>> hopefully the hang will disappear.
>>
>> Can you reproduce the crash with a simpler (and ideally
>> deterministic) version of your program.
>> the crash occurs in hashcode, and this makes little sense to me.
>> can you also update your jdk ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Wednesday, July 6, 2016, Gundram Leifert
>> <***@uni-rostock.de
>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>> wrote:
>>
>> Hello Jason,
>>
>> thanks for your response! I thing it is another problem. I
>> try to send 100MB bytes. So there are not many tries (between
>> 10 and 30). I realized that the execution of this code can
>> result 3 different errors:
>>
>> 1. most often the posted error message occures.
>>
>> 2. in <10% the cases i have a live lock. I can see 3
>> java-processes, one with 200% and two with 100% processor
>> utilization. After ~15 minutes without new system outputs
>> this error occurs.
>>
>>
>> [thread 47499823949568 also had an error]
>> # A fatal error has been detected by the Java Runtime
>> Environment:
>> #
>> # Internal Error (safepoint.cpp:317), pid=24256,
>> tid=47500347131648
>> # guarantee(PageArmed == 0) failed: invariant
>> #
>> # JRE version: 7.0_25-b15
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed
>> mode linux-amd64 compressed oops)
>> # Failed to write core dump. Core dumps have been disabled.
>> To enable core dumping, try "ulimit -c unlimited" before
>> starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>> #
>> # If you would like to submit a bug report, please visit:
>> # http://bugreport.sun.com/bugreport/crash.jsp
>> #
>> [titan01:24256] *** Process received signal ***
>> [titan01:24256] Signal: Aborted (6)
>> [titan01:24256] Signal code: (-6)
>> [titan01:24256] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>> [titan01:24256] [ 1]
>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>> [titan01:24256] [ 2]
>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>> [titan01:24256] [ 3]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>> [titan01:24256] [ 4]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>> [titan01:24256] [ 5]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>> [titan01:24256] [ 6]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>> [titan01:24256] [ 7]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>> [titan01:24256] [ 8]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>> [titan01:24256] [ 9]
>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>> [titan01:24256] [10]
>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>> [titan01:24256] [11]
>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>> [titan01:24256] *** End of error message ***
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been
>> aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 0 on node titan01
>> exited on signal 6 (Aborted).
>> --------------------------------------------------------------------------
>>
>>
>> 3. in <10% the cases i have a dead lock while MPI.init. This
>> stays for more than 15 minutes without returning with an
>> error message...
>>
>> Can I enable some debug-flags to see what happens on C /
>> OpenMPI side?
>>
>> Thanks in advance for your help!
>> Gundram Leifert
>>
>>
>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>> After reading your thread looks like it may be related to an
>>> issue I had a few weeks ago (I'm a novice though). Maybe my
>>> thread will be of help:
>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>
>>>
>>> When you say "After a specific number of repetitions the
>>> process either hangs up or returns with a SIGSEGV." does
>>> you mean that a single call hangs, or that at some point
>>> during the for loop a call hangs? If you mean the latter,
>>> then it might relate to my issue. Otherwise my thread
>>> probably won't be helpful.
>>>
>>> Jason Maldonis
>>> Research Assistant of Professor Paul Voyles
>>> Materials Science Grad Student
>>> University of Wisconsin, Madison
>>> 1509 University Ave, Rm M142
>>> Madison, WI 53706
>>> ***@wisc.edu
>>> 608-295-5532
>>>
>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert
>>> <***@uni-rostock.de
>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>>> wrote:
>>>
>>> Hello,
>>>
>>> I try to send many byte-arrays via broadcast. After a
>>> specific number of repetitions the process either hangs
>>> up or returns with a SIGSEGV. Does any one can help me
>>> solving the problem:
>>>
>>> ########## The code:
>>>
>>> import java.util.Random;
>>> import mpi.*;
>>>
>>> public class TestSendBigFiles {
>>>
>>> public static void log(String msg) {
>>> try {
>>> System.err.println(String.format("%2d/%2d:%s",
>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>> } catch (MPIException ex) {
>>> System.err.println(String.format("%2s/%2s:%s", "?", "?",
>>> msg));
>>> }
>>> }
>>>
>>> private static int hashcode(byte[] bytearray) {
>>> if (bytearray == null) {
>>> return 0;
>>> }
>>> int hash = 39;
>>> for (int i = 0; i < bytearray.length; i++) {
>>> byte b = bytearray[i];
>>> hash = hash * 7 + (int) b;
>>> }
>>> return hash;
>>> }
>>>
>>> public static void main(String args[]) throws
>>> MPIException {
>>> log("start main");
>>> MPI.Init(args);
>>> try {
>>> log("initialized done");
>>> byte[] saveMem = new byte[100000000];
>>> MPI.COMM_WORLD.barrier();
>>> Random r = new Random();
>>> r.nextBytes(saveMem);
>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>> for (int i = 0; i < 1000; i++) {
>>> saveMem[r.nextInt(saveMem.length)]++;
>>> log("i = " + i);
>>> int[] lengthData = new
>>> int[]{saveMem.length};
>>> log("object hash = " +
>>> hashcode(saveMem));
>>> log("length = " + lengthData[0]);
>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>> <http://MPI.INT>, 0);
>>> log("bcast length done (length = " +
>>> lengthData[0] + ")");
>>> MPI.COMM_WORLD.barrier();
>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
>>> log("bcast data done");
>>> MPI.COMM_WORLD.barrier();
>>> }
>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1,
>>> MPI.INT <http://MPI.INT>, 0);
>>> } else {
>>> while (true) {
>>> int[] lengthData = new int[1];
>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>> <http://MPI.INT>, 0);
>>> log("bcast length done (length = " +
>>> lengthData[0] + ")");
>>> if (lengthData[0] == 0) {
>>> break;
>>> }
>>> MPI.COMM_WORLD.barrier();
>>> saveMem = new byte[lengthData[0]];
>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
>>> log("bcast data done");
>>> MPI.COMM_WORLD.barrier();
>>> log("object hash = " +
>>> hashcode(saveMem));
>>> }
>>> }
>>> MPI.COMM_WORLD.barrier();
>>> } catch (MPIException ex) {
>>> System.out.println("caugth error." + ex);
>>> log(ex.getMessage());
>>> } catch (RuntimeException ex) {
>>> System.out.println("caugth error." + ex);
>>> log(ex.getMessage());
>>> } finally {
>>> MPI.Finalize();
>>> }
>>>
>>> }
>>>
>>> }
>>>
>>>
>>> ############ The Error (if it does not just hang up):
>>>
>>> #
>>> # A fatal error has been detected by the Java Runtime
>>> Environment:
>>> #
>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172,
>>> tid=47822674495232
>>> #
>>> #
>>> # A fatal error has been detected by the Java Runtime
>>> Environment:
>>> # JRE version: 7.0_25-b15
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01
>>> mixed mode linux-amd64 compressed oops)
>>> # Problematic frame:
>>> # #
>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173,
>>> tid=47238546896640
>>> #
>>> # JRE version: 7.0_25-b15
>>> J
>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>> #
>>> # Failed to write core dump. Core dumps have been
>>> disabled. To enable core dumping, try "ulimit -c
>>> unlimited" before starting Java again
>>> #
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01
>>> mixed mode linux-amd64 compressed oops)
>>> # Problematic frame:
>>> # J
>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>> #
>>> # Failed to write core dump. Core dumps have been
>>> disabled. To enable core dumping, try "ulimit -c
>>> unlimited" before starting Java again
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>> [titan01:01172] *** Process received signal ***
>>> [titan01:01172] Signal: Aborted (6)
>>> [titan01:01172] Signal code: (-6)
>>> [titan01:01173] *** Process received signal ***
>>> [titan01:01173] Signal: Aborted (6)
>>> [titan01:01173] Signal code: (-6)
>>> [titan01:01172] [ 0]
>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>> [titan01:01172] [ 1]
>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>> [titan01:01172] [ 2]
>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>> [titan01:01172] [ 3]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>> [titan01:01172] [ 4]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>> [titan01:01172] [ 5]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>> [titan01:01173] [ 1]
>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>> [titan01:01172] *** End of error message ***
>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>> [titan01:01173] [ 2]
>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>> [titan01:01173] [ 3]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>> [titan01:01173] [ 4]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>> [titan01:01173] [ 5]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>> [titan01:01173] [ 6]
>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>> [titan01:01173] *** End of error message ***
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has
>>> been aborted.
>>> -------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 0 on node
>>> titan01 exited on signal 6 (Aborted).
>>>
>>>
>>> ########CONFIGURATION:
>>> I used the ompi master sources from github:
>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>> Author: Gilles Gouaillardet <***@rist.or.jp
>>> <javascript:_e(%7B%7D,'cvml','***@rist.or.jp');>>
>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>
>>> ./configure --enable-mpi-java
>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>> --disable-dlopen --disable-mca-dso
>>>
>>> Thanks a lot for your help!
>>> Gundram
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> Subscription:
>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org
>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php
Saliya Ekanayake
2016-07-07 05:31:23 UTC
Permalink
I've received SIGSEGV a few times for different reasons with OpenMPI Java
and one of the most common reasons was the ulimit settings. You might want
to look at -l (max lock memory) -u (max user processes), -n (open files).

Here's a snapshot of what we use in our clusters running OpenMPI and Java

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515696
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 196608
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited



On Wed, Jul 6, 2016 at 9:00 PM, Gilles Gouaillardet <***@rist.or.jp>
wrote:

> Gundram,
>
>
> fwiw, i cannot reproduce the issue on my box
>
> - centos 7
>
> - java version "1.8.0_71"
> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)
>
>
> i noticed on non zero rank saveMem is allocated at each iteration.
> ideally, the garbage collector can take care of that and this should not
> be an issue.
>
> would you mind giving the attached file a try ?
>
> Cheers,
>
> Gilles
>
>
> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>
> I will have a look at it today
>
> how did you configure OpenMPI ?
>
> Cheers,
>
> Gilles
>
> On Thursday, July 7, 2016, Gundram Leifert <
> <***@uni-rostock.de>***@uni-rostock.de> wrote:
>
>> Hello Giles,
>>
>> thank you for your hints! I did 3 changes, unfortunately the same error
>> occures:
>>
>> update ompi:
>> commit ae8444682f0a7aa158caea08800542ce9874455e
>> Author: Ralph Castain <***@open-mpi.org>
>> Date: Tue Jul 5 20:07:16 2016 -0700
>>
>> update java:
>> java version "1.8.0_92"
>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>
>> delete hashcode-lines.
>>
>> Now I get this error message - to 100%, after different number of
>> iterations (15-300):
>>
>> 0/ 3:length = 100000000
>> 0/ 3:bcast length done (length = 100000000)
>> 1/ 3:bcast length done (length = 100000000)
>> 2/ 3:bcast length done (length = 100000000)
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>> tid=0x00002b3d29716700
>> #
>> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build
>> 1.8.0_92-b14)
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode
>> linux-amd64 compressed oops)
>> # Problematic frame:
>> # V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*,
>> int)+0x94
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable
>> core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>> #
>> # Compiler replay data is saved as:
>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>> #
>> # If you would like to submit a bug report, please visit:
>> # http://bugreport.java.com/bugreport/crash.jsp
>> #
>> [titan01:16578] *** Process received signal ***
>> [titan01:16578] Signal: Aborted (6)
>> [titan01:16578] Signal code: (-6)
>> [titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>> [titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>> [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>> [titan01:16578] [ 3]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>> [titan01:16578] [ 4]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>> [titan01:16578] [ 5]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>> [titan01:16578] [ 6]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>> [titan01:16578] [ 8]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>> [titan01:16578] [ 9]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>> [titan01:16578] [10]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>> [titan01:16578] [11]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>> [titan01:16578] [12]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>> [titan01:16578] [13]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>> [titan01:16578] [14]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>> [titan01:16578] [15]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>> [titan01:16578] [16]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>> [titan01:16578] [17]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>> [titan01:16578] [18]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>> [titan01:16578] [19]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>> [titan01:16578] [20]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>> [titan01:16578] [21]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>> [titan01:16578] [22]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>> [titan01:16578] [23]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>> [titan01:16578] [24]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>> [titan01:16578] [25]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>> [titan01:16578] [26]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>> [titan01:16578] [27]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>> [titan01:16578] [28]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>> [titan01:16578] [29]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>> [titan01:16578] *** End of error message ***
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 2 with PID 0 on node titan01 exited on
>> signal 6 (Aborted).
>> --------------------------------------------------------------------------
>>
>> I don't know if it is a problem of java or ompi - but the last years,
>> java worked with no problems on my machine...
>>
>> Thank you for your tips in advance!
>> Gundram
>>
>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>
>> Note a race condition in MPI_Init has been fixed yesterday in the master.
>> can you please update your OpenMPI and try again ?
>>
>> hopefully the hang will disappear.
>>
>> Can you reproduce the crash with a simpler (and ideally deterministic)
>> version of your program.
>> the crash occurs in hashcode, and this makes little sense to me. can you
>> also update your jdk ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Wednesday, July 6, 2016, Gundram Leifert <
>> ***@uni-rostock.de> wrote:
>>
>>> Hello Jason,
>>>
>>> thanks for your response! I thing it is another problem. I try to send
>>> 100MB bytes. So there are not many tries (between 10 and 30). I realized
>>> that the execution of this code can result 3 different errors:
>>>
>>> 1. most often the posted error message occures.
>>>
>>> 2. in <10% the cases i have a live lock. I can see 3 java-processes, one
>>> with 200% and two with 100% processor utilization. After ~15 minutes
>>> without new system outputs this error occurs.
>>>
>>>
>>> [thread 47499823949568 also had an error]
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
>>> # guarantee(PageArmed == 0) failed: invariant
>>> #
>>> # JRE version: 7.0_25-b15
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>> linux-amd64 compressed oops)
>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>> [titan01:24256] *** Process received signal ***
>>> [titan01:24256] Signal: Aborted (6)
>>> [titan01:24256] Signal code: (-6)
>>> [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>> [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>> [titan01:24256] [ 3]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>> [titan01:24256] [ 4]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>> [titan01:24256] [ 5]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>> [titan01:24256] [ 6]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>> [titan01:24256] [ 7]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>> [titan01:24256] [ 8]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>> [titan01:24256] [ 9]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>> [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>> [titan01:24256] *** End of error message ***
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 0 on node titan01 exited on
>>> signal 6 (Aborted).
>>>
>>> --------------------------------------------------------------------------
>>>
>>>
>>> 3. in <10% the cases i have a dead lock while MPI.init. This stays for
>>> more than 15 minutes without returning with an error message...
>>>
>>> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>>>
>>> Thanks in advance for your help!
>>> Gundram Leifert
>>>
>>>
>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>
>>> After reading your thread looks like it may be related to an issue I had
>>> a few weeks ago (I'm a novice though). Maybe my thread will be of help:
>>> <https://www.open-mpi.org/community/lists/users/2016/06/29425.php>
>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>
>>> When you say "After a specific number of repetitions the process either
>>> hangs up or returns with a SIGSEGV." does you mean that a single call
>>> hangs, or that at some point during the for loop a call hangs? If you mean
>>> the latter, then it might relate to my issue. Otherwise my thread probably
>>> won't be helpful.
>>>
>>> Jason Maldonis
>>> Research Assistant of Professor Paul Voyles
>>> Materials Science Grad Student
>>> University of Wisconsin, Madison
>>> 1509 University Ave, Rm M142
>>> Madison, WI 53706
>>> ***@wisc.edu
>>> 608-295-5532
>>>
>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
>>> ***@uni-rostock.de> wrote:
>>>
>>>> Hello,
>>>>
>>>> I try to send many byte-arrays via broadcast. After a specific number
>>>> of repetitions the process either hangs up or returns with a SIGSEGV. Does
>>>> any one can help me solving the problem:
>>>>
>>>> ########## The code:
>>>>
>>>> import java.util.Random;
>>>> import mpi.*;
>>>>
>>>> public class TestSendBigFiles {
>>>>
>>>> public static void log(String msg) {
>>>> try {
>>>> System.err.println(String.format("%2d/%2d:%s",
>>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>>> } catch (MPIException ex) {
>>>> System.err.println(String.format("%2s/%2s:%s", "?", "?",
>>>> msg));
>>>> }
>>>> }
>>>>
>>>> private static int hashcode(byte[] bytearray) {
>>>> if (bytearray == null) {
>>>> return 0;
>>>> }
>>>> int hash = 39;
>>>> for (int i = 0; i < bytearray.length; i++) {
>>>> byte b = bytearray[i];
>>>> hash = hash * 7 + (int) b;
>>>> }
>>>> return hash;
>>>> }
>>>>
>>>> public static void main(String args[]) throws MPIException {
>>>> log("start main");
>>>> MPI.Init(args);
>>>> try {
>>>> log("initialized done");
>>>> byte[] saveMem = new byte[100000000];
>>>> MPI.COMM_WORLD.barrier();
>>>> Random r = new Random();
>>>> r.nextBytes(saveMem);
>>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>>> for (int i = 0; i < 1000; i++) {
>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>> log("i = " + i);
>>>> int[] lengthData = new int[]{saveMem.length};
>>>> log("object hash = " + hashcode(saveMem));
>>>> log("length = " + lengthData[0]);
>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>> log("bcast length done (length = " + lengthData[0]
>>>> + ")");
>>>> MPI.COMM_WORLD.barrier();
>>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>>>> MPI.BYTE, 0);
>>>> log("bcast data done");
>>>> MPI.COMM_WORLD.barrier();
>>>> }
>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
>>>> } else {
>>>> while (true) {
>>>> int[] lengthData = new int[1];
>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>> log("bcast length done (length = " + lengthData[0]
>>>> + ")");
>>>> if (lengthData[0] == 0) {
>>>> break;
>>>> }
>>>> MPI.COMM_WORLD.barrier();
>>>> saveMem = new byte[lengthData[0]];
>>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>>> MPI.BYTE, 0);
>>>> log("bcast data done");
>>>> MPI.COMM_WORLD.barrier();
>>>> log("object hash = " + hashcode(saveMem));
>>>> }
>>>> }
>>>> MPI.COMM_WORLD.barrier();
>>>> } catch (MPIException ex) {
>>>> System.out.println("caugth error." + ex);
>>>> log(ex.getMessage());
>>>> } catch (RuntimeException ex) {
>>>> System.out.println("caugth error." + ex);
>>>> log(ex.getMessage());
>>>> } finally {
>>>> MPI.Finalize();
>>>> }
>>>>
>>>> }
>>>>
>>>> }
>>>>
>>>>
>>>> ############ The Error (if it does not just hang up):
>>>>
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
>>>> #
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> # JRE version: 7.0_25-b15
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>> linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # #
>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
>>>> #
>>>> # JRE version: 7.0_25-b15
>>>> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>> #
>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>> #
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>> linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>> #
>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>> [titan01:01172] *** Process received signal ***
>>>> [titan01:01172] Signal: Aborted (6)
>>>> [titan01:01172] Signal code: (-6)
>>>> [titan01:01173] *** Process received signal ***
>>>> [titan01:01173] Signal: Aborted (6)
>>>> [titan01:01173] Signal code: (-6)
>>>> [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>> [titan01:01172] [ 3]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>> [titan01:01172] [ 4]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>> [titan01:01172] [ 5]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>> [titan01:01172] *** End of error message ***
>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>> [titan01:01173] [ 3]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>> [titan01:01173] [ 4]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>> [titan01:01173] [ 5]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>> [titan01:01173] *** End of error message ***
>>>> -------------------------------------------------------
>>>> Primary job terminated normally, but 1 process returned
>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited on
>>>> signal 6 (Aborted).
>>>>
>>>>
>>>> ########CONFIGURATION:
>>>> I used the ompi master sources from github:
>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>> Author: Gilles Gouaillardet <***@rist.or.jp>
>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>
>>>> ./configure --enable-mpi-java
>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
>>>> --disable-mca-dso
>>>>
>>>> Thanks a lot for your help!
>>>> Gundram
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> <http://www.open-mpi.org/community/lists/users/2016/07/29584.php>
>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing ***@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>
>>>
>>>
>>
>> _______________________________________________
>> users mailing ***@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>
>>
>>
>
> _______________________________________________
> users mailing ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>



--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Gundram Leifert
2016-07-07 07:17:29 UTC
Permalink
Hello Gilles,

I tried you code and it crashes after 3-15 iterations (see (1)). It is
always the same error (only the "94" varies).

Meanwhile I think Java and MPI use the same memory because when I delete
the hash-call, the program runs sometimes more than 9k iterations.
When it crashes, there are different lines (see (2) and (3)). The
crashes also occurs on rank 0.

##### (1)#####
# Problematic frame:
# J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
(42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]

#####(2)#####
# Problematic frame:
# V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle,
Handle, JavaValue*, Thread*)+0xb6

#####(3)#####
# Problematic frame:
# V [libjvm.so+0x4183bf]
ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f

Any more idea?

On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>
> Gundram,
>
>
> fwiw, i cannot reproduce the issue on my box
>
> - centos 7
>
> - java version "1.8.0_71"
> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)
>
>
> i noticed on non zero rank saveMem is allocated at each iteration.
> ideally, the garbage collector can take care of that and this should
> not be an issue.
>
> would you mind giving the attached file a try ?
>
> Cheers,
>
> Gilles
>
> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>> I will have a look at it today
>>
>> how did you configure OpenMPI ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Thursday, July 7, 2016, Gundram Leifert
>> <***@uni-rostock.de> wrote:
>>
>> Hello Giles,
>>
>> thank you for your hints! I did 3 changes, unfortunately the same
>> error occures:
>>
>> update ompi:
>> commit ae8444682f0a7aa158caea08800542ce9874455e
>> Author: Ralph Castain <***@open-mpi.org>
>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>> Date: Tue Jul 5 20:07:16 2016 -0700
>>
>> update java:
>> java version "1.8.0_92"
>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>
>> delete hashcode-lines.
>>
>> Now I get this error message - to 100%, after different number of
>> iterations (15-300):
>>
>> 0/ 3:length = 100000000
>> 0/ 3:bcast length done (length = 100000000)
>> 1/ 3:bcast length done (length = 100000000)
>> 2/ 3:bcast length done (length = 100000000)
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>> tid=0x00002b3d29716700
>> #
>> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14)
>> (build 1.8.0_92-b14)
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed
>> mode linux-amd64 compressed oops)
>> # Problematic frame:
>> # V [libjvm.so+0x414d24]
>> ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
>> #
>> # Failed to write core dump. Core dumps have been disabled. To
>> enable core dumping, try "ulimit -c unlimited" before starting
>> Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>> #
>> # Compiler replay data is saved as:
>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>> #
>> # If you would like to submit a bug report, please visit:
>> # http://bugreport.java.com/bugreport/crash.jsp
>> #
>> [titan01:16578] *** Process received signal ***
>> [titan01:16578] Signal: Aborted (6)
>> [titan01:16578] Signal code: (-6)
>> [titan01:16578] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>> [titan01:16578] [ 1]
>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>> [titan01:16578] [ 2]
>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>> [titan01:16578] [ 3]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>> [titan01:16578] [ 4]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>> [titan01:16578] [ 5]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>> [titan01:16578] [ 6]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>> [titan01:16578] [ 8]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>> [titan01:16578] [ 9]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>> [titan01:16578] [10]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>> [titan01:16578] [11]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>> [titan01:16578] [12]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>> [titan01:16578] [13]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>> [titan01:16578] [14]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>> [titan01:16578] [15]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>> [titan01:16578] [16]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>> [titan01:16578] [17]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>> [titan01:16578] [18]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>> [titan01:16578] [19]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>> [titan01:16578] [20]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>> [titan01:16578] [21]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>> [titan01:16578] [22]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>> [titan01:16578] [23]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>> [titan01:16578] [24]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>> [titan01:16578] [25]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>> [titan01:16578] [26]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>> [titan01:16578] [27]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>> [titan01:16578] [28]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>> [titan01:16578] [29]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>> [titan01:16578] *** End of error message ***
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 2 with PID 0 on node titan01
>> exited on signal 6 (Aborted).
>> --------------------------------------------------------------------------
>>
>> I don't know if it is a problem of java or ompi - but the last
>> years, java worked with no problems on my machine...
>>
>> Thank you for your tips in advance!
>> Gundram
>>
>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>> Note a race condition in MPI_Init has been fixed yesterday in
>>> the master.
>>> can you please update your OpenMPI and try again ?
>>>
>>> hopefully the hang will disappear.
>>>
>>> Can you reproduce the crash with a simpler (and ideally
>>> deterministic) version of your program.
>>> the crash occurs in hashcode, and this makes little sense to me.
>>> can you also update your jdk ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Wednesday, July 6, 2016, Gundram Leifert
>>> <***@uni-rostock.de> wrote:
>>>
>>> Hello Jason,
>>>
>>> thanks for your response! I thing it is another problem. I
>>> try to send 100MB bytes. So there are not many tries
>>> (between 10 and 30). I realized that the execution of this
>>> code can result 3 different errors:
>>>
>>> 1. most often the posted error message occures.
>>>
>>> 2. in <10% the cases i have a live lock. I can see 3
>>> java-processes, one with 200% and two with 100% processor
>>> utilization. After ~15 minutes without new system outputs
>>> this error occurs.
>>>
>>>
>>> [thread 47499823949568 also had an error]
>>> # A fatal error has been detected by the Java Runtime
>>> Environment:
>>> #
>>> # Internal Error (safepoint.cpp:317), pid=24256,
>>> tid=47500347131648
>>> # guarantee(PageArmed == 0) failed: invariant
>>> #
>>> # JRE version: 7.0_25-b15
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01
>>> mixed mode linux-amd64 compressed oops)
>>> # Failed to write core dump. Core dumps have been disabled.
>>> To enable core dumping, try "ulimit -c unlimited" before
>>> starting Java again
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>> [titan01:24256] *** Process received signal ***
>>> [titan01:24256] Signal: Aborted (6)
>>> [titan01:24256] Signal code: (-6)
>>> [titan01:24256] [ 0]
>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>> [titan01:24256] [ 1]
>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>> [titan01:24256] [ 2]
>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>> [titan01:24256] [ 3]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>> [titan01:24256] [ 4]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>> [titan01:24256] [ 5]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>> [titan01:24256] [ 6]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>> [titan01:24256] [ 7]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>> [titan01:24256] [ 8]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>> [titan01:24256] [ 9]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>> [titan01:24256] [10]
>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>> [titan01:24256] [11]
>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>> [titan01:24256] *** End of error message ***
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been
>>> aborted.
>>> -------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 0 on node
>>> titan01 exited on signal 6 (Aborted).
>>> --------------------------------------------------------------------------
>>>
>>>
>>> 3. in <10% the cases i have a dead lock while MPI.init. This
>>> stays for more than 15 minutes without returning with an
>>> error message...
>>>
>>> Can I enable some debug-flags to see what happens on C /
>>> OpenMPI side?
>>>
>>> Thanks in advance for your help!
>>> Gundram Leifert
>>>
>>>
>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>> After reading your thread looks like it may be related to
>>>> an issue I had a few weeks ago (I'm a novice though). Maybe
>>>> my thread will be of help:
>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>
>>>>
>>>> When you say "After a specific number of repetitions the
>>>> process either hangs up or returns with a SIGSEGV." does
>>>> you mean that a single call hangs, or that at some point
>>>> during the for loop a call hangs? If you mean the latter,
>>>> then it might relate to my issue. Otherwise my thread
>>>> probably won't be helpful.
>>>>
>>>> Jason Maldonis
>>>> Research Assistant of Professor Paul Voyles
>>>> Materials Science Grad Student
>>>> University of Wisconsin, Madison
>>>> 1509 University Ave, Rm M142
>>>> Madison, WI 53706
>>>> ***@wisc.edu
>>>> 608-295-5532
>>>>
>>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert
>>>> <***@uni-rostock.de> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I try to send many byte-arrays via broadcast. After a
>>>> specific number of repetitions the process either hangs
>>>> up or returns with a SIGSEGV. Does any one can help me
>>>> solving the problem:
>>>>
>>>> ########## The code:
>>>>
>>>> import java.util.Random;
>>>> import mpi.*;
>>>>
>>>> public class TestSendBigFiles {
>>>>
>>>> public static void log(String msg) {
>>>> try {
>>>> System.err.println(String.format("%2d/%2d:%s",
>>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>>> } catch (MPIException ex) {
>>>> System.err.println(String.format("%2s/%2s:%s", "?",
>>>> "?", msg));
>>>> }
>>>> }
>>>>
>>>> private static int hashcode(byte[] bytearray) {
>>>> if (bytearray == null) {
>>>> return 0;
>>>> }
>>>> int hash = 39;
>>>> for (int i = 0; i < bytearray.length; i++) {
>>>> byte b = bytearray[i];
>>>> hash = hash * 7 + (int) b;
>>>> }
>>>> return hash;
>>>> }
>>>>
>>>> public static void main(String args[]) throws
>>>> MPIException {
>>>> log("start main");
>>>> MPI.Init(args);
>>>> try {
>>>> log("initialized done");
>>>> byte[] saveMem = new byte[100000000];
>>>> MPI.COMM_WORLD.barrier();
>>>> Random r = new Random();
>>>> r.nextBytes(saveMem);
>>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>>> for (int i = 0; i < 1000; i++) {
>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>> log("i = " + i);
>>>> int[] lengthData = new
>>>> int[]{saveMem.length};
>>>> log("object hash = " +
>>>> hashcode(saveMem));
>>>> log("length = " + lengthData[0]);
>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>>> <http://MPI.INT>, 0);
>>>> log("bcast length done (length = "
>>>> + lengthData[0] + ")");
>>>> MPI.COMM_WORLD.barrier();
>>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
>>>> log("bcast data done");
>>>> MPI.COMM_WORLD.barrier();
>>>> }
>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1,
>>>> MPI.INT <http://MPI.INT>, 0);
>>>> } else {
>>>> while (true) {
>>>> int[] lengthData = new int[1];
>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>>> <http://MPI.INT>, 0);
>>>> log("bcast length done (length = "
>>>> + lengthData[0] + ")");
>>>> if (lengthData[0] == 0) {
>>>> break;
>>>> }
>>>> MPI.COMM_WORLD.barrier();
>>>> saveMem = new byte[lengthData[0]];
>>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
>>>> log("bcast data done");
>>>> MPI.COMM_WORLD.barrier();
>>>> log("object hash = " +
>>>> hashcode(saveMem));
>>>> }
>>>> }
>>>> MPI.COMM_WORLD.barrier();
>>>> } catch (MPIException ex) {
>>>> System.out.println("caugth error." + ex);
>>>> log(ex.getMessage());
>>>> } catch (RuntimeException ex) {
>>>> System.out.println("caugth error." + ex);
>>>> log(ex.getMessage());
>>>> } finally {
>>>> MPI.Finalize();
>>>> }
>>>>
>>>> }
>>>>
>>>> }
>>>>
>>>>
>>>> ############ The Error (if it does not just hang up):
>>>>
>>>> #
>>>> # A fatal error has been detected by the Java Runtime
>>>> Environment:
>>>> #
>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172,
>>>> tid=47822674495232
>>>> #
>>>> #
>>>> # A fatal error has been detected by the Java Runtime
>>>> Environment:
>>>> # JRE version: 7.0_25-b15
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01
>>>> mixed mode linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # #
>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173,
>>>> tid=47238546896640
>>>> #
>>>> # JRE version: 7.0_25-b15
>>>> J
>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>> #
>>>> # Failed to write core dump. Core dumps have been
>>>> disabled. To enable core dumping, try "ulimit -c
>>>> unlimited" before starting Java again
>>>> #
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01
>>>> mixed mode linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # J
>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>> #
>>>> # Failed to write core dump. Core dumps have been
>>>> disabled. To enable core dumping, try "ulimit -c
>>>> unlimited" before starting Java again
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>> [titan01:01172] *** Process received signal ***
>>>> [titan01:01172] Signal: Aborted (6)
>>>> [titan01:01172] Signal code: (-6)
>>>> [titan01:01173] *** Process received signal ***
>>>> [titan01:01173] Signal: Aborted (6)
>>>> [titan01:01173] Signal code: (-6)
>>>> [titan01:01172] [ 0]
>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>> [titan01:01172] [ 1]
>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>> [titan01:01172] [ 2]
>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>> [titan01:01172] [ 3]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>> [titan01:01172] [ 4]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>> [titan01:01172] [ 5]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>> [titan01:01173] [ 1]
>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>> [titan01:01172] *** End of error message ***
>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>> [titan01:01173] [ 2]
>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>> [titan01:01173] [ 3]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>> [titan01:01173] [ 4]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>> [titan01:01173] [ 5]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>> [titan01:01173] [ 6]
>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>> [titan01:01173] *** End of error message ***
>>>> -------------------------------------------------------
>>>> Primary job terminated normally, but 1 process returned
>>>> a non-zero exit code. Per user-direction, the job has
>>>> been aborted.
>>>> -------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 1 with PID 0 on node
>>>> titan01 exited on signal 6 (Aborted).
>>>>
>>>>
>>>> ########CONFIGURATION:
>>>> I used the ompi master sources from github:
>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>> Author: Gilles Gouaillardet <***@rist.or.jp>
>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>
>>>> ./configure --enable-mpi-java
>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>> --disable-dlopen --disable-mca-dso
>>>>
>>>> Thanks a lot for your help!
>>>> Gundram
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription:
>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org
>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php
Gilles Gouaillardet
2016-07-07 08:05:57 UTC
Permalink
Gundram,


can you please provide more information on your environment :

- configure command line

- OS

- memory available

- ulimit -a

- number of nodes

- number of tasks used

- interconnect used (if any)

- batch manager (if any)


Cheers,


Gilles

On 7/7/2016 4:17 PM, Gundram Leifert wrote:
> Hello Gilles,
>
> I tried you code and it crashes after 3-15 iterations (see (1)). It is
> always the same error (only the "94" varies).
>
> Meanwhile I think Java and MPI use the same memory because when I
> delete the hash-call, the program runs sometimes more than 9k iterations.
> When it crashes, there are different lines (see (2) and (3)). The
> crashes also occurs on rank 0.
>
> ##### (1)#####
> # Problematic frame:
> # J 94 C2
> de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42
> bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]
>
> #####(2)#####
> # Problematic frame:
> # V [libjvm.so+0x68d0f6]
> JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*,
> Thread*)+0xb6
>
> #####(3)#####
> # Problematic frame:
> # V [libjvm.so+0x4183bf]
> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>
> Any more idea?
>
> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>
>> Gundram,
>>
>>
>> fwiw, i cannot reproduce the issue on my box
>>
>> - centos 7
>>
>> - java version "1.8.0_71"
>> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)
>>
>>
>> i noticed on non zero rank saveMem is allocated at each iteration.
>> ideally, the garbage collector can take care of that and this should
>> not be an issue.
>>
>> would you mind giving the attached file a try ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>> I will have a look at it today
>>>
>>> how did you configure OpenMPI ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Thursday, July 7, 2016, Gundram Leifert
>>> <***@uni-rostock.de> wrote:
>>>
>>> Hello Giles,
>>>
>>> thank you for your hints! I did 3 changes, unfortunately the
>>> same error occures:
>>>
>>> update ompi:
>>> commit ae8444682f0a7aa158caea08800542ce9874455e
>>> Author: Ralph Castain <***@open-mpi.org>
>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>
>>> update java:
>>> java version "1.8.0_92"
>>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>>
>>> delete hashcode-lines.
>>>
>>> Now I get this error message - to 100%, after different number
>>> of iterations (15-300):
>>>
>>> 0/ 3:length = 100000000
>>> 0/ 3:bcast length done (length = 100000000)
>>> 1/ 3:bcast length done (length = 100000000)
>>> 2/ 3:bcast length done (length = 100000000)
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>>> tid=0x00002b3d29716700
>>> #
>>> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14)
>>> (build 1.8.0_92-b14)
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed
>>> mode linux-amd64 compressed oops)
>>> # Problematic frame:
>>> # V [libjvm.so+0x414d24]
>>> ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
>>> #
>>> # Failed to write core dump. Core dumps have been disabled. To
>>> enable core dumping, try "ulimit -c unlimited" before starting
>>> Java again
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>> #
>>> # Compiler replay data is saved as:
>>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.java.com/bugreport/crash.jsp
>>> #
>>> [titan01:16578] *** Process received signal ***
>>> [titan01:16578] Signal: Aborted (6)
>>> [titan01:16578] Signal code: (-6)
>>> [titan01:16578] [ 0]
>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>> [titan01:16578] [ 1]
>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>> [titan01:16578] [ 2]
>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>> [titan01:16578] [ 3]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>> [titan01:16578] [ 4]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>> [titan01:16578] [ 5]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>> [titan01:16578] [ 6]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>> [titan01:16578] [ 8]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>> [titan01:16578] [ 9]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>> [titan01:16578] [10]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>> [titan01:16578] [11]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>> [titan01:16578] [12]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>> [titan01:16578] [13]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>> [titan01:16578] [14]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>> [titan01:16578] [15]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>> [titan01:16578] [16]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>> [titan01:16578] [17]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>> [titan01:16578] [18]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>> [titan01:16578] [19]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>> [titan01:16578] [20]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>> [titan01:16578] [21]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>> [titan01:16578] [22]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>> [titan01:16578] [23]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>> [titan01:16578] [24]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>> [titan01:16578] [25]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>> [titan01:16578] [26]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>> [titan01:16578] [27]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>> [titan01:16578] [28]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>> [titan01:16578] [29]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>> [titan01:16578] *** End of error message ***
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 2 with PID 0 on node titan01
>>> exited on signal 6 (Aborted).
>>> --------------------------------------------------------------------------
>>>
>>> I don't know if it is a problem of java or ompi - but the last
>>> years, java worked with no problems on my machine...
>>>
>>> Thank you for your tips in advance!
>>> Gundram
>>>
>>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>>> Note a race condition in MPI_Init has been fixed yesterday in
>>>> the master.
>>>> can you please update your OpenMPI and try again ?
>>>>
>>>> hopefully the hang will disappear.
>>>>
>>>> Can you reproduce the crash with a simpler (and ideally
>>>> deterministic) version of your program.
>>>> the crash occurs in hashcode, and this makes little sense to
>>>> me. can you also update your jdk ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Wednesday, July 6, 2016, Gundram Leifert
>>>> <***@uni-rostock.de> wrote:
>>>>
>>>> Hello Jason,
>>>>
>>>> thanks for your response! I thing it is another problem. I
>>>> try to send 100MB bytes. So there are not many tries
>>>> (between 10 and 30). I realized that the execution of this
>>>> code can result 3 different errors:
>>>>
>>>> 1. most often the posted error message occures.
>>>>
>>>> 2. in <10% the cases i have a live lock. I can see 3
>>>> java-processes, one with 200% and two with 100% processor
>>>> utilization. After ~15 minutes without new system outputs
>>>> this error occurs.
>>>>
>>>>
>>>> [thread 47499823949568 also had an error]
>>>> # A fatal error has been detected by the Java Runtime
>>>> Environment:
>>>> #
>>>> # Internal Error (safepoint.cpp:317), pid=24256,
>>>> tid=47500347131648
>>>> # guarantee(PageArmed == 0) failed: invariant
>>>> #
>>>> # JRE version: 7.0_25-b15
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01
>>>> mixed mode linux-amd64 compressed oops)
>>>> # Failed to write core dump. Core dumps have been disabled.
>>>> To enable core dumping, try "ulimit -c unlimited" before
>>>> starting Java again
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>> [titan01:24256] *** Process received signal ***
>>>> [titan01:24256] Signal: Aborted (6)
>>>> [titan01:24256] Signal code: (-6)
>>>> [titan01:24256] [ 0]
>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>> [titan01:24256] [ 1]
>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>> [titan01:24256] [ 2]
>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>> [titan01:24256] [ 3]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>> [titan01:24256] [ 4]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>> [titan01:24256] [ 5]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>> [titan01:24256] [ 6]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>> [titan01:24256] [ 7]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>> [titan01:24256] [ 8]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>> [titan01:24256] [ 9]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>> [titan01:24256] [10]
>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>> [titan01:24256] [11]
>>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>> [titan01:24256] *** End of error message ***
>>>> -------------------------------------------------------
>>>> Primary job terminated normally, but 1 process returned
>>>> a non-zero exit code. Per user-direction, the job has been
>>>> aborted.
>>>> -------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 0 with PID 0 on node
>>>> titan01 exited on signal 6 (Aborted).
>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>> 3. in <10% the cases i have a dead lock while MPI.init.
>>>> This stays for more than 15 minutes without returning with
>>>> an error message...
>>>>
>>>> Can I enable some debug-flags to see what happens on C /
>>>> OpenMPI side?
>>>>
>>>> Thanks in advance for your help!
>>>> Gundram Leifert
>>>>
>>>>
>>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>>> After reading your thread looks like it may be related to
>>>>> an issue I had a few weeks ago (I'm a novice though).
>>>>> Maybe my thread will be of help:
>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>
>>>>>
>>>>> When you say "After a specific number of repetitions the
>>>>> process either hangs up or returns with a SIGSEGV." does
>>>>> you mean that a single call hangs, or that at some point
>>>>> during the for loop a call hangs? If you mean the latter,
>>>>> then it might relate to my issue. Otherwise my thread
>>>>> probably won't be helpful.
>>>>>
>>>>> Jason Maldonis
>>>>> Research Assistant of Professor Paul Voyles
>>>>> Materials Science Grad Student
>>>>> University of Wisconsin, Madison
>>>>> 1509 University Ave, Rm M142
>>>>> Madison, WI 53706
>>>>> ***@wisc.edu
>>>>> 608-295-5532
>>>>>
>>>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert
>>>>> <***@uni-rostock.de> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I try to send many byte-arrays via broadcast. After a
>>>>> specific number of repetitions the process either
>>>>> hangs up or returns with a SIGSEGV. Does any one can
>>>>> help me solving the problem:
>>>>>
>>>>> ########## The code:
>>>>>
>>>>> import java.util.Random;
>>>>> import mpi.*;
>>>>>
>>>>> public class TestSendBigFiles {
>>>>>
>>>>> public static void log(String msg) {
>>>>> try {
>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>>>> } catch (MPIException ex) {
>>>>> System.err.println(String.format("%2s/%2s:%s", "?",
>>>>> "?", msg));
>>>>> }
>>>>> }
>>>>>
>>>>> private static int hashcode(byte[] bytearray) {
>>>>> if (bytearray == null) {
>>>>> return 0;
>>>>> }
>>>>> int hash = 39;
>>>>> for (int i = 0; i < bytearray.length; i++) {
>>>>> byte b = bytearray[i];
>>>>> hash = hash * 7 + (int) b;
>>>>> }
>>>>> return hash;
>>>>> }
>>>>>
>>>>> public static void main(String args[]) throws
>>>>> MPIException {
>>>>> log("start main");
>>>>> MPI.Init(args);
>>>>> try {
>>>>> log("initialized done");
>>>>> byte[] saveMem = new byte[100000000];
>>>>> MPI.COMM_WORLD.barrier();
>>>>> Random r = new Random();
>>>>> r.nextBytes(saveMem);
>>>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>>>> for (int i = 0; i < 1000; i++) {
>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>> log("i = " + i);
>>>>> int[] lengthData = new
>>>>> int[]{saveMem.length};
>>>>> log("object hash = " +
>>>>> hashcode(saveMem));
>>>>> log("length = " + lengthData[0]);
>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>>>> <http://MPI.INT>, 0);
>>>>> log("bcast length done (length = "
>>>>> + lengthData[0] + ")");
>>>>> MPI.COMM_WORLD.barrier();
>>>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
>>>>> log("bcast data done");
>>>>> MPI.COMM_WORLD.barrier();
>>>>> }
>>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT
>>>>> <http://MPI.INT>, 0);
>>>>> } else {
>>>>> while (true) {
>>>>> int[] lengthData = new int[1];
>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>>>> <http://MPI.INT>, 0);
>>>>> log("bcast length done (length = "
>>>>> + lengthData[0] + ")");
>>>>> if (lengthData[0] == 0) {
>>>>> break;
>>>>> }
>>>>> MPI.COMM_WORLD.barrier();
>>>>> saveMem = new byte[lengthData[0]];
>>>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>>>> MPI.BYTE, 0);
>>>>> log("bcast data done");
>>>>> MPI.COMM_WORLD.barrier();
>>>>> log("object hash = " +
>>>>> hashcode(saveMem));
>>>>> }
>>>>> }
>>>>> MPI.COMM_WORLD.barrier();
>>>>> } catch (MPIException ex) {
>>>>> System.out.println("caugth error." + ex);
>>>>> log(ex.getMessage());
>>>>> } catch (RuntimeException ex) {
>>>>> System.out.println("caugth error." + ex);
>>>>> log(ex.getMessage());
>>>>> } finally {
>>>>> MPI.Finalize();
>>>>> }
>>>>>
>>>>> }
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>> ############ The Error (if it does not just hang up):
>>>>>
>>>>> #
>>>>> # A fatal error has been detected by the Java Runtime
>>>>> Environment:
>>>>> #
>>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172,
>>>>> tid=47822674495232
>>>>> #
>>>>> #
>>>>> # A fatal error has been detected by the Java Runtime
>>>>> Environment:
>>>>> # JRE version: 7.0_25-b15
>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>> (23.25-b01 mixed mode linux-amd64 compressed oops)
>>>>> # Problematic frame:
>>>>> # #
>>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173,
>>>>> tid=47238546896640
>>>>> #
>>>>> # JRE version: 7.0_25-b15
>>>>> J
>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>> #
>>>>> # Failed to write core dump. Core dumps have been
>>>>> disabled. To enable core dumping, try "ulimit -c
>>>>> unlimited" before starting Java again
>>>>> #
>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>> (23.25-b01 mixed mode linux-amd64 compressed oops)
>>>>> # Problematic frame:
>>>>> # J
>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>> #
>>>>> # Failed to write core dump. Core dumps have been
>>>>> disabled. To enable core dumping, try "ulimit -c
>>>>> unlimited" before starting Java again
>>>>> #
>>>>> # An error report file with more information is saved as:
>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>> # An error report file with more information is saved as:
>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>> #
>>>>> # If you would like to submit a bug report, please visit:
>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>> #
>>>>> #
>>>>> # If you would like to submit a bug report, please visit:
>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>> #
>>>>> [titan01:01172] *** Process received signal ***
>>>>> [titan01:01172] Signal: Aborted (6)
>>>>> [titan01:01172] Signal code: (-6)
>>>>> [titan01:01173] *** Process received signal ***
>>>>> [titan01:01173] Signal: Aborted (6)
>>>>> [titan01:01173] Signal code: (-6)
>>>>> [titan01:01172] [ 0]
>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>> [titan01:01172] [ 1]
>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>> [titan01:01172] [ 2]
>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>> [titan01:01172] [ 3]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>> [titan01:01172] [ 4]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>> [titan01:01172] [ 5]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>> [titan01:01173] [ 1]
>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>>> [titan01:01172] *** End of error message ***
>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>> [titan01:01173] [ 2]
>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>> [titan01:01173] [ 3]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>> [titan01:01173] [ 4]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>> [titan01:01173] [ 5]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>> [titan01:01173] [ 6]
>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>>> [titan01:01173] *** End of error message ***
>>>>> -------------------------------------------------------
>>>>> Primary job terminated normally, but 1 process returned
>>>>> a non-zero exit code. Per user-direction, the job has
>>>>> been aborted.
>>>>> -------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 1 with PID 0 on node
>>>>> titan01 exited on signal 6 (Aborted).
>>>>>
>>>>>
>>>>> ########CONFIGURATION:
>>>>> I used the ompi master sources from github:
>>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>> Author: Gilles Gouaillardet <***@rist.or.jp>
>>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>>
>>>>> ./configure --enable-mpi-java
>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>> --disable-dlopen --disable-mca-dso
>>>>>
>>>>> Thanks a lot for your help!
>>>>> Gundram
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> ***@open-mpi.org
>>>>> Subscription:
>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> ***@open-mpi.org
>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org
>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php
Nathaniel Graham
2016-07-07 20:48:51 UTC
Permalink
Hello Gundram,

I was also not able to reproduce the issue on my computer (OS X El Capitan).
I ran both your code and the one provided by Gilles with no issues.

I can try it on my Ubuntu machine when I get home.

-Nathan

On Thu, Jul 7, 2016 at 2:05 AM, Gilles Gouaillardet <***@rist.or.jp>
wrote:

> Gundram,
>
>
> can you please provide more information on your environment :
>
> - configure command line
>
> - OS
>
> - memory available
>
> - ulimit -a
>
> - number of nodes
>
> - number of tasks used
>
> - interconnect used (if any)
>
> - batch manager (if any)
>
>
> Cheers,
>
>
> Gilles
> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>
> Hello Gilles,
>
> I tried you code and it crashes after 3-15 iterations (see (1)). It is
> always the same error (only the "94" varies).
>
> Meanwhile I think Java and MPI use the same memory because when I delete
> the hash-call, the program runs sometimes more than 9k iterations.
> When it crashes, there are different lines (see (2) and (3)). The crashes
> also occurs on rank 0.
>
> ##### (1)#####
> # Problematic frame:
> # J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
> (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]
>
> #####(2)#####
> # Problematic frame:
> # V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle,
> Handle, JavaValue*, Thread*)+0xb6
>
> #####(3)#####
> # Problematic frame:
> # V [libjvm.so+0x4183bf]
> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>
> Any more idea?
>
> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>
> Gundram,
>
>
> fwiw, i cannot reproduce the issue on my box
>
> - centos 7
>
> - java version "1.8.0_71"
> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)
>
>
> i noticed on non zero rank saveMem is allocated at each iteration.
> ideally, the garbage collector can take care of that and this should not
> be an issue.
>
> would you mind giving the attached file a try ?
>
> Cheers,
>
> Gilles
>
> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>
> I will have a look at it today
>
> how did you configure OpenMPI ?
>
> Cheers,
>
> Gilles
>
> On Thursday, July 7, 2016, Gundram Leifert <
> <***@uni-rostock.de>***@uni-rostock.de> wrote:
>
>> Hello Giles,
>>
>> thank you for your hints! I did 3 changes, unfortunately the same error
>> occures:
>>
>> update ompi:
>> commit ae8444682f0a7aa158caea08800542ce9874455e
>> Author: Ralph Castain <***@open-mpi.org>
>> Date: Tue Jul 5 20:07:16 2016 -0700
>>
>> update java:
>> java version "1.8.0_92"
>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>
>> delete hashcode-lines.
>>
>> Now I get this error message - to 100%, after different number of
>> iterations (15-300):
>>
>> 0/ 3:length = 100000000
>> 0/ 3:bcast length done (length = 100000000)
>> 1/ 3:bcast length done (length = 100000000)
>> 2/ 3:bcast length done (length = 100000000)
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>> tid=0x00002b3d29716700
>> #
>> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build
>> 1.8.0_92-b14)
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode
>> linux-amd64 compressed oops)
>> # Problematic frame:
>> # V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*,
>> int)+0x94
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable
>> core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>> #
>> # Compiler replay data is saved as:
>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>> #
>> # If you would like to submit a bug report, please visit:
>> # http://bugreport.java.com/bugreport/crash.jsp
>> #
>> [titan01:16578] *** Process received signal ***
>> [titan01:16578] Signal: Aborted (6)
>> [titan01:16578] Signal code: (-6)
>> [titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>> [titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>> [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>> [titan01:16578] [ 3]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>> [titan01:16578] [ 4]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>> [titan01:16578] [ 5]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>> [titan01:16578] [ 6]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>> [titan01:16578] [ 8]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>> [titan01:16578] [ 9]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>> [titan01:16578] [10]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>> [titan01:16578] [11]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>> [titan01:16578] [12]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>> [titan01:16578] [13]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>> [titan01:16578] [14]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>> [titan01:16578] [15]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>> [titan01:16578] [16]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>> [titan01:16578] [17]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>> [titan01:16578] [18]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>> [titan01:16578] [19]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>> [titan01:16578] [20]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>> [titan01:16578] [21]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>> [titan01:16578] [22]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>> [titan01:16578] [23]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>> [titan01:16578] [24]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>> [titan01:16578] [25]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>> [titan01:16578] [26]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>> [titan01:16578] [27]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>> [titan01:16578] [28]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>> [titan01:16578] [29]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>> [titan01:16578] *** End of error message ***
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 2 with PID 0 on node titan01 exited on
>> signal 6 (Aborted).
>> --------------------------------------------------------------------------
>>
>> I don't know if it is a problem of java or ompi - but the last years,
>> java worked with no problems on my machine...
>>
>> Thank you for your tips in advance!
>> Gundram
>>
>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>
>> Note a race condition in MPI_Init has been fixed yesterday in the master.
>> can you please update your OpenMPI and try again ?
>>
>> hopefully the hang will disappear.
>>
>> Can you reproduce the crash with a simpler (and ideally deterministic)
>> version of your program.
>> the crash occurs in hashcode, and this makes little sense to me. can you
>> also update your jdk ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Wednesday, July 6, 2016, Gundram Leifert <
>> <***@uni-rostock.de>***@uni-rostock.de> wrote:
>>
>>> Hello Jason,
>>>
>>> thanks for your response! I thing it is another problem. I try to send
>>> 100MB bytes. So there are not many tries (between 10 and 30). I realized
>>> that the execution of this code can result 3 different errors:
>>>
>>> 1. most often the posted error message occures.
>>>
>>> 2. in <10% the cases i have a live lock. I can see 3 java-processes, one
>>> with 200% and two with 100% processor utilization. After ~15 minutes
>>> without new system outputs this error occurs.
>>>
>>>
>>> [thread 47499823949568 also had an error]
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
>>> # guarantee(PageArmed == 0) failed: invariant
>>> #
>>> # JRE version: 7.0_25-b15
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>> linux-amd64 compressed oops)
>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>> [titan01:24256] *** Process received signal ***
>>> [titan01:24256] Signal: Aborted (6)
>>> [titan01:24256] Signal code: (-6)
>>> [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>> [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>> [titan01:24256] [ 3]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>> [titan01:24256] [ 4]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>> [titan01:24256] [ 5]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>> [titan01:24256] [ 6]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>> [titan01:24256] [ 7]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>> [titan01:24256] [ 8]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>> [titan01:24256] [ 9]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>> [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>> [titan01:24256] *** End of error message ***
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 0 on node titan01 exited on
>>> signal 6 (Aborted).
>>>
>>> --------------------------------------------------------------------------
>>>
>>>
>>> 3. in <10% the cases i have a dead lock while MPI.init. This stays for
>>> more than 15 minutes without returning with an error message...
>>>
>>> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>>>
>>> Thanks in advance for your help!
>>> Gundram Leifert
>>>
>>>
>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>
>>> After reading your thread looks like it may be related to an issue I had
>>> a few weeks ago (I'm a novice though). Maybe my thread will be of help:
>>> <https://www.open-mpi.org/community/lists/users/2016/06/29425.php>
>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>
>>> When you say "After a specific number of repetitions the process either
>>> hangs up or returns with a SIGSEGV." does you mean that a single call
>>> hangs, or that at some point during the for loop a call hangs? If you mean
>>> the latter, then it might relate to my issue. Otherwise my thread probably
>>> won't be helpful.
>>>
>>> Jason Maldonis
>>> Research Assistant of Professor Paul Voyles
>>> Materials Science Grad Student
>>> University of Wisconsin, Madison
>>> 1509 University Ave, Rm M142
>>> Madison, WI 53706
>>> ***@wisc.edu
>>> 608-295-5532
>>>
>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
>>> <***@uni-rostock.de>***@uni-rostock.de> wrote:
>>>
>>>> Hello,
>>>>
>>>> I try to send many byte-arrays via broadcast. After a specific number
>>>> of repetitions the process either hangs up or returns with a SIGSEGV. Does
>>>> any one can help me solving the problem:
>>>>
>>>> ########## The code:
>>>>
>>>> import java.util.Random;
>>>> import mpi.*;
>>>>
>>>> public class TestSendBigFiles {
>>>>
>>>> public static void log(String msg) {
>>>> try {
>>>> System.err.println(String.format("%2d/%2d:%s",
>>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>>> } catch (MPIException ex) {
>>>> System.err.println(String.format("%2s/%2s:%s", "?", "?",
>>>> msg));
>>>> }
>>>> }
>>>>
>>>> private static int hashcode(byte[] bytearray) {
>>>> if (bytearray == null) {
>>>> return 0;
>>>> }
>>>> int hash = 39;
>>>> for (int i = 0; i < bytearray.length; i++) {
>>>> byte b = bytearray[i];
>>>> hash = hash * 7 + (int) b;
>>>> }
>>>> return hash;
>>>> }
>>>>
>>>> public static void main(String args[]) throws MPIException {
>>>> log("start main");
>>>> MPI.Init(args);
>>>> try {
>>>> log("initialized done");
>>>> byte[] saveMem = new byte[100000000];
>>>> MPI.COMM_WORLD.barrier();
>>>> Random r = new Random();
>>>> r.nextBytes(saveMem);
>>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>>> for (int i = 0; i < 1000; i++) {
>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>> log("i = " + i);
>>>> int[] lengthData = new int[]{saveMem.length};
>>>> log("object hash = " + hashcode(saveMem));
>>>> log("length = " + lengthData[0]);
>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>> log("bcast length done (length = " + lengthData[0]
>>>> + ")");
>>>> MPI.COMM_WORLD.barrier();
>>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>>>> MPI.BYTE, 0);
>>>> log("bcast data done");
>>>> MPI.COMM_WORLD.barrier();
>>>> }
>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
>>>> } else {
>>>> while (true) {
>>>> int[] lengthData = new int[1];
>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>> log("bcast length done (length = " + lengthData[0]
>>>> + ")");
>>>> if (lengthData[0] == 0) {
>>>> break;
>>>> }
>>>> MPI.COMM_WORLD.barrier();
>>>> saveMem = new byte[lengthData[0]];
>>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>>> MPI.BYTE, 0);
>>>> log("bcast data done");
>>>> MPI.COMM_WORLD.barrier();
>>>> log("object hash = " + hashcode(saveMem));
>>>> }
>>>> }
>>>> MPI.COMM_WORLD.barrier();
>>>> } catch (MPIException ex) {
>>>> System.out.println("caugth error." + ex);
>>>> log(ex.getMessage());
>>>> } catch (RuntimeException ex) {
>>>> System.out.println("caugth error." + ex);
>>>> log(ex.getMessage());
>>>> } finally {
>>>> MPI.Finalize();
>>>> }
>>>>
>>>> }
>>>>
>>>> }
>>>>
>>>>
>>>> ############ The Error (if it does not just hang up):
>>>>
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
>>>> #
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> # JRE version: 7.0_25-b15
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>> linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # #
>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
>>>> #
>>>> # JRE version: 7.0_25-b15
>>>> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>> #
>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>> #
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>> linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>> #
>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>> [titan01:01172] *** Process received signal ***
>>>> [titan01:01172] Signal: Aborted (6)
>>>> [titan01:01172] Signal code: (-6)
>>>> [titan01:01173] *** Process received signal ***
>>>> [titan01:01173] Signal: Aborted (6)
>>>> [titan01:01173] Signal code: (-6)
>>>> [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>> [titan01:01172] [ 3]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>> [titan01:01172] [ 4]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>> [titan01:01172] [ 5]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>> [titan01:01172] *** End of error message ***
>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>> [titan01:01173] [ 3]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>> [titan01:01173] [ 4]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>> [titan01:01173] [ 5]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>> [titan01:01173] *** End of error message ***
>>>> -------------------------------------------------------
>>>> Primary job terminated normally, but 1 process returned
>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited on
>>>> signal 6 (Aborted).
>>>>
>>>>
>>>> ########CONFIGURATION:
>>>> I used the ompi master sources from github:
>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>> Author: Gilles Gouaillardet < <***@rist.or.jp>***@rist.or.jp>
>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>
>>>> ./configure --enable-mpi-java
>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
>>>> --disable-mca-dso
>>>>
>>>> Thanks a lot for your help!
>>>> Gundram
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription: <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> <http://www.open-mpi.org/community/lists/users/2016/07/29584.php>
>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing ***@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>
>>>
>>>
>>
>> _______________________________________________
>> users mailing ***@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>
>>
>>
>
> _______________________________________________
> users mailing ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>
>
>
>
> _______________________________________________
> users mailing ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>
>
>
>
> _______________________________________________
> users mailing ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>
Gundram Leifert
2016-07-08 09:15:49 UTC
Permalink
Hello,

configure:
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
--disable-dlopen --disable-mca-dso


1 node with 3 cores. I use SLURM to allocate one node. I changed --mem,
but it has no effect.
salloc -n 3


core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256564
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

uname -a
Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31
16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/system-release
CentOS Linux release 7.2.1511 (Core)

what else do you need?

Cheers, Gundram

On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>
> Gundram,
>
>
> can you please provide more information on your environment :
>
> - configure command line
>
> - OS
>
> - memory available
>
> - ulimit -a
>
> - number of nodes
>
> - number of tasks used
>
> - interconnect used (if any)
>
> - batch manager (if any)
>
>
> Cheers,
>
>
> Gilles
>
> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>> Hello Gilles,
>>
>> I tried you code and it crashes after 3-15 iterations (see (1)). It
>> is always the same error (only the "94" varies).
>>
>> Meanwhile I think Java and MPI use the same memory because when I
>> delete the hash-call, the program runs sometimes more than 9k iterations.
>> When it crashes, there are different lines (see (2) and (3)). The
>> crashes also occurs on rank 0.
>>
>> ##### (1)#####
>> # Problematic frame:
>> # J 94 C2
>> de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42
>> bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]
>>
>> #####(2)#####
>> # Problematic frame:
>> # V [libjvm.so+0x68d0f6]
>> JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*,
>> Thread*)+0xb6
>>
>> #####(3)#####
>> # Problematic frame:
>> # V [libjvm.so+0x4183bf]
>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>
>> Any more idea?
>>
>> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>>
>>> Gundram,
>>>
>>>
>>> fwiw, i cannot reproduce the issue on my box
>>>
>>> - centos 7
>>>
>>> - java version "1.8.0_71"
>>> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)
>>>
>>>
>>> i noticed on non zero rank saveMem is allocated at each iteration.
>>> ideally, the garbage collector can take care of that and this should
>>> not be an issue.
>>>
>>> would you mind giving the attached file a try ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>> I will have a look at it today
>>>>
>>>> how did you configure OpenMPI ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Thursday, July 7, 2016, Gundram Leifert
>>>> <***@uni-rostock.de> wrote:
>>>>
>>>> Hello Giles,
>>>>
>>>> thank you for your hints! I did 3 changes, unfortunately the
>>>> same error occures:
>>>>
>>>> update ompi:
>>>> commit ae8444682f0a7aa158caea08800542ce9874455e
>>>> Author: Ralph Castain <***@open-mpi.org>
>>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>
>>>> update java:
>>>> java version "1.8.0_92"
>>>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>>>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>>>
>>>> delete hashcode-lines.
>>>>
>>>> Now I get this error message - to 100%, after different number
>>>> of iterations (15-300):
>>>>
>>>> 0/ 3:length = 100000000
>>>> 0/ 3:bcast length done (length = 100000000)
>>>> 1/ 3:bcast length done (length = 100000000)
>>>> 2/ 3:bcast length done (length = 100000000)
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>>>> tid=0x00002b3d29716700
>>>> #
>>>> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14)
>>>> (build 1.8.0_92-b14)
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed
>>>> mode linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # V [libjvm.so+0x414d24]
>>>> ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
>>>> #
>>>> # Failed to write core dump. Core dumps have been disabled. To
>>>> enable core dumping, try "ulimit -c unlimited" before starting
>>>> Java again
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>> #
>>>> # Compiler replay data is saved as:
>>>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.java.com/bugreport/crash.jsp
>>>> #
>>>> [titan01:16578] *** Process received signal ***
>>>> [titan01:16578] Signal: Aborted (6)
>>>> [titan01:16578] Signal code: (-6)
>>>> [titan01:16578] [ 0]
>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>> [titan01:16578] [ 1]
>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>> [titan01:16578] [ 2]
>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>> [titan01:16578] [ 3]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>> [titan01:16578] [ 4]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>> [titan01:16578] [ 5]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>> [titan01:16578] [ 6]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>> [titan01:16578] [ 8]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>> [titan01:16578] [ 9]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>> [titan01:16578] [10]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>> [titan01:16578] [11]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>> [titan01:16578] [12]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>> [titan01:16578] [13]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>> [titan01:16578] [14]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>> [titan01:16578] [15]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>> [titan01:16578] [16]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>> [titan01:16578] [17]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>> [titan01:16578] [18]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>> [titan01:16578] [19]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>> [titan01:16578] [20]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>> [titan01:16578] [21]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>> [titan01:16578] [22]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>> [titan01:16578] [23]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>> [titan01:16578] [24]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>> [titan01:16578] [25]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>> [titan01:16578] [26]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>> [titan01:16578] [27]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>> [titan01:16578] [28]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>> [titan01:16578] [29]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>> [titan01:16578] *** End of error message ***
>>>> -------------------------------------------------------
>>>> Primary job terminated normally, but 1 process returned
>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 2 with PID 0 on node titan01
>>>> exited on signal 6 (Aborted).
>>>> --------------------------------------------------------------------------
>>>>
>>>> I don't know if it is a problem of java or ompi - but the last
>>>> years, java worked with no problems on my machine...
>>>>
>>>> Thank you for your tips in advance!
>>>> Gundram
>>>>
>>>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>>>> Note a race condition in MPI_Init has been fixed yesterday in
>>>>> the master.
>>>>> can you please update your OpenMPI and try again ?
>>>>>
>>>>> hopefully the hang will disappear.
>>>>>
>>>>> Can you reproduce the crash with a simpler (and ideally
>>>>> deterministic) version of your program.
>>>>> the crash occurs in hashcode, and this makes little sense to
>>>>> me. can you also update your jdk ?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>> On Wednesday, July 6, 2016, Gundram Leifert
>>>>> <***@uni-rostock.de> wrote:
>>>>>
>>>>> Hello Jason,
>>>>>
>>>>> thanks for your response! I thing it is another problem. I
>>>>> try to send 100MB bytes. So there are not many tries
>>>>> (between 10 and 30). I realized that the execution of this
>>>>> code can result 3 different errors:
>>>>>
>>>>> 1. most often the posted error message occures.
>>>>>
>>>>> 2. in <10% the cases i have a live lock. I can see 3
>>>>> java-processes, one with 200% and two with 100% processor
>>>>> utilization. After ~15 minutes without new system outputs
>>>>> this error occurs.
>>>>>
>>>>>
>>>>> [thread 47499823949568 also had an error]
>>>>> # A fatal error has been detected by the Java Runtime
>>>>> Environment:
>>>>> #
>>>>> # Internal Error (safepoint.cpp:317), pid=24256,
>>>>> tid=47500347131648
>>>>> # guarantee(PageArmed == 0) failed: invariant
>>>>> #
>>>>> # JRE version: 7.0_25-b15
>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01
>>>>> mixed mode linux-amd64 compressed oops)
>>>>> # Failed to write core dump. Core dumps have been
>>>>> disabled. To enable core dumping, try "ulimit -c
>>>>> unlimited" before starting Java again
>>>>> #
>>>>> # An error report file with more information is saved as:
>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>> #
>>>>> # If you would like to submit a bug report, please visit:
>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>> #
>>>>> [titan01:24256] *** Process received signal ***
>>>>> [titan01:24256] Signal: Aborted (6)
>>>>> [titan01:24256] Signal code: (-6)
>>>>> [titan01:24256] [ 0]
>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>> [titan01:24256] [ 1]
>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>> [titan01:24256] [ 2]
>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>> [titan01:24256] [ 3]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>> [titan01:24256] [ 4]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>> [titan01:24256] [ 5]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>> [titan01:24256] [ 6]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>> [titan01:24256] [ 7]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>> [titan01:24256] [ 8]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>> [titan01:24256] [ 9]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>> [titan01:24256] [10]
>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>> [titan01:24256] [11]
>>>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>> [titan01:24256] *** End of error message ***
>>>>> -------------------------------------------------------
>>>>> Primary job terminated normally, but 1 process returned
>>>>> a non-zero exit code. Per user-direction, the job has been
>>>>> aborted.
>>>>> -------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 0 with PID 0 on node
>>>>> titan01 exited on signal 6 (Aborted).
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> 3. in <10% the cases i have a dead lock while MPI.init.
>>>>> This stays for more than 15 minutes without returning with
>>>>> an error message...
>>>>>
>>>>> Can I enable some debug-flags to see what happens on C /
>>>>> OpenMPI side?
>>>>>
>>>>> Thanks in advance for your help!
>>>>> Gundram Leifert
>>>>>
>>>>>
>>>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>>>> After reading your thread looks like it may be related to
>>>>>> an issue I had a few weeks ago (I'm a novice though).
>>>>>> Maybe my thread will be of help:
>>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>>
>>>>>>
>>>>>> When you say "After a specific number of repetitions the
>>>>>> process either hangs up or returns with a SIGSEGV." does
>>>>>> you mean that a single call hangs, or that at some point
>>>>>> during the for loop a call hangs? If you mean the latter,
>>>>>> then it might relate to my issue. Otherwise my thread
>>>>>> probably won't be helpful.
>>>>>>
>>>>>> Jason Maldonis
>>>>>> Research Assistant of Professor Paul Voyles
>>>>>> Materials Science Grad Student
>>>>>> University of Wisconsin, Madison
>>>>>> 1509 University Ave, Rm M142
>>>>>> Madison, WI 53706
>>>>>> ***@wisc.edu
>>>>>> 608-295-5532
>>>>>>
>>>>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert
>>>>>> <***@uni-rostock.de> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I try to send many byte-arrays via broadcast. After a
>>>>>> specific number of repetitions the process either
>>>>>> hangs up or returns with a SIGSEGV. Does any one can
>>>>>> help me solving the problem:
>>>>>>
>>>>>> ########## The code:
>>>>>>
>>>>>> import java.util.Random;
>>>>>> import mpi.*;
>>>>>>
>>>>>> public class TestSendBigFiles {
>>>>>>
>>>>>> public static void log(String msg) {
>>>>>> try {
>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(),
>>>>>> msg));
>>>>>> } catch (MPIException ex) {
>>>>>> System.err.println(String.format("%2s/%2s:%s", "?",
>>>>>> "?", msg));
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> private static int hashcode(byte[] bytearray) {
>>>>>> if (bytearray == null) {
>>>>>> return 0;
>>>>>> }
>>>>>> int hash = 39;
>>>>>> for (int i = 0; i < bytearray.length; i++) {
>>>>>> byte b = bytearray[i];
>>>>>> hash = hash * 7 + (int) b;
>>>>>> }
>>>>>> return hash;
>>>>>> }
>>>>>>
>>>>>> public static void main(String args[]) throws
>>>>>> MPIException {
>>>>>> log("start main");
>>>>>> MPI.Init(args);
>>>>>> try {
>>>>>> log("initialized done");
>>>>>> byte[] saveMem = new byte[100000000];
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> Random r = new Random();
>>>>>> r.nextBytes(saveMem);
>>>>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>>>>> for (int i = 0; i < 1000; i++) {
>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>> log("i = " + i);
>>>>>> int[] lengthData = new
>>>>>> int[]{saveMem.length};
>>>>>> log("object hash = " +
>>>>>> hashcode(saveMem));
>>>>>> log("length = " + lengthData[0]);
>>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>>>>> <http://MPI.INT>, 0);
>>>>>> log("bcast length done (length =
>>>>>> " + lengthData[0] + ")");
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>>>>>> MPI.BYTE, 0);
>>>>>> log("bcast data done");
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> }
>>>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT
>>>>>> <http://MPI.INT>, 0);
>>>>>> } else {
>>>>>> while (true) {
>>>>>> int[] lengthData = new int[1];
>>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>>>>> <http://MPI.INT>, 0);
>>>>>> log("bcast length done (length =
>>>>>> " + lengthData[0] + ")");
>>>>>> if (lengthData[0] == 0) {
>>>>>> break;
>>>>>> }
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> saveMem = new byte[lengthData[0]];
>>>>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>>>>> MPI.BYTE, 0);
>>>>>> log("bcast data done");
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> log("object hash = " +
>>>>>> hashcode(saveMem));
>>>>>> }
>>>>>> }
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> } catch (MPIException ex) {
>>>>>> System.out.println("caugth error." + ex);
>>>>>> log(ex.getMessage());
>>>>>> } catch (RuntimeException ex) {
>>>>>> System.out.println("caugth error." + ex);
>>>>>> log(ex.getMessage());
>>>>>> } finally {
>>>>>> MPI.Finalize();
>>>>>> }
>>>>>>
>>>>>> }
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> ############ The Error (if it does not just hang up):
>>>>>>
>>>>>> #
>>>>>> # A fatal error has been detected by the Java Runtime
>>>>>> Environment:
>>>>>> #
>>>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172,
>>>>>> tid=47822674495232
>>>>>> #
>>>>>> #
>>>>>> # A fatal error has been detected by the Java Runtime
>>>>>> Environment:
>>>>>> # JRE version: 7.0_25-b15
>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>>> (23.25-b01 mixed mode linux-amd64 compressed oops)
>>>>>> # Problematic frame:
>>>>>> # #
>>>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173,
>>>>>> tid=47238546896640
>>>>>> #
>>>>>> # JRE version: 7.0_25-b15
>>>>>> J
>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>> #
>>>>>> # Failed to write core dump. Core dumps have been
>>>>>> disabled. To enable core dumping, try "ulimit -c
>>>>>> unlimited" before starting Java again
>>>>>> #
>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>>> (23.25-b01 mixed mode linux-amd64 compressed oops)
>>>>>> # Problematic frame:
>>>>>> # J
>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>> #
>>>>>> # Failed to write core dump. Core dumps have been
>>>>>> disabled. To enable core dumping, try "ulimit -c
>>>>>> unlimited" before starting Java again
>>>>>> #
>>>>>> # An error report file with more information is saved as:
>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>> # An error report file with more information is saved as:
>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>> #
>>>>>> # If you would like to submit a bug report, please visit:
>>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>>> #
>>>>>> #
>>>>>> # If you would like to submit a bug report, please visit:
>>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>>> #
>>>>>> [titan01:01172] *** Process received signal ***
>>>>>> [titan01:01172] Signal: Aborted (6)
>>>>>> [titan01:01172] Signal code: (-6)
>>>>>> [titan01:01173] *** Process received signal ***
>>>>>> [titan01:01173] Signal: Aborted (6)
>>>>>> [titan01:01173] Signal code: (-6)
>>>>>> [titan01:01172] [ 0]
>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>> [titan01:01172] [ 1]
>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>> [titan01:01172] [ 2]
>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>> [titan01:01172] [ 3]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>> [titan01:01172] [ 4]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>> [titan01:01172] [ 5]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>> [titan01:01173] [ 1]
>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>>>> [titan01:01172] *** End of error message ***
>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>> [titan01:01173] [ 2]
>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>> [titan01:01173] [ 3]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>> [titan01:01173] [ 4]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>> [titan01:01173] [ 5]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>> [titan01:01173] [ 6]
>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>>>> [titan01:01173] *** End of error message ***
>>>>>> -------------------------------------------------------
>>>>>> Primary job terminated normally, but 1 process returned
>>>>>> a non-zero exit code. Per user-direction, the job has
>>>>>> been aborted.
>>>>>> -------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that process rank 1 with PID 0 on node
>>>>>> titan01 exited on signal 6 (Aborted).
>>>>>>
>>>>>>
>>>>>> ########CONFIGURATION:
>>>>>> I used the ompi master sources from github:
>>>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>> Author: Gilles Gouaillardet <***@rist.or.jp>
>>>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>>>
>>>>>> ./configure --enable-mpi-java
>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>> --disable-dlopen --disable-mca-dso
>>>>>>
>>>>>> Thanks a lot for your help!
>>>>>> Gundram
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> Subscription:
>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> ***@open-mpi.org
>>>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org
>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php
Gilles Gouaillardet
2016-07-08 10:32:14 UTC
Permalink
you can try
export IPATH_NO_BACKTRACE
before invoking mpirun (that should not be needed though)

an other test is to
ulimit -s 10240
before invoking mpirun.

btw, do you use mpirun or srun ?

can you reproduce the crash with 1 or 2 tasks ?

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de>
wrote:

> Hello,
>
> configure:
> ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
> --disable-dlopen --disable-mca-dso
>
>
> 1 node with 3 cores. I use SLURM to allocate one node. I changed --mem,
> but it has no effect.
> salloc -n 3
>
>
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size (blocks, -f) unlimited
> pending signals (-i) 256564
> max locked memory (kbytes, -l) unlimited
> max memory size (kbytes, -m) unlimited
> open files (-n) 100000
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority (-r) 0
> stack size (kbytes, -s) unlimited
> cpu time (seconds, -t) unlimited
> max user processes (-u) 4096
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
> uname -a
> Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31
> 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> cat /etc/system-release
> CentOS Linux release 7.2.1511 (Core)
>
> what else do you need?
>
> Cheers, Gundram
>
> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>
> Gundram,
>
>
> can you please provide more information on your environment :
>
> - configure command line
>
> - OS
>
> - memory available
>
> - ulimit -a
>
> - number of nodes
>
> - number of tasks used
>
> - interconnect used (if any)
>
> - batch manager (if any)
>
>
> Cheers,
>
>
> Gilles
> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>
> Hello Gilles,
>
> I tried you code and it crashes after 3-15 iterations (see (1)). It is
> always the same error (only the "94" varies).
>
> Meanwhile I think Java and MPI use the same memory because when I delete
> the hash-call, the program runs sometimes more than 9k iterations.
> When it crashes, there are different lines (see (2) and (3)). The crashes
> also occurs on rank 0.
>
> ##### (1)#####
> # Problematic frame:
> # J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
> (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]
>
> #####(2)#####
> # Problematic frame:
> # V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle,
> Handle, JavaValue*, Thread*)+0xb6
>
> #####(3)#####
> # Problematic frame:
> # V [libjvm.so+0x4183bf]
> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>
> Any more idea?
>
> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>
> Gundram,
>
>
> fwiw, i cannot reproduce the issue on my box
>
> - centos 7
>
> - java version "1.8.0_71"
> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)
>
>
> i noticed on non zero rank saveMem is allocated at each iteration.
> ideally, the garbage collector can take care of that and this should not
> be an issue.
>
> would you mind giving the attached file a try ?
>
> Cheers,
>
> Gilles
>
> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>
> I will have a look at it today
>
> how did you configure OpenMPI ?
>
> Cheers,
>
> Gilles
>
> On Thursday, July 7, 2016, Gundram Leifert <
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>
> ***@uni-rostock.de
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>
>> Hello Giles,
>>
>> thank you for your hints! I did 3 changes, unfortunately the same error
>> occures:
>>
>> update ompi:
>> commit ae8444682f0a7aa158caea08800542ce9874455e
>> Author: Ralph Castain <***@open-mpi.org>
>> Date: Tue Jul 5 20:07:16 2016 -0700
>>
>> update java:
>> java version "1.8.0_92"
>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>
>> delete hashcode-lines.
>>
>> Now I get this error message - to 100%, after different number of
>> iterations (15-300):
>>
>> 0/ 3:length = 100000000
>> 0/ 3:bcast length done (length = 100000000)
>> 1/ 3:bcast length done (length = 100000000)
>> 2/ 3:bcast length done (length = 100000000)
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>> tid=0x00002b3d29716700
>> #
>> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build
>> 1.8.0_92-b14)
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode
>> linux-amd64 compressed oops)
>> # Problematic frame:
>> # V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*,
>> int)+0x94
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable
>> core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>> #
>> # Compiler replay data is saved as:
>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>> #
>> # If you would like to submit a bug report, please visit:
>> # http://bugreport.java.com/bugreport/crash.jsp
>> #
>> [titan01:16578] *** Process received signal ***
>> [titan01:16578] Signal: Aborted (6)
>> [titan01:16578] Signal code: (-6)
>> [titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>> [titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>> [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>> [titan01:16578] [ 3]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>> [titan01:16578] [ 4]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>> [titan01:16578] [ 5]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>> [titan01:16578] [ 6]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>> [titan01:16578] [ 8]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>> [titan01:16578] [ 9]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>> [titan01:16578] [10]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>> [titan01:16578] [11]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>> [titan01:16578] [12]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>> [titan01:16578] [13]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>> [titan01:16578] [14]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>> [titan01:16578] [15]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>> [titan01:16578] [16]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>> [titan01:16578] [17]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>> [titan01:16578] [18]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>> [titan01:16578] [19]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>> [titan01:16578] [20]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>> [titan01:16578] [21]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>> [titan01:16578] [22]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>> [titan01:16578] [23]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>> [titan01:16578] [24]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>> [titan01:16578] [25]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>> [titan01:16578] [26]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>> [titan01:16578] [27]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>> [titan01:16578] [28]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>> [titan01:16578] [29]
>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>> [titan01:16578] *** End of error message ***
>> -------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> -------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 2 with PID 0 on node titan01 exited on
>> signal 6 (Aborted).
>> --------------------------------------------------------------------------
>>
>> I don't know if it is a problem of java or ompi - but the last years,
>> java worked with no problems on my machine...
>>
>> Thank you for your tips in advance!
>> Gundram
>>
>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>
>> Note a race condition in MPI_Init has been fixed yesterday in the master.
>> can you please update your OpenMPI and try again ?
>>
>> hopefully the hang will disappear.
>>
>> Can you reproduce the crash with a simpler (and ideally deterministic)
>> version of your program.
>> the crash occurs in hashcode, and this makes little sense to me. can you
>> also update your jdk ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Wednesday, July 6, 2016, Gundram Leifert <
>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>
>> ***@uni-rostock.de
>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>>
>>> Hello Jason,
>>>
>>> thanks for your response! I thing it is another problem. I try to send
>>> 100MB bytes. So there are not many tries (between 10 and 30). I realized
>>> that the execution of this code can result 3 different errors:
>>>
>>> 1. most often the posted error message occures.
>>>
>>> 2. in <10% the cases i have a live lock. I can see 3 java-processes, one
>>> with 200% and two with 100% processor utilization. After ~15 minutes
>>> without new system outputs this error occurs.
>>>
>>>
>>> [thread 47499823949568 also had an error]
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
>>> # guarantee(PageArmed == 0) failed: invariant
>>> #
>>> # JRE version: 7.0_25-b15
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>> linux-amd64 compressed oops)
>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>> #
>>> [titan01:24256] *** Process received signal ***
>>> [titan01:24256] Signal: Aborted (6)
>>> [titan01:24256] Signal code: (-6)
>>> [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>> [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>> [titan01:24256] [ 3]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>> [titan01:24256] [ 4]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>> [titan01:24256] [ 5]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>> [titan01:24256] [ 6]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>> [titan01:24256] [ 7]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>> [titan01:24256] [ 8]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>> [titan01:24256] [ 9]
>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>> [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>> [titan01:24256] *** End of error message ***
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 0 on node titan01 exited on
>>> signal 6 (Aborted).
>>>
>>> --------------------------------------------------------------------------
>>>
>>>
>>> 3. in <10% the cases i have a dead lock while MPI.init. This stays for
>>> more than 15 minutes without returning with an error message...
>>>
>>> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>>>
>>> Thanks in advance for your help!
>>> Gundram Leifert
>>>
>>>
>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>
>>> After reading your thread looks like it may be related to an issue I had
>>> a few weeks ago (I'm a novice though). Maybe my thread will be of help:
>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>
>>> When you say "After a specific number of repetitions the process either
>>> hangs up or returns with a SIGSEGV." does you mean that a single call
>>> hangs, or that at some point during the for loop a call hangs? If you mean
>>> the latter, then it might relate to my issue. Otherwise my thread probably
>>> won't be helpful.
>>>
>>> Jason Maldonis
>>> Research Assistant of Professor Paul Voyles
>>> Materials Science Grad Student
>>> University of Wisconsin, Madison
>>> 1509 University Ave, Rm M142
>>> Madison, WI 53706
>>> ***@wisc.edu
>>> 608-295-5532
>>>
>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>
>>> ***@uni-rostock.de
>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I try to send many byte-arrays via broadcast. After a specific number
>>>> of repetitions the process either hangs up or returns with a SIGSEGV. Does
>>>> any one can help me solving the problem:
>>>>
>>>> ########## The code:
>>>>
>>>> import java.util.Random;
>>>> import mpi.*;
>>>>
>>>> public class TestSendBigFiles {
>>>>
>>>> public static void log(String msg) {
>>>> try {
>>>> System.err.println(String.format("%2d/%2d:%s",
>>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>>> } catch (MPIException ex) {
>>>> System.err.println(String.format("%2s/%2s:%s", "?", "?",
>>>> msg));
>>>> }
>>>> }
>>>>
>>>> private static int hashcode(byte[] bytearray) {
>>>> if (bytearray == null) {
>>>> return 0;
>>>> }
>>>> int hash = 39;
>>>> for (int i = 0; i < bytearray.length; i++) {
>>>> byte b = bytearray[i];
>>>> hash = hash * 7 + (int) b;
>>>> }
>>>> return hash;
>>>> }
>>>>
>>>> public static void main(String args[]) throws MPIException {
>>>> log("start main");
>>>> MPI.Init(args);
>>>> try {
>>>> log("initialized done");
>>>> byte[] saveMem = new byte[100000000];
>>>> MPI.COMM_WORLD.barrier();
>>>> Random r = new Random();
>>>> r.nextBytes(saveMem);
>>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>>> for (int i = 0; i < 1000; i++) {
>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>> log("i = " + i);
>>>> int[] lengthData = new int[]{saveMem.length};
>>>> log("object hash = " + hashcode(saveMem));
>>>> log("length = " + lengthData[0]);
>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>> log("bcast length done (length = " + lengthData[0]
>>>> + ")");
>>>> MPI.COMM_WORLD.barrier();
>>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>>>> MPI.BYTE, 0);
>>>> log("bcast data done");
>>>> MPI.COMM_WORLD.barrier();
>>>> }
>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
>>>> } else {
>>>> while (true) {
>>>> int[] lengthData = new int[1];
>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>> log("bcast length done (length = " + lengthData[0]
>>>> + ")");
>>>> if (lengthData[0] == 0) {
>>>> break;
>>>> }
>>>> MPI.COMM_WORLD.barrier();
>>>> saveMem = new byte[lengthData[0]];
>>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>>> MPI.BYTE, 0);
>>>> log("bcast data done");
>>>> MPI.COMM_WORLD.barrier();
>>>> log("object hash = " + hashcode(saveMem));
>>>> }
>>>> }
>>>> MPI.COMM_WORLD.barrier();
>>>> } catch (MPIException ex) {
>>>> System.out.println("caugth error." + ex);
>>>> log(ex.getMessage());
>>>> } catch (RuntimeException ex) {
>>>> System.out.println("caugth error." + ex);
>>>> log(ex.getMessage());
>>>> } finally {
>>>> MPI.Finalize();
>>>> }
>>>>
>>>> }
>>>>
>>>> }
>>>>
>>>>
>>>> ############ The Error (if it does not just hang up):
>>>>
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
>>>> #
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> # JRE version: 7.0_25-b15
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>> linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # #
>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
>>>> #
>>>> # JRE version: 7.0_25-b15
>>>> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>> #
>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>> #
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>> linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>> #
>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>> [titan01:01172] *** Process received signal ***
>>>> [titan01:01172] Signal: Aborted (6)
>>>> [titan01:01172] Signal code: (-6)
>>>> [titan01:01173] *** Process received signal ***
>>>> [titan01:01173] Signal: Aborted (6)
>>>> [titan01:01173] Signal code: (-6)
>>>> [titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>> [titan01:01172] [ 3]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>> [titan01:01172] [ 4]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>> [titan01:01172] [ 5]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>> [titan01:01172] *** End of error message ***
>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>> [titan01:01173] [ 3]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>> [titan01:01173] [ 4]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>> [titan01:01173] [ 5]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>> [titan01:01173] *** End of error message ***
>>>> -------------------------------------------------------
>>>> Primary job terminated normally, but 1 process returned
>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited on
>>>> signal 6 (Aborted).
>>>>
>>>>
>>>> ########CONFIGURATION:
>>>> I used the ompi master sources from github:
>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>> Author: Gilles Gouaillardet <
>>>> <javascript:_e(%7B%7D,'cvml','***@rist.or.jp');>***@rist.or.jp
>>>> <javascript:_e(%7B%7D,'cvml','***@rist.or.jp');>>
>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>
>>>> ./configure --enable-mpi-java
>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
>>>> --disable-mca-dso
>>>>
>>>> Thanks a lot for your help!
>>>> Gundram
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription: <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> <http://www.open-mpi.org/community/lists/users/2016/07/29584.php>
>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing ***@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>
>>>
>>>
>>
>> _______________________________________________
>> users mailing ***@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>
>>
>>
>
> _______________________________________________
> users mailing ***@open-mpi.org <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>
>
>
>
> _______________________________________________
> users mailing ***@open-mpi.org <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>
>
>
>
> _______________________________________________
> users mailing ***@open-mpi.org <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>
>
>
>
> _______________________________________________
> users mailing ***@open-mpi.org <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>
>
>
Gundram Leifert
2016-07-08 10:55:17 UTC
Permalink
In any cases the same error.
this is my code:

salloc -n 3
export IPATH_NO_BACKTRACE
ulimit -s 10240
mpirun -np 3 java -cp executor.jar
de.uros.citlab.executor.test.TestSendBigFiles2


also for 1 or two cores, the process crashes.


On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
> you can try
> export IPATH_NO_BACKTRACE
> before invoking mpirun (that should not be needed though)
>
> an other test is to
> ulimit -s 10240
> before invoking mpirun.
>
> btw, do you use mpirun or srun ?
>
> can you reproduce the crash with 1 or 2 tasks ?
>
> Cheers,
>
> Gilles
>
> On Friday, July 8, 2016, Gundram Leifert
> <***@uni-rostock.de
> <mailto:***@uni-rostock.de>> wrote:
>
> Hello,
>
> configure:
> ./configure --enable-mpi-java
> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
> --disable-mca-dso
>
>
> 1 node with 3 cores. I use SLURM to allocate one node. I changed
> --mem, but it has no effect.
> salloc -n 3
>
>
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size (blocks, -f) unlimited
> pending signals (-i) 256564
> max locked memory (kbytes, -l) unlimited
> max memory size (kbytes, -m) unlimited
> open files (-n) 100000
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority (-r) 0
> stack size (kbytes, -s) unlimited
> cpu time (seconds, -t) unlimited
> max user processes (-u) 4096
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
> uname -a
> Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31
> 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> cat /etc/system-release
> CentOS Linux release 7.2.1511 (Core)
>
> what else do you need?
>
> Cheers, Gundram
>
> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>
>> Gundram,
>>
>>
>> can you please provide more information on your environment :
>>
>> - configure command line
>>
>> - OS
>>
>> - memory available
>>
>> - ulimit -a
>>
>> - number of nodes
>>
>> - number of tasks used
>>
>> - interconnect used (if any)
>>
>> - batch manager (if any)
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>> Hello Gilles,
>>>
>>> I tried you code and it crashes after 3-15 iterations (see (1)).
>>> It is always the same error (only the "94" varies).
>>>
>>> Meanwhile I think Java and MPI use the same memory because when
>>> I delete the hash-call, the program runs sometimes more than 9k
>>> iterations.
>>> When it crashes, there are different lines (see (2) and (3)).
>>> The crashes also occurs on rank 0.
>>>
>>> ##### (1)#####
>>> # Problematic frame:
>>> # J 94 C2
>>> de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>>> (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]
>>>
>>> #####(2)#####
>>> # Problematic frame:
>>> # V [libjvm.so+0x68d0f6]
>>> JavaCallWrapper::JavaCallWrapper(methodHandle, Handle,
>>> JavaValue*, Thread*)+0xb6
>>>
>>> #####(3)#####
>>> # Problematic frame:
>>> # V [libjvm.so+0x4183bf]
>>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>>
>>> Any more idea?
>>>
>>> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>>>
>>>> Gundram,
>>>>
>>>>
>>>> fwiw, i cannot reproduce the issue on my box
>>>>
>>>> - centos 7
>>>>
>>>> - java version "1.8.0_71"
>>>> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)
>>>>
>>>>
>>>> i noticed on non zero rank saveMem is allocated at each iteration.
>>>> ideally, the garbage collector can take care of that and this
>>>> should not be an issue.
>>>>
>>>> would you mind giving the attached file a try ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>>> I will have a look at it today
>>>>>
>>>>> how did you configure OpenMPI ?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>> On Thursday, July 7, 2016, Gundram Leifert
>>>>> <***@uni-rostock.de
>>>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>>>>> wrote:
>>>>>
>>>>> Hello Giles,
>>>>>
>>>>> thank you for your hints! I did 3 changes, unfortunately
>>>>> the same error occures:
>>>>>
>>>>> update ompi:
>>>>> commit ae8444682f0a7aa158caea08800542ce9874455e
>>>>> Author: Ralph Castain <***@open-mpi.org>
>>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>>
>>>>> update java:
>>>>> java version "1.8.0_92"
>>>>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>>>>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>>>>
>>>>> delete hashcode-lines.
>>>>>
>>>>> Now I get this error message - to 100%, after different
>>>>> number of iterations (15-300):
>>>>>
>>>>> 0/ 3:length = 100000000
>>>>> 0/ 3:bcast length done (length = 100000000)
>>>>> 1/ 3:bcast length done (length = 100000000)
>>>>> 2/ 3:bcast length done (length = 100000000)
>>>>> #
>>>>> # A fatal error has been detected by the Java Runtime
>>>>> Environment:
>>>>> #
>>>>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>>>>> tid=0x00002b3d29716700
>>>>> #
>>>>> # JRE version: Java(TM) SE Runtime Environment
>>>>> (8.0_92-b14) (build 1.8.0_92-b14)
>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14
>>>>> mixed mode linux-amd64 compressed oops)
>>>>> # Problematic frame:
>>>>> # V [libjvm.so+0x414d24]
>>>>> ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
>>>>> #
>>>>> # Failed to write core dump. Core dumps have been
>>>>> disabled. To enable core dumping, try "ulimit -c
>>>>> unlimited" before starting Java again
>>>>> #
>>>>> # An error report file with more information is saved as:
>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>>> #
>>>>> # Compiler replay data is saved as:
>>>>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>>> #
>>>>> # If you would like to submit a bug report, please visit:
>>>>> # http://bugreport.java.com/bugreport/crash.jsp
>>>>> #
>>>>> [titan01:16578] *** Process received signal ***
>>>>> [titan01:16578] Signal: Aborted (6)
>>>>> [titan01:16578] Signal code: (-6)
>>>>> [titan01:16578] [ 0]
>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>>> [titan01:16578] [ 1]
>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>>> [titan01:16578] [ 2]
>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>>> [titan01:16578] [ 3]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>>> [titan01:16578] [ 4]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>>> [titan01:16578] [ 5]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>>> [titan01:16578] [ 6]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>>> [titan01:16578] [ 7]
>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>>> [titan01:16578] [ 8]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>>> [titan01:16578] [ 9]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>>> [titan01:16578] [10]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>>> [titan01:16578] [11]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>>> [titan01:16578] [12]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>> [titan01:16578] [13]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>> [titan01:16578] [14]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>> [titan01:16578] [15]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>> [titan01:16578] [16]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>> [titan01:16578] [17]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>> [titan01:16578] [18]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>> [titan01:16578] [19]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>> [titan01:16578] [20]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>> [titan01:16578] [21]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>> [titan01:16578] [22]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>>> [titan01:16578] [23]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>>> [titan01:16578] [24]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>>> [titan01:16578] [25]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>>> [titan01:16578] [26]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>>> [titan01:16578] [27]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>>> [titan01:16578] [28]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>>> [titan01:16578] [29]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>>> [titan01:16578] *** End of error message ***
>>>>> -------------------------------------------------------
>>>>> Primary job terminated normally, but 1 process returned
>>>>> a non-zero exit code. Per user-direction, the job has been
>>>>> aborted.
>>>>> -------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 2 with PID 0 on node
>>>>> titan01 exited on signal 6 (Aborted).
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> I don't know if it is a problem of java or ompi - but the
>>>>> last years, java worked with no problems on my machine...
>>>>>
>>>>> Thank you for your tips in advance!
>>>>> Gundram
>>>>>
>>>>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>>>>> Note a race condition in MPI_Init has been fixed
>>>>>> yesterday in the master.
>>>>>> can you please update your OpenMPI and try again ?
>>>>>>
>>>>>> hopefully the hang will disappear.
>>>>>>
>>>>>> Can you reproduce the crash with a simpler (and ideally
>>>>>> deterministic) version of your program.
>>>>>> the crash occurs in hashcode, and this makes little sense
>>>>>> to me. can you also update your jdk ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On Wednesday, July 6, 2016, Gundram Leifert
>>>>>> <***@uni-rostock.de
>>>>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>>>>>> wrote:
>>>>>>
>>>>>> Hello Jason,
>>>>>>
>>>>>> thanks for your response! I thing it is another
>>>>>> problem. I try to send 100MB bytes. So there are not
>>>>>> many tries (between 10 and 30). I realized that the
>>>>>> execution of this code can result 3 different errors:
>>>>>>
>>>>>> 1. most often the posted error message occures.
>>>>>>
>>>>>> 2. in <10% the cases i have a live lock. I can see 3
>>>>>> java-processes, one with 200% and two with 100%
>>>>>> processor utilization. After ~15 minutes without new
>>>>>> system outputs this error occurs.
>>>>>>
>>>>>>
>>>>>> [thread 47499823949568 also had an error]
>>>>>> # A fatal error has been detected by the Java Runtime
>>>>>> Environment:
>>>>>> #
>>>>>> # Internal Error (safepoint.cpp:317), pid=24256,
>>>>>> tid=47500347131648
>>>>>> # guarantee(PageArmed == 0) failed: invariant
>>>>>> #
>>>>>> # JRE version: 7.0_25-b15
>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>>> (23.25-b01 mixed mode linux-amd64 compressed oops)
>>>>>> # Failed to write core dump. Core dumps have been
>>>>>> disabled. To enable core dumping, try "ulimit -c
>>>>>> unlimited" before starting Java again
>>>>>> #
>>>>>> # An error report file with more information is saved as:
>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>>> #
>>>>>> # If you would like to submit a bug report, please visit:
>>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>>> #
>>>>>> [titan01:24256] *** Process received signal ***
>>>>>> [titan01:24256] Signal: Aborted (6)
>>>>>> [titan01:24256] Signal code: (-6)
>>>>>> [titan01:24256] [ 0]
>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>>> [titan01:24256] [ 1]
>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>>> [titan01:24256] [ 2]
>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>>> [titan01:24256] [ 3]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>>> [titan01:24256] [ 4]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>>> [titan01:24256] [ 5]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>>> [titan01:24256] [ 6]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>>> [titan01:24256] [ 7]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>>> [titan01:24256] [ 8]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>>> [titan01:24256] [ 9]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>>> [titan01:24256] [10]
>>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>>> [titan01:24256] [11]
>>>>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>>> [titan01:24256] *** End of error message ***
>>>>>> -------------------------------------------------------
>>>>>> Primary job terminated normally, but 1 process returned
>>>>>> a non-zero exit code. Per user-direction, the job has
>>>>>> been aborted.
>>>>>> -------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that process rank 0 with PID 0 on node
>>>>>> titan01 exited on signal 6 (Aborted).
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> 3. in <10% the cases i have a dead lock while
>>>>>> MPI.init. This stays for more than 15 minutes without
>>>>>> returning with an error message...
>>>>>>
>>>>>> Can I enable some debug-flags to see what happens on
>>>>>> C / OpenMPI side?
>>>>>>
>>>>>> Thanks in advance for your help!
>>>>>> Gundram Leifert
>>>>>>
>>>>>>
>>>>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>>>>> After reading your thread looks like it may be
>>>>>>> related to an issue I had a few weeks ago (I'm a
>>>>>>> novice though). Maybe my thread will be of help:
>>>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>>>
>>>>>>>
>>>>>>> When you say "After a specific number of repetitions
>>>>>>> the process either hangs up or returns with a
>>>>>>> SIGSEGV." does you mean that a single call hangs,
>>>>>>> or that at some point during the for loop a call
>>>>>>> hangs? If you mean the latter, then it might relate
>>>>>>> to my issue. Otherwise my thread probably won't be
>>>>>>> helpful.
>>>>>>>
>>>>>>> Jason Maldonis
>>>>>>> Research Assistant of Professor Paul Voyles
>>>>>>> Materials Science Grad Student
>>>>>>> University of Wisconsin, Madison
>>>>>>> 1509 University Ave, Rm M142
>>>>>>> Madison, WI 53706
>>>>>>> ***@wisc.edu
>>>>>>> 608-295-5532
>>>>>>>
>>>>>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert
>>>>>>> <***@uni-rostock.de
>>>>>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I try to send many byte-arrays via broadcast.
>>>>>>> After a specific number of repetitions the
>>>>>>> process either hangs up or returns with a
>>>>>>> SIGSEGV. Does any one can help me solving the
>>>>>>> problem:
>>>>>>>
>>>>>>> ########## The code:
>>>>>>>
>>>>>>> import java.util.Random;
>>>>>>> import mpi.*;
>>>>>>>
>>>>>>> public class TestSendBigFiles {
>>>>>>>
>>>>>>> public static void log(String msg) {
>>>>>>> try {
>>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>>> MPI.COMM_WORLD.getRank(),
>>>>>>> MPI.COMM_WORLD.getSize(), msg));
>>>>>>> } catch (MPIException ex) {
>>>>>>> System.err.println(String.format("%2s/%2s:%s",
>>>>>>> "?", "?", msg));
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> private static int hashcode(byte[] bytearray) {
>>>>>>> if (bytearray == null) {
>>>>>>> return 0;
>>>>>>> }
>>>>>>> int hash = 39;
>>>>>>> for (int i = 0; i < bytearray.length; i++) {
>>>>>>> byte b = bytearray[i];
>>>>>>> hash = hash * 7 + (int) b;
>>>>>>> }
>>>>>>> return hash;
>>>>>>> }
>>>>>>>
>>>>>>> public static void main(String args[])
>>>>>>> throws MPIException {
>>>>>>> log("start main");
>>>>>>> MPI.Init(args);
>>>>>>> try {
>>>>>>> log("initialized done");
>>>>>>> byte[] saveMem = new byte[100000000];
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> Random r = new Random();
>>>>>>> r.nextBytes(saveMem);
>>>>>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>>>>>> for (int i = 0; i < 1000; i++) {
>>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>>> log("i = " + i);
>>>>>>> int[] lengthData = new
>>>>>>> int[]{saveMem.length};
>>>>>>> log("object hash = " + hashcode(saveMem));
>>>>>>> log("length = " + lengthData[0]);
>>>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>>>>>> <http://MPI.INT>, 0);
>>>>>>> log("bcast length done (length = " +
>>>>>>> lengthData[0] + ")");
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>>>>>>> MPI.BYTE, 0);
>>>>>>> log("bcast data done");
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> }
>>>>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT
>>>>>>> <http://MPI.INT>, 0);
>>>>>>> } else {
>>>>>>> while (true) {
>>>>>>> int[] lengthData = new int[1];
>>>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>>>>>> <http://MPI.INT>, 0);
>>>>>>> log("bcast length done (length = " +
>>>>>>> lengthData[0] + ")");
>>>>>>> if (lengthData[0] == 0) {
>>>>>>> break;
>>>>>>> }
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> saveMem = new byte[lengthData[0]];
>>>>>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>>>>>> MPI.BYTE, 0);
>>>>>>> log("bcast data done");
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> log("object hash = " + hashcode(saveMem));
>>>>>>> }
>>>>>>> }
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> } catch (MPIException ex) {
>>>>>>> System.out.println("caugth error." + ex);
>>>>>>> log(ex.getMessage());
>>>>>>> } catch (RuntimeException ex) {
>>>>>>> System.out.println("caugth error." + ex);
>>>>>>> log(ex.getMessage());
>>>>>>> } finally {
>>>>>>> MPI.Finalize();
>>>>>>> }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> ############ The Error (if it does not just hang
>>>>>>> up):
>>>>>>>
>>>>>>> #
>>>>>>> # A fatal error has been detected by the Java
>>>>>>> Runtime Environment:
>>>>>>> #
>>>>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1,
>>>>>>> pid=1172, tid=47822674495232
>>>>>>> #
>>>>>>> #
>>>>>>> # A fatal error has been detected by the Java
>>>>>>> Runtime Environment:
>>>>>>> # JRE version: 7.0_25-b15
>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>>>> (23.25-b01 mixed mode linux-amd64 compressed oops)
>>>>>>> # Problematic frame:
>>>>>>> # #
>>>>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1,
>>>>>>> pid=1173, tid=47238546896640
>>>>>>> #
>>>>>>> # JRE version: 7.0_25-b15
>>>>>>> J
>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>> #
>>>>>>> # Failed to write core dump. Core dumps have
>>>>>>> been disabled. To enable core dumping, try
>>>>>>> "ulimit -c unlimited" before starting Java again
>>>>>>> #
>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>>>> (23.25-b01 mixed mode linux-amd64 compressed oops)
>>>>>>> # Problematic frame:
>>>>>>> # J
>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>> #
>>>>>>> # Failed to write core dump. Core dumps have
>>>>>>> been disabled. To enable core dumping, try
>>>>>>> "ulimit -c unlimited" before starting Java again
>>>>>>> #
>>>>>>> # An error report file with more information is
>>>>>>> saved as:
>>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>>> # An error report file with more information is
>>>>>>> saved as:
>>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>>> #
>>>>>>> # If you would like to submit a bug report,
>>>>>>> please visit:
>>>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>> #
>>>>>>> #
>>>>>>> # If you would like to submit a bug report,
>>>>>>> please visit:
>>>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>> #
>>>>>>> [titan01:01172] *** Process received signal ***
>>>>>>> [titan01:01172] Signal: Aborted (6)
>>>>>>> [titan01:01172] Signal code: (-6)
>>>>>>> [titan01:01173] *** Process received signal ***
>>>>>>> [titan01:01173] Signal: Aborted (6)
>>>>>>> [titan01:01173] Signal code: (-6)
>>>>>>> [titan01:01172] [ 0]
>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>>> [titan01:01172] [ 1]
>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>>> [titan01:01172] [ 2]
>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>>> [titan01:01172] [ 3]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>>> [titan01:01172] [ 4]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>>> [titan01:01172] [ 5]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>>> [titan01:01173] [ 1]
>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>>>>> [titan01:01172] *** End of error message ***
>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>>> [titan01:01173] [ 2]
>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>>> [titan01:01173] [ 3]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>>> [titan01:01173] [ 4]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>>> [titan01:01173] [ 5]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>>> [titan01:01173] [ 6]
>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>>>>> [titan01:01173] *** End of error message ***
>>>>>>> -------------------------------------------------------
>>>>>>> Primary job terminated normally, but 1 process
>>>>>>> returned
>>>>>>> a non-zero exit code. Per user-direction, the
>>>>>>> job has been aborted.
>>>>>>> -------------------------------------------------------
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that process rank 1 with PID 0 on
>>>>>>> node titan01 exited on signal 6 (Aborted).
>>>>>>>
>>>>>>>
>>>>>>> ########CONFIGURATION:
>>>>>>> I used the ompi master sources from github:
>>>>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>>> Author: Gilles Gouaillardet <***@rist.or.jp
>>>>>>> <javascript:_e(%7B%7D,'cvml','***@rist.or.jp');>>
>>>>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>>>>
>>>>>>> ./configure --enable-mpi-java
>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>>> --disable-dlopen --disable-mca-dso
>>>>>>>
>>>>>>> Thanks a lot for your help!
>>>>>>> Gundram
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> ***@open-mpi.org
>>>>>>> Subscription:
>>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> ***@open-mpi.org
>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> ***@open-mpi.org
>>>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org
>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php
Gilles Gouaillardet
2016-07-08 11:40:52 UTC
Permalink
I am running out of ideas ...

what if you do not run within slurm ?
what if you do not use '-cp executor.jar'
or what if you configure without --disable-dlopen --disable-mca-dso ?

if you
mpirun -np 1 ...
then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird
your program is still crashing. an other test is to comment out MPI_Bcast
and MPI_Barrier and try again with -np 1

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de>
wrote:

> In any cases the same error.
> this is my code:
>
> salloc -n 3
> export IPATH_NO_BACKTRACE
> ulimit -s 10240
> mpirun -np 3 java -cp executor.jar
> de.uros.citlab.executor.test.TestSendBigFiles2
>
>
> also for 1 or two cores, the process crashes.
>
>
> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>
> you can try
> export IPATH_NO_BACKTRACE
> before invoking mpirun (that should not be needed though)
>
> an other test is to
> ulimit -s 10240
> before invoking mpirun.
>
> btw, do you use mpirun or srun ?
>
> can you reproduce the crash with 1 or 2 tasks ?
>
> Cheers,
>
> Gilles
>
> On Friday, July 8, 2016, Gundram Leifert <
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>
> ***@uni-rostock.de
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>
>> Hello,
>>
>> configure:
>> ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>> --disable-dlopen --disable-mca-dso
>>
>>
>> 1 node with 3 cores. I use SLURM to allocate one node. I changed --mem,
>> but it has no effect.
>> salloc -n 3
>>
>>
>> core file size (blocks, -c) 0
>> data seg size (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size (blocks, -f) unlimited
>> pending signals (-i) 256564
>> max locked memory (kbytes, -l) unlimited
>> max memory size (kbytes, -m) unlimited
>> open files (-n) 100000
>> pipe size (512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority (-r) 0
>> stack size (kbytes, -s) unlimited
>> cpu time (seconds, -t) unlimited
>> max user processes (-u) 4096
>> virtual memory (kbytes, -v) unlimited
>> file locks (-x) unlimited
>>
>> uname -a
>> Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31
>> 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>
>> cat /etc/system-release
>> CentOS Linux release 7.2.1511 (Core)
>>
>> what else do you need?
>>
>> Cheers, Gundram
>>
>> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>
>> Gundram,
>>
>>
>> can you please provide more information on your environment :
>>
>> - configure command line
>>
>> - OS
>>
>> - memory available
>>
>> - ulimit -a
>>
>> - number of nodes
>>
>> - number of tasks used
>>
>> - interconnect used (if any)
>>
>> - batch manager (if any)
>>
>>
>> Cheers,
>>
>>
>> Gilles
>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>
>> Hello Gilles,
>>
>> I tried you code and it crashes after 3-15 iterations (see (1)). It is
>> always the same error (only the "94" varies).
>>
>> Meanwhile I think Java and MPI use the same memory because when I delete
>> the hash-call, the program runs sometimes more than 9k iterations.
>> When it crashes, there are different lines (see (2) and (3)). The crashes
>> also occurs on rank 0.
>>
>> ##### (1)#####
>> # Problematic frame:
>> # J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>> (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]
>>
>> #####(2)#####
>> # Problematic frame:
>> # V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle,
>> Handle, JavaValue*, Thread*)+0xb6
>>
>> #####(3)#####
>> # Problematic frame:
>> # V [libjvm.so+0x4183bf]
>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>
>> Any more idea?
>>
>> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>
>> Gundram,
>>
>>
>> fwiw, i cannot reproduce the issue on my box
>>
>> - centos 7
>>
>> - java version "1.8.0_71"
>> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)
>>
>>
>> i noticed on non zero rank saveMem is allocated at each iteration.
>> ideally, the garbage collector can take care of that and this should not
>> be an issue.
>>
>> would you mind giving the attached file a try ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>
>> I will have a look at it today
>>
>> how did you configure OpenMPI ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Thursday, July 7, 2016, Gundram Leifert <
>> ***@uni-rostock.de
>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>>
>>> Hello Giles,
>>>
>>> thank you for your hints! I did 3 changes, unfortunately the same error
>>> occures:
>>>
>>> update ompi:
>>> commit ae8444682f0a7aa158caea08800542ce9874455e
>>> Author: Ralph Castain <***@open-mpi.org>
>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>
>>> update java:
>>> java version "1.8.0_92"
>>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>>
>>> delete hashcode-lines.
>>>
>>> Now I get this error message - to 100%, after different number of
>>> iterations (15-300):
>>>
>>> 0/ 3:length = 100000000
>>> 0/ 3:bcast length done (length = 100000000)
>>> 1/ 3:bcast length done (length = 100000000)
>>> 2/ 3:bcast length done (length = 100000000)
>>> #
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #
>>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>>> tid=0x00002b3d29716700
>>> #
>>> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build
>>> 1.8.0_92-b14)
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode
>>> linux-amd64 compressed oops)
>>> # Problematic frame:
>>> # V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*,
>>> int)+0x94
>>> #
>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>> #
>>> # An error report file with more information is saved as:
>>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>> #
>>> # Compiler replay data is saved as:
>>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> # http://bugreport.java.com/bugreport/crash.jsp
>>> #
>>> [titan01:16578] *** Process received signal ***
>>> [titan01:16578] Signal: Aborted (6)
>>> [titan01:16578] Signal code: (-6)
>>> [titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>> [titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>> [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>> [titan01:16578] [ 3]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>> [titan01:16578] [ 4]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>> [titan01:16578] [ 5]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>> [titan01:16578] [ 6]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>> [titan01:16578] [ 8]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>> [titan01:16578] [ 9]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>> [titan01:16578] [10]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>> [titan01:16578] [11]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>> [titan01:16578] [12]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>> [titan01:16578] [13]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>> [titan01:16578] [14]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>> [titan01:16578] [15]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>> [titan01:16578] [16]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>> [titan01:16578] [17]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>> [titan01:16578] [18]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>> [titan01:16578] [19]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>> [titan01:16578] [20]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>> [titan01:16578] [21]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>> [titan01:16578] [22]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>> [titan01:16578] [23]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>> [titan01:16578] [24]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>> [titan01:16578] [25]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>> [titan01:16578] [26]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>> [titan01:16578] [27]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>> [titan01:16578] [28]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>> [titan01:16578] [29]
>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>> [titan01:16578] *** End of error message ***
>>> -------------------------------------------------------
>>> Primary job terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 2 with PID 0 on node titan01 exited on
>>> signal 6 (Aborted).
>>>
>>> --------------------------------------------------------------------------
>>>
>>> I don't know if it is a problem of java or ompi - but the last years,
>>> java worked with no problems on my machine...
>>>
>>> Thank you for your tips in advance!
>>> Gundram
>>>
>>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>>
>>> Note a race condition in MPI_Init has been fixed yesterday in the
>>> master.
>>> can you please update your OpenMPI and try again ?
>>>
>>> hopefully the hang will disappear.
>>>
>>> Can you reproduce the crash with a simpler (and ideally deterministic)
>>> version of your program.
>>> the crash occurs in hashcode, and this makes little sense to me. can you
>>> also update your jdk ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Wednesday, July 6, 2016, Gundram Leifert <
>>> ***@uni-rostock.de> wrote:
>>>
>>>> Hello Jason,
>>>>
>>>> thanks for your response! I thing it is another problem. I try to send
>>>> 100MB bytes. So there are not many tries (between 10 and 30). I realized
>>>> that the execution of this code can result 3 different errors:
>>>>
>>>> 1. most often the posted error message occures.
>>>>
>>>> 2. in <10% the cases i have a live lock. I can see 3 java-processes,
>>>> one with 200% and two with 100% processor utilization. After ~15 minutes
>>>> without new system outputs this error occurs.
>>>>
>>>>
>>>> [thread 47499823949568 also had an error]
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
>>>> # guarantee(PageArmed == 0) failed: invariant
>>>> #
>>>> # JRE version: 7.0_25-b15
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>> linux-amd64 compressed oops)
>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>> [titan01:24256] *** Process received signal ***
>>>> [titan01:24256] Signal: Aborted (6)
>>>> [titan01:24256] Signal code: (-6)
>>>> [titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>> [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>> [titan01:24256] [ 3]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>> [titan01:24256] [ 4]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>> [titan01:24256] [ 5]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>> [titan01:24256] [ 6]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>> [titan01:24256] [ 7]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>> [titan01:24256] [ 8]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>> [titan01:24256] [ 9]
>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>> [titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>> [titan01:24256] *** End of error message ***
>>>> -------------------------------------------------------
>>>> Primary job terminated normally, but 1 process returned
>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 0 with PID 0 on node titan01 exited on
>>>> signal 6 (Aborted).
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>> 3. in <10% the cases i have a dead lock while MPI.init. This stays for
>>>> more than 15 minutes without returning with an error message...
>>>>
>>>> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>>>>
>>>> Thanks in advance for your help!
>>>> Gundram Leifert
>>>>
>>>>
>>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>>
>>>> After reading your thread looks like it may be related to an issue I
>>>> had a few weeks ago (I'm a novice though). Maybe my thread will be of help:
>>>> <https://www.open-mpi.org/community/lists/users/2016/06/29425.php>
>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>
>>>> When you say "After a specific number of repetitions the process
>>>> either hangs up or returns with a SIGSEGV." does you mean that a single
>>>> call hangs, or that at some point during the for loop a call hangs? If you
>>>> mean the latter, then it might relate to my issue. Otherwise my thread
>>>> probably won't be helpful.
>>>>
>>>> Jason Maldonis
>>>> Research Assistant of Professor Paul Voyles
>>>> Materials Science Grad Student
>>>> University of Wisconsin, Madison
>>>> 1509 University Ave, Rm M142
>>>> Madison, WI 53706
>>>> ***@wisc.edu <javascript:_e(%7B%7D,'cvml','***@wisc.edu');>
>>>> 608-295-5532
>>>>
>>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
>>>> ***@uni-rostock.de
>>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I try to send many byte-arrays via broadcast. After a specific number
>>>>> of repetitions the process either hangs up or returns with a SIGSEGV. Does
>>>>> any one can help me solving the problem:
>>>>>
>>>>> ########## The code:
>>>>>
>>>>> import java.util.Random;
>>>>> import mpi.*;
>>>>>
>>>>> public class TestSendBigFiles {
>>>>>
>>>>> public static void log(String msg) {
>>>>> try {
>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>>>> } catch (MPIException ex) {
>>>>> System.err.println(String.format("%2s/%2s:%s", "?", "?",
>>>>> msg));
>>>>> }
>>>>> }
>>>>>
>>>>> private static int hashcode(byte[] bytearray) {
>>>>> if (bytearray == null) {
>>>>> return 0;
>>>>> }
>>>>> int hash = 39;
>>>>> for (int i = 0; i < bytearray.length; i++) {
>>>>> byte b = bytearray[i];
>>>>> hash = hash * 7 + (int) b;
>>>>> }
>>>>> return hash;
>>>>> }
>>>>>
>>>>> public static void main(String args[]) throws MPIException {
>>>>> log("start main");
>>>>> MPI.Init(args);
>>>>> try {
>>>>> log("initialized done");
>>>>> byte[] saveMem = new byte[100000000];
>>>>> MPI.COMM_WORLD.barrier();
>>>>> Random r = new Random();
>>>>> r.nextBytes(saveMem);
>>>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>>>> for (int i = 0; i < 1000; i++) {
>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>> log("i = " + i);
>>>>> int[] lengthData = new int[]{saveMem.length};
>>>>> log("object hash = " + hashcode(saveMem));
>>>>> log("length = " + lengthData[0]);
>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>>> log("bcast length done (length = " + lengthData[0]
>>>>> + ")");
>>>>> MPI.COMM_WORLD.barrier();
>>>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>>>>> MPI.BYTE, 0);
>>>>> log("bcast data done");
>>>>> MPI.COMM_WORLD.barrier();
>>>>> }
>>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
>>>>> } else {
>>>>> while (true) {
>>>>> int[] lengthData = new int[1];
>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>>> log("bcast length done (length = " + lengthData[0]
>>>>> + ")");
>>>>> if (lengthData[0] == 0) {
>>>>> break;
>>>>> }
>>>>> MPI.COMM_WORLD.barrier();
>>>>> saveMem = new byte[lengthData[0]];
>>>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>>>> MPI.BYTE, 0);
>>>>> log("bcast data done");
>>>>> MPI.COMM_WORLD.barrier();
>>>>> log("object hash = " + hashcode(saveMem));
>>>>> }
>>>>> }
>>>>> MPI.COMM_WORLD.barrier();
>>>>> } catch (MPIException ex) {
>>>>> System.out.println("caugth error." + ex);
>>>>> log(ex.getMessage());
>>>>> } catch (RuntimeException ex) {
>>>>> System.out.println("caugth error." + ex);
>>>>> log(ex.getMessage());
>>>>> } finally {
>>>>> MPI.Finalize();
>>>>> }
>>>>>
>>>>> }
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>> ############ The Error (if it does not just hang up):
>>>>>
>>>>> #
>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>> #
>>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
>>>>> #
>>>>> #
>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>> # JRE version: 7.0_25-b15
>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>>> linux-amd64 compressed oops)
>>>>> # Problematic frame:
>>>>> # #
>>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
>>>>> #
>>>>> # JRE version: 7.0_25-b15
>>>>> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>> #
>>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>>> #
>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>>> linux-amd64 compressed oops)
>>>>> # Problematic frame:
>>>>> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>> #
>>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>>> #
>>>>> # An error report file with more information is saved as:
>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>> # An error report file with more information is saved as:
>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>> #
>>>>> # If you would like to submit a bug report, please visit:
>>>>> # <http://bugreport.sun.com/bugreport/crash.jsp>
>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>> #
>>>>> #
>>>>> # If you would like to submit a bug report, please visit:
>>>>> # <http://bugreport.sun.com/bugreport/crash.jsp>
>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>> #
>>>>> [titan01:01172] *** Process received signal ***
>>>>> [titan01:01172] Signal: Aborted (6)
>>>>> [titan01:01172] Signal code: (-6)
>>>>> [titan01:01173] *** Process received signal ***
>>>>> [titan01:01173] Signal: Aborted (6)
>>>>> [titan01:01173] Signal code: (-6)
>>>>> [titan01:01172] [ 0]
>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>> [titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>> [titan01:01172] [ 3]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>> [titan01:01172] [ 4]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>> [titan01:01172] [ 5]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>>> [titan01:01172] *** End of error message ***
>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>> [titan01:01173] [ 3]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>> [titan01:01173] [ 4]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>> [titan01:01173] [ 5]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>>> [titan01:01173] *** End of error message ***
>>>>> -------------------------------------------------------
>>>>> Primary job terminated normally, but 1 process returned
>>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>>> -------------------------------------------------------
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited
>>>>> on signal 6 (Aborted).
>>>>>
>>>>>
>>>>> ########CONFIGURATION:
>>>>> I used the ompi master sources from github:
>>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>> Author: Gilles Gouaillardet <***@rist.or.jp
>>>>> <javascript:_e(%7B%7D,'cvml','***@rist.or.jp');>>
>>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>>
>>>>> ./configure --enable-mpi-java
>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
>>>>> --disable-mca-dso
>>>>>
>>>>> Thanks a lot for your help!
>>>>> Gundram
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> ***@open-mpi.org
>>>>> Subscription: <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> <http://www.open-mpi.org/community/lists/users/2016/07/29584.php>
>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing ***@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing ***@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>
>>>
>>>
>>
>> _______________________________________________
>> users mailing ***@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>
>>
>>
>>
>> _______________________________________________
>> users mailing ***@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>
>>
>>
>>
>> _______________________________________________
>> users mailing ***@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>
>>
>>
>>
>> _______________________________________________
>> users mailing ***@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>>
>>
>>
>
> _______________________________________________
> users mailing ***@open-mpi.org <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php
>
>
>
Gundram Leifert
2016-07-08 12:15:03 UTC
Permalink
You made the best of it... thanks a lot!

Whithout MPI it runs.
Just adding MPI.init() causes the crash!

maybe I installed something wrong...

install newest automake, autoconf, m4, libtoolize in right order and
same prefix
check out ompi,
autogen
configure with same prefix, pointing to the same jdk, I later use
make
make install

I will test some different configurations of ./configure...


On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
> I am running out of ideas ...
>
> what if you do not run within slurm ?
> what if you do not use '-cp executor.jar'
> or what if you configure without --disable-dlopen --disable-mca-dso ?
>
> if you
> mpirun -np 1 ...
> then MPI_Bcast and MPI_Barrier are basically no-op, so it is really
> weird your program is still crashing. an other test is to comment out
> MPI_Bcast and MPI_Barrier and try again with -np 1
>
> Cheers,
>
> Gilles
>
> On Friday, July 8, 2016, Gundram Leifert
> <***@uni-rostock.de
> <mailto:***@uni-rostock.de>> wrote:
>
> In any cases the same error.
> this is my code:
>
> salloc -n 3
> export IPATH_NO_BACKTRACE
> ulimit -s 10240
> mpirun -np 3 java -cp executor.jar
> de.uros.citlab.executor.test.TestSendBigFiles2
>
>
> also for 1 or two cores, the process crashes.
>
>
> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>> you can try
>> export IPATH_NO_BACKTRACE
>> before invoking mpirun (that should not be needed though)
>>
>> an other test is to
>> ulimit -s 10240
>> before invoking mpirun.
>>
>> btw, do you use mpirun or srun ?
>>
>> can you reproduce the crash with 1 or 2 tasks ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Friday, July 8, 2016, Gundram Leifert
>> <***@uni-rostock.de
>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>> wrote:
>>
>> Hello,
>>
>> configure:
>> ./configure --enable-mpi-java
>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
>> --disable-mca-dso
>>
>>
>> 1 node with 3 cores. I use SLURM to allocate one node. I
>> changed --mem, but it has no effect.
>> salloc -n 3
>>
>>
>> core file size (blocks, -c) 0
>> data seg size (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size (blocks, -f) unlimited
>> pending signals (-i) 256564
>> max locked memory (kbytes, -l) unlimited
>> max memory size (kbytes, -m) unlimited
>> open files (-n) 100000
>> pipe size (512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority (-r) 0
>> stack size (kbytes, -s) unlimited
>> cpu time (seconds, -t) unlimited
>> max user processes (-u) 4096
>> virtual memory (kbytes, -v) unlimited
>> file locks (-x) unlimited
>>
>> uname -a
>> Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu
>> Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>
>> cat /etc/system-release
>> CentOS Linux release 7.2.1511 (Core)
>>
>> what else do you need?
>>
>> Cheers, Gundram
>>
>> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>>
>>> Gundram,
>>>
>>>
>>> can you please provide more information on your environment :
>>>
>>> - configure command line
>>>
>>> - OS
>>>
>>> - memory available
>>>
>>> - ulimit -a
>>>
>>> - number of nodes
>>>
>>> - number of tasks used
>>>
>>> - interconnect used (if any)
>>>
>>> - batch manager (if any)
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>>> Hello Gilles,
>>>>
>>>> I tried you code and it crashes after 3-15 iterations (see
>>>> (1)). It is always the same error (only the "94" varies).
>>>>
>>>> Meanwhile I think Java and MPI use the same memory because
>>>> when I delete the hash-call, the program runs sometimes
>>>> more than 9k iterations.
>>>> When it crashes, there are different lines (see (2) and
>>>> (3)). The crashes also occurs on rank 0.
>>>>
>>>> ##### (1)#####
>>>> # Problematic frame:
>>>> # J 94 C2
>>>> de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>>>> (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]
>>>>
>>>> #####(2)#####
>>>> # Problematic frame:
>>>> # V [libjvm.so+0x68d0f6]
>>>> JavaCallWrapper::JavaCallWrapper(methodHandle, Handle,
>>>> JavaValue*, Thread*)+0xb6
>>>>
>>>> #####(3)#####
>>>> # Problematic frame:
>>>> # V [libjvm.so+0x4183bf]
>>>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>>>
>>>> Any more idea?
>>>>
>>>> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>>>>
>>>>> Gundram,
>>>>>
>>>>>
>>>>> fwiw, i cannot reproduce the issue on my box
>>>>>
>>>>> - centos 7
>>>>>
>>>>> - java version "1.8.0_71"
>>>>> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
>>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15,
>>>>> mixed mode)
>>>>>
>>>>>
>>>>> i noticed on non zero rank saveMem is allocated at each
>>>>> iteration.
>>>>> ideally, the garbage collector can take care of that and
>>>>> this should not be an issue.
>>>>>
>>>>> would you mind giving the attached file a try ?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>>>> I will have a look at it today
>>>>>>
>>>>>> how did you configure OpenMPI ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On Thursday, July 7, 2016, Gundram Leifert
>>>>>> <***@uni-rostock.de
>>>>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>>>>>> wrote:
>>>>>>
>>>>>> Hello Giles,
>>>>>>
>>>>>> thank you for your hints! I did 3 changes,
>>>>>> unfortunately the same error occures:
>>>>>>
>>>>>> update ompi:
>>>>>> commit ae8444682f0a7aa158caea08800542ce9874455e
>>>>>> Author: Ralph Castain <***@open-mpi.org>
>>>>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>>>
>>>>>> update java:
>>>>>> java version "1.8.0_92"
>>>>>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>>>>>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>>>>>
>>>>>> delete hashcode-lines.
>>>>>>
>>>>>> Now I get this error message - to 100%, after
>>>>>> different number of iterations (15-300):
>>>>>>
>>>>>> 0/ 3:length = 100000000
>>>>>> 0/ 3:bcast length done (length = 100000000)
>>>>>> 1/ 3:bcast length done (length = 100000000)
>>>>>> 2/ 3:bcast length done (length = 100000000)
>>>>>> #
>>>>>> # A fatal error has been detected by the Java Runtime
>>>>>> Environment:
>>>>>> #
>>>>>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>>>>>> tid=0x00002b3d29716700
>>>>>> #
>>>>>> # JRE version: Java(TM) SE Runtime Environment
>>>>>> (8.0_92-b14) (build 1.8.0_92-b14)
>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>>> (25.92-b14 mixed mode linux-amd64 compressed oops)
>>>>>> # Problematic frame:
>>>>>> # V [libjvm.so+0x414d24]
>>>>>> ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
>>>>>> #
>>>>>> # Failed to write core dump. Core dumps have been
>>>>>> disabled. To enable core dumping, try "ulimit -c
>>>>>> unlimited" before starting Java again
>>>>>> #
>>>>>> # An error report file with more information is saved as:
>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>>>> #
>>>>>> # Compiler replay data is saved as:
>>>>>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>>>> #
>>>>>> # If you would like to submit a bug report, please visit:
>>>>>> # http://bugreport.java.com/bugreport/crash.jsp
>>>>>> #
>>>>>> [titan01:16578] *** Process received signal ***
>>>>>> [titan01:16578] Signal: Aborted (6)
>>>>>> [titan01:16578] Signal code: (-6)
>>>>>> [titan01:16578] [ 0]
>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>>>> [titan01:16578] [ 1]
>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>>>> [titan01:16578] [ 2]
>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>>>> [titan01:16578] [ 3]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>>>> [titan01:16578] [ 4]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>>>> [titan01:16578] [ 5]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>>>> [titan01:16578] [ 6]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>>>> [titan01:16578] [ 7]
>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>>>> [titan01:16578] [ 8]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>>>> [titan01:16578] [ 9]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>>>> [titan01:16578] [10]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>>>> [titan01:16578] [11]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>>>> [titan01:16578] [12]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>> [titan01:16578] [13]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>> [titan01:16578] [14]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>> [titan01:16578] [15]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>> [titan01:16578] [16]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>> [titan01:16578] [17]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>> [titan01:16578] [18]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>> [titan01:16578] [19]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>> [titan01:16578] [20]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>> [titan01:16578] [21]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>> [titan01:16578] [22]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>>>> [titan01:16578] [23]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>>>> [titan01:16578] [24]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>>>> [titan01:16578] [25]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>>>> [titan01:16578] [26]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>>>> [titan01:16578] [27]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>>>> [titan01:16578] [28]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>>>> [titan01:16578] [29]
>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>>>> [titan01:16578] *** End of error message ***
>>>>>> -------------------------------------------------------
>>>>>> Primary job terminated normally, but 1 process returned
>>>>>> a non-zero exit code. Per user-direction, the job has
>>>>>> been aborted.
>>>>>> -------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that process rank 2 with PID 0 on node
>>>>>> titan01 exited on signal 6 (Aborted).
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> I don't know if it is a problem of java or ompi -
>>>>>> but the last years, java worked with no problems on
>>>>>> my machine...
>>>>>>
>>>>>> Thank you for your tips in advance!
>>>>>> Gundram
>>>>>>
>>>>>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>>>>>> Note a race condition in MPI_Init has been fixed
>>>>>>> yesterday in the master.
>>>>>>> can you please update your OpenMPI and try again ?
>>>>>>>
>>>>>>> hopefully the hang will disappear.
>>>>>>>
>>>>>>> Can you reproduce the crash with a simpler (and
>>>>>>> ideally deterministic) version of your program.
>>>>>>> the crash occurs in hashcode, and this makes little
>>>>>>> sense to me. can you also update your jdk ?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Gilles
>>>>>>>
>>>>>>> On Wednesday, July 6, 2016, Gundram Leifert
>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>
>>>>>>> Hello Jason,
>>>>>>>
>>>>>>> thanks for your response! I thing it is another
>>>>>>> problem. I try to send 100MB bytes. So there are
>>>>>>> not many tries (between 10 and 30). I realized
>>>>>>> that the execution of this code can result 3
>>>>>>> different errors:
>>>>>>>
>>>>>>> 1. most often the posted error message occures.
>>>>>>>
>>>>>>> 2. in <10% the cases i have a live lock. I can
>>>>>>> see 3 java-processes, one with 200% and two with
>>>>>>> 100% processor utilization. After ~15 minutes
>>>>>>> without new system outputs this error occurs.
>>>>>>>
>>>>>>>
>>>>>>> [thread 47499823949568 also had an error]
>>>>>>> # A fatal error has been detected by the Java
>>>>>>> Runtime Environment:
>>>>>>> #
>>>>>>> # Internal Error (safepoint.cpp:317),
>>>>>>> pid=24256, tid=47500347131648
>>>>>>> # guarantee(PageArmed == 0) failed: invariant
>>>>>>> #
>>>>>>> # JRE version: 7.0_25-b15
>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>>>> (23.25-b01 mixed mode linux-amd64 compressed oops)
>>>>>>> # Failed to write core dump. Core dumps have
>>>>>>> been disabled. To enable core dumping, try
>>>>>>> "ulimit -c unlimited" before starting Java again
>>>>>>> #
>>>>>>> # An error report file with more information is
>>>>>>> saved as:
>>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>>>> #
>>>>>>> # If you would like to submit a bug report,
>>>>>>> please visit:
>>>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>> #
>>>>>>> [titan01:24256] *** Process received signal ***
>>>>>>> [titan01:24256] Signal: Aborted (6)
>>>>>>> [titan01:24256] Signal code: (-6)
>>>>>>> [titan01:24256] [ 0]
>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>>>> [titan01:24256] [ 1]
>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>>>> [titan01:24256] [ 2]
>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>>>> [titan01:24256] [ 3]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>>>> [titan01:24256] [ 4]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>>>> [titan01:24256] [ 5]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>>>> [titan01:24256] [ 6]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>>>> [titan01:24256] [ 7]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>>>> [titan01:24256] [ 8]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>>>> [titan01:24256] [ 9]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>>>> [titan01:24256] [10]
>>>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>>>> [titan01:24256] [11]
>>>>>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>>>> [titan01:24256] *** End of error message ***
>>>>>>> -------------------------------------------------------
>>>>>>> Primary job terminated normally, but 1 process
>>>>>>> returned
>>>>>>> a non-zero exit code. Per user-direction, the
>>>>>>> job has been aborted.
>>>>>>> -------------------------------------------------------
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that process rank 0 with PID 0 on
>>>>>>> node titan01 exited on signal 6 (Aborted).
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> 3. in <10% the cases i have a dead lock while
>>>>>>> MPI.init. This stays for more than 15 minutes
>>>>>>> without returning with an error message...
>>>>>>>
>>>>>>> Can I enable some debug-flags to see what
>>>>>>> happens on C / OpenMPI side?
>>>>>>>
>>>>>>> Thanks in advance for your help!
>>>>>>> Gundram Leifert
>>>>>>>
>>>>>>>
>>>>>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>>>>>> After reading your thread looks like it may be
>>>>>>>> related to an issue I had a few weeks ago (I'm
>>>>>>>> a novice though). Maybe my thread will be of
>>>>>>>> help:
>>>>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>>>>
>>>>>>>>
>>>>>>>> When you say "After a specific number of
>>>>>>>> repetitions the process either hangs up or
>>>>>>>> returns with a SIGSEGV." does you mean that a
>>>>>>>> single call hangs, or that at some point during
>>>>>>>> the for loop a call hangs? If you mean the
>>>>>>>> latter, then it might relate to my issue.
>>>>>>>> Otherwise my thread probably won't be helpful.
>>>>>>>>
>>>>>>>> Jason Maldonis
>>>>>>>> Research Assistant of Professor Paul Voyles
>>>>>>>> Materials Science Grad Student
>>>>>>>> University of Wisconsin, Madison
>>>>>>>> 1509 University Ave, Rm M142
>>>>>>>> Madison, WI 53706
>>>>>>>> ***@wisc.edu
>>>>>>>> <javascript:_e(%7B%7D,'cvml','***@wisc.edu');>
>>>>>>>> 608-295-5532
>>>>>>>>
>>>>>>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert
>>>>>>>> <***@uni-rostock.de
>>>>>>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I try to send many byte-arrays via
>>>>>>>> broadcast. After a specific number of
>>>>>>>> repetitions the process either hangs up or
>>>>>>>> returns with a SIGSEGV. Does any one can
>>>>>>>> help me solving the problem:
>>>>>>>>
>>>>>>>> ########## The code:
>>>>>>>>
>>>>>>>> import java.util.Random;
>>>>>>>> import mpi.*;
>>>>>>>>
>>>>>>>> public class TestSendBigFiles {
>>>>>>>>
>>>>>>>> public static void log(String msg) {
>>>>>>>> try {
>>>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>>>> MPI.COMM_WORLD.getRank(),
>>>>>>>> MPI.COMM_WORLD.getSize(), msg));
>>>>>>>> } catch (MPIException ex) {
>>>>>>>> System.err.println(String.format("%2s/%2s:%s",
>>>>>>>> "?", "?", msg));
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> private static int hashcode(byte[]
>>>>>>>> bytearray) {
>>>>>>>> if (bytearray == null) {
>>>>>>>> return 0;
>>>>>>>> }
>>>>>>>> int hash = 39;
>>>>>>>> for (int i = 0; i <
>>>>>>>> bytearray.length; i++) {
>>>>>>>> byte b = bytearray[i];
>>>>>>>> hash = hash * 7 + (int) b;
>>>>>>>> }
>>>>>>>> return hash;
>>>>>>>> }
>>>>>>>>
>>>>>>>> public static void main(String args[])
>>>>>>>> throws MPIException {
>>>>>>>> log("start main");
>>>>>>>> MPI.Init(args);
>>>>>>>> try {
>>>>>>>> log("initialized done");
>>>>>>>> byte[] saveMem = new
>>>>>>>> byte[100000000];
>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>> Random r = new Random();
>>>>>>>> r.nextBytes(saveMem);
>>>>>>>> if (MPI.COMM_WORLD.getRank() ==
>>>>>>>> 0) {
>>>>>>>> for (int i = 0; i < 1000; i++) {
>>>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>>>> log("i = " + i);
>>>>>>>> int[] lengthData = new int[]{saveMem.length};
>>>>>>>> log("object hash = " + hashcode(saveMem));
>>>>>>>> log("length = " + lengthData[0]);
>>>>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>>>>>>> <http://MPI.INT>, 0);
>>>>>>>> log("bcast length done (length = " +
>>>>>>>> lengthData[0] + ")");
>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>> lengthData[0], MPI.BYTE, 0);
>>>>>>>> log("bcast data done");
>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>> }
>>>>>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1,
>>>>>>>> MPI.INT <http://MPI.INT>, 0);
>>>>>>>> } else {
>>>>>>>> while (true) {
>>>>>>>> int[] lengthData = new int[1];
>>>>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT
>>>>>>>> <http://MPI.INT>, 0);
>>>>>>>> log("bcast length done (length = " +
>>>>>>>> lengthData[0] + ")");
>>>>>>>> if (lengthData[0] == 0) {
>>>>>>>> break;
>>>>>>>> }
>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>> saveMem = new byte[lengthData[0]];
>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>> saveMem.length, MPI.BYTE, 0);
>>>>>>>> log("bcast data done");
>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>> log("object hash = " + hashcode(saveMem));
>>>>>>>> }
>>>>>>>> }
>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>> } catch (MPIException ex) {
>>>>>>>> System.out.println("caugth error." + ex);
>>>>>>>> log(ex.getMessage());
>>>>>>>> } catch (RuntimeException ex) {
>>>>>>>> System.out.println("caugth error." + ex);
>>>>>>>> log(ex.getMessage());
>>>>>>>> } finally {
>>>>>>>> MPI.Finalize();
>>>>>>>> }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> ############ The Error (if it does not just
>>>>>>>> hang up):
>>>>>>>>
>>>>>>>> #
>>>>>>>> # A fatal error has been detected by the
>>>>>>>> Java Runtime Environment:
>>>>>>>> #
>>>>>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1,
>>>>>>>> pid=1172, tid=47822674495232
>>>>>>>> #
>>>>>>>> #
>>>>>>>> # A fatal error has been detected by the
>>>>>>>> Java Runtime Environment:
>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server
>>>>>>>> VM (23.25-b01 mixed mode linux-amd64
>>>>>>>> compressed oops)
>>>>>>>> # Problematic frame:
>>>>>>>> # #
>>>>>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1,
>>>>>>>> pid=1173, tid=47238546896640
>>>>>>>> #
>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>> J
>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>> #
>>>>>>>> # Failed to write core dump. Core dumps
>>>>>>>> have been disabled. To enable core dumping,
>>>>>>>> try "ulimit -c unlimited" before starting
>>>>>>>> Java again
>>>>>>>> #
>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server
>>>>>>>> VM (23.25-b01 mixed mode linux-amd64
>>>>>>>> compressed oops)
>>>>>>>> # Problematic frame:
>>>>>>>> # J
>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>> #
>>>>>>>> # Failed to write core dump. Core dumps
>>>>>>>> have been disabled. To enable core dumping,
>>>>>>>> try "ulimit -c unlimited" before starting
>>>>>>>> Java again
>>>>>>>> #
>>>>>>>> # An error report file with more
>>>>>>>> information is saved as:
>>>>>>>> #
>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>>>> # An error report file with more
>>>>>>>> information is saved as:
>>>>>>>> #
>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>>>> #
>>>>>>>> # If you would like to submit a bug report,
>>>>>>>> please visit:
>>>>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>> #
>>>>>>>> #
>>>>>>>> # If you would like to submit a bug report,
>>>>>>>> please visit:
>>>>>>>> # http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>> #
>>>>>>>> [titan01:01172] *** Process received signal ***
>>>>>>>> [titan01:01172] Signal: Aborted (6)
>>>>>>>> [titan01:01172] Signal code: (-6)
>>>>>>>> [titan01:01173] *** Process received signal ***
>>>>>>>> [titan01:01173] Signal: Aborted (6)
>>>>>>>> [titan01:01173] Signal code: (-6)
>>>>>>>> [titan01:01172] [ 0]
>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>>>> [titan01:01172] [ 1]
>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>>>> [titan01:01172] [ 2]
>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>>>> [titan01:01172] [ 3]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>>>> [titan01:01172] [ 4]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>>>> [titan01:01172] [ 5]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>>>> [titan01:01173] [ 1]
>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>>>>>> [titan01:01172] *** End of error message ***
>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>>>> [titan01:01173] [ 2]
>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>>>> [titan01:01173] [ 3]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>>>> [titan01:01173] [ 4]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>>>> [titan01:01173] [ 5]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>>>> [titan01:01173] [ 6]
>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>>>>>> [titan01:01173] *** End of error message ***
>>>>>>>> -------------------------------------------------------
>>>>>>>> Primary job terminated normally, but 1
>>>>>>>> process returned
>>>>>>>> a non-zero exit code. Per user-direction,
>>>>>>>> the job has been aborted.
>>>>>>>> -------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun noticed that process rank 1 with PID
>>>>>>>> 0 on node titan01 exited on signal 6 (Aborted).
>>>>>>>>
>>>>>>>>
>>>>>>>> ########CONFIGURATION:
>>>>>>>> I used the ompi master sources from github:
>>>>>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>>>> Author: Gilles Gouaillardet
>>>>>>>> <***@rist.or.jp
>>>>>>>> <javascript:_e(%7B%7D,'cvml','***@rist.or.jp');>>
>>>>>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>>>>>
>>>>>>>> ./configure --enable-mpi-java
>>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>>>> --disable-dlopen --disable-mca-dso
>>>>>>>>
>>>>>>>> Thanks a lot for your help!
>>>>>>>> Gundram
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> ***@open-mpi.org
>>>>>>>> Subscription:
>>>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:
>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> ***@open-mpi.org
>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> ***@open-mpi.org
>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> ***@open-mpi.org
>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org
>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29601.php
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php
Gilles Gouaillardet
2016-07-08 12:36:43 UTC
Permalink
the JVM sets its own signal handlers, and it is important openmpi dones not
override them.
this is what previously happened with PSM (infinipath) but this has been
solved since.
you might be linking with a third party library that hijacks signal
handlers and cause the crash
(which would explain why I cannot reproduce the issue)

the master branch has a revamped memory patcher (compared to v2.x or
v1.10), and that could have some bad interactions with the JVM, so you
might also give v2.x a try

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de>
wrote:

> You made the best of it... thanks a lot!
>
> Whithout MPI it runs.
> Just adding MPI.init() causes the crash!
>
> maybe I installed something wrong...
>
> install newest automake, autoconf, m4, libtoolize in right order and same
> prefix
> check out ompi,
> autogen
> configure with same prefix, pointing to the same jdk, I later use
> make
> make install
>
> I will test some different configurations of ./configure...
>
>
> On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
>
> I am running out of ideas ...
>
> what if you do not run within slurm ?
> what if you do not use '-cp executor.jar'
> or what if you configure without --disable-dlopen --disable-mca-dso ?
>
> if you
> mpirun -np 1 ...
> then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird
> your program is still crashing. an other test is to comment out MPI_Bcast
> and MPI_Barrier and try again with -np 1
>
> Cheers,
>
> Gilles
>
> On Friday, July 8, 2016, Gundram Leifert <
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>
> ***@uni-rostock.de
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>
>> In any cases the same error.
>> this is my code:
>>
>> salloc -n 3
>> export IPATH_NO_BACKTRACE
>> ulimit -s 10240
>> mpirun -np 3 java -cp executor.jar
>> de.uros.citlab.executor.test.TestSendBigFiles2
>>
>>
>> also for 1 or two cores, the process crashes.
>>
>>
>> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>>
>> you can try
>> export IPATH_NO_BACKTRACE
>> before invoking mpirun (that should not be needed though)
>>
>> an other test is to
>> ulimit -s 10240
>> before invoking mpirun.
>>
>> btw, do you use mpirun or srun ?
>>
>> can you reproduce the crash with 1 or 2 tasks ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de
>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>>
>>> Hello,
>>>
>>> configure:
>>> ./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>> --disable-dlopen --disable-mca-dso
>>>
>>>
>>> 1 node with 3 cores. I use SLURM to allocate one node. I changed --mem,
>>> but it has no effect.
>>> salloc -n 3
>>>
>>>
>>> core file size (blocks, -c) 0
>>> data seg size (kbytes, -d) unlimited
>>> scheduling priority (-e) 0
>>> file size (blocks, -f) unlimited
>>> pending signals (-i) 256564
>>> max locked memory (kbytes, -l) unlimited
>>> max memory size (kbytes, -m) unlimited
>>> open files (-n) 100000
>>> pipe size (512 bytes, -p) 8
>>> POSIX message queues (bytes, -q) 819200
>>> real-time priority (-r) 0
>>> stack size (kbytes, -s) unlimited
>>> cpu time (seconds, -t) unlimited
>>> max user processes (-u) 4096
>>> virtual memory (kbytes, -v) unlimited
>>> file locks (-x) unlimited
>>>
>>> uname -a
>>> Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31
>>> 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> cat /etc/system-release
>>> CentOS Linux release 7.2.1511 (Core)
>>>
>>> what else do you need?
>>>
>>> Cheers, Gundram
>>>
>>> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>>
>>> Gundram,
>>>
>>>
>>> can you please provide more information on your environment :
>>>
>>> - configure command line
>>>
>>> - OS
>>>
>>> - memory available
>>>
>>> - ulimit -a
>>>
>>> - number of nodes
>>>
>>> - number of tasks used
>>>
>>> - interconnect used (if any)
>>>
>>> - batch manager (if any)
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>>
>>> Hello Gilles,
>>>
>>> I tried you code and it crashes after 3-15 iterations (see (1)). It is
>>> always the same error (only the "94" varies).
>>>
>>> Meanwhile I think Java and MPI use the same memory because when I delete
>>> the hash-call, the program runs sometimes more than 9k iterations.
>>> When it crashes, there are different lines (see (2) and (3)). The
>>> crashes also occurs on rank 0.
>>>
>>> ##### (1)#####
>>> # Problematic frame:
>>> # J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>>> (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]
>>>
>>> #####(2)#####
>>> # Problematic frame:
>>> # V [libjvm.so+0x68d0f6]
>>> JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*,
>>> Thread*)+0xb6
>>>
>>> #####(3)#####
>>> # Problematic frame:
>>> # V [libjvm.so+0x4183bf]
>>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>>
>>> Any more idea?
>>>
>>> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>>
>>> Gundram,
>>>
>>>
>>> fwiw, i cannot reproduce the issue on my box
>>>
>>> - centos 7
>>>
>>> - java version "1.8.0_71"
>>> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)
>>>
>>>
>>> i noticed on non zero rank saveMem is allocated at each iteration.
>>> ideally, the garbage collector can take care of that and this should not
>>> be an issue.
>>>
>>> would you mind giving the attached file a try ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>
>>> I will have a look at it today
>>>
>>> how did you configure OpenMPI ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Thursday, July 7, 2016, Gundram Leifert <
>>> ***@uni-rostock.de> wrote:
>>>
>>>> Hello Giles,
>>>>
>>>> thank you for your hints! I did 3 changes, unfortunately the same error
>>>> occures:
>>>>
>>>> update ompi:
>>>> commit ae8444682f0a7aa158caea08800542ce9874455e
>>>> Author: Ralph Castain <***@open-mpi.org>
>>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>
>>>> update java:
>>>> java version "1.8.0_92"
>>>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>>>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>>>
>>>> delete hashcode-lines.
>>>>
>>>> Now I get this error message - to 100%, after different number of
>>>> iterations (15-300):
>>>>
>>>> 0/ 3:length = 100000000
>>>> 0/ 3:bcast length done (length = 100000000)
>>>> 1/ 3:bcast length done (length = 100000000)
>>>> 2/ 3:bcast length done (length = 100000000)
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>>>> tid=0x00002b3d29716700
>>>> #
>>>> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build
>>>> 1.8.0_92-b14)
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode
>>>> linux-amd64 compressed oops)
>>>> # Problematic frame:
>>>> # V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*,
>>>> int)+0x94
>>>> #
>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>> #
>>>> # Compiler replay data is saved as:
>>>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> # http://bugreport.java.com/bugreport/crash.jsp
>>>> #
>>>> [titan01:16578] *** Process received signal ***
>>>> [titan01:16578] Signal: Aborted (6)
>>>> [titan01:16578] Signal code: (-6)
>>>> [titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>> [titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>> [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>> [titan01:16578] [ 3]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>> [titan01:16578] [ 4]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>> [titan01:16578] [ 5]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>> [titan01:16578] [ 6]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>> [titan01:16578] [ 8]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>> [titan01:16578] [ 9]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>> [titan01:16578] [10]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>> [titan01:16578] [11]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>> [titan01:16578] [12]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>> [titan01:16578] [13]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>> [titan01:16578] [14]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>> [titan01:16578] [15]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>> [titan01:16578] [16]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>> [titan01:16578] [17]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>> [titan01:16578] [18]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>> [titan01:16578] [19]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>> [titan01:16578] [20]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>> [titan01:16578] [21]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>> [titan01:16578] [22]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>> [titan01:16578] [23]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>> [titan01:16578] [24]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>> [titan01:16578] [25]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>> [titan01:16578] [26]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>> [titan01:16578] [27]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>> [titan01:16578] [28]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>> [titan01:16578] [29]
>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>> [titan01:16578] *** End of error message ***
>>>> -------------------------------------------------------
>>>> Primary job terminated normally, but 1 process returned
>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 2 with PID 0 on node titan01 exited on
>>>> signal 6 (Aborted).
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> I don't know if it is a problem of java or ompi - but the last years,
>>>> java worked with no problems on my machine...
>>>>
>>>> Thank you for your tips in advance!
>>>> Gundram
>>>>
>>>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>>>
>>>> Note a race condition in MPI_Init has been fixed yesterday in the
>>>> master.
>>>> can you please update your OpenMPI and try again ?
>>>>
>>>> hopefully the hang will disappear.
>>>>
>>>> Can you reproduce the crash with a simpler (and ideally deterministic)
>>>> version of your program.
>>>> the crash occurs in hashcode, and this makes little sense to me. can
>>>> you also update your jdk ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Wednesday, July 6, 2016, Gundram Leifert <
>>>> ***@uni-rostock.de
>>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>>>> wrote:
>>>>
>>>>> Hello Jason,
>>>>>
>>>>> thanks for your response! I thing it is another problem. I try to send
>>>>> 100MB bytes. So there are not many tries (between 10 and 30). I realized
>>>>> that the execution of this code can result 3 different errors:
>>>>>
>>>>> 1. most often the posted error message occures.
>>>>>
>>>>> 2. in <10% the cases i have a live lock. I can see 3 java-processes,
>>>>> one with 200% and two with 100% processor utilization. After ~15 minutes
>>>>> without new system outputs this error occurs.
>>>>>
>>>>>
>>>>> [thread 47499823949568 also had an error]
>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>> #
>>>>> # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
>>>>> # guarantee(PageArmed == 0) failed: invariant
>>>>> #
>>>>> # JRE version: 7.0_25-b15
>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>>> linux-amd64 compressed oops)
>>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>>> #
>>>>> # An error report file with more information is saved as:
>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>> #
>>>>> # If you would like to submit a bug report, please visit:
>>>>> # <http://bugreport.sun.com/bugreport/crash.jsp>
>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>> #
>>>>> [titan01:24256] *** Process received signal ***
>>>>> [titan01:24256] Signal: Aborted (6)
>>>>> [titan01:24256] Signal code: (-6)
>>>>> [titan01:24256] [ 0]
>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>> [titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>> [titan01:24256] [ 3]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>> [titan01:24256] [ 4]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>> [titan01:24256] [ 5]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>> [titan01:24256] [ 6]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>> [titan01:24256] [ 7]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>> [titan01:24256] [ 8]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>> [titan01:24256] [ 9]
>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>> [titan01:24256] [10]
>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>> [titan01:24256] *** End of error message ***
>>>>> -------------------------------------------------------
>>>>> Primary job terminated normally, but 1 process returned
>>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>>> -------------------------------------------------------
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 0 with PID 0 on node titan01 exited
>>>>> on signal 6 (Aborted).
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> 3. in <10% the cases i have a dead lock while MPI.init. This stays for
>>>>> more than 15 minutes without returning with an error message...
>>>>>
>>>>> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>>>>>
>>>>> Thanks in advance for your help!
>>>>> Gundram Leifert
>>>>>
>>>>>
>>>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>>>
>>>>> After reading your thread looks like it may be related to an issue I
>>>>> had a few weeks ago (I'm a novice though). Maybe my thread will be of help:
>>>>> <https://www.open-mpi.org/community/lists/users/2016/06/29425.php>
>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>
>>>>> When you say "After a specific number of repetitions the process
>>>>> either hangs up or returns with a SIGSEGV." does you mean that a single
>>>>> call hangs, or that at some point during the for loop a call hangs? If you
>>>>> mean the latter, then it might relate to my issue. Otherwise my thread
>>>>> probably won't be helpful.
>>>>>
>>>>> Jason Maldonis
>>>>> Research Assistant of Professor Paul Voyles
>>>>> Materials Science Grad Student
>>>>> University of Wisconsin, Madison
>>>>> 1509 University Ave, Rm M142
>>>>> Madison, WI 53706
>>>>> ***@wisc.edu <javascript:_e(%7B%7D,'cvml','***@wisc.edu');>
>>>>> 608-295-5532
>>>>>
>>>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
>>>>> ***@uni-rostock.de
>>>>> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I try to send many byte-arrays via broadcast. After a specific number
>>>>>> of repetitions the process either hangs up or returns with a SIGSEGV. Does
>>>>>> any one can help me solving the problem:
>>>>>>
>>>>>> ########## The code:
>>>>>>
>>>>>> import java.util.Random;
>>>>>> import mpi.*;
>>>>>>
>>>>>> public class TestSendBigFiles {
>>>>>>
>>>>>> public static void log(String msg) {
>>>>>> try {
>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>>>>> } catch (MPIException ex) {
>>>>>> System.err.println(String.format("%2s/%2s:%s", "?", "?",
>>>>>> msg));
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> private static int hashcode(byte[] bytearray) {
>>>>>> if (bytearray == null) {
>>>>>> return 0;
>>>>>> }
>>>>>> int hash = 39;
>>>>>> for (int i = 0; i < bytearray.length; i++) {
>>>>>> byte b = bytearray[i];
>>>>>> hash = hash * 7 + (int) b;
>>>>>> }
>>>>>> return hash;
>>>>>> }
>>>>>>
>>>>>> public static void main(String args[]) throws MPIException {
>>>>>> log("start main");
>>>>>> MPI.Init(args);
>>>>>> try {
>>>>>> log("initialized done");
>>>>>> byte[] saveMem = new byte[100000000];
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> Random r = new Random();
>>>>>> r.nextBytes(saveMem);
>>>>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>>>>> for (int i = 0; i < 1000; i++) {
>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>> log("i = " + i);
>>>>>> int[] lengthData = new int[]{saveMem.length};
>>>>>> log("object hash = " + hashcode(saveMem));
>>>>>> log("length = " + lengthData[0]);
>>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>>>> log("bcast length done (length = " +
>>>>>> lengthData[0] + ")");
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>>>>>> MPI.BYTE, 0);
>>>>>> log("bcast data done");
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> }
>>>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
>>>>>> } else {
>>>>>> while (true) {
>>>>>> int[] lengthData = new int[1];
>>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>>>> log("bcast length done (length = " +
>>>>>> lengthData[0] + ")");
>>>>>> if (lengthData[0] == 0) {
>>>>>> break;
>>>>>> }
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> saveMem = new byte[lengthData[0]];
>>>>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>>>>> MPI.BYTE, 0);
>>>>>> log("bcast data done");
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> log("object hash = " + hashcode(saveMem));
>>>>>> }
>>>>>> }
>>>>>> MPI.COMM_WORLD.barrier();
>>>>>> } catch (MPIException ex) {
>>>>>> System.out.println("caugth error." + ex);
>>>>>> log(ex.getMessage());
>>>>>> } catch (RuntimeException ex) {
>>>>>> System.out.println("caugth error." + ex);
>>>>>> log(ex.getMessage());
>>>>>> } finally {
>>>>>> MPI.Finalize();
>>>>>> }
>>>>>>
>>>>>> }
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> ############ The Error (if it does not just hang up):
>>>>>>
>>>>>> #
>>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>>> #
>>>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172,
>>>>>> tid=47822674495232
>>>>>> #
>>>>>> #
>>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>>> # JRE version: 7.0_25-b15
>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>>>> linux-amd64 compressed oops)
>>>>>> # Problematic frame:
>>>>>> # #
>>>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173,
>>>>>> tid=47238546896640
>>>>>> #
>>>>>> # JRE version: 7.0_25-b15
>>>>>> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>> #
>>>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>>>> #
>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>>>> linux-amd64 compressed oops)
>>>>>> # Problematic frame:
>>>>>> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>> #
>>>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>>>> #
>>>>>> # An error report file with more information is saved as:
>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>> # An error report file with more information is saved as:
>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>> #
>>>>>> # If you would like to submit a bug report, please visit:
>>>>>> # <http://bugreport.sun.com/bugreport/crash.jsp>
>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>> #
>>>>>> #
>>>>>> # If you would like to submit a bug report, please visit:
>>>>>> # <http://bugreport.sun.com/bugreport/crash.jsp>
>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>> #
>>>>>> [titan01:01172] *** Process received signal ***
>>>>>> [titan01:01172] Signal: Aborted (6)
>>>>>> [titan01:01172] Signal code: (-6)
>>>>>> [titan01:01173] *** Process received signal ***
>>>>>> [titan01:01173] Signal: Aborted (6)
>>>>>> [titan01:01173] Signal code: (-6)
>>>>>> [titan01:01172] [ 0]
>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>> [titan01:01172] [ 1]
>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>> [titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>> [titan01:01172] [ 3]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>> [titan01:01172] [ 4]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>> [titan01:01172] [ 5]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>>>> [titan01:01172] *** End of error message ***
>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>> [titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>> [titan01:01173] [ 3]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>> [titan01:01173] [ 4]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>> [titan01:01173] [ 5]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>>>> [titan01:01173] *** End of error message ***
>>>>>> -------------------------------------------------------
>>>>>> Primary job terminated normally, but 1 process returned
>>>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>>>> -------------------------------------------------------
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited
>>>>>> on signal 6 (Aborted).
>>>>>>
>>>>>>
>>>>>> ########CONFIGURATION:
>>>>>> I used the ompi master sources from github:
>>>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>> Author: Gilles Gouaillardet <***@rist.or.jp
>>>>>> <javascript:_e(%7B%7D,'cvml','***@rist.or.jp');>>
>>>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>>>
>>>>>> ./configure --enable-mpi-java
>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
>>>>>> --disable-mca-dso
>>>>>>
>>>>>> Thanks a lot for your help!
>>>>>> Gundram
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
>>>>>> Subscription: <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>>>>>> <http://www.open-mpi.org/community/lists/users/2016/07/29584.php>
>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing ***@open-mpi.org
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing ***@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing ***@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing ***@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing ***@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing ***@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>>>
>>>
>>>
>>
>> _______________________________________________
>> users mailing ***@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php
>>
>>
>>
>
> _______________________________________________
> users mailing ***@open-mpi.org <javascript:_e(%7B%7D,'cvml','***@open-mpi.org');>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php
>
>
>
Howard Pritchard
2016-07-08 14:19:51 UTC
Permalink
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :

> the JVM sets its own signal handlers, and it is important openmpi dones
> not override them.
> this is what previously happened with PSM (infinipath) but this has been
> solved since.
> you might be linking with a third party library that hijacks signal
> handlers and cause the crash
> (which would explain why I cannot reproduce the issue)
>
> the master branch has a revamped memory patcher (compared to v2.x or
> v1.10), and that could have some bad interactions with the JVM, so you
> might also give v2.x a try
>
> Cheers,
>
> Gilles
>
> On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>> wrote:
>
>> You made the best of it... thanks a lot!
>>
>> Whithout MPI it runs.
>> Just adding MPI.init() causes the crash!
>>
>> maybe I installed something wrong...
>>
>> install newest automake, autoconf, m4, libtoolize in right order and same
>> prefix
>> check out ompi,
>> autogen
>> configure with same prefix, pointing to the same jdk, I later use
>> make
>> make install
>>
>> I will test some different configurations of ./configure...
>>
>>
>> On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
>>
>> I am running out of ideas ...
>>
>> what if you do not run within slurm ?
>> what if you do not use '-cp executor.jar'
>> or what if you configure without --disable-dlopen --disable-mca-dso ?
>>
>> if you
>> mpirun -np 1 ...
>> then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird
>> your program is still crashing. an other test is to comment out MPI_Bcast
>> and MPI_Barrier and try again with -np 1
>>
>> Cheers,
>>
>> Gilles
>>
>> On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de>
>> wrote:
>>
>>> In any cases the same error.
>>> this is my code:
>>>
>>> salloc -n 3
>>> export IPATH_NO_BACKTRACE
>>> ulimit -s 10240
>>> mpirun -np 3 java -cp executor.jar
>>> de.uros.citlab.executor.test.TestSendBigFiles2
>>>
>>>
>>> also for 1 or two cores, the process crashes.
>>>
>>>
>>> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>>>
>>> you can try
>>> export IPATH_NO_BACKTRACE
>>> before invoking mpirun (that should not be needed though)
>>>
>>> an other test is to
>>> ulimit -s 10240
>>> before invoking mpirun.
>>>
>>> btw, do you use mpirun or srun ?
>>>
>>> can you reproduce the crash with 1 or 2 tasks ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> configure:
>>>> ./configure --enable-mpi-java
>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
>>>> --disable-mca-dso
>>>>
>>>>
>>>> 1 node with 3 cores. I use SLURM to allocate one node. I changed --mem,
>>>> but it has no effect.
>>>> salloc -n 3
>>>>
>>>>
>>>> core file size (blocks, -c) 0
>>>> data seg size (kbytes, -d) unlimited
>>>> scheduling priority (-e) 0
>>>> file size (blocks, -f) unlimited
>>>> pending signals (-i) 256564
>>>> max locked memory (kbytes, -l) unlimited
>>>> max memory size (kbytes, -m) unlimited
>>>> open files (-n) 100000
>>>> pipe size (512 bytes, -p) 8
>>>> POSIX message queues (bytes, -q) 819200
>>>> real-time priority (-r) 0
>>>> stack size (kbytes, -s) unlimited
>>>> cpu time (seconds, -t) unlimited
>>>> max user processes (-u) 4096
>>>> virtual memory (kbytes, -v) unlimited
>>>> file locks (-x) unlimited
>>>>
>>>> uname -a
>>>> Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31
>>>> 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>>>
>>>> cat /etc/system-release
>>>> CentOS Linux release 7.2.1511 (Core)
>>>>
>>>> what else do you need?
>>>>
>>>> Cheers, Gundram
>>>>
>>>> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>>>
>>>> Gundram,
>>>>
>>>>
>>>> can you please provide more information on your environment :
>>>>
>>>> - configure command line
>>>>
>>>> - OS
>>>>
>>>> - memory available
>>>>
>>>> - ulimit -a
>>>>
>>>> - number of nodes
>>>>
>>>> - number of tasks used
>>>>
>>>> - interconnect used (if any)
>>>>
>>>> - batch manager (if any)
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Gilles
>>>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>>>
>>>> Hello Gilles,
>>>>
>>>> I tried you code and it crashes after 3-15 iterations (see (1)). It is
>>>> always the same error (only the "94" varies).
>>>>
>>>> Meanwhile I think Java and MPI use the same memory because when I
>>>> delete the hash-call, the program runs sometimes more than 9k iterations.
>>>> When it crashes, there are different lines (see (2) and (3)). The
>>>> crashes also occurs on rank 0.
>>>>
>>>> ##### (1)#####
>>>> # Problematic frame:
>>>> # J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>>>> (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]
>>>>
>>>> #####(2)#####
>>>> # Problematic frame:
>>>> # V [libjvm.so+0x68d0f6]
>>>> JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*,
>>>> Thread*)+0xb6
>>>>
>>>> #####(3)#####
>>>> # Problematic frame:
>>>> # V [libjvm.so+0x4183bf]
>>>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>>>
>>>> Any more idea?
>>>>
>>>> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>>>
>>>> Gundram,
>>>>
>>>>
>>>> fwiw, i cannot reproduce the issue on my box
>>>>
>>>> - centos 7
>>>>
>>>> - java version "1.8.0_71"
>>>> Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)
>>>>
>>>>
>>>> i noticed on non zero rank saveMem is allocated at each iteration.
>>>> ideally, the garbage collector can take care of that and this should
>>>> not be an issue.
>>>>
>>>> would you mind giving the attached file a try ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>>
>>>> I will have a look at it today
>>>>
>>>> how did you configure OpenMPI ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Thursday, July 7, 2016, Gundram Leifert <
>>>> ***@uni-rostock.de> wrote:
>>>>
>>>>> Hello Giles,
>>>>>
>>>>> thank you for your hints! I did 3 changes, unfortunately the same
>>>>> error occures:
>>>>>
>>>>> update ompi:
>>>>> commit ae8444682f0a7aa158caea08800542ce9874455e
>>>>> Author: Ralph Castain <***@open-mpi.org>
>>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>>
>>>>> update java:
>>>>> java version "1.8.0_92"
>>>>> Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
>>>>> Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)
>>>>>
>>>>> delete hashcode-lines.
>>>>>
>>>>> Now I get this error message - to 100%, after different number of
>>>>> iterations (15-300):
>>>>>
>>>>> 0/ 3:length = 100000000
>>>>> 0/ 3:bcast length done (length = 100000000)
>>>>> 1/ 3:bcast length done (length = 100000000)
>>>>> 2/ 3:bcast length done (length = 100000000)
>>>>> #
>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>> #
>>>>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578,
>>>>> tid=0x00002b3d29716700
>>>>> #
>>>>> # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build
>>>>> 1.8.0_92-b14)
>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode
>>>>> linux-amd64 compressed oops)
>>>>> # Problematic frame:
>>>>> # V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*,
>>>>> int)+0x94
>>>>> #
>>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>>> #
>>>>> # An error report file with more information is saved as:
>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>>> #
>>>>> # Compiler replay data is saved as:
>>>>> # /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>>> #
>>>>> # If you would like to submit a bug report, please visit:
>>>>> # http://bugreport.java.com/bugreport/crash.jsp
>>>>> #
>>>>> [titan01:16578] *** Process received signal ***
>>>>> [titan01:16578] Signal: Aborted (6)
>>>>> [titan01:16578] Signal code: (-6)
>>>>> [titan01:16578] [ 0]
>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>>> [titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>>> [titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>>> [titan01:16578] [ 3]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>>> [titan01:16578] [ 4]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>>> [titan01:16578] [ 5]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>>> [titan01:16578] [ 6]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>>> [titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>>> [titan01:16578] [ 8]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>>> [titan01:16578] [ 9]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>>> [titan01:16578] [10]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>>> [titan01:16578] [11]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>>> [titan01:16578] [12]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>> [titan01:16578] [13]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>> [titan01:16578] [14]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>> [titan01:16578] [15]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>> [titan01:16578] [16]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>> [titan01:16578] [17]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>> [titan01:16578] [18]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>> [titan01:16578] [19]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>> [titan01:16578] [20]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>> [titan01:16578] [21]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>> [titan01:16578] [22]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>>> [titan01:16578] [23]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>>> [titan01:16578] [24]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>>> [titan01:16578] [25]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>>> [titan01:16578] [26]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>>> [titan01:16578] [27]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>>> [titan01:16578] [28]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>>> [titan01:16578] [29]
>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>>> [titan01:16578] *** End of error message ***
>>>>> -------------------------------------------------------
>>>>> Primary job terminated normally, but 1 process returned
>>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>>> -------------------------------------------------------
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 2 with PID 0 on node titan01 exited
>>>>> on signal 6 (Aborted).
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> I don't know if it is a problem of java or ompi - but the last years,
>>>>> java worked with no problems on my machine...
>>>>>
>>>>> Thank you for your tips in advance!
>>>>> Gundram
>>>>>
>>>>> On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
>>>>>
>>>>> Note a race condition in MPI_Init has been fixed yesterday in the
>>>>> master.
>>>>> can you please update your OpenMPI and try again ?
>>>>>
>>>>> hopefully the hang will disappear.
>>>>>
>>>>> Can you reproduce the crash with a simpler (and ideally deterministic)
>>>>> version of your program.
>>>>> the crash occurs in hashcode, and this makes little sense to me. can
>>>>> you also update your jdk ?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>> On Wednesday, July 6, 2016, Gundram Leifert <
>>>>> ***@uni-rostock.de> wrote:
>>>>>
>>>>>> Hello Jason,
>>>>>>
>>>>>> thanks for your response! I thing it is another problem. I try to
>>>>>> send 100MB bytes. So there are not many tries (between 10 and 30). I
>>>>>> realized that the execution of this code can result 3 different errors:
>>>>>>
>>>>>> 1. most often the posted error message occures.
>>>>>>
>>>>>> 2. in <10% the cases i have a live lock. I can see 3 java-processes,
>>>>>> one with 200% and two with 100% processor utilization. After ~15 minutes
>>>>>> without new system outputs this error occurs.
>>>>>>
>>>>>>
>>>>>> [thread 47499823949568 also had an error]
>>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>>> #
>>>>>> # Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
>>>>>> # guarantee(PageArmed == 0) failed: invariant
>>>>>> #
>>>>>> # JRE version: 7.0_25-b15
>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>>>> linux-amd64 compressed oops)
>>>>>> # Failed to write core dump. Core dumps have been disabled. To enable
>>>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>>>> #
>>>>>> # An error report file with more information is saved as:
>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>>> #
>>>>>> # If you would like to submit a bug report, please visit:
>>>>>> # <http://bugreport.sun.com/bugreport/crash.jsp>
>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>> #
>>>>>> [titan01:24256] *** Process received signal ***
>>>>>> [titan01:24256] Signal: Aborted (6)
>>>>>> [titan01:24256] Signal code: (-6)
>>>>>> [titan01:24256] [ 0]
>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>>> [titan01:24256] [ 1]
>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>>> [titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>>> [titan01:24256] [ 3]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>>> [titan01:24256] [ 4]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>>> [titan01:24256] [ 5]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>>> [titan01:24256] [ 6]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>>> [titan01:24256] [ 7]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>>> [titan01:24256] [ 8]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>>> [titan01:24256] [ 9]
>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>>> [titan01:24256] [10]
>>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>>> [titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>>> [titan01:24256] *** End of error message ***
>>>>>> -------------------------------------------------------
>>>>>> Primary job terminated normally, but 1 process returned
>>>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>>>> -------------------------------------------------------
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that process rank 0 with PID 0 on node titan01 exited
>>>>>> on signal 6 (Aborted).
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> 3. in <10% the cases i have a dead lock while MPI.init. This stays
>>>>>> for more than 15 minutes without returning with an error message...
>>>>>>
>>>>>> Can I enable some debug-flags to see what happens on C / OpenMPI side?
>>>>>>
>>>>>> Thanks in advance for your help!
>>>>>> Gundram Leifert
>>>>>>
>>>>>>
>>>>>> On 07/05/2016 06:05 PM, Jason Maldonis wrote:
>>>>>>
>>>>>> After reading your thread looks like it may be related to an issue I
>>>>>> had a few weeks ago (I'm a novice though). Maybe my thread will be of help:
>>>>>> <https://www.open-mpi.org/community/lists/users/2016/06/29425.php>
>>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>>
>>>>>> When you say "After a specific number of repetitions the process
>>>>>> either hangs up or returns with a SIGSEGV." does you mean that a single
>>>>>> call hangs, or that at some point during the for loop a call hangs? If you
>>>>>> mean the latter, then it might relate to my issue. Otherwise my thread
>>>>>> probably won't be helpful.
>>>>>>
>>>>>> Jason Maldonis
>>>>>> Research Assistant of Professor Paul Voyles
>>>>>> Materials Science Grad Student
>>>>>> University of Wisconsin, Madison
>>>>>> 1509 University Ave, Rm M142
>>>>>> Madison, WI 53706
>>>>>> ***@wisc.edu
>>>>>> 608-295-5532
>>>>>>
>>>>>> On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <
>>>>>> ***@uni-rostock.de> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I try to send many byte-arrays via broadcast. After a specific
>>>>>>> number of repetitions the process either hangs up or returns with a
>>>>>>> SIGSEGV. Does any one can help me solving the problem:
>>>>>>>
>>>>>>> ########## The code:
>>>>>>>
>>>>>>> import java.util.Random;
>>>>>>> import mpi.*;
>>>>>>>
>>>>>>> public class TestSendBigFiles {
>>>>>>>
>>>>>>> public static void log(String msg) {
>>>>>>> try {
>>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>>> MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
>>>>>>> } catch (MPIException ex) {
>>>>>>> System.err.println(String.format("%2s/%2s:%s", "?", "?",
>>>>>>> msg));
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> private static int hashcode(byte[] bytearray) {
>>>>>>> if (bytearray == null) {
>>>>>>> return 0;
>>>>>>> }
>>>>>>> int hash = 39;
>>>>>>> for (int i = 0; i < bytearray.length; i++) {
>>>>>>> byte b = bytearray[i];
>>>>>>> hash = hash * 7 + (int) b;
>>>>>>> }
>>>>>>> return hash;
>>>>>>> }
>>>>>>>
>>>>>>> public static void main(String args[]) throws MPIException {
>>>>>>> log("start main");
>>>>>>> MPI.Init(args);
>>>>>>> try {
>>>>>>> log("initialized done");
>>>>>>> byte[] saveMem = new byte[100000000];
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> Random r = new Random();
>>>>>>> r.nextBytes(saveMem);
>>>>>>> if (MPI.COMM_WORLD.getRank() == 0) {
>>>>>>> for (int i = 0; i < 1000; i++) {
>>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>>> log("i = " + i);
>>>>>>> int[] lengthData = new int[]{saveMem.length};
>>>>>>> log("object hash = " + hashcode(saveMem));
>>>>>>> log("length = " + lengthData[0]);
>>>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>>>>> log("bcast length done (length = " +
>>>>>>> lengthData[0] + ")");
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> MPI.COMM_WORLD.bcast(saveMem, lengthData[0],
>>>>>>> MPI.BYTE, 0);
>>>>>>> log("bcast data done");
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> }
>>>>>>> MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
>>>>>>> } else {
>>>>>>> while (true) {
>>>>>>> int[] lengthData = new int[1];
>>>>>>> MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
>>>>>>> log("bcast length done (length = " +
>>>>>>> lengthData[0] + ")");
>>>>>>> if (lengthData[0] == 0) {
>>>>>>> break;
>>>>>>> }
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> saveMem = new byte[lengthData[0]];
>>>>>>> MPI.COMM_WORLD.bcast(saveMem, saveMem.length,
>>>>>>> MPI.BYTE, 0);
>>>>>>> log("bcast data done");
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> log("object hash = " + hashcode(saveMem));
>>>>>>> }
>>>>>>> }
>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>> } catch (MPIException ex) {
>>>>>>> System.out.println("caugth error." + ex);
>>>>>>> log(ex.getMessage());
>>>>>>> } catch (RuntimeException ex) {
>>>>>>> System.out.println("caugth error." + ex);
>>>>>>> log(ex.getMessage());
>>>>>>> } finally {
>>>>>>> MPI.Finalize();
>>>>>>> }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> ############ The Error (if it does not just hang up):
>>>>>>>
>>>>>>> #
>>>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>>>> #
>>>>>>> # SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172,
>>>>>>> tid=47822674495232
>>>>>>> #
>>>>>>> #
>>>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>>>> # JRE version: 7.0_25-b15
>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>>>>> linux-amd64 compressed oops)
>>>>>>> # Problematic frame:
>>>>>>> # #
>>>>>>> # SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173,
>>>>>>> tid=47238546896640
>>>>>>> #
>>>>>>> # JRE version: 7.0_25-b15
>>>>>>> J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>> #
>>>>>>> # Failed to write core dump. Core dumps have been disabled. To
>>>>>>> enable core dumping, try "ulimit -c unlimited" before starting Java again
>>>>>>> #
>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode
>>>>>>> linux-amd64 compressed oops)
>>>>>>> # Problematic frame:
>>>>>>> # J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>> #
>>>>>>> # Failed to write core dump. Core dumps have been disabled. To
>>>>>>> enable core dumping, try "ulimit -c unlimited" before starting Java again
>>>>>>> #
>>>>>>> # An error report file with more information is saved as:
>>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>>> # An error report file with more information is saved as:
>>>>>>> # /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>>> #
>>>>>>> # If you would like to submit a bug report, please visit:
>>>>>>> # <http://bugreport.sun.com/bugreport/crash.jsp>
>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>> #
>>>>>>> #
>>>>>>> # If you would like to submit a bug report, please visit:
>>>>>>> # <http://bugreport.sun.com/bugreport/crash.jsp>
>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>> #
>>>>>>> [titan01:01172] *** Process received signal ***
>>>>>>> [titan01:01172] Signal: Aborted (6)
>>>>>>> [titan01:01172] Signal code: (-6)
>>>>>>> [titan01:01173] *** Process received signal ***
>>>>>>> [titan01:01173] Signal: Aborted (6)
>>>>>>> [titan01:01173] Signal code: (-6)
>>>>>>> [titan01:01172] [ 0]
>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>>> [titan01:01172] [ 1]
>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>>> [titan01:01172] [ 2]
>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>>> [titan01:01172] [ 3]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>>> [titan01:01172] [ 4]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>>> [titan01:01172] [ 5]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>>> [titan01:01172] [ 6] [titan01:01173] [ 0]
>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>>> [titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>>>>> [titan01:01172] *** End of error message ***
>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>>> [titan01:01173] [ 2]
>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>>> [titan01:01173] [ 3]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>>> [titan01:01173] [ 4]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>>> [titan01:01173] [ 5]
>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>>> [titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>>>>> [titan01:01173] *** End of error message ***
>>>>>>> -------------------------------------------------------
>>>>>>> Primary job terminated normally, but 1 process returned
>>>>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>>>>> -------------------------------------------------------
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that process rank 1 with PID 0 on node titan01 exited
>>>>>>> on signal 6 (Aborted).
>>>>>>>
>>>>>>>
>>>>>>> ########CONFIGURATION:
>>>>>>> I used the ompi master sources from github:
>>>>>>> commit 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>>> Author: Gilles Gouaillardet <***@rist.or.jp>
>>>>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>>>>
>>>>>>> ./configure --enable-mpi-java
>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
>>>>>>> --disable-mca-dso
>>>>>>>
>>>>>>> Thanks a lot for your help!
>>>>>>> Gundram
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> ***@open-mpi.org
>>>>>>> Subscription: <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:
>>>>>>> <http://www.open-mpi.org/community/lists/users/2016/07/29584.php>
>>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing ***@open-mpi.org
>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing ***@open-mpi.org
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing ***@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing ***@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing ***@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing ***@open-mpi.org
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing ***@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php
>>>
>>>
>>>
>>
>> _______________________________________________
>> users mailing ***@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php
>>
>>
>>
Gundram Leifert
2016-07-12 09:08:06 UTC
Permalink
Hello Gilley, Howard,

I configured without disable dlopen - same error.

I test these classes on another cluster and: IT WORKS!

So it is a problem of the cluster configuration. Thank you all very much
for all your help! When the admin can solve the problem, i will let you
know, what he had changed.

Cheers Gundram

On 07/08/2016 04:19 PM, Howard Pritchard wrote:
> Hi Gundram
>
> Could you configure without the disable dlopen option and retry?
>
> Howard
>
> Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
>
> the JVM sets its own signal handlers, and it is important openmpi
> dones not override them.
> this is what previously happened with PSM (infinipath) but this
> has been solved since.
> you might be linking with a third party library that hijacks
> signal handlers and cause the crash
> (which would explain why I cannot reproduce the issue)
>
> the master branch has a revamped memory patcher (compared to v2.x
> or v1.10), and that could have some bad interactions with the JVM,
> so you might also give v2.x a try
>
> Cheers,
>
> Gilles
>
> On Friday, July 8, 2016, Gundram Leifert
> <***@uni-rostock.de
> <javascript:_e(%7B%7D,'cvml','***@uni-rostock.de');>>
> wrote:
>
> You made the best of it... thanks a lot!
>
> Whithout MPI it runs.
> Just adding MPI.init() causes the crash!
>
> maybe I installed something wrong...
>
> install newest automake, autoconf, m4, libtoolize in right
> order and same prefix
> check out ompi,
> autogen
> configure with same prefix, pointing to the same jdk, I later use
> make
> make install
>
> I will test some different configurations of ./configure...
>
>
> On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
>> I am running out of ideas ...
>>
>> what if you do not run within slurm ?
>> what if you do not use '-cp executor.jar'
>> or what if you configure without --disable-dlopen
>> --disable-mca-dso ?
>>
>> if you
>> mpirun -np 1 ...
>> then MPI_Bcast and MPI_Barrier are basically no-op, so it is
>> really weird your program is still crashing. an other test is
>> to comment out MPI_Bcast and MPI_Barrier and try again with -np 1
>>
>> Cheers,
>>
>> Gilles
>>
>> On Friday, July 8, 2016, Gundram Leifert
>> <***@uni-rostock.de> wrote:
>>
>> In any cases the same error.
>> this is my code:
>>
>> salloc -n 3
>> export IPATH_NO_BACKTRACE
>> ulimit -s 10240
>> mpirun -np 3 java -cp executor.jar
>> de.uros.citlab.executor.test.TestSendBigFiles2
>>
>>
>> also for 1 or two cores, the process crashes.
>>
>>
>> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>>> you can try
>>> export IPATH_NO_BACKTRACE
>>> before invoking mpirun (that should not be needed though)
>>>
>>> an other test is to
>>> ulimit -s 10240
>>> before invoking mpirun.
>>>
>>> btw, do you use mpirun or srun ?
>>>
>>> can you reproduce the crash with 1 or 2 tasks ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Friday, July 8, 2016, Gundram Leifert
>>> <***@uni-rostock.de> wrote:
>>>
>>> Hello,
>>>
>>> configure:
>>> ./configure --enable-mpi-java
>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>> --disable-dlopen --disable-mca-dso
>>>
>>>
>>> 1 node with 3 cores. I use SLURM to allocate one
>>> node. I changed --mem, but it has no effect.
>>> salloc -n 3
>>>
>>>
>>> core file size (blocks, -c) 0
>>> data seg size (kbytes, -d) unlimited
>>> scheduling priority (-e) 0
>>> file size (blocks, -f) unlimited
>>> pending signals (-i) 256564
>>> max locked memory (kbytes, -l) unlimited
>>> max memory size (kbytes, -m) unlimited
>>> open files (-n) 100000
>>> pipe size (512 bytes, -p) 8
>>> POSIX message queues (bytes, -q) 819200
>>> real-time priority (-r) 0
>>> stack size (kbytes, -s) unlimited
>>> cpu time (seconds, -t) unlimited
>>> max user processes (-u) 4096
>>> virtual memory (kbytes, -v) unlimited
>>> file locks (-x) unlimited
>>>
>>> uname -a
>>> Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1
>>> SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64
>>> x86_64 GNU/Linux
>>>
>>> cat /etc/system-release
>>> CentOS Linux release 7.2.1511 (Core)
>>>
>>> what else do you need?
>>>
>>> Cheers, Gundram
>>>
>>> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>>>
>>>> Gundram,
>>>>
>>>>
>>>> can you please provide more information on your
>>>> environment :
>>>>
>>>> - configure command line
>>>>
>>>> - OS
>>>>
>>>> - memory available
>>>>
>>>> - ulimit -a
>>>>
>>>> - number of nodes
>>>>
>>>> - number of tasks used
>>>>
>>>> - interconnect used (if any)
>>>>
>>>> - batch manager (if any)
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Gilles
>>>>
>>>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>>>> Hello Gilles,
>>>>>
>>>>> I tried you code and it crashes after 3-15
>>>>> iterations (see (1)). It is always the same error
>>>>> (only the "94" varies).
>>>>>
>>>>> Meanwhile I think Java and MPI use the same memory
>>>>> because when I delete the hash-call, the program
>>>>> runs sometimes more than 9k iterations.
>>>>> When it crashes, there are different lines (see
>>>>> (2) and (3)). The crashes also occurs on rank 0.
>>>>>
>>>>> ##### (1)#####
>>>>> # Problematic frame:
>>>>> # J 94 C2
>>>>> de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>>>>> (42 bytes) @ 0x00002b03242dc9c4
>>>>> [0x00002b03242dc860+0x164]
>>>>>
>>>>> #####(2)#####
>>>>> # Problematic frame:
>>>>> # V [libjvm.so+0x68d0f6]
>>>>> JavaCallWrapper::JavaCallWrapper(methodHandle,
>>>>> Handle, JavaValue*, Thread*)+0xb6
>>>>>
>>>>> #####(3)#####
>>>>> # Problematic frame:
>>>>> # V [libjvm.so+0x4183bf]
>>>>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>>>>
>>>>> Any more idea?
>>>>>
>>>>> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>>>>>
>>>>>> Gundram,
>>>>>>
>>>>>>
>>>>>> fwiw, i cannot reproduce the issue on my box
>>>>>>
>>>>>> - centos 7
>>>>>>
>>>>>> - java version "1.8.0_71"
>>>>>> Java(TM) SE Runtime Environment (build
>>>>>> 1.8.0_71-b15)
>>>>>> Java HotSpot(TM) 64-Bit Server VM (build
>>>>>> 25.71-b15, mixed mode)
>>>>>>
>>>>>>
>>>>>> i noticed on non zero rank saveMem is allocated
>>>>>> at each iteration.
>>>>>> ideally, the garbage collector can take care of
>>>>>> that and this should not be an issue.
>>>>>>
>>>>>> would you mind giving the attached file a try ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>>>>> I will have a look at it today
>>>>>>>
>>>>>>> how did you configure OpenMPI ?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Gilles
>>>>>>>
>>>>>>> On Thursday, July 7, 2016, Gundram Leifert
>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>
>>>>>>> Hello Giles,
>>>>>>>
>>>>>>> thank you for your hints! I did 3 changes,
>>>>>>> unfortunately the same error occures:
>>>>>>>
>>>>>>> update ompi:
>>>>>>> commit ae8444682f0a7aa158caea08800542ce9874455e
>>>>>>> Author: Ralph Castain <***@open-mpi.org>
>>>>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>>>>
>>>>>>> update java:
>>>>>>> java version "1.8.0_92"
>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>> 1.8.0_92-b14)
>>>>>>> Java HotSpot(TM) Server VM (build 25.92-b14,
>>>>>>> mixed mode)
>>>>>>>
>>>>>>> delete hashcode-lines.
>>>>>>>
>>>>>>> Now I get this error message - to 100%,
>>>>>>> after different number of iterations (15-300):
>>>>>>>
>>>>>>> 0/ 3:length = 100000000
>>>>>>> 0/ 3:bcast length done (length = 100000000)
>>>>>>> 1/ 3:bcast length done (length = 100000000)
>>>>>>> 2/ 3:bcast length done (length = 100000000)
>>>>>>> #
>>>>>>> # A fatal error has been detected by the
>>>>>>> Java Runtime Environment:
>>>>>>> #
>>>>>>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24,
>>>>>>> pid=16578, tid=0x00002b3d29716700
>>>>>>> #
>>>>>>> # JRE version: Java(TM) SE Runtime
>>>>>>> Environment (8.0_92-b14) (build 1.8.0_92-b14)
>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>>>> (25.92-b14 mixed mode linux-amd64 compressed
>>>>>>> oops)
>>>>>>> # Problematic frame:
>>>>>>> # V [libjvm.so+0x414d24]
>>>>>>> ciEnv::get_field_by_index(ciInstanceKlass*,
>>>>>>> int)+0x94
>>>>>>> #
>>>>>>> # Failed to write core dump. Core dumps have
>>>>>>> been disabled. To enable core dumping, try
>>>>>>> "ulimit -c unlimited" before starting Java again
>>>>>>> #
>>>>>>> # An error report file with more information
>>>>>>> is saved as:
>>>>>>> #
>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>>>>> #
>>>>>>> # Compiler replay data is saved as:
>>>>>>> #
>>>>>>> /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>>>>> #
>>>>>>> # If you would like to submit a bug report,
>>>>>>> please visit:
>>>>>>> # http://bugreport.java.com/bugreport/crash.jsp
>>>>>>> #
>>>>>>> [titan01:16578] *** Process received signal ***
>>>>>>> [titan01:16578] Signal: Aborted (6)
>>>>>>> [titan01:16578] Signal code: (-6)
>>>>>>> [titan01:16578] [ 0]
>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>>>>> [titan01:16578] [ 1]
>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>>>>> [titan01:16578] [ 2]
>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>>>>> [titan01:16578] [ 3]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>>>>> [titan01:16578] [ 4]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>>>>> [titan01:16578] [ 5]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>>>>> [titan01:16578] [ 6]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>>>>> [titan01:16578] [ 7]
>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>>>>> [titan01:16578] [ 8]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>>>>> [titan01:16578] [ 9]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>>>>> [titan01:16578] [10]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>>>>> [titan01:16578] [11]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>>>>> [titan01:16578] [12]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>> [titan01:16578] [13]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>> [titan01:16578] [14]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>> [titan01:16578] [15]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>> [titan01:16578] [16]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>> [titan01:16578] [17]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>> [titan01:16578] [18]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>> [titan01:16578] [19]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>> [titan01:16578] [20]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>> [titan01:16578] [21]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>> [titan01:16578] [22]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>>>>> [titan01:16578] [23]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>>>>> [titan01:16578] [24]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>>>>> [titan01:16578] [25]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>>>>> [titan01:16578] [26]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>>>>> [titan01:16578] [27]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>>>>> [titan01:16578] [28]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>>>>> [titan01:16578] [29]
>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>>>>> [titan01:16578] *** End of error message ***
>>>>>>> -------------------------------------------------------
>>>>>>> Primary job terminated normally, but 1
>>>>>>> process returned
>>>>>>> a non-zero exit code. Per user-direction,
>>>>>>> the job has been aborted.
>>>>>>> -------------------------------------------------------
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that process rank 2 with PID
>>>>>>> 0 on node titan01 exited on signal 6 (Aborted).
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> I don't know if it is a problem of java or
>>>>>>> ompi - but the last years, java worked with
>>>>>>> no problems on my machine...
>>>>>>>
>>>>>>> Thank you for your tips in advance!
>>>>>>> Gundram
>>>>>>>
>>>>>>> On 07/06/2016 03:10 PM, Gilles Gouaillardet
>>>>>>> wrote:
>>>>>>>> Note a race condition in MPI_Init has been
>>>>>>>> fixed yesterday in the master.
>>>>>>>> can you please update your OpenMPI and try
>>>>>>>> again ?
>>>>>>>>
>>>>>>>> hopefully the hang will disappear.
>>>>>>>>
>>>>>>>> Can you reproduce the crash with a simpler
>>>>>>>> (and ideally deterministic) version of your
>>>>>>>> program.
>>>>>>>> the crash occurs in hashcode, and this
>>>>>>>> makes little sense to me. can you also
>>>>>>>> update your jdk ?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Gilles
>>>>>>>>
>>>>>>>> On Wednesday, July 6, 2016, Gundram Leifert
>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>
>>>>>>>> Hello Jason,
>>>>>>>>
>>>>>>>> thanks for your response! I thing it is
>>>>>>>> another problem. I try to send 100MB
>>>>>>>> bytes. So there are not many tries
>>>>>>>> (between 10 and 30). I realized that
>>>>>>>> the execution of this code can result 3
>>>>>>>> different errors:
>>>>>>>>
>>>>>>>> 1. most often the posted error message
>>>>>>>> occures.
>>>>>>>>
>>>>>>>> 2. in <10% the cases i have a live
>>>>>>>> lock. I can see 3 java-processes, one
>>>>>>>> with 200% and two with 100% processor
>>>>>>>> utilization. After ~15 minutes without
>>>>>>>> new system outputs this error occurs.
>>>>>>>>
>>>>>>>>
>>>>>>>> [thread 47499823949568 also had an error]
>>>>>>>> # A fatal error has been detected by
>>>>>>>> the Java Runtime Environment:
>>>>>>>> #
>>>>>>>> # Internal Error (safepoint.cpp:317),
>>>>>>>> pid=24256, tid=47500347131648
>>>>>>>> # guarantee(PageArmed == 0) failed:
>>>>>>>> invariant
>>>>>>>> #
>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit
>>>>>>>> Server VM (23.25-b01 mixed mode
>>>>>>>> linux-amd64 compressed oops)
>>>>>>>> # Failed to write core dump. Core dumps
>>>>>>>> have been disabled. To enable core
>>>>>>>> dumping, try "ulimit -c unlimited"
>>>>>>>> before starting Java again
>>>>>>>> #
>>>>>>>> # An error report file with more
>>>>>>>> information is saved as:
>>>>>>>> #
>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>>>>> #
>>>>>>>> # If you would like to submit a bug
>>>>>>>> report, please visit:
>>>>>>>> #
>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>> #
>>>>>>>> [titan01:24256] *** Process received
>>>>>>>> signal ***
>>>>>>>> [titan01:24256] Signal: Aborted (6)
>>>>>>>> [titan01:24256] Signal code: (-6)
>>>>>>>> [titan01:24256] [ 0]
>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>>>>> [titan01:24256] [ 1]
>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>>>>> [titan01:24256] [ 2]
>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>>>>> [titan01:24256] [ 3]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>>>>> [titan01:24256] [ 4]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>>>>> [titan01:24256] [ 5]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>>>>> [titan01:24256] [ 6]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>>>>> [titan01:24256] [ 7]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>>>>> [titan01:24256] [ 8]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>>>>> [titan01:24256] [ 9]
>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>>>>> [titan01:24256] [10]
>>>>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>>>>> [titan01:24256] [11]
>>>>>>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>>>>> [titan01:24256] *** End of error
>>>>>>>> message ***
>>>>>>>> -------------------------------------------------------
>>>>>>>> Primary job terminated normally, but 1
>>>>>>>> process returned
>>>>>>>> a non-zero exit code. Per
>>>>>>>> user-direction, the job has been aborted.
>>>>>>>> -------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun noticed that process rank 0 with
>>>>>>>> PID 0 on node titan01 exited on signal
>>>>>>>> 6 (Aborted).
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>> 3. in <10% the cases i have a dead lock
>>>>>>>> while MPI.init. This stays for more
>>>>>>>> than 15 minutes without returning with
>>>>>>>> an error message...
>>>>>>>>
>>>>>>>> Can I enable some debug-flags to see
>>>>>>>> what happens on C / OpenMPI side?
>>>>>>>>
>>>>>>>> Thanks in advance for your help!
>>>>>>>> Gundram Leifert
>>>>>>>>
>>>>>>>>
>>>>>>>> On 07/05/2016 06:05 PM, Jason Maldonis
>>>>>>>> wrote:
>>>>>>>>> After reading your thread looks like
>>>>>>>>> it may be related to an issue I had a
>>>>>>>>> few weeks ago (I'm a novice though).
>>>>>>>>> Maybe my thread will be of help:
>>>>>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> When you say "After a specific number
>>>>>>>>> of repetitions the process either
>>>>>>>>> hangs up or returns with a SIGSEGV."
>>>>>>>>> does you mean that a single call
>>>>>>>>> hangs, or that at some point during
>>>>>>>>> the for loop a call hangs? If you mean
>>>>>>>>> the latter, then it might relate to my
>>>>>>>>> issue. Otherwise my thread probably
>>>>>>>>> won't be helpful.
>>>>>>>>>
>>>>>>>>> Jason Maldonis
>>>>>>>>> Research Assistant of Professor Paul
>>>>>>>>> Voyles
>>>>>>>>> Materials Science Grad Student
>>>>>>>>> University of Wisconsin, Madison
>>>>>>>>> 1509 University Ave, Rm M142
>>>>>>>>> Madison, WI 53706
>>>>>>>>> ***@wisc.edu
>>>>>>>>> 608-295-5532
>>>>>>>>>
>>>>>>>>> On Tue, Jul 5, 2016 at 9:58 AM,
>>>>>>>>> Gundram Leifert
>>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I try to send many byte-arrays via
>>>>>>>>> broadcast. After a specific number
>>>>>>>>> of repetitions the process either
>>>>>>>>> hangs up or returns with a
>>>>>>>>> SIGSEGV. Does any one can help me
>>>>>>>>> solving the problem:
>>>>>>>>>
>>>>>>>>> ########## The code:
>>>>>>>>>
>>>>>>>>> import java.util.Random;
>>>>>>>>> import mpi.*;
>>>>>>>>>
>>>>>>>>> public class TestSendBigFiles {
>>>>>>>>>
>>>>>>>>> public static void log(String
>>>>>>>>> msg) {
>>>>>>>>> try {
>>>>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>>>>> MPI.COMM_WORLD.getRank(),
>>>>>>>>> MPI.COMM_WORLD.getSize(), msg));
>>>>>>>>> } catch (MPIException ex) {
>>>>>>>>> System.err.println(String.format("%2s/%2s:%s",
>>>>>>>>> "?", "?", msg));
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> private static int
>>>>>>>>> hashcode(byte[] bytearray) {
>>>>>>>>> if (bytearray == null) {
>>>>>>>>> return 0;
>>>>>>>>> }
>>>>>>>>> int hash = 39;
>>>>>>>>> for (int i = 0; i <
>>>>>>>>> bytearray.length; i++) {
>>>>>>>>> byte b = bytearray[i];
>>>>>>>>> hash = hash * 7 + (int) b;
>>>>>>>>> }
>>>>>>>>> return hash;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> public static void main(String
>>>>>>>>> args[]) throws MPIException {
>>>>>>>>> log("start main");
>>>>>>>>> MPI.Init(args);
>>>>>>>>> try {
>>>>>>>>> log("initialized done");
>>>>>>>>> byte[] saveMem = new byte[100000000];
>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>> Random r = new Random();
>>>>>>>>> r.nextBytes(saveMem);
>>>>>>>>> if
>>>>>>>>> (MPI.COMM_WORLD.getRank() == 0) {
>>>>>>>>> for (int i = 0; i < 1000; i++) {
>>>>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>>>>> log("i = " + i);
>>>>>>>>> int[] lengthData = new
>>>>>>>>> int[]{saveMem.length};
>>>>>>>>> log("object hash = " +
>>>>>>>>> hashcode(saveMem));
>>>>>>>>> log("length = " + lengthData[0]);
>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>> log("bcast length done (length = "
>>>>>>>>> + lengthData[0] + ")");
>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>> lengthData[0], MPI.BYTE, 0);
>>>>>>>>> log("bcast data done");
>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>> }
>>>>>>>>> MPI.COMM_WORLD.bcast(new int[]{0},
>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>> } else {
>>>>>>>>> while (true) {
>>>>>>>>> int[] lengthData = new int[1];
>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>> log("bcast length done (length = "
>>>>>>>>> + lengthData[0] + ")");
>>>>>>>>> if (lengthData[0] == 0) {
>>>>>>>>> break;
>>>>>>>>> }
>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>> saveMem = new
>>>>>>>>> byte[lengthData[0]];
>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>> saveMem.length, MPI.BYTE, 0);
>>>>>>>>> log("bcast data done");
>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>> log("object hash = " +
>>>>>>>>> hashcode(saveMem));
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>> } catch (MPIException ex) {
>>>>>>>>> System.out.println("caugth error."
>>>>>>>>> + ex);
>>>>>>>>> log(ex.getMessage());
>>>>>>>>> } catch (RuntimeException
>>>>>>>>> ex) {
>>>>>>>>> System.out.println("caugth error."
>>>>>>>>> + ex);
>>>>>>>>> log(ex.getMessage());
>>>>>>>>> } finally {
>>>>>>>>> MPI.Finalize();
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ############ The Error (if it does
>>>>>>>>> not just hang up):
>>>>>>>>>
>>>>>>>>> #
>>>>>>>>> # A fatal error has been detected
>>>>>>>>> by the Java Runtime Environment:
>>>>>>>>> #
>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>> pc=0x00002b7e9c86e3a1, pid=1172,
>>>>>>>>> tid=47822674495232
>>>>>>>>> #
>>>>>>>>> #
>>>>>>>>> # A fatal error has been detected
>>>>>>>>> by the Java Runtime Environment:
>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit
>>>>>>>>> Server VM (23.25-b01 mixed mode
>>>>>>>>> linux-amd64 compressed oops)
>>>>>>>>> # Problematic frame:
>>>>>>>>> # #
>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>> pc=0x00002af69c0693a1, pid=1173,
>>>>>>>>> tid=47238546896640
>>>>>>>>> #
>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>> J
>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>> #
>>>>>>>>> # Failed to write core dump. Core
>>>>>>>>> dumps have been disabled. To
>>>>>>>>> enable core dumping, try "ulimit
>>>>>>>>> -c unlimited" before starting Java
>>>>>>>>> again
>>>>>>>>> #
>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit
>>>>>>>>> Server VM (23.25-b01 mixed mode
>>>>>>>>> linux-amd64 compressed oops)
>>>>>>>>> # Problematic frame:
>>>>>>>>> # J
>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>> #
>>>>>>>>> # Failed to write core dump. Core
>>>>>>>>> dumps have been disabled. To
>>>>>>>>> enable core dumping, try "ulimit
>>>>>>>>> -c unlimited" before starting Java
>>>>>>>>> again
>>>>>>>>> #
>>>>>>>>> # An error report file with more
>>>>>>>>> information is saved as:
>>>>>>>>> #
>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>>>>> # An error report file with more
>>>>>>>>> information is saved as:
>>>>>>>>> #
>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>>>>> #
>>>>>>>>> # If you would like to submit a
>>>>>>>>> bug report, please visit:
>>>>>>>>> #
>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>> #
>>>>>>>>> #
>>>>>>>>> # If you would like to submit a
>>>>>>>>> bug report, please visit:
>>>>>>>>> #
>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>> #
>>>>>>>>> [titan01:01172] *** Process
>>>>>>>>> received signal ***
>>>>>>>>> [titan01:01172] Signal: Aborted (6)
>>>>>>>>> [titan01:01172] Signal code: (-6)
>>>>>>>>> [titan01:01173] *** Process
>>>>>>>>> received signal ***
>>>>>>>>> [titan01:01173] Signal: Aborted (6)
>>>>>>>>> [titan01:01173] Signal code: (-6)
>>>>>>>>> [titan01:01172] [ 0]
>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>>>>> [titan01:01172] [ 1]
>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>>>>> [titan01:01172] [ 2]
>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>>>>> [titan01:01172] [ 3]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>>>>> [titan01:01172] [ 4]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>>>>> [titan01:01172] [ 5]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>>>>> [titan01:01172] [ 6]
>>>>>>>>> [titan01:01173] [ 0]
>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>>>>> [titan01:01173] [ 1]
>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>>>>>>> [titan01:01172] *** End of error
>>>>>>>>> message ***
>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>>>>> [titan01:01173] [ 2]
>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>>>>> [titan01:01173] [ 3]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>>>>> [titan01:01173] [ 4]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>>>>> [titan01:01173] [ 5]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>>>>> [titan01:01173] [ 6]
>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>>>>>>> [titan01:01173] *** End of error
>>>>>>>>> message ***
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> Primary job terminated normally,
>>>>>>>>> but 1 process returned
>>>>>>>>> a non-zero exit code. Per
>>>>>>>>> user-direction, the job has been
>>>>>>>>> aborted.
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> mpirun noticed that process rank 1
>>>>>>>>> with PID 0 on node titan01 exited
>>>>>>>>> on signal 6 (Aborted).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ########CONFIGURATION:
>>>>>>>>> I used the ompi master sources
>>>>>>>>> from github:
>>>>>>>>> commit
>>>>>>>>> 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>>>>> Author: Gilles Gouaillardet
>>>>>>>>> <***@rist.or.jp>
>>>>>>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>>>>>>
>>>>>>>>> ./configure --enable-mpi-java
>>>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>>>>> --disable-dlopen --disable-mca-dso
>>>>>>>>>
>>>>>>>>> Thanks a lot for your help!
>>>>>>>>> Gundram
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> ***@open-mpi.org
>>>>>>>>> Subscription:
>>>>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:
>>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> ***@open-mpi.org
>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> ***@open-mpi.org
>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> ***@open-mpi.org
>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> ***@open-mpi.org
>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29601.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org
>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29603.php
>
>
>
> _______________________________________________
> users mailing list
> ***@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php
Gundram Leifert
2016-09-07 15:23:00 UTC
Permalink
Hello,

I still have the same errors on our cluster - even one more. Maybe the
new one helps us to find a solution.

I have this error if I run "make_onesided" of the ompi-java-test repo.

CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically
- in all my 30 runs) this error:

[titan01:5134] *** An error occurred in MPI_Compare_and_swap
[titan01:5134] *** reported by process [2392850433,1]
[titan01:5134] *** on win rdma window 3
[titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now
abort,
[titan01:5134] *** and potentially your MPI job)
[titan01.service:05128] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to
0 to see all help / error messages

Sometimes I also have the SIGSEGV error.

System:

compiler: gcc/5.2.0
java: jdk1.8.0_102
kernelmodule: mlx4_core mlx4_en mlx4_ib
Linux version 3.10.0-327.13.1.el7.x86_64
(***@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat
4.8.3-9) (GCC) ) #1 SMP

Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo
rev: v2.0.0-257-gee86e07, Sep 02, 2016

inifiband

openib: OpenSM 3.3.19


limits:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256554
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Thanks, Gundram
On 07/12/2016 11:08 AM, Gundram Leifert wrote:
> Hello Gilley, Howard,
>
> I configured without disable dlopen - same error.
>
> I test these classes on another cluster and: IT WORKS!
>
> So it is a problem of the cluster configuration. Thank you all very
> much for all your help! When the admin can solve the problem, i will
> let you know, what he had changed.
>
> Cheers Gundram
>
> On 07/08/2016 04:19 PM, Howard Pritchard wrote:
>> Hi Gundram
>>
>> Could you configure without the disable dlopen option and retry?
>>
>> Howard
>>
>> Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
>>
>> the JVM sets its own signal handlers, and it is important openmpi
>> dones not override them.
>> this is what previously happened with PSM (infinipath) but this
>> has been solved since.
>> you might be linking with a third party library that hijacks
>> signal handlers and cause the crash
>> (which would explain why I cannot reproduce the issue)
>>
>> the master branch has a revamped memory patcher (compared to v2.x
>> or v1.10), and that could have some bad interactions with the
>> JVM, so you might also give v2.x a try
>>
>> Cheers,
>>
>> Gilles
>>
>> On Friday, July 8, 2016, Gundram Leifert
>> <***@uni-rostock.de> wrote:
>>
>> You made the best of it... thanks a lot!
>>
>> Whithout MPI it runs.
>> Just adding MPI.init() causes the crash!
>>
>> maybe I installed something wrong...
>>
>> install newest automake, autoconf, m4, libtoolize in right
>> order and same prefix
>> check out ompi,
>> autogen
>> configure with same prefix, pointing to the same jdk, I later use
>> make
>> make install
>>
>> I will test some different configurations of ./configure...
>>
>>
>> On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
>>> I am running out of ideas ...
>>>
>>> what if you do not run within slurm ?
>>> what if you do not use '-cp executor.jar'
>>> or what if you configure without --disable-dlopen
>>> --disable-mca-dso ?
>>>
>>> if you
>>> mpirun -np 1 ...
>>> then MPI_Bcast and MPI_Barrier are basically no-op, so it is
>>> really weird your program is still crashing. an other test
>>> is to comment out MPI_Bcast and MPI_Barrier and try again
>>> with -np 1
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Friday, July 8, 2016, Gundram Leifert
>>> <***@uni-rostock.de> wrote:
>>>
>>> In any cases the same error.
>>> this is my code:
>>>
>>> salloc -n 3
>>> export IPATH_NO_BACKTRACE
>>> ulimit -s 10240
>>> mpirun -np 3 java -cp executor.jar
>>> de.uros.citlab.executor.test.TestSendBigFiles2
>>>
>>>
>>> also for 1 or two cores, the process crashes.
>>>
>>>
>>> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>>>> you can try
>>>> export IPATH_NO_BACKTRACE
>>>> before invoking mpirun (that should not be needed though)
>>>>
>>>> an other test is to
>>>> ulimit -s 10240
>>>> before invoking mpirun.
>>>>
>>>> btw, do you use mpirun or srun ?
>>>>
>>>> can you reproduce the crash with 1 or 2 tasks ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Friday, July 8, 2016, Gundram Leifert
>>>> <***@uni-rostock.de> wrote:
>>>>
>>>> Hello,
>>>>
>>>> configure:
>>>> ./configure --enable-mpi-java
>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>> --disable-dlopen --disable-mca-dso
>>>>
>>>>
>>>> 1 node with 3 cores. I use SLURM to allocate one
>>>> node. I changed --mem, but it has no effect.
>>>> salloc -n 3
>>>>
>>>>
>>>> core file size (blocks, -c) 0
>>>> data seg size (kbytes, -d) unlimited
>>>> scheduling priority (-e) 0
>>>> file size (blocks, -f) unlimited
>>>> pending signals (-i) 256564
>>>> max locked memory (kbytes, -l) unlimited
>>>> max memory size (kbytes, -m) unlimited
>>>> open files (-n) 100000
>>>> pipe size (512 bytes, -p) 8
>>>> POSIX message queues (bytes, -q) 819200
>>>> real-time priority (-r) 0
>>>> stack size (kbytes, -s) unlimited
>>>> cpu time (seconds, -t) unlimited
>>>> max user processes (-u) 4096
>>>> virtual memory (kbytes, -v) unlimited
>>>> file locks (-x) unlimited
>>>>
>>>> uname -a
>>>> Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1
>>>> SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64
>>>> x86_64 GNU/Linux
>>>>
>>>> cat /etc/system-release
>>>> CentOS Linux release 7.2.1511 (Core)
>>>>
>>>> what else do you need?
>>>>
>>>> Cheers, Gundram
>>>>
>>>> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>>>>
>>>>> Gundram,
>>>>>
>>>>>
>>>>> can you please provide more information on your
>>>>> environment :
>>>>>
>>>>> - configure command line
>>>>>
>>>>> - OS
>>>>>
>>>>> - memory available
>>>>>
>>>>> - ulimit -a
>>>>>
>>>>> - number of nodes
>>>>>
>>>>> - number of tasks used
>>>>>
>>>>> - interconnect used (if any)
>>>>>
>>>>> - batch manager (if any)
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>>
>>>>> Gilles
>>>>>
>>>>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>>>>> Hello Gilles,
>>>>>>
>>>>>> I tried you code and it crashes after 3-15
>>>>>> iterations (see (1)). It is always the same error
>>>>>> (only the "94" varies).
>>>>>>
>>>>>> Meanwhile I think Java and MPI use the same
>>>>>> memory because when I delete the hash-call, the
>>>>>> program runs sometimes more than 9k iterations.
>>>>>> When it crashes, there are different lines (see
>>>>>> (2) and (3)). The crashes also occurs on rank 0.
>>>>>>
>>>>>> ##### (1)#####
>>>>>> # Problematic frame:
>>>>>> # J 94 C2
>>>>>> de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>>>>>> (42 bytes) @ 0x00002b03242dc9c4
>>>>>> [0x00002b03242dc860+0x164]
>>>>>>
>>>>>> #####(2)#####
>>>>>> # Problematic frame:
>>>>>> # V [libjvm.so+0x68d0f6]
>>>>>> JavaCallWrapper::JavaCallWrapper(methodHandle,
>>>>>> Handle, JavaValue*, Thread*)+0xb6
>>>>>>
>>>>>> #####(3)#####
>>>>>> # Problematic frame:
>>>>>> # V [libjvm.so+0x4183bf]
>>>>>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>>>>>
>>>>>> Any more idea?
>>>>>>
>>>>>> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>>>>>>
>>>>>>> Gundram,
>>>>>>>
>>>>>>>
>>>>>>> fwiw, i cannot reproduce the issue on my box
>>>>>>>
>>>>>>> - centos 7
>>>>>>>
>>>>>>> - java version "1.8.0_71"
>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>> 1.8.0_71-b15)
>>>>>>> Java HotSpot(TM) 64-Bit Server VM (build
>>>>>>> 25.71-b15, mixed mode)
>>>>>>>
>>>>>>>
>>>>>>> i noticed on non zero rank saveMem is allocated
>>>>>>> at each iteration.
>>>>>>> ideally, the garbage collector can take care of
>>>>>>> that and this should not be an issue.
>>>>>>>
>>>>>>> would you mind giving the attached file a try ?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Gilles
>>>>>>>
>>>>>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>>>>>> I will have a look at it today
>>>>>>>>
>>>>>>>> how did you configure OpenMPI ?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Gilles
>>>>>>>>
>>>>>>>> On Thursday, July 7, 2016, Gundram Leifert
>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>
>>>>>>>> Hello Giles,
>>>>>>>>
>>>>>>>> thank you for your hints! I did 3 changes,
>>>>>>>> unfortunately the same error occures:
>>>>>>>>
>>>>>>>> update ompi:
>>>>>>>> commit ae8444682f0a7aa158caea08800542ce9874455e
>>>>>>>> Author: Ralph Castain <***@open-mpi.org>
>>>>>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>>>>>
>>>>>>>> update java:
>>>>>>>> java version "1.8.0_92"
>>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>>> 1.8.0_92-b14)
>>>>>>>> Java HotSpot(TM) Server VM (build
>>>>>>>> 25.92-b14, mixed mode)
>>>>>>>>
>>>>>>>> delete hashcode-lines.
>>>>>>>>
>>>>>>>> Now I get this error message - to 100%,
>>>>>>>> after different number of iterations (15-300):
>>>>>>>>
>>>>>>>> 0/ 3:length = 100000000
>>>>>>>> 0/ 3:bcast length done (length = 100000000)
>>>>>>>> 1/ 3:bcast length done (length = 100000000)
>>>>>>>> 2/ 3:bcast length done (length = 100000000)
>>>>>>>> #
>>>>>>>> # A fatal error has been detected by the
>>>>>>>> Java Runtime Environment:
>>>>>>>> #
>>>>>>>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24,
>>>>>>>> pid=16578, tid=0x00002b3d29716700
>>>>>>>> #
>>>>>>>> # JRE version: Java(TM) SE Runtime
>>>>>>>> Environment (8.0_92-b14) (build 1.8.0_92-b14)
>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server
>>>>>>>> VM (25.92-b14 mixed mode linux-amd64
>>>>>>>> compressed oops)
>>>>>>>> # Problematic frame:
>>>>>>>> # V [libjvm.so+0x414d24]
>>>>>>>> ciEnv::get_field_by_index(ciInstanceKlass*,
>>>>>>>> int)+0x94
>>>>>>>> #
>>>>>>>> # Failed to write core dump. Core dumps
>>>>>>>> have been disabled. To enable core dumping,
>>>>>>>> try "ulimit -c unlimited" before starting
>>>>>>>> Java again
>>>>>>>> #
>>>>>>>> # An error report file with more
>>>>>>>> information is saved as:
>>>>>>>> #
>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>>>>>> #
>>>>>>>> # Compiler replay data is saved as:
>>>>>>>> #
>>>>>>>> /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>>>>>> #
>>>>>>>> # If you would like to submit a bug report,
>>>>>>>> please visit:
>>>>>>>> # http://bugreport.java.com/bugreport/crash.jsp
>>>>>>>> #
>>>>>>>> [titan01:16578] *** Process received signal ***
>>>>>>>> [titan01:16578] Signal: Aborted (6)
>>>>>>>> [titan01:16578] Signal code: (-6)
>>>>>>>> [titan01:16578] [ 0]
>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>>>>>> [titan01:16578] [ 1]
>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>>>>>> [titan01:16578] [ 2]
>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>>>>>> [titan01:16578] [ 3]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>>>>>> [titan01:16578] [ 4]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>>>>>> [titan01:16578] [ 5]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>>>>>> [titan01:16578] [ 6]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>>>>>> [titan01:16578] [ 7]
>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>>>>>> [titan01:16578] [ 8]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>>>>>> [titan01:16578] [ 9]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>>>>>> [titan01:16578] [10]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>>>>>> [titan01:16578] [11]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>>>>>> [titan01:16578] [12]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>>> [titan01:16578] [13]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>>> [titan01:16578] [14]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>>> [titan01:16578] [15]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>>> [titan01:16578] [16]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>>> [titan01:16578] [17]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>>> [titan01:16578] [18]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>>> [titan01:16578] [19]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>>> [titan01:16578] [20]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>>> [titan01:16578] [21]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>>> [titan01:16578] [22]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>>>>>> [titan01:16578] [23]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>>>>>> [titan01:16578] [24]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>>>>>> [titan01:16578] [25]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>>>>>> [titan01:16578] [26]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>>>>>> [titan01:16578] [27]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>>>>>> [titan01:16578] [28]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>>>>>> [titan01:16578] [29]
>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>>>>>> [titan01:16578] *** End of error message ***
>>>>>>>> -------------------------------------------------------
>>>>>>>> Primary job terminated normally, but 1
>>>>>>>> process returned
>>>>>>>> a non-zero exit code. Per user-direction,
>>>>>>>> the job has been aborted.
>>>>>>>> -------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> mpirun noticed that process rank 2 with PID
>>>>>>>> 0 on node titan01 exited on signal 6 (Aborted).
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> I don't know if it is a problem of java or
>>>>>>>> ompi - but the last years, java worked with
>>>>>>>> no problems on my machine...
>>>>>>>>
>>>>>>>> Thank you for your tips in advance!
>>>>>>>> Gundram
>>>>>>>>
>>>>>>>> On 07/06/2016 03:10 PM, Gilles Gouaillardet
>>>>>>>> wrote:
>>>>>>>>> Note a race condition in MPI_Init has been
>>>>>>>>> fixed yesterday in the master.
>>>>>>>>> can you please update your OpenMPI and try
>>>>>>>>> again ?
>>>>>>>>>
>>>>>>>>> hopefully the hang will disappear.
>>>>>>>>>
>>>>>>>>> Can you reproduce the crash with a simpler
>>>>>>>>> (and ideally deterministic) version of
>>>>>>>>> your program.
>>>>>>>>> the crash occurs in hashcode, and this
>>>>>>>>> makes little sense to me. can you also
>>>>>>>>> update your jdk ?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Gilles
>>>>>>>>>
>>>>>>>>> On Wednesday, July 6, 2016, Gundram
>>>>>>>>> Leifert <***@uni-rostock.de>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hello Jason,
>>>>>>>>>
>>>>>>>>> thanks for your response! I thing it
>>>>>>>>> is another problem. I try to send
>>>>>>>>> 100MB bytes. So there are not many
>>>>>>>>> tries (between 10 and 30). I realized
>>>>>>>>> that the execution of this code can
>>>>>>>>> result 3 different errors:
>>>>>>>>>
>>>>>>>>> 1. most often the posted error message
>>>>>>>>> occures.
>>>>>>>>>
>>>>>>>>> 2. in <10% the cases i have a live
>>>>>>>>> lock. I can see 3 java-processes, one
>>>>>>>>> with 200% and two with 100% processor
>>>>>>>>> utilization. After ~15 minutes without
>>>>>>>>> new system outputs this error occurs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [thread 47499823949568 also had an error]
>>>>>>>>> # A fatal error has been detected by
>>>>>>>>> the Java Runtime Environment:
>>>>>>>>> #
>>>>>>>>> # Internal Error (safepoint.cpp:317),
>>>>>>>>> pid=24256, tid=47500347131648
>>>>>>>>> # guarantee(PageArmed == 0) failed:
>>>>>>>>> invariant
>>>>>>>>> #
>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit
>>>>>>>>> Server VM (23.25-b01 mixed mode
>>>>>>>>> linux-amd64 compressed oops)
>>>>>>>>> # Failed to write core dump. Core
>>>>>>>>> dumps have been disabled. To enable
>>>>>>>>> core dumping, try "ulimit -c
>>>>>>>>> unlimited" before starting Java again
>>>>>>>>> #
>>>>>>>>> # An error report file with more
>>>>>>>>> information is saved as:
>>>>>>>>> #
>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>>>>>> #
>>>>>>>>> # If you would like to submit a bug
>>>>>>>>> report, please visit:
>>>>>>>>> #
>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>> #
>>>>>>>>> [titan01:24256] *** Process received
>>>>>>>>> signal ***
>>>>>>>>> [titan01:24256] Signal: Aborted (6)
>>>>>>>>> [titan01:24256] Signal code: (-6)
>>>>>>>>> [titan01:24256] [ 0]
>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>>>>>> [titan01:24256] [ 1]
>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>>>>>> [titan01:24256] [ 2]
>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>>>>>> [titan01:24256] [ 3]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>>>>>> [titan01:24256] [ 4]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>>>>>> [titan01:24256] [ 5]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>>>>>> [titan01:24256] [ 6]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>>>>>> [titan01:24256] [ 7]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>>>>>> [titan01:24256] [ 8]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>>>>>> [titan01:24256] [ 9]
>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>>>>>> [titan01:24256] [10]
>>>>>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>>>>>> [titan01:24256] [11]
>>>>>>>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>>>>>> [titan01:24256] *** End of error
>>>>>>>>> message ***
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> Primary job terminated normally, but 1
>>>>>>>>> process returned
>>>>>>>>> a non-zero exit code. Per
>>>>>>>>> user-direction, the job has been aborted.
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> mpirun noticed that process rank 0
>>>>>>>>> with PID 0 on node titan01 exited on
>>>>>>>>> signal 6 (Aborted).
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 3. in <10% the cases i have a dead
>>>>>>>>> lock while MPI.init. This stays for
>>>>>>>>> more than 15 minutes without returning
>>>>>>>>> with an error message...
>>>>>>>>>
>>>>>>>>> Can I enable some debug-flags to see
>>>>>>>>> what happens on C / OpenMPI side?
>>>>>>>>>
>>>>>>>>> Thanks in advance for your help!
>>>>>>>>> Gundram Leifert
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 07/05/2016 06:05 PM, Jason Maldonis
>>>>>>>>> wrote:
>>>>>>>>>> After reading your thread looks like
>>>>>>>>>> it may be related to an issue I had a
>>>>>>>>>> few weeks ago (I'm a novice though).
>>>>>>>>>> Maybe my thread will be of help:
>>>>>>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> When you say "After a specific number
>>>>>>>>>> of repetitions the process either
>>>>>>>>>> hangs up or returns with a SIGSEGV."
>>>>>>>>>> does you mean that a single call
>>>>>>>>>> hangs, or that at some point during
>>>>>>>>>> the for loop a call hangs? If you
>>>>>>>>>> mean the latter, then it might relate
>>>>>>>>>> to my issue. Otherwise my thread
>>>>>>>>>> probably won't be helpful.
>>>>>>>>>>
>>>>>>>>>> Jason Maldonis
>>>>>>>>>> Research Assistant of Professor Paul
>>>>>>>>>> Voyles
>>>>>>>>>> Materials Science Grad Student
>>>>>>>>>> University of Wisconsin, Madison
>>>>>>>>>> 1509 University Ave, Rm M142
>>>>>>>>>> Madison, WI 53706
>>>>>>>>>> ***@wisc.edu
>>>>>>>>>> 608-295-5532
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 5, 2016 at 9:58 AM,
>>>>>>>>>> Gundram Leifert
>>>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I try to send many byte-arrays
>>>>>>>>>> via broadcast. After a specific
>>>>>>>>>> number of repetitions the process
>>>>>>>>>> either hangs up or returns with a
>>>>>>>>>> SIGSEGV. Does any one can help me
>>>>>>>>>> solving the problem:
>>>>>>>>>>
>>>>>>>>>> ########## The code:
>>>>>>>>>>
>>>>>>>>>> import java.util.Random;
>>>>>>>>>> import mpi.*;
>>>>>>>>>>
>>>>>>>>>> public class TestSendBigFiles {
>>>>>>>>>>
>>>>>>>>>> public static void log(String
>>>>>>>>>> msg) {
>>>>>>>>>> try {
>>>>>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>>>>>> MPI.COMM_WORLD.getRank(),
>>>>>>>>>> MPI.COMM_WORLD.getSize(), msg));
>>>>>>>>>> } catch (MPIException ex) {
>>>>>>>>>> System.err.println(String.format("%2s/%2s:%s",
>>>>>>>>>> "?", "?", msg));
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> private static int
>>>>>>>>>> hashcode(byte[] bytearray) {
>>>>>>>>>> if (bytearray == null) {
>>>>>>>>>> return 0;
>>>>>>>>>> }
>>>>>>>>>> int hash = 39;
>>>>>>>>>> for (int i = 0; i <
>>>>>>>>>> bytearray.length; i++) {
>>>>>>>>>> byte b = bytearray[i];
>>>>>>>>>> hash = hash * 7 + (int) b;
>>>>>>>>>> }
>>>>>>>>>> return hash;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> public static void
>>>>>>>>>> main(String args[]) throws
>>>>>>>>>> MPIException {
>>>>>>>>>> log("start main");
>>>>>>>>>> MPI.Init(args);
>>>>>>>>>> try {
>>>>>>>>>> log("initialized done");
>>>>>>>>>> byte[] saveMem = new byte[100000000];
>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>> Random r = new Random();
>>>>>>>>>> r.nextBytes(saveMem);
>>>>>>>>>> if
>>>>>>>>>> (MPI.COMM_WORLD.getRank() == 0) {
>>>>>>>>>> for (int i = 0; i < 1000; i++) {
>>>>>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>>>>>> log("i = " + i);
>>>>>>>>>> int[] lengthData = new
>>>>>>>>>> int[]{saveMem.length};
>>>>>>>>>> log("object hash = " +
>>>>>>>>>> hashcode(saveMem));
>>>>>>>>>> log("length = " + lengthData[0]);
>>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>>> log("bcast length done (length =
>>>>>>>>>> " + lengthData[0] + ")");
>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>>> lengthData[0], MPI.BYTE, 0);
>>>>>>>>>> log("bcast data done");
>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>> }
>>>>>>>>>> MPI.COMM_WORLD.bcast(new
>>>>>>>>>> int[]{0}, 1, MPI.INT
>>>>>>>>>> <http://MPI.INT>, 0);
>>>>>>>>>> } else {
>>>>>>>>>> while (true) {
>>>>>>>>>> int[] lengthData = new int[1];
>>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>>> log("bcast length done (length =
>>>>>>>>>> " + lengthData[0] + ")");
>>>>>>>>>> if (lengthData[0] == 0) {
>>>>>>>>>> break;
>>>>>>>>>> }
>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>> saveMem = new
>>>>>>>>>> byte[lengthData[0]];
>>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>>> saveMem.length, MPI.BYTE, 0);
>>>>>>>>>> log("bcast data done");
>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>> log("object hash = " +
>>>>>>>>>> hashcode(saveMem));
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>> } catch (MPIException ex) {
>>>>>>>>>> System.out.println("caugth
>>>>>>>>>> error." + ex);
>>>>>>>>>> log(ex.getMessage());
>>>>>>>>>> } catch (RuntimeException
>>>>>>>>>> ex) {
>>>>>>>>>> System.out.println("caugth
>>>>>>>>>> error." + ex);
>>>>>>>>>> log(ex.getMessage());
>>>>>>>>>> } finally {
>>>>>>>>>> MPI.Finalize();
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ############ The Error (if it
>>>>>>>>>> does not just hang up):
>>>>>>>>>>
>>>>>>>>>> #
>>>>>>>>>> # A fatal error has been detected
>>>>>>>>>> by the Java Runtime Environment:
>>>>>>>>>> #
>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>> pc=0x00002b7e9c86e3a1, pid=1172,
>>>>>>>>>> tid=47822674495232
>>>>>>>>>> #
>>>>>>>>>> #
>>>>>>>>>> # A fatal error has been detected
>>>>>>>>>> by the Java Runtime Environment:
>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>> # Java VM: Java HotSpot(TM)
>>>>>>>>>> 64-Bit Server VM (23.25-b01 mixed
>>>>>>>>>> mode linux-amd64 compressed oops)
>>>>>>>>>> # Problematic frame:
>>>>>>>>>> # #
>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>> pc=0x00002af69c0693a1, pid=1173,
>>>>>>>>>> tid=47238546896640
>>>>>>>>>> #
>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>> J
>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>>> #
>>>>>>>>>> # Failed to write core dump. Core
>>>>>>>>>> dumps have been disabled. To
>>>>>>>>>> enable core dumping, try "ulimit
>>>>>>>>>> -c unlimited" before starting
>>>>>>>>>> Java again
>>>>>>>>>> #
>>>>>>>>>> # Java VM: Java HotSpot(TM)
>>>>>>>>>> 64-Bit Server VM (23.25-b01 mixed
>>>>>>>>>> mode linux-amd64 compressed oops)
>>>>>>>>>> # Problematic frame:
>>>>>>>>>> # J
>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>>> #
>>>>>>>>>> # Failed to write core dump. Core
>>>>>>>>>> dumps have been disabled. To
>>>>>>>>>> enable core dumping, try "ulimit
>>>>>>>>>> -c unlimited" before starting
>>>>>>>>>> Java again
>>>>>>>>>> #
>>>>>>>>>> # An error report file with more
>>>>>>>>>> information is saved as:
>>>>>>>>>> #
>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>>>>>> # An error report file with more
>>>>>>>>>> information is saved as:
>>>>>>>>>> #
>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>>>>>> #
>>>>>>>>>> # If you would like to submit a
>>>>>>>>>> bug report, please visit:
>>>>>>>>>> #
>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>> #
>>>>>>>>>> #
>>>>>>>>>> # If you would like to submit a
>>>>>>>>>> bug report, please visit:
>>>>>>>>>> #
>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>> #
>>>>>>>>>> [titan01:01172] *** Process
>>>>>>>>>> received signal ***
>>>>>>>>>> [titan01:01172] Signal: Aborted (6)
>>>>>>>>>> [titan01:01172] Signal code: (-6)
>>>>>>>>>> [titan01:01173] *** Process
>>>>>>>>>> received signal ***
>>>>>>>>>> [titan01:01173] Signal: Aborted (6)
>>>>>>>>>> [titan01:01173] Signal code: (-6)
>>>>>>>>>> [titan01:01172] [ 0]
>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>>>>>> [titan01:01172] [ 1]
>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>>>>>> [titan01:01172] [ 2]
>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>>>>>> [titan01:01172] [ 3]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>>>>>> [titan01:01172] [ 4]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>>>>>> [titan01:01172] [ 5]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>>>>>> [titan01:01172] [ 6]
>>>>>>>>>> [titan01:01173] [ 0]
>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>>>>>> [titan01:01173] [ 1]
>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>>>>>> [titan01:01172] [ 7] [0x2b7e9c86e3a1]
>>>>>>>>>> [titan01:01172] *** End of error
>>>>>>>>>> message ***
>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>>>>>> [titan01:01173] [ 2]
>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>>>>>> [titan01:01173] [ 3]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>>>>>> [titan01:01173] [ 4]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>>>>>> [titan01:01173] [ 5]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>>>>>> [titan01:01173] [ 6]
>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>>>>>> [titan01:01173] [ 7] [0x2af69c0693a1]
>>>>>>>>>> [titan01:01173] *** End of error
>>>>>>>>>> message ***
>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>> Primary job terminated normally,
>>>>>>>>>> but 1 process returned
>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>> user-direction, the job has been
>>>>>>>>>> aborted.
>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> mpirun noticed that process rank
>>>>>>>>>> 1 with PID 0 on node titan01
>>>>>>>>>> exited on signal 6 (Aborted).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ########CONFIGURATION:
>>>>>>>>>> I used the ompi master sources
>>>>>>>>>> from github:
>>>>>>>>>> commit
>>>>>>>>>> 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>>>>>> Author: Gilles Gouaillardet
>>>>>>>>>> <***@rist.or.jp>
>>>>>>>>>> Date: Tue Jul 5 13:47:50 2016 +0900
>>>>>>>>>>
>>>>>>>>>> ./configure --enable-mpi-java
>>>>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>>>>>> --disable-dlopen --disable-mca-dso
>>>>>>>>>>
>>>>>>>>>> Thanks a lot for your help!
>>>>>>>>>> Gundram
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>> Subscription:
>>>>>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> Link to this post:
>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> ***@open-mpi.org
>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> ***@open-mpi.org
>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> ***@open-mpi.org
>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> ***@open-mpi.org
>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29601.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29603.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@open-mpi.org
>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29610.php
>
Graham, Nathaniel Richard
2016-09-07 16:49:04 UTC
Permalink
Hello Gundram,


It looks like the test that is failing is TestMpiRmaCompareAndSwap.java​. Is that the one that is crashing? If so, could you try to run the C test from:


http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1


There are a couple of header files you will need for that test, but they are in the same repo as the test (up a few folders and in an include folder).


This should let us know whether its an issue related to Java or not.


If it is another test, let me know and Ill see if I can get you the C version (most or all of the Java tests are translations from the C test).


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Wednesday, September 7, 2016 9:23 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hello,

I still have the same errors on our cluster - even one more. Maybe the new one helps us to find a solution.

I have this error if I run "make_onesided" of the ompi-java-test repo.

CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically - in all my 30 runs) this error:

[titan01:5134] *** An error occurred in MPI_Compare_and_swap
[titan01:5134] *** reported by process [2392850433,1]
[titan01:5134] *** on win rdma window 3
[titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:5134] *** and potentially your MPI job)
[titan01.service:05128] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


Sometimes I also have the SIGSEGV error.

System:

compiler: gcc/5.2.0
java: jdk1.8.0_102
kernelmodule: mlx4_core mlx4_en mlx4_ib
Linux version 3.10.0-327.13.1.el7.x86_64 (***@kbuilder.dev.centos.org<mailto:***@kbuilder.dev.centos.org>) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP

Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo rev: v2.0.0-257-gee86e07, Sep 02, 2016

inifiband

openib: OpenSM 3.3.19


limits:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256554
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Thanks, Gundram
On 07/12/2016 11:08 AM, Gundram Leifert wrote:
Hello Gilley, Howard,

I configured without disable dlopen - same error.

I test these classes on another cluster and: IT WORKS!

So it is a problem of the cluster configuration. Thank you all very much for all your help! When the admin can solve the problem, i will let you know, what he had changed.

Cheers Gundram

On 07/08/2016 04:19 PM, Howard Pritchard wrote:
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
the JVM sets its own signal handlers, and it is important openmpi dones not override them.
this is what previously happened with PSM (infinipath) but this has been solved since.
you might be linking with a third party library that hijacks signal handlers and cause the crash
(which would explain why I cannot reproduce the issue)

the master branch has a revamped memory patcher (compared to v2.x or v1.10), and that could have some bad interactions with the JVM, so you might also give v2.x a try

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
You made the best of it... thanks a lot!

Whithout MPI it runs.
Just adding MPI.init() causes the crash!

maybe I installed something wrong...

install newest automake, autoconf, m4, libtoolize in right order and same prefix
check out ompi,
autogen
configure with same prefix, pointing to the same jdk, I later use
make
make install

I will test some different configurations of ./configure...


On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
I am running out of ideas ...

what if you do not run within slurm ?
what if you do not use '-cp executor.jar'
or what if you configure without --disable-dlopen --disable-mca-dso ?

if you
mpirun -np 1 ...
then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird your program is still crashing. an other test is to comment out MPI_Bcast and MPI_Barrier and try again with -np 1

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
In any cases the same error.
this is my code:

salloc -n 3
export IPATH_NO_BACKTRACE
ulimit -s 10240
mpirun -np 3 java -cp executor.jar de.uros.citlab.executor.test.TestSendBigFiles2


also for 1 or two cores, the process crashes.


On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
you can try
export IPATH_NO_BACKTRACE
before invoking mpirun (that should not be needed though)

an other test is to
ulimit -s 10240
before invoking mpirun.

btw, do you use mpirun or srun ?

can you reproduce the crash with 1 or 2 tasks ?

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

configure:
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso


1 node with 3 cores. I use SLURM to allocate one node. I changed --mem, but it has no effect.
salloc -n 3


core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256564
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

uname -a
Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/system-release
CentOS Linux release 7.2.1511 (Core)

what else do you need?

Cheers, Gundram

On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:

Gundram,


can you please provide more information on your environment :

- configure command line

- OS

- memory available

- ulimit -a

- number of nodes

- number of tasks used

- interconnect used (if any)

- batch manager (if any)


Cheers,


Gilles

On 7/7/2016 4:17 PM, Gundram Leifert wrote:
Hello Gilles,

I tried you code and it crashes after 3-15 iterations (see (1)). It is always the same error (only the "94" varies).

Meanwhile I think Java and MPI use the same memory because when I delete the hash-call, the program runs sometimes more than 9k iterations.
When it crashes, there are different lines (see (2) and (3)). The crashes also occurs on rank 0.

##### (1)#####
# Problematic frame:
# J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]

#####(2)#####
# Problematic frame:
# V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0xb6

#####(3)#####
# Problematic frame:
# V [libjvm.so+0x4183bf] ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f

Any more idea?

On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:

Gundram,


fwiw, i cannot reproduce the issue on my box

- centos 7

- java version "1.8.0_71"
Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)


i noticed on non zero rank saveMem is allocated at each iteration.
ideally, the garbage collector can take care of that and this should not be an issue.

would you mind giving the attached file a try ?

Cheers,

Gilles

On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Giles,

thank you for your hints! I did 3 changes, unfortunately the same error occures:

update ompi:
commit ae8444682f0a7aa158caea08800542ce9874455e
Author: Ralph Castain <***@open-mpi.org><mailto:***@open-mpi.org>
Date: Tue Jul 5 20:07:16 2016 -0700

update java:
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)

delete hashcode-lines.

Now I get this error message - to 100%, after different number of iterations (15-300):

0/ 3:length = 100000000
0/ 3:bcast length done (length = 100000000)
1/ 3:bcast length done (length = 100000000)
2/ 3:bcast length done (length = 100000000)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid16578.log
#
# Compiler replay data is saved as:
# /home/gl069/ompi/bin/executor/replay_pid16578.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[titan01:16578] *** Process received signal ***
[titan01:16578] Signal: Aborted (6)
[titan01:16578] Signal code: (-6)
[titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
[titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
[titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
[titan01:16578] [ 3] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
[titan01:16578] [ 4] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
[titan01:16578] [ 5] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
[titan01:16578] [ 6] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
[titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
[titan01:16578] [ 8] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
[titan01:16578] [ 9] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
[titan01:16578] [10] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
[titan01:16578] [11] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
[titan01:16578] [12] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [13] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [14] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [15] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [16] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [17] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [18] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [19] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [20] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [21] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [22] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
[titan01:16578] [23] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
[titan01:16578] [24] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
[titan01:16578] [25] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
[titan01:16578] [26] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
[titan01:16578] [27] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
[titan01:16578] [28] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
[titan01:16578] [29] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
[titan01:16578] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I don't know if it is a problem of java or ompi - but the last years, java worked with no problems on my machine...

Thank you for your tips in advance!
Gundram

On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic) version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Jason,

thanks for your response! I thing it is another problem. I try to send 100MB bytes. So there are not many tries (between 10 and 30). I realized that the execution of this code can result 3 different errors:

1. most often the posted error message occures.

2. in <10% the cases i have a live lock. I can see 3 java-processes, one with 200% and two with 100% processor utilization. After ~15 minutes without new system outputs this error occurs.


[thread 47499823949568 also had an error]
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid24256.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:24256] *** Process received signal ***
[titan01:24256] Signal: Aborted (6)
[titan01:24256] Signal code: (-6)
[titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
[titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
[titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
[titan01:24256] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
[titan01:24256] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
[titan01:24256] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
[titan01:24256] [ 6] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
[titan01:24256] [ 7] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
[titan01:24256] [ 8] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
[titan01:24256] [ 9] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
[titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
[titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
[titan01:24256] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


3. in <10% the cases i have a dead lock while MPI.init. This stays for more than 15 minutes without returning with an error message...

Can I enable some debug-flags to see what happens on C / OpenMPI side?

Thanks in advance for your help!
Gundram Leifert


On 07/05/2016 06:05 PM, Jason Maldonis wrote:
After reading your thread looks like it may be related to an issue I had a few weeks ago (I'm a novice though). Maybe my thread will be of help: https://www.open-mpi.org/community/lists/users/2016/06/29425.php

When you say "After a specific number of repetitions the process either hangs up or returns with a SIGSEGV." does you mean that a single call hangs, or that at some point during the for loop a call hangs? If you mean the latter, then it might relate to my issue. Otherwise my thread probably won't be helpful.

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
***@wisc.edu<mailto:***@wisc.edu>
608-295-5532

On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello,

I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

public static void log(String msg) {
try {
System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
} catch (MPIException ex) {
System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
}
}

private static int hashcode(byte[] bytearray) {
if (bytearray == null) {
return 0;
}
int hash = 39;
for (int i = 0; i < bytearray.length; i++) {
byte b = bytearray[i];
hash = hash * 7 + (int) b;
}
return hash;
}

public static void main(String args[]) throws MPIException {
log("start main");
MPI.Init(args);
try {
log("initialized done");
byte[] saveMem = new byte[100000000];
MPI.COMM_WORLD.barrier();
Random r = new Random();
r.nextBytes(saveMem);
if (MPI.COMM_WORLD.getRank() == 0) {
for (int i = 0; i < 1000; i++) {
saveMem[r.nextInt(saveMem.length)]++;
log("i = " + i);
int[] lengthData = new int[]{saveMem.length};
log("object hash = " + hashcode(saveMem));
log("length = " + lengthData[0]);
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
MPI.COMM_WORLD.barrier();
MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
}
MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT<http://MPI.INT>, 0);
} else {
while (true) {
int[] lengthData = new int[1];
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
if (lengthData[0] == 0) {
break;
}
MPI.COMM_WORLD.barrier();
saveMem = new byte[lengthData[0]];
MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
log("object hash = " + hashcode(saveMem));
}
}
MPI.COMM_WORLD.barrier();
} catch (MPIException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} catch (RuntimeException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} finally {
MPI.Finalize();
}

}

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# #
# SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code: (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code: (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <***@rist.or.jp<mailto:***@rist.or.jp>>
Date: Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram

_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29584.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php




_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php
Gundram Leifert
2016-09-13 06:46:40 UTC
Permalink
Hey,


it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns
produces this output:

(the same bulid by sources or the release 2.0.1)


[***@node108 mpi_test]$ ./a.out
[node108:2949] *** An error occurred in MPI_Compare_and_swap
[node108:2949] *** reported by process [1649420396,0]
[node108:2949] *** on win rdma window 3
[node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will now
abort,
[node108:2949] *** and potentially your MPI job)

But the test works for 1.8.x! In fact our cluster does not have
shared-memory - so it has to use the wrapper to default methods.

Gundram

On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:
>
> Hello Gundram,
>
>
> It looks like the test that is failing is
> TestMpiRmaCompareAndSwap.java​. Is that the one that is crashing? If
> so, could you try to run the C test from:
>
>
> http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1
>
>
> There are a couple of header files you will need for that test, but
> they are in the same repo as the test (up a few folders and in an
> include folder).
>
>
> This should let us know whether its an issue related to Java or not.
>
>
> If it is another test, let me know and Ill see if I can get you the C
> version (most or all of the Java tests are translations from the C test).
>
>
> -Nathan
>
>
>
> --
> Nathaniel Graham
> HPC-DES
> Los Alamos National Laboratory
> ------------------------------------------------------------------------
> *From:* users <users-***@lists.open-mpi.org> on behalf of Gundram
> Leifert <***@uni-rostock.de>
> *Sent:* Wednesday, September 7, 2016 9:23 AM
> *To:* ***@lists.open-mpi.org
> *Subject:* Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
>
> Hello,
>
> I still have the same errors on our cluster - even one more. Maybe the
> new one helps us to find a solution.
>
> I have this error if I run "make_onesided" of the ompi-java-test repo.
>
> CReqops and TestMpiRmaCompareAndSwap report (pretty
> deterministically - in all my 30 runs) this error:
>
> [titan01:5134] *** An error occurred in MPI_Compare_and_swap
> [titan01:5134] *** reported by process [2392850433,1]
> [titan01:5134] *** on win rdma window 3
> [titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
> [titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will
> now abort,
> [titan01:5134] *** and potentially your MPI job)
> [titan01.service:05128] 1 more process has sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal
> [titan01.service:05128] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
>
> Sometimes I also have the SIGSEGV error.
>
> System:
>
> compiler: gcc/5.2.0
> java: jdk1.8.0_102
> kernelmodule: mlx4_core mlx4_en mlx4_ib
> Linux version 3.10.0-327.13.1.el7.x86_64
> (***@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat
> 4.8.3-9) (GCC) ) #1 SMP
>
> Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo
> rev: v2.0.0-257-gee86e07, Sep 02, 2016
>
> inifiband
>
> openib: OpenSM 3.3.19
>
>
> limits:
>
> ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size (blocks, -f) unlimited
> pending signals (-i) 256554
> max locked memory (kbytes, -l) unlimited
> max memory size (kbytes, -m) unlimited
> open files (-n) 100000
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority (-r) 0
> stack size (kbytes, -s) unlimited
> cpu time (seconds, -t) unlimited
> max user processes (-u) 4096
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
>
> Thanks, Gundram
> On 07/12/2016 11:08 AM, Gundram Leifert wrote:
>> Hello Gilley, Howard,
>>
>> I configured without disable dlopen - same error.
>>
>> I test these classes on another cluster and: IT WORKS!
>>
>> So it is a problem of the cluster configuration. Thank you all very
>> much for all your help! When the admin can solve the problem, i will
>> let you know, what he had changed.
>>
>> Cheers Gundram
>>
>> On 07/08/2016 04:19 PM, Howard Pritchard wrote:
>>> Hi Gundram
>>>
>>> Could you configure without the disable dlopen option and retry?
>>>
>>> Howard
>>>
>>> Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
>>>
>>> the JVM sets its own signal handlers, and it is important
>>> openmpi dones not override them.
>>> this is what previously happened with PSM (infinipath) but this
>>> has been solved since.
>>> you might be linking with a third party library that hijacks
>>> signal handlers and cause the crash
>>> (which would explain why I cannot reproduce the issue)
>>>
>>> the master branch has a revamped memory patcher (compared to
>>> v2.x or v1.10), and that could have some bad interactions with
>>> the JVM, so you might also give v2.x a try
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Friday, July 8, 2016, Gundram Leifert
>>> <***@uni-rostock.de> wrote:
>>>
>>> You made the best of it... thanks a lot!
>>>
>>> Whithout MPI it runs.
>>> Just adding MPI.init() causes the crash!
>>>
>>> maybe I installed something wrong...
>>>
>>> install newest automake, autoconf, m4, libtoolize in right
>>> order and same prefix
>>> check out ompi,
>>> autogen
>>> configure with same prefix, pointing to the same jdk, I
>>> later use
>>> make
>>> make install
>>>
>>> I will test some different configurations of ./configure...
>>>
>>>
>>> On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
>>>> I am running out of ideas ...
>>>>
>>>> what if you do not run within slurm ?
>>>> what if you do not use '-cp executor.jar'
>>>> or what if you configure without --disable-dlopen
>>>> --disable-mca-dso ?
>>>>
>>>> if you
>>>> mpirun -np 1 ...
>>>> then MPI_Bcast and MPI_Barrier are basically no-op, so it
>>>> is really weird your program is still crashing. an other
>>>> test is to comment out MPI_Bcast and MPI_Barrier and try
>>>> again with -np 1
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Friday, July 8, 2016, Gundram Leifert
>>>> <***@uni-rostock.de> wrote:
>>>>
>>>> In any cases the same error.
>>>> this is my code:
>>>>
>>>> salloc -n 3
>>>> export IPATH_NO_BACKTRACE
>>>> ulimit -s 10240
>>>> mpirun -np 3 java -cp executor.jar
>>>> de.uros.citlab.executor.test.TestSendBigFiles2
>>>>
>>>>
>>>> also for 1 or two cores, the process crashes.
>>>>
>>>>
>>>> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>>>>> you can try
>>>>> export IPATH_NO_BACKTRACE
>>>>> before invoking mpirun (that should not be needed though)
>>>>>
>>>>> an other test is to
>>>>> ulimit -s 10240
>>>>> before invoking mpirun.
>>>>>
>>>>> btw, do you use mpirun or srun ?
>>>>>
>>>>> can you reproduce the crash with 1 or 2 tasks ?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>> On Friday, July 8, 2016, Gundram Leifert
>>>>> <***@uni-rostock.de> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> configure:
>>>>> ./configure --enable-mpi-java
>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>> --disable-dlopen --disable-mca-dso
>>>>>
>>>>>
>>>>> 1 node with 3 cores. I use SLURM to allocate one
>>>>> node. I changed --mem, but it has no effect.
>>>>> salloc -n 3
>>>>>
>>>>>
>>>>> core file size (blocks, -c) 0
>>>>> data seg size (kbytes, -d) unlimited
>>>>> scheduling priority (-e) 0
>>>>> file size (blocks, -f) unlimited
>>>>> pending signals (-i) 256564
>>>>> max locked memory (kbytes, -l) unlimited
>>>>> max memory size (kbytes, -m) unlimited
>>>>> open files (-n) 100000
>>>>> pipe size (512 bytes, -p) 8
>>>>> POSIX message queues (bytes, -q) 819200
>>>>> real-time priority (-r) 0
>>>>> stack size (kbytes, -s) unlimited
>>>>> cpu time (seconds, -t) unlimited
>>>>> max user processes (-u) 4096
>>>>> virtual memory (kbytes, -v) unlimited
>>>>> file locks (-x) unlimited
>>>>>
>>>>> uname -a
>>>>> Linux titan01.service 3.10.0-327.13.1.el7.x86_64
>>>>> #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64
>>>>> x86_64 GNU/Linux
>>>>>
>>>>> cat /etc/system-release
>>>>> CentOS Linux release 7.2.1511 (Core)
>>>>>
>>>>> what else do you need?
>>>>>
>>>>> Cheers, Gundram
>>>>>
>>>>> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>>>>>
>>>>>> Gundram,
>>>>>>
>>>>>>
>>>>>> can you please provide more information on your
>>>>>> environment :
>>>>>>
>>>>>> - configure command line
>>>>>>
>>>>>> - OS
>>>>>>
>>>>>> - memory available
>>>>>>
>>>>>> - ulimit -a
>>>>>>
>>>>>> - number of nodes
>>>>>>
>>>>>> - number of tasks used
>>>>>>
>>>>>> - interconnect used (if any)
>>>>>>
>>>>>> - batch manager (if any)
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>>>>>> Hello Gilles,
>>>>>>>
>>>>>>> I tried you code and it crashes after 3-15
>>>>>>> iterations (see (1)). It is always the same
>>>>>>> error (only the "94" varies).
>>>>>>>
>>>>>>> Meanwhile I think Java and MPI use the same
>>>>>>> memory because when I delete the hash-call, the
>>>>>>> program runs sometimes more than 9k iterations.
>>>>>>> When it crashes, there are different lines (see
>>>>>>> (2) and (3)). The crashes also occurs on rank 0.
>>>>>>>
>>>>>>> ##### (1)#####
>>>>>>> # Problematic frame:
>>>>>>> # J 94 C2
>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>>>>>>> (42 bytes) @ 0x00002b03242dc9c4
>>>>>>> [0x00002b03242dc860+0x164]
>>>>>>>
>>>>>>> #####(2)#####
>>>>>>> # Problematic frame:
>>>>>>> # V [libjvm.so+0x68d0f6]
>>>>>>> JavaCallWrapper::JavaCallWrapper(methodHandle,
>>>>>>> Handle, JavaValue*, Thread*)+0xb6
>>>>>>>
>>>>>>> #####(3)#####
>>>>>>> # Problematic frame:
>>>>>>> # V [libjvm.so+0x4183bf]
>>>>>>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>>>>>>
>>>>>>> Any more idea?
>>>>>>>
>>>>>>> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>>>>>>>
>>>>>>>> Gundram,
>>>>>>>>
>>>>>>>>
>>>>>>>> fwiw, i cannot reproduce the issue on my box
>>>>>>>>
>>>>>>>> - centos 7
>>>>>>>>
>>>>>>>> - java version "1.8.0_71"
>>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>>> 1.8.0_71-b15)
>>>>>>>> Java HotSpot(TM) 64-Bit Server VM (build
>>>>>>>> 25.71-b15, mixed mode)
>>>>>>>>
>>>>>>>>
>>>>>>>> i noticed on non zero rank saveMem is allocated
>>>>>>>> at each iteration.
>>>>>>>> ideally, the garbage collector can take care of
>>>>>>>> that and this should not be an issue.
>>>>>>>>
>>>>>>>> would you mind giving the attached file a try ?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Gilles
>>>>>>>>
>>>>>>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>>>>>>> I will have a look at it today
>>>>>>>>>
>>>>>>>>> how did you configure OpenMPI ?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Gilles
>>>>>>>>>
>>>>>>>>> On Thursday, July 7, 2016, Gundram Leifert
>>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>>
>>>>>>>>> Hello Giles,
>>>>>>>>>
>>>>>>>>> thank you for your hints! I did 3 changes,
>>>>>>>>> unfortunately the same error occures:
>>>>>>>>>
>>>>>>>>> update ompi:
>>>>>>>>> commit
>>>>>>>>> ae8444682f0a7aa158caea08800542ce9874455e
>>>>>>>>> Author: Ralph Castain <***@open-mpi.org>
>>>>>>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>>>>>>
>>>>>>>>> update java:
>>>>>>>>> java version "1.8.0_92"
>>>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>>>> 1.8.0_92-b14)
>>>>>>>>> Java HotSpot(TM) Server VM (build
>>>>>>>>> 25.92-b14, mixed mode)
>>>>>>>>>
>>>>>>>>> delete hashcode-lines.
>>>>>>>>>
>>>>>>>>> Now I get this error message - to 100%,
>>>>>>>>> after different number of iterations (15-300):
>>>>>>>>>
>>>>>>>>> 0/ 3:length = 100000000
>>>>>>>>> 0/ 3:bcast length done (length = 100000000)
>>>>>>>>> 1/ 3:bcast length done (length = 100000000)
>>>>>>>>> 2/ 3:bcast length done (length = 100000000)
>>>>>>>>> #
>>>>>>>>> # A fatal error has been detected by the
>>>>>>>>> Java Runtime Environment:
>>>>>>>>> #
>>>>>>>>> # SIGSEGV (0xb) at pc=0x00002b3d022fcd24,
>>>>>>>>> pid=16578, tid=0x00002b3d29716700
>>>>>>>>> #
>>>>>>>>> # JRE version: Java(TM) SE Runtime
>>>>>>>>> Environment (8.0_92-b14) (build 1.8.0_92-b14)
>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server
>>>>>>>>> VM (25.92-b14 mixed mode linux-amd64
>>>>>>>>> compressed oops)
>>>>>>>>> # Problematic frame:
>>>>>>>>> # V [libjvm.so+0x414d24]
>>>>>>>>> ciEnv::get_field_by_index(ciInstanceKlass*,
>>>>>>>>> int)+0x94
>>>>>>>>> #
>>>>>>>>> # Failed to write core dump. Core dumps
>>>>>>>>> have been disabled. To enable core
>>>>>>>>> dumping, try "ulimit -c unlimited" before
>>>>>>>>> starting Java again
>>>>>>>>> #
>>>>>>>>> # An error report file with more
>>>>>>>>> information is saved as:
>>>>>>>>> #
>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>>>>>>> #
>>>>>>>>> # Compiler replay data is saved as:
>>>>>>>>> #
>>>>>>>>> /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>>>>>>> #
>>>>>>>>> # If you would like to submit a bug
>>>>>>>>> report, please visit:
>>>>>>>>> #
>>>>>>>>> http://bugreport.java.com/bugreport/crash.jsp
>>>>>>>>> #
>>>>>>>>> [titan01:16578] *** Process received
>>>>>>>>> signal ***
>>>>>>>>> [titan01:16578] Signal: Aborted (6)
>>>>>>>>> [titan01:16578] Signal code: (-6)
>>>>>>>>> [titan01:16578] [ 0]
>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>>>>>>> [titan01:16578] [ 1]
>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>>>>>>> [titan01:16578] [ 2]
>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>>>>>>> [titan01:16578] [ 3]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>>>>>>> [titan01:16578] [ 4]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>>>>>>> [titan01:16578] [ 5]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>>>>>>> [titan01:16578] [ 6]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>>>>>>> [titan01:16578] [ 7]
>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>>>>>>> [titan01:16578] [ 8]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>>>>>>> [titan01:16578] [ 9]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>>>>>>> [titan01:16578] [10]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>>>>>>> [titan01:16578] [11]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>>>>>>> [titan01:16578] [12]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>>>> [titan01:16578] [13]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>>>> [titan01:16578] [14]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>>>> [titan01:16578] [15]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>>>> [titan01:16578] [16]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>>>> [titan01:16578] [17]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>>>> [titan01:16578] [18]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>>>> [titan01:16578] [19]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>>>> [titan01:16578] [20]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>>>> [titan01:16578] [21]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>>>> [titan01:16578] [22]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>>>>>>> [titan01:16578] [23]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>>>>>>> [titan01:16578] [24]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>>>>>>> [titan01:16578] [25]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>>>>>>> [titan01:16578] [26]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>>>>>>> [titan01:16578] [27]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>>>>>>> [titan01:16578] [28]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>>>>>>> [titan01:16578] [29]
>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>>>>>>> [titan01:16578] *** End of error message ***
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> Primary job terminated normally, but 1
>>>>>>>>> process returned
>>>>>>>>> a non-zero exit code. Per user-direction,
>>>>>>>>> the job has been aborted.
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> mpirun noticed that process rank 2 with
>>>>>>>>> PID 0 on node titan01 exited on signal 6
>>>>>>>>> (Aborted).
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> I don't know if it is a problem of java
>>>>>>>>> or ompi - but the last years, java worked
>>>>>>>>> with no problems on my machine...
>>>>>>>>>
>>>>>>>>> Thank you for your tips in advance!
>>>>>>>>> Gundram
>>>>>>>>>
>>>>>>>>> On 07/06/2016 03:10 PM, Gilles
>>>>>>>>> Gouaillardet wrote:
>>>>>>>>>> Note a race condition in MPI_Init has
>>>>>>>>>> been fixed yesterday in the master.
>>>>>>>>>> can you please update your OpenMPI and
>>>>>>>>>> try again ?
>>>>>>>>>>
>>>>>>>>>> hopefully the hang will disappear.
>>>>>>>>>>
>>>>>>>>>> Can you reproduce the crash with a
>>>>>>>>>> simpler (and ideally deterministic)
>>>>>>>>>> version of your program.
>>>>>>>>>> the crash occurs in hashcode, and this
>>>>>>>>>> makes little sense to me. can you also
>>>>>>>>>> update your jdk ?
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Gilles
>>>>>>>>>>
>>>>>>>>>> On Wednesday, July 6, 2016, Gundram
>>>>>>>>>> Leifert <***@uni-rostock.de>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello Jason,
>>>>>>>>>>
>>>>>>>>>> thanks for your response! I thing it
>>>>>>>>>> is another problem. I try to send
>>>>>>>>>> 100MB bytes. So there are not many
>>>>>>>>>> tries (between 10 and 30). I realized
>>>>>>>>>> that the execution of this code can
>>>>>>>>>> result 3 different errors:
>>>>>>>>>>
>>>>>>>>>> 1. most often the posted error
>>>>>>>>>> message occures.
>>>>>>>>>>
>>>>>>>>>> 2. in <10% the cases i have a live
>>>>>>>>>> lock. I can see 3 java-processes, one
>>>>>>>>>> with 200% and two with 100% processor
>>>>>>>>>> utilization. After ~15 minutes
>>>>>>>>>> without new system outputs this error
>>>>>>>>>> occurs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [thread 47499823949568 also had an error]
>>>>>>>>>> # A fatal error has been detected by
>>>>>>>>>> the Java Runtime Environment:
>>>>>>>>>> #
>>>>>>>>>> # Internal Error
>>>>>>>>>> (safepoint.cpp:317), pid=24256,
>>>>>>>>>> tid=47500347131648
>>>>>>>>>> # guarantee(PageArmed == 0) failed:
>>>>>>>>>> invariant
>>>>>>>>>> #
>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit
>>>>>>>>>> Server VM (23.25-b01 mixed mode
>>>>>>>>>> linux-amd64 compressed oops)
>>>>>>>>>> # Failed to write core dump. Core
>>>>>>>>>> dumps have been disabled. To enable
>>>>>>>>>> core dumping, try "ulimit -c
>>>>>>>>>> unlimited" before starting Java again
>>>>>>>>>> #
>>>>>>>>>> # An error report file with more
>>>>>>>>>> information is saved as:
>>>>>>>>>> #
>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>>>>>>> #
>>>>>>>>>> # If you would like to submit a bug
>>>>>>>>>> report, please visit:
>>>>>>>>>> #
>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>> #
>>>>>>>>>> [titan01:24256] *** Process received
>>>>>>>>>> signal ***
>>>>>>>>>> [titan01:24256] Signal: Aborted (6)
>>>>>>>>>> [titan01:24256] Signal code: (-6)
>>>>>>>>>> [titan01:24256] [ 0]
>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>>>>>>> [titan01:24256] [ 1]
>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>>>>>>> [titan01:24256] [ 2]
>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>>>>>>> [titan01:24256] [ 3]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>>>>>>> [titan01:24256] [ 4]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>>>>>>> [titan01:24256] [ 5]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>>>>>>> [titan01:24256] [ 6]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>>>>>>> [titan01:24256] [ 7]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>>>>>>> [titan01:24256] [ 8]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>>>>>>> [titan01:24256] [ 9]
>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>>>>>>> [titan01:24256] [10]
>>>>>>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>>>>>>> [titan01:24256] [11]
>>>>>>>>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>>>>>>> [titan01:24256] *** End of error
>>>>>>>>>> message ***
>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>> Primary job terminated normally, but
>>>>>>>>>> 1 process returned
>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>> user-direction, the job has been aborted.
>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> mpirun noticed that process rank 0
>>>>>>>>>> with PID 0 on node titan01 exited on
>>>>>>>>>> signal 6 (Aborted).
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 3. in <10% the cases i have a dead
>>>>>>>>>> lock while MPI.init. This stays for
>>>>>>>>>> more than 15 minutes without
>>>>>>>>>> returning with an error message...
>>>>>>>>>>
>>>>>>>>>> Can I enable some debug-flags to see
>>>>>>>>>> what happens on C / OpenMPI side?
>>>>>>>>>>
>>>>>>>>>> Thanks in advance for your help!
>>>>>>>>>> Gundram Leifert
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 07/05/2016 06:05 PM, Jason
>>>>>>>>>> Maldonis wrote:
>>>>>>>>>>> After reading your thread looks like
>>>>>>>>>>> it may be related to an issue I had
>>>>>>>>>>> a few weeks ago (I'm a novice
>>>>>>>>>>> though). Maybe my thread will be of
>>>>>>>>>>> help:
>>>>>>>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> When you say "After a specific
>>>>>>>>>>> number of repetitions the process
>>>>>>>>>>> either hangs up or returns with a
>>>>>>>>>>> SIGSEGV." does you mean that a
>>>>>>>>>>> single call hangs, or that at some
>>>>>>>>>>> point during the for loop a call
>>>>>>>>>>> hangs? If you mean the latter, then
>>>>>>>>>>> it might relate to my issue.
>>>>>>>>>>> Otherwise my thread probably won't
>>>>>>>>>>> be helpful.
>>>>>>>>>>>
>>>>>>>>>>> Jason Maldonis
>>>>>>>>>>> Research Assistant of Professor Paul
>>>>>>>>>>> Voyles
>>>>>>>>>>> Materials Science Grad Student
>>>>>>>>>>> University of Wisconsin, Madison
>>>>>>>>>>> 1509 University Ave, Rm M142
>>>>>>>>>>> Madison, WI 53706
>>>>>>>>>>> ***@wisc.edu
>>>>>>>>>>> 608-295-5532
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 5, 2016 at 9:58 AM,
>>>>>>>>>>> Gundram Leifert
>>>>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I try to send many byte-arrays
>>>>>>>>>>> via broadcast. After a specific
>>>>>>>>>>> number of repetitions the
>>>>>>>>>>> process either hangs up or
>>>>>>>>>>> returns with a SIGSEGV. Does any
>>>>>>>>>>> one can help me solving the problem:
>>>>>>>>>>>
>>>>>>>>>>> ########## The code:
>>>>>>>>>>>
>>>>>>>>>>> import java.util.Random;
>>>>>>>>>>> import mpi.*;
>>>>>>>>>>>
>>>>>>>>>>> public class TestSendBigFiles {
>>>>>>>>>>>
>>>>>>>>>>> public static void
>>>>>>>>>>> log(String msg) {
>>>>>>>>>>> try {
>>>>>>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>>>>>>> MPI.COMM_WORLD.getRank(),
>>>>>>>>>>> MPI.COMM_WORLD.getSize(), msg));
>>>>>>>>>>> } catch (MPIException ex) {
>>>>>>>>>>> System.err.println(String.format("%2s/%2s:%s",
>>>>>>>>>>> "?", "?", msg));
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> private static int
>>>>>>>>>>> hashcode(byte[] bytearray) {
>>>>>>>>>>> if (bytearray == null) {
>>>>>>>>>>> return 0;
>>>>>>>>>>> }
>>>>>>>>>>> int hash = 39;
>>>>>>>>>>> for (int i = 0; i <
>>>>>>>>>>> bytearray.length; i++) {
>>>>>>>>>>> byte b = bytearray[i];
>>>>>>>>>>> hash = hash * 7 + (int) b;
>>>>>>>>>>> }
>>>>>>>>>>> return hash;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> public static void
>>>>>>>>>>> main(String args[]) throws
>>>>>>>>>>> MPIException {
>>>>>>>>>>> log("start main");
>>>>>>>>>>> MPI.Init(args);
>>>>>>>>>>> try {
>>>>>>>>>>> log("initialized done");
>>>>>>>>>>> byte[] saveMem = new
>>>>>>>>>>> byte[100000000];
>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>> Random r = new Random();
>>>>>>>>>>> r.nextBytes(saveMem);
>>>>>>>>>>> if
>>>>>>>>>>> (MPI.COMM_WORLD.getRank() == 0) {
>>>>>>>>>>> for (int i = 0; i < 1000; i++) {
>>>>>>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>>>>>>> log("i = " + i);
>>>>>>>>>>> int[] lengthData = new
>>>>>>>>>>> int[]{saveMem.length};
>>>>>>>>>>> log("object hash = " +
>>>>>>>>>>> hashcode(saveMem));
>>>>>>>>>>> log("length = " + lengthData[0]);
>>>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>>>> log("bcast length done (length =
>>>>>>>>>>> " + lengthData[0] + ")");
>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>>>> lengthData[0], MPI.BYTE, 0);
>>>>>>>>>>> log("bcast data done");
>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>> }
>>>>>>>>>>> MPI.COMM_WORLD.bcast(new
>>>>>>>>>>> int[]{0}, 1, MPI.INT
>>>>>>>>>>> <http://MPI.INT>, 0);
>>>>>>>>>>> } else {
>>>>>>>>>>> while (true) {
>>>>>>>>>>> int[] lengthData = new int[1];
>>>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>>>> log("bcast length done (length =
>>>>>>>>>>> " + lengthData[0] + ")");
>>>>>>>>>>> if (lengthData[0] == 0) {
>>>>>>>>>>> break;
>>>>>>>>>>> }
>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>> saveMem = new
>>>>>>>>>>> byte[lengthData[0]];
>>>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>>>> saveMem.length, MPI.BYTE, 0);
>>>>>>>>>>> log("bcast data done");
>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>> log("object hash = " +
>>>>>>>>>>> hashcode(saveMem));
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>> } catch (MPIException ex) {
>>>>>>>>>>> System.out.println("caugth
>>>>>>>>>>> error." + ex);
>>>>>>>>>>> log(ex.getMessage());
>>>>>>>>>>> } catch
>>>>>>>>>>> (RuntimeException ex) {
>>>>>>>>>>> System.out.println("caugth
>>>>>>>>>>> error." + ex);
>>>>>>>>>>> log(ex.getMessage());
>>>>>>>>>>> } finally {
>>>>>>>>>>> MPI.Finalize();
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ############ The Error (if it
>>>>>>>>>>> does not just hang up):
>>>>>>>>>>>
>>>>>>>>>>> #
>>>>>>>>>>> # A fatal error has been
>>>>>>>>>>> detected by the Java Runtime
>>>>>>>>>>> Environment:
>>>>>>>>>>> #
>>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>>> pc=0x00002b7e9c86e3a1, pid=1172,
>>>>>>>>>>> tid=47822674495232
>>>>>>>>>>> #
>>>>>>>>>>> #
>>>>>>>>>>> # A fatal error has been
>>>>>>>>>>> detected by the Java Runtime
>>>>>>>>>>> Environment:
>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>> # Java VM: Java HotSpot(TM)
>>>>>>>>>>> 64-Bit Server VM (23.25-b01
>>>>>>>>>>> mixed mode linux-amd64
>>>>>>>>>>> compressed oops)
>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>> # #
>>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>>> pc=0x00002af69c0693a1, pid=1173,
>>>>>>>>>>> tid=47238546896640
>>>>>>>>>>> #
>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>> J
>>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>>>> #
>>>>>>>>>>> # Failed to write core dump.
>>>>>>>>>>> Core dumps have been disabled.
>>>>>>>>>>> To enable core dumping, try
>>>>>>>>>>> "ulimit -c unlimited" before
>>>>>>>>>>> starting Java again
>>>>>>>>>>> #
>>>>>>>>>>> # Java VM: Java HotSpot(TM)
>>>>>>>>>>> 64-Bit Server VM (23.25-b01
>>>>>>>>>>> mixed mode linux-amd64
>>>>>>>>>>> compressed oops)
>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>> # J
>>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>>>> #
>>>>>>>>>>> # Failed to write core dump.
>>>>>>>>>>> Core dumps have been disabled.
>>>>>>>>>>> To enable core dumping, try
>>>>>>>>>>> "ulimit -c unlimited" before
>>>>>>>>>>> starting Java again
>>>>>>>>>>> #
>>>>>>>>>>> # An error report file with more
>>>>>>>>>>> information is saved as:
>>>>>>>>>>> #
>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>>>>>>> # An error report file with more
>>>>>>>>>>> information is saved as:
>>>>>>>>>>> #
>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>>>>>>> #
>>>>>>>>>>> # If you would like to submit a
>>>>>>>>>>> bug report, please visit:
>>>>>>>>>>> #
>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>> #
>>>>>>>>>>> #
>>>>>>>>>>> # If you would like to submit a
>>>>>>>>>>> bug report, please visit:
>>>>>>>>>>> #
>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>> #
>>>>>>>>>>> [titan01:01172] *** Process
>>>>>>>>>>> received signal ***
>>>>>>>>>>> [titan01:01172] Signal: Aborted (6)
>>>>>>>>>>> [titan01:01172] Signal code: (-6)
>>>>>>>>>>> [titan01:01173] *** Process
>>>>>>>>>>> received signal ***
>>>>>>>>>>> [titan01:01173] Signal: Aborted (6)
>>>>>>>>>>> [titan01:01173] Signal code: (-6)
>>>>>>>>>>> [titan01:01172] [ 0]
>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>>>>>>> [titan01:01172] [ 1]
>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>>>>>>> [titan01:01172] [ 2]
>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>>>>>>> [titan01:01172] [ 3]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>>>>>>> [titan01:01172] [ 4]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>>>>>>> [titan01:01172] [ 5]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>>>>>>> [titan01:01172] [ 6]
>>>>>>>>>>> [titan01:01173] [ 0]
>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>>>>>>> [titan01:01173] [ 1]
>>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>>>>>>> [titan01:01172] [ 7]
>>>>>>>>>>> [0x2b7e9c86e3a1]
>>>>>>>>>>> [titan01:01172] *** End of error
>>>>>>>>>>> message ***
>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>>>>>>> [titan01:01173] [ 2]
>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>>>>>>> [titan01:01173] [ 3]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>>>>>>> [titan01:01173] [ 4]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>>>>>>> [titan01:01173] [ 5]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>>>>>>> [titan01:01173] [ 6]
>>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>>>>>>> [titan01:01173] [ 7]
>>>>>>>>>>> [0x2af69c0693a1]
>>>>>>>>>>> [titan01:01173] *** End of error
>>>>>>>>>>> message ***
>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>> Primary job terminated normally,
>>>>>>>>>>> but 1 process returned
>>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>>> user-direction, the job has been
>>>>>>>>>>> aborted.
>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>> mpirun noticed that process rank
>>>>>>>>>>> 1 with PID 0 on node titan01
>>>>>>>>>>> exited on signal 6 (Aborted).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ########CONFIGURATION:
>>>>>>>>>>> I used the ompi master sources
>>>>>>>>>>> from github:
>>>>>>>>>>> commit
>>>>>>>>>>> 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>>>>>>> Author: Gilles Gouaillardet
>>>>>>>>>>> <***@rist.or.jp>
>>>>>>>>>>> Date: Tue Jul 5 13:47:50 2016
>>>>>>>>>>> +0900
>>>>>>>>>>>
>>>>>>>>>>> ./configure --enable-mpi-java
>>>>>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>>>>>>> --disable-dlopen --disable-mca-dso
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot for your help!
>>>>>>>>>>> Gundram
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>> Subscription:
>>>>>>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> Link to this post:
>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> ***@open-mpi.org
>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> ***@open-mpi.org
>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> ***@open-mpi.org
>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> ***@open-mpi.org
>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29601.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29603.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@open-mpi.org
>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29610.php
>>
>
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Graham, Nathaniel Richard
2016-09-13 18:06:39 UTC
Permalink
Since you are getting the same errors with C as you are with Java, this is an issue with C, not the Java bindings. However, in the most recent output, you are using ./a.out to run the test. Did you use mpirun to run the test in Java or C?


The command should be something along the lines of:


mpirun -np 2 ​java TestMpiRmaCompareAndSwap


mpirun -np 2 ./a.out


Also, are you compiling with the ompi wrappers? Should be:


mpijavac TestMpiRmaCompareAndSwap.java


​mpicc compare_and_swap.c


In the mean time, I will try to reproduce this on a similar system.


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Tuesday, September 13, 2016 12:46 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hey,


it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns produces this output:

(the same bulid by sources or the release 2.0.1)


[***@node108 mpi_test]$ ./a.out
[node108:2949] *** An error occurred in MPI_Compare_and_swap
[node108:2949] *** reported by process [1649420396,0]
[node108:2949] *** on win rdma window 3
[node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[node108:2949] *** and potentially your MPI job)


But the test works for 1.8.x! In fact our cluster does not have shared-memory - so it has to use the wrapper to default methods.

Gundram

On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:

Hello Gundram,


It looks like the test that is failing is TestMpiRmaCompareAndSwap.java​. Is that the one that is crashing? If so, could you try to run the C test from:


http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1


There are a couple of header files you will need for that test, but they are in the same repo as the test (up a few folders and in an include folder).


This should let us know whether its an issue related to Java or not.


If it is another test, let me know and Ill see if I can get you the C version (most or all of the Java tests are translations from the C test).


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Wednesday, September 7, 2016 9:23 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hello,

I still have the same errors on our cluster - even one more. Maybe the new one helps us to find a solution.

I have this error if I run "make_onesided" of the ompi-java-test repo.

CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically - in all my 30 runs) this error:

[titan01:5134] *** An error occurred in MPI_Compare_and_swap
[titan01:5134] *** reported by process [2392850433,1]
[titan01:5134] *** on win rdma window 3
[titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:5134] *** and potentially your MPI job)
[titan01.service:05128] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


Sometimes I also have the SIGSEGV error.

System:

compiler: gcc/5.2.0
java: jdk1.8.0_102
kernelmodule: mlx4_core mlx4_en mlx4_ib
Linux version 3.10.0-327.13.1.el7.x86_64 (***@kbuilder.dev.centos.org<mailto:***@kbuilder.dev.centos.org>) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP

Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo rev: v2.0.0-257-gee86e07, Sep 02, 2016

inifiband

openib: OpenSM 3.3.19


limits:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256554
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Thanks, Gundram
On 07/12/2016 11:08 AM, Gundram Leifert wrote:
Hello Gilley, Howard,

I configured without disable dlopen - same error.

I test these classes on another cluster and: IT WORKS!

So it is a problem of the cluster configuration. Thank you all very much for all your help! When the admin can solve the problem, i will let you know, what he had changed.

Cheers Gundram

On 07/08/2016 04:19 PM, Howard Pritchard wrote:
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
the JVM sets its own signal handlers, and it is important openmpi dones not override them.
this is what previously happened with PSM (infinipath) but this has been solved since.
you might be linking with a third party library that hijacks signal handlers and cause the crash
(which would explain why I cannot reproduce the issue)

the master branch has a revamped memory patcher (compared to v2.x or v1.10), and that could have some bad interactions with the JVM, so you might also give v2.x a try

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
You made the best of it... thanks a lot!

Whithout MPI it runs.
Just adding MPI.init() causes the crash!

maybe I installed something wrong...

install newest automake, autoconf, m4, libtoolize in right order and same prefix
check out ompi,
autogen
configure with same prefix, pointing to the same jdk, I later use
make
make install

I will test some different configurations of ./configure...


On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
I am running out of ideas ...

what if you do not run within slurm ?
what if you do not use '-cp executor.jar'
or what if you configure without --disable-dlopen --disable-mca-dso ?

if you
mpirun -np 1 ...
then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird your program is still crashing. an other test is to comment out MPI_Bcast and MPI_Barrier and try again with -np 1

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
In any cases the same error.
this is my code:

salloc -n 3
export IPATH_NO_BACKTRACE
ulimit -s 10240
mpirun -np 3 java -cp executor.jar de.uros.citlab.executor.test.TestSendBigFiles2


also for 1 or two cores, the process crashes.


On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
you can try
export IPATH_NO_BACKTRACE
before invoking mpirun (that should not be needed though)

an other test is to
ulimit -s 10240
before invoking mpirun.

btw, do you use mpirun or srun ?

can you reproduce the crash with 1 or 2 tasks ?

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

configure:
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso


1 node with 3 cores. I use SLURM to allocate one node. I changed --mem, but it has no effect.
salloc -n 3


core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256564
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

uname -a
Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/system-release
CentOS Linux release 7.2.1511 (Core)

what else do you need?

Cheers, Gundram

On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:

Gundram,


can you please provide more information on your environment :

- configure command line

- OS

- memory available

- ulimit -a

- number of nodes

- number of tasks used

- interconnect used (if any)

- batch manager (if any)


Cheers,


Gilles

On 7/7/2016 4:17 PM, Gundram Leifert wrote:
Hello Gilles,

I tried you code and it crashes after 3-15 iterations (see (1)). It is always the same error (only the "94" varies).

Meanwhile I think Java and MPI use the same memory because when I delete the hash-call, the program runs sometimes more than 9k iterations.
When it crashes, there are different lines (see (2) and (3)). The crashes also occurs on rank 0.

##### (1)#####
# Problematic frame:
# J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]

#####(2)#####
# Problematic frame:
# V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0xb6

#####(3)#####
# Problematic frame:
# V [libjvm.so+0x4183bf] ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f

Any more idea?

On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:

Gundram,


fwiw, i cannot reproduce the issue on my box

- centos 7

- java version "1.8.0_71"
Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)


i noticed on non zero rank saveMem is allocated at each iteration.
ideally, the garbage collector can take care of that and this should not be an issue.

would you mind giving the attached file a try ?

Cheers,

Gilles

On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Giles,

thank you for your hints! I did 3 changes, unfortunately the same error occures:

update ompi:
commit ae8444682f0a7aa158caea08800542ce9874455e
Author: Ralph Castain <***@open-mpi.org><mailto:***@open-mpi.org>
Date: Tue Jul 5 20:07:16 2016 -0700

update java:
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)

delete hashcode-lines.

Now I get this error message - to 100%, after different number of iterations (15-300):

0/ 3:length = 100000000
0/ 3:bcast length done (length = 100000000)
1/ 3:bcast length done (length = 100000000)
2/ 3:bcast length done (length = 100000000)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid16578.log
#
# Compiler replay data is saved as:
# /home/gl069/ompi/bin/executor/replay_pid16578.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[titan01:16578] *** Process received signal ***
[titan01:16578] Signal: Aborted (6)
[titan01:16578] Signal code: (-6)
[titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
[titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
[titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
[titan01:16578] [ 3] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
[titan01:16578] [ 4] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
[titan01:16578] [ 5] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
[titan01:16578] [ 6] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
[titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
[titan01:16578] [ 8] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
[titan01:16578] [ 9] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
[titan01:16578] [10] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
[titan01:16578] [11] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
[titan01:16578] [12] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [13] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [14] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [15] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [16] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [17] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [18] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [19] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [20] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [21] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [22] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
[titan01:16578] [23] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
[titan01:16578] [24] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
[titan01:16578] [25] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
[titan01:16578] [26] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
[titan01:16578] [27] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
[titan01:16578] [28] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
[titan01:16578] [29] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
[titan01:16578] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I don't know if it is a problem of java or ompi - but the last years, java worked with no problems on my machine...

Thank you for your tips in advance!
Gundram

On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic) version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Jason,

thanks for your response! I thing it is another problem. I try to send 100MB bytes. So there are not many tries (between 10 and 30). I realized that the execution of this code can result 3 different errors:

1. most often the posted error message occures.

2. in <10% the cases i have a live lock. I can see 3 java-processes, one with 200% and two with 100% processor utilization. After ~15 minutes without new system outputs this error occurs.


[thread 47499823949568 also had an error]
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid24256.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:24256] *** Process received signal ***
[titan01:24256] Signal: Aborted (6)
[titan01:24256] Signal code: (-6)
[titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
[titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
[titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
[titan01:24256] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
[titan01:24256] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
[titan01:24256] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
[titan01:24256] [ 6] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
[titan01:24256] [ 7] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
[titan01:24256] [ 8] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
[titan01:24256] [ 9] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
[titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
[titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
[titan01:24256] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


3. in <10% the cases i have a dead lock while MPI.init. This stays for more than 15 minutes without returning with an error message...

Can I enable some debug-flags to see what happens on C / OpenMPI side?

Thanks in advance for your help!
Gundram Leifert


On 07/05/2016 06:05 PM, Jason Maldonis wrote:
After reading your thread looks like it may be related to an issue I had a few weeks ago (I'm a novice though). Maybe my thread will be of help: https://www.open-mpi.org/community/lists/users/2016/06/29425.php

When you say "After a specific number of repetitions the process either hangs up or returns with a SIGSEGV." does you mean that a single call hangs, or that at some point during the for loop a call hangs? If you mean the latter, then it might relate to my issue. Otherwise my thread probably won't be helpful.

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
***@wisc.edu<mailto:***@wisc.edu>
608-295-5532

On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello,

I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

public static void log(String msg) {
try {
System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
} catch (MPIException ex) {
System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
}
}

private static int hashcode(byte[] bytearray) {
if (bytearray == null) {
return 0;
}
int hash = 39;
for (int i = 0; i < bytearray.length; i++) {
byte b = bytearray[i];
hash = hash * 7 + (int) b;
}
return hash;
}

public static void main(String args[]) throws MPIException {
log("start main");
MPI.Init(args);
try {
log("initialized done");
byte[] saveMem = new byte[100000000];
MPI.COMM_WORLD.barrier();
Random r = new Random();
r.nextBytes(saveMem);
if (MPI.COMM_WORLD.getRank() == 0) {
for (int i = 0; i < 1000; i++) {
saveMem[r.nextInt(saveMem.length)]++;
log("i = " + i);
int[] lengthData = new int[]{saveMem.length};
log("object hash = " + hashcode(saveMem));
log("length = " + lengthData[0]);
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
MPI.COMM_WORLD.barrier();
MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
}
MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT<http://MPI.INT>, 0);
} else {
while (true) {
int[] lengthData = new int[1];
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
if (lengthData[0] == 0) {
break;
}
MPI.COMM_WORLD.barrier();
saveMem = new byte[lengthData[0]];
MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
log("object hash = " + hashcode(saveMem));
}
}
MPI.COMM_WORLD.barrier();
} catch (MPIException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} catch (RuntimeException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} finally {
MPI.Finalize();
}

}

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# #
# SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code: (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code: (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <***@rist.or.jp<mailto:***@rist.or.jp>>
Date: Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram

_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29584.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php




_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php





_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gundram Leifert
2016-09-14 10:02:26 UTC
Permalink
In short words: yes, we compiled with mpijavac and mpicc and run with
mpirun -np 2.


In long words: we tested the following setups


a) without Java, with mpi 2.0.1 the C-test

[***@titan01 mpi_test]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1

[***@titan01 mpi_test]$ mpirun -np 2 ./a.out
[titan01:18460] *** An error occurred in MPI_Compare_and_swap
[titan01:18460] *** reported by process [3535667201,1]
[titan01:18460] *** on win rdma window 3
[titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will now
abort,
[titan01:18460] *** and potentially your MPI job)
[titan01.service:18454] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:18454] Set MCA parameter "orte_base_help_aggregate" to
0 to see all help / error messages

b) without Java with mpi 1.8.8 the C-test

[***@titan01 mpi_test2]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/1.8.8

[***@titan01 mpi_test2]$ mpirun -np 2 ./a.out
No Errors
[***@titan01 mpi_test2]$

c) with java 1.8.8 with jdk and Java-Testsuite

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol
win.compareAndSwap(next, iBuffer, result,
MPI.INT, rank, 0);
^
symbol: method
compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int)
location: variable win of type Win
TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol

>> these java methods are not supported in 1.8.8

d) ompi 2.0.1 and jdk and Testsuite

[***@titan01 ~]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1 2) java/jdk1.8.0_102

[***@titan01 ~]$ cd ompi-java-test/
[***@titan01 ompi-java-test]$ ./autogen.sh
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force
autoreconf: configure.ac: tracing
autoreconf: configure.ac: not using Libtool
autoreconf: running: /usr/bin/autoconf --force
autoreconf: configure.ac: not using Autoheader
autoreconf: running: automake --add-missing --copy --force-missing
autoreconf: Leaving directory `.'
[***@titan01 ompi-java-test]$ ./configure
Configuring Open Java test suite
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for mpijavac... yes
checking if checking MPI API params... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating reporting/OmpitestConfig.java
config.status: creating Makefile

[***@titan01 ompi-java-test]$ cd onesided/
[***@titan01 onesided]$ ./make_onesided &> result
cat result:
<crop.....>

=========================== CReqops ===========================
[titan01:32155] *** An error occurred in MPI_Rput
[titan01:32155] *** reported by process [3879534593,1]
[titan01:32155] *** on win rdma window 3
[titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will now
abort,
[titan01:32155] *** and potentially your MPI job)

<...crop....>

=========================== TestMpiRmaCompareAndSwap
===========================
[titan01:32703] *** An error occurred in MPI_Compare_and_swap
[titan01:32703] *** reported by process [3843162113,0]
[titan01:32703] *** on win rdma window 3
[titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will now
abort,
[titan01:32703] *** and potentially your MPI job)
[titan01.service:32698] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:32698] Set MCA parameter "orte_base_help_aggregate" to
0 to see all help / error messages


< ... end crop>


Also if we start the thing in this way, it fails:

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
OmpitestError.java OmpitestProgress.java OmpitestConfig.java

[***@titan01 onesided]$ mpiexec -np 2 java TestMpiRmaCompareAndSwap

[titan01:22877] *** An error occurred in MPI_Compare_and_swap
[titan01:22877] *** reported by process [3287285761,0]
[titan01:22877] *** on win rdma window 3
[titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will now
abort,
[titan01:22877] *** and potentially your MPI job)
[titan01.service:22872] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:22872] Set MCA parameter "orte_base_help_aggregate" to
0 to see all help / error messages



On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote:
>
> Since you are getting the same errors with C as you are with Java,
> this is an issue with C, not the Java bindings. However, in the most
> recent output, you are using ./a.out to run the test. Did you use
> mpirun to run the test in Java or C?
>
>
> The command should be something along the lines of:
>
>
> mpirun -np 2 ​java TestMpiRmaCompareAndSwap
>
>
> mpirun -np 2 ./a.out
>
>
> Also, are you compiling with the ompi wrappers? Should be:
>
>
> mpijavac TestMpiRmaCompareAndSwap.java
>
>
> ​mpicc compare_and_swap.c
>
>
> In the mean time, I will try to reproduce this on a similar system.
>
>
> -Nathan
>
>
>
> --
> Nathaniel Graham
> HPC-DES
> Los Alamos National Laboratory
> ------------------------------------------------------------------------
> *From:* users <users-***@lists.open-mpi.org> on behalf of Gundram
> Leifert <***@uni-rostock.de>
> *Sent:* Tuesday, September 13, 2016 12:46 AM
> *To:* ***@lists.open-mpi.org
> *Subject:* Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
>
> Hey,
>
>
> it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns
> produces this output:
>
> (the same bulid by sources or the release 2.0.1)
>
>
> [***@node108 mpi_test]$ ./a.out
> [node108:2949] *** An error occurred in MPI_Compare_and_swap
> [node108:2949] *** reported by process [1649420396,0]
> [node108:2949] *** on win rdma window 3
> [node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
> [node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will
> now abort,
> [node108:2949] *** and potentially your MPI job)
>
> But the test works for 1.8.x! In fact our cluster does not have
> shared-memory - so it has to use the wrapper to default methods.
>
> Gundram
>
> On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:
>>
>> Hello Gundram,
>>
>>
>> It looks like the test that is failing is
>> TestMpiRmaCompareAndSwap.java​. Is that the one that is crashing?
>> If so, could you try to run the C test from:
>>
>>
>> http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1
>>
>>
>> There are a couple of header files you will need for that test, but
>> they are in the same repo as the test (up a few folders and in an
>> include folder).
>>
>>
>> This should let us know whether its an issue related to Java or not.
>>
>>
>> If it is another test, let me know and Ill see if I can get you the C
>> version (most or all of the Java tests are translations from the C test).
>>
>>
>> -Nathan
>>
>>
>>
>> --
>> Nathaniel Graham
>> HPC-DES
>> Los Alamos National Laboratory
>> ------------------------------------------------------------------------
>> *From:* users <users-***@lists.open-mpi.org> on behalf of Gundram
>> Leifert <***@uni-rostock.de>
>> *Sent:* Wednesday, September 7, 2016 9:23 AM
>> *To:* ***@lists.open-mpi.org
>> *Subject:* Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
>>
>> Hello,
>>
>> I still have the same errors on our cluster - even one more. Maybe
>> the new one helps us to find a solution.
>>
>> I have this error if I run "make_onesided" of the ompi-java-test repo.
>>
>> CReqops and TestMpiRmaCompareAndSwap report (pretty
>> deterministically - in all my 30 runs) this error:
>>
>> [titan01:5134] *** An error occurred in MPI_Compare_and_swap
>> [titan01:5134] *** reported by process [2392850433,1]
>> [titan01:5134] *** on win rdma window 3
>> [titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>> [titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>> now abort,
>> [titan01:5134] *** and potentially your MPI job)
>> [titan01.service:05128] 1 more process has sent help message
>> help-mpi-errors.txt / mpi_errors_are_fatal
>> [titan01.service:05128] Set MCA parameter "orte_base_help_aggregate"
>> to 0 to see all help / error messages
>>
>> Sometimes I also have the SIGSEGV error.
>>
>> System:
>>
>> compiler: gcc/5.2.0
>> java: jdk1.8.0_102
>> kernelmodule: mlx4_core mlx4_en mlx4_ib
>> Linux version 3.10.0-327.13.1.el7.x86_64
>> (***@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red
>> Hat 4.8.3-9) (GCC) ) #1 SMP
>>
>> Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo
>> rev: v2.0.0-257-gee86e07, Sep 02, 2016
>>
>> inifiband
>>
>> openib: OpenSM 3.3.19
>>
>>
>> limits:
>>
>> ulimit -a
>> core file size (blocks, -c) 0
>> data seg size (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size (blocks, -f) unlimited
>> pending signals (-i) 256554
>> max locked memory (kbytes, -l) unlimited
>> max memory size (kbytes, -m) unlimited
>> open files (-n) 100000
>> pipe size (512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority (-r) 0
>> stack size (kbytes, -s) unlimited
>> cpu time (seconds, -t) unlimited
>> max user processes (-u) 4096
>> virtual memory (kbytes, -v) unlimited
>> file locks (-x) unlimited
>>
>>
>> Thanks, Gundram
>> On 07/12/2016 11:08 AM, Gundram Leifert wrote:
>>> Hello Gilley, Howard,
>>>
>>> I configured without disable dlopen - same error.
>>>
>>> I test these classes on another cluster and: IT WORKS!
>>>
>>> So it is a problem of the cluster configuration. Thank you all very
>>> much for all your help! When the admin can solve the problem, i will
>>> let you know, what he had changed.
>>>
>>> Cheers Gundram
>>>
>>> On 07/08/2016 04:19 PM, Howard Pritchard wrote:
>>>> Hi Gundram
>>>>
>>>> Could you configure without the disable dlopen option and retry?
>>>>
>>>> Howard
>>>>
>>>> Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
>>>>
>>>> the JVM sets its own signal handlers, and it is important
>>>> openmpi dones not override them.
>>>> this is what previously happened with PSM (infinipath) but this
>>>> has been solved since.
>>>> you might be linking with a third party library that hijacks
>>>> signal handlers and cause the crash
>>>> (which would explain why I cannot reproduce the issue)
>>>>
>>>> the master branch has a revamped memory patcher (compared to
>>>> v2.x or v1.10), and that could have some bad interactions with
>>>> the JVM, so you might also give v2.x a try
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Friday, July 8, 2016, Gundram Leifert
>>>> <***@uni-rostock.de> wrote:
>>>>
>>>> You made the best of it... thanks a lot!
>>>>
>>>> Whithout MPI it runs.
>>>> Just adding MPI.init() causes the crash!
>>>>
>>>> maybe I installed something wrong...
>>>>
>>>> install newest automake, autoconf, m4, libtoolize in right
>>>> order and same prefix
>>>> check out ompi,
>>>> autogen
>>>> configure with same prefix, pointing to the same jdk, I
>>>> later use
>>>> make
>>>> make install
>>>>
>>>> I will test some different configurations of ./configure...
>>>>
>>>>
>>>> On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
>>>>> I am running out of ideas ...
>>>>>
>>>>> what if you do not run within slurm ?
>>>>> what if you do not use '-cp executor.jar'
>>>>> or what if you configure without --disable-dlopen
>>>>> --disable-mca-dso ?
>>>>>
>>>>> if you
>>>>> mpirun -np 1 ...
>>>>> then MPI_Bcast and MPI_Barrier are basically no-op, so it
>>>>> is really weird your program is still crashing. an other
>>>>> test is to comment out MPI_Bcast and MPI_Barrier and try
>>>>> again with -np 1
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>> On Friday, July 8, 2016, Gundram Leifert
>>>>> <***@uni-rostock.de> wrote:
>>>>>
>>>>> In any cases the same error.
>>>>> this is my code:
>>>>>
>>>>> salloc -n 3
>>>>> export IPATH_NO_BACKTRACE
>>>>> ulimit -s 10240
>>>>> mpirun -np 3 java -cp executor.jar
>>>>> de.uros.citlab.executor.test.TestSendBigFiles2
>>>>>
>>>>>
>>>>> also for 1 or two cores, the process crashes.
>>>>>
>>>>>
>>>>> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>>>>>> you can try
>>>>>> export IPATH_NO_BACKTRACE
>>>>>> before invoking mpirun (that should not be needed though)
>>>>>>
>>>>>> an other test is to
>>>>>> ulimit -s 10240
>>>>>> before invoking mpirun.
>>>>>>
>>>>>> btw, do you use mpirun or srun ?
>>>>>>
>>>>>> can you reproduce the crash with 1 or 2 tasks ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On Friday, July 8, 2016, Gundram Leifert
>>>>>> <***@uni-rostock.de> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> configure:
>>>>>> ./configure --enable-mpi-java
>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>> --disable-dlopen --disable-mca-dso
>>>>>>
>>>>>>
>>>>>> 1 node with 3 cores. I use SLURM to allocate one
>>>>>> node. I changed --mem, but it has no effect.
>>>>>> salloc -n 3
>>>>>>
>>>>>>
>>>>>> core file size (blocks, -c) 0
>>>>>> data seg size (kbytes, -d) unlimited
>>>>>> scheduling priority (-e) 0
>>>>>> file size (blocks, -f) unlimited
>>>>>> pending signals (-i) 256564
>>>>>> max locked memory (kbytes, -l) unlimited
>>>>>> max memory size (kbytes, -m) unlimited
>>>>>> open files (-n) 100000
>>>>>> pipe size (512 bytes, -p) 8
>>>>>> POSIX message queues (bytes, -q) 819200
>>>>>> real-time priority (-r) 0
>>>>>> stack size (kbytes, -s) unlimited
>>>>>> cpu time (seconds, -t) unlimited
>>>>>> max user processes (-u) 4096
>>>>>> virtual memory (kbytes, -v) unlimited
>>>>>> file locks (-x) unlimited
>>>>>>
>>>>>> uname -a
>>>>>> Linux titan01.service 3.10.0-327.13.1.el7.x86_64
>>>>>> #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64
>>>>>> x86_64 GNU/Linux
>>>>>>
>>>>>> cat /etc/system-release
>>>>>> CentOS Linux release 7.2.1511 (Core)
>>>>>>
>>>>>> what else do you need?
>>>>>>
>>>>>> Cheers, Gundram
>>>>>>
>>>>>> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>>>>>>
>>>>>>> Gundram,
>>>>>>>
>>>>>>>
>>>>>>> can you please provide more information on your
>>>>>>> environment :
>>>>>>>
>>>>>>> - configure command line
>>>>>>>
>>>>>>> - OS
>>>>>>>
>>>>>>> - memory available
>>>>>>>
>>>>>>> - ulimit -a
>>>>>>>
>>>>>>> - number of nodes
>>>>>>>
>>>>>>> - number of tasks used
>>>>>>>
>>>>>>> - interconnect used (if any)
>>>>>>>
>>>>>>> - batch manager (if any)
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>>
>>>>>>> Gilles
>>>>>>>
>>>>>>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>>>>>>> Hello Gilles,
>>>>>>>>
>>>>>>>> I tried you code and it crashes after 3-15
>>>>>>>> iterations (see (1)). It is always the same
>>>>>>>> error (only the "94" varies).
>>>>>>>>
>>>>>>>> Meanwhile I think Java and MPI use the same
>>>>>>>> memory because when I delete the hash-call, the
>>>>>>>> program runs sometimes more than 9k iterations.
>>>>>>>> When it crashes, there are different lines (see
>>>>>>>> (2) and (3)). The crashes also occurs on rank 0.
>>>>>>>>
>>>>>>>> ##### (1)#####
>>>>>>>> # Problematic frame:
>>>>>>>> # J 94 C2
>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>>>>>>>> (42 bytes) @ 0x00002b03242dc9c4
>>>>>>>> [0x00002b03242dc860+0x164]
>>>>>>>>
>>>>>>>> #####(2)#####
>>>>>>>> # Problematic frame:
>>>>>>>> # V [libjvm.so+0x68d0f6]
>>>>>>>> JavaCallWrapper::JavaCallWrapper(methodHandle,
>>>>>>>> Handle, JavaValue*, Thread*)+0xb6
>>>>>>>>
>>>>>>>> #####(3)#####
>>>>>>>> # Problematic frame:
>>>>>>>> # V [libjvm.so+0x4183bf]
>>>>>>>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>>>>>>>
>>>>>>>> Any more idea?
>>>>>>>>
>>>>>>>> On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
>>>>>>>>>
>>>>>>>>> Gundram,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> fwiw, i cannot reproduce the issue on my box
>>>>>>>>>
>>>>>>>>> - centos 7
>>>>>>>>>
>>>>>>>>> - java version "1.8.0_71"
>>>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>>>> 1.8.0_71-b15)
>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM (build
>>>>>>>>> 25.71-b15, mixed mode)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> i noticed on non zero rank saveMem is
>>>>>>>>> allocated at each iteration.
>>>>>>>>> ideally, the garbage collector can take care
>>>>>>>>> of that and this should not be an issue.
>>>>>>>>>
>>>>>>>>> would you mind giving the attached file a try ?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Gilles
>>>>>>>>>
>>>>>>>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>>>>>>>> I will have a look at it today
>>>>>>>>>>
>>>>>>>>>> how did you configure OpenMPI ?
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Gilles
>>>>>>>>>>
>>>>>>>>>> On Thursday, July 7, 2016, Gundram Leifert
>>>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello Giles,
>>>>>>>>>>
>>>>>>>>>> thank you for your hints! I did 3
>>>>>>>>>> changes, unfortunately the same error
>>>>>>>>>> occures:
>>>>>>>>>>
>>>>>>>>>> update ompi:
>>>>>>>>>> commit
>>>>>>>>>> ae8444682f0a7aa158caea08800542ce9874455e
>>>>>>>>>> Author: Ralph Castain <***@open-mpi.org>
>>>>>>>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>>>>>>>
>>>>>>>>>> update java:
>>>>>>>>>> java version "1.8.0_92"
>>>>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>>>>> 1.8.0_92-b14)
>>>>>>>>>> Java HotSpot(TM) Server VM (build
>>>>>>>>>> 25.92-b14, mixed mode)
>>>>>>>>>>
>>>>>>>>>> delete hashcode-lines.
>>>>>>>>>>
>>>>>>>>>> Now I get this error message - to 100%,
>>>>>>>>>> after different number of iterations
>>>>>>>>>> (15-300):
>>>>>>>>>>
>>>>>>>>>> 0/ 3:length = 100000000
>>>>>>>>>> 0/ 3:bcast length done (length = 100000000)
>>>>>>>>>> 1/ 3:bcast length done (length = 100000000)
>>>>>>>>>> 2/ 3:bcast length done (length = 100000000)
>>>>>>>>>> #
>>>>>>>>>> # A fatal error has been detected by the
>>>>>>>>>> Java Runtime Environment:
>>>>>>>>>> #
>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>> pc=0x00002b3d022fcd24, pid=16578,
>>>>>>>>>> tid=0x00002b3d29716700
>>>>>>>>>> #
>>>>>>>>>> # JRE version: Java(TM) SE Runtime
>>>>>>>>>> Environment (8.0_92-b14) (build 1.8.0_92-b14)
>>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server
>>>>>>>>>> VM (25.92-b14 mixed mode linux-amd64
>>>>>>>>>> compressed oops)
>>>>>>>>>> # Problematic frame:
>>>>>>>>>> # V [libjvm.so+0x414d24]
>>>>>>>>>> ciEnv::get_field_by_index(ciInstanceKlass*,
>>>>>>>>>> int)+0x94
>>>>>>>>>> #
>>>>>>>>>> # Failed to write core dump. Core dumps
>>>>>>>>>> have been disabled. To enable core
>>>>>>>>>> dumping, try "ulimit -c unlimited" before
>>>>>>>>>> starting Java again
>>>>>>>>>> #
>>>>>>>>>> # An error report file with more
>>>>>>>>>> information is saved as:
>>>>>>>>>> #
>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>>>>>>>> #
>>>>>>>>>> # Compiler replay data is saved as:
>>>>>>>>>> #
>>>>>>>>>> /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>>>>>>>> #
>>>>>>>>>> # If you would like to submit a bug
>>>>>>>>>> report, please visit:
>>>>>>>>>> #
>>>>>>>>>> http://bugreport.java.com/bugreport/crash.jsp
>>>>>>>>>> #
>>>>>>>>>> [titan01:16578] *** Process received
>>>>>>>>>> signal ***
>>>>>>>>>> [titan01:16578] Signal: Aborted (6)
>>>>>>>>>> [titan01:16578] Signal code: (-6)
>>>>>>>>>> [titan01:16578] [ 0]
>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>>>>>>>> [titan01:16578] [ 1]
>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>>>>>>>> [titan01:16578] [ 2]
>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>>>>>>>> [titan01:16578] [ 3]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>>>>>>>> [titan01:16578] [ 4]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>>>>>>>> [titan01:16578] [ 5]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>>>>>>>> [titan01:16578] [ 6]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>>>>>>>> [titan01:16578] [ 7]
>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>>>>>>>> [titan01:16578] [ 8]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>>>>>>>> [titan01:16578] [ 9]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>>>>>>>> [titan01:16578] [10]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>>>>>>>> [titan01:16578] [11]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>>>>>>>> [titan01:16578] [12]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>>>>> [titan01:16578] [13]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>>>>> [titan01:16578] [14]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>>>>> [titan01:16578] [15]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>>>>> [titan01:16578] [16]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>>>>> [titan01:16578] [17]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>>>>> [titan01:16578] [18]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>>>>> [titan01:16578] [19]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>>>>> [titan01:16578] [20]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>>>>> [titan01:16578] [21]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>>>>> [titan01:16578] [22]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>>>>>>>> [titan01:16578] [23]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>>>>>>>> [titan01:16578] [24]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>>>>>>>> [titan01:16578] [25]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>>>>>>>> [titan01:16578] [26]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>>>>>>>> [titan01:16578] [27]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>>>>>>>> [titan01:16578] [28]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>>>>>>>> [titan01:16578] [29]
>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>>>>>>>> [titan01:16578] *** End of error message ***
>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>> Primary job terminated normally, but 1
>>>>>>>>>> process returned
>>>>>>>>>> a non-zero exit code. Per user-direction,
>>>>>>>>>> the job has been aborted.
>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> mpirun noticed that process rank 2 with
>>>>>>>>>> PID 0 on node titan01 exited on signal 6
>>>>>>>>>> (Aborted).
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> I don't know if it is a problem of java
>>>>>>>>>> or ompi - but the last years, java worked
>>>>>>>>>> with no problems on my machine...
>>>>>>>>>>
>>>>>>>>>> Thank you for your tips in advance!
>>>>>>>>>> Gundram
>>>>>>>>>>
>>>>>>>>>> On 07/06/2016 03:10 PM, Gilles
>>>>>>>>>> Gouaillardet wrote:
>>>>>>>>>>> Note a race condition in MPI_Init has
>>>>>>>>>>> been fixed yesterday in the master.
>>>>>>>>>>> can you please update your OpenMPI and
>>>>>>>>>>> try again ?
>>>>>>>>>>>
>>>>>>>>>>> hopefully the hang will disappear.
>>>>>>>>>>>
>>>>>>>>>>> Can you reproduce the crash with a
>>>>>>>>>>> simpler (and ideally deterministic)
>>>>>>>>>>> version of your program.
>>>>>>>>>>> the crash occurs in hashcode, and this
>>>>>>>>>>> makes little sense to me. can you also
>>>>>>>>>>> update your jdk ?
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> Gilles
>>>>>>>>>>>
>>>>>>>>>>> On Wednesday, July 6, 2016, Gundram
>>>>>>>>>>> Leifert <***@uni-rostock.de>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hello Jason,
>>>>>>>>>>>
>>>>>>>>>>> thanks for your response! I thing it
>>>>>>>>>>> is another problem. I try to send
>>>>>>>>>>> 100MB bytes. So there are not many
>>>>>>>>>>> tries (between 10 and 30). I
>>>>>>>>>>> realized that the execution of this
>>>>>>>>>>> code can result 3 different errors:
>>>>>>>>>>>
>>>>>>>>>>> 1. most often the posted error
>>>>>>>>>>> message occures.
>>>>>>>>>>>
>>>>>>>>>>> 2. in <10% the cases i have a live
>>>>>>>>>>> lock. I can see 3 java-processes,
>>>>>>>>>>> one with 200% and two with 100%
>>>>>>>>>>> processor utilization. After ~15
>>>>>>>>>>> minutes without new system outputs
>>>>>>>>>>> this error occurs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [thread 47499823949568 also had an
>>>>>>>>>>> error]
>>>>>>>>>>> # A fatal error has been detected by
>>>>>>>>>>> the Java Runtime Environment:
>>>>>>>>>>> #
>>>>>>>>>>> # Internal Error
>>>>>>>>>>> (safepoint.cpp:317), pid=24256,
>>>>>>>>>>> tid=47500347131648
>>>>>>>>>>> # guarantee(PageArmed == 0) failed:
>>>>>>>>>>> invariant
>>>>>>>>>>> #
>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit
>>>>>>>>>>> Server VM (23.25-b01 mixed mode
>>>>>>>>>>> linux-amd64 compressed oops)
>>>>>>>>>>> # Failed to write core dump. Core
>>>>>>>>>>> dumps have been disabled. To enable
>>>>>>>>>>> core dumping, try "ulimit -c
>>>>>>>>>>> unlimited" before starting Java again
>>>>>>>>>>> #
>>>>>>>>>>> # An error report file with more
>>>>>>>>>>> information is saved as:
>>>>>>>>>>> #
>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>>>>>>>> #
>>>>>>>>>>> # If you would like to submit a bug
>>>>>>>>>>> report, please visit:
>>>>>>>>>>> #
>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>> #
>>>>>>>>>>> [titan01:24256] *** Process received
>>>>>>>>>>> signal ***
>>>>>>>>>>> [titan01:24256] Signal: Aborted (6)
>>>>>>>>>>> [titan01:24256] Signal code: (-6)
>>>>>>>>>>> [titan01:24256] [ 0]
>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>>>>>>>> [titan01:24256] [ 1]
>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>>>>>>>> [titan01:24256] [ 2]
>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>>>>>>>> [titan01:24256] [ 3]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>>>>>>>> [titan01:24256] [ 4]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>>>>>>>> [titan01:24256] [ 5]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>>>>>>>> [titan01:24256] [ 6]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>>>>>>>> [titan01:24256] [ 7]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>>>>>>>> [titan01:24256] [ 8]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>>>>>>>> [titan01:24256] [ 9]
>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>>>>>>>> [titan01:24256] [10]
>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>>>>>>>> [titan01:24256] [11]
>>>>>>>>>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>>>>>>>> [titan01:24256] *** End of error
>>>>>>>>>>> message ***
>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>> Primary job terminated normally, but
>>>>>>>>>>> 1 process returned
>>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>>> user-direction, the job has been
>>>>>>>>>>> aborted.
>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>> mpirun noticed that process rank 0
>>>>>>>>>>> with PID 0 on node titan01 exited on
>>>>>>>>>>> signal 6 (Aborted).
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 3. in <10% the cases i have a dead
>>>>>>>>>>> lock while MPI.init. This stays for
>>>>>>>>>>> more than 15 minutes without
>>>>>>>>>>> returning with an error message...
>>>>>>>>>>>
>>>>>>>>>>> Can I enable some debug-flags to see
>>>>>>>>>>> what happens on C / OpenMPI side?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance for your help!
>>>>>>>>>>> Gundram Leifert
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 07/05/2016 06:05 PM, Jason
>>>>>>>>>>> Maldonis wrote:
>>>>>>>>>>>> After reading your thread looks
>>>>>>>>>>>> like it may be related to an issue
>>>>>>>>>>>> I had a few weeks ago (I'm a novice
>>>>>>>>>>>> though). Maybe my thread will be of
>>>>>>>>>>>> help:
>>>>>>>>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> When you say "After a specific
>>>>>>>>>>>> number of repetitions the process
>>>>>>>>>>>> either hangs up or returns with a
>>>>>>>>>>>> SIGSEGV." does you mean that a
>>>>>>>>>>>> single call hangs, or that at some
>>>>>>>>>>>> point during the for loop a call
>>>>>>>>>>>> hangs? If you mean the latter, then
>>>>>>>>>>>> it might relate to my issue.
>>>>>>>>>>>> Otherwise my thread probably won't
>>>>>>>>>>>> be helpful.
>>>>>>>>>>>>
>>>>>>>>>>>> Jason Maldonis
>>>>>>>>>>>> Research Assistant of Professor
>>>>>>>>>>>> Paul Voyles
>>>>>>>>>>>> Materials Science Grad Student
>>>>>>>>>>>> University of Wisconsin, Madison
>>>>>>>>>>>> 1509 University Ave, Rm M142
>>>>>>>>>>>> Madison, WI 53706
>>>>>>>>>>>> ***@wisc.edu
>>>>>>>>>>>> 608-295-5532
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 5, 2016 at 9:58 AM,
>>>>>>>>>>>> Gundram Leifert
>>>>>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> I try to send many byte-arrays
>>>>>>>>>>>> via broadcast. After a specific
>>>>>>>>>>>> number of repetitions the
>>>>>>>>>>>> process either hangs up or
>>>>>>>>>>>> returns with a SIGSEGV. Does
>>>>>>>>>>>> any one can help me solving the
>>>>>>>>>>>> problem:
>>>>>>>>>>>>
>>>>>>>>>>>> ########## The code:
>>>>>>>>>>>>
>>>>>>>>>>>> import java.util.Random;
>>>>>>>>>>>> import mpi.*;
>>>>>>>>>>>>
>>>>>>>>>>>> public class TestSendBigFiles {
>>>>>>>>>>>>
>>>>>>>>>>>> public static void
>>>>>>>>>>>> log(String msg) {
>>>>>>>>>>>> try {
>>>>>>>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>>>>>>>> MPI.COMM_WORLD.getRank(),
>>>>>>>>>>>> MPI.COMM_WORLD.getSize(), msg));
>>>>>>>>>>>> } catch (MPIException ex) {
>>>>>>>>>>>> System.err.println(String.format("%2s/%2s:%s",
>>>>>>>>>>>> "?", "?", msg));
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> private static int
>>>>>>>>>>>> hashcode(byte[] bytearray) {
>>>>>>>>>>>> if (bytearray == null) {
>>>>>>>>>>>> return 0;
>>>>>>>>>>>> }
>>>>>>>>>>>> int hash = 39;
>>>>>>>>>>>> for (int i = 0; i <
>>>>>>>>>>>> bytearray.length; i++) {
>>>>>>>>>>>> byte b = bytearray[i];
>>>>>>>>>>>> hash = hash * 7 + (int) b;
>>>>>>>>>>>> }
>>>>>>>>>>>> return hash;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> public static void
>>>>>>>>>>>> main(String args[]) throws
>>>>>>>>>>>> MPIException {
>>>>>>>>>>>> log("start main");
>>>>>>>>>>>> MPI.Init(args);
>>>>>>>>>>>> try {
>>>>>>>>>>>> log("initialized done");
>>>>>>>>>>>> byte[] saveMem = new
>>>>>>>>>>>> byte[100000000];
>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>> Random r = new Random();
>>>>>>>>>>>> r.nextBytes(saveMem);
>>>>>>>>>>>> if
>>>>>>>>>>>> (MPI.COMM_WORLD.getRank() == 0) {
>>>>>>>>>>>> for (int i = 0; i < 1000; i++) {
>>>>>>>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>>>>>>>> log("i = " + i);
>>>>>>>>>>>> int[] lengthData = new
>>>>>>>>>>>> int[]{saveMem.length};
>>>>>>>>>>>> log("object hash = " +
>>>>>>>>>>>> hashcode(saveMem));
>>>>>>>>>>>> log("length = " + lengthData[0]);
>>>>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>>>>> log("bcast length done (length
>>>>>>>>>>>> = " + lengthData[0] + ")");
>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>>>>> lengthData[0], MPI.BYTE, 0);
>>>>>>>>>>>> log("bcast data done");
>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>> }
>>>>>>>>>>>> MPI.COMM_WORLD.bcast(new
>>>>>>>>>>>> int[]{0}, 1, MPI.INT
>>>>>>>>>>>> <http://MPI.INT>, 0);
>>>>>>>>>>>> } else {
>>>>>>>>>>>> while (true) {
>>>>>>>>>>>> int[] lengthData = new
>>>>>>>>>>>> int[1];
>>>>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>>>>> log("bcast length done (length
>>>>>>>>>>>> = " + lengthData[0] + ")");
>>>>>>>>>>>> if (lengthData[0] == 0) {
>>>>>>>>>>>> break;
>>>>>>>>>>>> }
>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>> saveMem = new
>>>>>>>>>>>> byte[lengthData[0]];
>>>>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>>>>> saveMem.length, MPI.BYTE, 0);
>>>>>>>>>>>> log("bcast data done");
>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>> log("object hash = " +
>>>>>>>>>>>> hashcode(saveMem));
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>> } catch (MPIException ex) {
>>>>>>>>>>>> System.out.println("caugth
>>>>>>>>>>>> error." + ex);
>>>>>>>>>>>> log(ex.getMessage());
>>>>>>>>>>>> } catch
>>>>>>>>>>>> (RuntimeException ex) {
>>>>>>>>>>>> System.out.println("caugth
>>>>>>>>>>>> error." + ex);
>>>>>>>>>>>> log(ex.getMessage());
>>>>>>>>>>>> } finally {
>>>>>>>>>>>> MPI.Finalize();
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ############ The Error (if it
>>>>>>>>>>>> does not just hang up):
>>>>>>>>>>>>
>>>>>>>>>>>> #
>>>>>>>>>>>> # A fatal error has been
>>>>>>>>>>>> detected by the Java Runtime
>>>>>>>>>>>> Environment:
>>>>>>>>>>>> #
>>>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>>>> pc=0x00002b7e9c86e3a1,
>>>>>>>>>>>> pid=1172, tid=47822674495232
>>>>>>>>>>>> #
>>>>>>>>>>>> #
>>>>>>>>>>>> # A fatal error has been
>>>>>>>>>>>> detected by the Java Runtime
>>>>>>>>>>>> Environment:
>>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>>> # Java VM: Java HotSpot(TM)
>>>>>>>>>>>> 64-Bit Server VM (23.25-b01
>>>>>>>>>>>> mixed mode linux-amd64
>>>>>>>>>>>> compressed oops)
>>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>>> # #
>>>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>>>> pc=0x00002af69c0693a1,
>>>>>>>>>>>> pid=1173, tid=47238546896640
>>>>>>>>>>>> #
>>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>>> J
>>>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>>>>> #
>>>>>>>>>>>> # Failed to write core dump.
>>>>>>>>>>>> Core dumps have been disabled.
>>>>>>>>>>>> To enable core dumping, try
>>>>>>>>>>>> "ulimit -c unlimited" before
>>>>>>>>>>>> starting Java again
>>>>>>>>>>>> #
>>>>>>>>>>>> # Java VM: Java HotSpot(TM)
>>>>>>>>>>>> 64-Bit Server VM (23.25-b01
>>>>>>>>>>>> mixed mode linux-amd64
>>>>>>>>>>>> compressed oops)
>>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>>> # J
>>>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>>>>> #
>>>>>>>>>>>> # Failed to write core dump.
>>>>>>>>>>>> Core dumps have been disabled.
>>>>>>>>>>>> To enable core dumping, try
>>>>>>>>>>>> "ulimit -c unlimited" before
>>>>>>>>>>>> starting Java again
>>>>>>>>>>>> #
>>>>>>>>>>>> # An error report file with
>>>>>>>>>>>> more information is saved as:
>>>>>>>>>>>> #
>>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>>>>>>>> # An error report file with
>>>>>>>>>>>> more information is saved as:
>>>>>>>>>>>> #
>>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>>>>>>>> #
>>>>>>>>>>>> # If you would like to submit a
>>>>>>>>>>>> bug report, please visit:
>>>>>>>>>>>> #
>>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>>> #
>>>>>>>>>>>> #
>>>>>>>>>>>> # If you would like to submit a
>>>>>>>>>>>> bug report, please visit:
>>>>>>>>>>>> #
>>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>>> #
>>>>>>>>>>>> [titan01:01172] *** Process
>>>>>>>>>>>> received signal ***
>>>>>>>>>>>> [titan01:01172] Signal: Aborted (6)
>>>>>>>>>>>> [titan01:01172] Signal code: (-6)
>>>>>>>>>>>> [titan01:01173] *** Process
>>>>>>>>>>>> received signal ***
>>>>>>>>>>>> [titan01:01173] Signal: Aborted (6)
>>>>>>>>>>>> [titan01:01173] Signal code: (-6)
>>>>>>>>>>>> [titan01:01172] [ 0]
>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>>>>>>>> [titan01:01172] [ 1]
>>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>>>>>>>> [titan01:01172] [ 2]
>>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>>>>>>>> [titan01:01172] [ 3]
>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>>>>>>>> [titan01:01172] [ 4]
>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>>>>>>>> [titan01:01172] [ 5]
>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>>>>>>>> [titan01:01172] [ 6]
>>>>>>>>>>>> [titan01:01173] [ 0]
>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>>>>>>>> [titan01:01173] [ 1]
>>>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>>>>>>>> [titan01:01172] [ 7]
>>>>>>>>>>>> [0x2b7e9c86e3a1]
>>>>>>>>>>>> [titan01:01172] *** End of
>>>>>>>>>>>> error message ***
>>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>>>>>>>> [titan01:01173] [ 2]
>>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>>>>>>>> [titan01:01173] [ 3]
>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>>>>>>>> [titan01:01173] [ 4]
>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>>>>>>>> [titan01:01173] [ 5]
>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>>>>>>>> [titan01:01173] [ 6]
>>>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>>>>>>>> [titan01:01173] [ 7]
>>>>>>>>>>>> [0x2af69c0693a1]
>>>>>>>>>>>> [titan01:01173] *** End of
>>>>>>>>>>>> error message ***
>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>> Primary job terminated
>>>>>>>>>>>> normally, but 1 process returned
>>>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>>>> user-direction, the job has
>>>>>>>>>>>> been aborted.
>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>> mpirun noticed that process
>>>>>>>>>>>> rank 1 with PID 0 on node
>>>>>>>>>>>> titan01 exited on signal 6
>>>>>>>>>>>> (Aborted).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ########CONFIGURATION:
>>>>>>>>>>>> I used the ompi master sources
>>>>>>>>>>>> from github:
>>>>>>>>>>>> commit
>>>>>>>>>>>> 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>>>>>>>> Author: Gilles Gouaillardet
>>>>>>>>>>>> <***@rist.or.jp>
>>>>>>>>>>>> Date: Tue Jul 5 13:47:50 2016
>>>>>>>>>>>> +0900
>>>>>>>>>>>>
>>>>>>>>>>>> ./configure --enable-mpi-java
>>>>>>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>>>>>>>> --disable-dlopen --disable-mca-dso
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for your help!
>>>>>>>>>>>> Gundram
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>>> Subscription:
>>>>>>>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> Link to this post:
>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> ***@open-mpi.org
>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> ***@open-mpi.org
>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> ***@open-mpi.org
>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29601.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> ***@open-mpi.org
>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29603.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@open-mpi.org
>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29610.php
>>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Graham, Nathaniel Richard
2016-09-14 18:55:41 UTC
Permalink
​Thanks for reporting this! There are a number of things going on here.


It seems there may be a problem with the Java bindings checked by CReqops.Java because the C test passes. Ill take a look at that. The issue can be found at: https://github.com/open-mpi/ompi/issues/2081​


MPI_Compare_and_swap is failing on master, and therefore on the release branches. You can get around the issue for now by doing: export OMPI_MCA_osc=pt2pt​

I submitted an issue to track it at: https://github.com/open-mpi/ompi/issues/2080


These tests test code I added last summer and did not make it into 1.8. I know its all in the 2.0 serious though.


-Nathan



--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Wednesday, September 14, 2016 4:02 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


In short words: yes, we compiled with mpijavac and mpicc and run with mpirun -np 2.


In long words: we tested the following setups

a) without Java, with mpi 2.0.1 the C-test

[***@titan01 mpi_test]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1

[***@titan01 mpi_test]$ mpirun -np 2 ./a.out
[titan01:18460] *** An error occurred in MPI_Compare_and_swap
[titan01:18460] *** reported by process [3535667201,1]
[titan01:18460] *** on win rdma window 3
[titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:18460] *** and potentially your MPI job)
[titan01.service:18454] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:18454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

b) without Java with mpi 1.8.8 the C-test

[***@titan01 mpi_test2]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/1.8.8

[***@titan01 mpi_test2]$ mpirun -np 2 ./a.out
No Errors
[***@titan01 mpi_test2]$

c) with java 1.8.8 with jdk and Java-Testsuite

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol
win.compareAndSwap(next, iBuffer, result, MPI.INT, rank, 0);
^
symbol: method compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int)
location: variable win of type Win
TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol

>> these java methods are not supported in 1.8.8

d) ompi 2.0.1 and jdk and Testsuite

[***@titan01 ~]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1 2) java/jdk1.8.0_102

[***@titan01 ~]$ cd ompi-java-test/
[***@titan01 ompi-java-test]$ ./autogen.sh
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force
autoreconf: configure.ac: tracing
autoreconf: configure.ac: not using Libtool
autoreconf: running: /usr/bin/autoconf --force
autoreconf: configure.ac: not using Autoheader
autoreconf: running: automake --add-missing --copy --force-missing
autoreconf: Leaving directory `.'
[***@titan01 ompi-java-test]$ ./configure
Configuring Open Java test suite
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for mpijavac... yes
checking if checking MPI API params... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating reporting/OmpitestConfig.java
config.status: creating Makefile

[***@titan01 ompi-java-test]$ cd onesided/
[***@titan01 onesided]$ ./make_onesided &> result
cat result:
<crop.....>

=========================== CReqops ===========================
[titan01:32155] *** An error occurred in MPI_Rput
[titan01:32155] *** reported by process [3879534593,1]
[titan01:32155] *** on win rdma window 3
[titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32155] *** and potentially your MPI job)

<...crop....>

=========================== TestMpiRmaCompareAndSwap ===========================
[titan01:32703] *** An error occurred in MPI_Compare_and_swap
[titan01:32703] *** reported by process [3843162113,0]
[titan01:32703] *** on win rdma window 3
[titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32703] *** and potentially your MPI job)
[titan01.service:32698] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:32698] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


< ... end crop>


Also if we start the thing in this way, it fails:

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java OmpitestError.java OmpitestProgress.java OmpitestConfig.java

[***@titan01 onesided]$ mpiexec -np 2 java TestMpiRmaCompareAndSwap

[titan01:22877] *** An error occurred in MPI_Compare_and_swap
[titan01:22877] *** reported by process [3287285761,0]
[titan01:22877] *** on win rdma window 3
[titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:22877] *** and potentially your MPI job)
[titan01.service:22872] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:22872] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote:

Since you are getting the same errors with C as you are with Java, this is an issue with C, not the Java bindings. However, in the most recent output, you are using ./a.out to run the test. Did you use mpirun to run the test in Java or C?


The command should be something along the lines of:


mpirun -np 2 ​java TestMpiRmaCompareAndSwap


mpirun -np 2 ./a.out


Also, are you compiling with the ompi wrappers? Should be:


mpijavac TestMpiRmaCompareAndSwap.java


​mpicc compare_and_swap.c


In the mean time, I will try to reproduce this on a similar system.


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Tuesday, September 13, 2016 12:46 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hey,


it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns produces this output:

(the same bulid by sources or the release 2.0.1)


[***@node108 mpi_test]$ ./a.out
[node108:2949] *** An error occurred in MPI_Compare_and_swap
[node108:2949] *** reported by process [1649420396,0]
[node108:2949] *** on win rdma window 3
[node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[node108:2949] *** and potentially your MPI job)


But the test works for 1.8.x! In fact our cluster does not have shared-memory - so it has to use the wrapper to default methods.

Gundram

On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:

Hello Gundram,


It looks like the test that is failing is TestMpiRmaCompareAndSwap.java​. Is that the one that is crashing? If so, could you try to run the C test from:


http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1


There are a couple of header files you will need for that test, but they are in the same repo as the test (up a few folders and in an include folder).


This should let us know whether its an issue related to Java or not.


If it is another test, let me know and Ill see if I can get you the C version (most or all of the Java tests are translations from the C test).


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Wednesday, September 7, 2016 9:23 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hello,

I still have the same errors on our cluster - even one more. Maybe the new one helps us to find a solution.

I have this error if I run "make_onesided" of the ompi-java-test repo.

CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically - in all my 30 runs) this error:

[titan01:5134] *** An error occurred in MPI_Compare_and_swap
[titan01:5134] *** reported by process [2392850433,1]
[titan01:5134] *** on win rdma window 3
[titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:5134] *** and potentially your MPI job)
[titan01.service:05128] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


Sometimes I also have the SIGSEGV error.

System:

compiler: gcc/5.2.0
java: jdk1.8.0_102
kernelmodule: mlx4_core mlx4_en mlx4_ib
Linux version 3.10.0-327.13.1.el7.x86_64 (***@kbuilder.dev.centos.org<mailto:***@kbuilder.dev.centos.org>) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP

Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo rev: v2.0.0-257-gee86e07, Sep 02, 2016

inifiband

openib: OpenSM 3.3.19


limits:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256554
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Thanks, Gundram
On 07/12/2016 11:08 AM, Gundram Leifert wrote:
Hello Gilley, Howard,

I configured without disable dlopen - same error.

I test these classes on another cluster and: IT WORKS!

So it is a problem of the cluster configuration. Thank you all very much for all your help! When the admin can solve the problem, i will let you know, what he had changed.

Cheers Gundram

On 07/08/2016 04:19 PM, Howard Pritchard wrote:
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
the JVM sets its own signal handlers, and it is important openmpi dones not override them.
this is what previously happened with PSM (infinipath) but this has been solved since.
you might be linking with a third party library that hijacks signal handlers and cause the crash
(which would explain why I cannot reproduce the issue)

the master branch has a revamped memory patcher (compared to v2.x or v1.10), and that could have some bad interactions with the JVM, so you might also give v2.x a try

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
You made the best of it... thanks a lot!

Whithout MPI it runs.
Just adding MPI.init() causes the crash!

maybe I installed something wrong...

install newest automake, autoconf, m4, libtoolize in right order and same prefix
check out ompi,
autogen
configure with same prefix, pointing to the same jdk, I later use
make
make install

I will test some different configurations of ./configure...


On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
I am running out of ideas ...

what if you do not run within slurm ?
what if you do not use '-cp executor.jar'
or what if you configure without --disable-dlopen --disable-mca-dso ?

if you
mpirun -np 1 ...
then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird your program is still crashing. an other test is to comment out MPI_Bcast and MPI_Barrier and try again with -np 1

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
In any cases the same error.
this is my code:

salloc -n 3
export IPATH_NO_BACKTRACE
ulimit -s 10240
mpirun -np 3 java -cp executor.jar de.uros.citlab.executor.test.TestSendBigFiles2


also for 1 or two cores, the process crashes.


On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
you can try
export IPATH_NO_BACKTRACE
before invoking mpirun (that should not be needed though)

an other test is to
ulimit -s 10240
before invoking mpirun.

btw, do you use mpirun or srun ?

can you reproduce the crash with 1 or 2 tasks ?

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

configure:
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso


1 node with 3 cores. I use SLURM to allocate one node. I changed --mem, but it has no effect.
salloc -n 3


core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256564
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

uname -a
Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/system-release
CentOS Linux release 7.2.1511 (Core)

what else do you need?

Cheers, Gundram

On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:

Gundram,


can you please provide more information on your environment :

- configure command line

- OS

- memory available

- ulimit -a

- number of nodes

- number of tasks used

- interconnect used (if any)

- batch manager (if any)


Cheers,


Gilles

On 7/7/2016 4:17 PM, Gundram Leifert wrote:
Hello Gilles,

I tried you code and it crashes after 3-15 iterations (see (1)). It is always the same error (only the "94" varies).

Meanwhile I think Java and MPI use the same memory because when I delete the hash-call, the program runs sometimes more than 9k iterations.
When it crashes, there are different lines (see (2) and (3)). The crashes also occurs on rank 0.

##### (1)#####
# Problematic frame:
# J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]

#####(2)#####
# Problematic frame:
# V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0xb6

#####(3)#####
# Problematic frame:
# V [libjvm.so+0x4183bf] ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f

Any more idea?

On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:

Gundram,


fwiw, i cannot reproduce the issue on my box

- centos 7

- java version "1.8.0_71"
Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)


i noticed on non zero rank saveMem is allocated at each iteration.
ideally, the garbage collector can take care of that and this should not be an issue.

would you mind giving the attached file a try ?

Cheers,

Gilles

On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Giles,

thank you for your hints! I did 3 changes, unfortunately the same error occures:

update ompi:
commit ae8444682f0a7aa158caea08800542ce9874455e
Author: Ralph Castain <***@open-mpi.org><mailto:***@open-mpi.org>
Date: Tue Jul 5 20:07:16 2016 -0700

update java:
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)

delete hashcode-lines.

Now I get this error message - to 100%, after different number of iterations (15-300):

0/ 3:length = 100000000
0/ 3:bcast length done (length = 100000000)
1/ 3:bcast length done (length = 100000000)
2/ 3:bcast length done (length = 100000000)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid16578.log
#
# Compiler replay data is saved as:
# /home/gl069/ompi/bin/executor/replay_pid16578.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[titan01:16578] *** Process received signal ***
[titan01:16578] Signal: Aborted (6)
[titan01:16578] Signal code: (-6)
[titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
[titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
[titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
[titan01:16578] [ 3] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
[titan01:16578] [ 4] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
[titan01:16578] [ 5] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
[titan01:16578] [ 6] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
[titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
[titan01:16578] [ 8] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
[titan01:16578] [ 9] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
[titan01:16578] [10] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
[titan01:16578] [11] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
[titan01:16578] [12] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [13] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [14] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [15] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [16] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [17] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [18] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [19] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [20] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [21] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [22] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
[titan01:16578] [23] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
[titan01:16578] [24] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
[titan01:16578] [25] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
[titan01:16578] [26] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
[titan01:16578] [27] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
[titan01:16578] [28] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
[titan01:16578] [29] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
[titan01:16578] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I don't know if it is a problem of java or ompi - but the last years, java worked with no problems on my machine...

Thank you for your tips in advance!
Gundram

On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic) version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Jason,

thanks for your response! I thing it is another problem. I try to send 100MB bytes. So there are not many tries (between 10 and 30). I realized that the execution of this code can result 3 different errors:

1. most often the posted error message occures.

2. in <10% the cases i have a live lock. I can see 3 java-processes, one with 200% and two with 100% processor utilization. After ~15 minutes without new system outputs this error occurs.


[thread 47499823949568 also had an error]
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid24256.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:24256] *** Process received signal ***
[titan01:24256] Signal: Aborted (6)
[titan01:24256] Signal code: (-6)
[titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
[titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
[titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
[titan01:24256] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
[titan01:24256] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
[titan01:24256] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
[titan01:24256] [ 6] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
[titan01:24256] [ 7] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
[titan01:24256] [ 8] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
[titan01:24256] [ 9] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
[titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
[titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
[titan01:24256] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


3. in <10% the cases i have a dead lock while MPI.init. This stays for more than 15 minutes without returning with an error message...

Can I enable some debug-flags to see what happens on C / OpenMPI side?

Thanks in advance for your help!
Gundram Leifert


On 07/05/2016 06:05 PM, Jason Maldonis wrote:
After reading your thread looks like it may be related to an issue I had a few weeks ago (I'm a novice though). Maybe my thread will be of help: https://www.open-mpi.org/community/lists/users/2016/06/29425.php

When you say "After a specific number of repetitions the process either hangs up or returns with a SIGSEGV." does you mean that a single call hangs, or that at some point during the for loop a call hangs? If you mean the latter, then it might relate to my issue. Otherwise my thread probably won't be helpful.

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
***@wisc.edu<mailto:***@wisc.edu>
608-295-5532

On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello,

I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

public static void log(String msg) {
try {
System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
} catch (MPIException ex) {
System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
}
}

private static int hashcode(byte[] bytearray) {
if (bytearray == null) {
return 0;
}
int hash = 39;
for (int i = 0; i < bytearray.length; i++) {
byte b = bytearray[i];
hash = hash * 7 + (int) b;
}
return hash;
}

public static void main(String args[]) throws MPIException {
log("start main");
MPI.Init(args);
try {
log("initialized done");
byte[] saveMem = new byte[100000000];
MPI.COMM_WORLD.barrier();
Random r = new Random();
r.nextBytes(saveMem);
if (MPI.COMM_WORLD.getRank() == 0) {
for (int i = 0; i < 1000; i++) {
saveMem[r.nextInt(saveMem.length)]++;
log("i = " + i);
int[] lengthData = new int[]{saveMem.length};
log("object hash = " + hashcode(saveMem));
log("length = " + lengthData[0]);
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
MPI.COMM_WORLD.barrier();
MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
}
MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT<http://MPI.INT>, 0);
} else {
while (true) {
int[] lengthData = new int[1];
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
if (lengthData[0] == 0) {
break;
}
MPI.COMM_WORLD.barrier();
saveMem = new byte[lengthData[0]];
MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
log("object hash = " + hashcode(saveMem));
}
}
MPI.COMM_WORLD.barrier();
} catch (MPIException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} catch (RuntimeException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} finally {
MPI.Finalize();
}

}

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# #
# SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code: (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code: (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <***@rist.or.jp<mailto:***@rist.or.jp>>
Date: Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram

_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29584.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php




_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php





_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Graham, Nathaniel Richard
2016-09-15 20:07:55 UTC
Permalink
​Both issues have been fixed. The trouble with CReqops.java was a problem with the test. A fixed version has been pushed to the ompi-java-tests repo. The issue with compare_and_swap is merged on master, and should be in the 2.0.2 release.


Let me know if you have any other issues.


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Graham, Nathaniel Richard <***@lanl.gov>
Sent: Wednesday, September 14, 2016 12:55 PM
To: Open MPI Users
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


​Thanks for reporting this! There are a number of things going on here.


It seems there may be a problem with the Java bindings checked by CReqops.Java because the C test passes. Ill take a look at that. The issue can be found at: https://github.com/open-mpi/ompi/issues/2081​


MPI_Compare_and_swap is failing on master, and therefore on the release branches. You can get around the issue for now by doing: export OMPI_MCA_osc=pt2pt​

I submitted an issue to track it at: https://github.com/open-mpi/ompi/issues/2080


These tests test code I added last summer and did not make it into 1.8. I know its all in the 2.0 serious though.


-Nathan



--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Wednesday, September 14, 2016 4:02 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


In short words: yes, we compiled with mpijavac and mpicc and run with mpirun -np 2.


In long words: we tested the following setups

a) without Java, with mpi 2.0.1 the C-test

[***@titan01 mpi_test]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1

[***@titan01 mpi_test]$ mpirun -np 2 ./a.out
[titan01:18460] *** An error occurred in MPI_Compare_and_swap
[titan01:18460] *** reported by process [3535667201,1]
[titan01:18460] *** on win rdma window 3
[titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:18460] *** and potentially your MPI job)
[titan01.service:18454] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:18454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

b) without Java with mpi 1.8.8 the C-test

[***@titan01 mpi_test2]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/1.8.8

[***@titan01 mpi_test2]$ mpirun -np 2 ./a.out
No Errors
[***@titan01 mpi_test2]$

c) with java 1.8.8 with jdk and Java-Testsuite

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol
win.compareAndSwap(next, iBuffer, result, MPI.INT, rank, 0);
^
symbol: method compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int)
location: variable win of type Win
TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol

>> these java methods are not supported in 1.8.8

d) ompi 2.0.1 and jdk and Testsuite

[***@titan01 ~]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1 2) java/jdk1.8.0_102

[***@titan01 ~]$ cd ompi-java-test/
[***@titan01 ompi-java-test]$ ./autogen.sh
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force
autoreconf: configure.ac: tracing
autoreconf: configure.ac: not using Libtool
autoreconf: running: /usr/bin/autoconf --force
autoreconf: configure.ac: not using Autoheader
autoreconf: running: automake --add-missing --copy --force-missing
autoreconf: Leaving directory `.'
[***@titan01 ompi-java-test]$ ./configure
Configuring Open Java test suite
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for mpijavac... yes
checking if checking MPI API params... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating reporting/OmpitestConfig.java
config.status: creating Makefile

[***@titan01 ompi-java-test]$ cd onesided/
[***@titan01 onesided]$ ./make_onesided &> result
cat result:
<crop.....>

=========================== CReqops ===========================
[titan01:32155] *** An error occurred in MPI_Rput
[titan01:32155] *** reported by process [3879534593,1]
[titan01:32155] *** on win rdma window 3
[titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32155] *** and potentially your MPI job)

<...crop....>

=========================== TestMpiRmaCompareAndSwap ===========================
[titan01:32703] *** An error occurred in MPI_Compare_and_swap
[titan01:32703] *** reported by process [3843162113,0]
[titan01:32703] *** on win rdma window 3
[titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32703] *** and potentially your MPI job)
[titan01.service:32698] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:32698] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


< ... end crop>


Also if we start the thing in this way, it fails:

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java OmpitestError.java OmpitestProgress.java OmpitestConfig.java

[***@titan01 onesided]$ mpiexec -np 2 java TestMpiRmaCompareAndSwap

[titan01:22877] *** An error occurred in MPI_Compare_and_swap
[titan01:22877] *** reported by process [3287285761,0]
[titan01:22877] *** on win rdma window 3
[titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:22877] *** and potentially your MPI job)
[titan01.service:22872] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:22872] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote:

Since you are getting the same errors with C as you are with Java, this is an issue with C, not the Java bindings. However, in the most recent output, you are using ./a.out to run the test. Did you use mpirun to run the test in Java or C?


The command should be something along the lines of:


mpirun -np 2 ​java TestMpiRmaCompareAndSwap


mpirun -np 2 ./a.out


Also, are you compiling with the ompi wrappers? Should be:


mpijavac TestMpiRmaCompareAndSwap.java


​mpicc compare_and_swap.c


In the mean time, I will try to reproduce this on a similar system.


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Tuesday, September 13, 2016 12:46 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hey,


it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns produces this output:

(the same bulid by sources or the release 2.0.1)


[***@node108 mpi_test]$ ./a.out
[node108:2949] *** An error occurred in MPI_Compare_and_swap
[node108:2949] *** reported by process [1649420396,0]
[node108:2949] *** on win rdma window 3
[node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[node108:2949] *** and potentially your MPI job)


But the test works for 1.8.x! In fact our cluster does not have shared-memory - so it has to use the wrapper to default methods.

Gundram

On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:

Hello Gundram,


It looks like the test that is failing is TestMpiRmaCompareAndSwap.java​. Is that the one that is crashing? If so, could you try to run the C test from:


http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1


There are a couple of header files you will need for that test, but they are in the same repo as the test (up a few folders and in an include folder).


This should let us know whether its an issue related to Java or not.


If it is another test, let me know and Ill see if I can get you the C version (most or all of the Java tests are translations from the C test).


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Wednesday, September 7, 2016 9:23 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hello,

I still have the same errors on our cluster - even one more. Maybe the new one helps us to find a solution.

I have this error if I run "make_onesided" of the ompi-java-test repo.

CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically - in all my 30 runs) this error:

[titan01:5134] *** An error occurred in MPI_Compare_and_swap
[titan01:5134] *** reported by process [2392850433,1]
[titan01:5134] *** on win rdma window 3
[titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:5134] *** and potentially your MPI job)
[titan01.service:05128] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


Sometimes I also have the SIGSEGV error.

System:

compiler: gcc/5.2.0
java: jdk1.8.0_102
kernelmodule: mlx4_core mlx4_en mlx4_ib
Linux version 3.10.0-327.13.1.el7.x86_64 (***@kbuilder.dev.centos.org<mailto:***@kbuilder.dev.centos.org>) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP

Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo rev: v2.0.0-257-gee86e07, Sep 02, 2016

inifiband

openib: OpenSM 3.3.19


limits:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256554
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Thanks, Gundram
On 07/12/2016 11:08 AM, Gundram Leifert wrote:
Hello Gilley, Howard,

I configured without disable dlopen - same error.

I test these classes on another cluster and: IT WORKS!

So it is a problem of the cluster configuration. Thank you all very much for all your help! When the admin can solve the problem, i will let you know, what he had changed.

Cheers Gundram

On 07/08/2016 04:19 PM, Howard Pritchard wrote:
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
the JVM sets its own signal handlers, and it is important openmpi dones not override them.
this is what previously happened with PSM (infinipath) but this has been solved since.
you might be linking with a third party library that hijacks signal handlers and cause the crash
(which would explain why I cannot reproduce the issue)

the master branch has a revamped memory patcher (compared to v2.x or v1.10), and that could have some bad interactions with the JVM, so you might also give v2.x a try

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
You made the best of it... thanks a lot!

Whithout MPI it runs.
Just adding MPI.init() causes the crash!

maybe I installed something wrong...

install newest automake, autoconf, m4, libtoolize in right order and same prefix
check out ompi,
autogen
configure with same prefix, pointing to the same jdk, I later use
make
make install

I will test some different configurations of ./configure...


On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
I am running out of ideas ...

what if you do not run within slurm ?
what if you do not use '-cp executor.jar'
or what if you configure without --disable-dlopen --disable-mca-dso ?

if you
mpirun -np 1 ...
then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird your program is still crashing. an other test is to comment out MPI_Bcast and MPI_Barrier and try again with -np 1

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
In any cases the same error.
this is my code:

salloc -n 3
export IPATH_NO_BACKTRACE
ulimit -s 10240
mpirun -np 3 java -cp executor.jar de.uros.citlab.executor.test.TestSendBigFiles2


also for 1 or two cores, the process crashes.


On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
you can try
export IPATH_NO_BACKTRACE
before invoking mpirun (that should not be needed though)

an other test is to
ulimit -s 10240
before invoking mpirun.

btw, do you use mpirun or srun ?

can you reproduce the crash with 1 or 2 tasks ?

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

configure:
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso


1 node with 3 cores. I use SLURM to allocate one node. I changed --mem, but it has no effect.
salloc -n 3


core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256564
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

uname -a
Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/system-release
CentOS Linux release 7.2.1511 (Core)

what else do you need?

Cheers, Gundram

On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:

Gundram,


can you please provide more information on your environment :

- configure command line

- OS

- memory available

- ulimit -a

- number of nodes

- number of tasks used

- interconnect used (if any)

- batch manager (if any)


Cheers,


Gilles

On 7/7/2016 4:17 PM, Gundram Leifert wrote:
Hello Gilles,

I tried you code and it crashes after 3-15 iterations (see (1)). It is always the same error (only the "94" varies).

Meanwhile I think Java and MPI use the same memory because when I delete the hash-call, the program runs sometimes more than 9k iterations.
When it crashes, there are different lines (see (2) and (3)). The crashes also occurs on rank 0.

##### (1)#####
# Problematic frame:
# J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]

#####(2)#####
# Problematic frame:
# V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0xb6

#####(3)#####
# Problematic frame:
# V [libjvm.so+0x4183bf] ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f

Any more idea?

On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:

Gundram,


fwiw, i cannot reproduce the issue on my box

- centos 7

- java version "1.8.0_71"
Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)


i noticed on non zero rank saveMem is allocated at each iteration.
ideally, the garbage collector can take care of that and this should not be an issue.

would you mind giving the attached file a try ?

Cheers,

Gilles

On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Giles,

thank you for your hints! I did 3 changes, unfortunately the same error occures:

update ompi:
commit ae8444682f0a7aa158caea08800542ce9874455e
Author: Ralph Castain <***@open-mpi.org><mailto:***@open-mpi.org>
Date: Tue Jul 5 20:07:16 2016 -0700

update java:
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)

delete hashcode-lines.

Now I get this error message - to 100%, after different number of iterations (15-300):

0/ 3:length = 100000000
0/ 3:bcast length done (length = 100000000)
1/ 3:bcast length done (length = 100000000)
2/ 3:bcast length done (length = 100000000)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid16578.log
#
# Compiler replay data is saved as:
# /home/gl069/ompi/bin/executor/replay_pid16578.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[titan01:16578] *** Process received signal ***
[titan01:16578] Signal: Aborted (6)
[titan01:16578] Signal code: (-6)
[titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
[titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
[titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
[titan01:16578] [ 3] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
[titan01:16578] [ 4] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
[titan01:16578] [ 5] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
[titan01:16578] [ 6] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
[titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
[titan01:16578] [ 8] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
[titan01:16578] [ 9] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
[titan01:16578] [10] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
[titan01:16578] [11] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
[titan01:16578] [12] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [13] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [14] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [15] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [16] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [17] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [18] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [19] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [20] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [21] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [22] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
[titan01:16578] [23] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
[titan01:16578] [24] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
[titan01:16578] [25] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
[titan01:16578] [26] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
[titan01:16578] [27] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
[titan01:16578] [28] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
[titan01:16578] [29] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
[titan01:16578] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I don't know if it is a problem of java or ompi - but the last years, java worked with no problems on my machine...

Thank you for your tips in advance!
Gundram

On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic) version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Jason,

thanks for your response! I thing it is another problem. I try to send 100MB bytes. So there are not many tries (between 10 and 30). I realized that the execution of this code can result 3 different errors:

1. most often the posted error message occures.

2. in <10% the cases i have a live lock. I can see 3 java-processes, one with 200% and two with 100% processor utilization. After ~15 minutes without new system outputs this error occurs.


[thread 47499823949568 also had an error]
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid24256.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:24256] *** Process received signal ***
[titan01:24256] Signal: Aborted (6)
[titan01:24256] Signal code: (-6)
[titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
[titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
[titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
[titan01:24256] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
[titan01:24256] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
[titan01:24256] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
[titan01:24256] [ 6] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
[titan01:24256] [ 7] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
[titan01:24256] [ 8] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
[titan01:24256] [ 9] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
[titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
[titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
[titan01:24256] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


3. in <10% the cases i have a dead lock while MPI.init. This stays for more than 15 minutes without returning with an error message...

Can I enable some debug-flags to see what happens on C / OpenMPI side?

Thanks in advance for your help!
Gundram Leifert


On 07/05/2016 06:05 PM, Jason Maldonis wrote:
After reading your thread looks like it may be related to an issue I had a few weeks ago (I'm a novice though). Maybe my thread will be of help: https://www.open-mpi.org/community/lists/users/2016/06/29425.php

When you say "After a specific number of repetitions the process either hangs up or returns with a SIGSEGV." does you mean that a single call hangs, or that at some point during the for loop a call hangs? If you mean the latter, then it might relate to my issue. Otherwise my thread probably won't be helpful.

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
***@wisc.edu<mailto:***@wisc.edu>
608-295-5532

On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello,

I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

public static void log(String msg) {
try {
System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
} catch (MPIException ex) {
System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
}
}

private static int hashcode(byte[] bytearray) {
if (bytearray == null) {
return 0;
}
int hash = 39;
for (int i = 0; i < bytearray.length; i++) {
byte b = bytearray[i];
hash = hash * 7 + (int) b;
}
return hash;
}

public static void main(String args[]) throws MPIException {
log("start main");
MPI.Init(args);
try {
log("initialized done");
byte[] saveMem = new byte[100000000];
MPI.COMM_WORLD.barrier();
Random r = new Random();
r.nextBytes(saveMem);
if (MPI.COMM_WORLD.getRank() == 0) {
for (int i = 0; i < 1000; i++) {
saveMem[r.nextInt(saveMem.length)]++;
log("i = " + i);
int[] lengthData = new int[]{saveMem.length};
log("object hash = " + hashcode(saveMem));
log("length = " + lengthData[0]);
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
MPI.COMM_WORLD.barrier();
MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
}
MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT<http://MPI.INT>, 0);
} else {
while (true) {
int[] lengthData = new int[1];
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
if (lengthData[0] == 0) {
break;
}
MPI.COMM_WORLD.barrier();
saveMem = new byte[lengthData[0]];
MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
log("object hash = " + hashcode(saveMem));
}
}
MPI.COMM_WORLD.barrier();
} catch (MPIException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} catch (RuntimeException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} finally {
MPI.Finalize();
}

}

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# #
# SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code: (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code: (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <***@rist.or.jp<mailto:***@rist.or.jp>>
Date: Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram

_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29584.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php




_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php





_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Nathan Hjelm
2016-09-14 18:18:57 UTC
Permalink
We have a new high-speed component for RMA in 2.0.x called osc/rdma. Since the component is doing direct rdma on the target we are much more strict about the ranges. osc/pt2pt doesn't bother checking at the moment.

Can you build Open MPI with --enable-debug and add -mca osc_base_verbose 100 to the mpirun command-line? Please upload the output as a gist (https://gist.github.com/) and send a link so we can take a look.

-Nathan

On Sep 14, 2016, at 04:26 AM, Gundram Leifert <***@uni-rostock.de> wrote:

In short words: yes, we compiled with mpijavac and mpicc and run with mpirun -np 2.

In long words: we tested the following setups

a) without Java, with mpi 2.0.1 the C-test

[***@titan01 mpi_test]$ module list
Currently Loaded Modulefiles:
  1) openmpi/gcc/2.0.1

[***@titan01 mpi_test]$ mpirun -np 2 ./a.out
[titan01:18460] *** An error occurred in MPI_Compare_and_swap
[titan01:18460] *** reported by process [3535667201,1]
[titan01:18460] *** on win rdma window 3
[titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:18460] ***    and potentially your MPI job)
[titan01.service:18454] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:18454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

b) without Java with mpi 1.8.8 the C-test

[***@titan01 mpi_test2]$ module list
Currently Loaded Modulefiles:
  1) openmpi/gcc/1.8.8

[***@titan01 mpi_test2]$ mpirun -np 2 ./a.out
 No Errors
[***@titan01 mpi_test2]$

c) with java 1.8.8 with jdk and Java-Testsuite

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol
                        win.compareAndSwap(next, iBuffer, result, MPI.INT, rank, 0);
                           ^
  symbol:   method compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int)
  location: variable win of type Win
TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol

>> these java methods are not supported in 1.8.8

d) ompi 2.0.1 and jdk and Testsuite

[***@titan01 ~]$ module list
Currently Loaded Modulefiles:
  1) openmpi/gcc/2.0.1   2) java/jdk1.8.0_102

[***@titan01 ~]$ cd ompi-java-test/
[***@titan01 ompi-java-test]$ ./autogen.sh
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force
autoreconf: configure.ac: tracing
autoreconf: configure.ac: not using Libtool
autoreconf: running: /usr/bin/autoconf --force
autoreconf: configure.ac: not using Autoheader
autoreconf: running: automake --add-missing --copy --force-missing
autoreconf: Leaving directory `.'
[***@titan01 ompi-java-test]$ ./configure
Configuring Open Java test suite
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for mpijavac... yes
checking if checking MPI API params... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating reporting/OmpitestConfig.java
config.status: creating Makefile

[***@titan01 ompi-java-test]$ cd onesided/
[***@titan01 onesided]$ ./make_onesided &> result
cat result:
<crop.....>

=========================== CReqops ===========================
[titan01:32155] *** An error occurred in MPI_Rput
[titan01:32155] *** reported by process [3879534593,1]
[titan01:32155] *** on win rdma window 3
[titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32155] ***    and potentially your MPI job)

<...crop....>

=========================== TestMpiRmaCompareAndSwap ===========================
[titan01:32703] *** An error occurred in MPI_Compare_and_swap
[titan01:32703] *** reported by process [3843162113,0]
[titan01:32703] *** on win rdma window 3
[titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32703] ***    and potentially your MPI job)
[titan01.service:32698] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:32698] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


< ... end crop>


Also if we start the thing in this way, it fails:

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java OmpitestError.java OmpitestProgress.java OmpitestConfig.java

[***@titan01 onesided]$  mpiexec -np 2 java TestMpiRmaCompareAndSwap

[titan01:22877] *** An error occurred in MPI_Compare_and_swap
[titan01:22877] *** reported by process [3287285761,0]
[titan01:22877] *** on win rdma window 3
[titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:22877] ***    and potentially your MPI job)
[titan01.service:22872] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:22872] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote:
Since you are getting the same errors with C as you are with Java, this is an issue with C, not the Java bindings.  However, in the most recent output, you are using ./a.out to run the test.  Did you use mpirun to run the test in Java or C?

The command should be something along the lines of: 

    mpirun -np 2 ​java TestMpiRmaCompareAndSwap

    mpirun -np 2 ./a.out

Also, are you compiling with the ompi wrappers?  Should be:

    mpijavac TestMpiRmaCompareAndSwap.java

    ​mpicc compare_and_swap.c

In the mean time, I will try to reproduce this on a similar system.

-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Tuesday, September 13, 2016 12:46 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
 
Hey,

it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns produces this output:
(the same bulid by sources or the release 2.0.1)

[***@node108 mpi_test]$ ./a.out
[node108:2949] *** An error occurred in MPI_Compare_and_swap
[node108:2949] *** reported by process [1649420396,0]
[node108:2949] *** on win rdma window 3
[node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[node108:2949] ***    and potentially your MPI job)

But the test works for 1.8.x! In fact our cluster does not have shared-memory - so it has to use the wrapper to default methods.

Gundram

On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:
Hello Gundram,

It looks like the test that is failing is TestMpiRmaCompareAndSwap.java​.  Is that the one that is crashing?  If so, could you try to run the C test from:

    http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1

There are a couple of header files you will need for that test, but they are in the same repo as the test (up a few folders and in an include folder).

This should let us know whether its an issue related to Java or not.

If it is another test, let me know and Ill see if I can get you the C version (most or all of the Java tests are translations from the C test).

-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Wednesday, September 7, 2016 9:23 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
 
Hello,
I still have the same errors on our cluster - even one more. Maybe the new one helps us to find a solution.
I have this error if I run "make_onesided" of the ompi-java-test repo.

  CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically - in all my 30 runs) this error:

[titan01:5134] *** An error occurred in MPI_Compare_and_swap
[titan01:5134] *** reported by process [2392850433,1]
[titan01:5134] *** on win rdma window 3
[titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:5134] ***    and potentially your MPI job)
[titan01.service:05128] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Sometimes I also have the SIGSEGV error.
System:
compiler: gcc/5.2.0
java: jdk1.8.0_102
kernelmodule: mlx4_core mlx4_en mlx4_ib
Linux version 3.10.0-327.13.1.el7.x86_64 (***@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP
Open MPI v2.0.1, package: Open MPI  Distribution, ident: 2.0.1, repo rev: v2.0.0-257-gee86e07, Sep 02, 2016
inifiband

openib:  OpenSM 3.3.19


limits:

 ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256554
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 100000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Thanks, Gundram
On 07/12/2016 11:08 AM, Gundram Leifert wrote:
Hello Gilley, Howard,

I configured without disable dlopen - same error.

I test these classes on another cluster and: IT WORKS!

So it is a problem of the cluster configuration. Thank you all very much for all your help! When the admin can solve the problem, i will let you know, what he had changed.

Cheers Gundram

On 07/08/2016 04:19 PM, Howard Pritchard wrote:
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
the JVM sets its own signal handlers, and it is important openmpi dones not override them.
this is what previously happened with PSM (infinipath) but this has been solved since.
you might be linking with a third party library that hijacks signal handlers and cause the crash
(which would explain why I cannot reproduce the issue)

the master branch has a revamped memory patcher (compared to v2.x or v1.10), and that could have some bad interactions with the JVM, so you might also give v2.x a try

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
You made the best of it... thanks a lot!

Whithout MPI it runs.
Just adding MPI.init() causes the crash!

maybe I installed something wrong...

install newest automake, autoconf, m4, libtoolize in right order and same prefix
check out ompi,
autogen
configure with same prefix, pointing to the same jdk, I later use
make
make install

I will test some different configurations of ./configure...


On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
I am running out of ideas ...

what if you do not run within slurm ?
what if you do not use '-cp executor.jar'
or what if you configure without --disable-dlopen --disable-mca-dso ?

if you
mpirun -np 1 ...
then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird your program is still crashing. an other test is to comment out MPI_Bcast and MPI_Barrier and try again with -np 1

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
In any cases the same error.
this is my code:

salloc -n 3
export IPATH_NO_BACKTRACE
ulimit -s 10240
mpirun -np 3 java -cp executor.jar de.uros.citlab.executor.test.TestSendBigFiles2


also for 1 or two cores, the process crashes.


On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
you can try
export IPATH_NO_BACKTRACE
before invoking mpirun (that should not be needed though)

an other test is to
ulimit -s 10240
before invoking mpirun.

btw, do you use mpirun or srun ?

can you reproduce the crash with 1 or 2 tasks ?

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

configure:
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso


1 node with 3 cores. I use SLURM to allocate one node. I changed --mem, but it has no effect.
salloc -n 3


core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256564
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 100000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

uname -a
Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/system-release
CentOS Linux release 7.2.1511 (Core)

what else do you need?

Cheers, Gundram

On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
Gundram,

can you please provide more information on your environment :
- configure command line
- OS
- memory available
- ulimit -a
- number of nodes
- number of tasks used
- interconnect used (if any)
- batch manager (if any)

Cheers,

Gilles
On 7/7/2016 4:17 PM, Gundram Leifert wrote:
Hello Gilles,

I tried you code and it crashes after 3-15 iterations (see (1)). It is always the same error (only the "94" varies).

Meanwhile I think Java and MPI use the same memory because when I delete the hash-call, the program runs sometimes more than 9k iterations.
When it crashes, there are different lines (see (2) and (3)). The crashes also occurs on rank 0.

##### (1)#####
# Problematic frame:
# J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]

#####(2)#####
# Problematic frame:
# V  [libjvm.so+0x68d0f6]  JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0xb6

#####(3)#####
# Problematic frame:
# V  [libjvm.so+0x4183bf]  ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f

Any more idea?

On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
Gundram,

fwiw, i cannot reproduce the issue on my box
- centos 7
- java version "1.8.0_71"
  Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
  Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)


i noticed on non zero rank saveMem is allocated at each iteration.
ideally, the garbage collector can take care of that and this should not be an issue.

would you mind giving the attached file a try ?

Cheers,

Gilles

On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello Giles,

thank you for your hints! I did 3 changes, unfortunately the same error occures:

update ompi:
commit ae8444682f0a7aa158caea08800542ce9874455e
Author: Ralph Castain <***@open-mpi.org>
Date:   Tue Jul 5 20:07:16 2016 -0700

update java:
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)

delete hashcode-lines.

Now I get this error message - to 100%, after different number of iterations (15-300):

 0/ 3:length = 100000000
 0/ 3:bcast length done (length = 100000000)
 1/ 3:bcast length done (length = 100000000)
 2/ 3:bcast length done (length = 100000000)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x414d24]  ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid16578.log
#
# Compiler replay data is saved as:
# /home/gl069/ompi/bin/executor/replay_pid16578.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
[titan01:16578] *** Process received signal ***
[titan01:16578] Signal: Aborted (6)
[titan01:16578] Signal code:  (-6)
[titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
[titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
[titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
[titan01:16578] [ 3] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
[titan01:16578] [ 4] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
[titan01:16578] [ 5] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
[titan01:16578] [ 6] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
[titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
[titan01:16578] [ 8] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
[titan01:16578] [ 9] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
[titan01:16578] [10] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
[titan01:16578] [11] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
[titan01:16578] [12] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [13] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [14] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [15] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [16] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [17] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [18] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [19] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [20] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [21] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [22] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
[titan01:16578] [23] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
[titan01:16578] [24] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
[titan01:16578] [25] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
[titan01:16578] [26] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
[titan01:16578] [27] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
[titan01:16578] [28] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
[titan01:16578] [29] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
[titan01:16578] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I don't know if it is a  problem of java or ompi - but the last years, java worked with no problems on my machine...

Thank you for your tips in advance!
Gundram

On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic) version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello Jason,

thanks for your response! I thing it is another problem. I try to send 100MB bytes. So there are not many tries (between 10 and 30). I realized that the execution of this code can result 3 different errors:

1. most often the posted error message occures.

2. in <10% the cases i have a live lock. I can see 3 java-processes, one with 200% and two with 100% processor utilization. After ~15 minutes without new system outputs this error occurs.


[thread 47499823949568 also had an error]
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
#  guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid24256.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:24256] *** Process received signal ***
[titan01:24256] Signal: Aborted (6)
[titan01:24256] Signal code:  (-6)
[titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
[titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
[titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
[titan01:24256] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
[titan01:24256] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
[titan01:24256] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
[titan01:24256] [ 6] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
[titan01:24256] [ 7] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
[titan01:24256] [ 8] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
[titan01:24256] [ 9] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
[titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
[titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
[titan01:24256] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


3. in <10% the cases i have a dead lock while MPI.init. This stays for more than 15 minutes without returning with an error message...

Can I enable some debug-flags to see what happens on C / OpenMPI side?

Thanks in advance for your help!
Gundram Leifert


On 07/05/2016 06:05 PM, Jason Maldonis wrote:
After reading your thread looks like it may be related to an issue I had a few weeks ago (I'm a novice though). Maybe my thread will be of help:  https://www.open-mpi.org/community/lists/users/2016/06/29425.php

When you say "After a specific number of repetitions the process either hangs up or returns with a SIGSEGV."  does you mean that a single call hangs, or that at some point during the for loop a call hangs? If you mean the latter, then it might relate to my issue. Otherwise my thread probably won't be helpful.

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
***@wisc.edu
608-295-5532

On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

    public static void log(String msg) {
        try {
            System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
        } catch (MPIException ex) {
            System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
        }
    }

    private static int hashcode(byte[] bytearray) {
        if (bytearray == null) {
            return 0;
        }
        int hash = 39;
        for (int i = 0; i < bytearray.length; i++) {
            byte b = bytearray[i];
            hash = hash * 7 + (int) b;
        }
        return hash;
    }

    public static void main(String args[]) throws MPIException {
        log("start main");
        MPI.Init(args);
        try {
            log("initialized done");
            byte[] saveMem = new byte[100000000];
            MPI.COMM_WORLD.barrier();
            Random r = new Random();
            r.nextBytes(saveMem);
            if (MPI.COMM_WORLD.getRank() == 0) {
                for (int i = 0; i < 1000; i++) {
                    saveMem[r.nextInt(saveMem.length)]++;
                    log("i = " + i);
                    int[] lengthData = new int[]{saveMem.length};
                    log("object hash = " + hashcode(saveMem));
                    log("length = " + lengthData[0]);
                    MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
                    log("bcast length done (length = " + lengthData[0] + ")");
                    MPI.COMM_WORLD.barrier();
                    MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
                    log("bcast data done");
                    MPI.COMM_WORLD.barrier();
                }
                MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
            } else {
                while (true) {
                    int[] lengthData = new int[1];
                    MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
                    log("bcast length done (length = " + lengthData[0] + ")");
                    if (lengthData[0] == 0) {
                        break;
                    }
                    MPI.COMM_WORLD.barrier();
                    saveMem = new byte[lengthData[0]];
                    MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
                    log("bcast data done");
                    MPI.COMM_WORLD.barrier();
                    log("object hash = " + hashcode(saveMem));
                }
            }
            MPI.COMM_WORLD.barrier();
        } catch (MPIException ex) {
            System.out.println("caugth error." + ex);
            log(ex.getMessage());
        } catch (RuntimeException ex) {
            System.out.println("caugth error." + ex);
            log(ex.getMessage());
        } finally {
            MPI.Finalize();
        }

    }

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# #
#  SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J  de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J  de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code:  (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code:  (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <***@rist.or.jp>
Date:   Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram

_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29584.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php




_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gundram Leifert
2016-09-20 08:13:12 UTC
Permalink
Sorry for the delay...

we applied

./configure --enable-debug --with-psm --enable-mpi-java
--with-jdk-dir=/cluster/libraries/java/jdk1.8.0_102/
--prefix=/cluster/mpi/gcc/openmpi/2.0.x_nightly
make -j 8 all
make install

Java-test-suite

export OMPI_MCA_osc=pt2pt

./make_onesided &> make_onesided.out
Output: https://gist.github.com/anonymous/f8c6837b6a6d40c806cec9458dfcc1ab

we still sometimes get the SIGSEGV:

WinAllocate with -np = 2:
Exception in thread "main" Exception in thread "main" mpi.MPIException:
MPI_ERR_INTERN: internal errormpi.MPIException: MPI_ERR_INTERN: internal
error

at mpi.Win.allocateSharedWin(Native Method) at
mpi.Win.allocateSharedWin(Native Method)

at mpi.Win.<init>(Win.java:110) at mpi.Win.<init>(Win.java:110)

at WinAllocate.main(WinAllocate.java:42) at
WinAllocate.main(WinAllocate.java:42)


WinName with -np = 2:
mpiexec has exited due to process rank 1 with PID 0 on
node node160 exiting improperly. There are three reasons this could occur:
<CROP>


CCreateInfo and Cput with -np 8:
sometimes end with SigSegV (see
https://gist.github.com/anonymous/605c19422fd00bdfc4d1ea0151a1f34c ) for
detailed view.

I hope, this information is helpfull...

Best Regards,
Gundram


On 09/14/2016 08:18 PM, Nathan Hjelm wrote:
> We have a new high-speed component for RMA in 2.0.x called osc/rdma.
> Since the component is doing direct rdma on the target we are much
> more strict about the ranges. osc/pt2pt doesn't bother checking at the
> moment.
>
> Can you build Open MPI with --enable-debug and add -mca
> osc_base_verbose 100 to the mpirun command-line? Please upload the
> output as a gist (https://gist.github.com/) and send a link so we can
> take a look.
>
> -Nathan
>
> On Sep 14, 2016, at 04:26 AM, Gundram Leifert
> <***@uni-rostock.de> wrote:
>
>> In short words: yes, we compiled with mpijavac and mpicc and run with
>> mpirun -np 2.
>>
>>
>> In long words: we tested the following setups
>>
>>
>> a) without Java, with mpi 2.0.1 the C-test
>>
>> [***@titan01 mpi_test]$ module list
>> Currently Loaded Modulefiles:
>> 1) openmpi/gcc/2.0.1
>>
>> [***@titan01 mpi_test]$ mpirun -np 2 ./a.out
>> [titan01:18460] *** An error occurred in MPI_Compare_and_swap
>> [titan01:18460] *** reported by process [3535667201,1]
>> [titan01:18460] *** on win rdma window 3
>> [titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>> [titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>> now abort,
>> [titan01:18460] *** and potentially your MPI job)
>> [titan01.service:18454] 1 more process has sent help message
>> help-mpi-errors.txt / mpi_errors_are_fatal
>> [titan01.service:18454] Set MCA parameter "orte_base_help_aggregate"
>> to 0 to see all help / error messages
>>
>> b) without Java with mpi 1.8.8 the C-test
>>
>> [***@titan01 mpi_test2]$ module list
>> Currently Loaded Modulefiles:
>> 1) openmpi/gcc/1.8.8
>>
>> [***@titan01 mpi_test2]$ mpirun -np 2 ./a.out
>> No Errors
>> [***@titan01 mpi_test2]$
>>
>> c) with java 1.8.8 with jdk and Java-Testsuite
>>
>> [***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
>> TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol
>> win.compareAndSwap(next, iBuffer, result,
>> MPI.INT, rank, 0);
>> ^
>> symbol: method
>> compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int)
>> location: variable win of type Win
>> TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol
>>
>> >> these java methods are not supported in 1.8.8
>>
>> d) ompi 2.0.1 and jdk and Testsuite
>>
>> [***@titan01 ~]$ module list
>> Currently Loaded Modulefiles:
>> 1) openmpi/gcc/2.0.1 2) java/jdk1.8.0_102
>>
>> [***@titan01 ~]$ cd ompi-java-test/
>> [***@titan01 ompi-java-test]$ ./autogen.sh
>> autoreconf: Entering directory `.'
>> autoreconf: configure.ac: not using Gettext
>> autoreconf: running: aclocal --force
>> autoreconf: configure.ac: tracing
>> autoreconf: configure.ac: not using Libtool
>> autoreconf: running: /usr/bin/autoconf --force
>> autoreconf: configure.ac: not using Autoheader
>> autoreconf: running: automake --add-missing --copy --force-missing
>> autoreconf: Leaving directory `.'
>> [***@titan01 ompi-java-test]$ ./configure
>> Configuring Open Java test suite
>> checking for a BSD-compatible install... /bin/install -c
>> checking whether build environment is sane... yes
>> checking for a thread-safe mkdir -p... /bin/mkdir -p
>> checking for gawk... gawk
>> checking whether make sets $(MAKE)... yes
>> checking whether make supports nested variables... yes
>> checking whether make supports nested variables... (cached) yes
>> checking for mpijavac... yes
>> checking if checking MPI API params... yes
>> checking that generated files are newer than configure... done
>> configure: creating ./config.status
>> config.status: creating reporting/OmpitestConfig.java
>> config.status: creating Makefile
>>
>> [***@titan01 ompi-java-test]$ cd onesided/
>> [***@titan01 onesided]$ ./make_onesided &> result
>> cat result:
>> <crop.....>
>>
>> =========================== CReqops ===========================
>> [titan01:32155] *** An error occurred in MPI_Rput
>> [titan01:32155] *** reported by process [3879534593,1]
>> [titan01:32155] *** on win rdma window 3
>> [titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>> [titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>> now abort,
>> [titan01:32155] *** and potentially your MPI job)
>>
>> <...crop....>
>>
>> =========================== TestMpiRmaCompareAndSwap
>> ===========================
>> [titan01:32703] *** An error occurred in MPI_Compare_and_swap
>> [titan01:32703] *** reported by process [3843162113,0]
>> [titan01:32703] *** on win rdma window 3
>> [titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>> [titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>> now abort,
>> [titan01:32703] *** and potentially your MPI job)
>> [titan01.service:32698] 1 more process has sent help message
>> help-mpi-errors.txt / mpi_errors_are_fatal
>> [titan01.service:32698] Set MCA parameter "orte_base_help_aggregate"
>> to 0 to see all help / error messages
>>
>>
>> < ... end crop>
>>
>>
>> Also if we start the thing in this way, it fails:
>>
>> [***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
>> OmpitestError.java OmpitestProgress.java OmpitestConfig.java
>>
>> [***@titan01 onesided]$ mpiexec -np 2 java TestMpiRmaCompareAndSwap
>>
>> [titan01:22877] *** An error occurred in MPI_Compare_and_swap
>> [titan01:22877] *** reported by process [3287285761,0]
>> [titan01:22877] *** on win rdma window 3
>> [titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>> [titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>> now abort,
>> [titan01:22877] *** and potentially your MPI job)
>> [titan01.service:22872] 1 more process has sent help message
>> help-mpi-errors.txt / mpi_errors_are_fatal
>> [titan01.service:22872] Set MCA parameter "orte_base_help_aggregate"
>> to 0 to see all help / error messages
>>
>>
>>
>> On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote:
>>>
>>> Since you are getting the same errors with C as you are with Java,
>>> this is an issue with C, not the Java bindings. However, in the
>>> most recent output, you are using ./a.out to run the test. Did you
>>> use mpirun to run the test in Java or C?
>>>
>>>
>>> The command should be something along the lines of:
>>>
>>>
>>> mpirun -np 2 java TestMpiRmaCompareAndSwap
>>>
>>>
>>> mpirun -np 2 ./a.out
>>>
>>>
>>> Also, are you compiling with the ompi wrappers? Should be:
>>>
>>>
>>> mpijavac TestMpiRmaCompareAndSwap.java
>>>
>>>
>>> mpicc compare_and_swap.c
>>>
>>>
>>> In the mean time, I will try to reproduce this on a similar system.
>>>
>>>
>>> -Nathan
>>>
>>>
>>>
>>> --
>>> Nathaniel Graham
>>> HPC-DES
>>> Los Alamos National Laboratory
>>> ------------------------------------------------------------------------
>>> *From:* users <users-***@lists.open-mpi.org> on behalf of
>>> Gundram Leifert <***@uni-rostock.de>
>>> *Sent:* Tuesday, September 13, 2016 12:46 AM
>>> *To:* ***@lists.open-mpi.org
>>> *Subject:* Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
>>>
>>> Hey,
>>>
>>>
>>> it seams to be a problem of ompi 2.x. Also the c-version 2.0.1
>>> returns produces this output:
>>>
>>> (the same bulid by sources or the release 2.0.1)
>>>
>>>
>>> [***@node108 mpi_test]$ ./a.out
>>> [node108:2949] *** An error occurred in MPI_Compare_and_swap
>>> [node108:2949] *** reported by process [1649420396,0]
>>> [node108:2949] *** on win rdma window 3
>>> [node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>>> [node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>>> now abort,
>>> [node108:2949] *** and potentially your MPI job)
>>>
>>> But the test works for 1.8.x! In fact our cluster does not have
>>> shared-memory - so it has to use the wrapper to default methods.
>>>
>>> Gundram
>>>
>>> On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:
>>>>
>>>> Hello Gundram,
>>>>
>>>>
>>>> It looks like the test that is failing is
>>>> TestMpiRmaCompareAndSwap.java. Is that the one that is crashing?
>>>> If so, could you try to run the C test from:
>>>>
>>>>
>>>> http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1
>>>>
>>>>
>>>> There are a couple of header files you will need for that test, but
>>>> they are in the same repo as the test (up a few folders and in an
>>>> include folder).
>>>>
>>>>
>>>> This should let us know whether its an issue related to Java or not.
>>>>
>>>>
>>>> If it is another test, let me know and Ill see if I can get you the
>>>> C version (most or all of the Java tests are translations from the
>>>> C test).
>>>>
>>>>
>>>> -Nathan
>>>>
>>>>
>>>>
>>>> --
>>>> Nathaniel Graham
>>>> HPC-DES
>>>> Los Alamos National Laboratory
>>>> ------------------------------------------------------------------------
>>>> *From:* users <users-***@lists.open-mpi.org> on behalf of
>>>> Gundram Leifert <***@uni-rostock.de>
>>>> *Sent:* Wednesday, September 7, 2016 9:23 AM
>>>> *To:* ***@lists.open-mpi.org
>>>> *Subject:* Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
>>>>
>>>> Hello,
>>>>
>>>> I still have the same errors on our cluster - even one more. Maybe
>>>> the new one helps us to find a solution.
>>>>
>>>> I have this error if I run "make_onesided" of the ompi-java-test repo.
>>>>
>>>> CReqops and TestMpiRmaCompareAndSwap report (pretty
>>>> deterministically - in all my 30 runs) this error:
>>>>
>>>> [titan01:5134] *** An error occurred in MPI_Compare_and_swap
>>>> [titan01:5134] *** reported by process [2392850433,1]
>>>> [titan01:5134] *** on win rdma window 3
>>>> [titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>>>> [titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>>>> now abort,
>>>> [titan01:5134] *** and potentially your MPI job)
>>>> [titan01.service:05128] 1 more process has sent help message
>>>> help-mpi-errors.txt / mpi_errors_are_fatal
>>>> [titan01.service:05128] Set MCA parameter
>>>> "orte_base_help_aggregate" to 0 to see all help / error messages
>>>>
>>>> Sometimes I also have the SIGSEGV error.
>>>>
>>>> System:
>>>>
>>>> compiler: gcc/5.2.0
>>>> java: jdk1.8.0_102
>>>> kernelmodule: mlx4_core mlx4_en mlx4_ib
>>>> Linux version 3.10.0-327.13.1.el7.x86_64
>>>> (***@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red
>>>> Hat 4.8.3-9) (GCC) ) #1 SMP
>>>>
>>>> Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo
>>>> rev: v2.0.0-257-gee86e07, Sep 02, 2016
>>>>
>>>> inifiband
>>>>
>>>> openib: OpenSM 3.3.19
>>>>
>>>>
>>>> limits:
>>>>
>>>> ulimit -a
>>>> core file size (blocks, -c) 0
>>>> data seg size (kbytes, -d) unlimited
>>>> scheduling priority (-e) 0
>>>> file size (blocks, -f) unlimited
>>>> pending signals (-i) 256554
>>>> max locked memory (kbytes, -l) unlimited
>>>> max memory size (kbytes, -m) unlimited
>>>> open files (-n) 100000
>>>> pipe size (512 bytes, -p) 8
>>>> POSIX message queues (bytes, -q) 819200
>>>> real-time priority (-r) 0
>>>> stack size (kbytes, -s) unlimited
>>>> cpu time (seconds, -t) unlimited
>>>> max user processes (-u) 4096
>>>> virtual memory (kbytes, -v) unlimited
>>>> file locks (-x) unlimited
>>>>
>>>>
>>>> Thanks, Gundram
>>>> On 07/12/2016 11:08 AM, Gundram Leifert wrote:
>>>>> Hello Gilley, Howard,
>>>>>
>>>>> I configured without disable dlopen - same error.
>>>>>
>>>>> I test these classes on another cluster and: IT WORKS!
>>>>>
>>>>> So it is a problem of the cluster configuration. Thank you all
>>>>> very much for all your help! When the admin can solve the problem,
>>>>> i will let you know, what he had changed.
>>>>>
>>>>> Cheers Gundram
>>>>>
>>>>> On 07/08/2016 04:19 PM, Howard Pritchard wrote:
>>>>>> Hi Gundram
>>>>>>
>>>>>> Could you configure without the disable dlopen option and retry?
>>>>>>
>>>>>> Howard
>>>>>>
>>>>>> Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
>>>>>>
>>>>>> the JVM sets its own signal handlers, and it is important
>>>>>> openmpi dones not override them.
>>>>>> this is what previously happened with PSM (infinipath) but
>>>>>> this has been solved since.
>>>>>> you might be linking with a third party library that hijacks
>>>>>> signal handlers and cause the crash
>>>>>> (which would explain why I cannot reproduce the issue)
>>>>>>
>>>>>> the master branch has a revamped memory patcher (compared to
>>>>>> v2.x or v1.10), and that could have some bad interactions
>>>>>> with the JVM, so you might also give v2.x a try
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On Friday, July 8, 2016, Gundram Leifert
>>>>>> <***@uni-rostock.de> wrote:
>>>>>>
>>>>>> You made the best of it... thanks a lot!
>>>>>>
>>>>>> Whithout MPI it runs.
>>>>>> Just adding MPI.init() causes the crash!
>>>>>>
>>>>>> maybe I installed something wrong...
>>>>>>
>>>>>> install newest automake, autoconf, m4, libtoolize in
>>>>>> right order and same prefix
>>>>>> check out ompi,
>>>>>> autogen
>>>>>> configure with same prefix, pointing to the same jdk, I
>>>>>> later use
>>>>>> make
>>>>>> make install
>>>>>>
>>>>>> I will test some different configurations of ./configure...
>>>>>>
>>>>>>
>>>>>> On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
>>>>>>> I am running out of ideas ...
>>>>>>>
>>>>>>> what if you do not run within slurm ?
>>>>>>> what if you do not use '-cp executor.jar'
>>>>>>> or what if you configure without --disable-dlopen
>>>>>>> --disable-mca-dso ?
>>>>>>>
>>>>>>> if you
>>>>>>> mpirun -np 1 ...
>>>>>>> then MPI_Bcast and MPI_Barrier are basically no-op, so
>>>>>>> it is really weird your program is still crashing. an
>>>>>>> other test is to comment out MPI_Bcast and MPI_Barrier
>>>>>>> and try again with -np 1
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Gilles
>>>>>>>
>>>>>>> On Friday, July 8, 2016, Gundram Leifert
>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>
>>>>>>> In any cases the same error.
>>>>>>> this is my code:
>>>>>>>
>>>>>>> salloc -n 3
>>>>>>> export IPATH_NO_BACKTRACE
>>>>>>> ulimit -s 10240
>>>>>>> mpirun -np 3 java -cp executor.jar
>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles2
>>>>>>>
>>>>>>>
>>>>>>> also for 1 or two cores, the process crashes.
>>>>>>>
>>>>>>>
>>>>>>> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>>>>>>>> you can try
>>>>>>>> export IPATH_NO_BACKTRACE
>>>>>>>> before invoking mpirun (that should not be needed
>>>>>>>> though)
>>>>>>>>
>>>>>>>> an other test is to
>>>>>>>> ulimit -s 10240
>>>>>>>> before invoking mpirun.
>>>>>>>>
>>>>>>>> btw, do you use mpirun or srun ?
>>>>>>>>
>>>>>>>> can you reproduce the crash with 1 or 2 tasks ?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Gilles
>>>>>>>>
>>>>>>>> On Friday, July 8, 2016, Gundram Leifert
>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> configure:
>>>>>>>> ./configure --enable-mpi-java
>>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>>>> --disable-dlopen --disable-mca-dso
>>>>>>>>
>>>>>>>>
>>>>>>>> 1 node with 3 cores. I use SLURM to allocate
>>>>>>>> one node. I changed --mem, but it has no effect.
>>>>>>>> salloc -n 3
>>>>>>>>
>>>>>>>>
>>>>>>>> core file size (blocks, -c) 0
>>>>>>>> data seg size (kbytes, -d) unlimited
>>>>>>>> scheduling priority (-e) 0
>>>>>>>> file size (blocks, -f) unlimited
>>>>>>>> pending signals (-i) 256564
>>>>>>>> max locked memory (kbytes, -l) unlimited
>>>>>>>> max memory size (kbytes, -m) unlimited
>>>>>>>> open files (-n) 100000
>>>>>>>> pipe size (512 bytes, -p) 8
>>>>>>>> POSIX message queues (bytes, -q) 819200
>>>>>>>> real-time priority (-r) 0
>>>>>>>> stack size (kbytes, -s) unlimited
>>>>>>>> cpu time (seconds, -t) unlimited
>>>>>>>> max user processes (-u) 4096
>>>>>>>> virtual memory (kbytes, -v) unlimited
>>>>>>>> file locks (-x) unlimited
>>>>>>>>
>>>>>>>> uname -a
>>>>>>>> Linux titan01.service
>>>>>>>> 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31
>>>>>>>> 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>>
>>>>>>>> cat /etc/system-release
>>>>>>>> CentOS Linux release 7.2.1511 (Core)
>>>>>>>>
>>>>>>>> what else do you need?
>>>>>>>>
>>>>>>>> Cheers, Gundram
>>>>>>>>
>>>>>>>> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>>>>>>>>
>>>>>>>>> Gundram,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> can you please provide more information on
>>>>>>>>> your environment :
>>>>>>>>>
>>>>>>>>> - configure command line
>>>>>>>>>
>>>>>>>>> - OS
>>>>>>>>>
>>>>>>>>> - memory available
>>>>>>>>>
>>>>>>>>> - ulimit -a
>>>>>>>>>
>>>>>>>>> - number of nodes
>>>>>>>>>
>>>>>>>>> - number of tasks used
>>>>>>>>>
>>>>>>>>> - interconnect used (if any)
>>>>>>>>>
>>>>>>>>> - batch manager (if any)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gilles
>>>>>>>>>
>>>>>>>>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>>>>>>>>> Hello Gilles,
>>>>>>>>>>
>>>>>>>>>> I tried you code and it crashes after 3-15
>>>>>>>>>> iterations (see (1)). It is always the same
>>>>>>>>>> error (only the "94" varies).
>>>>>>>>>>
>>>>>>>>>> Meanwhile I think Java and MPI use the same
>>>>>>>>>> memory because when I delete the hash-call,
>>>>>>>>>> the program runs sometimes more than 9k
>>>>>>>>>> iterations.
>>>>>>>>>> When it crashes, there are different lines
>>>>>>>>>> (see (2) and (3)). The crashes also occurs on
>>>>>>>>>> rank 0.
>>>>>>>>>>
>>>>>>>>>> ##### (1)#####
>>>>>>>>>> # Problematic frame:
>>>>>>>>>> # J 94 C2
>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>>>>>>>>>> (42 bytes) @ 0x00002b03242dc9c4
>>>>>>>>>> [0x00002b03242dc860+0x164]
>>>>>>>>>>
>>>>>>>>>> #####(2)#####
>>>>>>>>>> # Problematic frame:
>>>>>>>>>> # V [libjvm.so+0x68d0f6]
>>>>>>>>>> JavaCallWrapper::JavaCallWrapper(methodHandle,
>>>>>>>>>> Handle, JavaValue*, Thread*)+0xb6
>>>>>>>>>>
>>>>>>>>>> #####(3)#####
>>>>>>>>>> # Problematic frame:
>>>>>>>>>> # V [libjvm.so+0x4183bf]
>>>>>>>>>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>>>>>>>>>
>>>>>>>>>> Any more idea?
>>>>>>>>>>
>>>>>>>>>> On 07/07/2016 03:00 AM, Gilles Gouaillardet
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Gundram,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> fwiw, i cannot reproduce the issue on my box
>>>>>>>>>>>
>>>>>>>>>>> - centos 7
>>>>>>>>>>>
>>>>>>>>>>> - java version "1.8.0_71"
>>>>>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>>>>>> 1.8.0_71-b15)
>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM (build
>>>>>>>>>>> 25.71-b15, mixed mode)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> i noticed on non zero rank saveMem is
>>>>>>>>>>> allocated at each iteration.
>>>>>>>>>>> ideally, the garbage collector can take care
>>>>>>>>>>> of that and this should not be an issue.
>>>>>>>>>>>
>>>>>>>>>>> would you mind giving the attached file a try ?
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> Gilles
>>>>>>>>>>>
>>>>>>>>>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>>>>>>>>>> I will have a look at it today
>>>>>>>>>>>>
>>>>>>>>>>>> how did you configure OpenMPI ?
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>> Gilles
>>>>>>>>>>>>
>>>>>>>>>>>> On Thursday, July 7, 2016, Gundram Leifert
>>>>>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hello Giles,
>>>>>>>>>>>>
>>>>>>>>>>>> thank you for your hints! I did 3
>>>>>>>>>>>> changes, unfortunately the same error
>>>>>>>>>>>> occures:
>>>>>>>>>>>>
>>>>>>>>>>>> update ompi:
>>>>>>>>>>>> commit
>>>>>>>>>>>> ae8444682f0a7aa158caea08800542ce9874455e
>>>>>>>>>>>> Author: Ralph Castain <***@open-mpi.org>
>>>>>>>>>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>>>>>>>>>
>>>>>>>>>>>> update java:
>>>>>>>>>>>> java version "1.8.0_92"
>>>>>>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>>>>>>> 1.8.0_92-b14)
>>>>>>>>>>>> Java HotSpot(TM) Server VM (build
>>>>>>>>>>>> 25.92-b14, mixed mode)
>>>>>>>>>>>>
>>>>>>>>>>>> delete hashcode-lines.
>>>>>>>>>>>>
>>>>>>>>>>>> Now I get this error message - to 100%,
>>>>>>>>>>>> after different number of iterations
>>>>>>>>>>>> (15-300):
>>>>>>>>>>>>
>>>>>>>>>>>> 0/ 3:length = 100000000
>>>>>>>>>>>> 0/ 3:bcast length done (length =
>>>>>>>>>>>> 100000000)
>>>>>>>>>>>> 1/ 3:bcast length done (length =
>>>>>>>>>>>> 100000000)
>>>>>>>>>>>> 2/ 3:bcast length done (length =
>>>>>>>>>>>> 100000000)
>>>>>>>>>>>> #
>>>>>>>>>>>> # A fatal error has been detected by
>>>>>>>>>>>> the Java Runtime Environment:
>>>>>>>>>>>> #
>>>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>>>> pc=0x00002b3d022fcd24, pid=16578,
>>>>>>>>>>>> tid=0x00002b3d29716700
>>>>>>>>>>>> #
>>>>>>>>>>>> # JRE version: Java(TM) SE Runtime
>>>>>>>>>>>> Environment (8.0_92-b14) (build
>>>>>>>>>>>> 1.8.0_92-b14)
>>>>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit
>>>>>>>>>>>> Server VM (25.92-b14 mixed mode
>>>>>>>>>>>> linux-amd64 compressed oops)
>>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>>> # V [libjvm.so+0x414d24]
>>>>>>>>>>>> ciEnv::get_field_by_index(ciInstanceKlass*,
>>>>>>>>>>>> int)+0x94
>>>>>>>>>>>> #
>>>>>>>>>>>> # Failed to write core dump. Core dumps
>>>>>>>>>>>> have been disabled. To enable core
>>>>>>>>>>>> dumping, try "ulimit -c unlimited"
>>>>>>>>>>>> before starting Java again
>>>>>>>>>>>> #
>>>>>>>>>>>> # An error report file with more
>>>>>>>>>>>> information is saved as:
>>>>>>>>>>>> #
>>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>>>>>>>>>> #
>>>>>>>>>>>> # Compiler replay data is saved as:
>>>>>>>>>>>> #
>>>>>>>>>>>> /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>>>>>>>>>> #
>>>>>>>>>>>> # If you would like to submit a bug
>>>>>>>>>>>> report, please visit:
>>>>>>>>>>>> #
>>>>>>>>>>>> http://bugreport.java.com/bugreport/crash.jsp
>>>>>>>>>>>> #
>>>>>>>>>>>> [titan01:16578] *** Process received
>>>>>>>>>>>> signal ***
>>>>>>>>>>>> [titan01:16578] Signal: Aborted (6)
>>>>>>>>>>>> [titan01:16578] Signal code: (-6)
>>>>>>>>>>>> [titan01:16578] [ 0]
>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>>>>>>>>>> [titan01:16578] [ 1]
>>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>>>>>>>>>> [titan01:16578] [ 2]
>>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>>>>>>>>>> [titan01:16578] [ 3]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>>>>>>>>>> [titan01:16578] [ 4]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>>>>>>>>>> [titan01:16578] [ 5]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>>>>>>>>>> [titan01:16578] [ 6]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>>>>>>>>>> [titan01:16578] [ 7]
>>>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>>>>>>>>>> [titan01:16578] [ 8]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>>>>>>>>>> [titan01:16578] [ 9]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>>>>>>>>>> [titan01:16578] [10]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>>>>>>>>>> [titan01:16578] [11]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>>>>>>>>>> [titan01:16578] [12]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>>>>>>> [titan01:16578] [13]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>>>>>>> [titan01:16578] [14]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>>>>>>> [titan01:16578] [15]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>>>>>>> [titan01:16578] [16]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>>>>>>> [titan01:16578] [17]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>>>>>>> [titan01:16578] [18]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>>>>>>> [titan01:16578] [19]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>>>>>>> [titan01:16578] [20]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>>>>>>> [titan01:16578] [21]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>>>>>>> [titan01:16578] [22]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>>>>>>>>>> [titan01:16578] [23]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>>>>>>>>>> [titan01:16578] [24]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>>>>>>>>>> [titan01:16578] [25]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>>>>>>>>>> [titan01:16578] [26]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>>>>>>>>>> [titan01:16578] [27]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>>>>>>>>>> [titan01:16578] [28]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>>>>>>>>>> [titan01:16578] [29]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>>>>>>>>>> [titan01:16578] *** End of error
>>>>>>>>>>>> message ***
>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>> Primary job terminated normally, but 1
>>>>>>>>>>>> process returned
>>>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>>>> user-direction, the job has been aborted.
>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>> mpirun noticed that process rank 2 with
>>>>>>>>>>>> PID 0 on node titan01 exited on signal
>>>>>>>>>>>> 6 (Aborted).
>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> I don't know if it is a problem of java
>>>>>>>>>>>> or ompi - but the last years, java
>>>>>>>>>>>> worked with no problems on my machine...
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for your tips in advance!
>>>>>>>>>>>> Gundram
>>>>>>>>>>>>
>>>>>>>>>>>> On 07/06/2016 03:10 PM, Gilles
>>>>>>>>>>>> Gouaillardet wrote:
>>>>>>>>>>>>> Note a race condition in MPI_Init has
>>>>>>>>>>>>> been fixed yesterday in the master.
>>>>>>>>>>>>> can you please update your OpenMPI and
>>>>>>>>>>>>> try again ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> hopefully the hang will disappear.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you reproduce the crash with a
>>>>>>>>>>>>> simpler (and ideally deterministic)
>>>>>>>>>>>>> version of your program.
>>>>>>>>>>>>> the crash occurs in hashcode, and this
>>>>>>>>>>>>> makes little sense to me. can you also
>>>>>>>>>>>>> update your jdk ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Gilles
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wednesday, July 6, 2016, Gundram
>>>>>>>>>>>>> Leifert
>>>>>>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Jason,
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks for your response! I thing
>>>>>>>>>>>>> it is another problem. I try to
>>>>>>>>>>>>> send 100MB bytes. So there are not
>>>>>>>>>>>>> many tries (between 10 and 30). I
>>>>>>>>>>>>> realized that the execution of
>>>>>>>>>>>>> this code can result 3 different
>>>>>>>>>>>>> errors:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. most often the posted error
>>>>>>>>>>>>> message occures.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. in <10% the cases i have a live
>>>>>>>>>>>>> lock. I can see 3 java-processes,
>>>>>>>>>>>>> one with 200% and two with 100%
>>>>>>>>>>>>> processor utilization. After ~15
>>>>>>>>>>>>> minutes without new system outputs
>>>>>>>>>>>>> this error occurs.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [thread 47499823949568 also had an
>>>>>>>>>>>>> error]
>>>>>>>>>>>>> # A fatal error has been detected
>>>>>>>>>>>>> by the Java Runtime Environment:
>>>>>>>>>>>>> #
>>>>>>>>>>>>> # Internal Error
>>>>>>>>>>>>> (safepoint.cpp:317), pid=24256,
>>>>>>>>>>>>> tid=47500347131648
>>>>>>>>>>>>> # guarantee(PageArmed == 0)
>>>>>>>>>>>>> failed: invariant
>>>>>>>>>>>>> #
>>>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit
>>>>>>>>>>>>> Server VM (23.25-b01 mixed mode
>>>>>>>>>>>>> linux-amd64 compressed oops)
>>>>>>>>>>>>> # Failed to write core dump. Core
>>>>>>>>>>>>> dumps have been disabled. To
>>>>>>>>>>>>> enable core dumping, try "ulimit
>>>>>>>>>>>>> -c unlimited" before starting Java
>>>>>>>>>>>>> again
>>>>>>>>>>>>> #
>>>>>>>>>>>>> # An error report file with more
>>>>>>>>>>>>> information is saved as:
>>>>>>>>>>>>> #
>>>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>>>>>>>>>> #
>>>>>>>>>>>>> # If you would like to submit a
>>>>>>>>>>>>> bug report, please visit:
>>>>>>>>>>>>> #
>>>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>>>> #
>>>>>>>>>>>>> [titan01:24256] *** Process
>>>>>>>>>>>>> received signal ***
>>>>>>>>>>>>> [titan01:24256] Signal: Aborted (6)
>>>>>>>>>>>>> [titan01:24256] Signal code: (-6)
>>>>>>>>>>>>> [titan01:24256] [ 0]
>>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>>>>>>>>>> [titan01:24256] [ 1]
>>>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>>>>>>>>>> [titan01:24256] [ 2]
>>>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>>>>>>>>>> [titan01:24256] [ 3]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>>>>>>>>>> [titan01:24256] [ 4]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>>>>>>>>>> [titan01:24256] [ 5]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>>>>>>>>>> [titan01:24256] [ 6]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>>>>>>>>>> [titan01:24256] [ 7]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>>>>>>>>>> [titan01:24256] [ 8]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>>>>>>>>>> [titan01:24256] [ 9]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>>>>>>>>>> [titan01:24256] [10]
>>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>>>>>>>>>> [titan01:24256] [11]
>>>>>>>>>>>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>>>>>>>>>> [titan01:24256] *** End of error
>>>>>>>>>>>>> message ***
>>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>>> Primary job terminated normally,
>>>>>>>>>>>>> but 1 process returned
>>>>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>>>>> user-direction, the job has been
>>>>>>>>>>>>> aborted.
>>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>> mpirun noticed that process rank 0
>>>>>>>>>>>>> with PID 0 on node titan01 exited
>>>>>>>>>>>>> on signal 6 (Aborted).
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. in <10% the cases i have a dead
>>>>>>>>>>>>> lock while MPI.init. This stays
>>>>>>>>>>>>> for more than 15 minutes without
>>>>>>>>>>>>> returning with an error message...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can I enable some debug-flags to
>>>>>>>>>>>>> see what happens on C / OpenMPI side?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance for your help!
>>>>>>>>>>>>> Gundram Leifert
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 07/05/2016 06:05 PM, Jason
>>>>>>>>>>>>> Maldonis wrote:
>>>>>>>>>>>>>> After reading your thread looks
>>>>>>>>>>>>>> like it may be related to an
>>>>>>>>>>>>>> issue I had a few weeks ago (I'm
>>>>>>>>>>>>>> a novice though). Maybe my thread
>>>>>>>>>>>>>> will be of help:
>>>>>>>>>>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When you say "After a specific
>>>>>>>>>>>>>> number of repetitions the process
>>>>>>>>>>>>>> either hangs up or returns with a
>>>>>>>>>>>>>> SIGSEGV." does you mean that a
>>>>>>>>>>>>>> single call hangs, or that at
>>>>>>>>>>>>>> some point during the for loop a
>>>>>>>>>>>>>> call hangs? If you mean the
>>>>>>>>>>>>>> latter, then it might relate to
>>>>>>>>>>>>>> my issue. Otherwise my thread
>>>>>>>>>>>>>> probably won't be helpful.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jason Maldonis
>>>>>>>>>>>>>> Research Assistant of Professor
>>>>>>>>>>>>>> Paul Voyles
>>>>>>>>>>>>>> Materials Science Grad Student
>>>>>>>>>>>>>> University of Wisconsin, Madison
>>>>>>>>>>>>>> 1509 University Ave, Rm M142
>>>>>>>>>>>>>> Madison, WI 53706
>>>>>>>>>>>>>> ***@wisc.edu
>>>>>>>>>>>>>> 608-295-5532
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 5, 2016 at 9:58 AM,
>>>>>>>>>>>>>> Gundram Leifert
>>>>>>>>>>>>>> <***@uni-rostock.de>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I try to send many
>>>>>>>>>>>>>> byte-arrays via broadcast.
>>>>>>>>>>>>>> After a specific number of
>>>>>>>>>>>>>> repetitions the process
>>>>>>>>>>>>>> either hangs up or returns
>>>>>>>>>>>>>> with a SIGSEGV. Does any one
>>>>>>>>>>>>>> can help me solving the problem:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ########## The code:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> import java.util.Random;
>>>>>>>>>>>>>> import mpi.*;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> public class TestSendBigFiles {
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> public static void
>>>>>>>>>>>>>> log(String msg) {
>>>>>>>>>>>>>> try {
>>>>>>>>>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>>>>>>>>>> MPI.COMM_WORLD.getRank(),
>>>>>>>>>>>>>> MPI.COMM_WORLD.getSize(), msg));
>>>>>>>>>>>>>> } catch (MPIException
>>>>>>>>>>>>>> ex) {
>>>>>>>>>>>>>> System.err.println(String.format("%2s/%2s:%s",
>>>>>>>>>>>>>> "?", "?", msg));
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> private static int
>>>>>>>>>>>>>> hashcode(byte[] bytearray) {
>>>>>>>>>>>>>> if (bytearray == null) {
>>>>>>>>>>>>>> return 0;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> int hash = 39;
>>>>>>>>>>>>>> for (int i = 0; i <
>>>>>>>>>>>>>> bytearray.length; i++) {
>>>>>>>>>>>>>> byte b = bytearray[i];
>>>>>>>>>>>>>> hash = hash * 7 + (int) b;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> return hash;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> public static void
>>>>>>>>>>>>>> main(String args[]) throws
>>>>>>>>>>>>>> MPIException {
>>>>>>>>>>>>>> log("start main");
>>>>>>>>>>>>>> MPI.Init(args);
>>>>>>>>>>>>>> try {
>>>>>>>>>>>>>> log("initialized done");
>>>>>>>>>>>>>> byte[] saveMem = new
>>>>>>>>>>>>>> byte[100000000];
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> Random r = new Random();
>>>>>>>>>>>>>> r.nextBytes(saveMem);
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> (MPI.COMM_WORLD.getRank() == 0) {
>>>>>>>>>>>>>> for (int i = 0; i < 1000;
>>>>>>>>>>>>>> i++) {
>>>>>>>>>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>>>>>>>>>> log("i = " + i);
>>>>>>>>>>>>>> int[] lengthData = new
>>>>>>>>>>>>>> int[]{saveMem.length};
>>>>>>>>>>>>>> log("object hash = " +
>>>>>>>>>>>>>> hashcode(saveMem));
>>>>>>>>>>>>>> log("length = " + lengthData[0]);
>>>>>>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>>>>>>> log("bcast length done
>>>>>>>>>>>>>> (length = " + lengthData[0] +
>>>>>>>>>>>>>> ")");
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>>>>>>> lengthData[0], MPI.BYTE, 0);
>>>>>>>>>>>>>> log("bcast data done");
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> MPI.COMM_WORLD.bcast(new
>>>>>>>>>>>>>> int[]{0}, 1, MPI.INT
>>>>>>>>>>>>>> <http://MPI.INT>, 0);
>>>>>>>>>>>>>> } else {
>>>>>>>>>>>>>> while (true) {
>>>>>>>>>>>>>> int[] lengthData = new
>>>>>>>>>>>>>> int[1];
>>>>>>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>>>>>>> log("bcast length done
>>>>>>>>>>>>>> (length = " + lengthData[0] +
>>>>>>>>>>>>>> ")");
>>>>>>>>>>>>>> if (lengthData[0] == 0) {
>>>>>>>>>>>>>> break;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> saveMem = new
>>>>>>>>>>>>>> byte[lengthData[0]];
>>>>>>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>>>>>>> saveMem.length, MPI.BYTE, 0);
>>>>>>>>>>>>>> log("bcast data done");
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> log("object hash = " +
>>>>>>>>>>>>>> hashcode(saveMem));
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> } catch (MPIException
>>>>>>>>>>>>>> ex) {
>>>>>>>>>>>>>> System.out.println("caugth
>>>>>>>>>>>>>> error." + ex);
>>>>>>>>>>>>>> log(ex.getMessage());
>>>>>>>>>>>>>> } catch
>>>>>>>>>>>>>> (RuntimeException ex) {
>>>>>>>>>>>>>> System.out.println("caugth
>>>>>>>>>>>>>> error." + ex);
>>>>>>>>>>>>>> log(ex.getMessage());
>>>>>>>>>>>>>> } finally {
>>>>>>>>>>>>>> MPI.Finalize();
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ############ The Error (if it
>>>>>>>>>>>>>> does not just hang up):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # A fatal error has been
>>>>>>>>>>>>>> detected by the Java Runtime
>>>>>>>>>>>>>> Environment:
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>>>>>> pc=0x00002b7e9c86e3a1,
>>>>>>>>>>>>>> pid=1172, tid=47822674495232
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # A fatal error has been
>>>>>>>>>>>>>> detected by the Java Runtime
>>>>>>>>>>>>>> Environment:
>>>>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>>>>> # Java VM: Java HotSpot(TM)
>>>>>>>>>>>>>> 64-Bit Server VM (23.25-b01
>>>>>>>>>>>>>> mixed mode linux-amd64
>>>>>>>>>>>>>> compressed oops)
>>>>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>>>>> # #
>>>>>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>>>>>> pc=0x00002af69c0693a1,
>>>>>>>>>>>>>> pid=1173, tid=47238546896640
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>>>>> J
>>>>>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # Failed to write core dump.
>>>>>>>>>>>>>> Core dumps have been
>>>>>>>>>>>>>> disabled. To enable core
>>>>>>>>>>>>>> dumping, try "ulimit -c
>>>>>>>>>>>>>> unlimited" before starting
>>>>>>>>>>>>>> Java again
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # Java VM: Java HotSpot(TM)
>>>>>>>>>>>>>> 64-Bit Server VM (23.25-b01
>>>>>>>>>>>>>> mixed mode linux-amd64
>>>>>>>>>>>>>> compressed oops)
>>>>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>>>>> # J
>>>>>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # Failed to write core dump.
>>>>>>>>>>>>>> Core dumps have been
>>>>>>>>>>>>>> disabled. To enable core
>>>>>>>>>>>>>> dumping, try "ulimit -c
>>>>>>>>>>>>>> unlimited" before starting
>>>>>>>>>>>>>> Java again
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # An error report file with
>>>>>>>>>>>>>> more information is saved as:
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>>>>>>>>>> # An error report file with
>>>>>>>>>>>>>> more information is saved as:
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # If you would like to submit
>>>>>>>>>>>>>> a bug report, please visit:
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # If you would like to submit
>>>>>>>>>>>>>> a bug report, please visit:
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> [titan01:01172] *** Process
>>>>>>>>>>>>>> received signal ***
>>>>>>>>>>>>>> [titan01:01172] Signal:
>>>>>>>>>>>>>> Aborted (6)
>>>>>>>>>>>>>> [titan01:01172] Signal code:
>>>>>>>>>>>>>> (-6)
>>>>>>>>>>>>>> [titan01:01173] *** Process
>>>>>>>>>>>>>> received signal ***
>>>>>>>>>>>>>> [titan01:01173] Signal:
>>>>>>>>>>>>>> Aborted (6)
>>>>>>>>>>>>>> [titan01:01173] Signal code:
>>>>>>>>>>>>>> (-6)
>>>>>>>>>>>>>> [titan01:01172] [ 0]
>>>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>>>>>>>>>> [titan01:01172] [ 1]
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>>>>>>>>>> [titan01:01172] [ 2]
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>>>>>>>>>> [titan01:01172] [ 3]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>>>>>>>>>> [titan01:01172] [ 4]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>>>>>>>>>> [titan01:01172] [ 5]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>>>>>>>>>> [titan01:01172] [ 6]
>>>>>>>>>>>>>> [titan01:01173] [ 0]
>>>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>>>>>>>>>> [titan01:01173] [ 1]
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>>>>>>>>>> [titan01:01172] [ 7]
>>>>>>>>>>>>>> [0x2b7e9c86e3a1]
>>>>>>>>>>>>>> [titan01:01172] *** End of
>>>>>>>>>>>>>> error message ***
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>>>>>>>>>> [titan01:01173] [ 2]
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>>>>>>>>>> [titan01:01173] [ 3]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>>>>>>>>>> [titan01:01173] [ 4]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>>>>>>>>>> [titan01:01173] [ 5]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>>>>>>>>>> [titan01:01173] [ 6]
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>>>>>>>>>> [titan01:01173] [ 7]
>>>>>>>>>>>>>> [0x2af69c0693a1]
>>>>>>>>>>>>>> [titan01:01173] *** End of
>>>>>>>>>>>>>> error message ***
>>>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>>>> Primary job terminated
>>>>>>>>>>>>>> normally, but 1 process returned
>>>>>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>>>>>> user-direction, the job has
>>>>>>>>>>>>>> been aborted.
>>>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> mpirun noticed that process
>>>>>>>>>>>>>> rank 1 with PID 0 on node
>>>>>>>>>>>>>> titan01 exited on signal 6
>>>>>>>>>>>>>> (Aborted).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ########CONFIGURATION:
>>>>>>>>>>>>>> I used the ompi master
>>>>>>>>>>>>>> sources from github:
>>>>>>>>>>>>>> commit
>>>>>>>>>>>>>> 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>>>>>>>>>> Author: Gilles Gouaillardet
>>>>>>>>>>>>>> <***@rist.or.jp>
>>>>>>>>>>>>>> Date: Tue Jul 5 13:47:50
>>>>>>>>>>>>>> 2016 +0900
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ./configure --enable-mpi-java
>>>>>>>>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>>>>>>>>>> --disable-dlopen
>>>>>>>>>>>>>> --disable-mca-dso
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot for your help!
>>>>>>>>>>>>>> Gundram
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>>>>> Subscription:
>>>>>>>>>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> Link to this post:
>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> ***@open-mpi.org
>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> ***@open-mpi.org
>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29601.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> ***@open-mpi.org
>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29603.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29610.php
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> ***@lists.open-mpi.org <mailto:***@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Graham, Nathaniel Richard
2016-09-23 22:01:54 UTC
Permalink
​The WinName test was failing because MPI was never finalized. The window was also not being freed. I have fixed that test and pushed the changes to the ompi-java-test repo.


I was not seeing failures with 2 processes for any of the tests except for WinName, but I did have quite a few fail occasionally when running with 8 processes. I am not sure why that is, but I will open an issue and look into it.


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Tuesday, September 20, 2016 2:13 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Sorry for the delay...

we applied

./configure --enable-debug --with-psm --enable-mpi-java --with-jdk-dir=/cluster/libraries/java/jdk1.8.0_102/ --prefix=/cluster/mpi/gcc/openmpi/2.0.x_nightly
make -j 8 all
make install

Java-test-suite

export OMPI_MCA_osc=pt2pt

./make_onesided &> make_onesided.out
Output: https://gist.github.com/anonymous/f8c6837b6a6d40c806cec9458dfcc1ab


we still sometimes get the SIGSEGV:

WinAllocate with -np = 2:
Exception in thread "main" Exception in thread "main" mpi.MPIException: MPI_ERR_INTERN: internal errormpi.MPIException: MPI_ERR_INTERN: internal error

at mpi.Win.allocateSharedWin(Native Method) at mpi.Win.allocateSharedWin(Native Method)

at mpi.Win.<init>(Win.java:110) at mpi.Win.<init>(Win.java:110)

at WinAllocate.main(WinAllocate.java:42) at WinAllocate.main(WinAllocate.java:42)


WinName with -np = 2:
mpiexec has exited due to process rank 1 with PID 0 on
node node160 exiting improperly. There are three reasons this could occur:
<CROP>


CCreateInfo and Cput with -np 8:
sometimes end with SigSegV (see https://gist.github.com/anonymous/605c19422fd00bdfc4d1ea0151a1f34c ) for detailed view.

I hope, this information is helpfull...

Best Regards,
Gundram


On 09/14/2016 08:18 PM, Nathan Hjelm wrote:
We have a new high-speed component for RMA in 2.0.x called osc/rdma. Since the component is doing direct rdma on the target we are much more strict about the ranges. osc/pt2pt doesn't bother checking at the moment.

Can you build Open MPI with --enable-debug and add -mca osc_base_verbose 100 to the mpirun command-line? Please upload the output as a gist (https://gist.github.com/) and send a link so we can take a look.

-Nathan

On Sep 14, 2016, at 04:26 AM, Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de> wrote:


In short words: yes, we compiled with mpijavac and mpicc and run with mpirun -np 2.


In long words: we tested the following setups

a) without Java, with mpi 2.0.1 the C-test

[***@titan01 mpi_test]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1

[***@titan01 mpi_test]$ mpirun -np 2 ./a.out
[titan01:18460] *** An error occurred in MPI_Compare_and_swap
[titan01:18460] *** reported by process [3535667201,1]
[titan01:18460] *** on win rdma window 3
[titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:18460] *** and potentially your MPI job)
[titan01.service:18454] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:18454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

b) without Java with mpi 1.8.8 the C-test

[***@titan01 mpi_test2]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/1.8.8

[***@titan01 mpi_test2]$ mpirun -np 2 ./a.out
No Errors
[***@titan01 mpi_test2]$

c) with java 1.8.8 with jdk and Java-Testsuite

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol
win.compareAndSwap(next, iBuffer, result, MPI.INT, rank, 0);
^
symbol: method compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int)
location: variable win of type Win
TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol

>> these java methods are not supported in 1.8.8

d) ompi 2.0.1 and jdk and Testsuite

[***@titan01 ~]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1 2) java/jdk1.8.0_102

[***@titan01 ~]$ cd ompi-java-test/
[***@titan01 ompi-java-test]$ ./autogen.sh
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force
autoreconf: configure.ac: tracing
autoreconf: configure.ac: not using Libtool
autoreconf: running: /usr/bin/autoconf --force
autoreconf: configure.ac: not using Autoheader
autoreconf: running: automake --add-missing --copy --force-missing
autoreconf: Leaving directory `.'
[***@titan01 ompi-java-test]$ ./configure
Configuring Open Java test suite
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for mpijavac... yes
checking if checking MPI API params... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating reporting/OmpitestConfig.java
config.status: creating Makefile

[***@titan01 ompi-java-test]$ cd onesided/
[***@titan01 onesided]$ ./make_onesided &> result
cat result:
<crop.....>

=========================== CReqops ===========================
[titan01:32155] *** An error occurred in MPI_Rput
[titan01:32155] *** reported by process [3879534593,1]
[titan01:32155] *** on win rdma window 3
[titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32155] *** and potentially your MPI job)

<...crop....>

=========================== TestMpiRmaCompareAndSwap ===========================
[titan01:32703] *** An error occurred in MPI_Compare_and_swap
[titan01:32703] *** reported by process [3843162113,0]
[titan01:32703] *** on win rdma window 3
[titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32703] *** and potentially your MPI job)
[titan01.service:32698] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:32698] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


< ... end crop>


Also if we start the thing in this way, it fails:

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java OmpitestError.java OmpitestProgress.java OmpitestConfig.java

[***@titan01 onesided]$ mpiexec -np 2 java TestMpiRmaCompareAndSwap

[titan01:22877] *** An error occurred in MPI_Compare_and_swap
[titan01:22877] *** reported by process [3287285761,0]
[titan01:22877] *** on win rdma window 3
[titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:22877] *** and potentially your MPI job)
[titan01.service:22872] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:22872] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote:

Since you are getting the same errors with C as you are with Java, this is an issue with C, not the Java bindings. However, in the most recent output, you are using ./a.out to run the test. Did you use mpirun to run the test in Java or C?


The command should be something along the lines of:


mpirun -np 2 java TestMpiRmaCompareAndSwap


mpirun -np 2 ./a.out


Also, are you compiling with the ompi wrappers? Should be:


mpijavac TestMpiRmaCompareAndSwap.java


mpicc compare_and_swap.c


In the mean time, I will try to reproduce this on a similar system.


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Tuesday, September 13, 2016 12:46 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hey,


it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns produces this output:

(the same bulid by sources or the release 2.0.1)


[***@node108 mpi_test]$ ./a.out
[node108:2949] *** An error occurred in MPI_Compare_and_swap
[node108:2949] *** reported by process [1649420396,0]
[node108:2949] *** on win rdma window 3
[node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[node108:2949] *** and potentially your MPI job)


But the test works for 1.8.x! In fact our cluster does not have shared-memory - so it has to use the wrapper to default methods.

Gundram

On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:

Hello Gundram,


It looks like the test that is failing is TestMpiRmaCompareAndSwap.java. Is that the one that is crashing? If so, could you try to run the C test from:


http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1


There are a couple of header files you will need for that test, but they are in the same repo as the test (up a few folders and in an include folder).


This should let us know whether its an issue related to Java or not.


If it is another test, let me know and Ill see if I can get you the C version (most or all of the Java tests are translations from the C test).


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Wednesday, September 7, 2016 9:23 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hello,

I still have the same errors on our cluster - even one more. Maybe the new one helps us to find a solution.

I have this error if I run "make_onesided" of the ompi-java-test repo.

CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically - in all my 30 runs) this error:

[titan01:5134] *** An error occurred in MPI_Compare_and_swap
[titan01:5134] *** reported by process [2392850433,1]
[titan01:5134] *** on win rdma window 3
[titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:5134] *** and potentially your MPI job)
[titan01.service:05128] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


Sometimes I also have the SIGSEGV error.

System:

compiler: gcc/5.2.0
java: jdk1.8.0_102
kernelmodule: mlx4_core mlx4_en mlx4_ib
Linux version 3.10.0-327.13.1.el7.x86_64 (***@kbuilder.dev.centos.org<mailto:***@kbuilder.dev.centos.org>) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP

Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo rev: v2.0.0-257-gee86e07, Sep 02, 2016

inifiband

openib: OpenSM 3.3.19


limits:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256554
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Thanks, Gundram
On 07/12/2016 11:08 AM, Gundram Leifert wrote:
Hello Gilley, Howard,

I configured without disable dlopen - same error.

I test these classes on another cluster and: IT WORKS!

So it is a problem of the cluster configuration. Thank you all very much for all your help! When the admin can solve the problem, i will let you know, what he had changed.

Cheers Gundram

On 07/08/2016 04:19 PM, Howard Pritchard wrote:
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
the JVM sets its own signal handlers, and it is important openmpi dones not override them.
this is what previously happened with PSM (infinipath) but this has been solved since.
you might be linking with a third party library that hijacks signal handlers and cause the crash
(which would explain why I cannot reproduce the issue)

the master branch has a revamped memory patcher (compared to v2.x or v1.10), and that could have some bad interactions with the JVM, so you might also give v2.x a try

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
You made the best of it... thanks a lot!

Whithout MPI it runs.
Just adding MPI.init() causes the crash!

maybe I installed something wrong...

install newest automake, autoconf, m4, libtoolize in right order and same prefix
check out ompi,
autogen
configure with same prefix, pointing to the same jdk, I later use
make
make install

I will test some different configurations of ./configure...


On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
I am running out of ideas ...

what if you do not run within slurm ?
what if you do not use '-cp executor.jar'
or what if you configure without --disable-dlopen --disable-mca-dso ?

if you
mpirun -np 1 ...
then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird your program is still crashing. an other test is to comment out MPI_Bcast and MPI_Barrier and try again with -np 1

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
In any cases the same error.
this is my code:

salloc -n 3
export IPATH_NO_BACKTRACE
ulimit -s 10240
mpirun -np 3 java -cp executor.jar de.uros.citlab.executor.test.TestSendBigFiles2


also for 1 or two cores, the process crashes.


On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
you can try
export IPATH_NO_BACKTRACE
before invoking mpirun (that should not be needed though)

an other test is to
ulimit -s 10240
before invoking mpirun.

btw, do you use mpirun or srun ?

can you reproduce the crash with 1 or 2 tasks ?

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

configure:
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso


1 node with 3 cores. I use SLURM to allocate one node. I changed --mem, but it has no effect.
salloc -n 3


core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256564
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

uname -a
Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/system-release
CentOS Linux release 7.2.1511 (Core)

what else do you need?

Cheers, Gundram

On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:

Gundram,


can you please provide more information on your environment :

- configure command line

- OS

- memory available

- ulimit -a

- number of nodes

- number of tasks used

- interconnect used (if any)

- batch manager (if any)


Cheers,


Gilles

On 7/7/2016 4:17 PM, Gundram Leifert wrote:
Hello Gilles,

I tried you code and it crashes after 3-15 iterations (see (1)). It is always the same error (only the "94" varies).

Meanwhile I think Java and MPI use the same memory because when I delete the hash-call, the program runs sometimes more than 9k iterations.
When it crashes, there are different lines (see (2) and (3)). The crashes also occurs on rank 0.

##### (1)#####
# Problematic frame:
# J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]

#####(2)#####
# Problematic frame:
# V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0xb6

#####(3)#####
# Problematic frame:
# V [libjvm.so+0x4183bf] ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f

Any more idea?

On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:

Gundram,


fwiw, i cannot reproduce the issue on my box

- centos 7

- java version "1.8.0_71"
Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)


i noticed on non zero rank saveMem is allocated at each iteration.
ideally, the garbage collector can take care of that and this should not be an issue.

would you mind giving the attached file a try ?

Cheers,

Gilles

On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Giles,

thank you for your hints! I did 3 changes, unfortunately the same error occures:

update ompi:
commit ae8444682f0a7aa158caea08800542ce9874455e
Author: Ralph Castain <***@open-mpi.org><mailto:***@open-mpi.org>
Date: Tue Jul 5 20:07:16 2016 -0700

update java:
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)

delete hashcode-lines.

Now I get this error message - to 100%, after different number of iterations (15-300):

0/ 3:length = 100000000
0/ 3:bcast length done (length = 100000000)
1/ 3:bcast length done (length = 100000000)
2/ 3:bcast length done (length = 100000000)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid16578.log
#
# Compiler replay data is saved as:
# /home/gl069/ompi/bin/executor/replay_pid16578.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[titan01:16578] *** Process received signal ***
[titan01:16578] Signal: Aborted (6)
[titan01:16578] Signal code: (-6)
[titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
[titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
[titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
[titan01:16578] [ 3] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
[titan01:16578] [ 4] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
[titan01:16578] [ 5] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
[titan01:16578] [ 6] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
[titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
[titan01:16578] [ 8] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
[titan01:16578] [ 9] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
[titan01:16578] [10] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
[titan01:16578] [11] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
[titan01:16578] [12] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [13] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [14] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [15] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [16] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [17] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [18] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [19] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [20] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [21] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [22] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
[titan01:16578] [23] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
[titan01:16578] [24] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
[titan01:16578] [25] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
[titan01:16578] [26] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
[titan01:16578] [27] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
[titan01:16578] [28] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
[titan01:16578] [29] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
[titan01:16578] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I don't know if it is a problem of java or ompi - but the last years, java worked with no problems on my machine...

Thank you for your tips in advance!
Gundram

On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic) version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Jason,

thanks for your response! I thing it is another problem. I try to send 100MB bytes. So there are not many tries (between 10 and 30). I realized that the execution of this code can result 3 different errors:

1. most often the posted error message occures.

2. in <10% the cases i have a live lock. I can see 3 java-processes, one with 200% and two with 100% processor utilization. After ~15 minutes without new system outputs this error occurs.


[thread 47499823949568 also had an error]
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid24256.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:24256] *** Process received signal ***
[titan01:24256] Signal: Aborted (6)
[titan01:24256] Signal code: (-6)
[titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
[titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
[titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
[titan01:24256] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
[titan01:24256] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
[titan01:24256] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
[titan01:24256] [ 6] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
[titan01:24256] [ 7] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
[titan01:24256] [ 8] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
[titan01:24256] [ 9] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
[titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
[titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
[titan01:24256] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


3. in <10% the cases i have a dead lock while MPI.init. This stays for more than 15 minutes without returning with an error message...

Can I enable some debug-flags to see what happens on C / OpenMPI side?

Thanks in advance for your help!
Gundram Leifert


On 07/05/2016 06:05 PM, Jason Maldonis wrote:
After reading your thread looks like it may be related to an issue I had a few weeks ago (I'm a novice though). Maybe my thread will be of help: https://www.open-mpi.org/community/lists/users/2016/06/29425.php

When you say "After a specific number of repetitions the process either hangs up or returns with a SIGSEGV." does you mean that a single call hangs, or that at some point during the for loop a call hangs? If you mean the latter, then it might relate to my issue. Otherwise my thread probably won't be helpful.

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
***@wisc.edu<mailto:***@wisc.edu>
608-295-5532

On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello,

I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

public static void log(String msg) {
try {
System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
} catch (MPIException ex) {
System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
}
}

private static int hashcode(byte[] bytearray) {
if (bytearray == null) {
return 0;
}
int hash = 39;
for (int i = 0; i < bytearray.length; i++) {
byte b = bytearray[i];
hash = hash * 7 + (int) b;
}
return hash;
}

public static void main(String args[]) throws MPIException {
log("start main");
MPI.Init(args);
try {
log("initialized done");
byte[] saveMem = new byte[100000000];
MPI.COMM_WORLD.barrier();
Random r = new Random();
r.nextBytes(saveMem);
if (MPI.COMM_WORLD.getRank() == 0) {
for (int i = 0; i < 1000; i++) {
saveMem[r.nextInt(saveMem.length)]++;
log("i = " + i);
int[] lengthData = new int[]{saveMem.length};
log("object hash = " + hashcode(saveMem));
log("length = " + lengthData[0]);
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
MPI.COMM_WORLD.barrier();
MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
}
MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT<http://MPI.INT>, 0);
} else {
while (true) {
int[] lengthData = new int[1];
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
if (lengthData[0] == 0) {
break;
}
MPI.COMM_WORLD.barrier();
saveMem = new byte[lengthData[0]];
MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
log("object hash = " + hashcode(saveMem));
}
}
MPI.COMM_WORLD.barrier();
} catch (MPIException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} catch (RuntimeException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} finally {
MPI.Finalize();
}

}

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# #
# SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code: (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code: (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <***@rist.or.jp<mailto:***@rist.or.jp>>
Date: Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram

_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29584.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php




_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php





_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Graham, Nathaniel Richard
2016-09-23 22:06:20 UTC
Permalink
For reference, the issue can be tracked at: https://github.com/open-mpi/ompi/issues/2116


-Nathan

--

Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Tuesday, September 20, 2016 2:13 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Sorry for the delay...

we applied

./configure --enable-debug --with-psm --enable-mpi-java --with-jdk-dir=/cluster/libraries/java/jdk1.8.0_102/ --prefix=/cluster/mpi/gcc/openmpi/2.0.x_nightly
make -j 8 all
make install

Java-test-suite

export OMPI_MCA_osc=pt2pt

./make_onesided &> make_onesided.out
Output: https://gist.github.com/anonymous/f8c6837b6a6d40c806cec9458dfcc1ab


we still sometimes get the SIGSEGV:

WinAllocate with -np = 2:
Exception in thread "main" Exception in thread "main" mpi.MPIException: MPI_ERR_INTERN: internal errormpi.MPIException: MPI_ERR_INTERN: internal error

at mpi.Win.allocateSharedWin(Native Method) at mpi.Win.allocateSharedWin(Native Method)

at mpi.Win.<init>(Win.java:110) at mpi.Win.<init>(Win.java:110)

at WinAllocate.main(WinAllocate.java:42) at WinAllocate.main(WinAllocate.java:42)


WinName with -np = 2:
mpiexec has exited due to process rank 1 with PID 0 on
node node160 exiting improperly. There are three reasons this could occur:
<CROP>


CCreateInfo and Cput with -np 8:
sometimes end with SigSegV (see https://gist.github.com/anonymous/605c19422fd00bdfc4d1ea0151a1f34c ) for detailed view.

I hope, this information is helpfull...

Best Regards,
Gundram


On 09/14/2016 08:18 PM, Nathan Hjelm wrote:
We have a new high-speed component for RMA in 2.0.x called osc/rdma. Since the component is doing direct rdma on the target we are much more strict about the ranges. osc/pt2pt doesn't bother checking at the moment.

Can you build Open MPI with --enable-debug and add -mca osc_base_verbose 100 to the mpirun command-line? Please upload the output as a gist (https://gist.github.com/) and send a link so we can take a look.

-Nathan

On Sep 14, 2016, at 04:26 AM, Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de> wrote:


In short words: yes, we compiled with mpijavac and mpicc and run with mpirun -np 2.


In long words: we tested the following setups

a) without Java, with mpi 2.0.1 the C-test

[***@titan01 mpi_test]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1

[***@titan01 mpi_test]$ mpirun -np 2 ./a.out
[titan01:18460] *** An error occurred in MPI_Compare_and_swap
[titan01:18460] *** reported by process [3535667201,1]
[titan01:18460] *** on win rdma window 3
[titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:18460] *** and potentially your MPI job)
[titan01.service:18454] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:18454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

b) without Java with mpi 1.8.8 the C-test

[***@titan01 mpi_test2]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/1.8.8

[***@titan01 mpi_test2]$ mpirun -np 2 ./a.out
No Errors
[***@titan01 mpi_test2]$

c) with java 1.8.8 with jdk and Java-Testsuite

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol
win.compareAndSwap(next, iBuffer, result, MPI.INT, rank, 0);
^
symbol: method compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int)
location: variable win of type Win
TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol

>> these java methods are not supported in 1.8.8

d) ompi 2.0.1 and jdk and Testsuite

[***@titan01 ~]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1 2) java/jdk1.8.0_102

[***@titan01 ~]$ cd ompi-java-test/
[***@titan01 ompi-java-test]$ ./autogen.sh
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force
autoreconf: configure.ac: tracing
autoreconf: configure.ac: not using Libtool
autoreconf: running: /usr/bin/autoconf --force
autoreconf: configure.ac: not using Autoheader
autoreconf: running: automake --add-missing --copy --force-missing
autoreconf: Leaving directory `.'
[***@titan01 ompi-java-test]$ ./configure
Configuring Open Java test suite
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for mpijavac... yes
checking if checking MPI API params... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating reporting/OmpitestConfig.java
config.status: creating Makefile

[***@titan01 ompi-java-test]$ cd onesided/
[***@titan01 onesided]$ ./make_onesided &> result
cat result:
<crop.....>

=========================== CReqops ===========================
[titan01:32155] *** An error occurred in MPI_Rput
[titan01:32155] *** reported by process [3879534593,1]
[titan01:32155] *** on win rdma window 3
[titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32155] *** and potentially your MPI job)

<...crop....>

=========================== TestMpiRmaCompareAndSwap ===========================
[titan01:32703] *** An error occurred in MPI_Compare_and_swap
[titan01:32703] *** reported by process [3843162113,0]
[titan01:32703] *** on win rdma window 3
[titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32703] *** and potentially your MPI job)
[titan01.service:32698] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:32698] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


< ... end crop>


Also if we start the thing in this way, it fails:

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java OmpitestError.java OmpitestProgress.java OmpitestConfig.java

[***@titan01 onesided]$ mpiexec -np 2 java TestMpiRmaCompareAndSwap

[titan01:22877] *** An error occurred in MPI_Compare_and_swap
[titan01:22877] *** reported by process [3287285761,0]
[titan01:22877] *** on win rdma window 3
[titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:22877] *** and potentially your MPI job)
[titan01.service:22872] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:22872] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote:

Since you are getting the same errors with C as you are with Java, this is an issue with C, not the Java bindings. However, in the most recent output, you are using ./a.out to run the test. Did you use mpirun to run the test in Java or C?


The command should be something along the lines of:


mpirun -np 2 java TestMpiRmaCompareAndSwap


mpirun -np 2 ./a.out


Also, are you compiling with the ompi wrappers? Should be:


mpijavac TestMpiRmaCompareAndSwap.java


mpicc compare_and_swap.c


In the mean time, I will try to reproduce this on a similar system.


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Tuesday, September 13, 2016 12:46 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hey,


it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns produces this output:

(the same bulid by sources or the release 2.0.1)


[***@node108 mpi_test]$ ./a.out
[node108:2949] *** An error occurred in MPI_Compare_and_swap
[node108:2949] *** reported by process [1649420396,0]
[node108:2949] *** on win rdma window 3
[node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[node108:2949] *** and potentially your MPI job)


But the test works for 1.8.x! In fact our cluster does not have shared-memory - so it has to use the wrapper to default methods.

Gundram

On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:

Hello Gundram,


It looks like the test that is failing is TestMpiRmaCompareAndSwap.java. Is that the one that is crashing? If so, could you try to run the C test from:


http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1


There are a couple of header files you will need for that test, but they are in the same repo as the test (up a few folders and in an include folder).


This should let us know whether its an issue related to Java or not.


If it is another test, let me know and Ill see if I can get you the C version (most or all of the Java tests are translations from the C test).


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Wednesday, September 7, 2016 9:23 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hello,

I still have the same errors on our cluster - even one more. Maybe the new one helps us to find a solution.

I have this error if I run "make_onesided" of the ompi-java-test repo.

CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically - in all my 30 runs) this error:

[titan01:5134] *** An error occurred in MPI_Compare_and_swap
[titan01:5134] *** reported by process [2392850433,1]
[titan01:5134] *** on win rdma window 3
[titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:5134] *** and potentially your MPI job)
[titan01.service:05128] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


Sometimes I also have the SIGSEGV error.

System:

compiler: gcc/5.2.0
java: jdk1.8.0_102
kernelmodule: mlx4_core mlx4_en mlx4_ib
Linux version 3.10.0-327.13.1.el7.x86_64 (***@kbuilder.dev.centos.org<mailto:***@kbuilder.dev.centos.org>) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP

Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo rev: v2.0.0-257-gee86e07, Sep 02, 2016

inifiband

openib: OpenSM 3.3.19


limits:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256554
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Thanks, Gundram
On 07/12/2016 11:08 AM, Gundram Leifert wrote:
Hello Gilley, Howard,

I configured without disable dlopen - same error.

I test these classes on another cluster and: IT WORKS!

So it is a problem of the cluster configuration. Thank you all very much for all your help! When the admin can solve the problem, i will let you know, what he had changed.

Cheers Gundram

On 07/08/2016 04:19 PM, Howard Pritchard wrote:
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
the JVM sets its own signal handlers, and it is important openmpi dones not override them.
this is what previously happened with PSM (infinipath) but this has been solved since.
you might be linking with a third party library that hijacks signal handlers and cause the crash
(which would explain why I cannot reproduce the issue)

the master branch has a revamped memory patcher (compared to v2.x or v1.10), and that could have some bad interactions with the JVM, so you might also give v2.x a try

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
You made the best of it... thanks a lot!

Whithout MPI it runs.
Just adding MPI.init() causes the crash!

maybe I installed something wrong...

install newest automake, autoconf, m4, libtoolize in right order and same prefix
check out ompi,
autogen
configure with same prefix, pointing to the same jdk, I later use
make
make install

I will test some different configurations of ./configure...


On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
I am running out of ideas ...

what if you do not run within slurm ?
what if you do not use '-cp executor.jar'
or what if you configure without --disable-dlopen --disable-mca-dso ?

if you
mpirun -np 1 ...
then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird your program is still crashing. an other test is to comment out MPI_Bcast and MPI_Barrier and try again with -np 1

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
In any cases the same error.
this is my code:

salloc -n 3
export IPATH_NO_BACKTRACE
ulimit -s 10240
mpirun -np 3 java -cp executor.jar de.uros.citlab.executor.test.TestSendBigFiles2


also for 1 or two cores, the process crashes.


On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
you can try
export IPATH_NO_BACKTRACE
before invoking mpirun (that should not be needed though)

an other test is to
ulimit -s 10240
before invoking mpirun.

btw, do you use mpirun or srun ?

can you reproduce the crash with 1 or 2 tasks ?

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

configure:
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso


1 node with 3 cores. I use SLURM to allocate one node. I changed --mem, but it has no effect.
salloc -n 3


core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256564
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

uname -a
Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/system-release
CentOS Linux release 7.2.1511 (Core)

what else do you need?

Cheers, Gundram

On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:

Gundram,


can you please provide more information on your environment :

- configure command line

- OS

- memory available

- ulimit -a

- number of nodes

- number of tasks used

- interconnect used (if any)

- batch manager (if any)


Cheers,


Gilles

On 7/7/2016 4:17 PM, Gundram Leifert wrote:
Hello Gilles,

I tried you code and it crashes after 3-15 iterations (see (1)). It is always the same error (only the "94" varies).

Meanwhile I think Java and MPI use the same memory because when I delete the hash-call, the program runs sometimes more than 9k iterations.
When it crashes, there are different lines (see (2) and (3)). The crashes also occurs on rank 0.

##### (1)#####
# Problematic frame:
# J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]

#####(2)#####
# Problematic frame:
# V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0xb6

#####(3)#####
# Problematic frame:
# V [libjvm.so+0x4183bf] ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f

Any more idea?

On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:

Gundram,


fwiw, i cannot reproduce the issue on my box

- centos 7

- java version "1.8.0_71"
Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)


i noticed on non zero rank saveMem is allocated at each iteration.
ideally, the garbage collector can take care of that and this should not be an issue.

would you mind giving the attached file a try ?

Cheers,

Gilles

On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Giles,

thank you for your hints! I did 3 changes, unfortunately the same error occures:

update ompi:
commit ae8444682f0a7aa158caea08800542ce9874455e
Author: Ralph Castain <***@open-mpi.org><mailto:***@open-mpi.org>
Date: Tue Jul 5 20:07:16 2016 -0700

update java:
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)

delete hashcode-lines.

Now I get this error message - to 100%, after different number of iterations (15-300):

0/ 3:length = 100000000
0/ 3:bcast length done (length = 100000000)
1/ 3:bcast length done (length = 100000000)
2/ 3:bcast length done (length = 100000000)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid16578.log
#
# Compiler replay data is saved as:
# /home/gl069/ompi/bin/executor/replay_pid16578.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[titan01:16578] *** Process received signal ***
[titan01:16578] Signal: Aborted (6)
[titan01:16578] Signal code: (-6)
[titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
[titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
[titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
[titan01:16578] [ 3] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
[titan01:16578] [ 4] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
[titan01:16578] [ 5] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
[titan01:16578] [ 6] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
[titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
[titan01:16578] [ 8] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
[titan01:16578] [ 9] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
[titan01:16578] [10] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
[titan01:16578] [11] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
[titan01:16578] [12] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [13] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [14] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [15] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [16] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [17] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [18] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [19] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [20] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [21] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [22] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
[titan01:16578] [23] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
[titan01:16578] [24] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
[titan01:16578] [25] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
[titan01:16578] [26] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
[titan01:16578] [27] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
[titan01:16578] [28] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
[titan01:16578] [29] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
[titan01:16578] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I don't know if it is a problem of java or ompi - but the last years, java worked with no problems on my machine...

Thank you for your tips in advance!
Gundram

On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic) version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Jason,

thanks for your response! I thing it is another problem. I try to send 100MB bytes. So there are not many tries (between 10 and 30). I realized that the execution of this code can result 3 different errors:

1. most often the posted error message occures.

2. in <10% the cases i have a live lock. I can see 3 java-processes, one with 200% and two with 100% processor utilization. After ~15 minutes without new system outputs this error occurs.


[thread 47499823949568 also had an error]
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid24256.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:24256] *** Process received signal ***
[titan01:24256] Signal: Aborted (6)
[titan01:24256] Signal code: (-6)
[titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
[titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
[titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
[titan01:24256] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
[titan01:24256] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
[titan01:24256] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
[titan01:24256] [ 6] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
[titan01:24256] [ 7] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
[titan01:24256] [ 8] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
[titan01:24256] [ 9] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
[titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
[titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
[titan01:24256] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


3. in <10% the cases i have a dead lock while MPI.init. This stays for more than 15 minutes without returning with an error message...

Can I enable some debug-flags to see what happens on C / OpenMPI side?

Thanks in advance for your help!
Gundram Leifert


On 07/05/2016 06:05 PM, Jason Maldonis wrote:
After reading your thread looks like it may be related to an issue I had a few weeks ago (I'm a novice though). Maybe my thread will be of help: https://www.open-mpi.org/community/lists/users/2016/06/29425.php

When you say "After a specific number of repetitions the process either hangs up or returns with a SIGSEGV." does you mean that a single call hangs, or that at some point during the for loop a call hangs? If you mean the latter, then it might relate to my issue. Otherwise my thread probably won't be helpful.

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
***@wisc.edu<mailto:***@wisc.edu>
608-295-5532

On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello,

I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

public static void log(String msg) {
try {
System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
} catch (MPIException ex) {
System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
}
}

private static int hashcode(byte[] bytearray) {
if (bytearray == null) {
return 0;
}
int hash = 39;
for (int i = 0; i < bytearray.length; i++) {
byte b = bytearray[i];
hash = hash * 7 + (int) b;
}
return hash;
}

public static void main(String args[]) throws MPIException {
log("start main");
MPI.Init(args);
try {
log("initialized done");
byte[] saveMem = new byte[100000000];
MPI.COMM_WORLD.barrier();
Random r = new Random();
r.nextBytes(saveMem);
if (MPI.COMM_WORLD.getRank() == 0) {
for (int i = 0; i < 1000; i++) {
saveMem[r.nextInt(saveMem.length)]++;
log("i = " + i);
int[] lengthData = new int[]{saveMem.length};
log("object hash = " + hashcode(saveMem));
log("length = " + lengthData[0]);
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
MPI.COMM_WORLD.barrier();
MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
}
MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT<http://MPI.INT>, 0);
} else {
while (true) {
int[] lengthData = new int[1];
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
if (lengthData[0] == 0) {
break;
}
MPI.COMM_WORLD.barrier();
saveMem = new byte[lengthData[0]];
MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
log("object hash = " + hashcode(saveMem));
}
}
MPI.COMM_WORLD.barrier();
} catch (MPIException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} catch (RuntimeException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} finally {
MPI.Finalize();
}

}

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# #
# SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code: (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code: (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <***@rist.or.jp<mailto:***@rist.or.jp>>
Date: Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram

_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29584.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php




_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php





_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Graham, Nathaniel Richard
2016-11-15 23:26:22 UTC
Permalink
​​Hello Gundram,


This seems to be an issue with psm. I have been communicating with Matias at Intel who has forwarded the problem to the correct team so they can investigate it. There is a new version of psm available at the link below. It will fix the issue you are currently seeing, however there are other issues you will most likely run into (this is what the team at Intel will be looking into).


https://github.com/01org/psm​


I'd recommend updating your psm and seeing whether you get the issues I was seeing (hangs and crashes). If so you will probably need to disable psm until a fix is implemented.


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Tuesday, September 20, 2016 2:13 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Sorry for the delay...

we applied

./configure --enable-debug --with-psm --enable-mpi-java --with-jdk-dir=/cluster/libraries/java/jdk1.8.0_102/ --prefix=/cluster/mpi/gcc/openmpi/2.0.x_nightly
make -j 8 all
make install

Java-test-suite

export OMPI_MCA_osc=pt2pt

./make_onesided &> make_onesided.out
Output: https://gist.github.com/anonymous/f8c6837b6a6d40c806cec9458dfcc1ab


we still sometimes get the SIGSEGV:

WinAllocate with -np = 2:
Exception in thread "main" Exception in thread "main" mpi.MPIException: MPI_ERR_INTERN: internal errormpi.MPIException: MPI_ERR_INTERN: internal error

at mpi.Win.allocateSharedWin(Native Method) at mpi.Win.allocateSharedWin(Native Method)

at mpi.Win.<init>(Win.java:110) at mpi.Win.<init>(Win.java:110)

at WinAllocate.main(WinAllocate.java:42) at WinAllocate.main(WinAllocate.java:42)


WinName with -np = 2:
mpiexec has exited due to process rank 1 with PID 0 on
node node160 exiting improperly. There are three reasons this could occur:
<CROP>


CCreateInfo and Cput with -np 8:
sometimes end with SigSegV (see https://gist.github.com/anonymous/605c19422fd00bdfc4d1ea0151a1f34c ) for detailed view.

I hope, this information is helpfull...

Best Regards,
Gundram


On 09/14/2016 08:18 PM, Nathan Hjelm wrote:
We have a new high-speed component for RMA in 2.0.x called osc/rdma. Since the component is doing direct rdma on the target we are much more strict about the ranges. osc/pt2pt doesn't bother checking at the moment.

Can you build Open MPI with --enable-debug and add -mca osc_base_verbose 100 to the mpirun command-line? Please upload the output as a gist (https://gist.github.com/) and send a link so we can take a look.

-Nathan

On Sep 14, 2016, at 04:26 AM, Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de> wrote:


In short words: yes, we compiled with mpijavac and mpicc and run with mpirun -np 2.


In long words: we tested the following setups

a) without Java, with mpi 2.0.1 the C-test

[***@titan01 mpi_test]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1

[***@titan01 mpi_test]$ mpirun -np 2 ./a.out
[titan01:18460] *** An error occurred in MPI_Compare_and_swap
[titan01:18460] *** reported by process [3535667201,1]
[titan01:18460] *** on win rdma window 3
[titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:18460] *** and potentially your MPI job)
[titan01.service:18454] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:18454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

b) without Java with mpi 1.8.8 the C-test

[***@titan01 mpi_test2]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/1.8.8

[***@titan01 mpi_test2]$ mpirun -np 2 ./a.out
No Errors
[***@titan01 mpi_test2]$

c) with java 1.8.8 with jdk and Java-Testsuite

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol
win.compareAndSwap(next, iBuffer, result, MPI.INT, rank, 0);
^
symbol: method compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int)
location: variable win of type Win
TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol

>> these java methods are not supported in 1.8.8

d) ompi 2.0.1 and jdk and Testsuite

[***@titan01 ~]$ module list
Currently Loaded Modulefiles:
1) openmpi/gcc/2.0.1 2) java/jdk1.8.0_102

[***@titan01 ~]$ cd ompi-java-test/
[***@titan01 ompi-java-test]$ ./autogen.sh
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force
autoreconf: configure.ac: tracing
autoreconf: configure.ac: not using Libtool
autoreconf: running: /usr/bin/autoconf --force
autoreconf: configure.ac: not using Autoheader
autoreconf: running: automake --add-missing --copy --force-missing
autoreconf: Leaving directory `.'
[***@titan01 ompi-java-test]$ ./configure
Configuring Open Java test suite
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for mpijavac... yes
checking if checking MPI API params... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating reporting/OmpitestConfig.java
config.status: creating Makefile

[***@titan01 ompi-java-test]$ cd onesided/
[***@titan01 onesided]$ ./make_onesided &> result
cat result:
<crop.....>

=========================== CReqops ===========================
[titan01:32155] *** An error occurred in MPI_Rput
[titan01:32155] *** reported by process [3879534593,1]
[titan01:32155] *** on win rdma window 3
[titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32155] *** and potentially your MPI job)

<...crop....>

=========================== TestMpiRmaCompareAndSwap ===========================
[titan01:32703] *** An error occurred in MPI_Compare_and_swap
[titan01:32703] *** reported by process [3843162113,0]
[titan01:32703] *** on win rdma window 3
[titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32703] *** and potentially your MPI job)
[titan01.service:32698] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:32698] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


< ... end crop>


Also if we start the thing in this way, it fails:

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java OmpitestError.java OmpitestProgress.java OmpitestConfig.java

[***@titan01 onesided]$ mpiexec -np 2 java TestMpiRmaCompareAndSwap

[titan01:22877] *** An error occurred in MPI_Compare_and_swap
[titan01:22877] *** reported by process [3287285761,0]
[titan01:22877] *** on win rdma window 3
[titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:22877] *** and potentially your MPI job)
[titan01.service:22872] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:22872] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote:

Since you are getting the same errors with C as you are with Java, this is an issue with C, not the Java bindings. However, in the most recent output, you are using ./a.out to run the test. Did you use mpirun to run the test in Java or C?


The command should be something along the lines of:


mpirun -np 2 java TestMpiRmaCompareAndSwap


mpirun -np 2 ./a.out


Also, are you compiling with the ompi wrappers? Should be:


mpijavac TestMpiRmaCompareAndSwap.java


mpicc compare_and_swap.c


In the mean time, I will try to reproduce this on a similar system.


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Tuesday, September 13, 2016 12:46 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hey,


it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns produces this output:

(the same bulid by sources or the release 2.0.1)


[***@node108 mpi_test]$ ./a.out
[node108:2949] *** An error occurred in MPI_Compare_and_swap
[node108:2949] *** reported by process [1649420396,0]
[node108:2949] *** on win rdma window 3
[node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[node108:2949] *** and potentially your MPI job)


But the test works for 1.8.x! In fact our cluster does not have shared-memory - so it has to use the wrapper to default methods.

Gundram

On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:

Hello Gundram,


It looks like the test that is failing is TestMpiRmaCompareAndSwap.java. Is that the one that is crashing? If so, could you try to run the C test from:


http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1


There are a couple of header files you will need for that test, but they are in the same repo as the test (up a few folders and in an include folder).


This should let us know whether its an issue related to Java or not.


If it is another test, let me know and Ill see if I can get you the C version (most or all of the Java tests are translations from the C test).


-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
________________________________
From: users <users-***@lists.open-mpi.org><mailto:users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de><mailto:***@uni-rostock.de>
Sent: Wednesday, September 7, 2016 9:23 AM
To: ***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV


Hello,

I still have the same errors on our cluster - even one more. Maybe the new one helps us to find a solution.

I have this error if I run "make_onesided" of the ompi-java-test repo.

CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically - in all my 30 runs) this error:

[titan01:5134] *** An error occurred in MPI_Compare_and_swap
[titan01:5134] *** reported by process [2392850433,1]
[titan01:5134] *** on win rdma window 3
[titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:5134] *** and potentially your MPI job)
[titan01.service:05128] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


Sometimes I also have the SIGSEGV error.

System:

compiler: gcc/5.2.0
java: jdk1.8.0_102
kernelmodule: mlx4_core mlx4_en mlx4_ib
Linux version 3.10.0-327.13.1.el7.x86_64 (***@kbuilder.dev.centos.org<mailto:***@kbuilder.dev.centos.org>) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP

Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo rev: v2.0.0-257-gee86e07, Sep 02, 2016

inifiband

openib: OpenSM 3.3.19


limits:

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256554
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Thanks, Gundram
On 07/12/2016 11:08 AM, Gundram Leifert wrote:
Hello Gilley, Howard,

I configured without disable dlopen - same error.

I test these classes on another cluster and: IT WORKS!

So it is a problem of the cluster configuration. Thank you all very much for all your help! When the admin can solve the problem, i will let you know, what he had changed.

Cheers Gundram

On 07/08/2016 04:19 PM, Howard Pritchard wrote:
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
the JVM sets its own signal handlers, and it is important openmpi dones not override them.
this is what previously happened with PSM (infinipath) but this has been solved since.
you might be linking with a third party library that hijacks signal handlers and cause the crash
(which would explain why I cannot reproduce the issue)

the master branch has a revamped memory patcher (compared to v2.x or v1.10), and that could have some bad interactions with the JVM, so you might also give v2.x a try

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
You made the best of it... thanks a lot!

Whithout MPI it runs.
Just adding MPI.init() causes the crash!

maybe I installed something wrong...

install newest automake, autoconf, m4, libtoolize in right order and same prefix
check out ompi,
autogen
configure with same prefix, pointing to the same jdk, I later use
make
make install

I will test some different configurations of ./configure...


On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
I am running out of ideas ...

what if you do not run within slurm ?
what if you do not use '-cp executor.jar'
or what if you configure without --disable-dlopen --disable-mca-dso ?

if you
mpirun -np 1 ...
then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird your program is still crashing. an other test is to comment out MPI_Bcast and MPI_Barrier and try again with -np 1

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
In any cases the same error.
this is my code:

salloc -n 3
export IPATH_NO_BACKTRACE
ulimit -s 10240
mpirun -np 3 java -cp executor.jar de.uros.citlab.executor.test.TestSendBigFiles2


also for 1 or two cores, the process crashes.


On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
you can try
export IPATH_NO_BACKTRACE
before invoking mpirun (that should not be needed though)

an other test is to
ulimit -s 10240
before invoking mpirun.

btw, do you use mpirun or srun ?

can you reproduce the crash with 1 or 2 tasks ?

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

configure:
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso


1 node with 3 cores. I use SLURM to allocate one node. I changed --mem, but it has no effect.
salloc -n 3


core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256564
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

uname -a
Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/system-release
CentOS Linux release 7.2.1511 (Core)

what else do you need?

Cheers, Gundram

On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:

Gundram,


can you please provide more information on your environment :

- configure command line

- OS

- memory available

- ulimit -a

- number of nodes

- number of tasks used

- interconnect used (if any)

- batch manager (if any)


Cheers,


Gilles

On 7/7/2016 4:17 PM, Gundram Leifert wrote:
Hello Gilles,

I tried you code and it crashes after 3-15 iterations (see (1)). It is always the same error (only the "94" varies).

Meanwhile I think Java and MPI use the same memory because when I delete the hash-call, the program runs sometimes more than 9k iterations.
When it crashes, there are different lines (see (2) and (3)). The crashes also occurs on rank 0.

##### (1)#####
# Problematic frame:
# J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]

#####(2)#####
# Problematic frame:
# V [libjvm.so+0x68d0f6] JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0xb6

#####(3)#####
# Problematic frame:
# V [libjvm.so+0x4183bf] ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f

Any more idea?

On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:

Gundram,


fwiw, i cannot reproduce the issue on my box

- centos 7

- java version "1.8.0_71"
Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)


i noticed on non zero rank saveMem is allocated at each iteration.
ideally, the garbage collector can take care of that and this should not be an issue.

would you mind giving the attached file a try ?

Cheers,

Gilles

On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Giles,

thank you for your hints! I did 3 changes, unfortunately the same error occures:

update ompi:
commit ae8444682f0a7aa158caea08800542ce9874455e
Author: Ralph Castain <***@open-mpi.org><mailto:***@open-mpi.org>
Date: Tue Jul 5 20:07:16 2016 -0700

update java:
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)

delete hashcode-lines.

Now I get this error message - to 100%, after different number of iterations (15-300):

0/ 3:length = 100000000
0/ 3:bcast length done (length = 100000000)
1/ 3:bcast length done (length = 100000000)
2/ 3:bcast length done (length = 100000000)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x414d24] ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid16578.log
#
# Compiler replay data is saved as:
# /home/gl069/ompi/bin/executor/replay_pid16578.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[titan01:16578] *** Process received signal ***
[titan01:16578] Signal: Aborted (6)
[titan01:16578] Signal code: (-6)
[titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
[titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
[titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
[titan01:16578] [ 3] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
[titan01:16578] [ 4] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
[titan01:16578] [ 5] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
[titan01:16578] [ 6] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
[titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
[titan01:16578] [ 8] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
[titan01:16578] [ 9] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
[titan01:16578] [10] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
[titan01:16578] [11] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
[titan01:16578] [12] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [13] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [14] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [15] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [16] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [17] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [18] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [19] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [20] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [21] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [22] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
[titan01:16578] [23] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
[titan01:16578] [24] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
[titan01:16578] [25] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
[titan01:16578] [26] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
[titan01:16578] [27] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
[titan01:16578] [28] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
[titan01:16578] [29] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
[titan01:16578] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I don't know if it is a problem of java or ompi - but the last years, java worked with no problems on my machine...

Thank you for your tips in advance!
Gundram

On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic) version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello Jason,

thanks for your response! I thing it is another problem. I try to send 100MB bytes. So there are not many tries (between 10 and 30). I realized that the execution of this code can result 3 different errors:

1. most often the posted error message occures.

2. in <10% the cases i have a live lock. I can see 3 java-processes, one with 200% and two with 100% processor utilization. After ~15 minutes without new system outputs this error occurs.


[thread 47499823949568 also had an error]
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid24256.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:24256] *** Process received signal ***
[titan01:24256] Signal: Aborted (6)
[titan01:24256] Signal code: (-6)
[titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
[titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
[titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
[titan01:24256] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
[titan01:24256] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
[titan01:24256] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
[titan01:24256] [ 6] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
[titan01:24256] [ 7] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
[titan01:24256] [ 8] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
[titan01:24256] [ 9] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
[titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
[titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
[titan01:24256] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


3. in <10% the cases i have a dead lock while MPI.init. This stays for more than 15 minutes without returning with an error message...

Can I enable some debug-flags to see what happens on C / OpenMPI side?

Thanks in advance for your help!
Gundram Leifert


On 07/05/2016 06:05 PM, Jason Maldonis wrote:
After reading your thread looks like it may be related to an issue I had a few weeks ago (I'm a novice though). Maybe my thread will be of help: https://www.open-mpi.org/community/lists/users/2016/06/29425.php

When you say "After a specific number of repetitions the process either hangs up or returns with a SIGSEGV." does you mean that a single call hangs, or that at some point during the for loop a call hangs? If you mean the latter, then it might relate to my issue. Otherwise my thread probably won't be helpful.

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
***@wisc.edu<mailto:***@wisc.edu>
608-295-5532

On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <***@uni-rostock.de<mailto:***@uni-rostock.de>> wrote:
Hello,

I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

public static void log(String msg) {
try {
System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
} catch (MPIException ex) {
System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
}
}

private static int hashcode(byte[] bytearray) {
if (bytearray == null) {
return 0;
}
int hash = 39;
for (int i = 0; i < bytearray.length; i++) {
byte b = bytearray[i];
hash = hash * 7 + (int) b;
}
return hash;
}

public static void main(String args[]) throws MPIException {
log("start main");
MPI.Init(args);
try {
log("initialized done");
byte[] saveMem = new byte[100000000];
MPI.COMM_WORLD.barrier();
Random r = new Random();
r.nextBytes(saveMem);
if (MPI.COMM_WORLD.getRank() == 0) {
for (int i = 0; i < 1000; i++) {
saveMem[r.nextInt(saveMem.length)]++;
log("i = " + i);
int[] lengthData = new int[]{saveMem.length};
log("object hash = " + hashcode(saveMem));
log("length = " + lengthData[0]);
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
MPI.COMM_WORLD.barrier();
MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
}
MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT<http://MPI.INT>, 0);
} else {
while (true) {
int[] lengthData = new int[1];
MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT<http://MPI.INT>, 0);
log("bcast length done (length = " + lengthData[0] + ")");
if (lengthData[0] == 0) {
break;
}
MPI.COMM_WORLD.barrier();
saveMem = new byte[lengthData[0]];
MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
log("bcast data done");
MPI.COMM_WORLD.barrier();
log("object hash = " + hashcode(saveMem));
}
}
MPI.COMM_WORLD.barrier();
} catch (MPIException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} catch (RuntimeException ex) {
System.out.println("caugth error." + ex);
log(ex.getMessage());
} finally {
MPI.Finalize();
}

}

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# #
# SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code: (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code: (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <***@rist.or.jp<mailto:***@rist.or.jp>>
Date: Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram

_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29584.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php




_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php




_______________________________________________
users mailing list
***@open-mpi.org<mailto:***@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php





_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Nathan Hjelm
2016-09-14 19:20:16 UTC
Permalink
This error was the result of a typo which caused an incorrect range check when the compare-and-swap was on a memory region less than 8 bytes away from the end of the window. We never caught this because in general no apps create a window as small as that MPICH test (4 bytes). We are adding the test to our nightly suite now.

-Nathan

On Sep 14, 2016, at 01:04 PM, "Graham, Nathaniel Richard" <***@lanl.gov> wrote:

​Thanks for reporting this!  There are a number of things going on here.  

It seems there may be a problem with the Java bindings checked by CReqops.Java because the C test passes.  Ill take a look at that.  The issue can be found at: https://github.com/open-mpi/ompi/issues/2081​

MPI_Compare_and_swap is failing on master, and therefore on the release branches.  You can get around the issue for now by doing: export OMPI_MCA_osc=pt2pt​
I submitted an issue to track it at: https://github.com/open-mpi/ompi/issues/2080

These tests test code I added last summer and did not make it into 1.8.  I know its all in the 2.0 serious though.

-Nathan



--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Wednesday, September 14, 2016 4:02 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
 
In short words: yes, we compiled with mpijavac and mpicc and run with mpirun -np 2.

In long words: we tested the following setups

a) without Java, with mpi 2.0.1 the C-test

[***@titan01 mpi_test]$ module list
Currently Loaded Modulefiles:
  1) openmpi/gcc/2.0.1

[***@titan01 mpi_test]$ mpirun -np 2 ./a.out
[titan01:18460] *** An error occurred in MPI_Compare_and_swap
[titan01:18460] *** reported by process [3535667201,1]
[titan01:18460] *** on win rdma window 3
[titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:18460] ***    and potentially your MPI job)
[titan01.service:18454] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:18454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

b) without Java with mpi 1.8.8 the C-test

[***@titan01 mpi_test2]$ module list
Currently Loaded Modulefiles:
  1) openmpi/gcc/1.8.8

[***@titan01 mpi_test2]$ mpirun -np 2 ./a.out
 No Errors
[***@titan01 mpi_test2]$

c) with java 1.8.8 with jdk and Java-Testsuite

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol
                        win.compareAndSwap(next, iBuffer, result, MPI.INT, rank, 0);
                           ^
  symbol:   method compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int)
  location: variable win of type Win
TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol

>> these java methods are not supported in 1.8.8

d) ompi 2.0.1 and jdk and Testsuite

[***@titan01 ~]$ module list
Currently Loaded Modulefiles:
  1) openmpi/gcc/2.0.1   2) java/jdk1.8.0_102

[***@titan01 ~]$ cd ompi-java-test/
[***@titan01 ompi-java-test]$ ./autogen.sh
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force
autoreconf: configure.ac: tracing
autoreconf: configure.ac: not using Libtool
autoreconf: running: /usr/bin/autoconf --force
autoreconf: configure.ac: not using Autoheader
autoreconf: running: automake --add-missing --copy --force-missing
autoreconf: Leaving directory `.'
[***@titan01 ompi-java-test]$ ./configure
Configuring Open Java test suite
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for mpijavac... yes
checking if checking MPI API params... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating reporting/OmpitestConfig.java
config.status: creating Makefile

[***@titan01 ompi-java-test]$ cd onesided/
[***@titan01 onesided]$ ./make_onesided &> result
cat result:
<crop.....>

=========================== CReqops ===========================
[titan01:32155] *** An error occurred in MPI_Rput
[titan01:32155] *** reported by process [3879534593,1]
[titan01:32155] *** on win rdma window 3
[titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32155] ***    and potentially your MPI job)

<...crop....>

=========================== TestMpiRmaCompareAndSwap ===========================
[titan01:32703] *** An error occurred in MPI_Compare_and_swap
[titan01:32703] *** reported by process [3843162113,0]
[titan01:32703] *** on win rdma window 3
[titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:32703] ***    and potentially your MPI job)
[titan01.service:32698] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:32698] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


< ... end crop>


Also if we start the thing in this way, it fails:

[***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java OmpitestError.java OmpitestProgress.java OmpitestConfig.java

[***@titan01 onesided]$  mpiexec -np 2 java TestMpiRmaCompareAndSwap

[titan01:22877] *** An error occurred in MPI_Compare_and_swap
[titan01:22877] *** reported by process [3287285761,0]
[titan01:22877] *** on win rdma window 3
[titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:22877] ***    and potentially your MPI job)
[titan01.service:22872] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:22872] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote:
Since you are getting the same errors with C as you are with Java, this is an issue with C, not the Java bindings.  However, in the most recent output, you are using ./a.out to run the test.  Did you use mpirun to run the test in Java or C?

The command should be something along the lines of: 

    mpirun -np 2 ​java TestMpiRmaCompareAndSwap

    mpirun -np 2 ./a.out

Also, are you compiling with the ompi wrappers?  Should be:

    mpijavac TestMpiRmaCompareAndSwap.java

    ​mpicc compare_and_swap.c

In the mean time, I will try to reproduce this on a similar system.

-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Tuesday, September 13, 2016 12:46 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
 
Hey,

it seams to be a problem of ompi 2.x. Also the c-version 2.0.1 returns produces this output:
(the same bulid by sources or the release 2.0.1)

[***@node108 mpi_test]$ ./a.out
[node108:2949] *** An error occurred in MPI_Compare_and_swap
[node108:2949] *** reported by process [1649420396,0]
[node108:2949] *** on win rdma window 3
[node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[node108:2949] ***    and potentially your MPI job)

But the test works for 1.8.x! In fact our cluster does not have shared-memory - so it has to use the wrapper to default methods.

Gundram

On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:
Hello Gundram,

It looks like the test that is failing is TestMpiRmaCompareAndSwap.java​.  Is that the one that is crashing?  If so, could you try to run the C test from:

    http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1

There are a couple of header files you will need for that test, but they are in the same repo as the test (up a few folders and in an include folder).

This should let us know whether its an issue related to Java or not.

If it is another test, let me know and Ill see if I can get you the C version (most or all of the Java tests are translations from the C test).

-Nathan


--
Nathaniel Graham
HPC-DES
Los Alamos National Laboratory
From: users <users-***@lists.open-mpi.org> on behalf of Gundram Leifert <***@uni-rostock.de>
Sent: Wednesday, September 7, 2016 9:23 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
 
Hello,
I still have the same errors on our cluster - even one more. Maybe the new one helps us to find a solution.
I have this error if I run "make_onesided" of the ompi-java-test repo.

  CReqops and TestMpiRmaCompareAndSwap report (pretty deterministically - in all my 30 runs) this error:

[titan01:5134] *** An error occurred in MPI_Compare_and_swap
[titan01:5134] *** reported by process [2392850433,1]
[titan01:5134] *** on win rdma window 3
[titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[titan01:5134] ***    and potentially your MPI job)
[titan01.service:05128] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[titan01.service:05128] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Sometimes I also have the SIGSEGV error.
System:
compiler: gcc/5.2.0
java: jdk1.8.0_102
kernelmodule: mlx4_core mlx4_en mlx4_ib
Linux version 3.10.0-327.13.1.el7.x86_64 (***@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP
Open MPI v2.0.1, package: Open MPI  Distribution, ident: 2.0.1, repo rev: v2.0.0-257-gee86e07, Sep 02, 2016
inifiband

openib:  OpenSM 3.3.19


limits:

 ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256554
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 100000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Thanks, Gundram
On 07/12/2016 11:08 AM, Gundram Leifert wrote:
Hello Gilley, Howard,

I configured without disable dlopen - same error.

I test these classes on another cluster and: IT WORKS!

So it is a problem of the cluster configuration. Thank you all very much for all your help! When the admin can solve the problem, i will let you know, what he had changed.

Cheers Gundram

On 07/08/2016 04:19 PM, Howard Pritchard wrote:
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
the JVM sets its own signal handlers, and it is important openmpi dones not override them.
this is what previously happened with PSM (infinipath) but this has been solved since.
you might be linking with a third party library that hijacks signal handlers and cause the crash
(which would explain why I cannot reproduce the issue)

the master branch has a revamped memory patcher (compared to v2.x or v1.10), and that could have some bad interactions with the JVM, so you might also give v2.x a try

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
You made the best of it... thanks a lot!

Whithout MPI it runs.
Just adding MPI.init() causes the crash!

maybe I installed something wrong...

install newest automake, autoconf, m4, libtoolize in right order and same prefix
check out ompi,
autogen
configure with same prefix, pointing to the same jdk, I later use
make
make install

I will test some different configurations of ./configure...


On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
I am running out of ideas ...

what if you do not run within slurm ?
what if you do not use '-cp executor.jar'
or what if you configure without --disable-dlopen --disable-mca-dso ?

if you
mpirun -np 1 ...
then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird your program is still crashing. an other test is to comment out MPI_Bcast and MPI_Barrier and try again with -np 1

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
In any cases the same error.
this is my code:

salloc -n 3
export IPATH_NO_BACKTRACE
ulimit -s 10240
mpirun -np 3 java -cp executor.jar de.uros.citlab.executor.test.TestSendBigFiles2


also for 1 or two cores, the process crashes.


On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
you can try
export IPATH_NO_BACKTRACE
before invoking mpirun (that should not be needed though)

an other test is to
ulimit -s 10240
before invoking mpirun.

btw, do you use mpirun or srun ?

can you reproduce the crash with 1 or 2 tasks ?

Cheers,

Gilles

On Friday, July 8, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

configure:
./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso


1 node with 3 cores. I use SLURM to allocate one node. I changed --mem, but it has no effect.
salloc -n 3


core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256564
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 100000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

uname -a
Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/system-release
CentOS Linux release 7.2.1511 (Core)

what else do you need?

Cheers, Gundram

On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
Gundram,

can you please provide more information on your environment :
- configure command line
- OS
- memory available
- ulimit -a
- number of nodes
- number of tasks used
- interconnect used (if any)
- batch manager (if any)

Cheers,

Gilles
On 7/7/2016 4:17 PM, Gundram Leifert wrote:
Hello Gilles,

I tried you code and it crashes after 3-15 iterations (see (1)). It is always the same error (only the "94" varies).

Meanwhile I think Java and MPI use the same memory because when I delete the hash-call, the program runs sometimes more than 9k iterations.
When it crashes, there are different lines (see (2) and (3)). The crashes also occurs on rank 0.

##### (1)#####
# Problematic frame:
# J 94 C2 de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I (42 bytes) @ 0x00002b03242dc9c4 [0x00002b03242dc860+0x164]

#####(2)#####
# Problematic frame:
# V  [libjvm.so+0x68d0f6]  JavaCallWrapper::JavaCallWrapper(methodHandle, Handle, JavaValue*, Thread*)+0xb6

#####(3)#####
# Problematic frame:
# V  [libjvm.so+0x4183bf]  ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f

Any more idea?

On 07/07/2016 03:00 AM, Gilles Gouaillardet wrote:
Gundram,

fwiw, i cannot reproduce the issue on my box
- centos 7
- java version "1.8.0_71"
  Java(TM) SE Runtime Environment (build 1.8.0_71-b15)
  Java HotSpot(TM) 64-Bit Server VM (build 25.71-b15, mixed mode)


i noticed on non zero rank saveMem is allocated at each iteration.
ideally, the garbage collector can take care of that and this should not be an issue.

would you mind giving the attached file a try ?

Cheers,

Gilles

On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
I will have a look at it today

how did you configure OpenMPI ?

Cheers,

Gilles

On Thursday, July 7, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello Giles,

thank you for your hints! I did 3 changes, unfortunately the same error occures:

update ompi:
commit ae8444682f0a7aa158caea08800542ce9874455e
Author: Ralph Castain <***@open-mpi.org>
Date:   Tue Jul 5 20:07:16 2016 -0700

update java:
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) Server VM (build 25.92-b14, mixed mode)

delete hashcode-lines.

Now I get this error message - to 100%, after different number of iterations (15-300):

 0/ 3:length = 100000000
 0/ 3:bcast length done (length = 100000000)
 1/ 3:bcast length done (length = 100000000)
 2/ 3:bcast length done (length = 100000000)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00002b3d022fcd24, pid=16578, tid=0x00002b3d29716700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x414d24]  ciEnv::get_field_by_index(ciInstanceKlass*, int)+0x94
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid16578.log
#
# Compiler replay data is saved as:
# /home/gl069/ompi/bin/executor/replay_pid16578.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
[titan01:16578] *** Process received signal ***
[titan01:16578] Signal: Aborted (6)
[titan01:16578] Signal code:  (-6)
[titan01:16578] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
[titan01:16578] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
[titan01:16578] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
[titan01:16578] [ 3] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
[titan01:16578] [ 4] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
[titan01:16578] [ 5] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
[titan01:16578] [ 6] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
[titan01:16578] [ 7] /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
[titan01:16578] [ 8] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
[titan01:16578] [ 9] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
[titan01:16578] [10] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
[titan01:16578] [11] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
[titan01:16578] [12] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [13] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [14] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [15] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [16] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [17] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
[titan01:16578] [18] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
[titan01:16578] [19] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
[titan01:16578] [20] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
[titan01:16578] [21] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
[titan01:16578] [22] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
[titan01:16578] [23] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
[titan01:16578] [24] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
[titan01:16578] [25] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
[titan01:16578] [26] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
[titan01:16578] [27] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
[titan01:16578] [28] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
[titan01:16578] [29] /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
[titan01:16578] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I don't know if it is a  problem of java or ompi - but the last years, java worked with no problems on my machine...

Thank you for your tips in advance!
Gundram

On 07/06/2016 03:10 PM, Gilles Gouaillardet wrote:
Note a race condition in MPI_Init has been fixed yesterday in the master.
can you please update your OpenMPI and try again ?

hopefully the hang will disappear.

Can you reproduce the crash with a simpler (and ideally deterministic) version of your program.
the crash occurs in hashcode, and this makes little sense to me. can you also update your jdk ?

Cheers,

Gilles

On Wednesday, July 6, 2016, Gundram Leifert <***@uni-rostock.de> wrote:
Hello Jason,

thanks for your response! I thing it is another problem. I try to send 100MB bytes. So there are not many tries (between 10 and 30). I realized that the execution of this code can result 3 different errors:

1. most often the posted error message occures.

2. in <10% the cases i have a live lock. I can see 3 java-processes, one with 200% and two with 100% processor utilization. After ~15 minutes without new system outputs this error occurs.


[thread 47499823949568 also had an error]
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (safepoint.cpp:317), pid=24256, tid=47500347131648
#  guarantee(PageArmed == 0) failed: invariant
#
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid24256.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:24256] *** Process received signal ***
[titan01:24256] Signal: Aborted (6)
[titan01:24256] Signal code:  (-6)
[titan01:24256] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
[titan01:24256] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
[titan01:24256] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
[titan01:24256] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
[titan01:24256] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
[titan01:24256] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
[titan01:24256] [ 6] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
[titan01:24256] [ 7] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
[titan01:24256] [ 8] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
[titan01:24256] [ 9] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
[titan01:24256] [10] /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
[titan01:24256] [11] /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
[titan01:24256] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titan01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


3. in <10% the cases i have a dead lock while MPI.init. This stays for more than 15 minutes without returning with an error message...

Can I enable some debug-flags to see what happens on C / OpenMPI side?

Thanks in advance for your help!
Gundram Leifert


On 07/05/2016 06:05 PM, Jason Maldonis wrote:
After reading your thread looks like it may be related to an issue I had a few weeks ago (I'm a novice though). Maybe my thread will be of help:  https://www.open-mpi.org/community/lists/users/2016/06/29425.php

When you say "After a specific number of repetitions the process either hangs up or returns with a SIGSEGV."  does you mean that a single call hangs, or that at some point during the for loop a call hangs? If you mean the latter, then it might relate to my issue. Otherwise my thread probably won't be helpful.

Jason Maldonis
Research Assistant of Professor Paul Voyles
Materials Science Grad Student
University of Wisconsin, Madison
1509 University Ave, Rm M142
Madison, WI 53706
***@wisc.edu
608-295-5532

On Tue, Jul 5, 2016 at 9:58 AM, Gundram Leifert <***@uni-rostock.de> wrote:
Hello,

I try to send many byte-arrays via broadcast. After a specific number of repetitions the process either hangs up or returns with a SIGSEGV. Does any one can help me solving the problem:

########## The code:

import java.util.Random;
import mpi.*;

public class TestSendBigFiles {

    public static void log(String msg) {
        try {
            System.err.println(String.format("%2d/%2d:%s", MPI.COMM_WORLD.getRank(), MPI.COMM_WORLD.getSize(), msg));
        } catch (MPIException ex) {
            System.err.println(String.format("%2s/%2s:%s", "?", "?", msg));
        }
    }

    private static int hashcode(byte[] bytearray) {
        if (bytearray == null) {
            return 0;
        }
        int hash = 39;
        for (int i = 0; i < bytearray.length; i++) {
            byte b = bytearray[i];
            hash = hash * 7 + (int) b;
        }
        return hash;
    }

    public static void main(String args[]) throws MPIException {
        log("start main");
        MPI.Init(args);
        try {
            log("initialized done");
            byte[] saveMem = new byte[100000000];
            MPI.COMM_WORLD.barrier();
            Random r = new Random();
            r.nextBytes(saveMem);
            if (MPI.COMM_WORLD.getRank() == 0) {
                for (int i = 0; i < 1000; i++) {
                    saveMem[r.nextInt(saveMem.length)]++;
                    log("i = " + i);
                    int[] lengthData = new int[]{saveMem.length};
                    log("object hash = " + hashcode(saveMem));
                    log("length = " + lengthData[0]);
                    MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
                    log("bcast length done (length = " + lengthData[0] + ")");
                    MPI.COMM_WORLD.barrier();
                    MPI.COMM_WORLD.bcast(saveMem, lengthData[0], MPI.BYTE, 0);
                    log("bcast data done");
                    MPI.COMM_WORLD.barrier();
                }
                MPI.COMM_WORLD.bcast(new int[]{0}, 1, MPI.INT, 0);
            } else {
                while (true) {
                    int[] lengthData = new int[1];
                    MPI.COMM_WORLD.bcast(lengthData, 1, MPI.INT, 0);
                    log("bcast length done (length = " + lengthData[0] + ")");
                    if (lengthData[0] == 0) {
                        break;
                    }
                    MPI.COMM_WORLD.barrier();
                    saveMem = new byte[lengthData[0]];
                    MPI.COMM_WORLD.bcast(saveMem, saveMem.length, MPI.BYTE, 0);
                    log("bcast data done");
                    MPI.COMM_WORLD.barrier();
                    log("object hash = " + hashcode(saveMem));
                }
            }
            MPI.COMM_WORLD.barrier();
        } catch (MPIException ex) {
            System.out.println("caugth error." + ex);
            log(ex.getMessage());
        } catch (RuntimeException ex) {
            System.out.println("caugth error." + ex);
            log(ex.getMessage());
        } finally {
            MPI.Finalize();
        }

    }

}


############ The Error (if it does not just hang up):

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00002b7e9c86e3a1, pid=1172, tid=47822674495232
#
#
# A fatal error has been detected by the Java Runtime Environment:
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# #
#  SIGSEGV (0xb) at pc=0x00002af69c0693a1, pid=1173, tid=47238546896640
#
# JRE version: 7.0_25-b15
J  de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J  de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1172.log
# An error report file with more information is saved as:
# /home/gl069/ompi/bin/executor/hs_err_pid1173.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
[titan01:01172] *** Process received signal ***
[titan01:01172] Signal: Aborted (6)
[titan01:01172] Signal code:  (-6)
[titan01:01173] *** Process received signal ***
[titan01:01173] Signal: Aborted (6)
[titan01:01173] Signal code:  (-6)
[titan01:01172] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
[titan01:01172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
[titan01:01172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
[titan01:01172] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
[titan01:01172] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
[titan01:01172] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
[titan01:01172] [ 6] [titan01:01173] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
[titan01:01173] [ 1] /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
[titan01:01172] [ 7] [0x2b7e9c86e3a1]
[titan01:01172] *** End of error message ***
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
[titan01:01173] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
[titan01:01173] [ 3] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
[titan01:01173] [ 4] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
[titan01:01173] [ 5] /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
[titan01:01173] [ 6] /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
[titan01:01173] [ 7] [0x2af69c0693a1]
[titan01:01173] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node titan01 exited on signal 6 (Aborted).


########CONFIGURATION:
I used the ompi master sources from github:
commit 267821f0dd405b5f4370017a287d9a49f92e734a
Author: Gilles Gouaillardet <***@rist.or.jp>
Date:   Tue Jul 5 13:47:50 2016 +0900

./configure --enable-mpi-java --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen --disable-mca-dso

Thanks a lot for your help!
Gundram

_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29584.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29585.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29587.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29589.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29590.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29592.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29593.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29601.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29603.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29610.php




_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...