Sorry for the delay...
we applied
./configure --enable-debug --with-psm --enable-mpi-java
--with-jdk-dir=/cluster/libraries/java/jdk1.8.0_102/
--prefix=/cluster/mpi/gcc/openmpi/2.0.x_nightly
make -j 8 all
make install
Java-test-suite
export OMPI_MCA_osc=pt2pt
./make_onesided &> make_onesided.out
Output: https://gist.github.com/anonymous/f8c6837b6a6d40c806cec9458dfcc1ab
we still sometimes get the SIGSEGV:
WinAllocate with -np = 2:
Exception in thread "main" Exception in thread "main" mpi.MPIException:
MPI_ERR_INTERN: internal errormpi.MPIException: MPI_ERR_INTERN: internal
error
at mpi.Win.allocateSharedWin(Native Method) at
mpi.Win.allocateSharedWin(Native Method)
at mpi.Win.<init>(Win.java:110) at mpi.Win.<init>(Win.java:110)
at WinAllocate.main(WinAllocate.java:42) at
WinAllocate.main(WinAllocate.java:42)
WinName with -np = 2:
mpiexec has exited due to process rank 1 with PID 0 on
node node160 exiting improperly. There are three reasons this could occur:
<CROP>
CCreateInfo and Cput with -np 8:
sometimes end with SigSegV (see
https://gist.github.com/anonymous/605c19422fd00bdfc4d1ea0151a1f34c ) for
detailed view.
I hope, this information is helpfull...
Best Regards,
Gundram
On 09/14/2016 08:18 PM, Nathan Hjelm wrote:
> We have a new high-speed component for RMA in 2.0.x called osc/rdma.
> Since the component is doing direct rdma on the target we are much
> more strict about the ranges. osc/pt2pt doesn't bother checking at the
> moment.
>
> Can you build Open MPI with --enable-debug and add -mca
> osc_base_verbose 100 to the mpirun command-line? Please upload the
> output as a gist (https://gist.github.com/) and send a link so we can
> take a look.
>
> -Nathan
>
> On Sep 14, 2016, at 04:26 AM, Gundram Leifert
> <***@uni-rostock.de> wrote:
>
>> In short words: yes, we compiled with mpijavac and mpicc and run with
>> mpirun -np 2.
>>
>>
>> In long words: we tested the following setups
>>
>>
>> a) without Java, with mpi 2.0.1 the C-test
>>
>> [***@titan01 mpi_test]$ module list
>> Currently Loaded Modulefiles:
>> 1) openmpi/gcc/2.0.1
>>
>> [***@titan01 mpi_test]$ mpirun -np 2 ./a.out
>> [titan01:18460] *** An error occurred in MPI_Compare_and_swap
>> [titan01:18460] *** reported by process [3535667201,1]
>> [titan01:18460] *** on win rdma window 3
>> [titan01:18460] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>> [titan01:18460] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>> now abort,
>> [titan01:18460] *** and potentially your MPI job)
>> [titan01.service:18454] 1 more process has sent help message
>> help-mpi-errors.txt / mpi_errors_are_fatal
>> [titan01.service:18454] Set MCA parameter "orte_base_help_aggregate"
>> to 0 to see all help / error messages
>>
>> b) without Java with mpi 1.8.8 the C-test
>>
>> [***@titan01 mpi_test2]$ module list
>> Currently Loaded Modulefiles:
>> 1) openmpi/gcc/1.8.8
>>
>> [***@titan01 mpi_test2]$ mpirun -np 2 ./a.out
>> No Errors
>> [***@titan01 mpi_test2]$
>>
>> c) with java 1.8.8 with jdk and Java-Testsuite
>>
>> [***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
>> TestMpiRmaCompareAndSwap.java:49: error: cannot find symbol
>> win.compareAndSwap(next, iBuffer, result,
>> MPI.INT, rank, 0);
>> ^
>> symbol: method
>> compareAndSwap(IntBuffer,IntBuffer,IntBuffer,Datatype,int,int)
>> location: variable win of type Win
>> TestMpiRmaCompareAndSwap.java:53: error: cannot find symbol
>>
>> >> these java methods are not supported in 1.8.8
>>
>> d) ompi 2.0.1 and jdk and Testsuite
>>
>> [***@titan01 ~]$ module list
>> Currently Loaded Modulefiles:
>> 1) openmpi/gcc/2.0.1 2) java/jdk1.8.0_102
>>
>> [***@titan01 ~]$ cd ompi-java-test/
>> [***@titan01 ompi-java-test]$ ./autogen.sh
>> autoreconf: Entering directory `.'
>> autoreconf: configure.ac: not using Gettext
>> autoreconf: running: aclocal --force
>> autoreconf: configure.ac: tracing
>> autoreconf: configure.ac: not using Libtool
>> autoreconf: running: /usr/bin/autoconf --force
>> autoreconf: configure.ac: not using Autoheader
>> autoreconf: running: automake --add-missing --copy --force-missing
>> autoreconf: Leaving directory `.'
>> [***@titan01 ompi-java-test]$ ./configure
>> Configuring Open Java test suite
>> checking for a BSD-compatible install... /bin/install -c
>> checking whether build environment is sane... yes
>> checking for a thread-safe mkdir -p... /bin/mkdir -p
>> checking for gawk... gawk
>> checking whether make sets $(MAKE)... yes
>> checking whether make supports nested variables... yes
>> checking whether make supports nested variables... (cached) yes
>> checking for mpijavac... yes
>> checking if checking MPI API params... yes
>> checking that generated files are newer than configure... done
>> configure: creating ./config.status
>> config.status: creating reporting/OmpitestConfig.java
>> config.status: creating Makefile
>>
>> [***@titan01 ompi-java-test]$ cd onesided/
>> [***@titan01 onesided]$ ./make_onesided &> result
>> cat result:
>> <crop.....>
>>
>> =========================== CReqops ===========================
>> [titan01:32155] *** An error occurred in MPI_Rput
>> [titan01:32155] *** reported by process [3879534593,1]
>> [titan01:32155] *** on win rdma window 3
>> [titan01:32155] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>> [titan01:32155] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>> now abort,
>> [titan01:32155] *** and potentially your MPI job)
>>
>> <...crop....>
>>
>> =========================== TestMpiRmaCompareAndSwap
>> ===========================
>> [titan01:32703] *** An error occurred in MPI_Compare_and_swap
>> [titan01:32703] *** reported by process [3843162113,0]
>> [titan01:32703] *** on win rdma window 3
>> [titan01:32703] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>> [titan01:32703] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>> now abort,
>> [titan01:32703] *** and potentially your MPI job)
>> [titan01.service:32698] 1 more process has sent help message
>> help-mpi-errors.txt / mpi_errors_are_fatal
>> [titan01.service:32698] Set MCA parameter "orte_base_help_aggregate"
>> to 0 to see all help / error messages
>>
>>
>> < ... end crop>
>>
>>
>> Also if we start the thing in this way, it fails:
>>
>> [***@titan01 onesided]$ mpijavac TestMpiRmaCompareAndSwap.java
>> OmpitestError.java OmpitestProgress.java OmpitestConfig.java
>>
>> [***@titan01 onesided]$ mpiexec -np 2 java TestMpiRmaCompareAndSwap
>>
>> [titan01:22877] *** An error occurred in MPI_Compare_and_swap
>> [titan01:22877] *** reported by process [3287285761,0]
>> [titan01:22877] *** on win rdma window 3
>> [titan01:22877] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>> [titan01:22877] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>> now abort,
>> [titan01:22877] *** and potentially your MPI job)
>> [titan01.service:22872] 1 more process has sent help message
>> help-mpi-errors.txt / mpi_errors_are_fatal
>> [titan01.service:22872] Set MCA parameter "orte_base_help_aggregate"
>> to 0 to see all help / error messages
>>
>>
>>
>> On 09/13/2016 08:06 PM, Graham, Nathaniel Richard wrote:
>>>
>>> Since you are getting the same errors with C as you are with Java,
>>> this is an issue with C, not the Java bindings. However, in the
>>> most recent output, you are using ./a.out to run the test. Did you
>>> use mpirun to run the test in Java or C?
>>>
>>>
>>> The command should be something along the lines of:
>>>
>>>
>>> mpirun -np 2 java TestMpiRmaCompareAndSwap
>>>
>>>
>>> mpirun -np 2 ./a.out
>>>
>>>
>>> Also, are you compiling with the ompi wrappers? Should be:
>>>
>>>
>>> mpijavac TestMpiRmaCompareAndSwap.java
>>>
>>>
>>> mpicc compare_and_swap.c
>>>
>>>
>>> In the mean time, I will try to reproduce this on a similar system.
>>>
>>>
>>> -Nathan
>>>
>>>
>>>
>>> --
>>> Nathaniel Graham
>>> HPC-DES
>>> Los Alamos National Laboratory
>>> ------------------------------------------------------------------------
>>> *From:* users <users-***@lists.open-mpi.org> on behalf of
>>> Gundram Leifert <***@uni-rostock.de>
>>> *Sent:* Tuesday, September 13, 2016 12:46 AM
>>> *To:* ***@lists.open-mpi.org
>>> *Subject:* Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
>>>
>>> Hey,
>>>
>>>
>>> it seams to be a problem of ompi 2.x. Also the c-version 2.0.1
>>> returns produces this output:
>>>
>>> (the same bulid by sources or the release 2.0.1)
>>>
>>>
>>> [***@node108 mpi_test]$ ./a.out
>>> [node108:2949] *** An error occurred in MPI_Compare_and_swap
>>> [node108:2949] *** reported by process [1649420396,0]
>>> [node108:2949] *** on win rdma window 3
>>> [node108:2949] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>>> [node108:2949] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>>> now abort,
>>> [node108:2949] *** and potentially your MPI job)
>>>
>>> But the test works for 1.8.x! In fact our cluster does not have
>>> shared-memory - so it has to use the wrapper to default methods.
>>>
>>> Gundram
>>>
>>> On 09/07/2016 06:49 PM, Graham, Nathaniel Richard wrote:
>>>>
>>>> Hello Gundram,
>>>>
>>>>
>>>> It looks like the test that is failing is
>>>> TestMpiRmaCompareAndSwap.java. Is that the one that is crashing?
>>>> If so, could you try to run the C test from:
>>>>
>>>>
>>>> http://git.mpich.org/mpich.git/blob/c77631474f072e86c9fe761c1328c3d4cb8cc4a5:/test/mpi/rma/compare_and_swap.c#l1
>>>>
>>>>
>>>> There are a couple of header files you will need for that test, but
>>>> they are in the same repo as the test (up a few folders and in an
>>>> include folder).
>>>>
>>>>
>>>> This should let us know whether its an issue related to Java or not.
>>>>
>>>>
>>>> If it is another test, let me know and Ill see if I can get you the
>>>> C version (most or all of the Java tests are translations from the
>>>> C test).
>>>>
>>>>
>>>> -Nathan
>>>>
>>>>
>>>>
>>>> --
>>>> Nathaniel Graham
>>>> HPC-DES
>>>> Los Alamos National Laboratory
>>>> ------------------------------------------------------------------------
>>>> *From:* users <users-***@lists.open-mpi.org> on behalf of
>>>> Gundram Leifert <***@uni-rostock.de>
>>>> *Sent:* Wednesday, September 7, 2016 9:23 AM
>>>> *To:* ***@lists.open-mpi.org
>>>> *Subject:* Re: [OMPI users] Java-OpenMPI returns with SIGSEGV
>>>>
>>>> Hello,
>>>>
>>>> I still have the same errors on our cluster - even one more. Maybe
>>>> the new one helps us to find a solution.
>>>>
>>>> I have this error if I run "make_onesided" of the ompi-java-test repo.
>>>>
>>>> CReqops and TestMpiRmaCompareAndSwap report (pretty
>>>> deterministically - in all my 30 runs) this error:
>>>>
>>>> [titan01:5134] *** An error occurred in MPI_Compare_and_swap
>>>> [titan01:5134] *** reported by process [2392850433,1]
>>>> [titan01:5134] *** on win rdma window 3
>>>> [titan01:5134] *** MPI_ERR_RMA_RANGE: invalid RMA address range
>>>> [titan01:5134] *** MPI_ERRORS_ARE_FATAL (processes in this win will
>>>> now abort,
>>>> [titan01:5134] *** and potentially your MPI job)
>>>> [titan01.service:05128] 1 more process has sent help message
>>>> help-mpi-errors.txt / mpi_errors_are_fatal
>>>> [titan01.service:05128] Set MCA parameter
>>>> "orte_base_help_aggregate" to 0 to see all help / error messages
>>>>
>>>> Sometimes I also have the SIGSEGV error.
>>>>
>>>> System:
>>>>
>>>> compiler: gcc/5.2.0
>>>> java: jdk1.8.0_102
>>>> kernelmodule: mlx4_core mlx4_en mlx4_ib
>>>> Linux version 3.10.0-327.13.1.el7.x86_64
>>>> (***@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red
>>>> Hat 4.8.3-9) (GCC) ) #1 SMP
>>>>
>>>> Open MPI v2.0.1, package: Open MPI Distribution, ident: 2.0.1, repo
>>>> rev: v2.0.0-257-gee86e07, Sep 02, 2016
>>>>
>>>> inifiband
>>>>
>>>> openib: OpenSM 3.3.19
>>>>
>>>>
>>>> limits:
>>>>
>>>> ulimit -a
>>>> core file size (blocks, -c) 0
>>>> data seg size (kbytes, -d) unlimited
>>>> scheduling priority (-e) 0
>>>> file size (blocks, -f) unlimited
>>>> pending signals (-i) 256554
>>>> max locked memory (kbytes, -l) unlimited
>>>> max memory size (kbytes, -m) unlimited
>>>> open files (-n) 100000
>>>> pipe size (512 bytes, -p) 8
>>>> POSIX message queues (bytes, -q) 819200
>>>> real-time priority (-r) 0
>>>> stack size (kbytes, -s) unlimited
>>>> cpu time (seconds, -t) unlimited
>>>> max user processes (-u) 4096
>>>> virtual memory (kbytes, -v) unlimited
>>>> file locks (-x) unlimited
>>>>
>>>>
>>>> Thanks, Gundram
>>>> On 07/12/2016 11:08 AM, Gundram Leifert wrote:
>>>>> Hello Gilley, Howard,
>>>>>
>>>>> I configured without disable dlopen - same error.
>>>>>
>>>>> I test these classes on another cluster and: IT WORKS!
>>>>>
>>>>> So it is a problem of the cluster configuration. Thank you all
>>>>> very much for all your help! When the admin can solve the problem,
>>>>> i will let you know, what he had changed.
>>>>>
>>>>> Cheers Gundram
>>>>>
>>>>> On 07/08/2016 04:19 PM, Howard Pritchard wrote:
>>>>>> Hi Gundram
>>>>>>
>>>>>> Could you configure without the disable dlopen option and retry?
>>>>>>
>>>>>> Howard
>>>>>>
>>>>>> Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :
>>>>>>
>>>>>> the JVM sets its own signal handlers, and it is important
>>>>>> openmpi dones not override them.
>>>>>> this is what previously happened with PSM (infinipath) but
>>>>>> this has been solved since.
>>>>>> you might be linking with a third party library that hijacks
>>>>>> signal handlers and cause the crash
>>>>>> (which would explain why I cannot reproduce the issue)
>>>>>>
>>>>>> the master branch has a revamped memory patcher (compared to
>>>>>> v2.x or v1.10), and that could have some bad interactions
>>>>>> with the JVM, so you might also give v2.x a try
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> On Friday, July 8, 2016, Gundram Leifert
>>>>>> <***@uni-rostock.de> wrote:
>>>>>>
>>>>>> You made the best of it... thanks a lot!
>>>>>>
>>>>>> Whithout MPI it runs.
>>>>>> Just adding MPI.init() causes the crash!
>>>>>>
>>>>>> maybe I installed something wrong...
>>>>>>
>>>>>> install newest automake, autoconf, m4, libtoolize in
>>>>>> right order and same prefix
>>>>>> check out ompi,
>>>>>> autogen
>>>>>> configure with same prefix, pointing to the same jdk, I
>>>>>> later use
>>>>>> make
>>>>>> make install
>>>>>>
>>>>>> I will test some different configurations of ./configure...
>>>>>>
>>>>>>
>>>>>> On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
>>>>>>> I am running out of ideas ...
>>>>>>>
>>>>>>> what if you do not run within slurm ?
>>>>>>> what if you do not use '-cp executor.jar'
>>>>>>> or what if you configure without --disable-dlopen
>>>>>>> --disable-mca-dso ?
>>>>>>>
>>>>>>> if you
>>>>>>> mpirun -np 1 ...
>>>>>>> then MPI_Bcast and MPI_Barrier are basically no-op, so
>>>>>>> it is really weird your program is still crashing. an
>>>>>>> other test is to comment out MPI_Bcast and MPI_Barrier
>>>>>>> and try again with -np 1
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Gilles
>>>>>>>
>>>>>>> On Friday, July 8, 2016, Gundram Leifert
>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>
>>>>>>> In any cases the same error.
>>>>>>> this is my code:
>>>>>>>
>>>>>>> salloc -n 3
>>>>>>> export IPATH_NO_BACKTRACE
>>>>>>> ulimit -s 10240
>>>>>>> mpirun -np 3 java -cp executor.jar
>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles2
>>>>>>>
>>>>>>>
>>>>>>> also for 1 or two cores, the process crashes.
>>>>>>>
>>>>>>>
>>>>>>> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>>>>>>>> you can try
>>>>>>>> export IPATH_NO_BACKTRACE
>>>>>>>> before invoking mpirun (that should not be needed
>>>>>>>> though)
>>>>>>>>
>>>>>>>> an other test is to
>>>>>>>> ulimit -s 10240
>>>>>>>> before invoking mpirun.
>>>>>>>>
>>>>>>>> btw, do you use mpirun or srun ?
>>>>>>>>
>>>>>>>> can you reproduce the crash with 1 or 2 tasks ?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Gilles
>>>>>>>>
>>>>>>>> On Friday, July 8, 2016, Gundram Leifert
>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> configure:
>>>>>>>> ./configure --enable-mpi-java
>>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>>>> --disable-dlopen --disable-mca-dso
>>>>>>>>
>>>>>>>>
>>>>>>>> 1 node with 3 cores. I use SLURM to allocate
>>>>>>>> one node. I changed --mem, but it has no effect.
>>>>>>>> salloc -n 3
>>>>>>>>
>>>>>>>>
>>>>>>>> core file size (blocks, -c) 0
>>>>>>>> data seg size (kbytes, -d) unlimited
>>>>>>>> scheduling priority (-e) 0
>>>>>>>> file size (blocks, -f) unlimited
>>>>>>>> pending signals (-i) 256564
>>>>>>>> max locked memory (kbytes, -l) unlimited
>>>>>>>> max memory size (kbytes, -m) unlimited
>>>>>>>> open files (-n) 100000
>>>>>>>> pipe size (512 bytes, -p) 8
>>>>>>>> POSIX message queues (bytes, -q) 819200
>>>>>>>> real-time priority (-r) 0
>>>>>>>> stack size (kbytes, -s) unlimited
>>>>>>>> cpu time (seconds, -t) unlimited
>>>>>>>> max user processes (-u) 4096
>>>>>>>> virtual memory (kbytes, -v) unlimited
>>>>>>>> file locks (-x) unlimited
>>>>>>>>
>>>>>>>> uname -a
>>>>>>>> Linux titan01.service
>>>>>>>> 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31
>>>>>>>> 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>>
>>>>>>>> cat /etc/system-release
>>>>>>>> CentOS Linux release 7.2.1511 (Core)
>>>>>>>>
>>>>>>>> what else do you need?
>>>>>>>>
>>>>>>>> Cheers, Gundram
>>>>>>>>
>>>>>>>> On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:
>>>>>>>>>
>>>>>>>>> Gundram,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> can you please provide more information on
>>>>>>>>> your environment :
>>>>>>>>>
>>>>>>>>> - configure command line
>>>>>>>>>
>>>>>>>>> - OS
>>>>>>>>>
>>>>>>>>> - memory available
>>>>>>>>>
>>>>>>>>> - ulimit -a
>>>>>>>>>
>>>>>>>>> - number of nodes
>>>>>>>>>
>>>>>>>>> - number of tasks used
>>>>>>>>>
>>>>>>>>> - interconnect used (if any)
>>>>>>>>>
>>>>>>>>> - batch manager (if any)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gilles
>>>>>>>>>
>>>>>>>>> On 7/7/2016 4:17 PM, Gundram Leifert wrote:
>>>>>>>>>> Hello Gilles,
>>>>>>>>>>
>>>>>>>>>> I tried you code and it crashes after 3-15
>>>>>>>>>> iterations (see (1)). It is always the same
>>>>>>>>>> error (only the "94" varies).
>>>>>>>>>>
>>>>>>>>>> Meanwhile I think Java and MPI use the same
>>>>>>>>>> memory because when I delete the hash-call,
>>>>>>>>>> the program runs sometimes more than 9k
>>>>>>>>>> iterations.
>>>>>>>>>> When it crashes, there are different lines
>>>>>>>>>> (see (2) and (3)). The crashes also occurs on
>>>>>>>>>> rank 0.
>>>>>>>>>>
>>>>>>>>>> ##### (1)#####
>>>>>>>>>> # Problematic frame:
>>>>>>>>>> # J 94 C2
>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles2.hashcode([BI)I
>>>>>>>>>> (42 bytes) @ 0x00002b03242dc9c4
>>>>>>>>>> [0x00002b03242dc860+0x164]
>>>>>>>>>>
>>>>>>>>>> #####(2)#####
>>>>>>>>>> # Problematic frame:
>>>>>>>>>> # V [libjvm.so+0x68d0f6]
>>>>>>>>>> JavaCallWrapper::JavaCallWrapper(methodHandle,
>>>>>>>>>> Handle, JavaValue*, Thread*)+0xb6
>>>>>>>>>>
>>>>>>>>>> #####(3)#####
>>>>>>>>>> # Problematic frame:
>>>>>>>>>> # V [libjvm.so+0x4183bf]
>>>>>>>>>> ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x4f
>>>>>>>>>>
>>>>>>>>>> Any more idea?
>>>>>>>>>>
>>>>>>>>>> On 07/07/2016 03:00 AM, Gilles Gouaillardet
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Gundram,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> fwiw, i cannot reproduce the issue on my box
>>>>>>>>>>>
>>>>>>>>>>> - centos 7
>>>>>>>>>>>
>>>>>>>>>>> - java version "1.8.0_71"
>>>>>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>>>>>> 1.8.0_71-b15)
>>>>>>>>>>> Java HotSpot(TM) 64-Bit Server VM (build
>>>>>>>>>>> 25.71-b15, mixed mode)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> i noticed on non zero rank saveMem is
>>>>>>>>>>> allocated at each iteration.
>>>>>>>>>>> ideally, the garbage collector can take care
>>>>>>>>>>> of that and this should not be an issue.
>>>>>>>>>>>
>>>>>>>>>>> would you mind giving the attached file a try ?
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> Gilles
>>>>>>>>>>>
>>>>>>>>>>> On 7/7/2016 7:41 AM, Gilles Gouaillardet wrote:
>>>>>>>>>>>> I will have a look at it today
>>>>>>>>>>>>
>>>>>>>>>>>> how did you configure OpenMPI ?
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>> Gilles
>>>>>>>>>>>>
>>>>>>>>>>>> On Thursday, July 7, 2016, Gundram Leifert
>>>>>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hello Giles,
>>>>>>>>>>>>
>>>>>>>>>>>> thank you for your hints! I did 3
>>>>>>>>>>>> changes, unfortunately the same error
>>>>>>>>>>>> occures:
>>>>>>>>>>>>
>>>>>>>>>>>> update ompi:
>>>>>>>>>>>> commit
>>>>>>>>>>>> ae8444682f0a7aa158caea08800542ce9874455e
>>>>>>>>>>>> Author: Ralph Castain <***@open-mpi.org>
>>>>>>>>>>>> Date: Tue Jul 5 20:07:16 2016 -0700
>>>>>>>>>>>>
>>>>>>>>>>>> update java:
>>>>>>>>>>>> java version "1.8.0_92"
>>>>>>>>>>>> Java(TM) SE Runtime Environment (build
>>>>>>>>>>>> 1.8.0_92-b14)
>>>>>>>>>>>> Java HotSpot(TM) Server VM (build
>>>>>>>>>>>> 25.92-b14, mixed mode)
>>>>>>>>>>>>
>>>>>>>>>>>> delete hashcode-lines.
>>>>>>>>>>>>
>>>>>>>>>>>> Now I get this error message - to 100%,
>>>>>>>>>>>> after different number of iterations
>>>>>>>>>>>> (15-300):
>>>>>>>>>>>>
>>>>>>>>>>>> 0/ 3:length = 100000000
>>>>>>>>>>>> 0/ 3:bcast length done (length =
>>>>>>>>>>>> 100000000)
>>>>>>>>>>>> 1/ 3:bcast length done (length =
>>>>>>>>>>>> 100000000)
>>>>>>>>>>>> 2/ 3:bcast length done (length =
>>>>>>>>>>>> 100000000)
>>>>>>>>>>>> #
>>>>>>>>>>>> # A fatal error has been detected by
>>>>>>>>>>>> the Java Runtime Environment:
>>>>>>>>>>>> #
>>>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>>>> pc=0x00002b3d022fcd24, pid=16578,
>>>>>>>>>>>> tid=0x00002b3d29716700
>>>>>>>>>>>> #
>>>>>>>>>>>> # JRE version: Java(TM) SE Runtime
>>>>>>>>>>>> Environment (8.0_92-b14) (build
>>>>>>>>>>>> 1.8.0_92-b14)
>>>>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit
>>>>>>>>>>>> Server VM (25.92-b14 mixed mode
>>>>>>>>>>>> linux-amd64 compressed oops)
>>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>>> # V [libjvm.so+0x414d24]
>>>>>>>>>>>> ciEnv::get_field_by_index(ciInstanceKlass*,
>>>>>>>>>>>> int)+0x94
>>>>>>>>>>>> #
>>>>>>>>>>>> # Failed to write core dump. Core dumps
>>>>>>>>>>>> have been disabled. To enable core
>>>>>>>>>>>> dumping, try "ulimit -c unlimited"
>>>>>>>>>>>> before starting Java again
>>>>>>>>>>>> #
>>>>>>>>>>>> # An error report file with more
>>>>>>>>>>>> information is saved as:
>>>>>>>>>>>> #
>>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid16578.log
>>>>>>>>>>>> #
>>>>>>>>>>>> # Compiler replay data is saved as:
>>>>>>>>>>>> #
>>>>>>>>>>>> /home/gl069/ompi/bin/executor/replay_pid16578.log
>>>>>>>>>>>> #
>>>>>>>>>>>> # If you would like to submit a bug
>>>>>>>>>>>> report, please visit:
>>>>>>>>>>>> #
>>>>>>>>>>>> http://bugreport.java.com/bugreport/crash.jsp
>>>>>>>>>>>> #
>>>>>>>>>>>> [titan01:16578] *** Process received
>>>>>>>>>>>> signal ***
>>>>>>>>>>>> [titan01:16578] Signal: Aborted (6)
>>>>>>>>>>>> [titan01:16578] Signal code: (-6)
>>>>>>>>>>>> [titan01:16578] [ 0]
>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b3d01500100]
>>>>>>>>>>>> [titan01:16578] [ 1]
>>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b3d01b5c5f7]
>>>>>>>>>>>> [titan01:16578] [ 2]
>>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b3d01b5dce8]
>>>>>>>>>>>> [titan01:16578] [ 3]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91e605)[0x2b3d02806605]
>>>>>>>>>>>> [titan01:16578] [ 4]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0xabda63)[0x2b3d029a5a63]
>>>>>>>>>>>> [titan01:16578] [ 5]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x14f)[0x2b3d0280be2f]
>>>>>>>>>>>> [titan01:16578] [ 6]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x91a5c3)[0x2b3d028025c3]
>>>>>>>>>>>> [titan01:16578] [ 7]
>>>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b3d01b5c670]
>>>>>>>>>>>> [titan01:16578] [ 8]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x414d24)[0x2b3d022fcd24]
>>>>>>>>>>>> [titan01:16578] [ 9]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x43c5ae)[0x2b3d023245ae]
>>>>>>>>>>>> [titan01:16578] [10]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x369ade)[0x2b3d02251ade]
>>>>>>>>>>>> [titan01:16578] [11]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36eda0)[0x2b3d02256da0]
>>>>>>>>>>>> [titan01:16578] [12]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>>>>>>> [titan01:16578] [13]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>>>>>>> [titan01:16578] [14]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>>>>>>> [titan01:16578] [15]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>>>>>>> [titan01:16578] [16]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>>>>>>> [titan01:16578] [17]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37091b)[0x2b3d0225891b]
>>>>>>>>>>>> [titan01:16578] [18]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3712b6)[0x2b3d022592b6]
>>>>>>>>>>>> [titan01:16578] [19]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36d2cf)[0x2b3d022552cf]
>>>>>>>>>>>> [titan01:16578] [20]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36e412)[0x2b3d02256412]
>>>>>>>>>>>> [titan01:16578] [21]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x36ed8d)[0x2b3d02256d8d]
>>>>>>>>>>>> [titan01:16578] [22]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3708c2)[0x2b3d022588c2]
>>>>>>>>>>>> [titan01:16578] [23]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3724e7)[0x2b3d0225a4e7]
>>>>>>>>>>>> [titan01:16578] [24]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a817)[0x2b3d02262817]
>>>>>>>>>>>> [titan01:16578] [25]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x37a92f)[0x2b3d0226292f]
>>>>>>>>>>>> [titan01:16578] [26]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x358edb)[0x2b3d02240edb]
>>>>>>>>>>>> [titan01:16578] [27]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35929e)[0x2b3d0224129e]
>>>>>>>>>>>> [titan01:16578] [28]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x3593ce)[0x2b3d022413ce]
>>>>>>>>>>>> [titan01:16578] [29]
>>>>>>>>>>>> /home/gl069/bin/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so(+0x35973e)[0x2b3d0224173e]
>>>>>>>>>>>> [titan01:16578] *** End of error
>>>>>>>>>>>> message ***
>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>> Primary job terminated normally, but 1
>>>>>>>>>>>> process returned
>>>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>>>> user-direction, the job has been aborted.
>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>> mpirun noticed that process rank 2 with
>>>>>>>>>>>> PID 0 on node titan01 exited on signal
>>>>>>>>>>>> 6 (Aborted).
>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> I don't know if it is a problem of java
>>>>>>>>>>>> or ompi - but the last years, java
>>>>>>>>>>>> worked with no problems on my machine...
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for your tips in advance!
>>>>>>>>>>>> Gundram
>>>>>>>>>>>>
>>>>>>>>>>>> On 07/06/2016 03:10 PM, Gilles
>>>>>>>>>>>> Gouaillardet wrote:
>>>>>>>>>>>>> Note a race condition in MPI_Init has
>>>>>>>>>>>>> been fixed yesterday in the master.
>>>>>>>>>>>>> can you please update your OpenMPI and
>>>>>>>>>>>>> try again ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> hopefully the hang will disappear.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you reproduce the crash with a
>>>>>>>>>>>>> simpler (and ideally deterministic)
>>>>>>>>>>>>> version of your program.
>>>>>>>>>>>>> the crash occurs in hashcode, and this
>>>>>>>>>>>>> makes little sense to me. can you also
>>>>>>>>>>>>> update your jdk ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Gilles
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wednesday, July 6, 2016, Gundram
>>>>>>>>>>>>> Leifert
>>>>>>>>>>>>> <***@uni-rostock.de> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Jason,
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks for your response! I thing
>>>>>>>>>>>>> it is another problem. I try to
>>>>>>>>>>>>> send 100MB bytes. So there are not
>>>>>>>>>>>>> many tries (between 10 and 30). I
>>>>>>>>>>>>> realized that the execution of
>>>>>>>>>>>>> this code can result 3 different
>>>>>>>>>>>>> errors:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. most often the posted error
>>>>>>>>>>>>> message occures.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. in <10% the cases i have a live
>>>>>>>>>>>>> lock. I can see 3 java-processes,
>>>>>>>>>>>>> one with 200% and two with 100%
>>>>>>>>>>>>> processor utilization. After ~15
>>>>>>>>>>>>> minutes without new system outputs
>>>>>>>>>>>>> this error occurs.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [thread 47499823949568 also had an
>>>>>>>>>>>>> error]
>>>>>>>>>>>>> # A fatal error has been detected
>>>>>>>>>>>>> by the Java Runtime Environment:
>>>>>>>>>>>>> #
>>>>>>>>>>>>> # Internal Error
>>>>>>>>>>>>> (safepoint.cpp:317), pid=24256,
>>>>>>>>>>>>> tid=47500347131648
>>>>>>>>>>>>> # guarantee(PageArmed == 0)
>>>>>>>>>>>>> failed: invariant
>>>>>>>>>>>>> #
>>>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit
>>>>>>>>>>>>> Server VM (23.25-b01 mixed mode
>>>>>>>>>>>>> linux-amd64 compressed oops)
>>>>>>>>>>>>> # Failed to write core dump. Core
>>>>>>>>>>>>> dumps have been disabled. To
>>>>>>>>>>>>> enable core dumping, try "ulimit
>>>>>>>>>>>>> -c unlimited" before starting Java
>>>>>>>>>>>>> again
>>>>>>>>>>>>> #
>>>>>>>>>>>>> # An error report file with more
>>>>>>>>>>>>> information is saved as:
>>>>>>>>>>>>> #
>>>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid24256.log
>>>>>>>>>>>>> #
>>>>>>>>>>>>> # If you would like to submit a
>>>>>>>>>>>>> bug report, please visit:
>>>>>>>>>>>>> #
>>>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>>>> #
>>>>>>>>>>>>> [titan01:24256] *** Process
>>>>>>>>>>>>> received signal ***
>>>>>>>>>>>>> [titan01:24256] Signal: Aborted (6)
>>>>>>>>>>>>> [titan01:24256] Signal code: (-6)
>>>>>>>>>>>>> [titan01:24256] [ 0]
>>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b336a324100]
>>>>>>>>>>>>> [titan01:24256] [ 1]
>>>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b336a9815f7]
>>>>>>>>>>>>> [titan01:24256] [ 2]
>>>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b336a982ce8]
>>>>>>>>>>>>> [titan01:24256] [ 3]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b336b44fac5]
>>>>>>>>>>>>> [titan01:24256] [ 4]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b336b5af137]
>>>>>>>>>>>>> [titan01:24256] [ 5]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x407262)[0x2b336b114262]
>>>>>>>>>>>>> [titan01:24256] [ 6]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x7c6c34)[0x2b336b4d3c34]
>>>>>>>>>>>>> [titan01:24256] [ 7]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a9c17)[0x2b336b5b6c17]
>>>>>>>>>>>>> [titan01:24256] [ 8]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8aa2c0)[0x2b336b5b72c0]
>>>>>>>>>>>>> [titan01:24256] [ 9]
>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x744270)[0x2b336b451270]
>>>>>>>>>>>>> [titan01:24256] [10]
>>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0x7dc5)[0x2b336a31cdc5]
>>>>>>>>>>>>> [titan01:24256] [11]
>>>>>>>>>>>>> /usr/lib64/libc.so.6(clone+0x6d)[0x2b336aa4228d]
>>>>>>>>>>>>> [titan01:24256] *** End of error
>>>>>>>>>>>>> message ***
>>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>>> Primary job terminated normally,
>>>>>>>>>>>>> but 1 process returned
>>>>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>>>>> user-direction, the job has been
>>>>>>>>>>>>> aborted.
>>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>> mpirun noticed that process rank 0
>>>>>>>>>>>>> with PID 0 on node titan01 exited
>>>>>>>>>>>>> on signal 6 (Aborted).
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. in <10% the cases i have a dead
>>>>>>>>>>>>> lock while MPI.init. This stays
>>>>>>>>>>>>> for more than 15 minutes without
>>>>>>>>>>>>> returning with an error message...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can I enable some debug-flags to
>>>>>>>>>>>>> see what happens on C / OpenMPI side?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance for your help!
>>>>>>>>>>>>> Gundram Leifert
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 07/05/2016 06:05 PM, Jason
>>>>>>>>>>>>> Maldonis wrote:
>>>>>>>>>>>>>> After reading your thread looks
>>>>>>>>>>>>>> like it may be related to an
>>>>>>>>>>>>>> issue I had a few weeks ago (I'm
>>>>>>>>>>>>>> a novice though). Maybe my thread
>>>>>>>>>>>>>> will be of help:
>>>>>>>>>>>>>> https://www.open-mpi.org/community/lists/users/2016/06/29425.php
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When you say "After a specific
>>>>>>>>>>>>>> number of repetitions the process
>>>>>>>>>>>>>> either hangs up or returns with a
>>>>>>>>>>>>>> SIGSEGV." does you mean that a
>>>>>>>>>>>>>> single call hangs, or that at
>>>>>>>>>>>>>> some point during the for loop a
>>>>>>>>>>>>>> call hangs? If you mean the
>>>>>>>>>>>>>> latter, then it might relate to
>>>>>>>>>>>>>> my issue. Otherwise my thread
>>>>>>>>>>>>>> probably won't be helpful.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jason Maldonis
>>>>>>>>>>>>>> Research Assistant of Professor
>>>>>>>>>>>>>> Paul Voyles
>>>>>>>>>>>>>> Materials Science Grad Student
>>>>>>>>>>>>>> University of Wisconsin, Madison
>>>>>>>>>>>>>> 1509 University Ave, Rm M142
>>>>>>>>>>>>>> Madison, WI 53706
>>>>>>>>>>>>>> ***@wisc.edu
>>>>>>>>>>>>>> 608-295-5532
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 5, 2016 at 9:58 AM,
>>>>>>>>>>>>>> Gundram Leifert
>>>>>>>>>>>>>> <***@uni-rostock.de>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I try to send many
>>>>>>>>>>>>>> byte-arrays via broadcast.
>>>>>>>>>>>>>> After a specific number of
>>>>>>>>>>>>>> repetitions the process
>>>>>>>>>>>>>> either hangs up or returns
>>>>>>>>>>>>>> with a SIGSEGV. Does any one
>>>>>>>>>>>>>> can help me solving the problem:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ########## The code:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> import java.util.Random;
>>>>>>>>>>>>>> import mpi.*;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> public class TestSendBigFiles {
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> public static void
>>>>>>>>>>>>>> log(String msg) {
>>>>>>>>>>>>>> try {
>>>>>>>>>>>>>> System.err.println(String.format("%2d/%2d:%s",
>>>>>>>>>>>>>> MPI.COMM_WORLD.getRank(),
>>>>>>>>>>>>>> MPI.COMM_WORLD.getSize(), msg));
>>>>>>>>>>>>>> } catch (MPIException
>>>>>>>>>>>>>> ex) {
>>>>>>>>>>>>>> System.err.println(String.format("%2s/%2s:%s",
>>>>>>>>>>>>>> "?", "?", msg));
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> private static int
>>>>>>>>>>>>>> hashcode(byte[] bytearray) {
>>>>>>>>>>>>>> if (bytearray == null) {
>>>>>>>>>>>>>> return 0;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> int hash = 39;
>>>>>>>>>>>>>> for (int i = 0; i <
>>>>>>>>>>>>>> bytearray.length; i++) {
>>>>>>>>>>>>>> byte b = bytearray[i];
>>>>>>>>>>>>>> hash = hash * 7 + (int) b;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> return hash;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> public static void
>>>>>>>>>>>>>> main(String args[]) throws
>>>>>>>>>>>>>> MPIException {
>>>>>>>>>>>>>> log("start main");
>>>>>>>>>>>>>> MPI.Init(args);
>>>>>>>>>>>>>> try {
>>>>>>>>>>>>>> log("initialized done");
>>>>>>>>>>>>>> byte[] saveMem = new
>>>>>>>>>>>>>> byte[100000000];
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> Random r = new Random();
>>>>>>>>>>>>>> r.nextBytes(saveMem);
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> (MPI.COMM_WORLD.getRank() == 0) {
>>>>>>>>>>>>>> for (int i = 0; i < 1000;
>>>>>>>>>>>>>> i++) {
>>>>>>>>>>>>>> saveMem[r.nextInt(saveMem.length)]++;
>>>>>>>>>>>>>> log("i = " + i);
>>>>>>>>>>>>>> int[] lengthData = new
>>>>>>>>>>>>>> int[]{saveMem.length};
>>>>>>>>>>>>>> log("object hash = " +
>>>>>>>>>>>>>> hashcode(saveMem));
>>>>>>>>>>>>>> log("length = " + lengthData[0]);
>>>>>>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>>>>>>> log("bcast length done
>>>>>>>>>>>>>> (length = " + lengthData[0] +
>>>>>>>>>>>>>> ")");
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>>>>>>> lengthData[0], MPI.BYTE, 0);
>>>>>>>>>>>>>> log("bcast data done");
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> MPI.COMM_WORLD.bcast(new
>>>>>>>>>>>>>> int[]{0}, 1, MPI.INT
>>>>>>>>>>>>>> <http://MPI.INT>, 0);
>>>>>>>>>>>>>> } else {
>>>>>>>>>>>>>> while (true) {
>>>>>>>>>>>>>> int[] lengthData = new
>>>>>>>>>>>>>> int[1];
>>>>>>>>>>>>>> MPI.COMM_WORLD.bcast(lengthData,
>>>>>>>>>>>>>> 1, MPI.INT <http://MPI.INT>, 0);
>>>>>>>>>>>>>> log("bcast length done
>>>>>>>>>>>>>> (length = " + lengthData[0] +
>>>>>>>>>>>>>> ")");
>>>>>>>>>>>>>> if (lengthData[0] == 0) {
>>>>>>>>>>>>>> break;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> saveMem = new
>>>>>>>>>>>>>> byte[lengthData[0]];
>>>>>>>>>>>>>> MPI.COMM_WORLD.bcast(saveMem,
>>>>>>>>>>>>>> saveMem.length, MPI.BYTE, 0);
>>>>>>>>>>>>>> log("bcast data done");
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> log("object hash = " +
>>>>>>>>>>>>>> hashcode(saveMem));
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> MPI.COMM_WORLD.barrier();
>>>>>>>>>>>>>> } catch (MPIException
>>>>>>>>>>>>>> ex) {
>>>>>>>>>>>>>> System.out.println("caugth
>>>>>>>>>>>>>> error." + ex);
>>>>>>>>>>>>>> log(ex.getMessage());
>>>>>>>>>>>>>> } catch
>>>>>>>>>>>>>> (RuntimeException ex) {
>>>>>>>>>>>>>> System.out.println("caugth
>>>>>>>>>>>>>> error." + ex);
>>>>>>>>>>>>>> log(ex.getMessage());
>>>>>>>>>>>>>> } finally {
>>>>>>>>>>>>>> MPI.Finalize();
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ############ The Error (if it
>>>>>>>>>>>>>> does not just hang up):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # A fatal error has been
>>>>>>>>>>>>>> detected by the Java Runtime
>>>>>>>>>>>>>> Environment:
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>>>>>> pc=0x00002b7e9c86e3a1,
>>>>>>>>>>>>>> pid=1172, tid=47822674495232
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # A fatal error has been
>>>>>>>>>>>>>> detected by the Java Runtime
>>>>>>>>>>>>>> Environment:
>>>>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>>>>> # Java VM: Java HotSpot(TM)
>>>>>>>>>>>>>> 64-Bit Server VM (23.25-b01
>>>>>>>>>>>>>> mixed mode linux-amd64
>>>>>>>>>>>>>> compressed oops)
>>>>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>>>>> # #
>>>>>>>>>>>>>> # SIGSEGV (0xb) at
>>>>>>>>>>>>>> pc=0x00002af69c0693a1,
>>>>>>>>>>>>>> pid=1173, tid=47238546896640
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # JRE version: 7.0_25-b15
>>>>>>>>>>>>>> J
>>>>>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # Failed to write core dump.
>>>>>>>>>>>>>> Core dumps have been
>>>>>>>>>>>>>> disabled. To enable core
>>>>>>>>>>>>>> dumping, try "ulimit -c
>>>>>>>>>>>>>> unlimited" before starting
>>>>>>>>>>>>>> Java again
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # Java VM: Java HotSpot(TM)
>>>>>>>>>>>>>> 64-Bit Server VM (23.25-b01
>>>>>>>>>>>>>> mixed mode linux-amd64
>>>>>>>>>>>>>> compressed oops)
>>>>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>>>>> # J
>>>>>>>>>>>>>> de.uros.citlab.executor.test.TestSendBigFiles.hashcode([B)I
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # Failed to write core dump.
>>>>>>>>>>>>>> Core dumps have been
>>>>>>>>>>>>>> disabled. To enable core
>>>>>>>>>>>>>> dumping, try "ulimit -c
>>>>>>>>>>>>>> unlimited" before starting
>>>>>>>>>>>>>> Java again
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # An error report file with
>>>>>>>>>>>>>> more information is saved as:
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1172.log
>>>>>>>>>>>>>> # An error report file with
>>>>>>>>>>>>>> more information is saved as:
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> /home/gl069/ompi/bin/executor/hs_err_pid1173.log
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # If you would like to submit
>>>>>>>>>>>>>> a bug report, please visit:
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> # If you would like to submit
>>>>>>>>>>>>>> a bug report, please visit:
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> http://bugreport.sun.com/bugreport/crash.jsp
>>>>>>>>>>>>>> #
>>>>>>>>>>>>>> [titan01:01172] *** Process
>>>>>>>>>>>>>> received signal ***
>>>>>>>>>>>>>> [titan01:01172] Signal:
>>>>>>>>>>>>>> Aborted (6)
>>>>>>>>>>>>>> [titan01:01172] Signal code:
>>>>>>>>>>>>>> (-6)
>>>>>>>>>>>>>> [titan01:01173] *** Process
>>>>>>>>>>>>>> received signal ***
>>>>>>>>>>>>>> [titan01:01173] Signal:
>>>>>>>>>>>>>> Aborted (6)
>>>>>>>>>>>>>> [titan01:01173] Signal code:
>>>>>>>>>>>>>> (-6)
>>>>>>>>>>>>>> [titan01:01172] [ 0]
>>>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2b7e9596a100]
>>>>>>>>>>>>>> [titan01:01172] [ 1]
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2b7e95fc75f7]
>>>>>>>>>>>>>> [titan01:01172] [ 2]
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2b7e95fc8ce8]
>>>>>>>>>>>>>> [titan01:01172] [ 3]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2b7e96a95ac5]
>>>>>>>>>>>>>> [titan01:01172] [ 4]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2b7e96bf5137]
>>>>>>>>>>>>>> [titan01:01172] [ 5]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2b7e96a995e0]
>>>>>>>>>>>>>> [titan01:01172] [ 6]
>>>>>>>>>>>>>> [titan01:01173] [ 0]
>>>>>>>>>>>>>> /usr/lib64/libpthread.so.0(+0xf100)[0x2af694ded100]
>>>>>>>>>>>>>> [titan01:01173] [ 1]
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2b7e95fc7670]
>>>>>>>>>>>>>> [titan01:01172] [ 7]
>>>>>>>>>>>>>> [0x2b7e9c86e3a1]
>>>>>>>>>>>>>> [titan01:01172] *** End of
>>>>>>>>>>>>>> error message ***
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(gsignal+0x37)[0x2af69544a5f7]
>>>>>>>>>>>>>> [titan01:01173] [ 2]
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(abort+0x148)[0x2af69544bce8]
>>>>>>>>>>>>>> [titan01:01173] [ 3]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x742ac5)[0x2af695f18ac5]
>>>>>>>>>>>>>> [titan01:01173] [ 4]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(+0x8a2137)[0x2af696078137]
>>>>>>>>>>>>>> [titan01:01173] [ 5]
>>>>>>>>>>>>>> /home/gl069/bin/jdk1.7.0_25/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x140)[0x2af695f1c5e0]
>>>>>>>>>>>>>> [titan01:01173] [ 6]
>>>>>>>>>>>>>> /usr/lib64/libc.so.6(+0x35670)[0x2af69544a670]
>>>>>>>>>>>>>> [titan01:01173] [ 7]
>>>>>>>>>>>>>> [0x2af69c0693a1]
>>>>>>>>>>>>>> [titan01:01173] *** End of
>>>>>>>>>>>>>> error message ***
>>>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>>>> Primary job terminated
>>>>>>>>>>>>>> normally, but 1 process returned
>>>>>>>>>>>>>> a non-zero exit code. Per
>>>>>>>>>>>>>> user-direction, the job has
>>>>>>>>>>>>>> been aborted.
>>>>>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> mpirun noticed that process
>>>>>>>>>>>>>> rank 1 with PID 0 on node
>>>>>>>>>>>>>> titan01 exited on signal 6
>>>>>>>>>>>>>> (Aborted).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ########CONFIGURATION:
>>>>>>>>>>>>>> I used the ompi master
>>>>>>>>>>>>>> sources from github:
>>>>>>>>>>>>>> commit
>>>>>>>>>>>>>> 267821f0dd405b5f4370017a287d9a49f92e734a
>>>>>>>>>>>>>> Author: Gilles Gouaillardet
>>>>>>>>>>>>>> <***@rist.or.jp>
>>>>>>>>>>>>>> Date: Tue Jul 5 13:47:50
>>>>>>>>>>>>>> 2016 +0900
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ./configure --enable-mpi-java
>>>>>>>>>>>>>> --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25
>>>>>>>>>>>>>> --disable-dlopen
>>>>>>>>>>>>>> --disable-mca-dso
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot for your help!
>>>>>>>>>>>>>> Gundram
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>>>>> Subscription:
>>>>>>>>>>>>>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> Link to this post:
>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2016/07/29584.php
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29585.php
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29587.php
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29589.php
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29590.php
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> ***@open-mpi.org
>>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29592.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> ***@open-mpi.org
>>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29593.php
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> ***@open-mpi.org
>>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29601.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> ***@open-mpi.org
>>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29603.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> ***@open-mpi.org
>>>>>> Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29610.php
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> ***@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> ***@lists.open-mpi.org <mailto:***@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users