Discussion:
[OMPI users] ScaLapack tester fails with 2.0.1, works with 1.10.4; Intel Omni-Path
Christof Köhler
2016-11-18 16:32:19 UTC
Permalink
Hello everybody,

I am observing failures in the xdsyevr (and xssyevr) ScaLapack self
tests when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4
no failures are observed. Also, with mvapich2 2.2 no failures are
observed.
The other testers appear to be working with all MPIs mentioned (have
to triple check again). I somehow overlooked the failures below at
first.

The system is an Intel OmniPath system (newest Intel driver release
10.2), i.e. we are using the PSM2
mtl I believe.

I built the OpenMPIs with gcc 6.2 and the following identical options:
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default

The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using
"-O1 -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper
compiler changes.

With OpenMPI 1.10.4 I see on a single node

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.

With OpenMPI 1.10.4 I see on two nodes

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.

With OpenMPI 2.0.1 I see on a single node

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.

With OpenMPI 2.0.1 I see on two nodes

mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.

A typical failure looks like this in the output

IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED EVR


The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc
is a good baseline.

Any ideas ? For us this is a real problem in that we do not know if
this indicates a network (transport) issue in the intel software stack
(libpsm2, hfi1 kernel module) which might affect our production codes
or if this is an OpenMPI issue. We have some other problems I might
ask about later on this list, but nothing which yields such a nice
reproducer and especially these other problems might well be
application related.

Best Regards

Christof
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Howard Pritchard
2016-11-18 18:25:06 UTC
Permalink
Hi Christof,

Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?

mpirun --mca pml ob1 (all the rest of the args)

and see if you still observe the failures?

Howard


2016-11-18 9:32 GMT-07:00 Christof Köhler <
Post by Christof Köhler
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are observed.
The other testers appear to be working with all MPIs mentioned (have to
triple check again). I somehow overlooked the failures below at first.
The system is an Intel OmniPath system (newest Intel driver release 10.2),
i.e. we are using the PSM2
mtl I believe.
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc is a
good baseline.
Any ideas ? For us this is a real problem in that we do not know if this
indicates a network (transport) issue in the intel software stack (libpsm2,
hfi1 kernel module) which might affect our production codes or if this is
an OpenMPI issue. We have some other problems I might ask about later on
this list, but nothing which yields such a nice reproducer and especially
these other problems might well be application related.
Best Regards
Christof
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Christof Köhler
2016-11-19 13:10:55 UTC
Permalink
Hello,

I tried

mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x
OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host
node009,node009,node009,node009 ./xdsyevr
mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x
OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host
node009,node010,node009,node010 ./xdsyevr

This does not change anything.


I made an attempt to narrow down what happens. Sorry, but this is a
bit longer. A stack trace is also below.

Looking at the actual numbers, see at the very bottom, I notice that
the CHK and QTQ columns (9th and 10th column, maximum over all
eigentests) between the two OpenMPIs are simliar. What changes is the
"IL, IU, VL or VU altered by PDSYEVR" line which is not present in the
output with 1.10.4, only in the 2.0.1 output. Looking at
pdseprsubtst.f, comment line 751, I see that this is (as far as I
understand it) a sanity check.

Inserting my own print statement in pdseprsubtst.f (and changing
optimization to "-O0 -g"), i.e.

IF( IL.NE.OLDIL .OR. IU.NE.OLDIU .OR. VL.NE.OLDVL .OR. VU.NE.
$ OLDVU ) THEN
IF( IAM.EQ.0 ) THEN
WRITE( NOUT, FMT = 9982 )
WRITE( NOUT, '(F8.3,F8.3,F8.3,F8.3)') VL,VU,OLDVL,OLDVU
WRITE( NOUT, '(I10,I10,I10,I10)') IL,IU,OLDIL,OLDIU
END IF
RESULT = 1
END IF

The result with 2.0.1 is

500 2 2 2 8 Y 0.08 -1.00 0.81E-03 3.3 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
NaN 0.000 NaN 0.000
-1 132733856 -1 132733856
500 4 1 4 8 Y 0.18 -1.00 0.84E-03 3.5 FAILED
500 4 4 1 8 Y 0.17 -1.00 0.78E-03 2.9 PASSED EVR

The values OLDVL and OLDVU are the saved values of VL and VU on entry
in pdseprsubtst (line 253 and 254) _before_ the actual eigensolver
pdsyevr is called.

Working upwards in the call tree and additionally inserting
IF (IAM.EQ.0) THEN
WRITE(NOUT,'(F8.3,F8.3)') VL, VU
ENDIF
right before each call to PDSEPRSUBTST in pdseprtst.f gives with 2.0.1


500 2 2 2 8 Y 0.07 -1.00 0.81E-03 3.3 PASSED EVR
NaN 0.000
IL, IU, VL or VU altered by PDSYEVR
NaN 0.000 NaN 0.000
-1 128725600 -1 128725600
500 4 1 4 8 Y 0.16 -1.00 0.84E-03 3.5 FAILED
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.343 0.377
-0.697 0.104
500 4 4 1 8 Y 0.17 -1.00 0.76E-03 3.1 PASSED EVR

With 1.10.4

500 2 2 2 8 Y 0.07 -1.00 0.80E-03 4.4 PASSED EVR
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.435 0.884
-0.804 0.699
500 4 1 4 8 Y 0.08 -1.00 0.91E-03 3.3 PASSED EVR
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
-0.437 0.253
-0.603 0.220
500 4 4 1 8 Y 0.17 -1.00 0.83E-03 3.7 PASSED EVR


So something goes wrong early and it is probably not related to numerics.

Setting -ffpe-trap=invalid,zero,overflow in FCFLAGS (and NOOPT), which
of course does nothing to the BLACS and C routines, although the stack
trace below ends in a C routine (which might be spurious).

login 14:04 ~/src/scalapack/TESTING % mpirun -n 4 --mca pml ob1 -x
PATH -x LD_LIBRARY_PATH -x OMP_NUM_T[31/1861]
ca oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
Check if overflow is handled in ieee default manner.
If this is the last output you see, you should assume
that overflow caused a floating point exception.

Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.

Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.

Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.

Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.

Backtrace for this error:
#0 0x2b921971266f in ???
#0 0x2ade83c4966f in ???
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2 0x40457b in pdseprdriver
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#2 0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#0 0x2b414549566f in ???
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2 0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#0 0x2b3701f4766f in ???
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2 0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257

Not sure why pdlachkieee_ appears twice !

Thank you for your help !

Best Regards

Christof



Original output without my inserted WRITE statements:

On a single node (node009) with 2.0.1
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.23 -1.00 0.18E-02 26. FAILED
500 1 2 1 8 Y 0.09 -1.00 0.74E-03 3.2 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.16 -1.00 0.83E-03 2.3 FAILED
500 1 2 2 8 Y 0.07 -1.00 0.77E-03 2.2 PASSED EVR
500 2 2 2 8 Y 0.04 -1.00 0.81E-03 3.3 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.05 -1.00 0.84E-03 3.5 FAILED
500 4 4 1 8 Y 0.06 -1.00 0.74E-03 3.5 PASSED EVR
'End of tests'
Finished 136 tests, with the following results:
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.

On node009 and node010 with 2.0.1
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.23 -1.00 0.18E-02 26. FAILED
500 1 2 1 8 Y 0.10 -1.00 0.74E-03 3.2 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.16 -1.00 0.83E-03 2.3 FAILED
500 1 2 2 8 Y 0.09 -1.00 0.77E-03 2.2 PASSED EVR
500 2 2 2 8 Y 0.07 -1.00 0.81E-03 3.3 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.17 -1.00 0.84E-03 3.5 FAILED
500 4 4 1 8 Y 0.15 -1.00 0.77E-03 3.6 PASSED EVR
'End of tests'
Finished 136 tests, with the following results:
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.

On node009 and node010 with 1.10.4
'TEST 10 - test one large matrix'
500 1 1 1 8 Y 0.15 -1.00 0.18E-02 26. PASSED EVR
500 1 2 1 8 Y 0.10 -1.00 0.81E-03 2.7 PASSED EVR
500 1 1 2 8 Y 0.09 -1.00 0.71E-03 3.5 PASSED EVR
500 1 2 2 8 Y 0.09 -1.00 0.82E-03 2.6 PASSED EVR
500 2 2 2 8 Y 0.06 -1.00 0.80E-03 4.4 PASSED EVR
500 4 1 4 8 Y 0.07 -1.00 0.91E-03 3.3 PASSED EVR
500 4 4 1 8 Y 0.16 -1.00 0.83E-03 3.7 PASSED EVR
'End of tests'
Finished 136 tests, with the following results:
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.








----- Nachricht von Howard Pritchard <***@gmail.com> ---------
Datum: Fri, 18 Nov 2016 11:25:06 -0700
Von: Howard Pritchard <***@gmail.com>
Betreff: Re: [OMPI users] ScaLapack tester fails with 2.0.1, works
with 1.10.4; Intel Omni-Path
Post by Howard Pritchard
Hi Christof,
Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args)
and see if you still observe the failures?
Howard
2016-11-18 9:32 GMT-07:00 Christof Köhler <
Post by Christof Köhler
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are observed.
The other testers appear to be working with all MPIs mentioned (have to
triple check again). I somehow overlooked the failures below at first.
The system is an Intel OmniPath system (newest Intel driver release 10.2),
i.e. we are using the PSM2
mtl I believe.
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc is a
good baseline.
Any ideas ? For us this is a real problem in that we do not know if this
indicates a network (transport) issue in the intel software stack (libpsm2,
hfi1 kernel module) which might affect our production codes or if this is
an OpenMPI issue. We have some other problems I might ask about later on
this list, but nothing which yields such a nice reproducer and especially
these other problems might well be application related.
Best Regards
Christof
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
----- Ende der Nachricht von Howard Pritchard <***@gmail.com> -----
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Christof Köhler
2016-11-19 13:16:58 UTC
Permalink
Hello again,

please ignore the stack trace contained in my previous mail. It fails
with 1.10.4 at the same point, apparently the check for IEEE
arithmetics is a red herring !

Best Regards

Christof

----- Nachricht von Christof Köhler
<***@bccms.uni-bremen.de> ---------
Datum: Sat, 19 Nov 2016 14:10:55 +0100
Von: Christof Köhler <***@bccms.uni-bremen.de>
Antwort an: ***@bccms.uni-bremen.de
Betreff: Re: [OMPI users] ScaLapack tester fails with 2.0.1, works
with 1.10.4; Intel Omni-Path
Post by Christof Köhler
Hello,
I tried
mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x
OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host
node009,node009,node009,node009 ./xdsyevr
mpirun -n 4 --mca pml ob1 -x PATH -x LD_LIBRARY_PATH -x
OMP_NUM_THREADS -mca oob_tcp_if_include eth0,team0 -host
node009,node010,node009,node010 ./xdsyevr
This does not change anything.
I made an attempt to narrow down what happens. Sorry, but this is a
bit longer. A stack trace is also below.
Looking at the actual numbers, see at the very bottom, I notice that
the CHK and QTQ columns (9th and 10th column, maximum over all
eigentests) between the two OpenMPIs are simliar. What changes is
the "IL, IU, VL or VU altered by PDSYEVR" line which is not present
in the output with 1.10.4, only in the 2.0.1 output. Looking at
pdseprsubtst.f, comment line 751, I see that this is (as far as I
understand it) a sanity check.
Inserting my own print statement in pdseprsubtst.f (and changing
optimization to "-O0 -g"), i.e.
IF( IL.NE.OLDIL .OR. IU.NE.OLDIU .OR. VL.NE.OLDVL .OR. VU.NE.
$ OLDVU ) THEN
IF( IAM.EQ.0 ) THEN
WRITE( NOUT, FMT = 9982 )
WRITE( NOUT, '(F8.3,F8.3,F8.3,F8.3)') VL,VU,OLDVL,OLDVU
WRITE( NOUT, '(I10,I10,I10,I10)') IL,IU,OLDIL,OLDIU
END IF
RESULT = 1
END IF
The result with 2.0.1 is
500 2 2 2 8 Y 0.08 -1.00 0.81E-03 3.3 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
NaN 0.000 NaN 0.000
-1 132733856 -1 132733856
500 4 1 4 8 Y 0.18 -1.00 0.84E-03 3.5 FAILED
500 4 4 1 8 Y 0.17 -1.00 0.78E-03 2.9 PASSED EVR
The values OLDVL and OLDVU are the saved values of VL and VU on
entry in pdseprsubtst (line 253 and 254) _before_ the actual
eigensolver pdsyevr is called.
Working upwards in the call tree and additionally inserting
IF (IAM.EQ.0) THEN
WRITE(NOUT,'(F8.3,F8.3)') VL, VU
ENDIF
right before each call to PDSEPRSUBTST in pdseprtst.f gives with 2.0.1
500 2 2 2 8 Y 0.07 -1.00 0.81E-03 3.3 PASSED EVR
NaN 0.000
IL, IU, VL or VU altered by PDSYEVR
NaN 0.000 NaN 0.000
-1 128725600 -1 128725600
500 4 1 4 8 Y 0.16 -1.00 0.84E-03 3.5 FAILED
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.343 0.377
-0.697 0.104
500 4 4 1 8 Y 0.17 -1.00 0.76E-03 3.1 PASSED EVR
With 1.10.4
500 2 2 2 8 Y 0.07 -1.00 0.80E-03 4.4 PASSED EVR
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.435 0.884
-0.804 0.699
500 4 1 4 8 Y 0.08 -1.00 0.91E-03 3.3 PASSED EVR
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
-0.437 0.253
-0.603 0.220
500 4 4 1 8 Y 0.17 -1.00 0.83E-03 3.7 PASSED EVR
So something goes wrong early and it is probably not related to numerics.
Setting -ffpe-trap=invalid,zero,overflow in FCFLAGS (and NOOPT),
which of course does nothing to the BLACS and C routines, although
the stack trace below ends in a C routine (which might be spurious).
login 14:04 ~/src/scalapack/TESTING % mpirun -n 4 --mca pml ob1 -x
PATH -x LD_LIBRARY_PATH -x OMP_NUM_T[31/1861]
ca oob_tcp_if_include eth0,team0 -host
node009,node010,node009,node010 ./xdsyevr
Check if overflow is handled in ieee default manner.
If this is the last output you see, you should assume
that overflow caused a floating point exception.
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
#0 0x2b921971266f in ???
#0 0x2ade83c4966f in ???
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2 0x40457b in pdseprdriver
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#2 0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#0 0x2b414549566f in ???
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2 0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
#0 0x2b3701f4766f in ???
#1 0x4316fd in pdlachkieee_
at /home1/ckoe/src/scalapack/SRC/pdlaiect.c:260
#2 0x40457b in pdseprdriver
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:120
#3 0x405828 in main
at /home1/ckoe/src/scalapack/TESTING/EIG/pdseprdriver.f:257
Not sure why pdlachkieee_ appears twice !
Thank you for your help !
Best Regards
Christof
On a single node (node009) with 2.0.1
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.23 -1.00 0.18E-02 26. FAILED
500 1 2 1 8 Y 0.09 -1.00 0.74E-03 3.2 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.16 -1.00 0.83E-03 2.3 FAILED
500 1 2 2 8 Y 0.07 -1.00 0.77E-03 2.2 PASSED EVR
500 2 2 2 8 Y 0.04 -1.00 0.81E-03 3.3 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.05 -1.00 0.84E-03 3.5 FAILED
500 4 4 1 8 Y 0.06 -1.00 0.74E-03 3.5 PASSED EVR
'End of tests'
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
On node009 and node010 with 2.0.1
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.23 -1.00 0.18E-02 26. FAILED
500 1 2 1 8 Y 0.10 -1.00 0.74E-03 3.2 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.16 -1.00 0.83E-03 2.3 FAILED
500 1 2 2 8 Y 0.09 -1.00 0.77E-03 2.2 PASSED EVR
500 2 2 2 8 Y 0.07 -1.00 0.81E-03 3.3 PASSED EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.17 -1.00 0.84E-03 3.5 FAILED
500 4 4 1 8 Y 0.15 -1.00 0.77E-03 3.6 PASSED EVR
'End of tests'
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
On node009 and node010 with 1.10.4
'TEST 10 - test one large matrix'
500 1 1 1 8 Y 0.15 -1.00 0.18E-02 26. PASSED EVR
500 1 2 1 8 Y 0.10 -1.00 0.81E-03 2.7 PASSED EVR
500 1 1 2 8 Y 0.09 -1.00 0.71E-03 3.5 PASSED EVR
500 1 2 2 8 Y 0.09 -1.00 0.82E-03 2.6 PASSED EVR
500 2 2 2 8 Y 0.06 -1.00 0.80E-03 4.4 PASSED EVR
500 4 1 4 8 Y 0.07 -1.00 0.91E-03 3.3 PASSED EVR
500 4 4 1 8 Y 0.16 -1.00 0.83E-03 3.7 PASSED EVR
'End of tests'
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
Datum: Fri, 18 Nov 2016 11:25:06 -0700
Betreff: Re: [OMPI users] ScaLapack tester fails with 2.0.1, works
with 1.10.4; Intel Omni-Path
Post by Howard Pritchard
Hi Christof,
Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args)
and see if you still observe the failures?
Howard
2016-11-18 9:32 GMT-07:00 Christof Köhler <
Post by Christof Köhler
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are observed.
The other testers appear to be working with all MPIs mentioned (have to
triple check again). I somehow overlooked the failures below at first.
The system is an Intel OmniPath system (newest Intel driver release 10.2),
i.e. we are using the PSM2
mtl I believe.
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc is a
good baseline.
Any ideas ? For us this is a real problem in that we do not know if this
indicates a network (transport) issue in the intel software stack (libpsm2,
hfi1 kernel module) which might affect our production codes or if this is
an OpenMPI issue. We have some other problems I might ask about later on
this list, but nothing which yields such a nice reproducer and especially
these other problems might well be application related.
Best Regards
Christof
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
----- Ende der Nachricht von Christof Köhler
<***@bccms.uni-bremen.de> -----
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Christof Koehler
2016-11-22 10:21:47 UTC
Permalink
Hello again,

I tried to replicate the situation on the workstation at my desk,
running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
blas libraries.

With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and
failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but
reasonable looking numbers as described before.

With 1.10 I get "136 tests completed and passed residual checks."
instead as observed before.

So this is likely not an Omni-Path problem but something else in 2.0.1.

I should eventually clarify that I am using the current revision 206 from
the scalapack trunk (svn co https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk)
but if I remember correctly I had very similar problems with the 2.0.2
release tarball.

Both MPIs were built with
./configure --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default


Best Regards

Christof
Post by Howard Pritchard
Hi Christof,
Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args)
and see if you still observe the failures?
Howard
2016-11-18 9:32 GMT-07:00 Christof Köhler <
Post by Christof Köhler
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are observed.
The other testers appear to be working with all MPIs mentioned (have to
triple check again). I somehow overlooked the failures below at first.
The system is an Intel OmniPath system (newest Intel driver release 10.2),
i.e. we are using the PSM2
mtl I believe.
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc is a
good baseline.
Any ideas ? For us this is a real problem in that we do not know if this
indicates a network (transport) issue in the intel software stack (libpsm2,
hfi1 kernel module) which might affect our production codes or if this is
an OpenMPI issue. We have some other problems I might ask about later on
this list, but nothing which yields such a nice reproducer and especially
these other problems might well be application related.
Best Regards
Christof
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Gilles Gouaillardet
2016-11-22 13:35:57 UTC
Permalink
Christoph,

out of curiosity, could you try to
mpirun --mca coll ^tuned ...
and see if it helps ?

Cheers,

Gilles


On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler
Post by Christof Köhler
Hello again,
I tried to replicate the situation on the workstation at my desk,
running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
blas libraries.
With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and
failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but
reasonable looking numbers as described before.
With 1.10 I get "136 tests completed and passed residual checks."
instead as observed before.
So this is likely not an Omni-Path problem but something else in 2.0.1.
I should eventually clarify that I am using the current revision 206 from
the scalapack trunk (svn co https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk)
but if I remember correctly I had very similar problems with the 2.0.2
release tarball.
Both MPIs were built with
./configure --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
Best Regards
Christof
Post by Howard Pritchard
Hi Christof,
Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args)
and see if you still observe the failures?
Howard
2016-11-18 9:32 GMT-07:00 Christof Köhler <
Post by Christof Köhler
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are observed.
The other testers appear to be working with all MPIs mentioned (have to
triple check again). I somehow overlooked the failures below at first.
The system is an Intel OmniPath system (newest Intel driver release 10.2),
i.e. we are using the PSM2
mtl I believe.
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc is a
good baseline.
Any ideas ? For us this is a real problem in that we do not know if this
indicates a network (transport) issue in the intel software stack (libpsm2,
hfi1 kernel module) which might affect our production codes or if this is
an OpenMPI issue. We have some other problems I might ask about later on
this list, but nothing which yields such a nice reproducer and especially
these other problems might well be application related.
Best Regards
Christof
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Christof Koehler
2016-11-22 14:59:44 UTC
Permalink
Hello,
Post by Gilles Gouaillardet
Christoph,
out of curiosity, could you try to
mpirun --mca coll ^tuned ...
and see if it helps ?
No, at least not for the workstation example. I will test with my laptop
(debian stable) tomorrow.

Thank you all for your help ! This is really strange.

Cheers


Christof
Post by Gilles Gouaillardet
Cheers,
Gilles
On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler
Post by Christof Köhler
Hello again,
I tried to replicate the situation on the workstation at my desk,
running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
blas libraries.
With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and
failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but
reasonable looking numbers as described before.
With 1.10 I get "136 tests completed and passed residual checks."
instead as observed before.
So this is likely not an Omni-Path problem but something else in 2.0.1.
I should eventually clarify that I am using the current revision 206 from
the scalapack trunk (svn co https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk)
but if I remember correctly I had very similar problems with the 2.0.2
release tarball.
Both MPIs were built with
./configure --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
Best Regards
Christof
Post by Howard Pritchard
Hi Christof,
Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args)
and see if you still observe the failures?
Howard
2016-11-18 9:32 GMT-07:00 Christof Köhler <
Post by Christof Köhler
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are observed.
The other testers appear to be working with all MPIs mentioned (have to
triple check again). I somehow overlooked the failures below at first.
The system is an Intel OmniPath system (newest Intel driver release 10.2),
i.e. we are using the PSM2
mtl I believe.
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc is a
good baseline.
Any ideas ? For us this is a real problem in that we do not know if this
indicates a network (transport) issue in the intel software stack (libpsm2,
hfi1 kernel module) which might affect our production codes or if this is
an OpenMPI issue. We have some other problems I might ask about later on
this list, but nothing which yields such a nice reproducer and especially
these other problems might well be application related.
Best Regards
Christof
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Christof Koehler
2016-11-23 16:41:31 UTC
Permalink
Hello everybody,

as promised I started to test on my laptop (which has only two physical
cores, in case that matters).

As I discovered the story is not as simple as I assumed. I was focusing
on xdsyevr when testing on the workstation and overlooked the others.

On the cluster the only test which throws errors is xdsyevr with 2.0.1.
With 1.10.4 everything is fine. I double checked this by now.

On the workstation I get "136 tests completed and failed." in xcheevr
with 1.10.4 which I overlooked. With 2.0.1 I get "136 tests completed
and failed" in xdsyevr and xssyevr.

On the laptop I am not sure yet, I ran out of battery power. But it
looked similar to the workstation. Failures with both versions.

So, there is certainly a factor unrelated to OpenMPI. It might even be
that this failures are complete noise. I will try to investigate this
further. If some list member has a good idea how to test and what to
look for I would appreciate a hint. Also, perhaps someone could try to
replicate this.

Thank you for your help so far.

Best Regards

Christof
Post by Gilles Gouaillardet
Christoph,
out of curiosity, could you try to
mpirun --mca coll ^tuned ...
and see if it helps ?
Cheers,
Gilles
On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler
Post by Christof Köhler
Hello again,
I tried to replicate the situation on the workstation at my desk,
running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
blas libraries.
With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed and
failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but
reasonable looking numbers as described before.
With 1.10 I get "136 tests completed and passed residual checks."
instead as observed before.
So this is likely not an Omni-Path problem but something else in 2.0.1.
I should eventually clarify that I am using the current revision 206 from
the scalapack trunk (svn co https://icl.cs.utk.edu/svn/scalapack-dev/scalapack/trunk)
but if I remember correctly I had very similar problems with the 2.0.2
release tarball.
Both MPIs were built with
./configure --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
Best Regards
Christof
Post by Howard Pritchard
Hi Christof,
Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args)
and see if you still observe the failures?
Howard
2016-11-18 9:32 GMT-07:00 Christof Köhler <
Post by Christof Köhler
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are observed.
The other testers appear to be working with all MPIs mentioned (have to
triple check again). I somehow overlooked the failures below at first.
The system is an Intel OmniPath system (newest Intel driver release 10.2),
i.e. we are using the PSM2
mtl I believe.
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15. FAILED
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5 FAILED
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3 PASSED
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0 PASSED
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1 FAILED
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8 PASSED
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe gcc is a
good baseline.
Any ideas ? For us this is a real problem in that we do not know if this
indicates a network (transport) issue in the intel software stack (libpsm2,
hfi1 kernel module) which might affect our production codes or if this is
an OpenMPI issue. We have some other problems I might ask about later on
this list, but nothing which yields such a nice reproducer and especially
these other problems might well be application related.
Best Regards
Christof
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
George Bosilca
2016-11-23 22:30:38 UTC
Permalink
Christof,

Don't use "-ffpe-trap=invalid,zero,overflow" on the pdlaiect.f file. This
file implements checks for special corner cases (division by NaN and by 0)
and will always trigger if you set the FPR trap.

I talked with some of the ScaLAPACK developers, and their assumption is
that this looks like a non-initialized local variable somewhere. Because
different MPI versions do different memory allocations and stack
manipulations, the local variables might inherit different values, and in
some cases these values might be erroneous (in your example the printed
values should not be NaN).

Moreover, the eigenvalue tester is a pretty sensitive piece of code. I
would strongly suggest you send an email with your findings to the
ScaLAPACK mailing list.

George.



On Wed, Nov 23, 2016 at 9:41 AM, Christof Koehler <
Post by Christof Köhler
Hello everybody,
as promised I started to test on my laptop (which has only two physical
cores, in case that matters).
As I discovered the story is not as simple as I assumed. I was focusing
on xdsyevr when testing on the workstation and overlooked the others.
On the cluster the only test which throws errors is xdsyevr with 2.0.1.
With 1.10.4 everything is fine. I double checked this by now.
On the workstation I get "136 tests completed and failed." in xcheevr
with 1.10.4 which I overlooked. With 2.0.1 I get "136 tests completed
and failed" in xdsyevr and xssyevr.
On the laptop I am not sure yet, I ran out of battery power. But it
looked similar to the workstation. Failures with both versions.
So, there is certainly a factor unrelated to OpenMPI. It might even be
that this failures are complete noise. I will try to investigate this
further. If some list member has a good idea how to test and what to
look for I would appreciate a hint. Also, perhaps someone could try to
replicate this.
Thank you for your help so far.
Best Regards
Christof
Post by Gilles Gouaillardet
Christoph,
out of curiosity, could you try to
mpirun --mca coll ^tuned ...
and see if it helps ?
Cheers,
Gilles
On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler
Post by Christof Köhler
Hello again,
I tried to replicate the situation on the workstation at my desk,
running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
blas libraries.
With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed
and
Post by Gilles Gouaillardet
Post by Christof Köhler
failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but
reasonable looking numbers as described before.
With 1.10 I get "136 tests completed and passed residual checks."
instead as observed before.
So this is likely not an Omni-Path problem but something else in 2.0.1.
I should eventually clarify that I am using the current revision 206
from
Post by Gilles Gouaillardet
Post by Christof Köhler
the scalapack trunk (svn co https://icl.cs.utk.edu/svn/
scalapack-dev/scalapack/trunk)
Post by Gilles Gouaillardet
Post by Christof Köhler
but if I remember correctly I had very similar problems with the 2.0.2
release tarball.
Both MPIs were built with
./configure --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
Post by Gilles Gouaillardet
Post by Christof Köhler
Best Regards
Christof
Post by Howard Pritchard
Hi Christof,
Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args)
and see if you still observe the failures?
Howard
2016-11-18 9:32 GMT-07:00 Christof Köhler <
Post by Christof Köhler
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self
tests
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are
observed.
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
The other testers appear to be working with all MPIs mentioned
(have to
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
triple check again). I somehow overlooked the failures below at
first.
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
The system is an Intel OmniPath system (newest Intel driver release
10.2),
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
i.e. we are using the PSM2
mtl I believe.
I built the OpenMPIs with gcc 6.2 and the following identical
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using
"-O1
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper
compiler
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15.
FAILED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5
FAILED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1
FAILED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe
gcc is a
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
good baseline.
Any ideas ? For us this is a real problem in that we do not know if
this
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
indicates a network (transport) issue in the intel software stack
(libpsm2,
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
hfi1 kernel module) which might affect our production codes or if
this is
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
an OpenMPI issue. We have some other problems I might ask about
later on
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
this list, but nothing which yields such a nice reproducer and
especially
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
these other problems might well be application related.
Best Regards
Christof
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Christof Koehler
2016-11-24 10:11:56 UTC
Permalink
Hello,

thank you very much for your input. I will then certainly contact the
scalapack list after making sure this is not a problem with the
lapack/blas libraries and making tests with mvapich on the workstation
and laptop.

I should have made the workstation/laptop test earlier. But at
that time it looked like clear cut problem. In addition we have this
relatively new network stack and do not have experience with it (it is
tempting to just assume that bugs are left in such new stack as
Omni-Path/psm2 is or interfaces with it). In fact we have a quite
repeatable deadlock in vasp 5.3.5 with mvapich in reduce/allreduce,
but that is some discussion on another list.

Thank you very much again.

Cheers

Christof Köhler
Post by Howard Pritchard
Christof,
Don't use "-ffpe-trap=invalid,zero,overflow" on the pdlaiect.f file. This
file implements checks for special corner cases (division by NaN and by 0)
and will always trigger if you set the FPR trap.
I talked with some of the ScaLAPACK developers, and their assumption is
that this looks like a non-initialized local variable somewhere. Because
different MPI versions do different memory allocations and stack
manipulations, the local variables might inherit different values, and in
some cases these values might be erroneous (in your example the printed
values should not be NaN).
Moreover, the eigenvalue tester is a pretty sensitive piece of code. I
would strongly suggest you send an email with your findings to the
ScaLAPACK mailing list.
George.
On Wed, Nov 23, 2016 at 9:41 AM, Christof Koehler <
Post by Christof Köhler
Hello everybody,
as promised I started to test on my laptop (which has only two physical
cores, in case that matters).
As I discovered the story is not as simple as I assumed. I was focusing
on xdsyevr when testing on the workstation and overlooked the others.
On the cluster the only test which throws errors is xdsyevr with 2.0.1.
With 1.10.4 everything is fine. I double checked this by now.
On the workstation I get "136 tests completed and failed." in xcheevr
with 1.10.4 which I overlooked. With 2.0.1 I get "136 tests completed
and failed" in xdsyevr and xssyevr.
On the laptop I am not sure yet, I ran out of battery power. But it
looked similar to the workstation. Failures with both versions.
So, there is certainly a factor unrelated to OpenMPI. It might even be
that this failures are complete noise. I will try to investigate this
further. If some list member has a good idea how to test and what to
look for I would appreciate a hint. Also, perhaps someone could try to
replicate this.
Thank you for your help so far.
Best Regards
Christof
Post by Gilles Gouaillardet
Christoph,
out of curiosity, could you try to
mpirun --mca coll ^tuned ...
and see if it helps ?
Cheers,
Gilles
On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler
Post by Christof Köhler
Hello again,
I tried to replicate the situation on the workstation at my desk,
running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
blas libraries.
With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed
and
Post by Gilles Gouaillardet
Post by Christof Köhler
failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but
reasonable looking numbers as described before.
With 1.10 I get "136 tests completed and passed residual checks."
instead as observed before.
So this is likely not an Omni-Path problem but something else in 2.0.1.
I should eventually clarify that I am using the current revision 206
from
Post by Gilles Gouaillardet
Post by Christof Köhler
the scalapack trunk (svn co https://icl.cs.utk.edu/svn/
scalapack-dev/scalapack/trunk)
Post by Gilles Gouaillardet
Post by Christof Köhler
but if I remember correctly I had very similar problems with the 2.0.2
release tarball.
Both MPIs were built with
./configure --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
Post by Gilles Gouaillardet
Post by Christof Köhler
Best Regards
Christof
Post by Howard Pritchard
Hi Christof,
Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args)
and see if you still observe the failures?
Howard
2016-11-18 9:32 GMT-07:00 Christof Köhler <
Post by Christof Köhler
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self
tests
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are
observed.
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
The other testers appear to be working with all MPIs mentioned
(have to
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
triple check again). I somehow overlooked the failures below at
first.
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
The system is an Intel OmniPath system (newest Intel driver release
10.2),
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
i.e. we are using the PSM2
mtl I believe.
I built the OpenMPIs with gcc 6.2 and the following identical
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using
"-O1
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper
compiler
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15.
FAILED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5
FAILED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1
FAILED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe
gcc is a
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
good baseline.
Any ideas ? For us this is a real problem in that we do not know if
this
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
indicates a network (transport) issue in the intel software stack
(libpsm2,
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
hfi1 kernel module) which might affect our production codes or if
this is
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
an OpenMPI issue. We have some other problems I might ask about
later on
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
this list, but nothing which yields such a nice reproducer and
especially
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
these other problems might well be application related.
Best Regards
Christof
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Christof Koehler
2016-11-28 12:12:37 UTC
Permalink
Hello everybody,

just to bring this to some conclusion for other people who might
eventually find this thread.

I tried several times to submit a message to the scalapack/lapack forum.
However, regardless what I do my message gets flagged as spam and is
denied. An strongly abbreviated message got through but apparently
needs admin approval which has not been given yet.

So, due to my inability to discuss with the right people this issue is
closed from my side as far as reporting to the scalapack developers is
concerned.

Cheers

Christof
Post by Christof Köhler
Hello,
thank you very much for your input. I will then certainly contact the
scalapack list after making sure this is not a problem with the
lapack/blas libraries and making tests with mvapich on the workstation
and laptop.
I should have made the workstation/laptop test earlier. But at
that time it looked like clear cut problem. In addition we have this
relatively new network stack and do not have experience with it (it is
tempting to just assume that bugs are left in such new stack as
Omni-Path/psm2 is or interfaces with it). In fact we have a quite
repeatable deadlock in vasp 5.3.5 with mvapich in reduce/allreduce,
but that is some discussion on another list.
Thank you very much again.
Cheers
Christof Köhler
Post by Howard Pritchard
Christof,
Don't use "-ffpe-trap=invalid,zero,overflow" on the pdlaiect.f file. This
file implements checks for special corner cases (division by NaN and by 0)
and will always trigger if you set the FPR trap.
I talked with some of the ScaLAPACK developers, and their assumption is
that this looks like a non-initialized local variable somewhere. Because
different MPI versions do different memory allocations and stack
manipulations, the local variables might inherit different values, and in
some cases these values might be erroneous (in your example the printed
values should not be NaN).
Moreover, the eigenvalue tester is a pretty sensitive piece of code. I
would strongly suggest you send an email with your findings to the
ScaLAPACK mailing list.
George.
On Wed, Nov 23, 2016 at 9:41 AM, Christof Koehler <
Post by Christof Köhler
Hello everybody,
as promised I started to test on my laptop (which has only two physical
cores, in case that matters).
As I discovered the story is not as simple as I assumed. I was focusing
on xdsyevr when testing on the workstation and overlooked the others.
On the cluster the only test which throws errors is xdsyevr with 2.0.1.
With 1.10.4 everything is fine. I double checked this by now.
On the workstation I get "136 tests completed and failed." in xcheevr
with 1.10.4 which I overlooked. With 2.0.1 I get "136 tests completed
and failed" in xdsyevr and xssyevr.
On the laptop I am not sure yet, I ran out of battery power. But it
looked similar to the workstation. Failures with both versions.
So, there is certainly a factor unrelated to OpenMPI. It might even be
that this failures are complete noise. I will try to investigate this
further. If some list member has a good idea how to test and what to
look for I would appreciate a hint. Also, perhaps someone could try to
replicate this.
Thank you for your help so far.
Best Regards
Christof
Post by Gilles Gouaillardet
Christoph,
out of curiosity, could you try to
mpirun --mca coll ^tuned ...
and see if it helps ?
Cheers,
Gilles
On Tue, Nov 22, 2016 at 7:21 PM, Christof Koehler
Post by Christof Köhler
Hello again,
I tried to replicate the situation on the workstation at my desk,
running ubuntu 14.04 (gcc 4.8.4) and with the OS supplied lapack and
blas libraries.
With openmpi 2.0.1 (mpirun -np 4 xdsyevr) I get "136 tests completed
and
Post by Gilles Gouaillardet
Post by Christof Köhler
failed." with the "IL, IU, VL or VU altered by PDSYEVR" message but
reasonable looking numbers as described before.
With 1.10 I get "136 tests completed and passed residual checks."
instead as observed before.
So this is likely not an Omni-Path problem but something else in 2.0.1.
I should eventually clarify that I am using the current revision 206
from
Post by Gilles Gouaillardet
Post by Christof Köhler
the scalapack trunk (svn co https://icl.cs.utk.edu/svn/
scalapack-dev/scalapack/trunk)
Post by Gilles Gouaillardet
Post by Christof Köhler
but if I remember correctly I had very similar problems with the 2.0.2
release tarball.
Both MPIs were built with
./configure --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
Post by Gilles Gouaillardet
Post by Christof Köhler
Best Regards
Christof
Post by Howard Pritchard
Hi Christof,
Thanks for trying out 2.0.1. Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?
mpirun --mca pml ob1 (all the rest of the args)
and see if you still observe the failures?
Howard
2016-11-18 9:32 GMT-07:00 Christof Köhler <
Post by Christof Köhler
Hello everybody,
I am observing failures in the xdsyevr (and xssyevr) ScaLapack self
tests
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
failures are observed. Also, with mvapich2 2.2 no failures are
observed.
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
The other testers appear to be working with all MPIs mentioned
(have to
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
triple check again). I somehow overlooked the failures below at
first.
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
The system is an Intel OmniPath system (newest Intel driver release
10.2),
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
i.e. we are using the PSM2
mtl I believe.
I built the OpenMPIs with gcc 6.2 and the following identical
./configure FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
--with-psm2 --with-tm --with-hwloc=internal --enable-static
--enable-orterun-prefix-by-default
The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using
"-O1
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
-g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper
compiler
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
changes.
With OpenMPI 1.10.4 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 1.10.4 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
136 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
0 tests completed and failed.
With OpenMPI 2.0.1 I see on a single node
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
With OpenMPI 2.0.1 I see on two nodes
mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
./xdsyevr
32 tests completed and passed residual checks.
0 tests completed without checking.
0 tests skipped for lack of memory.
104 tests completed and failed.
A typical failure looks like this in the output
IL, IU, VL or VU altered by PDSYEVR
500 1 1 1 8 Y 0.26 -1.00 0.19E-02 15.
FAILED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
500 1 2 1 8 Y 0.29 -1.00 0.79E-03 3.9
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
IL, IU, VL or VU altered by PDSYEVR
500 1 1 2 8 Y 0.52 -1.00 0.82E-03 2.5
FAILED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
500 1 2 2 8 Y 0.41 -1.00 0.79E-03 2.3
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
500 2 2 2 8 Y 0.18 -1.00 0.78E-03 3.0
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
IL, IU, VL or VU altered by PDSYEVR
500 4 1 4 8 Y 0.09 -1.00 0.95E-03 4.1
FAILED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
500 4 4 1 8 Y 0.11 -1.00 0.91E-03 2.8
PASSED
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
EVR
The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
We see similar problems with intel 2016 compilers, but I believe
gcc is a
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
good baseline.
Any ideas ? For us this is a real problem in that we do not know if
this
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
indicates a network (transport) issue in the intel software stack
(libpsm2,
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
hfi1 kernel module) which might affect our production codes or if
this is
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
an OpenMPI issue. We have some other problems I might ask about
later on
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
this list, but nothing which yields such a nice reproducer and
especially
Post by Gilles Gouaillardet
Post by Christof Köhler
Post by Howard Pritchard
Post by Christof Köhler
these other problems might well be application related.
Best Regards
Christof
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen
PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
--
Dr. rer. nat. Christof Köhler email: ***@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770
28359 Bremen

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
Loading...