[OMPI users] Python code inconsistency on complex multiplication in MPI (MPI4py)

Discussion:

Konstantinos Konstantinidis

2018-05-22 18:51:03 UTC

Assume an Python MPI program where a master node sends a pair of complex
matrices to each worker node and the worker node is supposed to compute
their product (conventional matrix product). The input matrices are
constructed at the master node according to some algorithm which there is
no need to explain. Now imagine for simplicity that we have only 2 MPI
processes, one master and one worker. I have created two versions of this
program for this case. The first one constructs two complex numbers (1-by-1
matrices for simplicity) and sends them to the worker to compute the
product. This program is like a skeleton for what I am trying to do with
multiple workers. In the second program, I have omitted the algorithm and
have just hard-coded these two complex numbers into the code. The programs
are supposed to give the same product shown here:

a = 28534314.10478439+28534314.10478436j

b = -1.39818115e+09+1.39818115e+09j

a*b = -7.97922802e+16+48j

This has been checked in Matlab. Instead, the first program does not work
and the worker gives a*b = -7.97922801e+16+28534416.j while the second
program works correctly. Please note that the data is transmitted correctly
from the master to the worker and the data structures are the same in both
cases (see the print() functions).

The first (wrong) program is program1.py and the second (correct) is
program2.py

I am using MPI4py 3.0.0. along with Python 2.7.14 and the kernel of Open
MPI 2.1.2. I have been straggling with this problem for a whole day and
still cannot figure out what's going on. I have tried numerous
initializations like np.zeros(), np.zeros_like(), np.empty_like() as well
as both np.array and np.matrix and functions np.dot(), np.matmul() and the
operator *.

Finally, I think that the problem is always with the imaginary part of the
product based on other examples I tried. Any suggestions?

Jeff Squyres (jsquyres)

2018-05-22 22:05:21 UTC

Permalink

There are two issues:

1. You should be using MPI.C_COMPLEX, not MPI.COMPLEX. MPI.COMPLEX is a Fortran datatype; MPI.C_COMPLEX is the C datatype (which is what NumPy is using behind the scenes).

2. Somehow the received B values are different between the two.

I derived this program from your two programs to show the difference:

https://gist.github.com/jsquyres/2ed86736e475e9e9ccd08b66378ef968

I don't know offhand how mpi4py sends floating point values -- but I'm guessing that either mpi4py or numpy are pickling the floating point values (vs. sending the exact bitmap of the floating point value), and some precision is being lost either in the pickling or the de-pickling. That's a guess, though.

Post by Konstantinos Konstantinidis
a = 28534314.10478439+28534314.10478436j
b = -1.39818115e+09+1.39818115e+09j
a*b = -7.97922802e+16+48j
This has been checked in Matlab. Instead, the first program does not work and the worker gives a*b = -7.97922801e+16+28534416.j while the second program works correctly. Please note that the data is transmitted correctly from the master to the worker and the data structures are the same in both cases (see the print() functions).
The first (wrong) program is program1.py and the second (correct) is program2.py
I am using MPI4py 3.0.0. along with Python 2.7.14 and the kernel of Open MPI 2.1.2. I have been straggling with this problem for a whole day and still cannot figure out what's going on. I have tried numerous initializations like np.zeros(), np.zeros_like(), np.empty_like() as well as both np.array and np.matrix and functions np.dot(), np.matmul() and the operator *.
Finally, I think that the problem is always with the imaginary part of the product based on other examples I tried. Any suggestions?
<program1.py><program2.py>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Squyres
***@cisco.com

Konstantinos Konstantinidis

2018-05-22 22:38:21 UTC

Permalink

Thanks Jeff.

I ran your code and saw your point. Based on that, it seems that my
comparison by just printing the values was misleading.

I have two questions for you:

1. Can you please describe your setup i.e. Python version, Numpy version,
MPI4py version and Open MPI version? I'm asking since I am thinking of
doing a fresh build and trying Python 3. What do you think?

2. When I try the following code (which manually computes the imaginary
part of that same complex number) at any receiver:

C_imag = np.dot(-28534314.10478436, 1.39818115e+09) +
np.dot(28534314.10478439, 1.39818115e+09)
print(C_imag)

I see that the answer is 48 which is correct. Do you think that this fact
points to MPI4py as the source of the precision loss problem, instead of
numpy?

Honestly, I don't understand how they have that serious bugs unresolved.

Post by Jeff Squyres (jsquyres)
1. You should be using MPI.C_COMPLEX, not MPI.COMPLEX. MPI.COMPLEX is a
Fortran datatype; MPI.C_COMPLEX is the C datatype (which is what NumPy is
using behind the scenes).
2. Somehow the received B values are different between the two.
https://gist.github.com/jsquyres/2ed86736e475e9e9ccd08b66378ef968
I don't know offhand how mpi4py sends floating point values -- but I'm
guessing that either mpi4py or numpy are pickling the floating point values
(vs. sending the exact bitmap of the floating point value), and some
precision is being lost either in the pickling or the de-pickling. That's
a guess, though.

On May 22, 2018, at 2:51 PM, Konstantinos Konstantinidis <
Assume an Python MPI program where a master node sends a pair of complex

matrices to each worker node and the worker node is supposed to compute
their product (conventional matrix product). The input matrices are
constructed at the master node according to some algorithm which there is
no need to explain. Now imagine for simplicity that we have only 2 MPI
processes, one master and one worker. I have created two versions of this
program for this case. The first one constructs two complex numbers (1-by-1
matrices for simplicity) and sends them to the worker to compute the
product. This program is like a skeleton for what I am trying to do with
multiple workers. In the second program, I have omitted the algorithm and
have just hard-coded these two complex numbers into the code. The programs

a = 28534314.10478439+28534314.10478436j
b = -1.39818115e+09+1.39818115e+09j
a*b = -7.97922802e+16+48j
This has been checked in Matlab. Instead, the first program does not

work and the worker gives a*b = -7.97922801e+16+28534416.j while the second
program works correctly. Please note that the data is transmitted correctly
from the master to the worker and the data structures are the same in both
cases (see the print() functions).

The first (wrong) program is program1.py and the second (correct) is

program2.py

I am using MPI4py 3.0.0. along with Python 2.7.14 and the kernel of Open

MPI 2.1.2. I have been straggling with this problem for a whole day and
still cannot figure out what's going on. I have tried numerous
initializations like np.zeros(), np.zeros_like(), np.empty_like() as well
as both np.array and np.matrix and functions np.dot(), np.matmul() and the
operator *.

Finally, I think that the problem is always with the imaginary part of

the product based on other examples I tried. Any suggestions?

<program1.py><program2.py>__________________________________

_____________

users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Squyres

Ben Menadue

2018-05-22 23:50:16 UTC

Permalink

Hi Jeff, Konstantinos,

I think you might want MPI.C_DOUBLE_COMPLEX for your datatype, since np.complex128 is a double-precision. But I think itâs either ignoring this and using the datatype of the object youâre sending or MPI4py is handling the conversion in the backend somewhere. You could actually just drop the datatype specification and let MPI4py select the datatype for you, as you do on the receiver side.

Modifying Jeffâs script to print out the product on the sender side as well, I see this:

Sender computed (first):
[[-7.97922801e+16+28534416.j]]
Receiver computed (first):
[[-7.97922801e+16+28534416.j]]
Sender computed (second):
[[-7.97922802e+16+48.j]]
Receiver computed (second):
[[-7.97922802e+16+48.j]]

Even the real part of the result is slightly different between the two approaches (as is the case for your results). So the values are probably being sent correctly, itâs just that the values that are being sent are different. Adding np.set_printoptions(precision=20) to the program shows this:

Sender sent (first):
[[28534314.10478439+28534314.10478436j]]
[[-1.3981811475968072e+09+1.3981811485968091e+09j]]
Sender sent (second):
[[28534314.10478439+28534314.10478436j]]
[[-1.39818115e+09+1.39818115e+09j]]

If the second value is what you expect from your construction algorithm, then I suspect youâre just seeing natural floating-point precision loss inside only of the functions youâre calling there. Otherwise, if you made the second input by copying the output from the first, you just didnât copy enough decimal places :-) .

Cheers,
Ben

Post by Konstantinos Konstantinidis
Thanks Jeff.
I ran your code and saw your point. Based on that, it seems that my comparison by just printing the values was misleading.
1. Can you please describe your setup i.e. Python version, Numpy version, MPI4py version and Open MPI version? I'm asking since I am thinking of doing a fresh build and trying Python 3. What do you think?
C_imag = np.dot(-28534314.10478436, 1.39818115e+09) + np.dot(28534314.10478439, 1.39818115e+09)
print(C_imag)
I see that the answer is 48 which is correct. Do you think that this fact points to MPI4py as the source of the precision loss problem, instead of numpy?
Honestly, I don't understand how they have that serious bugs unresolved.
1. You should be using MPI.C_COMPLEX, not MPI.COMPLEX. MPI.COMPLEX is a Fortran datatype; MPI.C_COMPLEX is the C datatype (which is what NumPy is using behind the scenes).
2. Somehow the received B values are different between the two.
https://gist.github.com/jsquyres/2ed86736e475e9e9ccd08b66378ef968 <https://gist.github.com/jsquyres/2ed86736e475e9e9ccd08b66378ef968>
I don't know offhand how mpi4py sends floating point values -- but I'm guessing that either mpi4py or numpy are pickling the floating point values (vs. sending the exact bitmap of the floating point value), and some precision is being lost either in the pickling or the de-pickling. That's a guess, though.

--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Lisandro Dalcin

2018-05-23 22:23:20 UTC

Permalink

This post might be inappropriate. Click to display it.

Konstantinos Konstantinidis

2018-05-24 01:02:47 UTC

Permalink

Thanks Ben and Lisandro.

You are right, my comparison based on the print() was misleading in terms
of precision so I probably didn't copy enough decimal places :)

I will also try to skip datatype specification and let MPI4py select the
datatype, and see what's going on.

Post by Ben Menadue
Hi Jeff, Konstantinos,
I think you might want MPI.C_DOUBLE_COMPLEX for your datatype, since
np.complex128 is a double-precision. But I think itâs either ignoring this
and using the datatype of the object youâre sending or MPI4py is handling
the conversion in the backend somewhere. You could actually just drop the
datatype specification and let MPI4py select the datatype for you, as you
do on the receiver side.
Modifying Jeffâs script to print out the product on the sender side as
[[-7.97922801e+16+28534416.j]]
[[-7.97922801e+16+28534416.j]]
[[-7.97922802e+16+48.j]]
[[-7.97922802e+16+48.j]]
Even the real part of the result is slightly different between the two
approaches (as is the case for your results). So the values are probably
being sent correctly, itâs just that the values that are being sent are
different. Adding np.set_printoptions(precision=20) to the program shows
[[28534314.10478439+28534314.10478436j]]
[[-1.3981811475968072e+09+1.3981811485968091e+09j]]
[[28534314.10478439+28534314.10478436j]]
[[-1.39818115e+09+1.39818115e+09j]]
If the second value is what you expect from your construction algorithm,
then I suspect youâre just seeing natural floating-point precision loss
inside only of the functions youâre calling there. Otherwise, if you made
the second input by copying the output from the first, you just didnât copy
enough decimal places :-) .
Cheers,
Ben
On 23 May 2018, at 8:38 am, Konstantinos Konstantinidis <
Thanks Jeff.
I ran your code and saw your point. Based on that, it seems that my
comparison by just printing the values was misleading.
1. Can you please describe your setup i.e. Python version, Numpy version,
MPI4py version and Open MPI version? I'm asking since I am thinking of
doing a fresh build and trying Python 3. What do you think?
2. When I try the following code (which manually computes the imaginary
C_imag = np.dot(-28534314.10478436, 1.39818115e+09) +
np.dot(28534314.10478439, 1.39818115e+09)
print(C_imag)
I see that the answer is 48 which is correct. Do you think that this fact
points to MPI4py as the source of the precision loss problem, instead of
numpy?
Honestly, I don't understand how they have that serious bugs unresolved.
On Tue, May 22, 2018 at 5:05 PM, Jeff Squyres (jsquyres) <

On May 22, 2018, at 2:51 PM, Konstantinos Konstantinidis <
Assume an Python MPI program where a master node sends a pair of

complex matrices to each worker node and the worker node is supposed to
compute their product (conventional matrix product). The input matrices are
constructed at the master node according to some algorithm which there is
no need to explain. Now imagine for simplicity that we have only 2 MPI
processes, one master and one worker. I have created two versions of this
program for this case. The first one constructs two complex numbers (1-by-1
matrices for simplicity) and sends them to the worker to compute the
product. This program is like a skeleton for what I am trying to do with
multiple workers. In the second program, I have omitted the algorithm and
have just hard-coded these two complex numbers into the code. The programs

a = 28534314.10478439+28534314.10478436j
b = -1.39818115e+09+1.39818115e+09j
a*b = -7.97922802e+16+48j
This has been checked in Matlab. Instead, the first program does not

The first (wrong) program is program1.py and the second (correct) is

program2.py

I am using MPI4py 3.0.0. along with Python 2.7.14 and the kernel of

Open MPI 2.1.2. I have been straggling with this problem for a whole day
and still cannot figure out what's going on. I have tried numerous
initializations like np.zeros(), np.zeros_like(), np.empty_like() as well
as both np.array and np.matrix and functions np.dot(), np.matmul() and the
operator *.

Finally, I think that the problem is always with the imaginary part of

the product based on other examples I tried. Any suggestions?

<program1.py><program2.py>__________________________________

_____________

users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Squyres

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users