Discussion:
[OMPI users] Bad file descriptor segmentation fault on an MPI4py program
Konstantinos Konstantinidis
2018-05-31 20:41:05 UTC
Permalink
Consider matrices A: s x r and B: s x t. In the attached file, I am doing
matrix multiplication in a distributed manner with one master node and N
workers in order to compute C = A^T*B based on some algorithm.

For small matrices like if A and B are 10-by-10, I get the correct results
without any error. Now, if I try A and B to be 1000-by-1000 the result is
correct but I am getting the following error at the end of the execution:

*[kostas-VirtualBox:02688] Read -1, expected 4000000, errno = 14*
*[kostas-VirtualBox:02688] *** Process received signal ****
*[kostas-VirtualBox:02688] Signal: Segmentation fault (11)*
*[kostas-VirtualBox:02688] Signal code: Address not mapped (1)*
*[kostas-VirtualBox:02688] Failing at address: 0x5096ea0*
*[kostas-VirtualBox:02688] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f232f8de390]*
*[kostas-VirtualBox:02688] [ 1]
/lib/x86_64-linux-gnu/libc.so.6(+0x14e156)[0x7f232f651156]*
*[kostas-VirtualBox:02688] [ 2]
/usr/local/lib/libopen-pal.so.20(opal_convertor_unpack+0x188)[0x7f232d4fa7d6]*
*[kostas-VirtualBox:02688] [ 3]
/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_frag+0x230)[0x7f23237de373]*
*[kostas-VirtualBox:02688] [ 4]
/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_frag+0x6f)[0x7f23237da235]*
*[kostas-VirtualBox:02688] [ 5]
/usr/local/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x1d7)[0x7f2328327dfb]*
*[kostas-VirtualBox:02688] [ 6]
/usr/local/lib/openmpi/mca_btl_vader.so(+0x6f17)[0x7f2328327f17]*
*[kostas-VirtualBox:02688] [ 7]
/usr/local/lib/openmpi/mca_btl_vader.so(+0x70ea)[0x7f23283280ea]*
*[kostas-VirtualBox:02688] [ 8]
/usr/local/lib/libopen-pal.so.20(opal_progress+0xa9)[0x7f232d4e197d]*
*[kostas-VirtualBox:02688] [ 9]
/usr/local/lib/libmpi.so.20(ompi_mpi_finalize+0x359)[0x7f232db1d31e]*
*[kostas-VirtualBox:02688] [10]
/usr/local/lib/libmpi.so.20(PMPI_Finalize+0x59)[0x7f232db49cdf]*
*[kostas-VirtualBox:02688] [11]
/home/kostas/.local/lib/python2.7/site-packages/mpi4py/MPI.so(+0x2ed6c)[0x7f232de7bd6c]*
*[kostas-VirtualBox:02688] [12] python2[0x4354d8]*
*[kostas-VirtualBox:02688] [13] python2(Py_Main+0x43c)[0x497acc]*
*[kostas-VirtualBox:02688] [14]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f232f523830]*
*[kostas-VirtualBox:02688] [15] python2(_start+0x29)[0x4975a9]*
*[kostas-VirtualBox:02688] *** End of error message ****
*[warn] Epoll ADD(4) on fd 38 failed. Old events were 0; read change was 0
(none); write change was 1 (add): Bad file descriptor*
*--------------------------------------------------------------------------*
*mpirun noticed that process rank 0 with PID 0 on node kostas-VirtualBox
exited on signal 11 (Segmentation fault).*
*--------------------------------------------------------------------------*

If you want to reproduce the problem please do so by keeping the same
parameters I used in the code since there are some constraints on them
based on the algorithm. Also please use 6 MPI processes (rank 0 is going to
be the master and the rest N=5 will be the workers). Also keep the
dimensions r,s,t of the matrices to be s == r == t and all of them to be
even.

I am using MPI4py 3.0.0. along with Python 2.7.14, Numpy 1.14.3 and the
kernel of Open MPI 2.1.2.

I cannot understand how I get a bad file descriptor if I am not writing to
some file.
Nathan Hjelm
2018-05-31 21:22:43 UTC
Permalink
This is a known bug due to the incorrect (or incomplete) documentation for Linux CMA. I believe it is fixed in 2.1.3.

-Nathan

On May 31, 2018, at 02:43 PM, Konstantinos Konstantinidis <***@gmail.com> wrote:

Consider matrices A: s x r and B: s x t. In the attached file, I am doing matrix multiplication in a distributed manner with one master node and N workers in order to compute C = A^T*B based on some algorithm. 

For small matrices like if A and B are 10-by-10, I get the correct results without any error. Now, if I try A and B to be 1000-by-1000 the result is correct but I am getting the following error at the end of the execution:

[kostas-VirtualBox:02688] Read -1, expected 4000000, errno = 14
[kostas-VirtualBox:02688] *** Process received signal ***
[kostas-VirtualBox:02688] Signal: Segmentation fault (11)
[kostas-VirtualBox:02688] Signal code: Address not mapped (1)
[kostas-VirtualBox:02688] Failing at address: 0x5096ea0
[kostas-VirtualBox:02688] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f232f8de390]
[kostas-VirtualBox:02688] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x14e156)[0x7f232f651156]
[kostas-VirtualBox:02688] [ 2] /usr/local/lib/libopen-pal.so.20(opal_convertor_unpack+0x188)[0x7f232d4fa7d6]
[kostas-VirtualBox:02688] [ 3] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_frag+0x230)[0x7f23237de373]
[kostas-VirtualBox:02688] [ 4] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_frag+0x6f)[0x7f23237da235]
[kostas-VirtualBox:02688] [ 5] /usr/local/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x1d7)[0x7f2328327dfb]
[kostas-VirtualBox:02688] [ 6] /usr/local/lib/openmpi/mca_btl_vader.so(+0x6f17)[0x7f2328327f17]
[kostas-VirtualBox:02688] [ 7] /usr/local/lib/openmpi/mca_btl_vader.so(+0x70ea)[0x7f23283280ea]
[kostas-VirtualBox:02688] [ 8] /usr/local/lib/libopen-pal.so.20(opal_progress+0xa9)[0x7f232d4e197d]
[kostas-VirtualBox:02688] [ 9] /usr/local/lib/libmpi.so.20(ompi_mpi_finalize+0x359)[0x7f232db1d31e]
[kostas-VirtualBox:02688] [10] /usr/local/lib/libmpi.so.20(PMPI_Finalize+0x59)[0x7f232db49cdf]
[kostas-VirtualBox:02688] [11] /home/kostas/.local/lib/python2.7/site-packages/mpi4py/MPI.so(+0x2ed6c)[0x7f232de7bd6c]
[kostas-VirtualBox:02688] [12] python2[0x4354d8]
[kostas-VirtualBox:02688] [13] python2(Py_Main+0x43c)[0x497acc]
[kostas-VirtualBox:02688] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f232f523830]
[kostas-VirtualBox:02688] [15] python2(_start+0x29)[0x4975a9]
[kostas-VirtualBox:02688] *** End of error message ***
[warn] Epoll ADD(4) on fd 38 failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node kostas-VirtualBox exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

If you want to reproduce the problem please do so by keeping the same parameters I used in the code since there are some constraints on them based on the algorithm. Also please use 6 MPI processes (rank 0 is going to be the master and the rest N=5 will be the workers). Also keep the dimensions r,s,t of the matrices to be s == r == t and all of them to be even. 

I am using MPI4py 3.0.0. along with Python 2.7.14, Numpy 1.14.3 and the kernel of Open MPI 2.1.2. 

I cannot understand how I get a bad file descriptor if I am not writing to some file.
Loading...