Discussion:
[OMPI users] mpi send/recv pair hangin
Noam Bernstein
2018-04-05 14:16:21 UTC
Permalink
Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange way. Basically, there’s a Cartesian communicator, 4x16 (64 processes total), and despite the fact that the communication pattern is rather regular, one particular send/recv pair hangs consistently. Basically, across each row of 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0. On most of the 16 such sets all those send/recv pairs complete. However, on 2 of them, it hangs (both the send and recv). I have stack traces (with gdb -p on the running processes) from what I believe are corresponding send/recv pairs.

receiving:
0x00002b06eeed0eb2 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#0 0x00002b06eeed0eb2 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#1 0x00002b06f0a5d2de in poll_device () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#2 0x00002b06f0a5e0af in btl_openib_component_progress () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#3 0x00002b06dd3c00b0 in opal_progress () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libopen-pal.so.40
#4 0x00002b06f1c9232d in mca_pml_ob1_recv () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_pml_ob1.so
#5 0x00002b06dce56bb7 in PMPI_Recv () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi.so.40
#6 0x00002b06dcbd1e0b in pmpi_recv__ () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi_mpifh.so.40
#7 0x000000000042887b in m_recv_z (comm=..., node=-858993460, zvec=) at mpi.F:680
#8 0x000000000123e0b7 in fileio::outwav (io=..., wdes=..., w=) at fileio.F:952
#9 0x0000000002abfccf in vamp () at main.F:4204
#10 0x00000000004139de in main ()
#11 0x000000314561ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x00000000004138e9 in _start ()
sending:
0x00002abc32ed0ea1 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#0 0x00002abc32ed0ea1 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#1 0x00002abc34a5d2de in poll_device () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#2 0x00002abc34a5e0af in btl_openib_component_progress () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#3 0x00002abc238800b0 in opal_progress () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libopen-pal.so.40
#4 0x00002abc35c95955 in mca_pml_ob1_send () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_pml_ob1.so
#5 0x00002abc2331c412 in PMPI_Send () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi.so.40
#6 0x00002abc230927e0 in pmpi_send__ () from /usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi_mpifh.so.40
#7 0x0000000000428798 in m_send_z (comm=..., node=) at mpi.F:655
#8 0x000000000123d0a9 in fileio::outwav (io=..., wdes=) at fileio.F:942
#9 0x0000000002abfccf in vamp () at main.F:4204
#10 0x00000000004139de in main ()
#11 0x0000003cec81ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x00000000004138e9 in _start ()

This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), Intel compilers (17.2.174). It seems to be independent of which nodes, always happens on this pair of calls and happens after the code has been running for a while, and the same code for the other 14 sets of 4 work fine, suggesting that it’s an MPI issue, rather than an obvious bug in this code or a hardware problem. Does anyone have any ideas, either about possible causes or how to debug things further?

thanks,
Noam
Reuti
2018-04-05 15:03:56 UTC
Permalink
Hi,

> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <***@nrl.navy.mil>:
>
> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange way. Basically, there’s a Cartesian communicator, 4x16 (64 processes total), and despite the fact that the communication pattern is rather regular, one particular send/recv pair hangs consistently. Basically, across each row of 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0. On most of the 16 such sets all those send/recv pairs complete. However, on 2 of them, it hangs (both the send and recv). I have stack traces (with gdb -p on the running processes) from what I believe are corresponding send/recv pairs.
>
> <snip>
>
> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), Intel compilers (17.2.174). It seems to be independent of which nodes, always happens on this pair of calls and happens after the code has been running for a while, and the same code for the other 14 sets of 4 work fine, suggesting that it’s an MPI issue, rather than an obvious bug in this code or a hardware problem. Does anyone have any ideas, either about possible causes or how to debug things further?

Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with the Intel compilers for VASP and found, that using in addition a self-compiled scaLAPACK is working fine in combination with Open MPI. Using Intel scaLAPACK and Intel MPI is also working fine. What I never got working was the combination Intel scaLAPACK and Open MPI – at one point one process got a message from a wrong rank IIRC. I tried both: the Intel supplied Open MPI version of scaLAPACK and also compiling the necessary interface on my own for Open MPI in $MKLROOT/interfaces/mklmpi with identical results.

-- Reuti
Noam Bernstein
2018-04-05 15:15:24 UTC
Permalink
> On Apr 5, 2018, at 11:03 AM, Reuti <***@staff.uni-marburg.de> wrote:
>
> Hi,
>
>> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <***@nrl.navy.mil>:
>>
>> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange way. Basically, there’s a Cartesian communicator, 4x16 (64 processes total), and despite the fact that the communication pattern is rather regular, one particular send/recv pair hangs consistently. Basically, across each row of 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0. On most of the 16 such sets all those send/recv pairs complete. However, on 2 of them, it hangs (both the send and recv). I have stack traces (with gdb -p on the running processes) from what I believe are corresponding send/recv pairs.
>>
>> <snip>
>>
>> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), Intel compilers (17.2.174). It seems to be independent of which nodes, always happens on this pair of calls and happens after the code has been running for a while, and the same code for the other 14 sets of 4 work fine, suggesting that it’s an MPI issue, rather than an obvious bug in this code or a hardware problem. Does anyone have any ideas, either about possible causes or how to debug things further?
>
> Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with the Intel compilers for VASP and found, that using in addition a self-compiled scaLAPACK is working fine in combination with Open MPI. Using Intel scaLAPACK and Intel MPI is also working fine. What I never got working was the combination Intel scaLAPACK and Open MPI – at one point one process got a message from a wrong rank IIRC. I tried both: the Intel supplied Open MPI version of scaLAPACK and also compiling the necessary interface on my own for Open MPI in $MKLROOT/interfaces/mklmpi with identical results.

MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I set LSCALAPCK=.FALSE. I suppose I could try compiling without it just to test. In any case, this is when it’s writing out the wavefunctions, which I would assume be unrelated to scalapack operations (unless they’re corrupting some low level MPI thing, I guess).

Noam
Noam Bernstein
2018-04-05 15:39:23 UTC
Permalink
> On Apr 5, 2018, at 11:32 AM, Edgar Gabriel <***@Central.UH.EDU> wrote:
>
> is the file I/O that you mentioned using MPI I/O for that? If yes, what file system are you writing to?

No MPI I/O. Just MPI calls to gather the data, and plain Fortran I/O on the head node only.

I should also say that in lots of other circumstances (different node numbers, computational systems, etc) it works fine. But the hang is completely repeatable for this particular set of parameters (MPI and physical simulation). I haven’t explored to see what variations do/don’t lead to this kind of hanging.

Noam


____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Edgar Gabriel
2018-04-05 16:01:24 UTC
Permalink
is the file I/O that you mentioned using MPI I/O for that? If yes, what
file system are you writing to?

Edgar


On 4/5/2018 10:15 AM, Noam Bernstein wrote:
>> On Apr 5, 2018, at 11:03 AM, Reuti <***@staff.uni-marburg.de> wrote:
>>
>> Hi,
>>
>>> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <***@nrl.navy.mil>:
>>>
>>> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange way. Basically, there’s a Cartesian communicator, 4x16 (64 processes total), and despite the fact that the communication pattern is rather regular, one particular send/recv pair hangs consistently. Basically, across each row of 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0. On most of the 16 such sets all those send/recv pairs complete. However, on 2 of them, it hangs (both the send and recv). I have stack traces (with gdb -p on the running processes) from what I believe are corresponding send/recv pairs.
>>>
>>> <snip>
>>>
>>> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), Intel compilers (17.2.174). It seems to be independent of which nodes, always happens on this pair of calls and happens after the code has been running for a while, and the same code for the other 14 sets of 4 work fine, suggesting that it’s an MPI issue, rather than an obvious bug in this code or a hardware problem. Does anyone have any ideas, either about possible causes or how to debug things further?
>> Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with the Intel compilers for VASP and found, that using in addition a self-compiled scaLAPACK is working fine in combination with Open MPI. Using Intel scaLAPACK and Intel MPI is also working fine. What I never got working was the combination Intel scaLAPACK and Open MPI – at one point one process got a message from a wrong rank IIRC. I tried both: the Intel supplied Open MPI version of scaLAPACK and also compiling the necessary interface on my own for Open MPI in $MKLROOT/interfaces/mklmpi with identical results.
> MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I set LSCALAPCK=.FALSE. I suppose I could try compiling without it just to test. In any case, this is when it’s writing out the wavefunctions, which I would assume be unrelated to scalapack operations (unless they’re corrupting some low level MPI thing, I guess).
>
> Noam
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
George Bosilca
2018-04-05 19:55:52 UTC
Permalink
Noam,

The OB1 provide a mechanism to dump all pending communications in a
particular communicator. To do this I usually call mca_pml_ob1_dump(comm,
1), with comm being the MPI_Comm and 1 being the verbose mode. I have no
idea how you can find the pointer to the communicator out of your code, but
if you compile OMPI in debug mode you will see it as an argument to
the mca_pml_ob1_send
and mca_pml_ob1_recv function.

This information will give us a better idea on what happened to the
message, where is has been sent (or not), and what were the source and tag
used for the matching.

George.



On Thu, Apr 5, 2018 at 12:01 PM, Edgar Gabriel <***@central.uh.edu>
wrote:

> is the file I/O that you mentioned using MPI I/O for that? If yes, what
> file system are you writing to?
>
> Edgar
>
>
>
> On 4/5/2018 10:15 AM, Noam Bernstein wrote:
>
>> On Apr 5, 2018, at 11:03 AM, Reuti <***@staff.uni-marburg.de> wrote:
>>>
>>> Hi,
>>>
>>> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <
>>>> ***@nrl.navy.mil>:
>>>>
>>>> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a
>>>> strange way. Basically, there’s a Cartesian communicator, 4x16 (64
>>>> processes total), and despite the fact that the communication pattern is
>>>> rather regular, one particular send/recv pair hangs consistently.
>>>> Basically, across each row of 4, task 0 receives from 1,2,3, and tasks
>>>> 1,2,3 send to 0. On most of the 16 such sets all those send/recv pairs
>>>> complete. However, on 2 of them, it hangs (both the send and recv). I
>>>> have stack traces (with gdb -p on the running processes) from what I
>>>> believe are corresponding send/recv pairs.
>>>>
>>>> <snip>
>>>>
>>>> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older
>>>> versions), Intel compilers (17.2.174). It seems to be independent of which
>>>> nodes, always happens on this pair of calls and happens after the code has
>>>> been running for a while, and the same code for the other 14 sets of 4 work
>>>> fine, suggesting that it’s an MPI issue, rather than an obvious bug in this
>>>> code or a hardware problem. Does anyone have any ideas, either about
>>>> possible causes or how to debug things further?
>>>>
>>> Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL
>>> with the Intel compilers for VASP and found, that using in addition a
>>> self-compiled scaLAPACK is working fine in combination with Open MPI. Using
>>> Intel scaLAPACK and Intel MPI is also working fine. What I never got
>>> working was the combination Intel scaLAPACK and Open MPI – at one point one
>>> process got a message from a wrong rank IIRC. I tried both: the Intel
>>> supplied Open MPI version of scaLAPACK and also compiling the necessary
>>> interface on my own for Open MPI in $MKLROOT/interfaces/mklmpi with
>>> identical results.
>>>
>> MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I
>> set LSCALAPCK=.FALSE. I suppose I could try compiling without it just to
>> test. In any case, this is when it’s writing out the wavefunctions, which
>> I would assume be unrelated to scalapack operations (unless they’re
>> corrupting some low level MPI thing, I guess).
>>
>>
>> Noam
>>
>> _______________________________________________
>> users mailing list
>> ***@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
Noam Bernstein
2018-04-05 20:03:45 UTC
Permalink
> On Apr 5, 2018, at 3:55 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> Noam,
>
> The OB1 provide a mechanism to dump all pending communications in a particular communicator. To do this I usually call mca_pml_ob1_dump(comm, 1), with comm being the MPI_Comm and 1 being the verbose mode. I have no idea how you can find the pointer to the communicator out of your code, but if you compile OMPI in debug mode you will see it as an argument to the mca_pml_ob1_send and mca_pml_ob1_recv function.
>
> This information will give us a better idea on what happened to the message, where is has been sent (or not), and what were the source and tag used for the matching.

Interesting. How would you do this in a hung program? Call it before you call the things that you expect will hang? And any ideas how to get a communicator pointer from fortran?

Noam
George Bosilca
2018-04-05 20:11:52 UTC
Permalink
I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
1)". This allows the debugger to make a call our function, and output
internal information about the library status.

George.



On Thu, Apr 5, 2018 at 4:03 PM, Noam Bernstein <***@nrl.navy.mil>
wrote:

> On Apr 5, 2018, at 3:55 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> Noam,
>
> The OB1 provide a mechanism to dump all pending communications in a
> particular communicator. To do this I usually call mca_pml_ob1_dump(comm,
> 1), with comm being the MPI_Comm and 1 being the verbose mode. I have no
> idea how you can find the pointer to the communicator out of your code, but
> if you compile OMPI in debug mode you will see it as an argument to the mca_pml_ob1_send
> and mca_pml_ob1_recv function.
>
> This information will give us a better idea on what happened to the
> message, where is has been sent (or not), and what were the source and tag
> used for the matching.
>
>
> Interesting. How would you do this in a hung program? Call it before you
> call the things that you expect will hang? And any ideas how to get a
> communicator pointer from fortran?
>
> Noam
>
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
Noam Bernstein
2018-04-05 20:20:54 UTC
Permalink
> On Apr 5, 2018, at 4:11 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm, 1)". This allows the debugger to make a call our function, and output internal information about the library status.

Great. But I guess I need to recompile ompi in debug mode? Is that just a flag to configure?

thanks,
Noam



____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
George Bosilca
2018-04-05 20:31:34 UTC
Permalink
Yes, you can do this by adding --enable-debug to OMPI configure (and make
sure your don't have the configure flag --with-platform=optimize).

George.


On Thu, Apr 5, 2018 at 4:20 PM, Noam Bernstein <***@nrl.navy.mil>
wrote:

>
> On Apr 5, 2018, at 4:11 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
> 1)". This allows the debugger to make a call our function, and output
> internal information about the library status.
>
>
> Great. But I guess I need to recompile ompi in debug mode? Is that just
> a flag to configure?
>
> thanks,
> Noam
>
>
> ____________
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628 F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
Gilles Gouaillardet
2018-04-05 23:54:09 UTC
Permalink
Noam,

you might also want to try

mpirun --mca btl tcp,self ...

to rule out btl (shared memory and/or infiniband) related issues.


Once you rebuild Open MPI with --enable-debug, I recommend you first
check the arguments of the MPI_Send() and MPI_Recv() functions and
make sure
- same communicator is used (in C, check comm->c_contextid)
- same tag
- double check the MPI tasks do wait for each other (in C, check
comm->c_my_rank, source and dest)


Cheers,

Gilles

On Fri, Apr 6, 2018 at 5:31 AM, George Bosilca <***@icl.utk.edu> wrote:
> Yes, you can do this by adding --enable-debug to OMPI configure (and make
> sure your don't have the configure flag --with-platform=optimize).
>
> George.
>
>
> On Thu, Apr 5, 2018 at 4:20 PM, Noam Bernstein <***@nrl.navy.mil>
> wrote:
>>
>>
>> On Apr 5, 2018, at 4:11 PM, George Bosilca <***@icl.utk.edu> wrote:
>>
>> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
>> 1)". This allows the debugger to make a call our function, and output
>> internal information about the library status.
>>
>>
>> Great. But I guess I need to recompile ompi in debug mode? Is that just
>> a flag to configure?
>>
>> thanks,
>> Noam
>>
>>
>> ____________
>> |
>> |
>> |
>> U.S. NAVAL
>> |
>> |
>> _RESEARCH_
>> |
>> LABORATORY
>>
>> Noam Bernstein, Ph.D.
>> Center for Materials Physics and Technology
>> U.S. Naval Research Laboratory
>> T +1 202 404 8628 F +1 202 404 7546
>> https://www.nrl.navy.mil
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
Noam Bernstein
2018-04-06 16:53:07 UTC
Permalink
> On Apr 5, 2018, at 4:11 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm, 1)". This allows the debugger to make a call our function, and output internal information about the library status.

OK - after a number of missteps, I recompiled openmpi with debugging mode active, reran the executable (didn’t recompile, just using the new library), and got the comm pointer by attaching to the process and looking at the stack trace:

#0 0x00002b8a7599c42b in ibv_poll_cq (cq=0xec66010, num_entries=256, wc=0x7ffdea76d680) at /usr/include/infiniband/verbs.h:1272
#1 0x00002b8a759a8194 in poll_device (device=0xebc5300, count=0) at btl_openib_component.c:3608
#2 0x00002b8a759a871f in progress_one_device (device=0xebc5300) at btl_openib_component.c:3741
#3 0x00002b8a759a87be in btl_openib_component_progress () at btl_openib_component.c:3765
#4 0x00002b8a64b9da42 in opal_progress () at runtime/opal_progress.c:222
#5 0x00002b8a76c2c199 in ompi_request_wait_completion (req=0xec22600) at ../../../../ompi/request/request.h:392
#6 0x00002b8a76c2d642 in mca_pml_ob1_recv (addr=0x2b8a8a99bf20, count=5423600, datatype=0x2b8a64832b80, src=1, tag=200, comm=0xed932d0, status=0x385dd90) at pml_ob1_irecv.c:135
#7 0x00002b8a6454c857 in PMPI_Recv (buf=0x2b8a8a99bf20, count=5423600, type=0x2b8a64832b80, source=1, tag=200, comm=0xed932d0, status=0x385dd90) at precv.c:79
#8 0x00002b8a6428ca7c in ompi_recv_f (buf=0x2b8a8a99bf20 "DB»\373\v{\277\204\333\336\306[B\205\277\030ҀҶ\250v\277\225\377qW\001\251w?\240\020\202&=)S\277\202+\214\067\224\345R?\272\221Co\236\206\217?", count=0x7ffdea770eb4, datatype=0x2d43bec, source=0x7ffdea770a38,
tag=0x2d43bf0, comm=0x5d30a68, status=0x385dd90, ierr=0x7ffdea770a3c) at precv_f.c:85
#9 0x000000000042887b in m_recv_z (comm=..., node=-858993460, zvec=Cannot access memory at address 0x2d
) at mpi.F:680
#10 0x000000000123e0f1 in fileio::outwav (io=..., wdes=..., w=Cannot access memory at address 0x2d
) at fileio.F:952
#11 0x0000000002abfd8f in vamp () at main.F:4204
#12 0x00000000004139de in main ()
#13 0x0000003f0c81ed1d in __libc_start_main () from /lib64/libc.so.6
#14 0x00000000004138e9 in _start ()

The comm value is different in omp_recv_f and things below, so I tried both. With the value of the lower level functions I get nothing useful
(gdb) call mca_pml_ob1_dump(0xed932d0, 1)
$1 = 0
and the value from omp_recv_f I get a seg fault:
(gdb) call mca_pml_ob1_dump(0x5d30a68, 1)

Program received signal SIGSEGV, Segmentation fault.
0x00002b8a76c26d0d in mca_pml_ob1_dump (comm=0x5d30a68, verbose=1) at pml_ob1.c:577
577 opal_output(0, "Communicator %s [%p](%d) rank %d recv_seq %d num_procs %lu last_probed %lu\n",
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(mca_pml_ob1_dump) will be abandoned.
When the function is done executing, GDB will silently stop.

Should this have worked, or am I doing something wrong?

thanks,
Noam
George Bosilca
2018-04-06 17:41:56 UTC
Permalink
Noam,

According to your stack trace the correct way to call the mca_pml_ob1_dump
is with the communicator from the PMPI call. Thus, this call was successful:

(gdb) call mca_pml_ob1_dump(0xed932d0, 1)
$1 = 0


I should have been more clear, the output is not on gdb but on the output
stream of your application. If you run your application by hand with
mpirun, the output should be on the terminal where you started mpirun. If
you start your job with a batch schedule, the output should be in the
output file associated with your job.

George.



On Fri, Apr 6, 2018 at 12:53 PM, Noam Bernstein <***@nrl.navy.mil
> wrote:

> On Apr 5, 2018, at 4:11 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
> 1)". This allows the debugger to make a call our function, and output
> internal information about the library status.
>
>
> OK - after a number of missteps, I recompiled openmpi with debugging mode
> active, reran the executable (didn’t recompile, just using the new
> library), and got the comm pointer by attaching to the process and looking
> at the stack trace:
>
> #0 0x00002b8a7599c42b in ibv_poll_cq (cq=0xec66010, num_entries=256,
> wc=0x7ffdea76d680) at /usr/include/infiniband/verbs.h:1272
> #1 0x00002b8a759a8194 in poll_device (device=0xebc5300, count=0) at
> btl_openib_component.c:3608
> #2 0x00002b8a759a871f in progress_one_device (device=0xebc5300) at
> btl_openib_component.c:3741
> #3 0x00002b8a759a87be in btl_openib_component_progress () at
> btl_openib_component.c:3765
> #4 0x00002b8a64b9da42 in opal_progress () at runtime/opal_progress.c:222
> #5 0x00002b8a76c2c199 in ompi_request_wait_completion (req=0xec22600) at
> ../../../../ompi/request/request.h:392
> #6 0x00002b8a76c2d642 in mca_pml_ob1_recv (addr=0x2b8a8a99bf20,
> count=5423600, datatype=0x2b8a64832b80, src=1, tag=200, comm=0xed932d0,
> status=0x385dd90) at pml_ob1_irecv.c:135
> #7 0x00002b8a6454c857 in PMPI_Recv (buf=0x2b8a8a99bf20, count=5423600,
> type=0x2b8a64832b80, source=1, tag=200, comm=0xed932d0, status=0x385dd90)
> at precv.c:79
> #8 0x00002b8a6428ca7c in ompi_recv_f (buf=0x2b8a8a99bf20
> "DB»\373\v{\277\204\333\336\306[B\205\277\030ҀҶ\250v\277\
> 225\377qW\001\251w?\240\020\202&=)S\277\202+\214\067\224\345R?\272\221Co\236\206\217?",
> count=0x7ffdea770eb4, datatype=0x2d43bec, source=0x7ffdea770a38,
> tag=0x2d43bf0, comm=0x5d30a68, status=0x385dd90, ierr=0x7ffdea770a3c)
> at precv_f.c:85
> #9 0x000000000042887b in m_recv_z (comm=..., node=-858993460, zvec=Cannot
> access memory at address 0x2d
> ) at mpi.F:680
> #10 0x000000000123e0f1 in fileio::outwav (io=..., wdes=..., w=Cannot
> access memory at address 0x2d
> ) at fileio.F:952
> #11 0x0000000002abfd8f in vamp () at main.F:4204
> #12 0x00000000004139de in main ()
> #13 0x0000003f0c81ed1d in __libc_start_main () from /lib64/libc.so.6
> #14 0x00000000004138e9 in _start ()
>
>
> The comm value is different in omp_recv_f and things below, so I tried
> both. With the value of the lower level functions I get nothing useful
>
> (gdb) call mca_pml_ob1_dump(0xed932d0, 1)
> $1 = 0
>
> and the value from omp_recv_f I get a seg fault:
>
> (gdb) call mca_pml_ob1_dump(0x5d30a68, 1)
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x00002b8a76c26d0d in mca_pml_ob1_dump (comm=0x5d30a68, verbose=1) at
> pml_ob1.c:577
> 577 opal_output(0, "Communicator %s [%p](%d) rank %d recv_seq %d
> num_procs %lu last_probed %lu\n",
> The program being debugged was signaled while in a function called from
> GDB.
> GDB remains in the frame where the signal was received.
> To change this behavior use "set unwindonsignal on".
> Evaluation of the expression containing the function
> (mca_pml_ob1_dump) will be abandoned.
> When the function is done executing, GDB will silently stop.
>
> Should this have worked, or am I doing something wrong?
>
> thanks,
> Noam
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
Noam Bernstein
2018-04-06 18:42:29 UTC
Permalink
> On Apr 6, 2018, at 1:41 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> Noam,
>
> According to your stack trace the correct way to call the mca_pml_ob1_dump is with the communicator from the PMPI call. Thus, this call was successful:
>
> (gdb) call mca_pml_ob1_dump(0xed932d0, 1)
> $1 = 0
>
> I should have been more clear, the output is not on gdb but on the output stream of your application. If you run your application by hand with mpirun, the output should be on the terminal where you started mpirun. If you start your job with a batch schedule, the output should be in the output file associated with your job.
>

OK, that makes sense. Here’s what I get from the two relevant processes. compute-1-9 should be receiving, and 1-10 sending, I believe. Is it possible that the fact that all send send/recv pairs (nodes 1-3 on each set of 4 sending to 0, which is receiving from each one in turn) are using the same tag (200) is confusing things?

[compute-1-9:29662] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3 [0xeba14d0](5) rank 0 recv_seq 8855 num_procs 4 last_probed 0
[compute-1-9:29662] [Rank 1] expected_seq 175 ompi_proc 0xeb0ec50 send_seq 8941
[compute-1-9:29662] [Rank 2] expected_seq 127 ompi_proc 0xeb97200 send_seq 385
[compute-1-9:29662] unexpected frag
[compute-1-9:29662] hdr RNDV [ ] ctx 5 src 2 tag 200 seq 126 msg_length 86777600
[compute-1-9:29662] [Rank 3] expected_seq 8558 ompi_proc 0x2b8ee8000f90 send_seq 5
[compute-1-9:29662] unexpected frag
[compute-1-9:29662] hdr RNDV [ ] ctx 5 src 3 tag 200 seq 8557 msg_length 86777600

[compute-1-10:15673] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3 [0xe9cc6a0](5) rank 1 recv_seq 9119 num_procs 4 last_probed 0
[compute-1-10:15673] [Rank 0] expected_seq 8942 ompi_proc 0xe8e1db0 send_seq 174
[compute-1-10:15673] [Rank 2] expected_seq 54 ompi_proc 0xe9d7940 send_seq 8561
[compute-1-10:15673] [Rank 3] expected_seq 126 ompi_proc 0xe9c20c0 send_seq 385



____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
George Bosilca
2018-04-08 19:58:13 UTC
Permalink
Noam,

Thanks for your output, it highlight an usual outcome. It shows that a
process (29662) has pending messages from other processes that are tagged
with a past sequence number, something that should have not happened. The
only way to get that is if somehow we screwed-up the sending part and push
the same sequence number twice ...

More digging is required.

George.



On Fri, Apr 6, 2018 at 2:42 PM, Noam Bernstein <***@nrl.navy.mil>
wrote:

>
> On Apr 6, 2018, at 1:41 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> Noam,
>
> According to your stack trace the correct way to call the mca_pml_ob1_dump
> is with the communicator from the PMPI call. Thus, this call was successful:
>
> (gdb) call mca_pml_ob1_dump(0xed932d0, 1)
> $1 = 0
>
>
> I should have been more clear, the output is not on gdb but on the output
> stream of your application. If you run your application by hand with
> mpirun, the output should be on the terminal where you started mpirun. If
> you start your job with a batch schedule, the output should be in the
> output file associated with your job.
>
>
> OK, that makes sense. Here’s what I get from the two relevant processes.
> compute-1-9 should be receiving, and 1-10 sending, I believe. Is it
> possible that the fact that all send send/recv pairs (nodes 1-3 on each set
> of 4 sending to 0, which is receiving from each one in turn) are using the
> same tag (200) is confusing things?
>
> [compute-1-9:29662] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3
> [0xeba14d0](5) rank 0 recv_seq 8855 num_procs 4 last_probed 0
> [compute-1-9:29662] [Rank 1] expected_seq 175 ompi_proc 0xeb0ec50 send_seq
> 8941
> [compute-1-9:29662] [Rank 2] expected_seq 127 ompi_proc 0xeb97200 send_seq
> 385
> [compute-1-9:29662] unexpected frag
> [compute-1-9:29662] hdr RNDV [ ] ctx 5 src 2 tag 200 seq 126
> msg_length 86777600
> [compute-1-9:29662] [Rank 3] expected_seq 8558 ompi_proc 0x2b8ee8000f90
> send_seq 5
> [compute-1-9:29662] unexpected frag
> [compute-1-9:29662] hdr RNDV [ ] ctx 5 src 3 tag 200 seq 8557
> msg_length 86777600
>
> [compute-1-10:15673] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3
> [0xe9cc6a0](5) rank 1 recv_seq 9119 num_procs 4 last_probed 0
> [compute-1-10:15673] [Rank 0] expected_seq 8942 ompi_proc 0xe8e1db0
> send_seq 174
> [compute-1-10:15673] [Rank 2] expected_seq 54 ompi_proc 0xe9d7940 send_seq
> 8561
> [compute-1-10:15673] [Rank 3] expected_seq 126 ompi_proc 0xe9c20c0
> send_seq 385
>
>
> ____________
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628 F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
Noam Bernstein
2018-04-08 22:00:27 UTC
Permalink
> On Apr 8, 2018, at 3:58 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> Noam,
>
> Thanks for your output, it highlight an usual outcome. It shows that a process (29662) has pending messages from other processes that are tagged with a past sequence number, something that should have not happened. The only way to get that is if somehow we screwed-up the sending part and push the same sequence number twice ...
>
> More digging is required.

OK - these sequence numbers are unrelated to the send/recv tags, right? I’m happy to do any further debugging. I can’t share code, since we do have access but it’s not open source, but I’d be happy to test out anything you can suggest.

thanks,
Noam


____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
George Bosilca
2018-04-09 00:30:51 UTC
Permalink
Right, it has nothing to do with the tag. The sequence number is an
internal counter that help OMPI to deliver the messages in the MPI required
order (FIFO ordering per communicator per peer).

Thanks for offering your help to debug this issue. We'll need to figure out
how this can happen, and we will get back to you for further debugging.

George.



On Sun, Apr 8, 2018 at 6:00 PM, Noam Bernstein <***@nrl.navy.mil>
wrote:

> On Apr 8, 2018, at 3:58 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> Noam,
>
> Thanks for your output, it highlight an usual outcome. It shows that a
> process (29662) has pending messages from other processes that are tagged
> with a past sequence number, something that should have not happened. The
> only way to get that is if somehow we screwed-up the sending part and push
> the same sequence number twice ...
>
> More digging is required.
>
>
> OK - these sequence numbers are unrelated to the send/recv tags, right?
> I’m happy to do any further debugging. I can’t share code, since we do
> have access but it’s not open source, but I’d be happy to test out anything
> you can suggest.
>
> thanks,
> Noam
>
> ____________
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628 F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
George Bosilca
2018-04-09 22:36:52 UTC
Permalink
Noam,

I have few questions for you. According to your original email you are
using OMPI 3.0.1 (but the hang can also be reproduced with the 3.0.0). Also
according to your stacktrace I assume it is an x86_64, compiled with icc.

Is your application multithreaded ? How did you initialized MPI (which
level of threading) ? Can you send us the opal_config.h file please.

Thanks,
George.




On Sun, Apr 8, 2018 at 8:30 PM, George Bosilca <***@icl.utk.edu> wrote:

> Right, it has nothing to do with the tag. The sequence number is an
> internal counter that help OMPI to deliver the messages in the MPI required
> order (FIFO ordering per communicator per peer).
>
> Thanks for offering your help to debug this issue. We'll need to figure
> out how this can happen, and we will get back to you for further debugging.
>
> George.
>
>
>
> On Sun, Apr 8, 2018 at 6:00 PM, Noam Bernstein <
> ***@nrl.navy.mil> wrote:
>
>> On Apr 8, 2018, at 3:58 PM, George Bosilca <***@icl.utk.edu> wrote:
>>
>> Noam,
>>
>> Thanks for your output, it highlight an usual outcome. It shows that a
>> process (29662) has pending messages from other processes that are
>> tagged with a past sequence number, something that should have not
>> happened. The only way to get that is if somehow we screwed-up the sending
>> part and push the same sequence number twice ...
>>
>> More digging is required.
>>
>>
>> OK - these sequence numbers are unrelated to the send/recv tags, right?
>> I’m happy to do any further debugging. I can’t share code, since we do
>> have access but it’s not open source, but I’d be happy to test out anything
>> you can suggest.
>>
>> thanks,
>> Noam
>>
>> ____________
>> |
>> |
>> |
>> *U.S. NAVAL*
>> |
>> |
>> _*RESEARCH*_
>> |
>> LABORATORY
>>
>> Noam Bernstein, Ph.D.
>> Center for Materials Physics and Technology
>> U.S. Naval Research Laboratory
>> T +1 202 404 8628 F +1 202 404 7546
>> https://www.nrl.navy.mil
>>
>>
>> _______________________________________________
>> users mailing list
>> ***@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
Noam Bernstein
2018-04-09 23:04:16 UTC
Permalink
> On Apr 9, 2018, at 6:36 PM, George Bosilca <***@icl.utk.edu> wrote:
>
> Noam,
>
> I have few questions for you. According to your original email you are using OMPI 3.0.1 (but the hang can also be reproduced with the 3.0.0).

Correct.

> Also according to your stacktrace I assume it is an x86_64, compiled with icc.

x86_64, yes, but, gcc + ifort. I can test with gcc+gfortran if that’s helpful.

>
> Is your application multithreaded ? How did you initialized MPI (which level of threading) ? Can you send us the opal_config.h file please.

No, no multithreading, at least not intentionally. I can run with OMP_NUM_THREADS explicitly 1 if you’d like to exclude that as a possibility. opal_config.h is attached, from ./opal/include/opal_config.h in the build directory.

Noam



____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Reuti
2018-04-10 08:20:42 UTC
Permalink
> Am 10.04.2018 um 01:04 schrieb Noam Bernstein <***@nrl.navy.mil>:
>
>> On Apr 9, 2018, at 6:36 PM, George Bosilca <***@icl.utk.edu> wrote:
>>
>> Noam,
>>
>> I have few questions for you. According to your original email you are using OMPI 3.0.1 (but the hang can also be reproduced with the 3.0.0).
>
> Correct.
>
>> Also according to your stacktrace I assume it is an x86_64, compiled with icc.
>
> x86_64, yes, but, gcc + ifort. I can test with gcc+gfortran if that’s helpful.

Was there any reason not to choose icc + ifort?

-- Reuti


>
>> Is your application multithreaded ? How did you initialized MPI (which level of threading) ? Can you send us the opal_config.h file please.
>
> No, no multithreading, at least not intentionally. I can run with OMP_NUM_THREADS explicitly 1 if you’d like to exclude that as a possibility. opal_config.h is attached, from ./opal/include/opal_config.h in the build directory.
>
> Noam
>
>
>
> ____________
> ||
> |U.S. NAVAL|
> |_RESEARCH_|
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628 F +1 202 404 7546
> https://www.nrl.navy.mil
> <opal_config.h>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
Noam Bernstein
2018-04-10 11:37:21 UTC
Permalink
> On Apr 10, 2018, at 4:20 AM, Reuti <***@staff.uni-marburg.de> wrote:
>
>>
>> Am 10.04.2018 um 01:04 schrieb Noam Bernstein <***@nrl.navy.mil <mailto:***@nrl.navy.mil>>:
>>
>>> On Apr 9, 2018, at 6:36 PM, George Bosilca <***@icl.utk.edu <mailto:***@icl.utk.edu>> wrote:
>>>
>>> Noam,
>>>
>>> I have few questions for you. According to your original email you are using OMPI 3.0.1 (but the hang can also be reproduced with the 3.0.0).
>>
>> Correct.
>>
>>> Also according to your stacktrace I assume it is an x86_64, compiled with icc.
>>
>> x86_64, yes, but, gcc + ifort. I can test with gcc+gfortran if that’s helpful.
>
> Was there any reason not to choose icc + ifort?

For historical reasons, we only bought ifort, not the complete compiler suite. But VASP is 99% fortran, so I doubt it makes a difference in this case.

Noam


____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Reuti
2018-04-10 11:56:55 UTC
Permalink
> Am 10.04.2018 um 13:37 schrieb Noam Bernstein <***@nrl.navy.mil>:
>
>> On Apr 10, 2018, at 4:20 AM, Reuti <***@staff.uni-marburg.de> wrote:
>>
>>>
>>> Am 10.04.2018 um 01:04 schrieb Noam Bernstein <***@nrl.navy.mil>:
>>>
>>>> On Apr 9, 2018, at 6:36 PM, George Bosilca <***@icl.utk.edu> wrote:
>>>>
>>>> Noam,
>>>>
>>>> I have few questions for you. According to your original email you are using OMPI 3.0.1 (but the hang can also be reproduced with the 3.0.0).
>>>
>>> Correct.
>>>
>>>> Also according to your stacktrace I assume it is an x86_64, compiled with icc.
>>>
>>> x86_64, yes, but, gcc + ifort. I can test with gcc+gfortran if that’s helpful.
>>
>> Was there any reason not to choose icc + ifort?
>
> For historical reasons, we only bought ifort, not the complete compiler suite. But VASP is 99% fortran, so I doubt it makes a difference in this case.

I see. Sure it's nothing which would change the behavior of VASP, but maybe the interplay with Open MPI compiled with gcc. I try in my compilations to stay with one vendor, being it GCC, PGI or Intel.

Looks like icc/icpc is freely available now: https://software.intel.com/en-us/system-studio/choose-download#technical Choosing Linux + Linux as platform to develop and execute seems to be the full icc/icpc incl. the MKL (except the Fortran libs and scaLAPACK – but both are freely available in another package). Only point to take care of, is the location intel/system_studio_2018 where the usual compiler directories are located and not one level above.

-- Reuti
Nathan Hjelm
2018-04-10 12:46:15 UTC
Permalink
Using icc will not change anything unless there is a bug in the gcc version. I personally never build Open MPI with icc as it is slow and provides no benefit over gcc these days. I do, however, use ifort for the Fortran bindings.

-Nathan

> On Apr 10, 2018, at 5:56 AM, Reuti <***@staff.uni-marburg.de> wrote:
>
>
>>> Am 10.04.2018 um 13:37 schrieb Noam Bernstein <***@nrl.navy.mil>:
>>>
>>>> On Apr 10, 2018, at 4:20 AM, Reuti <***@staff.uni-marburg.de> wrote:
>>>>
>>>>
>>>>> Am 10.04.2018 um 01:04 schrieb Noam Bernstein <***@nrl.navy.mil>:
>>>>>
>>>>> On Apr 9, 2018, at 6:36 PM, George Bosilca <***@icl.utk.edu> wrote:
>>>>>
>>>>> Noam,
>>>>>
>>>>> I have few questions for you. According to your original email you are using OMPI 3.0.1 (but the hang can also be reproduced with the 3.0.0).
>>>>
>>>> Correct.
>>>>
>>>>> Also according to your stacktrace I assume it is an x86_64, compiled with icc.
>>>>
>>>> x86_64, yes, but, gcc + ifort. I can test with gcc+gfortran if that’s helpful.
>>>
>>> Was there any reason not to choose icc + ifort?
>>
>> For historical reasons, we only bought ifort, not the complete compiler suite. But VASP is 99% fortran, so I doubt it makes a difference in this case.
>
> I see. Sure it's nothing which would change the behavior of VASP, but maybe the interplay with Open MPI compiled with gcc. I try in my compilations to stay with one vendor, being it GCC, PGI or Intel.
>
> Looks like icc/icpc is freely available now: https://software.intel.com/en-us/system-studio/choose-download#technical Choosing Linux + Linux as platform to develop and execute seems to be the full icc/icpc incl. the MKL (except the Fortran libs and scaLAPACK – but both are freely available in another package). Only point to take care of, is the location intel/system_studio_2018 where the usual compiler directories are located and not one level above.
>
> -- Reuti
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
Loading...