Discussion:
[OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL
McGrattan, Kevin B. Dr. (Fed)
2017-04-01 16:40:07 UTC
Permalink
I am running a large computational fluid dynamics code on a linux cluster (Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two sockets, each socket has six cores. I have noticed that the code hangs when the size of the packages exchanged using a persistent send and receive call become large. I cannot say exactly how large, but generally on the order of 10 MB. Rather than let the code just hang, I implemented a timing loop using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say, 10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and continue on with the calculation, even if the communication(s) did not succeed. It would not necessarily cripple the calculation if a few MPI communications were unsuccessful. This is a snippet of code that tests if the communications are successful and attempts to cancel if not:

START_TIME = MPI_WTIME()
FLAG = .FALSE.
DO WHILE(.NOT.FLAG)
CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR)
WAIT_TIME = MPI_WTIME() - START_TIME
IF (WAIT_TIME>TIMEOUT) THEN
WRITE(LU_ERR,'(A,A,I6,A,A)') 'Request timed out for MPI process ',MYID,' running on ',PNAME(1:PNAMELEN)
DO NNN=1,NREQ
IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE
CALL MPI_CANCEL(REQ(NNN),IERR)
write(LU_ERR,*) 'Request ',NNN,' returns from MPI_CANCEL'
CALL MPI_WAIT(REQ(NNN),STATUS,IERR)
write(LU_ERR,*) 'Request ',NNN,' returns from MPI_WAIT'
CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR)
write(LU_ERR,*) 'Request ',NNN,' returns from MPI_TEST_CANCELLED'
ENDDO
ENDIF
ENDDO

The job still hangs, and when I look at the error file, I see that on MPI process A, one of the sends has not completed, and on process B, one of the receives has not completed. The failed send and failed receive are consistent - that is they are matching. What I do not understand is that for both the uncompleted send and receive, the code hangs in MPI_WAIT. That is, I do not get the printout that says that the process has returned from MPI_WAIT. I interpret this to mean that either some of the large message has been sent or received, but not all. The MPI standard seems a bit vague on what is supposed to happen if part of the message simply disappears due to some network glitch. These errors occur after hundreds or thousands of successful exchanges. They never happen at the same point in the calculation. They are random, but they occur only when the messages are large (like MBs). When the messages are not large, the code can run for days or weeks without errors.

So why does MPI_WAIT hang? The MPI standard says

"If a communication is marked for cancellation, then an MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that communication is guaranteed to return, irrespective of the activities of other processes (i.e., MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a local function)" (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php).

Could the problem be with my cluster - in that the large message is broken up into smaller packets, and one of these packets disappears and there is no way to cancel it? That's really what I am looking for - a way to cancel the failed communication but still continue the calculation.
George Bosilca
2017-04-03 18:28:40 UTC
Permalink
Kevin,

In Open MPI we only support cancelling non-yet matched receives. So, you
cannot cancel sends nor receive requests that have already been matched.
While the latter are supposed to complete (otherwise they would not have
been matched), the former are trickier to complete if the corresponding
receive is never posted.

To sum this up, the bad news is that there is no way to correctly cancel
MPI requests without hitting deadlock.

That being said, I can hardly understand how Open MPI can drop a message.
There might be something else in here, that is more difficult to spot. We
do have an internal way to dump all pending (or known) communication.
Assuming you are using the OB1 PML here is how you dump all known
communications. Attach to a process and find the communicator pointer (you
will need to convert between the F90 communicator and the C pointer) and
then call mca_pml.pml_dump( commptr, 1).

Also, it is possible to check how one of the more recent versions of Open
MPI (> 2.1) behave with your code ?

George.




On Sat, Apr 1, 2017 at 12:40 PM, McGrattan, Kevin B. Dr. (Fed) <
Post by McGrattan, Kevin B. Dr. (Fed)
I am running a large computational fluid dynamics code on a linux cluster
(Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled
with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two
sockets, each socket has six cores. I have noticed that the code hangs when
the size of the packages exchanged using a persistent send and receive call
become large. I cannot say exactly how large, but generally on the order of
10 MB. Rather than let the code just hang, I implemented a timing loop
using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say,
10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and
continue on with the calculation, even if the communication(s) did not
succeed. It would not necessarily cripple the calculation if a few MPI
communications were unsuccessful. This is a snippet of code that tests if
START_TIME = MPI_WTIME()
FLAG = .FALSE.
DO WHILE(.NOT.FLAG)
CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR)
WAIT_TIME = MPI_WTIME() - START_TIME
IF (WAIT_TIME>TIMEOUT) THEN
WRITE(LU_ERR,'(A,A,I6,A,A)') ‘Request timed out for MPI process
',MYID,' running on ',PNAME(1:PNAMELEN)
DO NNN=1,NREQ
IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE
CALL MPI_CANCEL(REQ(NNN),IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_CANCEL'
CALL MPI_WAIT(REQ(NNN),STATUS,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_WAIT'
CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from
MPI_TEST_CANCELLED'
ENDDO
ENDIF
ENDDO
The job still hangs, and when I look at the error file, I see that on MPI
process A, one of the sends has not completed, and on process B, one of the
receives has not completed. The failed send and failed receive are
consistent – that is they are matching. What I do not understand is that
for both the uncompleted send and receive, the code hangs in MPI_WAIT. That
is, I do not get the printout that says that the process has returned from
MPI_WAIT. I interpret this to mean that either some of the large message
has been sent or received, but not all. The MPI standard seems a bit vague
on what is supposed to happen if part of the message simply disappears due
to some network glitch. These errors occur after hundreds or thousands of
successful exchanges. They never happen at the same point in the
calculation. They are random, but they occur only when the messages are
large (like MBs). When the messages are not large, the code can run for
days or weeks without errors.
So why does MPI_WAIT hang? The MPI standard says
“If a communication is marked for cancellation, then an MPI_Wait
<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that
communication is guaranteed to return, irrespective of the activities of
other processes (i.e., MPI_Wait
<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a
local function)” (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php).
Could the problem be with my cluster – in that the large message is broken
up into smaller packets, and one of these packets disappears and there is
no way to cancel it? That’s really what I am looking for – a way to cancel
the failed communication but still continue the calculation.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
McGrattan, Kevin B. Dr. (Fed)
2017-04-03 20:47:50 UTC
Permalink
Thanks, George.

Are persistent send/receives matched from the start of the calculation? If so, then I guess MPI_CANCEL won’t work.

I don’t think Open MPI is the problem. I think there is something wrong with our cluster in that it just seems to hang up on these big packages. The calculation successfully exchanges hundreds or thousands before just hanging.

I’m not sure I understand completely your recommendation for dumping diagnostics. Is this documented somewhere?

Thanks

Kevin



From: George Bosilca [mailto:***@icl.utk.edu]
Sent: Monday, April 03, 2017 2:29 PM
To: Open MPI Users <***@lists.open-mpi.org>
Cc: McGrattan, Kevin B. Dr. (Fed) <***@nist.gov>
Subject: Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL

Kevin,

In Open MPI we only support cancelling non-yet matched receives. So, you cannot cancel sends nor receive requests that have already been matched. While the latter are supposed to complete (otherwise they would not have been matched), the former are trickier to complete if the corresponding receive is never posted.

To sum this up, the bad news is that there is no way to correctly cancel MPI requests without hitting deadlock.

That being said, I can hardly understand how Open MPI can drop a message. There might be something else in here, that is more difficult to spot. We do have an internal way to dump all pending (or known) communication. Assuming you are using the OB1 PML here is how you dump all known communications. Attach to a process and find the communicator pointer (you will need to convert between the F90 communicator and the C pointer) and then call mca_pml.pml_dump( commptr, 1).

Also, it is possible to check how one of the more recent versions of Open MPI (> 2.1) behave with your code ?

George.




On Sat, Apr 1, 2017 at 12:40 PM, McGrattan, Kevin B. Dr. (Fed) <***@nist.gov<mailto:***@nist.gov>> wrote:
I am running a large computational fluid dynamics code on a linux cluster (Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two sockets, each socket has six cores. I have noticed that the code hangs when the size of the packages exchanged using a persistent send and receive call become large. I cannot say exactly how large, but generally on the order of 10 MB. Rather than let the code just hang, I implemented a timing loop using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say, 10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and continue on with the calculation, even if the communication(s) did not succeed. It would not necessarily cripple the calculation if a few MPI communications were unsuccessful. This is a snippet of code that tests if the communications are successful and attempts to cancel if not:

START_TIME = MPI_WTIME()
FLAG = .FALSE.
DO WHILE(.NOT.FLAG)
CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR)
WAIT_TIME = MPI_WTIME() - START_TIME
IF (WAIT_TIME>TIMEOUT) THEN
WRITE(LU_ERR,'(A,A,I6,A,A)') ‘Request timed out for MPI process ',MYID,' running on ',PNAME(1:PNAMELEN)
DO NNN=1,NREQ
IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE
CALL MPI_CANCEL(REQ(NNN),IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_CANCEL'
CALL MPI_WAIT(REQ(NNN),STATUS,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_WAIT'
CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_TEST_CANCELLED'
ENDDO
ENDIF
ENDDO

The job still hangs, and when I look at the error file, I see that on MPI process A, one of the sends has not completed, and on process B, one of the receives has not completed. The failed send and failed receive are consistent – that is they are matching. What I do not understand is that for both the uncompleted send and receive, the code hangs in MPI_WAIT. That is, I do not get the printout that says that the process has returned from MPI_WAIT. I interpret this to mean that either some of the large message has been sent or received, but not all. The MPI standard seems a bit vague on what is supposed to happen if part of the message simply disappears due to some network glitch. These errors occur after hundreds or thousands of successful exchanges. They never happen at the same point in the calculation. They are random, but they occur only when the messages are large (like MBs). When the messages are not large, the code can run for days or weeks without errors.

So why does MPI_WAIT hang? The MPI standard says

“If a communication is marked for cancellation, then an MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that communication is guaranteed to return, irrespective of the activities of other processes (i.e., MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a local function)” (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php).

Could the problem be with my cluster – in that the large message is broken up into smaller packets, and one of these packets disappears and there is no way to cancel it? That’s really what I am looking for – a way to cancel the failed communication but still continue the calculation.


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
George Bosilca
2017-04-03 21:58:52 UTC
Permalink
On Mon, Apr 3, 2017 at 4:47 PM, McGrattan, Kevin B. Dr. (Fed) <
Post by McGrattan, Kevin B. Dr. (Fed)
Thanks, George.
Are persistent send/receives matched from the start of the calculation? If
so, then I guess MPI_CANCEL won’t work.
A persistent request is only matched when it is started. The MPI_Cancel on
a persistent receive doesn't affect the persistent request itself, but
instead only cancel the started instance of the request.
Post by McGrattan, Kevin B. Dr. (Fed)
I don’t think Open MPI is the problem. I think there is something wrong
with our cluster in that it just seems to hang up on these big packages.
The calculation successfully exchanges hundreds or thousands before just
hanging.
While possible, it is highly unlikely that a message gets dropped by the
network without some kind of warning (system log at least). You might want
to take a look in the dmesg to see if there is nothing unexpected there.
Post by McGrattan, Kevin B. Dr. (Fed)
I’m not sure I understand completely your recommendation for dumping
diagnostics. Is this documented somewhere?
Unfortunately not, this is basically a developer trick to dump the state of
the MPI library. This goes a little like this. Once you have attached a
debugger to your process (let's assume gdb), you need to find the
communicator where you have posted your requests (I can't help here this is
not part of the code you sent). With <communicator_index> set to this value:

gdb$ p ompi_comm_f_to_c_table.addr[<communicator_index>]

will give you the C pointer of the communicator.

gdb$ call mca_pml.pml_dump(
ompi_comm_f_to_c_table.addr[<communicator_index>], 1)

should print all the local known messages by the MPI library, including
pending sends and receives. This will also print additional information
(the status of the requests, the tag, the size, and so on) that can be
understood by the developers. If you post the info here, we might be able
to provide additional information on the issue.

George.
Post by McGrattan, Kevin B. Dr. (Fed)
Thanks
Kevin
*Sent:* Monday, April 03, 2017 2:29 PM
*Subject:* Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL
Kevin,
In Open MPI we only support cancelling non-yet matched receives. So, you
cannot cancel sends nor receive requests that have already been matched.
While the latter are supposed to complete (otherwise they would not have
been matched), the former are trickier to complete if the corresponding
receive is never posted.
To sum this up, the bad news is that there is no way to correctly cancel
MPI requests without hitting deadlock.
That being said, I can hardly understand how Open MPI can drop a message.
There might be something else in here, that is more difficult to spot. We
do have an internal way to dump all pending (or known) communication.
Assuming you are using the OB1 PML here is how you dump all known
communications. Attach to a process and find the communicator pointer (you
will need to convert between the F90 communicator and the C pointer) and
then call mca_pml.pml_dump( commptr, 1).
Also, it is possible to check how one of the more recent versions of Open
MPI (> 2.1) behave with your code ?
George.
On Sat, Apr 1, 2017 at 12:40 PM, McGrattan, Kevin B. Dr. (Fed) <
I am running a large computational fluid dynamics code on a linux cluster
(Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled
with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two
sockets, each socket has six cores. I have noticed that the code hangs when
the size of the packages exchanged using a persistent send and receive call
become large. I cannot say exactly how large, but generally on the order of
10 MB. Rather than let the code just hang, I implemented a timing loop
using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say,
10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and
continue on with the calculation, even if the communication(s) did not
succeed. It would not necessarily cripple the calculation if a few MPI
communications were unsuccessful. This is a snippet of code that tests if
START_TIME = MPI_WTIME()
FLAG = .FALSE.
DO WHILE(.NOT.FLAG)
CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR)
WAIT_TIME = MPI_WTIME() - START_TIME
IF (WAIT_TIME>TIMEOUT) THEN
WRITE(LU_ERR,'(A,A,I6,A,A)') ‘Request timed out for MPI process
',MYID,' running on ',PNAME(1:PNAMELEN)
DO NNN=1,NREQ
IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE
CALL MPI_CANCEL(REQ(NNN),IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_CANCEL'
CALL MPI_WAIT(REQ(NNN),STATUS,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_WAIT'
CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_TEST_CANCELLED'
ENDDO
ENDIF
ENDDO
The job still hangs, and when I look at the error file, I see that on MPI
process A, one of the sends has not completed, and on process B, one of the
receives has not completed. The failed send and failed receive are
consistent – that is they are matching. What I do not understand is that
for both the uncompleted send and receive, the code hangs in MPI_WAIT. That
is, I do not get the printout that says that the process has returned from
MPI_WAIT. I interpret this to mean that either some of the large message
has been sent or received, but not all. The MPI standard seems a bit vague
on what is supposed to happen if part of the message simply disappears due
to some network glitch. These errors occur after hundreds or thousands of
successful exchanges. They never happen at the same point in the
calculation. They are random, but they occur only when the messages are
large (like MBs). When the messages are not large, the code can run for
days or weeks without errors.
So why does MPI_WAIT hang? The MPI standard says
“If a communication is marked for cancellation, then an MPI_Wait
<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that
communication is guaranteed to return, irrespective of the activities of
other processes (i.e., MPI_Wait
<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a
local function)” (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php).
Could the problem be with my cluster – in that the large message is broken
up into smaller packets, and one of these packets disappears and there is
no way to cancel it? That’s really what I am looking for – a way to cancel
the failed communication but still continue the calculation.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
McGrattan, Kevin B. Dr. (Fed)
2017-04-05 20:16:26 UTC
Permalink
George

Thanks for the advice. I still don’t know what’s wrong with my cluster. I get errors like this

[[39827,1],182][btl_openib_component.c:3497:handle_wc] from burn001 to: burn005-ib error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 2b428467dc00 opcode 1 vendor error 0 qp_idx 0
[[39827,1],114][btl_openib_component.c:3497:handle_wc] from burn023 to: burn005-ib error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 8dd8080 opcode 128 vendor error 0 qp_idx 0

I did some searching on these error messages, and I think it implies there’s something amiss with our IB fabric. But I am able to bypass some of the timeouts by doing this

CALL MPI_CANCEL
CALL MPI_TEST
CALL MPI_TEST_CANCELLED

I don’t think that the calls to MPI_TEST or MPI_TEST_CANCELLED do anything, but at least they don’t block. I am going to see if I can just ignore a dropped packet now and again, or try to figure out what’s wrong with our IB.

Thanks

Kevin

From: George Bosilca [mailto:***@icl.utk.edu]
Sent: Monday, April 03, 2017 5:59 PM
To: McGrattan, Kevin B. Dr. (Fed) <***@nist.gov>
Cc: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL

On Mon, Apr 3, 2017 at 4:47 PM, McGrattan, Kevin B. Dr. (Fed) <***@nist.gov<mailto:***@nist.gov>> wrote:
Thanks, George.

Are persistent send/receives matched from the start of the calculation? If so, then I guess MPI_CANCEL won’t work.

A persistent request is only matched when it is started. The MPI_Cancel on a persistent receive doesn't affect the persistent request itself, but instead only cancel the started instance of the request.

I don’t think Open MPI is the problem. I think there is something wrong with our cluster in that it just seems to hang up on these big packages. The calculation successfully exchanges hundreds or thousands before just hanging.

While possible, it is highly unlikely that a message gets dropped by the network without some kind of warning (system log at least). You might want to take a look in the dmesg to see if there is nothing unexpected there.

I’m not sure I understand completely your recommendation for dumping diagnostics. Is this documented somewhere?

Unfortunately not, this is basically a developer trick to dump the state of the MPI library. This goes a little like this. Once you have attached a debugger to your process (let's assume gdb), you need to find the communicator where you have posted your requests (I can't help here this is not part of the code you sent). With <communicator_index> set to this value:

gdb$ p ompi_comm_f_to_c_table.addr[<communicator_index>]

will give you the C pointer of the communicator.

gdb$ call mca_pml.pml_dump( ompi_comm_f_to_c_table.addr[<communicator_index>], 1)

should print all the local known messages by the MPI library, including pending sends and receives. This will also print additional information (the status of the requests, the tag, the size, and so on) that can be understood by the developers. If you post the info here, we might be able to provide additional information on the issue.

George.



Thanks

Kevin



From: George Bosilca [mailto:***@icl.utk.edu<mailto:***@icl.utk.edu>]
Sent: Monday, April 03, 2017 2:29 PM
To: Open MPI Users <***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>>
Cc: McGrattan, Kevin B. Dr. (Fed) <***@nist.gov<mailto:***@nist.gov>>
Subject: Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL

Kevin,

In Open MPI we only support cancelling non-yet matched receives. So, you cannot cancel sends nor receive requests that have already been matched. While the latter are supposed to complete (otherwise they would not have been matched), the former are trickier to complete if the corresponding receive is never posted.

To sum this up, the bad news is that there is no way to correctly cancel MPI requests without hitting deadlock.

That being said, I can hardly understand how Open MPI can drop a message. There might be something else in here, that is more difficult to spot. We do have an internal way to dump all pending (or known) communication. Assuming you are using the OB1 PML here is how you dump all known communications. Attach to a process and find the communicator pointer (you will need to convert between the F90 communicator and the C pointer) and then call mca_pml.pml_dump( commptr, 1).

Also, it is possible to check how one of the more recent versions of Open MPI (> 2.1) behave with your code ?

George.




On Sat, Apr 1, 2017 at 12:40 PM, McGrattan, Kevin B. Dr. (Fed) <***@nist.gov<mailto:***@nist.gov>> wrote:
I am running a large computational fluid dynamics code on a linux cluster (Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two sockets, each socket has six cores. I have noticed that the code hangs when the size of the packages exchanged using a persistent send and receive call become large. I cannot say exactly how large, but generally on the order of 10 MB. Rather than let the code just hang, I implemented a timing loop using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say, 10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and continue on with the calculation, even if the communication(s) did not succeed. It would not necessarily cripple the calculation if a few MPI communications were unsuccessful. This is a snippet of code that tests if the communications are successful and attempts to cancel if not:

START_TIME = MPI_WTIME()
FLAG = .FALSE.
DO WHILE(.NOT.FLAG)
CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR)
WAIT_TIME = MPI_WTIME() - START_TIME
IF (WAIT_TIME>TIMEOUT) THEN
WRITE(LU_ERR,'(A,A,I6,A,A)') ‘Request timed out for MPI process ',MYID,' running on ',PNAME(1:PNAMELEN)
DO NNN=1,NREQ
IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE
CALL MPI_CANCEL(REQ(NNN),IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_CANCEL'
CALL MPI_WAIT(REQ(NNN),STATUS,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_WAIT'
CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_TEST_CANCELLED'
ENDDO
ENDIF
ENDDO

The job still hangs, and when I look at the error file, I see that on MPI process A, one of the sends has not completed, and on process B, one of the receives has not completed. The failed send and failed receive are consistent – that is they are matching. What I do not understand is that for both the uncompleted send and receive, the code hangs in MPI_WAIT. That is, I do not get the printout that says that the process has returned from MPI_WAIT. I interpret this to mean that either some of the large message has been sent or received, but not all. The MPI standard seems a bit vague on what is supposed to happen if part of the message simply disappears due to some network glitch. These errors occur after hundreds or thousands of successful exchanges. They never happen at the same point in the calculation. They are random, but they occur only when the messages are large (like MBs). When the messages are not large, the code can run for days or weeks without errors.

So why does MPI_WAIT hang? The MPI standard says

“If a communication is marked for cancellation, then an MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that communication is guaranteed to return, irrespective of the activities of other processes (i.e., MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a local function)” (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php).

Could the problem be with my cluster – in that the large message is broken up into smaller packets, and one of these packets disappears and there is no way to cancel it? That’s really what I am looking for – a way to cancel the failed communication but still continue the calculation.


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
George Bosilca
2017-04-06 22:31:16 UTC
Permalink
Kevin,

You are right, changing to MPI_TEST only hides the issue. A message is
dropped, and the corresponding request will therefore never finish.

Your error message indicates that at least 2 processes have issues sending
data to burn005-ib. Have that node received a message in your run before ?
If yes, is the process somehow reaching the MPI_Finalize and started to
tear down connections to his peers ?

George.



On Wed, Apr 5, 2017 at 4:16 PM, McGrattan, Kevin B. Dr. (Fed) <
Post by George Bosilca
George
Thanks for the advice. I still don’t know what’s wrong with my cluster. I
get errors like this
burn005-ib error polling LP CQ with status RETRY EXCEEDED ERROR status
number 12 for wr_id 2b428467dc00 opcode 1 vendor error 0 qp_idx 0
burn005-ib error polling HP CQ with status WORK REQUEST FLUSHED ERROR
status number 5 for wr_id 8dd8080 opcode 128 vendor error 0 qp_idx 0
I did some searching on these error messages, and I think it implies
there’s something amiss with our IB fabric. But I am able to bypass some of
the timeouts by doing this
CALL MPI_CANCEL
CALL MPI_TEST
CALL MPI_TEST_CANCELLED
I don’t think that the calls to MPI_TEST or MPI_TEST_CANCELLED do
anything, but at least they don’t block. I am going to see if I can just
ignore a dropped packet now and again, or try to figure out what’s wrong
with our IB.
Thanks
Kevin
*Sent:* Monday, April 03, 2017 5:59 PM
*Subject:* Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL
On Mon, Apr 3, 2017 at 4:47 PM, McGrattan, Kevin B. Dr. (Fed) <
Thanks, George.
Are persistent send/receives matched from the start of the calculation? If
so, then I guess MPI_CANCEL won’t work.
A persistent request is only matched when it is started. The MPI_Cancel on
a persistent receive doesn't affect the persistent request itself, but
instead only cancel the started instance of the request.
I don’t think Open MPI is the problem. I think there is something wrong
with our cluster in that it just seems to hang up on these big packages.
The calculation successfully exchanges hundreds or thousands before just
hanging.
While possible, it is highly unlikely that a message gets dropped by the
network without some kind of warning (system log at least). You might want
to take a look in the dmesg to see if there is nothing unexpected there.
I’m not sure I understand completely your recommendation for dumping
diagnostics. Is this documented somewhere?
Unfortunately not, this is basically a developer trick to dump the state
of the MPI library. This goes a little like this. Once you have attached a
debugger to your process (let's assume gdb), you need to find the
communicator where you have posted your requests (I can't help here this is
gdb$ p ompi_comm_f_to_c_table.addr[<communicator_index>]
will give you the C pointer of the communicator.
gdb$ call mca_pml.pml_dump( ompi_comm_f_to_c_table.addr[<
communicator_index>], 1)
should print all the local known messages by the MPI library, including
pending sends and receives. This will also print additional information
(the status of the requests, the tag, the size, and so on) that can be
understood by the developers. If you post the info here, we might be able
to provide additional information on the issue.
George.
Thanks
Kevin
*Sent:* Monday, April 03, 2017 2:29 PM
*Subject:* Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL
Kevin,
In Open MPI we only support cancelling non-yet matched receives. So, you
cannot cancel sends nor receive requests that have already been matched.
While the latter are supposed to complete (otherwise they would not have
been matched), the former are trickier to complete if the corresponding
receive is never posted.
To sum this up, the bad news is that there is no way to correctly cancel
MPI requests without hitting deadlock.
That being said, I can hardly understand how Open MPI can drop a message.
There might be something else in here, that is more difficult to spot. We
do have an internal way to dump all pending (or known) communication.
Assuming you are using the OB1 PML here is how you dump all known
communications. Attach to a process and find the communicator pointer (you
will need to convert between the F90 communicator and the C pointer) and
then call mca_pml.pml_dump( commptr, 1).
Also, it is possible to check how one of the more recent versions of Open
MPI (> 2.1) behave with your code ?
George.
On Sat, Apr 1, 2017 at 12:40 PM, McGrattan, Kevin B. Dr. (Fed) <
I am running a large computational fluid dynamics code on a linux cluster
(Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled
with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two
sockets, each socket has six cores. I have noticed that the code hangs when
the size of the packages exchanged using a persistent send and receive call
become large. I cannot say exactly how large, but generally on the order of
10 MB. Rather than let the code just hang, I implemented a timing loop
using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say,
10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and
continue on with the calculation, even if the communication(s) did not
succeed. It would not necessarily cripple the calculation if a few MPI
communications were unsuccessful. This is a snippet of code that tests if
START_TIME = MPI_WTIME()
FLAG = .FALSE.
DO WHILE(.NOT.FLAG)
CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR)
WAIT_TIME = MPI_WTIME() - START_TIME
IF (WAIT_TIME>TIMEOUT) THEN
WRITE(LU_ERR,'(A,A,I6,A,A)') ‘Request timed out for MPI process
',MYID,' running on ',PNAME(1:PNAMELEN)
DO NNN=1,NREQ
IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE
CALL MPI_CANCEL(REQ(NNN),IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_CANCEL'
CALL MPI_WAIT(REQ(NNN),STATUS,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_WAIT'
CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_TEST_CANCELLED'
ENDDO
ENDIF
ENDDO
The job still hangs, and when I look at the error file, I see that on MPI
process A, one of the sends has not completed, and on process B, one of the
receives has not completed. The failed send and failed receive are
consistent – that is they are matching. What I do not understand is that
for both the uncompleted send and receive, the code hangs in MPI_WAIT. That
is, I do not get the printout that says that the process has returned from
MPI_WAIT. I interpret this to mean that either some of the large message
has been sent or received, but not all. The MPI standard seems a bit vague
on what is supposed to happen if part of the message simply disappears due
to some network glitch. These errors occur after hundreds or thousands of
successful exchanges. They never happen at the same point in the
calculation. They are random, but they occur only when the messages are
large (like MBs). When the messages are not large, the code can run for
days or weeks without errors.
So why does MPI_WAIT hang? The MPI standard says
“If a communication is marked for cancellation, then an MPI_Wait
<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that
communication is guaranteed to return, irrespective of the activities of
other processes (i.e., MPI_Wait
<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a
local function)” (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php).
Could the problem be with my cluster – in that the large message is broken
up into smaller packets, and one of these packets disappears and there is
no way to cancel it? That’s really what I am looking for – a way to cancel
the failed communication but still continue the calculation.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
McGrattan, Kevin B. Dr. (Fed)
2017-04-07 20:34:34 UTC
Permalink
Your error message indicates that at least 2 processes have issues sending data to burn005-ib. Have that node received a message in your run before ? If yes, is the process somehow reaching the MPI_Finalize and started to tear down connections to his peers ?
When my jobs fail, the nodes have successfully exchanged thousands of messages. The failures are random, and can take days to make a job fail. The same exact calculations run successfully on all other clusters that I’ve tried, which makes me think that there’s something wrong with our IB network. We’re going to “diff” our cluster with another here to see if we can find some setting that is different. Thanks for your help.

Kevin

From: George Bosilca [mailto:***@icl.utk.edu]
Sent: Thursday, April 06, 2017 6:31 PM
To: McGrattan, Kevin B. Dr. (Fed) <***@nist.gov>
Cc: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL

Kevin,

You are right, changing to MPI_TEST only hides the issue. A message is dropped, and the corresponding request will therefore never finish.

Your error message indicates that at least 2 processes have issues sending data to burn005-ib. Have that node received a message in your run before ? If yes, is the process somehow reaching the MPI_Finalize and started to tear down connections to his peers ?

George.



On Wed, Apr 5, 2017 at 4:16 PM, McGrattan, Kevin B. Dr. (Fed) <***@nist.gov<mailto:***@nist.gov>> wrote:
George

Thanks for the advice. I still don’t know what’s wrong with my cluster. I get errors like this

[[39827,1],182][btl_openib_component.c:3497:handle_wc] from burn001 to: burn005-ib error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 2b428467dc00 opcode 1 vendor error 0 qp_idx 0
[[39827,1],114][btl_openib_component.c:3497:handle_wc] from burn023 to: burn005-ib error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 8dd8080 opcode 128 vendor error 0 qp_idx 0

I did some searching on these error messages, and I think it implies there’s something amiss with our IB fabric. But I am able to bypass some of the timeouts by doing this

CALL MPI_CANCEL
CALL MPI_TEST
CALL MPI_TEST_CANCELLED

I don’t think that the calls to MPI_TEST or MPI_TEST_CANCELLED do anything, but at least they don’t block. I am going to see if I can just ignore a dropped packet now and again, or try to figure out what’s wrong with our IB.

Thanks

Kevin

From: George Bosilca [mailto:***@icl.utk.edu<mailto:***@icl.utk.edu>]
Sent: Monday, April 03, 2017 5:59 PM
To: McGrattan, Kevin B. Dr. (Fed) <***@nist.gov<mailto:***@nist.gov>>
Cc: Open MPI Users <***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>>

Subject: Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL

On Mon, Apr 3, 2017 at 4:47 PM, McGrattan, Kevin B. Dr. (Fed) <***@nist.gov<mailto:***@nist.gov>> wrote:
Thanks, George.

Are persistent send/receives matched from the start of the calculation? If so, then I guess MPI_CANCEL won’t work.

A persistent request is only matched when it is started. The MPI_Cancel on a persistent receive doesn't affect the persistent request itself, but instead only cancel the started instance of the request.

I don’t think Open MPI is the problem. I think there is something wrong with our cluster in that it just seems to hang up on these big packages. The calculation successfully exchanges hundreds or thousands before just hanging.

While possible, it is highly unlikely that a message gets dropped by the network without some kind of warning (system log at least). You might want to take a look in the dmesg to see if there is nothing unexpected there.

I’m not sure I understand completely your recommendation for dumping diagnostics. Is this documented somewhere?

Unfortunately not, this is basically a developer trick to dump the state of the MPI library. This goes a little like this. Once you have attached a debugger to your process (let's assume gdb), you need to find the communicator where you have posted your requests (I can't help here this is not part of the code you sent). With <communicator_index> set to this value:

gdb$ p ompi_comm_f_to_c_table.addr[<communicator_index>]

will give you the C pointer of the communicator.

gdb$ call mca_pml.pml_dump( ompi_comm_f_to_c_table.addr[<communicator_index>], 1)

should print all the local known messages by the MPI library, including pending sends and receives. This will also print additional information (the status of the requests, the tag, the size, and so on) that can be understood by the developers. If you post the info here, we might be able to provide additional information on the issue.

George.



Thanks

Kevin



From: George Bosilca [mailto:***@icl.utk.edu<mailto:***@icl.utk.edu>]
Sent: Monday, April 03, 2017 2:29 PM
To: Open MPI Users <***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>>
Cc: McGrattan, Kevin B. Dr. (Fed) <***@nist.gov<mailto:***@nist.gov>>
Subject: Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL

Kevin,

In Open MPI we only support cancelling non-yet matched receives. So, you cannot cancel sends nor receive requests that have already been matched. While the latter are supposed to complete (otherwise they would not have been matched), the former are trickier to complete if the corresponding receive is never posted.

To sum this up, the bad news is that there is no way to correctly cancel MPI requests without hitting deadlock.

That being said, I can hardly understand how Open MPI can drop a message. There might be something else in here, that is more difficult to spot. We do have an internal way to dump all pending (or known) communication. Assuming you are using the OB1 PML here is how you dump all known communications. Attach to a process and find the communicator pointer (you will need to convert between the F90 communicator and the C pointer) and then call mca_pml.pml_dump( commptr, 1).

Also, it is possible to check how one of the more recent versions of Open MPI (> 2.1) behave with your code ?

George.




On Sat, Apr 1, 2017 at 12:40 PM, McGrattan, Kevin B. Dr. (Fed) <***@nist.gov<mailto:***@nist.gov>> wrote:
I am running a large computational fluid dynamics code on a linux cluster (Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two sockets, each socket has six cores. I have noticed that the code hangs when the size of the packages exchanged using a persistent send and receive call become large. I cannot say exactly how large, but generally on the order of 10 MB. Rather than let the code just hang, I implemented a timing loop using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say, 10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and continue on with the calculation, even if the communication(s) did not succeed. It would not necessarily cripple the calculation if a few MPI communications were unsuccessful. This is a snippet of code that tests if the communications are successful and attempts to cancel if not:

START_TIME = MPI_WTIME()
FLAG = .FALSE.
DO WHILE(.NOT.FLAG)
CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR)
WAIT_TIME = MPI_WTIME() - START_TIME
IF (WAIT_TIME>TIMEOUT) THEN
WRITE(LU_ERR,'(A,A,I6,A,A)') ‘Request timed out for MPI process ',MYID,' running on ',PNAME(1:PNAMELEN)
DO NNN=1,NREQ
IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE
CALL MPI_CANCEL(REQ(NNN),IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_CANCEL'
CALL MPI_WAIT(REQ(NNN),STATUS,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_WAIT'
CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR)
write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_TEST_CANCELLED'
ENDDO
ENDIF
ENDDO

The job still hangs, and when I look at the error file, I see that on MPI process A, one of the sends has not completed, and on process B, one of the receives has not completed. The failed send and failed receive are consistent – that is they are matching. What I do not understand is that for both the uncompleted send and receive, the code hangs in MPI_WAIT. That is, I do not get the printout that says that the process has returned from MPI_WAIT. I interpret this to mean that either some of the large message has been sent or received, but not all. The MPI standard seems a bit vague on what is supposed to happen if part of the message simply disappears due to some network glitch. These errors occur after hundreds or thousands of successful exchanges. They never happen at the same point in the calculation. They are random, but they occur only when the messages are large (like MBs). When the messages are not large, the code can run for days or weeks without errors.

So why does MPI_WAIT hang? The MPI standard says

“If a communication is marked for cancellation, then an MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that communication is guaranteed to return, irrespective of the activities of other processes (i.e., MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a local function)” (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php).

Could the problem be with my cluster – in that the large message is broken up into smaller packets, and one of these packets disappears and there is no way to cancel it? That’s really what I am looking for – a way to cancel the failed communication but still continue the calculation.


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Loading...