[OMPI users] MPI vendor error

Discussion:

Ludovic Raess

2017-09-15 12:08:04 UTC

Hi,

we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).

On long runs (over ~10 days) involving more than 1 node (usually 64 MPI processes distributed on 16 nodes [node01-node16]?), we observe the freeze of the simulation due to an internal error displaying: "error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1 vendor error 136 qp_idx 0" (see attached file for full output).

The job hangs, no computation neither communication occurs anymore, but no exit neither unload of the nodes is observed. The job can be killed normally but then the concerned nodes do not fully recover. A relaunch of the simulation usually sustains a couple of iterations (few minutes runtime), and then the job hangs again due to similar reasons. The only workaround so far is to reboot the involved nodes.

Since we didn't find any hints on the web regarding this strange behaviour, I am wondering if this is a known issue. We actually don't know what causes this to happen and why. So any hints were to start investigating or possible reasons for this to happen are welcome.

Thanks

Ludovic

Ludovic Raess

2017-09-27 23:26:36 UTC

Permalink

John Hearns via users

2017-09-28 09:17:11 UTC

Permalink

Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls

On 28 September 2017 at 01:26, Ludovic Raess <***@unil.ch> wrote:

> Hi,
>
>
> we have a issue on our 32 nodes Linux cluster regarding the usage of Open
> MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single
> port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>
>
> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
> processes distributed on 16 nodes [node01-node16]â), we observe the freeze
> of the simulation due to an internal error displaying: "error polling LP CQ
> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
> vendor error 136 qp_idx 0" (see attached file for full output).
>
>
> The job hangs, no computation neither communication occurs anymore, but no
> exit neither unload of the nodes is observed. The job can be killed
> normally but then the concerned nodes do not fully recover. A relaunch of
> the simulation usually sustains a couple of iterations (few minutes
> runtime), and then the job hangs again due to similar reasons. The only
> workaround so far is to reboot the involved nodes.
>
>
> Since we didn't find any hints on the web regarding this
> strange behaviour, I am wondering if this is a known issue. We actually
> don't know what causes this to happen and why. So any hints were to start
> investigating or possible reasons for this to happen are welcome.â
>
>
> Ludovic
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>

John Hearns via users

2017-09-28 09:17:59 UTC

Permalink

ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?

On 28 September 2017 at 11:17, John Hearns <***@googlemail.com> wrote:

>
> Google turns this up:
> https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
>
>
> On 28 September 2017 at 01:26, Ludovic Raess <***@unil.ch>
> wrote:
>
>> Hi,
>>
>>
>> we have a issue on our 32 nodes Linux cluster regarding the usage of Open
>> MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single
>> port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>>
>>
>> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
>> processes distributed on 16 nodes [node01-node16]â), we observe the freeze
>> of the simulation due to an internal error displaying: "error polling LP CQ
>> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>> vendor error 136 qp_idx 0" (see attached file for full output).
>>
>>
>> The job hangs, no computation neither communication occurs anymore, but
>> no exit neither unload of the nodes is observed. The job can be killed
>> normally but then the concerned nodes do not fully recover. A relaunch of
>> the simulation usually sustains a couple of iterations (few minutes
>> runtime), and then the job hangs again due to similar reasons. The only
>> workaround so far is to reboot the involved nodes.
>>
>>
>> Since we didn't find any hints on the web regarding this
>> strange behaviour, I am wondering if this is a known issue. We actually
>> don't know what causes this to happen and why. So any hints were to start
>> investigating or possible reasons for this to happen are welcome.â
>>
>>
>> Ludovic
>>
>> _______________________________________________
>> users mailing list
>> ***@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>

George Bosilca

2017-09-28 15:03:43 UTC

Permalink

John,

On the ULFM mailing list you pointed out, we converged toward a hardware
issue. Resources associated with the dead process were not correctly freed,
and follow-up processes on the same setup would inherit issues related to
these lingering messages. However, keep in mind that the setup was
different as we were talking about losing a process.

The proposed solution to force the timeout to a large value was not fixing
the problem, it was just delaying it enough for the application to run to
completion.

George.

On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users <
***@lists.open-mpi.org> wrote:

> ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?
>
> On 28 September 2017 at 11:17, John Hearns <***@googlemail.com> wrote:
>
>>
>> Google turns this up:
>> https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
>>
>>
>> On 28 September 2017 at 01:26, Ludovic Raess <***@unil.ch>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> we have a issue on our 32 nodes Linux cluster regarding the usage of
>>> Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR
>>> single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>>>
>>>
>>> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
>>> processes distributed on 16 nodes [node01-node16]â), we observe the freeze
>>> of the simulation due to an internal error displaying: "error polling LP CQ
>>> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>>> vendor error 136 qp_idx 0" (see attached file for full output).
>>>
>>>
>>> The job hangs, no computation neither communication occurs anymore, but
>>> no exit neither unload of the nodes is observed. The job can be killed
>>> normally but then the concerned nodes do not fully recover. A relaunch of
>>> the simulation usually sustains a couple of iterations (few minutes
>>> runtime), and then the job hangs again due to similar reasons. The only
>>> workaround so far is to reboot the involved nodes.
>>>
>>>
>>> Since we didn't find any hints on the web regarding this
>>> strange behaviour, I am wondering if this is a known issue. We actually
>>> don't know what causes this to happen and why. So any hints were to start
>>> investigating or possible reasons for this to happen are welcome.â
>>>
>>>
>>> Ludovic
>>>
>>> _______________________________________________
>>> users mailing list
>>> ***@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>
> _______________________________________________
> users mailing list
> ***@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>

Richard Graham

2017-09-28 16:09:58 UTC

Permalink

I just talked with George, who brought me up to speed on this particular problem.

I would suggest a couple of things:

- Look at the HW error counters, and see if you have many retransmits. This would indicate a potential issue with the particular HW in use, such as a cable that is not seated well, or some type of similar problem.

- If you have the ability, reset your cables from the HCA to the switch, and see if this addresses the problem.
Also, if you have the ability (e.g., can modify the Open MPI source code), set the retransmit count to 0, and see if you see the same issue. This would just speed up reaching the problem, if this is indeed the issue.

Rich

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of George Bosilca
Sent: Thursday, September 28, 2017 11:04 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Open MPI internal error

John,

On the ULFM mailing list you pointed out, we converged toward a hardware issue. Resources associated with the dead process were not correctly freed, and follow-up processes on the same setup would inherit issues related to these lingering messages. However, keep in mind that the setup was different as we were talking about losing a process.

The proposed solution to force the timeout to a large value was not fixing the problem, it was just delaying it enough for the application to run to completion.

George.

On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users <***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>> wrote:
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?

On 28 September 2017 at 11:17, John Hearns <***@googlemail.com<mailto:***@googlemail.com>> wrote:

Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fulfm%2FOPdsHTXF5ls&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=ydWEy8pvVUbwLw31V3dUS2ruDcCa3sPmQV4KSYZUSeQ%3D&reserved=0>

On 28 September 2017 at 01:26, Ludovic Raess <***@unil.ch<mailto:***@unil.ch>> wrote:

Hi,

we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).

On long runs (over ~10 days) involving more than 1 node (usually 64 MPI processes distributed on 16 nodes [node01-node16]â), we observe the freeze of the simulation due to an internal error displaying: "error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1 vendor error 136 qp_idx 0" (see attached file for full output).

The job hangs, no computation neither communication occurs anymore, but no exit neither unload of the nodes is observed. The job can be killed normally but then the concerned nodes do not fully recover. A relaunch of the simulation usually sustains a couple of iterations (few minutes runtime), and then the job hangs again due to similar reasons. The only workaround so far is to reboot the involved nodes.

Since we didn't find any hints on the web regarding this strange behaviour, I am wondering if this is a known issue. We actually don't know what causes this to happen and why. So any hints were to start investigating or possible reasons for this to happen are welcome.â

Ludovic

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=nksw2LquERu9E0lPQAcQC%2FcU3mGyWU4OFjE7%2Ft7It8w%3D&reserved=0>

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=nksw2LquERu9E0lPQAcQC%2FcU3mGyWU4OFjE7%2Ft7It8w%3D&reserved=0>

Ludovic Raess

2017-09-28 21:19:26 UTC

Permalink

Dear John, George, Rich,

thank you for the suggestions and potential paths towards understanding the reason for the observed freeze. Although a HW issue might be possible, it sounds unlikely since the error appears only after long runs and not randomly. Also, it is kind of fixed after a reboot, unless another long run starts again.

Cables and connections seems OK, we already reseted all connexions.

Currently, we are investigating two paths towards a fix. We implemented a slightly modified version of the MPI point to point comm routine, to see if it was still a hidden programming issue. Additionally, I run the problematic setup using mvapich to see if it is related to Open MPI in particular, excluding thus a HW or implementation issue.

In both cases, I will run 'ibdiagnet' in case freeze will occur again, as suggested. In last, we could try to set the retransmit count to 0 as suggested by Rich.

Thanks for propositions, and I'll write if I have new hints (it would require some days to the runs to potentially freeze)

Ludovic

________________________________
De : users <users-***@lists.open-mpi.org> de la part de Richard Graham <***@mellanox.com>
EnvoyÃ© : jeudi 28 septembre 2017 18:09
Ã : Open MPI Users
Objet : Re: [OMPI users] Open MPI internal error

I just talked with George, who brought me up to speed on this particular problem.

I would suggest a couple of things:

- Look at the HW error counters, and see if you have many retransmits. This would indicate a potential issue with the particular HW in use, such as a cable that is not seated well, or some type of similar problem.

- If you have the ability, reset your cables from the HCA to the switch, and see if this addresses the problem.
Also, if you have the ability (e.g., can modify the Open MPI source code), set the retransmit count to 0, and see if you see the same issue. This would just speed up reaching the problem, if this is indeed the issue.

Rich

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of George Bosilca
Sent: Thursday, September 28, 2017 11:04 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Open MPI internal error

John,

On the ULFM mailing list you pointed out, we converged toward a hardware issue. Resources associated with the dead process were not correctly freed, and follow-up processes on the same setup would inherit issues related to these lingering messages. However, keep in mind that the setup was different as we were talking about losing a process.

The proposed solution to force the timeout to a large value was not fixing the problem, it was just delaying it enough for the application to run to completion.

George.

On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users <***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>> wrote:
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?

On 28 September 2017 at 11:17, John Hearns <***@googlemail.com<mailto:***@googlemail.com>> wrote:

Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fulfm%2FOPdsHTXF5ls&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=ydWEy8pvVUbwLw31V3dUS2ruDcCa3sPmQV4KSYZUSeQ%3D&reserved=0>

On 28 September 2017 at 01:26, Ludovic Raess <***@unil.ch<mailto:***@unil.ch>> wrote:

Hi,

we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).

On long runs (over ~10 days) involving more than 1 node (usually 64 MPI processes distributed on 16 nodes [node01-node16]â), we observe the freeze of the simulation due to an internal error displaying: "error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1 vendor error 136 qp_idx 0" (see attached file for full output).

The job hangs, no computation neither communication occurs anymore, but no exit neither unload of the nodes is observed. The job can be killed normally but then the concerned nodes do not fully recover. A relaunch of the simulation usually sustains a couple of iterations (few minutes runtime), and then the job hangs again due to similar reasons. The only workaround so far is to reboot the involved nodes.

Since we didn't find any hints on the web regarding this strange behaviour, I am wondering if this is a known issue. We actually don't know what causes this to happen and why. So any hints were to start investigating or possible reasons for this to happen are welcome.â

Ludovic

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=nksw2LquERu9E0lPQAcQC%2FcU3mGyWU4OFjE7%2Ft7It8w%3D&reserved=0>

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=nksw2LquERu9E0lPQAcQC%2FcU3mGyWU4OFjE7%2Ft7It8w%3D&reserved=0>

Richard Graham

2017-09-29 15:53:23 UTC

Permalink

BTW, another cause for retransmission is the lack of posted receive buffers.

Rich

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Ludovic Raess
Sent: Thursday, September 28, 2017 5:19 PM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Open MPI internal error

Dear John, George, Rich,

thank you for the suggestions and potential paths towards understanding the reason for the observed freeze. Although a HW issue might be possible, it sounds unlikely since the error appears only after long runs and not randomly. Also, it is kind of fixed after a reboot, unless another long run starts again.

Cables and connections seems OK, we already reseted all connexions.

Currently, we are investigating two paths towards a fix. We implemented a slightly modified version of the MPI point to point comm routine, to see if it was still a hidden programming issue. Additionally, I run the problematic setup using mvapich to see if it is related to Open MPI in particular, excluding thus a HW or implementation issue.

In both cases, I will run 'ibdiagnet' in case freeze will occur again, as suggested. In last, we could try to set the retransmit count to 0 as suggested by Rich.

Thanks for propositions, and I'll write if I have new hints (it would require some days to the runs to potentially freeze)

Ludovic

________________________________
De : users <users-***@lists.open-mpi.org<mailto:users-***@lists.open-mpi.org>> de la part de Richard Graham <***@mellanox.com<mailto:***@mellanox.com>>
EnvoyÃ© : jeudi 28 septembre 2017 18:09
Ã : Open MPI Users
Objet : Re: [OMPI users] Open MPI internal error

I just talked with George, who brought me up to speed on this particular problem.

I would suggest a couple of things:

- Look at the HW error counters, and see if you have many retransmits. This would indicate a potential issue with the particular HW in use, such as a cable that is not seated well, or some type of similar problem.

- If you have the ability, reset your cables from the HCA to the switch, and see if this addresses the problem.
Also, if you have the ability (e.g., can modify the Open MPI source code), set the retransmit count to 0, and see if you see the same issue. This would just speed up reaching the problem, if this is indeed the issue.

Rich

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of George Bosilca
Sent: Thursday, September 28, 2017 11:04 AM
To: Open MPI Users <***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>>
Subject: Re: [OMPI users] Open MPI internal error

John,

On the ULFM mailing list you pointed out, we converged toward a hardware issue. Resources associated with the dead process were not correctly freed, and follow-up processes on the same setup would inherit issues related to these lingering messages. However, keep in mind that the setup was different as we were talking about losing a process.

The proposed solution to force the timeout to a large value was not fixing the problem, it was just delaying it enough for the application to run to completion.

George.

On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users <***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>> wrote:
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?

On 28 September 2017 at 11:17, John Hearns <***@googlemail.com<mailto:***@googlemail.com>> wrote:

Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fulfm%2FOPdsHTXF5ls&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=ydWEy8pvVUbwLw31V3dUS2ruDcCa3sPmQV4KSYZUSeQ%3D&reserved=0>

On 28 September 2017 at 01:26, Ludovic Raess <***@unil.ch<mailto:***@unil.ch>> wrote:

Hi,

we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).

On long runs (over ~10 days) involving more than 1 node (usually 64 MPI processes distributed on 16 nodes [node01-node16]â), we observe the freeze of the simulation due to an internal error displaying: "error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1 vendor error 136 qp_idx 0" (see attached file for full output).

The job hangs, no computation neither communication occurs anymore, but no exit neither unload of the nodes is observed. The job can be killed normally but then the concerned nodes do not fully recover. A relaunch of the simulation usually sustains a couple of iterations (few minutes runtime), and then the job hangs again due to similar reasons. The only workaround so far is to reboot the involved nodes.

Since we didn't find any hints on the web regarding this strange behaviour, I am wondering if this is a known issue. We actually don't know what causes this to happen and why. So any hints were to start investigating or possible reasons for this to happen are welcome.â

Ludovic

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=nksw2LquERu9E0lPQAcQC%2FcU3mGyWU4OFjE7%2Ft7It8w%3D&reserved=0>

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Crichardg%40mellanox.com%7C53e5364a65f640e2466608d50682bd00%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636422080954625054&sdata=nksw2LquERu9E0lPQAcQC%2FcU3mGyWU4OFjE7%2Ft7It8w%3D&reserved=0>