[OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

Discussion:

Lev Givon

2015-05-19 22:29:38 UTC

I'm encountering intermittent errors while trying to use the Multi-Process
Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU by
multiple MPI processes that perform GPU-to-GPU communication with each other
(i.e., GPU pointers are passed to the MPI transmission primitives). I'm using
GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5, which is in turn
built against CUDA 7.0. In my current configuration, I have 4 MPS server daemons
running, each of which controls access to one of 4 GPUs; the MPI processes
spawned by my program are partitioned into 4 groups (which might contain
different numbers of processes) that each talk to a separate daemon. For certain
transmission patterns between these processes, the program runs without any
problems. For others (e.g., 16 processes partitioned into 4 groups), however, it
dies with the following error:

[node05:20562] Failed to register remote memory, rc=-1
--------------------------------------------------------------------------
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
cuIpcOpenMemHandle return value: 21199360
address: 0x1
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
-------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
[node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send]
mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104)
[node05:20564] Failed to register remote memory, rc=-1
[node05:20564] [[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
[node05:20566] Failed to register remote memory, rc=-1
[node05:20566] [[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
[node05:20567] Failed to register remote memory, rc=-1
[node05:20567] [[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
[node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node05:20569] Failed to register remote memory, rc=-1
[node05:20569] [[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
[node05:20571] Failed to register remote memory, rc=-1
[node05:20571] [[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
[node05:20572] Failed to register remote memory, rc=-1
[node05:20572] [[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477

After the above error occurs, I notice that /dev/shm/ is littered with
cuda.shm.* files. I tried cleaning up /dev/shm before running my program, but
that doesn't seem to have any effect upon the problem. Rebooting the machine
also doesn't have any effect. I should also add that my program runs without any
error if the groups of MPI processes talk directly to the GPUs instead of via
MPS.

Does anyone have any ideas as to what could be going on?

--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Rolf vandeVaart

2015-05-20 00:28:46 UTC

Permalink

I am not sure why you are seeing this. One thing that is clear is that you have found a bug in the error reporting. The error message is a little garbled and I see a bug in what we are reporting. I will fix that.

If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0. My expectation is that you will not see any errors, but may lose some performance.

What does your hardware configuration look like? Can you send me output from "nvidia-smi topo -m"

Thanks,
Rolf

-----Original Message-----
Sent: Tuesday, May 19, 2015 6:30 PM
Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
1.8.5 with CUDA 7.0 and Multi-Process Service
I'm encountering intermittent errors while trying to use the Multi-Process
Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
by multiple MPI processes that perform GPU-to-GPU communication with
each other (i.e., GPU pointers are passed to the MPI transmission primitives).
I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
which is in turn built against CUDA 7.0. In my current configuration, I have 4
MPS server daemons running, each of which controls access to one of 4 GPUs;
the MPI processes spawned by my program are partitioned into 4 groups
(which might contain different numbers of processes) that each talk to a
separate daemon. For certain transmission patterns between these
processes, the program runs without any problems. For others (e.g., 16
[node05:20562] Failed to register remote memory, rc=-1
--------------------------------------------------------------------------
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
will cause the program to abort.
cuIpcOpenMemHandle return value: 21199360
address: 0x1
Check the cuda.h file for what the return value means. Perhaps a reboot of
the node will clear the problem.
--------------------------------------------------------------------------
[node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file
pml_ob1_recvreq.c at line 477
-------------------------------------------------------
Child job 2 terminated normally, but 1 process returned a non-zero exit code..
Per user-direction, the job has been aborted.
-------------------------------------------------------
[node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send]
mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104)
[node05:20564] Failed to register remote memory, rc=-1 [node05:20564]
[[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
[node05:20566] Failed to register remote memory, rc=-1 [node05:20566]
[[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
[node05:20567] Failed to register remote memory, rc=-1 [node05:20567]
[[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
[node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node05:20569] Failed to register remote memory, rc=-1 [node05:20569]
[[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
[node05:20571] Failed to register remote memory, rc=-1 [node05:20571]
[[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
[node05:20572] Failed to register remote memory, rc=-1 [node05:20572]
[[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
After the above error occurs, I notice that /dev/shm/ is littered with
cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
but that doesn't seem to have any effect upon the problem. Rebooting the
machine also doesn't have any effect. I should also add that my program runs
without any error if the groups of MPI processes talk directly to the GPUs
instead of via MPS.
Does anyone have any ideas as to what could be going on?
--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2015/05/26881.php

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Lev Givon

2015-05-20 00:32:53 UTC

Permalink

Post by Rolf vandeVaart

(snip)

Post by Rolf vandeVaart

After the above error occurs, I notice that /dev/shm/ is littered with
cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
but that doesn't seem to have any effect upon the problem. Rebooting the
machine also doesn't have any effect. I should also add that my program runs
without any error if the groups of MPI processes talk directly to the GPUs
instead of via MPS.
Does anyone have any ideas as to what could be going on?

I am not sure why you are seeing this. One thing that is clear is that you
have found a bug in the error reporting. The error message is a little
garbled and I see a bug in what we are reporting. I will fix that.
If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0. My
expectation is that you will not see any errors, but may lose some
performance.
What does your hardware configuration look like? Can you send me output from
"nvidia-smi topo -m"

GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB SOC SOC 0-23
GPU1 PHB X SOC SOC 0-23
GPU2 SOC SOC X PHB 0-23
GPU3 SOC SOC PHB X 0-23

Legend:

X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch

--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Lev Givon

2015-05-20 02:25:20 UTC

Permalink

Post by Rolf vandeVaart

(snip)

Post by Rolf vandeVaart

I am not sure why you are seeing this. One thing that is clear is that you
have found a bug in the error reporting. The error message is a little
garbled and I see a bug in what we are reporting. I will fix that.
If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0. My
expectation is that you will not see any errors, but may lose some
performance.

The error does indeed go away when IPC is disabled, although I do want to
avoid degrading the performance of data transfers between GPU memory locations.

Post by Rolf vandeVaart
What does your hardware configuration look like? Can you send me output from
"nvidia-smi topo -m"

--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Rolf vandeVaart

2015-05-20 11:48:15 UTC

Permalink

-----Original Message-----

Sent: Tuesday, May 19, 2015 10:25 PM
To: Open MPI Users
Subject: Re: [OMPI users] cuIpcOpenMemHandle failure when using
OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

Post by Rolf vandeVaart

passed to the MPI transmission primitives).

Post by Rolf vandeVaart

I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI
1.8.5, which is in turn built against CUDA 7.0. In my current
configuration, I have 4 MPS server daemons running, each of which
controls access to one of 4 GPUs; the MPI processes spawned by my
program are partitioned into 4 groups (which might contain different
numbers of processes) that each talk to a separate daemon. For
certain transmission patterns between these processes, the program
runs without any problems. For others (e.g., 16 processes partitioned into
[node05:20562] Failed to register remote memory, rc=-1
---------------------------------------------------------------------
----- The call to cuIpcOpenMemHandle failed. This is an unrecoverable
error and will cause the program to abort.
cuIpcOpenMemHandle return value: 21199360
address: 0x1
Check the cuda.h file for what the return value means. Perhaps a
reboot of the node will clear the problem.

(snip)

Post by Rolf vandeVaart

After the above error occurs, I notice that /dev/shm/ is littered with
cuda.shm.* files. I tried cleaning up /dev/shm before running my
program, but that doesn't seem to have any effect upon the problem.
Rebooting the machine also doesn't have any effect. I should also add
that my program runs without any error if the groups of MPI processes
talk directly to the GPUs instead of via MPS.
Does anyone have any ideas as to what could be going on?

I am not sure why you are seeing this. One thing that is clear is
that you have found a bug in the error reporting. The error message
is a little garbled and I see a bug in what we are reporting. I will fix that.
If possible, could you try running with --mca btl_smcuda_use_cuda_ipc
0. My expectation is that you will not see any errors, but may lose
some performance.

The error does indeed go away when IPC is disabled, although I do want to
avoid degrading the performance of data transfers between GPU memory locations.

Post by Rolf vandeVaart
What does your hardware configuration look like? Can you send me
output from "nvidia-smi topo -m"

I see that you mentioned you are starting 4 MPS daemons. Are you following the instructions here?

http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html

This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for CUDA IPC. Since you are using CUDA 7 there is no more need to start multiple daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single MPS control daemon which will handle all GPUs. Can you try that? Because of this question, we realized we need to update our documentation as well.

Thanks,
Rolf

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Lev Givon

2015-05-21 15:32:33 UTC

Permalink

Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT:

(snip)

Post by Rolf vandeVaart
I see that you mentioned you are starting 4 MPS daemons. Are you following
the instructions here?
http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html

Yes - also
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

Post by Rolf vandeVaart
This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for CUDA
IPC. Since you are using CUDA 7 there is no more need to start multiple
daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single
MPS control daemon which will handle all GPUs. Can you try that?

I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value should be
passed to all MPI processes.

Several questions related to your comment above:

- Should the MPI processes select and initialize the GPUs they respectively need
to access as they normally would when MPS is not in use?
- Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS (and
hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES to
control GPU resource allocation, and I would like to run my program (and the
MPS control daemon) on a cluster via SLURM.
- Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU setting
with CUDA 6.5 even when one starts multiple MPS control daemons as described
in the aforementioned blog post?

Post by Rolf vandeVaart
Because of this question, we realized we need to update our documentation as
well.

--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Lev Givon

2015-05-21 18:18:33 UTC

Permalink

Post by Lev Givon
(snip)

Yes - also
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value should be
passed to all MPI processes.
- Should the MPI processes select and initialize the GPUs they respectively need
to access as they normally would when MPS is not in use?
- Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS (and
hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES to
control GPU resource allocation, and I would like to run my program (and the
MPS control daemon) on a cluster via SLURM.
- Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU setting
with CUDA 6.5 even when one starts multiple MPS control daemons as described
in the aforementioned blog post?

Using a single control daemon with CUDA_VISIBLE_DEVICES unset appears to solve
the problem when IPC is enabled.

--
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Rolf vandeVaart

2015-05-21 19:04:28 UTC

Permalink

Answers below...

-----Original Message-----
Sent: Thursday, May 21, 2015 2:19 PM
To: Open MPI Users
Subject: Re: [OMPI users] cuIpcOpenMemHandle failure when using
OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

Post by Lev Givon
(snip)

Post by Rolf vandeVaart
I see that you mentioned you are starting 4 MPS daemons. Are you
following the instructions here?
http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-se
rvice-mps.html

Yes - also

https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overvie
w

Post by Lev Givon
.pdf

Post by Rolf vandeVaart
This relies on setting CUDA_VISIBLE_DEVICES which can cause problems
for CUDA IPC. Since you are using CUDA 7 there is no more need to
start multiple daemons. You simply leave CUDA_VISIBLE_DEVICES
untouched and start a single MPS control daemon which will handle all

GPUs. Can you try that?

Post by Lev Givon
I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value
should be passed to all MPI processes.

There is no need to do anything with CUDA_MPS_PIPE_DIRECTORY with CUDA 7.

Post by Lev Givon
- Should the MPI processes select and initialize the GPUs they respectively need
to access as they normally would when MPS is not in use?

Yes.

Post by Lev Givon
- Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS (and
hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES to
control GPU resource allocation, and I would like to run my program (and the
MPS control daemon) on a cluster via SLURM.

Yes, I believe that is true.

Post by Lev Givon
- Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU setting
with CUDA 6.5 even when one starts multiple MPS control daemons as described
in the aforementioned blog post?

Using a single control daemon with CUDA_VISIBLE_DEVICES unset appears to
solve the problem when IPC is enabled.
--

Glad to see this worked. And you are correct that CUDA IPC will not work between devices if they are segregated by the use of CUDA_VISIBLE_DEVICES as we do with MPS in 6.5.

Rolf
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------