[OMPI users] One question about progression of operations in MPI

Discussion:

Weicheng Xue

2018-11-14 01:52:23 UTC

Hi,

I am a student whose research work includes using MPI and OpenACC to
accelerate our in-house research CFD code on multiple GPUs. I am having a
big issue related to the "progression of operations in MPI" and am thinking
your inputs can be very helpful.

I am now testing the performance of overlapping communication and
computation for a code. Communication exists between hosts (CPUs) and
computations are done on devices (GPUs). However, in my case, the actual
communication always starts when the computations finish. Therefore, even
though I wrote my code in an overlapping way, there is no overlapping
because of the OpenMPI not supporting asynchronous progression. I found
that MPI often does progress (i.e. actually send or receive the data) only
if I am blocking in a call to MPI_Wait (Then no overlapping occurs at all).
My purpose is to use overlapping to hide communication latency and thus
improve the performance of my code. Is there a way you can suggest to me?
Thank you very much!

I am now using PGI/17.5 compiler and openmpi/2.0.0. A 100 Gbps
EDR-Infiniband is used for MPI traffic. If I use "ompi_info", then info.
about the thread support is "Thread support: posix (MPI_THREAD_MULTIPLE:
yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib:
yes)".

Best Regards,

Weicheng Xue

Jeff Squyres (jsquyres) via users

2018-11-16 22:35:58 UTC

Permalink

I am a student whose research work includes using MPI and OpenACC to accelerate our in-house research CFD code on multiple GPUs. I am having a big issue related to the "progression of operations in MPI" and am thinking your inputs can be very helpful.

Someone asked me about an Open MPI + OpenACC issue this past week at the Supercomputing trade show.

I'm not sure if anyone in the Open MPI development community is testing with Open MPI + OpenACC. I don't know much about it -- I would *hope* that it "just works", but I don't know that for sure.

I am now testing the performance of overlapping communication and computation for a code. Communication exists between hosts (CPUs) and computations are done on devices (GPUs). However, in my case, the actual communication always starts when the computations finish. Therefore, even though I wrote my code in an overlapping way, there is no overlapping because of the OpenMPI not supporting asynchronous progression. I found that MPI often does progress (i.e. actually send or receive the data) only if I am blocking in a call to MPI_Wait (Then no overlapping occurs at all). My purpose is to use overlapping to hide communication latency and thus improve the performance of my code. Is there a way you can suggest to me? Thank you very much!

Nearly all transports in Open MPI support asynchronous progress -- but only some of them offer hardware- and/or OS-assisted asynchronous progress (which is probably what you are assuming). Specifically: I'm quibbling with your choice of wording, but the end effect you are observing is likely a) correct, and b) dependent upon the network transport that you are using.

I am now using PGI/17.5 compiler and openmpi/2.0.0. A 100 Gbps EDR-Infiniband is used for MPI traffic. If I use "ompi_info", then info. about the thread support is "Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)".

That's a little surprising -- IB should be one of the transports that actually supports asynchronous progress.

Are you using UCX for the IB transport?

--
Jeff Squyres
***@cisco.com

Weicheng Xue

2018-11-17 03:50:30 UTC

Permalink

Hi Jeff,

Thank you very much for your reply! I am now using a cluster at my
university (https://www.arc.vt.edu/computing/newriver/). I cannot find any
info. about the use of Unified Communications X (or UCX) there so I would
guess the cluster does not use it (not exactly sure though). Actually, I
called MPI_Test functions at several places in my code where the
communication activity was supposed to finish, but communication did not
finish until the code finally called MPI_WAITALL. I got to know this by
using the Nvidia profiler (The profiling result showed that the kernel on
GPUs right after MPI_WAITALL only started after CPUs finished
communication. However, there is enough time for CPUs to finish this task
in the background before MPI_WAITALL). If the communication overhead is
not hidden, then it does not make any sense to write the code in an
overlapping way. I am wondering whether the openmpi on the cluster was
compiled with asynchronous progression enabled, as "OMPI progress: no, ORTE
progress: yes" is obtained by using "ompi_info". I really do not know the
difference between "OMPI progress" and "ORTE progress" as I am not a CS
guy. Also, I am wondering whether MVAPICH2 is worthwhile to be tried as it
provides an environment variable to control the progression of operation,
which is easier. I would greatly appreciate your help!

Best Regards,

Weicheng Xue

On Fri, Nov 16, 2018 at 5:37 PM Jeff Squyres (jsquyres) via users <

Post by Weicheng Xue

Post by Weicheng Xue
I am a student whose research work includes using MPI and OpenACC to

accelerate our in-house research CFD code on multiple GPUs. I am having a
big issue related to the "progression of operations in MPI" and am thinking
your inputs can be very helpful.
Someone asked me about an Open MPI + OpenACC issue this past week at the
Supercomputing trade show.
I'm not sure if anyone in the Open MPI development community is testing
with Open MPI + OpenACC. I don't know much about it -- I would *hope* that
it "just works", but I don't know that for sure.

Post by Weicheng Xue
I am now testing the performance of overlapping communication and

computation for a code. Communication exists between hosts (CPUs) and
computations are done on devices (GPUs). However, in my case, the actual
communication always starts when the computations finish. Therefore, even
though I wrote my code in an overlapping way, there is no overlapping
because of the OpenMPI not supporting asynchronous progression. I found
that MPI often does progress (i.e. actually send or receive the data) only
if I am blocking in a call to MPI_Wait (Then no overlapping occurs at all).
My purpose is to use overlapping to hide communication latency and thus
improve the performance of my code. Is there a way you can suggest to me?
Thank you very much!
Nearly all transports in Open MPI support asynchronous progress -- but
only some of them offer hardware- and/or OS-assisted asynchronous progress
(which is probably what you are assuming). Specifically: I'm quibbling
with your choice of wording, but the end effect you are observing is likely
a) correct, and b) dependent upon the network transport that you are using.

Post by Weicheng Xue
I am now using PGI/17.5 compiler and openmpi/2.0.0. A 100 Gbps

EDR-Infiniband is used for MPI traffic. If I use "ompi_info", then info.
yes)".
That's a little surprising -- IB should be one of the transports that
actually supports asynchronous progress.
Are you using UCX for the IB transport?
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Jeff Squyres (jsquyres) via users

2018-11-27 11:32:46 UTC

Permalink

Sorry for the delay in replying; the SC'18 show and then the US Thanksgiving holiday got in the way. More below.

Post by Weicheng Xue
Hi Jeff,
Thank you very much for your reply! I am now using a cluster at my university (https://www.arc.vt.edu/computing/newriver/). I cannot find any info. about the use of Unified Communications X (or UCX) there so I would guess the cluster does not use it (not exactly sure though).

You might want to try compiling UCX yourself (it's just a user-level library -- it can even be installed under your $HOME) and then try compiling Open MPI against it and using that. Make sure to configure/compile UCX with CUDA support -- I believe you need a very recent version of UCX for that.

Post by Weicheng Xue
Actually, I called MPI_Test functions at several places in my code where the communication activity was supposed to finish, but communication did not finish until the code finally called MPI_WAITALL.

You might want to test looping calling MPI_TEST many times, just to see what is happening.

Specifically: in Open MPI (and probably in other MPI implementations), MPI_TEST dips into the MPI progression engine (essentially) once, whereas MPI_WAIT dips into the MPI progression engine as many times as necessary in order to complete the request(s). So it's just a difference of looping.

How large is the message you're sending?

Post by Weicheng Xue
I got to know this by using the Nvidia profiler (The profiling result showed that the kernel on GPUs right after MPI_WAITALL only started after CPUs finished communication. However, there is enough time for CPUs to finish this task in the background before MPI_WAITALL). If the communication overhead is not hidden, then it does not make any sense to write the code in an overlapping way. I am wondering whether the openmpi on the cluster was compiled with asynchronous progression enabled, as "OMPI progress: no, ORTE progress: yes" is obtained by using "ompi_info". I really do not know the difference between "OMPI progress" and "ORTE progress" as I am not a CS guy.

I applaud your initiative to find that phrase in the ompi_info output!

However, don't get caught up in it -- that phrase isn't specifically oriented to the exact issue you're discussing here (for lack of a longer explanation).

Post by Weicheng Xue
Also, I am wondering whether MVAPICH2 is worthwhile to be tried as it provides an environment variable to control the progression of operation, which is easier. I would greatly appreciate your help!

Sure, try MVAPICH2 -- that's kinda the strength of the MPI ecosystem (that there are multiple different MPI implementations to try).

--
Jeff Squyres
***@cisco.com

Weicheng Xue

2018-11-27 16:01:32 UTC

Permalink

Hi Jeff,

Thank you very much for providing these useful suggestions! I may try
MVAPICH2 first. In my case, I transferred different data 2 times. Each time
the size is 3.146MB. Actually, I also tested problems of different sizes,
and none of them worked as expected.

Best Regards,

Weicheng Xue

Post by Jeff Squyres (jsquyres) via users
Sorry for the delay in replying; the SC'18 show and then the US
Thanksgiving holiday got in the way. More below.

Post by Weicheng Xue
Hi Jeff,
Thank you very much for your reply! I am now using a cluster at my

university (https://www.arc.vt.edu/computing/newriver/). I cannot find
any info. about the use of Unified Communications X (or UCX) there so I
would guess the cluster does not use it (not exactly sure though).
You might want to try compiling UCX yourself (it's just a user-level
library -- it can even be installed under your $HOME) and then try
compiling Open MPI against it and using that. Make sure to
configure/compile UCX with CUDA support -- I believe you need a very recent
version of UCX for that.

Post by Weicheng Xue
Actually, I called MPI_Test functions at several places in my code where

the communication activity was supposed to finish, but communication did
not finish until the code finally called MPI_WAITALL.
You might want to test looping calling MPI_TEST many times, just to see what is happening.
Specifically: in Open MPI (and probably in other MPI implementations),
MPI_TEST dips into the MPI progression engine (essentially) once, whereas
MPI_WAIT dips into the MPI progression engine as many times as necessary in
order to complete the request(s). So it's just a difference of looping.
How large is the message you're sending?

Post by Weicheng Xue
I got to know this by using the Nvidia profiler (The profiling result

showed that the kernel on GPUs right after MPI_WAITALL only started after
CPUs finished communication. However, there is enough time for CPUs to
finish this task in the background before MPI_WAITALL). If the
communication overhead is not hidden, then it does not make any sense to
write the code in an overlapping way. I am wondering whether the openmpi on
the cluster was compiled with asynchronous progression enabled, as "OMPI
progress: no, ORTE progress: yes" is obtained by using "ompi_info". I
really do not know the difference between "OMPI progress" and "ORTE
progress" as I am not a CS guy.
I applaud your initiative to find that phrase in the ompi_info output!
However, don't get caught up in it -- that phrase isn't specifically
oriented to the exact issue you're discussing here (for lack of a longer
explanation).

Post by Weicheng Xue
Also, I am wondering whether MVAPICH2 is worthwhile to be tried as it

provides an environment variable to control the progression of operation,
which is easier. I would greatly appreciate your help!
Sure, try MVAPICH2 -- that's kinda the strength of the MPI ecosystem (that
there are multiple different MPI implementations to try).
--
Jeff Squyres

Weicheng Xue

2018-11-27 21:41:38 UTC

Permalink

Hi Xin,

Thanks a lot for providing such help to me! I may wait until the
technicians working for the university cluster finish their work, as they
agreed to help install OpenMPI 4.0.0 first (and maybe install Mvapich2 with
PGI compiler). Let me try these options first. I will let you know if I
decided to install UCX with CUDA support.

Best regards,

Weicheng Xue

Hi Weicheng,
I am an engineer from Mellanox and we saw your thread about testing with
OMPI on your cluster.
It is quite easy and fast to build UCX with CUDA support in your home
directory. You just need to download from github (
https://github.com/openucx/ucx/releases) and run configure and make. You
need to pass --with-cuda=(YOUR_CUDA_DIR) during configure.
I can help with you to set up UCX and OMPI on your cluster. We can set up
a call to do this together if that is convenient for you.
Xin
Hi Jeff,
Thank you very much for providing these useful suggestions! I may try
MVAPICH2 first. In my case, I transferred different data 2 times. Each time
the size is 3.146MB. Actually, I also tested problems of different sizes,
and none of them worked as expected.
Best Regards,
Weicheng Xue

Post by Jeff Squyres (jsquyres) via users
Sorry for the delay in replying; the SC'18 show and then the US
Thanksgiving holiday got in the way. More below.

Post by Weicheng Xue
Hi Jeff,
Thank you very much for your reply! I am now using a cluster at my

Post by Weicheng Xue
Actually, I called MPI_Test functions at several places in my code where

the communication activity was supposed to finish, but communication did
not finish until the code finally called MPI_WAITALL.
You might want to test looping calling MPI_TEST many times, just to see
what is happening.
Specifically: in Open MPI (and probably in other MPI implementations),
MPI_TEST dips into the MPI progression engine (essentially) once, whereas
MPI_WAIT dips into the MPI progression engine as many times as necessary in
order to complete the request(s). So it's just a difference of looping.
How large is the message you're sending?

Post by Weicheng Xue
I got to know this by using the Nvidia profiler (The profiling result

Post by Weicheng Xue
Also, I am wondering whether MVAPICH2 is worthwhile to be tried as it