Discussion:
[OMPI users] Multi-threaded MPI communication
saiyedul islam
2017-09-21 20:57:35 UTC
Permalink
Hi all,

I am working on parallelization of a Data Clustering Algorithm in which I
am following MPMD pattern of MPI (i.e. 1 master process and p slave
processes in same communicator). It is an iterative algorithm where 2 loops
inside iteration are separately parallelized.

The first loop is parallelized by partitioning the N size input data into
(almost) equal parts between p slaves. Each slave produces a contiguous
chunk of about (p * N/p) double values as result of its local processing.
This local chunk from each slave is collected back on master process where
it is merged with chunks from other slaves.
If a blocking call (MPI_Send / Recv) is put in a loop on master such that
it receives the data one by one in order of their rank from slaves, then
each slave takes about 75 seconds for its local computation (as calculated
by MPI_Wtime() ) and about 1.5 seconds for transferring its chunk to
master. But, as the transfer happens in order, by the time last slave
process is done, the total time becomes 75 seconds for computation and 50
seconds for communication.
These timings are for a cluster of 31 machines where a single process
executes in each machine. All these machines are connected directly via a
private Gigabit network switch. In order to be effectively parallelize the
algorithm, the overall execution time needs to come below 80 seconds.

I have tried following strategies to solve this problem:
0. Ordered transfer, as explained above.
1. Collecting data through MPI_Gatherv and assuming that internally it will
transfer data in parallel.
2. Creating p threads at master using OpenMP and calling MPI_Recv (or
MPI_Irecv with MPI_Wait) by threads. The received data by each process is
put in a separate buffer. My installation support MPI_THREAD_MULTIPLE.

The problem is that strategies 1 & 2 are taking almost similar time as
compared to strategy 0.
*Is there a way through which I can receive data in parallel and
substantially decrease the overall execution time?*

Hoping to get your help soon. Sorry for the long question.

Regards,
Saiyedul Islam

PS: Specifications of the cluster: GCC 5.10, OpenMP 2.0.1, CentOS 6.5 (as
part of Rockscluster).
George Bosilca
2017-09-21 21:48:34 UTC
Permalink
All your processes send their data to a single destination, in same time.
Clearly you are reaching the capacity of your network and your data
transfers will be bound by this. This is a physical constraint that you can
only overcome by adding network capacity to your cluster.

At the software level only possibility is to make each of the p slave
processes send their data to your centralize resource at different time, so
that the data has the time to be transferred through the network before the
next slave is ready to submit it's result.

George
Post by saiyedul islam
Hi all,
I am working on parallelization of a Data Clustering Algorithm in which I
am following MPMD pattern of MPI (i.e. 1 master process and p slave
processes in same communicator). It is an iterative algorithm where 2 loops
inside iteration are separately parallelized.
The first loop is parallelized by partitioning the N size input data into
(almost) equal parts between p slaves. Each slave produces a contiguous
chunk of about (p * N/p) double values as result of its local processing.
This local chunk from each slave is collected back on master process where
it is merged with chunks from other slaves.
If a blocking call (MPI_Send / Recv) is put in a loop on master such that
it receives the data one by one in order of their rank from slaves, then
each slave takes about 75 seconds for its local computation (as calculated
by MPI_Wtime() ) and about 1.5 seconds for transferring its chunk to
master. But, as the transfer happens in order, by the time last slave
process is done, the total time becomes 75 seconds for computation and 50
seconds for communication.
These timings are for a cluster of 31 machines where a single process
executes in each machine. All these machines are connected directly via a
private Gigabit network switch. In order to be effectively parallelize the
algorithm, the overall execution time needs to come below 80 seconds.
0. Ordered transfer, as explained above.
1. Collecting data through MPI_Gatherv and assuming that internally it
will transfer data in parallel.
2. Creating p threads at master using OpenMP and calling MPI_Recv (or
MPI_Irecv with MPI_Wait) by threads. The received data by each process is
put in a separate buffer. My installation support MPI_THREAD_MULTIPLE.
The problem is that strategies 1 & 2 are taking almost similar time as
compared to strategy 0.
*Is there a way through which I can receive data in parallel and
substantially decrease the overall execution time?*
Hoping to get your help soon. Sorry for the long question.
Regards,
Saiyedul Islam
PS: Specifications of the cluster: GCC 5.10, OpenMP 2.0.1, CentOS 6.5 (as
part of Rockscluster).
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
saiyedul islam
2017-09-21 22:12:16 UTC
Permalink
Thank you very much for the reply, Sir.

Yes, I can observe this pattern by eyeballing Network communication graph
of my cluster (through Ganglia Cluster Monitor :
http://ganglia.sourceforge.net/).
During this loop execution, master is receiving data at ~ 100 MB/sec (of
theoretical 125 MB/sec in Gigabit) while each of the 30 slave processes are
sending around ~3-4 MB/sec.

Is there a way to get exact numbers about network utilization through MPI
code, instead of just visualizing the graph?

Regards,
Saiyed
Post by George Bosilca
All your processes send their data to a single destination, in same time.
Clearly you are reaching the capacity of your network and your data
transfers will be bound by this. This is a physical constraint that you can
only overcome by adding network capacity to your cluster.
At the software level only possibility is to make each of the p slave
processes send their data to your centralize resource at different time, so
that the data has the time to be transferred through the network before the
next slave is ready to submit it's result.
George
Post by saiyedul islam
Hi all,
I am working on parallelization of a Data Clustering Algorithm in which I
am following MPMD pattern of MPI (i.e. 1 master process and p slave
processes in same communicator). It is an iterative algorithm where 2 loops
inside iteration are separately parallelized.
The first loop is parallelized by partitioning the N size input data into
(almost) equal parts between p slaves. Each slave produces a contiguous
chunk of about (p * N/p) double values as result of its local processing.
This local chunk from each slave is collected back on master process where
it is merged with chunks from other slaves.
If a blocking call (MPI_Send / Recv) is put in a loop on master such that
it receives the data one by one in order of their rank from slaves, then
each slave takes about 75 seconds for its local computation (as calculated
by MPI_Wtime() ) and about 1.5 seconds for transferring its chunk to
master. But, as the transfer happens in order, by the time last slave
process is done, the total time becomes 75 seconds for computation and 50
seconds for communication.
These timings are for a cluster of 31 machines where a single process
executes in each machine. All these machines are connected directly via a
private Gigabit network switch. In order to be effectively parallelize the
algorithm, the overall execution time needs to come below 80 seconds.
0. Ordered transfer, as explained above.
1. Collecting data through MPI_Gatherv and assuming that internally it
will transfer data in parallel.
2. Creating p threads at master using OpenMP and calling MPI_Recv (or
MPI_Irecv with MPI_Wait) by threads. The received data by each process is
put in a separate buffer. My installation support MPI_THREAD_MULTIPLE.
The problem is that strategies 1 & 2 are taking almost similar time as
compared to strategy 0.
*Is there a way through which I can receive data in parallel and
substantially decrease the overall execution time?*
Hoping to get your help soon. Sorry for the long question.
Regards,
Saiyedul Islam
PS: Specifications of the cluster: GCC 5.10, OpenMP 2.0.1, CentOS 6.5 (as
part of Rockscluster).
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Hammond
2017-09-22 03:29:00 UTC
Permalink
Can you pad data so you can use MPI_Gather instead? It's possible that
Gatherv doesn't use recursive doubling.

Or you can implement your own aggregation tree to work around the in-cast
problem.

Jeff
Post by saiyedul islam
Hi all,
I am working on parallelization of a Data Clustering Algorithm in which I
am following MPMD pattern of MPI (i.e. 1 master process and p slave
processes in same communicator). It is an iterative algorithm where 2 loops
inside iteration are separately parallelized.
The first loop is parallelized by partitioning the N size input data into
(almost) equal parts between p slaves. Each slave produces a contiguous
chunk of about (p * N/p) double values as result of its local processing.
This local chunk from each slave is collected back on master process where
it is merged with chunks from other slaves.
If a blocking call (MPI_Send / Recv) is put in a loop on master such that
it receives the data one by one in order of their rank from slaves, then
each slave takes about 75 seconds for its local computation (as calculated
by MPI_Wtime() ) and about 1.5 seconds for transferring its chunk to
master. But, as the transfer happens in order, by the time last slave
process is done, the total time becomes 75 seconds for computation and 50
seconds for communication.
These timings are for a cluster of 31 machines where a single process
executes in each machine. All these machines are connected directly via a
private Gigabit network switch. In order to be effectively parallelize the
algorithm, the overall execution time needs to come below 80 seconds.
0. Ordered transfer, as explained above.
1. Collecting data through MPI_Gatherv and assuming that internally it
will transfer data in parallel.
2. Creating p threads at master using OpenMP and calling MPI_Recv (or
MPI_Irecv with MPI_Wait) by threads. The received data by each process is
put in a separate buffer. My installation support MPI_THREAD_MULTIPLE.
The problem is that strategies 1 & 2 are taking almost similar time as
compared to strategy 0.
*Is there a way through which I can receive data in parallel and
substantially decrease the overall execution time?*
Hoping to get your help soon. Sorry for the long question.
Regards,
Saiyedul Islam
PS: Specifications of the cluster: GCC 5.10, OpenMP 2.0.1, CentOS 6.5 (as
part of Rockscluster).
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Loading...