[OMPI users] How can I measure synchronization time of MPI

Discussion:

[OMPI users] How can I measure synchronization time of MPI_Bcast()

Konstantinos Konstantinidis

2017-10-20 22:23:46 UTC

Hi,

I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.

I would like to give you an idea of what kind of communication takes place.
There are 40 m3.large instances. Now, 28672 groups of 5 instances are
formed in a specific manner (let's skip the details). Within each group,
each instance broadcasts some unsigned char data to the other 4 instances
in the group. So within each group, exactly 5 broadcasts take place.

The problem is that if I increase the size of the group from 5 to 10 there
is significant delay in terms of transmission rate while, based on some
theoretical results, this is not reasonable.

I want to check if one of the reasons that this is happening is due to the
time needed for the instances to synchronize when they call MPI_Bcast()
since it's a collective function. As far as I know, all of the machines in
the broadcast need to call it and then synchronize until the actual data
transfer starts. Is there any way to measure this synchronization time?

The code is in C++ and the MPI installed is described in the attached file.

Jeff Hammond

2017-10-21 03:32:21 UTC

Permalink

Broadcast is collective but not necessarily synchronous in the sense you
imply. If you broadcast message size under the eager limit, the root may
return before any non-root processes enter the function. Data transfer may
happen prior to processes entering the function. Only rendezvous forces
synchronization between any two processes but there may still be asynchrony
between different levels of the broadcast tree.

Jeff

On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis <

Post by Konstantinos Konstantinidis
Hi,
I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.
I would like to give you an idea of what kind of communication takes
place. There are 40 m3.large instances. Now, 28672 groups of 5 instances
are formed in a specific manner (let's skip the details). Within each
group, each instance broadcasts some unsigned char data to the other 4
instances in the group. So within each group, exactly 5 broadcasts take
place.
The problem is that if I increase the size of the group from 5 to 10 there
is significant delay in terms of transmission rate while, based on some
theoretical results, this is not reasonable.
I want to check if one of the reasons that this is happening is due to the
time needed for the instances to synchronize when they call MPI_Bcast()
since it's a collective function. As far as I know, all of the machines in
the broadcast need to call it and then synchronize until the actual data
transfer starts. Is there any way to measure this synchronization time?
The code is in C++ and the MPI installed is described in the attached file.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/

Konstantinos Konstantinidis

2017-10-23 07:16:47 UTC

Permalink

In any case, do you think that the time NOT spent on actual data
transmission can impact the total time of the broadcast especially when
there are so many groups that communicate (please refer to the numbers I
gave before if you want to get an idea).

Also, is there any way to quantify this impact i.e. to measure the time not
spent on actual data transmissions?

Kostas

Post by Jeff Hammond
Broadcast is collective but not necessarily synchronous in the sense you
imply. If you broadcast message size under the eager limit, the root may
return before any non-root processes enter the function. Data transfer may
happen prior to processes entering the function. Only rendezvous forces
synchronization between any two processes but there may still be asynchrony
between different levels of the broadcast tree.
Jeff
On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis <

Post by Konstantinos Konstantinidis
Hi,
I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.
I would like to give you an idea of what kind of communication takes
place. There are 40 m3.large instances. Now, 28672 groups of 5 instances
are formed in a specific manner (let's skip the details). Within each
group, each instance broadcasts some unsigned char data to the other 4
instances in the group. So within each group, exactly 5 broadcasts take
place.
The problem is that if I increase the size of the group from 5 to 10
there is significant delay in terms of transmission rate while, based on
some theoretical results, this is not reasonable.
I want to check if one of the reasons that this is happening is due to
the time needed for the instances to synchronize when they call MPI_Bcast()
since it's a collective function. As far as I know, all of the machines in
the broadcast need to call it and then synchronize until the actual data
transfer starts. Is there any way to measure this synchronization time?
The code is in C++ and the MPI installed is described in the attached file.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Hammond
http://jeffhammond.github.io/

Gilles Gouaillardet

2017-10-23 11:19:07 UTC

Permalink

Konstantions,

A simple way is to rewrite MPI_Bcast() and insert timer and
PMPI_Barrier() before invoking the real PMPI_Bcast().
time spent in PMPI_Barrier() can be seen as time NOT spent on actual
data transmission,
and since all tasks are synchronized upon exit, time spent in
PMPI_Bcast() can be seen as time spent on actual data transmission.
this is not perfect, but this is a pretty good approximation.
You can add extra timers so you end up with an idea of how much time
is spent in PMPI_Barrier() vs PMPI_Bcast().

Cheers,

Gilles

On Mon, Oct 23, 2017 at 4:16 PM, Konstantinos Konstantinidis

Post by Konstantinos Konstantinidis
In any case, do you think that the time NOT spent on actual data
transmission can impact the total time of the broadcast especially when
there are so many groups that communicate (please refer to the numbers I
gave before if you want to get an idea).
Also, is there any way to quantify this impact i.e. to measure the time not
spent on actual data transmissions?
Kostas

Post by Konstantinos Konstantinidis
Hi,
I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.
I would like to give you an idea of what kind of communication takes
place. There are 40 m3.large instances. Now, 28672 groups of 5 instances are
formed in a specific manner (let's skip the details). Within each group,
each instance broadcasts some unsigned char data to the other 4 instances in
the group. So within each group, exactly 5 broadcasts take place.
The problem is that if I increase the size of the group from 5 to 10
there is significant delay in terms of transmission rate while, based on
some theoretical results, this is not reasonable.
I want to check if one of the reasons that this is happening is due to
the time needed for the instances to synchronize when they call MPI_Bcast()
since it's a collective function. As far as I know, all of the machines in
the broadcast need to call it and then synchronize until the actual data
transfer starts. Is there any way to measure this synchronization time?
The code is in C++ and the MPI installed is described in the attached file.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Hammond
http://jeffhammond.github.io/

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Barrett, Brian via users

2017-10-23 15:20:27 UTC

Permalink

Gilles suggested your best next course of action; time the MPI_Bcast and MPI_Barrier calls and see if there’s a non-linear scaling effect as you increase group size.

You mention that you’re using m3.large instances; while this isn’t the list for in-depth discussion about EC2 instances (the AWS Forums are better for that), I’ll note that unless you’re tied to m3 for organizational or reserved instance reasons, you’ll probably be happier on another instance type. m3 was one of the last instance families released which does not support Enhanced Networking. There’s significantly more jitter and latency in the m3 network stack compared to platforms which support Enhanced Networking (including the m4 platform). If networking costs are causing your scaling problems, the first step will be migrating instance types.

Brian

Post by Gilles Gouaillardet
Konstantions,
A simple way is to rewrite MPI_Bcast() and insert timer and
PMPI_Barrier() before invoking the real PMPI_Bcast().
time spent in PMPI_Barrier() can be seen as time NOT spent on actual
data transmission,
and since all tasks are synchronized upon exit, time spent in
PMPI_Bcast() can be seen as time spent on actual data transmission.
this is not perfect, but this is a pretty good approximation.
You can add extra timers so you end up with an idea of how much time
is spent in PMPI_Barrier() vs PMPI_Bcast().
Cheers,
Gilles
On Mon, Oct 23, 2017 at 4:16 PM, Konstantinos Konstantinidis

Post by Konstantinos Konstantinidis
Hi,
I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.
I would like to give you an idea of what kind of communication takes
place. There are 40 m3.large instances. Now, 28672 groups of 5 instances are
formed in a specific manner (let's skip the details). Within each group,
each instance broadcasts some unsigned char data to the other 4 instances in
the group. So within each group, exactly 5 broadcasts take place.
The problem is that if I increase the size of the group from 5 to 10
there is significant delay in terms of transmission rate while, based on
some theoretical results, this is not reasonable.
I want to check if one of the reasons that this is happening is due to
the time needed for the instances to synchronize when they call MPI_Bcast()
since it's a collective function. As far as I know, all of the machines in
the broadcast need to call it and then synchronize until the actual data
transfer starts. Is there any way to measure this synchronization time?
The code is in C++ and the MPI installed is described in the attached file.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Hammond
http://jeffhammond.github.io/

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Konstantinos Konstantinidis

2017-10-23 19:27:43 UTC

Permalink

I do not completely understand whether that involves changing some MPI
code. I have no prior experience with that.

But if I get the idea something like this could potentially work (assume
that comm is the communicator of the groups that communicates at each
iteration):

*clock_t total_time = clock();*

*clock_t sync_time = 0;*

*for each transmission{*

* sync_time = sync_time - clock();*

* comm.Barrier();*

* sync_time = sync_time + clock();*

* comm.Bcast(...);*

*}*

*total_time = clock() - total_time;*

*//Total time*

*double t_time = double(total_time)/CLOCKS_PER_SEC;*

*//Synchronization time*

*double s_time = double(sync_time)/CLOCKS_PER_SEC;*

*//Actual data transmission time*
*double d_time = t_time - s_time;*

I know that I have added a useless barrier call, but do you think that this
can work the way I think it will and at least give some idea of the
synchronization time?

Barrett, I am also working on switching to m4.large instances and will
check if this helps.

Regards,
Kostas

Post by Barrett, Brian via users
Gilles suggested your best next course of action; time the MPI_Bcast and
MPI_Barrier calls and see if thereâs a non-linear scaling effect as you
increase group size.
You mention that youâre using m3.large instances; while this isnât the
list for in-depth discussion about EC2 instances (the AWS Forums are better
for that), Iâll note that unless youâre tied to m3 for organizational or
reserved instance reasons, youâll probably be happier on another instance
type. m3 was one of the last instance families released which does not
support Enhanced Networking. Thereâs significantly more jitter and latency
in the m3 network stack compared to platforms which support Enhanced
Networking (including the m4 platform). If networking costs are causing
your scaling problems, the first step will be migrating instance types.
Brian

On Oct 23, 2017, at 4:19 AM, Gilles Gouaillardet <
Konstantions,
A simple way is to rewrite MPI_Bcast() and insert timer and
PMPI_Barrier() before invoking the real PMPI_Bcast().
time spent in PMPI_Barrier() can be seen as time NOT spent on actual
data transmission,
and since all tasks are synchronized upon exit, time spent in
PMPI_Bcast() can be seen as time spent on actual data transmission.
this is not perfect, but this is a pretty good approximation.
You can add extra timers so you end up with an idea of how much time
is spent in PMPI_Barrier() vs PMPI_Bcast().
Cheers,
Gilles
On Mon, Oct 23, 2017 at 4:16 PM, Konstantinos Konstantinidis

not

Post by Konstantinos Konstantinidis
spent on actual data transmissions?
Kostas

Post by Jeff Hammond
Broadcast is collective but not necessarily synchronous in the sense

you