Discussion:
[OMPI users] How can I measure synchronization time of MPI_Bcast()
Konstantinos Konstantinidis
2017-10-20 22:23:46 UTC
Permalink
Hi,

I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.

I would like to give you an idea of what kind of communication takes place.
There are 40 m3.large instances. Now, 28672 groups of 5 instances are
formed in a specific manner (let's skip the details). Within each group,
each instance broadcasts some unsigned char data to the other 4 instances
in the group. So within each group, exactly 5 broadcasts take place.

The problem is that if I increase the size of the group from 5 to 10 there
is significant delay in terms of transmission rate while, based on some
theoretical results, this is not reasonable.

I want to check if one of the reasons that this is happening is due to the
time needed for the instances to synchronize when they call MPI_Bcast()
since it's a collective function. As far as I know, all of the machines in
the broadcast need to call it and then synchronize until the actual data
transfer starts. Is there any way to measure this synchronization time?

The code is in C++ and the MPI installed is described in the attached file.
Jeff Hammond
2017-10-21 03:32:21 UTC
Permalink
Broadcast is collective but not necessarily synchronous in the sense you
imply. If you broadcast message size under the eager limit, the root may
return before any non-root processes enter the function. Data transfer may
happen prior to processes entering the function. Only rendezvous forces
synchronization between any two processes but there may still be asynchrony
between different levels of the broadcast tree.

Jeff

On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis <
Post by Konstantinos Konstantinidis
Hi,
I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.
I would like to give you an idea of what kind of communication takes
place. There are 40 m3.large instances. Now, 28672 groups of 5 instances
are formed in a specific manner (let's skip the details). Within each
group, each instance broadcasts some unsigned char data to the other 4
instances in the group. So within each group, exactly 5 broadcasts take
place.
The problem is that if I increase the size of the group from 5 to 10 there
is significant delay in terms of transmission rate while, based on some
theoretical results, this is not reasonable.
I want to check if one of the reasons that this is happening is due to the
time needed for the instances to synchronize when they call MPI_Bcast()
since it's a collective function. As far as I know, all of the machines in
the broadcast need to call it and then synchronize until the actual data
transfer starts. Is there any way to measure this synchronization time?
The code is in C++ and the MPI installed is described in the attached file.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Konstantinos Konstantinidis
2017-10-23 07:16:47 UTC
Permalink
In any case, do you think that the time NOT spent on actual data
transmission can impact the total time of the broadcast especially when
there are so many groups that communicate (please refer to the numbers I
gave before if you want to get an idea).

Also, is there any way to quantify this impact i.e. to measure the time not
spent on actual data transmissions?

Kostas
Post by Jeff Hammond
Broadcast is collective but not necessarily synchronous in the sense you
imply. If you broadcast message size under the eager limit, the root may
return before any non-root processes enter the function. Data transfer may
happen prior to processes entering the function. Only rendezvous forces
synchronization between any two processes but there may still be asynchrony
between different levels of the broadcast tree.
Jeff
On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis <
Post by Konstantinos Konstantinidis
Hi,
I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.
I would like to give you an idea of what kind of communication takes
place. There are 40 m3.large instances. Now, 28672 groups of 5 instances
are formed in a specific manner (let's skip the details). Within each
group, each instance broadcasts some unsigned char data to the other 4
instances in the group. So within each group, exactly 5 broadcasts take
place.
The problem is that if I increase the size of the group from 5 to 10
there is significant delay in terms of transmission rate while, based on
some theoretical results, this is not reasonable.
I want to check if one of the reasons that this is happening is due to
the time needed for the instances to synchronize when they call MPI_Bcast()
since it's a collective function. As far as I know, all of the machines in
the broadcast need to call it and then synchronize until the actual data
transfer starts. Is there any way to measure this synchronization time?
The code is in C++ and the MPI installed is described in the attached file.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
http://jeffhammond.github.io/
Gilles Gouaillardet
2017-10-23 11:19:07 UTC
Permalink
Konstantions,

A simple way is to rewrite MPI_Bcast() and insert timer and
PMPI_Barrier() before invoking the real PMPI_Bcast().
time spent in PMPI_Barrier() can be seen as time NOT spent on actual
data transmission,
and since all tasks are synchronized upon exit, time spent in
PMPI_Bcast() can be seen as time spent on actual data transmission.
this is not perfect, but this is a pretty good approximation.
You can add extra timers so you end up with an idea of how much time
is spent in PMPI_Barrier() vs PMPI_Bcast().

Cheers,

Gilles

On Mon, Oct 23, 2017 at 4:16 PM, Konstantinos Konstantinidis
Post by Konstantinos Konstantinidis
In any case, do you think that the time NOT spent on actual data
transmission can impact the total time of the broadcast especially when
there are so many groups that communicate (please refer to the numbers I
gave before if you want to get an idea).
Also, is there any way to quantify this impact i.e. to measure the time not
spent on actual data transmissions?
Kostas
Post by Jeff Hammond
Broadcast is collective but not necessarily synchronous in the sense you
imply. If you broadcast message size under the eager limit, the root may
return before any non-root processes enter the function. Data transfer may
happen prior to processes entering the function. Only rendezvous forces
synchronization between any two processes but there may still be asynchrony
between different levels of the broadcast tree.
Jeff
On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis
Post by Konstantinos Konstantinidis
Hi,
I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.
I would like to give you an idea of what kind of communication takes
place. There are 40 m3.large instances. Now, 28672 groups of 5 instances are
formed in a specific manner (let's skip the details). Within each group,
each instance broadcasts some unsigned char data to the other 4 instances in
the group. So within each group, exactly 5 broadcasts take place.
The problem is that if I increase the size of the group from 5 to 10
there is significant delay in terms of transmission rate while, based on
some theoretical results, this is not reasonable.
I want to check if one of the reasons that this is happening is due to
the time needed for the instances to synchronize when they call MPI_Bcast()
since it's a collective function. As far as I know, all of the machines in
the broadcast need to call it and then synchronize until the actual data
transfer starts. Is there any way to measure this synchronization time?
The code is in C++ and the MPI installed is described in the attached file.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Barrett, Brian via users
2017-10-23 15:20:27 UTC
Permalink
Gilles suggested your best next course of action; time the MPI_Bcast and MPI_Barrier calls and see if there’s a non-linear scaling effect as you increase group size.

You mention that you’re using m3.large instances; while this isn’t the list for in-depth discussion about EC2 instances (the AWS Forums are better for that), I’ll note that unless you’re tied to m3 for organizational or reserved instance reasons, you’ll probably be happier on another instance type. m3 was one of the last instance families released which does not support Enhanced Networking. There’s significantly more jitter and latency in the m3 network stack compared to platforms which support Enhanced Networking (including the m4 platform). If networking costs are causing your scaling problems, the first step will be migrating instance types.

Brian
Post by Gilles Gouaillardet
Konstantions,
A simple way is to rewrite MPI_Bcast() and insert timer and
PMPI_Barrier() before invoking the real PMPI_Bcast().
time spent in PMPI_Barrier() can be seen as time NOT spent on actual
data transmission,
and since all tasks are synchronized upon exit, time spent in
PMPI_Bcast() can be seen as time spent on actual data transmission.
this is not perfect, but this is a pretty good approximation.
You can add extra timers so you end up with an idea of how much time
is spent in PMPI_Barrier() vs PMPI_Bcast().
Cheers,
Gilles
On Mon, Oct 23, 2017 at 4:16 PM, Konstantinos Konstantinidis
Post by Konstantinos Konstantinidis
In any case, do you think that the time NOT spent on actual data
transmission can impact the total time of the broadcast especially when
there are so many groups that communicate (please refer to the numbers I
gave before if you want to get an idea).
Also, is there any way to quantify this impact i.e. to measure the time not
spent on actual data transmissions?
Kostas
Post by Jeff Hammond
Broadcast is collective but not necessarily synchronous in the sense you
imply. If you broadcast message size under the eager limit, the root may
return before any non-root processes enter the function. Data transfer may
happen prior to processes entering the function. Only rendezvous forces
synchronization between any two processes but there may still be asynchrony
between different levels of the broadcast tree.
Jeff
On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis
Post by Konstantinos Konstantinidis
Hi,
I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.
I would like to give you an idea of what kind of communication takes
place. There are 40 m3.large instances. Now, 28672 groups of 5 instances are
formed in a specific manner (let's skip the details). Within each group,
each instance broadcasts some unsigned char data to the other 4 instances in
the group. So within each group, exactly 5 broadcasts take place.
The problem is that if I increase the size of the group from 5 to 10
there is significant delay in terms of transmission rate while, based on
some theoretical results, this is not reasonable.
I want to check if one of the reasons that this is happening is due to
the time needed for the instances to synchronize when they call MPI_Bcast()
since it's a collective function. As far as I know, all of the machines in
the broadcast need to call it and then synchronize until the actual data
transfer starts. Is there any way to measure this synchronization time?
The code is in C++ and the MPI installed is described in the attached file.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Konstantinos Konstantinidis
2017-10-23 19:27:43 UTC
Permalink
I do not completely understand whether that involves changing some MPI
code. I have no prior experience with that.

But if I get the idea something like this could potentially work (assume
that comm is the communicator of the groups that communicates at each
iteration):


*clock_t total_time = clock();*

*clock_t sync_time = 0;*

*for each transmission{*

* sync_time = sync_time - clock();*

* comm.Barrier();*

* sync_time = sync_time + clock();*

* comm.Bcast(...);*

*}*


*total_time = clock() - total_time;*

*//Total time*

*double t_time = double(total_time)/CLOCKS_PER_SEC;*

*//Synchronization time*

*double s_time = double(sync_time)/CLOCKS_PER_SEC;*

*//Actual data transmission time*
*double d_time = t_time - s_time;*


I know that I have added a useless barrier call, but do you think that this
can work the way I think it will and at least give some idea of the
synchronization time?

Barrett, I am also working on switching to m4.large instances and will
check if this helps.

Regards,
Kostas
Post by Barrett, Brian via users
Gilles suggested your best next course of action; time the MPI_Bcast and
MPI_Barrier calls and see if there’s a non-linear scaling effect as you
increase group size.
You mention that you’re using m3.large instances; while this isn’t the
list for in-depth discussion about EC2 instances (the AWS Forums are better
for that), I’ll note that unless you’re tied to m3 for organizational or
reserved instance reasons, you’ll probably be happier on another instance
type. m3 was one of the last instance families released which does not
support Enhanced Networking. There’s significantly more jitter and latency
in the m3 network stack compared to platforms which support Enhanced
Networking (including the m4 platform). If networking costs are causing
your scaling problems, the first step will be migrating instance types.
Brian
On Oct 23, 2017, at 4:19 AM, Gilles Gouaillardet <
Konstantions,
A simple way is to rewrite MPI_Bcast() and insert timer and
PMPI_Barrier() before invoking the real PMPI_Bcast().
time spent in PMPI_Barrier() can be seen as time NOT spent on actual
data transmission,
and since all tasks are synchronized upon exit, time spent in
PMPI_Bcast() can be seen as time spent on actual data transmission.
this is not perfect, but this is a pretty good approximation.
You can add extra timers so you end up with an idea of how much time
is spent in PMPI_Barrier() vs PMPI_Bcast().
Cheers,
Gilles
On Mon, Oct 23, 2017 at 4:16 PM, Konstantinos Konstantinidis
Post by Konstantinos Konstantinidis
In any case, do you think that the time NOT spent on actual data
transmission can impact the total time of the broadcast especially when
there are so many groups that communicate (please refer to the numbers I
gave before if you want to get an idea).
Also, is there any way to quantify this impact i.e. to measure the time
not
Post by Konstantinos Konstantinidis
spent on actual data transmissions?
Kostas
Post by Jeff Hammond
Broadcast is collective but not necessarily synchronous in the sense
you
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
imply. If you broadcast message size under the eager limit, the root
may
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
return before any non-root processes enter the function. Data transfer
may
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
happen prior to processes entering the function. Only rendezvous forces
synchronization between any two processes but there may still be
asynchrony
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
between different levels of the broadcast tree.
Jeff
On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis
Post by Konstantinos Konstantinidis
Hi,
I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.
I would like to give you an idea of what kind of communication takes
place. There are 40 m3.large instances. Now, 28672 groups of 5
instances are
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
formed in a specific manner (let's skip the details). Within each
group,
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
each instance broadcasts some unsigned char data to the other 4
instances in
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
the group. So within each group, exactly 5 broadcasts take place.
The problem is that if I increase the size of the group from 5 to 10
there is significant delay in terms of transmission rate while, based
on
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
some theoretical results, this is not reasonable.
I want to check if one of the reasons that this is happening is due to
the time needed for the instances to synchronize when they call
MPI_Bcast()
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
since it's a collective function. As far as I know, all of the
machines in
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
the broadcast need to call it and then synchronize until the actual
data
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
transfer starts. Is there any way to measure this synchronization
time?
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
The code is in C++ and the MPI installed is described in the attached file.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Gilles Gouaillardet
2017-10-24 00:44:35 UTC
Permalink
Konstantinos,


i previously suggested you use the profiler interface (aka PMPI)
specified in the MPI standard.

An example is available at
http://mpi-forum.org/docs/mpi-3.1/mpi31-report/node363.htm#Node363

The pro is you simply need to rewrite MPI_Bcast() vs add some code
around each MPI_Bcast() invokation.


Your option will also work, and fwiw, i suggest you use the standard
MPI_Wtime() instead of clock()


Strictly speaking, you should time PMPI_Barrier(), get the max accross
all ranks, and then sum this max

in order to get the total time spent in synchronization.


Cheers,


Gilles
Post by Konstantinos Konstantinidis
I do not completely understand whether that involves changing some MPI
code. I have no prior experience with that.
But if I get the idea something like this could potentially work
(assume that comm is the communicator of the groups that communicates
/*clock_t total_time = clock();
*/
/*clock_t sync_time = 0;
*/
/*
*/
/*for each transmission{*/
/*
*/
/*    sync_time = sync_time - clock();*/
/*    comm.Barrier();
*/
/*    sync_time = sync_time + clock();
*/
/*
*/
/*    comm.Bcast(...);*/
/*
*/
/*}*/
/*
*/
/*total_time = clock() - total_time;
*/
/*
*/
/*//Total time*/
/*double t_time = double(total_time)/CLOCKS_PER_SEC;
*/
/*
*/
/*//Synchronization time*/
/*double s_time = double(sync_time)/CLOCKS_PER_SEC;
*/
/*
*/
/*//Actual data transmission time*/
/*double d_time = t_time - s_time;*/
I know that I have added a useless barrier call, but do you think that
this can work the way I think it will and at least give some idea of
the synchronization time?
Barrett, I am also working on switching to m4.large instances and will
check if this helps.
Regards,
Kostas
Gilles suggested your best next course of action; time the
MPI_Bcast and MPI_Barrier calls and see if there’s a non-linear
scaling effect as you increase group size.
You mention that you’re using m3.large instances; while this isn’t
the list for in-depth discussion about EC2 instances (the AWS
Forums are better for that), I’ll note that unless you’re tied to
m3 for organizational or reserved instance reasons, you’ll
probably be happier on another instance type.  m3 was one of the
last instance families released which does not support Enhanced
Networking.  There’s significantly more jitter and latency in the
m3 network stack compared to platforms which support Enhanced
Networking (including the m4 platform).  If networking costs are
causing your scaling problems, the first step will be migrating
instance types.
Brian
On Oct 23, 2017, at 4:19 AM, Gilles Gouaillardet
Konstantions,
A simple way is to rewrite MPI_Bcast() and insert timer and
PMPI_Barrier() before invoking the real PMPI_Bcast().
time spent in PMPI_Barrier() can be seen as time NOT spent on actual
data transmission,
and since all tasks are synchronized upon exit, time spent in
PMPI_Bcast() can be seen as time spent on actual data transmission.
this is not perfect, but this is a pretty good approximation.
You can add extra timers so you end up with an idea of how much time
is spent in PMPI_Barrier() vs PMPI_Bcast().
Cheers,
Gilles
On Mon, Oct 23, 2017 at 4:16 PM, Konstantinos Konstantinidis
Post by Konstantinos Konstantinidis
In any case, do you think that the time NOT spent on actual data
transmission can impact the total time of the broadcast
especially when
Post by Konstantinos Konstantinidis
there are so many groups that communicate (please refer to the
numbers I
Post by Konstantinos Konstantinidis
gave before if you want to get an idea).
Also, is there any way to quantify this impact i.e. to measure
the time not
Post by Konstantinos Konstantinidis
spent on actual data transmissions?
Kostas
On Fri, Oct 20, 2017 at 10:32 PM, Jeff Hammond
Post by Jeff Hammond
Broadcast is collective but not necessarily synchronous in the
sense you
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
imply. If you broadcast message size under the eager limit,
the root may
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
return before any non-root processes enter the function. Data
transfer may
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
happen prior to processes entering the function. Only
rendezvous forces
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
synchronization between any two processes but there may still
be asynchrony
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
between different levels of the broadcast tree.
Jeff
On Fri, Oct 20, 2017 at 3:27 PM Konstantinos Konstantinidis
Post by Konstantinos Konstantinidis
Hi,
I am running some tests on Amazon EC2 and they require a lot of
communication among m3.large instances.
I would like to give you an idea of what kind of
communication takes
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
place. There are 40 m3.large instances. Now, 28672 groups of
5 instances are
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
formed in a specific manner (let's skip the details). Within
each group,
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
each instance broadcasts some unsigned char data to the other
4 instances in
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
the group. So within each group, exactly 5 broadcasts take place.
The problem is that if I increase the size of the group from
5 to 10
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
there is significant delay in terms of transmission rate
while, based on
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
some theoretical results, this is not reasonable.
I want to check if one of the reasons that this is happening
is due to
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
the time needed for the instances to synchronize when they
call MPI_Bcast()
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
since it's a collective function. As far as I know, all of
the machines in
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
the broadcast need to call it and then synchronize until the
actual data
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
transfer starts. Is there any way to measure this
synchronization time?
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
The code is in C++ and the MPI installed is described in the
attached
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
Post by Konstantinos Konstantinidis
file.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
Post by Konstantinos Konstantinidis
Post by Jeff Hammond
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...