Discussion:
[OMPI users] tuning sm/vader for large messages
Joshua Mora
2017-03-17 17:47:26 UTC
Permalink
Hello,
I am trying to get the max bw for shared memory communications using
osu_[bw,bibw,mbw_mr] benchmarks.
I am observing a peak at ~64k/128K msg size and then drops instead of
sustaining it.
What parameters or linux config do I need to add to default openmpi settings
to get this improved ?
I am already using vader and knem.

See below one way bandwidth with peak at 64k.

# Size Bandwidth (MB/s)
1 1.02
2 2.13
4 4.03
8 8.48
16 11.90
32 23.29
64 47.33
128 88.08
256 136.77
512 245.06
1024 263.79
2048 405.49
4096 1040.46
8192 1964.81
16384 2983.71
32768 5705.11
65536 7181.11
131072 6490.55
262144 4449.59
524288 4898.14
1048576 5324.45
2097152 5539.79
4194304 5669.76

Thanks,
Joshua
George Bosilca
2017-03-17 19:14:09 UTC
Permalink
Joshua,

In shared memory the bandwidth depends on many parameters, including the
process placement and the size of the different cache levels. In your
particular case I guess after 128k you are outside the L2 cache (1/2 of the
cache in fact) and the bandwidth will drop as the data need to be flushed
to main memory.

George.
Post by Joshua Mora
Hello,
I am trying to get the max bw for shared memory communications using
osu_[bw,bibw,mbw_mr] benchmarks.
I am observing a peak at ~64k/128K msg size and then drops instead of
sustaining it.
What parameters or linux config do I need to add to default openmpi settings
to get this improved ?
I am already using vader and knem.
See below one way bandwidth with peak at 64k.
# Size Bandwidth (MB/s)
1 1.02
2 2.13
4 4.03
8 8.48
16 11.90
32 23.29
64 47.33
128 88.08
256 136.77
512 245.06
1024 263.79
2048 405.49
4096 1040.46
8192 1964.81
16384 2983.71
32768 5705.11
65536 7181.11
131072 6490.55
262144 4449.59
524288 4898.14
1048576 5324.45
2097152 5539.79
4194304 5669.76
Thanks,
Joshua
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Joshua Mora
2017-03-17 19:33:11 UTC
Permalink
Thanks for the quick reply.
This test is between 2 cores that are on different CPUs. Say data has to
traverse coherent fabric (eg. QPI,UPI, cHT).
It has to go to main memory independently of cache size. Wrong assumption ?
Can data be evicted from cache and put into cache of second core on different
CPU without placing it first in main memory ?
I am more thinking that there is a parameter that splits large messages in
smaller ones at 64k or 128k ?
This seems (wrong assumption ?) like the kind of parameter I would need for
large messages on a NIC. Coalescing data / large MTU,...

Joshua








------ Original Message ------
Received: 02:15 PM CDT, 03/17/2017
From: George Bosilca <***@icl.utk.edu>
To: Open MPI Users <***@lists.open-mpi.org>

Subject: Re: [OMPI users] tuning sm/vader for large messages
Post by George Bosilca
Joshua,
In shared memory the bandwidth depends on many parameters, including the
process placement and the size of the different cache levels. In your
particular case I guess after 128k you are outside the L2 cache (1/2 of the
cache in fact) and the bandwidth will drop as the data need to be flushed
to main memory.
George.
Post by Joshua Mora
Hello,
I am trying to get the max bw for shared memory communications using
osu_[bw,bibw,mbw_mr] benchmarks.
I am observing a peak at ~64k/128K msg size and then drops instead of
sustaining it.
What parameters or linux config do I need to add to default openmpi settings
to get this improved ?
I am already using vader and knem.
See below one way bandwidth with peak at 64k.
# Size Bandwidth (MB/s)
1 1.02
2 2.13
4 4.03
8 8.48
16 11.90
32 23.29
64 47.33
128 88.08
256 136.77
512 245.06
1024 263.79
2048 405.49
4096 1040.46
8192 1964.81
16384 2983.71
32768 5705.11
65536 7181.11
131072 6490.55
262144 4449.59
524288 4898.14
1048576 5324.45
2097152 5539.79
4194304 5669.76
Thanks,
Joshua
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
George Bosilca
2017-03-17 20:06:09 UTC
Permalink
Post by Joshua Mora
Thanks for the quick reply.
This test is between 2 cores that are on different CPUs. Say data has to
traverse coherent fabric (eg. QPI,UPI, cHT).
It has to go to main memory independently of cache size. Wrong assumption ?
Depends on the usage pattern. Some benchmarks have options to clean/flush
the cache before each round of tests.
Post by Joshua Mora
Can data be evicted from cache and put into cache of second core on different
CPU without placing it first in main memory ?
It would depend on the memory coherency protocol. Usually it gets marked as
shared, and as a result it might not need to be pushed into main memory
right away.
Post by Joshua Mora
I am more thinking that there is a parameter that splits large messages in
smaller ones at 64k or 128k ?
Pipelining is not the answer to all situations. Once your messages are
larger than the caches, you already built memory pressure (by getting
outside the cache size) so the pipelining is bound by the memory bandwidth.
Post by Joshua Mora
This seems (wrong assumption ?) like the kind of parameter I would need for
large messages on a NIC. Coalescing data / large MTU,...
Sure, but there are hard limits imposed by the hardware, especially with
regards to intranode communications. Once you saturate the memory bus, you
hit a pretty hard limit.

George.
Post by Joshua Mora
Joshua
------ Original Message ------
Received: 02:15 PM CDT, 03/17/2017
Subject: Re: [OMPI users] tuning sm/vader for large messages
Post by George Bosilca
Joshua,
In shared memory the bandwidth depends on many parameters, including the
process placement and the size of the different cache levels. In your
particular case I guess after 128k you are outside the L2 cache (1/2 of
the
Post by George Bosilca
cache in fact) and the bandwidth will drop as the data need to be flushed
to main memory.
George.
Post by Joshua Mora
Hello,
I am trying to get the max bw for shared memory communications using
osu_[bw,bibw,mbw_mr] benchmarks.
I am observing a peak at ~64k/128K msg size and then drops instead of
sustaining it.
What parameters or linux config do I need to add to default openmpi settings
to get this improved ?
I am already using vader and knem.
See below one way bandwidth with peak at 64k.
# Size Bandwidth (MB/s)
1 1.02
2 2.13
4 4.03
8 8.48
16 11.90
32 23.29
64 47.33
128 88.08
256 136.77
512 245.06
1024 263.79
2048 405.49
4096 1040.46
8192 1964.81
16384 2983.71
32768 5705.11
65536 7181.11
131072 6490.55
262144 4449.59
524288 4898.14
1048576 5324.45
2097152 5539.79
4194304 5669.76
Thanks,
Joshua
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Joshua Mora
2017-03-20 16:45:16 UTC
Permalink
This post might be inappropriate. Click to display it.
George Bosilca
2017-03-20 22:11:06 UTC
Permalink
Post by Joshua Mora
If at certain x msg size you achieve X performance (MB/s) and at 2x msg size
or higher you achieve Y performance, being Y significantly lower than X, is it
possible to have a parameter that chops messages internally to x size in order
to sustain X performance rather than let it choke ?
Unfortunately not. After a certain message size you hit the hardware memory
bandwidth limit, and no pipeline can help. To push it up you will need to
have a single copy instead of 2, but vader should do this by default as
long as KNEM or CMA are available on the machine.

George.
Post by Joshua Mora
sort of flow control to
avoid congestion ?
If that is possible, what would be that parameter for vader ?
Other than source code, is there any detailed documentation/studies of vader
related parameters to improve the bandwidth at large message size ? I did see
some documentation for sm, but not for vader.
Thanks,
Joshua
------ Original Message ------
Received: 03:06 PM CDT, 03/17/2017
Subject: Re: [OMPI users] tuning sm/vader for large messages
Post by George Bosilca
Post by Joshua Mora
Thanks for the quick reply.
This test is between 2 cores that are on different CPUs. Say data has
to
Post by George Bosilca
Post by Joshua Mora
traverse coherent fabric (eg. QPI,UPI, cHT).
It has to go to main memory independently of cache size. Wrong
assumption
?
Post by George Bosilca
Depends on the usage pattern. Some benchmarks have options to clean/flush
the cache before each round of tests.
Post by Joshua Mora
Can data be evicted from cache and put into cache of second core on different
CPU without placing it first in main memory ?
It would depend on the memory coherency protocol. Usually it gets marked
as
Post by George Bosilca
shared, and as a result it might not need to be pushed into main memory
right away.
Post by Joshua Mora
I am more thinking that there is a parameter that splits large messages
in
Post by George Bosilca
Post by Joshua Mora
smaller ones at 64k or 128k ?
Pipelining is not the answer to all situations. Once your messages are
larger than the caches, you already built memory pressure (by getting
outside the cache size) so the pipelining is bound by the memory
bandwidth.
Post by George Bosilca
Post by Joshua Mora
This seems (wrong assumption ?) like the kind of parameter I would need
for
Post by George Bosilca
Post by Joshua Mora
large messages on a NIC. Coalescing data / large MTU,...
Sure, but there are hard limits imposed by the hardware, especially with
regards to intranode communications. Once you saturate the memory bus,
you
Post by George Bosilca
hit a pretty hard limit.
George.
Post by Joshua Mora
Joshua
------ Original Message ------
Received: 02:15 PM CDT, 03/17/2017
Subject: Re: [OMPI users] tuning sm/vader for large messages
Post by George Bosilca
Joshua,
In shared memory the bandwidth depends on many parameters, including
the
Post by George Bosilca
Post by Joshua Mora
Post by George Bosilca
process placement and the size of the different cache levels. In your
particular case I guess after 128k you are outside the L2 cache (1/2
of
Post by George Bosilca
Post by Joshua Mora
the
Post by George Bosilca
cache in fact) and the bandwidth will drop as the data need to be
flushed
Post by George Bosilca
Post by Joshua Mora
Post by George Bosilca
to main memory.
George.
Post by Joshua Mora
Hello,
I am trying to get the max bw for shared memory communications
using
Post by George Bosilca
Post by Joshua Mora
Post by George Bosilca
Post by Joshua Mora
osu_[bw,bibw,mbw_mr] benchmarks.
I am observing a peak at ~64k/128K msg size and then drops instead
of
Post by George Bosilca
Post by Joshua Mora
Post by George Bosilca
Post by Joshua Mora
sustaining it.
What parameters or linux config do I need to add to default openmpi
settings
to get this improved ?
I am already using vader and knem.
See below one way bandwidth with peak at 64k.
# Size Bandwidth (MB/s)
1 1.02
2 2.13
4 4.03
8 8.48
16 11.90
32 23.29
64 47.33
128 88.08
256 136.77
512 245.06
1024 263.79
2048 405.49
4096 1040.46
8192 1964.81
16384 2983.71
32768 5705.11
65536 7181.11
131072 6490.55
262144 4449.59
524288 4898.14
1048576 5324.45
2097152 5539.79
4194304 5669.76
Thanks,
Joshua
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Joshua Mora
2017-03-21 00:11:57 UTC
Permalink
I don't want to push it up.
I just want to sustain the same bandwidth sending at that optimal size. I'd
like to see a constant bw from that size and above , not a significant drop
when I cross a msg size.

------ Original Message ------
Received: 05:11 PM CDT, 03/20/2017
From: George Bosilca <***@icl.utk.edu>
To: Joshua Mora <***@usa.net> Cc:
Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] tuning sm/vader for large messages
Post by George Bosilca
Post by Joshua Mora
If at certain x msg size you achieve X performance (MB/s) and at 2x msg
size
or higher you achieve Y performance, being Y significantly lower than X,
is it
possible to have a parameter that chops messages internally to x size in
order
to sustain X performance rather than let it choke ?
Unfortunately not. After a certain message size you hit the hardware memory
bandwidth limit, and no pipeline can help. To push it up you will need to
have a single copy instead of 2, but vader should do this by default as
long as KNEM or CMA are available on the machine.
George.
Post by Joshua Mora
sort of flow control to
avoid congestion ?
If that is possible, what would be that parameter for vader ?
Other than source code, is there any detailed documentation/studies of
vader
related parameters to improve the bandwidth at large message size ? I did
see
some documentation for sm, but not for vader.
Thanks,
Joshua
------ Original Message ------
Received: 03:06 PM CDT, 03/17/2017
From: George Bosilca
To: Joshua Mora
Cc: Open MPI Users
Subject: Re: [OMPI users] tuning sm/vader for large messages
On Fri, Mar 17, 2017 at 3:33 PM, Joshua Mora
Post by Joshua Mora
Thanks for the quick reply.
This test is between 2 cores that are on different CPUs. Say data has
to
Post by Joshua Mora
traverse coherent fabric (eg. QPI,UPI, cHT).
It has to go to main memory independently of cache size. Wrong
assumption
?
Depends on the usage pattern. Some benchmarks have options to
clean/flush
Post by George Bosilca
Post by Joshua Mora
the cache before each round of tests.
Post by Joshua Mora
Can data be evicted from cache and put into cache of second core on
different
CPU without placing it first in main memory ?
It would depend on the memory coherency protocol. Usually it gets
marked
Post by George Bosilca
Post by Joshua Mora
as
shared, and as a result it might not need to be pushed into main memory
right away.
Post by Joshua Mora
I am more thinking that there is a parameter that splits large
messages
Post by George Bosilca
Post by Joshua Mora
in
Post by Joshua Mora
smaller ones at 64k or 128k ?
Pipelining is not the answer to all situations. Once your messages are
larger than the caches, you already built memory pressure (by getting
outside the cache size) so the pipelining is bound by the memory
bandwidth.
Post by Joshua Mora
This seems (wrong assumption ?) like the kind of parameter I would
need
Post by George Bosilca
Post by Joshua Mora
for
Post by Joshua Mora
large messages on a NIC. Coalescing data / large MTU,...
Sure, but there are hard limits imposed by the hardware, especially
with
Post by George Bosilca
Post by Joshua Mora
regards to intranode communications. Once you saturate the memory bus,
you
hit a pretty hard limit.
George.
Post by Joshua Mora
Joshua
------ Original Message ------
Received: 02:15 PM CDT, 03/17/2017
From: George Bosilca
To: Open MPI Users
Subject: Re: [OMPI users] tuning sm/vader for large messages
Post by George Bosilca
Joshua,
In shared memory the bandwidth depends on many parameters,
including
Post by George Bosilca
Post by Joshua Mora
the
Post by Joshua Mora
Post by George Bosilca
process placement and the size of the different cache levels. In
your
Post by George Bosilca
Post by Joshua Mora
Post by Joshua Mora
Post by George Bosilca
particular case I guess after 128k you are outside the L2 cache
(1/2
Post by George Bosilca
Post by Joshua Mora
of
Post by Joshua Mora
the
Post by George Bosilca
cache in fact) and the bandwidth will drop as the data need to be
flushed
Post by Joshua Mora
Post by George Bosilca
to main memory.
George.
On Fri, Mar 17, 2017 at 1:47 PM, Joshua Mora
Post by Joshua Mora
Hello,
I am trying to get the max bw for shared memory communications
using
Post by Joshua Mora
Post by George Bosilca
Post by Joshua Mora
osu_[bw,bibw,mbw_mr] benchmarks.
I am observing a peak at ~64k/128K msg size and then drops
instead
Post by George Bosilca
Post by Joshua Mora
of
Post by Joshua Mora
Post by George Bosilca
Post by Joshua Mora
sustaining it.
What parameters or linux config do I need to add to default
openmpi
Post by George Bosilca
Post by Joshua Mora
Post by Joshua Mora
Post by George Bosilca
Post by Joshua Mora
settings
to get this improved ?
I am already using vader and knem.
See below one way bandwidth with peak at 64k.
# Size Bandwidth (MB/s)
1 1.02
2 2.13
4 4.03
8 8.48
16 11.90
32 23.29
64 47.33
128 88.08
256 136.77
512 245.06
1024 263.79
2048 405.49
4096 1040.46
8192 1964.81
16384 2983.71
32768 5705.11
65536 7181.11
131072 6490.55
262144 4449.59
524288 4898.14
1048576 5324.45
2097152 5539.79
4194304 5669.76
Thanks,
Joshua
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gilles Gouaillardet
2017-03-21 00:21:16 UTC
Permalink
Joshua,


George previously explained you are limited by the size of your level X
cache.

that means that you might get optimal performance for a given message
size, let's say

when everything fits in the L2 cache.

when you increase the message size, L2 cache is too small, and you have
to move to the L3 cache,

which is obviously slower, and hence the drop in performance.


so send/recv a same small message twice might be faster than send/recv
one twice larger message ...

just because of the cache size


Cheers,


Gilles
Post by Joshua Mora
I don't want to push it up.
I just want to sustain the same bandwidth sending at that optimal size. I'd
like to see a constant bw from that size and above , not a significant drop
when I cross a msg size.
------ Original Message ------
Received: 05:11 PM CDT, 03/20/2017
Subject: Re: [OMPI users] tuning sm/vader for large messages
Post by George Bosilca
Post by Joshua Mora
If at certain x msg size you achieve X performance (MB/s) and at 2x msg
size
or higher you achieve Y performance, being Y significantly lower than X,
is it
possible to have a parameter that chops messages internally to x size in
order
to sustain X performance rather than let it choke ?
Unfortunately not. After a certain message size you hit the hardware memory
bandwidth limit, and no pipeline can help. To push it up you will need to
have a single copy instead of 2, but vader should do this by default as
long as KNEM or CMA are available on the machine.
George.
Post by Joshua Mora
sort of flow control to
avoid congestion ?
If that is possible, what would be that parameter for vader ?
Other than source code, is there any detailed documentation/studies of
vader
related parameters to improve the bandwidth at large message size ? I did
see
some documentation for sm, but not for vader.
Thanks,
Joshua
------ Original Message ------
Received: 03:06 PM CDT, 03/17/2017
From: George Bosilca
To: Joshua Mora
Cc: Open MPI Users
Subject: Re: [OMPI users] tuning sm/vader for large messages
On Fri, Mar 17, 2017 at 3:33 PM, Joshua Mora
Post by Joshua Mora
Thanks for the quick reply.
This test is between 2 cores that are on different CPUs. Say data has
to
Post by Joshua Mora
traverse coherent fabric (eg. QPI,UPI, cHT).
It has to go to main memory independently of cache size. Wrong
assumption
?
Depends on the usage pattern. Some benchmarks have options to
clean/flush
Post by George Bosilca
Post by Joshua Mora
the cache before each round of tests.
Post by Joshua Mora
Can data be evicted from cache and put into cache of second core on
different
CPU without placing it first in main memory ?
It would depend on the memory coherency protocol. Usually it gets
marked
Post by George Bosilca
Post by Joshua Mora
as
shared, and as a result it might not need to be pushed into main memory
right away.
Post by Joshua Mora
I am more thinking that there is a parameter that splits large
messages
Post by George Bosilca
Post by Joshua Mora
in
Post by Joshua Mora
smaller ones at 64k or 128k ?
Pipelining is not the answer to all situations. Once your messages are
larger than the caches, you already built memory pressure (by getting
outside the cache size) so the pipelining is bound by the memory
bandwidth.
Post by Joshua Mora
This seems (wrong assumption ?) like the kind of parameter I would
need
Post by George Bosilca
Post by Joshua Mora
for
Post by Joshua Mora
large messages on a NIC. Coalescing data / large MTU,...
Sure, but there are hard limits imposed by the hardware, especially
with
Post by George Bosilca
Post by Joshua Mora
regards to intranode communications. Once you saturate the memory bus,
you
hit a pretty hard limit.
George.
Post by Joshua Mora
Joshua
------ Original Message ------
Received: 02:15 PM CDT, 03/17/2017
From: George Bosilca
To: Open MPI Users
Subject: Re: [OMPI users] tuning sm/vader for large messages
Post by George Bosilca
Joshua,
In shared memory the bandwidth depends on many parameters,
including
Post by George Bosilca
Post by Joshua Mora
the
Post by Joshua Mora
Post by George Bosilca
process placement and the size of the different cache levels. In
your
Post by George Bosilca
Post by Joshua Mora
Post by Joshua Mora
Post by George Bosilca
particular case I guess after 128k you are outside the L2 cache
(1/2
Post by George Bosilca
Post by Joshua Mora
of
Post by Joshua Mora
the
Post by George Bosilca
cache in fact) and the bandwidth will drop as the data need to be
flushed
Post by Joshua Mora
Post by George Bosilca
to main memory.
George.
On Fri, Mar 17, 2017 at 1:47 PM, Joshua Mora
Post by Joshua Mora
Hello,
I am trying to get the max bw for shared memory communications
using
Post by Joshua Mora
Post by George Bosilca
Post by Joshua Mora
osu_[bw,bibw,mbw_mr] benchmarks.
I am observing a peak at ~64k/128K msg size and then drops
instead
Post by George Bosilca
Post by Joshua Mora
of
Post by Joshua Mora
Post by George Bosilca
Post by Joshua Mora
sustaining it.
What parameters or linux config do I need to add to default
openmpi
Post by George Bosilca
Post by Joshua Mora
Post by Joshua Mora
Post by George Bosilca
Post by Joshua Mora
settings
to get this improved ?
I am already using vader and knem.
See below one way bandwidth with peak at 64k.
# Size Bandwidth (MB/s)
1 1.02
2 2.13
4 4.03
8 8.48
16 11.90
32 23.29
64 47.33
128 88.08
256 136.77
512 245.06
1024 263.79
2048 405.49
4096 1040.46
8192 1964.81
16384 2983.71
32768 5705.11
65536 7181.11
131072 6490.55
262144 4449.59
524288 4898.14
1048576 5324.45
2097152 5539.79
4194304 5669.76
Thanks,
Joshua
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...