[OMPI users] MPI Windows: performance of local memory access

Discussion:

Joseph Schuchart

2018-05-23 10:45:12 UTC

All,

We are observing some strange/interesting performance issues in
accessing memory that has been allocated through MPI_Win_allocate. I am
attaching our test case, which allocates memory for 100M integer values
on each process both through malloc and MPI_Win_allocate and writes to
the local ranges sequentially.

On different systems (incl. SuperMUC and a Bull Cluster), we see that
accessing the memory allocated through MPI is significantly slower than
accessing the malloc'ed memory if multiple processes run on a single
node, increasing the effect with increasing number of processes per
node. As an example, running 24 processes per node with the example
attached we see the operations on the malloc'ed memory to take ~0.4s
while the MPI allocated memory takes up to 10s.

After some experiments, I think there are two factors involved:

1) Initialization: it appears that the first iteration is significantly
slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a
single socket). Excluding the first iteration from the timing or
memsetting the range leads to comparable performance. I assume that this
is due to page faults that stem from first accessing the mmap'ed memory
that backs the shared memory used in the window. The effect of
presetting the malloc'ed memory seems smaller (0.4s vs 0.6s).

2) NUMA effects: Given proper initialization, running on two sockets
still leads to fluctuating performance degradation under the MPI window
memory, which ranges up to 20x (in extreme cases). The performance of
accessing the malloc'ed memory is rather stable. The difference seems to
get smaller (but does not disappear) with increasing number of
repetitions. I am not sure what causes these effects as each process
should first-touch their local memory.

Are these known issues? Does anyone have any thoughts on my analysis?

It is problematic for us that replacing local memory allocation with MPI
memory allocation leads to performance degradation as we rely on this
mechanism in our distributed data structures. While we can ensure proper
initialization of the memory to mitigate 1) for performance
measurements, I don't see a way to control the NUMA effects. If there is
one I'd be happy about any hints :)

I should note that we also tested MPICH-based implementations, which
showed similar effects (as they also mmap their window memory). Not
surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic
window does not cause these effects while using shared memory windows
does. I ran my experiments using Open MPI 3.1.0 with the following
command lines:

- 12 cores / 1 socket:
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
- 24 cores / 2 sockets:
mpirun -n 24 --bind-to socket

and verified the binding using --report-bindings.

Any help or comment would be much appreciated.

Cheers
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de

Nathan Hjelm

2018-05-23 12:04:22 UTC

Permalink

What Open MPI version are you using? Does this happen when you run on a single node or multiple nodes?

-Nathan

All,
We are observing some strange/interesting performance issues in accessing memory that has been allocated through MPI_Win_allocate. I am attaching our test case, which allocates memory for 100M integer values on each process both through malloc and MPI_Win_allocate and writes to the local ranges sequentially.
On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing the memory allocated through MPI is significantly slower than accessing the malloc'ed memory if multiple processes run on a single node, increasing the effect with increasing number of processes per node. As an example, running 24 processes per node with the example attached we see the operations on the malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.
1) Initialization: it appears that the first iteration is significantly slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single socket). Excluding the first iteration from the timing or memsetting the range leads to comparable performance. I assume that this is due to page faults that stem from first accessing the mmap'ed memory that backs the shared memory used in the window. The effect of presetting the malloc'ed memory seems smaller (0.4s vs 0.6s).
2) NUMA effects: Given proper initialization, running on two sockets still leads to fluctuating performance degradation under the MPI window memory, which ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed memory is rather stable. The difference seems to get smaller (but does not disappear) with increasing number of repetitions. I am not sure what causes these effects as each process should first-touch their local memory.
Are these known issues? Does anyone have any thoughts on my analysis?
It is problematic for us that replacing local memory allocation with MPI memory allocation leads to performance degradation as we rely on this mechanism in our distributed data structures. While we can ensure proper initialization of the memory to mitigate 1) for performance measurements, I don't see a way to control the NUMA effects. If there is one I'd be happy about any hints :)
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
mpirun -n 24 --bind-to socket
and verified the binding using --report-bindings.
Any help or comment would be much appreciated.
Cheers
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
<mpiwin_vs_malloc.c>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Joseph Schuchart

2018-05-23 12:11:39 UTC

Permalink

I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC
7.1.0 on the Bull Cluster. I only ran on a single node but haven't
tested what happens if more than one node is involved.

Joseph

Post by Nathan Hjelm
What Open MPI version are you using? Does this happen when you run on a single node or multiple nodes?
-Nathan

All,
We are observing some strange/interesting performance issues in accessing memory that has been allocated through MPI_Win_allocate. I am attaching our test case, which allocates memory for 100M integer values on each process both through malloc and MPI_Win_allocate and writes to the local ranges sequentially.
On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing the memory allocated through MPI is significantly slower than accessing the malloc'ed memory if multiple processes run on a single node, increasing the effect with increasing number of processes per node. As an example, running 24 processes per node with the example attached we see the operations on the malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.
1) Initialization: it appears that the first iteration is significantly slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single socket). Excluding the first iteration from the timing or memsetting the range leads to comparable performance. I assume that this is due to page faults that stem from first accessing the mmap'ed memory that backs the shared memory used in the window. The effect of presetting the malloc'ed memory seems smaller (0.4s vs 0.6s).
2) NUMA effects: Given proper initialization, running on two sockets still leads to fluctuating performance degradation under the MPI window memory, which ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed memory is rather stable. The difference seems to get smaller (but does not disappear) with increasing number of repetitions. I am not sure what causes these effects as each process should first-touch their local memory.
Are these known issues? Does anyone have any thoughts on my analysis?
It is problematic for us that replacing local memory allocation with MPI memory allocation leads to performance degradation as we rely on this mechanism in our distributed data structures. While we can ensure proper initialization of the memory to mitigate 1) for performance measurements, I don't see a way to control the NUMA effects. If there is one I'd be happy about any hints :)
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
mpirun -n 24 --bind-to socket
and verified the binding using --report-bindings.
Any help or comment would be much appreciated.
Cheers
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
<mpiwin_vs_malloc.c>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Nathan Hjelm

2018-05-23 12:26:09 UTC

Permalink

Odd. I wonder if it is something affected by your session directory. It might be worth moving the segment to /dev/shm. I donât expect it will have an impact but you could try the following patch:

diff --git a/ompi/mca/osc/sm/osc_sm_component.c b/ompi/mca/osc/sm/osc_sm_component.c
index f7211cd93c..bfc26b39f2 100644
--- a/ompi/mca/osc/sm/osc_sm_component.c
+++ b/ompi/mca/osc/sm/osc_sm_component.c
@@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, size_t size, int disp_unit
posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64);
if (0 == ompi_comm_rank (module->comm)) {
char *data_file;
- if (asprintf(&data_file, "%s"OPAL_PATH_SEP"shared_window_%d.%s",
- ompi_process_info.proc_session_dir,
+ if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s",
+ ompi_process_info.my_name.jobid,
ompi_comm_get_cid(module->comm),
ompi_process_info.nodename) < 0) {
return OMPI_ERR_OUT_OF_RESOURCE;

I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 7.1.0 on the Bull Cluster. I only ran on a single node but haven't tested what happens if more than one node is involved.
Joseph

Post by Nathan Hjelm
What Open MPI version are you using? Does this happen when you run on a single node or multiple nodes?
-Nathan

All,
We are observing some strange/interesting performance issues in accessing memory that has been allocated through MPI_Win_allocate. I am attaching our test case, which allocates memory for 100M integer values on each process both through malloc and MPI_Win_allocate and writes to the local ranges sequentially.
On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing the memory allocated through MPI is significantly slower than accessing the malloc'ed memory if multiple processes run on a single node, increasing the effect with increasing number of processes per node. As an example, running 24 processes per node with the example attached we see the operations on the malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.
1) Initialization: it appears that the first iteration is significantly slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single socket). Excluding the first iteration from the timing or memsetting the range leads to comparable performance. I assume that this is due to page faults that stem from first accessing the mmap'ed memory that backs the shared memory used in the window. The effect of presetting the malloc'ed memory seems smaller (0.4s vs 0.6s).
2) NUMA effects: Given proper initialization, running on two sockets still leads to fluctuating performance degradation under the MPI window memory, which ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed memory is rather stable. The difference seems to get smaller (but does not disappear) with increasing number of repetitions. I am not sure what causes these effects as each process should first-touch their local memory.
Are these known issues? Does anyone have any thoughts on my analysis?
It is problematic for us that replacing local memory allocation with MPI memory allocation leads to performance degradation as we rely on this mechanism in our distributed data structures. While we can ensure proper initialization of the memory to mitigate 1) for performance measurements, I don't see a way to control the NUMA effects. If there is one I'd be happy about any hints :)
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
mpirun -n 24 --bind-to socket
and verified the binding using --report-bindings.
Any help or comment would be much appreciated.
Cheers
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
<mpiwin_vs_malloc.c>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Joseph Schuchart

2018-05-24 12:46:57 UTC

Permalink

Thank you all for your input!

Nathan: thanks for that hint, this seems to be the culprit: With your
patch, I do not observe a difference in the performance between the two
memory allocations. I remembered that Open MPI allows to change the
shmem allocator on the command line. Using vanilla Open MPI 3.1.0 and
increasing the priority of the POSIX shmem implementation using `--mca
shmem_posix_priority 100` leads to good performance, too. The reason
could be that on the Bull machine /tmp is mounted on a disk partition
(SSD, iirc). Maybe there is actual I/O involved that hurts performance
if the shm backing file is located on a disk (even though the file is
unlinked before the memory is accessed)?

Regarding the other hints: I tried using MPI_Win_allocate_shared with
the noncontig hint. Using POSIX shmem, I do not observe a difference in
performance to the other two options. If using the disk-backed shmem
file, performance fluctuations are similar to MPI_Win_allocate.

On this machine /proc/sys/kernel/numa_balancing is not available, so I
assume that this is not the cause in this case. It's good to know for
the future that this might become an issue on other systems.

Cheers
Joseph

Post by Nathan Hjelm
diff --git a/ompi/mca/osc/sm/osc_sm_component.c b/ompi/mca/osc/sm/osc_sm_component.c
index f7211cd93c..bfc26b39f2 100644
--- a/ompi/mca/osc/sm/osc_sm_component.c
+++ b/ompi/mca/osc/sm/osc_sm_component.c
@@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, size_t size, int disp_unit
posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64);
if (0 == ompi_comm_rank (module->comm)) {
char *data_file;
- if (asprintf(&data_file, "%s"OPAL_PATH_SEP"shared_window_%d.%s",
- ompi_process_info.proc_session_dir,
+ if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s",
+ ompi_process_info.my_name.jobid,
ompi_comm_get_cid(module->comm),
ompi_process_info.nodename) < 0) {
return OMPI_ERR_OUT_OF_RESOURCE;

I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 7.1.0 on the Bull Cluster. I only ran on a single node but haven't tested what happens if more than one node is involved.
Joseph

Post by Nathan Hjelm
What Open MPI version are you using? Does this happen when you run on a single node or multiple nodes?
-Nathan

All,
We are observing some strange/interesting performance issues in accessing memory that has been allocated through MPI_Win_allocate. I am attaching our test case, which allocates memory for 100M integer values on each process both through malloc and MPI_Win_allocate and writes to the local ranges sequentially.
On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing the memory allocated through MPI is significantly slower than accessing the malloc'ed memory if multiple processes run on a single node, increasing the effect with increasing number of processes per node. As an example, running 24 processes per node with the example attached we see the operations on the malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.
1) Initialization: it appears that the first iteration is significantly slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single socket). Excluding the first iteration from the timing or memsetting the range leads to comparable performance. I assume that this is due to page faults that stem from first accessing the mmap'ed memory that backs the shared memory used in the window. The effect of presetting the malloc'ed memory seems smaller (0.4s vs 0.6s).
2) NUMA effects: Given proper initialization, running on two sockets still leads to fluctuating performance degradation under the MPI window memory, which ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed memory is rather stable. The difference seems to get smaller (but does not disappear) with increasing number of repetitions. I am not sure what causes these effects as each process should first-touch their local memory.
Are these known issues? Does anyone have any thoughts on my analysis?
It is problematic for us that replacing local memory allocation with MPI memory allocation leads to performance degradation as we rely on this mechanism in our distributed data structures. While we can ensure proper initialization of the memory to mitigate 1) for performance measurements, I don't see a way to control the NUMA effects. If there is one I'd be happy about any hints :)
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
mpirun -n 24 --bind-to socket
and verified the binding using --report-bindings.
Any help or comment would be much appreciated.
Cheers
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
<mpiwin_vs_malloc.c>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Nathan Hjelm

2018-05-24 13:09:12 UTC

Permalink

Ok, thanks for testing that. I will open a PR for master changing the default backing location to /dev/shm on linux. Will be PRâd to v3.0.x and v3.1.x.

-Nathan

Post by Joseph Schuchart
Thank you all for your input!
Nathan: thanks for that hint, this seems to be the culprit: With your patch, I do not observe a difference in the performance between the two memory allocations. I remembered that Open MPI allows to change the shmem allocator on the command line. Using vanilla Open MPI 3.1.0 and increasing the priority of the POSIX shmem implementation using `--mca shmem_posix_priority 100` leads to good performance, too. The reason could be that on the Bull machine /tmp is mounted on a disk partition (SSD, iirc). Maybe there is actual I/O involved that hurts performance if the shm backing file is located on a disk (even though the file is unlinked before the memory is accessed)?
Regarding the other hints: I tried using MPI_Win_allocate_shared with the noncontig hint. Using POSIX shmem, I do not observe a difference in performance to the other two options. If using the disk-backed shmem file, performance fluctuations are similar to MPI_Win_allocate.
On this machine /proc/sys/kernel/numa_balancing is not available, so I assume that this is not the cause in this case. It's good to know for the future that this might become an issue on other systems.
Cheers
Joseph

I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 7.1.0 on the Bull Cluster. I only ran on a single node but haven't tested what happens if more than one node is involved.
Joseph

Post by Nathan Hjelm
What Open MPI version are you using? Does this happen when you run on a single node or multiple nodes?
-Nathan

All,
We are observing some strange/interesting performance issues in accessing memory that has been allocated through MPI_Win_allocate. I am attaching our test case, which allocates memory for 100M integer values on each process both through malloc and MPI_Win_allocate and writes to the local ranges sequentially.
On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing the memory allocated through MPI is significantly slower than accessing the malloc'ed memory if multiple processes run on a single node, increasing the effect with increasing number of processes per node. As an example, running 24 processes per node with the example attached we see the operations on the malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.
1) Initialization: it appears that the first iteration is significantly slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single socket). Excluding the first iteration from the timing or memsetting the range leads to comparable performance. I assume that this is due to page faults that stem from first accessing the mmap'ed memory that backs the shared memory used in the window. The effect of presetting the malloc'ed memory seems smaller (0.4s vs 0.6s).
2) NUMA effects: Given proper initialization, running on two sockets still leads to fluctuating performance degradation under the MPI window memory, which ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed memory is rather stable. The difference seems to get smaller (but does not disappear) with increasing number of repetitions. I am not sure what causes these effects as each process should first-touch their local memory.
Are these known issues? Does anyone have any thoughts on my analysis?
It is problematic for us that replacing local memory allocation with MPI memory allocation leads to performance degradation as we rely on this mechanism in our distributed data structures. While we can ensure proper initialization of the memory to mitigate 1) for performance measurements, I don't see a way to control the NUMA effects. If there is one I'd be happy about any hints :)
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
mpirun -n 24 --bind-to socket
and verified the binding using --report-bindings.
Any help or comment would be much appreciated.
Cheers
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
<mpiwin_vs_malloc.c>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Nathan Hjelm

2018-05-24 14:28:01 UTC

Permalink

PR is up

https://github.com/open-mpi/ompi/pull/5193

-Nathan

Post by Nathan Hjelm
Ok, thanks for testing that. I will open a PR for master changing the default backing location to /dev/shm on linux. Will be PRâd to v3.0.x and v3.1.x.
-Nathan

I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 7.1.0 on the Bull Cluster. I only ran on a single node but haven't tested what happens if more than one node is involved.
Joseph

Post by Nathan Hjelm
What Open MPI version are you using? Does this happen when you run on a single node or multiple nodes?
-Nathan

All,
We are observing some strange/interesting performance issues in accessing memory that has been allocated through MPI_Win_allocate. I am attaching our test case, which allocates memory for 100M integer values on each process both through malloc and MPI_Win_allocate and writes to the local ranges sequentially.
On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing the memory allocated through MPI is significantly slower than accessing the malloc'ed memory if multiple processes run on a single node, increasing the effect with increasing number of processes per node. As an example, running 24 processes per node with the example attached we see the operations on the malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.
1) Initialization: it appears that the first iteration is significantly slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single socket). Excluding the first iteration from the timing or memsetting the range leads to comparable performance. I assume that this is due to page faults that stem from first accessing the mmap'ed memory that backs the shared memory used in the window. The effect of presetting the malloc'ed memory seems smaller (0.4s vs 0.6s).
2) NUMA effects: Given proper initialization, running on two sockets still leads to fluctuating performance degradation under the MPI window memory, which ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed memory is rather stable. The difference seems to get smaller (but does not disappear) with increasing number of repetitions. I am not sure what causes these effects as each process should first-touch their local memory.
Are these known issues? Does anyone have any thoughts on my analysis?
It is problematic for us that replacing local memory allocation with MPI memory allocation leads to performance degradation as we rely on this mechanism in our distributed data structures. While we can ensure proper initialization of the memory to mitigate 1) for performance measurements, I don't see a way to control the NUMA effects. If there is one I'd be happy about any hints :)
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
mpirun -n 24 --bind-to socket
and verified the binding using --report-bindings.
Any help or comment would be much appreciated.
Cheers
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
<mpiwin_vs_malloc.c>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Joseph Schuchart

2018-05-24 14:53:57 UTC

Permalink

Nathan, thanks for taking care of this! I looked at the PR and wonder
why we don't move the whole session directory to /dev/shm on Linux
instead of introducing a new mca parameter?

Joseph

Post by Nathan Hjelm
PR is up
https://github.com/open-mpi/ompi/pull/5193
-Nathan

Ok, thanks for testing that. I will open a PR for master changing the default backing location to /dev/shm on linux. Will be PR’d to v3.0.x and v3.1.x.
-Nathan

I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 7.1.0 on the Bull Cluster. I only ran on a single node but haven't tested what happens if more than one node is involved.
Joseph

Post by Nathan Hjelm
What Open MPI version are you using? Does this happen when you run on a single node or multiple nodes?
-Nathan

All,
We are observing some strange/interesting performance issues in accessing memory that has been allocated through MPI_Win_allocate. I am attaching our test case, which allocates memory for 100M integer values on each process both through malloc and MPI_Win_allocate and writes to the local ranges sequentially.
On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing the memory allocated through MPI is significantly slower than accessing the malloc'ed memory if multiple processes run on a single node, increasing the effect with increasing number of processes per node. As an example, running 24 processes per node with the example attached we see the operations on the malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.
1) Initialization: it appears that the first iteration is significantly slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single socket). Excluding the first iteration from the timing or memsetting the range leads to comparable performance. I assume that this is due to page faults that stem from first accessing the mmap'ed memory that backs the shared memory used in the window. The effect of presetting the malloc'ed memory seems smaller (0.4s vs 0.6s).
2) NUMA effects: Given proper initialization, running on two sockets still leads to fluctuating performance degradation under the MPI window memory, which ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed memory is rather stable. The difference seems to get smaller (but does not disappear) with increasing number of repetitions. I am not sure what causes these effects as each process should first-touch their local memory.
Are these known issues? Does anyone have any thoughts on my analysis?
It is problematic for us that replacing local memory allocation with MPI memory allocation leads to performance degradation as we rely on this mechanism in our distributed data structures. While we can ensure proper initialization of the memory to mitigate 1) for performance measurements, I don't see a way to control the NUMA effects. If there is one I'd be happy about any hints :)
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
mpirun -n 24 --bind-to socket
and verified the binding using --report-bindings.
Any help or comment would be much appreciated.
Cheers
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
<mpiwin_vs_malloc.c>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Jeff Hammond

2018-05-23 15:16:41 UTC

Permalink

This is very interesting. Thanks for providing a test code. I have two
suggestions for understanding this better.

1) Use MPI_Win_allocate_shared instead and measure the difference with and
without alloc_shared_noncontig. I think this info is not available for
MPI_Win_allocate because MPI_Win_shared_query is not permitted on
MPI_Win_allocate windows. This is a flaw in MPI-3 that I would like to see
fixed (https://github.com/mpi-forum/mpi-issues/issues/23).

2) Extend your test to allocate with mmap and measure with various sets of
map flags (http://man7.org/linux/man-pages/man2/mmap.2.html). Starting
with MAP_SHARED and MAP_PRIVATE is the right place to start. This
experiment should make the cause unambiguous.

Most likely, this is due to shared vs private mapping, but there is likely
a tradeoff w.r.t. RMA performance. It depends on your network and how the
MPI implementation uses it, but MPI_Win_create_dynamic likely leads to much
worse RMA performance than MPI_Win_allocate. MPI_Win_create with a
malloc'd buffer may perform worse than MPI_Win_allocate for internode RMA
if the MPI implementation is lazy and doesn't cache page registration in
MPI_Win_create.

Jeff

All,
We are observing some strange/interesting performance issues in accessing
memory that has been allocated through MPI_Win_allocate. I am attaching our
test case, which allocates memory for 100M integer values on each process
both through malloc and MPI_Win_allocate and writes to the local ranges
sequentially.
On different systems (incl. SuperMUC and a Bull Cluster), we see that
accessing the memory allocated through MPI is significantly slower than
accessing the malloc'ed memory if multiple processes run on a single node,
increasing the effect with increasing number of processes per node. As an
example, running 24 processes per node with the example attached we see the
operations on the malloc'ed memory to take ~0.4s while the MPI allocated
memory takes up to 10s.
1) Initialization: it appears that the first iteration is significantly
slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a
single socket). Excluding the first iteration from the timing or memsetting
the range leads to comparable performance. I assume that this is due to
page faults that stem from first accessing the mmap'ed memory that backs
the shared memory used in the window. The effect of presetting the
malloc'ed memory seems smaller (0.4s vs 0.6s).
2) NUMA effects: Given proper initialization, running on two sockets still
leads to fluctuating performance degradation under the MPI window memory,
which ranges up to 20x (in extreme cases). The performance of accessing the
malloc'ed memory is rather stable. The difference seems to get smaller (but
does not disappear) with increasing number of repetitions. I am not sure
what causes these effects as each process should first-touch their local
memory.
Are these known issues? Does anyone have any thoughts on my analysis?
It is problematic for us that replacing local memory allocation with MPI
memory allocation leads to performance degradation as we rely on this
mechanism in our distributed data structures. While we can ensure proper
initialization of the memory to mitigate 1) for performance measurements, I
don't see a way to control the NUMA effects. If there is one I'd be happy
about any hints :)
I should note that we also tested MPICH-based implementations, which
showed similar effects (as they also mmap their window memory). Not
surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic
window does not cause these effects while using shared memory windows does.
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
mpirun -n 24 --bind-to socket
and verified the binding using --report-bindings.
Any help or comment would be much appreciated.
Cheers
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/

George Bosilca

2018-05-23 15:45:56 UTC

Permalink

We had a similar issue few months back. After investigation it turned out
to be related to NUMA balancing [1] being enabled by default on recent
releases of Linux-based OSes.

In our case turning off NUMA balancing fixed most of the performance
incoherences we had. You can check its status in /proc/sys/kernel/numa_
balancing.

George.

Post by Jeff Hammond
This is very interesting. Thanks for providing a test code. I have two
suggestions for understanding this better.
1) Use MPI_Win_allocate_shared instead and measure the difference with and
without alloc_shared_noncontig. I think this info is not available for
MPI_Win_allocate because MPI_Win_shared_query is not permitted on
MPI_Win_allocate windows. This is a flaw in MPI-3 that I would like to see
fixed (https://github.com/mpi-forum/mpi-issues/issues/23).
2) Extend your test to allocate with mmap and measure with various sets of
map flags (http://man7.org/linux/man-pages/man2/mmap.2.html). Starting
with MAP_SHARED and MAP_PRIVATE is the right place to start. This
experiment should make the cause unambiguous.
Most likely, this is due to shared vs private mapping, but there is likely
a tradeoff w.r.t. RMA performance. It depends on your network and how the
MPI implementation uses it, but MPI_Win_create_dynamic likely leads to much
worse RMA performance than MPI_Win_allocate. MPI_Win_create with a
malloc'd buffer may perform worse than MPI_Win_allocate for internode RMA
if the MPI implementation is lazy and doesn't cache page registration in
MPI_Win_create.
Jeff

All,
We are observing some strange/interesting performance issues in accessing
memory that has been allocated through MPI_Win_allocate. I am attaching our
test case, which allocates memory for 100M integer values on each process
both through malloc and MPI_Win_allocate and writes to the local ranges
sequentially.
On different systems (incl. SuperMUC and a Bull Cluster), we see that
accessing the memory allocated through MPI is significantly slower than
accessing the malloc'ed memory if multiple processes run on a single node,
increasing the effect with increasing number of processes per node. As an
example, running 24 processes per node with the example attached we see the
operations on the malloc'ed memory to take ~0.4s while the MPI allocated
memory takes up to 10s.
1) Initialization: it appears that the first iteration is significantly
slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a
single socket). Excluding the first iteration from the timing or memsetting
the range leads to comparable performance. I assume that this is due to
page faults that stem from first accessing the mmap'ed memory that backs
the shared memory used in the window. The effect of presetting the
malloc'ed memory seems smaller (0.4s vs 0.6s).
2) NUMA effects: Given proper initialization, running on two sockets
still leads to fluctuating performance degradation under the MPI window
memory, which ranges up to 20x (in extreme cases). The performance of
accessing the malloc'ed memory is rather stable. The difference seems to
get smaller (but does not disappear) with increasing number of repetitions.
I am not sure what causes these effects as each process should first-touch
their local memory.
Are these known issues? Does anyone have any thoughts on my analysis?
It is problematic for us that replacing local memory allocation with MPI
memory allocation leads to performance degradation as we rely on this
mechanism in our distributed data structures. While we can ensure proper
initialization of the memory to mitigate 1) for performance measurements, I
don't see a way to control the NUMA effects. If there is one I'd be happy
about any hints :)
I should note that we also tested MPICH-based implementations, which
showed similar effects (as they also mmap their window memory). Not
surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic
window does not cause these effects while using shared memory windows does.
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
mpirun -n 24 --bind-to socket
and verified the binding using --report-bindings.
Any help or comment would be much appreciated.
Cheers
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users