Joseph Schuchart
2017-08-24 13:31:35 UTC
All,
I have been experimenting with large window allocations recently and
have made some interesting observations that I would like to share.
The system under test:
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
My observations:
1) Running with 1 process on a single node, I can allocate and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single large
allocations succeed but with the repeating allocate/free cycle in the
attached code I see the application being reproducibly being killed by
the OOM at 25GB allocation with MPI_Win_allocate_shared. When I try to
run it under Valgrind I get an error from MPI_Win_allocate at ~50GB that
I cannot make sense of:
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the following error from
both MPI_Win_allocate and MPI_Win_allocate_shared:
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
/tmp/openmpi-sessions-***@n131702_0/23041/1/0/shared_window_4.n131702
(the shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp. MPI_Allocate works
as expected, i.e., I can allocate ~50GB per process. I understand that I
can set $TMP to a bigger filesystem (such as lustre) but then I am
greeted with a warning on each allocation and performance seems to drop.
Is there a way to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the memory:
while the allocations are in the sub-millisecond range for both
MPI_Allocate and MPI_Win_allocate_shared, it takes >24s to allocate
100GB using MPI_Win_allocate and the time increasing linearly with the
allocation size.
Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :) I'm happy to
provide additional details if needed.
Best
Joseph
I have been experimenting with large window allocations recently and
have made some interesting observations that I would like to share.
The system under test:
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
My observations:
1) Running with 1 process on a single node, I can allocate and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single large
allocations succeed but with the repeating allocate/free cycle in the
attached code I see the application being reproducibly being killed by
the OOM at 25GB allocation with MPI_Win_allocate_shared. When I try to
run it under Valgrind I get an error from MPI_Win_allocate at ~50GB that
I cannot make sense of:
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the following error from
both MPI_Win_allocate and MPI_Win_allocate_shared:
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
/tmp/openmpi-sessions-***@n131702_0/23041/1/0/shared_window_4.n131702
(the shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp. MPI_Allocate works
as expected, i.e., I can allocate ~50GB per process. I understand that I
can set $TMP to a bigger filesystem (such as lustre) but then I am
greeted with a warning on each allocation and performance seems to drop.
Is there a way to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the memory:
while the allocations are in the sub-millisecond range for both
MPI_Allocate and MPI_Win_allocate_shared, it takes >24s to allocate
100GB using MPI_Win_allocate and the time increasing linearly with the
allocation size.
Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :) I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de