Discussion:
[OMPI users] Issues with Large Window Allocations
Joseph Schuchart
2017-08-24 13:31:35 UTC
Permalink
All,

I have been experimenting with large window allocations recently and
have made some interesting observations that I would like to share.

The system under test:
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node

My observations:
1) Running with 1 process on a single node, I can allocate and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.

2) If running with 1 process per node on 2 nodes single large
allocations succeed but with the repeating allocate/free cycle in the
attached code I see the application being reproducibly being killed by
the OOM at 25GB allocation with MPI_Win_allocate_shared. When I try to
run it under Valgrind I get an error from MPI_Win_allocate at ~50GB that
I cannot make sense of:

```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```

3) If running with 2 processes on a node, I get the following error from
both MPI_Win_allocate and MPI_Win_allocate_shared:
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
/tmp/openmpi-sessions-***@n131702_0/23041/1/0/shared_window_4.n131702
(the shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.

Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp. MPI_Allocate works
as expected, i.e., I can allocate ~50GB per process. I understand that I
can set $TMP to a bigger filesystem (such as lustre) but then I am
greeted with a warning on each allocation and performance seems to drop.
Is there a way to fall back to the allocation strategy used in case 2)?

4) It is also worth noting the time it takes to allocate the memory:
while the allocations are in the sub-millisecond range for both
MPI_Allocate and MPI_Win_allocate_shared, it takes >24s to allocate
100GB using MPI_Win_allocate and the time increasing linearly with the
allocation size.

Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))

I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :) I'm happy to
provide additional details if needed.

Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de
Gilles Gouaillardet
2017-08-24 13:49:33 UTC
Permalink
Joseph,

the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file and then mmap'ing
it.
how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a workaround, i suggest
you use this as the shared-memory backing directory

/* i am afk and do not remember the syntax, ompi_info --all | grep
backing is likely to help */

Cheers,

Gilles
All,
I have been experimenting with large window allocations recently and have
made some interesting observations that I would like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can allocate and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single large allocations
succeed but with the repeating allocate/free cycle in the attached code I
see the application being reproducibly being killed by the OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run it under Valgrind
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the following error from
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp. MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I am greeted with
a warning on each allocation and performance seems to drop. Is there a way
to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the memory: while
the allocations are in the sub-millisecond range for both MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate 100GB using
MPI_Win_allocate and the time increasing linearly with the allocation size.
Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :) I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Joseph Schuchart
2017-08-24 14:41:36 UTC
Permalink
Gilles,

Thanks for your swift response. On this system, /dev/shm only has 256M
available so that is no option unfortunately. I tried disabling both
vader and sm btl via `--mca btl ^vader,sm` but Open MPI still seems to
allocate the shmem backing file under /tmp. From my point of view,
missing the performance benefits of file backed shared memory as long as
large allocations work but I don't know the implementation details and
whether that is possible. It seems that the mmap does not happen if
there is only one process per node.

Cheers,
Joseph
Post by Gilles Gouaillardet
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file and then mmap'ing
it.
how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info --all | grep
backing is likely to help */
Cheers,
Gilles
All,
I have been experimenting with large window allocations recently and have
made some interesting observations that I would like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can allocate and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single large allocations
succeed but with the repeating allocate/free cycle in the attached code I
see the application being reproducibly being killed by the OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run it under Valgrind
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the following error from
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp. MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I am greeted with
a warning on each allocation and performance seems to drop. Is there a way
to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the memory: while
the allocations are in the sub-millisecond range for both MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate 100GB using
MPI_Win_allocate and the time increasing linearly with the allocation size.
Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :) I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de
Jeff Hammond
2017-08-25 19:17:05 UTC
Permalink
There's no reason to do anything special for shared memory with a
single-process job because MPI_Win_allocate_shared(MPI_COMM_SELF) ~=
MPI_Alloc_mem(). However, it would help debugging if MPI implementers at
least had an option to take the code path that allocates shared memory even
when np=1.

Jeff
Post by Joseph Schuchart
Gilles,
Thanks for your swift response. On this system, /dev/shm only has 256M
available so that is no option unfortunately. I tried disabling both vader
and sm btl via `--mca btl ^vader,sm` but Open MPI still seems to allocate
the shmem backing file under /tmp. From my point of view, missing the
performance benefits of file backed shared memory as long as large
allocations work but I don't know the implementation details and whether
that is possible. It seems that the mmap does not happen if there is only
one process per node.
Cheers,
Joseph
Post by Gilles Gouaillardet
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file and then mmap'ing
it.
how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info --all | grep
backing is likely to help */
Cheers,
Gilles
All,
I have been experimenting with large window allocations recently and have
made some interesting observations that I would like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can allocate and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single large allocations
succeed but with the repeating allocate/free cycle in the attached code I
see the application being reproducibly being killed by the OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run it under Valgrind
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the following error from
```
------------------------------------------------------------
--------------
It appears as if there is not enough space for
shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp. MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I am greeted with
a warning on each allocation and performance seems to drop. Is there a way
to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the memory: while
the allocations are in the sub-millisecond range for both MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate 100GB using
MPI_Win_allocate and the time increasing linearly with the allocation size.
Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :) I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Joseph Schuchart
2017-08-29 13:15:35 UTC
Permalink
Jeff, all,

Thanks for the clarification. My measurements show that global memory
allocations do not require the backing file if there is only one process
per node, for arbitrary number of processes. So I was wondering if it
was possible to use the same allocation process even with multiple
processes per node if there is not enough space available in /tmp.
However, I am not sure whether the IB devices can be used to perform
intra-node RMA. At least that would retain the functionality on this
kind of system (which arguably might be a rare case).

On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below may
be invalid. Number 4 seems still seems curious to me, though.

Best
Joseph
Post by Jeff Hammond
There's no reason to do anything special for shared memory with a
single-process job because MPI_Win_allocate_shared(MPI_COMM_SELF) ~=
MPI_Alloc_mem(). However, it would help debugging if MPI implementers
at least had an option to take the code path that allocates shared
memory even when np=1.
Jeff
Gilles,
Thanks for your swift response. On this system, /dev/shm only has
256M available so that is no option unfortunately. I tried disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open MPI still
seems to allocate the shmem backing file under /tmp. From my point
of view, missing the performance benefits of file backed shared
memory as long as large allocations work but I don't know the
implementation details and whether that is possible. It seems that
the mmap does not happen if there is only one process per node.
Cheers,
Joseph
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file and then mmap'ing
it.
how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info --all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
All,
I have been experimenting with large window allocations
recently and have
made some interesting observations that I would like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can allocate
and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single
large allocations
succeed but with the repeating allocate/free cycle in the
attached code I
see the application being reproducibly being killed by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that I cannot
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the
following error from
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
(the
shared-memory backing
file). It is likely that your MPI job will now either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp.
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I
understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I
am greeted with
a warning on each allocation and performance seems to drop.
Is there a way
to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the
memory: while
the allocations are in the sub-millisecond range for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate 100GB using
MPI_Win_allocate and the time increasing linearly with the
allocation size.
Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de
Jeff Hammond
2017-08-29 16:12:41 UTC
Permalink
I don't know any reason why you shouldn't be able to use IB for intra-node
transfers. There are, of course, arguments against doing it in general
(e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it likely behaves
less synchronously than shared-memory, since I'm not aware of any MPI RMA
library that dispatches the intranode RMA operations to an asynchronous
agent (e.g. communication helper thread).

Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, which
doesn't sound unreasonable to me. You might investigate if/how you can use
2M or 1G pages instead. It's possible Open-MPI already supports this, if
the underlying system does. You may need to twiddle your OS settings to
get hugetlbfs working.

Jeff
Post by Joseph Schuchart
Jeff, all,
Thanks for the clarification. My measurements show that global memory
allocations do not require the backing file if there is only one process
per node, for arbitrary number of processes. So I was wondering if it was
possible to use the same allocation process even with multiple processes
per node if there is not enough space available in /tmp. However, I am not
sure whether the IB devices can be used to perform intra-node RMA. At least
that would retain the functionality on this kind of system (which arguably
might be a rare case).
On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below may be
invalid. Number 4 seems still seems curious to me, though.
Best
Joseph
Post by Jeff Hammond
There's no reason to do anything special for shared memory with a
single-process job because MPI_Win_allocate_shared(MPI_COMM_SELF) ~=
MPI_Alloc_mem(). However, it would help debugging if MPI implementers at
least had an option to take the code path that allocates shared memory even
when np=1.
Jeff
Gilles,
Thanks for your swift response. On this system, /dev/shm only has
256M available so that is no option unfortunately. I tried disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open MPI still
seems to allocate the shmem backing file under /tmp. From my point
of view, missing the performance benefits of file backed shared
memory as long as large allocations work but I don't know the
implementation details and whether that is possible. It seems that
the mmap does not happen if there is only one process per node.
Cheers,
Joseph
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file and then mmap'ing
it.
how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info --all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
All,
I have been experimenting with large window allocations
recently and have
made some interesting observations that I would like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can allocate
and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single
large allocations
succeed but with the repeating allocate/free cycle in the
attached code I
see the application being reproducibly being killed by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that I cannot
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the
following error from
```
------------------------------------------------------------
--------------
It appears as if there is not enough space for
w_4.n131702
(the
shared-memory backing
file). It is likely that your MPI job will now either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp.
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I
understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I
am greeted with
a warning on each allocation and performance seems to drop.
Is there a way
to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the
memory: while
the allocations are in the sub-millisecond range for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate 100GB using
MPI_Win_allocate and the time increasing linearly with the
allocation size.
Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Joseph Schuchart
2017-09-04 13:13:21 UTC
Permalink
This post might be inappropriate. Click to display it.
Gilles Gouaillardet
2017-09-04 13:22:35 UTC
Permalink
Joseph,

please open a github issue regarding the SIGBUS error.

as far as i understand, MAP_ANONYMOUS+MAP_SHARED can only be used
between related processes. (e.g. parent and children)
in the case of Open MPI, MPI tasks are siblings, so this is not an option.

Cheers,

Gilles
Post by Joseph Schuchart
Jeff, all,
Unfortunately, I (as a user) have no control over the page size on our
cluster. My interest in this is more of a general nature because I am
concerned that our users who use Open MPI underneath our code run into this
issue on their machine.
I took a look at the code for the various window creation methods and now
have a better picture of the allocation process in Open MPI. I realized that
memory in windows allocated through MPI_Win_alloc or created through
MPI_Win_create is registered with the IB device using ibv_reg_mr, which
takes significant time for large allocations (I assume this is where
hugepages would help?). In contrast to this, it seems that memory attached
through MPI_Win_attach is not registered, which explains the lower latency
for these allocation I am observing (I seem to remember having observed
higher communication latencies as well).
Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix
component that uses shmem_open to create a POSIX shared memory object
instead of a file on disk, which is then mmap'ed. Unfortunately, if I raise
the priority of this component above that of the default mmap component I
end up with a SIGBUS during MPI_Init. No other errors are reported by MPI.
Should I open a ticket on Github for this?
As an alternative, would it be possible to use anonymous shared memory
mappings to avoid the backing file for large allocations (maybe above a
certain threshold) on systems that support MAP_ANONYMOUS and distribute the
result of the mmap call among the processes on the node?
Thanks,
Joseph
Post by Jeff Hammond
I don't know any reason why you shouldn't be able to use IB for intra-node
transfers. There are, of course, arguments against doing it in general
(e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it likely behaves less
synchronously than shared-memory, since I'm not aware of any MPI RMA library
that dispatches the intranode RMA operations to an asynchronous agent (e.g.
communication helper thread).
Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, which
doesn't sound unreasonable to me. You might investigate if/how you can use
2M or 1G pages instead. It's possible Open-MPI already supports this, if
the underlying system does. You may need to twiddle your OS settings to get
hugetlbfs working.
Jeff
Jeff, all,
Thanks for the clarification. My measurements show that global
memory allocations do not require the backing file if there is only
one process per node, for arbitrary number of processes. So I was
wondering if it was possible to use the same allocation process even
with multiple processes per node if there is not enough space
available in /tmp. However, I am not sure whether the IB devices can
be used to perform intra-node RMA. At least that would retain the
functionality on this kind of system (which arguably might be a rare
case).
On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below
may be invalid. Number 4 seems still seems curious to me, though.
Best
Joseph
There's no reason to do anything special for shared memory with
a single-process job because
MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
However, it would help debugging if MPI implementers at least
had an option to take the code path that allocates shared memory
even when np=1.
Jeff
On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
Gilles,
Thanks for your swift response. On this system, /dev/shm only has
256M available so that is no option unfortunately. I tried disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open MPI still
seems to allocate the shmem backing file under /tmp. From my point
of view, missing the performance benefits of file backed shared
memory as long as large allocations work but I don't know the
implementation details and whether that is possible. It seems that
the mmap does not happen if there is only one process per node.
Cheers,
Joseph
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file
and then
mmap'ing
it.
how much space do you have in /dev/shm ? (this is a
tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a
workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info
--all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
All,
I have been experimenting with large window allocations
recently and have
made some interesting observations that I would
like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can allocate
and write to
memory up to ~110 GB through MPI_Allocate,
MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single
large allocations
succeed but with the repeating allocate/free cycle in the
attached code I
see the application being reproducibly being killed by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that I cannot
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in
MPI_Alloc_mem
[n131302:11989] *** reported by process
[1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the
following error from
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
(the
shared-memory backing
file). It is likely that your MPI job will now either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp.
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I
understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I
am greeted with
a warning on each allocation and performance seems to drop.
Is there a way
to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the
memory: while
the allocations are in the sub-millisecond range for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate
100GB using
MPI_Win_allocate and the time increasing linearly with the
allocation size.
Are these issues known? Maybe there is
documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure
to adjust the
MEM_PER_NODE macro to suit your system before you run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Joseph Schuchart
2017-09-04 14:11:45 UTC
Permalink
Gilles,
Post by Gilles Gouaillardet
Joseph,
please open a github issue regarding the SIGBUS error.
Done: https://github.com/open-mpi/ompi/issues/4166
Post by Gilles Gouaillardet
as far as i understand, MAP_ANONYMOUS+MAP_SHARED can only be used
between related processes. (e.g. parent and children)
in the case of Open MPI, MPI tasks are siblings, so this is not an option.
You are right, it doesn't work the way I expected. Should have tested it
before :)

Best
Joseph
Post by Gilles Gouaillardet
Cheers,
Gilles
Post by Joseph Schuchart
Jeff, all,
Unfortunately, I (as a user) have no control over the page size on our
cluster. My interest in this is more of a general nature because I am
concerned that our users who use Open MPI underneath our code run into this
issue on their machine.
I took a look at the code for the various window creation methods and now
have a better picture of the allocation process in Open MPI. I realized that
memory in windows allocated through MPI_Win_alloc or created through
MPI_Win_create is registered with the IB device using ibv_reg_mr, which
takes significant time for large allocations (I assume this is where
hugepages would help?). In contrast to this, it seems that memory attached
through MPI_Win_attach is not registered, which explains the lower latency
for these allocation I am observing (I seem to remember having observed
higher communication latencies as well).
Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix
component that uses shmem_open to create a POSIX shared memory object
instead of a file on disk, which is then mmap'ed. Unfortunately, if I raise
the priority of this component above that of the default mmap component I
end up with a SIGBUS during MPI_Init. No other errors are reported by MPI.
Should I open a ticket on Github for this?
As an alternative, would it be possible to use anonymous shared memory
mappings to avoid the backing file for large allocations (maybe above a
certain threshold) on systems that support MAP_ANONYMOUS and distribute the
result of the mmap call among the processes on the node?
Thanks,
Joseph
Post by Jeff Hammond
I don't know any reason why you shouldn't be able to use IB for intra-node
transfers. There are, of course, arguments against doing it in general
(e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it likely behaves less
synchronously than shared-memory, since I'm not aware of any MPI RMA library
that dispatches the intranode RMA operations to an asynchronous agent (e.g.
communication helper thread).
Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, which
doesn't sound unreasonable to me. You might investigate if/how you can use
2M or 1G pages instead. It's possible Open-MPI already supports this, if
the underlying system does. You may need to twiddle your OS settings to get
hugetlbfs working.
Jeff
Jeff, all,
Thanks for the clarification. My measurements show that global
memory allocations do not require the backing file if there is only
one process per node, for arbitrary number of processes. So I was
wondering if it was possible to use the same allocation process even
with multiple processes per node if there is not enough space
available in /tmp. However, I am not sure whether the IB devices can
be used to perform intra-node RMA. At least that would retain the
functionality on this kind of system (which arguably might be a rare
case).
On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below
may be invalid. Number 4 seems still seems curious to me, though.
Best
Joseph
There's no reason to do anything special for shared memory with
a single-process job because
MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
However, it would help debugging if MPI implementers at least
had an option to take the code path that allocates shared memory
even when np=1.
Jeff
On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
Gilles,
Thanks for your swift response. On this system, /dev/shm
only has
256M available so that is no option unfortunately. I tried
disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open
MPI still
seems to allocate the shmem backing file under /tmp. From
my point
of view, missing the performance benefits of file backed shared
memory as long as large allocations work but I don't know the
implementation details and whether that is possible. It
seems that
the mmap does not happen if there is only one process per node.
Cheers,
Joseph
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file
and then
mmap'ing
it.
how much space do you have in /dev/shm ? (this is a
tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a
workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info
--all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
All,
I have been experimenting with large window allocations
recently and have
made some interesting observations that I would
like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can
allocate
and write to
memory up to ~110 GB through MPI_Allocate,
MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single
large allocations
succeed but with the repeating allocate/free cycle
in the
attached code I
see the application being reproducibly being killed
by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try
to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that
I cannot
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes
in this
communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the
following error from
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
(the
shared-memory backing
file). It is likely that your MPI job will now
either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp.
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I
understand that I can
set $TMP to a bigger filesystem (such as lustre)
but then I
am greeted with
a warning on each allocation and performance seems
to drop.
Is there a way
to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to
allocate the
memory: while
the allocations are in the sub-millisecond range
for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate
100GB using
MPI_Win_allocate and the time increasing linearly
with the
allocation size.
Are these issues known? Maybe there is
documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure
to adjust the
MEM_PER_NODE macro to suit your system before you
run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de
Jeff Hammond
2017-09-04 16:09:42 UTC
Permalink
Post by Joseph Schuchart
Jeff, all,
Unfortunately, I (as a user) have no control over the page size on our
cluster. My interest in this is more of a general nature because I am
concerned that our users who use Open MPI underneath our code run into this
issue on their machine.
Sure, but I assume you are able to suggest such changes to the HLRS
operations team. Cray XC machines like Hazel Hen already support large
pages by default and Cray recommends their use to improve MPI performance,
so I don't think it is a surprising or unreasonable request to support them
on your non-Cray systems.
Post by Joseph Schuchart
I took a look at the code for the various window creation methods and now
have a better picture of the allocation process in Open MPI. I realized
that memory in windows allocated through MPI_Win_alloc or created through
MPI_Win_create is registered with the IB device using ibv_reg_mr, which
takes significant time for large allocations (I assume this is where
hugepages would help?). In contrast to this, it seems that memory attached
through MPI_Win_attach is not registered, which explains the lower latency
for these allocation I am observing (I seem to remember having observed
higher communication latencies as well).
There's a reason for this. The way MPI dynamic windows are defined,
caching registration keys is not practical without implementation-defined
info keys to assert no reattach. That is why allocation latency is lower
and communication latency is higher.

Jeff
Post by Joseph Schuchart
Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix
component that uses shmem_open to create a POSIX shared memory object
instead of a file on disk, which is then mmap'ed. Unfortunately, if I raise
the priority of this component above that of the default mmap component I
end up with a SIGBUS during MPI_Init. No other errors are reported by MPI.
Should I open a ticket on Github for this?
As an alternative, would it be possible to use anonymous shared memory
mappings to avoid the backing file for large allocations (maybe above a
certain threshold) on systems that support MAP_ANONYMOUS and distribute the
result of the mmap call among the processes on the node?
Thanks,
Joseph
Post by Jeff Hammond
I don't know any reason why you shouldn't be able to use IB for
intra-node transfers. There are, of course, arguments against doing it in
general (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it likely
behaves less synchronously than shared-memory, since I'm not aware of any
MPI RMA library that dispatches the intranode RMA operations to an
asynchronous agent (e.g. communication helper thread).
Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, which
doesn't sound unreasonable to me. You might investigate if/how you can use
2M or 1G pages instead. It's possible Open-MPI already supports this, if
the underlying system does. You may need to twiddle your OS settings to
get hugetlbfs working.
Jeff
Jeff, all,
Thanks for the clarification. My measurements show that global
memory allocations do not require the backing file if there is only
one process per node, for arbitrary number of processes. So I was
wondering if it was possible to use the same allocation process even
with multiple processes per node if there is not enough space
available in /tmp. However, I am not sure whether the IB devices can
be used to perform intra-node RMA. At least that would retain the
functionality on this kind of system (which arguably might be a rare
case).
On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below
may be invalid. Number 4 seems still seems curious to me, though.
Best
Joseph
There's no reason to do anything special for shared memory with
a single-process job because
MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
However, it would help debugging if MPI implementers at least
had an option to take the code path that allocates shared memory
even when np=1.
Jeff
On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
Gilles,
Thanks for your swift response. On this system, /dev/shm only has
256M available so that is no option unfortunately. I tried disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open MPI still
seems to allocate the shmem backing file under /tmp. From my point
of view, missing the performance benefits of file backed shared
memory as long as large allocations work but I don't know the
implementation details and whether that is possible. It seems that
the mmap does not happen if there is only one process per node.
Cheers,
Joseph
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file
and then
mmap'ing
it.
how much space do you have in /dev/shm ? (this is a
tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a
workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info
--all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
All,
I have been experimenting with large window allocations
recently and have
made some interesting observations that I would
like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can allocate
and write to
memory up to ~110 GB through MPI_Allocate,
MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single
large allocations
succeed but with the repeating allocate/free cycle in the
attached code I
see the application being reproducibly being killed by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that I cannot
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in
MPI_Alloc_mem
[n131302:11989] *** reported by process
[1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the
following error from
```
------------------------------
--------------------------------------------
It appears as if there is not enough space for
31702_0/23041/1/0/shared_window_4.n131702
(the
shared-memory backing
file). It is likely that your MPI job will now either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp.
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I
understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I
am greeted with
a warning on each allocation and performance seems to drop.
Is there a way
to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the
memory: while
the allocations are in the sub-millisecond range for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate
100GB using
MPI_Win_allocate and the time increasing linearly with the
allocation size.
Are these issues known? Maybe there is
documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure
to adjust the
MEM_PER_NODE macro to suit your system before you run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Joseph Schuchart
2017-09-08 16:02:24 UTC
Permalink
We are currently discussing internally how to proceed with this issue on
our machine. We did a little survey to see the setup of some of the
machines we have access to, which includes an IBM, a Bull machine, and
two Cray XC40 machines. To summarize our findings:

1) On the Cray systems, both /tmp and /dev/shm are mounted tmpfs and
each limited to half of the main memory size per node.
2) On the IBM system, nodes have 64GB and /tmp is limited to 20 GB and
mounted from a disk partition. /dev/shm, on the other hand, is sized at
63GB.
3) On the above systems, /proc/sys/kernel/shm* is set up to allow the
full memory of the node to be used as System V shared memory.
4) On the Bull machine, /tmp is mounted from a disk and fixed to ~100GB
while /dev/shm is limited to half the node's memory (there are nodes
with 2TB memory, huge page support is available). System V shmem on the
other hand is limited to 4GB.

Overall, it seems that there is no globally optimal allocation strategy
as the best matching source of shared memory is machine dependent.

Open MPI treats System V shared memory as the least favorable option,
even giving it a lower priority than POSIX shared memory, where
conflicting names might occur. What's the reason for preferring /tmp and
POSIX shared memory over System V? It seems to me that the latter is a
cleaner and safer way (provided that shared memory is not constrained by
/proc, which could easily be detected) while mmap'ing large files feels
somewhat hacky. Maybe I am missing an important aspect here though.

The reason I am interested in this issue is that our PGAS library is
build on top of MPI and allocates pretty much all memory exposed to the
user through MPI windows. Thus, any limitation from the underlying MPI
implementation (or system for that matter) limits the amount of usable
memory for our users.

Given our observations above, I would like to propose a change to the
shared memory allocator: the priorities would be derived from the
percentage of main memory each component can cover, i.e.,

Priority = 99*(min(Memory, SpaceAvail) / Memory)

At startup, each shm component would determine the available size (by
looking at /tmp, /dev/shm, and /proc/sys/kernel/shm*, respectively) and
set its priority between 0 and 99. A user could force Open MPI to use a
specific component by manually settings its priority to 100 (which of
course has to be documented). The priority could factor in other aspects
as well, such as whether /tmp is actually tmpfs or disk-based if that
makes a difference in performance.

This proposal of course assumes that shared memory size is the sole
optimization goal. Maybe there are other aspects to consider? I'd be
happy to work on a patch but would like to get some feedback before
getting my hands dirty. IMO, the current situation is less than ideal
and prone to cause pain to the average user. In my recent experience,
debugging this has been tedious and the user in general shouldn't have
to care about how shared memory is allocated (and administrators don't
always seem to care, see above).

Any feedback is highly appreciated.

Joseph
Post by Joseph Schuchart
Jeff, all,
Unfortunately, I (as a user) have no control over the page size on our
cluster. My interest in this is more of a general nature because I am
concerned that our users who use Open MPI underneath our code run into
this issue on their machine.
I took a look at the code for the various window creation methods and
now have a better picture of the allocation process in Open MPI. I
realized that memory in windows allocated through MPI_Win_alloc or
created through MPI_Win_create is registered with the IB device using
ibv_reg_mr, which takes significant time for large allocations (I assume
this is where hugepages would help?). In contrast to this, it seems that
memory attached through MPI_Win_attach is not registered, which explains
the lower latency for these allocation I am observing (I seem to
remember having observed higher communication latencies as well).
Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix
component that uses shmem_open to create a POSIX shared memory object
instead of a file on disk, which is then mmap'ed. Unfortunately, if I
raise the priority of this component above that of the default mmap
component I end up with a SIGBUS during MPI_Init. No other errors are
reported by MPI. Should I open a ticket on Github for this?
As an alternative, would it be possible to use anonymous shared memory
mappings to avoid the backing file for large allocations (maybe above a
certain threshold) on systems that support MAP_ANONYMOUS and distribute
the result of the mmap call among the processes on the node?
Thanks,
Joseph
Post by Jeff Hammond
I don't know any reason why you shouldn't be able to use IB for
intra-node transfers. There are, of course, arguments against doing
it in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it
likely behaves less synchronously than shared-memory, since I'm not
aware of any MPI RMA library that dispatches the intranode RMA
operations to an asynchronous agent (e.g. communication helper thread).
Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page,
which doesn't sound unreasonable to me. You might investigate if/how
you can use 2M or 1G pages instead. It's possible Open-MPI already
supports this, if the underlying system does. You may need to twiddle
your OS settings to get hugetlbfs working.
Jeff
Jeff, all,
Thanks for the clarification. My measurements show that global
memory allocations do not require the backing file if there is only
one process per node, for arbitrary number of processes. So I was
wondering if it was possible to use the same allocation process even
with multiple processes per node if there is not enough space
available in /tmp. However, I am not sure whether the IB devices can
be used to perform intra-node RMA. At least that would retain the
functionality on this kind of system (which arguably might be a rare
case).
On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below
may be invalid. Number 4 seems still seems curious to me, though.
Best
Joseph
There's no reason to do anything special for shared memory with
a single-process job because
MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
However, it would help debugging if MPI implementers at least
had an option to take the code path that allocates shared memory
even when np=1.
Jeff
On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
Gilles,
Thanks for your swift response. On this system, /dev/shm only has
256M available so that is no option unfortunately. I tried disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open MPI still
seems to allocate the shmem backing file under /tmp. From my point
of view, missing the performance benefits of file backed shared
memory as long as large allocations work but I don't know the
implementation details and whether that is possible. It seems that
the mmap does not happen if there is only one process per node.
Cheers,
Joseph
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file
and then
mmap'ing
it.
how much space do you have in /dev/shm ? (this is a
tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a
workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info
--all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
All,
I have been experimenting with large window allocations
recently and have
made some interesting observations that I would
like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can allocate
and write to
memory up to ~110 GB through MPI_Allocate,
MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single
large allocations
succeed but with the repeating allocate/free cycle in the
attached code I
see the application being reproducibly being killed by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that I cannot
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in
MPI_Alloc_mem
[n131302:11989] *** reported by process
[1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the
following error from
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
(the
shared-memory backing
file). It is likely that your MPI job will now either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp.
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I
understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I
am greeted with
a warning on each allocation and performance seems to drop.
Is there a way
to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the
memory: while
the allocations are in the sub-millisecond range for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate
100GB using
MPI_Win_allocate and the time increasing linearly with the
allocation size.
Are these issues known? Maybe there is
documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure
to adjust the
MEM_PER_NODE macro to suit your system before you run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de
Gilles Gouaillardet
2017-09-08 16:16:52 UTC
Permalink
Joseph,

Thanks for sharing this !

sysv is imho the worst option because if something goes really wrong, Open MPI might leave some shared memory segments behind when a job crashes. From that perspective, leaving a big file in /tmp can be seen as the lesser evil.
That being said, there might be other reasons that drove this design

Cheers,

Gilles
Post by Joseph Schuchart
We are currently discussing internally how to proceed with this issue on
our machine. We did a little survey to see the setup of some of the
machines we have access to, which includes an IBM, a Bull machine, and
1) On the Cray systems, both /tmp and /dev/shm are mounted tmpfs and
each limited to half of the main memory size per node.
2) On the IBM system, nodes have 64GB and /tmp is limited to 20 GB and
mounted from a disk partition. /dev/shm, on the other hand, is sized at
63GB.
3) On the above systems, /proc/sys/kernel/shm* is set up to allow the
full memory of the node to be used as System V shared memory.
4) On the Bull machine, /tmp is mounted from a disk and fixed to ~100GB
while /dev/shm is limited to half the node's memory (there are nodes
with 2TB memory, huge page support is available). System V shmem on the
other hand is limited to 4GB.
Overall, it seems that there is no globally optimal allocation strategy
as the best matching source of shared memory is machine dependent.
Open MPI treats System V shared memory as the least favorable option,
even giving it a lower priority than POSIX shared memory, where
conflicting names might occur. What's the reason for preferring /tmp and
POSIX shared memory over System V? It seems to me that the latter is a
cleaner and safer way (provided that shared memory is not constrained by
/proc, which could easily be detected) while mmap'ing large files feels
somewhat hacky. Maybe I am missing an important aspect here though.
The reason I am interested in this issue is that our PGAS library is
build on top of MPI and allocates pretty much all memory exposed to the
user through MPI windows. Thus, any limitation from the underlying MPI
implementation (or system for that matter) limits the amount of usable
memory for our users.
Given our observations above, I would like to propose a change to the
shared memory allocator: the priorities would be derived from the
percentage of main memory each component can cover, i.e.,
Priority = 99*(min(Memory, SpaceAvail) / Memory)
At startup, each shm component would determine the available size (by
looking at /tmp, /dev/shm, and /proc/sys/kernel/shm*, respectively) and
set its priority between 0 and 99. A user could force Open MPI to use a
specific component by manually settings its priority to 100 (which of
course has to be documented). The priority could factor in other aspects
as well, such as whether /tmp is actually tmpfs or disk-based if that
makes a difference in performance.
This proposal of course assumes that shared memory size is the sole
optimization goal. Maybe there are other aspects to consider? I'd be
happy to work on a patch but would like to get some feedback before
getting my hands dirty. IMO, the current situation is less than ideal
and prone to cause pain to the average user. In my recent experience,
debugging this has been tedious and the user in general shouldn't have
to care about how shared memory is allocated (and administrators don't
always seem to care, see above).
Any feedback is highly appreciated.
Joseph
Post by Joseph Schuchart
Jeff, all,
Unfortunately, I (as a user) have no control over the page size on our
cluster. My interest in this is more of a general nature because I am
concerned that our users who use Open MPI underneath our code run into
this issue on their machine.
I took a look at the code for the various window creation methods and
now have a better picture of the allocation process in Open MPI. I
realized that memory in windows allocated through MPI_Win_alloc or
created through MPI_Win_create is registered with the IB device using
ibv_reg_mr, which takes significant time for large allocations (I assume
this is where hugepages would help?). In contrast to this, it seems that
memory attached through MPI_Win_attach is not registered, which explains
the lower latency for these allocation I am observing (I seem to
remember having observed higher communication latencies as well).
Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix
component that uses shmem_open to create a POSIX shared memory object
instead of a file on disk, which is then mmap'ed. Unfortunately, if I
raise the priority of this component above that of the default mmap
component I end up with a SIGBUS during MPI_Init. No other errors are
reported by MPI. Should I open a ticket on Github for this?
As an alternative, would it be possible to use anonymous shared memory
mappings to avoid the backing file for large allocations (maybe above a
certain threshold) on systems that support MAP_ANONYMOUS and distribute
the result of the mmap call among the processes on the node?
Thanks,
Joseph
Post by Jeff Hammond
I don't know any reason why you shouldn't be able to use IB for
intra-node transfers. There are, of course, arguments against doing
it in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it
likely behaves less synchronously than shared-memory, since I'm not
aware of any MPI RMA library that dispatches the intranode RMA
operations to an asynchronous agent (e.g. communication helper thread).
Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page,
which doesn't sound unreasonable to me. You might investigate if/how
you can use 2M or 1G pages instead. It's possible Open-MPI already
supports this, if the underlying system does. You may need to twiddle
your OS settings to get hugetlbfs working.
Jeff
Jeff, all,
Thanks for the clarification. My measurements show that global
memory allocations do not require the backing file if there is only
one process per node, for arbitrary number of processes. So I was
wondering if it was possible to use the same allocation process even
with multiple processes per node if there is not enough space
available in /tmp. However, I am not sure whether the IB devices can
be used to perform intra-node RMA. At least that would retain the
functionality on this kind of system (which arguably might be a rare
case).
On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below
may be invalid. Number 4 seems still seems curious to me, though.
Best
Joseph
There's no reason to do anything special for shared memory with
a single-process job because
MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
However, it would help debugging if MPI implementers at least
had an option to take the code path that allocates shared memory
even when np=1.
Jeff
On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
Gilles,
Thanks for your swift response. On this system, /dev/shm
only has
256M available so that is no option unfortunately. I tried
disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open
MPI still
seems to allocate the shmem backing file under /tmp. From
my point
of view, missing the performance benefits of file backed
shared
memory as long as large allocations work but I don't know
the
implementation details and whether that is possible. It
seems that
the mmap does not happen if there is only one process per
node.
Cheers,
Joseph
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file
and then
mmap'ing
it.
how much space do you have in /dev/shm ? (this is a
tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a
workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info
--all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
All,
I have been experimenting with large window
allocations
recently and have
made some interesting observations that I would
like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can
allocate
and write to
memory up to ~110 GB through MPI_Allocate,
MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes
single
large allocations
succeed but with the repeating allocate/free cycle
in the
attached code I
see the application being reproducibly being killed
by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try
to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that
I cannot
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in
MPI_Alloc_mem
[n131302:11989] *** reported by process
[1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes
in this
communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the
following error from
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
(the
shared-memory backing
file). It is likely that your MPI job will now
either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp.
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I
understand that I can
set $TMP to a bigger filesystem (such as lustre)
but then I
am greeted with
a warning on each allocation and performance seems
to drop.
Is there a way
to fall back to the allocation strategy used in
case 2)?
4) It is also worth noting the time it takes to
allocate the
memory: while
the allocations are in the sub-millisecond range
for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate
100GB using
MPI_Win_allocate and the time increasing linearly
with the
allocation size.
Are these issues known? Maybe there is
documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure
to adjust the
MEM_PER_NODE macro to suit your system before you
run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Hammond
2017-09-08 16:29:54 UTC
Permalink
In my experience, POSIX is much more reliable than Sys5. Sys5 depends on
the value of shmmax, which is often set to a small fraction of node
memory. I've probably seen the error described on
http://verahill.blogspot.com/2012/04/solution-to-nwchem-shmmax-too-small.html
with NWChem a 1000 times because of this. POSIX, on the other hand, isn't
limited by SHMMAX (https://community.oracle.com/thread/3828422).

POSIX is newer than Sys5, and while Sys5 is supported by Linux and thus
almost ubiquitous, it wasn't supported by Blue Gene, so in an HPC context,
one can argue that POSIX is more portable.

Jeff

On Fri, Sep 8, 2017 at 9:16 AM, Gilles Gouaillardet <
Post by Gilles Gouaillardet
Joseph,
Thanks for sharing this !
sysv is imho the worst option because if something goes really wrong, Open
MPI might leave some shared memory segments behind when a job crashes. From
that perspective, leaving a big file in /tmp can be seen as the lesser evil.
That being said, there might be other reasons that drove this design
Cheers,
Gilles
Post by Joseph Schuchart
We are currently discussing internally how to proceed with this issue on
our machine. We did a little survey to see the setup of some of the
machines we have access to, which includes an IBM, a Bull machine, and
1) On the Cray systems, both /tmp and /dev/shm are mounted tmpfs and
each limited to half of the main memory size per node.
2) On the IBM system, nodes have 64GB and /tmp is limited to 20 GB and
mounted from a disk partition. /dev/shm, on the other hand, is sized at
63GB.
3) On the above systems, /proc/sys/kernel/shm* is set up to allow the
full memory of the node to be used as System V shared memory.
4) On the Bull machine, /tmp is mounted from a disk and fixed to ~100GB
while /dev/shm is limited to half the node's memory (there are nodes
with 2TB memory, huge page support is available). System V shmem on the
other hand is limited to 4GB.
Overall, it seems that there is no globally optimal allocation strategy
as the best matching source of shared memory is machine dependent.
Open MPI treats System V shared memory as the least favorable option,
even giving it a lower priority than POSIX shared memory, where
conflicting names might occur. What's the reason for preferring /tmp and
POSIX shared memory over System V? It seems to me that the latter is a
cleaner and safer way (provided that shared memory is not constrained by
/proc, which could easily be detected) while mmap'ing large files feels
somewhat hacky. Maybe I am missing an important aspect here though.
The reason I am interested in this issue is that our PGAS library is
build on top of MPI and allocates pretty much all memory exposed to the
user through MPI windows. Thus, any limitation from the underlying MPI
implementation (or system for that matter) limits the amount of usable
memory for our users.
Given our observations above, I would like to propose a change to the
shared memory allocator: the priorities would be derived from the
percentage of main memory each component can cover, i.e.,
Priority = 99*(min(Memory, SpaceAvail) / Memory)
At startup, each shm component would determine the available size (by
looking at /tmp, /dev/shm, and /proc/sys/kernel/shm*, respectively) and
set its priority between 0 and 99. A user could force Open MPI to use a
specific component by manually settings its priority to 100 (which of
course has to be documented). The priority could factor in other aspects
as well, such as whether /tmp is actually tmpfs or disk-based if that
makes a difference in performance.
This proposal of course assumes that shared memory size is the sole
optimization goal. Maybe there are other aspects to consider? I'd be
happy to work on a patch but would like to get some feedback before
getting my hands dirty. IMO, the current situation is less than ideal
and prone to cause pain to the average user. In my recent experience,
debugging this has been tedious and the user in general shouldn't have
to care about how shared memory is allocated (and administrators don't
always seem to care, see above).
Any feedback is highly appreciated.
Joseph
Post by Joseph Schuchart
Jeff, all,
Unfortunately, I (as a user) have no control over the page size on our
cluster. My interest in this is more of a general nature because I am
concerned that our users who use Open MPI underneath our code run into
this issue on their machine.
I took a look at the code for the various window creation methods and
now have a better picture of the allocation process in Open MPI. I
realized that memory in windows allocated through MPI_Win_alloc or
created through MPI_Win_create is registered with the IB device using
ibv_reg_mr, which takes significant time for large allocations (I assume
this is where hugepages would help?). In contrast to this, it seems that
memory attached through MPI_Win_attach is not registered, which explains
the lower latency for these allocation I am observing (I seem to
remember having observed higher communication latencies as well).
Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix
component that uses shmem_open to create a POSIX shared memory object
instead of a file on disk, which is then mmap'ed. Unfortunately, if I
raise the priority of this component above that of the default mmap
component I end up with a SIGBUS during MPI_Init. No other errors are
reported by MPI. Should I open a ticket on Github for this?
As an alternative, would it be possible to use anonymous shared memory
mappings to avoid the backing file for large allocations (maybe above a
certain threshold) on systems that support MAP_ANONYMOUS and distribute
the result of the mmap call among the processes on the node?
Thanks,
Joseph
Post by Jeff Hammond
I don't know any reason why you shouldn't be able to use IB for
intra-node transfers. There are, of course, arguments against doing
it in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it
likely behaves less synchronously than shared-memory, since I'm not
aware of any MPI RMA library that dispatches the intranode RMA
operations to an asynchronous agent (e.g. communication helper thread).
Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page,
which doesn't sound unreasonable to me. You might investigate if/how
you can use 2M or 1G pages instead. It's possible Open-MPI already
supports this, if the underlying system does. You may need to twiddle
your OS settings to get hugetlbfs working.
Jeff
Jeff, all,
Thanks for the clarification. My measurements show that global
memory allocations do not require the backing file if there is only
one process per node, for arbitrary number of processes. So I was
wondering if it was possible to use the same allocation process
even
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
with multiple processes per node if there is not enough space
available in /tmp. However, I am not sure whether the IB devices
can
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
be used to perform intra-node RMA. At least that would retain the
functionality on this kind of system (which arguably might be a
rare
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
case).
On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below
may be invalid. Number 4 seems still seems curious to me, though.
Best
Joseph
There's no reason to do anything special for shared memory with
a single-process job because
MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
However, it would help debugging if MPI implementers at least
had an option to take the code path that allocates shared
memory
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
even when np=1.
Jeff
On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
Gilles,
Thanks for your swift response. On this system, /dev/shm
only has
256M available so that is no option unfortunately. I tried
disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open
MPI still
seems to allocate the shmem backing file under /tmp. From
my point
of view, missing the performance benefits of file backed
shared
memory as long as large allocations work but I don't know
the
implementation details and whether that is possible. It
seems that
the mmap does not happen if there is only one process per
node.
Cheers,
Joseph
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file
and then
mmap'ing
it.
how much space do you have in /dev/shm ? (this is a
tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a
workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info
--all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
All,
I have been experimenting with large window
allocations
recently and have
made some interesting observations that I would
like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single node, I can
allocate
and write to
memory up to ~110 GB through MPI_Allocate,
MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes
single
large allocations
succeed but with the repeating allocate/free cycle
in the
attached code I
see the application being reproducibly being
killed
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I
try
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that
I cannot
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in
MPI_Alloc_mem
[n131302:11989] *** reported by process
[1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL
(processes
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
in this
communicator
will now abort,
[n131302:11989] *** and potentially your MPI
job)
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
```
3) If running with 2 processes on a node, I get
the
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
following error from
```
------------------------------------------------------------
--------------
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
It appears as if there is not enough space for
window_4.n131702
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
(the
shared-memory backing
file). It is likely that your MPI job will now
either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of
/tmp.
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process.
I
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
understand that I can
set $TMP to a bigger filesystem (such as lustre)
but then I
am greeted with
a warning on each allocation and performance seems
to drop.
Is there a way
to fall back to the allocation strategy used in
case 2)?
4) It is also worth noting the time it takes to
allocate the
memory: while
the allocations are in the sub-millisecond range
for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate
100GB using
MPI_Win_allocate and the time increasing linearly
with the
allocation size.
Are these issues known? Maybe there is
documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure
to adjust the
MEM_PER_NODE macro to suit your system before you
run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/
mailman/listinfo/users
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
<https://lists.open-mpi.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Joseph Schuchart
2017-09-09 15:35:06 UTC
Permalink
Jeff, Gilles,

Thanks for your input. I am aware of the limitations of Sys5 shmem (the
links you posted do not accurately reflect the description of SHMMAX,
SHMALL, and SHMMNI found in the standard, though. See
http://man7.org/linux/man-pages/man2/shmget.2.html).

However, these limitations can be easily checked by looking at the
entries in /proc. If the limits of Sys5 shmem are smaller than for /tmp
it would receive a lower priority and thus not be used in my proposal.
While Sys5 shmem is limited by shmmax, POSIX shmem (shm_open that is) at
least on Linux is limited by the size of /dev/shm, where ftruncate does
not complain if the shared memory allocation grows beyond what is
possible there. A user will learn this the hard way by debugging SIGBUS
signals upon memory access. This is imo a flaw in the way shm_open is
implemented on Linux (and/or a flaw of POSIX shm not providing a way to
check such arbitrary limits). You have to guess the limit by looking at
the space available in /dev/shm and hoping that the implementation
didn't change. Apparently, some BSD flavors (such as FreeBSD) have
implemented POSIX shmem as system calls to not rely on files in some
tmpfs mounts, which gets rid of such size limitations.

I wasn't aware of the possibility of a memory leak when using Sys5. From
what I understand, this might happen if the process receives a signal
between shmget and the immediately following call to shmctl that marks
the segment deletable. We could go to some lengths and install a signal
handler in case another thread causes a SIGSEGV or the user tries to
abort in exactly that moment (except if he uses SIGKILL, that is). I
agree that all this is not nice but I would argue that it doesn't
disqualify Sys5 in cases where it's the only way of allocating decent
amounts of shared memory due to size limitations of the tmpfs mounts.
What's more, Open MPI is not the only implementation that supports Sys5
(it's a compile-time option in MPICH and I'm sure there are others using
it) so automatic job epilogue scripts should clean up Sys5 shmem as well
(which I'm sure they don't, mostly).

Open MPI currently supports three different allocation strategies for
shared memory, so it can chose based on what is available (on a
BlueGene, only POSIX shmem or mmap'ing from /tmp would be considered
then) to maximize portability. I'm not proposing to make Sys5 the
default (although I was wondering why it's not preferred by Open MPI,
which I have a better understanding of now, thanks to your input). I
just would like to draw something from my recent experience and make
Open MPI more user friendly by deciding which shmem component to use
automatically :)

Best
Joseph
Post by Jeff Hammond
In my experience, POSIX is much more reliable than Sys5. Sys5 depends
on the value of shmmax, which is often set to a small fraction of node
memory. I've probably seen the error described on
http://verahill.blogspot.com/2012/04/solution-to-nwchem-shmmax-too-small.html
with NWChem a 1000 times because of this. POSIX, on the other hand,
isn't limited by SHMMAX (https://community.oracle.com/thread/3828422).
POSIX is newer than Sys5, and while Sys5 is supported by Linux and thus
almost ubiquitous, it wasn't supported by Blue Gene, so in an HPC
context, one can argue that POSIX is more portable.
Jeff
On Fri, Sep 8, 2017 at 9:16 AM, Gilles Gouaillardet
Joseph,
Thanks for sharing this !
sysv is imho the worst option because if something goes really
wrong, Open MPI might leave some shared memory segments behind when
a job crashes. From that perspective, leaving a big file in /tmp can
be seen as the lesser evil.
That being said, there might be other reasons that drove this design
Cheers,
Gilles
Post by Joseph Schuchart
We are currently discussing internally how to proceed with this
issue on
Post by Joseph Schuchart
our machine. We did a little survey to see the setup of some of the
machines we have access to, which includes an IBM, a Bull machine, and
1) On the Cray systems, both /tmp and /dev/shm are mounted tmpfs and
each limited to half of the main memory size per node.
2) On the IBM system, nodes have 64GB and /tmp is limited to 20 GB and
mounted from a disk partition. /dev/shm, on the other hand, is
sized at
Post by Joseph Schuchart
63GB.
3) On the above systems, /proc/sys/kernel/shm* is set up to allow the
full memory of the node to be used as System V shared memory.
4) On the Bull machine, /tmp is mounted from a disk and fixed to
~100GB
Post by Joseph Schuchart
while /dev/shm is limited to half the node's memory (there are nodes
with 2TB memory, huge page support is available). System V shmem
on the
Post by Joseph Schuchart
other hand is limited to 4GB.
Overall, it seems that there is no globally optimal allocation
strategy
Post by Joseph Schuchart
as the best matching source of shared memory is machine dependent.
Open MPI treats System V shared memory as the least favorable option,
even giving it a lower priority than POSIX shared memory, where
conflicting names might occur. What's the reason for preferring
/tmp and
Post by Joseph Schuchart
POSIX shared memory over System V? It seems to me that the latter is a
cleaner and safer way (provided that shared memory is not
constrained by
Post by Joseph Schuchart
/proc, which could easily be detected) while mmap'ing large files
feels
Post by Joseph Schuchart
somewhat hacky. Maybe I am missing an important aspect here though.
The reason I am interested in this issue is that our PGAS library is
build on top of MPI and allocates pretty much all memory exposed
to the
Post by Joseph Schuchart
user through MPI windows. Thus, any limitation from the underlying MPI
implementation (or system for that matter) limits the amount of usable
memory for our users.
Given our observations above, I would like to propose a change to the
shared memory allocator: the priorities would be derived from the
percentage of main memory each component can cover, i.e.,
Priority = 99*(min(Memory, SpaceAvail) / Memory)
At startup, each shm component would determine the available size (by
looking at /tmp, /dev/shm, and /proc/sys/kernel/shm*,
respectively) and
Post by Joseph Schuchart
set its priority between 0 and 99. A user could force Open MPI to
use a
Post by Joseph Schuchart
specific component by manually settings its priority to 100 (which of
course has to be documented). The priority could factor in other
aspects
Post by Joseph Schuchart
as well, such as whether /tmp is actually tmpfs or disk-based if that
makes a difference in performance.
This proposal of course assumes that shared memory size is the sole
optimization goal. Maybe there are other aspects to consider? I'd be
happy to work on a patch but would like to get some feedback before
getting my hands dirty. IMO, the current situation is less than ideal
and prone to cause pain to the average user. In my recent experience,
debugging this has been tedious and the user in general shouldn't have
to care about how shared memory is allocated (and administrators don't
always seem to care, see above).
Any feedback is highly appreciated.
Joseph
Post by Joseph Schuchart
Jeff, all,
Unfortunately, I (as a user) have no control over the page size
on our
Post by Joseph Schuchart
Post by Joseph Schuchart
cluster. My interest in this is more of a general nature because
I am
Post by Joseph Schuchart
Post by Joseph Schuchart
concerned that our users who use Open MPI underneath our code
run into
Post by Joseph Schuchart
Post by Joseph Schuchart
this issue on their machine.
I took a look at the code for the various window creation
methods and
Post by Joseph Schuchart
Post by Joseph Schuchart
now have a better picture of the allocation process in Open MPI. I
realized that memory in windows allocated through MPI_Win_alloc or
created through MPI_Win_create is registered with the IB device
using
Post by Joseph Schuchart
Post by Joseph Schuchart
ibv_reg_mr, which takes significant time for large allocations
(I assume
Post by Joseph Schuchart
Post by Joseph Schuchart
this is where hugepages would help?). In contrast to this, it
seems that
Post by Joseph Schuchart
Post by Joseph Schuchart
memory attached through MPI_Win_attach is not registered, which
explains
Post by Joseph Schuchart
Post by Joseph Schuchart
the lower latency for these allocation I am observing (I seem to
remember having observed higher communication latencies as well).
Regarding the size limitation of /tmp: I found an
opal/mca/shmem/posix
Post by Joseph Schuchart
Post by Joseph Schuchart
component that uses shmem_open to create a POSIX shared memory
object
Post by Joseph Schuchart
Post by Joseph Schuchart
instead of a file on disk, which is then mmap'ed. Unfortunately,
if I
Post by Joseph Schuchart
Post by Joseph Schuchart
raise the priority of this component above that of the default mmap
component I end up with a SIGBUS during MPI_Init. No other
errors are
Post by Joseph Schuchart
Post by Joseph Schuchart
reported by MPI. Should I open a ticket on Github for this?
As an alternative, would it be possible to use anonymous shared
memory
Post by Joseph Schuchart
Post by Joseph Schuchart
mappings to avoid the backing file for large allocations (maybe
above a
Post by Joseph Schuchart
Post by Joseph Schuchart
certain threshold) on systems that support MAP_ANONYMOUS and
distribute
Post by Joseph Schuchart
Post by Joseph Schuchart
the result of the mmap call among the processes on the node?
Thanks,
Joseph
Post by Jeff Hammond
I don't know any reason why you shouldn't be able to use IB for
intra-node transfers. There are, of course, arguments against
doing
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
it in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth),
but it
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
likely behaves less synchronously than shared-memory, since I'm not
aware of any MPI RMA library that dispatches the intranode RMA
operations to an asynchronous agent (e.g. communication helper
thread).
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page,
which doesn't sound unreasonable to me. You might investigate
if/how
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
you can use 2M or 1G pages instead. It's possible Open-MPI already
supports this, if the underlying system does. You may need to
twiddle
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
your OS settings to get hugetlbfs working.
Jeff
On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart
Jeff, all,
Thanks for the clarification. My measurements show that global
memory allocations do not require the backing file if there
is only
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
one process per node, for arbitrary number of processes. So
I was
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
wondering if it was possible to use the same allocation
process even
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
with multiple processes per node if there is not enough space
available in /tmp. However, I am not sure whether the IB
devices can
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
be used to perform intra-node RMA. At least that would
retain the
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
functionality on this kind of system (which arguably might
be a rare
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
case).
On a different note, I found during the weekend that
Valgrind only
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
supports allocations up to 60GB, so my second point
reported below
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
may be invalid. Number 4 seems still seems curious to me,
though.
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
Best
Joseph
There's no reason to do anything special for shared
memory with
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
a single-process job because
MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
However, it would help debugging if MPI implementers at
least
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
had an option to take the code path that allocates
shared memory
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
even when np=1.
Jeff
On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
Gilles,
Thanks for your swift response. On this system,
/dev/shm
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
only has
256M available so that is no option unfortunately.
I tried
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
disabling
both vader and sm btl via `--mca btl ^vader,sm`
but Open
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
MPI still
seems to allocate the shmem backing file under
/tmp. From
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
my point
of view, missing the performance benefits of file
backed
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
shared
memory as long as large allocations work but I
don't know
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
the
implementation details and whether that is
possible. It
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
seems that
the mmap does not happen if there is only one
process per
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
node.
Cheers,
Joseph
Joseph,
the error message suggests that allocating
memory with
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
MPI_Win_allocate[_shared] is done by creating
a file
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
and then
mmap'ing
it.
how much space do you have in /dev/shm ? (this
is a
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a
workaround, i suggest
you use this as the shared-memory backing
directory
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
/* i am afk and do not remember the syntax,
ompi_info
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
--all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
All,
I have been experimenting with large window
allocations
recently and have
made some interesting observations that I
would
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
like to share.
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
1) Running with 1 process on a single
node, I can
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
allocate
and write to
memory up to ~110 GB through MPI_Allocate,
MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2
nodes
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
single
large allocations
succeed but with the repeating
allocate/free cycle
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
in the
attached code I
see the application being reproducibly
being killed
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
by the
OOM at 25GB
allocation with MPI_Win_allocate_shared.
When I try
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
to run
it under Valgrind
I get an error from MPI_Win_allocate at
~50GB that
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
I cannot
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in
MPI_Alloc_mem
[n131302:11989] *** reported by process
[1567293441,1]
[n131302:11989] *** on communicator
MPI_COMM_WORLD
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
[n131302:11989] *** MPI_ERR_NO_MEM: out of
memory
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
[n131302:11989] *** MPI_ERRORS_ARE_FATAL
(processes
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
in this
communicator
will now abort,
[n131302:11989] *** and potentially
your MPI job)
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
```
3) If running with 2 processes on a node,
I get the
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
following error from
both MPI_Win_allocate and
```
--------------------------------------------------------------------------
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
It appears as if there is not enough space for
(the
shared-memory backing
file). It is likely that your MPI job will now
either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit
of /tmp.
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per
process. I
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
understand that I can
set $TMP to a bigger filesystem (such as
lustre)
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
but then I
am greeted with
a warning on each allocation and
performance seems
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
to drop.
Is there a way
to fall back to the allocation strategy
used in
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
case 2)?
4) It is also worth noting the time it
takes to
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
allocate the
memory: while
the allocations are in the sub-millisecond
range
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to
allocate
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
100GB using
MPI_Win_allocate and the time increasing
linearly
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
with the
allocation size.
Are these issues known? Maybe there is
documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please
make sure
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
to adjust the
MEM_PER_NODE macro to suit your system
before you
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center
Stuttgart (HLRS)
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
-- Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
Post by Joseph Schuchart
Post by Joseph Schuchart
Post by Jeff Hammond
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
Post by Joseph Schuchart
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
http://jeffhammond.github.io/
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: ***@hlrs.de
Loading...