[OMPI users] Tracking Open MPI memory usage
Adam Sylvester
2017-11-26 13:44:32 UTC
I have an application running across 20 machines where each machine has 60
GB RAM. For some large inputs, some ranks require 45-50 GB RAM. The
behavior I'm seeing is that for some of these large cases, my application
will run for 10-15 minutes and then one rank will be killed; based on
watching top in the past, the application's memory usage gradually
increases until it eventually hits 60 GB and is killed (presumably by the
OOM killer).

There are a few possibilities that come to mind...
1. While I compute all memory requirements upfront and allocate one large
ping/pong buffer to reuse throughout the application, there are some other
(believed to be small) allocations here and there. For large inputs, some
of these may not be quite as small as I think.
2. There's a memory leak.
3. Open MPI is allocating very large buffers for transferring data,
potentially because throughout the application I am *not* using synchronous

I can track down 1 and 2, but I'm wondering if there's some kind of
debug/logging mode I can run in to see Open MPI's buffer management. All I
really care about is the total amount of memory it allocates, but if I need
to parse a list of buffers and sizes to infer the total allocation size,
that's fine.

Thanks for the help.
