Discussion:
[OMPI users] [Singularity] OpenMPI, Slurm & portability
victor sv
2017-08-25 08:46:03 UTC
Permalink
Dear Singularity & OpenMPI teams, Greg and Ralph,

going back to the Ralph Castain response to this thread:

https://groups.google.com/a/lbl.gov/forum/#!topic/singularity/lQ6sWCWhIWY

In order to get portability of Singularity images containing OpenMPI
distributed applications, he suggested mix some OpenMPI versions with some
external PMIX to check about the interoperability across versions while
using of the Singularity MPI hybrid approach (see his response in the
thread).

I did some experiments and I would like to share with you my results and to
discuss about the conclusions.

First of all, I'm going to describe the environment (some scripts are
attached).

- I performed this test at CESGA FinisTerrae II cluster (
https://www.cesga.es/en/infraestructuras/computacion/FinisTerrae2
<https://www.cesga.es/en/infraestructuras/computacion/FinisTerrae2>).
- The compiler used is GCC/6.3.0 and I had to compile some external
dependencies to be linked from PMIX or OpenMPI:
- hwloc/1.11.5
- libevent/2.0.22
- PMIX versions used in this experiments:
- 1.2.1
- 1.2.2
- 2.0.0
- I configure PMIX with the following options:
- ./configure --with-hwloc= --with-munge-libdir=
--with-platform=optimized --with-libevent=
- OpenMPI versions used in this experiments:
- 2.0.X
- 2.1.1
- 3.0.0_rcX
- I configure OpenMPI with the following options:
- ./configure --with-hwloc= --enable-shared --with-slurm
--enable-mpi-thread-multiple --with-verbs-libdir=
--enable-mpirun-prefix-by-default --disable-dlopen --with-pmix=
--with-libevent= --with-knem
- Version 2.1.1 was also compiled with flag --disable-pmix-dstore
- I used the well known "Ring" OpenMPI application.
- I used MPIRUN as process manager

What I expected from previous Ralph response was full cross-version
compatibility using any OpenMPI >= 2.0.0 linked against PMIX 1.2.X both,
inside the container and at the host.
In general, my results were not as good as expected, but promising.

- The worst thing is, my results show that OpenMPI 2.X versions needs
exactly the same version of OpenMPI inside & outside the container, but I
can mix PMIx 1.2.1 and 1.2.2
- The better thing, if OpenMPI 3.0.0_rc3 version is present inside or
outside the container, seems to work mixing any other OpenMPI >= 2.X
version and also mixing PMIx 1.2.1 and 1.2.2. Some notes* to this result:
- OpenMPI 2.0.0 with PMIx 1.2.2 (In&Out the container) never worked.
- After getting the expected output from "Ring" app, I randomly get
SEGFAULT if OpenMPI 3.0.0.rcX is involved.
- As Ralph said, PMIx 1.2X and 2.0.X are not interoperable.
- I was not able to compile OpenMPI 2.1.0 with external PMIx

I can conclude that PMIx 1.2.1 and 1.2.2 are interoperable, but only
OpenMPI 3.0.0_rc3 can work*, in general, with other versions of OpenMPI
(>2).

Going back again to Ralph Castain mail to this thread, I would expect full
support for interoperability with different PMIx versions (>1.2) through
PMIx > 2.1 (not yet released)

Some questions about this experiments and conclusions are:

- What do you think about this results? Do you have any suggestion? I'm
missing something?
- are these results aligned with your expectations?
- I know that PMIx 2.1 is being developed but, any version is already
available to check? How can I get it?
- The SEGFAULT I get with OpenMPI 3.0.0.rcX is something already
tracked?

Hope to helpful!

BR,

Víctor
Hi Victor,
The are of ABI compatibility I am referring to is with the container's
underlying library stack. Meaning that if you link in the libraries
compiled on the host, and the container you want to run is newer then what
is installed on the host, (or potentially vise-versa), you may end up with
a conflict between the binary and library.
This is what Nvidia has mitigated by building their library on a very
recent toolchain, thus the libraries are backwards compatible with older
binaries.
Does that make sense?
Greg
Hi Greg and Ralph,
yes Greg, I agree with you that the mentioned strategy could be dangerous
and goes against the principals of containment.
sorry for the basic question ... but what do you mean with ABI compatible
containers? which components of the container environment are involved with
this ABI compatibility?
If we talk about libc or the kernel itself, as you say in your web page,
"If you require kernel dependent features, a container platform is probably
not the right solution for you."
If we focus on OpenMPI ABI compatibility, I figure out that the variables
involved in this compatibility could be (1) the compiler (vendor) and (2)
the OpenMPI library itself.
I'm right or I'm missing any other variables?
An interesting project called ABI-tracker has performed an OpenMPI ABI
https://abi-laboratory.pro/tracker/timeline/openmpi/
I think that, at least for OpenMPI 2.X, altough been a dangerous
approach, the ABI compatibility seems reasonable.
What do you think?
BR,
Víctor.
Hi Victor,
I will let Ralph comment on the OMPI versions and compatibilities, but
regarding using the MPI host libraries within a container is dangerous for
the reason that you are mentioning. If you are running ABI compatible
containers with the host, then things *might* work as expected. But this
breaks container portability, and goes against the principals of
containment.
We do however do exactly this for the Nvidia driver libraries, but...
Nvidia builds these libraries with careful attention on ABI compatibility
such that these binary libraries are indeed reasonably portable across
containers.
The only way to do this portably is with using a launcher on the host,
outside the container, to spin up the container and launch the MPI within.
PMIx is a fantastic approach to solving this.
Hope that helps!
Greg
Hi Greg and Ralph,
Thank you for your precise and elaborated answers.
Only for confirmation and to sum up some conclussions (if I understood
- OpenMPI process management compatibility depends on PMIx.
- OpenMPI (and also Slurm) complete backward/forward compatibility
will come (hopefully) in the future by means of PMIx 2.1.
- Nowadays, there exists compatibility with OpenMPI 2.X if we compile
it with default PMIx (1.X) support.
- OpenMPI 2.1 must be compiled with --disable-pmix-dstore due to a
compatibility break.
- OpenMPI 1.X does not suppot PMIx and we can ignore it from this
thread.
I'm right?
I'm interested in performing the tests you purpose. I will try to build
all three OMPI versions (2.0, 2.1 and 3.0) against the same PMIx external
library to check the compatibility. Which PMIx version (1.2.0, 1.2.1 or
1.2.2 ) do you recommend as a start point?
I will report this results ASAP to this thread.
On the other hand, although we are planning to add support to PMIx,
unfortunately, our Slurm version (14.11.10-Bull.1.0) does not support it
yet.
The second strategy we are testing to get compatibility between OpenMPI
inside and outside a Singularity container relies on replacing the OpenMPI
libraries inside the container by the host libraries hierarchy.
This approach rest upon the assumption that OpenMPI symbols and data
structures are compatible through several versions of OpenMPI. At least
combining several releases that share the same major version.
Although the empirical tests of this approach seem to work properly
with some tests, benchmarks and real apps, I'm afraid of getting
unexepected errors/warnings (segfaults, data errors, etc.) in the future.
What do you think about this approach?
Can you confirm that OpenMPI is compatible in this way?
Finally, I think this thread could be very interesting for other users
too and I would like to keep it alive with your help.
Thank you again for your support!
BR,
Víctor
Hiya Victor, et al.,
I didn't realize this but Ralph had to drop off of the Singularity
list. Hopefully we will get him back again, as he is a fantastic resource
for all of the OMPI questions and always a great source of information and
ideas (poke, poke Ralph!). Ralph did send me this in response to the
...
As Greg said, we have been concerned about this since we started
looking at Singularity support. Just for clarity, the version of PMI OMPI
uses is PMIx (https://pmix.github.io/pmix/). While our plan from the
beginning was to support cross-versions specifically to address this
problem, we fell behind on its implementation due to priorities. We just
committed the code to the PMIx repo in the last week, and it won’t be
released into production for a few months while we shake it down.
I fear it will be impossible to get the OMPI 1.10 series to work with
anything other than itself as it pre-dates PMIx.
The OMPI 2.0 and 2.1 series should work across each other as they both
include PMIx 1.x. However, you probably will need to configure the 2.1
series with --disable-pmix-dstore as there was an unintended compatibility
break there (the shared memory store was added during the PMIx 1.x series
and we didn’t catch the compatibility break it introduced).
Looking into the future, OMPI 3.0 is about to be released. It includes
PMIx 2.0, which isn’t backwards compatible at this time, and so it won’t
cross-version with OMPI 2.x “out-of-the-box”. We haven’t tested this, but
one thing you could try is to build all three OMPI versions against the
same PMIx external library (you would probably have to experiment a bit
with PMIx versions to see which works across the different OMPI versions as
the glue between the two also changed a bit). This will ensure that the
shared memory store in PMIx is compatible across the versions, and things
should work since OMPI doesn’t care how the data is moved across the
host-container boundary.
As I said, we will be adding cross-version support to the PMIx release
series soon, without changing the API, that will ensure support across all
PMIx versions starting with v1.2. Thus, you could (once that happens) build
OMPI 2.0, 2.1, and 3.0 against the new PMIx release (probably PMIx v2.1.0)
and the resulting containers would be future-proof as OMPI moves ahead. The
RMs plan to follow that path as well, so you should be in good shape once
this is done if you prefer to “direct launch” your containers (e.g., “srun
./mycontainer” under SLURM).
Sorry if that is all confusing - we sometimes get lost in the
numbering schemes between OMPI and PMIx ourselves. Feel free to contact me
directly, or on the OMPI or PMIx mailing lists, if you have more questions
or encounter problems. We definitely want to make this work.
Ralph
On Sun, Jul 9, 2017 at 12:19 PM, Gregory M. Kurtzer <
Hi Victor,
Sorry for the latency, I'm on email overload.
Open MPI uses PMI to communicate both inside and outside of the
container. Ralph Castain (on this list, but possibly not monitoring
actively) is leading the PMI effort and he is an active Open MPI developer.
We have had several talks about how to achieve "hetero-versionistic"
compatibility through the PMI handshake. I was under the impression that
PMI now supports that, as long as you are running equal or newer version on
the host (outside the container). Also, I don't know what version of PMI
this feature was introduced in, nor do I know what version of Open MPI
includes that compatibility.
I have CC'ed Ralph, and hopefully he will be able to offer some
suggestions.
Regarding your question about supporting the MPI libraries in the
same manner that we are doing the Nvidia libraries, that would be hard.
Nvidia specifically builds their libraries to be as generally compatible as
possible (e.g. the same libraries/binaries work on a large array of Linux
distributions). Most people do not build host libraries in a manner that
would be generally compatible as Nvidia does.
Hope that helps!
Greg
Dear Singularity team,
first of all, thanks for the great work with Singularity. It looks
amazing!
Sorry if this topic is duplicated and for the length of the email,
but I want to share my experience about Singularity and OpenMPI
compatibility, and also ask some questions.
I've being reading a lot about OpenMPI and Singularity compatibility
because we are trying to find the generic way to run OpenMPI applications
within Singularity containers. It was not so clear (for me) in the
documentation, forums and mailing lists, and this is why we've performed an
OpenMPI empiric compatibility study.
We ran these comparisons in CESGA FinisTerrae II cluster (
https://www.cesga.es/en/infraestructuras/computacion/FinisTerrae2).
We used several versions of OpenMPI. The chosen versions of OpenMPI
- openmpi/1.10.2
- openmpi/2.0.0
- openmpi/2.0.1
- openmpi/2.0.2
- openmpi/2.1.1
We have created Singularity images containing the same versions of
OpenMPI and with the basic OpenMPI ring example. I share the bootstrap
```
BootStrap: docker
From: ubuntu:16.04
IncludeCmd: yes
%post
sed -i 's/main/main restricted universe/g'
/etc/apt/sources.list
apt-get update
apt-get install -y bash git wget build-essential gcc time
libc6-dev libgcc-5-dev
apt-get install -y dapl2-utils libdapl-dev libdapl2
libibverbs1 librdmacm1 libcxgb3-1 libipathverbs1 libmlx4-1 libmlx5-1
libmthca1 libnes1 libpmi0 libpmi0-dev libslurm29 libslurm-dev
##Install OpenMPI
cd /tmp
wget 'https://www.open-mpi.org/soft
ware/ompi/vX.X/downloads/openmpi-X.X.X.tar.gz' -O
openmpi-X.X.X.tar.gz
tar -xzf openmpi-X.X.X.tar.gz -C openmpi-X.X.X
mkdir -p /tmp/openmpi-X.X.X/build
cd /tmp/openmpi-X.X.X/build
../configure --enable-shared --enable-mpi-thread-multiple
--with-verbs --enable-mpirun-prefix-by-default --with-hwloc
--disable-dlopen --with-pmi --prefix=/usr
make all install
# Install ring
cd /tmp
wget https://raw.githubusercontent.
com/open-mpi/ompi/master/examples/ring_c.c
mpicc ring_c.c -o /usr/bin/ring
```
Once the containers were created, we ran the ring app with mpirun
using 2 cores of 2 different nodes mixing all possible combinations of
those OpenMPI versions inside and outside the container.
The obtained results shown that we need the same versions of OpenMPI
inside and outside the container to succesfully run the contained
application in parallel with mpirun.
Is this the expected behaviour or am I missing something?
Will be this the expected behaviour in the future (with future
versions of OpenMPI)?
Currently, we have slurm 14.11.10-Bull.1.0 installed as job
scheduler at FinisTerrae II. We found the following tip/trick to use srun
http://singularity.lbl.gov/tutorial-gpu-drivers-open-mpi-mtls
In order to run whatever Singularity image containing OpenMPI
applications using Slurm, we've adapted it to our infrastructure and
checked the same test cases running them with srun. It seems that it's
working properly (no real world applications were tested yet).
What do you think about this strategy?
Can you confirm that it provides portability of singularity images
containing OpenMPI applications?
I think this strategy is similar to the one you are following with
"--nv" option for NVidia drivers.
Why not to do the same strategy with MPI, PMI, libibverbs, etc.?
Thanks in advance and congrats again for your great work!
Víctor.
--
You received this message because you are subscribed to the Google
Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it,
--
Gregory M. Kurtzer
CEO, SingularityWare, LLC.
Senior Architect, RStor
Computational Science Advisor, Lawrence Berkeley National Laboratory
--
Gregory M. Kurtzer
CEO, SingularityWare, LLC.
Senior Architect, RStor
Computational Science Advisor, Lawrence Berkeley National Laboratory
--
You received this message because you are subscribed to the Google
Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send
--
You received this message because you are subscribed to the Google
Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send
--
Gregory M. Kurtzer
CEO, SingularityWare, LLC.
Senior Architect, RStor
Computational Science Advisor, Lawrence Berkeley National Laboratory
--
You received this message because you are subscribed to the Google
Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send
--
You received this message because you are subscribed to the Google Groups
"singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an
--
Gregory M. Kurtzer
CEO, SingularityWare, LLC.
Senior Architect, RStor
Computational Science Advisor, Lawrence Berkeley National Laboratory
--
You received this message because you are subscribed to the Google Groups
"singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an
Loading...