[OMPI users] Problem with double shared library

Discussion:

Sean Ahern

2016-10-17 19:53:27 UTC

Folks,

For our code, we have a communication layer that abstracts the code that
does the actual transfer of data. We call these "transports", and we link
them as shared libraries. We have created an MPI transport that
compiles/links against OpenMPI 2.0.1 using the compiler wrappers. When I
compile OpenMPI with the--disable-dlopen option (thus cramming all of
OpenMPI's plugins into the MPI library directly), things work great with
our transport shared library. But when I have a "normal" OpenMPI (without
--disable-dlopen) and create the same transport shared library, things
fail. Upon launch, it appears that OpenMPI is unable to find the
appropriate plugins:

[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_patcher_overwrite:
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_patcher_overwrite.so:
undefined symbol: *mca_patcher_base_patch_t_class* (ignored)
[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_shmem_mmap:
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_mmap.so:
undefined symbol: *opal_show_help* (ignored)
[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_shmem_posix:
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_posix.so:
undefined symbol: *opal_show_help* (ignored)
[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_shmem_sysv:
/home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-2.0.1/lib/openmpi/mca_shmem_sysv.so:
undefined symbol: *opal_show_help* (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)

If I skip our shared libraries and instead write a standard MPI-based
"hello, world" program that links against MPI directly (without
--disable-dlopen), everything is again fine.

It seems that having the double dlopen is causing problems for OpenMPI
finding its own shared libraries.

Note: I do have LD_LIBRARY_PATH pointing to âŠ"openmpi-2.0.1/lib", as well
as OPAL_PREFIX pointing to âŠ"openmpi-2.0.1".

Any thoughts about how I can try to tease out what's going wrong here?

-Sean

--
Sean Ahern
Computational Engineering International
919-363-0883

Gilles Gouaillardet

2016-10-18 01:45:42 UTC

Permalink

Sean,

if i understand correctly, your built a libtransport_mpi.so library that
depends on Open MPI, and your main program dlopen libtransport_mpi.so.

in this case, and at least for the time being, you need to use
RTLD_GLOBAL in your dlopen flags.

Cheers,

Gilles

Post by Sean Ahern
Folks,
For our code, we have a communication layer that abstracts the code
that does the actual transfer of data. We call these "transports", and
we link them as shared libraries. We have created an MPI transport
that compiles/links against OpenMPI 2.0.1 using the compiler wrappers.
When I compile OpenMPI with the--disable-dlopenoption (thus cramming
all of OpenMPI's plugins into the MPI library directly), things work
great with our transport shared library. But when I have a "normal"
OpenMPI (without --disable-dlopen) and create the same transport
shared library, things fail. Upon launch, it appears that OpenMPI is
[hyperion.ceintl.com:25595 <http://hyperion.ceintl.com:25595>]
mca_base_component_repository_open: unable to open
undefined symbol: *mca_patcher_base_patch_t_class* (ignored)
[hyperion.ceintl.com:25595 <http://hyperion.ceintl.com:25595>]
undefined symbol: *opal_show_help* (ignored)
[hyperion.ceintl.com:25595 <http://hyperion.ceintl.com:25595>]
mca_base_component_repository_open: unable to open
undefined symbol: *opal_show_help* (ignored)
[hyperion.ceintl.com:25595 <http://hyperion.ceintl.com:25595>]
undefined symbol: *opal_show_help* (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
If I skip our shared libraries and instead write a standard MPI-based
"hello, world" program that links against MPI directly (without
--disable-dlopen), everything is again fine.
It seems that having the double dlopenis causing problems for OpenMPI
finding its own shared libraries.
Note: I do have LD_LIBRARY_PATHpointing to "openmpi-2.0.1/lib", as
well as OPAL_PREFIXpointing to "openmpi-2.0.1".
Any thoughts about how I can try to tease out what's going wrong here?
-Sean
--
Sean Ahern
Computational Engineering International
919-363-0883
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Sean Ahern

2016-10-28 19:02:20 UTC

Permalink

Gilles,

You described the problem exactly. I think we were able to nail down a
solution to this one through judicious use of the -rpath $MPI_DIR/lib
linker flag, allowing the runtime linker to properly find OpenMPI symbols
at runtime. We're operational. Thanks for your help.

-Sean

--
Sean Ahern
Computational Engineering International
919-363-0883

Post by Gilles Gouaillardet
Sean,
if i understand correctly, your built a libtransport_mpi.so library that
depends on Open MPI, and your main program dlopen libtransport_mpi.so.
in this case, and at least for the time being, you need to use
RTLD_GLOBAL in your dlopen flags.
Cheers,
Gilles
Folks,
For our code, we have a communication layer that abstracts the code that
does the actual transfer of data. We call these "transports", and we link
them as shared libraries. We have created an MPI transport that
compiles/links against OpenMPI 2.0.1 using the compiler wrappers. When I
compile OpenMPI with the--disable-dlopen option (thus cramming all of
OpenMPI's plugins into the MPI library directly), things work great with
our transport shared library. But when I have a "normal" OpenMPI (without
--disable-dlopen) and create the same transport shared library, things
fail. Upon launch, it appears that OpenMPI is unable to find the
[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_patcher_overwrite: /home/sean/work/ceisvn/apex/
branches/OpenMPI/apex32/machines/linux_2.6_64/openmpi-
*mca_patcher_base_patch_t_class* (ignored)
[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_shmem_mmap: /home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/
undefined symbol: *opal_show_help* (ignored)
[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_shmem_posix: /home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/
undefined symbol: *opal_show_help* (ignored)
[hyperion.ceintl.com:25595] mca_base_component_repository_open: unable to
open mca_shmem_sysv: /home/sean/work/ceisvn/apex/branches/OpenMPI/apex32/
undefined symbol: *opal_show_help* (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
If I skip our shared libraries and instead write a standard MPI-based
"hello, world" program that links against MPI directly (without
--disable-dlopen), everything is again fine.
It seems that having the double dlopen is causing problems for OpenMPI
finding its own shared libraries.
Note: I do have LD_LIBRARY_PATH pointing to âŠ"openmpi-2.0.1/lib", as well
as OPAL_PREFIX pointing to âŠ"openmpi-2.0.1".
Any thoughts about how I can try to tease out what's going wrong here?
-Sean
--
Sean Ahern
Computational Engineering International
919-363-0883
_______________________________________________
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users