Discussion:
[OMPI users] OpenMPI in docker container
Ender GÜLER
2017-03-11 13:49:01 UTC
Permalink
Hi there,

I try to use openmpi in a docker container. My host and container OS is
CentOS 7 (7.2.1511 to be exact). When I try to run a simple MPI hello world
application, the app core dumps every time with BUS ERROR. The OpenMPI
version is 2.0.2 and I compiled in the container. When I copied the
installation from container to host, it runs without any problem.

Have you ever tried to run OpenMPI and encountered a problem like this one.
If so what can be wrong? What should I do to find the root cause and solve
the problem? The very same application can be run with IntelMPI in the
container without any problem.

I pasted the output of my mpirun command and its output below.

[***@cn15 ~]# mpirun --allow-run-as-root -mca btl sm -np 2 -machinefile
mpd.hosts ./mpi_hello.x
[cn15:25287] *** Process received signal ***
[cn15:25287] Signal: Bus error (7)
[cn15:25287] Signal code: Non-existant physical address (2)
[cn15:25287] Failing at address: 0x7fe2d0fbf000
[cn15:25287] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fe2d53e9100]
[cn15:25287] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fe2d5a9a034]
[cn15:25287] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fe2d5a5b45f]
[cn15:25287] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fe2d5a5b706]
[cn15:25287] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fe2d5a5fd60]
[cn15:25287] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fe2d5a5e8de]
[cn15:25287] [ 6]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fe2d69b5d5b]
[cn15:25287] [ 7]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fe2d69b7249]
[cn15:25287] [ 8]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fe2d69b2956]
[cn15:25287] [ 9]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fe2d6a1ac9f]
[cn15:25287] [10]
/opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fe2d69f7566]
[cn15:25287] [11]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fe2d687e0f4]
[cn15:25287] [12]
/opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fe2d68b1cb4]
[cn15:25287] [13] ./mpi_hello.x[0x400927]
[cn15:25287] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe2d5039b15]
[cn15:25287] [15] ./mpi_hello.x[0x400839]
[cn15:25287] *** End of error message ***
[cn15:25286] *** Process received signal ***
[cn15:25286] Signal: Bus error (7)
[cn15:25286] Signal code: Non-existant physical address (2)
[cn15:25286] Failing at address: 0x7fd4abb18000
[cn15:25286] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fd4b3f56100]
[cn15:25286] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fd4b4607034]
[cn15:25286] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fd4b45c845f]
[cn15:25286] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fd4b45c8706]
[cn15:25286] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fd4b45ccd60]
[cn15:25286] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fd4b45cb8de]
[cn15:25286] [ 6]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fd4b5522d5b]
[cn15:25286] [ 7]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fd4b5524249]
[cn15:25286] [ 8]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fd4b551f956]
[cn15:25286] [ 9]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fd4b5587c9f]
[cn15:25286] [10]
/opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fd4b5564566]
[cn15:25286] [11]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fd4b53eb0f4]
[cn15:25286] [12]
/opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fd4b541ecb4]
[cn15:25286] [13] ./mpi_hello.x[0x400927]
[cn15:25286] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd4b3ba6b15]
[cn15:25286] [15] ./mpi_hello.x[0x400839]
[cn15:25286] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node cn15 exited on signal
7 (Bus error).
--------------------------------------------------------------------------

Thanks in advance,

Ender
Josh Hursey
2017-03-11 15:17:07 UTC
Permalink
From the stack track it looks like it's failing the PSM2 MTL, which you
shouldn't need (or want?) in this scenario.

Try adding this additional MCA parameter to your command line:
-mca pml ob1

That will force Open MPI's selection such that it avoids that component.
That might get you further along.
Post by Ender GÜLER
Hi there,
I try to use openmpi in a docker container. My host and container OS is
CentOS 7 (7.2.1511 to be exact). When I try to run a simple MPI hello world
application, the app core dumps every time with BUS ERROR. The OpenMPI
version is 2.0.2 and I compiled in the container. When I copied the
installation from container to host, it runs without any problem.
Have you ever tried to run OpenMPI and encountered a problem like this
one. If so what can be wrong? What should I do to find the root cause and
solve the problem? The very same application can be run with IntelMPI in
the container without any problem.
I pasted the output of my mpirun command and its output below.
mpd.hosts ./mpi_hello.x
[cn15:25287] *** Process received signal ***
[cn15:25287] Signal: Bus error (7)
[cn15:25287] Signal code: Non-existant physical address (2)
[cn15:25287] Failing at address: 0x7fe2d0fbf000
[cn15:25287] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fe2d53e9100]
[cn15:25287] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fe2d5a9a034]
[cn15:25287] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fe2d5a5b45f]
[cn15:25287] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fe2d5a5b706]
[cn15:25287] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fe2d5a5fd60]
[cn15:25287] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fe2d5a5e8de]
[cn15:25287] [ 6] /opt/openmpi/2.0.2/lib/libmpi.
so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fe2d69b5d5b]
[cn15:25287] [ 7] /opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[
0x7fe2d69b7249]
[cn15:25287] [ 8] /opt/openmpi/2.0.2/lib/libmpi.
so.20(ompi_mtl_base_select+0xc2)[0x7fe2d69b2956]
[cn15:25287] [ 9] /opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[
0x7fe2d6a1ac9f]
[cn15:25287] [10] /opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+
0x29b)[0x7fe2d69f7566]
[cn15:25287] [11] /opt/openmpi/2.0.2/lib/libmpi.
so.20(ompi_mpi_init+0x665)[0x7fe2d687e0f4]
[cn15:25287] [12] /opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[
0x7fe2d68b1cb4]
[cn15:25287] [13] ./mpi_hello.x[0x400927]
[cn15:25287] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe2d5039b15]
[cn15:25287] [15] ./mpi_hello.x[0x400839]
[cn15:25287] *** End of error message ***
[cn15:25286] *** Process received signal ***
[cn15:25286] Signal: Bus error (7)
[cn15:25286] Signal code: Non-existant physical address (2)
[cn15:25286] Failing at address: 0x7fd4abb18000
[cn15:25286] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fd4b3f56100]
[cn15:25286] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fd4b4607034]
[cn15:25286] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fd4b45c845f]
[cn15:25286] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fd4b45c8706]
[cn15:25286] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fd4b45ccd60]
[cn15:25286] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fd4b45cb8de]
[cn15:25286] [ 6] /opt/openmpi/2.0.2/lib/libmpi.
so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fd4b5522d5b]
[cn15:25286] [ 7] /opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[
0x7fd4b5524249]
[cn15:25286] [ 8] /opt/openmpi/2.0.2/lib/libmpi.
so.20(ompi_mtl_base_select+0xc2)[0x7fd4b551f956]
[cn15:25286] [ 9] /opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[
0x7fd4b5587c9f]
[cn15:25286] [10] /opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+
0x29b)[0x7fd4b5564566]
[cn15:25286] [11] /opt/openmpi/2.0.2/lib/libmpi.
so.20(ompi_mpi_init+0x665)[0x7fd4b53eb0f4]
[cn15:25286] [12] /opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[
0x7fd4b541ecb4]
[cn15:25286] [13] ./mpi_hello.x[0x400927]
[cn15:25286] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd4b3ba6b15]
[cn15:25286] [15] ./mpi_hello.x[0x400839]
[cn15:25286] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node cn15 exited on
signal 7 (Bus error).
--------------------------------------------------------------------------
Thanks in advance,
Ender
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Josh Hursey
IBM Spectrum MPI Developer
Ender GÜLER
2017-03-11 19:09:55 UTC
Permalink
Hi Josh,

Thanks for your suggestion. When I add "-mca pml ob1" it worked. Actually I
need the psm support (but not with this scenario). Here's the story:

I compiled the openmpi source with psm2 support becuase the host has
OmniPath device and my first try is to test whether I can use the hardware
or not and I ended up testing the compiled OpenMPI against the different
transport modes without success.

The psm2 support is working when running directly from physical host and I
suppose the docker layer has something to do with this error. But I cannot
figure out what causes this situation.

Do you guys, have any idea what to look at next? I'll ask opinion at the
Docker Forums but before that I try to get more information and I wondered
whether anyone else have this kind of problem before.

Regards,

Ender
Post by Josh Hursey
From the stack track it looks like it's failing the PSM2 MTL, which you
shouldn't need (or want?) in this scenario.
-mca pml ob1
That will force Open MPI's selection such that it avoids that component.
That might get you further along.
Hi there,
I try to use openmpi in a docker container. My host and container OS is
CentOS 7 (7.2.1511 to be exact). When I try to run a simple MPI hello world
application, the app core dumps every time with BUS ERROR. The OpenMPI
version is 2.0.2 and I compiled in the container. When I copied the
installation from container to host, it runs without any problem.
Have you ever tried to run OpenMPI and encountered a problem like this
one. If so what can be wrong? What should I do to find the root cause and
solve the problem? The very same application can be run with IntelMPI in
the container without any problem.
I pasted the output of my mpirun command and its output below.
mpd.hosts ./mpi_hello.x
[cn15:25287] *** Process received signal ***
[cn15:25287] Signal: Bus error (7)
[cn15:25287] Signal code: Non-existant physical address (2)
[cn15:25287] Failing at address: 0x7fe2d0fbf000
[cn15:25287] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fe2d53e9100]
[cn15:25287] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fe2d5a9a034]
[cn15:25287] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fe2d5a5b45f]
[cn15:25287] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fe2d5a5b706]
[cn15:25287] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fe2d5a5fd60]
[cn15:25287] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fe2d5a5e8de]
[cn15:25287] [ 6]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fe2d69b5d5b]
[cn15:25287] [ 7]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fe2d69b7249]
[cn15:25287] [ 8]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fe2d69b2956]
[cn15:25287] [ 9]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fe2d6a1ac9f]
[cn15:25287] [10]
/opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fe2d69f7566]
[cn15:25287] [11]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fe2d687e0f4]
[cn15:25287] [12]
/opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fe2d68b1cb4]
[cn15:25287] [13] ./mpi_hello.x[0x400927]
[cn15:25287] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe2d5039b15]
[cn15:25287] [15] ./mpi_hello.x[0x400839]
[cn15:25287] *** End of error message ***
[cn15:25286] *** Process received signal ***
[cn15:25286] Signal: Bus error (7)
[cn15:25286] Signal code: Non-existant physical address (2)
[cn15:25286] Failing at address: 0x7fd4abb18000
[cn15:25286] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fd4b3f56100]
[cn15:25286] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fd4b4607034]
[cn15:25286] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fd4b45c845f]
[cn15:25286] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fd4b45c8706]
[cn15:25286] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fd4b45ccd60]
[cn15:25286] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fd4b45cb8de]
[cn15:25286] [ 6]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fd4b5522d5b]
[cn15:25286] [ 7]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fd4b5524249]
[cn15:25286] [ 8]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fd4b551f956]
[cn15:25286] [ 9]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fd4b5587c9f]
[cn15:25286] [10]
/opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fd4b5564566]
[cn15:25286] [11]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fd4b53eb0f4]
[cn15:25286] [12]
/opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fd4b541ecb4]
[cn15:25286] [13] ./mpi_hello.x[0x400927]
[cn15:25286] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd4b3ba6b15]
[cn15:25286] [15] ./mpi_hello.x[0x400839]
[cn15:25286] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node cn15 exited on
signal 7 (Bus error).
--------------------------------------------------------------------------
Thanks in advance,
Ender
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Josh Hursey
IBM Spectrum MPI Developer
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-03-11 19:14:26 UTC
Permalink
Past attempts have indicated that only TCP works well with Docker - if you want to use OPA, you’re probably better off using Singularity as your container.

http://singularity.lbl.gov/ <http://singularity.lbl.gov/>

The OMPI master has some optimized integration for Singularity, but 2.0.2 will work with it just fine as well.
Post by Ender GÜLER
Hi Josh,
I compiled the openmpi source with psm2 support becuase the host has OmniPath device and my first try is to test whether I can use the hardware or not and I ended up testing the compiled OpenMPI against the different transport modes without success.
The psm2 support is working when running directly from physical host and I suppose the docker layer has something to do with this error. But I cannot figure out what causes this situation.
Do you guys, have any idea what to look at next? I'll ask opinion at the Docker Forums but before that I try to get more information and I wondered whether anyone else have this kind of problem before.
Regards,
Ender
From the stack track it looks like it's failing the PSM2 MTL, which you shouldn't need (or want?) in this scenario.
-mca pml ob1
That will force Open MPI's selection such that it avoids that component. That might get you further along.
Hi there,
I try to use openmpi in a docker container. My host and container OS is CentOS 7 (7.2.1511 to be exact). When I try to run a simple MPI hello world application, the app core dumps every time with BUS ERROR. The OpenMPI version is 2.0.2 and I compiled in the container. When I copied the installation from container to host, it runs without any problem.
Have you ever tried to run OpenMPI and encountered a problem like this one. If so what can be wrong? What should I do to find the root cause and solve the problem? The very same application can be run with IntelMPI in the container without any problem.
I pasted the output of my mpirun command and its output below.
[cn15:25287] *** Process received signal ***
[cn15:25287] Signal: Bus error (7)
[cn15:25287] Signal code: Non-existant physical address (2)
[cn15:25287] Failing at address: 0x7fe2d0fbf000
[cn15:25287] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fe2d53e9100]
[cn15:25287] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fe2d5a9a034]
[cn15:25287] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fe2d5a5b45f]
[cn15:25287] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fe2d5a5b706]
[cn15:25287] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fe2d5a5fd60]
[cn15:25287] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fe2d5a5e8de]
[cn15:25287] [ 6] /opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fe2d69b5d5b]
[cn15:25287] [ 7] /opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fe2d69b7249]
[cn15:25287] [ 8] /opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fe2d69b2956]
[cn15:25287] [ 9] /opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fe2d6a1ac9f]
[cn15:25287] [10] /opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fe2d69f7566]
[cn15:25287] [11] /opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fe2d687e0f4]
[cn15:25287] [12] /opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fe2d68b1cb4]
[cn15:25287] [13] ./mpi_hello.x[0x400927]
[cn15:25287] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe2d5039b15]
[cn15:25287] [15] ./mpi_hello.x[0x400839]
[cn15:25287] *** End of error message ***
[cn15:25286] *** Process received signal ***
[cn15:25286] Signal: Bus error (7)
[cn15:25286] Signal code: Non-existant physical address (2)
[cn15:25286] Failing at address: 0x7fd4abb18000
[cn15:25286] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fd4b3f56100]
[cn15:25286] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fd4b4607034]
[cn15:25286] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fd4b45c845f]
[cn15:25286] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fd4b45c8706]
[cn15:25286] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fd4b45ccd60]
[cn15:25286] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fd4b45cb8de]
[cn15:25286] [ 6] /opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fd4b5522d5b]
[cn15:25286] [ 7] /opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fd4b5524249]
[cn15:25286] [ 8] /opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fd4b551f956]
[cn15:25286] [ 9] /opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fd4b5587c9f]
[cn15:25286] [10] /opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fd4b5564566]
[cn15:25286] [11] /opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fd4b53eb0f4]
[cn15:25286] [12] /opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fd4b541ecb4]
[cn15:25286] [13] ./mpi_hello.x[0x400927]
[cn15:25286] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd4b3ba6b15]
[cn15:25286] [15] ./mpi_hello.x[0x400839]
[cn15:25286] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node cn15 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
Thanks in advance,
Ender
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
--
Josh Hursey
IBM Spectrum MPI Developer
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Ender GÜLER
2017-03-11 20:11:14 UTC
Permalink
Thank you very much for suggesting the singularity. I'm new to containers
and I'm not aware of the Singularity project.
Post by r***@open-mpi.org
Past attempts have indicated that only TCP works well with Docker - if you
want to use OPA, you’re probably better off using Singularity as your
container.
http://singularity.lbl.gov/
The OMPI master has some optimized integration for Singularity, but 2.0.2
will work with it just fine as well.
Hi Josh,
Thanks for your suggestion. When I add "-mca pml ob1" it worked. Actually
I compiled the openmpi source with psm2 support becuase the host has
OmniPath device and my first try is to test whether I can use the hardware
or not and I ended up testing the compiled OpenMPI against the different
transport modes without success.
The psm2 support is working when running directly from physical host and I
suppose the docker layer has something to do with this error. But I cannot
figure out what causes this situation.
Do you guys, have any idea what to look at next? I'll ask opinion at the
Docker Forums but before that I try to get more information and I wondered
whether anyone else have this kind of problem before.
Regards,
Ender
From the stack track it looks like it's failing the PSM2 MTL, which you
shouldn't need (or want?) in this scenario.
-mca pml ob1
That will force Open MPI's selection such that it avoids that component.
That might get you further along.
Hi there,
I try to use openmpi in a docker container. My host and container OS is
CentOS 7 (7.2.1511 to be exact). When I try to run a simple MPI hello world
application, the app core dumps every time with BUS ERROR. The OpenMPI
version is 2.0.2 and I compiled in the container. When I copied the
installation from container to host, it runs without any problem.
Have you ever tried to run OpenMPI and encountered a problem like this
one. If so what can be wrong? What should I do to find the root cause and
solve the problem? The very same application can be run with IntelMPI in
the container without any problem.
I pasted the output of my mpirun command and its output below.
mpd.hosts ./mpi_hello.x
[cn15:25287] *** Process received signal ***
[cn15:25287] Signal: Bus error (7)
[cn15:25287] Signal code: Non-existant physical address (2)
[cn15:25287] Failing at address: 0x7fe2d0fbf000
[cn15:25287] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fe2d53e9100]
[cn15:25287] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fe2d5a9a034]
[cn15:25287] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fe2d5a5b45f]
[cn15:25287] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fe2d5a5b706]
[cn15:25287] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fe2d5a5fd60]
[cn15:25287] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fe2d5a5e8de]
[cn15:25287] [ 6]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fe2d69b5d5b]
[cn15:25287] [ 7]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fe2d69b7249]
[cn15:25287] [ 8]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fe2d69b2956]
[cn15:25287] [ 9]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fe2d6a1ac9f]
[cn15:25287] [10]
/opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fe2d69f7566]
[cn15:25287] [11]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fe2d687e0f4]
[cn15:25287] [12]
/opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fe2d68b1cb4]
[cn15:25287] [13] ./mpi_hello.x[0x400927]
[cn15:25287] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe2d5039b15]
[cn15:25287] [15] ./mpi_hello.x[0x400839]
[cn15:25287] *** End of error message ***
[cn15:25286] *** Process received signal ***
[cn15:25286] Signal: Bus error (7)
[cn15:25286] Signal code: Non-existant physical address (2)
[cn15:25286] Failing at address: 0x7fd4abb18000
[cn15:25286] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fd4b3f56100]
[cn15:25286] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fd4b4607034]
[cn15:25286] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fd4b45c845f]
[cn15:25286] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fd4b45c8706]
[cn15:25286] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fd4b45ccd60]
[cn15:25286] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fd4b45cb8de]
[cn15:25286] [ 6]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fd4b5522d5b]
[cn15:25286] [ 7]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fd4b5524249]
[cn15:25286] [ 8]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fd4b551f956]
[cn15:25286] [ 9]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fd4b5587c9f]
[cn15:25286] [10]
/opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fd4b5564566]
[cn15:25286] [11]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fd4b53eb0f4]
[cn15:25286] [12]
/opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fd4b541ecb4]
[cn15:25286] [13] ./mpi_hello.x[0x400927]
[cn15:25286] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd4b3ba6b15]
[cn15:25286] [15] ./mpi_hello.x[0x400839]
[cn15:25286] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node cn15 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
Thanks in advance,
Ender
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Josh Hursey
IBM Spectrum MPI Developer
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...