Ender GÜLER
2017-03-11 13:49:01 UTC
Hi there,
I try to use openmpi in a docker container. My host and container OS is
CentOS 7 (7.2.1511 to be exact). When I try to run a simple MPI hello world
application, the app core dumps every time with BUS ERROR. The OpenMPI
version is 2.0.2 and I compiled in the container. When I copied the
installation from container to host, it runs without any problem.
Have you ever tried to run OpenMPI and encountered a problem like this one.
If so what can be wrong? What should I do to find the root cause and solve
the problem? The very same application can be run with IntelMPI in the
container without any problem.
I pasted the output of my mpirun command and its output below.
[***@cn15 ~]# mpirun --allow-run-as-root -mca btl sm -np 2 -machinefile
mpd.hosts ./mpi_hello.x
[cn15:25287] *** Process received signal ***
[cn15:25287] Signal: Bus error (7)
[cn15:25287] Signal code: Non-existant physical address (2)
[cn15:25287] Failing at address: 0x7fe2d0fbf000
[cn15:25287] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fe2d53e9100]
[cn15:25287] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fe2d5a9a034]
[cn15:25287] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fe2d5a5b45f]
[cn15:25287] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fe2d5a5b706]
[cn15:25287] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fe2d5a5fd60]
[cn15:25287] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fe2d5a5e8de]
[cn15:25287] [ 6]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fe2d69b5d5b]
[cn15:25287] [ 7]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fe2d69b7249]
[cn15:25287] [ 8]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fe2d69b2956]
[cn15:25287] [ 9]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fe2d6a1ac9f]
[cn15:25287] [10]
/opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fe2d69f7566]
[cn15:25287] [11]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fe2d687e0f4]
[cn15:25287] [12]
/opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fe2d68b1cb4]
[cn15:25287] [13] ./mpi_hello.x[0x400927]
[cn15:25287] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe2d5039b15]
[cn15:25287] [15] ./mpi_hello.x[0x400839]
[cn15:25287] *** End of error message ***
[cn15:25286] *** Process received signal ***
[cn15:25286] Signal: Bus error (7)
[cn15:25286] Signal code: Non-existant physical address (2)
[cn15:25286] Failing at address: 0x7fd4abb18000
[cn15:25286] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fd4b3f56100]
[cn15:25286] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fd4b4607034]
[cn15:25286] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fd4b45c845f]
[cn15:25286] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fd4b45c8706]
[cn15:25286] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fd4b45ccd60]
[cn15:25286] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fd4b45cb8de]
[cn15:25286] [ 6]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fd4b5522d5b]
[cn15:25286] [ 7]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fd4b5524249]
[cn15:25286] [ 8]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fd4b551f956]
[cn15:25286] [ 9]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fd4b5587c9f]
[cn15:25286] [10]
/opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fd4b5564566]
[cn15:25286] [11]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fd4b53eb0f4]
[cn15:25286] [12]
/opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fd4b541ecb4]
[cn15:25286] [13] ./mpi_hello.x[0x400927]
[cn15:25286] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd4b3ba6b15]
[cn15:25286] [15] ./mpi_hello.x[0x400839]
[cn15:25286] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node cn15 exited on signal
7 (Bus error).
--------------------------------------------------------------------------
Thanks in advance,
Ender
I try to use openmpi in a docker container. My host and container OS is
CentOS 7 (7.2.1511 to be exact). When I try to run a simple MPI hello world
application, the app core dumps every time with BUS ERROR. The OpenMPI
version is 2.0.2 and I compiled in the container. When I copied the
installation from container to host, it runs without any problem.
Have you ever tried to run OpenMPI and encountered a problem like this one.
If so what can be wrong? What should I do to find the root cause and solve
the problem? The very same application can be run with IntelMPI in the
container without any problem.
I pasted the output of my mpirun command and its output below.
[***@cn15 ~]# mpirun --allow-run-as-root -mca btl sm -np 2 -machinefile
mpd.hosts ./mpi_hello.x
[cn15:25287] *** Process received signal ***
[cn15:25287] Signal: Bus error (7)
[cn15:25287] Signal code: Non-existant physical address (2)
[cn15:25287] Failing at address: 0x7fe2d0fbf000
[cn15:25287] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fe2d53e9100]
[cn15:25287] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fe2d5a9a034]
[cn15:25287] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fe2d5a5b45f]
[cn15:25287] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fe2d5a5b706]
[cn15:25287] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fe2d5a5fd60]
[cn15:25287] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fe2d5a5e8de]
[cn15:25287] [ 6]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fe2d69b5d5b]
[cn15:25287] [ 7]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fe2d69b7249]
[cn15:25287] [ 8]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fe2d69b2956]
[cn15:25287] [ 9]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fe2d6a1ac9f]
[cn15:25287] [10]
/opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fe2d69f7566]
[cn15:25287] [11]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fe2d687e0f4]
[cn15:25287] [12]
/opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fe2d68b1cb4]
[cn15:25287] [13] ./mpi_hello.x[0x400927]
[cn15:25287] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe2d5039b15]
[cn15:25287] [15] ./mpi_hello.x[0x400839]
[cn15:25287] *** End of error message ***
[cn15:25286] *** Process received signal ***
[cn15:25286] Signal: Bus error (7)
[cn15:25286] Signal code: Non-existant physical address (2)
[cn15:25286] Failing at address: 0x7fd4abb18000
[cn15:25286] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7fd4b3f56100]
[cn15:25286] [ 1] /lib64/libpsm2.so.2(+0x4b034)[0x7fd4b4607034]
[cn15:25286] [ 2] /lib64/libpsm2.so.2(+0xc45f)[0x7fd4b45c845f]
[cn15:25286] [ 3] /lib64/libpsm2.so.2(+0xc706)[0x7fd4b45c8706]
[cn15:25286] [ 4] /lib64/libpsm2.so.2(+0x10d60)[0x7fd4b45ccd60]
[cn15:25286] [ 5] /lib64/libpsm2.so.2(psm2_ep_open+0x41e)[0x7fd4b45cb8de]
[cn15:25286] [ 6]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_psm2_module_init+0x1df)[0x7fd4b5522d5b]
[cn15:25286] [ 7]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x1b3249)[0x7fd4b5524249]
[cn15:25286] [ 8]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mtl_base_select+0xc2)[0x7fd4b551f956]
[cn15:25286] [ 9]
/opt/openmpi/2.0.2/lib/libmpi.so.20(+0x216c9f)[0x7fd4b5587c9f]
[cn15:25286] [10]
/opt/openmpi/2.0.2/lib/libmpi.so.20(mca_pml_base_select+0x29b)[0x7fd4b5564566]
[cn15:25286] [11]
/opt/openmpi/2.0.2/lib/libmpi.so.20(ompi_mpi_init+0x665)[0x7fd4b53eb0f4]
[cn15:25286] [12]
/opt/openmpi/2.0.2/lib/libmpi.so.20(MPI_Init+0x99)[0x7fd4b541ecb4]
[cn15:25286] [13] ./mpi_hello.x[0x400927]
[cn15:25286] [14] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd4b3ba6b15]
[cn15:25286] [15] ./mpi_hello.x[0x400839]
[cn15:25286] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node cn15 exited on signal
7 (Bus error).
--------------------------------------------------------------------------
Thanks in advance,
Ender