Discussion:
[OMPI users] Openmpi 1.10.4 crashes with 1024 processes
Götz Waschk
2017-03-22 12:25:59 UTC
Permalink
Hi everyone,

I'm testing a new machine with 32 nodes of 32 cores each using the IMB
benchmark. It is working fine with 512 processes, but it crashes with
1024 processes after a running for a minute:

[pax11-17:16978] *** Process received signal ***
[pax11-17:16978] Signal: Bus error (7)
[pax11-17:16978] Signal code: Non-existant physical address (2)
[pax11-17:16978] Failing at address: 0x2b147b785450
[pax11-17:16978] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b1473b13370]
[pax11-17:16978] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x8e)[0x2b14794a413e]
[pax11-17:16978] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(ompi_free_list_grow+0x199)[0x2b147384f309]
[pax11-17:16978] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_vader.so(+0x270d)[0x2b14794a270d]
[pax11-17:16978] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x43)[0x2b1479ae3a13]
[pax11-17:16978] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x89a)[0x2b1479ad90ca]
[pax11-17:16978] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_ring+0x3f1)[0x2b147ad6ec41]
[pax11-17:16978] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(MPI_Allreduce+0x17b)[0x2b147387d6bb]
[pax11-17:16978] [ 8] IMB-MPI1[0x40b316]
[pax11-17:16978] [ 9] IMB-MPI1[0x407284]
[pax11-17:16978] [10] IMB-MPI1[0x40250e]
[pax11-17:16978] [11]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1473d41b35]
[pax11-17:16978] [12] IMB-MPI1[0x401f79]
[pax11-17:16978] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 552 with PID 0 on node pax11-17
exited on signal 7 (Bus error).
--------------------------------------------------------------------------

The program is started from the slurm batch system using mpirun. The
same application is working fine when using mvapich2 instead.

Regards, Götz Waschk
Howard Pritchard
2017-03-22 18:46:20 UTC
Permalink
Hi Goetz,

Would you mind testing against the 2.1.0 release or the latest from the
1.10.x series (1.10.6)?

Thanks,

Howard
Post by Götz Waschk
Hi everyone,
I'm testing a new machine with 32 nodes of 32 cores each using the IMB
benchmark. It is working fine with 512 processes, but it crashes with
[pax11-17:16978] *** Process received signal ***
[pax11-17:16978] Signal: Bus error (7)
[pax11-17:16978] Signal code: Non-existant physical address (2)
[pax11-17:16978] Failing at address: 0x2b147b785450
[pax11-17:16978] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b1473b13370]
[pax11-17:16978] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_
vader.so(mca_btl_vader_frag_init+0x8e)[0x2b14794a413e]
[pax11-17:16978] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(ompi_
free_list_grow+0x199)[0x2b147384f309]
[pax11-17:16978] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_
vader.so(+0x270d)[0x2b14794a270d]
[pax11-17:16978] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_
ob1.so(mca_pml_ob1_send_request_start_prepare+0x43)[0x2b1479ae3a13]
[pax11-17:16978] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_
ob1.so(mca_pml_ob1_send+0x89a)[0x2b1479ad90ca]
[pax11-17:16978] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_coll_
tuned.so(ompi_coll_tuned_allreduce_intra_ring+0x3f1)[0x2b147ad6ec41]
[pax11-17:16978] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(MPI_
Allreduce+0x17b)[0x2b147387d6bb]
[pax11-17:16978] [ 8] IMB-MPI1[0x40b316]
[pax11-17:16978] [ 9] IMB-MPI1[0x407284]
[pax11-17:16978] [10] IMB-MPI1[0x40250e]
[pax11-17:16978] [11]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1473d41b35]
[pax11-17:16978] [12] IMB-MPI1[0x401f79]
[pax11-17:16978] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 552 with PID 0 on node pax11-17
exited on signal 7 (Bus error).
--------------------------------------------------------------------------
The program is started from the slurm batch system using mpirun. The
same application is working fine when using mvapich2 instead.
Regards, Götz Waschk
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Götz Waschk
2017-03-22 19:07:51 UTC
Permalink
Post by Howard Pritchard
Hi Goetz,
Would you mind testing against the 2.1.0 release or the latest from the
1.10.x series (1.10.6)?
Hi Howard,

after sending my mail I have tested both 1.10.6 and 2.1.0 and I have
received the same error. I have also tested outside of slurm using
ssh, same problem.

Here's the message from 2.1.0:
[pax11-10:21920] *** Process received signal ***
[pax11-10:21920] Signal: Bus error (7)
[pax11-10:21920] Signal code: Non-existant physical address (2)
[pax11-10:21920] Failing at address: 0x2b5d5b752290
[pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370]
[pax11-10:21920] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0]
[pax11-10:21920] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1]
[pax11-10:21920] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51]
[pax11-10:21920] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f]
[pax11-10:21920] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa]
[pax11-10:21920] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429]
[pax11-10:21920] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d444486ab]
[pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff]
[pax11-10:21920] [ 9] IMB-MPI1[0x402646]
[pax11-10:21920] [10]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35]
[pax11-10:21920] [11] IMB-MPI1[0x401f79]
[pax11-10:21920] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 320 with PID 21920 on node pax11-10
exited on signal 7 (Bus error).
--------------------------------------------------------------------------


Regards, Götz Waschk
Howard Pritchard
2017-03-23 00:11:15 UTC
Permalink
Hi Goetz

Thanks for trying these other versions. Looks like a bug. Could you post
the config.log output from your build of the 2.1.0 to the list?

Also could you try running the job using this extra command line arg to see
if the problem goes away?

mpirun --mca btl ^vader (rest of your args)

Howard
Post by Howard Pritchard
Hi Goetz,
Would you mind testing against the 2.1.0 release or the latest from the
1.10.x series (1.10.6)?
Hi Howard,

after sending my mail I have tested both 1.10.6 and 2.1.0 and I have
received the same error. I have also tested outside of slurm using
ssh, same problem.

Here's the message from 2.1.0:
[pax11-10:21920] *** Process received signal ***
[pax11-10:21920] Signal: Bus error (7)
[pax11-10:21920] Signal code: Non-existant physical address (2)
[pax11-10:21920] Failing at address: 0x2b5d5b752290
[pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370]
[pax11-10:21920] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0]
[pax11-10:21920] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1]
[pax11-10:21920] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51]
[pax11-10:21920] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f]
[pax11-10:21920] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa]
[pax11-10:21920] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429]
[pax11-10:21920] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d444486ab]
[pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff]
[pax11-10:21920] [ 9] IMB-MPI1[0x402646]
[pax11-10:21920] [10]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35]
[pax11-10:21920] [11] IMB-MPI1[0x401f79]
[pax11-10:21920] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 320 with PID 21920 on node pax11-10
exited on signal 7 (Bus error).
--------------------------------------------------------------------------


Regards, Götz Waschk
Howard Pritchard
2017-03-23 00:29:36 UTC
Permalink
Forgot you probably need an equal sign after btl arg
Post by Howard Pritchard
Hi Goetz
Thanks for trying these other versions. Looks like a bug. Could you post
the config.log output from your build of the 2.1.0 to the list?
Also could you try running the job using this extra command line arg to
see if the problem goes away?
mpirun --mca btl ^vader (rest of your args)
Howard
Post by Howard Pritchard
Hi Goetz,
Would you mind testing against the 2.1.0 release or the latest from the
1.10.x series (1.10.6)?
Hi Howard,
after sending my mail I have tested both 1.10.6 and 2.1.0 and I have
received the same error. I have also tested outside of slurm using
ssh, same problem.
[pax11-10:21920] *** Process received signal ***
[pax11-10:21920] Signal: Bus error (7)
[pax11-10:21920] Signal code: Non-existant physical address (2)
[pax11-10:21920] Failing at address: 0x2b5d5b752290
[pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370]
[pax11-10:21920] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0]
[pax11-10:21920] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1]
[pax11-10:21920] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51]
[pax11-10:21920] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f]
[pax11-10:21920] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa]
[pax11-10:21920] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429]
[pax11-10:21920] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d444486ab]
[pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff]
[pax11-10:21920] [ 9] IMB-MPI1[0x402646]
[pax11-10:21920] [10]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35]
[pax11-10:21920] [11] IMB-MPI1[0x401f79]
[pax11-10:21920] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 320 with PID 21920 on node pax11-10
exited on signal 7 (Bus error).
--------------------------------------------------------------------------
Regards, Götz Waschk
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Götz Waschk
2017-03-23 07:45:18 UTC
Permalink
Hi Howard,

I have attached my config.log file for version 2.1.0. I have based it
on the OpenHPC package. Unfortunately, it still crashes with disabling
the vader btl with this command line:
mpirun --mca btl "^vader" IMB-MPI1


[pax11-10:44753] *** Process received signal ***
[pax11-10:44753] Signal: Bus error (7)
[pax11-10:44753] Signal code: Non-existant physical address (2)
[pax11-10:44753] Failing at address: 0x2b3989e27a00
[pax11-10:44753] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b3976f44370]
[pax11-10:44753] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.so(+0x559a)[0x2b398545259a]
[pax11-10:44753] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x1df)[0x2b39777bb78f]
[pax11-10:44753] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.so(mca_btl_sm_sendi+0x272)[0x2b3985450562]
[pax11-10:44753] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(+0x8a3f)[0x2b3985d78a3f]
[pax11-10:44753] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x4a7)[0x2b3985d79ad7]
[pax11-10:44753] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_sendrecv_nonzero_actual+0x110)[0x2b3976cda620]
[pax11-10:44753] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x860)[0x2b3976cdb8f0]
[pax11-10:44753] [ 8]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b3976ca36ab]
[pax11-10:44753] [ 9] IMB-MPI1[0x40b2ff]
[pax11-10:44753] [10] IMB-MPI1[0x402646]
[pax11-10:44753] [11]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3977172b35]
[pax11-10:44753] [12] IMB-MPI1[0x401f79]
[pax11-10:44753] *** End of error message ***
[pax11-10:44752] *** Process received signal ***
[pax11-10:44752] Signal: Bus error (7)
[pax11-10:44752] Signal code: Non-existant physical address (2)
[pax11-10:44752] Failing at address: 0x2ab0d270d3e8
[pax11-10:44752] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2ab0bf7ec370]
[pax11-10:44752] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x89)[0x2ab0c2eed1c9]
[pax11-10:44752] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmca_common_sm.so.20(+0x1495)[0x2ab0cde8d495]
[pax11-10:44752] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x277)[0x2ab0c0063827]
[pax11-10:44752] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.so(mca_btl_sm_sendi+0x272)[0x2ab0cdc87562]
[pax11-10:44752] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(+0x8a3f)[0x2ab0ce630a3f]
[pax11-10:44752] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x4a7)[0x2ab0ce631ad7]
[pax11-10:44752] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_sendrecv_nonzero_actual+0x110)[0x2ab0bf582620]
[pax11-10:44752] [ 8]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x860)[0x2ab0bf5838f0]
[pax11-10:44752] [ 9]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2ab0bf54b6ab]
[pax11-10:44752] [10] IMB-MPI1[0x40b2ff]
[pax11-10:44752] [11] IMB-MPI1[0x402646]
[pax11-10:44752] [12]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab0bfa1ab35]
[pax11-10:44752] [13] IMB-MPI1[0x401f79]
[pax11-10:44752] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 340 with PID 44753 on node pax11-10
exited on signal 7 (Bus error).
--------------------------------------------------------------------------
Åke Sandgren
2017-03-23 08:28:52 UTC
Permalink
Since i'm seeing similar Bus errors from both openmpi and other places
on our system I'm wondering, what hardware do you have?

CPU:s, interconnect etc.
Post by Götz Waschk
Hi Howard,
I have attached my config.log file for version 2.1.0. I have based it
on the OpenHPC package. Unfortunately, it still crashes with disabling
mpirun --mca btl "^vader" IMB-MPI1
[pax11-10:44753] *** Process received signal ***
[pax11-10:44753] Signal: Bus error (7)
[pax11-10:44753] Signal code: Non-existant physical address (2)
[pax11-10:44753] Failing at address: 0x2b3989e27a00
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ***@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
Götz Waschk
2017-03-23 08:53:17 UTC
Permalink
Hi Åke,

I have E5-2697A CPUs and Mellanox ConnectX-3 FDR Infiniband. I'm using
EL7.3 as the operating system.

Regards, Götz Waschk
Post by Åke Sandgren
Since i'm seeing similar Bus errors from both openmpi and other places
on our system I'm wondering, what hardware do you have?
CPU:s, interconnect etc.
Post by Götz Waschk
Hi Howard,
I have attached my config.log file for version 2.1.0. I have based it
on the OpenHPC package. Unfortunately, it still crashes with disabling
mpirun --mca btl "^vader" IMB-MPI1
[pax11-10:44753] *** Process received signal ***
[pax11-10:44753] Signal: Bus error (7)
[pax11-10:44753] Signal code: Non-existant physical address (2)
[pax11-10:44753] Failing at address: 0x2b3989e27a00
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
AL I:40: Do what thou wilt shall be the whole of the Law.
Åke Sandgren
2017-03-23 08:59:25 UTC
Permalink
E5-2697A which version? v4?
Post by Götz Waschk
Hi Åke,
I have E5-2697A CPUs and Mellanox ConnectX-3 FDR Infiniband. I'm using
EL7.3 as the operating system.
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ***@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
Götz Waschk
2017-03-23 09:11:40 UTC
Permalink
Post by Åke Sandgren
E5-2697A which version? v4?
Hi, yes, that one:
Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz

Regards, Götz
Åke Sandgren
2017-03-23 09:19:06 UTC
Permalink
Ok, we have E5-2690v4's and Connect-IB.
Post by Götz Waschk
Post by Åke Sandgren
E5-2697A which version? v4?
Regards, Götz
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ***@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
Gilles Gouaillardet
2017-03-23 09:33:09 UTC
Permalink
Can you please try
mpirun --mca btl tcp,self ...
And if it works
mpirun --mca btl openib,self ...

Then can you try
mpirun --mca coll ^tuned --mca btl tcp,self ...

That will help figuring out whether the error is in the pml or the coll
framework/module

Cheers,

Gilles
Post by Götz Waschk
Hi Howard,
I have attached my config.log file for version 2.1.0. I have based it
on the OpenHPC package. Unfortunately, it still crashes with disabling
mpirun --mca btl "^vader" IMB-MPI1
[pax11-10:44753] *** Process received signal ***
[pax11-10:44753] Signal: Bus error (7)
[pax11-10:44753] Signal code: Non-existant physical address (2)
[pax11-10:44753] Failing at address: 0x2b3989e27a00
[pax11-10:44753] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b3976f44370]
[pax11-10:44753] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.
so(+0x559a)[0x2b398545259a]
[pax11-10:44753] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(
opal_free_list_grow_st+0x1df)[0x2b39777bb78f]
[pax11-10:44753] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.
so(mca_btl_sm_sendi+0x272)[0x2b3985450562]
[pax11-10:44753] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.
so(+0x8a3f)[0x2b3985d78a3f]
[pax11-10:44753] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.
so(mca_pml_ob1_send+0x4a7)[0x2b3985d79ad7]
[pax11-10:44753] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_
coll_base_sendrecv_nonzero_actual+0x110)[0x2b3976cda620]
[pax11-10:44753] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_
coll_base_allreduce_intra_ring+0x860)[0x2b3976cdb8f0]
[pax11-10:44753] [ 8]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_
Allreduce+0x17b)[0x2b3976ca36ab]
[pax11-10:44753] [ 9] IMB-MPI1[0x40b2ff]
[pax11-10:44753] [10] IMB-MPI1[0x402646]
[pax11-10:44753] [11]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3977172b35]
[pax11-10:44753] [12] IMB-MPI1[0x401f79]
[pax11-10:44753] *** End of error message ***
[pax11-10:44752] *** Process received signal ***
[pax11-10:44752] Signal: Bus error (7)
[pax11-10:44752] Signal code: Non-existant physical address (2)
[pax11-10:44752] Failing at address: 0x2ab0d270d3e8
[pax11-10:44752] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2ab0bf7ec370]
[pax11-10:44752] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_
allocator_bucket.so(mca_allocator_bucket_alloc_align+0x89)[0x2ab0c2eed1c9]
[pax11-10:44752] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmca_common_sm.so.
20(+0x1495)[0x2ab0cde8d495]
[pax11-10:44752] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(
opal_free_list_grow_st+0x277)[0x2ab0c0063827]
[pax11-10:44752] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.
so(mca_btl_sm_sendi+0x272)[0x2ab0cdc87562]
[pax11-10:44752] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.
so(+0x8a3f)[0x2ab0ce630a3f]
[pax11-10:44752] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.
so(mca_pml_ob1_send+0x4a7)[0x2ab0ce631ad7]
[pax11-10:44752] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_
coll_base_sendrecv_nonzero_actual+0x110)[0x2ab0bf582620]
[pax11-10:44752] [ 8]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_
coll_base_allreduce_intra_ring+0x860)[0x2ab0bf5838f0]
[pax11-10:44752] [ 9]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_
Allreduce+0x17b)[0x2ab0bf54b6ab]
[pax11-10:44752] [10] IMB-MPI1[0x40b2ff]
[pax11-10:44752] [11] IMB-MPI1[0x402646]
[pax11-10:44752] [12]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab0bfa1ab35]
[pax11-10:44752] [13] IMB-MPI1[0x401f79]
[pax11-10:44752] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 340 with PID 44753 on node pax11-10
exited on signal 7 (Bus error).
--------------------------------------------------------------------------
Götz Waschk
2017-03-23 10:16:56 UTC
Permalink
Hi Gilles,

I'm currently testing and here are some preliminary results:

On Thu, Mar 23, 2017 at 10:33 AM, Gilles Gouaillardet
Post by Gilles Gouaillardet
Can you please try
mpirun --mca btl tcp,self ...
this failed to produce the program output, there were lots of errors like this:
[pax11-00][[54124,1],31][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.225.202 failed: Connection timed out (110)

I had to terminate the job.

That's why I have added the option --mca btl_tcp_if_exclude ib0 . In
this case, the program started to produce output, but started hanging
early on with this error:
[pax11-00][[61232,1],31][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 127.0.0.1 failed: Connection refused (111)
[pax11-01][[61232,1],63][btl_tcp_endpoint.c:649:mca_btl_tcp_endpoint_recv_connect_ack]
received unexpected process identifier [[61232,1],33]

I have aborted that job as well.
Post by Gilles Gouaillardet
And if it works
mpirun --mca btl openib,self ...
This is running fine so far but will take some more time.

Regards, Götz
Götz Waschk
2017-03-23 13:37:23 UTC
Permalink
Hi Gilles,

On Thu, Mar 23, 2017 at 10:33 AM, Gilles Gouaillardet
Post by Gilles Gouaillardet
mpirun --mca btl openib,self ...
Looks like this didn't finish, I had to terminate the job during the
Gather with 32 processes step.
Post by Gilles Gouaillardet
Then can you try
mpirun --mca coll ^tuned --mca btl tcp,self ...
As mentioned, this didn't produce any program output, just the mentioned errors.

I have also tried mpirun --mca coll ^tuned --mca btl tcp,openib , this
finished fine, but was quite slow. I am currently testing with mpirun
--mca coll ^tuned

Regards, Götz
Götz Waschk
2017-03-23 15:01:45 UTC
Permalink
Post by Götz Waschk
I have also tried mpirun --mca coll ^tuned --mca btl tcp,openib , this
finished fine, but was quite slow. I am currently testing with mpirun
--mca coll ^tuned
This one ran also fine.
Götz Waschk
2017-03-28 14:19:27 UTC
Permalink
Hi everyone,

so how do I proceed with this problem, do you need more information?
Should I open a bug report on github?

Regards, Götz Waschk
Gilles Gouaillardet
2017-03-29 08:26:11 UTC
Permalink
Hi,


yes, please open an issue on github, and post your configure and mpirun
command lines.

ideally, could you try the latest v1.10.6 and v2.1.0 ?


if you can reproduce the issue with a smaller number of MPI tasks, that
would be great too


Cheers,


Gilles
Post by Götz Waschk
Hi everyone,
so how do I proceed with this problem, do you need more information?
Should I open a bug report on github?
Regards, Götz Waschk
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Götz Waschk
2017-11-30 14:53:27 UTC
Permalink
Hi everyone,

I have managed to solve the first part of this problem. It was caused
by the quota on /tmp, that's where the session directory of openmpi
was stored. There's a XFS default quota of 100MB to prevent users from
filling up /tmp. Instead of an over quota message, the result was the
openmpi crash from a bus error.

After setting TMPDIR in slurm, I was finally able to run IMB-MPI1 with
1024 cores and openmpi 1.10.6.

But now for the new problem: with openmpi3, the same test (IMB-MPI1,
1024 cores, 32 nodes) hangs after about 30 minutes of runtime. Any
idea on this?

Regards, Götz Waschk
Jeff Squyres (jsquyres)
2017-11-30 15:24:23 UTC
Permalink
Can you upgrade to 1.10.7? That's the last release in the v1.10 series, and has all the latest bug fixes.
Post by Götz Waschk
Hi everyone,
I have managed to solve the first part of this problem. It was caused
by the quota on /tmp, that's where the session directory of openmpi
was stored. There's a XFS default quota of 100MB to prevent users from
filling up /tmp. Instead of an over quota message, the result was the
openmpi crash from a bus error.
After setting TMPDIR in slurm, I was finally able to run IMB-MPI1 with
1024 cores and openmpi 1.10.6.
But now for the new problem: with openmpi3, the same test (IMB-MPI1,
1024 cores, 32 nodes) hangs after about 30 minutes of runtime. Any
idea on this?
Regards, Götz Waschk
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Götz Waschk
2017-11-30 16:10:53 UTC
Permalink
Dear Jeff,

I'm using openmpi as shipped by OpenHPC, so I'll upgrade 1.10 to
1.10.7 when they do. But it isn't 1.10 that is failing for me but
openmpi 3.0.0.

Regards, Götz

On Thu, Nov 30, 2017 at 4:24 PM, Jeff Squyres (jsquyres)
Post by Jeff Squyres (jsquyres)
Can you upgrade to 1.10.7? That's the last release in the v1.10 series, and has all the latest bug fixes.
Post by Götz Waschk
Hi everyone,
I have managed to solve the first part of this problem. It was caused
by the quota on /tmp, that's where the session directory of openmpi
was stored. There's a XFS default quota of 100MB to prevent users from
filling up /tmp. Instead of an over quota message, the result was the
openmpi crash from a bus error.
After setting TMPDIR in slurm, I was finally able to run IMB-MPI1 with
1024 cores and openmpi 1.10.6.
But now for the new problem: with openmpi3, the same test (IMB-MPI1,
1024 cores, 32 nodes) hangs after about 30 minutes of runtime. Any
idea on this?
Regards, Götz Waschk
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres)
2017-11-30 17:32:46 UTC
Permalink
Ah, I was misled by the subject.

Can you provide more information about "hangs", and your environment?

You previously cited:

- E5-2697A v4 CPUs and Mellanox ConnectX-3 FDR Infiniband
- SLRUM
- Open MPI v3.0.0
- IMB-MPI1

Can you send the information listed here:

https://www.open-mpi.org/community/help/

BTW, the fact that you fixed the last error by growing the tmpdir size (admittedly: we should probably have a better error message here, and shouldn't just segv like you were seeing -- I'll open a bug on that), you can probably remove "--mca btl ^vader" or other similar CLI options. vader and sm were [probably?] failing due to the memory-mapped files on the filesystem running out of space and Open MPI not handling it well. Meaning: in general, you don't want to turn off shared memory support, because that will likely always be the fastest for on-node communication.
Post by Götz Waschk
Dear Jeff,
I'm using openmpi as shipped by OpenHPC, so I'll upgrade 1.10 to
1.10.7 when they do. But it isn't 1.10 that is failing for me but
openmpi 3.0.0.
Regards, Götz
On Thu, Nov 30, 2017 at 4:24 PM, Jeff Squyres (jsquyres)
Post by Jeff Squyres (jsquyres)
Can you upgrade to 1.10.7? That's the last release in the v1.10 series, and has all the latest bug fixes.
Post by Götz Waschk
Hi everyone,
I have managed to solve the first part of this problem. It was caused
by the quota on /tmp, that's where the session directory of openmpi
was stored. There's a XFS default quota of 100MB to prevent users from
filling up /tmp. Instead of an over quota message, the result was the
openmpi crash from a bus error.
After setting TMPDIR in slurm, I was finally able to run IMB-MPI1 with
1024 cores and openmpi 1.10.6.
But now for the new problem: with openmpi3, the same test (IMB-MPI1,
1024 cores, 32 nodes) hangs after about 30 minutes of runtime. Any
idea on this?
Regards, Götz Waschk
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Götz Waschk
2017-12-01 09:13:52 UTC
Permalink
On Thu, Nov 30, 2017 at 6:32 PM, Jeff Squyres (jsquyres)
Post by Jeff Squyres (jsquyres)
Ah, I was misled by the subject.
Can you provide more information about "hangs", and your environment?
- E5-2697A v4 CPUs and Mellanox ConnectX-3 FDR Infiniband
- SLRUM
- Open MPI v3.0.0
- IMB-MPI1
https://www.open-mpi.org/community/help/
BTW, the fact that you fixed the last error by growing the tmpdir size (admittedly: we should probably have a better error message here, and shouldn't just segv like you were seeing -- I'll open a bug on that), you can probably remove "--mca btl ^vader" or other similar CLI options. vader and sm were [probably?] failing due to the memory-mapped files on the filesystem running out of space and Open MPI not handling it well. Meaning: in general, you don't want to turn off shared memory support, because that will likely always be the fastest for on-node communication.
Hi Jeff,

yes, it was wrong to simply close the issue with openmpi 1.10. But now
about the current problem:

I am using the packages provided by OpenHPC, so I didn't build openmpi
myself and don't have config.log. The package version is
openmpi3-gnu7-ohpc-3.0.0-35.1.x86_64.
Attached is the output of ompi_info --all.
The FAQ entry must be outdated, as this happened:
% ompi_info -v ompi full --parsable
ompi_info: Error: unknown option "-v"
Type 'ompi_info --help' for usage.

I have attached my slurm job script, it will simply do an mpirun
IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for
instance, vader is enabled.

The bug's effect is that the program will provide standard output for
over 30 minutes, then all processes will keep running with 100% CPU
until they are killed by the slurm job limit (2 hours in the example).

The Infiniband network seems to be working fine. I'm using Red Hat's
OFED from RHEL7.4 (it really is Scientific Linux 7.4). I am running
opensm on one of the nodes.


Regards, Götz
Götz Waschk
2017-12-01 13:10:24 UTC
Permalink
Post by Götz Waschk
I have attached my slurm job script, it will simply do an mpirun
IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for
instance, vader is enabled.
I have tested again, with
mpirun --mca btl "^vader" IMB-MPI1
it made no difference.
Noam Bernstein
2017-12-01 14:00:09 UTC
Permalink
Post by Götz Waschk
Post by Götz Waschk
I have attached my slurm job script, it will simply do an mpirun
IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for
instance, vader is enabled.
I have tested again, with
mpirun --mca btl "^vader" IMB-MPI1
it made no difference.
I’ve lost track of the earlier parts of this thread, but has anyone suggested logging into the nodes it’s running on, doing “gdb -p PID” for each of the mpi processes, and doing “where” to see where it’s hanging?

I use this script (trace_all), which depends on a variable process that is a grep regexp that matches the mpi executable:
echo "where" > /tmp/gf

pids=`ps aux | grep $process | grep -v grep | grep -v trace_all | awk '{print \$2}'`
for pid in $pids; do
echo $pid
prog=`ps auxw | grep " $pid " | grep -v grep | awk '{print $11}'`
gdb -x /tmp/gf -batch $prog $pid
echo ""
done
Götz Waschk
2017-03-23 08:57:32 UTC
Permalink
Hi Howard,

I had tried to send config.log of my 2.1.0 build, but I guess it was
too big for the list. I'm trying again with a compressed file.
I have based it on the OpenHPC package. Unfortunately, it still
crashes with disabling
the vader btl with this command line:
mpirun --mca btl "^vader" IMB-MPI1


[pax11-10:44753] *** Process received signal ***
[pax11-10:44753] Signal: Bus error (7)
[pax11-10:44753] Signal code: Non-existant physical address (2)
[pax11-10:44753] Failing at address: 0x2b3989e27a00
[pax11-10:44753] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b3976f44370]
[pax11-10:44753] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.so(+0x559a)[0x2b398545259a]
[pax11-10:44753] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x1df)[0x2b39777bb78f]
[pax11-10:44753] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.so(mca_btl_sm_sendi+0x272)[0x2b3985450562]
[pax11-10:44753] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(+0x8a3f)[0x2b3985d78a3f]
[pax11-10:44753] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x4a7)[0x2b3985d79ad7]
[pax11-10:44753] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_sendrecv_nonzero_actual+0x110)[0x2b3976cda620]
[pax11-10:44753] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x860)[0x2b3976cdb8f0]
[pax11-10:44753] [ 8]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b3976ca36ab]
[pax11-10:44753] [ 9] IMB-MPI1[0x40b2ff]
[pax11-10:44753] [10] IMB-MPI1[0x402646]
[pax11-10:44753] [11]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3977172b35]
[pax11-10:44753] [12] IMB-MPI1[0x401f79]
[pax11-10:44753] *** End of error message ***
[pax11-10:44752] *** Process received signal ***
[pax11-10:44752] Signal: Bus error (7)
[pax11-10:44752] Signal code: Non-existant physical address (2)
[pax11-10:44752] Failing at address: 0x2ab0d270d3e8
[pax11-10:44752] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2ab0bf7ec370]
[pax11-10:44752] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x89)[0x2ab0c2eed1c9]
[pax11-10:44752] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmca_common_sm.so.20(+0x1495)[0x2ab0cde8d495]
[pax11-10:44752] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x277)[0x2ab0c0063827]
[pax11-10:44752] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_sm.so(mca_btl_sm_sendi+0x272)[0x2ab0cdc87562]
[pax11-10:44752] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(+0x8a3f)[0x2ab0ce630a3f]
[pax11-10:44752] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x4a7)[0x2ab0ce631ad7]
[pax11-10:44752] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_sendrecv_nonzero_actual+0x110)[0x2ab0bf582620]
[pax11-10:44752] [ 8]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x860)[0x2ab0bf5838f0]
[pax11-10:44752] [ 9]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2ab0bf54b6ab]
[pax11-10:44752] [10] IMB-MPI1[0x40b2ff]
[pax11-10:44752] [11] IMB-MPI1[0x402646]
[pax11-10:44752] [12]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab0bfa1ab35]
[pax11-10:44752] [13] IMB-MPI1[0x401f79]
[pax11-10:44752] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 340 with PID 44753 on node pax11-10
Gilles Gouaillardet
2017-12-01 15:23:01 UTC
Permalink
FWIW,

pstack <pid>
Is a gdb wrapper that displays the stack trace.

PADB http://padb.pittman.org.uk is a great OSS that automatically collect the stack traces of all the MPI tasks (and can do some grouping similar to dshbak)

Cheers,

Gilles
Post by Götz Waschk
I have attached my slurm job script, it will simply do an mpirun
IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for
instance, vader is enabled.
I have tested again, with
   mpirun --mca btl "^vader" IMB-MPI1
it made no difference.
I’ve lost track of the earlier parts of this thread, but has anyone suggested logging into the nodes it’s running on, doing “gdb -p PID” for each of the mpi processes, and doing “where” to see where it’s hanging?
echo "where" > /tmp/gf
pids=`ps aux | grep $process | grep -v grep | grep -v trace_all | awk '{print \$2}'`
for pid in $pids; do
   echo $pid
   prog=`ps auxw | grep " $pid " | grep -v grep | awk '{print $11}'`
   gdb -x /tmp/gf -batch $prog $pid
   echo ""
done
Götz Waschk
2017-12-01 20:32:35 UTC
Permalink
Thanks,

I've tried padb first to get stack traces. This is from IMB-MPI1
hanging after one hour, the last output was:
# Benchmarking Alltoall
# #processes = 1024
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.04 0.09 0.05
1 1000 253.40 335.35 293.06
2 1000 266.93 346.65 306.23
4 1000 303.52 382.41 342.21
8 1000 383.89 493.56 439.34
16 1000 501.27 627.84 569.80
32 1000 1039.65 1259.70 1163.12
64 1000 1710.12 2071.47 1910.62
128 1000 3051.68 3653.44 3398.65

On Fri, Dec 1, 2017 at 4:23 PM, Gilles Gouaillardet
Post by Gilles Gouaillardet
FWIW,
pstack <pid>
Is a gdb wrapper that displays the stack trace.
PADB http://padb.pittman.org.uk is a great OSS that automatically collect
the stack traces of all the MPI tasks (and can do some grouping similar to
dshbak)
Cheers,
Gilles
I have attached my slurm job script, it will simply do an mpirun
IMB-MPI1 with 1024 processes. I haven't set any mca parameters, so for
instance, vader is enabled.
I have tested again, with
mpirun --mca btl "^vader" IMB-MPI1
it made no difference.
I’ve lost track of the earlier parts of this thread, but has anyone
suggested logging into the nodes it’s running on, doing “gdb -p PID” for
each of the mpi processes, and doing “where” to see where it’s hanging?
I use this script (trace_all), which depends on a variable process that is a
echo "where" > /tmp/gf
pids=`ps aux | grep $process | grep -v grep | grep -v trace_all | awk '{print \$2}'`
for pid in $pids; do
echo $pid
prog=`ps auxw | grep " $pid " | grep -v grep | awk '{print $11}'`
gdb -x /tmp/gf -batch $prog $pid
echo ""
done
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
AL I:40: Do what thou wilt shall be the whole of the Law.
Peter Kjellström
2017-12-04 12:06:42 UTC
Permalink
On Fri, 1 Dec 2017 21:32:35 +0100
Götz Waschk <***@gmail.com> wrote:
...
Post by Götz Waschk
# Benchmarking Alltoall
# #processes = 1024
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.04 0.09 0.05
1 1000 253.40 335.35 293.06
2 1000 266.93 346.65 306.23
4 1000 303.52 382.41 342.21
8 1000 383.89 493.56 439.34
16 1000 501.27 627.84 569.80
32 1000 1039.65 1259.70 1163.12
64 1000 1710.12 2071.47 1910.62
128 1000 3051.68 3653.44 3398.65
As a potentially interesting data point, I dug through my archive of
imb output and found an example that also showed something strange
happening at the 128 to 256 byte transition on alltoall @1024 ranks
(although in my case it didn't completely hang):

# Benchmarking Alltoall
# #processes = 1024
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
1 1000 417.44 417.59 417.54
2 1000 410.50 410.72 410.67
4 1000 365.92 366.21 365.99
8 1000 583.21 583.51 583.37
16 1000 652.90 653.09 652.98
32 1000 982.09 982.42 982.28
64 1000 2090.70 2091.11 2090.90
128 1000 2590.91 2591.93 2591.44
256 93 70077.42 70219.70 70174.85
512 93 88611.39 88711.53 88672.84

My output was run on OpenMPI-1.7.6 on CentOS-6 on Mellanox FDR ib
(using the normal verbs/openib transport).

/Peter K
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Götz Waschk
2018-03-26 11:31:56 UTC
Permalink
Hi everyone,

is there anything new on this issue? Should I report it on github as well?

Regads, Götz Waschk

Loading...