[OMPI users] Deadly warning "Epoll ADD(4) on fd 2 failed." ?

Discussion:

Filippo Spiga

2014-05-27 08:22:20 UTC

Dear all,

I am using Open MPI v1.8.2 night snapshot compiled with SLURM support (version 14.03pre5). These two messages below appeared during a job of 2048 MPI that died after 24 hours!

[warn] Epoll ADD(1) on fd 0 failed. Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted

[warn] Epoll ADD(4) on fd 2 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Operation not permitted

The first one, appeared immediately at the beginning had no effect. The application started to compute and it successfully called a big parallel eigensolver. The second message appeared after 18~19 hours of non-stop computation and the application crashed without showing any other error message! Regularly I was checking that MPI processes were not stuck, after this message the processes were all aborted without dumping anything on stdout/stderr. It is quite weird.

I believe these messages come from Open MPI (but correct me if I am wrong!). I am going to look at the application and the various libraries to find out if something is wrong. In the meanwhile it will be a great help if anyone can clarify the exact meaning of these warning messages.

Many thanks in advance.

Regards,
Filippo

--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."

Ralph Castain

2014-05-27 17:31:52 UTC

Permalink

I'm unaware of any OMPI error message like that - might be caused by something in libevent as that could be using epoll, so it could be caused by us. However, I'm a little concerned about the use of the prerelease version of Slurm as we know that PMI is having some problems over there.

So out of curiosity - how was this job launched? Via mpirun or directly using srun?

Post by Filippo Spiga
Dear all,
I am using Open MPI v1.8.2 night snapshot compiled with SLURM support (version 14.03pre5). These two messages below appeared during a job of 2048 MPI that died after 24 hours!
[warn] Epoll ADD(1) on fd 0 failed. Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted
[warn] Epoll ADD(4) on fd 2 failed. Old events were 0; read change was 0 (none); write change was 1 (add): Operation not permitted
The first one, appeared immediately at the beginning had no effect. The application started to compute and it successfully called a big parallel eigensolver. The second message appeared after 18~19 hours of non-stop computation and the application crashed without showing any other error message! Regularly I was checking that MPI processes were not stuck, after this message the processes were all aborted without dumping anything on stdout/stderr. It is quite weird.
I believe these messages come from Open MPI (but correct me if I am wrong!). I am going to look at the application and the various libraries to find out if something is wrong. In the meanwhile it will be a great help if anyone can clarify the exact meaning of these warning messages.
Many thanks in advance.
Regards,
Filippo
--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users

Mike Dubman

2014-05-28 06:17:59 UTC

Permalink

I think it comes from PMI API used by OMPI/SLURM.
SLURM`s libpmi is trying to control stdout/stdin which is already
controlled by OMPI.

Post by Ralph Castain
I'm unaware of any OMPI error message like that - might be caused by
something in libevent as that could be using epoll, so it could be caused
by us. However, I'm a little concerned about the use of the prerelease
version of Slurm as we know that PMI is having some problems over there.
So out of curiosity - how was this job launched? Via mpirun or directly using srun?
Dear all,
I am using Open MPI v1.8.2 night snapshot compiled with SLURM support
(version 14.03pre5). These two messages below appeared during a job of 2048
MPI that died after 24 hours!
[warn] Epoll ADD(1) on fd 0 failed. Old events were 0; read change was 1
(add); write change was 0 (none): Operation not permitted
[warn] Epoll ADD(4) on fd 2 failed. Old events were 0; read change was 0
(none); write change was 1 (add): Operation not permitted
The first one, appeared immediately at the beginning had no effect. The
application started to compute and it successfully called a big parallel
eigensolver. The second message appeared after 18~19 hours of non-stop
computation and the application crashed without showing any other error
message! Regularly I was checking that MPI processes were not stuck, after
this message the processes were all aborted without dumping anything on
stdout/stderr. It is quite weird.
I believe these messages come from Open MPI (but correct me if I am
wrong!). I am going to look at the application and the various libraries to
find out if something is wrong. In the meanwhile it will be a great help if
anyone can clarify the exact meaning of these warning messages.
Many thanks in advance.
Regards,
Filippo
--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
Â«Nobody will drive us out of Cantor's paradise.Â» ~ David Hilbert
*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL
and may be privileged or otherwise protected from disclosure. The contents
are not to be disclosed to anyone other than the addressee. Unauthorized
recipients are requested to preserve this confidentiality and to advise the
sender immediately of any error in transmission."
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users

Filippo Spiga

2014-05-28 07:03:08 UTC

Permalink

Dear Ralph,

Post by Ralph Castain
So out of curiosity - how was this job launched? Via mpirun or directly using srun?

The job has been submitted using mpirun. However Open MPI is compiled with SLURM support (and I start to believe this is might not ideal after all !!!). I have a partial job trace dumped by the process when it died:

--------------------------------------------------------------------------
mpirun noticed that process rank 8190 with PID 29319 on node sand-8-39 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
diag_OMPI-INTEL.x 0000000000537349 Unknown Unknown Unknown
diag_OMPI-INTEL.x 0000000000535C1E Unknown Unknown Unknown
diag_OMPI-INTEL.x 000000000050CF52 Unknown Unknown Unknown
diag_OMPI-INTEL.x 00000000004F0BB3 Unknown Unknown Unknown
diag_OMPI-INTEL.x 00000000004BEB99 Unknown Unknown Unknown
libpthread.so.0 00007FE5B5BE5710 Unknown Unknown Unknown
libmlx4-rdmav2.so 00007FE5A8C0A867 Unknown Unknown Unknown
mca_btl_openib.so 00007FE5ADA36644 Unknown Unknown Unknown
libopen-pal.so.6 00007FE5B288262A Unknown Unknown Unknown
mca_pml_ob1.so 00007FE5AC344FAF Unknown Unknown Unknown
libmpi.so.1 00007FE5B5064E7D Unknown Unknown Unknown
libmpi_mpifh.so.2 00007FE5B531919B Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EC0CE Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EBE36 Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EBDFD Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EC2CD Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EB798 Unknown Unknown Unknown
libelpa.so.0 00007FE5B82E571A Unknown Unknown Unknown
diag_OMPI-INTEL.x 00000000004101C2 MAIN__ 562 dirac_exomol_eigen.f90
diag_OMPI-INTEL.x 000000000040A1A6 Unknown Unknown Unknown
libc.so.6 00007FE5B4A89D1D Unknown Unknown Unknown
diag_OMPI-INTEL.x 000000000040A099 Unknown Unknown Unknown

(plus many other trace information like this)

No more information that this unfortunately because not everything library has been built using debug flags. The computation is all concentrated in ScaLAPACK and ELPA that I recompiled by myself, I run over 8192 MPI and the memory allocated per MPI process was below 1 GByte (per MPI). My compute nodes have 64 GByte of RAM and 2 eight-core Intel Sandy Bridge. Since 512 nodes are 80% of the cluster I have available for this test, I cannot easily reschedule a repetition of the test.

I wonder if this message that can be related to libevent may in principle cause this seg fault error. I am working to understand the cause on my side but so far a reduced problem size using less nodes never failed.

Any help is much appreciated!

Regards,
F

--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."

Ralph Castain

2014-05-28 12:23:02 UTC

Permalink

When next you run, I would just add "-mca plm rsh" to your cmd line. You don't need to rebuild OMPI to avoid issues with the slurm integration. This will still allow OMPI to read the slurm allocation so it knows which nodes to use, but won't use slurm to launch the job.

If it is a slurm PMI issue, this should resolve it.

Post by Filippo Spiga
Dear Ralph,

Post by Ralph Castain
So out of curiosity - how was this job launched? Via mpirun or directly using srun?

--------------------------------------------------------------------------
mpirun noticed that process rank 8190 with PID 29319 on node sand-8-39 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
diag_OMPI-INTEL.x 0000000000537349 Unknown Unknown Unknown
diag_OMPI-INTEL.x 0000000000535C1E Unknown Unknown Unknown
diag_OMPI-INTEL.x 000000000050CF52 Unknown Unknown Unknown
diag_OMPI-INTEL.x 00000000004F0BB3 Unknown Unknown Unknown
diag_OMPI-INTEL.x 00000000004BEB99 Unknown Unknown Unknown
libpthread.so.0 00007FE5B5BE5710 Unknown Unknown Unknown
libmlx4-rdmav2.so 00007FE5A8C0A867 Unknown Unknown Unknown
mca_btl_openib.so 00007FE5ADA36644 Unknown Unknown Unknown
libopen-pal.so.6 00007FE5B288262A Unknown Unknown Unknown
mca_pml_ob1.so 00007FE5AC344FAF Unknown Unknown Unknown
libmpi.so.1 00007FE5B5064E7D Unknown Unknown Unknown
libmpi_mpifh.so.2 00007FE5B531919B Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EC0CE Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EBE36 Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EBDFD Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EC2CD Unknown Unknown Unknown
libelpa.so.0 00007FE5B82EB798 Unknown Unknown Unknown
libelpa.so.0 00007FE5B82E571A Unknown Unknown Unknown
diag_OMPI-INTEL.x 00000000004101C2 MAIN__ 562 dirac_exomol_eigen.f90
diag_OMPI-INTEL.x 000000000040A1A6 Unknown Unknown Unknown
libc.so.6 00007FE5B4A89D1D Unknown Unknown Unknown
diag_OMPI-INTEL.x 000000000040A099 Unknown Unknown Unknown
(plus many other trace information like this)
No more information that this unfortunately because not everything library has been built using debug flags. The computation is all concentrated in ScaLAPACK and ELPA that I recompiled by myself, I run over 8192 MPI and the memory allocated per MPI process was below 1 GByte (per MPI). My compute nodes have 64 GByte of RAM and 2 eight-core Intel Sandy Bridge. Since 512 nodes are 80% of the cluster I have available for this test, I cannot easily reschedule a repetition of the test.
I wonder if this message that can be related to libevent may in principle cause this seg fault error. I am working to understand the cause on my side but so far a reduced problem size using less nodes never failed.
Any help is much appreciated!
Regards,
F
--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
*****
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and may be privileged or otherwise protected from disclosure. The contents are not to be disclosed to anyone other than the addressee. Unauthorized recipients are requested to preserve this confidentiality and to advise the sender immediately of any error in transmission."
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users