Discussion:
[OMPI users] MPI_Spawn error: Data unpack would read past end of buffer" (-26) instead of "Success"
Simone Pellegrini
2011-09-06 09:01:18 UTC
Permalink
Dear all,
I am developing an MPI application which uses heavily MPI_Spawn. Usually
everything works fine for the first hundred spawn but after a while the
application exist with a curious message:

[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file grpcomm_bad_module.c at line 518
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_proc_set_arch failed
--> Returned "Data unpack would read past end of buffer" (-26)
instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27712] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file grpcomm_bad_module.c at line 518
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27714] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
[arch-top:27226] 1 more process has sent help message help-mpi-runtime /
mpi_init:startup:internal-failure
[arch-top:27226] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages

Also using MPI_init instead of MPI_Init_thread does not help, the same
error occurs.

Strangely the error does not occur if I run the code enabling debug in
(-mca plm_base_verbose 5 -mca rmaps_base_verbose 5).

I am using OpenMPI 1.5.3

cheers, Simone
Ralph Castain
2011-09-06 14:57:11 UTC
Permalink
Hi Simone

Just to clarify: is your application threaded? Could you please send the OMPI configure cmd you used?

Adding the debug flags just changes the race condition. Interestingly, those values only impact the behavior of mpirun, so it looks like the race condition is occurring there.
Post by Simone Pellegrini
Dear all,
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_proc_set_arch failed
--> Returned "Data unpack would read past end of buffer" (-26) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27712] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27714] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27226] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
[arch-top:27226] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Also using MPI_init instead of MPI_Init_thread does not help, the same error occurs.
Strangely the error does not occur if I run the code enabling debug in (-mca plm_base_verbose 5 -mca rmaps_base_verbose 5).
I am using OpenMPI 1.5.3
cheers, Simone
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Simone Pellegrini
2011-09-06 18:49:46 UTC
Permalink
Post by Ralph Castain
Hi Simone
Just to clarify: is your application threaded? Could you please send the OMPI configure cmd you used?
yes, it is threaded. There are basically 3 threads, 1 for the outgoing
messages (MPI_send), 1 for incoming messages (MPI_Iprobe / MPI_Recv) and
one spawning.

I am not sure what you mean with OMPI configure cmd I used... I simply
do mpirun --np 1 ./executable
Post by Ralph Castain
Adding the debug flags just changes the race condition. Interestingly, those values only impact the behavior of mpirun, so it looks like the race condition is occurring there.
The problem is that the error is totally nondeterministic. Sometimes
happens, others not but the error message gives me no clue where the
error is coming from. Is is a problem of my code or internal MPI?

cheers, Simone
Post by Ralph Castain
Post by Simone Pellegrini
Dear all,
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_proc_set_arch failed
--> Returned "Data unpack would read past end of buffer" (-26) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27712] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27714] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27226] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
[arch-top:27226] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Also using MPI_init instead of MPI_Init_thread does not help, the same error occurs.
Strangely the error does not occur if I run the code enabling debug in (-mca plm_base_verbose 5 -mca rmaps_base_verbose 5).
I am using OpenMPI 1.5.3
cheers, Simone
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Ralph Castain
2011-09-06 16:58:23 UTC
Permalink
Post by Ralph Castain
Hi Simone
Just to clarify: is your application threaded? Could you please send the OMPI configure cmd you used?
yes, it is threaded. There are basically 3 threads, 1 for the outgoing messages (MPI_send), 1 for incoming messages (MPI_Iprobe / MPI_Recv) and one spawning.
I am not sure what you mean with OMPI configure cmd I used... I simply do mpirun --np 1 ./executable
How was OMPI configured when it was installed? If you didn't install it, then provide the output of ompi_info - it will tell us.
Post by Ralph Castain
Adding the debug flags just changes the race condition. Interestingly, those values only impact the behavior of mpirun, so it looks like the race condition is occurring there.
The problem is that the error is totally nondeterministic. Sometimes happens, others not but the error message gives me no clue where the error is coming from. Is is a problem of my code or internal MPI?
Can't tell, but it is likely an impact of threading. Race conditions within threaded environments are common, and OMPI isn't particularly thread safe, especially when it comes to comm_spawn.
cheers, Simone
Post by Ralph Castain
Post by Simone Pellegrini
Dear all,
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_proc_set_arch failed
--> Returned "Data unpack would read past end of buffer" (-26) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27712] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27714] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27226] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
[arch-top:27226] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Also using MPI_init instead of MPI_Init_thread does not help, the same error occurs.
Strangely the error does not occur if I run the code enabling debug in (-mca plm_base_verbose 5 -mca rmaps_base_verbose 5).
I am using OpenMPI 1.5.3
cheers, Simone
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Simone Pellegrini
2011-09-06 19:20:27 UTC
Permalink
Post by Ralph Castain
Post by Ralph Castain
Hi Simone
Just to clarify: is your application threaded? Could you please send the OMPI configure cmd you used?
yes, it is threaded. There are basically 3 threads, 1 for the outgoing messages (MPI_send), 1 for incoming messages (MPI_Iprobe / MPI_Recv) and one spawning.
I am not sure what you mean with OMPI configure cmd I used... I simply do mpirun --np 1 ./executable
How was OMPI configured when it was installed? If you didn't install it, then provide the output of ompi_info - it will tell us.
[@arch-moto tasksys]$ ompi_info
Package: Open MPI ***@alderaan Distribution
Open MPI: 1.5.3
Open MPI SVN revision: r24532
Open MPI release date: Mar 16, 2011
Open RTE: 1.5.3
Open RTE SVN revision: r24532
Open RTE release date: Mar 16, 2011
OPAL: 1.5.3
OPAL SVN revision: r24532
OPAL release date: Mar 16, 2011
Ident string: 1.5.3
Prefix: /usr
Configured architecture: x86_64-unknown-linux-gnu
Configure host: alderaan
Configured by: nobody
Configured on: Thu Jul 7 13:21:35 UTC 2011
Configure host: alderaan
Built by: nobody
Built on: Thu Jul 7 13:27:08 UTC 2011
Built host: alderaan
C bindings: yes
C++ bindings: yes
Fortran77 bindings: yes (all)
Fortran90 bindings: yes
Fortran90 bindings size: small
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C compiler family name: GNU
C compiler version: 4.6.1
C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
Fortran77 compiler: gfortran
Fortran77 compiler abs: /usr/bin/gfortran
Fortran90 compiler: /usr/bin/gfortran
Fortran90 compiler abs:
C profiling: yes
C++ profiling: yes
Fortran77 profiling: yes
Fortran90 profiling: yes
C++ exceptions: no
Thread support: posix (mpi: yes, progress: no)
Sparse Groups: no
Internal debug support: yes
MPI interface warnings: no
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
libltdl support: yes
Heterogeneous support: no
mpirun default --prefix: no
MPI I/O support: yes
MPI_WTIME support: gettimeofday
Symbol vis. support: yes
MPI extensions: affinity example
FT Checkpoint support: no (checkpoint thread: no)
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.5.3)
MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.5.3)
MCA memory: linux (MCA v2.0, API v2.0, Component v1.5.3)
MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.3)
MCA carto: auto_detect (MCA v2.0, API v2.0, Component
v1.5.3)
MCA carto: file (MCA v2.0, API v2.0, Component v1.5.3)
MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.3)
MCA timer: linux (MCA v2.0, API v2.0, Component v1.5.3)
MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.3)
MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.3)
MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.3)
MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.3)
MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: self (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.5.3)
MCA io: romio (MCA v2.0, API v2.0, Component v1.5.3)
MCA mpool: fake (MCA v2.0, API v2.0, Component v1.5.3)
MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.3)
MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: bfo (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: csum (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: v (MCA v2.0, API v2.0, Component v1.5.3)
MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.3)
MCA rcache: vma (MCA v2.0, API v2.0, Component v1.5.3)
MCA btl: self (MCA v2.0, API v2.0, Component v1.5.3)
MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.3)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.3)
MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.5.3)
MCA osc: rdma (MCA v2.0, API v2.0, Component v1.5.3)
MCA iof: hnp (MCA v2.0, API v2.0, Component v1.5.3)
MCA iof: orted (MCA v2.0, API v2.0, Component v1.5.3)
MCA iof: tool (MCA v2.0, API v2.0, Component v1.5.3)
MCA oob: tcp (MCA v2.0, API v2.0, Component v1.5.3)
MCA odls: default (MCA v2.0, API v2.0, Component v1.5.3)
MCA ras: cm (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: load_balance (MCA v2.0, API v2.0, Component
v1.5.3)
MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: round_robin (MCA v2.0, API v2.0, Component
v1.5.3)
MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.5.3)
MCA rml: oob (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: binomial (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: cm (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: direct (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: linear (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: radix (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: slave (MCA v2.0, API v2.0, Component v1.5.3)
MCA plm: rsh (MCA v2.0, API v2.0, Component v1.5.3)
MCA plm: rshd (MCA v2.0, API v2.0, Component v1.5.3)
MCA filem: rsh (MCA v2.0, API v2.0, Component v1.5.3)
MCA errmgr: default (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: env (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: hnp (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: singleton (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: slave (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: tool (MCA v2.0, API v2.0, Component v1.5.3)
MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.5.3)
MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.5.3)
MCA notifier: command (MCA v2.0, API v1.0, Component v1.5.3)
MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.5.3)
Post by Ralph Castain
Post by Ralph Castain
Adding the debug flags just changes the race condition. Interestingly, those values only impact the behavior of mpirun, so it looks like the race condition is occurring there.
The problem is that the error is totally nondeterministic. Sometimes happens, others not but the error message gives me no clue where the error is coming from. Is is a problem of my code or internal MPI?
Can't tell, but it is likely an impact of threading. Race conditions within threaded environments are common, and OMPI isn't particularly thread safe, especially when it comes to comm_spawn.
cheers, Simone
Post by Ralph Castain
Post by Simone Pellegrini
Dear all,
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_proc_set_arch failed
--> Returned "Data unpack would read past end of buffer" (-26) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27712] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27714] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27226] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
[arch-top:27226] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Also using MPI_init instead of MPI_Init_thread does not help, the same error occurs.
Strangely the error does not occur if I run the code enabling debug in (-mca plm_base_verbose 5 -mca rmaps_base_verbose 5).
I am using OpenMPI 1.5.3
cheers, Simone
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Ralph Castain
2011-09-06 18:11:27 UTC
Permalink
Hmmm...well, nothing definitive there, I'm afraid.

All I can suggest is to remove/reduce the threading. Like I said, we aren't terribly thread safe at this time. I suspect you're stepping into one of those non-safe areas here.

Hopefully will do better in later releases.
Post by Simone Pellegrini
Post by Ralph Castain
Post by Ralph Castain
Hi Simone
Just to clarify: is your application threaded? Could you please send the OMPI configure cmd you used?
yes, it is threaded. There are basically 3 threads, 1 for the outgoing messages (MPI_send), 1 for incoming messages (MPI_Iprobe / MPI_Recv) and one spawning.
I am not sure what you mean with OMPI configure cmd I used... I simply do mpirun --np 1 ./executable
How was OMPI configured when it was installed? If you didn't install it, then provide the output of ompi_info - it will tell us.
Open MPI: 1.5.3
Open MPI SVN revision: r24532
Open MPI release date: Mar 16, 2011
Open RTE: 1.5.3
Open RTE SVN revision: r24532
Open RTE release date: Mar 16, 2011
OPAL: 1.5.3
OPAL SVN revision: r24532
OPAL release date: Mar 16, 2011
Ident string: 1.5.3
Prefix: /usr
Configured architecture: x86_64-unknown-linux-gnu
Configure host: alderaan
Configured by: nobody
Configured on: Thu Jul 7 13:21:35 UTC 2011
Configure host: alderaan
Built by: nobody
Built on: Thu Jul 7 13:27:08 UTC 2011
Built host: alderaan
C bindings: yes
C++ bindings: yes
Fortran77 bindings: yes (all)
Fortran90 bindings: yes
Fortran90 bindings size: small
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C compiler family name: GNU
C compiler version: 4.6.1
C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
Fortran77 compiler: gfortran
Fortran77 compiler abs: /usr/bin/gfortran
Fortran90 compiler: /usr/bin/gfortran
C profiling: yes
C++ profiling: yes
Fortran77 profiling: yes
Fortran90 profiling: yes
C++ exceptions: no
Thread support: posix (mpi: yes, progress: no)
Sparse Groups: no
Internal debug support: yes
MPI interface warnings: no
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
libltdl support: yes
Heterogeneous support: no
mpirun default --prefix: no
MPI I/O support: yes
MPI_WTIME support: gettimeofday
Symbol vis. support: yes
MPI extensions: affinity example
FT Checkpoint support: no (checkpoint thread: no)
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.5.3)
MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.5.3)
MCA memory: linux (MCA v2.0, API v2.0, Component v1.5.3)
MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.3)
MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.5.3)
MCA carto: file (MCA v2.0, API v2.0, Component v1.5.3)
MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.3)
MCA timer: linux (MCA v2.0, API v2.0, Component v1.5.3)
MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.3)
MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.3)
MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.3)
MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.3)
MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: self (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.5.3)
MCA io: romio (MCA v2.0, API v2.0, Component v1.5.3)
MCA mpool: fake (MCA v2.0, API v2.0, Component v1.5.3)
MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.3)
MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: bfo (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: csum (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: v (MCA v2.0, API v2.0, Component v1.5.3)
MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.3)
MCA rcache: vma (MCA v2.0, API v2.0, Component v1.5.3)
MCA btl: self (MCA v2.0, API v2.0, Component v1.5.3)
MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.3)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.3)
MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.5.3)
MCA osc: rdma (MCA v2.0, API v2.0, Component v1.5.3)
MCA iof: hnp (MCA v2.0, API v2.0, Component v1.5.3)
MCA iof: orted (MCA v2.0, API v2.0, Component v1.5.3)
MCA iof: tool (MCA v2.0, API v2.0, Component v1.5.3)
MCA oob: tcp (MCA v2.0, API v2.0, Component v1.5.3)
MCA odls: default (MCA v2.0, API v2.0, Component v1.5.3)
MCA ras: cm (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.5.3)
MCA rml: oob (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: binomial (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: cm (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: direct (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: linear (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: radix (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: slave (MCA v2.0, API v2.0, Component v1.5.3)
MCA plm: rsh (MCA v2.0, API v2.0, Component v1.5.3)
MCA plm: rshd (MCA v2.0, API v2.0, Component v1.5.3)
MCA filem: rsh (MCA v2.0, API v2.0, Component v1.5.3)
MCA errmgr: default (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: env (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: hnp (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: singleton (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: slave (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: tool (MCA v2.0, API v2.0, Component v1.5.3)
MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.5.3)
MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.5.3)
MCA notifier: command (MCA v2.0, API v1.0, Component v1.5.3)
MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.5.3)
Post by Ralph Castain
Post by Ralph Castain
Adding the debug flags just changes the race condition. Interestingly, those values only impact the behavior of mpirun, so it looks like the race condition is occurring there.
The problem is that the error is totally nondeterministic. Sometimes happens, others not but the error message gives me no clue where the error is coming from. Is is a problem of my code or internal MPI?
Can't tell, but it is likely an impact of threading. Race conditions within threaded environments are common, and OMPI isn't particularly thread safe, especially when it comes to comm_spawn.
cheers, Simone
Post by Ralph Castain
Post by Simone Pellegrini
Dear all,
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_proc_set_arch failed
--> Returned "Data unpack would read past end of buffer" (-26) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27712] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27714] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27226] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
[arch-top:27226] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Also using MPI_init instead of MPI_Init_thread does not help, the same error occurs.
Strangely the error does not occur if I run the code enabling debug in (-mca plm_base_verbose 5 -mca rmaps_base_verbose 5).
I am using OpenMPI 1.5.3
cheers, Simone
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Simone Pellegrini
2011-09-07 20:03:22 UTC
Permalink
Post by Ralph Castain
Hmmm...well, nothing definitive there, I'm afraid.
All I can suggest is to remove/reduce the threading. Like I said, we aren't terribly thread safe at this time. I suspect you're stepping into one of those non-safe areas here.
Hopefully will do better in later releases.
Hi again,
I made some improvements on this problem myself. It looks like is not a
related to threading and/or race conditions but instead to the behavior
of MPI_Finalize invoked by the spawned processes. Apparently despite the
spawned processes all invoke MPI_Finalize, the processes remains alive
blocked on a semaphore. Therefore by spawning more and more processes I
end up having hundreds of processes and slowly filling up all the
available file descriptors.

I got this hint by running my code with mpich2. After a while I also get
an error there related to file descriptors and since then it was easy to
understand what was going on (you should made errors semantically more
sound in open mpi).

By the way, I solved the problem by invoking MPI_Comm_disconnect on the
inter-communicator I receive from the spawning task (MPI_Finalize is not
enough). This makes the spawned tasks to close the parent communicator
and terminate.

After this small change the system is more stable now and that specific
error is gone. Unfortunately a different message showed up:

[arch-moto][[530,1],0][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Bad file descriptor (9)

This error doesn't make the program terminate.

Some other times I get an hard error, which is:
[err] event_queue_remove: 0x7fb5fc008c58(fd 14) not on queue 8
[arch-moto][[14492,46],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.88.1 failed: Connection refused (111)
[arch-moto:09536] [[14492,0],0] ORTE_ERROR_LOG: A message is attempting
to be sent to a process whose contact information is unknown in file
rml_oob_send.c at line 145
[arch-moto:09536] [[14492,0],0] attempted to send to [[14492,1],0]: tag 6
[arch-moto:09536] [[14492,0],0] ORTE_ERROR_LOG: A message is attempting
to be sent to a process whose contact information is unknown in file
base/plm_base_receive.c at line 278
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9538 on
node arch-moto exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

any hints from this?

cheers, Simone
Post by Ralph Castain
Post by Simone Pellegrini
Post by Ralph Castain
Post by Ralph Castain
Hi Simone
Just to clarify: is your application threaded? Could you please send the OMPI configure cmd you used?
yes, it is threaded. There are basically 3 threads, 1 for the outgoing messages (MPI_send), 1 for incoming messages (MPI_Iprobe / MPI_Recv) and one spawning.
I am not sure what you mean with OMPI configure cmd I used... I simply do mpirun --np 1 ./executable
How was OMPI configured when it was installed? If you didn't install it, then provide the output of ompi_info - it will tell us.
Open MPI: 1.5.3
Open MPI SVN revision: r24532
Open MPI release date: Mar 16, 2011
Open RTE: 1.5.3
Open RTE SVN revision: r24532
Open RTE release date: Mar 16, 2011
OPAL: 1.5.3
OPAL SVN revision: r24532
OPAL release date: Mar 16, 2011
Ident string: 1.5.3
Prefix: /usr
Configured architecture: x86_64-unknown-linux-gnu
Configure host: alderaan
Configured by: nobody
Configured on: Thu Jul 7 13:21:35 UTC 2011
Configure host: alderaan
Built by: nobody
Built on: Thu Jul 7 13:27:08 UTC 2011
Built host: alderaan
C bindings: yes
C++ bindings: yes
Fortran77 bindings: yes (all)
Fortran90 bindings: yes
Fortran90 bindings size: small
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C compiler family name: GNU
C compiler version: 4.6.1
C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
Fortran77 compiler: gfortran
Fortran77 compiler abs: /usr/bin/gfortran
Fortran90 compiler: /usr/bin/gfortran
C profiling: yes
C++ profiling: yes
Fortran77 profiling: yes
Fortran90 profiling: yes
C++ exceptions: no
Thread support: posix (mpi: yes, progress: no)
Sparse Groups: no
Internal debug support: yes
MPI interface warnings: no
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
libltdl support: yes
Heterogeneous support: no
mpirun default --prefix: no
MPI I/O support: yes
MPI_WTIME support: gettimeofday
Symbol vis. support: yes
MPI extensions: affinity example
FT Checkpoint support: no (checkpoint thread: no)
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.5.3)
MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.5.3)
MCA memory: linux (MCA v2.0, API v2.0, Component v1.5.3)
MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.3)
MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.5.3)
MCA carto: file (MCA v2.0, API v2.0, Component v1.5.3)
MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.3)
MCA timer: linux (MCA v2.0, API v2.0, Component v1.5.3)
MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.3)
MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.3)
MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.3)
MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.3)
MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: self (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.3)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.5.3)
MCA io: romio (MCA v2.0, API v2.0, Component v1.5.3)
MCA mpool: fake (MCA v2.0, API v2.0, Component v1.5.3)
MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.3)
MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: bfo (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: csum (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.3)
MCA pml: v (MCA v2.0, API v2.0, Component v1.5.3)
MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.3)
MCA rcache: vma (MCA v2.0, API v2.0, Component v1.5.3)
MCA btl: self (MCA v2.0, API v2.0, Component v1.5.3)
MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.3)
MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.3)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.3)
MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.5.3)
MCA osc: rdma (MCA v2.0, API v2.0, Component v1.5.3)
MCA iof: hnp (MCA v2.0, API v2.0, Component v1.5.3)
MCA iof: orted (MCA v2.0, API v2.0, Component v1.5.3)
MCA iof: tool (MCA v2.0, API v2.0, Component v1.5.3)
MCA oob: tcp (MCA v2.0, API v2.0, Component v1.5.3)
MCA odls: default (MCA v2.0, API v2.0, Component v1.5.3)
MCA ras: cm (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.5.3)
MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.5.3)
MCA rml: oob (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: binomial (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: cm (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: direct (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: linear (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: radix (MCA v2.0, API v2.0, Component v1.5.3)
MCA routed: slave (MCA v2.0, API v2.0, Component v1.5.3)
MCA plm: rsh (MCA v2.0, API v2.0, Component v1.5.3)
MCA plm: rshd (MCA v2.0, API v2.0, Component v1.5.3)
MCA filem: rsh (MCA v2.0, API v2.0, Component v1.5.3)
MCA errmgr: default (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: env (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: hnp (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: singleton (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: slave (MCA v2.0, API v2.0, Component v1.5.3)
MCA ess: tool (MCA v2.0, API v2.0, Component v1.5.3)
MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.5.3)
MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.5.3)
MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.5.3)
MCA notifier: command (MCA v2.0, API v1.0, Component v1.5.3)
MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.5.3)
Post by Ralph Castain
Post by Ralph Castain
Adding the debug flags just changes the race condition. Interestingly, those values only impact the behavior of mpirun, so it looks like the race condition is occurring there.
The problem is that the error is totally nondeterministic. Sometimes happens, others not but the error message gives me no clue where the error is coming from. Is is a problem of my code or internal MPI?
Can't tell, but it is likely an impact of threading. Race conditions within threaded environments are common, and OMPI isn't particularly thread safe, especially when it comes to comm_spawn.
cheers, Simone
Post by Ralph Castain
Post by Simone Pellegrini
Dear all,
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27712] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_proc_set_arch failed
--> Returned "Data unpack would read past end of buffer" (-26) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27712] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/grpcomm_base_modex.c at line 349
[arch-top:27714] [[36904,165],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_bad_module.c at line 518
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[arch-top:27714] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[arch-top:27226] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
[arch-top:27226] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Also using MPI_init instead of MPI_Init_thread does not help, the same error occurs.
Strangely the error does not occur if I run the code enabling debug in (-mca plm_base_verbose 5 -mca rmaps_base_verbose 5).
I am using OpenMPI 1.5.3
cheers, Simone
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2011-09-07 18:07:34 UTC
Permalink
By the way, I solved the problem by invoking MPI_Comm_disconnect on the inter-communicator I receive from the spawning task (MPI_Finalize is not enough). This makes the spawned tasks to close the parent communicator and terminate.
This is correct MPI behavior.

Just having spawned processes call Finalize is not sufficient, because they are still "connected" to the parent(s) who spawned them, meaning that you can eventually run out of resources.

Having your children disconnect before finalizing is definitely a good idea.
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Loading...