Discussion:
[OMPI users] Signal propagation in 2.0.1
Noel Rycroft
2016-11-28 17:29:58 UTC
Permalink
I'm seeing different behaviour between Open MPI 1.8.4 and 2.0.1 with
regards to signal propagation.

With version 1.8.4 mpirun seems to propagate SIGTERM to the tasks it starts
which enables the tasks to handle SIGTERM.

In version 2.0.1 mpirun does not seem to propagate SIGTERM and instead I
suspect it's sending SIGKILL immediately. Because the child tasks are not
given a chance to handle SIGTERM they end up orphaning their child
processes.

I have a pretty simply reproducer which consists of:

1. A simple MPI application that sleeps for a number of seconds.
2. A simple bash script which launches mpirun.
3. A second bash script which is used to launch a 'child' MPI
application 'sleep' binary

Both scripts launch their children in the background, and 'wait' on
completion. They both install signal handlers for SIGTERM.

When SIGTERM is sent to the top level script it is explicitly propagated to
'mpirun' via the signal handler.

In Open MPI 1.8.4 SIGTERM is propagated to the child MPI tasks which in
turn explicitly propagate the signal to the child binary processes.

In Open MPI 2.0.1 I see no evidence that SIGTERM is propagated to the child
MPI tasks. Instead those tasks are killed and their children (the
application binaries) are orphaned.

Is the difference in behaviour between the different versions expected..?
r***@open-mpi.org
2016-12-01 18:49:46 UTC
Permalink
Yeah, that’s a bug - we’ll have to address it

Thanks
Ralph
I'm seeing different behaviour between Open MPI 1.8.4 and 2.0.1 with regards to signal propagation.
With version 1.8.4 mpirun seems to propagate SIGTERM to the tasks it starts which enables the tasks to handle SIGTERM.
In version 2.0.1 mpirun does not seem to propagate SIGTERM and instead I suspect it's sending SIGKILL immediately. Because the child tasks are not given a chance to handle SIGTERM they end up orphaning their child processes.
A simple MPI application that sleeps for a number of seconds.
A simple bash script which launches mpirun.
A second bash script which is used to launch a 'child' MPI application 'sleep' binary
Both scripts launch their children in the background, and 'wait' on completion. They both install signal handlers for SIGTERM.
When SIGTERM is sent to the top level script it is explicitly propagated to 'mpirun' via the signal handler.
In Open MPI 1.8.4 SIGTERM is propagated to the child MPI tasks which in turn explicitly propagate the signal to the child binary processes.
In Open MPI 2.0.1 I see no evidence that SIGTERM is propagated to the child MPI tasks. Instead those tasks are killed and their children (the application binaries) are orphaned.
Is the difference in behaviour between the different versions expected..?
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2016-12-02 11:48:23 UTC
Permalink
Fix is on the way: https://github.com/open-mpi/ompi/pull/2498 <https://github.com/open-mpi/ompi/pull/2498>

Thanks
Ralph
Post by r***@open-mpi.org
Yeah, that’s a bug - we’ll have to address it
Thanks
Ralph
I'm seeing different behaviour between Open MPI 1.8.4 and 2.0.1 with regards to signal propagation.
With version 1.8.4 mpirun seems to propagate SIGTERM to the tasks it starts which enables the tasks to handle SIGTERM.
In version 2.0.1 mpirun does not seem to propagate SIGTERM and instead I suspect it's sending SIGKILL immediately. Because the child tasks are not given a chance to handle SIGTERM they end up orphaning their child processes.
A simple MPI application that sleeps for a number of seconds.
A simple bash script which launches mpirun.
A second bash script which is used to launch a 'child' MPI application 'sleep' binary
Both scripts launch their children in the background, and 'wait' on completion. They both install signal handlers for SIGTERM.
When SIGTERM is sent to the top level script it is explicitly propagated to 'mpirun' via the signal handler.
In Open MPI 1.8.4 SIGTERM is propagated to the child MPI tasks which in turn explicitly propagate the signal to the child binary processes.
In Open MPI 2.0.1 I see no evidence that SIGTERM is propagated to the child MPI tasks. Instead those tasks are killed and their children (the application binaries) are orphaned.
Is the difference in behaviour between the different versions expected..?
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...