Discussion:
[OMPI users] Fwd: [OMPI USERS] Fault Tolerance and migration
Alberto Ortiz
2017-02-27 16:23:59 UTC
Permalink
Hi,
I am interested in using OpenMPI to manage the distribution on a MicroZed
cluster. This MicroZed boards come with a Zynq device, which has a
dual-core ARM cortex A9. One of the objectives of the project I am working
on is resilience, so I am trully interested in the fault tolerance provided
by OpenMPI.

The thing I want to know is if there is any implementation for run-time
migration. For instance, if I have an octa-MicroZed cluster running an MPI
job and I unplug the Ethernet cable of one of them or I reboot another one,
is there any support in OpenMPI to detect these failures and migrate the
ranks to other processors on run-time execution?

Thank you in advance,
Alberto.
George Bosilca
2017-02-27 22:47:06 UTC
Permalink
Alberto,

In the master there is no such support (we had support for migration a
while back, but we have stripped it out). However, at UTK we developed a
fork of Open MPI, called ULFM, which provides fault management
capabilities. This fork provides support to detect failures, and support
for handling the fault in the MPI layer.

I suggest you look at fault-tolerance.org for more info.

George.
Post by Alberto Ortiz
Hi,
I am interested in using OpenMPI to manage the distribution on a MicroZed
cluster. This MicroZed boards come with a Zynq device, which has a
dual-core ARM cortex A9. One of the objectives of the project I am working
on is resilience, so I am trully interested in the fault tolerance provided
by OpenMPI.
The thing I want to know is if there is any implementation for run-time
migration. For instance, if I have an octa-MicroZed cluster running an MPI
job and I unplug the Ethernet cable of one of them or I reboot another one,
is there any support in OpenMPI to detect these failures and migrate the
ranks to other processors on run-time execution?
Thank you in advance,
Alberto.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...