Discussion:
[OMPI users] openmpi hang on IB disconnect
Michael Di Domenico
2018-01-17 13:35:23 UTC
Permalink
openmpi-2.0.2 running on rhel 7.4 with qlogic QDR infiniband
switches/adapters, also using slurm

i have a user that's running a job over multiple days. unfortunately
after a few days at random the job will seemingly hang. the latest
instance was caused by an infiniband adapter that went offline and
online several times.

the card is in a semi-working state at the moment, it's passing
traffic, but i suspect some of the IB messages during the job run got
lost and now the job is seemingly hung.

is there some mechanism i can put in place to detect this condition
either in the code or on the system. it's causing two problems at the
moment. first and foremost the user has no idea the job hung and for
what reason. second it's wasting system time.

i'm sure other people have come across wonky IB cards, i'm curious how
everyone else is detecting this condition and dealing with it.

Loading...