Discussion:
[OMPI users] no openmpi over IB on new CentOS 7 system
Noam Bernstein
2018-10-09 18:26:47 UTC
Permalink
Hi - I’m trying to get OpenMPI working on a newly configured CentOS 7 system, and I’m not even sure what information would be useful to provide. I’m using the CentOS built in libibverbs and/or libfabric, and I configure openmpi with just
—with-verbs —with-ofi —prefix=$DEST
also tried —without-ofi, no change. Basically, I can run with “—mca btl self,vader”, but if I try “—mca btl,openib” I get an error from each process:
[compute-0-0][[24658,1],5][connect/btl_openib_connect_udcm.c:1245:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument
If I don’t specify the btl it appears to try to set up openib with the same errors, then crashes on some free() related segfault, presumably when it tries to actually use vader.

The machine seems to be able to see its IB interface, as reported by things like ibstatus or ibv_devinfo. I’m not sure what else to look for. I also confirmed that “ulimit -l” reports unlimited.

Does anyone have any suggestions as to how to diagnose this issue?

thanks,
Noam
Andy Riebs
2018-10-09 18:34:26 UTC
Permalink
Noam,

Start with the FAQ, etc., under "Getting Help/Support" in the
left-column menu at https://www.open-mpi.org/

Andy

------------------------------------------------------------------------
*From:* Noam Bernstein <***@nrl.navy.mil>
*Sent:* Tuesday, October 09, 2018 2:26PM
*To:* Open Mpi Users <***@lists.open-mpi.org>
*Cc:*
*Subject:* [OMPI users] no openmpi over IB on new CentOS 7 system

Hi - I’m trying to get OpenMPI working on a newly configured CentOS 7
system, and I’m not even sure what information would be useful to
provide.  I’m using the CentOS built in libibverbs and/or libfabric, and
I configure openmpi with just
—with-verbs —with-ofi —prefix=$DEST
also tried —without-ofi, no change.  Basically, I can run with “—mca btl
self,vader”, but if I try “—mca btl,openib” I get an error from each
process:

[compute-0-0][[24658,1],5][connect/btl_openib_connect_udcm.c:1245:udcm_rc_qp_to_rtr]
error modifing QP to RTR errno says Invalid argument

If I don’t specify the btl it appears to try to set up openib with the
same errors, then crashes on some free() related segfault, presumably
when it tries to actually use vader.

The machine seems to be able to see its IB interface, as reported by
things like ibstatus or ibv_devinfo.  I’m not sure what else to look
for.  I also confirmed that “ulimit -l” reports unlimited.

Does anyone have any suggestions as to how to diagnose this issue?

thanks,
Noam
Dave Love
2018-10-10 08:51:46 UTC
Permalink
RDMA was just broken in the last-but-one(?) RHEL7 kernel release, in
case that's the problem. (Fixed in 3.10.0-862.14.4.)
John Hearns via users
2018-10-10 11:49:03 UTC
Permalink
Noam, what does ompi_info say - specifically which BTLs are available?
Stupid question though - this is a single system with no connection to a switch?
You probably dont have an OpenSM subnet manager running then - could
that be the root cause?
Post by Dave Love
RDMA was just broken in the last-but-one(?) RHEL7 kernel release, in
case that's the problem. (Fixed in 3.10.0-862.14.4.)
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
John Hearns via users
2018-10-10 11:49:54 UTC
Permalink
On that system please tell us what these return:
ibstat
ibstatus
sminfo
ibdiagnet
Post by John Hearns via users
Noam, what does ompi_info say - specifically which BTLs are available?
Stupid question though - this is a single system with no connection to a switch?
You probably dont have an OpenSM subnet manager running then - could that be the root cause?
Post by Dave Love
RDMA was just broken in the last-but-one(?) RHEL7 kernel release, in
case that's the problem. (Fixed in 3.10.0-862.14.4.)
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Noam Bernstein
2018-10-10 16:09:29 UTC
Permalink
Post by Dave Love
RDMA was just broken in the last-but-one(?) RHEL7 kernel release, in
case that's the problem. (Fixed in 3.10.0-862.14.4.)
I strongly suspect that this is it. In the process of getting everything organized to collect the info various people suggested would be useful, I noticed some kernel package inconsistencies, and when I made them consistent by upgrading to 862.14, it started working. If the problem comes back, I guess I’ll be back here, but for the moment it appears to be working. Thanks to everyone for the suggestions

Noam

Loading...