Discussion:
[OMPI users] How do I build 3.1.0 (or later) with mellanox's libraries
Alan Wild
2018-09-14 15:20:06 UTC
Permalink
I apologize if this has been discussed before but I've been unable to find
discussion on the topic.

I recently went to build 3.1.2 on our cluster only to have the build
completely fail during configure due to issues with libnl versions.

Specifically I was had requested support for mellanox's libraries (mxm,
hcoll, sharp, etc) which was fine for me in 3.0.0 and 3.0.1. However it
appears all of those libraries are built with libnl version 1 but the
netlink component is now requiring netlink version 3 and aborts the build
if it finds anything else in LIBS that using version 1.

I don't believe mellanox's is providing releases of these libraries linked
agsinst liblnl version 3 (love to find out I'm wrong on that) at least not
for CentOS 6.9.

According to github, it appears bwbarret's commit a543e7f (from one year
ago today) which was merged into 3.1.0 is responsible. However I'm having
a hard time believing that openmpi would want to break support for these
libraries or there isn't some other kind of workaround.

I'm on a short timeline to deliver this build of openmpi to my users but I
know they won't accept a build that doesn't support mellanox's libraries.

Hoping there's an easy fix here (short of trying to reverse the commit in
my build) that I'm overlooking here.

Thanks,

-Alan
Gilles Gouaillardet
2018-09-14 15:44:22 UTC
Permalink
Alan,

Can you please compress and post your config.log ?

My understanding of the mentioned commit is it does not build the
reachable/netlink component if libnl version 1 is used (by third party libs
such as mxm).
I do not believe it should abort configure

Cheers,

Gilles
Post by Alan Wild
I apologize if this has been discussed before but I've been unable to find
discussion on the topic.
I recently went to build 3.1.2 on our cluster only to have the build
completely fail during configure due to issues with libnl versions.
Specifically I was had requested support for mellanox's libraries (mxm,
hcoll, sharp, etc) which was fine for me in 3.0.0 and 3.0.1. However it
appears all of those libraries are built with libnl version 1 but the
netlink component is now requiring netlink version 3 and aborts the build
if it finds anything else in LIBS that using version 1.
I don't believe mellanox's is providing releases of these libraries linked
agsinst liblnl version 3 (love to find out I'm wrong on that) at least not
for CentOS 6.9.
According to github, it appears bwbarret's commit a543e7f (from one year
ago today) which was merged into 3.1.0 is responsible. However I'm having
a hard time believing that openmpi would want to break support for these
libraries or there isn't some other kind of workaround.
I'm on a short timeline to deliver this build of openmpi to my users but I
know they won't accept a build that doesn't support mellanox's libraries.
Hoping there's an easy fix here (short of trying to reverse the commit in
my build) that I'm overlooking here.
Thanks,
-Alan
Alan Wild
2018-09-14 20:35:47 UTC
Permalink
As request I've attached the config.log. I also included the output from
configure itself.

-Alan
Post by Alan Wild
I apologize if this has been discussed before but I've been unable to find
discussion on the topic.
I recently went to build 3.1.2 on our cluster only to have the build
completely fail during configure due to issues with libnl versions.
Specifically I was had requested support for mellanox's libraries (mxm,
hcoll, sharp, etc) which was fine for me in 3.0.0 and 3.0.1. However it
appears all of those libraries are built with libnl version 1 but the
netlink component is now requiring netlink version 3 and aborts the build
if it finds anything else in LIBS that using version 1.
I don't believe mellanox's is providing releases of these libraries linked
agsinst liblnl version 3 (love to find out I'm wrong on that) at least not
for CentOS 6.9.
According to github, it appears bwbarret's commit a543e7f (from one year
ago today) which was merged into 3.1.0 is responsible. However I'm having
a hard time believing that openmpi would want to break support for these
libraries or there isn't some other kind of workaround.
I'm on a short timeline to deliver this build of openmpi to my users but I
know they won't accept a build that doesn't support mellanox's libraries.
Hoping there's an easy fix here (short of trying to reverse the commit in
my build) that I'm overlooking here.
Thanks,
-Alan
Jeff Squyres (jsquyres) via users
2018-09-19 16:12:55 UTC
Permalink
Alan --

Sorry for the delay.

I agree with Gilles: Brian's commit had to do with "reachable" plugins in Open MPI -- they do not appear to be the problem here.

From the config.log you sent, it looks like configure aborted because you requested UCX support (via --with-ucx) but configure wasn't able to find it. And it looks like it didn't find it because of libnl v1 vs. v3 issues, as you stated.

I think we're going to have to refer you to Mellanox support on this one. The libnl situation is kind of a nightmare: your entire stack must be compiled for either libnl v1 *or* v3. If you have both libnl v1 *and* v3 appear in a process together, the process will crash before main() even executes. :-( This is precisely why we have these warnings in Open MPI's configure.
As request I've attached the config.log. I also included the output from configure itself.
-Alan
I apologize if this has been discussed before but I've been unable to find discussion on the topic.
I recently went to build 3.1.2 on our cluster only to have the build completely fail during configure due to issues with libnl versions.
Specifically I was had requested support for mellanox's libraries (mxm, hcoll, sharp, etc) which was fine for me in 3.0.0 and 3.0.1. However it appears all of those libraries are built with libnl version 1 but the netlink component is now requiring netlink version 3 and aborts the build if it finds anything else in LIBS that using version 1.
I don't believe mellanox's is providing releases of these libraries linked agsinst liblnl version 3 (love to find out I'm wrong on that) at least not for CentOS 6.9.
According to github, it appears bwbarret's commit a543e7f (from one year ago today) which was merged into 3.1.0 is responsible. However I'm having a hard time believing that openmpi would want to break support for these libraries or there isn't some other kind of workaround.
I'm on a short timeline to deliver this build of openmpi to my users but I know they won't accept a build that doesn't support mellanox's libraries.
Hoping there's an easy fix here (short of trying to reverse the commit in my build) that I'm overlooking here.
Thanks,
-Alan
<openmpi-3.1.2.config.tar.xz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Barrett, Brian via users
2018-09-19 22:28:50 UTC
Permalink
Yeah, there’s no good answer here from an “automatically do the right thing” point of view. The reachable:netlink component (which is used for the TCP BTL) only works with libnl-3 because libnl-1 is a real pain to deal with if you’re trying to parse route behaviors. It will do the right thing if you’re using OpenIB (the other place the libnl-1/libnl-3 thing comes into play) because OpenIB runs its configure test before reachable:netlink, but UCX’s tests run way later (for reasons that aren’t fixable).

Mellanox should really update everything to use libnl3 so that there’s at least hope of getting the right answer (not just in Open MPI, but in general; libnl-1 is old and not awesome). In the mean time, I *think* you can work around this problem via two paths. First, which I know will work, is to remove the libnl-3 devel package. That’s probably not optimal for obvious reasons. The second is to specify --enable-mca-no-build=reachable-netlink, which will disable the component that is preferring libnl-3 and then UCX should be happy.

Hope this helps,

Brian
Post by Jeff Squyres (jsquyres) via users
Alan --
Sorry for the delay.
I agree with Gilles: Brian's commit had to do with "reachable" plugins in Open MPI -- they do not appear to be the problem here.
From the config.log you sent, it looks like configure aborted because you requested UCX support (via --with-ucx) but configure wasn't able to find it. And it looks like it didn't find it because of libnl v1 vs. v3 issues, as you stated.
I think we're going to have to refer you to Mellanox support on this one. The libnl situation is kind of a nightmare: your entire stack must be compiled for either libnl v1 *or* v3. If you have both libnl v1 *and* v3 appear in a process together, the process will crash before main() even executes. :-( This is precisely why we have these warnings in Open MPI's configure.
As request I've attached the config.log. I also included the output from configure itself.
-Alan
I apologize if this has been discussed before but I've been unable to find discussion on the topic.
I recently went to build 3.1.2 on our cluster only to have the build completely fail during configure due to issues with libnl versions.
Specifically I was had requested support for mellanox's libraries (mxm, hcoll, sharp, etc) which was fine for me in 3.0.0 and 3.0.1. However it appears all of those libraries are built with libnl version 1 but the netlink component is now requiring netlink version 3 and aborts the build if it finds anything else in LIBS that using version 1.
I don't believe mellanox's is providing releases of these libraries linked agsinst liblnl version 3 (love to find out I'm wrong on that) at least not for CentOS 6.9.
According to github, it appears bwbarret's commit a543e7f (from one year ago today) which was merged into 3.1.0 is responsible. However I'm having a hard time believing that openmpi would want to break support for these libraries or there isn't some other kind of workaround.
I'm on a short timeline to deliver this build of openmpi to my users but I know they won't accept a build that doesn't support mellanox's libraries.
Hoping there's an easy fix here (short of trying to reverse the commit in my build) that I'm overlooking here.
Thanks,
-Alan
<openmpi-3.1.2.config.tar.xz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...