Discussion:
[OMPI users] btl_openib_if_include
Marshall2, John (SSC/SPC)
2018-04-20 17:03:16 UTC
Permalink
Hi

I am trying to verify/determine what the proper setting is for btl_openib_ib_include.

Some background:
* openmpi 2.1.1 (and 1.6.5 - yes it is old)
* lxc containers
* SRIOV (virtual functions) being used
* dedicated IB interface (e.g., ib2) per container

Should the mlx4_X:1 correspond to a specific ibY interface? E.g., for ib26, I find
mlx4_13:1 by:
$ ls /sys/class/net/ib26/device/infiniband
mlx4_13

Does the mlx4_X have to be determined at each location where an mpi task
would run? I suppose it would because the ibY is likely to be different.

On some tests, I have found that the setting:
export OMPI_MCA_btl_openib_if_include=mlx4_0:1

provides better performance than not specifying a value or letting mpirun/orted
figure it out at runtime.

Thanks,
John
Jeff Squyres (jsquyres)
2018-04-23 14:44:57 UTC
Permalink
Post by Marshall2, John (SSC/SPC)
I am trying to verify/determine what the proper setting is for btl_openib_ib_include.
I think you mean btl_openib_if_include ("if" = "interface").
Post by Marshall2, John (SSC/SPC)
* openmpi 2.1.1 (and 1.6.5 - yes it is old)
* lxc containers
* SRIOV (virtual functions) being used
* dedicated IB interface (e.g., ib2) per container
Should the mlx4_X:1 correspond to a specific ibY interface? E.g., for ib26, I find
$ ls /sys/class/net/ib26/device/infiniband
mlx4_13
Does the mlx4_X have to be determined at each location where an mpi task
would run? I suppose it would because the ibY is likely to be different.
Open MPI basically probes its environment at run time. In your case, it will find all available IB interfaces (per MPI process), filter them through if_include / if_exclude, and then use whatever is left.
Post by Marshall2, John (SSC/SPC)
export OMPI_MCA_btl_openib_if_include=mlx4_0:1
provides better performance than not specifying a value or letting mpirun/orted
figure it out at runtime.
That's a little surprising.

Do you have more than 1 IB interface? If not, then Open MPI should likely be independently coming to the same conclusion (i.e., "mlx4_0:1"). If it's not, that's weird.
--
Jeff Squyres
***@cisco.com
Marshall2, John (SSC/SPC)
2018-04-23 15:00:04 UTC
Permalink
On Mon, 2018-04-23 at 14:44 +0000, Jeff Squyres (jsquyres) wrote:

On Apr 20, 2018, at 1:03 PM, Marshall2, John (SSC/SPC) <***@canada.ca<mailto:***@canada.ca>> wrote:



I am trying to verify/determine what the proper setting is for btl_openib_ib_include.



I think you mean btl_openib_if_include ("if" = "interface").


Yes.





Some background:
* openmpi 2.1.1 (and 1.6.5 - yes it is old)
* lxc containers
* SRIOV (virtual functions) being used
* dedicated IB interface (e.g., ib2) per container

Should the mlx4_X:1 correspond to a specific ibY interface? E.g., for ib26, I find
mlx4_13:1 by:
$ ls /sys/class/net/ib26/device/infiniband
mlx4_13

Does the mlx4_X have to be determined at each location where an mpi task
would run? I suppose it would because the ibY is likely to be different.



Open MPI basically probes its environment at run time. In your case, it will find all available IB interfaces (per MPI process), filter them through if_include / if_exclude, and then use whatever is left.



On some tests, I have found that the setting:
export OMPI_MCA_btl_openib_if_include=mlx4_0:1

provides better performance than not specifying a value or letting mpirun/orted
figure it out at runtime.



That's a little surprising.

Do you have more than 1 IB interface? If not, then Open MPI should likely be independently coming to the same conclusion (i.e., "mlx4_0:1"). If it's not, that's weird.


Only one ib interface shows up via ifconfig and at /sys/class/net/ibX.

But, under /sys/class/infiniband and /sys/class/infiniband_cm, all the mlx4_Y do show
up. E.g.,

mlx4_0 mlx4_10 mlx4_12 mlx4_14 mlx4_16 mlx4_3 mlx4_5 mlx4_7 mlx4_9

mlx4_1 mlx4_11 mlx4_13 mlx4_15 mlx4_2 mlx4_4 mlx4_6 mlx4_8

I'm not sure if this can be avoided.

So, where is openmpi looking for the available mlx4_Y? Under one of those two directories
or whatever is at /sys/class/net/ibX/device/infiniband/mlx4_Y?

Thanks,
John
Jeff Squyres (jsquyres)
2018-04-23 15:12:50 UTC
Permalink
Post by Marshall2, John (SSC/SPC)
Only one ib interface shows up via ifconfig and at /sys/class/net/ibX.
But, under /sys/class/infiniband and /sys/class/infiniband_cm, all the mlx4_Y do show
up. E.g.,
mlx4_0 mlx4_10 mlx4_12 mlx4_14 mlx4_16 mlx4_3 mlx4_5 mlx4_7 mlx4_9
mlx4_1 mlx4_11 mlx4_13 mlx4_15 mlx4_2 mlx4_4 mlx4_6 mlx4_8
I'm not sure if this can be avoided.
So, where is openmpi looking for the available mlx4_Y? Under one of those two directories
or whatever is at /sys/class/net/ibX/device/infiniband/mlx4_Y?
It will use whatever devices libibverbs reports back.

It's been quite a while since I've looked in the libibverbs code, but it *might* return all the devices...? What does ibv_devinfo(1) return inside one of your containers? That's probably the same information that is returned to Open MPI programmatically via the libibverbs API.

If libibverbs is returning all devices vs. just the one that is actually available in your container, then that might explain the performance disparity.
--
Jeff Squyres
***@cisco.com
Marshall2, John (SSC/SPC)
2018-04-23 16:43:32 UTC
Permalink
Hi,

That gives me an avenue to pursue.

Thanks,
John

On Mon, 2018-04-23 at 15:12 +0000, Jeff Squyres (jsquyres) wrote:

On Apr 23, 2018, at 11:00 AM, Marshall2, John (SSC/SPC) <***@canada.ca<mailto:***@canada.ca>> wrote:



Only one ib interface shows up via ifconfig and at /sys/class/net/ibX.

But, under /sys/class/infiniband and /sys/class/infiniband_cm, all the mlx4_Y do show
up. E.g.,
mlx4_0 mlx4_10 mlx4_12 mlx4_14 mlx4_16 mlx4_3 mlx4_5 mlx4_7 mlx4_9
mlx4_1 mlx4_11 mlx4_13 mlx4_15 mlx4_2 mlx4_4 mlx4_6 mlx4_8

I'm not sure if this can be avoided.

So, where is openmpi looking for the available mlx4_Y? Under one of those two directories
or whatever is at /sys/class/net/ibX/device/infiniband/mlx4_Y?



It will use whatever devices libibverbs reports back.

It's been quite a while since I've looked in the libibverbs code, but it *might* return all the devices...? What does ibv_devinfo(1) return inside one of your containers? That's probably the same information that is returned to Open MPI programmatically via the libibverbs API.

If libibverbs is returning all devices vs. just the one that is actually available in your container, then that might explain the performance disparity.
Loading...