[OMPI users] peformance abnormality with openib and tcp framework

Discussion:

Blade Shieh

2018-05-14 01:44:24 UTC

/********** The problem ***********/

I have a cluster with 10GE ethernet and 100Gb infiniband. While running my
application - CAMx, I found that the performance with IB is not as good as
ethernet. That is confusing because IB latency and bandwith is
undoubtablely better than ethernet, which is proven by MPI benchmark
IMB-MPI1 and osu.

/********** software stack ***********/

centos7.4 with kernel 4.11.0-45.6.1.el7a.aarch64

MLNX_OFED_LINUX-4.3-1.0.1.0 from
http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers

gnu7.3 from OpenHPC release. yun install
gnu7-compilers-ohpc-7.3.0-43.1.aarch64

openmpi3 from OpenHPC release. yum install
openmpi3-gnu7-ohpc-3.0.0-36.4.aarch64

CAMx 6.4.0 from http://www.camx.com/

IMB from https://github.com/intel/mpi-benchmarks

OSU from http://mvapich.cse.ohio-state.edu/benchmarks/

/********** command lines are ********/

(time mpirun --allow-run-as-root -mca btl self,openib -x OMP_NUM_THREADS=2
-n 32 -mca btl_tcp_if_include eth2
../../src/CAMx.v6.40.openMPI.gfortranomp.ompi) > camx_openib_log 2>&1

(time mpirun --allow-run-as-root -mca btl self,tcp -x OMP_NUM_THREADS=2 -n
32 -mca btl_tcp_if_include eth2
../../src/CAMx.v6.40.openMPI.gfortranomp.ompi) > camx_tcp_log 2>&1

(time mpirun --allow-run-as-root -mca btl self,openib -x OMP_NUM_THREADS=2
-n 32 -mca btl_tcp_if_include eth2 IMB-MPI1 allreduce -msglog 8 -npmin
1000) > IMB_openib_log 2>&1

(time mpirun --allow-run-as-root -mca btl self,tcp -x OMP_NUM_THREADS=2 -n
32 -mca btl_tcp_if_include eth2 IMB-MPI1 allreduce -msglog 8 -npmin 1000) >
IMB_tcp_log 2>&1

(time mpirun --allow-run-as-root -mca btl self,openib -x OMP_NUM_THREADS=2
-n 32 -mca btl_tcp_if_include eth2 osu_latency) > osu_openib_log 2>&1

(time mpirun --allow-run-as-root -mca btl self,tcp -x OMP_NUM_THREADS=2 -n
32 -mca btl_tcp_if_include eth2 osu_latency) > osu_tcp_log 2>&1

/********** about openmpi and network config *************/

Please refer to relevant log files in the attachment.

*Best Regards,*

*Xie Bin*

Nathan Hjelm

2018-05-14 02:22:19 UTC

Permalink

I see several problems

1) osu_latency only works with two procs.

2) You explicitly excluded shared memory support by specifying only self and openib (or tcp). If you want to just disable tcp or openib use âmca btl ^tcp or âmca btl ^openib

Also, it looks like you have multiple ports active that are on different subnets. You can use âmca btl_openib_if_include to set it to use a specific device or devices (i.e. mlx5_0).

See this warning:

--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'localhost', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------

-Nathan

Post by Blade Shieh
/********** The problem ***********/
I have a cluster with 10GE ethernet and 100Gb infiniband. While running my application - CAMx, I found that the performance with IB is not as good as ethernet. That is confusing because IB latency and bandwith is undoubtablely better than ethernet, which is proven by MPI benchmark IMB-MPI1 and osu.
/********** software stack ***********/
centos7.4 with kernel 4.11.0-45.6.1.el7a.aarch64
MLNX_OFED_LINUX-4.3-1.0.1.0 from http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers
gnu7.3 from OpenHPC release. yun install gnu7-compilers-ohpc-7.3.0-43.1.aarch64
openmpi3 from OpenHPC release. yum install openmpi3-gnu7-ohpc-3.0.0-36.4.aarch64
CAMx 6.4.0 from http://www.camx.com/
IMB from https://github.com/intel/mpi-benchmarks
OSU from http://mvapich.cse.ohio-state.edu/benchmarks/
/********** command lines are ********/
(time mpirun --allow-run-as-root -mca btl self,openib -x OMP_NUM_THREADS=2 -n 32 -mca btl_tcp_if_include eth2 ../../src/CAMx.v6.40.openMPI.gfortranomp.ompi) > camx_openib_log 2>&1
(time mpirun --allow-run-as-root -mca btl self,tcp -x OMP_NUM_THREADS=2 -n 32 -mca btl_tcp_if_include eth2 ../../src/CAMx.v6.40.openMPI.gfortranomp.ompi) > camx_tcp_log 2>&1
(time mpirun --allow-run-as-root -mca btl self,openib -x OMP_NUM_THREADS=2 -n 32 -mca btl_tcp_if_include eth2 IMB-MPI1 allreduce -msglog 8 -npmin 1000) > IMB_openib_log 2>&1
(time mpirun --allow-run-as-root -mca btl self,tcp -x OMP_NUM_THREADS=2 -n 32 -mca btl_tcp_if_include eth2 IMB-MPI1 allreduce -msglog 8 -npmin 1000) > IMB_tcp_log 2>&1
(time mpirun --allow-run-as-root -mca btl self,openib -x OMP_NUM_THREADS=2 -n 32 -mca btl_tcp_if_include eth2 osu_latency) > osu_openib_log 2>&1
(time mpirun --allow-run-as-root -mca btl self,tcp -x OMP_NUM_THREADS=2 -n 32 -mca btl_tcp_if_include eth2 osu_latency) > osu_tcp_log 2>&1
/********** about openmpi and network config *************/
Please refer to relevant log files in the attachment.
Best Regards,
Xie Bin
<ompi_support.tar.bz2>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Blade Shieh

2018-05-14 09:04:26 UTC

Permalink

Hi, Nathan:
Thanks for you reply.
1) It was my mistake not to notice usage of osu_latency. Now it worked
well, but still poorer in openib.
2) I did not use sm or vader because I wanted to check performance between
tcp and openib. Besides, I will run the application in cluster, so vader is
not so important.
3) Of course, I tried you suggestions. I used ^tcp/^openib and set
btl_openib_if_include to mlx5_0 in a two-node cluster (IB
direcet-connected). The result did not change -- IB still better in MPI
benchmark but poorer in my applicaion.

Best Regards,
Xie Bin

John Hearns via users

2018-05-14 09:43:34 UTC

Permalink

Xie Bin, I do hate to ask this. You say "in a two-node cluster (IB
direcet-connected). "
Does that mean that you have no IB switch, and that there is a single IB
cable joining up these two servers?
If so please run: ibstatus ibhosts ibdiagnet
I am trying to check if the IB fabric is functioning properly in that
situation.
(Also need to check if there is o Subnet Manager - so run sminfo)

But you do say that the IMB test gives good results for IB, so you must
have IB working properly.
Therefore I am an idiot...

Post by Blade Shieh
Thanks for you reply.
1) It was my mistake not to notice usage of osu_latency. Now it worked
well, but still poorer in openib.
2) I did not use sm or vader because I wanted to check performance between
tcp and openib. Besides, I will run the application in cluster, so vader is
not so important.
3) Of course, I tried you suggestions. I used ^tcp/^openib and set
btl_openib_if_include to mlx5_0 in a two-node cluster (IB
direcet-connected). The result did not change -- IB still better in MPI
benchmark but poorer in my applicaion.
Best Regards,
Xie Bin
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Blade Shieh

2018-05-15 01:29:08 UTC

Permalink

Hi, John:

You are right on the network framework. I do have no IB switch and just
connect the servers with an IB cable. I did not even open the opensmd
service because it seems unnecessary in this situation. Can this be the
reason why IB performs poorer?

Interconnection details are in the attachment.

Best Regards,

Xie Bin

Post by John Hearns via users
Xie Bin, I do hate to ask this. You say "in a two-node cluster (IB
direcet-connected). "
Does that mean that you have no IB switch, and that there is a single IB
cable joining up these two servers?
If so please run: ibstatus ibhosts ibdiagnet
I am trying to check if the IB fabric is functioning properly in that
situation.
(Also need to check if there is o Subnet Manager - so run sminfo)
But you do say that the IMB test gives good results for IB, so you must
have IB working properly.
Therefore I am an idiot...

Post by Blade Shieh
Thanks for you reply.
1) It was my mistake not to notice usage of osu_latency. Now it worked
well, but still poorer in openib.
2) I did not use sm or vader because I wanted to check performance
between tcp and openib. Besides, I will run the application in cluster, so
vader is not so important.
3) Of course, I tried you suggestions. I used ^tcp/^openib and set
btl_openib_if_include to mlx5_0 in a two-node cluster (IB
direcet-connected). The result did not change -- IB still better in MPI
benchmark but poorer in my applicaion.
Best Regards,
Xie Bin
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

John Hearns via users

2018-05-15 07:51:01 UTC

Permalink

Xie, as far as I know you need to run OpenSM even on two hosts.

Post by Blade Shieh
You are right on the network framework. I do have no IB switch and just
connect the servers with an IB cable. I did not even open the opensmd
service because it seems unnecessary in this situation. Can this be the
reason why IB performs poorer?
Interconnection details are in the attachment.
Best Regards,
Xie Bin

Post by Blade Shieh
Thanks for you reply.
1) It was my mistake not to notice usage of osu_latency. Now it worked
well, but still poorer in openib.
2) I did not use sm or vader because I wanted to check performance
between tcp and openib. Besides, I will run the application in cluster, so
vader is not so important.
3) Of course, I tried you suggestions. I used ^tcp/^openib and set
btl_openib_if_include to mlx5_0 in a two-node cluster (IB
direcet-connected). The result did not change -- IB still better in MPI
benchmark but poorer in my applicaion.
Best Regards,
Xie Bin
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Gilles Gouaillardet

2018-05-15 08:11:31 UTC

Permalink

The long story is you need always need a subnet manager to initialize
the fabric.

That means you can run the subnet manager and stop it once so each HCA
is assigned a LID.

In that case, the commands that interact with the SM (ibhosts,
ibdiagnet) will obviously fail.

Cheers,

Gilles

Xie, as far as I know you need to run OpenSM even on two hosts.
You are right on the network framework. I do have no IB switch and
just connect the servers with an IB cable. I did not even open the
opensmd service because it seems unnecessary in this situation.
Can this be the reason why IB performs poorer?
Interconnection details are in the attachment.
Best Regards,
Xie Bin
Xie Bin, I do hate to ask this. You say "in a two-node
cluster (IB direcet-connected). "
Does that mean that you have no IB switch, and that there is a
single IB cable joining up these two servers?
If so please run: ibstatus ibhosts ibdiagnet
I am trying to check if the IB fabric is functioning properly
in that situation.
(Also need to check if there is o Subnet Manager - so run
sminfo)
But you do say that the IMB test gives good results for IB, so
you must have IB working properly.
Therefore I am an idiot...
Thanks for you reply.
1) It was my mistake not to notice usage of osu_latency.
Now it worked well, but still poorer in openib.
2) I did not use sm or vader because I wanted to check
performance between tcp and openib. Besides, I will run
the application in cluster, so vader is not so important.
3) Of course, I tried you suggestions. I used ^tcp/^openib
and set btl_openib_if_include to mlx5_0 in a two-node
cluster (IB direcet-connected). The result did not change
-- IB still better in MPI benchmark but poorer in my
applicaion.
Best Regards,
Xie Bin
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

George Bosilca

2018-05-14 18:05:38 UTC

Permalink

Shared memory communication is important for multi-core platforms,
especially when you have multiple processes per node. But this is only part
of your issue here.

You haven't specified how your processes will be mapped on your resources.
As a result rank 0 and 1 will be on the same node, so you are testing the
shared memory support of whatever BTL you allow. In this case the
performance will be much better for TCP than for IB, simply because you are
not using your network, but its capacity to move data across memory banks.
In such an environment, TCP translated to a memcpy plus a system call,
which is much faster than IB. That being said, it should not matter because
shared memory is there to cover this case.

Add "--map-by node" to your mpirun command to measure the bandwidth between
nodes.

George.

Blade Shieh

2018-05-15 01:45:39 UTC

Permalink

Hi, George:
My command lines are:
1) single node
mpirun --allow-run-as-root -mca btl self,tcp(or openib) -mca
btl_tcp_if_include eth2 -mca btl_openib_if_include mlx5_0 -x
OMP_NUM_THREADS=2 -n 32 myapp
2) 2-node cluster
mpirun --allow-run-as-root -mca btl ^tcp(or ^openib) -mca
btl_tcp_if_include eth2 -mca btl_openib_if_include mlx5_0 -x
OMP_NUM_THREADS=4 -N 16 myapp

In 2nd condition, I used -N, which is equal to --map-by node.

Best regards,
Xie Bin

Post by George Bosilca
Shared memory communication is important for multi-core platforms,
especially when you have multiple processes per node. But this is only part
of your issue here.
You haven't specified how your processes will be mapped on your resources.
As a result rank 0 and 1 will be on the same node, so you are testing the
shared memory support of whatever BTL you allow. In this case the
performance will be much better for TCP than for IB, simply because you are
not using your network, but its capacity to move data across memory banks.
In such an environment, TCP translated to a memcpy plus a system call,
which is much faster than IB. That being said, it should not matter because
shared memory is there to cover this case.
Add "--map-by node" to your mpirun command to measure the bandwidth
between nodes.
George.

Post by Blade Shieh
Thanks for you reply.
1) It was my mistake not to notice usage of osu_latency. Now it worked
well, but still poorer in openib.
2) I did not use sm or vader because I wanted to check performance
between tcp and openib. Besides, I will run the application in cluster, so
vader is not so important.
3) Of course, I tried you suggestions. I used ^tcp/^openib and set
btl_openib_if_include to mlx5_0 in a two-node cluster (IB
direcet-connected). The result did not change -- IB still better in MPI
benchmark but poorer in my applicaion.
Best Regards,
Xie Bin
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Gilles Gouaillardet

2018-05-15 02:08:42 UTC

Permalink

Xie Bin,

According to the man page, -N is equivalent to npernode, which is
equivalent to --map-by ppr:N:node.

This is *not* equivalent to -map-by node :

The former packs tasks to the same node, and the latter scatters tasks
accross the nodes

[***@login ~]$ mpirun --host n0:2,n1:2 -N 2 --tag-output hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n0
[1,2]<stdout>:n1
[1,3]<stdout>:n1

[***@login ~]$ mpirun --host n0:2,n1:2 -np 4 --tag-output -map-by
node hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n1
[1,2]<stdout>:n0
[1,3]<stdout>:n1

I am pretty sure a subnet manager was ran at some point in time (so your
HCA can get their identifier).

/* feel free to reboot your nodes and see if ibstat still shows the
adapters as active */

Note you might also use --mca pml ob1 in order to make sure mxm nor ucx
are used

Cheers,

Gilles

Post by Blade Shieh
1) single node
mpirun --allow-run-as-root -mca btl self,tcp(or openib) -mca
btl_tcp_if_include eth2 -mca btl_openib_if_include mlx5_0 -x
OMP_NUM_THREADS=2 -n 32 myapp
2) 2-node cluster
mpirun --allow-run-as-root -mca btl ^tcp(or ^openib) -mca
btl_tcp_if_include eth2 -mca btl_openib_if_include mlx5_0 -x
OMP_NUM_THREADS=4 -N 16 myapp
In 2nd condition, I used -N, which is equal to --map-by node.
Best regards,
Xie Bin
2018年5月15日周二 02:07写道：
Shared memory communication is important for multi-core platforms,
especially when you have multiple processes per node. But this is
only part of your issue here.
You haven't specified how your processes will be mapped on your
resources. As a result rank 0 and 1 will be on the same node, so
you are testing the shared memory support of whatever BTL you
allow. In this case the performance will be much better for TCP
than for IB, simply because you are not using your network, but
its capacity to move data across memory banks. In such an
environment, TCP translated to a memcpy plus a system call, which
is much faster than IB. That being said, it should not matter
because shared memory is there to cover this case.
Add "--map-by node" to your mpirun command to measure the
bandwidth between nodes.
George.
Thanks for you reply.
1) It was my mistake not to notice usage of osu_latency. Now
it worked well, but still poorer in openib.
2) I did not use sm or vader because I wanted to check
performance between tcp and openib. Besides, I will run the
application in cluster, so vader is not so important.
3) Of course, I tried you suggestions. I used ^tcp/^openib and
set btl_openib_if_include to mlx5_0 in a two-node cluster (IB
direcet-connected). The result did not change -- IB still
better in MPI benchmark but poorer in my applicaion.
Best Regards,
Xie Bin
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Blade Shieh

2018-05-15 07:40:04 UTC

Permalink

Hi Gilles,
Thank you for pointing out my error on *-N*.
And you are right that I opened opensmd service before so the link up can
be set up correctly. But many IB-related command cannot be executed
correctly, like ibhosts and ibdiagnet.
As for pml, I am pretty sure I was using ob1, because ompi_info shows there
was no ucx or mxm and ob1 has highest priority.

Best regards,
Xie Bin

Post by John Hearns via users
Xie Bin,
According to the man page, -N is equivalent to npernode, which is
equivalent to --map-by ppr:N:node.
The former packs tasks to the same node, and the latter scatters tasks
accross the nodes
[1,0]<stdout>:n0
[1,1]<stdout>:n0
[1,2]<stdout>:n1
[1,3]<stdout>:n1
node hostname | sort
[1,0]<stdout>:n0
[1,1]<stdout>:n1
[1,2]<stdout>:n0
[1,3]<stdout>:n1
I am pretty sure a subnet manager was ran at some point in time (so your
HCA can get their identifier).
/* feel free to reboot your nodes and see if ibstat still shows the
adapters as active */
Note you might also use --mca pml ob1 in order to make sure mxm nor ucx
are used
Cheers,
Gilles

Post by Blade Shieh
1) single node
mpirun --allow-run-as-root -mca btl self,tcp(or openib) -mca
btl_tcp_if_include eth2 -mca btl_openib_if_include mlx5_0 -x
OMP_NUM_THREADS=2 -n 32 myapp
2) 2-node cluster
mpirun --allow-run-as-root -mca btl ^tcp(or ^openib) -mca
btl_tcp_if_include eth2 -mca btl_openib_if_include mlx5_0 -x
OMP_NUM_THREADS=4 -N 16 myapp
In 2nd condition, I used -N, which is equal to --map-by node.
Best regards,
Xie Bin
2018å¹Ž5æ15æ¥ åšäº 02:07åéïŒ
Shared memory communication is important for multi-core platforms,
especially when you have multiple processes per node. But this is
only part of your issue here.
You haven't specified how your processes will be mapped on your
resources. As a result rank 0 and 1 will be on the same node, so
you are testing the shared memory support of whatever BTL you
allow. In this case the performance will be much better for TCP
than for IB, simply because you are not using your network, but
its capacity to move data across memory banks. In such an
environment, TCP translated to a memcpy plus a system call, which
is much faster than IB. That being said, it should not matter
because shared memory is there to cover this case.
Add "--map-by node" to your mpirun command to measure the
bandwidth between nodes.
George.
Thanks for you reply.
1) It was my mistake not to notice usage of osu_latency. Now
it worked well, but still poorer in openib.
2) I did not use sm or vader because I wanted to check
performance between tcp and openib. Besides, I will run the
application in cluster, so vader is not so important.
3) Of course, I tried you suggestions. I used ^tcp/^openib and
set btl_openib_if_include to mlx5_0 in a two-node cluster (IB
direcet-connected). The result did not change -- IB still
better in MPI benchmark but poorer in my applicaion.
Best Regards,
Xie Bin
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users