I am sorry if there have bene other replies ot this over the weekend.
Post by Boris M. VulovicGus, Gilles and John,
Thanks for the help. Let me first post (below) the output from checkouts
ibdiagnet
ibhosts
ibstat (for login node, for now)
What do you think?
Thanks
--Boris
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-bash-4.1$ *ibdiagnet*
----------
/usr/share/ibdiagnet2.1.1/plugins/
(You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH"
env variable)
Plugin Name Result Comment
libibdiagnet_cable_diag_plugin-2.1.1 Succeeded Plugin loaded
libibdiagnet_phy_diag_plugin-2.1.1 Succeeded Plugin loaded
---------------------------------------------
Discovery
-E- Failed to initialize
-E- Fabric Discover failed, err=IBDiag initialize wasn't done
-E- Fabric Discover failed, MAD err=Failed to register SMI class
---------------------------------------------
Summary
-I- Stage Warnings Errors Comment
-I- Discovery NA
-I- Lids Check NA
-I- Links Check NA
-I- Subnet Manager NA
-I- Port Counters NA
-I- Nodes Information NA
-I- Speed / Width checks NA
-I- Partition Keys NA
-I- Alias GUIDs NA
-I- Temperature Sensing NA
-I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/
ibdiagnet2.log
-E- A fatal error occurred, exiting...
-bash-4.1$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-bash-4.1$ *ibhosts*
ibwarn: [168221] mad_rpc_open_port: client_register for mgmt 1 failed
src/ibnetdisc.c:766; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed
-bash-4.1$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-bash-4.1$ *ibstat*
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.2020
Hardware version: 0
Node GUID: 0x248a0703005abb1c
System image GUID: 0x248a0703005abb1c
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x3c010000
Port GUID: 0x268a07fffe5abb1c
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.2020
Hardware version: 0
Node GUID: 0x248a0703005abb1d
System image GUID: 0x248a0703005abb1c
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x3c010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_2'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.2020
Hardware version: 0
Node GUID: 0x248a0703005abb30
System image GUID: 0x248a0703005abb30
State: Down
Physical state: Disabled
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x3c010000
Port GUID: 0x268a07fffe5abb30
Link layer: Ethernet
CA 'mlx5_3'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.2020
Hardware version: 0
Node GUID: 0x248a0703005abb31
System image GUID: 0x248a0703005abb30
State: Down
Physical state: Disabled
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x3c010000
Port GUID: 0x268a07fffe5abb31
Link layer: Ethernet
-bash-4.1$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
On Fri, Jul 14, 2017 at 12:37 AM, John Hearns via users <
Post by John Hearns via usersABoris, as Gilles says - first do som elower level checkouts of your
Infiniband network.
ibdiagnet
ibhosts
and then as Gilles says 'ibstat' on each node
Post by Gilles GouaillardetBoris,
Open MPI should automatically detect the infiniband hardware, and use
openib (and *not* tcp) for inter node communications
and a shared memory optimized btl (e.g. sm or vader) for intra node
communications.
note if you "-mca btl openib,self", you tell Open MPI to use the openib
btl between any tasks,
including tasks running on the same node (which is less efficient than
using sm or vader)
at first, i suggest you make sure infiniband is up and running on all
your nodes.
(just run ibstat, at least one port should be listed, state should be
Active, and all nodes should have the same SM lid)
then try to run two tasks on two nodes.
if this does not work, you can
mpirun --mca btl_base_verbose 100 ...
and post the logs so we can investigate from there.
Cheers,
Gilles
Post by Boris M. VulovicI would like to know how to invoke InfiniBand hardware on CentOS 6x
cluster with OpenMPI (static libs.) for running my C++ code. This is how I
/usr/local/open-mpi/1.10.7/bin/mpic++ -L/usr/local/open-mpi/1.10.7/lib
-Bstatic main.cpp -o DoWork
usr/local/open-mpi/1.10.7/bin/mpiexec -mca btl tcp,self --hostfile
hostfile5 -host node01,node02,node03,node04,node05 -n 200 DoWork
Here, "*-mca btl tcp,self*" reveals that *TCP* is used, and the cluster
has InfiniBand.
What should be changed in compiling and running commands for InfiniBand
to be invoked? If I just replace "*-mca btl tcp,self*" with "*-mca btl
/At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated that
it can be used to communicate between these processes. This is an error;
Open MPI requires that all MPI processes be able to reach each other. This
error can sometimes be the result of forgetting to specify the "self" BTL./
Thanks very much!!!
*Boris *
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
*Boris M. Vulovic*
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users