Orion Poplawski
2017-02-27 22:49:06 UTC
We have a couple nodes with different IB adapters in them:
font1/var/log/lspci:03:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204
[InfiniHost III Lx HCA] [15b3:6274] (rev 20)
font2/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 InfiniBand
HCA [1077:7220] (rev 02)
font3/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 InfiniBand
HCA [1077:7220] (rev 02)
With 1.10.3 we saw the following errors with mpirun:
[font2.cora.nwra.com:13982] [[23220,1],10] selected pml cm, but peer
[[23220,1],0] on font1 selected pml ob1
which crashed MPI_Init.
We worked around this by passing "--mca pml ob1". I notice now with openmpi
2.0.2 without that option I no longer see errors, but the mpi program will
hang shortly after startup. Re-adding the option makes it work, so I'm
assuming the underlying problem is still the same, but openmpi appears to have
stopped alerting me to the issue.
Thoughts?
font1/var/log/lspci:03:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204
[InfiniHost III Lx HCA] [15b3:6274] (rev 20)
font2/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 InfiniBand
HCA [1077:7220] (rev 02)
font3/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 InfiniBand
HCA [1077:7220] (rev 02)
With 1.10.3 we saw the following errors with mpirun:
[font2.cora.nwra.com:13982] [[23220,1],10] selected pml cm, but peer
[[23220,1],0] on font1 selected pml ob1
which crashed MPI_Init.
We worked around this by passing "--mca pml ob1". I notice now with openmpi
2.0.2 without that option I no longer see errors, but the mpi program will
hang shortly after startup. Re-adding the option makes it work, so I'm
assuming the underlying problem is still the same, but openmpi appears to have
stopped alerting me to the issue.
Thoughts?
--
Orion Poplawski
Technical Manager 720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane ***@nwra.com
Boulder, CO 80301 http://www.nwra.com
Orion Poplawski
Technical Manager 720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane ***@nwra.com
Boulder, CO 80301 http://www.nwra.com