Discussion:
[OMPI users] device failed to appear .. Connection timed out
Daniele Tartarini
2016-12-08 14:16:51 UTC
Permalink
Hi,

I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:

* openmpi-devel.x86_64 1.10.3-3.el7 *

any code I try to run (including the mpitests-*) I get the following
message with slight variants:

* my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
failed to appear after 15.0 seconds: Connection timed out*

Is anyone able to help me in identifying the source of the problem?
Anyway, * /dev/hfi1_0* doesn't exist.

If I use an OpenMPI version compiled from source I have no issue (gcc
4.8.5).

many thanks in advance.

cheers
Daniele
r***@open-mpi.org
2016-12-08 15:45:30 UTC
Permalink
Sounds like something didn’t quite get configured right, or maybe you have a library installed that isn’t quite setup correctly, or...

Regardless, we generally advise building from source to avoid such problems. Is there some reason not to just do so?
Post by Daniele Tartarini
Hi,
openmpi-devel.x86_64 1.10.3-3.el7
my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
Is anyone able to help me in identifying the source of the problem?
Anyway, /dev/hfi1_0 doesn't exist.
If I use an OpenMPI version compiled from source I have no issue (gcc 4.8.5).
many thanks in advance.
cheers
Daniele
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Howard Pritchard
2016-12-08 17:22:55 UTC
Permalink
hello Daniele,

Could you post the output from ompi_info command? I'm noticing on the RPMS
that came with the rhel7.2 distro on
one of our systems that it was built to support psm2/hfi-1.

Two things, could you try running applications with

mpirun --mca pml ob1 (all the rest of your args)

and see if that works?

Second, what sort of system are you using? Is this a cluster? If it is,
you may want to check whether
you have a situation where its an omnipath interconnect and you have the
psm2/hfi1 packages installed
but for some reason the omnipath HCAs themselves are not active.

On one of our omnipath systems the following hfi1 related pms are installed:

*hfi*diags-0.8-13.x86_64

*hfi*1-psm-devel-0.7-244.x86_64
lib*hfi*1verbs-0.5-16.el7.x86_64
*hfi*1-psm-0.7-244.x86_64
*hfi*1-firmware-0.9-36.noarch
*hfi*1-psm-compat-0.7-244.x86_64
lib*hfi*1verbs-devel-0.5-16.el7.x86_64
*hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
*hfi*1-firmware_debug-0.9-36.noarc
*hfi*1-diagtools-sw-0.8-13.x86_64


Howard
Post by r***@open-mpi.org
Sounds like something didn’t quite get configured right, or maybe you have
a library installed that isn’t quite setup correctly, or...
Regardless, we generally advise building from source to avoid such
problems. Is there some reason not to just do so?
Hi,
* openmpi-devel.x86_64 1.10.3-3.el7 *
any code I try to run (including the mpitests-*) I get the following
* my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
failed to appear after 15.0 seconds: Connection timed out*
Is anyone able to help me in identifying the source of the problem?
Anyway, * /dev/hfi1_0* doesn't exist.
If I use an OpenMPI version compiled from source I have no issue (gcc 4.8.5).
many thanks in advance.
cheers
Daniele
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Cabral, Matias A
2016-12-08 17:55:25 UTC
Permalink
Post by Daniele Tartarini
Anyway, /dev/hfi1_0 doesn't exist.
Make sure you have the hfi1 module/driver loaded.
In addition, please confirm the links are in active state on all the nodes `opainfo`

_MAC

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Howard Pritchard
Sent: Thursday, December 08, 2016 9:23 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] device failed to appear .. Connection timed out

hello Daniele,

Could you post the output from ompi_info command? I'm noticing on the RPMS that came with the rhel7.2 distro on
one of our systems that it was built to support psm2/hfi-1.

Two things, could you try running applications with

mpirun --mca pml ob1 (all the rest of your args)

and see if that works?

Second, what sort of system are you using? Is this a cluster? If it is, you may want to check whether
you have a situation where its an omnipath interconnect and you have the psm2/hfi1 packages installed
but for some reason the omnipath HCAs themselves are not active.

On one of our omnipath systems the following hfi1 related pms are installed:

hfidiags-0.8-13.x86_64

hfi1-psm-devel-0.7-244.x86_64
libhfi1verbs-0.5-16.el7.x86_64
hfi1-psm-0.7-244.x86_64
hfi1-firmware-0.9-36.noarch
hfi1-psm-compat-0.7-244.x86_64
libhfi1verbs-devel-0.5-16.el7.x86_64
hfi1-0.11.3.10.0_327.el7.x86_64-245.x86_64
hfi1-firmware_debug-0.9-36.noarc
hfi1-diagtools-sw-0.8-13.x86_64



Howard

2016-12-08 8:45 GMT-07:00 ***@open-mpi.org<mailto:***@open-mpi.org> <***@open-mpi.org<mailto:***@open-mpi.org>>:
Sounds like something didn’t quite get configured right, or maybe you have a library installed that isn’t quite setup correctly, or...

Regardless, we generally advise building from source to avoid such problems. Is there some reason not to just do so?

On Dec 8, 2016, at 6:16 AM, Daniele Tartarini <***@sheffield.ac.uk<mailto:***@sheffield.ac.uk>> wrote:

Hi,

I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:

openmpi-devel.x86_64 1.10.3-3.el7

any code I try to run (including the mpitests-*) I get the following message with slight variants:

my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out

Is anyone able to help me in identifying the source of the problem?
Anyway, /dev/hfi1_0 doesn't exist.

If I use an OpenMPI version compiled from source I have no issue (gcc 4.8.5).

many thanks in advance.

cheers
Daniele
_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Daniele Tartarini
2016-12-08 21:17:01 UTC
Permalink
Hi,
many thanks for tour reply.

I have a S2600IP Intel motherboard. it is a stand alone server and I cannot
see any omnipath device and so not such modules.
opainfo is not available on my system

missing anything?
cheers
Daniele
Post by Cabral, Matias A
Post by Daniele Tartarini
Anyway, * /dev/hfi1_0* doesn't exist.
Make sure you have the hfi1 module/driver loaded.
In addition, please confirm the links are in active state on all the nodes `opainfo`
_MAC
Pritchard
*Sent:* Thursday, December 08, 2016 9:23 AM
*Subject:* Re: [OMPI users] device failed to appear .. Connection timed
out
hello Daniele,
Could you post the output from ompi_info command? I'm noticing on the
RPMS that came with the rhel7.2 distro on
one of our systems that it was built to support psm2/hfi-1.
Two things, could you try running applications with
mpirun --mca pml ob1 (all the rest of your args)
and see if that works?
Second, what sort of system are you using? Is this a cluster? If it is,
you may want to check whether
you have a situation where its an omnipath interconnect and you have the
psm2/hfi1 packages installed
but for some reason the omnipath HCAs themselves are not active.
*hfi*diags-0.8-13.x86_64
*hfi*1-psm-devel-0.7-244.x86_64
lib*hfi*1verbs-0.5-16.el7.x86_64
*hfi*1-psm-0.7-244.x86_64
*hfi*1-firmware-0.9-36.noarch
*hfi*1-psm-compat-0.7-244.x86_64
lib*hfi*1verbs-devel-0.5-16.el7.x86_64
*hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
*hfi*1-firmware_debug-0.9-36.noarc
*hfi*1-diagtools-sw-0.8-13.x86_64
Howard
Sounds like something didn’t quite get configured right, or maybe you have
a library installed that isn’t quite setup correctly, or...
Regardless, we generally advise building from source to avoid such
problems. Is there some reason not to just do so?
Hi,
* openmpi-devel.x86_64 1.10.3-3.el7 *
any code I try to run (including the mpitests-*) I get the following
* my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
failed to appear after 15.0 seconds: Connection timed out*
Is anyone able to help me in identifying the source of the problem?
Anyway, * /dev/hfi1_0* doesn't exist.
If I use an OpenMPI version compiled from source I have no issue (gcc 4.8.5).
many thanks in advance.
cheers
Daniele
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
--
Daniele Tartarini

Post-Doctoral Research Associate
Dept. Mechanical Engineering &
INSIGNEO, institute for *in silico* medicine,
University of Sheffield, Sheffield, UK
linkedIn <http://uk.linkedin.com/in/danieletartarini>
Howard Pritchard
2016-12-08 21:45:29 UTC
Permalink
Hi Daniele,

I bet this psm2 got installed as part of Mpss 3.7. I see something in the
readme for that about MPSS install with OFED support.
I think if you want to go the route of using the RHEL Open MPI RPMS, you
could use the mca-params.conf file approach
to disabling the use of psm2.

This file and a lot of other stuff about mca parameters is described here:

https://www.open-mpi.org/faq/?category=tuning

Alternatively, you could try and build/install Open MPI yourself from the
download page:

https://www.open-mpi.org/software/ompi/v1.10/

The simplest solution - but you need to be confident that nothing's using
the PSM2 software - would be just
use yum to deinstall the psm2 rpm.

Good luck,

Howard
Post by Daniele Tartarini
Hi,
many thanks for tour reply.
I have a S2600IP Intel motherboard. it is a stand alone server and I
cannot see any omnipath device and so not such modules.
opainfo is not available on my system
missing anything?
cheers
Daniele
Post by Cabral, Matias A
Post by Daniele Tartarini
Anyway, * /dev/hfi1_0* doesn't exist.
Make sure you have the hfi1 module/driver loaded.
In addition, please confirm the links are in active state on all the nodes `opainfo`
_MAC
Pritchard
*Sent:* Thursday, December 08, 2016 9:23 AM
*Subject:* Re: [OMPI users] device failed to appear .. Connection timed
out
hello Daniele,
Could you post the output from ompi_info command? I'm noticing on the
RPMS that came with the rhel7.2 distro on
one of our systems that it was built to support psm2/hfi-1.
Two things, could you try running applications with
mpirun --mca pml ob1 (all the rest of your args)
and see if that works?
Second, what sort of system are you using? Is this a cluster? If it
is, you may want to check whether
you have a situation where its an omnipath interconnect and you have the
psm2/hfi1 packages installed
but for some reason the omnipath HCAs themselves are not active.
*hfi*diags-0.8-13.x86_64
*hfi*1-psm-devel-0.7-244.x86_64
lib*hfi*1verbs-0.5-16.el7.x86_64
*hfi*1-psm-0.7-244.x86_64
*hfi*1-firmware-0.9-36.noarch
*hfi*1-psm-compat-0.7-244.x86_64
lib*hfi*1verbs-devel-0.5-16.el7.x86_64
*hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
*hfi*1-firmware_debug-0.9-36.noarc
*hfi*1-diagtools-sw-0.8-13.x86_64
Howard
Sounds like something didn’t quite get configured right, or maybe you
have a library installed that isn’t quite setup correctly, or...
Regardless, we generally advise building from source to avoid such
problems. Is there some reason not to just do so?
On Dec 8, 2016, at 6:16 AM, Daniele Tartarini <
Hi,
* openmpi-devel.x86_64 1.10.3-3.el7 *
any code I try to run (including the mpitests-*) I get the following
* my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
failed to appear after 15.0 seconds: Connection timed out*
Is anyone able to help me in identifying the source of the problem?
Anyway, * /dev/hfi1_0* doesn't exist.
If I use an OpenMPI version compiled from source I have no issue (gcc 4.8.5).
many thanks in advance.
cheers
Daniele
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
--
Daniele Tartarini
Post-Doctoral Research Associate
Dept. Mechanical Engineering &
INSIGNEO, institute for *in silico* medicine,
University of Sheffield, Sheffield, UK
linkedIn <http://uk.linkedin.com/in/danieletartarini>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Daniele Tartarini
2016-12-08 21:00:52 UTC
Permalink
Hi Howard,
Post by Howard Pritchard
hello Daniele,
Could you post the output from ompi_info command? I'm noticing on the
RPMS that came with the rhel7.2 distro on
one of our systems that it was built to support psm2/hfi-1.
please find attached the optput of ompi_info
Post by Howard Pritchard
Two things, could you try running applications with
mpirun --mca pml ob1 (all the rest of your args)
and see if that works?
It works without complaining!
Post by Howard Pritchard
Second, what sort of system are you using? Is this a cluster? If it is,
you may want to check whether
you have a situation where its an omnipath interconnect and you have the
psm2/hfi1 packages installed
but for some reason the omnipath HCAs themselves are not active.
*hfi*diags-0.8-13.x86_64
*hfi*1-psm-devel-0.7-244.x86_64
lib*hfi*1verbs-0.5-16.el7.x86_64
*hfi*1-psm-0.7-244.x86_64
*hfi*1-firmware-0.9-36.noarch
*hfi*1-psm-compat-0.7-244.x86_64
lib*hfi*1verbs-devel-0.5-16.el7.x86_64
*hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
*hfi*1-firmware_debug-0.9-36.noarc
*hfi*1-diagtools-sw-0.8-13.x86_64
The machine is a dual processor with attached (GPUs and) Intel Xeon Phi.
The Mpss 3.7 is installed.
The Xeon Phi is a 3120A (Knights Corner), so should be without omni-path.

I have no hfi package installed but I have the

libpsm2.x86_64 10.2.33-1.el7


any idea?

cheers
daniele
Post by Howard Pritchard
Howard
Post by r***@open-mpi.org
Sounds like something didn’t quite get configured right, or maybe you
have a library installed that isn’t quite setup correctly, or...
Regardless, we generally advise building from source to avoid such
problems. Is there some reason not to just do so?
On Dec 8, 2016, at 6:16 AM, Daniele Tartarini <
Hi,
* openmpi-devel.x86_64 1.10.3-3.el7 *
any code I try to run (including the mpitests-*) I get the following
* my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
failed to appear after 15.0 seconds: Connection timed out*
Is anyone able to help me in identifying the source of the problem?
Anyway, * /dev/hfi1_0* doesn't exist.
If I use an OpenMPI version compiled from source I have no issue (gcc 4.8.5).
many thanks in advance.
cheers
Daniele
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
--
Daniele Tartarini

Post-Doctoral Research Associate
Dept. Mechanical Engineering &
INSIGNEO, institute for *in silico* medicine,
University of Sheffield, Sheffield, UK
linkedIn <http://uk.linkedin.com/in/danieletartarini>
Loading...