[OMPI users] KNEM errors when running OMPI 2.0.1

Discussion:

Juan A. Cordero Varelaq

2017-01-17 12:16:05 UTC

Hi, I am running on my SCG cluster the following script (using qsub):

#!/bin/bash
#$-cwd
#$ -S /bin/bash
#$ -V
#$ -q normal
#$ -pe mpi 40
#$ -P Lab219
#$ -o output
#$ -e error
module load PhyML/3.3
mpirun --mca pml yalla -np 40 phyml-mpi -i proteic -b 10 -d aa

where phyml-mpi is the parallel version for OMPI of the program PhyML.
--mca pml yalla option is called to used MXM (I have mellanox OFED).

It gives me lots of errors related to KNEM (see error and output files
from qsub in the attachments). However, I specified the KNEM directory
when installing OMPI. I can't really understand such errors, and would
appreciate any hint on this issue. I have run open-mpi on an own script
(just a loop running inside something as: command --help) and got no error.

Thanks in advance

Joshua Ladd

2017-01-17 16:55:50 UTC

Permalink

Can you please attach your configure log. It looks like both MXM and the
Vader BTL (used for OSC) are complaining because they can't find your KNEM
installation.

Josh

On Tue, Jan 17, 2017 at 7:16 AM, Juan A. Cordero Varelaq <

Post by Juan A. Cordero Varelaq
#!/bin/bash
#$-cwd
#$ -S /bin/bash
#$ -V
#$ -q normal
#$ -pe mpi 40
#$ -P Lab219
#$ -o output
#$ -e error
module load PhyML/3.3
mpirun --mca pml yalla -np 40 phyml-mpi -i proteic -b 10 -d aa
where phyml-mpi is the parallel version for OMPI of the program PhyML.
--mca pml yalla option is called to used MXM (I have mellanox OFED).
It gives me lots of errors related to KNEM (see error and output files
from qsub in the attachments). However, I specified the KNEM directory when
installing OMPI. I can't really understand such errors, and would
appreciate any hint on this issue. I have run open-mpi on an own script
(just a loop running inside something as: command --help) and got no error.
Thanks in advance
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Juan A. Cordero Varelaq

2017-01-18 09:37:42 UTC

Permalink

Sure, I attach the config.log from the openmpi installation.

Post by Joshua Ladd
Can you please attach your configure log. It looks like both MXM and
the Vader BTL (used for OSC) are complaining because they can't find
your KNEM installation.
Josh
On Tue, Jan 17, 2017 at 7:16 AM, Juan A. Cordero Varelaq
#!/bin/bash
#$-cwd
#$ -S /bin/bash
#$ -V
#$ -q normal
#$ -pe mpi 40
#$ -P Lab219
#$ -o output
#$ -e error
module load PhyML/3.3
mpirun --mca pml yalla -np 40 phyml-mpi -i proteic -b 10 -d aa
where phyml-mpi is the parallel version for OMPI of the program
PhyML. --mca pml yalla option is called to used MXM (I have
mellanox OFED).
It gives me lots of errors related to KNEM (see error and output
files from qsub in the attachments). However, I specified the KNEM
directory when installing OMPI. I can't really understand such
errors, and would appreciate any hint on this issue. I have run
open-mpi on an own script (just a loop running inside something
as: command --help) and got no error.
Thanks in advance
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Gilles Gouaillardet

2017-01-18 10:36:35 UTC

Permalink

Juan,

You also need to make sure knem and MOFED drivers are loaded on all the
compute nodes.
You also need to double check the permissions of /dev/knem

Cheers,

Gilles

On Wednesday, January 18, 2017, Juan A. Cordero Varelaq <

Post by Juan A. Cordero Varelaq
Sure, I attach the config.log from the openmpi installation.
Can you please attach your configure log. It looks like both MXM and the
Vader BTL (used for OSC) are complaining because they can't find your KNEM
installation.
Josh
On Tue, Jan 17, 2017 at 7:16 AM, Juan A. Cordero Varelaq <

_______________________________________________

Juan A. Cordero Varelaq

2017-01-18 13:36:57 UTC

Permalink

Hi,

knem and MOFED drivers are installed in /opt:

* /opt/knem-1.1.90mlnx2
* /opt/mellanox/fca
* /opt/mellanox/mxm
* /opt/mellanox/openshmem

However/dev/knem does not exist.

Cheers,

Juan

Post by Gilles Gouaillardet
Juan,
You also need to make sure knem and MOFED drivers are loaded on all
the compute nodes.
You also need to double check the permissions of /dev/knem
Cheers,
Gilles
On Wednesday, January 18, 2017, Juan A. Cordero Varelaq
Sure, I attach the config.log from the openmpi installation.

Post by Joshua Ladd
Can you please attach your configure log. It looks like both MXM
and the Vader BTL (used for OSC) are complaining because they
can't find your KNEM installation.
Josh
On Tue, Jan 17, 2017 at 7:16 AM, Juan A. Cordero Varelaq
#!/bin/bash
#$-cwd
#$ -S /bin/bash
#$ -V
#$ -q normal
#$ -pe mpi 40
#$ -P Lab219
#$ -o output
#$ -e error
module load PhyML/3.3
mpirun --mca pml yalla -np 40 phyml-mpi -i proteic -b 10 -d aa
where phyml-mpi is the parallel version for OMPI of the
program PhyML. --mca pml yalla option is called to used MXM
(I have mellanox OFED).
It gives me lots of errors related to KNEM (see error and
output files from qsub in the attachments). However, I
specified the KNEM directory when installing OMPI. I can't
really understand such errors, and would appreciate any hint
on this issue. I have run open-mpi on an own script (just a
loop running inside something as: command --help) and got no
error.
Thanks in advance
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Gilles Gouaillardet

2017-01-18 14:08:39 UTC

Permalink

Juan,

So you need to load the knem module
sudo modprobe knem
and then you can check it is correctly loaded with lsmod

Loading the module should automagically create /dev/knem, but maybe not
with the permissions you expect

Cheers,

Gilles

On Wednesday, January 18, 2017, Juan A. Cordero Varelaq <

Hi,
- /opt/knem-1.1.90mlnx2
- /opt/mellanox/fca
- /opt/mellanox/mxm
- /opt/mellanox/openshmem
However /dev/knem does not exist.
Cheers,
Juan
Juan,
You also need to make sure knem and MOFED drivers are loaded on all the
compute nodes.
You also need to double check the permissions of /dev/knem
Cheers,
Gilles
On Wednesday, January 18, 2017, Juan A. Cordero Varelaq <

_______________________________________________
_______________________________________________

Juan A. Cordero Varelaq

2017-01-18 14:16:30 UTC

Permalink

Hi,

when I try sudo modprobe knem, I get:
FATAL: Error inserting knem
(/lib/modules/3.13.0-37-generic/updates/dkms/knem.ko): Invalid module format

Cheers,

Juan

Post by Gilles Gouaillardet
Juan,
So you need to load the knem module
sudo modprobe knem
and then you can check it is correctly loaded with lsmod
Loading the module should automagically create /dev/knem, but maybe
not with the permissions you expect
Cheers,
Gilles
On Wednesday, January 18, 2017, Juan A. Cordero Varelaq
Hi,
* /opt/knem-1.1.90mlnx2
* /opt/mellanox/fca
* /opt/mellanox/mxm
* /opt/mellanox/openshmem
However/dev/knem does not exist.
Cheers,
Juan

Post by Gilles Gouaillardet
Juan,
You also need to make sure knem and MOFED drivers are loaded on
all the compute nodes.
You also need to double check the permissions of /dev/knem
Cheers,
Gilles
On Wednesday, January 18, 2017, Juan A. Cordero Varelaq
Sure, I attach the config.log from the openmpi installation.

Post by Joshua Ladd
Can you please attach your configure log. It looks like both
MXM and the Vader BTL (used for OSC) are complaining because
they can't find your KNEM installation.
Josh
On Tue, Jan 17, 2017 at 7:16 AM, Juan A. Cordero Varelaq
Hi, I am running on my SCG cluster the following script
#!/bin/bash
#$-cwd
#$ -S /bin/bash
#$ -V
#$ -q normal
#$ -pe mpi 40
#$ -P Lab219
#$ -o output
#$ -e error
module load PhyML/3.3
mpirun --mca pml yalla -np 40 phyml-mpi -i proteic -b 10 -d aa
where phyml-mpi is the parallel version for OMPI of the
program PhyML. --mca pml yalla option is called to used
MXM (I have mellanox OFED).
It gives me lots of errors related to KNEM (see error
and output files from qsub in the attachments). However,
I specified the KNEM directory when installing OMPI. I
can't really understand such errors, and would
appreciate any hint on this issue. I have run open-mpi
on an own script (just a loop running inside something
as: command --help) and got no error.
Thanks in advance
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Juan A. Cordero Varelaq

2017-01-23 14:15:55 UTC

Permalink

Hi,

I have run the same command but disabling KNEM (with -mca
btl_sm_use_knem 0), and it gives me, still, the same errors. Futhermore,
I've noticed when I run, for instance, ompi_info, I get one of the same
errors as when running mpirun:

Warning: Conflicting CPU frequencies detected, using: 2001.000000

Does anyone know what may be happening?

Cheers

Post by Juan A. Cordero Varelaq
Hi,
FATAL: Error inserting knem
(/lib/modules/3.13.0-37-generic/updates/dkms/knem.ko): Invalid module format
Cheers,
Juan

Post by Gilles Gouaillardet
Juan,
You also need to make sure knem and MOFED drivers are loaded on
all the compute nodes.
You also need to double check the permissions of /dev/knem
Cheers,
Gilles
On Wednesday, January 18, 2017, Juan A. Cordero Varelaq
Sure, I attach the config.log from the openmpi installation.

Post by Joshua Ladd
Can you please attach your configure log. It looks like
both MXM and the Vader BTL (used for OSC) are complaining
because they can't find your KNEM installation.
Josh
On Tue, Jan 17, 2017 at 7:16 AM, Juan A. Cordero Varelaq
Hi, I am running on my SCG cluster the following script
#!/bin/bash
#$-cwd
#$ -S /bin/bash
#$ -V
#$ -q normal
#$ -pe mpi 40
#$ -P Lab219
#$ -o output
#$ -e error
module load PhyML/3.3
mpirun --mca pml yalla -np 40 phyml-mpi -i proteic -b
10 -d aa
where phyml-mpi is the parallel version for OMPI of the
program PhyML. --mca pml yalla option is called to used
MXM (I have mellanox OFED).
It gives me lots of errors related to KNEM (see error
and output files from qsub in the attachments).
However, I specified the KNEM directory when installing
OMPI. I can't really understand such errors, and would
appreciate any hint on this issue. I have run open-mpi
on an own script (just a loop running inside something
as: command --help) and got no error.
Thanks in advance
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users