[OMPI users] (no subject)

Discussion:

Ioannis Botsis

2017-05-15 09:03:10 UTC

Hi

I am trying to run the following simple demo to a cluster of two nodes

----------------------------------------------------------------------------------------------------------
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
MPI_Init(NULL, NULL);

int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);

printf("Hello world from processor %s, rank %d" " out of %d
processors\n", processor_name, world_rank, world_size);

MPI_Finalize();
}
-------------------------------------------------------------------------------------------------

i get always the message

------------------------------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------------------------------

any hint?

Ioannis Botsis

g***@rist.or.jp

2017-05-15 10:46:54 UTC

Permalink

Ioannis,

### What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git
branch name and hash, etc.)

### Describe how Open MPI was installed (e.g., from a source/
distribution tarball, from a git clone, from an operating system
distribution package, etc.)

### Please describe the system on which you are running

* Operating system/version:
* Computer hardware:
* Network type:

also, what if you

mpirun --mca shmem_base_verbose 100 ...

Cheers,

Gilles
----- Original Message -----

Post by Ioannis Botsis
Hi
I am trying to run the following simple demo to a cluster of two nodes
----------------------------------------------------------------------

------------------------------------

Post by Ioannis Botsis
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
MPI_Init(NULL, NULL);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
printf("Hello world from processor %s, rank %d" " out of %d
processors\n", processor_name, world_rank, world_size);
MPI_Finalize();
}
----------------------------------------------------------------------

---------------------------

Post by Ioannis Botsis
i get always the message
----------------------------------------------------------------------

--------------------------

Post by Ioannis Botsis
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
----------------------------------------------------------------------

----------------------------

Post by Ioannis Botsis
any hint?
Ioannis Botsis
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Ioannis Botsis

2017-05-15 18:59:56 UTC

Permalink

Hi Gilles

Thank you for your prompt response.

Here is some information about the system

Ubuntu 16.04 server
Linux-4.4.0-75-generic-x86_64-with-Ubuntu-16.04-xenial

On HP PROLIANT DL320R05 Generation 5, 4GB RAM, 4x120GB raid-1 HDD, 2
ethernet ports 10/100/1000
HP StorageWorks 70 Modular Smart Array with 14x120GB HDD (RAID-5)

44 HP Proliant BL465c server blade, double AMD Opteron Model 2218(2.6GHz,
2MB, 95W), 4 GB RAM, 2 NC370i Multifunction Gigabit Servers Adapters, 120GB

User's area is shared with the nodes.

ssh and torque 6.0.2 services works fine

Torque and openmpi 2.1.0 are installed from tarball. configure
--prefix=/storage/exp_soft/tuc is used for the deployment of openmpi 2.1.0.
After make and make install binaries, lib and include files of openmpi2.1.0
are located under /storage/exp_soft/tuc .

/storage is a shared file system for all the nodes of the cluster

$PATH:
/storage/exp_soft/tuc/bin
/storage/exp_soft/tuc/sbin
/storage/exp_soft/tuc/torque/bin
/storage/exp_soft/tuc/torque/sbin
/usr/local/sbin
/usr/local/bin
/usr/sbin
/usr/bin
/sbin
/bin
/snap/bin

LD_LIBRARY_PATH=/storage/exp_soft/tuc/lib

C_INCLUDE_PATH=/storage/exp_soft/tuc/include

I use also jupyterhub (with cluster tab enabled) as a user interface to the
cluster. After the installation of python and some dependencies???? mpich
and openmpi are also installed in the system directories.

----------------------------------------------------------------------------
----------------------------------------------------------------------------
--------------------------
mpirun --allow-run-as-root --mca shmem_base_verbose 100 ...

[se01.grid.tuc.gr:19607] mca: base: components_register: registering
framework shmem components
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component sysv
[se01.grid.tuc.gr:19607] mca: base: components_register: component sysv
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component posix
[se01.grid.tuc.gr:19607] mca: base: components_register: component posix
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component mmap
[se01.grid.tuc.gr:19607] mca: base: components_register: component mmap
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: opening shmem
components
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
sysv
[se01.grid.tuc.gr:19607] mca: base: components_open: component sysv open
function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
posix
[se01.grid.tuc.gr:19607] mca: base: components_open: component posix open
function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
mmap
[se01.grid.tuc.gr:19607] mca: base: components_open: component mmap open
function successful
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: Auto-selecting shmem
components
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [sysv]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [sysv] set priority to 30
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [posix]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [posix] set priority to 40
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [mmap]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [mmap] set priority to 50
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Selected
component [mmap]
[se01.grid.tuc.gr:19607] mca: base: close: unloading component sysv
[se01.grid.tuc.gr:19607] mca: base: close: unloading component posix
[se01.grid.tuc.gr:19607] shmem: base: best_runnable_component_name:
Searching for best runnable component.
[se01.grid.tuc.gr:19607] shmem: base: best_runnable_component_name: Found
best runnable component: (mmap).
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).

Node: se01
Executable: ...
--------------------------------------------------------------------------
2 total processes failed to start
[se01.grid.tuc.gr:19607] mca: base: close: component mmap closed
[se01.grid.tuc.gr:19607] mca: base: close: unloading component mmap

jb

-----Original Message-----
From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of
***@rist.or.jp
Sent: Monday, May 15, 2017 1:47 PM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] (no subject)

Ioannis,

### What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git
branch name and hash, etc.)

### Describe how Open MPI was installed (e.g., from a source/
distribution tarball, from a git clone, from an operating system
distribution package, etc.)

### Please describe the system on which you are running

* Operating system/version:
* Computer hardware:
* Network type:

also, what if you

mpirun --mca shmem_base_verbose 100 ...

Cheers,

Gilles
----- Original Message -----

Post by Ioannis Botsis
Hi
I am trying to run the following simple demo to a cluster of two nodes
----------------------------------------------------------------------

------------------------------------

---------------------------

Post by Ioannis Botsis
i get always the message
----------------------------------------------------------------------

--------------------------

----------------------------

Post by Ioannis Botsis
any hint?
Ioannis Botsis
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Gilles Gouaillardet

2017-05-16 06:42:28 UTC

Permalink

Thanks for all the information,

what i meant by

mpirun --mca shmem_base_verbose 100 ...

is really you modify your mpirun command line (or your torque script if
applicable) and add

--mca shmem_base_verbose 100

right after mpirun

Cheers,

Gilles

Post by Ioannis Botsis
Hi Gilles
Thank you for your prompt response.
Here is some information about the system
Ubuntu 16.04 server
Linux-4.4.0-75-generic-x86_64-with-Ubuntu-16.04-xenial
On HP PROLIANT DL320R05 Generation 5, 4GB RAM, 4x120GB raid-1 HDD, 2
ethernet ports 10/100/1000
HP StorageWorks 70 Modular Smart Array with 14x120GB HDD (RAID-5)
44 HP Proliant BL465c server blade, double AMD Opteron Model 2218(2.6GHz,
2MB, 95W), 4 GB RAM, 2 NC370i Multifunction Gigabit Servers Adapters, 120GB
User's area is shared with the nodes.
ssh and torque 6.0.2 services works fine
Torque and openmpi 2.1.0 are installed from tarball. configure
--prefix=/storage/exp_soft/tuc is used for the deployment of openmpi 2.1.0.
After make and make install binaries, lib and include files of openmpi2.1.0
are located under /storage/exp_soft/tuc .
/storage is a shared file system for all the nodes of the cluster
/storage/exp_soft/tuc/bin
/storage/exp_soft/tuc/sbin
/storage/exp_soft/tuc/torque/bin
/storage/exp_soft/tuc/torque/sbin
/usr/local/sbin
/usr/local/bin
/usr/sbin
/usr/bin
/sbin
/bin
/snap/bin
LD_LIBRARY_PATH=/storage/exp_soft/tuc/lib
C_INCLUDE_PATH=/storage/exp_soft/tuc/include
I use also jupyterhub (with cluster tab enabled) as a user interface to the
cluster. After the installation of python and some dependencies???? mpich
and openmpi are also installed in the system directories.
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--------------------------
mpirun --allow-run-as-root --mca shmem_base_verbose 100 ...
[se01.grid.tuc.gr:19607] mca: base: components_register: registering
framework shmem components
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component sysv
[se01.grid.tuc.gr:19607] mca: base: components_register: component sysv
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component posix
[se01.grid.tuc.gr:19607] mca: base: components_register: component posix
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component mmap
[se01.grid.tuc.gr:19607] mca: base: components_register: component mmap
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: opening shmem
components
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
sysv
[se01.grid.tuc.gr:19607] mca: base: components_open: component sysv open
function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
posix
[se01.grid.tuc.gr:19607] mca: base: components_open: component posix open
function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
mmap
[se01.grid.tuc.gr:19607] mca: base: components_open: component mmap open
function successful
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: Auto-selecting shmem
components
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [sysv]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [sysv] set priority to 30
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [posix]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [posix] set priority to 40
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [mmap]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [mmap] set priority to 50
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Selected
component [mmap]
[se01.grid.tuc.gr:19607] mca: base: close: unloading component sysv
[se01.grid.tuc.gr:19607] mca: base: close: unloading component posix
Searching for best runnable component.
[se01.grid.tuc.gr:19607] shmem: base: best_runnable_component_name: Found
best runnable component: (mmap).
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: se01
Executable: ...
--------------------------------------------------------------------------
2 total processes failed to start
[se01.grid.tuc.gr:19607] mca: base: close: component mmap closed
[se01.grid.tuc.gr:19607] mca: base: close: unloading component mmap
jb
-----Original Message-----
Sent: Monday, May 15, 2017 1:47 PM
Subject: Re: [OMPI users] (no subject)
Ioannis,
### What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git
branch name and hash, etc.)
### Describe how Open MPI was installed (e.g., from a source/
distribution tarball, from a git clone, from an operating system
distribution package, etc.)
### Please describe the system on which you are running
also, what if you
mpirun --mca shmem_base_verbose 100 ...
Cheers,
Gilles
----- Original Message -----

Post by Ioannis Botsis
Hi
I am trying to run the following simple demo to a cluster of two nodes
----------------------------------------------------------------------

------------------------------------

---------------------------

Post by Ioannis Botsis
i get always the message
----------------------------------------------------------------------

--------------------------

Post by Ioannis Botsis
It looks like opal_init failed for some reason; your parallel process

Post by Ioannis Botsis
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
----------------------------------------------------------------------

----------------------------

Post by Ioannis Botsis
any hint?
Ioannis Botsis
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Ioannis Botsis

2017-05-16 07:35:07 UTC

Permalink

Hi Gilles

Here is

The pbs script

---------------------------------------------------------------------------------------------------------------------

#PBS -N mpitest
#PBS -l nodes=2
#PBS -q tuc
#PBS -m abe -M ***@isc.tuc.gr
#PBS -k oe

which mpirun
mpirun --mca shmem_base_verbose 100 mpitest
-----------------------------------------------------------------------------------------------------------------------

the 2 e-mails i receive

-----------------------------------------------------------------------------------------------------------------------

PBS Job Id: 137.se01.grid.tuc.gr
Job Name: mpitest
Exec host: wn002.grid.tuc.gr/0+wn001.grid.tuc.gr/0
Begun execution

PBS Job Id: 137.se01.grid.tuc.gr
Job Name: mpitest
Exec host: wn002.grid.tuc.gr/0+wn001.grid.tuc.gr/0
Execution terminated
Exit_status=1
resources_used.cput=00:00:00
resources_used.energy_used=0
resources_used.mem=0kb
resources_used.vmem=0kb
resources_used.walltime=00:00:01
Error_Path: se01.grid.tuc.gr:/storage/tuclocal/jb/mpitest.e137
Output_Path: se01.grid.tuc.gr:/storage/tuclocal/jb/mpitest.o137

------------------------------------------------------------------------------------------------------------

the error file

---------------------------------------------------------------------------------------------------------------
[wn002.grid.tuc.gr:28619] mca: base: components_register: registering
framework shmem components
[wn002.grid.tuc.gr:28619] mca: base: components_open: opening shmem
components
[wn002.grid.tuc.gr:28619] shmem: base: runtime_query: Auto-selecting
shmem components
[wn002.grid.tuc.gr:28619] shmem: base: runtime_query: (shmem) No
component selected!
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------

the output file

-----------------------------------------------------------------------------------------------------------

/storage/exp_soft/tuc/bin/mpirun
----------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------

When i initiate 2 engines from IPython Clusters tab from jupyter

the pbs script is
---------------------------------------------------------------------------------------------------------------
#PBS -N ipengine
#PBS -l nodes=2:ppn=4
#PBS -q tuc
#PBS -m abe -M ***@isc.tuc.gr
#PBS -k oe

mpirun --mca shmem_base_verbose 100 -n 2 ipengine3
--profile-dir=/storage/tuclocal/jb/.ipython/profile_pbs --ip=147.27.48.3
---------------------------------------------------------------------------------------------------------------------

the 2 e-mails i receive

-----------------------------------------------------------------------------------------------------------------------

PBS Job Id: 138.se01.grid.tuc.gr

Job Name: ipengine
Exec host: wn002.grid.tuc.gr/0-3+wn001.grid.tuc.gr/0-3
Begun execution

PBS Job Id: 138.se01.grid.tuc.gr
Job Name: ipengine
Exec host: wn002.grid.tuc.gr/0-3+wn001.grid.tuc.gr/0-3
Execution terminated
Exit_status=1
resources_used.cput=00:00:00
resources_used.energy_used=0
resources_used.mem=0kb
resources_used.vmem=0kb
resources_used.walltime=00:00:01
Error_Path: se01.grid.tuc.gr:/storage/tuclocal/jb/.ipython/ipengine.e138
Output_Path: se01.grid.tuc.gr:/storage/tuclocal/jb/.ipython/ipengine.o138
----------------------------------------------------------------------------------

the error file

--------------------------------------------------------------------------------------------------------------------
[wn002.grid.tuc.gr:28874] mca: base: components_register: registering
framework shmem components
[wn002.grid.tuc.gr:28874] mca: base: components_open: opening shmem
components
[wn002.grid.tuc.gr:28874] shmem: base: runtime_query: Auto-selecting
shmem components
[wn002.grid.tuc.gr:28874] shmem: base: runtime_query: (shmem) No
component selected!
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------

jb

Post by Gilles Gouaillardet
Thanks for all the information,
what i meant by
mpirun --mca shmem_base_verbose 100 ...
is really you modify your mpirun command line (or your torque script
if applicable) and add
--mca shmem_base_verbose 100
right after mpirun
Cheers,
Gilles

Post by Ioannis Botsis
Hi Gilles
Thank you for your prompt response.
Here is some information about the system
Ubuntu 16.04 server
Linux-4.4.0-75-generic-x86_64-with-Ubuntu-16.04-xenial
On HP PROLIANT DL320R05 Generation 5, 4GB RAM, 4x120GB raid-1 HDD, 2
ethernet ports 10/100/1000
HP StorageWorks 70 Modular Smart Array with 14x120GB HDD (RAID-5)
44 HP Proliant BL465c server blade, double AMD Opteron Model
2218(2.6GHz,
2MB, 95W), 4 GB RAM, 2 NC370i Multifunction Gigabit Servers Adapters, 120GB
User's area is shared with the nodes.
ssh and torque 6.0.2 services works fine
Torque and openmpi 2.1.0 are installed from tarball. configure
--prefix=/storage/exp_soft/tuc is used for the deployment of openmpi 2.1.0.
After make and make install binaries, lib and include files of openmpi2.1.0
are located under /storage/exp_soft/tuc .
/storage is a shared file system for all the nodes of the cluster
/storage/exp_soft/tuc/bin
/storage/exp_soft/tuc/sbin
/storage/exp_soft/tuc/torque/bin
/storage/exp_soft/tuc/torque/sbin
/usr/local/sbin
/usr/local/bin
/usr/sbin
/usr/bin
/sbin
/bin
/snap/bin
LD_LIBRARY_PATH=/storage/exp_soft/tuc/lib
C_INCLUDE_PATH=/storage/exp_soft/tuc/include
I use also jupyterhub (with cluster tab enabled) as a user interface to the
cluster. After the installation of python and some dependencies???? mpich
and openmpi are also installed in the system directories.
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--------------------------
mpirun --allow-run-as-root --mca shmem_base_verbose 100 ...
[se01.grid.tuc.gr:19607] mca: base: components_register: registering
framework shmem components
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component sysv
[se01.grid.tuc.gr:19607] mca: base: components_register: component sysv
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component posix
[se01.grid.tuc.gr:19607] mca: base: components_register: component posix
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component mmap
[se01.grid.tuc.gr:19607] mca: base: components_register: component mmap
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: opening shmem
components
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
sysv
[se01.grid.tuc.gr:19607] mca: base: components_open: component sysv open
function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
posix
[se01.grid.tuc.gr:19607] mca: base: components_open: component posix open
function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
mmap
[se01.grid.tuc.gr:19607] mca: base: components_open: component mmap open
function successful
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: Auto-selecting shmem
components
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [sysv]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [sysv] set priority to 30
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [posix]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [posix] set priority to 40
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [mmap]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [mmap] set priority to 50
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Selected
component [mmap]
[se01.grid.tuc.gr:19607] mca: base: close: unloading component sysv
[se01.grid.tuc.gr:19607] mca: base: close: unloading component posix
Searching for best runnable component.
[se01.grid.tuc.gr:19607] shmem: base: best_runnable_component_name: Found
best runnable component: (mmap).
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: se01
Executable: ...
--------------------------------------------------------------------------
2 total processes failed to start
[se01.grid.tuc.gr:19607] mca: base: close: component mmap closed
[se01.grid.tuc.gr:19607] mca: base: close: unloading component mmap
jb
-----Original Message-----
Sent: Monday, May 15, 2017 1:47 PM
Subject: Re: [OMPI users] (no subject)
Ioannis,
### What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git
branch name and hash, etc.)
### Describe how Open MPI was installed (e.g., from a source/
distribution tarball, from a git clone, from an operating system
distribution package, etc.)
### Please describe the system on which you are running
also, what if you
mpirun --mca shmem_base_verbose 100 ...
Cheers,
Gilles
----- Original Message -----

Post by Ioannis Botsis
Hi
I am trying to run the following simple demo to a cluster of two nodes
----------------------------------------------------------------------

------------------------------------

---------------------------

Post by Ioannis Botsis
i get always the message
----------------------------------------------------------------------

--------------------------

Post by Ioannis Botsis
It looks like opal_init failed for some reason; your parallel process

----------------------------

Post by Ioannis Botsis
any hint?
Ioannis Botsis
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Ioannis Botsis

2017-05-16 07:43:47 UTC

Permalink

Gilles

As an extra information i send you config.log from the tarball directory
of openmpi-2.1.0

jb

Post by Ioannis Botsis
Hi Gilles
Thank you for your prompt response.
Here is some information about the system
Ubuntu 16.04 server
Linux-4.4.0-75-generic-x86_64-with-Ubuntu-16.04-xenial
On HP PROLIANT DL320R05 Generation 5, 4GB RAM, 4x120GB raid-1 HDD, 2
ethernet ports 10/100/1000
HP StorageWorks 70 Modular Smart Array with 14x120GB HDD (RAID-5)
44 HP Proliant BL465c server blade, double AMD Opteron Model
2218(2.6GHz,
2MB, 95W), 4 GB RAM, 2 NC370i Multifunction Gigabit Servers Adapters, 120GB
User's area is shared with the nodes.
ssh and torque 6.0.2 services works fine
Torque and openmpi 2.1.0 are installed from tarball. configure
--prefix=/storage/exp_soft/tuc is used for the deployment of openmpi 2.1.0.
After make and make install binaries, lib and include files of openmpi2.1.0
are located under /storage/exp_soft/tuc .
/storage is a shared file system for all the nodes of the cluster
/storage/exp_soft/tuc/bin
/storage/exp_soft/tuc/sbin
/storage/exp_soft/tuc/torque/bin
/storage/exp_soft/tuc/torque/sbin
/usr/local/sbin
/usr/local/bin
/usr/sbin
/usr/bin
/sbin
/bin
/snap/bin
LD_LIBRARY_PATH=/storage/exp_soft/tuc/lib
C_INCLUDE_PATH=/storage/exp_soft/tuc/include
I use also jupyterhub (with cluster tab enabled) as a user interface to the
cluster. After the installation of python and some dependencies???? mpich
and openmpi are also installed in the system directories.
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--------------------------
mpirun --allow-run-as-root --mca shmem_base_verbose 100 ...
[se01.grid.tuc.gr:19607] mca: base: components_register: registering
framework shmem components
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component sysv
[se01.grid.tuc.gr:19607] mca: base: components_register: component sysv
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component posix
[se01.grid.tuc.gr:19607] mca: base: components_register: component posix
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_register: found loaded
component mmap
[se01.grid.tuc.gr:19607] mca: base: components_register: component mmap
register function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: opening shmem
components
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
sysv
[se01.grid.tuc.gr:19607] mca: base: components_open: component sysv open
function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
posix
[se01.grid.tuc.gr:19607] mca: base: components_open: component posix open
function successful
[se01.grid.tuc.gr:19607] mca: base: components_open: found loaded component
mmap
[se01.grid.tuc.gr:19607] mca: base: components_open: component mmap open
function successful
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: Auto-selecting shmem
components
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [sysv]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [sysv] set priority to 30
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [posix]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [posix] set priority to 40
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Querying
component (run-time) [mmap]
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Query of
component [mmap] set priority to 50
[se01.grid.tuc.gr:19607] shmem: base: runtime_query: (shmem) Selected
component [mmap]
[se01.grid.tuc.gr:19607] mca: base: close: unloading component sysv
[se01.grid.tuc.gr:19607] mca: base: close: unloading component posix
Searching for best runnable component.
[se01.grid.tuc.gr:19607] shmem: base: best_runnable_component_name: Found
best runnable component: (mmap).
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: se01
Executable: ...
--------------------------------------------------------------------------
2 total processes failed to start
[se01.grid.tuc.gr:19607] mca: base: close: component mmap closed
[se01.grid.tuc.gr:19607] mca: base: close: unloading component mmap
jb
-----Original Message-----
Sent: Monday, May 15, 2017 1:47 PM
Subject: Re: [OMPI users] (no subject)
Ioannis,
### What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git
branch name and hash, etc.)
### Describe how Open MPI was installed (e.g., from a source/
distribution tarball, from a git clone, from an operating system
distribution package, etc.)
### Please describe the system on which you are running
also, what if you
mpirun --mca shmem_base_verbose 100 ...
Cheers,
Gilles
----- Original Message -----

Post by Ioannis Botsis
Hi
I am trying to run the following simple demo to a cluster of two nodes
----------------------------------------------------------------------

------------------------------------

---------------------------

Post by Ioannis Botsis
i get always the message
----------------------------------------------------------------------

--------------------------

Post by Ioannis Botsis
It looks like opal_init failed for some reason; your parallel process

----------------------------

Post by Ioannis Botsis
any hint?
Ioannis Botsis
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Gilles Gouaillardet

2017-05-17 08:25:35 UTC

Permalink

Folks,

for the records, this was investigated off-list

- the root cause was bad permissions on the /.../lib/openmpi directory
(no components could be found)

- then it was found tm support was not built-in, so mpirun did not
behave as expected under torque/pbs

Cheers,

Gilles

Post by Ioannis Botsis
Hi
I am trying to run the following simple demo to a cluster of two nodes
----------------------------------------------------------------------------------------------------------
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
MPI_Init(NULL, NULL);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
printf("Hello world from processor %s, rank %d" " out of %d
processors\n", processor_name, world_rank, world_size);
MPI_Finalize();
}
-------------------------------------------------------------------------------------------------
i get always the message
------------------------------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------------------------------
any hint?
Ioannis Botsis
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users