Discussion:
[OMPI users] Slot count parameter in hostfile ignored
Maksym Planeta
2017-09-07 10:33:42 UTC
Permalink
Hello,

I'm trying to tell OpenMPI how many processes per node I want to use, but mpirun seems to ignore the configuration I provide.

I create following hostfile:

$ cat hostfile.16
taurusi6344 slots=16
taurusi6348 slots=16

And then start the app as follows:

$ mpirun --display-map -machinefile hostfile.16 -np 2 hostname
Data for JOB [42099,1] offset 0

======================== JOB MAP ========================

Data for node: taurusi6344 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [42099,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]

Data for node: taurusi6348 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [42099,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]

=============================================================
taurusi6344
taurusi6348

If I put anything more than 2 in "-np 2", I get following error message:

$ mpirun --display-map -machinefile hostfile.16 -np 4 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

The OpenMPI version is "mpirun (Open MPI) 2.1.0"

Also there is SLURM installed with version "slurm 16.05.7-Bull.1.1-20170512-1252"

Could you help me to enforce OpenMPI to respect slots paremeter?
--
Regards,
Maksym Planeta
r***@open-mpi.org
2017-09-07 12:12:33 UTC
Permalink
My best guess is that SLURM has only allocated 2 slots, and we respect the RM regardless of what you say in the hostfile. You can check this by adding --display-allocation to your cmd line. You probably need to tell slurm to allocate more cpus/node.
Post by Maksym Planeta
Hello,
I'm trying to tell OpenMPI how many processes per node I want to use, but mpirun seems to ignore the configuration I provide.
$ cat hostfile.16
taurusi6344 slots=16
taurusi6348 slots=16
$ mpirun --display-map -machinefile hostfile.16 -np 2 hostname
Data for JOB [42099,1] offset 0
======================== JOB MAP ========================
Data for node: taurusi6344 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [42099,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
Data for node: taurusi6348 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [42099,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
=============================================================
taurusi6344
taurusi6348
$ mpirun --display-map -machinefile hostfile.16 -np 4 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
The OpenMPI version is "mpirun (Open MPI) 2.1.0"
Also there is SLURM installed with version "slurm 16.05.7-Bull.1.1-20170512-1252"
Could you help me to enforce OpenMPI to respect slots paremeter?
--
Regards,
Maksym Planeta
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Maksym Planeta
2017-09-08 07:19:59 UTC
Permalink
Indeed mpirun shows slots=1 per node, but I create allocation with --ntasks-per-node 24, so I do have all cores of the node allocated.

When I use srun I can get all the cores.
Post by r***@open-mpi.org
My best guess is that SLURM has only allocated 2 slots, and we respect the RM regardless of what you say in the hostfile. You can check this by adding --display-allocation to your cmd line. You probably need to tell slurm to allocate more cpus/node.
Post by Maksym Planeta
Hello,
I'm trying to tell OpenMPI how many processes per node I want to use, but mpirun seems to ignore the configuration I provide.
$ cat hostfile.16
taurusi6344 slots=16
taurusi6348 slots=16
$ mpirun --display-map -machinefile hostfile.16 -np 2 hostname
Data for JOB [42099,1] offset 0
======================== JOB MAP ========================
Data for node: taurusi6344 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [42099,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
Data for node: taurusi6348 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [42099,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
=============================================================
taurusi6344
taurusi6348
$ mpirun --display-map -machinefile hostfile.16 -np 4 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
The OpenMPI version is "mpirun (Open MPI) 2.1.0"
Also there is SLURM installed with version "slurm 16.05.7-Bull.1.1-20170512-1252"
Could you help me to enforce OpenMPI to respect slots paremeter?
--
Regards,
Maksym Planeta
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Regards,
Maksym Planeta
Gilles Gouaillardet
2017-09-08 07:58:19 UTC
Permalink
Maxsym,


can you please post your sbatch script ?

fwiw, i am unable to reproduce the issue with the latest v2.x from github.


by any chance, would you be able to test the latest openmpi 2.1.2rc3 ?


Cheers,


Gilles
Post by Maksym Planeta
Indeed mpirun shows slots=1 per node, but I create allocation with --ntasks-per-node 24, so I do have all cores of the node allocated.
When I use srun I can get all the cores.
Post by r***@open-mpi.org
My best guess is that SLURM has only allocated 2 slots, and we respect the RM regardless of what you say in the hostfile. You can check this by adding --display-allocation to your cmd line. You probably need to tell slurm to allocate more cpus/node.
Post by Maksym Planeta
Hello,
I'm trying to tell OpenMPI how many processes per node I want to use, but mpirun seems to ignore the configuration I provide.
$ cat hostfile.16
taurusi6344 slots=16
taurusi6348 slots=16
$ mpirun --display-map -machinefile hostfile.16 -np 2 hostname
Data for JOB [42099,1] offset 0
======================== JOB MAP ========================
Data for node: taurusi6344 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [42099,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
Data for node: taurusi6348 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [42099,1] App: 0 Process rank: 1 Bound: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
=============================================================
taurusi6344
taurusi6348
$ mpirun --display-map -machinefile hostfile.16 -np 4 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
The OpenMPI version is "mpirun (Open MPI) 2.1.0"
Also there is SLURM installed with version "slurm 16.05.7-Bull.1.1-20170512-1252"
Could you help me to enforce OpenMPI to respect slots paremeter?
--
Regards,
Maksym Planeta
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Maksym Planeta
2017-09-08 08:20:28 UTC
Permalink
I run start an interactive allocation and I just noticed that the problem happens, when I join this allocation from another shell.

Here is how I join:

srun --pty --x11 --jobid=$(squeue -u $USER -o %A | tail -n 1) bash

And here is how I create the allocation:

srun --pty --nodes 8 --ntasks-per-node 24 --mem 50G --time=3:00:00 --partition=haswell --x11 bash
Post by Gilles Gouaillardet
Maxsym,
can you please post your sbatch script ?
fwiw, i am unable to reproduce the issue with the latest v2.x from github.
by any chance, would you be able to test the latest openmpi 2.1.2rc3 ?
Cheers,
Gilles
Post by Maksym Planeta
Indeed mpirun shows slots=1 per node, but I create allocation with
--ntasks-per-node 24, so I do have all cores of the node allocated.
When I use srun I can get all the cores.
Post by r***@open-mpi.org
My best guess is that SLURM has only allocated 2 slots, and we
respect the RM regardless of what you say in the hostfile. You can
check this by adding --display-allocation to your cmd line. You
probably need to tell slurm to allocate more cpus/node.
On Sep 7, 2017, at 3:33 AM, Maksym Planeta
Hello,
I'm trying to tell OpenMPI how many processes per node I want to
use, but mpirun seems to ignore the configuration I provide.
$ cat hostfile.16
taurusi6344 slots=16
taurusi6348 slots=16
$ mpirun --display-map -machinefile hostfile.16 -np 2 hostname
Data for JOB [42099,1] offset 0
======================== JOB MAP ========================
Data for node: taurusi6344 Num slots: 1 Max slots: 0 Num procs: 1
socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core
2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]],
socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core
10[hwt 0]], socket 0[core 11[hwt
0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
Data for node: taurusi6348 Num slots: 1 Max slots: 0 Num procs: 1
socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core
2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]],
socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core
10[hwt 0]], socket 0[core 11[hwt
0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
=============================================================
taurusi6344
taurusi6348
$ mpirun --display-map -machinefile hostfile.16 -np 4 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
The OpenMPI version is "mpirun (Open MPI) 2.1.0"
Also there is SLURM installed with version "slurm
16.05.7-Bull.1.1-20170512-1252"
Could you help me to enforce OpenMPI to respect slots paremeter?
--
Regards,
Maksym Planeta
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Regards,
Maksym Planeta
Gilles Gouaillardet
2017-09-08 08:27:31 UTC
Permalink
Thanks, now i can reproduce the issue


Cheers,


Gilles
Post by Maksym Planeta
I run start an interactive allocation and I just noticed that the problem happens, when I join this allocation from another shell.
srun --pty --x11 --jobid=$(squeue -u $USER -o %A | tail -n 1) bash
srun --pty --nodes 8 --ntasks-per-node 24 --mem 50G --time=3:00:00 --partition=haswell --x11 bash
Post by Gilles Gouaillardet
Maxsym,
can you please post your sbatch script ?
fwiw, i am unable to reproduce the issue with the latest v2.x from github.
by any chance, would you be able to test the latest openmpi 2.1.2rc3 ?
Cheers,
Gilles
Post by Maksym Planeta
Indeed mpirun shows slots=1 per node, but I create allocation with
--ntasks-per-node 24, so I do have all cores of the node allocated.
When I use srun I can get all the cores.
Post by r***@open-mpi.org
My best guess is that SLURM has only allocated 2 slots, and we
respect the RM regardless of what you say in the hostfile. You can
check this by adding --display-allocation to your cmd line. You
probably need to tell slurm to allocate more cpus/node.
On Sep 7, 2017, at 3:33 AM, Maksym Planeta
Hello,
I'm trying to tell OpenMPI how many processes per node I want to
use, but mpirun seems to ignore the configuration I provide.
$ cat hostfile.16
taurusi6344 slots=16
taurusi6348 slots=16
$ mpirun --display-map -machinefile hostfile.16 -np 2 hostname
Data for JOB [42099,1] offset 0
======================== JOB MAP ========================
Data for node: taurusi6344 Num slots: 1 Max slots: 0 Num procs: 1
socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core
2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]],
socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core
10[hwt 0]], socket 0[core 11[hwt
0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
Data for node: taurusi6348 Num slots: 1 Max slots: 0 Num procs: 1
socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core
2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]],
socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core
10[hwt 0]], socket 0[core 11[hwt
0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
=============================================================
taurusi6344
taurusi6348
$ mpirun --display-map -machinefile hostfile.16 -np 4 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
The OpenMPI version is "mpirun (Open MPI) 2.1.0"
Also there is SLURM installed with version "slurm
16.05.7-Bull.1.1-20170512-1252"
Could you help me to enforce OpenMPI to respect slots paremeter?
--
Regards,
Maksym Planeta
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2017-09-08 13:49:17 UTC
Permalink
It isn’t an issue as there is nothing wrong with OMPI. Your method of joining the allocation is a problem. What you have done is to create a job step that has only 1 slot/node. We have no choice but to honor that constraint and run within it.

What you should be doing is to use salloc to create the allocation. This places you inside the main allocation so we can use all of it.
Post by Gilles Gouaillardet
Thanks, now i can reproduce the issue
Cheers,
Gilles
Post by Maksym Planeta
I run start an interactive allocation and I just noticed that the problem happens, when I join this allocation from another shell.
srun --pty --x11 --jobid=$(squeue -u $USER -o %A | tail -n 1) bash
srun --pty --nodes 8 --ntasks-per-node 24 --mem 50G --time=3:00:00 --partition=haswell --x11 bash
Post by Gilles Gouaillardet
Maxsym,
can you please post your sbatch script ?
fwiw, i am unable to reproduce the issue with the latest v2.x from github.
by any chance, would you be able to test the latest openmpi 2.1.2rc3 ?
Cheers,
Gilles
Post by Maksym Planeta
Indeed mpirun shows slots=1 per node, but I create allocation with
--ntasks-per-node 24, so I do have all cores of the node allocated.
When I use srun I can get all the cores.
Post by r***@open-mpi.org
My best guess is that SLURM has only allocated 2 slots, and we
respect the RM regardless of what you say in the hostfile. You can
check this by adding --display-allocation to your cmd line. You
probably need to tell slurm to allocate more cpus/node.
On Sep 7, 2017, at 3:33 AM, Maksym Planeta
Hello,
I'm trying to tell OpenMPI how many processes per node I want to
use, but mpirun seems to ignore the configuration I provide.
$ cat hostfile.16
taurusi6344 slots=16
taurusi6348 slots=16
$ mpirun --display-map -machinefile hostfile.16 -np 2 hostname
Data for JOB [42099,1] offset 0
======================== JOB MAP ========================
Data for node: taurusi6344 Num slots: 1 Max slots: 0 Num procs: 1
socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core
2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]],
socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core
10[hwt 0]], socket 0[core 11[hwt
0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
Data for node: taurusi6348 Num slots: 1 Max slots: 0 Num procs: 1
socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core
2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]],
socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core
10[hwt 0]], socket 0[core 11[hwt
0]]:[B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././.]
=============================================================
taurusi6344
taurusi6348
$ mpirun --display-map -machinefile hostfile.16 -np 4 hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
hostname
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
The OpenMPI version is "mpirun (Open MPI) 2.1.0"
Also there is SLURM installed with version "slurm
16.05.7-Bull.1.1-20170512-1252"
Could you help me to enforce OpenMPI to respect slots paremeter?
--
Regards,
Maksym Planeta
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>
Maksym Planeta
2017-09-08 08:25:10 UTC
Permalink
Post by Gilles Gouaillardet
by any chance, would you be able to test the latest openmpi 2.1.2rc3 ?
OpenMPI 2.1.0 is the latest on our cluster.
--
Regards,
Maksym Planeta
Loading...