Discussion:
[OMPI users] Mixing Linux's CPU-shielding with mpirun's bind-to-core
Siddhartha Jana
2013-08-18 03:34:52 UTC
Permalink
Hi,

My requirement:
1. Avoid the OS from scheduling tasks on cores 0-7 allocated to my process.
2. Avoid rescheduling of processes to other cores.

My solution: I use Linux's CPU-shielding.
[ Man page:
http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html
]
I create a cpuset called "socket1" with cores 8-15 in the dev fs. I iterate
through all the tasks in /dev/cpuset/tasks and copy them to
/dev/cpuset/socket1/tasks
I create a cpuset called "socket0" with cores 0-7 .
At the start of the application, (before MPI_Init()), I schedule my MPI
process on the cpuset as follows:
------------------------------------------------------
sprintf(str,"/bin/echo %d >> /dev/cpuset/socket0/tasks ",mypid);
system(str);
------------------------------------------------------
In order to ensure that my processes remain bound to the cores, I am
passing the --bind-to-core option to mpirun. I do this, instead of using
sched_setaffinity from within the application. Is there a chance that
mpirun's "binding-to-core" will clash with the above ?

While this solution seems to work temporarily, I am not sure whether this
is good solution, given mpirun's own techniques of binding to cores,
scheduling processes by slot, et al.

Will mpirun's bind-by-slot technique guarantee cpu shielding?

I would be highly obliged if some one could direct me to the right
direction.

Many thanks
Sincerely
Siddhartha Jana
John Hearns
2013-08-18 06:30:10 UTC
Permalink
For information, if you use a batch system such as PbsPro or Torque it can
be configured to set up the cpuset for a job and start the job within the
cpuset. It will also destroy the cpuset at the end of a job.
Highly useful for job cpu binding as you day and also if you have a machine
running many separate jobs where cpusets help isolate jobs and help
allocate resources.
Brice Goglin
2013-08-18 07:36:10 UTC
Permalink
Post by Siddhartha Jana
Hi,
1. Avoid the OS from scheduling tasks on cores 0-7 allocated to my process.
2. Avoid rescheduling of processes to other cores.
My solution: I use Linux's CPU-shielding.
http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html
]
I create a cpuset called "socket1" with cores 8-15 in the dev fs. I
iterate through all the tasks in /dev/cpuset/tasks and copy them to
/dev/cpuset/socket1/tasks
Hello,

Most of these existing tasks are system tasks. Some actually *want* to
run on specific cores outside of socket1. For instance some kernel
threads are doing the scheduler load balancing on each core. Others are
doing defered work in the kernel that your application may need. I
wonder what happens when you move them. The kernel may reject your
request, or it may actually break things.

Also most of these tasks do nothing but sleeping 99.9% of the times
anyway. If you're worried about having too many system tasks on your
applications' core, just make sure you don't install useless packages
(or disable some services at startup).

If you *really* want to have 100% CPU for your application on cores 0-7,
be aware that other things such as interrupts will be stealing some CPU
cycles anyway. You could move these to cores 8-15 as well, but that
seems overkill to me.
Post by Siddhartha Jana
I create a cpuset called "socket0" with cores 0-7 .
At the start of the application, (before MPI_Init()), I schedule my
------------------------------------------------------
sprintf(str,"/bin/echo %d >> /dev/cpuset/socket0/tasks ",mypid);
system(str);
------------------------------------------------------
In order to ensure that my processes remain bound to the cores, I am
passing the --bind-to-core option to mpirun. I do this, instead of
using sched_setaffinity from within the application. Is there a chance
that mpirun's "binding-to-core" will clash with the above ?
Make sure you also specified the NUMA node in your cpuset "mems" file
too. That's required before the cpuset can be used (otherwise adding a
task will fail). And make sure that the application can add itself to
the cpuset, usually only root can add tasks to cpusets.

And you may want to open/write/close on /dev/cpuset/socket0/tasks and
check the return values instead of this system() call.

If all the above works and does not return errors (you should check that
your application's PID is in /dev/cpuset/socket0/tasks while running),
bind-to-core won't clash with it, at least when using a OMPI that uses
hwloc for binding (v1.5.2 or later if I remember correctly).
Post by Siddhartha Jana
While this solution seems to work temporarily, I am not sure whether
this is good solution.
Usually the administrator or PBS/Torque/... creates the cpuset and
places tasks in there for you.

Brice
Siddhartha Jana
2013-08-18 12:51:31 UTC
Permalink
Hi,
Thanks for the reply,
Post by Brice Goglin
Post by Siddhartha Jana
1. Avoid the OS from scheduling tasks on cores 0-7 allocated to my
process.
2. Avoid rescheduling of processes to other cores.
My solution: I use Linux's CPU-shielding.
http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html
]
I create a cpuset called "socket1" with cores 8-15 in the dev fs. I
iterate through all the tasks in /dev/cpuset/tasks and copy them to
/dev/cpuset/socket1/tasks
Most of these existing tasks are system tasks. Some actually *want* to
run on specific cores outside of socket1. For instance some kernel
threads are doing the scheduler load balancing on each core. Others are
doing defered work in the kernel that your application may need. I
wonder what happens when you move them. The kernel may reject your
request, or it may actually break things.
Yes, when I move all system tasks, the movable kernel tasks are easily
moved without complains. The ones that can't be moved return an error code.
But since their CPU usage is very less, I decide to ignore them anyway.
Nothing breaks really.
Post by Brice Goglin
Also most of these tasks do nothing but sleeping 99.9% of the times
anyway. If you're worried about having too many system tasks on your
applications' core, just make sure you don't install useless packages
(or disable some services at startup).
For my use case, I have ensured that the heavy tasks that I wanted to be
moved out of socket0 could be moved without complaints. The non-movable
ones, as I mentioned, were left as is.
Post by Brice Goglin
If you *really* want to have 100% CPU for your application on cores 0-7,
be aware that other things such as interrupts will be stealing some CPU
cycles anyway.
Noted. As mentioned, the tasks that really matter were safely moved to a
different socket.
Post by Brice Goglin
Post by Siddhartha Jana
I create a cpuset called "socket0" with cores 0-7 .
At the start of the application, (before MPI_Init()), I schedule my
------------------------------------------------------
sprintf(str,"/bin/echo %d >> /dev/cpuset/socket0/tasks ",mypid);
system(str);
------------------------------------------------------
In order to ensure that my processes remain bound to the cores, I am
passing the --bind-to-core option to mpirun. I do this, instead of
using sched_setaffinity from within the application. Is there a chance
that mpirun's "binding-to-core" will clash with the above ?
Make sure you also specified the NUMA node in your cpuset "mems" file
too.That's required before the cpuset can be used (otherwise adding a
task will fail). And make sure that the application can add itself to
Post by Brice Goglin
the cpuset, usually only root can add tasks to cpusets.
Yes, I have ensured all of these. The application has enough rights to add
itself to the cpuset.
Post by Brice Goglin
And you may want to open/write/close on /dev/cpuset/socket0/tasks and
check the return values instead of this system() call.
Checked. Everything works as expected.
Post by Brice Goglin
If all the above works and does not return errors (you should check that
your application's PID is in /dev/cpuset/socket0/tasks while running),
bind-to-core won't clash with it, at least when using a OMPI that uses
hwloc for binding (v1.5.2 or later if I remember correctly).
My concern is that hwloc is used before the application begins executing
and so mpirun might use it to bind the application to different cores than
the ones I want them to bind to. If there were a way to specify the cores
through the hostfile, this problem should be solved. Is there? I sit
possible to specify the "cores" in the hostfile.
Post by Brice Goglin
Post by Siddhartha Jana
While this solution seems to work temporarily, I am not sure whether
this is good solution.
Usually the administrator or PBS/Torque/... creates the cpuset and
places tasks in there for you.
Yes, this is what was done in my case for the kernel tasks.
John Hearns
2013-08-18 12:57:02 UTC
Permalink
On a bug system you can boot the system into a 'boot cpuset'.
So all system processes run in a small number of low numbered cores. Plus
any login sessions. The batch system then crwtes cpusets in the higher
numbeted cores - free from OS interference.
John Hearns
2013-08-18 12:57:26 UTC
Permalink
Bug system?
Big system!
Brice Goglin
2013-08-18 15:50:47 UTC
Permalink
Post by Brice Goglin
If all the above works and does not return errors (you should check that
your application's PID is in /dev/cpuset/socket0/tasks while running),
bind-to-core won't clash with it, at least when using a OMPI that uses
hwloc for binding (v1.5.2 or later if I remember correctly).
My concern is that hwloc is used before the application begins
executing and so mpirun might use it to bind the application to
different cores than the ones I want them to bind to.
Ah right, they could be a problem here. MPI can bind at two different
times: inside mpirun after ssh before running the actual program (this
one would ignore your cpuset), later at MPI_Init inside your program
(this one will ignore your cpuset only if you call MPI_Init before
creating the cpuset).

I'll let OMPI people give more details about this.

Brice
Siddhartha Jana
2013-08-18 12:54:06 UTC
Permalink
Noted. Thanks. Unfortunately, in my case the cluster is a basic Linux
cluster without any job schedulers.
Post by John Hearns
For information, if you use a batch system such as PbsPro or Torque it can
be configured to set up the cpuset for a job and start the job within the
cpuset. It will also destroy the cpuset at the end of a job.
Highly useful for job cpu binding as you day and also if you have a
machine running many separate jobs where cpusets help isolate jobs and help
allocate resources.
John Hearns
2013-08-18 13:03:33 UTC
Permalink
You really should install a job scheduler.
There are free versions.

I'm not sure about cpuset support in Gridengine. Anyone?
Dave Love
2013-08-21 17:50:43 UTC
Permalink
Post by John Hearns
You really should install a job scheduler.
Indeed (although it's the resource management component that does the
job).
Post by John Hearns
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
Yes, but I've had reports of problems (races?) that I haven't sorted out
yet.

I wouldn't bother anyway. Just use core binding supplied by the
resource manager with well-behaved jobs (Use "-binding linear:slots"
with the current SGE, assuming you want to pack arbitrary jobs onto
nodes.) Cpusets help with badly-behaved jobs, but don't have much to
offer over core binding with well-behaved ones. "Well-behaved" means
they request the right number of slots, don't daemonize, and don't try
to escape the binding they're handed; that includes simple OMPI ones
with tight integration working properly. If you really care about
avoiding specific cores (why?), you could submit other jobs to block them.

["CPU-shielding" is a new one on me.]
--
Community Grid Engine: http://arc.liv.ac.uk/SGE/
John Hearns
2013-08-21 17:57:41 UTC
Permalink
Agree with what you say Dave.

Regarding not wanting jobs to use certsin cores ie. reserving low-numbered
cores for OS processes then surely a good way forward is to use a 'boot
cpuset' of one or two cores and let your jobs run on the rest of the cores.

You're right about cpusets being helpful with 'badly behaved' jobs.
War stories some other time!
Dave Love
2013-08-23 11:36:31 UTC
Permalink
Post by John Hearns
Agree with what you say Dave.
Regarding not wanting jobs to use certsin cores ie. reserving low-numbered
cores for OS processes then surely a good way forward is to use a 'boot
cpuset' of one or two cores and let your jobs run on the rest of the cores.
Maybe, if you make sure the resource manager knows about it, and users
don't mind losing the cores, presumably resulting in a cock-eyed MPI
process distribution. Is it really necessary, compared with simply
using core binding?

I'd expect the bulk of overheads to be due to the resource manager,
especially if it tracks things by grovelling /proc frequently, not to
the OS. In cases I've measured, it's typically ~1%, depending on
parameters, scaling more slowly than core count.
Post by John Hearns
You're right about cpusets being helpful with 'badly behaved' jobs.
War stories some other time!
Well [trying to bring this on topic], things got much more sanitary here
after I replaced the wretched Streamline-supplied setup with tight
integration of OMPI under SGE and then made the SGE core binding
inherited by OMPI work sensibly with partially full nodes.
John Hearns
2013-08-23 13:28:12 UTC
Permalink
Post by Siddhartha Jana
Post by John Hearns
cpuset' of one or two cores and let your jobs run on the rest of the
cores.
Maybe, if you make sure the resource manager knows about it, and users
don't mind losing the cores,
Depends how big your machine is. Having a few cores devoteds to OS
processes
on a big NUMA machine is not a great loss of resources.
And of course the resource manager knows about it.

Siddhartha Jana
2013-08-18 13:04:57 UTC
Permalink
Thanks John. But I have an incredibly small system. 2 nodes - 16 cores each.
2-4 MPI processes. :-)
Post by John Hearns
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Siddhartha Jana
2013-08-18 13:09:44 UTC
Permalink
So my question really boils down to:
How does one ensure that mpirun launches the processes on the "specific"
cores that are expected of them to be bound to.
As I mentioned, if there were a way to specify the cores through the
hostfile, this problem should be solved.

Thanks for all the quick replies,
-- Sid
Post by Siddhartha Jana
Thanks John. But I have an incredibly small system. 2 nodes - 16 cores
each.
2-4 MPI processes. :-)
Post by John Hearns
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Ralph Castain
2013-08-18 14:49:44 UTC
Permalink
If you require that a specific rank go to a specific core, then use the rankfile mapper - you can see explanations on the syntax in "man mpirun"

If you just want mpirun to respect an external cpuset limitation, it already does so when binding - it will bind within the external limitation
How does one ensure that mpirun launches the processes on the "specific" cores that are expected of them to be bound to.
As I mentioned, if there were a way to specify the cores through the hostfile, this problem should be solved.
Thanks for all the quick replies,
-- Sid
Thanks John. But I have an incredibly small system. 2 nodes - 16 cores each.
2-4 MPI processes. :-)
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Siddhartha Jana
2013-08-18 16:38:56 UTC
Permalink
Firstly, I would like my program to dynamically assign it self to one of
the cores it pleases and remain bound to it until it later reschedules
itself.
*

Ralph Castain wrote:*
*>> "If you just want mpirun to respect an external cpuset limitation, it
already does so when binding - it will bind within the external limitation"*

In my case, the limitation is enforced "internally", by the application
once in begins execution. I enforce this during program execution, after
the mpirun has finished "binding within the external limitation".


*Brice Goglin said*:
*>> "MPI can bind at two different times: inside mpirun after ssh before
running the actual program (this one would ignore your cpuset), later at
MPI_Init inside your program (this one will ignore your cpuset only if you
call MPI_Init before creating the cpuset)."*

Noted. In that case, during program execution, whose binding is respected -
mpirun's or MPI_Init()'s? From the above, is my understanding correct? That
MPI_Init() will be responsible for the 2nd round of attempting to bind
processes to cores and can override what mpirun or the programmer had
enforced before its call (using hwloc/cpuset/sched_load_balance()* *and
other *compatible* cousins) ?


--------------------------------------------
If this is so, in my case the flow of events is thus:

1. mpirun binds an MPI process which is yet to begin execution. So mpirun
says: "Bind to some core - A" (I don't use any hostfile/rankfile. but I do
use the --bind-to-core flag)

2. Process begins execution on core A

3. I enforce: "Bind to core B". (we must remember, it is only at runtime
that I know what core I want to be bound to and not while launching the
processes using mpirun). So my process shifts over to core B

4. MPI_Init() once again honors rankfile mapping(if any, default policy,
otherwise ) and rebinds my process to core A

5. process finished execution and calls MPI_Finalize(), all the time on
core A

6. mpirun exits
--------------------------------------------

So if I place step-3 above after step-4, my request will hold for the rest
of the execution. Please do let me know, if my understanding is correct.

Thanks for all the help

Sincerely,
Siddhartha Jana
HPCTools
Post by Ralph Castain
If you require that a specific rank go to a specific core, then use the
rankfile mapper - you can see explanations on the syntax in "man mpirun"
If you just want mpirun to respect an external cpuset limitation, it
already does so when binding - it will bind within the external limitation
How does one ensure that mpirun launches the processes on the "specific"
cores that are expected of them to be bound to.
As I mentioned, if there were a way to specify the cores through the
hostfile, this problem should be solved.
Thanks for all the quick replies,
-- Sid
Post by Siddhartha Jana
Thanks John. But I have an incredibly small system. 2 nodes - 16 cores
each.
2-4 MPI processes. :-)
Post by John Hearns
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Ralph Castain
2013-08-18 17:11:58 UTC
Permalink
A process can always change its binding by "re-binding" to wherever it wants after MPI_Init completes.
Firstly, I would like my program to dynamically assign it self to one of the cores it pleases and remain bound to it until it later reschedules itself.
"If you just want mpirun to respect an external cpuset limitation, it already does so when binding - it will bind within the external limitation"
In my case, the limitation is enforced "internally", by the application once in begins execution. I enforce this during program execution, after the mpirun has finished "binding within the external limitation".
"MPI can bind at two different times: inside mpirun after ssh before running the actual program (this one would ignore your cpuset), later at MPI_Init inside your program (this one will ignore your cpuset only if you call MPI_Init before creating the cpuset)."
Noted. In that case, during program execution, whose binding is respected - mpirun's or MPI_Init()'s? From the above, is my understanding correct? That MPI_Init() will be responsible for the 2nd round of attempting to bind processes to cores and can override what mpirun or the programmer had enforced before its call (using hwloc/cpuset/sched_load_balance() and other compatible cousins) ?
--------------------------------------------
1. mpirun binds an MPI process which is yet to begin execution. So mpirun says: "Bind to some core - A" (I don't use any hostfile/rankfile. but I do use the --bind-to-core flag)
2. Process begins execution on core A
3. I enforce: "Bind to core B". (we must remember, it is only at runtime that I know what core I want to be bound to and not while launching the processes using mpirun). So my process shifts over to core B
4. MPI_Init() once again honors rankfile mapping(if any, default policy, otherwise ) and rebinds my process to core A
5. process finished execution and calls MPI_Finalize(), all the time on core A
6. mpirun exits
--------------------------------------------
So if I place step-3 above after step-4, my request will hold for the rest of the execution. Please do let me know, if my understanding is correct.
Thanks for all the help
Sincerely,
Siddhartha Jana
HPCTools
If you require that a specific rank go to a specific core, then use the rankfile mapper - you can see explanations on the syntax in "man mpirun"
If you just want mpirun to respect an external cpuset limitation, it already does so when binding - it will bind within the external limitation
How does one ensure that mpirun launches the processes on the "specific" cores that are expected of them to be bound to.
As I mentioned, if there were a way to specify the cores through the hostfile, this problem should be solved.
Thanks for all the quick replies,
-- Sid
Thanks John. But I have an incredibly small system. 2 nodes - 16 cores each.
2-4 MPI processes. :-)
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Siddhartha Jana
2013-08-18 22:24:43 UTC
Permalink
Post by Ralph Castain
A process can always change its binding by "re-binding" to wherever it
wants after MPI_Init completes.
Noted. Thanks. I guess the important thing that I wanted to know was that
the binding needs to happen *after* MPI_Init() completes.

Thanks all

-- Siddhartha
Post by Ralph Castain
Firstly, I would like my program to dynamically assign it self to one of
the cores it pleases and remain bound to it until it later reschedules
itself.
*
Ralph Castain wrote:*
*>> "If you just want mpirun to respect an external cpuset limitation, it
already does so when binding - it will bind within the external limitation"
*
In my case, the limitation is enforced "internally", by the application
once in begins execution. I enforce this during program execution, after
the mpirun has finished "binding within the external limitation".
*>> "MPI can bind at two different times: inside mpirun after ssh before
running the actual program (this one would ignore your cpuset), later at
MPI_Init inside your program (this one will ignore your cpuset only if you
call MPI_Init before creating the cpuset)."*
Noted. In that case, during program execution, whose binding is respected
- mpirun's or MPI_Init()'s? From the above, is my understanding correct?
That MPI_Init() will be responsible for the 2nd round of attempting to bind
processes to cores and can override what mpirun or the programmer had
enforced before its call (using hwloc/cpuset/sched_load_balance()* *and
other *compatible* cousins) ?
--------------------------------------------
1. mpirun binds an MPI process which is yet to begin execution. So mpirun
says: "Bind to some core - A" (I don't use any hostfile/rankfile. but I do
use the --bind-to-core flag)
2. Process begins execution on core A
3. I enforce: "Bind to core B". (we must remember, it is only at runtime
that I know what core I want to be bound to and not while launching the
processes using mpirun). So my process shifts over to core B
4. MPI_Init() once again honors rankfile mapping(if any, default policy,
otherwise ) and rebinds my process to core A
5. process finished execution and calls MPI_Finalize(), all the time on
core A
6. mpirun exits
--------------------------------------------
So if I place step-3 above after step-4, my request will hold for the rest
of the execution. Please do let me know, if my understanding is correct.
Thanks for all the help
Sincerely,
Siddhartha Jana
HPCTools
Post by Ralph Castain
If you require that a specific rank go to a specific core, then use the
rankfile mapper - you can see explanations on the syntax in "man mpirun"
If you just want mpirun to respect an external cpuset limitation, it
already does so when binding - it will bind within the external limitation
How does one ensure that mpirun launches the processes on the "specific"
cores that are expected of them to be bound to.
As I mentioned, if there were a way to specify the cores through the
hostfile, this problem should be solved.
Thanks for all the quick replies,
-- Sid
Post by Siddhartha Jana
Thanks John. But I have an incredibly small system. 2 nodes - 16 cores
each.
2-4 MPI processes. :-)
Post by John Hearns
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Ralph Castain
2013-08-18 22:40:57 UTC
Permalink
It only has to come after MPI_Init *if* you are telling mpirun to bind you as well. Otherwise, you could just not tell mpirun to bind (it doesn't by default) and then bind anywhere, anytime you like
Post by Ralph Castain
A process can always change its binding by "re-binding" to wherever it wants after MPI_Init completes.
Noted. Thanks. I guess the important thing that I wanted to know was that the binding needs to happen *after* MPI_Init() completes.
Thanks all
-- Siddhartha
Firstly, I would like my program to dynamically assign it self to one of the cores it pleases and remain bound to it until it later reschedules itself.
"If you just want mpirun to respect an external cpuset limitation, it already does so when binding - it will bind within the external limitation"
In my case, the limitation is enforced "internally", by the application once in begins execution. I enforce this during program execution, after the mpirun has finished "binding within the external limitation".
"MPI can bind at two different times: inside mpirun after ssh before running the actual program (this one would ignore your cpuset), later at MPI_Init inside your program (this one will ignore your cpuset only if you call MPI_Init before creating the cpuset)."
Noted. In that case, during program execution, whose binding is respected - mpirun's or MPI_Init()'s? From the above, is my understanding correct? That MPI_Init() will be responsible for the 2nd round of attempting to bind processes to cores and can override what mpirun or the programmer had enforced before its call (using hwloc/cpuset/sched_load_balance() and other compatible cousins) ?
--------------------------------------------
1. mpirun binds an MPI process which is yet to begin execution. So mpirun says: "Bind to some core - A" (I don't use any hostfile/rankfile. but I do use the --bind-to-core flag)
2. Process begins execution on core A
3. I enforce: "Bind to core B". (we must remember, it is only at runtime that I know what core I want to be bound to and not while launching the processes using mpirun). So my process shifts over to core B
4. MPI_Init() once again honors rankfile mapping(if any, default policy, otherwise ) and rebinds my process to core A
5. process finished execution and calls MPI_Finalize(), all the time on core A
6. mpirun exits
--------------------------------------------
So if I place step-3 above after step-4, my request will hold for the rest of the execution. Please do let me know, if my understanding is correct.
Thanks for all the help
Sincerely,
Siddhartha Jana
HPCTools
If you require that a specific rank go to a specific core, then use the rankfile mapper - you can see explanations on the syntax in "man mpirun"
If you just want mpirun to respect an external cpuset limitation, it already does so when binding - it will bind within the external limitation
How does one ensure that mpirun launches the processes on the "specific" cores that are expected of them to be bound to.
As I mentioned, if there were a way to specify the cores through the hostfile, this problem should be solved.
Thanks for all the quick replies,
-- Sid
Thanks John. But I have an incredibly small system. 2 nodes - 16 cores each.
2-4 MPI processes. :-)
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Siddhartha Jana
2013-08-18 23:01:31 UTC
Permalink
Noted. Thanks again
-- Sid
Post by Ralph Castain
It only has to come after MPI_Init *if* you are telling mpirun to bind you
as well. Otherwise, you could just not tell mpirun to bind (it doesn't by
default) and then bind anywhere, anytime you like
A process can always change its binding by "re-binding" to wherever it
Post by Ralph Castain
wants after MPI_Init completes.
Noted. Thanks. I guess the important thing that I wanted to know was that
the binding needs to happen *after* MPI_Init() completes.
Thanks all
-- Siddhartha
Post by Ralph Castain
Firstly, I would like my program to dynamically assign it self to one of
the cores it pleases and remain bound to it until it later reschedules
itself.
*
Ralph Castain wrote:*
*>> "If you just want mpirun to respect an external cpuset limitation,
it already does so when binding - it will bind within the external
limitation"*
In my case, the limitation is enforced "internally", by the application
once in begins execution. I enforce this during program execution, after
the mpirun has finished "binding within the external limitation".
*>> "MPI can bind at two different times: inside mpirun after ssh
before running the actual program (this one would ignore your cpuset),
later at MPI_Init inside your program (this one will ignore your cpuset
only if you call MPI_Init before creating the cpuset)."*
Noted. In that case, during program execution, whose binding is respected
- mpirun's or MPI_Init()'s? From the above, is my understanding correct?
That MPI_Init() will be responsible for the 2nd round of attempting to bind
processes to cores and can override what mpirun or the programmer had
enforced before its call (using hwloc/cpuset/sched_load_balance()* *and
other *compatible* cousins) ?
--------------------------------------------
1. mpirun binds an MPI process which is yet to begin execution. So mpirun
says: "Bind to some core - A" (I don't use any hostfile/rankfile. but I do
use the --bind-to-core flag)
2. Process begins execution on core A
3. I enforce: "Bind to core B". (we must remember, it is only at runtime
that I know what core I want to be bound to and not while launching the
processes using mpirun). So my process shifts over to core B
4. MPI_Init() once again honors rankfile mapping(if any, default policy,
otherwise ) and rebinds my process to core A
5. process finished execution and calls MPI_Finalize(), all the time on core A
6. mpirun exits
--------------------------------------------
So if I place step-3 above after step-4, my request will hold for the
rest of the execution. Please do let me know, if my understanding is
correct.
Thanks for all the help
Sincerely,
Siddhartha Jana
HPCTools
Post by Ralph Castain
If you require that a specific rank go to a specific core, then use the
rankfile mapper - you can see explanations on the syntax in "man mpirun"
If you just want mpirun to respect an external cpuset limitation, it
already does so when binding - it will bind within the external limitation
How does one ensure that mpirun launches the processes on the "specific"
cores that are expected of them to be bound to.
As I mentioned, if there were a way to specify the cores through the
hostfile, this problem should be solved.
Thanks for all the quick replies,
-- Sid
Post by Siddhartha Jana
Thanks John. But I have an incredibly small system. 2 nodes - 16 cores each.
2-4 MPI processes. :-)
Post by John Hearns
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres (jsquyres)
2013-08-20 17:18:28 UTC
Permalink
I know I'm late to this conversation, but I was on vacation last week. Some random points:

1. If you use OMPI's --bind-to-core option and then re-bind yourself to some other core, then all the memory affinity that MPI setup during MPI_Init() will be "wrong" (possibly on a remote numa node). I would advise against doing this.

2. Instead of #1, as Ralph stated, if you're going to do your own process affinity, then don't use OMPI's --bind-to-core (or any --bind-to-* option). Then MPI won't setup any affinity stuff, and you're good.

3. Rather that setting up cpu shielding, you can just use simple API calls or scripting calls to bind each MPI process to wherever you want. For example:

$ mpirun --host a,b -np 4 my_binding_script.sh my_mpi_app

Where my_binding_script.sh simply invokes a tool like hwloc-bind to bind yourself to whatever socket/core combination you want, and then invokes my_mpi_app (i.e., the real MPI application). For example:

$ cat my_binding_script.sh
#!/bin/sh
exec hwloc-bind socket.1:core.$OMPI_COMM_WORLD_LOCAL_RANK $1

Where $OMPI_COMM_WORLD_LOCAL_RANK is an environment variable that mpirun will put in the environment of the processes that it launches. Each process will have $OMPI_COMM_WORLD_LOCAL_RANK set to a value in the range of [0,N), where N is the number processes on that server. In the above example of launching 4 processes (2 on each server a and b), each of the 4 processes would get an $OMPI_COMM_WORLD_LOCAL_RANK value of 0 or 1.

If you don't know about hwloc, you should -- it's very, very helpful for all this kind of process affinity stuff. See http://www.open-mpi.org/projects/hwloc/ (hwloc-bind is one of the tools in the hwloc suite).
Post by Siddhartha Jana
Noted. Thanks again
-- Sid
It only has to come after MPI_Init *if* you are telling mpirun to bind you as well. Otherwise, you could just not tell mpirun to bind (it doesn't by default) and then bind anywhere, anytime you like
Post by Ralph Castain
A process can always change its binding by "re-binding" to wherever it wants after MPI_Init completes.
Noted. Thanks. I guess the important thing that I wanted to know was that the binding needs to happen *after* MPI_Init() completes.
Thanks all
-- Siddhartha
Firstly, I would like my program to dynamically assign it self to one of the cores it pleases and remain bound to it until it later reschedules itself.
"If you just want mpirun to respect an external cpuset limitation, it already does so when binding - it will bind within the external limitation"
In my case, the limitation is enforced "internally", by the application once in begins execution. I enforce this during program execution, after the mpirun has finished "binding within the external limitation".
"MPI can bind at two different times: inside mpirun after ssh before running the actual program (this one would ignore your cpuset), later at MPI_Init inside your program (this one will ignore your cpuset only if you call MPI_Init before creating the cpuset)."
Noted. In that case, during program execution, whose binding is respected - mpirun's or MPI_Init()'s? From the above, is my understanding correct? That MPI_Init() will be responsible for the 2nd round of attempting to bind processes to cores and can override what mpirun or the programmer had enforced before its call (using hwloc/cpuset/sched_load_balance() and other compatible cousins) ?
--------------------------------------------
1. mpirun binds an MPI process which is yet to begin execution. So mpirun says: "Bind to some core - A" (I don't use any hostfile/rankfile. but I do use the --bind-to-core flag)
2. Process begins execution on core A
3. I enforce: "Bind to core B". (we must remember, it is only at runtime that I know what core I want to be bound to and not while launching the processes using mpirun). So my process shifts over to core B
4. MPI_Init() once again honors rankfile mapping(if any, default policy, otherwise ) and rebinds my process to core A
5. process finished execution and calls MPI_Finalize(), all the time on core A
6. mpirun exits
--------------------------------------------
So if I place step-3 above after step-4, my request will hold for the rest of the execution. Please do let me know, if my understanding is correct.
Thanks for all the help
Sincerely,
Siddhartha Jana
HPCTools
If you require that a specific rank go to a specific core, then use the rankfile mapper - you can see explanations on the syntax in "man mpirun"
If you just want mpirun to respect an external cpuset limitation, it already does so when binding - it will bind within the external limitation
How does one ensure that mpirun launches the processes on the "specific" cores that are expected of them to be bound to.
As I mentioned, if there were a way to specify the cores through the hostfile, this problem should be solved.
Thanks for all the quick replies,
-- Sid
Thanks John. But I have an incredibly small system. 2 nodes - 16 cores each.
2-4 MPI processes. :-)
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
***@cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Jeff Squyres (jsquyres)
2013-08-20 17:20:26 UTC
Permalink
Post by Jeff Squyres (jsquyres)
$ cat my_binding_script.sh
#!/bin/sh
exec hwloc-bind socket.1:core.$OMPI_COMM_WORLD_LOCAL_RANK $1
Oops! Typo. That last line should be:

exec hwloc-bind socket:1.core:$OMPI_COMM_WORLD_LOCAL_RANK $1
--
Jeff Squyres
***@cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Siddhartha Jana
2013-08-21 15:19:57 UTC
Permalink
Hi,

1. If you use OMPI's --bind-to-core option and then re-bind yourself to
Post by Jeff Squyres (jsquyres)
some other core, then all the memory affinity that MPI setup during
MPI_Init() will be "wrong" (possibly on a remote numa node). I would
advise against doing this.
Ah yes! Noted.
Post by Jeff Squyres (jsquyres)
3. Rather that setting up cpu shielding, you can just use simple API calls
or scripting calls to bind each MPI process to wherever you want.
The reason for using "cpu shielding" was not to bind processes to cores but
to ensure that no other processes get scheduled on those cores (some
stubborn kernel tasks can still disobey cpuset rules but they are too
lightweight anyway, so that's fine).
Post by Jeff Squyres (jsquyres)
$ mpirun --host a,b -np 4 my_binding_script.sh my_mpi_app
Where my_binding_script.sh simply invokes a tool like hwloc-bind to bind
yourself to whatever socket/core combination you want, and then invokes
$ cat my_binding_script.sh
#!/bin/sh
exec hwloc-bind socket:1.core:$OMPI_COMM_WORLD_LOCAL_RANK $1
As pointed out, it is indeed convenient to use hwloc and its cousins for
binding processes. It is my understanding, however, that coupling hwloc
with cpu-shielding will enable exclusive access to cores within the set.

Thanks again,
Siddhartha Jana
Post by Jeff Squyres (jsquyres)
Post by Siddhartha Jana
Noted. Thanks again
-- Sid
It only has to come after MPI_Init *if* you are telling mpirun to bind
you as well. Otherwise, you could just not tell mpirun to bind (it doesn't
by default) and then bind anywhere, anytime you like
Post by Siddhartha Jana
Post by Ralph Castain
A process can always change its binding by "re-binding" to wherever it
wants after MPI_Init completes.
Post by Siddhartha Jana
Post by Ralph Castain
Noted. Thanks. I guess the important thing that I wanted to know was
that the binding needs to happen *after* MPI_Init() completes.
Post by Siddhartha Jana
Post by Ralph Castain
Thanks all
-- Siddhartha
On Aug 18, 2013, at 9:38 AM, Siddhartha Jana <
Post by Siddhartha Jana
Firstly, I would like my program to dynamically assign it self to one
of the cores it pleases and remain bound to it until it later reschedules
itself.
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
Post by Siddhartha Jana
Post by Siddhartha Jana
"If you just want mpirun to respect an external cpuset limitation,
it already does so when binding - it will bind within the external
limitation"
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
In my case, the limitation is enforced "internally", by the
application once in begins execution. I enforce this during program
execution, after the mpirun has finished "binding within the external
limitation".
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
Post by Siddhartha Jana
Post by Siddhartha Jana
"MPI can bind at two different times: inside mpirun after ssh
before running the actual program (this one would ignore your cpuset),
later at MPI_Init inside your program (this one will ignore your cpuset
only if you call MPI_Init before creating the cpuset)."
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
Noted. In that case, during program execution, whose binding is
respected - mpirun's or MPI_Init()'s? From the above, is my understanding
correct? That MPI_Init() will be responsible for the 2nd round of
attempting to bind processes to cores and can override what mpirun or the
programmer had enforced before its call (using
hwloc/cpuset/sched_load_balance() and other compatible cousins) ?
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
--------------------------------------------
1. mpirun binds an MPI process which is yet to begin execution. So
mpirun says: "Bind to some core - A" (I don't use any hostfile/rankfile.
but I do use the --bind-to-core flag)
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
2. Process begins execution on core A
3. I enforce: "Bind to core B". (we must remember, it is only at
runtime that I know what core I want to be bound to and not while
launching the processes using mpirun). So my process shifts over to core B
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
4. MPI_Init() once again honors rankfile mapping(if any, default
policy, otherwise ) and rebinds my process to core A
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
5. process finished execution and calls MPI_Finalize(), all the time
on core A
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
6. mpirun exits
--------------------------------------------
So if I place step-3 above after step-4, my request will hold for the
rest of the execution. Please do let me know, if my understanding is
correct.
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
Thanks for all the help
Sincerely,
Siddhartha Jana
HPCTools
If you require that a specific rank go to a specific core, then use
the rankfile mapper - you can see explanations on the syntax in "man mpirun"
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
If you just want mpirun to respect an external cpuset limitation, it
already does so when binding - it will bind within the external limitation
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
On Aug 18, 2013, at 6:09 AM, Siddhartha Jana <
Post by Siddhartha Jana
How does one ensure that mpirun launches the processes on the
"specific" cores that are expected of them to be bound to.
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
Post by Siddhartha Jana
As I mentioned, if there were a way to specify the cores through the
hostfile, this problem should be solved.
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
Post by Siddhartha Jana
Thanks for all the quick replies,
-- Sid
Thanks John. But I have an incredibly small system. 2 nodes - 16
cores each.
Post by Siddhartha Jana
Post by Ralph Castain
Post by Siddhartha Jana
Post by Siddhartha Jana
2-4 MPI processes. :-)
You really should install a job scheduler.
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Loading...