[OMPI users] Mixing Linux's CPU-shielding with mpirun's bind-to-core

Discussion:

Siddhartha Jana

2013-08-18 03:34:52 UTC

Hi,

My requirement:
1. Avoid the OS from scheduling tasks on cores 0-7 allocated to my process.
2. Avoid rescheduling of processes to other cores.

My solution: I use Linux's CPU-shielding.
[ Man page:
http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html
]
I create a cpuset called "socket1" with cores 8-15 in the dev fs. I iterate
through all the tasks in /dev/cpuset/tasks and copy them to
/dev/cpuset/socket1/tasks
I create a cpuset called "socket0" with cores 0-7 .
At the start of the application, (before MPI_Init()), I schedule my MPI
process on the cpuset as follows:
------------------------------------------------------
sprintf(str,"/bin/echo %d >> /dev/cpuset/socket0/tasks ",mypid);
system(str);
------------------------------------------------------
In order to ensure that my processes remain bound to the cores, I am
passing the --bind-to-core option to mpirun. I do this, instead of using
sched_setaffinity from within the application. Is there a chance that
mpirun's "binding-to-core" will clash with the above ?

While this solution seems to work temporarily, I am not sure whether this
is good solution, given mpirun's own techniques of binding to cores,
scheduling processes by slot, et al.

Will mpirun's bind-by-slot technique guarantee cpu shielding?

I would be highly obliged if some one could direct me to the right
direction.

Many thanks
Sincerely
Siddhartha Jana

John Hearns

2013-08-18 06:30:10 UTC

Permalink

For information, if you use a batch system such as PbsPro or Torque it can
be configured to set up the cpuset for a job and start the job within the
cpuset. It will also destroy the cpuset at the end of a job.
Highly useful for job cpu binding as you day and also if you have a machine
running many separate jobs where cpusets help isolate jobs and help
allocate resources.

Brice Goglin

2013-08-18 07:36:10 UTC

Permalink

Post by Siddhartha Jana
Hi,
1. Avoid the OS from scheduling tasks on cores 0-7 allocated to my process.
2. Avoid rescheduling of processes to other cores.
My solution: I use Linux's CPU-shielding.
http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html
]
I create a cpuset called "socket1" with cores 8-15 in the dev fs. I
iterate through all the tasks in /dev/cpuset/tasks and copy them to
/dev/cpuset/socket1/tasks

Hello,

Most of these existing tasks are system tasks. Some actually *want* to
run on specific cores outside of socket1. For instance some kernel
threads are doing the scheduler load balancing on each core. Others are
doing defered work in the kernel that your application may need. I
wonder what happens when you move them. The kernel may reject your
request, or it may actually break things.

Also most of these tasks do nothing but sleeping 99.9% of the times
anyway. If you're worried about having too many system tasks on your
applications' core, just make sure you don't install useless packages
(or disable some services at startup).

If you *really* want to have 100% CPU for your application on cores 0-7,
be aware that other things such as interrupts will be stealing some CPU
cycles anyway. You could move these to cores 8-15 as well, but that
seems overkill to me.

Post by Siddhartha Jana
I create a cpuset called "socket0" with cores 0-7 .
At the start of the application, (before MPI_Init()), I schedule my
------------------------------------------------------
sprintf(str,"/bin/echo %d >> /dev/cpuset/socket0/tasks ",mypid);
system(str);
------------------------------------------------------
In order to ensure that my processes remain bound to the cores, I am
passing the --bind-to-core option to mpirun. I do this, instead of
using sched_setaffinity from within the application. Is there a chance
that mpirun's "binding-to-core" will clash with the above ?

Make sure you also specified the NUMA node in your cpuset "mems" file
too. That's required before the cpuset can be used (otherwise adding a
task will fail). And make sure that the application can add itself to
the cpuset, usually only root can add tasks to cpusets.

And you may want to open/write/close on /dev/cpuset/socket0/tasks and
check the return values instead of this system() call.

If all the above works and does not return errors (you should check that
your application's PID is in /dev/cpuset/socket0/tasks while running),
bind-to-core won't clash with it, at least when using a OMPI that uses
hwloc for binding (v1.5.2 or later if I remember correctly).

Post by Siddhartha Jana
While this solution seems to work temporarily, I am not sure whether
this is good solution.

Usually the administrator or PBS/Torque/... creates the cpuset and
places tasks in there for you.

Brice

Siddhartha Jana

2013-08-18 12:51:31 UTC

Permalink

Hi,
Thanks for the reply,

Post by Brice Goglin

Post by Siddhartha Jana
1. Avoid the OS from scheduling tasks on cores 0-7 allocated to my
process.
2. Avoid rescheduling of processes to other cores.
My solution: I use Linux's CPU-shielding.
http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html
]
I create a cpuset called "socket1" with cores 8-15 in the dev fs. I
iterate through all the tasks in /dev/cpuset/tasks and copy them to
/dev/cpuset/socket1/tasks

Most of these existing tasks are system tasks. Some actually *want* to
run on specific cores outside of socket1. For instance some kernel
threads are doing the scheduler load balancing on each core. Others are
doing defered work in the kernel that your application may need. I
wonder what happens when you move them. The kernel may reject your

request, or it may actually break things.
Yes, when I move all system tasks, the movable kernel tasks are easily
moved without complains. The ones that can't be moved return an error code.
But since their CPU usage is very less, I decide to ignore them anyway.
Nothing breaks really.

Post by Brice Goglin
Also most of these tasks do nothing but sleeping 99.9% of the times
anyway. If you're worried about having too many system tasks on your
applications' core, just make sure you don't install useless packages
(or disable some services at startup).

For my use case, I have ensured that the heavy tasks that I wanted to be
moved out of socket0 could be moved without complaints. The non-movable
ones, as I mentioned, were left as is.

Post by Brice Goglin
If you *really* want to have 100% CPU for your application on cores 0-7,
be aware that other things such as interrupts will be stealing some CPU
cycles anyway.

Noted. As mentioned, the tasks that really matter were safely moved to a
different socket.

Post by Brice Goglin

Make sure you also specified the NUMA node in your cpuset "mems" file
too.That's required before the cpuset can be used (otherwise adding a

task will fail). And make sure that the application can add itself to

Post by Brice Goglin
the cpuset, usually only root can add tasks to cpusets.

Yes, I have ensured all of these. The application has enough rights to add
itself to the cpuset.

Post by Brice Goglin
And you may want to open/write/close on /dev/cpuset/socket0/tasks and
check the return values instead of this system() call.

Checked. Everything works as expected.

My concern is that hwloc is used before the application begins executing
and so mpirun might use it to bind the application to different cores than
the ones I want them to bind to. If there were a way to specify the cores
through the hostfile, this problem should be solved. Is there? I sit
possible to specify the "cores" in the hostfile.

Post by Brice Goglin

Post by Siddhartha Jana
While this solution seems to work temporarily, I am not sure whether
this is good solution.

Usually the administrator or PBS/Torque/... creates the cpuset and
places tasks in there for you.

Yes, this is what was done in my case for the kernel tasks.

John Hearns

2013-08-18 12:57:02 UTC

Permalink

On a bug system you can boot the system into a 'boot cpuset'.
So all system processes run in a small number of low numbered cores. Plus
any login sessions. The batch system then crwtes cpusets in the higher
numbeted cores - free from OS interference.

John Hearns

2013-08-18 12:57:26 UTC

Permalink

Bug system?
Big system!

Brice Goglin

2013-08-18 15:50:47 UTC

Permalink

Post by Brice Goglin
If all the above works and does not return errors (you should check that
your application's PID is in /dev/cpuset/socket0/tasks while running),
bind-to-core won't clash with it, at least when using a OMPI that uses
hwloc for binding (v1.5.2 or later if I remember correctly).
My concern is that hwloc is used before the application begins
executing and so mpirun might use it to bind the application to
different cores than the ones I want them to bind to.

Ah right, they could be a problem here. MPI can bind at two different
times: inside mpirun after ssh before running the actual program (this
one would ignore your cpuset), later at MPI_Init inside your program
(this one will ignore your cpuset only if you call MPI_Init before
creating the cpuset).

I'll let OMPI people give more details about this.

Brice

Siddhartha Jana

2013-08-18 12:54:06 UTC

Permalink

Noted. Thanks. Unfortunately, in my case the cluster is a basic Linux
cluster without any job schedulers.

Post by John Hearns
For information, if you use a batch system such as PbsPro or Torque it can
be configured to set up the cpuset for a job and start the job within the
cpuset. It will also destroy the cpuset at the end of a job.
Highly useful for job cpu binding as you day and also if you have a
machine running many separate jobs where cpusets help isolate jobs and help
allocate resources.

John Hearns

2013-08-18 13:03:33 UTC

Permalink

You really should install a job scheduler.
There are free versions.

I'm not sure about cpuset support in Gridengine. Anyone?

Dave Love

2013-08-21 17:50:43 UTC

Permalink

Post by John Hearns
You really should install a job scheduler.

Indeed (although it's the resource management component that does the
job).

Post by John Hearns
There are free versions.
I'm not sure about cpuset support in Gridengine. Anyone?

Yes, but I've had reports of problems (races?) that I haven't sorted out
yet.

I wouldn't bother anyway. Just use core binding supplied by the
resource manager with well-behaved jobs (Use "-binding linear:slots"
with the current SGE, assuming you want to pack arbitrary jobs onto
nodes.) Cpusets help with badly-behaved jobs, but don't have much to
offer over core binding with well-behaved ones. "Well-behaved" means
they request the right number of slots, don't daemonize, and don't try
to escape the binding they're handed; that includes simple OMPI ones
with tight integration working properly. If you really care about
avoiding specific cores (why?), you could submit other jobs to block them.

["CPU-shielding" is a new one on me.]

--
Community Grid Engine: http://arc.liv.ac.uk/SGE/

John Hearns

2013-08-21 17:57:41 UTC

Permalink

Agree with what you say Dave.

Regarding not wanting jobs to use certsin cores ie. reserving low-numbered
cores for OS processes then surely a good way forward is to use a 'boot
cpuset' of one or two cores and let your jobs run on the rest of the cores.

You're right about cpusets being helpful with 'badly behaved' jobs.
War stories some other time!

Dave Love

2013-08-23 11:36:31 UTC

Permalink

Post by John Hearns
Agree with what you say Dave.
Regarding not wanting jobs to use certsin cores ie. reserving low-numbered
cores for OS processes then surely a good way forward is to use a 'boot
cpuset' of one or two cores and let your jobs run on the rest of the cores.

Maybe, if you make sure the resource manager knows about it, and users
don't mind losing the cores, presumably resulting in a cock-eyed MPI
process distribution. Is it really necessary, compared with simply
using core binding?

I'd expect the bulk of overheads to be due to the resource manager,
especially if it tracks things by grovelling /proc frequently, not to
the OS. In cases I've measured, it's typically ~1%, depending on
parameters, scaling more slowly than core count.

Post by John Hearns
You're right about cpusets being helpful with 'badly behaved' jobs.
War stories some other time!

Well [trying to bring this on topic], things got much more sanitary here
after I replaced the wretched Streamline-supplied setup with tight
integration of OMPI under SGE and then made the SGE core binding
inherited by OMPI work sensibly with partially full nodes.

John Hearns

2013-08-23 13:28:12 UTC