WHAT IS A CPUSET
A cpuset is a kernel object to which can be assigned a set of
CPUs, and that can be attached to processes. When a cpuset is
attached to a process, the latter will run only on the CPUs
that have been attached to the cpuset.
This page describes the BULL CPUSETS: i.e CPUSETS with added migration and virtualization patches.
A cpuset can be created with mkdir in the cpuset
pseudo filesystem. CPUs must then be
assigned to it, and it can be attached
The cpuset of a process is a property that has been inherited
from his parent, or has been changed by attaching it to another cpuset.
THE CPUSET TREE
Cpusets can be nested. When a process (attached to cpuset A) creates a cpuset B, B is considered to be a child of A. And B will have to stay inside A, e.g. only CPUs that have been assigned to A can be assigned to B. This can be used for instance to restrict a user shell to some CPUs. The user will be able to create cpusets inside the cpuset that was assigned to his shell, but only inside.
When the system starts, only the 'top cpuset' (number 1), with all the CPUs of the system, exists. By default all processes will be attached to it. All the other cpusets will be inside it.
| root | <-- top cpuset, with all CPUs
| 1234 |
| root | <-- cpuset created by root, attached to user shell
| 34 | with a few CPUs (3 and 4 here)
/ | \
+------+ | +------+
| user | | | user | <-- cpusets created by user from the shell
| 4 |<-\ | /->| 3 |
+------+ | | | +------+
| | | | |
V | V | V
program1 <- shell -> program2
(user) (user) (user)
CPUs can be assigned to cpusets in two ways: strictly or
not. When a cpuset is 'STRICT', the system garantees that all
the CPUs assigned to it are assigned only to it (not counting
its parents, of course). Of course a cpuset can only be 'STRICT'
if its parent is also 'STRICT'. A CPU may be shared by several
cpusets only if none of them is 'STRICT'.
New: cpusets are no longer 'strict'. The flag has been called 'exclusive', and exists for CPUs and for memory nodes : cpu_exclusive and mem_exclusive.
Normally a cpuset exists as a kernel object after being
created with mkdiruntil it is destroyed with rmdir.
However, when a cpuset has the 'AUTOCLEAN'
attribute, it will be automatically destroyed once the system
finds that it is unused (and has been used, of course), typically
when the last process attached to it exits.
New: this flag has been renamed: now it is 'notify_on_release'.
When this flag is set and the cpuset becomes unused, the kernel will call a userland helper to remove the cpuset.
All information about the cpusets existing in the system can be found in the cpuset filesystem
which should be mounted as /dev/cpuset.
For each process, the cpuset it is attached to can be found in /proc/xxx/cpuset.
CPUs are assigned to cpusets by calling cpuset_alloc with a
mask of the desired CPUs. CPUs can be added to a cpuset at
any time. CPUs can also be removed by a cpuset, but only if:
The processes attached to the cpuset will be updated. Thus
a program can benefit from new CPUs that would be added to
- the resulting cpuset would NOT be empty (no CPUs)
- the CPUs to be removed are not assigned to a child cpuset
The sched_setaffinity() call has been modified to take the
cpusets in account. Now the mask given to sched_setaffinity()
is considered inside the cpuset: i.e bit N of the mask will be
CPU N of the cpuset. If the cpuset has not enough CPUs (let's
say M CPUs), CPU 0 of the cpuset will also be CPU M and CPU 2xM,
CPU 1 will also be CPU M+1, and so on.
Inside the kernel, the cpusets are represented by a list of
'struct cpuset' structures. The head of this list is the
static 'top cpuset' which holds all the CPUs of the system.
A cpuset_t is just an integer that is used to reference a
particular cpuset. The cpusets list of the kernel is then
searched for a structure with the same integer.
each struct task_struct now has a 'struct cpuset *' field pointing to the cpuset of the task. This field is copied by dup_task_struct() during a fork() [ in include/linux/sched.h ]
during a fork(), the use_cpuset() function is called to register the new user (new process) of a particular cpuset [ in kernel/fork.c ]
release_cpuset() is called in release_task() [ in kernel/exit.c ]
the sched_setaffinity system call has been modified, so that the given mask now is considered 'inside the current cpuset'. [ in kernel/sched.c ]
The link between the cpusets and the scheduler has simply been
done using the existing set_cpus_allowed() function from sched.c.
To use the previously described kernel mechanism, some unix commands have been
Basic description of commands
pcreate, pdestroy, pmod
These commands are used to create, destroy and modify cpusets.
pexec does the same thing than the pcreate command, but is used as a wrapper
to launch an application directly inside the newly created cpuset. By default,
it creates auto-cleaned cpusets, which ensure the cpuset is destroyed right
after the application terminates.
Lists all cpusets and their attributes.
To allocate cpus, there are a lot of options described in the pexec,pcreate(1)
man pages. The behavior of applications can be predefined in a configuration
To take into account NUMA preoccupations, a notion of width is defined. This
way, the range of cpus is seen as a rectangle with a width of W and containing
NumCPU CPUs. CPUs in the same row belong to the same NUMA entity and are
considered as close to each other. This is used in cyclic allocation mode and
in the alignment switch.
|0 |1 |2 |3 |..
|W |W+1 |W+2 |W+3 |..
|... | | | |
3 modes are available : sequential, cyclic and user-defined.
- The sequential mode is allocating CPUs in a linear way. It begins with CPU
0, then 1, 2, etc ...
- The cyclic mode is allocating CPUs in a stepping way. The first CPU
allocated will be 0, the second will be 0+W, the third 0+2xW, wrapping
around when exceeding MaxCPU. See NUMA considerations.
- The user-defined mode allocates CPUs in the way defined by the order
- Auto-clean : sets the auto-clean flag of the cpuset in the kernel. When the
last process of the set will disappear, the cpuset will be cleaned.
- Strict : sets the strict flag of the cpuset in the kernel. This cpuset will
not be allowed to share any cpu with another existing cpuset.
- Align : this will tell the allocator to try to get, if possible a set of
CPUs on the same row.
- Order : specifies the user-defined order for the user-defined mode of
To ensure that each critical application is running on a different CPU than
other ones, we could always lanch them within a cpuset, running pexec -np N
app, N beeing the number of CPUs needed by the application, for instance 1
if the application is single threaded.
Concrete case : resources managers
Resources managers know how many processes thay can launch on one machine.
They know how many they have already launched. The issue is, they don't know
where they have launched it. If you replace the
commandline launched by the
pexec -np N commandlineprocesses should be distributed across the
machine, with the method specified in the configuration file.
Sysadmins could create one cpuset per user to ensure each user will stay on
the part of the machine he was granted access to. To do this, you will have to
The command for each user will be (as root):
- create one area as root for each user,
- create one area as user inside this one.
pexec -np N --strict su user1
pcreate -np N --strict
[ keep the cpuset number returned by pcreate, let's say it is C1 ]
Then edit pshell.conf(5) and add a line :
Repeat the operation for each user.
The root's cpuset containing the user's one is needed to ensure that the user
won't be able to extend its own cpuset if there are free cpus. You may not use
it if you want users to be able to use remaining cpus.
The user's cpuset is needed because the user may need to reattach processes to
its main cpuset and this wouldn't have been possible with one owned by root.
This page still needs updates.
Wed Oct 13 10:56:22 CEST 2004