A cpuset is a kernel object to which can be assigned a set of CPUs, and that can be attached to processes. When a cpuset is attached to a process, the latter will run only on the CPUs that have been attached to the cpuset. This page describes the BULL CPUSETS: i.e CPUSETS with added migration and virtualization patches.


A cpuset can be created with mkdir in the cpuset pseudo filesystem. CPUs must then be assigned to it, and it can be attached to processes.

The cpuset of a process is a property that has been inherited from his parent, or has been changed by attaching it to another cpuset.


Cpusets can be nested. When a process (attached to cpuset A) creates a cpuset B, B is considered to be a child of A. And B will have to stay inside A, e.g. only CPUs that have been assigned to A can be assigned to B. This can be used for instance to restrict a user shell to some CPUs. The user will be able to create cpusets inside the cpuset that was assigned to his shell, but only inside.

When the system starts, only the 'top cpuset' (number 1), with all the CPUs of the system, exists. By default all processes will be attached to it. All the other cpusets will be inside it.

		| root | <-- top cpuset, with all CPUs 
		| 1234 |
		| root | <-- cpuset created by root, attached to user shell
		|   34 |     with a few CPUs (3 and 4 here)
	       /    |   \	
	+------+    |    +------+
	| user |    |    | user | <-- cpusets created by user from the shell
	|    4 |<-\ | /->|   3  |
	+------+  | | |  +------+
	  |       | | |      |
	  V       | V |      V
      program1 <- shell -> program2
       (user)     (user)    (user)


CPUs can be assigned to cpusets in two ways: strictly or not. When a cpuset is 'STRICT', the system garantees that all the CPUs assigned to it are assigned only to it (not counting its parents, of course). Of course a cpuset can only be 'STRICT' if its parent is also 'STRICT'. A CPU may be shared by several cpusets only if none of them is 'STRICT'.

New: cpusets are no longer 'strict'. The flag has been called 'exclusive', and exists for CPUs and for memory nodes : cpu_exclusive and mem_exclusive.


Normally a cpuset exists as a kernel object after being created with mkdiruntil it is destroyed with rmdir. However, when a cpuset has the 'AUTOCLEAN' attribute, it will be automatically destroyed once the system finds that it is unused (and has been used, of course), typically when the last process attached to it exits.

New: this flag has been renamed: now it is 'notify_on_release'. When this flag is set and the cpuset becomes unused, the kernel will call a userland helper to remove the cpuset.

All information about the cpusets existing in the system can be found in the cpuset filesystem which should be mounted as /dev/cpuset. For each process, the cpuset it is attached to can be found in /proc/xxx/cpuset.


CPUs are assigned to cpusets by calling cpuset_alloc with a mask of the desired CPUs. CPUs can be added to a cpuset at any time. CPUs can also be removed by a cpuset, but only if: The processes attached to the cpuset will be updated. Thus a program can benefit from new CPUs that would be added to its cpuset.


The sched_setaffinity() call has been modified to take the cpusets in account. Now the mask given to sched_setaffinity() is considered inside the cpuset: i.e bit N of the mask will be CPU N of the cpuset. If the cpuset has not enough CPUs (let's say M CPUs), CPU 0 of the cpuset will also be CPU M and CPU 2xM, CPU 1 will also be CPU M+1, and so on.


Inside the kernel, the cpusets are represented by a list of 'struct cpuset' structures. The head of this list is the static 'top cpuset' which holds all the CPUs of the system. A cpuset_t is just an integer that is used to reference a particular cpuset. The cpusets list of the kernel is then searched for a structure with the same integer.
  • each struct task_struct now has a 'struct cpuset *' field pointing to the cpuset of the task. This field is copied by dup_task_struct() during a fork() [ in include/linux/sched.h ]
  • during a fork(), the use_cpuset() function is called to register the new user (new process) of a particular cpuset [ in kernel/fork.c ]
  • release_cpuset() is called in release_task() [ in kernel/exit.c ]
  • the sched_setaffinity system call has been modified, so that the given mask now is considered 'inside the current cpuset'. [ in kernel/sched.c ]
  • The link between the cpusets and the scheduler has simply been done using the existing set_cpus_allowed() function from sched.c.


    To use the previously described kernel mechanism, some unix commands have been written.

    Basic description of commands

    pcreate, pdestroy, pmod

    These commands are used to create, destroy and modify cpusets.


    pexec does the same thing than the pcreate command, but is used as a wrapper to launch an application directly inside the newly created cpuset. By default, it creates auto-cleaned cpusets, which ensure the cpuset is destroyed right after the application terminates.


    Lists all cpusets and their attributes.

    Allocating cpus

    To allocate cpus, there are a lot of options described in the pexec,pcreate(1) man pages. The behavior of applications can be predefined in a configuration file, pexec.conf(5).

    NUMA considerations.

    To take into account NUMA preoccupations, a notion of width is defined. This way, the range of cpus is seen as a rectangle with a width of W and containing NumCPU CPUs. CPUs in the same row belong to the same NUMA entity and are considered as close to each other. This is used in cyclic allocation mode and in the alignment switch.
       |0   |1   |2   |3   |..
       |W   |W+1 |W+2 |W+3 |..
       |2W  |2W+1|2W+2|2W+3|..
       |... |    |    |    |


    3 modes are available : sequential, cyclic and user-defined.

    Flags options

    Other options

    Usage considerations

    conflicts avoidance

    To ensure that each critical application is running on a different CPU than other ones, we could always lanch them within a cpuset, running pexec -np N app, N beeing the number of CPUs needed by the application, for instance 1 if the application is single threaded.

    Concrete case : resources managers
    Resources managers know how many processes thay can launch on one machine. They know how many they have already launched. The issue is, they don't know where they have launched it. If you replace the
    launched by the manager by
    pexec -np N commandline
    processes should be distributed across the machine, with the method specified in the configuration file.

    user partitions

    Sysadmins could create one cpuset per user to ensure each user will stay on the part of the machine he was granted access to. To do this, you will have to : The command for each user will be (as root):
     pexec -np N --strict su user1
     pcreate -np N --strict
    [ keep the cpuset number returned by pcreate, let's say it is C1 ]
    Then edit pshell.conf(5) and add a line :
     user1 C1
    Repeat the operation for each user.

    The root's cpuset containing the user's one is needed to ensure that the user won't be able to extend its own cpuset if there are free cpus. You may not use it if you want users to be able to use remaining cpus. The user's cpuset is needed because the user may need to reattach processes to its main cpuset and this wouldn't have been possible with one owned by root.

    This page still needs updates. Wed Oct 13 10:56:22 CEST 2004