aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/admin-guide/cgroup-v2.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide/cgroup-v2.rst')
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst212
1 files changed, 204 insertions, 8 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 184193bcb262..7bf3f129c68b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -56,11 +56,13 @@ v1 is available under Documentation/cgroup-v1/.
5-3-3-2. IO Latency Interface Files
5-4. PID
5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
- 5-6-1. RDMA Interface Files
- 5-7. Misc
- 5-7-1. perf_event
+ 5-5. Cpuset
+ 5.5-1. Cpuset Interface Files
+ 5-6. Device
+ 5-7. RDMA
+ 5-7-1. RDMA Interface Files
+ 5-8. Misc
+ 5-8-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -966,6 +968,12 @@ All time durations are in microseconds.
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.
+ cpu.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for CPU. See
+ Documentation/accounting/psi.txt for details.
+
Memory
------
@@ -1127,6 +1135,10 @@ PAGE_SIZE multiple when read back.
disk readahead. For now OOM in memory cgroup kills
tasks iff shortage has happened inside page fault.
+ This event is not raised if the OOM killer is not
+ considered as an option, e.g. for failed high-order
+ allocations.
+
oom_kill
The number of processes belonging to this cgroup
killed by any kind of OOM killer.
@@ -1271,6 +1283,12 @@ PAGE_SIZE multiple when read back.
higher than the limit for an extended period of time. This
reduces the impact on the workload and memory management.
+ memory.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for memory. See
+ Documentation/accounting/psi.txt for details.
+
Usage Guidelines
~~~~~~~~~~~~~~~~
@@ -1408,6 +1426,12 @@ IO Interface Files
8:16 rbps=2097152 wbps=max riops=max wiops=max
+ io.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for IO. See
+ Documentation/accounting/psi.txt for details.
+
Writeback
~~~~~~~~~
@@ -1588,6 +1612,176 @@ through fork() or clone(). These will return -EAGAIN if the creation
of a new process would cause a cgroup policy to be violated.
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical. That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+ cpuset.cpus
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists the requested CPUs to be used by tasks within this
+ cgroup. The actual list of CPUs to be granted, however, is
+ subjected to constraints imposed by its parent and can differ
+ from the requested CPUs.
+
+ The CPU numbers are comma-separated numbers or ranges.
+ For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+ An empty value indicates that the cgroup is using the same
+ setting as the nearest cgroup ancestor with a non-empty
+ "cpuset.cpus" or all the available CPUs if none is found.
+
+ The value of "cpuset.cpus" stays constant until the next update
+ and won't be affected by any CPU hotplug events.
+
+ cpuset.cpus.effective
+ A read-only multiple values file which exists on all
+ cpuset-enabled cgroups.
+
+ It lists the onlined CPUs that are actually granted to this
+ cgroup by its parent. These CPUs are allowed to be used by
+ tasks within the current cgroup.
+
+ If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
+ all the CPUs from the parent cgroup that can be available to
+ be used by this cgroup. Otherwise, it should be a subset of
+ "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
+ can be granted. In this case, it will be treated just like an
+ empty "cpuset.cpus".
+
+ Its value will be affected by CPU hotplug events.
+
+ cpuset.mems
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists the requested memory nodes to be used by tasks within
+ this cgroup. The actual list of memory nodes granted, however,
+ is subjected to constraints imposed by its parent and can differ
+ from the requested memory nodes.
+
+ The memory node numbers are comma-separated numbers or ranges.
+ For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+ An empty value indicates that the cgroup is using the same
+ setting as the nearest cgroup ancestor with a non-empty
+ "cpuset.mems" or all the available memory nodes if none
+ is found.
+
+ The value of "cpuset.mems" stays constant until the next update
+ and won't be affected by any memory nodes hotplug events.
+
+ cpuset.mems.effective
+ A read-only multiple values file which exists on all
+ cpuset-enabled cgroups.
+
+ It lists the onlined memory nodes that are actually granted to
+ this cgroup by its parent. These memory nodes are allowed to
+ be used by tasks within the current cgroup.
+
+ If "cpuset.mems" is empty, it shows all the memory nodes from the
+ parent cgroup that will be available to be used by this cgroup.
+ Otherwise, it should be a subset of "cpuset.mems" unless none of
+ the memory nodes listed in "cpuset.mems" can be granted. In this
+ case, it will be treated just like an empty "cpuset.mems".
+
+ Its value will be affected by memory nodes hotplug events.
+
+ cpuset.cpus.partition
+ A read-write single value file which exists on non-root
+ cpuset-enabled cgroups. This flag is owned by the parent cgroup
+ and is not delegatable.
+
+ It accepts only the following input values when written to.
+
+ "root" - a paritition root
+ "member" - a non-root member of a partition
+
+ When set to be a partition root, the current cgroup is the
+ root of a new partition or scheduling domain that comprises
+ itself and all its descendants except those that are separate
+ partition roots themselves and their descendants. The root
+ cgroup is always a partition root.
+
+ There are constraints on where a partition root can be set.
+ It can only be set in a cgroup if all the following conditions
+ are true.
+
+ 1) The "cpuset.cpus" is not empty and the list of CPUs are
+ exclusive, i.e. they are not shared by any of its siblings.
+ 2) The parent cgroup is a partition root.
+ 3) The "cpuset.cpus" is also a proper subset of the parent's
+ "cpuset.cpus.effective".
+ 4) There is no child cgroups with cpuset enabled. This is for
+ eliminating corner cases that have to be handled if such a
+ condition is allowed.
+
+ Setting it to partition root will take the CPUs away from the
+ effective CPUs of the parent cgroup. Once it is set, this
+ file cannot be reverted back to "member" if there are any child
+ cgroups with cpuset enabled.
+
+ A parent partition cannot distribute all its CPUs to its
+ child partitions. There must be at least one cpu left in the
+ parent partition.
+
+ Once becoming a partition root, changes to "cpuset.cpus" is
+ generally allowed as long as the first condition above is true,
+ the change will not take away all the CPUs from the parent
+ partition and the new "cpuset.cpus" value is a superset of its
+ children's "cpuset.cpus" values.
+
+ Sometimes, external factors like changes to ancestors'
+ "cpuset.cpus" or cpu hotplug can cause the state of the partition
+ root to change. On read, the "cpuset.sched.partition" file
+ can show the following values.
+
+ "member" Non-root member of a partition
+ "root" Partition root
+ "root invalid" Invalid partition root
+
+ It is a partition root if the first 2 partition root conditions
+ above are true and at least one CPU from "cpuset.cpus" is
+ granted by the parent cgroup.
+
+ A partition root can become invalid if none of CPUs requested
+ in "cpuset.cpus" can be granted by the parent cgroup or the
+ parent cgroup is no longer a partition root itself. In this
+ case, it is not a real partition even though the restriction
+ of the first partition root condition above will still apply.
+ The cpu affinity of all the tasks in the cgroup will then be
+ associated with CPUs in the nearest ancestor partition.
+
+ An invalid partition root can be transitioned back to a
+ real partition root if at least one of the requested CPUs
+ can now be granted by its parent. In this case, the cpu
+ affinity of all the tasks in the formerly invalid partition
+ will be associated to the CPUs of the newly formed partition.
+ Changing the partition state of an invalid partition root to
+ "member" is always allowed even if child cpusets are present.
+
+
Device controller
-----------------
@@ -1857,8 +2051,10 @@ following two functions.
wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and
- associates the bio with the inode's owner cgroup. Can be
- called anytime between bio allocation and submission.
+ associates the bio with the inode's owner cgroup and the
+ corresponding request queue. This must be called after
+ a queue (device) has been associated with the bio and
+ before submission.
wbc_account_io(@wbc, @page, @bytes)
Should be called for each data segment being written out.
@@ -1877,7 +2073,7 @@ the configuration, the bio may be executed at a lower priority and if
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
-cases by skipping wbc_init_bio() or using bio_associate_blkcg()
+cases by skipping wbc_init_bio() and using bio_associate_blkg()
directly.