aboutsummaryrefslogtreecommitdiff
path: root/Documentation/cgroup-legacy/unified-hierarchy.txt
diff options
context:
space:
mode:
authorTejun Heo <tj@kernel.org>2015-11-16 11:13:34 -0500
committerAlex Shi <alex.shi@linaro.org>2016-11-29 15:25:22 +0800
commit07d48acbfee2ac01687514ab6bf3144bbd2c0e13 (patch)
tree39e23aea55a16ec6c9e362329a0658119c970e1a /Documentation/cgroup-legacy/unified-hierarchy.txt
parentcd5367ae02a488450b714a5d5f47afb014ea6541 (diff)
cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
In preparation for adding cgroup2 documentation, rename Documentation/cgroups/ to Documentation/cgroup-legacy/. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> (cherry picked from commit 0d942766453f3d23a51e0a2d430340a178b0903e) Signed-off-by: Alex Shi <alex.shi@linaro.org>
Diffstat (limited to 'Documentation/cgroup-legacy/unified-hierarchy.txt')
-rw-r--r--Documentation/cgroup-legacy/unified-hierarchy.txt645
1 files changed, 645 insertions, 0 deletions
diff --git a/Documentation/cgroup-legacy/unified-hierarchy.txt b/Documentation/cgroup-legacy/unified-hierarchy.txt
new file mode 100644
index 000000000000..c1f0e8780960
--- /dev/null
+++ b/Documentation/cgroup-legacy/unified-hierarchy.txt
@@ -0,0 +1,645 @@
+
+Cgroup unified hierarchy
+
+April, 2014 Tejun Heo <tj@kernel.org>
+
+This document describes the changes made by unified hierarchy and
+their rationales. It will eventually be merged into the main cgroup
+documentation.
+
+CONTENTS
+
+1. Background
+2. Basic Operation
+ 2-1. Mounting
+ 2-2. cgroup.subtree_control
+ 2-3. cgroup.controllers
+3. Structural Constraints
+ 3-1. Top-down
+ 3-2. No internal tasks
+4. Delegation
+ 4-1. Model of delegation
+ 4-2. Common ancestor rule
+5. Other Changes
+ 5-1. [Un]populated Notification
+ 5-2. Other Core Changes
+ 5-3. Controller File Conventions
+ 5-3-1. Format
+ 5-3-2. Control Knobs
+ 5-4. Per-Controller Changes
+ 5-4-1. io
+ 5-4-2. cpuset
+ 5-4-3. memory
+6. Planned Changes
+ 6-1. CAP for resource control
+
+
+1. Background
+
+cgroup allows an arbitrary number of hierarchies and each hierarchy
+can host any number of controllers. While this seems to provide a
+high level of flexibility, it isn't quite useful in practice.
+
+For example, as there is only one instance of each controller, utility
+type controllers such as freezer which can be useful in all
+hierarchies can only be used in one. The issue is exacerbated by the
+fact that controllers can't be moved around once hierarchies are
+populated. Another issue is that all controllers bound to a hierarchy
+are forced to have exactly the same view of the hierarchy. It isn't
+possible to vary the granularity depending on the specific controller.
+
+In practice, these issues heavily limit which controllers can be put
+on the same hierarchy and most configurations resort to putting each
+controller on its own hierarchy. Only closely related ones, such as
+the cpu and cpuacct controllers, make sense to put on the same
+hierarchy. This often means that userland ends up managing multiple
+similar hierarchies repeating the same steps on each hierarchy
+whenever a hierarchy management operation is necessary.
+
+Unfortunately, support for multiple hierarchies comes at a steep cost.
+Internal implementation in cgroup core proper is dazzlingly
+complicated but more importantly the support for multiple hierarchies
+restricts how cgroup is used in general and what controllers can do.
+
+There's no limit on how many hierarchies there may be, which means
+that a task's cgroup membership can't be described in finite length.
+The key may contain any varying number of entries and is unlimited in
+length, which makes it highly awkward to handle and leads to addition
+of controllers which exist only to identify membership, which in turn
+exacerbates the original problem.
+
+Also, as a controller can't have any expectation regarding what shape
+of hierarchies other controllers would be on, each controller has to
+assume that all other controllers are operating on completely
+orthogonal hierarchies. This makes it impossible, or at least very
+cumbersome, for controllers to cooperate with each other.
+
+In most use cases, putting controllers on hierarchies which are
+completely orthogonal to each other isn't necessary. What usually is
+called for is the ability to have differing levels of granularity
+depending on the specific controller. In other words, hierarchy may
+be collapsed from leaf towards root when viewed from specific
+controllers. For example, a given configuration might not care about
+how memory is distributed beyond a certain level while still wanting
+to control how CPU cycles are distributed.
+
+Unified hierarchy is the next version of cgroup interface. It aims to
+address the aforementioned issues by having more structure while
+retaining enough flexibility for most use cases. Various other
+general and controller-specific interface issues are also addressed in
+the process.
+
+
+2. Basic Operation
+
+2-1. Mounting
+
+Unified hierarchy can be mounted with the following mount command.
+
+ mount -t cgroup2 none $MOUNT_POINT
+
+All controllers which support the unified hierarchy and are not bound
+to other hierarchies are automatically bound to unified hierarchy and
+show up at the root of it. Controllers which are enabled only in the
+root of unified hierarchy can be bound to other hierarchies. This
+allows mixing unified hierarchy with the traditional multiple
+hierarchies in a fully backward compatible way.
+
+A controller can be moved across hierarchies only after the controller
+is no longer referenced in its current hierarchy. Because per-cgroup
+controller states are destroyed asynchronously and controllers may
+have lingering references, a controller may not show up immediately on
+the unified hierarchy after the final umount of the previous
+hierarchy. Similarly, a controller should be fully disabled to be
+moved out of the unified hierarchy and it may take some time for the
+disabled controller to become available for other hierarchies;
+furthermore, due to dependencies among controllers, other controllers
+may need to be disabled too.
+
+While useful for development and manual configurations, dynamically
+moving controllers between the unified and other hierarchies is
+strongly discouraged for production use. It is recommended to decide
+the hierarchies and controller associations before starting using the
+controllers.
+
+
+2-2. cgroup.subtree_control
+
+All cgroups on unified hierarchy have a "cgroup.subtree_control" file
+which governs which controllers are enabled on the children of the
+cgroup. Let's assume a hierarchy like the following.
+
+ root - A - B - C
+ \ D
+
+root's "cgroup.subtree_control" file determines which controllers are
+enabled on A. A's on B. B's on C and D. This coincides with the
+fact that controllers on the immediate sub-level are used to
+distribute the resources of the parent. In fact, it's natural to
+assume that resource control knobs of a child belong to its parent.
+Enabling a controller in a "cgroup.subtree_control" file declares that
+distribution of the respective resources of the cgroup will be
+controlled. Note that this means that controller enable states are
+shared among siblings.
+
+When read, the file contains a space-separated list of currently
+enabled controllers. A write to the file should contain a
+space-separated list of controllers with '+' or '-' prefixed (without
+the quotes). Controllers prefixed with '+' are enabled and '-'
+disabled. If a controller is listed multiple times, the last entry
+wins. The specific operations are executed atomically - either all
+succeed or fail.
+
+
+2-3. cgroup.controllers
+
+Read-only "cgroup.controllers" file contains a space-separated list of
+controllers which can be enabled in the cgroup's
+"cgroup.subtree_control" file.
+
+In the root cgroup, this lists controllers which are not bound to
+other hierarchies and the content changes as controllers are bound to
+and unbound from other hierarchies.
+
+In non-root cgroups, the content of this file equals that of the
+parent's "cgroup.subtree_control" file as only controllers enabled
+from the parent can be used in its children.
+
+
+3. Structural Constraints
+
+3-1. Top-down
+
+As it doesn't make sense to nest control of an uncontrolled resource,
+all non-root "cgroup.subtree_control" files can only contain
+controllers which are enabled in the parent's "cgroup.subtree_control"
+file. A controller can be enabled only if the parent has the
+controller enabled and a controller can't be disabled if one or more
+children have it enabled.
+
+
+3-2. No internal tasks
+
+One long-standing issue that cgroup faces is the competition between
+tasks belonging to the parent cgroup and its children cgroups. This
+is inherently nasty as two different types of entities compete and
+there is no agreed-upon obvious way to handle it. Different
+controllers are doing different things.
+
+The cpu controller considers tasks and cgroups as equivalents and maps
+nice levels to cgroup weights. This works for some cases but falls
+flat when children should be allocated specific ratios of CPU cycles
+and the number of internal tasks fluctuates - the ratios constantly
+change as the number of competing entities fluctuates. There also are
+other issues. The mapping from nice level to weight isn't obvious or
+universal, and there are various other knobs which simply aren't
+available for tasks.
+
+The io controller implicitly creates a hidden leaf node for each
+cgroup to host the tasks. The hidden leaf has its own copies of all
+the knobs with "leaf_" prefixed. While this allows equivalent control
+over internal tasks, it's with serious drawbacks. It always adds an
+extra layer of nesting which may not be necessary, makes the interface
+messy and significantly complicates the implementation.
+
+The memory controller currently doesn't have a way to control what
+happens between internal tasks and child cgroups and the behavior is
+not clearly defined. There have been attempts to add ad-hoc behaviors
+and knobs to tailor the behavior to specific workloads. Continuing
+this direction will lead to problems which will be extremely difficult
+to resolve in the long term.
+
+Multiple controllers struggle with internal tasks and came up with
+different ways to deal with it; unfortunately, all the approaches in
+use now are severely flawed and, furthermore, the widely different
+behaviors make cgroup as whole highly inconsistent.
+
+It is clear that this is something which needs to be addressed from
+cgroup core proper in a uniform way so that controllers don't need to
+worry about it and cgroup as a whole shows a consistent and logical
+behavior. To achieve that, unified hierarchy enforces the following
+structural constraint:
+
+ Except for the root, only cgroups which don't contain any task may
+ have controllers enabled in their "cgroup.subtree_control" files.
+
+Combined with other properties, this guarantees that, when a
+controller is looking at the part of the hierarchy which has it
+enabled, tasks are always only on the leaves. This rules out
+situations where child cgroups compete against internal tasks of the
+parent.
+
+There are two things to note. Firstly, the root cgroup is exempt from
+the restriction. Root contains tasks and anonymous resource
+consumption which can't be associated with any other cgroup and
+requires special treatment from most controllers. How resource
+consumption in the root cgroup is governed is up to each controller.
+
+Secondly, the restriction doesn't take effect if there is no enabled
+controller in the cgroup's "cgroup.subtree_control" file. This is
+important as otherwise it wouldn't be possible to create children of a
+populated cgroup. To control resource distribution of a cgroup, the
+cgroup must create children and transfer all its tasks to the children
+before enabling controllers in its "cgroup.subtree_control" file.
+
+
+4. Delegation
+
+4-1. Model of delegation
+
+A cgroup can be delegated to a less privileged user by granting write
+access of the directory and its "cgroup.procs" file to the user. Note
+that the resource control knobs in a given directory concern the
+resources of the parent and thus must not be delegated along with the
+directory.
+
+Once delegated, the user can build sub-hierarchy under the directory,
+organize processes as it sees fit and further distribute the resources
+it got from the parent. The limits and other settings of all resource
+controllers are hierarchical and regardless of what happens in the
+delegated sub-hierarchy, nothing can escape the resource restrictions
+imposed by the parent.
+
+Currently, cgroup doesn't impose any restrictions on the number of
+cgroups in or nesting depth of a delegated sub-hierarchy; however,
+this may in the future be limited explicitly.
+
+
+4-2. Common ancestor rule
+
+On the unified hierarchy, to write to a "cgroup.procs" file, in
+addition to the usual write permission to the file and uid match, the
+writer must also have write access to the "cgroup.procs" file of the
+common ancestor of the source and destination cgroups. This prevents
+delegatees from smuggling processes across disjoint sub-hierarchies.
+
+Let's say cgroups C0 and C1 have been delegated to user U0 who created
+C00, C01 under C0 and C10 under C1 as follows.
+
+ ~~~~~~~~~~~~~ - C0 - C00
+ ~ cgroup ~ \ C01
+ ~ hierarchy ~
+ ~~~~~~~~~~~~~ - C1 - C10
+
+C0 and C1 are separate entities in terms of resource distribution
+regardless of their relative positions in the hierarchy. The
+resources the processes under C0 are entitled to are controlled by
+C0's ancestors and may be completely different from C1. It's clear
+that the intention of delegating C0 to U0 is allowing U0 to organize
+the processes under C0 and further control the distribution of C0's
+resources.
+
+On traditional hierarchies, if a task has write access to "tasks" or
+"cgroup.procs" file of a cgroup and its uid agrees with the target, it
+can move the target to the cgroup. In the above example, U0 will not
+only be able to move processes in each sub-hierarchy but also across
+the two sub-hierarchies, effectively allowing it to violate the
+organizational and resource restrictions implied by the hierarchical
+structure above C0 and C1.
+
+On the unified hierarchy, let's say U0 wants to write the pid of a
+process which has a matching uid and is currently in C10 into
+"C00/cgroup.procs". U0 obviously has write access to the file and
+migration permission on the process; however, the common ancestor of
+the source cgroup C10 and the destination cgroup C00 is above the
+points of delegation and U0 would not have write access to its
+"cgroup.procs" and thus be denied with -EACCES.
+
+
+5. Other Changes
+
+5-1. [Un]populated Notification
+
+cgroup users often need a way to determine when a cgroup's
+subhierarchy becomes empty so that it can be cleaned up. cgroup
+currently provides release_agent for it; unfortunately, this mechanism
+is riddled with issues.
+
+- It delivers events by forking and execing a userland binary
+ specified as the release_agent. This is a long deprecated method of
+ notification delivery. It's extremely heavy, slow and cumbersome to
+ integrate with larger infrastructure.
+
+- There is single monitoring point at the root. There's no way to
+ delegate management of a subtree.
+
+- The event isn't recursive. It triggers when a cgroup doesn't have
+ any tasks or child cgroups. Events for internal nodes trigger only
+ after all children are removed. This again makes it impossible to
+ delegate management of a subtree.
+
+- Events are filtered from the kernel side. A "notify_on_release"
+ file is used to subscribe to or suppress release events. This is
+ unnecessarily complicated and probably done this way because event
+ delivery itself was expensive.
+
+Unified hierarchy implements "populated" field in "cgroup.events"
+interface file which can be used to monitor whether the cgroup's
+subhierarchy has tasks in it or not. Its value is 0 if there is no
+task in the cgroup and its descendants; otherwise, 1. poll and
+[id]notify events are triggered when the value changes.
+
+This is significantly lighter and simpler and trivially allows
+delegating management of subhierarchy - subhierarchy monitoring can
+block further propagation simply by putting itself or another process
+in the subhierarchy and monitor events that it's interested in from
+there without interfering with monitoring higher in the tree.
+
+In unified hierarchy, the release_agent mechanism is no longer
+supported and the interface files "release_agent" and
+"notify_on_release" do not exist.
+
+
+5-2. Other Core Changes
+
+- None of the mount options is allowed.
+
+- remount is disallowed.
+
+- rename(2) is disallowed.
+
+- The "tasks" file is removed. Everything should at process
+ granularity. Use the "cgroup.procs" file instead.
+
+- The "cgroup.procs" file is not sorted. pids will be unique unless
+ they got recycled in-between reads.
+
+- The "cgroup.clone_children" file is removed.
+
+- /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged
+ to before exiting. If the cgroup is removed before the zombie is
+ reaped, " (deleted)" is appeneded to the path.
+
+
+5-3. Controller File Conventions
+
+5-3-1. Format
+
+In general, all controller files should be in one of the following
+formats whenever possible.
+
+- Values only files
+
+ VAL0 VAL1...\n
+
+- Flat keyed files
+
+ KEY0 VAL0\n
+ KEY1 VAL1\n
+ ...
+
+- Nested keyed files
+
+ KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
+ KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
+ ...
+
+For a writeable file, the format for writing should generally match
+reading; however, controllers may allow omitting later fields or
+implement restricted shortcuts for most common use cases.
+
+For both flat and nested keyed files, only the values for a single key
+can be written at a time. For nested keyed files, the sub key pairs
+may be specified in any order and not all pairs have to be specified.
+
+
+5-3-2. Control Knobs
+
+- Settings for a single feature should generally be implemented in a
+ single file.
+
+- In general, the root cgroup should be exempt from resource control
+ and thus shouldn't have resource control knobs.
+
+- If a controller implements ratio based resource distribution, the
+ control knob should be named "weight" and have the range [1, 10000]
+ and 100 should be the default value. The values are chosen to allow
+ enough and symmetric bias in both directions while keeping it
+ intuitive (the default is 100%).
+
+- If a controller implements an absolute resource guarantee and/or
+ limit, the control knobs should be named "min" and "max"
+ respectively. If a controller implements best effort resource
+ gurantee and/or limit, the control knobs should be named "low" and
+ "high" respectively.
+
+ In the above four control files, the special token "max" should be
+ used to represent upward infinity for both reading and writing.
+
+- If a setting has configurable default value and specific overrides,
+ the default settings should be keyed with "default" and appear as
+ the first entry in the file. Specific entries can use "default" as
+ its value to indicate inheritance of the default value.
+
+- For events which are not very high frequency, an interface file
+ "events" should be created which lists event key value pairs.
+ Whenever a notifiable event happens, file modified event should be
+ generated on the file.
+
+
+5-4. Per-Controller Changes
+
+5-4-1. io
+
+- blkio is renamed to io. The interface is overhauled anyway. The
+ new name is more in line with the other two major controllers, cpu
+ and memory, and better suited given that it may be used for cgroup
+ writeback without involving block layer.
+
+- Everything including stat is always hierarchical making separate
+ recursive stat files pointless and, as no internal node can have
+ tasks, leaf weights are meaningless. The operation model is
+ simplified and the interface is overhauled accordingly.
+
+ io.stat
+
+ The stat file. The reported stats are from the point where
+ bio's are issued to request_queue. The stats are counted
+ independent of which policies are enabled. Each line in the
+ file follows the following format. More fields may later be
+ added at the end.
+
+ $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS
+
+ io.weight
+
+ The weight setting, currently only available and effective if
+ cfq-iosched is in use for the target device. The weight is
+ between 1 and 10000 and defaults to 100. The first line
+ always contains the default weight in the following format to
+ use when per-device setting is missing.
+
+ default $WEIGHT
+
+ Subsequent lines list per-device weights of the following
+ format.
+
+ $MAJ:$MIN $WEIGHT
+
+ Writing "$WEIGHT" or "default $WEIGHT" changes the default
+ setting. Writing "$MAJ:$MIN $WEIGHT" sets per-device weight
+ while "$MAJ:$MIN default" clears it.
+
+ This file is available only on non-root cgroups.
+
+ io.max
+
+ The maximum bandwidth and/or iops setting, only available if
+ blk-throttle is enabled. The file is of the following format.
+
+ $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS
+
+ ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are
+ read/write IOs per second. "max" indicates no limit. Writing
+ to the file follows the same format but the individual
+ settings may be omitted or specified in any order.
+
+ This file is available only on non-root cgroups.
+
+
+5-4-2. cpuset
+
+- Tasks are kept in empty cpusets after hotplug and take on the masks
+ of the nearest non-empty ancestor, instead of being moved to it.
+
+- A task can be moved into an empty cpuset, and again it takes on the
+ masks of the nearest non-empty ancestor.
+
+
+5-4-3. memory
+
+- use_hierarchy is on by default and the cgroup file for the flag is
+ not created.
+
+- The original lower boundary, the soft limit, is defined as a limit
+ that is per default unset. As a result, the set of cgroups that
+ global reclaim prefers is opt-in, rather than opt-out. The costs
+ for optimizing these mostly negative lookups are so high that the
+ implementation, despite its enormous size, does not even provide the
+ basic desirable behavior. First off, the soft limit has no
+ hierarchical meaning. All configured groups are organized in a
+ global rbtree and treated like equal peers, regardless where they
+ are located in the hierarchy. This makes subtree delegation
+ impossible. Second, the soft limit reclaim pass is so aggressive
+ that it not just introduces high allocation latencies into the
+ system, but also impacts system performance due to overreclaim, to
+ the point where the feature becomes self-defeating.
+
+ The memory.low boundary on the other hand is a top-down allocated
+ reserve. A cgroup enjoys reclaim protection when it and all its
+ ancestors are below their low boundaries, which makes delegation of
+ subtrees possible. Secondly, new cgroups have no reserve per
+ default and in the common case most cgroups are eligible for the
+ preferred reclaim pass. This allows the new low boundary to be
+ efficiently implemented with just a minor addition to the generic
+ reclaim code, without the need for out-of-band data structures and
+ reclaim passes. Because the generic reclaim code considers all
+ cgroups except for the ones running low in the preferred first
+ reclaim pass, overreclaim of individual groups is eliminated as
+ well, resulting in much better overall workload performance.
+
+- The original high boundary, the hard limit, is defined as a strict
+ limit that can not budge, even if the OOM killer has to be called.
+ But this generally goes against the goal of making the most out of
+ the available memory. The memory consumption of workloads varies
+ during runtime, and that requires users to overcommit. But doing
+ that with a strict upper limit requires either a fairly accurate
+ prediction of the working set size or adding slack to the limit.
+ Since working set size estimation is hard and error prone, and
+ getting it wrong results in OOM kills, most users tend to err on the
+ side of a looser limit and end up wasting precious resources.
+
+ The memory.high boundary on the other hand can be set much more
+ conservatively. When hit, it throttles allocations by forcing them
+ into direct reclaim to work off the excess, but it never invokes the
+ OOM killer. As a result, a high boundary that is chosen too
+ aggressively will not terminate the processes, but instead it will
+ lead to gradual performance degradation. The user can monitor this
+ and make corrections until the minimal memory footprint that still
+ gives acceptable performance is found.
+
+ In extreme cases, with many concurrent allocations and a complete
+ breakdown of reclaim progress within the group, the high boundary
+ can be exceeded. But even then it's mostly better to satisfy the
+ allocation from the slack available in other groups or the rest of
+ the system than killing the group. Otherwise, memory.max is there
+ to limit this type of spillover and ultimately contain buggy or even
+ malicious applications.
+
+- The original control file names are unwieldy and inconsistent in
+ many different ways. For example, the upper boundary hit count is
+ exported in the memory.failcnt file, but an OOM event count has to
+ be manually counted by listening to memory.oom_control events, and
+ lower boundary / soft limit events have to be counted by first
+ setting a threshold for that value and then counting those events.
+ Also, usage and limit files encode their units in the filename.
+ That makes the filenames very long, even though this is not
+ information that a user needs to be reminded of every time they type
+ out those names.
+
+ To address these naming issues, as well as to signal clearly that
+ the new interface carries a new configuration model, the naming
+ conventions in it necessarily differ from the old interface.
+
+- The original limit files indicate the state of an unset limit with a
+ Very High Number, and a configured limit can be unset by echoing -1
+ into those files. But that very high number is implementation and
+ architecture dependent and not very descriptive. And while -1 can
+ be understood as an underflow into the highest possible value, -2 or
+ -10M etc. do not work, so it's not consistent.
+
+ memory.low, memory.high, and memory.max will use the string "max" to
+ indicate and set the highest possible value.
+
+6. Planned Changes
+
+6-1. CAP for resource control
+
+Unified hierarchy will require one of the capabilities(7), which is
+yet to be decided, for all resource control related knobs. Process
+organization operations - creation of sub-cgroups and migration of
+processes in sub-hierarchies may be delegated by changing the
+ownership and/or permissions on the cgroup directory and
+"cgroup.procs" interface file; however, all operations which affect
+resource control - writes to a "cgroup.subtree_control" file or any
+controller-specific knobs - will require an explicit CAP privilege.
+
+This, in part, is to prevent the cgroup interface from being
+inadvertently promoted to programmable API used by non-privileged
+binaries. cgroup exposes various aspects of the system in ways which
+aren't properly abstracted for direct consumption by regular programs.
+This is an administration interface much closer to sysctl knobs than
+system calls. Even the basic access model, being filesystem path
+based, isn't suitable for direct consumption. There's no way to
+access "my cgroup" in a race-free way or make multiple operations
+atomic against migration to another cgroup.
+
+Another aspect is that, for better or for worse, the cgroup interface
+goes through far less scrutiny than regular interfaces for
+unprivileged userland. The upside is that cgroup is able to expose
+useful features which may not be suitable for general consumption in a
+reasonable time frame. It provides a relatively short path between
+internal details and userland-visible interface. Of course, this
+shortcut comes with high risk. We go through what we go through for
+general kernel APIs for good reasons. It may end up leaking internal
+details in a way which can exert significant pain by locking the
+kernel into a contract that can't be maintained in a reasonable
+manner.
+
+Also, due to the specific nature, cgroup and its controllers don't
+tend to attract attention from a wide scope of developers. cgroup's
+short history is already fraught with severely mis-designed
+interfaces, unnecessary commitments to and exposing of internal
+details, broken and dangerous implementations of various features.
+
+Keeping cgroup as an administration interface is both advantageous for
+its role and imperative given its nature. Some of the cgroup features
+may make sense for unprivileged access. If deemed justified, those
+must be further abstracted and implemented as a different interface,
+be it a system call or process-private filesystem, and survive through
+the scrutiny that any interface for general consumption is required to
+go through.
+
+Requiring CAP is not a complete solution but should serve as a
+significant deterrent against spraying cgroup usages in non-privileged
+programs.