From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753429Ab1IFAOn (ORCPT ); Mon, 5 Sep 2011 20:14:43 -0400 Received: from mail-vx0-f174.google.com ([209.85.220.174]:53143 "EHLO mail-vx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752672Ab1IFAOe (ORCPT ); Mon, 5 Sep 2011 20:14:34 -0400 From: Frederic Weisbecker To: LKML Cc: Frederic Weisbecker , Paul Menage , Li Zefan , Johannes Weiner , Aditya Kali , Oleg Nesterov , Andrew Morton , Kay Sievers , Tim Hockin , Tejun Heo Subject: [PATCH 10/12] cgroups: Add documentation for task counter subsystem Date: Tue, 6 Sep 2011 02:13:04 +0200 Message-Id: <1315267986-28937-11-git-send-email-fweisbec@gmail.com> X-Mailer: git-send-email 1.7.5.4 In-Reply-To: <1315267986-28937-1-git-send-email-fweisbec@gmail.com> References: <1315267986-28937-1-git-send-email-fweisbec@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Signed-off-by: Frederic Weisbecker Cc: Paul Menage Cc: Li Zefan Cc: Johannes Weiner Cc: Aditya Kali Cc: Oleg Nesterov Cc: Andrew Morton Cc: Kay Sievers Cc: Tim Hockin Cc: Tejun Heo --- Documentation/cgroups/task_counter.txt | 126 ++++++++++++++++++++++++++++++++ 1 files changed, 126 insertions(+), 0 deletions(-) create mode 100644 Documentation/cgroups/task_counter.txt diff --git a/Documentation/cgroups/task_counter.txt b/Documentation/cgroups/task_counter.txt new file mode 100644 index 0000000..e93760a --- /dev/null +++ b/Documentation/cgroups/task_counter.txt @@ -0,0 +1,126 @@ +Task counter subsystem + +1. Description + +The task counter subsystem limits the number of tasks running +inside a given cgroup. It behaves like the NR_PROC rlimit but in +the scope of a cgroup instead of a user. + +It has two typical usecases, although more can probably be found: + +- Protect against forkbombs that explode inside a container when +that container is implemented using a cgroup. The NR_PROC rlimit +is not efficient for that because if we have several containers +running in parallel under the same user, one container could starve +all the others by spawning a high number of tasks close to the +rlimit boundary. So in this case we need this limitation to be +done in a per cgroup granularity. + +- Kill all tasks inside a cgroup without races. By setting the limit +of running tasks to 0, one can prevent from any further fork inside a +cgroup and then kill all of its tasks without the need to retry an +unbound amount of time due to races between kills and forks running +in parallel (more details in "Kill a cgroup safely" paragraph). + + +2. Interface + +When a hierarchy is mounted with the task counter subsystem binded, it +adds two files into the cgroups directories, except the root one: + +- tasks.usage contains the number of tasks running inside a cgroup and +its children in the hierarchy (see paragraph about Inheritance). + +- tasks.limit contains the maximum number of tasks that can run inside +a cgroup. We check this limit when a task forks or when it is migrated +to a cgroup. + +Note that the tasks.limit value can be forced below tasks.usage, in which +case any new task in the cgroup will be rejected until the tasks.usage +value goes below tasks.limit. + +For optimization reasons, the root directory of a hierarchy doesn't have +a task counter. + + +3. Inheritance + +When a task is added to a cgroup, by way of a cgroup migration or a fork, +it increases the task counter of that cgroup and of all its ancestors. +Hence a cgroup is also subject to the limit of its ancestors. + +In the following hierarchy: + + + A + | + B + / \ + C D + + +We have 1 task running in B, one running in C and none running in D. +It means we have tasks.usage = 1 in C and tasks.usage = 2 in B because +B counts its task and those of its children. + +Now lets set tasks.limit = 2 in B and tasks.limit = 1 in D. +If we move a new task in D, it will be refused because the limit in B has +been reached already. + + +4. Kill a cgroup safely + +As explained in the description, this subsystem is also helpful to +kill all tasks in a cgroup safely, after setting tasks.limit to 0, +so that we don't race against parallel forks in an unbound numbers +of kill iterations. + +But there is a small detail to be aware of to use this feature that +way. + +Some typical way to proceed would be: + + echo 0 > tasks.limit + for TASK in $(cat cgroup.procs) + do + kill -KILL $TASK + done + +However there is a small race window where a task can be in the way to +be forked but hasn't enough completed the fork to have the PID of the +fork appearing in the cgroup.procs file. + +The only way to get it right is to run a loop that reads tasks.usage, kill +all the tasks in cgroup.procs and exit the loop only if the value in +tasks.usage was the same than the number of tasks that were in cgroup.procs, +ie: the number of tasks that were killed. + +It works because the new child appears in tasks.usage right before we check, +in the fork path, whether the parent has a pending signal, in which case the +fork is cancelled anyway. So relying on tasks.usage is fine and non-racy. + +This race window is tiny and unlikely to happen, so most of the time a single +kill iteration should be enough. But it's worth knowing about that corner +case spotted by Oleg Nesterov. + +Example of safe use would be: + + echo 0 > tasks.limit + END=false + + while [ $END == false ] + do + NR_TASKS=$(cat tasks.usage) + NR_KILLED=0 + + for TASK in $(cat cgroup.procs) + do + let NR_KILLED=NR_KILLED+1 + kill -KILL $TASK + done + + if [ "$NR_TASKS" = "$NR_KILLED" ] + then + END=true + fi + done -- 1.7.5.4