linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
@ 2015-04-24 14:07 Aleksa Sarai
  2015-04-24 15:26 ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2015-04-24 14:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

>>> +     rcu_read_lock();
>>> +     css = task_css(current, pids_cgrp_id);
>>> +     if (!css_tryget_online(css)) {
>>> +             retval = -EBUSY;
>>> +             goto err_rcu_unlock;
>>> +     }
>>> +     rcu_read_unlock();
>>
>> Hmmm... so, the above is guaranteed to succeed in finite amount of
>> time (the race window is actually very narrow) and it'd be silly to
>> fail fork because a task was being moved across cgroups.
>>
>> I think it'd be a good idea to implement task_get_css() which loops
>> and returns the current css for the requested subsystem with reference
>> count bumped and it can use css_tryget() too.  Holding a ref doesn't
>> prevent css from dying anyway, so it doesn't make any difference.
>
> Hmmm, okay. I'll work on this later.

Would something like this suffice?

struct cgroup_subsys_state *task_get_css(struct task_struct *task, int
subsys_id) {
        bool have_ref = false;
        struct cgroup_subsys_state *css;

        while(!have_ref) {
                rcu_read_lock();
                css = task_css(task, subsys_id);
                have_ref = !css_tryget(css);
                rcu_read_unlock();
        }

        return css;
}

Also, as a side note (in the same vein I guess), does a ref on a
css_set give you an implicit ref on a css inside that css_set, or are
those two orthogonal operations?

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-04-24 14:07 [PATCH v10 4/4] cgroups: implement the PIDs subsystem Aleksa Sarai
@ 2015-04-24 15:26 ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2015-04-24 15:26 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

Hello, Aleksa.

On Sat, Apr 25, 2015 at 12:07:34AM +1000, Aleksa Sarai wrote:
> Would something like this suffice?
> 
> struct cgroup_subsys_state *task_get_css(struct task_struct *task, int
> subsys_id) {
>         bool have_ref = false;
>         struct cgroup_subsys_state *css;
> 
>         while(!have_ref) {
>                 rcu_read_lock();
>                 css = task_css(task, subsys_id);
>                 have_ref = !css_tryget(css);
>                 rcu_read_unlock();
>         }
> 
>         return css;
> }

I was thinking why this felt so familiar and realized that I have the
patch pending.

 http://lkml.kernel.org/g/1428350318-8215-8-git-send-email-tj@kernel.org

Please feel free to include it in the patch series.  I'll sort out the
merging later.

> Also, as a side note (in the same vein I guess), does a ref on a
> css_set give you an implicit ref on a css inside that css_set, or are
> those two orthogonal operations?

Yes, it does, but if you're gonna depend on that, please document that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-05-16  3:59                 ` Aleksa Sarai
@ 2015-05-18  1:24                   ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2015-05-18  1:24 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, Peter Zijlstra, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

On Sat, May 16, 2015 at 01:59:09PM +1000, Aleksa Sarai wrote:
> One question RE: defaults for .config. What is the kernel policy for
> deciding if a particular subsystem should be made enabled-by-default?

Just default to N.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-05-13 17:47               ` Tejun Heo
@ 2015-05-16  3:59                 ` Aleksa Sarai
  2015-05-18  1:24                   ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2015-05-16  3:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, Peter Zijlstra, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

Hi Tejun,

One question RE: defaults for .config. What is the kernel policy for
deciding if a particular subsystem should be made enabled-by-default?

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-05-13 17:44             ` Aleksa Sarai
@ 2015-05-13 17:47               ` Tejun Heo
  2015-05-16  3:59                 ` Aleksa Sarai
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2015-05-13 17:47 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, Peter Zijlstra, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

On Thu, May 14, 2015 at 03:44:24AM +1000, Aleksa Sarai wrote:
> I think it's because we didn't want to expose PIDS_MAX to userspace.
> But we're not *really* exposing it, we're just enforcing the input
> limit for "max".

Ah, PIDS_MAX is fine then.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-05-13 17:29           ` Tejun Heo
@ 2015-05-13 17:44             ` Aleksa Sarai
  2015-05-13 17:47               ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2015-05-13 17:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, Peter Zijlstra, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

>> Would you be okay with this?
>>
>>     if (limit < 0 || limit >= PIDS_MAX)
>>
>> I'd prefer if we used PIDS_MAX as the maximum input value as well as
>> being the internal representation of the maximum, rather than
>> switching to something like INT_MAX.
>
> Yeah, that sounds okay to me but I forgot why we went for INT_MAX in
> the first place.  Do you remember why we tried INT_MAX at all?
>
> Thanks.
>
> --
> tejun

I think it's because we didn't want to expose PIDS_MAX to userspace.
But we're not *really* exposing it, we're just enforcing the input
limit for "max".

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-05-13 17:04         ` Aleksa Sarai
@ 2015-05-13 17:29           ` Tejun Heo
  2015-05-13 17:44             ` Aleksa Sarai
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2015-05-13 17:29 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, Peter Zijlstra, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

Hello,

On Thu, May 14, 2015 at 03:04:52AM +1000, Aleksa Sarai wrote:
> Would you be okay with this?
> 
>     if (limit < 0 || limit >= PIDS_MAX)
> 
> I'd prefer if we used PIDS_MAX as the maximum input value as well as
> being the internal representation of the maximum, rather than
> switching to something like INT_MAX.

Yeah, that sounds okay to me but I forgot why we went for INT_MAX in
the first place.  Do you remember why we tried INT_MAX at all?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-04-24 15:36       ` Tejun Heo
@ 2015-05-13 17:04         ` Aleksa Sarai
  2015-05-13 17:29           ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2015-05-13 17:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, Peter Zijlstra, richard,
	Frédéric Weisbecker, linux-kernel, cgroups

Hi Tejun

>> >> +     /* We use INT_MAX as the maximum value of pid_t. */
>> >> +     if (limit < 0 || limit > INT_MAX)
>> >
>> > This is kinda weird if we're using PIDS_MAX for max as it may end up
>> > showing "max" after some larger number is written to the file.
>>
>> The reason for this is because I believe you said "PIDS_MAX isn't
>> meant to be exposed to userspace" (one of the previous patchsets used
>> PIDS_MAX as the maximum valid value).
>
> Yeah, but wouldn't it be weird to allow the userland to input PIDS_MAX
> (whatever value that may be) and reads back max?  It can be whatever
> maximum input value + 1, no?

Would you be okay with this?

    if (limit < 0 || limit >= PIDS_MAX)

I'd prefer if we used PIDS_MAX as the maximum input value as well as
being the internal representation of the maximum, rather than
switching to something like INT_MAX.

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-04-23  0:43     ` Aleksa Sarai
@ 2015-04-24 15:36       ` Tejun Heo
  2015-05-13 17:04         ` Aleksa Sarai
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2015-04-24 15:36 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

Hello,

On Thu, Apr 23, 2015 at 10:43:12AM +1000, Aleksa Sarai wrote:
> > Why is this safe?  What guarantees that css's ref isn't already zero
> > at this point?
> 
> Because it's already been exposed by pids_fork, so the current css_set

But what prevents against the task being migrated to a different
cgroup?

> (which contains the current css)'s ref has been bumped. There isn't a
> guarantee that there is a ref to css, but there is a guarantee the
> css_set it is in has a ref. The problem with using tryget is that we
> can't fail here.

The guarantee you have there is the css_set wouldn't go away until rcu
lock is dropped and you can deref csses from it.  The way it's
currently implemented, you're guaranteed to have references to the
csses but that's sort of implementation detail.  It can be implemented
in different ways.

A task, as long as it's alive, is guaranteed to have a css associated
with it all the time.  What the tryget protects is races against the
task being migrated to a different cgroup, so retrying until success
is guaranteed to finish in a short amount of time.

> >> +     /* We use INT_MAX as the maximum value of pid_t. */
> >> +     if (limit < 0 || limit > INT_MAX)
> >
> > This is kinda weird if we're using PIDS_MAX for max as it may end up
> > showing "max" after some larger number is written to the file.
> 
> The reason for this is because I believe you said "PIDS_MAX isn't
> meant to be exposed to userspace" (one of the previous patchsets used
> PIDS_MAX as the maximum valid value).

Yeah, but wouldn't it be weird to allow the userland to input PIDS_MAX
(whatever value that may be) and reads back max?  It can be whatever
maximum input value + 1, no?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-04-22 16:29   ` Tejun Heo
  2015-04-23  0:43     ` Aleksa Sarai
@ 2015-04-24 14:24     ` Aleksa Sarai
  1 sibling, 0 replies; 13+ messages in thread
From: Aleksa Sarai @ 2015-04-24 14:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

Also,

>> +struct pids_cgroup {
>> +     struct cgroup_subsys_state      css;
>> +
>> +     /*
>> +      * Use 64-bit types so that we can safely represent "max" as
>> +      * (PID_MAX_LIMIT + 1).
>             ^^^^^^^^^^^^^^^^^
> ...
>> +static struct cgroup_subsys_state *
>> +pids_css_alloc(struct cgroup_subsys_state *parent)
>> +{
>> +     struct pids_cgroup *pids;
>> +
>> +     pids = kzalloc(sizeof(struct pids_cgroup), GFP_KERNEL);
>> +     if (!pids)
>> +             return ERR_PTR(-ENOMEM);
>> +
>> +     pids->limit = PIDS_MAX;
>                       ^^^^^^^^^

%PIDS_MAX = (%PID_MAX_LIMIT + 1). I can clarify this in the comments
if you want.

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-04-22 16:29   ` Tejun Heo
@ 2015-04-23  0:43     ` Aleksa Sarai
  2015-04-24 15:36       ` Tejun Heo
  2015-04-24 14:24     ` Aleksa Sarai
  1 sibling, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2015-04-23  0:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

Hi Tejun,

>> +     rcu_read_lock();
>> +     css = task_css(current, pids_cgrp_id);
>> +     if (!css_tryget_online(css)) {
>> +             retval = -EBUSY;
>> +             goto err_rcu_unlock;
>> +     }
>> +     rcu_read_unlock();
>
> Hmmm... so, the above is guaranteed to succeed in finite amount of
> time (the race window is actually very narrow) and it'd be silly to
> fail fork because a task was being moved across cgroups.
>
> I think it'd be a good idea to implement task_get_css() which loops
> and returns the current css for the requested subsystem with reference
> count bumped and it can use css_tryget() too.  Holding a ref doesn't
> prevent css from dying anyway, so it doesn't make any difference.

Hmmm, okay. I'll work on this later.

>> +     rcu_read_lock();
>> +     css = task_css(task, pids_cgrp_id);
>> +     css_get(css);
>
> Why is this safe?  What guarantees that css's ref isn't already zero
> at this point?

Because it's already been exposed by pids_fork, so the current css_set
(which contains the current css)'s ref has been bumped. There isn't a
guarantee that there is a ref to css, but there is a guarantee the
css_set it is in has a ref. The problem with using tryget is that we
can't fail here.

>> +static ssize_t pids_max_write(struct kernfs_open_file *of, char *buf,
>> +                           size_t nbytes, loff_t off)
>> +{
>> +     struct cgroup_subsys_state *css = of_css(of);
>> +     struct pids_cgroup *pids = css_pids(css);
>> +     int64_t limit;
>> +     int err;
>> +
>> +     buf = strstrip(buf);
>> +     if (!strcmp(buf, PIDS_MAX_STR)) {
>> +             limit = PIDS_MAX;
>> +             goto set_limit;
>> +     }
>> +
>> +     err = kstrtoll(buf, 0, &limit);
>> +     if (err)
>> +             return err;
>> +
>> +     /* We use INT_MAX as the maximum value of pid_t. */
>> +     if (limit < 0 || limit > INT_MAX)
>
> This is kinda weird if we're using PIDS_MAX for max as it may end up
> showing "max" after some larger number is written to the file.

The reason for this is because I believe you said "PIDS_MAX isn't
meant to be exposed to userspace" (one of the previous patchsets used
PIDS_MAX as the maximum valid value).

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-04-19 12:22 ` [PATCH v10 4/4] cgroups: implement the PIDs subsystem Aleksa Sarai
@ 2015-04-22 16:29   ` Tejun Heo
  2015-04-23  0:43     ` Aleksa Sarai
  2015-04-24 14:24     ` Aleksa Sarai
  0 siblings, 2 replies; 13+ messages in thread
From: Tejun Heo @ 2015-04-22 16:29 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

> @@ -0,0 +1,368 @@
> +/*
> + * Process number limiting controller for cgroups.
> + *
> + * Used to allow a cgroup hierarchy to stop any new processes
> + * from fork()ing after a certain limit is reached.
> + *
> + * Since it is trivial to hit the task limit without hitting
> + * any kmemcg limits in place, PIDs are a fundamental resource.
> + * As such, PID exhaustion must be preventable in the scope of
> + * a cgroup hierarchy by allowing resource limiting of the
> + * number of tasks in a cgroup.
> + *
> + * In order to use the `pids` controller, set the maximum number
> + * of tasks in pids.max (this is not available in the root cgroup
> + * for obvious reasons). The number of processes currently
> + * in the cgroup is given by pids.current. Organisational operations
> + * are not blocked by cgroup policies, so it is possible to have
> + * pids.current > pids.max. However, fork()s will still not work.
> + *
> + * To set a cgroup to have no limit, set pids.max to "max". fork()
> + * will return -EBUSY if forking would cause a cgroup policy to be
> + * violated.
> + *
> + * pids.current tracks all child cgroup hierarchies, so
> + * parent/pids.current is a superset of parent/child/pids.current.
> + *
> + * Copyright (C) 2015 Aleksa Sarai <cyphar@cyphar.com>

The above text looks wrapped too narrow.

> +struct pids_cgroup {
> +	struct cgroup_subsys_state	css;
> +
> +	/*
> +	 * Use 64-bit types so that we can safely represent "max" as
> +	 * (PID_MAX_LIMIT + 1).
            ^^^^^^^^^^^^^^^^^
...
> +static struct cgroup_subsys_state *
> +pids_css_alloc(struct cgroup_subsys_state *parent)
> +{
> +	struct pids_cgroup *pids;
> +
> +	pids = kzalloc(sizeof(struct pids_cgroup), GFP_KERNEL);
> +	if (!pids)
> +		return ERR_PTR(-ENOMEM);
> +
> +	pids->limit = PIDS_MAX;
                      ^^^^^^^^^

> +	atomic64_set(&pids->counter, 0);
> +	return &pids->css;
> +}
...
> +static void pids_detach(struct cgroup_subsys_state *old_css,
> +			struct task_struct *task)
> +{
> +	struct pids_cgroup *old_pids = css_pids(old_css);
> +
> +	pids_uncharge(old_pids, 1);
> +}

You can do the above as a part of can/cancel.

> +static int pids_can_fork(struct task_struct *task, void **private)

Maybe @priv_p or something which signifies it's of different type from
others?

> +{
...
> +	rcu_read_lock();
> +	css = task_css(current, pids_cgrp_id);
> +	if (!css_tryget_online(css)) {
> +		retval = -EBUSY;
> +		goto err_rcu_unlock;
> +	}
> +	rcu_read_unlock();

Hmmm... so, the above is guaranteed to succeed in finite amount of
time (the race window is actually very narrow) and it'd be silly to
fail fork because a task was being moved across cgroups.

I think it'd be a good idea to implement task_get_css() which loops
and returns the current css for the requested subsystem with reference
count bumped and it can use css_tryget() too.  Holding a ref doesn't
prevent css from dying anyway, so it doesn't make any difference.

> +static void pids_fork(struct task_struct *task, void *private)
> +{
...
> +	rcu_read_lock();
> +	css = task_css(task, pids_cgrp_id);
> +	css_get(css);

Why is this safe?  What guarantees that css's ref isn't already zero
at this point?

> +	rcu_read_unlock();
> +
> +	pids = css_pids(css);
> +
> +	/*
> +	 * The association has changed, we have to revert and reapply the
> +	 * charge/uncharge on the wrong hierarchy to the current one. Since
> +	 * the association can only change due to an organisation event, its
> +	 * okay for us to ignore the limit in this case.
> +	 */
> +	if (pids != old_pids) {
> +		pids_uncharge(old_pids, 1);
> +		pids_charge(pids, 1);
> +	}
> +
> +	css_put(css);
> +	css_put(old_css);
> +}
...
> +static ssize_t pids_max_write(struct kernfs_open_file *of, char *buf,
> +			      size_t nbytes, loff_t off)
> +{
> +	struct cgroup_subsys_state *css = of_css(of);
> +	struct pids_cgroup *pids = css_pids(css);
> +	int64_t limit;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	if (!strcmp(buf, PIDS_MAX_STR)) {
> +		limit = PIDS_MAX;
> +		goto set_limit;
> +	}
> +
> +	err = kstrtoll(buf, 0, &limit);
> +	if (err)
> +		return err;
> +
> +	/* We use INT_MAX as the maximum value of pid_t. */
> +	if (limit < 0 || limit > INT_MAX)

This is kinda weird if we're using PIDS_MAX for max as it may end up
showing "max" after some larger number is written to the file.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v10 4/4] cgroups: implement the PIDs subsystem
  2015-04-19 12:22 [PATCH v10 0/4] cgroups: add pids subsystem Aleksa Sarai
@ 2015-04-19 12:22 ` Aleksa Sarai
  2015-04-22 16:29   ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2015-04-19 12:22 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Adds a new single-purpose PIDs subsystem to limit the number of
tasks that can be forked inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that applies to a cgroup rather than a
process tree.

However, it should be noted that organisational operations (adding and
removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
the number of tasks in the hierarchy cannot exceed the limit through
forking. This is due to the fact that, in the unified hierarchy, attach
cannot fail (and it is not possible for a task to overcome its PIDs
cgroup policy limit by attaching to a child cgroup).

PIDs are fundamentally a global resource, and it is possible to reach
PID exhaustion inside a cgroup without hitting any reasonable kmemcg
policy. Once you've hit PID exhaustion, you're only in a marginally
better state than OOM. This subsystem allows PID exhaustion inside a
cgroup to be prevented.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup_subsys.h |   5 +
 init/Kconfig                  |  16 ++
 kernel/Makefile               |   1 +
 kernel/cgroup_pids.c          | 368 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 390 insertions(+)
 create mode 100644 kernel/cgroup_pids.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index fdd3551..fc61bc6 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -61,6 +61,11 @@ SUBSYS(hugetlb)
  * Subsystems that implement the can_fork() family of callbacks.
  */
 SUBSYS_TAG(PREFORK_START)
+
+#if IS_ENABLED(CONFIG_CGROUP_PIDS)
+SUBSYS(pids)
+#endif
+
 SUBSYS_TAG(PREFORK_END)
 
 /*
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..1f135b7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -952,6 +952,22 @@ config CGROUP_FREEZER
 	  Provides a way to freeze and unfreeze all tasks in a
 	  cgroup.
 
+config CGROUP_PIDS
+	bool "PIDs cgroup subsystem"
+	help
+	  Provides enforcement of process number limits in the scope of a
+	  cgroup. Any attempt to fork more processes than is allowed in the
+	  cgroup will fail. PIDs are fundamentally a global resource because it
+	  is fairly trivial to reach PID exhaustion before you reach even a
+	  conservative kmemcg limit. As a result, it is possible to grind a
+	  system to halt without being limited by other cgroup policies. The
+	  PIDs cgroup subsystem is designed to stop this from happening.
+
+	  It should be noted that organisational operations (such as attaching
+	  to a cgroup hierarchy will *not* be blocked by the PIDs subsystem),
+	  since the PIDs limit only affects a process's ability to fork, not to
+	  attach to a cgroup.
+
 config CGROUP_DEVICE
 	bool "Device controller for cgroups"
 	help
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..e823592 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
new file mode 100644
index 0000000..c1c89f2
--- /dev/null
+++ b/kernel/cgroup_pids.c
@@ -0,0 +1,368 @@
+/*
+ * Process number limiting controller for cgroups.
+ *
+ * Used to allow a cgroup hierarchy to stop any new processes
+ * from fork()ing after a certain limit is reached.
+ *
+ * Since it is trivial to hit the task limit without hitting
+ * any kmemcg limits in place, PIDs are a fundamental resource.
+ * As such, PID exhaustion must be preventable in the scope of
+ * a cgroup hierarchy by allowing resource limiting of the
+ * number of tasks in a cgroup.
+ *
+ * In order to use the `pids` controller, set the maximum number
+ * of tasks in pids.max (this is not available in the root cgroup
+ * for obvious reasons). The number of processes currently
+ * in the cgroup is given by pids.current. Organisational operations
+ * are not blocked by cgroup policies, so it is possible to have
+ * pids.current > pids.max. However, fork()s will still not work.
+ *
+ * To set a cgroup to have no limit, set pids.max to "max". fork()
+ * will return -EBUSY if forking would cause a cgroup policy to be
+ * violated.
+ *
+ * pids.current tracks all child cgroup hierarchies, so
+ * parent/pids.current is a superset of parent/child/pids.current.
+ *
+ * Copyright (C) 2015 Aleksa Sarai <cyphar@cyphar.com>
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/threads.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+
+#define PIDS_MAX (PID_MAX_LIMIT + 1ULL)
+#define PIDS_MAX_STR "max"
+
+struct pids_cgroup {
+	struct cgroup_subsys_state	css;
+
+	/*
+	 * Use 64-bit types so that we can safely represent "max" as
+	 * (PID_MAX_LIMIT + 1).
+	 */
+	atomic64_t			counter;
+	int64_t				limit;
+};
+
+static struct pids_cgroup *css_pids(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct pids_cgroup, css);
+}
+
+static struct pids_cgroup *parent_pids(struct pids_cgroup *pids)
+{
+	return css_pids(pids->css.parent);
+}
+
+static struct cgroup_subsys_state *
+pids_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct pids_cgroup *pids;
+
+	pids = kzalloc(sizeof(struct pids_cgroup), GFP_KERNEL);
+	if (!pids)
+		return ERR_PTR(-ENOMEM);
+
+	pids->limit = PIDS_MAX;
+	atomic64_set(&pids->counter, 0);
+	return &pids->css;
+}
+
+static void pids_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_pids(css));
+}
+
+/**
+ * pids_cancel - uncharge the local pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to cancel
+ *
+ * This function will WARN if the pid count goes under 0,
+ * because such a case is a bug in the pids controller proper.
+ */
+static void pids_cancel(struct pids_cgroup *pids, int num)
+{
+	/*
+	 * A negative count (or overflow for that matter) is invalid,
+	 * and indicates a bug in the pids controller proper.
+	 */
+	WARN_ON_ONCE(atomic64_add_negative(-num, &pids->counter));
+}
+
+/**
+ * pids_uncharge - hierarchically uncharge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to uncharge
+ */
+static void pids_uncharge(struct pids_cgroup *pids, int num)
+{
+	struct pids_cgroup *p;
+
+	for (p = pids; p; p = parent_pids(p))
+		pids_cancel(p, num);
+}
+
+/**
+ * pids_charge - hierarchically charge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to charge
+ *
+ * This function does *not* follow the pid limit set. It cannot
+ * fail and the new pid count may exceed the limit, because
+ * organisational operations cannot fail in the unified hierarchy.
+ */
+static void pids_charge(struct pids_cgroup *pids, int num)
+{
+	struct pids_cgroup *p;
+
+	for (p = pids; p; p = parent_pids(p))
+		atomic64_add(num, &p->counter);
+}
+
+/**
+ * pids_try_charge - hierarchically try to charge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to charge
+ *
+ * This function follows the set limit. It will fail if the charge
+ * would cause the new value to exceed the hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN.
+ */
+static int pids_try_charge(struct pids_cgroup *pids, int num)
+{
+	struct pids_cgroup *p, *q;
+
+	for (p = pids; p; p = parent_pids(p)) {
+		int64_t new = atomic64_add_return(num, &p->counter);
+
+		/*
+		 * Since new is capped to the maximum number of pid_t, if
+		 * p->limit is %PIDS_MAX then we know that this test will never
+		 * fail.
+		 */
+		if (new > p->limit)
+			goto revert;
+	}
+
+	return 0;
+
+revert:
+	for (q = pids; q != p; q = parent_pids(q))
+		pids_cancel(q, num);
+	pids_cancel(p, num);
+
+	return -EAGAIN;
+}
+
+static int pids_can_attach(struct cgroup_subsys_state *css,
+			   struct cgroup_taskset *tset)
+{
+	struct pids_cgroup *pids = css_pids(css);
+	struct task_struct *task;
+	int64_t num = 0;
+
+	cgroup_taskset_for_each(task, tset)
+		num++;
+
+	/*
+	 * Attaching to a cgroup is allowed to overcome the
+	 * the PID limit, so that organisation operations aren't
+	 * blocked by the `pids` cgroup controller.
+	 */
+	pids_charge(pids, num);
+	return 0;
+}
+
+static void pids_cancel_attach(struct cgroup_subsys_state *css,
+			       struct cgroup_taskset *tset)
+{
+	struct pids_cgroup *pids = css_pids(css);
+	struct task_struct *task;
+	int64_t num = 0;
+
+	cgroup_taskset_for_each(task, tset)
+		num++;
+
+	pids_uncharge(pids, num);
+}
+
+static void pids_detach(struct cgroup_subsys_state *old_css,
+			struct task_struct *task)
+{
+	struct pids_cgroup *old_pids = css_pids(old_css);
+
+	pids_uncharge(old_pids, 1);
+}
+
+static int pids_can_fork(struct task_struct *task, void **private)
+{
+	struct cgroup_subsys_state *css;
+	struct pids_cgroup *pids;
+	int retval;
+
+	/*
+	 * Use the "current" task_css for the pids subsystem as the tentative
+	 * css. It is possible we will charge the wrong hierarchy, in which
+	 * case we will forcefully revert/reapply the charge on the right
+	 * hierarchy after it is committed to the task proper.
+	 */
+	rcu_read_lock();
+	css = task_css(current, pids_cgrp_id);
+	if (!css_tryget_online(css)) {
+		retval = -EBUSY;
+		goto err_rcu_unlock;
+	}
+	rcu_read_unlock();
+	pids = css_pids(css);
+
+	retval = pids_try_charge(pids, 1);
+	if (retval)
+		goto err_css_put;
+
+	*private = css;
+	return 0;
+
+err_rcu_unlock:
+	rcu_read_unlock();
+err_css_put:
+	css_put(css);
+	return retval;
+}
+
+static void pids_cancel_fork(struct task_struct *task, void *private)
+{
+	struct cgroup_subsys_state *css = private;
+	struct pids_cgroup *pids = css_pids(css);
+
+	pids_uncharge(pids, 1);
+	css_put(css);
+}
+
+static void pids_fork(struct task_struct *task, void *private)
+{
+	struct cgroup_subsys_state *css;
+	struct cgroup_subsys_state *old_css = private;
+	struct pids_cgroup *pids;
+	struct pids_cgroup *old_pids = css_pids(old_css);
+
+	/*
+	 * Get the current task css. Since the task has already been exposed to
+	 * the system and had its cg_list updated, we know that we already have
+	 * an implicit reference through task.
+	 */
+	rcu_read_lock();
+	css = task_css(task, pids_cgrp_id);
+	css_get(css);
+	rcu_read_unlock();
+
+	pids = css_pids(css);
+
+	/*
+	 * The association has changed, we have to revert and reapply the
+	 * charge/uncharge on the wrong hierarchy to the current one. Since
+	 * the association can only change due to an organisation event, its
+	 * okay for us to ignore the limit in this case.
+	 */
+	if (pids != old_pids) {
+		pids_uncharge(old_pids, 1);
+		pids_charge(pids, 1);
+	}
+
+	css_put(css);
+	css_put(old_css);
+}
+
+static void pids_exit(struct cgroup_subsys_state *css,
+		      struct cgroup_subsys_state *old_css,
+		      struct task_struct *task)
+{
+	struct pids_cgroup *pids = css_pids(old_css);
+
+	pids_uncharge(pids, 1);
+}
+
+static ssize_t pids_max_write(struct kernfs_open_file *of, char *buf,
+			      size_t nbytes, loff_t off)
+{
+	struct cgroup_subsys_state *css = of_css(of);
+	struct pids_cgroup *pids = css_pids(css);
+	int64_t limit;
+	int err;
+
+	buf = strstrip(buf);
+	if (!strcmp(buf, PIDS_MAX_STR)) {
+		limit = PIDS_MAX;
+		goto set_limit;
+	}
+
+	err = kstrtoll(buf, 0, &limit);
+	if (err)
+		return err;
+
+	/* We use INT_MAX as the maximum value of pid_t. */
+	if (limit < 0 || limit > INT_MAX)
+		return -EINVAL;
+
+set_limit:
+	/*
+	 * Limit updates don't need to be mutex'd, since it isn't
+	 * critical that any racing fork()s follow the new limit.
+	 */
+	pids->limit = limit;
+	return nbytes;
+}
+
+static int pids_max_show(struct seq_file *sf, void *v)
+{
+	struct cgroup_subsys_state *css = seq_css(sf);
+	struct pids_cgroup *pids = css_pids(css);
+	int64_t limit = pids->limit;
+
+	if (limit == PIDS_MAX)
+		seq_printf(sf, "%s\n", PIDS_MAX_STR);
+	else
+		seq_printf(sf, "%lld\n", limit);
+
+	return 0;
+}
+
+static s64 pids_current_read(struct cgroup_subsys_state *css,
+			     struct cftype *cft)
+{
+	struct pids_cgroup *pids = css_pids(css);
+
+	return atomic64_read(&pids->counter);
+}
+
+static struct cftype files[] = {
+	{
+		.name = "max",
+		.write = pids_max_write,
+		.seq_show = pids_max_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "current",
+		.read_s64 = pids_current_read,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys pids_cgrp_subsys = {
+	.css_alloc	= pids_css_alloc,
+	.css_free	= pids_css_free,
+	.can_attach	= pids_can_attach,
+	.cancel_attach	= pids_cancel_attach,
+	.detach		= pids_detach,
+	.can_fork	= pids_can_fork,
+	.cancel_fork	= pids_cancel_fork,
+	.fork		= pids_fork,
+	.exit		= pids_exit,
+	.legacy_cftypes	= files,
+	.early_init	= 0,
+};
-- 
2.3.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-05-18  1:24 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-24 14:07 [PATCH v10 4/4] cgroups: implement the PIDs subsystem Aleksa Sarai
2015-04-24 15:26 ` Tejun Heo
  -- strict thread matches above, loose matches on Subject: below --
2015-04-19 12:22 [PATCH v10 0/4] cgroups: add pids subsystem Aleksa Sarai
2015-04-19 12:22 ` [PATCH v10 4/4] cgroups: implement the PIDs subsystem Aleksa Sarai
2015-04-22 16:29   ` Tejun Heo
2015-04-23  0:43     ` Aleksa Sarai
2015-04-24 15:36       ` Tejun Heo
2015-05-13 17:04         ` Aleksa Sarai
2015-05-13 17:29           ` Tejun Heo
2015-05-13 17:44             ` Aleksa Sarai
2015-05-13 17:47               ` Tejun Heo
2015-05-16  3:59                 ` Aleksa Sarai
2015-05-18  1:24                   ` Tejun Heo
2015-04-24 14:24     ` Aleksa Sarai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).