All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Aleksa Sarai <cyphar@cyphar.com>
Cc: lizefan@huawei.com, mingo@redhat.com, peterz@infradead.org,
	richard@nod.at, fweisbec@gmail.com, linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org
Subject: Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
Date: Wed, 22 Apr 2015 12:29:54 -0400	[thread overview]
Message-ID: <20150422162954.GF10738@htj.duckdns.org> (raw)
In-Reply-To: <1429446154-10660-5-git-send-email-cyphar@cyphar.com>

> @@ -0,0 +1,368 @@
> +/*
> + * Process number limiting controller for cgroups.
> + *
> + * Used to allow a cgroup hierarchy to stop any new processes
> + * from fork()ing after a certain limit is reached.
> + *
> + * Since it is trivial to hit the task limit without hitting
> + * any kmemcg limits in place, PIDs are a fundamental resource.
> + * As such, PID exhaustion must be preventable in the scope of
> + * a cgroup hierarchy by allowing resource limiting of the
> + * number of tasks in a cgroup.
> + *
> + * In order to use the `pids` controller, set the maximum number
> + * of tasks in pids.max (this is not available in the root cgroup
> + * for obvious reasons). The number of processes currently
> + * in the cgroup is given by pids.current. Organisational operations
> + * are not blocked by cgroup policies, so it is possible to have
> + * pids.current > pids.max. However, fork()s will still not work.
> + *
> + * To set a cgroup to have no limit, set pids.max to "max". fork()
> + * will return -EBUSY if forking would cause a cgroup policy to be
> + * violated.
> + *
> + * pids.current tracks all child cgroup hierarchies, so
> + * parent/pids.current is a superset of parent/child/pids.current.
> + *
> + * Copyright (C) 2015 Aleksa Sarai <cyphar@cyphar.com>

The above text looks wrapped too narrow.

> +struct pids_cgroup {
> +	struct cgroup_subsys_state	css;
> +
> +	/*
> +	 * Use 64-bit types so that we can safely represent "max" as
> +	 * (PID_MAX_LIMIT + 1).
            ^^^^^^^^^^^^^^^^^
...
> +static struct cgroup_subsys_state *
> +pids_css_alloc(struct cgroup_subsys_state *parent)
> +{
> +	struct pids_cgroup *pids;
> +
> +	pids = kzalloc(sizeof(struct pids_cgroup), GFP_KERNEL);
> +	if (!pids)
> +		return ERR_PTR(-ENOMEM);
> +
> +	pids->limit = PIDS_MAX;
                      ^^^^^^^^^

> +	atomic64_set(&pids->counter, 0);
> +	return &pids->css;
> +}
...
> +static void pids_detach(struct cgroup_subsys_state *old_css,
> +			struct task_struct *task)
> +{
> +	struct pids_cgroup *old_pids = css_pids(old_css);
> +
> +	pids_uncharge(old_pids, 1);
> +}

You can do the above as a part of can/cancel.

> +static int pids_can_fork(struct task_struct *task, void **private)

Maybe @priv_p or something which signifies it's of different type from
others?

> +{
...
> +	rcu_read_lock();
> +	css = task_css(current, pids_cgrp_id);
> +	if (!css_tryget_online(css)) {
> +		retval = -EBUSY;
> +		goto err_rcu_unlock;
> +	}
> +	rcu_read_unlock();

Hmmm... so, the above is guaranteed to succeed in finite amount of
time (the race window is actually very narrow) and it'd be silly to
fail fork because a task was being moved across cgroups.

I think it'd be a good idea to implement task_get_css() which loops
and returns the current css for the requested subsystem with reference
count bumped and it can use css_tryget() too.  Holding a ref doesn't
prevent css from dying anyway, so it doesn't make any difference.

> +static void pids_fork(struct task_struct *task, void *private)
> +{
...
> +	rcu_read_lock();
> +	css = task_css(task, pids_cgrp_id);
> +	css_get(css);

Why is this safe?  What guarantees that css's ref isn't already zero
at this point?

> +	rcu_read_unlock();
> +
> +	pids = css_pids(css);
> +
> +	/*
> +	 * The association has changed, we have to revert and reapply the
> +	 * charge/uncharge on the wrong hierarchy to the current one. Since
> +	 * the association can only change due to an organisation event, its
> +	 * okay for us to ignore the limit in this case.
> +	 */
> +	if (pids != old_pids) {
> +		pids_uncharge(old_pids, 1);
> +		pids_charge(pids, 1);
> +	}
> +
> +	css_put(css);
> +	css_put(old_css);
> +}
...
> +static ssize_t pids_max_write(struct kernfs_open_file *of, char *buf,
> +			      size_t nbytes, loff_t off)
> +{
> +	struct cgroup_subsys_state *css = of_css(of);
> +	struct pids_cgroup *pids = css_pids(css);
> +	int64_t limit;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	if (!strcmp(buf, PIDS_MAX_STR)) {
> +		limit = PIDS_MAX;
> +		goto set_limit;
> +	}
> +
> +	err = kstrtoll(buf, 0, &limit);
> +	if (err)
> +		return err;
> +
> +	/* We use INT_MAX as the maximum value of pid_t. */
> +	if (limit < 0 || limit > INT_MAX)

This is kinda weird if we're using PIDS_MAX for max as it may end up
showing "max" after some larger number is written to the file.

Thanks.

-- 
tejun

WARNING: multiple messages have this Message-ID (diff)
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: Aleksa Sarai <cyphar-gVpy/LI/lHzQT0dZR+AlfA@public.gmane.org>
Cc: lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org,
	mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
	richard-/L3Ra7n9ekc@public.gmane.org,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH v10 4/4] cgroups: implement the PIDs subsystem
Date: Wed, 22 Apr 2015 12:29:54 -0400	[thread overview]
Message-ID: <20150422162954.GF10738@htj.duckdns.org> (raw)
In-Reply-To: <1429446154-10660-5-git-send-email-cyphar-gVpy/LI/lHzQT0dZR+AlfA@public.gmane.org>

> @@ -0,0 +1,368 @@
> +/*
> + * Process number limiting controller for cgroups.
> + *
> + * Used to allow a cgroup hierarchy to stop any new processes
> + * from fork()ing after a certain limit is reached.
> + *
> + * Since it is trivial to hit the task limit without hitting
> + * any kmemcg limits in place, PIDs are a fundamental resource.
> + * As such, PID exhaustion must be preventable in the scope of
> + * a cgroup hierarchy by allowing resource limiting of the
> + * number of tasks in a cgroup.
> + *
> + * In order to use the `pids` controller, set the maximum number
> + * of tasks in pids.max (this is not available in the root cgroup
> + * for obvious reasons). The number of processes currently
> + * in the cgroup is given by pids.current. Organisational operations
> + * are not blocked by cgroup policies, so it is possible to have
> + * pids.current > pids.max. However, fork()s will still not work.
> + *
> + * To set a cgroup to have no limit, set pids.max to "max". fork()
> + * will return -EBUSY if forking would cause a cgroup policy to be
> + * violated.
> + *
> + * pids.current tracks all child cgroup hierarchies, so
> + * parent/pids.current is a superset of parent/child/pids.current.
> + *
> + * Copyright (C) 2015 Aleksa Sarai <cyphar-gVpy/LI/lHzQT0dZR+AlfA@public.gmane.org>

The above text looks wrapped too narrow.

> +struct pids_cgroup {
> +	struct cgroup_subsys_state	css;
> +
> +	/*
> +	 * Use 64-bit types so that we can safely represent "max" as
> +	 * (PID_MAX_LIMIT + 1).
            ^^^^^^^^^^^^^^^^^
...
> +static struct cgroup_subsys_state *
> +pids_css_alloc(struct cgroup_subsys_state *parent)
> +{
> +	struct pids_cgroup *pids;
> +
> +	pids = kzalloc(sizeof(struct pids_cgroup), GFP_KERNEL);
> +	if (!pids)
> +		return ERR_PTR(-ENOMEM);
> +
> +	pids->limit = PIDS_MAX;
                      ^^^^^^^^^

> +	atomic64_set(&pids->counter, 0);
> +	return &pids->css;
> +}
...
> +static void pids_detach(struct cgroup_subsys_state *old_css,
> +			struct task_struct *task)
> +{
> +	struct pids_cgroup *old_pids = css_pids(old_css);
> +
> +	pids_uncharge(old_pids, 1);
> +}

You can do the above as a part of can/cancel.

> +static int pids_can_fork(struct task_struct *task, void **private)

Maybe @priv_p or something which signifies it's of different type from
others?

> +{
...
> +	rcu_read_lock();
> +	css = task_css(current, pids_cgrp_id);
> +	if (!css_tryget_online(css)) {
> +		retval = -EBUSY;
> +		goto err_rcu_unlock;
> +	}
> +	rcu_read_unlock();

Hmmm... so, the above is guaranteed to succeed in finite amount of
time (the race window is actually very narrow) and it'd be silly to
fail fork because a task was being moved across cgroups.

I think it'd be a good idea to implement task_get_css() which loops
and returns the current css for the requested subsystem with reference
count bumped and it can use css_tryget() too.  Holding a ref doesn't
prevent css from dying anyway, so it doesn't make any difference.

> +static void pids_fork(struct task_struct *task, void *private)
> +{
...
> +	rcu_read_lock();
> +	css = task_css(task, pids_cgrp_id);
> +	css_get(css);

Why is this safe?  What guarantees that css's ref isn't already zero
at this point?

> +	rcu_read_unlock();
> +
> +	pids = css_pids(css);
> +
> +	/*
> +	 * The association has changed, we have to revert and reapply the
> +	 * charge/uncharge on the wrong hierarchy to the current one. Since
> +	 * the association can only change due to an organisation event, its
> +	 * okay for us to ignore the limit in this case.
> +	 */
> +	if (pids != old_pids) {
> +		pids_uncharge(old_pids, 1);
> +		pids_charge(pids, 1);
> +	}
> +
> +	css_put(css);
> +	css_put(old_css);
> +}
...
> +static ssize_t pids_max_write(struct kernfs_open_file *of, char *buf,
> +			      size_t nbytes, loff_t off)
> +{
> +	struct cgroup_subsys_state *css = of_css(of);
> +	struct pids_cgroup *pids = css_pids(css);
> +	int64_t limit;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	if (!strcmp(buf, PIDS_MAX_STR)) {
> +		limit = PIDS_MAX;
> +		goto set_limit;
> +	}
> +
> +	err = kstrtoll(buf, 0, &limit);
> +	if (err)
> +		return err;
> +
> +	/* We use INT_MAX as the maximum value of pid_t. */
> +	if (limit < 0 || limit > INT_MAX)

This is kinda weird if we're using PIDS_MAX for max as it may end up
showing "max" after some larger number is written to the file.

Thanks.

-- 
tejun

  reply	other threads:[~2015-04-22 16:30 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-19 12:22 [PATCH v10 0/4] cgroups: add pids subsystem Aleksa Sarai
2015-04-19 12:22 ` Aleksa Sarai
2015-04-19 12:22 ` [PATCH v10 1/4] cgroups: use bitmask to filter for_each_subsys Aleksa Sarai
2015-04-19 12:22   ` Aleksa Sarai
2015-04-22 15:25   ` Tejun Heo
2015-04-22 15:42     ` Peter Zijlstra
2015-04-22 16:02       ` Tejun Heo
2015-04-26 16:05         ` Aleksa Sarai
2015-04-26 16:09           ` Tejun Heo
2015-04-26 16:09             ` Tejun Heo
2015-05-13  5:44             ` Aleksa Sarai
2015-05-13  5:44               ` Aleksa Sarai
2015-05-13 13:50               ` Tejun Heo
2015-05-13 13:50                 ` Tejun Heo
2015-04-22 15:30   ` Tejun Heo
2015-04-19 12:22 ` [PATCH v10 2/4] cgroups: replace explicit ss_mask checking with for_each_subsys_which Aleksa Sarai
2015-04-22 15:31   ` Tejun Heo
2015-04-22 15:31     ` Tejun Heo
2015-04-19 12:22 ` [PATCH v10 3/4] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
2015-04-19 12:22   ` Aleksa Sarai
2015-04-22 15:54   ` Tejun Heo
2015-04-22 15:54     ` Tejun Heo
2015-04-24 13:59     ` Aleksa Sarai
2015-04-24 15:48       ` Tejun Heo
2015-05-14 10:57     ` Aleksa Sarai
2015-05-14 15:08       ` Tejun Heo
2015-04-19 12:22 ` [PATCH v10 4/4] cgroups: implement the PIDs subsystem Aleksa Sarai
2015-04-19 12:22   ` Aleksa Sarai
2015-04-22 16:29   ` Tejun Heo [this message]
2015-04-22 16:29     ` Tejun Heo
2015-04-23  0:43     ` Aleksa Sarai
2015-04-23  0:43       ` Aleksa Sarai
2015-04-24 15:36       ` Tejun Heo
2015-04-24 15:36         ` Tejun Heo
2015-05-13 17:04         ` Aleksa Sarai
2015-05-13 17:04           ` Aleksa Sarai
2015-05-13 17:29           ` Tejun Heo
2015-05-13 17:29             ` Tejun Heo
2015-05-13 17:44             ` Aleksa Sarai
2015-05-13 17:47               ` Tejun Heo
2015-05-13 17:47                 ` Tejun Heo
2015-05-16  3:59                 ` Aleksa Sarai
2015-05-16  3:59                   ` Aleksa Sarai
2015-05-18  1:24                   ` Tejun Heo
2015-05-18  1:24                     ` Tejun Heo
2015-04-24 14:24     ` Aleksa Sarai
2015-04-24 14:24       ` Aleksa Sarai
2015-04-24 14:07 Aleksa Sarai
2015-04-24 14:07 ` Aleksa Sarai
2015-04-24 15:26 ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150422162954.GF10738@htj.duckdns.org \
    --to=tj@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=cyphar@cyphar.com \
    --cc=fweisbec@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=richard@nod.at \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.