linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* sched/autogroup: race if !sysctl_sched_autogroup_enabled ?
@ 2016-11-09 16:59 Oleg Nesterov
  2016-11-09 17:50 ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2016-11-09 16:59 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, Mike Galbraith, Peter Zijlstra
  Cc: hartsjc, vbendel, vlovejoy, linux-kernel

I am trying to investigate a bug-report which looks as an autogroup bug,
and it seems I found the race which _might_ explain the problem. I'll
try to make the fix tomorrow but could you confirm I got it right and
answer the question below?

Let's look at task_wants_autogroup()

	/*
	 * We can only assume the task group can't go away on us if
	 * autogroup_move_group() can see us on ->thread_group list.
	 */
	if (p->flags & PF_EXITING)
		return false;

Firstly, I think that this PF_EXITING check is no longer needed.
sched_change_group() can be called by autogroup or cgroups code.
autogroup is obviously fine, cgroup-attach will never try to migrate
the exiting task (cgroup_threadgroup_rwsem helps), cpu_cgroup_fork()
is obviously fine too.

But!!! at the same time the comment is _correct_ even if very cryptic ;)

We need to ensure that autogroup/tg returned by autogroup_task_group()
can't go away if we race with autogroup_move_group(), and unless the
caller holds ->siglock we rely on fact that autogroup_move_group()
will a) see this task and b) do sched_move_task() which needs the same
same rq->lock.

However. autogroup_move_group() skips for_each_thread/sched_move_task
if sysctl_sched_autogroup_enabled == 0.

So. Doesn't this mean that cgroup migration to the root cgroup can race
with autogroup_move_group() and use the soon-to-be-freed autogroup->tg?

Or, even simpler, cgroup_post_fork()->cpu_cgroup_fork() can hit the
same race if CLONE_TRHEAD?

Or I am totally confused?

---------------------------------------------------------------------
Now the qustions. Can't we simplify autogroup_task_get() and avoid
lock_task_sighand() ?

	struct autogroup *autogroup_task_get(struct task_struct *p)
	{
		struct autogroup *ag
	
		rcu_read_lock();
		for (;;) {
			// it is freed by sched_free_group_rcu() path
			// and thus ->autogroup is rcu-safe too.
			ag = READ_ONCE(p->signal->autogroup);
			if (kref_get_unless_zero(&ag->kref))
				break;
		}
		rcu_read_unlock();

		return ag;
	}

although this is a bit off-topic. Another question is that I fail to
understand why sched_autogroup_create_attach() does autogroup_create()
and changes signal->autogroup even if !sysctl_sched_autogroup_enabled.

IOW, even ignoring the problem above, what is wrong with this patch?

	--- x/kernel/sched/auto_group.c
	+++ x/kernel/sched/auto_group.c
	@@ -152,8 +152,12 @@ out:
	 /* Allocates GFP_KERNEL, cannot be called under any spinlock */
	 void sched_autogroup_create_attach(struct task_struct *p)
	 {
	-	struct autogroup *ag = autogroup_create();
	+	struct autogroup *ag;
	+
	+	if (!sysctl_sched_autogroup_enabled)
	+		return;
	 
	+	ag = autogroup_create();
		autogroup_move_group(p, ag);
		/* drop extra reference added by autogroup_create() */
		autogroup_kref_put(ag);

or even

	--- x/kernel/sched/auto_group.c
	+++ x/kernel/sched/auto_group.c
	@@ -64,9 +64,13 @@ static inline struct autogroup *autogrou
	 
	 static inline struct autogroup *autogroup_create(void)
	 {
	-	struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
	+	struct autogroup *ag;
		struct task_group *tg;
	 
	+	if (!sysctl_sched_autogroup_enabled)
	+		goto xxx;
	+
	+	ag = kzalloc(sizeof(*ag), GFP_KERNEL);
		if (!ag)
			goto out_fail;
	 
	@@ -103,7 +107,7 @@ out_fail:
			printk(KERN_WARNING "autogroup_create: %s failure.\n",
				ag ? "sched_create_group()" : "kmalloc()");
		}
	-
	+xxx:
		return autogroup_kref_get(&autogroup_default);
	 }
	 

Oleg.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sched/autogroup: race if !sysctl_sched_autogroup_enabled ?
  2016-11-09 16:59 sched/autogroup: race if !sysctl_sched_autogroup_enabled ? Oleg Nesterov
@ 2016-11-09 17:50 ` Peter Zijlstra
  2016-11-10 13:09   ` Oleg Nesterov
  2016-11-12 12:12   ` Mike Galbraith
  0 siblings, 2 replies; 7+ messages in thread
From: Peter Zijlstra @ 2016-11-09 17:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Linus Torvalds, Mike Galbraith, hartsjc, vbendel,
	vlovejoy, linux-kernel

On Wed, Nov 09, 2016 at 05:59:33PM +0100, Oleg Nesterov wrote:

> We need to ensure that autogroup/tg returned by autogroup_task_group()
> can't go away if we race with autogroup_move_group(), and unless the
> caller holds ->siglock we rely on fact that autogroup_move_group()
> will a) see this task and b) do sched_move_task() which needs the same
> same rq->lock.
> 
> However. autogroup_move_group() skips for_each_thread/sched_move_task
> if sysctl_sched_autogroup_enabled == 0.
> 
> So. Doesn't this mean that cgroup migration to the root cgroup can race
> with autogroup_move_group() and use the soon-to-be-freed autogroup->tg?

Argh, its too late for this, also jet-lag. But maybe, I can sort of feel
a hole here but cannot for the life of me still think.


> although this is a bit off-topic. Another question is that I fail to
> understand why sched_autogroup_create_attach() does autogroup_create()
> and changes signal->autogroup even if !sysctl_sched_autogroup_enabled.

I really cannot remember back that far, but it could be to allow
flipping it back on. Then again, I don't think the fork path puts new
tasks in, even if the autogroup exists.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sched/autogroup: race if !sysctl_sched_autogroup_enabled ?
  2016-11-09 17:50 ` Peter Zijlstra
@ 2016-11-10 13:09   ` Oleg Nesterov
  2016-11-11 16:57     ` Oleg Nesterov
  2016-11-12 12:12   ` Mike Galbraith
  1 sibling, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2016-11-10 13:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Mike Galbraith, hartsjc, vbendel,
	vlovejoy, linux-kernel

On 11/09, Peter Zijlstra wrote:
>
> On Wed, Nov 09, 2016 at 05:59:33PM +0100, Oleg Nesterov wrote:
>
> > We need to ensure that autogroup/tg returned by autogroup_task_group()
> > can't go away if we race with autogroup_move_group(), and unless the
> > caller holds ->siglock we rely on fact that autogroup_move_group()
> > will a) see this task and b) do sched_move_task() which needs the same
> > same rq->lock.
> >
> > However. autogroup_move_group() skips for_each_thread/sched_move_task
> > if sysctl_sched_autogroup_enabled == 0.
> >
> > So. Doesn't this mean that cgroup migration to the root cgroup can race
> > with autogroup_move_group() and use the soon-to-be-freed autogroup->tg?
>
> Argh, its too late for this, also jet-lag. But maybe, I can sort of feel
> a hole here but cannot for the life of me still think.

And the 3rd case which I didn't think about yesterday. And now I really hope
it can explain the vmcore we have.

If sysctl_sched_autogroup_enabled was enabled and then disabled, it is
possible that the "autogrouped" process runs with ag->kref.refcount == 1,
and if it does setsid() it frees its active task_group.

> > although this is a bit off-topic. Another question is that I fail to
> > understand why sched_autogroup_create_attach() does autogroup_create()
> > and changes signal->autogroup even if !sysctl_sched_autogroup_enabled.
>
> I really cannot remember back that far, but it could be to allow
> flipping it back on.

Yes, I thought about this too, but I think it is hardly possible to explain
what do we actually want when sysctl_sched_autogroup_enabled changes from 0
to 1.

So I am going to send the patch which simply moves the sysctl check from
autogroup_move_group() to sched_autogroup_create_attach(), but perhaps I
should split this change?

I mean, the first patch for -stable could just remove the current check,
the 2nd one will add it into sched_autogroup_create_attach().

Oleg.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sched/autogroup: race if !sysctl_sched_autogroup_enabled ?
  2016-11-10 13:09   ` Oleg Nesterov
@ 2016-11-11 16:57     ` Oleg Nesterov
  2016-11-13 13:59       ` Mike Galbraith
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2016-11-11 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Mike Galbraith, hartsjc, vbendel,
	vlovejoy, linux-kernel

On 11/10, Oleg Nesterov wrote:
>
> And the 3rd case which I didn't think about yesterday. And now I really hope
> it can explain the vmcore we have.
>
> If sysctl_sched_autogroup_enabled was enabled and then disabled, it is
> possible that the "autogrouped" process runs with ag->kref.refcount == 1,
> and if it does setsid() it frees its active task_group.

And yet another problem ;)

The exiting thread must call sched_move_task() somewhere before exit_notify()
or it can run with the freed task_group() after that. And this means that the
no-longer-needed PF_EXITING check in task_wants_autogroup() will be needed
again. Simple, but needs the comments/changelog...

> So I am going to send the patch which simply moves the sysctl check from
> autogroup_move_group() to sched_autogroup_create_attach(), but perhaps I
> should split this change?
>
> I mean, the first patch for -stable could just remove the current check,
> the 2nd one will add it into sched_autogroup_create_attach().

No, this is not enough, see above.

I am starting to think that we should just move ->autogroup from signal_struct
to task_struct. This will simplify the code and fix all these problems. But
I need a simple fix for backporting anyway.

Oleg.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sched/autogroup: race if !sysctl_sched_autogroup_enabled ?
  2016-11-09 17:50 ` Peter Zijlstra
  2016-11-10 13:09   ` Oleg Nesterov
@ 2016-11-12 12:12   ` Mike Galbraith
  1 sibling, 0 replies; 7+ messages in thread
From: Mike Galbraith @ 2016-11-12 12:12 UTC (permalink / raw)
  To: Peter Zijlstra, Oleg Nesterov
  Cc: Ingo Molnar, Linus Torvalds, hartsjc, vbendel, vlovejoy, linux-kernel

On Wed, 2016-11-09 at 18:50 +0100, Peter Zijlstra wrote:
> On Wed, Nov 09, 2016 at 05:59:33PM +0100, Oleg Nesterov wrote:
> 
> > We need to ensure that autogroup/tg returned by autogroup_task_group()
> > can't go away if we race with autogroup_move_group(), and unless the
> > caller holds ->siglock we rely on fact that autogroup_move_group()
> > will a) see this task and b) do sched_move_task() which needs the same
> > same rq->lock.
> > 
> > However. autogroup_move_group() skips for_each_thread/sched_move_task
> > if sysctl_sched_autogroup_enabled == 0.
> > 
> > So. Doesn't this mean that cgroup migration to the root cgroup can race
> > with autogroup_move_group() and use the soon-to-be-freed autogroup->tg?
> 
> Argh, its too late for this, also jet-lag. But maybe, I can sort of feel
> a hole here but cannot for the life of me still think.
> 
> 
> > although this is a bit off-topic. Another question is that I fail to
> > understand why sched_autogroup_create_attach() does autogroup_create()
> > and changes signal->autogroup even if !sysctl_sched_autogroup_enabled.
> 
> I really cannot remember back that far, but it could be to allow
> flipping it back on. Then again, I don't think the fork path puts new
> tasks in, even if the autogroup exists.

Yeah, I've forgotten nearly everything about it too.  I do recall that
the autogroup always existed for both manual on/off and migrations
cgroup <--> root.

I think autogroup should go away.  Systemthing infestation is nearly
everywhere these days, and it wants to do ever more stupid shite that
will render it useless anyway.

	-Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sched/autogroup: race if !sysctl_sched_autogroup_enabled ?
  2016-11-11 16:57     ` Oleg Nesterov
@ 2016-11-13 13:59       ` Mike Galbraith
  2016-11-14 15:14         ` Oleg Nesterov
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Galbraith @ 2016-11-13 13:59 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, hartsjc, vbendel, vlovejoy, linux-kernel

On Fri, 2016-11-11 at 17:57 +0100, Oleg Nesterov wrote:

> I am starting to think that we should just move ->autogroup from signal_struct
> to task_struct. This will simplify the code and fix all these problems. But
> I need a simple fix for backporting anyway.

How about just rip out knobs that should have never been born.  More
can probably go as well.

(sorry for slowness btw.. on vacation, trying the "go outside" thing;)

sched/autogroup: Rip out dynamic knobs

Autogroup never should have had knobs in the first place, and now, Oleg has
discovered that the dynamic enable/disable capability either has become, or
perhaps always was racy.  Rip it all out, leaving only commandline disable.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 fs/proc/base.c               |   34 ---------------------------
 include/linux/sched/sysctl.h |    4 ---
 kernel/sched/auto_group.c    |   53 +++++++------------------------------------
 kernel/sched/auto_group.h    |    5 ----
 kernel/sysctl.c              |   11 --------
 5 files changed, 10 insertions(+), 97 deletions(-)

--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1455,39 +1455,6 @@ static int sched_autogroup_show(struct s
 	return 0;
 }
 
-static ssize_t
-sched_autogroup_write(struct file *file, const char __user *buf,
-	    size_t count, loff_t *offset)
-{
-	struct inode *inode = file_inode(file);
-	struct task_struct *p;
-	char buffer[PROC_NUMBUF];
-	int nice;
-	int err;
-
-	memset(buffer, 0, sizeof(buffer));
-	if (count > sizeof(buffer) - 1)
-		count = sizeof(buffer) - 1;
-	if (copy_from_user(buffer, buf, count))
-		return -EFAULT;
-
-	err = kstrtoint(strstrip(buffer), 0, &nice);
-	if (err < 0)
-		return err;
-
-	p = get_proc_task(inode);
-	if (!p)
-		return -ESRCH;
-
-	err = proc_sched_autogroup_set_nice(p, nice);
-	if (err)
-		count = err;
-
-	put_task_struct(p);
-
-	return count;
-}
-
 static int sched_autogroup_open(struct inode *inode, struct file *filp)
 {
 	int ret;
@@ -1504,7 +1471,6 @@ static int sched_autogroup_open(struct i
 static const struct file_operations proc_pid_sched_autogroup_operations = {
 	.open		= sched_autogroup_open,
 	.read		= seq_read,
-	.write		= sched_autogroup_write,
 	.llseek		= seq_lseek,
 	.release	= single_release,
 };
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -60,10 +60,6 @@ extern int sysctl_sched_rt_runtime;
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
 #endif
 
-#ifdef CONFIG_SCHED_AUTOGROUP
-extern unsigned int sysctl_sched_autogroup_enabled;
-#endif
-
 extern int sched_rr_timeslice;
 
 extern int sched_rr_handler(struct ctl_table *table, int write,
--- a/kernel/sched/auto_group.c
+++ b/kernel/sched/auto_group.c
@@ -7,7 +7,7 @@
 #include <linux/security.h>
 #include <linux/export.h>
 
-unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+unsigned int __read_mostly sched_autogroup_enabled = 1;
 static struct autogroup autogroup_default;
 static atomic_t autogroup_seq_nr;
 
@@ -109,7 +109,7 @@ static inline struct autogroup *autogrou
 
 bool task_wants_autogroup(struct task_struct *p, struct task_group *tg)
 {
-	if (tg != &root_task_group)
+	if (!sched_autogroup_enabled || tg != &root_task_group)
 		return false;
 
 	/*
@@ -139,12 +139,9 @@ autogroup_move_group(struct task_struct
 
 	p->signal->autogroup = autogroup_kref_get(ag);
 
-	if (!READ_ONCE(sysctl_sched_autogroup_enabled))
-		goto out;
-
 	for_each_thread(p, t)
 		sched_move_task(t);
-out:
+
 	unlock_task_sighand(p, &flags);
 	autogroup_kref_put(prev);
 }
@@ -152,8 +149,11 @@ autogroup_move_group(struct task_struct
 /* Allocates GFP_KERNEL, cannot be called under any spinlock */
 void sched_autogroup_create_attach(struct task_struct *p)
 {
-	struct autogroup *ag = autogroup_create();
+	struct autogroup *ag;
 
+	if (!sched_autogroup_enabled)
+		return;
+	ag = autogroup_create();
 	autogroup_move_group(p, ag);
 	/* drop extra reference added by autogroup_create() */
 	autogroup_kref_put(ag);
@@ -179,7 +179,7 @@ void sched_autogroup_exit(struct signal_
 
 static int __init setup_autogroup(char *str)
 {
-	sysctl_sched_autogroup_enabled = 0;
+	sched_autogroup_enabled = 0;
 
 	return 1;
 }
@@ -187,41 +187,6 @@ static int __init setup_autogroup(char *
 __setup("noautogroup", setup_autogroup);
 
 #ifdef CONFIG_PROC_FS
-
-int proc_sched_autogroup_set_nice(struct task_struct *p, int nice)
-{
-	static unsigned long next = INITIAL_JIFFIES;
-	struct autogroup *ag;
-	int err;
-
-	if (nice < MIN_NICE || nice > MAX_NICE)
-		return -EINVAL;
-
-	err = security_task_setnice(current, nice);
-	if (err)
-		return err;
-
-	if (nice < 0 && !can_nice(current, nice))
-		return -EPERM;
-
-	/* this is a heavy operation taking global locks.. */
-	if (!capable(CAP_SYS_ADMIN) && time_before(jiffies, next))
-		return -EAGAIN;
-
-	next = HZ / 10 + jiffies;
-	ag = autogroup_task_get(p);
-
-	down_write(&ag->lock);
-	err = sched_group_set_shares(ag->tg, sched_prio_to_weight[nice + 20]);
-	if (!err)
-		ag->nice = nice;
-	up_write(&ag->lock);
-
-	autogroup_kref_put(ag);
-
-	return err;
-}
-
 void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
 {
 	struct autogroup *ag = autogroup_task_get(p);
@@ -230,7 +195,7 @@ void proc_sched_autogroup_show_task(stru
 		goto out;
 
 	down_read(&ag->lock);
-	seq_printf(m, "/autogroup-%ld nice %d\n", ag->id, ag->nice);
+	seq_printf(m, "/autogroup-%ld\n", ag->id);
 	up_read(&ag->lock);
 
 out:
--- a/kernel/sched/auto_group.h
+++ b/kernel/sched/auto_group.h
@@ -13,7 +13,6 @@ struct autogroup {
 	struct task_group	*tg;
 	struct rw_semaphore	lock;
 	unsigned long		id;
-	int			nice;
 };
 
 extern void autogroup_init(struct task_struct *init_task);
@@ -29,9 +28,7 @@ extern bool task_wants_autogroup(struct
 static inline struct task_group *
 autogroup_task_group(struct task_struct *p, struct task_group *tg)
 {
-	int enabled = READ_ONCE(sysctl_sched_autogroup_enabled);
-
-	if (enabled && task_wants_autogroup(p, tg))
+	if (task_wants_autogroup(p, tg))
 		return p->signal->autogroup->tg;
 
 	return tg;
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -437,17 +437,6 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rr_handler,
 	},
-#ifdef CONFIG_SCHED_AUTOGROUP
-	{
-		.procname	= "sched_autogroup_enabled",
-		.data		= &sysctl_sched_autogroup_enabled,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero,
-		.extra2		= &one,
-	},
-#endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
 		.procname	= "sched_cfs_bandwidth_slice_us",

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: sched/autogroup: race if !sysctl_sched_autogroup_enabled ?
  2016-11-13 13:59       ` Mike Galbraith
@ 2016-11-14 15:14         ` Oleg Nesterov
  0 siblings, 0 replies; 7+ messages in thread
From: Oleg Nesterov @ 2016-11-14 15:14 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, hartsjc, vbendel,
	vlovejoy, linux-kernel

On 11/13, Mike Galbraith wrote:
>
> Autogroup never should have had knobs in the first place, and now, Oleg has
> discovered that the dynamic enable/disable capability either has become, or
> perhaps always was racy.

It was always racy afaics, even if sysctl_sched_autogroup_enabled doesn't
change. So this patch is not enough.

But the main reason I dislike it is that I can't backport it ;) rhel7 runs
with sysctl_sched_autogroup_enabled=0 by default, and users actually use
this knob to enable it dynamically.

Mike, et al, sorry for delay. I'll try to finally force myself to write the
comments and the changelog and send 2 (trivial) fixes today.

Oleg.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-11-14 15:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-09 16:59 sched/autogroup: race if !sysctl_sched_autogroup_enabled ? Oleg Nesterov
2016-11-09 17:50 ` Peter Zijlstra
2016-11-10 13:09   ` Oleg Nesterov
2016-11-11 16:57     ` Oleg Nesterov
2016-11-13 13:59       ` Mike Galbraith
2016-11-14 15:14         ` Oleg Nesterov
2016-11-12 12:12   ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).