Possible race between cgroup_attach_proc and de_thread, and questionable code in de

All of lore.kernel.org
 help / color / mirror / Atom feed

* Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
@ 2011-07-27  7:11 NeilBrown
  2011-08-14 17:40 ` Oleg Nesterov
       [not found] ` <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 2 replies; 63+ messages in thread
From: NeilBrown @ 2011-07-27  7:11 UTC (permalink / raw)
  To: Paul Menage, Ben Blum, Li Zefan, Oleg Nesterov
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Paul E.McKenney, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi,
  I've been exploring the use of RCU in the kernel, particularly looking for
  things that don't quite look right.  I found cgroup_attach_proc which was
  added a few months ago.

 It contains:

	rcu_read_lock();
	if (!thread_group_leader(leader)) {
		/*
		 * a race with de_thread from another thread's exec() may strip
		 * us of our leadership, making while_each_thread unsafe to use
		 * on this task. if this happens, there is no choice but to
		 * throw this task away and try again (from cgroup_procs_write);
		 * this is "double-double-toil-and-trouble-check locking".
		 */
		rcu_read_unlock();
		retval = -EAGAIN;
		goto out_free_group_list;
	}

 (and having the comment helps a lot!)

 The comment acknowledges a race with de_thread but seems to assume that
 rcu_read_lock() will protect against that race.  It won't.
 It could possibly protect if the racy code in de_thread() contained a call
 to synchronize_rcu(), but it doesn't so there is no obvious exclusion
 between the two.
 I note that some other locks are held and maybe some other lock provides
 the required exclusion - I haven't explored that too deeply - but if that is
 the case, then the use of rcu_read_lock() here is pointless - it isn't
 needed just to call thread_group_leader().

 The race as I understand it is with this code:

		list_replace_rcu(&leader->tasks, &tsk->tasks);
		list_replace_init(&leader->sibling, &tsk->sibling);

		tsk->group_leader = tsk;
		leader->group_leader = tsk;

 which seems to be called with only tasklist_lock held, which doesn't seem to
 be held in the cgroup code.

 If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
 this chunk is run with the same value for 'leader', but the
 while_each_thread is run after, then the while_read_thread() might loop
 forever.  rcu_read_lock doesn't prevent this from happening.

 The code in de_thread() is actually questionable by itself.
 "list_replace_rcu" cannot really be used on the head of a list - it is only
 meant to be used on a member of a list.
 To move a list from one head to another you should be using
 list_splice_init_rcu().
 The ->tasks list doesn't seem to have a clearly distinguished 'head' but
 whatever is passed as 'g' to while_each_thread() is effectively a head and
 removing it from a list can cause a loop using while_each_thread() can not
 find the head and so never complete.

 I' not sure how best to fix this, though possibly changing
 while_each_thead to:

   while ((t = next_task(t)) != g && !thread_group_leader(t))

 might be part of it.  We would also need to move 
    tsk->group_leader = tsk;
 in the above up to the top, and probably add some memory barrier.
 However I don't know enough about how the list is used to be sure.

Comments?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-07-27  7:11 Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread NeilBrown
@ 2011-08-14 17:40 ` Oleg Nesterov
  2011-08-15  0:11   ` NeilBrown
       [not found]   ` <20110814174000.GA2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
       [not found] ` <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  1 sibling, 2 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-14 17:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Paul Menage, Ben Blum, Li Zefan, containers, Paul E.McKenney,
	linux-kernel

Sorry for delay, just noticed this thread...

On 07/27, NeilBrown wrote:
>
>  The race as I understand it is with this code:
>
>
> 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> 		list_replace_init(&leader->sibling, &tsk->sibling);
>
> 		tsk->group_leader = tsk;
> 		leader->group_leader = tsk;
>
>
>  which seems to be called with only tasklist_lock held, which doesn't seem to
>  be held in the cgroup code.
>
>  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
>  this chunk is run with the same value for 'leader', but the
>  while_each_thread is run after, then the while_read_thread() might loop
>  forever.  rcu_read_lock doesn't prevent this from happening.

Yes. This was already discussed. See http://marc.info/?t=127688987300002

Damn. I forgot about this completely.

>  The code in de_thread() is actually questionable by itself.
>  "list_replace_rcu" cannot really be used on the head of a list - it is only
>  meant to be used on a member of a list.
>  To move a list from one head to another you should be using
>  list_splice_init_rcu().

Hmm... can't understand this part.

And just in case... list_replace_rcu() looks fine afaics. The real problem
is release_task(old_leader) which does list_del_rcu(old_leader->thread_group),
this is what breaks while_each_thread().

>  The ->tasks list doesn't seem to have a clearly distinguished 'head'

Exactly. This is the problem.

But: you seem to confused ->tasks and ->thread_group ;)

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-08-14 17:40 ` Oleg Nesterov
@ 2011-08-15  0:11   ` NeilBrown
       [not found]     ` <20110815101144.39812e9f-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2011-08-15 19:09     ` Oleg Nesterov
       [not found]   ` <20110814174000.GA2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 63+ messages in thread
From: NeilBrown @ 2011-08-15  0:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul Menage, Ben Blum, Li Zefan, containers, Paul E.McKenney,
	linux-kernel

On Sun, 14 Aug 2011 19:40:00 +0200 Oleg Nesterov <oleg@redhat.com> wrote:

> Sorry for delay, just noticed this thread...
> 
> On 07/27, NeilBrown wrote:
> >
> >  The race as I understand it is with this code:
> >
> >
> > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > 		list_replace_init(&leader->sibling, &tsk->sibling);
> >
> > 		tsk->group_leader = tsk;
> > 		leader->group_leader = tsk;
> >
> >
> >  which seems to be called with only tasklist_lock held, which doesn't seem to
> >  be held in the cgroup code.
> >
> >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> >  this chunk is run with the same value for 'leader', but the
> >  while_each_thread is run after, then the while_read_thread() might loop
> >  forever.  rcu_read_lock doesn't prevent this from happening.
> 
> Yes. This was already discussed. See http://marc.info/?t=127688987300002
> 
> Damn. I forgot about this completely.
> 
> >  The code in de_thread() is actually questionable by itself.
> >  "list_replace_rcu" cannot really be used on the head of a list - it is only
> >  meant to be used on a member of a list.
> >  To move a list from one head to another you should be using
> >  list_splice_init_rcu().
> 
> Hmm... can't understand this part.
> 
> And just in case... list_replace_rcu() looks fine afaics. The real problem
> is release_task(old_leader) which does list_del_rcu(old_leader->thread_group),
> this is what breaks while_each_thread().
> 
> >  The ->tasks list doesn't seem to have a clearly distinguished 'head'
> 
> Exactly. This is the problem.
> 
> But: you seem to confused ->tasks and ->thread_group ;)
> 

That last point is certainly correct.  I caught myself confusing the two
several time and wouldn't be at all surprised if I did it without catching
myself.

de_thread can change the group_leader of a thread_group, and release_task can
remove a non-leader while leaving the rest of the thread_group intact.  So
any while_each_thread() loop needs some extra care to ensure that it doesn't
loop infinitely, because the "head" that it is looking for might not be there
any more.
Maybe there are other rules that ensure this can never happen, but they sure
aren't obvious to me (i.e. if you know them - please tell ;-)

NeilBrown

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110815101144.39812e9f-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]     ` <20110815101144.39812e9f-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2011-08-15 19:09       ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-15 19:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	Paul  E.McKenney

On 08/15, NeilBrown wrote:
>
> de_thread can change the group_leader of a thread_group, and release_task can
> remove a non-leader while leaving the rest of the thread_group intact.  So
> any while_each_thread() loop needs some extra care to ensure that it doesn't
> loop infinitely, because the "head" that it is looking for might not be there
> any more.
> Maybe there are other rules that ensure this can never happen, but they sure
> aren't obvious to me (i.e. if you know them - please tell ;-)

No, I don't know ;)

And note also that if g != leader, then while_each_thread(g, t) can hang
simply because g exits. I am still trying to invent something simple to
fix while_each_thread-under-rcu.

This looks possible, but I am starting to think that, say, zap_threads()
needs locking anyway. With any fix I can imagine, it can miss a thread
we should care about.

Oleg.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-08-15  0:11   ` NeilBrown
       [not found]     ` <20110815101144.39812e9f-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2011-08-15 19:09     ` Oleg Nesterov
  1 sibling, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-15 19:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Paul Menage, Ben Blum, Li Zefan, containers, Paul E.McKenney,
	linux-kernel

On 08/15, NeilBrown wrote:
>
> de_thread can change the group_leader of a thread_group, and release_task can
> remove a non-leader while leaving the rest of the thread_group intact.  So
> any while_each_thread() loop needs some extra care to ensure that it doesn't
> loop infinitely, because the "head" that it is looking for might not be there
> any more.
> Maybe there are other rules that ensure this can never happen, but they sure
> aren't obvious to me (i.e. if you know them - please tell ;-)

No, I don't know ;)

And note also that if g != leader, then while_each_thread(g, t) can hang
simply because g exits. I am still trying to invent something simple to
fix while_each_thread-under-rcu.

This looks possible, but I am starting to think that, say, zap_threads()
needs locking anyway. With any fix I can imagine, it can miss a thread
we should care about.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110814174000.GA2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]   ` <20110814174000.GA2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2011-08-15  0:11     ` NeilBrown
  0 siblings, 0 replies; 63+ messages in thread
From: NeilBrown @ 2011-08-15  0:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	Paul  E.McKenney

On Sun, 14 Aug 2011 19:40:00 +0200 Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> Sorry for delay, just noticed this thread...
> 
> On 07/27, NeilBrown wrote:
> >
> >  The race as I understand it is with this code:
> >
> >
> > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > 		list_replace_init(&leader->sibling, &tsk->sibling);
> >
> > 		tsk->group_leader = tsk;
> > 		leader->group_leader = tsk;
> >
> >
> >  which seems to be called with only tasklist_lock held, which doesn't seem to
> >  be held in the cgroup code.
> >
> >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> >  this chunk is run with the same value for 'leader', but the
> >  while_each_thread is run after, then the while_read_thread() might loop
> >  forever.  rcu_read_lock doesn't prevent this from happening.
> 
> Yes. This was already discussed. See http://marc.info/?t=127688987300002
> 
> Damn. I forgot about this completely.
> 
> >  The code in de_thread() is actually questionable by itself.
> >  "list_replace_rcu" cannot really be used on the head of a list - it is only
> >  meant to be used on a member of a list.
> >  To move a list from one head to another you should be using
> >  list_splice_init_rcu().
> 
> Hmm... can't understand this part.
> 
> And just in case... list_replace_rcu() looks fine afaics. The real problem
> is release_task(old_leader) which does list_del_rcu(old_leader->thread_group),
> this is what breaks while_each_thread().
> 
> >  The ->tasks list doesn't seem to have a clearly distinguished 'head'
> 
> Exactly. This is the problem.
> 
> But: you seem to confused ->tasks and ->thread_group ;)
> 

That last point is certainly correct.  I caught myself confusing the two
several time and wouldn't be at all surprised if I did it without catching
myself.

de_thread can change the group_leader of a thread_group, and release_task can
remove a non-leader while leaving the rest of the thread_group intact.  So
any while_each_thread() loop needs some extra care to ensure that it doesn't
loop infinitely, because the "head" that it is looking for might not be there
any more.
Maybe there are other rules that ensure this can never happen, but they sure
aren't obvious to me (i.e. if you know them - please tell ;-)

NeilBrown

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-07-27  7:11 Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread NeilBrown
@ 2011-07-27 15:07     ` Ben Blum
       [not found] ` <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  1 sibling, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-07-27 15:07 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	Paul E.McKenney, Oleg Nesterov

On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:
> 
> Hi,
>   I've been exploring the use of RCU in the kernel, particularly looking for
>   things that don't quite look right.  I found cgroup_attach_proc which was
>   added a few months ago.

Awesome, thanks! :)

> 
>  It contains:
> 
> 	rcu_read_lock();
> 	if (!thread_group_leader(leader)) {
> 		/*
> 		 * a race with de_thread from another thread's exec() may strip
> 		 * us of our leadership, making while_each_thread unsafe to use
> 		 * on this task. if this happens, there is no choice but to
> 		 * throw this task away and try again (from cgroup_procs_write);
> 		 * this is "double-double-toil-and-trouble-check locking".
> 		 */
> 		rcu_read_unlock();
> 		retval = -EAGAIN;
> 		goto out_free_group_list;
> 	}
> 
>  (and having the comment helps a lot!)
> 
>  The comment acknowledges a race with de_thread but seems to assume that
>  rcu_read_lock() will protect against that race.  It won't.
>  It could possibly protect if the racy code in de_thread() contained a call
>  to synchronize_rcu(), but it doesn't so there is no obvious exclusion
>  between the two.
>  I note that some other locks are held and maybe some other lock provides
>  the required exclusion - I haven't explored that too deeply - but if that is
>  the case, then the use of rcu_read_lock() here is pointless - it isn't
>  needed just to call thread_group_leader().

I wrote this code, and I admit to not having a full understanding of RCU
myself. The code was once more complicated (before the patches went in,
mind you), and had a series of checks like that leading up to a
list_for_each_entry over the ->thread_group list (in "step 3", instead
of iterating over the flex_array), and had read_lock(&tasklist_lock)
around it. (...)

(The other locks held are just cgroup_mutex and threadgroup_fork_lock,
which wouldn't provide the exclusion.)

> 
>  The race as I understand it is with this code:
> 
> 
> 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> 		list_replace_init(&leader->sibling, &tsk->sibling);
> 
> 		tsk->group_leader = tsk;
> 		leader->group_leader = tsk;
> 
> 
>  which seems to be called with only tasklist_lock held, which doesn't seem to
>  be held in the cgroup code.
> 
>  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
>  this chunk is run with the same value for 'leader', but the
>  while_each_thread is run after, then the while_read_thread() might loop
>  forever.  rcu_read_lock doesn't prevent this from happening.

Somehow I was under the impression that holding tasklist_lock (for
writing) provided exclusion from code that holds rcu_read_lock -
probably because there are other points in the kernel which do
while_each_thread with only RCU-read held (and not tasklist):

- kernel/hung_task.c, check_hung_uninterruptible_tasks()
- kernel/posix-cpu-timers.c, thread_group_cputime()
- fs/ioprio.c, ioprio_set() and ioprio_get()

(There are also places, like kernel/signal.c, where code does
while_each_thread with only sighand->siglock held. this also seems
sketchy, since de_thread only takes that lock after the code quoted
above. there's a big comment in fs/exec.c where this is also done, but I
don't quite understand it.)

You seem to imply that rcu_read_lock() doesn't exclude against
write_lock(&tasklist_lock). If that's true, then we can fix the cgroup
code simply by replacing rcu_read_lock/rcu_read_unlock with
read_lock and read_unlocck on tasklist_lock. (I can hurry a bugfix patch
for this together if so.)

Wouldn't this mean that the three places listed above are also wrong?

> 
>  The code in de_thread() is actually questionable by itself.
>  "list_replace_rcu" cannot really be used on the head of a list - it is only
>  meant to be used on a member of a list.
>  To move a list from one head to another you should be using
>  list_splice_init_rcu().
>  The ->tasks list doesn't seem to have a clearly distinguished 'head' but
>  whatever is passed as 'g' to while_each_thread() is effectively a head and
>  removing it from a list can cause a loop using while_each_thread() can not
>  find the head and so never complete.
> 
>  I' not sure how best to fix this, though possibly changing
>  while_each_thead to:
> 
>    while ((t = next_task(t)) != g && !thread_group_leader(t))
> 
>  might be part of it.  We would also need to move 
>     tsk->group_leader = tsk;
>  in the above up to the top, and probably add some memory barrier.
>  However I don't know enough about how the list is used to be sure.
> 
> Comments?
> 
> Thanks,
> NeilBrown
> 
> 

I barely understand de_thread() from the reader's perspective, let alone
from the author's perspective, so I can't speak for that one.

Thanks for pointing this out!

-- Ben

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
@ 2011-07-27 15:07     ` Ben Blum
  0 siblings, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-07-27 15:07 UTC (permalink / raw)
  To: NeilBrown
  Cc: Paul Menage, Ben Blum, Li Zefan, Oleg Nesterov, containers,
	Paul E.McKenney, linux-kernel

On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:
> 
> Hi,
>   I've been exploring the use of RCU in the kernel, particularly looking for
>   things that don't quite look right.  I found cgroup_attach_proc which was
>   added a few months ago.

Awesome, thanks! :)

> 
>  It contains:
> 
> 	rcu_read_lock();
> 	if (!thread_group_leader(leader)) {
> 		/*
> 		 * a race with de_thread from another thread's exec() may strip
> 		 * us of our leadership, making while_each_thread unsafe to use
> 		 * on this task. if this happens, there is no choice but to
> 		 * throw this task away and try again (from cgroup_procs_write);
> 		 * this is "double-double-toil-and-trouble-check locking".
> 		 */
> 		rcu_read_unlock();
> 		retval = -EAGAIN;
> 		goto out_free_group_list;
> 	}
> 
>  (and having the comment helps a lot!)
> 
>  The comment acknowledges a race with de_thread but seems to assume that
>  rcu_read_lock() will protect against that race.  It won't.
>  It could possibly protect if the racy code in de_thread() contained a call
>  to synchronize_rcu(), but it doesn't so there is no obvious exclusion
>  between the two.
>  I note that some other locks are held and maybe some other lock provides
>  the required exclusion - I haven't explored that too deeply - but if that is
>  the case, then the use of rcu_read_lock() here is pointless - it isn't
>  needed just to call thread_group_leader().

I wrote this code, and I admit to not having a full understanding of RCU
myself. The code was once more complicated (before the patches went in,
mind you), and had a series of checks like that leading up to a
list_for_each_entry over the ->thread_group list (in "step 3", instead
of iterating over the flex_array), and had read_lock(&tasklist_lock)
around it. (...)

(The other locks held are just cgroup_mutex and threadgroup_fork_lock,
which wouldn't provide the exclusion.)

> 
>  The race as I understand it is with this code:
> 
> 
> 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> 		list_replace_init(&leader->sibling, &tsk->sibling);
> 
> 		tsk->group_leader = tsk;
> 		leader->group_leader = tsk;
> 
> 
>  which seems to be called with only tasklist_lock held, which doesn't seem to
>  be held in the cgroup code.
> 
>  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
>  this chunk is run with the same value for 'leader', but the
>  while_each_thread is run after, then the while_read_thread() might loop
>  forever.  rcu_read_lock doesn't prevent this from happening.

Somehow I was under the impression that holding tasklist_lock (for
writing) provided exclusion from code that holds rcu_read_lock -
probably because there are other points in the kernel which do
while_each_thread with only RCU-read held (and not tasklist):

- kernel/hung_task.c, check_hung_uninterruptible_tasks()
- kernel/posix-cpu-timers.c, thread_group_cputime()
- fs/ioprio.c, ioprio_set() and ioprio_get()

(There are also places, like kernel/signal.c, where code does
while_each_thread with only sighand->siglock held. this also seems
sketchy, since de_thread only takes that lock after the code quoted
above. there's a big comment in fs/exec.c where this is also done, but I
don't quite understand it.)

You seem to imply that rcu_read_lock() doesn't exclude against
write_lock(&tasklist_lock). If that's true, then we can fix the cgroup
code simply by replacing rcu_read_lock/rcu_read_unlock with
read_lock and read_unlocck on tasklist_lock. (I can hurry a bugfix patch
for this together if so.)

Wouldn't this mean that the three places listed above are also wrong?

> 
>  The code in de_thread() is actually questionable by itself.
>  "list_replace_rcu" cannot really be used on the head of a list - it is only
>  meant to be used on a member of a list.
>  To move a list from one head to another you should be using
>  list_splice_init_rcu().
>  The ->tasks list doesn't seem to have a clearly distinguished 'head' but
>  whatever is passed as 'g' to while_each_thread() is effectively a head and
>  removing it from a list can cause a loop using while_each_thread() can not
>  find the head and so never complete.
> 
>  I' not sure how best to fix this, though possibly changing
>  while_each_thead to:
> 
>    while ((t = next_task(t)) != g && !thread_group_leader(t))
> 
>  might be part of it.  We would also need to move 
>     tsk->group_leader = tsk;
>  in the above up to the top, and probably add some memory barrier.
>  However I don't know enough about how the list is used to be sure.
> 
> Comments?
> 
> Thanks,
> NeilBrown
> 
> 

I barely understand de_thread() from the reader's perspective, let alone
from the author's perspective, so I can't speak for that one.

Thanks for pointing this out!

-- Ben

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-07-27 15:07     ` Ben Blum
  (?)
@ 2011-07-27 23:42     ` Paul E. McKenney
       [not found]       ` <20110727234235.GA2318-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2011-07-28  1:08       ` NeilBrown
  -1 siblings, 2 replies; 63+ messages in thread
From: Paul E. McKenney @ 2011-07-27 23:42 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown, Paul Menage, Li Zefan, Oleg Nesterov, containers,
	linux-kernel

On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote:
> On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:

[ . . . ]

> >  The race as I understand it is with this code:
> > 
> > 
> > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > 		list_replace_init(&leader->sibling, &tsk->sibling);
> > 
> > 		tsk->group_leader = tsk;
> > 		leader->group_leader = tsk;
> > 
> > 
> >  which seems to be called with only tasklist_lock held, which doesn't seem to
> >  be held in the cgroup code.
> > 
> >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> >  this chunk is run with the same value for 'leader', but the
> >  while_each_thread is run after, then the while_read_thread() might loop
> >  forever.  rcu_read_lock doesn't prevent this from happening.
> 
> Somehow I was under the impression that holding tasklist_lock (for
> writing) provided exclusion from code that holds rcu_read_lock -
> probably because there are other points in the kernel which do
> while_each_thread with only RCU-read held (and not tasklist):
> 
> - kernel/hung_task.c, check_hung_uninterruptible_tasks()

This one looks OK to me.  The code is just referencing fields in each
of the task structures, and appears to be making proper use of
rcu_dereference().  All this code requires is that the task structures
remain in existence through the full lifetime of the RCU read-side
critical section, which is guaranteed because of the way the task_struct
is freed.

> - kernel/posix-cpu-timers.c, thread_group_cputime()

Same story here.  The code only needs the task_struct structures to
stick around long enough to compute the CPU time consumed, so this
should be OK.

> - fs/ioprio.c, ioprio_set() and ioprio_get()

Here RCU protects traversal to the task structure as in the two
examples above.  In the ioprio_set() case, the actual change is
carried out under task_lock().  So this code relies on RCU to
safely locate the task structure, and on locking to safely carry
out the update.

The ioprio_get() is using RCU to protect finding the task structure.
This returns an integer (the I/O priority), so no locking is required.
Though if compilers continue becoming increasingly aggressive with
their optimizations, there might need to be an ACCESS_ONCE()
in get_task_ioprio().

> (There are also places, like kernel/signal.c, where code does
> while_each_thread with only sighand->siglock held. this also seems
> sketchy, since de_thread only takes that lock after the code quoted
> above. there's a big comment in fs/exec.c where this is also done, but I
> don't quite understand it.)
> 
> You seem to imply that rcu_read_lock() doesn't exclude against
> write_lock(&tasklist_lock). If that's true, then we can fix the cgroup
> code simply by replacing rcu_read_lock/rcu_read_unlock with
> read_lock and read_unlocck on tasklist_lock. (I can hurry a bugfix patch
> for this together if so.)

Indeed, rcu_read_lock() does not exclude write_lock(&tasklist_lock).
In fact, it doesn't exclude against -any- lock.  But don't take my
word for it, instead take a look at the heaviest-weight implementation
of rcu_read_lock() that currently exists in mainline:

	void __rcu_read_lock(void)
	{
		current->rcu_read_lock_nesting++;
		barrier();
	}

That really is all there is to it, at least in the absence of
CONFIG_PROVE_RCU.  The first line does nothing more than increment a
counter in the current tasks task_struct, while the second line does
nothing more than suppress code-motion optimizations that the compiler
might otherwise be tempted to undertake.

The purpose of rcu_read_lock() is instead to prevent any subsequent
synchronize_rcu() calls (on any CPU or from any task) from returning
until this task executes the matching rcu_read_unlock().  If you
make careful use of rcu_read_lock(), synchronize_rcu(), and friends,
you can get an effect that in some ways resembles mutual exclusion,
but that is much faster and more scalable.

In the case of the task structure, rcu_read_lock() guarantees that
any task_struct that you can get a pointer to will remain in
existence until you execute the matching rcu_read_unlock().
By "remain in existence", I mean that the task_struct won't be
freed (and possibly reused).

So rcu_read_lock() is giving you the following guarantee:  If you
legitimately pick up a pointer to an RCU-protected data structure,
then that data structure will not be freed until after you execute
the matching rcu_read_unlock().

> Wouldn't this mean that the three places listed above are also wrong?

I do not believe so, give or take the ACCESS_ONCE().

The issue in your case is that you are relying not just on the
process to exist, but also on the state of group leadership.
RCU protects the former, but not the latter.

I suspect that your suggested bugfix patch would do the trick, but
I must defer to people who understand tasklist locking better than
I do.

							Thanx, Paul

> >  The code in de_thread() is actually questionable by itself.
> >  "list_replace_rcu" cannot really be used on the head of a list - it is only
> >  meant to be used on a member of a list.
> >  To move a list from one head to another you should be using
> >  list_splice_init_rcu().
> >  The ->tasks list doesn't seem to have a clearly distinguished 'head' but
> >  whatever is passed as 'g' to while_each_thread() is effectively a head and
> >  removing it from a list can cause a loop using while_each_thread() can not
> >  find the head and so never complete.
> > 
> >  I' not sure how best to fix this, though possibly changing
> >  while_each_thead to:
> > 
> >    while ((t = next_task(t)) != g && !thread_group_leader(t))
> > 
> >  might be part of it.  We would also need to move 
> >     tsk->group_leader = tsk;
> >  in the above up to the top, and probably add some memory barrier.
> >  However I don't know enough about how the list is used to be sure.
> > 
> > Comments?
> > 
> > Thanks,
> > NeilBrown
> > 
> > 
> 
> I barely understand de_thread() from the reader's perspective, let alone
> from the author's perspective, so I can't speak for that one.
> 
> Thanks for pointing this out!
> 
> -- Ben

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110727234235.GA2318-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]       ` <20110727234235.GA2318-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2011-07-28  1:08         ` NeilBrown
  0 siblings, 0 replies; 63+ messages in thread
From: NeilBrown @ 2011-07-28  1:08 UTC (permalink / raw)
  To: paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Oleg Nesterov

On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney"
<paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:

> On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote:
> > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:
> 
> [ . . . ]
> 
> > >  The race as I understand it is with this code:
> > > 
> > > 
> > > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > > 		list_replace_init(&leader->sibling, &tsk->sibling);
> > > 
> > > 		tsk->group_leader = tsk;
> > > 		leader->group_leader = tsk;
> > > 
> > > 
> > >  which seems to be called with only tasklist_lock held, which doesn't seem to
> > >  be held in the cgroup code.
> > > 
> > >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> > >  this chunk is run with the same value for 'leader', but the
> > >  while_each_thread is run after, then the while_read_thread() might loop
> > >  forever.  rcu_read_lock doesn't prevent this from happening.
> > 
> > Somehow I was under the impression that holding tasklist_lock (for
> > writing) provided exclusion from code that holds rcu_read_lock -
> > probably because there are other points in the kernel which do
> > while_each_thread with only RCU-read held (and not tasklist):
> > 
> > - kernel/hung_task.c, check_hung_uninterruptible_tasks()
> 
> This one looks OK to me.  The code is just referencing fields in each
> of the task structures, and appears to be making proper use of
> rcu_dereference().  All this code requires is that the task structures
> remain in existence through the full lifetime of the RCU read-side
> critical section, which is guaranteed because of the way the task_struct
> is freed.

I disagree.  It also requires - by virtue of the use of while_each_thread() -
that 'g' remains on the list that 't' is walking along.

Now for a normal list, the head always stays on the list and is accessible
even from an rcu-removed entry.  But the thread_group list isn't a normal
list.  It doesn't have a distinct head.  It is a loop of all of the
'task_structs' in a thread group.  One of them is designated the 'leader' but
de_thread() can change the 'leader' - though it doesn't remove the old leader.

__unhash_process in mm/exit.c looks like it could remove the leader from the
list and definitely could remove a non-leader.

So if a non-leader calls 'exec' and the leader calls 'exit', then a
task_struct that was the leader could become a non-leader and then be removed
from the list that kernel/hung_task could be walking along.

So I don't think that while_each_thread() is currently safe.  It depends on
the thread leader not disappearing and I think it can.

So I'm imagining a patch like this to ensure that while_each_thread() is
actually safe.  If it is always safe you can remove that extra check in
cgroup_attach_proc() which looked wrong.

I just hope someone who understands the process tree is listening..
The change in exit.c is the most uncertain part.

NeilBrown

diff --git a/fs/exec.c b/fs/exec.c
index 6075a1e..c9ea5f0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -960,6 +960,9 @@ static int de_thread(struct task_struct *tsk)
 		list_replace_init(&leader->sibling, &tsk->sibling);
 
 		tsk->group_leader = tsk;
+		smp_mb(); /* ensure that any reader will always be able to see
+			   * a task that claims to be the group leader
+			   */
 		leader->group_leader = tsk;
 
 		tsk->exit_signal = SIGCHLD;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 14a6c7b..13e0192 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2267,8 +2267,10 @@ extern bool current_is_single_threaded(void);
 #define do_each_thread(g, t) \
 	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
 
+/* Thread group leader can change, so stop loop when we see one
+ * even if it isn't 'g' */
 #define while_each_thread(g, t) \
-	while ((t = next_thread(t)) != g)
+	while ((t = next_thread(t)) != g && !thread_group_leader(t))
 
 static inline int get_nr_threads(struct task_struct *tsk)
 {
diff --git a/kernel/exit.c b/kernel/exit.c
index f2b321b..d6cef25 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
 		__this_cpu_dec(process_counts);
-	}
-	list_del_rcu(&p->thread_group);
+	} else
+		/* only remove members from the thread group.
+		 * The thread group leader must stay so that
+		 * while_each_thread() uses can see the end of
+		 * the list and stop.
+		 */
+		list_del_rcu(&p->thread_group);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-07-27 23:42     ` Paul E. McKenney
       [not found]       ` <20110727234235.GA2318-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2011-07-28  1:08       ` NeilBrown
  2011-07-28  6:26         ` Ben Blum
                           ` (3 more replies)
  1 sibling, 4 replies; 63+ messages in thread
From: NeilBrown @ 2011-07-28  1:08 UTC (permalink / raw)
  To: paulmck
  Cc: Ben Blum, Paul Menage, Li Zefan, Oleg Nesterov, containers, linux-kernel

On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney"
<paulmck@linux.vnet.ibm.com> wrote:

> On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote:
> > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:
> 
> [ . . . ]
> 
> > >  The race as I understand it is with this code:
> > > 
> > > 
> > > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > > 		list_replace_init(&leader->sibling, &tsk->sibling);
> > > 
> > > 		tsk->group_leader = tsk;
> > > 		leader->group_leader = tsk;
> > > 
> > > 
> > >  which seems to be called with only tasklist_lock held, which doesn't seem to
> > >  be held in the cgroup code.
> > > 
> > >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> > >  this chunk is run with the same value for 'leader', but the
> > >  while_each_thread is run after, then the while_read_thread() might loop
> > >  forever.  rcu_read_lock doesn't prevent this from happening.
> > 
> > Somehow I was under the impression that holding tasklist_lock (for
> > writing) provided exclusion from code that holds rcu_read_lock -
> > probably because there are other points in the kernel which do
> > while_each_thread with only RCU-read held (and not tasklist):
> > 
> > - kernel/hung_task.c, check_hung_uninterruptible_tasks()
> 
> This one looks OK to me.  The code is just referencing fields in each
> of the task structures, and appears to be making proper use of
> rcu_dereference().  All this code requires is that the task structures
> remain in existence through the full lifetime of the RCU read-side
> critical section, which is guaranteed because of the way the task_struct
> is freed.

I disagree.  It also requires - by virtue of the use of while_each_thread() -
that 'g' remains on the list that 't' is walking along.

Now for a normal list, the head always stays on the list and is accessible
even from an rcu-removed entry.  But the thread_group list isn't a normal
list.  It doesn't have a distinct head.  It is a loop of all of the
'task_structs' in a thread group.  One of them is designated the 'leader' but
de_thread() can change the 'leader' - though it doesn't remove the old leader.

__unhash_process in mm/exit.c looks like it could remove the leader from the
list and definitely could remove a non-leader.

So if a non-leader calls 'exec' and the leader calls 'exit', then a
task_struct that was the leader could become a non-leader and then be removed
from the list that kernel/hung_task could be walking along.

So I don't think that while_each_thread() is currently safe.  It depends on
the thread leader not disappearing and I think it can.

So I'm imagining a patch like this to ensure that while_each_thread() is
actually safe.  If it is always safe you can remove that extra check in
cgroup_attach_proc() which looked wrong.

I just hope someone who understands the process tree is listening..
The change in exit.c is the most uncertain part.

NeilBrown

diff --git a/fs/exec.c b/fs/exec.c
index 6075a1e..c9ea5f0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -960,6 +960,9 @@ static int de_thread(struct task_struct *tsk)
 		list_replace_init(&leader->sibling, &tsk->sibling);
 
 		tsk->group_leader = tsk;
+		smp_mb(); /* ensure that any reader will always be able to see
+			   * a task that claims to be the group leader
+			   */
 		leader->group_leader = tsk;
 
 		tsk->exit_signal = SIGCHLD;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 14a6c7b..13e0192 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2267,8 +2267,10 @@ extern bool current_is_single_threaded(void);
 #define do_each_thread(g, t) \
 	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
 
+/* Thread group leader can change, so stop loop when we see one
+ * even if it isn't 'g' */
 #define while_each_thread(g, t) \
-	while ((t = next_thread(t)) != g)
+	while ((t = next_thread(t)) != g && !thread_group_leader(t))
 
 static inline int get_nr_threads(struct task_struct *tsk)
 {
diff --git a/kernel/exit.c b/kernel/exit.c
index f2b321b..d6cef25 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
 		__this_cpu_dec(process_counts);
-	}
-	list_del_rcu(&p->thread_group);
+	} else
+		/* only remove members from the thread group.
+		 * The thread group leader must stay so that
+		 * while_each_thread() uses can see the end of
+		 * the list and stop.
+		 */
+		list_del_rcu(&p->thread_group);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-07-28  1:08       ` NeilBrown
@ 2011-07-28  6:26         ` Ben Blum
  2011-07-28  7:13           ` NeilBrown
       [not found]           ` <20110728062616.GC15204-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
  2011-07-28 12:17         ` Paul E. McKenney
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 63+ messages in thread
From: Ben Blum @ 2011-07-28  6:26 UTC (permalink / raw)
  To: NeilBrown
  Cc: paulmck, Ben Blum, Paul Menage, Li Zefan, Oleg Nesterov,
	containers, linux-kernel

On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney"
> <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote:
> > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:
> > 
> > [ . . . ]
> > 
> > > >  The race as I understand it is with this code:
> > > > 
> > > > 
> > > > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > > > 		list_replace_init(&leader->sibling, &tsk->sibling);
> > > > 
> > > > 		tsk->group_leader = tsk;
> > > > 		leader->group_leader = tsk;
> > > > 
> > > > 
> > > >  which seems to be called with only tasklist_lock held, which doesn't seem to
> > > >  be held in the cgroup code.
> > > > 
> > > >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> > > >  this chunk is run with the same value for 'leader', but the
> > > >  while_each_thread is run after, then the while_read_thread() might loop
> > > >  forever.  rcu_read_lock doesn't prevent this from happening.
> > > 
> > > Somehow I was under the impression that holding tasklist_lock (for
> > > writing) provided exclusion from code that holds rcu_read_lock -
> > > probably because there are other points in the kernel which do
> > > while_each_thread with only RCU-read held (and not tasklist):
> > > 
> > > - kernel/hung_task.c, check_hung_uninterruptible_tasks()
> > 
> > This one looks OK to me.  The code is just referencing fields in each
> > of the task structures, and appears to be making proper use of
> > rcu_dereference().  All this code requires is that the task structures
> > remain in existence through the full lifetime of the RCU read-side
> > critical section, which is guaranteed because of the way the task_struct
> > is freed.
> 
> I disagree.  It also requires - by virtue of the use of while_each_thread() -
> that 'g' remains on the list that 't' is walking along.
> 
> Now for a normal list, the head always stays on the list and is accessible
> even from an rcu-removed entry.  But the thread_group list isn't a normal
> list.  It doesn't have a distinct head.  It is a loop of all of the
> 'task_structs' in a thread group.  One of them is designated the 'leader' but
> de_thread() can change the 'leader' - though it doesn't remove the old leader.
> 
> __unhash_process in mm/exit.c looks like it could remove the leader from the
> list and definitely could remove a non-leader.
> 
> So if a non-leader calls 'exec' and the leader calls 'exit', then a
> task_struct that was the leader could become a non-leader and then be removed
> from the list that kernel/hung_task could be walking along.

That agrees with my understanding.

> 
> So I don't think that while_each_thread() is currently safe.  It depends on
> the thread leader not disappearing and I think it can.

I think that while_each_thread is perfectly safe, it just needs to be
protected properly while used. it reads the tasklist, and both competing
paths (__unhash_process and de_thread) are done with tasklist_lock write
locked, so read-locking ought to suffice. all it needs is to be better
documented.

> [...]
> 
> +/* Thread group leader can change, so stop loop when we see one
> + * even if it isn't 'g' */
>  #define while_each_thread(g, t) \
> -	while ((t = next_thread(t)) != g)
> +	while ((t = next_thread(t)) != g && !thread_group_leader(t))

this is semantically wrong: it will stop as soon as it finds a thread
that has newly become the leader, and not run the loop body code in that
thread's case. so the thread that just execed would not get run on, and
in the case of my code, would "escape" the cgroup migration.

but I argue it is also organisationally wrong. while_each_thread's
purpose is just to worry about the structure of the process list, not to
account for behavioural details of de_thread. this check belongs outside
of the macro, and it should be protected by tasklist_lock in the same
critical section in which while_each_thread is used.

-- Ben

>  
>  static inline int get_nr_threads(struct task_struct *tsk)
>  {
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f2b321b..d6cef25 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
> -	}
> -	list_del_rcu(&p->thread_group);
> +	} else
> +		/* only remove members from the thread group.
> +		 * The thread group leader must stay so that
> +		 * while_each_thread() uses can see the end of
> +		 * the list and stop.
> +		 */
> +		list_del_rcu(&p->thread_group);
>  }
>  
>  /*
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-07-28  6:26         ` Ben Blum
@ 2011-07-28  7:13           ` NeilBrown
       [not found]             ` <20110728171345.67d3797d-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2011-07-29 14:28             ` Ben Blum
       [not found]           ` <20110728062616.GC15204-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
  1 sibling, 2 replies; 63+ messages in thread
From: NeilBrown @ 2011-07-28  7:13 UTC (permalink / raw)
  To: Ben Blum
  Cc: paulmck, Paul Menage, Li Zefan, Oleg Nesterov, containers, linux-kernel

On Thu, 28 Jul 2011 02:26:16 -0400 Ben Blum <bblum@andrew.cmu.edu> wrote:

> On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> > On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney"
> > <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote:
> > > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > >  The race as I understand it is with this code:
> > > > > 
> > > > > 
> > > > > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > > > > 		list_replace_init(&leader->sibling, &tsk->sibling);
> > > > > 
> > > > > 		tsk->group_leader = tsk;
> > > > > 		leader->group_leader = tsk;
> > > > > 
> > > > > 
> > > > >  which seems to be called with only tasklist_lock held, which doesn't seem to
> > > > >  be held in the cgroup code.
> > > > > 
> > > > >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> > > > >  this chunk is run with the same value for 'leader', but the
> > > > >  while_each_thread is run after, then the while_read_thread() might loop
> > > > >  forever.  rcu_read_lock doesn't prevent this from happening.
> > > > 
> > > > Somehow I was under the impression that holding tasklist_lock (for
> > > > writing) provided exclusion from code that holds rcu_read_lock -
> > > > probably because there are other points in the kernel which do
> > > > while_each_thread with only RCU-read held (and not tasklist):
> > > > 
> > > > - kernel/hung_task.c, check_hung_uninterruptible_tasks()
> > > 
> > > This one looks OK to me.  The code is just referencing fields in each
> > > of the task structures, and appears to be making proper use of
> > > rcu_dereference().  All this code requires is that the task structures
> > > remain in existence through the full lifetime of the RCU read-side
> > > critical section, which is guaranteed because of the way the task_struct
> > > is freed.
> > 
> > I disagree.  It also requires - by virtue of the use of while_each_thread() -
> > that 'g' remains on the list that 't' is walking along.
> > 
> > Now for a normal list, the head always stays on the list and is accessible
> > even from an rcu-removed entry.  But the thread_group list isn't a normal
> > list.  It doesn't have a distinct head.  It is a loop of all of the
> > 'task_structs' in a thread group.  One of them is designated the 'leader' but
> > de_thread() can change the 'leader' - though it doesn't remove the old leader.
> > 
> > __unhash_process in mm/exit.c looks like it could remove the leader from the
> > list and definitely could remove a non-leader.
> > 
> > So if a non-leader calls 'exec' and the leader calls 'exit', then a
> > task_struct that was the leader could become a non-leader and then be removed
> > from the list that kernel/hung_task could be walking along.
> 
> That agrees with my understanding.
> 
> > 
> > So I don't think that while_each_thread() is currently safe.  It depends on
> > the thread leader not disappearing and I think it can.
> 
> I think that while_each_thread is perfectly safe, it just needs to be
> protected properly while used. it reads the tasklist, and both competing
> paths (__unhash_process and de_thread) are done with tasklist_lock write
> locked, so read-locking ought to suffice. all it needs is to be better
> documented.

That might be one answer.
However the fact that rcu primitives are used to walk the list (next_thread()
contains list_entry_rcu) seems to suggest that the intention was that RCU
could be used for read access - which in large part it can.
So we have two alternatives:
 1/ take a reader spinlock to walk the thread_group list
 2/ make RCU work safely for walking thread_group

I thought it was best to try the latter first.

> 
> > [...]
> > 
> > +/* Thread group leader can change, so stop loop when we see one
> > + * even if it isn't 'g' */
> >  #define while_each_thread(g, t) \
> > -	while ((t = next_thread(t)) != g)
> > +	while ((t = next_thread(t)) != g && !thread_group_leader(t))
> 
> this is semantically wrong: it will stop as soon as it finds a thread
> that has newly become the leader, and not run the loop body code in that
> thread's case. so the thread that just execed would not get run on, and
> in the case of my code, would "escape" the cgroup migration.

Yes, you are right ...
For your case I think you do want the thread group list to be both safe and
stable.  RCU can only provide 'safe' for a list.  So I think I'm convinced
that you need a readlock on tasklist_lock.  That doesn't mean that all users
of while_each_thread do.


> 
> but I argue it is also organisationally wrong. while_each_thread's
> purpose is just to worry about the structure of the process list, not to
> account for behavioural details of de_thread. this check belongs outside
> of the macro, and it should be protected by tasklist_lock in the same
> critical section in which while_each_thread is used.

Sound fair.

Maybe we need two versions of while_each_thread().  One that assumes
tasklist_lock and guarantees getting every thread, and one that is less
precise but only needs rcu_read_lock().

That would require an audit of each call site to see what is really needed.

But yes: if you replace your rcu_read_lock()s with read_lock(&tasklist_lock)
then you will be safe, and I don't think you can do any better (i.e. less
locking) than that.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110728171345.67d3797d-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]

* [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
       [not found]             ` <20110728171345.67d3797d-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2011-07-29 14:28               ` Ben Blum
  0 siblings, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-07-29 14:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Ben Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Oleg Nesterov

Fix unstable tasklist locking in cgroup_attach_proc.

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
ensures proper exclusion.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 kernel/cgroup.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff a/kernel/cgroup.c b/kernel/cgroup.c
--- a/kernel/cgroup.c	2011-07-21 19:17:23.000000000 -0700
+++ b/kernel/cgroup.c	2011-07-29 06:17:47.000000000 -0700
@@ -2024,7 +2024,7 @@
 		goto out_free_group_list;
 
 	/* prevent changes to the threadgroup list while we take a snapshot. */
-	rcu_read_lock();
+	read_lock(&tasklist_lock);
 	if (!thread_group_leader(leader)) {
 		/*
 		 * a race with de_thread from another thread's exec() may strip
@@ -2033,7 +2033,7 @@
 		 * throw this task away and try again (from cgroup_procs_write);
 		 * this is "double-double-toil-and-trouble-check locking".
 		 */
-		rcu_read_unlock();
+		read_unlock(&tasklist_lock);
 		retval = -EAGAIN;
 		goto out_free_group_list;
 	}
@@ -2054,7 +2054,7 @@
 	} while_each_thread(leader, tsk);
 	/* remember the number of threads in the array for later. */
 	group_size = i;
-	rcu_read_unlock();
+	read_unlock(&tasklist_lock);
 
 	/*
 	 * step 1: check that we can legitimately attach to the cgroup.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-07-28  7:13           ` NeilBrown
       [not found]             ` <20110728171345.67d3797d-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2011-07-29 14:28             ` Ben Blum
  2011-08-01 19:31               ` Paul Menage
                                 ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Ben Blum @ 2011-07-29 14:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ben Blum, paulmck, Paul Menage, Li Zefan, Oleg Nesterov,
	containers, linux-kernel, Andrew Morton

Fix unstable tasklist locking in cgroup_attach_proc.

From: Ben Blum <bblum@andrew.cmu.edu>

According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
ensures proper exclusion.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 kernel/cgroup.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff a/kernel/cgroup.c b/kernel/cgroup.c
--- a/kernel/cgroup.c	2011-07-21 19:17:23.000000000 -0700
+++ b/kernel/cgroup.c	2011-07-29 06:17:47.000000000 -0700
@@ -2024,7 +2024,7 @@
 		goto out_free_group_list;
 
 	/* prevent changes to the threadgroup list while we take a snapshot. */
-	rcu_read_lock();
+	read_lock(&tasklist_lock);
 	if (!thread_group_leader(leader)) {
 		/*
 		 * a race with de_thread from another thread's exec() may strip
@@ -2033,7 +2033,7 @@
 		 * throw this task away and try again (from cgroup_procs_write);
 		 * this is "double-double-toil-and-trouble-check locking".
 		 */
-		rcu_read_unlock();
+		read_unlock(&tasklist_lock);
 		retval = -EAGAIN;
 		goto out_free_group_list;
 	}
@@ -2054,7 +2054,7 @@
 	} while_each_thread(leader, tsk);
 	/* remember the number of threads in the array for later. */
 	group_size = i;
-	rcu_read_unlock();
+	read_unlock(&tasklist_lock);
 
 	/*
 	 * step 1: check that we can legitimately attach to the cgroup.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-07-29 14:28             ` Ben Blum
@ 2011-08-01 19:31               ` Paul Menage
       [not found]               ` <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
  2011-08-15 18:49               ` Oleg Nesterov
  2 siblings, 0 replies; 63+ messages in thread
From: Paul Menage @ 2011-08-01 19:31 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown, paulmck, Li Zefan, Oleg Nesterov, containers,
	linux-kernel, Andrew Morton

On Fri, Jul 29, 2011 at 7:28 AM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> Fix unstable tasklist locking in cgroup_attach_proc.
>
> From: Ben Blum <bblum@andrew.cmu.edu>
>
> According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> ensures proper exclusion.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Acked-by: Paul Menage <menage@google.com>

Thanks,
Paul

> ---
>  kernel/cgroup.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff a/kernel/cgroup.c b/kernel/cgroup.c
> --- a/kernel/cgroup.c   2011-07-21 19:17:23.000000000 -0700
> +++ b/kernel/cgroup.c   2011-07-29 06:17:47.000000000 -0700
> @@ -2024,7 +2024,7 @@
>                goto out_free_group_list;
>
>        /* prevent changes to the threadgroup list while we take a snapshot. */
> -       rcu_read_lock();
> +       read_lock(&tasklist_lock);
>        if (!thread_group_leader(leader)) {
>                /*
>                 * a race with de_thread from another thread's exec() may strip
> @@ -2033,7 +2033,7 @@
>                 * throw this task away and try again (from cgroup_procs_write);
>                 * this is "double-double-toil-and-trouble-check locking".
>                 */
> -               rcu_read_unlock();
> +               read_unlock(&tasklist_lock);
>                retval = -EAGAIN;
>                goto out_free_group_list;
>        }
> @@ -2054,7 +2054,7 @@
>        } while_each_thread(leader, tsk);
>        /* remember the number of threads in the array for later. */
>        group_size = i;
> -       rcu_read_unlock();
> +       read_unlock(&tasklist_lock);
>
>        /*
>         * step 1: check that we can legitimately attach to the cgroup.
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
       [not found]               ` <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
@ 2011-08-01 19:31                 ` Paul Menage
  2011-08-15 18:49                 ` Oleg Nesterov
  1 sibling, 0 replies; 63+ messages in thread
From: Paul Menage @ 2011-08-01 19:31 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Oleg Nesterov

On Fri, Jul 29, 2011 at 7:28 AM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> Fix unstable tasklist locking in cgroup_attach_proc.
>
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>
> According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> ensures proper exclusion.
>
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

Acked-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Thanks,
Paul

> ---
>  kernel/cgroup.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff a/kernel/cgroup.c b/kernel/cgroup.c
> --- a/kernel/cgroup.c   2011-07-21 19:17:23.000000000 -0700
> +++ b/kernel/cgroup.c   2011-07-29 06:17:47.000000000 -0700
> @@ -2024,7 +2024,7 @@
>                goto out_free_group_list;
>
>        /* prevent changes to the threadgroup list while we take a snapshot. */
> -       rcu_read_lock();
> +       read_lock(&tasklist_lock);
>        if (!thread_group_leader(leader)) {
>                /*
>                 * a race with de_thread from another thread's exec() may strip
> @@ -2033,7 +2033,7 @@
>                 * throw this task away and try again (from cgroup_procs_write);
>                 * this is "double-double-toil-and-trouble-check locking".
>                 */
> -               rcu_read_unlock();
> +               read_unlock(&tasklist_lock);
>                retval = -EAGAIN;
>                goto out_free_group_list;
>        }
> @@ -2054,7 +2054,7 @@
>        } while_each_thread(leader, tsk);
>        /* remember the number of threads in the array for later. */
>        group_size = i;
> -       rcu_read_unlock();
> +       read_unlock(&tasklist_lock);
>
>        /*
>         * step 1: check that we can legitimately attach to the cgroup.
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
       [not found]               ` <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
  2011-08-01 19:31                 ` Paul Menage
@ 2011-08-15 18:49                 ` Oleg Nesterov
  1 sibling, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-15 18:49 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown, Frederic Weisbecker,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton

On 07/29, Ben Blum wrote:
>
> According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> ensures proper exclusion.

Yes.

So far I still think we should fix while_each_thread() so that it works
under rcu_read_lock() "as exepected", I'll try to think more.

But whatever we do with while_each_thread(), this can't help
cgroup_attach_proc(), it needs the locking.

> -	rcu_read_lock();
> +	read_lock(&tasklist_lock);
>  	if (!thread_group_leader(leader)) {

Agreed, this should work.

But can't we avoid the global list? thread_group_leader() or not, we do
not really care. We only need to ensure we can safely find all threads.

How about the patch below?


With or without this/your patch this leader can die right after we
drop the lock. ss->can_attach(leader) and ss->attach(leader) look
suspicious. If a sub-thread execs, this task_struct has nothing to
do with the threadgroup.



Also. This is off-topic, but... Why cgroup_attach_proc() and
cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
in the different order? cgroup_attach_proc() looks wrong even
if currently doesn't matter.


Oleg.

--- x/kernel/cgroup.c
+++ x/kernel/cgroup.c
@@ -2000,6 +2000,7 @@ int cgroup_attach_proc(struct cgroup *cg
 	/* threadgroup list cursor and array */
 	struct task_struct *tsk;
 	struct flex_array *group;
+	unsigned long flags;
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
 	 * going to move -before- we actually start moving them, so that in
@@ -2027,19 +2028,10 @@ int cgroup_attach_proc(struct cgroup *cg
 		goto out_free_group_list;
 
 	/* prevent changes to the threadgroup list while we take a snapshot. */
-	rcu_read_lock();
-	if (!thread_group_leader(leader)) {
-		/*
-		 * a race with de_thread from another thread's exec() may strip
-		 * us of our leadership, making while_each_thread unsafe to use
-		 * on this task. if this happens, there is no choice but to
-		 * throw this task away and try again (from cgroup_procs_write);
-		 * this is "double-double-toil-and-trouble-check locking".
-		 */
-		rcu_read_unlock();
-		retval = -EAGAIN;
+	retval = -EAGAIN;
+	if (!lock_task_sighand(leader, &flags))
 		goto out_free_group_list;
-	}
+
 	/* take a reference on each task in the group to go in the array. */
 	tsk = leader;
 	i = 0;
@@ -2055,9 +2047,9 @@ int cgroup_attach_proc(struct cgroup *cg
 		BUG_ON(retval != 0);
 		i++;
 	} while_each_thread(leader, tsk);
+	unlock_task_sighand(leader, &flags);
 	/* remember the number of threads in the array for later. */
 	group_size = i;
-	rcu_read_unlock();
 
 	/*
 	 * step 1: check that we can legitimately attach to the cgroup.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-07-29 14:28             ` Ben Blum
  2011-08-01 19:31               ` Paul Menage
       [not found]               ` <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
@ 2011-08-15 18:49               ` Oleg Nesterov
  2011-08-15 22:50                 ` Frederic Weisbecker
                                   ` (4 more replies)
  2 siblings, 5 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-15 18:49 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown, paulmck, Paul Menage, Li Zefan, containers,
	linux-kernel, Andrew Morton, Frederic Weisbecker

On 07/29, Ben Blum wrote:
>
> According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> ensures proper exclusion.

Yes.

So far I still think we should fix while_each_thread() so that it works
under rcu_read_lock() "as exepected", I'll try to think more.

But whatever we do with while_each_thread(), this can't help
cgroup_attach_proc(), it needs the locking.

> -	rcu_read_lock();
> +	read_lock(&tasklist_lock);
>  	if (!thread_group_leader(leader)) {

Agreed, this should work.

But can't we avoid the global list? thread_group_leader() or not, we do
not really care. We only need to ensure we can safely find all threads.

How about the patch below?


With or without this/your patch this leader can die right after we
drop the lock. ss->can_attach(leader) and ss->attach(leader) look
suspicious. If a sub-thread execs, this task_struct has nothing to
do with the threadgroup.



Also. This is off-topic, but... Why cgroup_attach_proc() and
cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
in the different order? cgroup_attach_proc() looks wrong even
if currently doesn't matter.


Oleg.

--- x/kernel/cgroup.c
+++ x/kernel/cgroup.c
@@ -2000,6 +2000,7 @@ int cgroup_attach_proc(struct cgroup *cg
 	/* threadgroup list cursor and array */
 	struct task_struct *tsk;
 	struct flex_array *group;
+	unsigned long flags;
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
 	 * going to move -before- we actually start moving them, so that in
@@ -2027,19 +2028,10 @@ int cgroup_attach_proc(struct cgroup *cg
 		goto out_free_group_list;
 
 	/* prevent changes to the threadgroup list while we take a snapshot. */
-	rcu_read_lock();
-	if (!thread_group_leader(leader)) {
-		/*
-		 * a race with de_thread from another thread's exec() may strip
-		 * us of our leadership, making while_each_thread unsafe to use
-		 * on this task. if this happens, there is no choice but to
-		 * throw this task away and try again (from cgroup_procs_write);
-		 * this is "double-double-toil-and-trouble-check locking".
-		 */
-		rcu_read_unlock();
-		retval = -EAGAIN;
+	retval = -EAGAIN;
+	if (!lock_task_sighand(leader, &flags))
 		goto out_free_group_list;
-	}
+
 	/* take a reference on each task in the group to go in the array. */
 	tsk = leader;
 	i = 0;
@@ -2055,9 +2047,9 @@ int cgroup_attach_proc(struct cgroup *cg
 		BUG_ON(retval != 0);
 		i++;
 	} while_each_thread(leader, tsk);
+	unlock_task_sighand(leader, &flags);
 	/* remember the number of threads in the array for later. */
 	group_size = i;
-	rcu_read_unlock();
 
 	/*
 	 * step 1: check that we can legitimately attach to the cgroup.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-08-15 18:49               ` Oleg Nesterov
@ 2011-08-15 22:50                 ` Frederic Weisbecker
  2011-08-15 23:04                   ` Ben Blum
                                     ` (3 more replies)
       [not found]                 ` <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 22:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers,
	linux-kernel, Andrew Morton

On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> On 07/29, Ben Blum wrote:
> >
> > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> > exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> > ensures proper exclusion.
> 
> Yes.
> 
> So far I still think we should fix while_each_thread() so that it works
> under rcu_read_lock() "as exepected", I'll try to think more.
> 
> But whatever we do with while_each_thread(), this can't help
> cgroup_attach_proc(), it needs the locking.
> 
> > -	rcu_read_lock();
> > +	read_lock(&tasklist_lock);
> >  	if (!thread_group_leader(leader)) {
> 
> Agreed, this should work.
> 
> But can't we avoid the global list? thread_group_leader() or not, we do
> not really care. We only need to ensure we can safely find all threads.
> 
> How about the patch below?
> 
> 
> With or without this/your patch this leader can die right after we
> drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> suspicious. If a sub-thread execs, this task_struct has nothing to
> do with the threadgroup.
> 
> 
> 
> Also. This is off-topic, but... Why cgroup_attach_proc() and
> cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
> in the different order? cgroup_attach_proc() looks wrong even
> if currently doesn't matter.

Right. As we concluded in our off-list discussion, if there
is no strong reason for that, I'm going to fix that in my task
counter patchset because there it really matters. If we can't
migrate the thread because it has already exited, we really
don't want to call ->attach_task() but rather cancel_attach_task().

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-08-15 22:50                 ` Frederic Weisbecker
@ 2011-08-15 23:04                   ` Ben Blum
  2011-08-15 23:09                     ` Ben Blum
       [not found]                     ` <20110815230415.GA6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
  2011-08-15 23:04                   ` Ben Blum
                                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 63+ messages in thread
From: Ben Blum @ 2011-08-15 23:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Ben Blum, NeilBrown, paulmck, Paul Menage,
	Li Zefan, containers, linux-kernel, Andrew Morton

On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> > On 07/29, Ben Blum wrote:
> > >
> > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> > > ensures proper exclusion.
> > 
> > Yes.
> > 
> > So far I still think we should fix while_each_thread() so that it works
> > under rcu_read_lock() "as exepected", I'll try to think more.
> > 
> > But whatever we do with while_each_thread(), this can't help
> > cgroup_attach_proc(), it needs the locking.
> > 
> > > -	rcu_read_lock();
> > > +	read_lock(&tasklist_lock);
> > >  	if (!thread_group_leader(leader)) {
> > 
> > Agreed, this should work.
> > 
> > But can't we avoid the global list? thread_group_leader() or not, we do
> > not really care. We only need to ensure we can safely find all threads.
> > 
> > How about the patch below?
> > 
> > 
> > With or without this/your patch this leader can die right after we
> > drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> > suspicious. If a sub-thread execs, this task_struct has nothing to
> > do with the threadgroup.
> > 
> > 
> > 
> > Also. This is off-topic, but... Why cgroup_attach_proc() and
> > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
> > in the different order? cgroup_attach_proc() looks wrong even
> > if currently doesn't matter.
> 
> Right. As we concluded in our off-list discussion, if there
> is no strong reason for that, I'm going to fix that in my task
> counter patchset because there it really matters. If we can't
> migrate the thread because it has already exited, we really
> don't want to call ->attach_task() but rather cancel_attach_task().
> 
> Thanks.
> 

Yes. Um, this must have been a mistake on my part. The lines of code
should be the other way around. It should be done in a separate bugfix
patch, though, so it goes through faster...

-- Ben

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-08-15 23:04                   ` Ben Blum
@ 2011-08-15 23:09                     ` Ben Blum
  2011-08-15 23:19                       ` Frederic Weisbecker
       [not found]                       ` <20110815230900.GB6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
       [not found]                     ` <20110815230415.GA6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
  1 sibling, 2 replies; 63+ messages in thread
From: Ben Blum @ 2011-08-15 23:09 UTC (permalink / raw)
  To: Ben Blum
  Cc: Frederic Weisbecker, Oleg Nesterov, NeilBrown, paulmck,
	Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton

On Mon, Aug 15, 2011 at 07:04:15PM -0400, Ben Blum wrote:
> On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> > > On 07/29, Ben Blum wrote:
> > > >
> > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> > > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> > > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> > > > ensures proper exclusion.
> > > 
> > > Yes.
> > > 
> > > So far I still think we should fix while_each_thread() so that it works
> > > under rcu_read_lock() "as exepected", I'll try to think more.
> > > 
> > > But whatever we do with while_each_thread(), this can't help
> > > cgroup_attach_proc(), it needs the locking.
> > > 
> > > > -	rcu_read_lock();
> > > > +	read_lock(&tasklist_lock);
> > > >  	if (!thread_group_leader(leader)) {
> > > 
> > > Agreed, this should work.
> > > 
> > > But can't we avoid the global list? thread_group_leader() or not, we do
> > > not really care. We only need to ensure we can safely find all threads.
> > > 
> > > How about the patch below?
> > > 
> > > 
> > > With or without this/your patch this leader can die right after we
> > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> > > suspicious. If a sub-thread execs, this task_struct has nothing to
> > > do with the threadgroup.
> > > 
> > > 
> > > 
> > > Also. This is off-topic, but... Why cgroup_attach_proc() and
> > > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
> > > in the different order? cgroup_attach_proc() looks wrong even
> > > if currently doesn't matter.
> > 
> > Right. As we concluded in our off-list discussion, if there
> > is no strong reason for that, I'm going to fix that in my task
> > counter patchset because there it really matters. If we can't
> > migrate the thread because it has already exited, we really
> > don't want to call ->attach_task() but rather cancel_attach_task().
> > 
> > Thanks.
> > 
> 
> Yes. Um, this must have been a mistake on my part. The lines of code
> should be the other way around. It should be done in a separate bugfix
> patch, though, so it goes through faster...
> 
> -- Ben
> 

also, there is no cancel_attach_task, afaict. cancel_attach is only for
memcg, for the whole operation at once - unless you are changing this,
in which case feel free to modify on top of the patch I'm about to send
out.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-08-15 23:09                     ` Ben Blum
@ 2011-08-15 23:19                       ` Frederic Weisbecker
       [not found]                       ` <20110815230900.GB6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
  1 sibling, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 23:19 UTC (permalink / raw)
  To: Ben Blum
  Cc: Oleg Nesterov, NeilBrown, paulmck, Paul Menage, Li Zefan,
	containers, linux-kernel, Andrew Morton

On Mon, Aug 15, 2011 at 07:09:00PM -0400, Ben Blum wrote:
> On Mon, Aug 15, 2011 at 07:04:15PM -0400, Ben Blum wrote:
> > On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote:
> > > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> > > > On 07/29, Ben Blum wrote:
> > > > >
> > > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> > > > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> > > > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> > > > > ensures proper exclusion.
> > > > 
> > > > Yes.
> > > > 
> > > > So far I still think we should fix while_each_thread() so that it works
> > > > under rcu_read_lock() "as exepected", I'll try to think more.
> > > > 
> > > > But whatever we do with while_each_thread(), this can't help
> > > > cgroup_attach_proc(), it needs the locking.
> > > > 
> > > > > -	rcu_read_lock();
> > > > > +	read_lock(&tasklist_lock);
> > > > >  	if (!thread_group_leader(leader)) {
> > > > 
> > > > Agreed, this should work.
> > > > 
> > > > But can't we avoid the global list? thread_group_leader() or not, we do
> > > > not really care. We only need to ensure we can safely find all threads.
> > > > 
> > > > How about the patch below?
> > > > 
> > > > 
> > > > With or without this/your patch this leader can die right after we
> > > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> > > > suspicious. If a sub-thread execs, this task_struct has nothing to
> > > > do with the threadgroup.
> > > > 
> > > > 
> > > > 
> > > > Also. This is off-topic, but... Why cgroup_attach_proc() and
> > > > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
> > > > in the different order? cgroup_attach_proc() looks wrong even
> > > > if currently doesn't matter.
> > > 
> > > Right. As we concluded in our off-list discussion, if there
> > > is no strong reason for that, I'm going to fix that in my task
> > > counter patchset because there it really matters. If we can't
> > > migrate the thread because it has already exited, we really
> > > don't want to call ->attach_task() but rather cancel_attach_task().
> > > 
> > > Thanks.
> > > 
> > 
> > Yes. Um, this must have been a mistake on my part. The lines of code
> > should be the other way around. It should be done in a separate bugfix
> > patch, though, so it goes through faster...
> > 
> > -- Ben
> > 
> 
> also, there is no cancel_attach_task, afaict. cancel_attach is only for
> memcg, for the whole operation at once - unless you are changing this,
> in which case feel free to modify on top of the patch I'm about to send
> out.

Yeah right, I forgot it's something I'm adding with the task counter subsystem.

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110815230900.GB6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
       [not found]                       ` <20110815230900.GB6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
@ 2011-08-15 23:19                         ` Frederic Weisbecker
  0 siblings, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 23:19 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton

On Mon, Aug 15, 2011 at 07:09:00PM -0400, Ben Blum wrote:
> On Mon, Aug 15, 2011 at 07:04:15PM -0400, Ben Blum wrote:
> > On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote:
> > > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> > > > On 07/29, Ben Blum wrote:
> > > > >
> > > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> > > > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> > > > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> > > > > ensures proper exclusion.
> > > > 
> > > > Yes.
> > > > 
> > > > So far I still think we should fix while_each_thread() so that it works
> > > > under rcu_read_lock() "as exepected", I'll try to think more.
> > > > 
> > > > But whatever we do with while_each_thread(), this can't help
> > > > cgroup_attach_proc(), it needs the locking.
> > > > 
> > > > > -	rcu_read_lock();
> > > > > +	read_lock(&tasklist_lock);
> > > > >  	if (!thread_group_leader(leader)) {
> > > > 
> > > > Agreed, this should work.
> > > > 
> > > > But can't we avoid the global list? thread_group_leader() or not, we do
> > > > not really care. We only need to ensure we can safely find all threads.
> > > > 
> > > > How about the patch below?
> > > > 
> > > > 
> > > > With or without this/your patch this leader can die right after we
> > > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> > > > suspicious. If a sub-thread execs, this task_struct has nothing to
> > > > do with the threadgroup.
> > > > 
> > > > 
> > > > 
> > > > Also. This is off-topic, but... Why cgroup_attach_proc() and
> > > > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
> > > > in the different order? cgroup_attach_proc() looks wrong even
> > > > if currently doesn't matter.
> > > 
> > > Right. As we concluded in our off-list discussion, if there
> > > is no strong reason for that, I'm going to fix that in my task
> > > counter patchset because there it really matters. If we can't
> > > migrate the thread because it has already exited, we really
> > > don't want to call ->attach_task() but rather cancel_attach_task().
> > > 
> > > Thanks.
> > > 
> > 
> > Yes. Um, this must have been a mistake on my part. The lines of code
> > should be the other way around. It should be done in a separate bugfix
> > patch, though, so it goes through faster...
> > 
> > -- Ben
> > 
> 
> also, there is no cancel_attach_task, afaict. cancel_attach is only for
> memcg, for the whole operation at once - unless you are changing this,
> in which case feel free to modify on top of the patch I'm about to send
> out.

Yeah right, I forgot it's something I'm adding with the task counter subsystem.

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110815230415.GA6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
       [not found]                     ` <20110815230415.GA6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
@ 2011-08-15 23:09                       ` Ben Blum
  0 siblings, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-08-15 23:09 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown, Frederic Weisbecker,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton

On Mon, Aug 15, 2011 at 07:04:15PM -0400, Ben Blum wrote:
> On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> > > On 07/29, Ben Blum wrote:
> > > >
> > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> > > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> > > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> > > > ensures proper exclusion.
> > > 
> > > Yes.
> > > 
> > > So far I still think we should fix while_each_thread() so that it works
> > > under rcu_read_lock() "as exepected", I'll try to think more.
> > > 
> > > But whatever we do with while_each_thread(), this can't help
> > > cgroup_attach_proc(), it needs the locking.
> > > 
> > > > -	rcu_read_lock();
> > > > +	read_lock(&tasklist_lock);
> > > >  	if (!thread_group_leader(leader)) {
> > > 
> > > Agreed, this should work.
> > > 
> > > But can't we avoid the global list? thread_group_leader() or not, we do
> > > not really care. We only need to ensure we can safely find all threads.
> > > 
> > > How about the patch below?
> > > 
> > > 
> > > With or without this/your patch this leader can die right after we
> > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> > > suspicious. If a sub-thread execs, this task_struct has nothing to
> > > do with the threadgroup.
> > > 
> > > 
> > > 
> > > Also. This is off-topic, but... Why cgroup_attach_proc() and
> > > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
> > > in the different order? cgroup_attach_proc() looks wrong even
> > > if currently doesn't matter.
> > 
> > Right. As we concluded in our off-list discussion, if there
> > is no strong reason for that, I'm going to fix that in my task
> > counter patchset because there it really matters. If we can't
> > migrate the thread because it has already exited, we really
> > don't want to call ->attach_task() but rather cancel_attach_task().
> > 
> > Thanks.
> > 
> 
> Yes. Um, this must have been a mistake on my part. The lines of code
> should be the other way around. It should be done in a separate bugfix
> patch, though, so it goes through faster...
> 
> -- Ben
> 

also, there is no cancel_attach_task, afaict. cancel_attach is only for
memcg, for the whole operation at once - unless you are changing this,
in which case feel free to modify on top of the patch I'm about to send
out.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-08-15 22:50                 ` Frederic Weisbecker
  2011-08-15 23:04                   ` Ben Blum
@ 2011-08-15 23:04                   ` Ben Blum
  2011-08-15 23:11                   ` [PATCH][BUGFIX] cgroups: fix ordering of calls " Ben Blum
  2011-08-15 23:11                   ` Ben Blum
  3 siblings, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-08-15 23:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, NeilBrown,
	Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Andrew Morton

On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> > On 07/29, Ben Blum wrote:
> > >
> > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> > > ensures proper exclusion.
> > 
> > Yes.
> > 
> > So far I still think we should fix while_each_thread() so that it works
> > under rcu_read_lock() "as exepected", I'll try to think more.
> > 
> > But whatever we do with while_each_thread(), this can't help
> > cgroup_attach_proc(), it needs the locking.
> > 
> > > -	rcu_read_lock();
> > > +	read_lock(&tasklist_lock);
> > >  	if (!thread_group_leader(leader)) {
> > 
> > Agreed, this should work.
> > 
> > But can't we avoid the global list? thread_group_leader() or not, we do
> > not really care. We only need to ensure we can safely find all threads.
> > 
> > How about the patch below?
> > 
> > 
> > With or without this/your patch this leader can die right after we
> > drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> > suspicious. If a sub-thread execs, this task_struct has nothing to
> > do with the threadgroup.
> > 
> > 
> > 
> > Also. This is off-topic, but... Why cgroup_attach_proc() and
> > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
> > in the different order? cgroup_attach_proc() looks wrong even
> > if currently doesn't matter.
> 
> Right. As we concluded in our off-list discussion, if there
> is no strong reason for that, I'm going to fix that in my task
> counter patchset because there it really matters. If we can't
> migrate the thread because it has already exited, we really
> don't want to call ->attach_task() but rather cancel_attach_task().
> 
> Thanks.
> 

Yes. Um, this must have been a mistake on my part. The lines of code
should be the other way around. It should be done in a separate bugfix
patch, though, so it goes through faster...

-- Ben

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc
  2011-08-15 22:50                 ` Frederic Weisbecker
  2011-08-15 23:04                   ` Ben Blum
  2011-08-15 23:04                   ` Ben Blum
@ 2011-08-15 23:11                   ` Ben Blum
  2011-08-15 23:20                     ` Frederic Weisbecker
                                       ` (2 more replies)
  2011-08-15 23:11                   ` Ben Blum
  3 siblings, 3 replies; 63+ messages in thread
From: Ben Blum @ 2011-08-15 23:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Ben Blum, NeilBrown, paulmck, Paul Menage,
	Li Zefan, containers, linux-kernel, Andrew Morton

Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

---
 kernel/cgroup.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff -u a/kernel/cgroup.c b/kernel/cgroup.c
--- a/kernel/cgroup.c	2011-08-15 19:05:37.308401219 -0400
+++ b/kernel/cgroup.c	2011-08-15 19:07:47.376775142 -0400
@@ -2132,14 +2132,17 @@
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
 			continue;
-		/* attach each task to each subsystem */
-		for_each_subsys(root, ss) {
-			if (ss->attach_task)
-				ss->attach_task(cgrp, tsk);
-		}
 		/* if the thread is PF_EXITING, it can just get skipped. */
 		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
-		BUG_ON(retval != 0 && retval != -ESRCH);
+		if (retval == 0) {
+			/* attach each task to each subsystem */
+			for_each_subsys(root, ss) {
+				if (ss->attach_task)
+					ss->attach_task(cgrp, tsk);
+			}
+		} else {
+			BUG_ON(retval != -ESRCH);
+		}
 	}
 	/* nothing is sensitive to fork() after this point. */


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc
  2011-08-15 23:11                   ` [PATCH][BUGFIX] cgroups: fix ordering of calls " Ben Blum
@ 2011-08-15 23:20                     ` Frederic Weisbecker
       [not found]                     ` <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
  2011-08-15 23:31                     ` Paul Menage
  2 siblings, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 23:20 UTC (permalink / raw)
  To: Ben Blum
  Cc: Oleg Nesterov, NeilBrown, paulmck, Paul Menage, Li Zefan,
	containers, linux-kernel, Andrew Morton

On Mon, Aug 15, 2011 at 07:11:56PM -0400, Ben Blum wrote:
> Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc.
> 
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]

* Re: [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc
       [not found]                     ` <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
@ 2011-08-15 23:20                       ` Frederic Weisbecker
  2011-08-15 23:31                       ` Paul Menage
  1 sibling, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 23:20 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton

On Mon, Aug 15, 2011 at 07:11:56PM -0400, Ben Blum wrote:
> Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc.
> 
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

Acked-by: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc
       [not found]                     ` <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
  2011-08-15 23:20                       ` Frederic Weisbecker
@ 2011-08-15 23:31                       ` Paul Menage
  1 sibling, 0 replies; 63+ messages in thread
From: Paul Menage @ 2011-08-15 23:31 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown, Frederic Weisbecker,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Mon, Aug 15, 2011 at 4:11 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc.
>
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

Acked-by: Paul Menage <paul-inf54ven1CmVyaH7bEyXVA@public.gmane.org>

BTW, I left Google at the end of last week - if it's not bouncing
already, menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org isn't going to work for much longer. I'll
try to get a patch out to the maintainers files.

Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc
  2011-08-15 23:11                   ` [PATCH][BUGFIX] cgroups: fix ordering of calls " Ben Blum
  2011-08-15 23:20                     ` Frederic Weisbecker
       [not found]                     ` <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
@ 2011-08-15 23:31                     ` Paul Menage
  2 siblings, 0 replies; 63+ messages in thread
From: Paul Menage @ 2011-08-15 23:31 UTC (permalink / raw)
  To: Ben Blum
  Cc: Frederic Weisbecker, Oleg Nesterov, NeilBrown, paulmck, Li Zefan,
	containers, linux-kernel, Andrew Morton

On Mon, Aug 15, 2011 at 4:11 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Acked-by: Paul Menage <paul@paulmenage.org>

BTW, I left Google at the end of last week - if it's not bouncing
already, menage@google.com isn't going to work for much longer. I'll
try to get a patch out to the maintainers files.

Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc
  2011-08-15 22:50                 ` Frederic Weisbecker
                                     ` (2 preceding siblings ...)
  2011-08-15 23:11                   ` [PATCH][BUGFIX] cgroups: fix ordering of calls " Ben Blum
@ 2011-08-15 23:11                   ` Ben Blum
  3 siblings, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-08-15 23:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, NeilBrown,
	Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Andrew Morton

Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

---
 kernel/cgroup.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff -u a/kernel/cgroup.c b/kernel/cgroup.c
--- a/kernel/cgroup.c	2011-08-15 19:05:37.308401219 -0400
+++ b/kernel/cgroup.c	2011-08-15 19:07:47.376775142 -0400
@@ -2132,14 +2132,17 @@
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
 			continue;
-		/* attach each task to each subsystem */
-		for_each_subsys(root, ss) {
-			if (ss->attach_task)
-				ss->attach_task(cgrp, tsk);
-		}
 		/* if the thread is PF_EXITING, it can just get skipped. */
 		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
-		BUG_ON(retval != 0 && retval != -ESRCH);
+		if (retval == 0) {
+			/* attach each task to each subsystem */
+			for_each_subsys(root, ss) {
+				if (ss->attach_task)
+					ss->attach_task(cgrp, tsk);
+			}
+		} else {
+			BUG_ON(retval != -ESRCH);
+		}
 	}
 	/* nothing is sensitive to fork() after this point. */

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
       [not found]                 ` <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2011-08-15 22:50                   ` Frederic Weisbecker
  2011-09-01 21:46                   ` Ben Blum
  1 sibling, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2011-08-15 22:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, NeilBrown, Paul Menage,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton

On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> On 07/29, Ben Blum wrote:
> >
> > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> > exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> > ensures proper exclusion.
> 
> Yes.
> 
> So far I still think we should fix while_each_thread() so that it works
> under rcu_read_lock() "as exepected", I'll try to think more.
> 
> But whatever we do with while_each_thread(), this can't help
> cgroup_attach_proc(), it needs the locking.
> 
> > -	rcu_read_lock();
> > +	read_lock(&tasklist_lock);
> >  	if (!thread_group_leader(leader)) {
> 
> Agreed, this should work.
> 
> But can't we avoid the global list? thread_group_leader() or not, we do
> not really care. We only need to ensure we can safely find all threads.
> 
> How about the patch below?
> 
> 
> With or without this/your patch this leader can die right after we
> drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> suspicious. If a sub-thread execs, this task_struct has nothing to
> do with the threadgroup.
> 
> 
> 
> Also. This is off-topic, but... Why cgroup_attach_proc() and
> cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
> in the different order? cgroup_attach_proc() looks wrong even
> if currently doesn't matter.

Right. As we concluded in our off-list discussion, if there
is no strong reason for that, I'm going to fix that in my task
counter patchset because there it really matters. If we can't
migrate the thread because it has already exited, we really
don't want to call ->attach_task() but rather cancel_attach_task().

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
       [not found]                 ` <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2011-08-15 22:50                   ` [PATCH][BUGFIX] cgroups: more safe tasklist locking " Frederic Weisbecker
@ 2011-09-01 21:46                   ` Ben Blum
  1 sibling, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-09-01 21:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, Frederic Weisbecker,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, NeilBrown, Paul Menage,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton

On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> > -	rcu_read_lock();
> > +	read_lock(&tasklist_lock);
> >  	if (!thread_group_leader(leader)) {
> 
> Agreed, this should work.
> 
> But can't we avoid the global list? thread_group_leader() or not, we do
> not really care. We only need to ensure we can safely find all threads.
> 
> How about the patch below?

I was content with the tasklist_lock because cgroup_attach_proc is
already a pretty heavyweight operation, and probably pretty rare that a
user would want to do multiple of them at once quickly. I asked Andrew
to take the simple tasklist_lock patch just now, since it does fix the
bug at least.

Anyway, looking at this, hmm. I am not sure if this protects adequately?
In de_thread, the sighand lock is held only around the first half
(around zap_other_threads), and not around the following section where
leadership is transferred (esp. around the list_replace calls).
tasklist_lock is held here, though, so it seems like the right lock to
hold.

> 
> 
> With or without this/your patch this leader can die right after we
> drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> suspicious. If a sub-thread execs, this task_struct has nothing to
> do with the threadgroup.

hmm. I thought I had this case covered, but it's been so long since I
actually wrote the code that if I did I can't remember how. I think
exiting is no issue since we hold a reference on the task_struct, but
exec may still be a problem. I'm thinking:

- cgroup_attach_proc drops the tasklist_lock
- a sub-thread execs, and in exec_mmap (after de_thread) changes the mm
- ss->attach, for example in memcg, wants to use leader->mm, which is
  now wrong

this seems to be possible as the code currently is. I wonder if the best
fix is just to have exec (maybe around de_thread) bounce off of or hold
threadgroup_fork_read_lock somewhere?

> 
> 
> 
> Also. This is off-topic, but... Why cgroup_attach_proc() and
> cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
> in the different order? cgroup_attach_proc() looks wrong even
> if currently doesn't matter.

(already submitted a patch for this)

Thanks,
Ben

> 
> 
> Oleg.
> 
> --- x/kernel/cgroup.c
> +++ x/kernel/cgroup.c
> @@ -2000,6 +2000,7 @@ int cgroup_attach_proc(struct cgroup *cg
>  	/* threadgroup list cursor and array */
>  	struct task_struct *tsk;
>  	struct flex_array *group;
> +	unsigned long flags;
>  	/*
>  	 * we need to make sure we have css_sets for all the tasks we're
>  	 * going to move -before- we actually start moving them, so that in
> @@ -2027,19 +2028,10 @@ int cgroup_attach_proc(struct cgroup *cg
>  		goto out_free_group_list;
>  
>  	/* prevent changes to the threadgroup list while we take a snapshot. */
> -	rcu_read_lock();
> -	if (!thread_group_leader(leader)) {
> -		/*
> -		 * a race with de_thread from another thread's exec() may strip
> -		 * us of our leadership, making while_each_thread unsafe to use
> -		 * on this task. if this happens, there is no choice but to
> -		 * throw this task away and try again (from cgroup_procs_write);
> -		 * this is "double-double-toil-and-trouble-check locking".
> -		 */
> -		rcu_read_unlock();
> -		retval = -EAGAIN;
> +	retval = -EAGAIN;
> +	if (!lock_task_sighand(leader, &flags))
>  		goto out_free_group_list;
> -	}
> +
>  	/* take a reference on each task in the group to go in the array. */
>  	tsk = leader;
>  	i = 0;
> @@ -2055,9 +2047,9 @@ int cgroup_attach_proc(struct cgroup *cg
>  		BUG_ON(retval != 0);
>  		i++;
>  	} while_each_thread(leader, tsk);
> +	unlock_task_sighand(leader, &flags);
>  	/* remember the number of threads in the array for later. */
>  	group_size = i;
> -	rcu_read_unlock();
>  
>  	/*
>  	 * step 1: check that we can legitimately attach to the cgroup.
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-08-15 18:49               ` Oleg Nesterov
  2011-08-15 22:50                 ` Frederic Weisbecker
       [not found]                 ` <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2011-09-01 21:46                 ` Ben Blum
       [not found]                   ` <20110901214643.GD10401-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
  2011-09-02 12:32                   ` Oleg Nesterov
  2011-10-14  0:31                 ` [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock " Ben Blum
  2011-10-14  0:36                 ` [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) Ben Blum
  4 siblings, 2 replies; 63+ messages in thread
From: Ben Blum @ 2011-09-01 21:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers,
	linux-kernel, Andrew Morton, Frederic Weisbecker

On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> > -	rcu_read_lock();
> > +	read_lock(&tasklist_lock);
> >  	if (!thread_group_leader(leader)) {
> 
> Agreed, this should work.
> 
> But can't we avoid the global list? thread_group_leader() or not, we do
> not really care. We only need to ensure we can safely find all threads.
> 
> How about the patch below?

I was content with the tasklist_lock because cgroup_attach_proc is
already a pretty heavyweight operation, and probably pretty rare that a
user would want to do multiple of them at once quickly. I asked Andrew
to take the simple tasklist_lock patch just now, since it does fix the
bug at least.

Anyway, looking at this, hmm. I am not sure if this protects adequately?
In de_thread, the sighand lock is held only around the first half
(around zap_other_threads), and not around the following section where
leadership is transferred (esp. around the list_replace calls).
tasklist_lock is held here, though, so it seems like the right lock to
hold.

> 
> 
> With or without this/your patch this leader can die right after we
> drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> suspicious. If a sub-thread execs, this task_struct has nothing to
> do with the threadgroup.

hmm. I thought I had this case covered, but it's been so long since I
actually wrote the code that if I did I can't remember how. I think
exiting is no issue since we hold a reference on the task_struct, but
exec may still be a problem. I'm thinking:

- cgroup_attach_proc drops the tasklist_lock
- a sub-thread execs, and in exec_mmap (after de_thread) changes the mm
- ss->attach, for example in memcg, wants to use leader->mm, which is
  now wrong

this seems to be possible as the code currently is. I wonder if the best
fix is just to have exec (maybe around de_thread) bounce off of or hold
threadgroup_fork_read_lock somewhere?

> 
> 
> 
> Also. This is off-topic, but... Why cgroup_attach_proc() and
> cgroup_attach_task() do ->attach_task() + cgroup_task_migrate()
> in the different order? cgroup_attach_proc() looks wrong even
> if currently doesn't matter.

(already submitted a patch for this)

Thanks,
Ben

> 
> 
> Oleg.
> 
> --- x/kernel/cgroup.c
> +++ x/kernel/cgroup.c
> @@ -2000,6 +2000,7 @@ int cgroup_attach_proc(struct cgroup *cg
>  	/* threadgroup list cursor and array */
>  	struct task_struct *tsk;
>  	struct flex_array *group;
> +	unsigned long flags;
>  	/*
>  	 * we need to make sure we have css_sets for all the tasks we're
>  	 * going to move -before- we actually start moving them, so that in
> @@ -2027,19 +2028,10 @@ int cgroup_attach_proc(struct cgroup *cg
>  		goto out_free_group_list;
>  
>  	/* prevent changes to the threadgroup list while we take a snapshot. */
> -	rcu_read_lock();
> -	if (!thread_group_leader(leader)) {
> -		/*
> -		 * a race with de_thread from another thread's exec() may strip
> -		 * us of our leadership, making while_each_thread unsafe to use
> -		 * on this task. if this happens, there is no choice but to
> -		 * throw this task away and try again (from cgroup_procs_write);
> -		 * this is "double-double-toil-and-trouble-check locking".
> -		 */
> -		rcu_read_unlock();
> -		retval = -EAGAIN;
> +	retval = -EAGAIN;
> +	if (!lock_task_sighand(leader, &flags))
>  		goto out_free_group_list;
> -	}
> +
>  	/* take a reference on each task in the group to go in the array. */
>  	tsk = leader;
>  	i = 0;
> @@ -2055,9 +2047,9 @@ int cgroup_attach_proc(struct cgroup *cg
>  		BUG_ON(retval != 0);
>  		i++;
>  	} while_each_thread(leader, tsk);
> +	unlock_task_sighand(leader, &flags);
>  	/* remember the number of threads in the array for later. */
>  	group_size = i;
> -	rcu_read_unlock();
>  
>  	/*
>  	 * step 1: check that we can legitimately attach to the cgroup.
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110901214643.GD10401-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
       [not found]                   ` <20110901214643.GD10401-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
@ 2011-09-02 12:32                     ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-09-02 12:32 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown, Frederic Weisbecker, Paul Menage,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On 09/01, Ben Blum wrote:
>
> On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> >
> > But can't we avoid the global list? thread_group_leader() or not, we do
> > not really care. We only need to ensure we can safely find all threads.
> >
> > How about the patch below?
>
> I was content with the tasklist_lock because cgroup_attach_proc is
> already a pretty heavyweight operation, and probably pretty rare that a
> user would want to do multiple of them at once quickly.

Perhaps. But this is the global lock, no matter what. Why do you prefer
to take it instead of per-process ->siglock?

> I asked Andrew
> to take the simple tasklist_lock patch just now, since it does fix the
> bug at least.

I disagree.

> Anyway, looking at this, hmm. I am not sure if this protects adequately?
> In de_thread, the sighand lock is held only around the first half
> (around zap_other_threads),

We only need this lock to find all threads,

> and not around the following section where
> leadership is transferred (esp. around the list_replace calls).
> tasklist_lock is held here, though, so it seems like the right lock to
> hold.

This doesn't matter at all, I think. The code under while_each_thread()
doesn't need the stable ->group_leader, and it can be changed right after
you drop tasklist. list_replace() calls do not play with ->thread_group
list.

How can tasklist make any difference?

> > With or without this/your patch this leader can die right after we
> > drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> > suspicious. If a sub-thread execs, this task_struct has nothing to
> > do with the threadgroup.
>
> hmm. I thought I had this case covered, but it's been so long since I
> actually wrote the code that if I did I can't remember how.

This all doesn't look right... Hopefully addressed by Tejun patches.

Oleg.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-09-01 21:46                 ` Ben Blum
       [not found]                   ` <20110901214643.GD10401-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
@ 2011-09-02 12:32                   ` Oleg Nesterov
       [not found]                     ` <20110902123251.GA26764-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2011-09-08  2:11                     ` Ben Blum
  1 sibling, 2 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-09-02 12:32 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown, paulmck, Paul Menage, Li Zefan, containers,
	linux-kernel, Andrew Morton, Frederic Weisbecker

On 09/01, Ben Blum wrote:
>
> On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> >
> > But can't we avoid the global list? thread_group_leader() or not, we do
> > not really care. We only need to ensure we can safely find all threads.
> >
> > How about the patch below?
>
> I was content with the tasklist_lock because cgroup_attach_proc is
> already a pretty heavyweight operation, and probably pretty rare that a
> user would want to do multiple of them at once quickly.

Perhaps. But this is the global lock, no matter what. Why do you prefer
to take it instead of per-process ->siglock?

> I asked Andrew
> to take the simple tasklist_lock patch just now, since it does fix the
> bug at least.

I disagree.

> Anyway, looking at this, hmm. I am not sure if this protects adequately?
> In de_thread, the sighand lock is held only around the first half
> (around zap_other_threads),

We only need this lock to find all threads,

> and not around the following section where
> leadership is transferred (esp. around the list_replace calls).
> tasklist_lock is held here, though, so it seems like the right lock to
> hold.

This doesn't matter at all, I think. The code under while_each_thread()
doesn't need the stable ->group_leader, and it can be changed right after
you drop tasklist. list_replace() calls do not play with ->thread_group
list.

How can tasklist make any difference?

> > With or without this/your patch this leader can die right after we
> > drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> > suspicious. If a sub-thread execs, this task_struct has nothing to
> > do with the threadgroup.
>
> hmm. I thought I had this case covered, but it's been so long since I
> actually wrote the code that if I did I can't remember how.

This all doesn't look right... Hopefully addressed by Tejun patches.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110902123251.GA26764-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
       [not found]                     ` <20110902123251.GA26764-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2011-09-08  2:11                       ` Ben Blum
  0 siblings, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-09-08  2:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, Frederic Weisbecker, Paul Menage,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, NeilBrown, Andrew Morton,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Fri, Sep 02, 2011 at 02:32:51PM +0200, Oleg Nesterov wrote:
> On 09/01, Ben Blum wrote:
> >
> > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> > >
> > > But can't we avoid the global list? thread_group_leader() or not, we do
> > > not really care. We only need to ensure we can safely find all threads.
> > >
> > > How about the patch below?
> >
> > I was content with the tasklist_lock because cgroup_attach_proc is
> > already a pretty heavyweight operation, and probably pretty rare that a
> > user would want to do multiple of them at once quickly.
> 
> Perhaps. But this is the global lock, no matter what. Why do you prefer
> to take it instead of per-process ->siglock?

Basically just because I wanted to find the simplest race-free
implementation - at that point I was just shooting to have a complete
set of patches, and didn't judge the global lock to be bad enough to
think about replacing. I could yet be convinced that siglock is better.

> 
> > I asked Andrew
> > to take the simple tasklist_lock patch just now, since it does fix the
> > bug at least.
> 
> I disagree.
> 
> > Anyway, looking at this, hmm. I am not sure if this protects adequately?
> > In de_thread, the sighand lock is held only around the first half
> > (around zap_other_threads),
> 
> We only need this lock to find all threads,
> 
> > and not around the following section where
> > leadership is transferred (esp. around the list_replace calls).
> > tasklist_lock is held here, though, so it seems like the right lock to
> > hold.
> 
> This doesn't matter at all, I think. The code under while_each_thread()
> doesn't need the stable ->group_leader, and it can be changed right after
> you drop tasklist. list_replace() calls do not play with ->thread_group
> list.
> 
> How can tasklist make any difference?

Oops. I misread the list_replace calls as being on ->thread_group,
instead of on ->tasks as they actually are. Hm, and I see __exit_signal
takes the sighand lock around the call to __unhash_process.

OK, I believe it will work. Oh, and I guess the sighand lock also
subsumes the group_leader check that you were objecting to in the other
thread.

Sorry for the delay. Any way this change goes through, though, the
commenting should be clear on how de_thread interacts with it. Something
like:

/* If the leader exits, its links on the thread_group list become
 * invalid. One way this can happen is if a sub-thread does exec() when
 * de_thread() calls release_task(leader) (and leader->sighand gets set
 * to NULL, in which case lock_task_sighand will fail). Since in that
 * case the threadgroup is still around, cgroup_procs_write should try
 * again (finding the new leader), which EAGAIN indicates here. This is
 * "double-double-toil-and-trouble-check locking". */

(I rather like the phrase "double-double-..." to describe the overall
approach, and would prefer if it stayed in the comment.)

> 
> > > With or without this/your patch this leader can die right after we
> > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> > > suspicious. If a sub-thread execs, this task_struct has nothing to
> > > do with the threadgroup.
> >
> > hmm. I thought I had this case covered, but it's been so long since I
> > actually wrote the code that if I did I can't remember how.
> 
> This all doesn't look right... Hopefully addressed by Tejun patches.
> 

Will look into those separately.

Thanks,
Ben

> 
> Oleg.
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc
  2011-09-02 12:32                   ` Oleg Nesterov
       [not found]                     ` <20110902123251.GA26764-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2011-09-08  2:11                     ` Ben Blum
  1 sibling, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-09-08  2:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers,
	linux-kernel, Andrew Morton, Frederic Weisbecker

On Fri, Sep 02, 2011 at 02:32:51PM +0200, Oleg Nesterov wrote:
> On 09/01, Ben Blum wrote:
> >
> > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote:
> > >
> > > But can't we avoid the global list? thread_group_leader() or not, we do
> > > not really care. We only need to ensure we can safely find all threads.
> > >
> > > How about the patch below?
> >
> > I was content with the tasklist_lock because cgroup_attach_proc is
> > already a pretty heavyweight operation, and probably pretty rare that a
> > user would want to do multiple of them at once quickly.
> 
> Perhaps. But this is the global lock, no matter what. Why do you prefer
> to take it instead of per-process ->siglock?

Basically just because I wanted to find the simplest race-free
implementation - at that point I was just shooting to have a complete
set of patches, and didn't judge the global lock to be bad enough to
think about replacing. I could yet be convinced that siglock is better.

> 
> > I asked Andrew
> > to take the simple tasklist_lock patch just now, since it does fix the
> > bug at least.
> 
> I disagree.
> 
> > Anyway, looking at this, hmm. I am not sure if this protects adequately?
> > In de_thread, the sighand lock is held only around the first half
> > (around zap_other_threads),
> 
> We only need this lock to find all threads,
> 
> > and not around the following section where
> > leadership is transferred (esp. around the list_replace calls).
> > tasklist_lock is held here, though, so it seems like the right lock to
> > hold.
> 
> This doesn't matter at all, I think. The code under while_each_thread()
> doesn't need the stable ->group_leader, and it can be changed right after
> you drop tasklist. list_replace() calls do not play with ->thread_group
> list.
> 
> How can tasklist make any difference?

Oops. I misread the list_replace calls as being on ->thread_group,
instead of on ->tasks as they actually are. Hm, and I see __exit_signal
takes the sighand lock around the call to __unhash_process.

OK, I believe it will work. Oh, and I guess the sighand lock also
subsumes the group_leader check that you were objecting to in the other
thread.

Sorry for the delay. Any way this change goes through, though, the
commenting should be clear on how de_thread interacts with it. Something
like:

/* If the leader exits, its links on the thread_group list become
 * invalid. One way this can happen is if a sub-thread does exec() when
 * de_thread() calls release_task(leader) (and leader->sighand gets set
 * to NULL, in which case lock_task_sighand will fail). Since in that
 * case the threadgroup is still around, cgroup_procs_write should try
 * again (finding the new leader), which EAGAIN indicates here. This is
 * "double-double-toil-and-trouble-check locking". */

(I rather like the phrase "double-double-..." to describe the overall
approach, and would prefer if it stayed in the comment.)

> 
> > > With or without this/your patch this leader can die right after we
> > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look
> > > suspicious. If a sub-thread execs, this task_struct has nothing to
> > > do with the threadgroup.
> >
> > hmm. I thought I had this case covered, but it's been so long since I
> > actually wrote the code that if I did I can't remember how.
> 
> This all doesn't look right... Hopefully addressed by Tejun patches.
> 

Will look into those separately.

Thanks,
Ben

> 
> Oleg.
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock in cgroup_attach_proc
  2011-08-15 18:49               ` Oleg Nesterov
                                   ` (2 preceding siblings ...)
  2011-09-01 21:46                 ` Ben Blum
@ 2011-10-14  0:31                 ` Ben Blum
  2011-10-14 12:15                   ` Frederic Weisbecker
  2011-10-14  0:36                 ` [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) Ben Blum
  4 siblings, 1 reply; 63+ messages in thread
From: Ben Blum @ 2011-10-14  0:31 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, linux-kernel
  Cc: Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers,
	Frederic Weisbecker, Balbir Singh, Daisuke Nishimura,
	KAMEZAWA Hiroyuki

Use sighand lock instead of tasklist_lock in cgroup_attach_proc.

From: Ben Blum <bblum@andrew.cmu.edu>

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 kernel/cgroup.c |   27 +++++++++++++--------------
 1 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 893fc3d..32fb4c8 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2003,6 +2003,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	/* threadgroup list cursor and array */
 	struct task_struct *tsk;
 	struct flex_array *group;
+	unsigned long flags;
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
 	 * going to move -before- we actually start moving them, so that in
@@ -2029,20 +2030,18 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	if (retval)
 		goto out_free_group_list;
 
-	/* prevent changes to the threadgroup list while we take a snapshot. */
-	read_lock(&tasklist_lock);
-	if (!thread_group_leader(leader)) {
-		/*
-		 * a race with de_thread from another thread's exec() may strip
-		 * us of our leadership, making while_each_thread unsafe to use
-		 * on this task. if this happens, there is no choice but to
-		 * throw this task away and try again (from cgroup_procs_write);
-		 * this is "double-double-toil-and-trouble-check locking".
-		 */
-		read_unlock(&tasklist_lock);
-		retval = -EAGAIN;
+	/* prevent changes to the threadgroup list while we take a snapshot.
+	 * If the leader exits, its links on the thread_group list become
+	 * invalid. One way this can happen is if a sub-thread does exec() when
+	 * de_thread() calls release_task(leader) (and leader->sighand gets set
+	 * to NULL, in which case lock_task_sighand will fail). Since in that
+	 * case the threadgroup is still around, cgroup_procs_write should try
+	 * again (finding the new leader), which EAGAIN indicates here. This is
+	 * double-double-toil-and-trouble-check locking". */
+	retval = -EAGAIN;
+	if (!lock_task_sighand(leader, &flags))
 		goto out_free_group_list;
-	}
+
 	/* take a reference on each task in the group to go in the array. */
 	tsk = leader;
 	i = 0;
@@ -2058,9 +2057,9 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		BUG_ON(retval != 0);
 		i++;
 	} while_each_thread(leader, tsk);
+	unlock_task_sighand(leader, &flags);
 	/* remember the number of threads in the array for later. */
 	group_size = i;
-	read_unlock(&tasklist_lock);
 
 	/*
 	 * step 1: check that we can legitimately attach to the cgroup.

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock in cgroup_attach_proc
  2011-10-14  0:31                 ` [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock " Ben Blum
@ 2011-10-14 12:15                   ` Frederic Weisbecker
  0 siblings, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2011-10-14 12:15 UTC (permalink / raw)
  To: Ben Blum
  Cc: Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck,
	Paul Menage, Li Zefan, containers, Balbir Singh,
	Daisuke Nishimura, KAMEZAWA Hiroyuki

Hi Ben,

On Thu, Oct 13, 2011 at 08:31:15PM -0400, Ben Blum wrote:
> Use sighand lock instead of tasklist_lock in cgroup_attach_proc.

Could you please feed more the changelog to explain the reason of
the change? Well, the reason of converting tasklist_lock based locking to
something more granular is almost obvious these days, but
it would be nice to know why it is safe to do so. And what exactly
that protects.

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol)
  2011-08-15 18:49               ` Oleg Nesterov
                                   ` (3 preceding siblings ...)
  2011-10-14  0:31                 ` [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock " Ben Blum
@ 2011-10-14  0:36                 ` Ben Blum
  2011-10-14 12:21                   ` Frederic Weisbecker
  2011-10-19  5:43                   ` Paul Menage
  4 siblings, 2 replies; 63+ messages in thread
From: Ben Blum @ 2011-10-14  0:36 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov, linux-kernel
  Cc: Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers,
	Frederic Weisbecker, Balbir Singh, Daisuke Nishimura,
	KAMEZAWA Hiroyuki

Convert ss->attach to take a flex_array of tasks instead of just the leader.

From: Ben Blum <bblum@andrew.cmu.edu>

This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and
memcontrol) to accurately find the group's mm even when a non-leader does exec
and leaves the leader with a NULL mm pointer.

Also converts cpuset and memcontrol to take the flex_array and iterate down it
until an mm is found, instead of just attempting to use the leader's mm.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |    7 ++++++-
 include/linux/cgroup.h            |    4 +++-
 kernel/cgroup.c                   |   16 +++++++++++++---
 kernel/cpuset.c                   |   16 ++++++++++++++--
 mm/memcontrol.c                   |   17 +++++++++++++++--
 5 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 3fa646f..8e900ec 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -637,12 +637,17 @@ For any non-per-thread attachment work that needs to happen before
 attach_task. Needed by cpuset.
 
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task)
+	    struct cgroup *old_cgrp, struct flex_array *group,
+	    int group_size)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
 
+The flex_array contains pointers to every task_struct being moved, so
+that subsystems can, for example, iterate over a threadgroup's tasks to
+find one with an mm that needs to be moved.
+
 void attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
 		 struct task_struct *tsk);
 (cgroup_mutex held by caller)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ddc13eb..2f97a3b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -25,6 +25,7 @@ struct cgroupfs_root;
 struct inode;
 struct cgroup;
 struct css_id;
+struct flex_array;
 
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
@@ -481,7 +482,8 @@ struct cgroup_subsys {
 	void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp,
 			    struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		       struct cgroup *old_cgrp, struct task_struct *tsk);
+		       struct cgroup *old_cgrp, struct flex_array *group,
+		       int group_size);
 	int (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 32fb4c8..f5fc882 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1824,10 +1824,18 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	struct cgroup *oldcgrp;
 	struct cgroupfs_root *root = cgrp->root;
 
+	/* Singleton array, for ss->attach (see cgroup_attach_proc). */
+	struct flex_array *group = flex_array_alloc(sizeof(tsk), 1, GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+	retval = flex_array_put_ptr(group, 0, tsk, GFP_KERNEL);
+	if (retval < 0)
+		goto out_free_array;
+
 	/* Nothing to do if the task is already in that cgroup */
 	oldcgrp = task_cgroup_from_root(tsk, root);
 	if (cgrp == oldcgrp)
-		return 0;
+		goto out_free_array;
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
@@ -1862,7 +1870,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach_task)
 			ss->attach_task(cgrp, oldcgrp, tsk);
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk);
+			ss->attach(ss, cgrp, oldcgrp, group, 1);
 	}
 
 	synchronize_rcu();
@@ -1890,6 +1898,8 @@ out:
 				ss->cancel_attach(ss, cgrp, tsk);
 		}
 	}
+out_free_array:
+	flex_array_free(group);
 	return retval;
 }
 
@@ -2164,7 +2174,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 */
 	for_each_subsys(root, ss) {
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, leader);
+			ss->attach(ss, cgrp, oldcgrp, group, group_size);
 	}
 
 	/*
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 00b3430..fce7841 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -59,6 +59,7 @@
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
 #include <linux/cgroup.h>
+#include <linux/flex_array.h>
 
 /*
  * Workqueue for cpuset related tasks.
@@ -1440,11 +1441,13 @@ static void cpuset_attach_task(struct cgroup *cont, struct cgroup *old,
 }
 
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			  struct cgroup *oldcont, struct task_struct *tsk)
+			  struct cgroup *oldcont, struct flex_array *group,
+			  int group_size)
 {
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
+	int i;
 
 	/*
 	 * Change mm, possibly for multiple threads in a threadgroup. This is
@@ -1452,7 +1455,16 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	 */
 	cpuset_attach_nodemask_from = oldcs->mems_allowed;
 	cpuset_attach_nodemask_to = cs->mems_allowed;
-	mm = get_task_mm(tsk);
+	/*
+	 * Find the first task in the group that still has its mm. (This could
+	 * be not the first one if another did exec() and the leader exited.
+	 */
+	for (i = 0; i < group_size; i++) {
+		struct task_struct *tsk = flex_array_get_ptr(group, i);
+		mm = get_task_mm(tsk);
+		if (mm)
+			break;
+	}
 	if (mm) {
 		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6aff93c..f951a9c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@
 #include <linux/page_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/flex_array.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -5455,9 +5456,21 @@ retry:
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p)
+				struct flex_array *group, int group_size)
 {
-	struct mm_struct *mm = get_task_mm(p);
+	struct mm_struct *mm;
+	int i;
+
+	/*
+	 * Find the first task in the group that still has its mm. (This could
+	 * be not the first one if another did exec() and the leader exited.
+	 */
+	for (i = 0; i < group_size; i++) {
+		struct task_struct *tsk = flex_array_get_ptr(group, i);
+		mm = get_task_mm(tsk);
+		if (mm)
+			break;
+	}
 
 	if (mm) {
 		if (mc.to)

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol)
  2011-10-14  0:36                 ` [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) Ben Blum
@ 2011-10-14 12:21                   ` Frederic Weisbecker
  2011-10-14 13:53                     ` Ben Blum
  2011-10-19  5:43                   ` Paul Menage
  1 sibling, 1 reply; 63+ messages in thread
From: Frederic Weisbecker @ 2011-10-14 12:21 UTC (permalink / raw)
  To: Ben Blum, Tejun Heo
  Cc: Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck,
	Paul Menage, Li Zefan, containers, Balbir Singh,
	Daisuke Nishimura, KAMEZAWA Hiroyuki

On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote:
> Convert ss->attach to take a flex_array of tasks instead of just the leader.
> 
> From: Ben Blum <bblum@andrew.cmu.edu>
> 
> This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and
> memcontrol) to accurately find the group's mm even when a non-leader does exec
> and leaves the leader with a NULL mm pointer.
> 
> Also converts cpuset and memcontrol to take the flex_array and iterate down it
> until an mm is found, instead of just attempting to use the leader's mm.
> 
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

I think there are patches from Tejun that handle that did that already?

https://lkml.org/lkml/2011/8/23/418

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol)
  2011-10-14 12:21                   ` Frederic Weisbecker
@ 2011-10-14 13:53                     ` Ben Blum
  2011-10-14 13:54                       ` Ben Blum
  2011-10-14 15:21                       ` Frederic Weisbecker
  0 siblings, 2 replies; 63+ messages in thread
From: Ben Blum @ 2011-10-14 13:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ben Blum, Tejun Heo, Andrew Morton, Oleg Nesterov, linux-kernel,
	NeilBrown, paulmck, Paul Menage, Li Zefan, containers,
	Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki

On Fri, Oct 14, 2011 at 02:21:30PM +0200, Frederic Weisbecker wrote:
> On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote:
> > Convert ss->attach to take a flex_array of tasks instead of just the leader.
> > 
> > From: Ben Blum <bblum@andrew.cmu.edu>
> > 
> > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and
> > memcontrol) to accurately find the group's mm even when a non-leader does exec
> > and leaves the leader with a NULL mm pointer.
> > 
> > Also converts cpuset and memcontrol to take the flex_array and iterate down it
> > until an mm is found, instead of just attempting to use the leader's mm.
> > 
> > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> 
> I think there are patches from Tejun that handle that did that already?
> 
> https://lkml.org/lkml/2011/8/23/418
> 

Yeah, I'm not hoping to preempt them or anything; I'm just presenting
this alternate approach, since (1) this way addresses the ss->attach()
problem directly without needing to add locking anywhere else, while the
focus of the discussion around Tejun's patches seems to have moved to
more global concerns, and (2) these patches and his patches ought to be
compatible with each other.

Ben

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol)
  2011-10-14 13:53                     ` Ben Blum
@ 2011-10-14 13:54                       ` Ben Blum
  2011-10-14 15:22                         ` Frederic Weisbecker
  2011-10-14 15:21                       ` Frederic Weisbecker
  1 sibling, 1 reply; 63+ messages in thread
From: Ben Blum @ 2011-10-14 13:54 UTC (permalink / raw)
  To: Ben Blum
  Cc: Frederic Weisbecker, Tejun Heo, Andrew Morton, Oleg Nesterov,
	linux-kernel, NeilBrown, paulmck, Paul Menage, Li Zefan,
	containers, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki

On Fri, Oct 14, 2011 at 09:53:23AM -0400, Ben Blum wrote:
> On Fri, Oct 14, 2011 at 02:21:30PM +0200, Frederic Weisbecker wrote:
> > On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote:
> > > Convert ss->attach to take a flex_array of tasks instead of just the leader.
> > > 
> > > From: Ben Blum <bblum@andrew.cmu.edu>
> > > 
> > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and
> > > memcontrol) to accurately find the group's mm even when a non-leader does exec
> > > and leaves the leader with a NULL mm pointer.
> > > 
> > > Also converts cpuset and memcontrol to take the flex_array and iterate down it
> > > until an mm is found, instead of just attempting to use the leader's mm.
> > > 
> > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> > 
> > I think there are patches from Tejun that handle that did that already?
> > 
> > https://lkml.org/lkml/2011/8/23/418
> > 
> 
> Yeah, I'm not hoping to preempt them or anything; I'm just presenting
> this alternate approach, since (1) this way addresses the ss->attach()
> problem directly without needing to add locking anywhere else, while the
> focus of the discussion around Tejun's patches seems to have moved to
> more global concerns, and (2) these patches and his patches ought to be
> compatible with each other.

though, I did wonder if, given this approach, I shouldn't just get rid
of attach_task and can_attach_task and use the flex_array for all of
the calls. would that be nicer?

> 
> Ben
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol)
  2011-10-14 13:54                       ` Ben Blum
@ 2011-10-14 15:22                         ` Frederic Weisbecker
  2011-10-17 19:11                           ` Ben Blum
  0 siblings, 1 reply; 63+ messages in thread
From: Frederic Weisbecker @ 2011-10-14 15:22 UTC (permalink / raw)
  To: Ben Blum
  Cc: Tejun Heo, Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown,
	paulmck, Paul Menage, Li Zefan, containers, Balbir Singh,
	Daisuke Nishimura, KAMEZAWA Hiroyuki

On Fri, Oct 14, 2011 at 09:54:57AM -0400, Ben Blum wrote:
> On Fri, Oct 14, 2011 at 09:53:23AM -0400, Ben Blum wrote:
> > On Fri, Oct 14, 2011 at 02:21:30PM +0200, Frederic Weisbecker wrote:
> > > On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote:
> > > > Convert ss->attach to take a flex_array of tasks instead of just the leader.
> > > > 
> > > > From: Ben Blum <bblum@andrew.cmu.edu>
> > > > 
> > > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and
> > > > memcontrol) to accurately find the group's mm even when a non-leader does exec
> > > > and leaves the leader with a NULL mm pointer.
> > > > 
> > > > Also converts cpuset and memcontrol to take the flex_array and iterate down it
> > > > until an mm is found, instead of just attempting to use the leader's mm.
> > > > 
> > > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> > > 
> > > I think there are patches from Tejun that handle that did that already?
> > > 
> > > https://lkml.org/lkml/2011/8/23/418
> > > 
> > 
> > Yeah, I'm not hoping to preempt them or anything; I'm just presenting
> > this alternate approach, since (1) this way addresses the ss->attach()
> > problem directly without needing to add locking anywhere else, while the
> > focus of the discussion around Tejun's patches seems to have moved to
> > more global concerns, and (2) these patches and his patches ought to be
> > compatible with each other.
> 
> though, I did wonder if, given this approach, I shouldn't just get rid
> of attach_task and can_attach_task and use the flex_array for all of
> the calls. would that be nicer?

That's what Tejun's patches do.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol)
  2011-10-14 15:22                         ` Frederic Weisbecker
@ 2011-10-17 19:11                           ` Ben Blum
  0 siblings, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-10-17 19:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ben Blum, Tejun Heo, Andrew Morton, Oleg Nesterov, linux-kernel,
	NeilBrown, paulmck, Paul Menage, Li Zefan, containers,
	Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki

On Fri, Oct 14, 2011 at 05:22:20PM +0200, Frederic Weisbecker wrote:
> On Fri, Oct 14, 2011 at 09:54:57AM -0400, Ben Blum wrote:
> > On Fri, Oct 14, 2011 at 09:53:23AM -0400, Ben Blum wrote:
> > > On Fri, Oct 14, 2011 at 02:21:30PM +0200, Frederic Weisbecker wrote:
> > > > On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote:
> > > > > Convert ss->attach to take a flex_array of tasks instead of just the leader.
> > > > > 
> > > > > From: Ben Blum <bblum@andrew.cmu.edu>
> > > > > 
> > > > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and
> > > > > memcontrol) to accurately find the group's mm even when a non-leader does exec
> > > > > and leaves the leader with a NULL mm pointer.
> > > > > 
> > > > > Also converts cpuset and memcontrol to take the flex_array and iterate down it
> > > > > until an mm is found, instead of just attempting to use the leader's mm.
> > > > > 
> > > > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> > > > 
> > > > I think there are patches from Tejun that handle that did that already?
> > > > 
> > > > https://lkml.org/lkml/2011/8/23/418
> > > > 
> > > 
> > > Yeah, I'm not hoping to preempt them or anything; I'm just presenting
> > > this alternate approach, since (1) this way addresses the ss->attach()
> > > problem directly without needing to add locking anywhere else, while the
> > > focus of the discussion around Tejun's patches seems to have moved to
> > > more global concerns, and (2) these patches and his patches ought to be
> > > compatible with each other.
> > 
> > though, I did wonder if, given this approach, I shouldn't just get rid
> > of attach_task and can_attach_task and use the flex_array for all of
> > the calls. would that be nicer?
> 
> That's what Tejun's patches do.
> 

Oh, I must have missed that bit. In which case, sorry for the noise, and
I support Tejun's change.

-- Ben

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol)
  2011-10-14 13:53                     ` Ben Blum
  2011-10-14 13:54                       ` Ben Blum
@ 2011-10-14 15:21                       ` Frederic Weisbecker
  1 sibling, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2011-10-14 15:21 UTC (permalink / raw)
  To: Ben Blum
  Cc: Tejun Heo, Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown,
	paulmck, Paul Menage, Li Zefan, containers, Balbir Singh,
	Daisuke Nishimura, KAMEZAWA Hiroyuki

On Fri, Oct 14, 2011 at 09:53:23AM -0400, Ben Blum wrote:
> On Fri, Oct 14, 2011 at 02:21:30PM +0200, Frederic Weisbecker wrote:
> > On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote:
> > > Convert ss->attach to take a flex_array of tasks instead of just the leader.
> > > 
> > > From: Ben Blum <bblum@andrew.cmu.edu>
> > > 
> > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and
> > > memcontrol) to accurately find the group's mm even when a non-leader does exec
> > > and leaves the leader with a NULL mm pointer.
> > > 
> > > Also converts cpuset and memcontrol to take the flex_array and iterate down it
> > > until an mm is found, instead of just attempting to use the leader's mm.
> > > 
> > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> > 
> > I think there are patches from Tejun that handle that did that already?
> > 
> > https://lkml.org/lkml/2011/8/23/418
> > 
> 
> Yeah, I'm not hoping to preempt them or anything; I'm just presenting
> this alternate approach, since (1) this way addresses the ss->attach()
> problem directly without needing to add locking anywhere else

You basically have the same approach, the locking involved in Tejun's
patches seem to address a different issue that you would need to
address as well anyway.


> , while the
> focus of the discussion around Tejun's patches seems to have moved to
> more global concerns, and (2) these patches and his patches ought to be
> compatible with each other.

But Tejun's patches seem to address not only your issue but even more.
I don't understand the point of this lighter version.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol)
  2011-10-14  0:36                 ` [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) Ben Blum
  2011-10-14 12:21                   ` Frederic Weisbecker
@ 2011-10-19  5:43                   ` Paul Menage
  1 sibling, 0 replies; 63+ messages in thread
From: Paul Menage @ 2011-10-19  5:43 UTC (permalink / raw)
  To: Ben Blum
  Cc: Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck,
	Li Zefan, containers, Frederic Weisbecker, Balbir Singh,
	Daisuke Nishimura, KAMEZAWA Hiroyuki

On Thu, Oct 13, 2011 at 5:36 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> Convert ss->attach to take a flex_array of tasks instead of just the leader.
>
> From: Ben Blum <bblum@andrew.cmu.edu>
>
> This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and
> memcontrol) to accurately find the group's mm even when a non-leader does exec
> and leaves the leader with a NULL mm pointer.
>
> Also converts cpuset and memcontrol to take the flex_array and iterate down it
> until an mm is found, instead of just attempting to use the leader's mm.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Reviewed-by: Paul Menage <paul@paulmenage.org>

Thanks,
Paul

> ---
>  Documentation/cgroups/cgroups.txt |    7 ++++++-
>  include/linux/cgroup.h            |    4 +++-
>  kernel/cgroup.c                   |   16 +++++++++++++---
>  kernel/cpuset.c                   |   16 ++++++++++++++--
>  mm/memcontrol.c                   |   17 +++++++++++++++--
>  5 files changed, 51 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 3fa646f..8e900ec 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -637,12 +637,17 @@ For any non-per-thread attachment work that needs to happen before
>  attach_task. Needed by cpuset.
>
>  void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -           struct cgroup *old_cgrp, struct task_struct *task)
> +           struct cgroup *old_cgrp, struct flex_array *group,
> +           int group_size)

Maybe flex_array should have a size field, so that you can just pass a
single pointer here?

> @@ -1452,7 +1455,16 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>         */
>        cpuset_attach_nodemask_from = oldcs->mems_allowed;
>        cpuset_attach_nodemask_to = cs->mems_allowed;
> -       mm = get_task_mm(tsk);
> +       /*
> +        * Find the first task in the group that still has its mm. (This could
> +        * be not the first one if another did exec() and the leader exited.

Maybe rephrase this as "This might not be the first one"?

Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110728062616.GC15204-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]           ` <20110728062616.GC15204-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
@ 2011-07-28  7:13             ` NeilBrown
  0 siblings, 0 replies; 63+ messages in thread
From: NeilBrown @ 2011-07-28  7:13 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Oleg Nesterov

On Thu, 28 Jul 2011 02:26:16 -0400 Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> > On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney"
> > <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> > 
> > > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote:
> > > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > >  The race as I understand it is with this code:
> > > > > 
> > > > > 
> > > > > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > > > > 		list_replace_init(&leader->sibling, &tsk->sibling);
> > > > > 
> > > > > 		tsk->group_leader = tsk;
> > > > > 		leader->group_leader = tsk;
> > > > > 
> > > > > 
> > > > >  which seems to be called with only tasklist_lock held, which doesn't seem to
> > > > >  be held in the cgroup code.
> > > > > 
> > > > >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> > > > >  this chunk is run with the same value for 'leader', but the
> > > > >  while_each_thread is run after, then the while_read_thread() might loop
> > > > >  forever.  rcu_read_lock doesn't prevent this from happening.
> > > > 
> > > > Somehow I was under the impression that holding tasklist_lock (for
> > > > writing) provided exclusion from code that holds rcu_read_lock -
> > > > probably because there are other points in the kernel which do
> > > > while_each_thread with only RCU-read held (and not tasklist):
> > > > 
> > > > - kernel/hung_task.c, check_hung_uninterruptible_tasks()
> > > 
> > > This one looks OK to me.  The code is just referencing fields in each
> > > of the task structures, and appears to be making proper use of
> > > rcu_dereference().  All this code requires is that the task structures
> > > remain in existence through the full lifetime of the RCU read-side
> > > critical section, which is guaranteed because of the way the task_struct
> > > is freed.
> > 
> > I disagree.  It also requires - by virtue of the use of while_each_thread() -
> > that 'g' remains on the list that 't' is walking along.
> > 
> > Now for a normal list, the head always stays on the list and is accessible
> > even from an rcu-removed entry.  But the thread_group list isn't a normal
> > list.  It doesn't have a distinct head.  It is a loop of all of the
> > 'task_structs' in a thread group.  One of them is designated the 'leader' but
> > de_thread() can change the 'leader' - though it doesn't remove the old leader.
> > 
> > __unhash_process in mm/exit.c looks like it could remove the leader from the
> > list and definitely could remove a non-leader.
> > 
> > So if a non-leader calls 'exec' and the leader calls 'exit', then a
> > task_struct that was the leader could become a non-leader and then be removed
> > from the list that kernel/hung_task could be walking along.
> 
> That agrees with my understanding.
> 
> > 
> > So I don't think that while_each_thread() is currently safe.  It depends on
> > the thread leader not disappearing and I think it can.
> 
> I think that while_each_thread is perfectly safe, it just needs to be
> protected properly while used. it reads the tasklist, and both competing
> paths (__unhash_process and de_thread) are done with tasklist_lock write
> locked, so read-locking ought to suffice. all it needs is to be better
> documented.

That might be one answer.
However the fact that rcu primitives are used to walk the list (next_thread()
contains list_entry_rcu) seems to suggest that the intention was that RCU
could be used for read access - which in large part it can.
So we have two alternatives:
 1/ take a reader spinlock to walk the thread_group list
 2/ make RCU work safely for walking thread_group

I thought it was best to try the latter first.

> 
> > [...]
> > 
> > +/* Thread group leader can change, so stop loop when we see one
> > + * even if it isn't 'g' */
> >  #define while_each_thread(g, t) \
> > -	while ((t = next_thread(t)) != g)
> > +	while ((t = next_thread(t)) != g && !thread_group_leader(t))
> 
> this is semantically wrong: it will stop as soon as it finds a thread
> that has newly become the leader, and not run the loop body code in that
> thread's case. so the thread that just execed would not get run on, and
> in the case of my code, would "escape" the cgroup migration.

Yes, you are right ...
For your case I think you do want the thread group list to be both safe and
stable.  RCU can only provide 'safe' for a list.  So I think I'm convinced
that you need a readlock on tasklist_lock.  That doesn't mean that all users
of while_each_thread do.


> 
> but I argue it is also organisationally wrong. while_each_thread's
> purpose is just to worry about the structure of the process list, not to
> account for behavioural details of de_thread. this check belongs outside
> of the macro, and it should be protected by tasklist_lock in the same
> critical section in which while_each_thread is used.

Sound fair.

Maybe we need two versions of while_each_thread().  One that assumes
tasklist_lock and guarantees getting every thread, and one that is less
precise but only needs rcu_read_lock().

That would require an audit of each call site to see what is really needed.

But yes: if you replace your rcu_read_lock()s with read_lock(&tasklist_lock)
then you will be safe, and I don't think you can do any better (i.e. less
locking) than that.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-07-28  1:08       ` NeilBrown
  2011-07-28  6:26         ` Ben Blum
@ 2011-07-28 12:17         ` Paul E. McKenney
       [not found]           ` <20110728121741.GB2427-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2011-08-14 17:51           ` Oleg Nesterov
       [not found]         ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2011-08-14 17:45         ` Oleg Nesterov
  3 siblings, 2 replies; 63+ messages in thread
From: Paul E. McKenney @ 2011-07-28 12:17 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ben Blum, Paul Menage, Li Zefan, Oleg Nesterov, containers, linux-kernel

On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney"
> <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote:
> > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:
> > 
> > [ . . . ]
> > 
> > > >  The race as I understand it is with this code:
> > > > 
> > > > 
> > > > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > > > 		list_replace_init(&leader->sibling, &tsk->sibling);
> > > > 
> > > > 		tsk->group_leader = tsk;
> > > > 		leader->group_leader = tsk;
> > > > 
> > > > 
> > > >  which seems to be called with only tasklist_lock held, which doesn't seem to
> > > >  be held in the cgroup code.
> > > > 
> > > >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> > > >  this chunk is run with the same value for 'leader', but the
> > > >  while_each_thread is run after, then the while_read_thread() might loop
> > > >  forever.  rcu_read_lock doesn't prevent this from happening.
> > > 
> > > Somehow I was under the impression that holding tasklist_lock (for
> > > writing) provided exclusion from code that holds rcu_read_lock -
> > > probably because there are other points in the kernel which do
> > > while_each_thread with only RCU-read held (and not tasklist):
> > > 
> > > - kernel/hung_task.c, check_hung_uninterruptible_tasks()
> > 
> > This one looks OK to me.  The code is just referencing fields in each
> > of the task structures, and appears to be making proper use of
> > rcu_dereference().  All this code requires is that the task structures
> > remain in existence through the full lifetime of the RCU read-side
> > critical section, which is guaranteed because of the way the task_struct
> > is freed.
> 
> I disagree.  It also requires - by virtue of the use of while_each_thread() -
> that 'g' remains on the list that 't' is walking along.

Doesn't the following code in the loop body deal with this possibilty?

	/* Exit if t or g was unhashed during refresh. */
	if (t->state == TASK_DEAD || g->state == TASK_DEAD)
		goto unlock;

Yes, a concurrent dethread could cause some of the tasks to be skipped,
but there really is a hung thread, it will still be there to be caught
next time, right?

							Thanx, Paul

> Now for a normal list, the head always stays on the list and is accessible
> even from an rcu-removed entry.  But the thread_group list isn't a normal
> list.  It doesn't have a distinct head.  It is a loop of all of the
> 'task_structs' in a thread group.  One of them is designated the 'leader' but
> de_thread() can change the 'leader' - though it doesn't remove the old leader.
> 
> __unhash_process in mm/exit.c looks like it could remove the leader from the
> list and definitely could remove a non-leader.
> 
> So if a non-leader calls 'exec' and the leader calls 'exit', then a
> task_struct that was the leader could become a non-leader and then be removed
> from the list that kernel/hung_task could be walking along.
> 
> So I don't think that while_each_thread() is currently safe.  It depends on
> the thread leader not disappearing and I think it can.
> 
> So I'm imagining a patch like this to ensure that while_each_thread() is
> actually safe.  If it is always safe you can remove that extra check in
> cgroup_attach_proc() which looked wrong.
> 
> I just hope someone who understands the process tree is listening..
> The change in exit.c is the most uncertain part.
> 
> NeilBrown
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 6075a1e..c9ea5f0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -960,6 +960,9 @@ static int de_thread(struct task_struct *tsk)
>  		list_replace_init(&leader->sibling, &tsk->sibling);
> 
>  		tsk->group_leader = tsk;
> +		smp_mb(); /* ensure that any reader will always be able to see
> +			   * a task that claims to be the group leader
> +			   */
>  		leader->group_leader = tsk;
> 
>  		tsk->exit_signal = SIGCHLD;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 14a6c7b..13e0192 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2267,8 +2267,10 @@ extern bool current_is_single_threaded(void);
>  #define do_each_thread(g, t) \
>  	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
> 
> +/* Thread group leader can change, so stop loop when we see one
> + * even if it isn't 'g' */
>  #define while_each_thread(g, t) \
> -	while ((t = next_thread(t)) != g)
> +	while ((t = next_thread(t)) != g && !thread_group_leader(t))
> 
>  static inline int get_nr_threads(struct task_struct *tsk)
>  {
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f2b321b..d6cef25 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
> -	}
> -	list_del_rcu(&p->thread_group);
> +	} else
> +		/* only remove members from the thread group.
> +		 * The thread group leader must stay so that
> +		 * while_each_thread() uses can see the end of
> +		 * the list and stop.
> +		 */
> +		list_del_rcu(&p->thread_group);
>  }
> 
>  /*
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110728121741.GB2427-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]           ` <20110728121741.GB2427-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2011-08-14 17:51             ` Oleg Nesterov
  0 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-14 17:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: NeilBrown,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ben Blum, Paul Menage

On 07/28, Paul E. McKenney wrote:
>
> On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> >
> > I disagree.  It also requires - by virtue of the use of while_each_thread() -
> > that 'g' remains on the list that 't' is walking along.
>
> Doesn't the following code in the loop body deal with this possibilty?
>
> 	/* Exit if t or g was unhashed during refresh. */
> 	if (t->state == TASK_DEAD || g->state == TASK_DEAD)
> 		goto unlock;

This code is completely wrong even if while_each_thread() was fine.

I sent the patch but it was ignored.

	[PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break()
	http://marc.info/?l=linux-kernel&m=127688790019041

Oleg.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-07-28 12:17         ` Paul E. McKenney
       [not found]           ` <20110728121741.GB2427-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2011-08-14 17:51           ` Oleg Nesterov
       [not found]             ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                               ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-14 17:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: NeilBrown, Ben Blum, Paul Menage, Li Zefan, containers, linux-kernel

On 07/28, Paul E. McKenney wrote:
>
> On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> >
> > I disagree.  It also requires - by virtue of the use of while_each_thread() -
> > that 'g' remains on the list that 't' is walking along.
>
> Doesn't the following code in the loop body deal with this possibilty?
>
> 	/* Exit if t or g was unhashed during refresh. */
> 	if (t->state == TASK_DEAD || g->state == TASK_DEAD)
> 		goto unlock;

This code is completely wrong even if while_each_thread() was fine.

I sent the patch but it was ignored.

	[PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break()
	http://marc.info/?l=linux-kernel&m=127688790019041

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]             ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2011-08-14 23:58               ` NeilBrown
  2011-08-15 18:01               ` Paul E. McKenney
  1 sibling, 0 replies; 63+ messages in thread
From: NeilBrown @ 2011-08-14 23:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	Paul E. McKenney

On Sun, 14 Aug 2011 19:51:19 +0200 Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On 07/28, Paul E. McKenney wrote:
> >
> > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> > >
> > > I disagree.  It also requires - by virtue of the use of while_each_thread() -
> > > that 'g' remains on the list that 't' is walking along.
> >
> > Doesn't the following code in the loop body deal with this possibilty?
> >
> > 	/* Exit if t or g was unhashed during refresh. */
> > 	if (t->state == TASK_DEAD || g->state == TASK_DEAD)
> > 		goto unlock;
> 
> This code is completely wrong even if while_each_thread() was fine.
> 
> I sent the patch but it was ignored.
> 
> 	[PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break()
> 	http://marc.info/?l=linux-kernel&m=127688790019041
> 
> Oleg.


I agree with that patch.
RCU only protects a task_struct until release_task() is called (which
removes it from the task list).

So holding rcu_lock doesn't stop put_task_struct from freeing the memory
unless we *know* that release_task hasn't been called.  This is exactly that
pid_alive() tests.


I must say that handling of task_struct seems to violate the law of least
surprise a little to often for my taste.  Maybe it is just a difficult
problem and it needs a complex solution - but it would be really nice if it
were a bit simpler :-(

NeilBrown

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]             ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2011-08-14 23:58               ` NeilBrown
@ 2011-08-15 18:01               ` Paul E. McKenney
  1 sibling, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2011-08-15 18:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: NeilBrown,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ben Blum, Paul Menage

On Sun, Aug 14, 2011 at 07:51:19PM +0200, Oleg Nesterov wrote:
> On 07/28, Paul E. McKenney wrote:
> >
> > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> > >
> > > I disagree.  It also requires - by virtue of the use of while_each_thread() -
> > > that 'g' remains on the list that 't' is walking along.
> >
> > Doesn't the following code in the loop body deal with this possibilty?
> >
> > 	/* Exit if t or g was unhashed during refresh. */
> > 	if (t->state == TASK_DEAD || g->state == TASK_DEAD)
> > 		goto unlock;
> 
> This code is completely wrong even if while_each_thread() was fine.
> 
> I sent the patch but it was ignored.
> 
> 	[PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break()
> 	http://marc.info/?l=linux-kernel&m=127688790019041

If it helps...

Acked-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-08-14 17:51           ` Oleg Nesterov
       [not found]             ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2011-08-14 23:58             ` NeilBrown
  2011-08-15 18:01             ` Paul E. McKenney
  2 siblings, 0 replies; 63+ messages in thread
From: NeilBrown @ 2011-08-14 23:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Ben Blum, Paul Menage, Li Zefan, containers,
	linux-kernel

On Sun, 14 Aug 2011 19:51:19 +0200 Oleg Nesterov <oleg@redhat.com> wrote:

> On 07/28, Paul E. McKenney wrote:
> >
> > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> > >
> > > I disagree.  It also requires - by virtue of the use of while_each_thread() -
> > > that 'g' remains on the list that 't' is walking along.
> >
> > Doesn't the following code in the loop body deal with this possibilty?
> >
> > 	/* Exit if t or g was unhashed during refresh. */
> > 	if (t->state == TASK_DEAD || g->state == TASK_DEAD)
> > 		goto unlock;
> 
> This code is completely wrong even if while_each_thread() was fine.
> 
> I sent the patch but it was ignored.
> 
> 	[PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break()
> 	http://marc.info/?l=linux-kernel&m=127688790019041
> 
> Oleg.


I agree with that patch.
RCU only protects a task_struct until release_task() is called (which
removes it from the task list).

So holding rcu_lock doesn't stop put_task_struct from freeing the memory
unless we *know* that release_task hasn't been called.  This is exactly that
pid_alive() tests.


I must say that handling of task_struct seems to violate the law of least
surprise a little to often for my taste.  Maybe it is just a difficult
problem and it needs a complex solution - but it would be really nice if it
were a bit simpler :-(

NeilBrown

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-08-14 17:51           ` Oleg Nesterov
       [not found]             ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2011-08-14 23:58             ` NeilBrown
@ 2011-08-15 18:01             ` Paul E. McKenney
  2 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2011-08-15 18:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: NeilBrown, Ben Blum, Paul Menage, Li Zefan, containers, linux-kernel

On Sun, Aug 14, 2011 at 07:51:19PM +0200, Oleg Nesterov wrote:
> On 07/28, Paul E. McKenney wrote:
> >
> > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> > >
> > > I disagree.  It also requires - by virtue of the use of while_each_thread() -
> > > that 'g' remains on the list that 't' is walking along.
> >
> > Doesn't the following code in the loop body deal with this possibilty?
> >
> > 	/* Exit if t or g was unhashed during refresh. */
> > 	if (t->state == TASK_DEAD || g->state == TASK_DEAD)
> > 		goto unlock;
> 
> This code is completely wrong even if while_each_thread() was fine.
> 
> I sent the patch but it was ignored.
> 
> 	[PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break()
> 	http://marc.info/?l=linux-kernel&m=127688790019041

If it helps...

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]         ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2011-07-28  6:26           ` Ben Blum
  2011-07-28 12:17           ` Paul E. McKenney
  2011-08-14 17:45           ` Oleg Nesterov
  2 siblings, 0 replies; 63+ messages in thread
From: Ben Blum @ 2011-07-28  6:26 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Oleg Nesterov

On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney"
> <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> 
> > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote:
> > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:
> > 
> > [ . . . ]
> > 
> > > >  The race as I understand it is with this code:
> > > > 
> > > > 
> > > > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > > > 		list_replace_init(&leader->sibling, &tsk->sibling);
> > > > 
> > > > 		tsk->group_leader = tsk;
> > > > 		leader->group_leader = tsk;
> > > > 
> > > > 
> > > >  which seems to be called with only tasklist_lock held, which doesn't seem to
> > > >  be held in the cgroup code.
> > > > 
> > > >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> > > >  this chunk is run with the same value for 'leader', but the
> > > >  while_each_thread is run after, then the while_read_thread() might loop
> > > >  forever.  rcu_read_lock doesn't prevent this from happening.
> > > 
> > > Somehow I was under the impression that holding tasklist_lock (for
> > > writing) provided exclusion from code that holds rcu_read_lock -
> > > probably because there are other points in the kernel which do
> > > while_each_thread with only RCU-read held (and not tasklist):
> > > 
> > > - kernel/hung_task.c, check_hung_uninterruptible_tasks()
> > 
> > This one looks OK to me.  The code is just referencing fields in each
> > of the task structures, and appears to be making proper use of
> > rcu_dereference().  All this code requires is that the task structures
> > remain in existence through the full lifetime of the RCU read-side
> > critical section, which is guaranteed because of the way the task_struct
> > is freed.
> 
> I disagree.  It also requires - by virtue of the use of while_each_thread() -
> that 'g' remains on the list that 't' is walking along.
> 
> Now for a normal list, the head always stays on the list and is accessible
> even from an rcu-removed entry.  But the thread_group list isn't a normal
> list.  It doesn't have a distinct head.  It is a loop of all of the
> 'task_structs' in a thread group.  One of them is designated the 'leader' but
> de_thread() can change the 'leader' - though it doesn't remove the old leader.
> 
> __unhash_process in mm/exit.c looks like it could remove the leader from the
> list and definitely could remove a non-leader.
> 
> So if a non-leader calls 'exec' and the leader calls 'exit', then a
> task_struct that was the leader could become a non-leader and then be removed
> from the list that kernel/hung_task could be walking along.

That agrees with my understanding.

> 
> So I don't think that while_each_thread() is currently safe.  It depends on
> the thread leader not disappearing and I think it can.

I think that while_each_thread is perfectly safe, it just needs to be
protected properly while used. it reads the tasklist, and both competing
paths (__unhash_process and de_thread) are done with tasklist_lock write
locked, so read-locking ought to suffice. all it needs is to be better
documented.

> [...]
> 
> +/* Thread group leader can change, so stop loop when we see one
> + * even if it isn't 'g' */
>  #define while_each_thread(g, t) \
> -	while ((t = next_thread(t)) != g)
> +	while ((t = next_thread(t)) != g && !thread_group_leader(t))

this is semantically wrong: it will stop as soon as it finds a thread
that has newly become the leader, and not run the loop body code in that
thread's case. so the thread that just execed would not get run on, and
in the case of my code, would "escape" the cgroup migration.

but I argue it is also organisationally wrong. while_each_thread's
purpose is just to worry about the structure of the process list, not to
account for behavioural details of de_thread. this check belongs outside
of the macro, and it should be protected by tasklist_lock in the same
critical section in which while_each_thread is used.

-- Ben

>  
>  static inline int get_nr_threads(struct task_struct *tsk)
>  {
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f2b321b..d6cef25 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
> -	}
> -	list_del_rcu(&p->thread_group);
> +	} else
> +		/* only remove members from the thread group.
> +		 * The thread group leader must stay so that
> +		 * while_each_thread() uses can see the end of
> +		 * the list and stop.
> +		 */
> +		list_del_rcu(&p->thread_group);
>  }
>  
>  /*
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]         ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2011-07-28  6:26           ` Ben Blum
@ 2011-07-28 12:17           ` Paul E. McKenney
  2011-08-14 17:45           ` Oleg Nesterov
  2 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2011-07-28 12:17 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Oleg Nesterov

On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote:
> On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney"
> <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> 
> > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote:
> > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:
> > 
> > [ . . . ]
> > 
> > > >  The race as I understand it is with this code:
> > > > 
> > > > 
> > > > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > > > 		list_replace_init(&leader->sibling, &tsk->sibling);
> > > > 
> > > > 		tsk->group_leader = tsk;
> > > > 		leader->group_leader = tsk;
> > > > 
> > > > 
> > > >  which seems to be called with only tasklist_lock held, which doesn't seem to
> > > >  be held in the cgroup code.
> > > > 
> > > >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> > > >  this chunk is run with the same value for 'leader', but the
> > > >  while_each_thread is run after, then the while_read_thread() might loop
> > > >  forever.  rcu_read_lock doesn't prevent this from happening.
> > > 
> > > Somehow I was under the impression that holding tasklist_lock (for
> > > writing) provided exclusion from code that holds rcu_read_lock -
> > > probably because there are other points in the kernel which do
> > > while_each_thread with only RCU-read held (and not tasklist):
> > > 
> > > - kernel/hung_task.c, check_hung_uninterruptible_tasks()
> > 
> > This one looks OK to me.  The code is just referencing fields in each
> > of the task structures, and appears to be making proper use of
> > rcu_dereference().  All this code requires is that the task structures
> > remain in existence through the full lifetime of the RCU read-side
> > critical section, which is guaranteed because of the way the task_struct
> > is freed.
> 
> I disagree.  It also requires - by virtue of the use of while_each_thread() -
> that 'g' remains on the list that 't' is walking along.

Doesn't the following code in the loop body deal with this possibilty?

	/* Exit if t or g was unhashed during refresh. */
	if (t->state == TASK_DEAD || g->state == TASK_DEAD)
		goto unlock;

Yes, a concurrent dethread could cause some of the tasks to be skipped,
but there really is a hung thread, it will still be there to be caught
next time, right?

							Thanx, Paul

> Now for a normal list, the head always stays on the list and is accessible
> even from an rcu-removed entry.  But the thread_group list isn't a normal
> list.  It doesn't have a distinct head.  It is a loop of all of the
> 'task_structs' in a thread group.  One of them is designated the 'leader' but
> de_thread() can change the 'leader' - though it doesn't remove the old leader.
> 
> __unhash_process in mm/exit.c looks like it could remove the leader from the
> list and definitely could remove a non-leader.
> 
> So if a non-leader calls 'exec' and the leader calls 'exit', then a
> task_struct that was the leader could become a non-leader and then be removed
> from the list that kernel/hung_task could be walking along.
> 
> So I don't think that while_each_thread() is currently safe.  It depends on
> the thread leader not disappearing and I think it can.
> 
> So I'm imagining a patch like this to ensure that while_each_thread() is
> actually safe.  If it is always safe you can remove that extra check in
> cgroup_attach_proc() which looked wrong.
> 
> I just hope someone who understands the process tree is listening..
> The change in exit.c is the most uncertain part.
> 
> NeilBrown
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 6075a1e..c9ea5f0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -960,6 +960,9 @@ static int de_thread(struct task_struct *tsk)
>  		list_replace_init(&leader->sibling, &tsk->sibling);
> 
>  		tsk->group_leader = tsk;
> +		smp_mb(); /* ensure that any reader will always be able to see
> +			   * a task that claims to be the group leader
> +			   */
>  		leader->group_leader = tsk;
> 
>  		tsk->exit_signal = SIGCHLD;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 14a6c7b..13e0192 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2267,8 +2267,10 @@ extern bool current_is_single_threaded(void);
>  #define do_each_thread(g, t) \
>  	for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
> 
> +/* Thread group leader can change, so stop loop when we see one
> + * even if it isn't 'g' */
>  #define while_each_thread(g, t) \
> -	while ((t = next_thread(t)) != g)
> +	while ((t = next_thread(t)) != g && !thread_group_leader(t))
> 
>  static inline int get_nr_threads(struct task_struct *tsk)
>  {
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f2b321b..d6cef25 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
> -	}
> -	list_del_rcu(&p->thread_group);
> +	} else
> +		/* only remove members from the thread group.
> +		 * The thread group leader must stay so that
> +		 * while_each_thread() uses can see the end of
> +		 * the list and stop.
> +		 */
> +		list_del_rcu(&p->thread_group);
>  }
> 
>  /*
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]         ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2011-07-28  6:26           ` Ben Blum
  2011-07-28 12:17           ` Paul E. McKenney
@ 2011-08-14 17:45           ` Oleg Nesterov
  2 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-14 17:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On 07/28, NeilBrown wrote:
>
> +/* Thread group leader can change, so stop loop when we see one
> + * even if it isn't 'g' */
>  #define while_each_thread(g, t) \
> -	while ((t = next_thread(t)) != g)
> +	while ((t = next_thread(t)) != g && !thread_group_leader(t))

This assumes that while_each_thread(g, t) always uses g == group_leader.

> @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
> -	}
> -	list_del_rcu(&p->thread_group);
> +	} else
> +		/* only remove members from the thread group.
> +		 * The thread group leader must stay so that
> +		 * while_each_thread() uses can see the end of
> +		 * the list and stop.
> +		 */
> +		list_del_rcu(&p->thread_group);

Hmm... this looks obvously wrong. This leaves the dead thread on
the list.

Oleg.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
  2011-07-28  1:08       ` NeilBrown
                           ` (2 preceding siblings ...)
       [not found]         ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2011-08-14 17:45         ` Oleg Nesterov
  3 siblings, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-14 17:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: paulmck, Ben Blum, Paul Menage, Li Zefan, containers, linux-kernel

On 07/28, NeilBrown wrote:
>
> +/* Thread group leader can change, so stop loop when we see one
> + * even if it isn't 'g' */
>  #define while_each_thread(g, t) \
> -	while ((t = next_thread(t)) != g)
> +	while ((t = next_thread(t)) != g && !thread_group_leader(t))

This assumes that while_each_thread(g, t) always uses g == group_leader.

> @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
> -	}
> -	list_del_rcu(&p->thread_group);
> +	} else
> +		/* only remove members from the thread group.
> +		 * The thread group leader must stay so that
> +		 * while_each_thread() uses can see the end of
> +		 * the list and stop.
> +		 */
> +		list_del_rcu(&p->thread_group);

Hmm... this looks obvously wrong. This leaves the dead thread on
the list.

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

[parent not found: <20110727150710.GB5242-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found]     ` <20110727150710.GB5242-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
@ 2011-07-27 23:42       ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2011-07-27 23:42 UTC (permalink / raw)
  To: Ben Blum
  Cc: NeilBrown,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Oleg Nesterov

On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote:
> On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote:

[ . . . ]

> >  The race as I understand it is with this code:
> > 
> > 
> > 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> > 		list_replace_init(&leader->sibling, &tsk->sibling);
> > 
> > 		tsk->group_leader = tsk;
> > 		leader->group_leader = tsk;
> > 
> > 
> >  which seems to be called with only tasklist_lock held, which doesn't seem to
> >  be held in the cgroup code.
> > 
> >  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
> >  this chunk is run with the same value for 'leader', but the
> >  while_each_thread is run after, then the while_read_thread() might loop
> >  forever.  rcu_read_lock doesn't prevent this from happening.
> 
> Somehow I was under the impression that holding tasklist_lock (for
> writing) provided exclusion from code that holds rcu_read_lock -
> probably because there are other points in the kernel which do
> while_each_thread with only RCU-read held (and not tasklist):
> 
> - kernel/hung_task.c, check_hung_uninterruptible_tasks()

This one looks OK to me.  The code is just referencing fields in each
of the task structures, and appears to be making proper use of
rcu_dereference().  All this code requires is that the task structures
remain in existence through the full lifetime of the RCU read-side
critical section, which is guaranteed because of the way the task_struct
is freed.

> - kernel/posix-cpu-timers.c, thread_group_cputime()

Same story here.  The code only needs the task_struct structures to
stick around long enough to compute the CPU time consumed, so this
should be OK.

> - fs/ioprio.c, ioprio_set() and ioprio_get()

Here RCU protects traversal to the task structure as in the two
examples above.  In the ioprio_set() case, the actual change is
carried out under task_lock().  So this code relies on RCU to
safely locate the task structure, and on locking to safely carry
out the update.

The ioprio_get() is using RCU to protect finding the task structure.
This returns an integer (the I/O priority), so no locking is required.
Though if compilers continue becoming increasingly aggressive with
their optimizations, there might need to be an ACCESS_ONCE()
in get_task_ioprio().

> (There are also places, like kernel/signal.c, where code does
> while_each_thread with only sighand->siglock held. this also seems
> sketchy, since de_thread only takes that lock after the code quoted
> above. there's a big comment in fs/exec.c where this is also done, but I
> don't quite understand it.)
> 
> You seem to imply that rcu_read_lock() doesn't exclude against
> write_lock(&tasklist_lock). If that's true, then we can fix the cgroup
> code simply by replacing rcu_read_lock/rcu_read_unlock with
> read_lock and read_unlocck on tasklist_lock. (I can hurry a bugfix patch
> for this together if so.)

Indeed, rcu_read_lock() does not exclude write_lock(&tasklist_lock).
In fact, it doesn't exclude against -any- lock.  But don't take my
word for it, instead take a look at the heaviest-weight implementation
of rcu_read_lock() that currently exists in mainline:

	void __rcu_read_lock(void)
	{
		current->rcu_read_lock_nesting++;
		barrier();
	}

That really is all there is to it, at least in the absence of
CONFIG_PROVE_RCU.  The first line does nothing more than increment a
counter in the current tasks task_struct, while the second line does
nothing more than suppress code-motion optimizations that the compiler
might otherwise be tempted to undertake.

The purpose of rcu_read_lock() is instead to prevent any subsequent
synchronize_rcu() calls (on any CPU or from any task) from returning
until this task executes the matching rcu_read_unlock().  If you
make careful use of rcu_read_lock(), synchronize_rcu(), and friends,
you can get an effect that in some ways resembles mutual exclusion,
but that is much faster and more scalable.

In the case of the task structure, rcu_read_lock() guarantees that
any task_struct that you can get a pointer to will remain in
existence until you execute the matching rcu_read_unlock().
By "remain in existence", I mean that the task_struct won't be
freed (and possibly reused).

So rcu_read_lock() is giving you the following guarantee:  If you
legitimately pick up a pointer to an RCU-protected data structure,
then that data structure will not be freed until after you execute
the matching rcu_read_unlock().

> Wouldn't this mean that the three places listed above are also wrong?

I do not believe so, give or take the ACCESS_ONCE().

The issue in your case is that you are relying not just on the
process to exist, but also on the state of group leadership.
RCU protects the former, but not the latter.

I suspect that your suggested bugfix patch would do the trick, but
I must defer to people who understand tasklist locking better than
I do.

							Thanx, Paul

> >  The code in de_thread() is actually questionable by itself.
> >  "list_replace_rcu" cannot really be used on the head of a list - it is only
> >  meant to be used on a member of a list.
> >  To move a list from one head to another you should be using
> >  list_splice_init_rcu().
> >  The ->tasks list doesn't seem to have a clearly distinguished 'head' but
> >  whatever is passed as 'g' to while_each_thread() is effectively a head and
> >  removing it from a list can cause a loop using while_each_thread() can not
> >  find the head and so never complete.
> > 
> >  I' not sure how best to fix this, though possibly changing
> >  while_each_thead to:
> > 
> >    while ((t = next_task(t)) != g && !thread_group_leader(t))
> > 
> >  might be part of it.  We would also need to move 
> >     tsk->group_leader = tsk;
> >  in the above up to the top, and probably add some memory barrier.
> >  However I don't know enough about how the list is used to be sure.
> > 
> > Comments?
> > 
> > Thanks,
> > NeilBrown
> > 
> > 
> 
> I barely understand de_thread() from the reader's perspective, let alone
> from the author's perspective, so I can't speak for that one.
> 
> Thanks for pointing this out!
> 
> -- Ben

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread.
       [not found] ` <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2011-07-27 15:07     ` Ben Blum
@ 2011-08-14 17:40   ` Oleg Nesterov
  1 sibling, 0 replies; 63+ messages in thread
From: Oleg Nesterov @ 2011-08-14 17:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage,
	Paul E.McKenney

Sorry for delay, just noticed this thread...

On 07/27, NeilBrown wrote:
>
>  The race as I understand it is with this code:
>
>
> 		list_replace_rcu(&leader->tasks, &tsk->tasks);
> 		list_replace_init(&leader->sibling, &tsk->sibling);
>
> 		tsk->group_leader = tsk;
> 		leader->group_leader = tsk;
>
>
>  which seems to be called with only tasklist_lock held, which doesn't seem to
>  be held in the cgroup code.
>
>  If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before
>  this chunk is run with the same value for 'leader', but the
>  while_each_thread is run after, then the while_read_thread() might loop
>  forever.  rcu_read_lock doesn't prevent this from happening.

Yes. This was already discussed. See http://marc.info/?t=127688987300002

Damn. I forgot about this completely.

>  The code in de_thread() is actually questionable by itself.
>  "list_replace_rcu" cannot really be used on the head of a list - it is only
>  meant to be used on a member of a list.
>  To move a list from one head to another you should be using
>  list_splice_init_rcu().

Hmm... can't understand this part.

And just in case... list_replace_rcu() looks fine afaics. The real problem
is release_task(old_leader) which does list_del_rcu(old_leader->thread_group),
this is what breaks while_each_thread().

>  The ->tasks list doesn't seem to have a clearly distinguished 'head'

Exactly. This is the problem.

But: you seem to confused ->tasks and ->thread_group ;)

Oleg.

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2011-10-19  5:43 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-27  7:11 Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread NeilBrown
2011-08-14 17:40 ` Oleg Nesterov
2011-08-15  0:11   ` NeilBrown
     [not found]     ` <20110815101144.39812e9f-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2011-08-15 19:09       ` Oleg Nesterov
2011-08-15 19:09     ` Oleg Nesterov
     [not found]   ` <20110814174000.GA2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2011-08-15  0:11     ` NeilBrown
     [not found] ` <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2011-07-27 15:07   ` Ben Blum
2011-07-27 15:07     ` Ben Blum
2011-07-27 23:42     ` Paul E. McKenney
     [not found]       ` <20110727234235.GA2318-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2011-07-28  1:08         ` NeilBrown
2011-07-28  1:08       ` NeilBrown
2011-07-28  6:26         ` Ben Blum
2011-07-28  7:13           ` NeilBrown
     [not found]             ` <20110728171345.67d3797d-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2011-07-29 14:28               ` [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc Ben Blum
2011-07-29 14:28             ` Ben Blum
2011-08-01 19:31               ` Paul Menage
     [not found]               ` <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
2011-08-01 19:31                 ` Paul Menage
2011-08-15 18:49                 ` Oleg Nesterov
2011-08-15 18:49               ` Oleg Nesterov
2011-08-15 22:50                 ` Frederic Weisbecker
2011-08-15 23:04                   ` Ben Blum
2011-08-15 23:09                     ` Ben Blum
2011-08-15 23:19                       ` Frederic Weisbecker
     [not found]                       ` <20110815230900.GB6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
2011-08-15 23:19                         ` Frederic Weisbecker
     [not found]                     ` <20110815230415.GA6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
2011-08-15 23:09                       ` Ben Blum
2011-08-15 23:04                   ` Ben Blum
2011-08-15 23:11                   ` [PATCH][BUGFIX] cgroups: fix ordering of calls " Ben Blum
2011-08-15 23:20                     ` Frederic Weisbecker
     [not found]                     ` <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
2011-08-15 23:20                       ` Frederic Weisbecker
2011-08-15 23:31                       ` Paul Menage
2011-08-15 23:31                     ` Paul Menage
2011-08-15 23:11                   ` Ben Blum
     [not found]                 ` <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2011-08-15 22:50                   ` [PATCH][BUGFIX] cgroups: more safe tasklist locking " Frederic Weisbecker
2011-09-01 21:46                   ` Ben Blum
2011-09-01 21:46                 ` Ben Blum
     [not found]                   ` <20110901214643.GD10401-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
2011-09-02 12:32                     ` Oleg Nesterov
2011-09-02 12:32                   ` Oleg Nesterov
     [not found]                     ` <20110902123251.GA26764-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2011-09-08  2:11                       ` Ben Blum
2011-09-08  2:11                     ` Ben Blum
2011-10-14  0:31                 ` [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock " Ben Blum
2011-10-14 12:15                   ` Frederic Weisbecker
2011-10-14  0:36                 ` [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) Ben Blum
2011-10-14 12:21                   ` Frederic Weisbecker
2011-10-14 13:53                     ` Ben Blum
2011-10-14 13:54                       ` Ben Blum
2011-10-14 15:22                         ` Frederic Weisbecker
2011-10-17 19:11                           ` Ben Blum
2011-10-14 15:21                       ` Frederic Weisbecker
2011-10-19  5:43                   ` Paul Menage
     [not found]           ` <20110728062616.GC15204-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
2011-07-28  7:13             ` Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread NeilBrown
2011-07-28 12:17         ` Paul E. McKenney
     [not found]           ` <20110728121741.GB2427-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2011-08-14 17:51             ` Oleg Nesterov
2011-08-14 17:51           ` Oleg Nesterov
     [not found]             ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2011-08-14 23:58               ` NeilBrown
2011-08-15 18:01               ` Paul E. McKenney
2011-08-14 23:58             ` NeilBrown
2011-08-15 18:01             ` Paul E. McKenney
     [not found]         ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2011-07-28  6:26           ` Ben Blum
2011-07-28 12:17           ` Paul E. McKenney
2011-08-14 17:45           ` Oleg Nesterov
2011-08-14 17:45         ` Oleg Nesterov
     [not found]     ` <20110727150710.GB5242-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>
2011-07-27 23:42       ` Paul E. McKenney
2011-08-14 17:40   ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.