* Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. @ 2011-07-27 7:11 NeilBrown 2011-08-14 17:40 ` Oleg Nesterov [not found] ` <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 0 siblings, 2 replies; 63+ messages in thread From: NeilBrown @ 2011-07-27 7:11 UTC (permalink / raw) To: Paul Menage, Ben Blum, Li Zefan, Oleg Nesterov Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Paul E.McKenney, linux-kernel-u79uwXL29TY76Z2rM5mHXA Hi, I've been exploring the use of RCU in the kernel, particularly looking for things that don't quite look right. I found cgroup_attach_proc which was added a few months ago. It contains: rcu_read_lock(); if (!thread_group_leader(leader)) { /* * a race with de_thread from another thread's exec() may strip * us of our leadership, making while_each_thread unsafe to use * on this task. if this happens, there is no choice but to * throw this task away and try again (from cgroup_procs_write); * this is "double-double-toil-and-trouble-check locking". */ rcu_read_unlock(); retval = -EAGAIN; goto out_free_group_list; } (and having the comment helps a lot!) The comment acknowledges a race with de_thread but seems to assume that rcu_read_lock() will protect against that race. It won't. It could possibly protect if the racy code in de_thread() contained a call to synchronize_rcu(), but it doesn't so there is no obvious exclusion between the two. I note that some other locks are held and maybe some other lock provides the required exclusion - I haven't explored that too deeply - but if that is the case, then the use of rcu_read_lock() here is pointless - it isn't needed just to call thread_group_leader(). The race as I understand it is with this code: list_replace_rcu(&leader->tasks, &tsk->tasks); list_replace_init(&leader->sibling, &tsk->sibling); tsk->group_leader = tsk; leader->group_leader = tsk; which seems to be called with only tasklist_lock held, which doesn't seem to be held in the cgroup code. If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before this chunk is run with the same value for 'leader', but the while_each_thread is run after, then the while_read_thread() might loop forever. rcu_read_lock doesn't prevent this from happening. The code in de_thread() is actually questionable by itself. "list_replace_rcu" cannot really be used on the head of a list - it is only meant to be used on a member of a list. To move a list from one head to another you should be using list_splice_init_rcu(). The ->tasks list doesn't seem to have a clearly distinguished 'head' but whatever is passed as 'g' to while_each_thread() is effectively a head and removing it from a list can cause a loop using while_each_thread() can not find the head and so never complete. I' not sure how best to fix this, though possibly changing while_each_thead to: while ((t = next_task(t)) != g && !thread_group_leader(t)) might be part of it. We would also need to move tsk->group_leader = tsk; in the above up to the top, and probably add some memory barrier. However I don't know enough about how the list is used to be sure. Comments? Thanks, NeilBrown ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-07-27 7:11 Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread NeilBrown @ 2011-08-14 17:40 ` Oleg Nesterov 2011-08-15 0:11 ` NeilBrown [not found] ` <20110814174000.GA2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [not found] ` <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 1 sibling, 2 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-08-14 17:40 UTC (permalink / raw) To: NeilBrown Cc: Paul Menage, Ben Blum, Li Zefan, containers, Paul E.McKenney, linux-kernel Sorry for delay, just noticed this thread... On 07/27, NeilBrown wrote: > > The race as I understand it is with this code: > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > list_replace_init(&leader->sibling, &tsk->sibling); > > tsk->group_leader = tsk; > leader->group_leader = tsk; > > > which seems to be called with only tasklist_lock held, which doesn't seem to > be held in the cgroup code. > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > this chunk is run with the same value for 'leader', but the > while_each_thread is run after, then the while_read_thread() might loop > forever. rcu_read_lock doesn't prevent this from happening. Yes. This was already discussed. See http://marc.info/?t=127688987300002 Damn. I forgot about this completely. > The code in de_thread() is actually questionable by itself. > "list_replace_rcu" cannot really be used on the head of a list - it is only > meant to be used on a member of a list. > To move a list from one head to another you should be using > list_splice_init_rcu(). Hmm... can't understand this part. And just in case... list_replace_rcu() looks fine afaics. The real problem is release_task(old_leader) which does list_del_rcu(old_leader->thread_group), this is what breaks while_each_thread(). > The ->tasks list doesn't seem to have a clearly distinguished 'head' Exactly. This is the problem. But: you seem to confused ->tasks and ->thread_group ;) Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-08-14 17:40 ` Oleg Nesterov @ 2011-08-15 0:11 ` NeilBrown [not found] ` <20110815101144.39812e9f-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2011-08-15 19:09 ` Oleg Nesterov [not found] ` <20110814174000.GA2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 2 replies; 63+ messages in thread From: NeilBrown @ 2011-08-15 0:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Paul Menage, Ben Blum, Li Zefan, containers, Paul E.McKenney, linux-kernel On Sun, 14 Aug 2011 19:40:00 +0200 Oleg Nesterov <oleg@redhat.com> wrote: > Sorry for delay, just noticed this thread... > > On 07/27, NeilBrown wrote: > > > > The race as I understand it is with this code: > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > tsk->group_leader = tsk; > > leader->group_leader = tsk; > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > be held in the cgroup code. > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > this chunk is run with the same value for 'leader', but the > > while_each_thread is run after, then the while_read_thread() might loop > > forever. rcu_read_lock doesn't prevent this from happening. > > Yes. This was already discussed. See http://marc.info/?t=127688987300002 > > Damn. I forgot about this completely. > > > The code in de_thread() is actually questionable by itself. > > "list_replace_rcu" cannot really be used on the head of a list - it is only > > meant to be used on a member of a list. > > To move a list from one head to another you should be using > > list_splice_init_rcu(). > > Hmm... can't understand this part. > > And just in case... list_replace_rcu() looks fine afaics. The real problem > is release_task(old_leader) which does list_del_rcu(old_leader->thread_group), > this is what breaks while_each_thread(). > > > The ->tasks list doesn't seem to have a clearly distinguished 'head' > > Exactly. This is the problem. > > But: you seem to confused ->tasks and ->thread_group ;) > That last point is certainly correct. I caught myself confusing the two several time and wouldn't be at all surprised if I did it without catching myself. de_thread can change the group_leader of a thread_group, and release_task can remove a non-leader while leaving the rest of the thread_group intact. So any while_each_thread() loop needs some extra care to ensure that it doesn't loop infinitely, because the "head" that it is looking for might not be there any more. Maybe there are other rules that ensure this can never happen, but they sure aren't obvious to me (i.e. if you know them - please tell ;-) NeilBrown ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110815101144.39812e9f-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110815101144.39812e9f-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2011-08-15 19:09 ` Oleg Nesterov 0 siblings, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-08-15 19:09 UTC (permalink / raw) To: NeilBrown Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Paul E.McKenney On 08/15, NeilBrown wrote: > > de_thread can change the group_leader of a thread_group, and release_task can > remove a non-leader while leaving the rest of the thread_group intact. So > any while_each_thread() loop needs some extra care to ensure that it doesn't > loop infinitely, because the "head" that it is looking for might not be there > any more. > Maybe there are other rules that ensure this can never happen, but they sure > aren't obvious to me (i.e. if you know them - please tell ;-) No, I don't know ;) And note also that if g != leader, then while_each_thread(g, t) can hang simply because g exits. I am still trying to invent something simple to fix while_each_thread-under-rcu. This looks possible, but I am starting to think that, say, zap_threads() needs locking anyway. With any fix I can imagine, it can miss a thread we should care about. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-08-15 0:11 ` NeilBrown [not found] ` <20110815101144.39812e9f-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2011-08-15 19:09 ` Oleg Nesterov 1 sibling, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-08-15 19:09 UTC (permalink / raw) To: NeilBrown Cc: Paul Menage, Ben Blum, Li Zefan, containers, Paul E.McKenney, linux-kernel On 08/15, NeilBrown wrote: > > de_thread can change the group_leader of a thread_group, and release_task can > remove a non-leader while leaving the rest of the thread_group intact. So > any while_each_thread() loop needs some extra care to ensure that it doesn't > loop infinitely, because the "head" that it is looking for might not be there > any more. > Maybe there are other rules that ensure this can never happen, but they sure > aren't obvious to me (i.e. if you know them - please tell ;-) No, I don't know ;) And note also that if g != leader, then while_each_thread(g, t) can hang simply because g exits. I am still trying to invent something simple to fix while_each_thread-under-rcu. This looks possible, but I am starting to think that, say, zap_threads() needs locking anyway. With any fix I can imagine, it can miss a thread we should care about. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110814174000.GA2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110814174000.GA2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2011-08-15 0:11 ` NeilBrown 0 siblings, 0 replies; 63+ messages in thread From: NeilBrown @ 2011-08-15 0:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Paul E.McKenney On Sun, 14 Aug 2011 19:40:00 +0200 Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > Sorry for delay, just noticed this thread... > > On 07/27, NeilBrown wrote: > > > > The race as I understand it is with this code: > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > tsk->group_leader = tsk; > > leader->group_leader = tsk; > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > be held in the cgroup code. > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > this chunk is run with the same value for 'leader', but the > > while_each_thread is run after, then the while_read_thread() might loop > > forever. rcu_read_lock doesn't prevent this from happening. > > Yes. This was already discussed. See http://marc.info/?t=127688987300002 > > Damn. I forgot about this completely. > > > The code in de_thread() is actually questionable by itself. > > "list_replace_rcu" cannot really be used on the head of a list - it is only > > meant to be used on a member of a list. > > To move a list from one head to another you should be using > > list_splice_init_rcu(). > > Hmm... can't understand this part. > > And just in case... list_replace_rcu() looks fine afaics. The real problem > is release_task(old_leader) which does list_del_rcu(old_leader->thread_group), > this is what breaks while_each_thread(). > > > The ->tasks list doesn't seem to have a clearly distinguished 'head' > > Exactly. This is the problem. > > But: you seem to confused ->tasks and ->thread_group ;) > That last point is certainly correct. I caught myself confusing the two several time and wouldn't be at all surprised if I did it without catching myself. de_thread can change the group_leader of a thread_group, and release_task can remove a non-leader while leaving the rest of the thread_group intact. So any while_each_thread() loop needs some extra care to ensure that it doesn't loop infinitely, because the "head" that it is looking for might not be there any more. Maybe there are other rules that ensure this can never happen, but they sure aren't obvious to me (i.e. if you know them - please tell ;-) NeilBrown ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-07-27 7:11 Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread NeilBrown @ 2011-07-27 15:07 ` Ben Blum [not found] ` <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 1 sibling, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-07-27 15:07 UTC (permalink / raw) To: NeilBrown Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Paul E.McKenney, Oleg Nesterov On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: > > Hi, > I've been exploring the use of RCU in the kernel, particularly looking for > things that don't quite look right. I found cgroup_attach_proc which was > added a few months ago. Awesome, thanks! :) > > It contains: > > rcu_read_lock(); > if (!thread_group_leader(leader)) { > /* > * a race with de_thread from another thread's exec() may strip > * us of our leadership, making while_each_thread unsafe to use > * on this task. if this happens, there is no choice but to > * throw this task away and try again (from cgroup_procs_write); > * this is "double-double-toil-and-trouble-check locking". > */ > rcu_read_unlock(); > retval = -EAGAIN; > goto out_free_group_list; > } > > (and having the comment helps a lot!) > > The comment acknowledges a race with de_thread but seems to assume that > rcu_read_lock() will protect against that race. It won't. > It could possibly protect if the racy code in de_thread() contained a call > to synchronize_rcu(), but it doesn't so there is no obvious exclusion > between the two. > I note that some other locks are held and maybe some other lock provides > the required exclusion - I haven't explored that too deeply - but if that is > the case, then the use of rcu_read_lock() here is pointless - it isn't > needed just to call thread_group_leader(). I wrote this code, and I admit to not having a full understanding of RCU myself. The code was once more complicated (before the patches went in, mind you), and had a series of checks like that leading up to a list_for_each_entry over the ->thread_group list (in "step 3", instead of iterating over the flex_array), and had read_lock(&tasklist_lock) around it. (...) (The other locks held are just cgroup_mutex and threadgroup_fork_lock, which wouldn't provide the exclusion.) > > The race as I understand it is with this code: > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > list_replace_init(&leader->sibling, &tsk->sibling); > > tsk->group_leader = tsk; > leader->group_leader = tsk; > > > which seems to be called with only tasklist_lock held, which doesn't seem to > be held in the cgroup code. > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > this chunk is run with the same value for 'leader', but the > while_each_thread is run after, then the while_read_thread() might loop > forever. rcu_read_lock doesn't prevent this from happening. Somehow I was under the impression that holding tasklist_lock (for writing) provided exclusion from code that holds rcu_read_lock - probably because there are other points in the kernel which do while_each_thread with only RCU-read held (and not tasklist): - kernel/hung_task.c, check_hung_uninterruptible_tasks() - kernel/posix-cpu-timers.c, thread_group_cputime() - fs/ioprio.c, ioprio_set() and ioprio_get() (There are also places, like kernel/signal.c, where code does while_each_thread with only sighand->siglock held. this also seems sketchy, since de_thread only takes that lock after the code quoted above. there's a big comment in fs/exec.c where this is also done, but I don't quite understand it.) You seem to imply that rcu_read_lock() doesn't exclude against write_lock(&tasklist_lock). If that's true, then we can fix the cgroup code simply by replacing rcu_read_lock/rcu_read_unlock with read_lock and read_unlocck on tasklist_lock. (I can hurry a bugfix patch for this together if so.) Wouldn't this mean that the three places listed above are also wrong? > > The code in de_thread() is actually questionable by itself. > "list_replace_rcu" cannot really be used on the head of a list - it is only > meant to be used on a member of a list. > To move a list from one head to another you should be using > list_splice_init_rcu(). > The ->tasks list doesn't seem to have a clearly distinguished 'head' but > whatever is passed as 'g' to while_each_thread() is effectively a head and > removing it from a list can cause a loop using while_each_thread() can not > find the head and so never complete. > > I' not sure how best to fix this, though possibly changing > while_each_thead to: > > while ((t = next_task(t)) != g && !thread_group_leader(t)) > > might be part of it. We would also need to move > tsk->group_leader = tsk; > in the above up to the top, and probably add some memory barrier. > However I don't know enough about how the list is used to be sure. > > Comments? > > Thanks, > NeilBrown > > I barely understand de_thread() from the reader's perspective, let alone from the author's perspective, so I can't speak for that one. Thanks for pointing this out! -- Ben ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. @ 2011-07-27 15:07 ` Ben Blum 0 siblings, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-07-27 15:07 UTC (permalink / raw) To: NeilBrown Cc: Paul Menage, Ben Blum, Li Zefan, Oleg Nesterov, containers, Paul E.McKenney, linux-kernel On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: > > Hi, > I've been exploring the use of RCU in the kernel, particularly looking for > things that don't quite look right. I found cgroup_attach_proc which was > added a few months ago. Awesome, thanks! :) > > It contains: > > rcu_read_lock(); > if (!thread_group_leader(leader)) { > /* > * a race with de_thread from another thread's exec() may strip > * us of our leadership, making while_each_thread unsafe to use > * on this task. if this happens, there is no choice but to > * throw this task away and try again (from cgroup_procs_write); > * this is "double-double-toil-and-trouble-check locking". > */ > rcu_read_unlock(); > retval = -EAGAIN; > goto out_free_group_list; > } > > (and having the comment helps a lot!) > > The comment acknowledges a race with de_thread but seems to assume that > rcu_read_lock() will protect against that race. It won't. > It could possibly protect if the racy code in de_thread() contained a call > to synchronize_rcu(), but it doesn't so there is no obvious exclusion > between the two. > I note that some other locks are held and maybe some other lock provides > the required exclusion - I haven't explored that too deeply - but if that is > the case, then the use of rcu_read_lock() here is pointless - it isn't > needed just to call thread_group_leader(). I wrote this code, and I admit to not having a full understanding of RCU myself. The code was once more complicated (before the patches went in, mind you), and had a series of checks like that leading up to a list_for_each_entry over the ->thread_group list (in "step 3", instead of iterating over the flex_array), and had read_lock(&tasklist_lock) around it. (...) (The other locks held are just cgroup_mutex and threadgroup_fork_lock, which wouldn't provide the exclusion.) > > The race as I understand it is with this code: > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > list_replace_init(&leader->sibling, &tsk->sibling); > > tsk->group_leader = tsk; > leader->group_leader = tsk; > > > which seems to be called with only tasklist_lock held, which doesn't seem to > be held in the cgroup code. > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > this chunk is run with the same value for 'leader', but the > while_each_thread is run after, then the while_read_thread() might loop > forever. rcu_read_lock doesn't prevent this from happening. Somehow I was under the impression that holding tasklist_lock (for writing) provided exclusion from code that holds rcu_read_lock - probably because there are other points in the kernel which do while_each_thread with only RCU-read held (and not tasklist): - kernel/hung_task.c, check_hung_uninterruptible_tasks() - kernel/posix-cpu-timers.c, thread_group_cputime() - fs/ioprio.c, ioprio_set() and ioprio_get() (There are also places, like kernel/signal.c, where code does while_each_thread with only sighand->siglock held. this also seems sketchy, since de_thread only takes that lock after the code quoted above. there's a big comment in fs/exec.c where this is also done, but I don't quite understand it.) You seem to imply that rcu_read_lock() doesn't exclude against write_lock(&tasklist_lock). If that's true, then we can fix the cgroup code simply by replacing rcu_read_lock/rcu_read_unlock with read_lock and read_unlocck on tasklist_lock. (I can hurry a bugfix patch for this together if so.) Wouldn't this mean that the three places listed above are also wrong? > > The code in de_thread() is actually questionable by itself. > "list_replace_rcu" cannot really be used on the head of a list - it is only > meant to be used on a member of a list. > To move a list from one head to another you should be using > list_splice_init_rcu(). > The ->tasks list doesn't seem to have a clearly distinguished 'head' but > whatever is passed as 'g' to while_each_thread() is effectively a head and > removing it from a list can cause a loop using while_each_thread() can not > find the head and so never complete. > > I' not sure how best to fix this, though possibly changing > while_each_thead to: > > while ((t = next_task(t)) != g && !thread_group_leader(t)) > > might be part of it. We would also need to move > tsk->group_leader = tsk; > in the above up to the top, and probably add some memory barrier. > However I don't know enough about how the list is used to be sure. > > Comments? > > Thanks, > NeilBrown > > I barely understand de_thread() from the reader's perspective, let alone from the author's perspective, so I can't speak for that one. Thanks for pointing this out! -- Ben ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-07-27 15:07 ` Ben Blum (?) @ 2011-07-27 23:42 ` Paul E. McKenney [not found] ` <20110727234235.GA2318-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 2011-07-28 1:08 ` NeilBrown -1 siblings, 2 replies; 63+ messages in thread From: Paul E. McKenney @ 2011-07-27 23:42 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, Paul Menage, Li Zefan, Oleg Nesterov, containers, linux-kernel On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote: > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: [ . . . ] > > The race as I understand it is with this code: > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > tsk->group_leader = tsk; > > leader->group_leader = tsk; > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > be held in the cgroup code. > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > this chunk is run with the same value for 'leader', but the > > while_each_thread is run after, then the while_read_thread() might loop > > forever. rcu_read_lock doesn't prevent this from happening. > > Somehow I was under the impression that holding tasklist_lock (for > writing) provided exclusion from code that holds rcu_read_lock - > probably because there are other points in the kernel which do > while_each_thread with only RCU-read held (and not tasklist): > > - kernel/hung_task.c, check_hung_uninterruptible_tasks() This one looks OK to me. The code is just referencing fields in each of the task structures, and appears to be making proper use of rcu_dereference(). All this code requires is that the task structures remain in existence through the full lifetime of the RCU read-side critical section, which is guaranteed because of the way the task_struct is freed. > - kernel/posix-cpu-timers.c, thread_group_cputime() Same story here. The code only needs the task_struct structures to stick around long enough to compute the CPU time consumed, so this should be OK. > - fs/ioprio.c, ioprio_set() and ioprio_get() Here RCU protects traversal to the task structure as in the two examples above. In the ioprio_set() case, the actual change is carried out under task_lock(). So this code relies on RCU to safely locate the task structure, and on locking to safely carry out the update. The ioprio_get() is using RCU to protect finding the task structure. This returns an integer (the I/O priority), so no locking is required. Though if compilers continue becoming increasingly aggressive with their optimizations, there might need to be an ACCESS_ONCE() in get_task_ioprio(). > (There are also places, like kernel/signal.c, where code does > while_each_thread with only sighand->siglock held. this also seems > sketchy, since de_thread only takes that lock after the code quoted > above. there's a big comment in fs/exec.c where this is also done, but I > don't quite understand it.) > > You seem to imply that rcu_read_lock() doesn't exclude against > write_lock(&tasklist_lock). If that's true, then we can fix the cgroup > code simply by replacing rcu_read_lock/rcu_read_unlock with > read_lock and read_unlocck on tasklist_lock. (I can hurry a bugfix patch > for this together if so.) Indeed, rcu_read_lock() does not exclude write_lock(&tasklist_lock). In fact, it doesn't exclude against -any- lock. But don't take my word for it, instead take a look at the heaviest-weight implementation of rcu_read_lock() that currently exists in mainline: void __rcu_read_lock(void) { current->rcu_read_lock_nesting++; barrier(); } That really is all there is to it, at least in the absence of CONFIG_PROVE_RCU. The first line does nothing more than increment a counter in the current tasks task_struct, while the second line does nothing more than suppress code-motion optimizations that the compiler might otherwise be tempted to undertake. The purpose of rcu_read_lock() is instead to prevent any subsequent synchronize_rcu() calls (on any CPU or from any task) from returning until this task executes the matching rcu_read_unlock(). If you make careful use of rcu_read_lock(), synchronize_rcu(), and friends, you can get an effect that in some ways resembles mutual exclusion, but that is much faster and more scalable. In the case of the task structure, rcu_read_lock() guarantees that any task_struct that you can get a pointer to will remain in existence until you execute the matching rcu_read_unlock(). By "remain in existence", I mean that the task_struct won't be freed (and possibly reused). So rcu_read_lock() is giving you the following guarantee: If you legitimately pick up a pointer to an RCU-protected data structure, then that data structure will not be freed until after you execute the matching rcu_read_unlock(). > Wouldn't this mean that the three places listed above are also wrong? I do not believe so, give or take the ACCESS_ONCE(). The issue in your case is that you are relying not just on the process to exist, but also on the state of group leadership. RCU protects the former, but not the latter. I suspect that your suggested bugfix patch would do the trick, but I must defer to people who understand tasklist locking better than I do. Thanx, Paul > > The code in de_thread() is actually questionable by itself. > > "list_replace_rcu" cannot really be used on the head of a list - it is only > > meant to be used on a member of a list. > > To move a list from one head to another you should be using > > list_splice_init_rcu(). > > The ->tasks list doesn't seem to have a clearly distinguished 'head' but > > whatever is passed as 'g' to while_each_thread() is effectively a head and > > removing it from a list can cause a loop using while_each_thread() can not > > find the head and so never complete. > > > > I' not sure how best to fix this, though possibly changing > > while_each_thead to: > > > > while ((t = next_task(t)) != g && !thread_group_leader(t)) > > > > might be part of it. We would also need to move > > tsk->group_leader = tsk; > > in the above up to the top, and probably add some memory barrier. > > However I don't know enough about how the list is used to be sure. > > > > Comments? > > > > Thanks, > > NeilBrown > > > > > > I barely understand de_thread() from the reader's perspective, let alone > from the author's perspective, so I can't speak for that one. > > Thanks for pointing this out! > > -- Ben ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110727234235.GA2318-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110727234235.GA2318-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2011-07-28 1:08 ` NeilBrown 0 siblings, 0 replies; 63+ messages in thread From: NeilBrown @ 2011-07-28 1:08 UTC (permalink / raw) To: paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Oleg Nesterov On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney" <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote: > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: > > [ . . . ] > > > > The race as I understand it is with this code: > > > > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > > > tsk->group_leader = tsk; > > > leader->group_leader = tsk; > > > > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > > be held in the cgroup code. > > > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > > this chunk is run with the same value for 'leader', but the > > > while_each_thread is run after, then the while_read_thread() might loop > > > forever. rcu_read_lock doesn't prevent this from happening. > > > > Somehow I was under the impression that holding tasklist_lock (for > > writing) provided exclusion from code that holds rcu_read_lock - > > probably because there are other points in the kernel which do > > while_each_thread with only RCU-read held (and not tasklist): > > > > - kernel/hung_task.c, check_hung_uninterruptible_tasks() > > This one looks OK to me. The code is just referencing fields in each > of the task structures, and appears to be making proper use of > rcu_dereference(). All this code requires is that the task structures > remain in existence through the full lifetime of the RCU read-side > critical section, which is guaranteed because of the way the task_struct > is freed. I disagree. It also requires - by virtue of the use of while_each_thread() - that 'g' remains on the list that 't' is walking along. Now for a normal list, the head always stays on the list and is accessible even from an rcu-removed entry. But the thread_group list isn't a normal list. It doesn't have a distinct head. It is a loop of all of the 'task_structs' in a thread group. One of them is designated the 'leader' but de_thread() can change the 'leader' - though it doesn't remove the old leader. __unhash_process in mm/exit.c looks like it could remove the leader from the list and definitely could remove a non-leader. So if a non-leader calls 'exec' and the leader calls 'exit', then a task_struct that was the leader could become a non-leader and then be removed from the list that kernel/hung_task could be walking along. So I don't think that while_each_thread() is currently safe. It depends on the thread leader not disappearing and I think it can. So I'm imagining a patch like this to ensure that while_each_thread() is actually safe. If it is always safe you can remove that extra check in cgroup_attach_proc() which looked wrong. I just hope someone who understands the process tree is listening.. The change in exit.c is the most uncertain part. NeilBrown diff --git a/fs/exec.c b/fs/exec.c index 6075a1e..c9ea5f0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -960,6 +960,9 @@ static int de_thread(struct task_struct *tsk) list_replace_init(&leader->sibling, &tsk->sibling); tsk->group_leader = tsk; + smp_mb(); /* ensure that any reader will always be able to see + * a task that claims to be the group leader + */ leader->group_leader = tsk; tsk->exit_signal = SIGCHLD; diff --git a/include/linux/sched.h b/include/linux/sched.h index 14a6c7b..13e0192 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2267,8 +2267,10 @@ extern bool current_is_single_threaded(void); #define do_each_thread(g, t) \ for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do +/* Thread group leader can change, so stop loop when we see one + * even if it isn't 'g' */ #define while_each_thread(g, t) \ - while ((t = next_thread(t)) != g) + while ((t = next_thread(t)) != g && !thread_group_leader(t)) static inline int get_nr_threads(struct task_struct *tsk) { diff --git a/kernel/exit.c b/kernel/exit.c index f2b321b..d6cef25 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead) list_del_rcu(&p->tasks); list_del_init(&p->sibling); __this_cpu_dec(process_counts); - } - list_del_rcu(&p->thread_group); + } else + /* only remove members from the thread group. + * The thread group leader must stay so that + * while_each_thread() uses can see the end of + * the list and stop. + */ + list_del_rcu(&p->thread_group); } /* ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-07-27 23:42 ` Paul E. McKenney [not found] ` <20110727234235.GA2318-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2011-07-28 1:08 ` NeilBrown 2011-07-28 6:26 ` Ben Blum ` (3 more replies) 1 sibling, 4 replies; 63+ messages in thread From: NeilBrown @ 2011-07-28 1:08 UTC (permalink / raw) To: paulmck Cc: Ben Blum, Paul Menage, Li Zefan, Oleg Nesterov, containers, linux-kernel On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote: > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: > > [ . . . ] > > > > The race as I understand it is with this code: > > > > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > > > tsk->group_leader = tsk; > > > leader->group_leader = tsk; > > > > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > > be held in the cgroup code. > > > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > > this chunk is run with the same value for 'leader', but the > > > while_each_thread is run after, then the while_read_thread() might loop > > > forever. rcu_read_lock doesn't prevent this from happening. > > > > Somehow I was under the impression that holding tasklist_lock (for > > writing) provided exclusion from code that holds rcu_read_lock - > > probably because there are other points in the kernel which do > > while_each_thread with only RCU-read held (and not tasklist): > > > > - kernel/hung_task.c, check_hung_uninterruptible_tasks() > > This one looks OK to me. The code is just referencing fields in each > of the task structures, and appears to be making proper use of > rcu_dereference(). All this code requires is that the task structures > remain in existence through the full lifetime of the RCU read-side > critical section, which is guaranteed because of the way the task_struct > is freed. I disagree. It also requires - by virtue of the use of while_each_thread() - that 'g' remains on the list that 't' is walking along. Now for a normal list, the head always stays on the list and is accessible even from an rcu-removed entry. But the thread_group list isn't a normal list. It doesn't have a distinct head. It is a loop of all of the 'task_structs' in a thread group. One of them is designated the 'leader' but de_thread() can change the 'leader' - though it doesn't remove the old leader. __unhash_process in mm/exit.c looks like it could remove the leader from the list and definitely could remove a non-leader. So if a non-leader calls 'exec' and the leader calls 'exit', then a task_struct that was the leader could become a non-leader and then be removed from the list that kernel/hung_task could be walking along. So I don't think that while_each_thread() is currently safe. It depends on the thread leader not disappearing and I think it can. So I'm imagining a patch like this to ensure that while_each_thread() is actually safe. If it is always safe you can remove that extra check in cgroup_attach_proc() which looked wrong. I just hope someone who understands the process tree is listening.. The change in exit.c is the most uncertain part. NeilBrown diff --git a/fs/exec.c b/fs/exec.c index 6075a1e..c9ea5f0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -960,6 +960,9 @@ static int de_thread(struct task_struct *tsk) list_replace_init(&leader->sibling, &tsk->sibling); tsk->group_leader = tsk; + smp_mb(); /* ensure that any reader will always be able to see + * a task that claims to be the group leader + */ leader->group_leader = tsk; tsk->exit_signal = SIGCHLD; diff --git a/include/linux/sched.h b/include/linux/sched.h index 14a6c7b..13e0192 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2267,8 +2267,10 @@ extern bool current_is_single_threaded(void); #define do_each_thread(g, t) \ for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do +/* Thread group leader can change, so stop loop when we see one + * even if it isn't 'g' */ #define while_each_thread(g, t) \ - while ((t = next_thread(t)) != g) + while ((t = next_thread(t)) != g && !thread_group_leader(t)) static inline int get_nr_threads(struct task_struct *tsk) { diff --git a/kernel/exit.c b/kernel/exit.c index f2b321b..d6cef25 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead) list_del_rcu(&p->tasks); list_del_init(&p->sibling); __this_cpu_dec(process_counts); - } - list_del_rcu(&p->thread_group); + } else + /* only remove members from the thread group. + * The thread group leader must stay so that + * while_each_thread() uses can see the end of + * the list and stop. + */ + list_del_rcu(&p->thread_group); } /* ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-07-28 1:08 ` NeilBrown @ 2011-07-28 6:26 ` Ben Blum 2011-07-28 7:13 ` NeilBrown [not found] ` <20110728062616.GC15204-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-07-28 12:17 ` Paul E. McKenney ` (2 subsequent siblings) 3 siblings, 2 replies; 63+ messages in thread From: Ben Blum @ 2011-07-28 6:26 UTC (permalink / raw) To: NeilBrown Cc: paulmck, Ben Blum, Paul Menage, Li Zefan, Oleg Nesterov, containers, linux-kernel On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney" > <paulmck@linux.vnet.ibm.com> wrote: > > > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote: > > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: > > > > [ . . . ] > > > > > > The race as I understand it is with this code: > > > > > > > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > > > > > tsk->group_leader = tsk; > > > > leader->group_leader = tsk; > > > > > > > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > > > be held in the cgroup code. > > > > > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > > > this chunk is run with the same value for 'leader', but the > > > > while_each_thread is run after, then the while_read_thread() might loop > > > > forever. rcu_read_lock doesn't prevent this from happening. > > > > > > Somehow I was under the impression that holding tasklist_lock (for > > > writing) provided exclusion from code that holds rcu_read_lock - > > > probably because there are other points in the kernel which do > > > while_each_thread with only RCU-read held (and not tasklist): > > > > > > - kernel/hung_task.c, check_hung_uninterruptible_tasks() > > > > This one looks OK to me. The code is just referencing fields in each > > of the task structures, and appears to be making proper use of > > rcu_dereference(). All this code requires is that the task structures > > remain in existence through the full lifetime of the RCU read-side > > critical section, which is guaranteed because of the way the task_struct > > is freed. > > I disagree. It also requires - by virtue of the use of while_each_thread() - > that 'g' remains on the list that 't' is walking along. > > Now for a normal list, the head always stays on the list and is accessible > even from an rcu-removed entry. But the thread_group list isn't a normal > list. It doesn't have a distinct head. It is a loop of all of the > 'task_structs' in a thread group. One of them is designated the 'leader' but > de_thread() can change the 'leader' - though it doesn't remove the old leader. > > __unhash_process in mm/exit.c looks like it could remove the leader from the > list and definitely could remove a non-leader. > > So if a non-leader calls 'exec' and the leader calls 'exit', then a > task_struct that was the leader could become a non-leader and then be removed > from the list that kernel/hung_task could be walking along. That agrees with my understanding. > > So I don't think that while_each_thread() is currently safe. It depends on > the thread leader not disappearing and I think it can. I think that while_each_thread is perfectly safe, it just needs to be protected properly while used. it reads the tasklist, and both competing paths (__unhash_process and de_thread) are done with tasklist_lock write locked, so read-locking ought to suffice. all it needs is to be better documented. > [...] > > +/* Thread group leader can change, so stop loop when we see one > + * even if it isn't 'g' */ > #define while_each_thread(g, t) \ > - while ((t = next_thread(t)) != g) > + while ((t = next_thread(t)) != g && !thread_group_leader(t)) this is semantically wrong: it will stop as soon as it finds a thread that has newly become the leader, and not run the loop body code in that thread's case. so the thread that just execed would not get run on, and in the case of my code, would "escape" the cgroup migration. but I argue it is also organisationally wrong. while_each_thread's purpose is just to worry about the structure of the process list, not to account for behavioural details of de_thread. this check belongs outside of the macro, and it should be protected by tasklist_lock in the same critical section in which while_each_thread is used. -- Ben > > static inline int get_nr_threads(struct task_struct *tsk) > { > diff --git a/kernel/exit.c b/kernel/exit.c > index f2b321b..d6cef25 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead) > list_del_rcu(&p->tasks); > list_del_init(&p->sibling); > __this_cpu_dec(process_counts); > - } > - list_del_rcu(&p->thread_group); > + } else > + /* only remove members from the thread group. > + * The thread group leader must stay so that > + * while_each_thread() uses can see the end of > + * the list and stop. > + */ > + list_del_rcu(&p->thread_group); > } > > /* > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-07-28 6:26 ` Ben Blum @ 2011-07-28 7:13 ` NeilBrown [not found] ` <20110728171345.67d3797d-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2011-07-29 14:28 ` Ben Blum [not found] ` <20110728062616.GC15204-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 1 sibling, 2 replies; 63+ messages in thread From: NeilBrown @ 2011-07-28 7:13 UTC (permalink / raw) To: Ben Blum Cc: paulmck, Paul Menage, Li Zefan, Oleg Nesterov, containers, linux-kernel On Thu, 28 Jul 2011 02:26:16 -0400 Ben Blum <bblum@andrew.cmu.edu> wrote: > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > > On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney" > > <paulmck@linux.vnet.ibm.com> wrote: > > > > > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote: > > > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: > > > > > > [ . . . ] > > > > > > > > The race as I understand it is with this code: > > > > > > > > > > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > > > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > > > > > > > tsk->group_leader = tsk; > > > > > leader->group_leader = tsk; > > > > > > > > > > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > > > > be held in the cgroup code. > > > > > > > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > > > > this chunk is run with the same value for 'leader', but the > > > > > while_each_thread is run after, then the while_read_thread() might loop > > > > > forever. rcu_read_lock doesn't prevent this from happening. > > > > > > > > Somehow I was under the impression that holding tasklist_lock (for > > > > writing) provided exclusion from code that holds rcu_read_lock - > > > > probably because there are other points in the kernel which do > > > > while_each_thread with only RCU-read held (and not tasklist): > > > > > > > > - kernel/hung_task.c, check_hung_uninterruptible_tasks() > > > > > > This one looks OK to me. The code is just referencing fields in each > > > of the task structures, and appears to be making proper use of > > > rcu_dereference(). All this code requires is that the task structures > > > remain in existence through the full lifetime of the RCU read-side > > > critical section, which is guaranteed because of the way the task_struct > > > is freed. > > > > I disagree. It also requires - by virtue of the use of while_each_thread() - > > that 'g' remains on the list that 't' is walking along. > > > > Now for a normal list, the head always stays on the list and is accessible > > even from an rcu-removed entry. But the thread_group list isn't a normal > > list. It doesn't have a distinct head. It is a loop of all of the > > 'task_structs' in a thread group. One of them is designated the 'leader' but > > de_thread() can change the 'leader' - though it doesn't remove the old leader. > > > > __unhash_process in mm/exit.c looks like it could remove the leader from the > > list and definitely could remove a non-leader. > > > > So if a non-leader calls 'exec' and the leader calls 'exit', then a > > task_struct that was the leader could become a non-leader and then be removed > > from the list that kernel/hung_task could be walking along. > > That agrees with my understanding. > > > > > So I don't think that while_each_thread() is currently safe. It depends on > > the thread leader not disappearing and I think it can. > > I think that while_each_thread is perfectly safe, it just needs to be > protected properly while used. it reads the tasklist, and both competing > paths (__unhash_process and de_thread) are done with tasklist_lock write > locked, so read-locking ought to suffice. all it needs is to be better > documented. That might be one answer. However the fact that rcu primitives are used to walk the list (next_thread() contains list_entry_rcu) seems to suggest that the intention was that RCU could be used for read access - which in large part it can. So we have two alternatives: 1/ take a reader spinlock to walk the thread_group list 2/ make RCU work safely for walking thread_group I thought it was best to try the latter first. > > > [...] > > > > +/* Thread group leader can change, so stop loop when we see one > > + * even if it isn't 'g' */ > > #define while_each_thread(g, t) \ > > - while ((t = next_thread(t)) != g) > > + while ((t = next_thread(t)) != g && !thread_group_leader(t)) > > this is semantically wrong: it will stop as soon as it finds a thread > that has newly become the leader, and not run the loop body code in that > thread's case. so the thread that just execed would not get run on, and > in the case of my code, would "escape" the cgroup migration. Yes, you are right ... For your case I think you do want the thread group list to be both safe and stable. RCU can only provide 'safe' for a list. So I think I'm convinced that you need a readlock on tasklist_lock. That doesn't mean that all users of while_each_thread do. > > but I argue it is also organisationally wrong. while_each_thread's > purpose is just to worry about the structure of the process list, not to > account for behavioural details of de_thread. this check belongs outside > of the macro, and it should be protected by tasklist_lock in the same > critical section in which while_each_thread is used. Sound fair. Maybe we need two versions of while_each_thread(). One that assumes tasklist_lock and guarantees getting every thread, and one that is less precise but only needs rcu_read_lock(). That would require an audit of each call site to see what is really needed. But yes: if you replace your rcu_read_lock()s with read_lock(&tasklist_lock) then you will be safe, and I don't think you can do any better (i.e. less locking) than that. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110728171345.67d3797d-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc [not found] ` <20110728171345.67d3797d-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2011-07-29 14:28 ` Ben Blum 0 siblings, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-07-29 14:28 UTC (permalink / raw) To: NeilBrown Cc: Andrew Morton, Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Oleg Nesterov Fix unstable tasklist locking in cgroup_attach_proc. From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is not sufficient to guarantee the tasklist is stable w.r.t. de_thread and exit. Taking tasklist_lock for reading, instead of rcu_read_lock, ensures proper exclusion. Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> --- kernel/cgroup.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff a/kernel/cgroup.c b/kernel/cgroup.c --- a/kernel/cgroup.c 2011-07-21 19:17:23.000000000 -0700 +++ b/kernel/cgroup.c 2011-07-29 06:17:47.000000000 -0700 @@ -2024,7 +2024,7 @@ goto out_free_group_list; /* prevent changes to the threadgroup list while we take a snapshot. */ - rcu_read_lock(); + read_lock(&tasklist_lock); if (!thread_group_leader(leader)) { /* * a race with de_thread from another thread's exec() may strip @@ -2033,7 +2033,7 @@ * throw this task away and try again (from cgroup_procs_write); * this is "double-double-toil-and-trouble-check locking". */ - rcu_read_unlock(); + read_unlock(&tasklist_lock); retval = -EAGAIN; goto out_free_group_list; } @@ -2054,7 +2054,7 @@ } while_each_thread(leader, tsk); /* remember the number of threads in the array for later. */ group_size = i; - rcu_read_unlock(); + read_unlock(&tasklist_lock); /* * step 1: check that we can legitimately attach to the cgroup. ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-07-28 7:13 ` NeilBrown [not found] ` <20110728171345.67d3797d-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2011-07-29 14:28 ` Ben Blum 2011-08-01 19:31 ` Paul Menage ` (2 more replies) 1 sibling, 3 replies; 63+ messages in thread From: Ben Blum @ 2011-07-29 14:28 UTC (permalink / raw) To: NeilBrown Cc: Ben Blum, paulmck, Paul Menage, Li Zefan, Oleg Nesterov, containers, linux-kernel, Andrew Morton Fix unstable tasklist locking in cgroup_attach_proc. From: Ben Blum <bblum@andrew.cmu.edu> According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is not sufficient to guarantee the tasklist is stable w.r.t. de_thread and exit. Taking tasklist_lock for reading, instead of rcu_read_lock, ensures proper exclusion. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> --- kernel/cgroup.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff a/kernel/cgroup.c b/kernel/cgroup.c --- a/kernel/cgroup.c 2011-07-21 19:17:23.000000000 -0700 +++ b/kernel/cgroup.c 2011-07-29 06:17:47.000000000 -0700 @@ -2024,7 +2024,7 @@ goto out_free_group_list; /* prevent changes to the threadgroup list while we take a snapshot. */ - rcu_read_lock(); + read_lock(&tasklist_lock); if (!thread_group_leader(leader)) { /* * a race with de_thread from another thread's exec() may strip @@ -2033,7 +2033,7 @@ * throw this task away and try again (from cgroup_procs_write); * this is "double-double-toil-and-trouble-check locking". */ - rcu_read_unlock(); + read_unlock(&tasklist_lock); retval = -EAGAIN; goto out_free_group_list; } @@ -2054,7 +2054,7 @@ } while_each_thread(leader, tsk); /* remember the number of threads in the array for later. */ group_size = i; - rcu_read_unlock(); + read_unlock(&tasklist_lock); /* * step 1: check that we can legitimately attach to the cgroup. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-07-29 14:28 ` Ben Blum @ 2011-08-01 19:31 ` Paul Menage [not found] ` <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-08-15 18:49 ` Oleg Nesterov 2 siblings, 0 replies; 63+ messages in thread From: Paul Menage @ 2011-08-01 19:31 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, paulmck, Li Zefan, Oleg Nesterov, containers, linux-kernel, Andrew Morton On Fri, Jul 29, 2011 at 7:28 AM, Ben Blum <bblum@andrew.cmu.edu> wrote: > Fix unstable tasklist locking in cgroup_attach_proc. > > From: Ben Blum <bblum@andrew.cmu.edu> > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > ensures proper exclusion. > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <menage@google.com> Thanks, Paul > --- > kernel/cgroup.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff a/kernel/cgroup.c b/kernel/cgroup.c > --- a/kernel/cgroup.c 2011-07-21 19:17:23.000000000 -0700 > +++ b/kernel/cgroup.c 2011-07-29 06:17:47.000000000 -0700 > @@ -2024,7 +2024,7 @@ > goto out_free_group_list; > > /* prevent changes to the threadgroup list while we take a snapshot. */ > - rcu_read_lock(); > + read_lock(&tasklist_lock); > if (!thread_group_leader(leader)) { > /* > * a race with de_thread from another thread's exec() may strip > @@ -2033,7 +2033,7 @@ > * throw this task away and try again (from cgroup_procs_write); > * this is "double-double-toil-and-trouble-check locking". > */ > - rcu_read_unlock(); > + read_unlock(&tasklist_lock); > retval = -EAGAIN; > goto out_free_group_list; > } > @@ -2054,7 +2054,7 @@ > } while_each_thread(leader, tsk); > /* remember the number of threads in the array for later. */ > group_size = i; > - rcu_read_unlock(); > + read_unlock(&tasklist_lock); > > /* > * step 1: check that we can legitimately attach to the cgroup. > ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc [not found] ` <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> @ 2011-08-01 19:31 ` Paul Menage 2011-08-15 18:49 ` Oleg Nesterov 1 sibling, 0 replies; 63+ messages in thread From: Paul Menage @ 2011-08-01 19:31 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Oleg Nesterov On Fri, Jul 29, 2011 at 7:28 AM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote: > Fix unstable tasklist locking in cgroup_attach_proc. > > From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > ensures proper exclusion. > > Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> Acked-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Thanks, Paul > --- > kernel/cgroup.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff a/kernel/cgroup.c b/kernel/cgroup.c > --- a/kernel/cgroup.c 2011-07-21 19:17:23.000000000 -0700 > +++ b/kernel/cgroup.c 2011-07-29 06:17:47.000000000 -0700 > @@ -2024,7 +2024,7 @@ > goto out_free_group_list; > > /* prevent changes to the threadgroup list while we take a snapshot. */ > - rcu_read_lock(); > + read_lock(&tasklist_lock); > if (!thread_group_leader(leader)) { > /* > * a race with de_thread from another thread's exec() may strip > @@ -2033,7 +2033,7 @@ > * throw this task away and try again (from cgroup_procs_write); > * this is "double-double-toil-and-trouble-check locking". > */ > - rcu_read_unlock(); > + read_unlock(&tasklist_lock); > retval = -EAGAIN; > goto out_free_group_list; > } > @@ -2054,7 +2054,7 @@ > } while_each_thread(leader, tsk); > /* remember the number of threads in the array for later. */ > group_size = i; > - rcu_read_unlock(); > + read_unlock(&tasklist_lock); > > /* > * step 1: check that we can legitimately attach to the cgroup. > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc [not found] ` <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-08-01 19:31 ` Paul Menage @ 2011-08-15 18:49 ` Oleg Nesterov 1 sibling, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-08-15 18:49 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, Frederic Weisbecker, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton On 07/29, Ben Blum wrote: > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > ensures proper exclusion. Yes. So far I still think we should fix while_each_thread() so that it works under rcu_read_lock() "as exepected", I'll try to think more. But whatever we do with while_each_thread(), this can't help cgroup_attach_proc(), it needs the locking. > - rcu_read_lock(); > + read_lock(&tasklist_lock); > if (!thread_group_leader(leader)) { Agreed, this should work. But can't we avoid the global list? thread_group_leader() or not, we do not really care. We only need to ensure we can safely find all threads. How about the patch below? With or without this/your patch this leader can die right after we drop the lock. ss->can_attach(leader) and ss->attach(leader) look suspicious. If a sub-thread execs, this task_struct has nothing to do with the threadgroup. Also. This is off-topic, but... Why cgroup_attach_proc() and cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() in the different order? cgroup_attach_proc() looks wrong even if currently doesn't matter. Oleg. --- x/kernel/cgroup.c +++ x/kernel/cgroup.c @@ -2000,6 +2000,7 @@ int cgroup_attach_proc(struct cgroup *cg /* threadgroup list cursor and array */ struct task_struct *tsk; struct flex_array *group; + unsigned long flags; /* * we need to make sure we have css_sets for all the tasks we're * going to move -before- we actually start moving them, so that in @@ -2027,19 +2028,10 @@ int cgroup_attach_proc(struct cgroup *cg goto out_free_group_list; /* prevent changes to the threadgroup list while we take a snapshot. */ - rcu_read_lock(); - if (!thread_group_leader(leader)) { - /* - * a race with de_thread from another thread's exec() may strip - * us of our leadership, making while_each_thread unsafe to use - * on this task. if this happens, there is no choice but to - * throw this task away and try again (from cgroup_procs_write); - * this is "double-double-toil-and-trouble-check locking". - */ - rcu_read_unlock(); - retval = -EAGAIN; + retval = -EAGAIN; + if (!lock_task_sighand(leader, &flags)) goto out_free_group_list; - } + /* take a reference on each task in the group to go in the array. */ tsk = leader; i = 0; @@ -2055,9 +2047,9 @@ int cgroup_attach_proc(struct cgroup *cg BUG_ON(retval != 0); i++; } while_each_thread(leader, tsk); + unlock_task_sighand(leader, &flags); /* remember the number of threads in the array for later. */ group_size = i; - rcu_read_unlock(); /* * step 1: check that we can legitimately attach to the cgroup. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-07-29 14:28 ` Ben Blum 2011-08-01 19:31 ` Paul Menage [not found] ` <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> @ 2011-08-15 18:49 ` Oleg Nesterov 2011-08-15 22:50 ` Frederic Weisbecker ` (4 more replies) 2 siblings, 5 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-08-15 18:49 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, paulmck, Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton, Frederic Weisbecker On 07/29, Ben Blum wrote: > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > ensures proper exclusion. Yes. So far I still think we should fix while_each_thread() so that it works under rcu_read_lock() "as exepected", I'll try to think more. But whatever we do with while_each_thread(), this can't help cgroup_attach_proc(), it needs the locking. > - rcu_read_lock(); > + read_lock(&tasklist_lock); > if (!thread_group_leader(leader)) { Agreed, this should work. But can't we avoid the global list? thread_group_leader() or not, we do not really care. We only need to ensure we can safely find all threads. How about the patch below? With or without this/your patch this leader can die right after we drop the lock. ss->can_attach(leader) and ss->attach(leader) look suspicious. If a sub-thread execs, this task_struct has nothing to do with the threadgroup. Also. This is off-topic, but... Why cgroup_attach_proc() and cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() in the different order? cgroup_attach_proc() looks wrong even if currently doesn't matter. Oleg. --- x/kernel/cgroup.c +++ x/kernel/cgroup.c @@ -2000,6 +2000,7 @@ int cgroup_attach_proc(struct cgroup *cg /* threadgroup list cursor and array */ struct task_struct *tsk; struct flex_array *group; + unsigned long flags; /* * we need to make sure we have css_sets for all the tasks we're * going to move -before- we actually start moving them, so that in @@ -2027,19 +2028,10 @@ int cgroup_attach_proc(struct cgroup *cg goto out_free_group_list; /* prevent changes to the threadgroup list while we take a snapshot. */ - rcu_read_lock(); - if (!thread_group_leader(leader)) { - /* - * a race with de_thread from another thread's exec() may strip - * us of our leadership, making while_each_thread unsafe to use - * on this task. if this happens, there is no choice but to - * throw this task away and try again (from cgroup_procs_write); - * this is "double-double-toil-and-trouble-check locking". - */ - rcu_read_unlock(); - retval = -EAGAIN; + retval = -EAGAIN; + if (!lock_task_sighand(leader, &flags)) goto out_free_group_list; - } + /* take a reference on each task in the group to go in the array. */ tsk = leader; i = 0; @@ -2055,9 +2047,9 @@ int cgroup_attach_proc(struct cgroup *cg BUG_ON(retval != 0); i++; } while_each_thread(leader, tsk); + unlock_task_sighand(leader, &flags); /* remember the number of threads in the array for later. */ group_size = i; - rcu_read_unlock(); /* * step 1: check that we can legitimately attach to the cgroup. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-08-15 18:49 ` Oleg Nesterov @ 2011-08-15 22:50 ` Frederic Weisbecker 2011-08-15 23:04 ` Ben Blum ` (3 more replies) [not found] ` <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (3 subsequent siblings) 4 siblings, 4 replies; 63+ messages in thread From: Frederic Weisbecker @ 2011-08-15 22:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > On 07/29, Ben Blum wrote: > > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > > ensures proper exclusion. > > Yes. > > So far I still think we should fix while_each_thread() so that it works > under rcu_read_lock() "as exepected", I'll try to think more. > > But whatever we do with while_each_thread(), this can't help > cgroup_attach_proc(), it needs the locking. > > > - rcu_read_lock(); > > + read_lock(&tasklist_lock); > > if (!thread_group_leader(leader)) { > > Agreed, this should work. > > But can't we avoid the global list? thread_group_leader() or not, we do > not really care. We only need to ensure we can safely find all threads. > > How about the patch below? > > > With or without this/your patch this leader can die right after we > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > suspicious. If a sub-thread execs, this task_struct has nothing to > do with the threadgroup. > > > > Also. This is off-topic, but... Why cgroup_attach_proc() and > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() > in the different order? cgroup_attach_proc() looks wrong even > if currently doesn't matter. Right. As we concluded in our off-list discussion, if there is no strong reason for that, I'm going to fix that in my task counter patchset because there it really matters. If we can't migrate the thread because it has already exited, we really don't want to call ->attach_task() but rather cancel_attach_task(). Thanks. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-08-15 22:50 ` Frederic Weisbecker @ 2011-08-15 23:04 ` Ben Blum 2011-08-15 23:09 ` Ben Blum [not found] ` <20110815230415.GA6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-08-15 23:04 ` Ben Blum ` (2 subsequent siblings) 3 siblings, 2 replies; 63+ messages in thread From: Ben Blum @ 2011-08-15 23:04 UTC (permalink / raw) To: Frederic Weisbecker Cc: Oleg Nesterov, Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote: > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > On 07/29, Ben Blum wrote: > > > > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > > > ensures proper exclusion. > > > > Yes. > > > > So far I still think we should fix while_each_thread() so that it works > > under rcu_read_lock() "as exepected", I'll try to think more. > > > > But whatever we do with while_each_thread(), this can't help > > cgroup_attach_proc(), it needs the locking. > > > > > - rcu_read_lock(); > > > + read_lock(&tasklist_lock); > > > if (!thread_group_leader(leader)) { > > > > Agreed, this should work. > > > > But can't we avoid the global list? thread_group_leader() or not, we do > > not really care. We only need to ensure we can safely find all threads. > > > > How about the patch below? > > > > > > With or without this/your patch this leader can die right after we > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > > suspicious. If a sub-thread execs, this task_struct has nothing to > > do with the threadgroup. > > > > > > > > Also. This is off-topic, but... Why cgroup_attach_proc() and > > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() > > in the different order? cgroup_attach_proc() looks wrong even > > if currently doesn't matter. > > Right. As we concluded in our off-list discussion, if there > is no strong reason for that, I'm going to fix that in my task > counter patchset because there it really matters. If we can't > migrate the thread because it has already exited, we really > don't want to call ->attach_task() but rather cancel_attach_task(). > > Thanks. > Yes. Um, this must have been a mistake on my part. The lines of code should be the other way around. It should be done in a separate bugfix patch, though, so it goes through faster... -- Ben ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-08-15 23:04 ` Ben Blum @ 2011-08-15 23:09 ` Ben Blum 2011-08-15 23:19 ` Frederic Weisbecker [not found] ` <20110815230900.GB6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> [not found] ` <20110815230415.GA6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 1 sibling, 2 replies; 63+ messages in thread From: Ben Blum @ 2011-08-15 23:09 UTC (permalink / raw) To: Ben Blum Cc: Frederic Weisbecker, Oleg Nesterov, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton On Mon, Aug 15, 2011 at 07:04:15PM -0400, Ben Blum wrote: > On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote: > > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > > On 07/29, Ben Blum wrote: > > > > > > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > > > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > > > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > > > > ensures proper exclusion. > > > > > > Yes. > > > > > > So far I still think we should fix while_each_thread() so that it works > > > under rcu_read_lock() "as exepected", I'll try to think more. > > > > > > But whatever we do with while_each_thread(), this can't help > > > cgroup_attach_proc(), it needs the locking. > > > > > > > - rcu_read_lock(); > > > > + read_lock(&tasklist_lock); > > > > if (!thread_group_leader(leader)) { > > > > > > Agreed, this should work. > > > > > > But can't we avoid the global list? thread_group_leader() or not, we do > > > not really care. We only need to ensure we can safely find all threads. > > > > > > How about the patch below? > > > > > > > > > With or without this/your patch this leader can die right after we > > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > > > suspicious. If a sub-thread execs, this task_struct has nothing to > > > do with the threadgroup. > > > > > > > > > > > > Also. This is off-topic, but... Why cgroup_attach_proc() and > > > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() > > > in the different order? cgroup_attach_proc() looks wrong even > > > if currently doesn't matter. > > > > Right. As we concluded in our off-list discussion, if there > > is no strong reason for that, I'm going to fix that in my task > > counter patchset because there it really matters. If we can't > > migrate the thread because it has already exited, we really > > don't want to call ->attach_task() but rather cancel_attach_task(). > > > > Thanks. > > > > Yes. Um, this must have been a mistake on my part. The lines of code > should be the other way around. It should be done in a separate bugfix > patch, though, so it goes through faster... > > -- Ben > also, there is no cancel_attach_task, afaict. cancel_attach is only for memcg, for the whole operation at once - unless you are changing this, in which case feel free to modify on top of the patch I'm about to send out. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-08-15 23:09 ` Ben Blum @ 2011-08-15 23:19 ` Frederic Weisbecker [not found] ` <20110815230900.GB6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 1 sibling, 0 replies; 63+ messages in thread From: Frederic Weisbecker @ 2011-08-15 23:19 UTC (permalink / raw) To: Ben Blum Cc: Oleg Nesterov, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton On Mon, Aug 15, 2011 at 07:09:00PM -0400, Ben Blum wrote: > On Mon, Aug 15, 2011 at 07:04:15PM -0400, Ben Blum wrote: > > On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote: > > > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > > > On 07/29, Ben Blum wrote: > > > > > > > > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > > > > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > > > > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > > > > > ensures proper exclusion. > > > > > > > > Yes. > > > > > > > > So far I still think we should fix while_each_thread() so that it works > > > > under rcu_read_lock() "as exepected", I'll try to think more. > > > > > > > > But whatever we do with while_each_thread(), this can't help > > > > cgroup_attach_proc(), it needs the locking. > > > > > > > > > - rcu_read_lock(); > > > > > + read_lock(&tasklist_lock); > > > > > if (!thread_group_leader(leader)) { > > > > > > > > Agreed, this should work. > > > > > > > > But can't we avoid the global list? thread_group_leader() or not, we do > > > > not really care. We only need to ensure we can safely find all threads. > > > > > > > > How about the patch below? > > > > > > > > > > > > With or without this/your patch this leader can die right after we > > > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > > > > suspicious. If a sub-thread execs, this task_struct has nothing to > > > > do with the threadgroup. > > > > > > > > > > > > > > > > Also. This is off-topic, but... Why cgroup_attach_proc() and > > > > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() > > > > in the different order? cgroup_attach_proc() looks wrong even > > > > if currently doesn't matter. > > > > > > Right. As we concluded in our off-list discussion, if there > > > is no strong reason for that, I'm going to fix that in my task > > > counter patchset because there it really matters. If we can't > > > migrate the thread because it has already exited, we really > > > don't want to call ->attach_task() but rather cancel_attach_task(). > > > > > > Thanks. > > > > > > > Yes. Um, this must have been a mistake on my part. The lines of code > > should be the other way around. It should be done in a separate bugfix > > patch, though, so it goes through faster... > > > > -- Ben > > > > also, there is no cancel_attach_task, afaict. cancel_attach is only for > memcg, for the whole operation at once - unless you are changing this, > in which case feel free to modify on top of the patch I'm about to send > out. Yeah right, I forgot it's something I'm adding with the task counter subsystem. ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110815230900.GB6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc [not found] ` <20110815230900.GB6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> @ 2011-08-15 23:19 ` Frederic Weisbecker 0 siblings, 0 replies; 63+ messages in thread From: Frederic Weisbecker @ 2011-08-15 23:19 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton On Mon, Aug 15, 2011 at 07:09:00PM -0400, Ben Blum wrote: > On Mon, Aug 15, 2011 at 07:04:15PM -0400, Ben Blum wrote: > > On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote: > > > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > > > On 07/29, Ben Blum wrote: > > > > > > > > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > > > > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > > > > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > > > > > ensures proper exclusion. > > > > > > > > Yes. > > > > > > > > So far I still think we should fix while_each_thread() so that it works > > > > under rcu_read_lock() "as exepected", I'll try to think more. > > > > > > > > But whatever we do with while_each_thread(), this can't help > > > > cgroup_attach_proc(), it needs the locking. > > > > > > > > > - rcu_read_lock(); > > > > > + read_lock(&tasklist_lock); > > > > > if (!thread_group_leader(leader)) { > > > > > > > > Agreed, this should work. > > > > > > > > But can't we avoid the global list? thread_group_leader() or not, we do > > > > not really care. We only need to ensure we can safely find all threads. > > > > > > > > How about the patch below? > > > > > > > > > > > > With or without this/your patch this leader can die right after we > > > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > > > > suspicious. If a sub-thread execs, this task_struct has nothing to > > > > do with the threadgroup. > > > > > > > > > > > > > > > > Also. This is off-topic, but... Why cgroup_attach_proc() and > > > > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() > > > > in the different order? cgroup_attach_proc() looks wrong even > > > > if currently doesn't matter. > > > > > > Right. As we concluded in our off-list discussion, if there > > > is no strong reason for that, I'm going to fix that in my task > > > counter patchset because there it really matters. If we can't > > > migrate the thread because it has already exited, we really > > > don't want to call ->attach_task() but rather cancel_attach_task(). > > > > > > Thanks. > > > > > > > Yes. Um, this must have been a mistake on my part. The lines of code > > should be the other way around. It should be done in a separate bugfix > > patch, though, so it goes through faster... > > > > -- Ben > > > > also, there is no cancel_attach_task, afaict. cancel_attach is only for > memcg, for the whole operation at once - unless you are changing this, > in which case feel free to modify on top of the patch I'm about to send > out. Yeah right, I forgot it's something I'm adding with the task counter subsystem. ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110815230415.GA6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc [not found] ` <20110815230415.GA6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> @ 2011-08-15 23:09 ` Ben Blum 0 siblings, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-08-15 23:09 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, Frederic Weisbecker, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton On Mon, Aug 15, 2011 at 07:04:15PM -0400, Ben Blum wrote: > On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote: > > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > > On 07/29, Ben Blum wrote: > > > > > > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > > > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > > > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > > > > ensures proper exclusion. > > > > > > Yes. > > > > > > So far I still think we should fix while_each_thread() so that it works > > > under rcu_read_lock() "as exepected", I'll try to think more. > > > > > > But whatever we do with while_each_thread(), this can't help > > > cgroup_attach_proc(), it needs the locking. > > > > > > > - rcu_read_lock(); > > > > + read_lock(&tasklist_lock); > > > > if (!thread_group_leader(leader)) { > > > > > > Agreed, this should work. > > > > > > But can't we avoid the global list? thread_group_leader() or not, we do > > > not really care. We only need to ensure we can safely find all threads. > > > > > > How about the patch below? > > > > > > > > > With or without this/your patch this leader can die right after we > > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > > > suspicious. If a sub-thread execs, this task_struct has nothing to > > > do with the threadgroup. > > > > > > > > > > > > Also. This is off-topic, but... Why cgroup_attach_proc() and > > > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() > > > in the different order? cgroup_attach_proc() looks wrong even > > > if currently doesn't matter. > > > > Right. As we concluded in our off-list discussion, if there > > is no strong reason for that, I'm going to fix that in my task > > counter patchset because there it really matters. If we can't > > migrate the thread because it has already exited, we really > > don't want to call ->attach_task() but rather cancel_attach_task(). > > > > Thanks. > > > > Yes. Um, this must have been a mistake on my part. The lines of code > should be the other way around. It should be done in a separate bugfix > patch, though, so it goes through faster... > > -- Ben > also, there is no cancel_attach_task, afaict. cancel_attach is only for memcg, for the whole operation at once - unless you are changing this, in which case feel free to modify on top of the patch I'm about to send out. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-08-15 22:50 ` Frederic Weisbecker 2011-08-15 23:04 ` Ben Blum @ 2011-08-15 23:04 ` Ben Blum 2011-08-15 23:11 ` [PATCH][BUGFIX] cgroups: fix ordering of calls " Ben Blum 2011-08-15 23:11 ` Ben Blum 3 siblings, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-08-15 23:04 UTC (permalink / raw) To: Frederic Weisbecker Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, NeilBrown, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton On Tue, Aug 16, 2011 at 12:50:06AM +0200, Frederic Weisbecker wrote: > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > On 07/29, Ben Blum wrote: > > > > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > > > ensures proper exclusion. > > > > Yes. > > > > So far I still think we should fix while_each_thread() so that it works > > under rcu_read_lock() "as exepected", I'll try to think more. > > > > But whatever we do with while_each_thread(), this can't help > > cgroup_attach_proc(), it needs the locking. > > > > > - rcu_read_lock(); > > > + read_lock(&tasklist_lock); > > > if (!thread_group_leader(leader)) { > > > > Agreed, this should work. > > > > But can't we avoid the global list? thread_group_leader() or not, we do > > not really care. We only need to ensure we can safely find all threads. > > > > How about the patch below? > > > > > > With or without this/your patch this leader can die right after we > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > > suspicious. If a sub-thread execs, this task_struct has nothing to > > do with the threadgroup. > > > > > > > > Also. This is off-topic, but... Why cgroup_attach_proc() and > > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() > > in the different order? cgroup_attach_proc() looks wrong even > > if currently doesn't matter. > > Right. As we concluded in our off-list discussion, if there > is no strong reason for that, I'm going to fix that in my task > counter patchset because there it really matters. If we can't > migrate the thread because it has already exited, we really > don't want to call ->attach_task() but rather cancel_attach_task(). > > Thanks. > Yes. Um, this must have been a mistake on my part. The lines of code should be the other way around. It should be done in a separate bugfix patch, though, so it goes through faster... -- Ben ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc 2011-08-15 22:50 ` Frederic Weisbecker 2011-08-15 23:04 ` Ben Blum 2011-08-15 23:04 ` Ben Blum @ 2011-08-15 23:11 ` Ben Blum 2011-08-15 23:20 ` Frederic Weisbecker ` (2 more replies) 2011-08-15 23:11 ` Ben Blum 3 siblings, 3 replies; 63+ messages in thread From: Ben Blum @ 2011-08-15 23:11 UTC (permalink / raw) To: Frederic Weisbecker Cc: Oleg Nesterov, Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> --- kernel/cgroup.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff -u a/kernel/cgroup.c b/kernel/cgroup.c --- a/kernel/cgroup.c 2011-08-15 19:05:37.308401219 -0400 +++ b/kernel/cgroup.c 2011-08-15 19:07:47.376775142 -0400 @@ -2132,14 +2132,17 @@ oldcgrp = task_cgroup_from_root(tsk, root); if (cgrp == oldcgrp) continue; - /* attach each task to each subsystem */ - for_each_subsys(root, ss) { - if (ss->attach_task) - ss->attach_task(cgrp, tsk); - } /* if the thread is PF_EXITING, it can just get skipped. */ retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true); - BUG_ON(retval != 0 && retval != -ESRCH); + if (retval == 0) { + /* attach each task to each subsystem */ + for_each_subsys(root, ss) { + if (ss->attach_task) + ss->attach_task(cgrp, tsk); + } + } else { + BUG_ON(retval != -ESRCH); + } } /* nothing is sensitive to fork() after this point. */ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc 2011-08-15 23:11 ` [PATCH][BUGFIX] cgroups: fix ordering of calls " Ben Blum @ 2011-08-15 23:20 ` Frederic Weisbecker [not found] ` <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-08-15 23:31 ` Paul Menage 2 siblings, 0 replies; 63+ messages in thread From: Frederic Weisbecker @ 2011-08-15 23:20 UTC (permalink / raw) To: Ben Blum Cc: Oleg Nesterov, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton On Mon, Aug 15, 2011 at 07:11:56PM -0400, Ben Blum wrote: > Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc. > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]
* Re: [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc [not found] ` <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> @ 2011-08-15 23:20 ` Frederic Weisbecker 2011-08-15 23:31 ` Paul Menage 1 sibling, 0 replies; 63+ messages in thread From: Frederic Weisbecker @ 2011-08-15 23:20 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton On Mon, Aug 15, 2011 at 07:11:56PM -0400, Ben Blum wrote: > Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc. > > Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> Acked-by: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc [not found] ` <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-08-15 23:20 ` Frederic Weisbecker @ 2011-08-15 23:31 ` Paul Menage 1 sibling, 0 replies; 63+ messages in thread From: Paul Menage @ 2011-08-15 23:31 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, Frederic Weisbecker, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 On Mon, Aug 15, 2011 at 4:11 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote: > Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc. > > Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> Acked-by: Paul Menage <paul-inf54ven1CmVyaH7bEyXVA@public.gmane.org> BTW, I left Google at the end of last week - if it's not bouncing already, menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org isn't going to work for much longer. I'll try to get a patch out to the maintainers files. Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc 2011-08-15 23:11 ` [PATCH][BUGFIX] cgroups: fix ordering of calls " Ben Blum 2011-08-15 23:20 ` Frederic Weisbecker [not found] ` <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> @ 2011-08-15 23:31 ` Paul Menage 2 siblings, 0 replies; 63+ messages in thread From: Paul Menage @ 2011-08-15 23:31 UTC (permalink / raw) To: Ben Blum Cc: Frederic Weisbecker, Oleg Nesterov, NeilBrown, paulmck, Li Zefan, containers, linux-kernel, Andrew Morton On Mon, Aug 15, 2011 at 4:11 PM, Ben Blum <bblum@andrew.cmu.edu> wrote: > Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc. > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> BTW, I left Google at the end of last week - if it's not bouncing already, menage@google.com isn't going to work for much longer. I'll try to get a patch out to the maintainers files. Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc 2011-08-15 22:50 ` Frederic Weisbecker ` (2 preceding siblings ...) 2011-08-15 23:11 ` [PATCH][BUGFIX] cgroups: fix ordering of calls " Ben Blum @ 2011-08-15 23:11 ` Ben Blum 3 siblings, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-08-15 23:11 UTC (permalink / raw) To: Frederic Weisbecker Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, NeilBrown, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton Fix ordering of cgroup_task_migrate and attach_task in cgroup_attach_proc. Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> --- kernel/cgroup.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff -u a/kernel/cgroup.c b/kernel/cgroup.c --- a/kernel/cgroup.c 2011-08-15 19:05:37.308401219 -0400 +++ b/kernel/cgroup.c 2011-08-15 19:07:47.376775142 -0400 @@ -2132,14 +2132,17 @@ oldcgrp = task_cgroup_from_root(tsk, root); if (cgrp == oldcgrp) continue; - /* attach each task to each subsystem */ - for_each_subsys(root, ss) { - if (ss->attach_task) - ss->attach_task(cgrp, tsk); - } /* if the thread is PF_EXITING, it can just get skipped. */ retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true); - BUG_ON(retval != 0 && retval != -ESRCH); + if (retval == 0) { + /* attach each task to each subsystem */ + for_each_subsys(root, ss) { + if (ss->attach_task) + ss->attach_task(cgrp, tsk); + } + } else { + BUG_ON(retval != -ESRCH); + } } /* nothing is sensitive to fork() after this point. */ ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc [not found] ` <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2011-08-15 22:50 ` Frederic Weisbecker 2011-09-01 21:46 ` Ben Blum 1 sibling, 0 replies; 63+ messages in thread From: Frederic Weisbecker @ 2011-08-15 22:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, NeilBrown, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > On 07/29, Ben Blum wrote: > > > > According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is > > not sufficient to guarantee the tasklist is stable w.r.t. de_thread and > > exit. Taking tasklist_lock for reading, instead of rcu_read_lock, > > ensures proper exclusion. > > Yes. > > So far I still think we should fix while_each_thread() so that it works > under rcu_read_lock() "as exepected", I'll try to think more. > > But whatever we do with while_each_thread(), this can't help > cgroup_attach_proc(), it needs the locking. > > > - rcu_read_lock(); > > + read_lock(&tasklist_lock); > > if (!thread_group_leader(leader)) { > > Agreed, this should work. > > But can't we avoid the global list? thread_group_leader() or not, we do > not really care. We only need to ensure we can safely find all threads. > > How about the patch below? > > > With or without this/your patch this leader can die right after we > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > suspicious. If a sub-thread execs, this task_struct has nothing to > do with the threadgroup. > > > > Also. This is off-topic, but... Why cgroup_attach_proc() and > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() > in the different order? cgroup_attach_proc() looks wrong even > if currently doesn't matter. Right. As we concluded in our off-list discussion, if there is no strong reason for that, I'm going to fix that in my task counter patchset because there it really matters. If we can't migrate the thread because it has already exited, we really don't want to call ->attach_task() but rather cancel_attach_task(). Thanks. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc [not found] ` <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2011-08-15 22:50 ` [PATCH][BUGFIX] cgroups: more safe tasklist locking " Frederic Weisbecker @ 2011-09-01 21:46 ` Ben Blum 1 sibling, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-09-01 21:46 UTC (permalink / raw) To: Oleg Nesterov Cc: Ben Blum, Frederic Weisbecker, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, NeilBrown, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Andrew Morton On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > - rcu_read_lock(); > > + read_lock(&tasklist_lock); > > if (!thread_group_leader(leader)) { > > Agreed, this should work. > > But can't we avoid the global list? thread_group_leader() or not, we do > not really care. We only need to ensure we can safely find all threads. > > How about the patch below? I was content with the tasklist_lock because cgroup_attach_proc is already a pretty heavyweight operation, and probably pretty rare that a user would want to do multiple of them at once quickly. I asked Andrew to take the simple tasklist_lock patch just now, since it does fix the bug at least. Anyway, looking at this, hmm. I am not sure if this protects adequately? In de_thread, the sighand lock is held only around the first half (around zap_other_threads), and not around the following section where leadership is transferred (esp. around the list_replace calls). tasklist_lock is held here, though, so it seems like the right lock to hold. > > > With or without this/your patch this leader can die right after we > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > suspicious. If a sub-thread execs, this task_struct has nothing to > do with the threadgroup. hmm. I thought I had this case covered, but it's been so long since I actually wrote the code that if I did I can't remember how. I think exiting is no issue since we hold a reference on the task_struct, but exec may still be a problem. I'm thinking: - cgroup_attach_proc drops the tasklist_lock - a sub-thread execs, and in exec_mmap (after de_thread) changes the mm - ss->attach, for example in memcg, wants to use leader->mm, which is now wrong this seems to be possible as the code currently is. I wonder if the best fix is just to have exec (maybe around de_thread) bounce off of or hold threadgroup_fork_read_lock somewhere? > > > > Also. This is off-topic, but... Why cgroup_attach_proc() and > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() > in the different order? cgroup_attach_proc() looks wrong even > if currently doesn't matter. (already submitted a patch for this) Thanks, Ben > > > Oleg. > > --- x/kernel/cgroup.c > +++ x/kernel/cgroup.c > @@ -2000,6 +2000,7 @@ int cgroup_attach_proc(struct cgroup *cg > /* threadgroup list cursor and array */ > struct task_struct *tsk; > struct flex_array *group; > + unsigned long flags; > /* > * we need to make sure we have css_sets for all the tasks we're > * going to move -before- we actually start moving them, so that in > @@ -2027,19 +2028,10 @@ int cgroup_attach_proc(struct cgroup *cg > goto out_free_group_list; > > /* prevent changes to the threadgroup list while we take a snapshot. */ > - rcu_read_lock(); > - if (!thread_group_leader(leader)) { > - /* > - * a race with de_thread from another thread's exec() may strip > - * us of our leadership, making while_each_thread unsafe to use > - * on this task. if this happens, there is no choice but to > - * throw this task away and try again (from cgroup_procs_write); > - * this is "double-double-toil-and-trouble-check locking". > - */ > - rcu_read_unlock(); > - retval = -EAGAIN; > + retval = -EAGAIN; > + if (!lock_task_sighand(leader, &flags)) > goto out_free_group_list; > - } > + > /* take a reference on each task in the group to go in the array. */ > tsk = leader; > i = 0; > @@ -2055,9 +2047,9 @@ int cgroup_attach_proc(struct cgroup *cg > BUG_ON(retval != 0); > i++; > } while_each_thread(leader, tsk); > + unlock_task_sighand(leader, &flags); > /* remember the number of threads in the array for later. */ > group_size = i; > - rcu_read_unlock(); > > /* > * step 1: check that we can legitimately attach to the cgroup. > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-08-15 18:49 ` Oleg Nesterov 2011-08-15 22:50 ` Frederic Weisbecker [not found] ` <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2011-09-01 21:46 ` Ben Blum [not found] ` <20110901214643.GD10401-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-09-02 12:32 ` Oleg Nesterov 2011-10-14 0:31 ` [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock " Ben Blum 2011-10-14 0:36 ` [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) Ben Blum 4 siblings, 2 replies; 63+ messages in thread From: Ben Blum @ 2011-09-01 21:46 UTC (permalink / raw) To: Oleg Nesterov Cc: Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton, Frederic Weisbecker On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > - rcu_read_lock(); > > + read_lock(&tasklist_lock); > > if (!thread_group_leader(leader)) { > > Agreed, this should work. > > But can't we avoid the global list? thread_group_leader() or not, we do > not really care. We only need to ensure we can safely find all threads. > > How about the patch below? I was content with the tasklist_lock because cgroup_attach_proc is already a pretty heavyweight operation, and probably pretty rare that a user would want to do multiple of them at once quickly. I asked Andrew to take the simple tasklist_lock patch just now, since it does fix the bug at least. Anyway, looking at this, hmm. I am not sure if this protects adequately? In de_thread, the sighand lock is held only around the first half (around zap_other_threads), and not around the following section where leadership is transferred (esp. around the list_replace calls). tasklist_lock is held here, though, so it seems like the right lock to hold. > > > With or without this/your patch this leader can die right after we > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > suspicious. If a sub-thread execs, this task_struct has nothing to > do with the threadgroup. hmm. I thought I had this case covered, but it's been so long since I actually wrote the code that if I did I can't remember how. I think exiting is no issue since we hold a reference on the task_struct, but exec may still be a problem. I'm thinking: - cgroup_attach_proc drops the tasklist_lock - a sub-thread execs, and in exec_mmap (after de_thread) changes the mm - ss->attach, for example in memcg, wants to use leader->mm, which is now wrong this seems to be possible as the code currently is. I wonder if the best fix is just to have exec (maybe around de_thread) bounce off of or hold threadgroup_fork_read_lock somewhere? > > > > Also. This is off-topic, but... Why cgroup_attach_proc() and > cgroup_attach_task() do ->attach_task() + cgroup_task_migrate() > in the different order? cgroup_attach_proc() looks wrong even > if currently doesn't matter. (already submitted a patch for this) Thanks, Ben > > > Oleg. > > --- x/kernel/cgroup.c > +++ x/kernel/cgroup.c > @@ -2000,6 +2000,7 @@ int cgroup_attach_proc(struct cgroup *cg > /* threadgroup list cursor and array */ > struct task_struct *tsk; > struct flex_array *group; > + unsigned long flags; > /* > * we need to make sure we have css_sets for all the tasks we're > * going to move -before- we actually start moving them, so that in > @@ -2027,19 +2028,10 @@ int cgroup_attach_proc(struct cgroup *cg > goto out_free_group_list; > > /* prevent changes to the threadgroup list while we take a snapshot. */ > - rcu_read_lock(); > - if (!thread_group_leader(leader)) { > - /* > - * a race with de_thread from another thread's exec() may strip > - * us of our leadership, making while_each_thread unsafe to use > - * on this task. if this happens, there is no choice but to > - * throw this task away and try again (from cgroup_procs_write); > - * this is "double-double-toil-and-trouble-check locking". > - */ > - rcu_read_unlock(); > - retval = -EAGAIN; > + retval = -EAGAIN; > + if (!lock_task_sighand(leader, &flags)) > goto out_free_group_list; > - } > + > /* take a reference on each task in the group to go in the array. */ > tsk = leader; > i = 0; > @@ -2055,9 +2047,9 @@ int cgroup_attach_proc(struct cgroup *cg > BUG_ON(retval != 0); > i++; > } while_each_thread(leader, tsk); > + unlock_task_sighand(leader, &flags); > /* remember the number of threads in the array for later. */ > group_size = i; > - rcu_read_unlock(); > > /* > * step 1: check that we can legitimately attach to the cgroup. > > ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110901214643.GD10401-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc [not found] ` <20110901214643.GD10401-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> @ 2011-09-02 12:32 ` Oleg Nesterov 0 siblings, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-09-02 12:32 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, Frederic Weisbecker, Paul Menage, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On 09/01, Ben Blum wrote: > > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > > > But can't we avoid the global list? thread_group_leader() or not, we do > > not really care. We only need to ensure we can safely find all threads. > > > > How about the patch below? > > I was content with the tasklist_lock because cgroup_attach_proc is > already a pretty heavyweight operation, and probably pretty rare that a > user would want to do multiple of them at once quickly. Perhaps. But this is the global lock, no matter what. Why do you prefer to take it instead of per-process ->siglock? > I asked Andrew > to take the simple tasklist_lock patch just now, since it does fix the > bug at least. I disagree. > Anyway, looking at this, hmm. I am not sure if this protects adequately? > In de_thread, the sighand lock is held only around the first half > (around zap_other_threads), We only need this lock to find all threads, > and not around the following section where > leadership is transferred (esp. around the list_replace calls). > tasklist_lock is held here, though, so it seems like the right lock to > hold. This doesn't matter at all, I think. The code under while_each_thread() doesn't need the stable ->group_leader, and it can be changed right after you drop tasklist. list_replace() calls do not play with ->thread_group list. How can tasklist make any difference? > > With or without this/your patch this leader can die right after we > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > > suspicious. If a sub-thread execs, this task_struct has nothing to > > do with the threadgroup. > > hmm. I thought I had this case covered, but it's been so long since I > actually wrote the code that if I did I can't remember how. This all doesn't look right... Hopefully addressed by Tejun patches. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-09-01 21:46 ` Ben Blum [not found] ` <20110901214643.GD10401-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> @ 2011-09-02 12:32 ` Oleg Nesterov [not found] ` <20110902123251.GA26764-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2011-09-08 2:11 ` Ben Blum 1 sibling, 2 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-09-02 12:32 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, paulmck, Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton, Frederic Weisbecker On 09/01, Ben Blum wrote: > > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > > > But can't we avoid the global list? thread_group_leader() or not, we do > > not really care. We only need to ensure we can safely find all threads. > > > > How about the patch below? > > I was content with the tasklist_lock because cgroup_attach_proc is > already a pretty heavyweight operation, and probably pretty rare that a > user would want to do multiple of them at once quickly. Perhaps. But this is the global lock, no matter what. Why do you prefer to take it instead of per-process ->siglock? > I asked Andrew > to take the simple tasklist_lock patch just now, since it does fix the > bug at least. I disagree. > Anyway, looking at this, hmm. I am not sure if this protects adequately? > In de_thread, the sighand lock is held only around the first half > (around zap_other_threads), We only need this lock to find all threads, > and not around the following section where > leadership is transferred (esp. around the list_replace calls). > tasklist_lock is held here, though, so it seems like the right lock to > hold. This doesn't matter at all, I think. The code under while_each_thread() doesn't need the stable ->group_leader, and it can be changed right after you drop tasklist. list_replace() calls do not play with ->thread_group list. How can tasklist make any difference? > > With or without this/your patch this leader can die right after we > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > > suspicious. If a sub-thread execs, this task_struct has nothing to > > do with the threadgroup. > > hmm. I thought I had this case covered, but it's been so long since I > actually wrote the code that if I did I can't remember how. This all doesn't look right... Hopefully addressed by Tejun patches. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110902123251.GA26764-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc [not found] ` <20110902123251.GA26764-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2011-09-08 2:11 ` Ben Blum 0 siblings, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-09-08 2:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Ben Blum, Frederic Weisbecker, Paul Menage, linux-kernel-u79uwXL29TY76Z2rM5mHXA, NeilBrown, Andrew Morton, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On Fri, Sep 02, 2011 at 02:32:51PM +0200, Oleg Nesterov wrote: > On 09/01, Ben Blum wrote: > > > > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > > > > > But can't we avoid the global list? thread_group_leader() or not, we do > > > not really care. We only need to ensure we can safely find all threads. > > > > > > How about the patch below? > > > > I was content with the tasklist_lock because cgroup_attach_proc is > > already a pretty heavyweight operation, and probably pretty rare that a > > user would want to do multiple of them at once quickly. > > Perhaps. But this is the global lock, no matter what. Why do you prefer > to take it instead of per-process ->siglock? Basically just because I wanted to find the simplest race-free implementation - at that point I was just shooting to have a complete set of patches, and didn't judge the global lock to be bad enough to think about replacing. I could yet be convinced that siglock is better. > > > I asked Andrew > > to take the simple tasklist_lock patch just now, since it does fix the > > bug at least. > > I disagree. > > > Anyway, looking at this, hmm. I am not sure if this protects adequately? > > In de_thread, the sighand lock is held only around the first half > > (around zap_other_threads), > > We only need this lock to find all threads, > > > and not around the following section where > > leadership is transferred (esp. around the list_replace calls). > > tasklist_lock is held here, though, so it seems like the right lock to > > hold. > > This doesn't matter at all, I think. The code under while_each_thread() > doesn't need the stable ->group_leader, and it can be changed right after > you drop tasklist. list_replace() calls do not play with ->thread_group > list. > > How can tasklist make any difference? Oops. I misread the list_replace calls as being on ->thread_group, instead of on ->tasks as they actually are. Hm, and I see __exit_signal takes the sighand lock around the call to __unhash_process. OK, I believe it will work. Oh, and I guess the sighand lock also subsumes the group_leader check that you were objecting to in the other thread. Sorry for the delay. Any way this change goes through, though, the commenting should be clear on how de_thread interacts with it. Something like: /* If the leader exits, its links on the thread_group list become * invalid. One way this can happen is if a sub-thread does exec() when * de_thread() calls release_task(leader) (and leader->sighand gets set * to NULL, in which case lock_task_sighand will fail). Since in that * case the threadgroup is still around, cgroup_procs_write should try * again (finding the new leader), which EAGAIN indicates here. This is * "double-double-toil-and-trouble-check locking". */ (I rather like the phrase "double-double-..." to describe the overall approach, and would prefer if it stayed in the comment.) > > > > With or without this/your patch this leader can die right after we > > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > > > suspicious. If a sub-thread execs, this task_struct has nothing to > > > do with the threadgroup. > > > > hmm. I thought I had this case covered, but it's been so long since I > > actually wrote the code that if I did I can't remember how. > > This all doesn't look right... Hopefully addressed by Tejun patches. > Will look into those separately. Thanks, Ben > > Oleg. > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc 2011-09-02 12:32 ` Oleg Nesterov [not found] ` <20110902123251.GA26764-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2011-09-08 2:11 ` Ben Blum 1 sibling, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-09-08 2:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, linux-kernel, Andrew Morton, Frederic Weisbecker On Fri, Sep 02, 2011 at 02:32:51PM +0200, Oleg Nesterov wrote: > On 09/01, Ben Blum wrote: > > > > On Mon, Aug 15, 2011 at 08:49:57PM +0200, Oleg Nesterov wrote: > > > > > > But can't we avoid the global list? thread_group_leader() or not, we do > > > not really care. We only need to ensure we can safely find all threads. > > > > > > How about the patch below? > > > > I was content with the tasklist_lock because cgroup_attach_proc is > > already a pretty heavyweight operation, and probably pretty rare that a > > user would want to do multiple of them at once quickly. > > Perhaps. But this is the global lock, no matter what. Why do you prefer > to take it instead of per-process ->siglock? Basically just because I wanted to find the simplest race-free implementation - at that point I was just shooting to have a complete set of patches, and didn't judge the global lock to be bad enough to think about replacing. I could yet be convinced that siglock is better. > > > I asked Andrew > > to take the simple tasklist_lock patch just now, since it does fix the > > bug at least. > > I disagree. > > > Anyway, looking at this, hmm. I am not sure if this protects adequately? > > In de_thread, the sighand lock is held only around the first half > > (around zap_other_threads), > > We only need this lock to find all threads, > > > and not around the following section where > > leadership is transferred (esp. around the list_replace calls). > > tasklist_lock is held here, though, so it seems like the right lock to > > hold. > > This doesn't matter at all, I think. The code under while_each_thread() > doesn't need the stable ->group_leader, and it can be changed right after > you drop tasklist. list_replace() calls do not play with ->thread_group > list. > > How can tasklist make any difference? Oops. I misread the list_replace calls as being on ->thread_group, instead of on ->tasks as they actually are. Hm, and I see __exit_signal takes the sighand lock around the call to __unhash_process. OK, I believe it will work. Oh, and I guess the sighand lock also subsumes the group_leader check that you were objecting to in the other thread. Sorry for the delay. Any way this change goes through, though, the commenting should be clear on how de_thread interacts with it. Something like: /* If the leader exits, its links on the thread_group list become * invalid. One way this can happen is if a sub-thread does exec() when * de_thread() calls release_task(leader) (and leader->sighand gets set * to NULL, in which case lock_task_sighand will fail). Since in that * case the threadgroup is still around, cgroup_procs_write should try * again (finding the new leader), which EAGAIN indicates here. This is * "double-double-toil-and-trouble-check locking". */ (I rather like the phrase "double-double-..." to describe the overall approach, and would prefer if it stayed in the comment.) > > > > With or without this/your patch this leader can die right after we > > > drop the lock. ss->can_attach(leader) and ss->attach(leader) look > > > suspicious. If a sub-thread execs, this task_struct has nothing to > > > do with the threadgroup. > > > > hmm. I thought I had this case covered, but it's been so long since I > > actually wrote the code that if I did I can't remember how. > > This all doesn't look right... Hopefully addressed by Tejun patches. > Will look into those separately. Thanks, Ben > > Oleg. > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock in cgroup_attach_proc 2011-08-15 18:49 ` Oleg Nesterov ` (2 preceding siblings ...) 2011-09-01 21:46 ` Ben Blum @ 2011-10-14 0:31 ` Ben Blum 2011-10-14 12:15 ` Frederic Weisbecker 2011-10-14 0:36 ` [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) Ben Blum 4 siblings, 1 reply; 63+ messages in thread From: Ben Blum @ 2011-10-14 0:31 UTC (permalink / raw) To: Andrew Morton, Oleg Nesterov, linux-kernel Cc: Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, Frederic Weisbecker, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki Use sighand lock instead of tasklist_lock in cgroup_attach_proc. From: Ben Blum <bblum@andrew.cmu.edu> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> --- kernel/cgroup.c | 27 +++++++++++++-------------- 1 files changed, 13 insertions(+), 14 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 893fc3d..32fb4c8 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2003,6 +2003,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader) /* threadgroup list cursor and array */ struct task_struct *tsk; struct flex_array *group; + unsigned long flags; /* * we need to make sure we have css_sets for all the tasks we're * going to move -before- we actually start moving them, so that in @@ -2029,20 +2030,18 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader) if (retval) goto out_free_group_list; - /* prevent changes to the threadgroup list while we take a snapshot. */ - read_lock(&tasklist_lock); - if (!thread_group_leader(leader)) { - /* - * a race with de_thread from another thread's exec() may strip - * us of our leadership, making while_each_thread unsafe to use - * on this task. if this happens, there is no choice but to - * throw this task away and try again (from cgroup_procs_write); - * this is "double-double-toil-and-trouble-check locking". - */ - read_unlock(&tasklist_lock); - retval = -EAGAIN; + /* prevent changes to the threadgroup list while we take a snapshot. + * If the leader exits, its links on the thread_group list become + * invalid. One way this can happen is if a sub-thread does exec() when + * de_thread() calls release_task(leader) (and leader->sighand gets set + * to NULL, in which case lock_task_sighand will fail). Since in that + * case the threadgroup is still around, cgroup_procs_write should try + * again (finding the new leader), which EAGAIN indicates here. This is + * double-double-toil-and-trouble-check locking". */ + retval = -EAGAIN; + if (!lock_task_sighand(leader, &flags)) goto out_free_group_list; - } + /* take a reference on each task in the group to go in the array. */ tsk = leader; i = 0; @@ -2058,9 +2057,9 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader) BUG_ON(retval != 0); i++; } while_each_thread(leader, tsk); + unlock_task_sighand(leader, &flags); /* remember the number of threads in the array for later. */ group_size = i; - read_unlock(&tasklist_lock); /* * step 1: check that we can legitimately attach to the cgroup. ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock in cgroup_attach_proc 2011-10-14 0:31 ` [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock " Ben Blum @ 2011-10-14 12:15 ` Frederic Weisbecker 0 siblings, 0 replies; 63+ messages in thread From: Frederic Weisbecker @ 2011-10-14 12:15 UTC (permalink / raw) To: Ben Blum Cc: Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki Hi Ben, On Thu, Oct 13, 2011 at 08:31:15PM -0400, Ben Blum wrote: > Use sighand lock instead of tasklist_lock in cgroup_attach_proc. Could you please feed more the changelog to explain the reason of the change? Well, the reason of converting tasklist_lock based locking to something more granular is almost obvious these days, but it would be nice to know why it is safe to do so. And what exactly that protects. Thanks. ^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) 2011-08-15 18:49 ` Oleg Nesterov ` (3 preceding siblings ...) 2011-10-14 0:31 ` [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock " Ben Blum @ 2011-10-14 0:36 ` Ben Blum 2011-10-14 12:21 ` Frederic Weisbecker 2011-10-19 5:43 ` Paul Menage 4 siblings, 2 replies; 63+ messages in thread From: Ben Blum @ 2011-10-14 0:36 UTC (permalink / raw) To: Andrew Morton, Oleg Nesterov, linux-kernel Cc: Ben Blum, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, Frederic Weisbecker, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki Convert ss->attach to take a flex_array of tasks instead of just the leader. From: Ben Blum <bblum@andrew.cmu.edu> This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and memcontrol) to accurately find the group's mm even when a non-leader does exec and leaves the leader with a NULL mm pointer. Also converts cpuset and memcontrol to take the flex_array and iterate down it until an mm is found, instead of just attempting to use the leader's mm. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> --- Documentation/cgroups/cgroups.txt | 7 ++++++- include/linux/cgroup.h | 4 +++- kernel/cgroup.c | 16 +++++++++++++--- kernel/cpuset.c | 16 ++++++++++++++-- mm/memcontrol.c | 17 +++++++++++++++-- 5 files changed, 51 insertions(+), 9 deletions(-) diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 3fa646f..8e900ec 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -637,12 +637,17 @@ For any non-per-thread attachment work that needs to happen before attach_task. Needed by cpuset. void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, - struct cgroup *old_cgrp, struct task_struct *task) + struct cgroup *old_cgrp, struct flex_array *group, + int group_size) (cgroup_mutex held by caller) Called after the task has been attached to the cgroup, to allow any post-attachment activity that requires memory allocations or blocking. +The flex_array contains pointers to every task_struct being moved, so +that subsystems can, for example, iterate over a threadgroup's tasks to +find one with an mm that needs to be moved. + void attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *tsk); (cgroup_mutex held by caller) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index ddc13eb..2f97a3b 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -25,6 +25,7 @@ struct cgroupfs_root; struct inode; struct cgroup; struct css_id; +struct flex_array; extern int cgroup_init_early(void); extern int cgroup_init(void); @@ -481,7 +482,8 @@ struct cgroup_subsys { void (*attach_task)(struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *tsk); void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, - struct cgroup *old_cgrp, struct task_struct *tsk); + struct cgroup *old_cgrp, struct flex_array *group, + int group_size); int (*fork)(struct cgroup_subsys *ss, struct task_struct *task); void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *task); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 32fb4c8..f5fc882 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1824,10 +1824,18 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) struct cgroup *oldcgrp; struct cgroupfs_root *root = cgrp->root; + /* Singleton array, for ss->attach (see cgroup_attach_proc). */ + struct flex_array *group = flex_array_alloc(sizeof(tsk), 1, GFP_KERNEL); + if (!group) + return -ENOMEM; + retval = flex_array_put_ptr(group, 0, tsk, GFP_KERNEL); + if (retval < 0) + goto out_free_array; + /* Nothing to do if the task is already in that cgroup */ oldcgrp = task_cgroup_from_root(tsk, root); if (cgrp == oldcgrp) - return 0; + goto out_free_array; for_each_subsys(root, ss) { if (ss->can_attach) { @@ -1862,7 +1870,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) if (ss->attach_task) ss->attach_task(cgrp, oldcgrp, tsk); if (ss->attach) - ss->attach(ss, cgrp, oldcgrp, tsk); + ss->attach(ss, cgrp, oldcgrp, group, 1); } synchronize_rcu(); @@ -1890,6 +1898,8 @@ out: ss->cancel_attach(ss, cgrp, tsk); } } +out_free_array: + flex_array_free(group); return retval; } @@ -2164,7 +2174,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader) */ for_each_subsys(root, ss) { if (ss->attach) - ss->attach(ss, cgrp, oldcgrp, leader); + ss->attach(ss, cgrp, oldcgrp, group, group_size); } /* diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 00b3430..fce7841 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -59,6 +59,7 @@ #include <linux/mutex.h> #include <linux/workqueue.h> #include <linux/cgroup.h> +#include <linux/flex_array.h> /* * Workqueue for cpuset related tasks. @@ -1440,11 +1441,13 @@ static void cpuset_attach_task(struct cgroup *cont, struct cgroup *old, } static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont, - struct cgroup *oldcont, struct task_struct *tsk) + struct cgroup *oldcont, struct flex_array *group, + int group_size) { struct mm_struct *mm; struct cpuset *cs = cgroup_cs(cont); struct cpuset *oldcs = cgroup_cs(oldcont); + int i; /* * Change mm, possibly for multiple threads in a threadgroup. This is @@ -1452,7 +1455,16 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont, */ cpuset_attach_nodemask_from = oldcs->mems_allowed; cpuset_attach_nodemask_to = cs->mems_allowed; - mm = get_task_mm(tsk); + /* + * Find the first task in the group that still has its mm. (This could + * be not the first one if another did exec() and the leader exited. + */ + for (i = 0; i < group_size; i++) { + struct task_struct *tsk = flex_array_get_ptr(group, i); + mm = get_task_mm(tsk); + if (mm) + break; + } if (mm) { mpol_rebind_mm(mm, &cpuset_attach_nodemask_to); if (is_memory_migrate(cs)) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6aff93c..f951a9c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include <linux/page_cgroup.h> #include <linux/cpu.h> #include <linux/oom.h> +#include <linux/flex_array.h> #include "internal.h" #include <asm/uaccess.h> @@ -5455,9 +5456,21 @@ retry: static void mem_cgroup_move_task(struct cgroup_subsys *ss, struct cgroup *cont, struct cgroup *old_cont, - struct task_struct *p) + struct flex_array *group, int group_size) { - struct mm_struct *mm = get_task_mm(p); + struct mm_struct *mm; + int i; + + /* + * Find the first task in the group that still has its mm. (This could + * be not the first one if another did exec() and the leader exited. + */ + for (i = 0; i < group_size; i++) { + struct task_struct *tsk = flex_array_get_ptr(group, i); + mm = get_task_mm(tsk); + if (mm) + break; + } if (mm) { if (mc.to) ^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) 2011-10-14 0:36 ` [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) Ben Blum @ 2011-10-14 12:21 ` Frederic Weisbecker 2011-10-14 13:53 ` Ben Blum 2011-10-19 5:43 ` Paul Menage 1 sibling, 1 reply; 63+ messages in thread From: Frederic Weisbecker @ 2011-10-14 12:21 UTC (permalink / raw) To: Ben Blum, Tejun Heo Cc: Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote: > Convert ss->attach to take a flex_array of tasks instead of just the leader. > > From: Ben Blum <bblum@andrew.cmu.edu> > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and > memcontrol) to accurately find the group's mm even when a non-leader does exec > and leaves the leader with a NULL mm pointer. > > Also converts cpuset and memcontrol to take the flex_array and iterate down it > until an mm is found, instead of just attempting to use the leader's mm. > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> I think there are patches from Tejun that handle that did that already? https://lkml.org/lkml/2011/8/23/418 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) 2011-10-14 12:21 ` Frederic Weisbecker @ 2011-10-14 13:53 ` Ben Blum 2011-10-14 13:54 ` Ben Blum 2011-10-14 15:21 ` Frederic Weisbecker 0 siblings, 2 replies; 63+ messages in thread From: Ben Blum @ 2011-10-14 13:53 UTC (permalink / raw) To: Frederic Weisbecker Cc: Ben Blum, Tejun Heo, Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki On Fri, Oct 14, 2011 at 02:21:30PM +0200, Frederic Weisbecker wrote: > On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote: > > Convert ss->attach to take a flex_array of tasks instead of just the leader. > > > > From: Ben Blum <bblum@andrew.cmu.edu> > > > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and > > memcontrol) to accurately find the group's mm even when a non-leader does exec > > and leaves the leader with a NULL mm pointer. > > > > Also converts cpuset and memcontrol to take the flex_array and iterate down it > > until an mm is found, instead of just attempting to use the leader's mm. > > > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> > > I think there are patches from Tejun that handle that did that already? > > https://lkml.org/lkml/2011/8/23/418 > Yeah, I'm not hoping to preempt them or anything; I'm just presenting this alternate approach, since (1) this way addresses the ss->attach() problem directly without needing to add locking anywhere else, while the focus of the discussion around Tejun's patches seems to have moved to more global concerns, and (2) these patches and his patches ought to be compatible with each other. Ben ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) 2011-10-14 13:53 ` Ben Blum @ 2011-10-14 13:54 ` Ben Blum 2011-10-14 15:22 ` Frederic Weisbecker 2011-10-14 15:21 ` Frederic Weisbecker 1 sibling, 1 reply; 63+ messages in thread From: Ben Blum @ 2011-10-14 13:54 UTC (permalink / raw) To: Ben Blum Cc: Frederic Weisbecker, Tejun Heo, Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki On Fri, Oct 14, 2011 at 09:53:23AM -0400, Ben Blum wrote: > On Fri, Oct 14, 2011 at 02:21:30PM +0200, Frederic Weisbecker wrote: > > On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote: > > > Convert ss->attach to take a flex_array of tasks instead of just the leader. > > > > > > From: Ben Blum <bblum@andrew.cmu.edu> > > > > > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and > > > memcontrol) to accurately find the group's mm even when a non-leader does exec > > > and leaves the leader with a NULL mm pointer. > > > > > > Also converts cpuset and memcontrol to take the flex_array and iterate down it > > > until an mm is found, instead of just attempting to use the leader's mm. > > > > > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> > > > > I think there are patches from Tejun that handle that did that already? > > > > https://lkml.org/lkml/2011/8/23/418 > > > > Yeah, I'm not hoping to preempt them or anything; I'm just presenting > this alternate approach, since (1) this way addresses the ss->attach() > problem directly without needing to add locking anywhere else, while the > focus of the discussion around Tejun's patches seems to have moved to > more global concerns, and (2) these patches and his patches ought to be > compatible with each other. though, I did wonder if, given this approach, I shouldn't just get rid of attach_task and can_attach_task and use the flex_array for all of the calls. would that be nicer? > > Ben > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) 2011-10-14 13:54 ` Ben Blum @ 2011-10-14 15:22 ` Frederic Weisbecker 2011-10-17 19:11 ` Ben Blum 0 siblings, 1 reply; 63+ messages in thread From: Frederic Weisbecker @ 2011-10-14 15:22 UTC (permalink / raw) To: Ben Blum Cc: Tejun Heo, Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki On Fri, Oct 14, 2011 at 09:54:57AM -0400, Ben Blum wrote: > On Fri, Oct 14, 2011 at 09:53:23AM -0400, Ben Blum wrote: > > On Fri, Oct 14, 2011 at 02:21:30PM +0200, Frederic Weisbecker wrote: > > > On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote: > > > > Convert ss->attach to take a flex_array of tasks instead of just the leader. > > > > > > > > From: Ben Blum <bblum@andrew.cmu.edu> > > > > > > > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and > > > > memcontrol) to accurately find the group's mm even when a non-leader does exec > > > > and leaves the leader with a NULL mm pointer. > > > > > > > > Also converts cpuset and memcontrol to take the flex_array and iterate down it > > > > until an mm is found, instead of just attempting to use the leader's mm. > > > > > > > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> > > > > > > I think there are patches from Tejun that handle that did that already? > > > > > > https://lkml.org/lkml/2011/8/23/418 > > > > > > > Yeah, I'm not hoping to preempt them or anything; I'm just presenting > > this alternate approach, since (1) this way addresses the ss->attach() > > problem directly without needing to add locking anywhere else, while the > > focus of the discussion around Tejun's patches seems to have moved to > > more global concerns, and (2) these patches and his patches ought to be > > compatible with each other. > > though, I did wonder if, given this approach, I shouldn't just get rid > of attach_task and can_attach_task and use the flex_array for all of > the calls. would that be nicer? That's what Tejun's patches do. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) 2011-10-14 15:22 ` Frederic Weisbecker @ 2011-10-17 19:11 ` Ben Blum 0 siblings, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-10-17 19:11 UTC (permalink / raw) To: Frederic Weisbecker Cc: Ben Blum, Tejun Heo, Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki On Fri, Oct 14, 2011 at 05:22:20PM +0200, Frederic Weisbecker wrote: > On Fri, Oct 14, 2011 at 09:54:57AM -0400, Ben Blum wrote: > > On Fri, Oct 14, 2011 at 09:53:23AM -0400, Ben Blum wrote: > > > On Fri, Oct 14, 2011 at 02:21:30PM +0200, Frederic Weisbecker wrote: > > > > On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote: > > > > > Convert ss->attach to take a flex_array of tasks instead of just the leader. > > > > > > > > > > From: Ben Blum <bblum@andrew.cmu.edu> > > > > > > > > > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and > > > > > memcontrol) to accurately find the group's mm even when a non-leader does exec > > > > > and leaves the leader with a NULL mm pointer. > > > > > > > > > > Also converts cpuset and memcontrol to take the flex_array and iterate down it > > > > > until an mm is found, instead of just attempting to use the leader's mm. > > > > > > > > > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> > > > > > > > > I think there are patches from Tejun that handle that did that already? > > > > > > > > https://lkml.org/lkml/2011/8/23/418 > > > > > > > > > > Yeah, I'm not hoping to preempt them or anything; I'm just presenting > > > this alternate approach, since (1) this way addresses the ss->attach() > > > problem directly without needing to add locking anywhere else, while the > > > focus of the discussion around Tejun's patches seems to have moved to > > > more global concerns, and (2) these patches and his patches ought to be > > > compatible with each other. > > > > though, I did wonder if, given this approach, I shouldn't just get rid > > of attach_task and can_attach_task and use the flex_array for all of > > the calls. would that be nicer? > > That's what Tejun's patches do. > Oh, I must have missed that bit. In which case, sorry for the noise, and I support Tejun's change. -- Ben ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) 2011-10-14 13:53 ` Ben Blum 2011-10-14 13:54 ` Ben Blum @ 2011-10-14 15:21 ` Frederic Weisbecker 1 sibling, 0 replies; 63+ messages in thread From: Frederic Weisbecker @ 2011-10-14 15:21 UTC (permalink / raw) To: Ben Blum Cc: Tejun Heo, Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck, Paul Menage, Li Zefan, containers, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki On Fri, Oct 14, 2011 at 09:53:23AM -0400, Ben Blum wrote: > On Fri, Oct 14, 2011 at 02:21:30PM +0200, Frederic Weisbecker wrote: > > On Thu, Oct 13, 2011 at 08:36:01PM -0400, Ben Blum wrote: > > > Convert ss->attach to take a flex_array of tasks instead of just the leader. > > > > > > From: Ben Blum <bblum@andrew.cmu.edu> > > > > > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and > > > memcontrol) to accurately find the group's mm even when a non-leader does exec > > > and leaves the leader with a NULL mm pointer. > > > > > > Also converts cpuset and memcontrol to take the flex_array and iterate down it > > > until an mm is found, instead of just attempting to use the leader's mm. > > > > > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> > > > > I think there are patches from Tejun that handle that did that already? > > > > https://lkml.org/lkml/2011/8/23/418 > > > > Yeah, I'm not hoping to preempt them or anything; I'm just presenting > this alternate approach, since (1) this way addresses the ss->attach() > problem directly without needing to add locking anywhere else You basically have the same approach, the locking involved in Tejun's patches seem to address a different issue that you would need to address as well anyway. > , while the > focus of the discussion around Tejun's patches seems to have moved to > more global concerns, and (2) these patches and his patches ought to be > compatible with each other. But Tejun's patches seem to address not only your issue but even more. I don't understand the point of this lighter version. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) 2011-10-14 0:36 ` [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) Ben Blum 2011-10-14 12:21 ` Frederic Weisbecker @ 2011-10-19 5:43 ` Paul Menage 1 sibling, 0 replies; 63+ messages in thread From: Paul Menage @ 2011-10-19 5:43 UTC (permalink / raw) To: Ben Blum Cc: Andrew Morton, Oleg Nesterov, linux-kernel, NeilBrown, paulmck, Li Zefan, containers, Frederic Weisbecker, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki On Thu, Oct 13, 2011 at 5:36 PM, Ben Blum <bblum@andrew.cmu.edu> wrote: > Convert ss->attach to take a flex_array of tasks instead of just the leader. > > From: Ben Blum <bblum@andrew.cmu.edu> > > This lets subsystems with whole-threadgroup attach calls (i.e., cpuset and > memcontrol) to accurately find the group's mm even when a non-leader does exec > and leaves the leader with a NULL mm pointer. > > Also converts cpuset and memcontrol to take the flex_array and iterate down it > until an mm is found, instead of just attempting to use the leader's mm. > > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Reviewed-by: Paul Menage <paul@paulmenage.org> Thanks, Paul > --- > Documentation/cgroups/cgroups.txt | 7 ++++++- > include/linux/cgroup.h | 4 +++- > kernel/cgroup.c | 16 +++++++++++++--- > kernel/cpuset.c | 16 ++++++++++++++-- > mm/memcontrol.c | 17 +++++++++++++++-- > 5 files changed, 51 insertions(+), 9 deletions(-) > > diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt > index 3fa646f..8e900ec 100644 > --- a/Documentation/cgroups/cgroups.txt > +++ b/Documentation/cgroups/cgroups.txt > @@ -637,12 +637,17 @@ For any non-per-thread attachment work that needs to happen before > attach_task. Needed by cpuset. > > void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, > - struct cgroup *old_cgrp, struct task_struct *task) > + struct cgroup *old_cgrp, struct flex_array *group, > + int group_size) Maybe flex_array should have a size field, so that you can just pass a single pointer here? > @@ -1452,7 +1455,16 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont, > */ > cpuset_attach_nodemask_from = oldcs->mems_allowed; > cpuset_attach_nodemask_to = cs->mems_allowed; > - mm = get_task_mm(tsk); > + /* > + * Find the first task in the group that still has its mm. (This could > + * be not the first one if another did exec() and the leader exited. Maybe rephrase this as "This might not be the first one"? Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110728062616.GC15204-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110728062616.GC15204-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> @ 2011-07-28 7:13 ` NeilBrown 0 siblings, 0 replies; 63+ messages in thread From: NeilBrown @ 2011-07-28 7:13 UTC (permalink / raw) To: Ben Blum Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Oleg Nesterov On Thu, 28 Jul 2011 02:26:16 -0400 Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote: > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > > On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney" > > <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > > > > > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote: > > > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: > > > > > > [ . . . ] > > > > > > > > The race as I understand it is with this code: > > > > > > > > > > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > > > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > > > > > > > tsk->group_leader = tsk; > > > > > leader->group_leader = tsk; > > > > > > > > > > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > > > > be held in the cgroup code. > > > > > > > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > > > > this chunk is run with the same value for 'leader', but the > > > > > while_each_thread is run after, then the while_read_thread() might loop > > > > > forever. rcu_read_lock doesn't prevent this from happening. > > > > > > > > Somehow I was under the impression that holding tasklist_lock (for > > > > writing) provided exclusion from code that holds rcu_read_lock - > > > > probably because there are other points in the kernel which do > > > > while_each_thread with only RCU-read held (and not tasklist): > > > > > > > > - kernel/hung_task.c, check_hung_uninterruptible_tasks() > > > > > > This one looks OK to me. The code is just referencing fields in each > > > of the task structures, and appears to be making proper use of > > > rcu_dereference(). All this code requires is that the task structures > > > remain in existence through the full lifetime of the RCU read-side > > > critical section, which is guaranteed because of the way the task_struct > > > is freed. > > > > I disagree. It also requires - by virtue of the use of while_each_thread() - > > that 'g' remains on the list that 't' is walking along. > > > > Now for a normal list, the head always stays on the list and is accessible > > even from an rcu-removed entry. But the thread_group list isn't a normal > > list. It doesn't have a distinct head. It is a loop of all of the > > 'task_structs' in a thread group. One of them is designated the 'leader' but > > de_thread() can change the 'leader' - though it doesn't remove the old leader. > > > > __unhash_process in mm/exit.c looks like it could remove the leader from the > > list and definitely could remove a non-leader. > > > > So if a non-leader calls 'exec' and the leader calls 'exit', then a > > task_struct that was the leader could become a non-leader and then be removed > > from the list that kernel/hung_task could be walking along. > > That agrees with my understanding. > > > > > So I don't think that while_each_thread() is currently safe. It depends on > > the thread leader not disappearing and I think it can. > > I think that while_each_thread is perfectly safe, it just needs to be > protected properly while used. it reads the tasklist, and both competing > paths (__unhash_process and de_thread) are done with tasklist_lock write > locked, so read-locking ought to suffice. all it needs is to be better > documented. That might be one answer. However the fact that rcu primitives are used to walk the list (next_thread() contains list_entry_rcu) seems to suggest that the intention was that RCU could be used for read access - which in large part it can. So we have two alternatives: 1/ take a reader spinlock to walk the thread_group list 2/ make RCU work safely for walking thread_group I thought it was best to try the latter first. > > > [...] > > > > +/* Thread group leader can change, so stop loop when we see one > > + * even if it isn't 'g' */ > > #define while_each_thread(g, t) \ > > - while ((t = next_thread(t)) != g) > > + while ((t = next_thread(t)) != g && !thread_group_leader(t)) > > this is semantically wrong: it will stop as soon as it finds a thread > that has newly become the leader, and not run the loop body code in that > thread's case. so the thread that just execed would not get run on, and > in the case of my code, would "escape" the cgroup migration. Yes, you are right ... For your case I think you do want the thread group list to be both safe and stable. RCU can only provide 'safe' for a list. So I think I'm convinced that you need a readlock on tasklist_lock. That doesn't mean that all users of while_each_thread do. > > but I argue it is also organisationally wrong. while_each_thread's > purpose is just to worry about the structure of the process list, not to > account for behavioural details of de_thread. this check belongs outside > of the macro, and it should be protected by tasklist_lock in the same > critical section in which while_each_thread is used. Sound fair. Maybe we need two versions of while_each_thread(). One that assumes tasklist_lock and guarantees getting every thread, and one that is less precise but only needs rcu_read_lock(). That would require an audit of each call site to see what is really needed. But yes: if you replace your rcu_read_lock()s with read_lock(&tasklist_lock) then you will be safe, and I don't think you can do any better (i.e. less locking) than that. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-07-28 1:08 ` NeilBrown 2011-07-28 6:26 ` Ben Blum @ 2011-07-28 12:17 ` Paul E. McKenney [not found] ` <20110728121741.GB2427-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 2011-08-14 17:51 ` Oleg Nesterov [not found] ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2011-08-14 17:45 ` Oleg Nesterov 3 siblings, 2 replies; 63+ messages in thread From: Paul E. McKenney @ 2011-07-28 12:17 UTC (permalink / raw) To: NeilBrown Cc: Ben Blum, Paul Menage, Li Zefan, Oleg Nesterov, containers, linux-kernel On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney" > <paulmck@linux.vnet.ibm.com> wrote: > > > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote: > > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: > > > > [ . . . ] > > > > > > The race as I understand it is with this code: > > > > > > > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > > > > > tsk->group_leader = tsk; > > > > leader->group_leader = tsk; > > > > > > > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > > > be held in the cgroup code. > > > > > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > > > this chunk is run with the same value for 'leader', but the > > > > while_each_thread is run after, then the while_read_thread() might loop > > > > forever. rcu_read_lock doesn't prevent this from happening. > > > > > > Somehow I was under the impression that holding tasklist_lock (for > > > writing) provided exclusion from code that holds rcu_read_lock - > > > probably because there are other points in the kernel which do > > > while_each_thread with only RCU-read held (and not tasklist): > > > > > > - kernel/hung_task.c, check_hung_uninterruptible_tasks() > > > > This one looks OK to me. The code is just referencing fields in each > > of the task structures, and appears to be making proper use of > > rcu_dereference(). All this code requires is that the task structures > > remain in existence through the full lifetime of the RCU read-side > > critical section, which is guaranteed because of the way the task_struct > > is freed. > > I disagree. It also requires - by virtue of the use of while_each_thread() - > that 'g' remains on the list that 't' is walking along. Doesn't the following code in the loop body deal with this possibilty? /* Exit if t or g was unhashed during refresh. */ if (t->state == TASK_DEAD || g->state == TASK_DEAD) goto unlock; Yes, a concurrent dethread could cause some of the tasks to be skipped, but there really is a hung thread, it will still be there to be caught next time, right? Thanx, Paul > Now for a normal list, the head always stays on the list and is accessible > even from an rcu-removed entry. But the thread_group list isn't a normal > list. It doesn't have a distinct head. It is a loop of all of the > 'task_structs' in a thread group. One of them is designated the 'leader' but > de_thread() can change the 'leader' - though it doesn't remove the old leader. > > __unhash_process in mm/exit.c looks like it could remove the leader from the > list and definitely could remove a non-leader. > > So if a non-leader calls 'exec' and the leader calls 'exit', then a > task_struct that was the leader could become a non-leader and then be removed > from the list that kernel/hung_task could be walking along. > > So I don't think that while_each_thread() is currently safe. It depends on > the thread leader not disappearing and I think it can. > > So I'm imagining a patch like this to ensure that while_each_thread() is > actually safe. If it is always safe you can remove that extra check in > cgroup_attach_proc() which looked wrong. > > I just hope someone who understands the process tree is listening.. > The change in exit.c is the most uncertain part. > > NeilBrown > > diff --git a/fs/exec.c b/fs/exec.c > index 6075a1e..c9ea5f0 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -960,6 +960,9 @@ static int de_thread(struct task_struct *tsk) > list_replace_init(&leader->sibling, &tsk->sibling); > > tsk->group_leader = tsk; > + smp_mb(); /* ensure that any reader will always be able to see > + * a task that claims to be the group leader > + */ > leader->group_leader = tsk; > > tsk->exit_signal = SIGCHLD; > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 14a6c7b..13e0192 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2267,8 +2267,10 @@ extern bool current_is_single_threaded(void); > #define do_each_thread(g, t) \ > for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do > > +/* Thread group leader can change, so stop loop when we see one > + * even if it isn't 'g' */ > #define while_each_thread(g, t) \ > - while ((t = next_thread(t)) != g) > + while ((t = next_thread(t)) != g && !thread_group_leader(t)) > > static inline int get_nr_threads(struct task_struct *tsk) > { > diff --git a/kernel/exit.c b/kernel/exit.c > index f2b321b..d6cef25 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead) > list_del_rcu(&p->tasks); > list_del_init(&p->sibling); > __this_cpu_dec(process_counts); > - } > - list_del_rcu(&p->thread_group); > + } else > + /* only remove members from the thread group. > + * The thread group leader must stay so that > + * while_each_thread() uses can see the end of > + * the list and stop. > + */ > + list_del_rcu(&p->thread_group); > } > > /* > ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110728121741.GB2427-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110728121741.GB2427-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2011-08-14 17:51 ` Oleg Nesterov 0 siblings, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-08-14 17:51 UTC (permalink / raw) To: Paul E. McKenney Cc: NeilBrown, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ben Blum, Paul Menage On 07/28, Paul E. McKenney wrote: > > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > > > > I disagree. It also requires - by virtue of the use of while_each_thread() - > > that 'g' remains on the list that 't' is walking along. > > Doesn't the following code in the loop body deal with this possibilty? > > /* Exit if t or g was unhashed during refresh. */ > if (t->state == TASK_DEAD || g->state == TASK_DEAD) > goto unlock; This code is completely wrong even if while_each_thread() was fine. I sent the patch but it was ignored. [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() http://marc.info/?l=linux-kernel&m=127688790019041 Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-07-28 12:17 ` Paul E. McKenney [not found] ` <20110728121741.GB2427-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2011-08-14 17:51 ` Oleg Nesterov [not found] ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (2 more replies) 1 sibling, 3 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-08-14 17:51 UTC (permalink / raw) To: Paul E. McKenney Cc: NeilBrown, Ben Blum, Paul Menage, Li Zefan, containers, linux-kernel On 07/28, Paul E. McKenney wrote: > > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > > > > I disagree. It also requires - by virtue of the use of while_each_thread() - > > that 'g' remains on the list that 't' is walking along. > > Doesn't the following code in the loop body deal with this possibilty? > > /* Exit if t or g was unhashed during refresh. */ > if (t->state == TASK_DEAD || g->state == TASK_DEAD) > goto unlock; This code is completely wrong even if while_each_thread() was fine. I sent the patch but it was ignored. [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() http://marc.info/?l=linux-kernel&m=127688790019041 Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2011-08-14 23:58 ` NeilBrown 2011-08-15 18:01 ` Paul E. McKenney 1 sibling, 0 replies; 63+ messages in thread From: NeilBrown @ 2011-08-14 23:58 UTC (permalink / raw) To: Oleg Nesterov Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Paul E. McKenney On Sun, 14 Aug 2011 19:51:19 +0200 Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On 07/28, Paul E. McKenney wrote: > > > > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > > > > > > I disagree. It also requires - by virtue of the use of while_each_thread() - > > > that 'g' remains on the list that 't' is walking along. > > > > Doesn't the following code in the loop body deal with this possibilty? > > > > /* Exit if t or g was unhashed during refresh. */ > > if (t->state == TASK_DEAD || g->state == TASK_DEAD) > > goto unlock; > > This code is completely wrong even if while_each_thread() was fine. > > I sent the patch but it was ignored. > > [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() > http://marc.info/?l=linux-kernel&m=127688790019041 > > Oleg. I agree with that patch. RCU only protects a task_struct until release_task() is called (which removes it from the task list). So holding rcu_lock doesn't stop put_task_struct from freeing the memory unless we *know* that release_task hasn't been called. This is exactly that pid_alive() tests. I must say that handling of task_struct seems to violate the law of least surprise a little to often for my taste. Maybe it is just a difficult problem and it needs a complex solution - but it would be really nice if it were a bit simpler :-( NeilBrown ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2011-08-14 23:58 ` NeilBrown @ 2011-08-15 18:01 ` Paul E. McKenney 1 sibling, 0 replies; 63+ messages in thread From: Paul E. McKenney @ 2011-08-15 18:01 UTC (permalink / raw) To: Oleg Nesterov Cc: NeilBrown, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ben Blum, Paul Menage On Sun, Aug 14, 2011 at 07:51:19PM +0200, Oleg Nesterov wrote: > On 07/28, Paul E. McKenney wrote: > > > > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > > > > > > I disagree. It also requires - by virtue of the use of while_each_thread() - > > > that 'g' remains on the list that 't' is walking along. > > > > Doesn't the following code in the loop body deal with this possibilty? > > > > /* Exit if t or g was unhashed during refresh. */ > > if (t->state == TASK_DEAD || g->state == TASK_DEAD) > > goto unlock; > > This code is completely wrong even if while_each_thread() was fine. > > I sent the patch but it was ignored. > > [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() > http://marc.info/?l=linux-kernel&m=127688790019041 If it helps... Acked-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-08-14 17:51 ` Oleg Nesterov [not found] ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2011-08-14 23:58 ` NeilBrown 2011-08-15 18:01 ` Paul E. McKenney 2 siblings, 0 replies; 63+ messages in thread From: NeilBrown @ 2011-08-14 23:58 UTC (permalink / raw) To: Oleg Nesterov Cc: Paul E. McKenney, Ben Blum, Paul Menage, Li Zefan, containers, linux-kernel On Sun, 14 Aug 2011 19:51:19 +0200 Oleg Nesterov <oleg@redhat.com> wrote: > On 07/28, Paul E. McKenney wrote: > > > > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > > > > > > I disagree. It also requires - by virtue of the use of while_each_thread() - > > > that 'g' remains on the list that 't' is walking along. > > > > Doesn't the following code in the loop body deal with this possibilty? > > > > /* Exit if t or g was unhashed during refresh. */ > > if (t->state == TASK_DEAD || g->state == TASK_DEAD) > > goto unlock; > > This code is completely wrong even if while_each_thread() was fine. > > I sent the patch but it was ignored. > > [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() > http://marc.info/?l=linux-kernel&m=127688790019041 > > Oleg. I agree with that patch. RCU only protects a task_struct until release_task() is called (which removes it from the task list). So holding rcu_lock doesn't stop put_task_struct from freeing the memory unless we *know* that release_task hasn't been called. This is exactly that pid_alive() tests. I must say that handling of task_struct seems to violate the law of least surprise a little to often for my taste. Maybe it is just a difficult problem and it needs a complex solution - but it would be really nice if it were a bit simpler :-( NeilBrown ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-08-14 17:51 ` Oleg Nesterov [not found] ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2011-08-14 23:58 ` NeilBrown @ 2011-08-15 18:01 ` Paul E. McKenney 2 siblings, 0 replies; 63+ messages in thread From: Paul E. McKenney @ 2011-08-15 18:01 UTC (permalink / raw) To: Oleg Nesterov Cc: NeilBrown, Ben Blum, Paul Menage, Li Zefan, containers, linux-kernel On Sun, Aug 14, 2011 at 07:51:19PM +0200, Oleg Nesterov wrote: > On 07/28, Paul E. McKenney wrote: > > > > On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > > > > > > I disagree. It also requires - by virtue of the use of while_each_thread() - > > > that 'g' remains on the list that 't' is walking along. > > > > Doesn't the following code in the loop body deal with this possibilty? > > > > /* Exit if t or g was unhashed during refresh. */ > > if (t->state == TASK_DEAD || g->state == TASK_DEAD) > > goto unlock; > > This code is completely wrong even if while_each_thread() was fine. > > I sent the patch but it was ignored. > > [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() > http://marc.info/?l=linux-kernel&m=127688790019041 If it helps... Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Thanx, Paul ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2011-07-28 6:26 ` Ben Blum 2011-07-28 12:17 ` Paul E. McKenney 2011-08-14 17:45 ` Oleg Nesterov 2 siblings, 0 replies; 63+ messages in thread From: Ben Blum @ 2011-07-28 6:26 UTC (permalink / raw) To: NeilBrown Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Oleg Nesterov On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney" > <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > > > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote: > > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: > > > > [ . . . ] > > > > > > The race as I understand it is with this code: > > > > > > > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > > > > > tsk->group_leader = tsk; > > > > leader->group_leader = tsk; > > > > > > > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > > > be held in the cgroup code. > > > > > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > > > this chunk is run with the same value for 'leader', but the > > > > while_each_thread is run after, then the while_read_thread() might loop > > > > forever. rcu_read_lock doesn't prevent this from happening. > > > > > > Somehow I was under the impression that holding tasklist_lock (for > > > writing) provided exclusion from code that holds rcu_read_lock - > > > probably because there are other points in the kernel which do > > > while_each_thread with only RCU-read held (and not tasklist): > > > > > > - kernel/hung_task.c, check_hung_uninterruptible_tasks() > > > > This one looks OK to me. The code is just referencing fields in each > > of the task structures, and appears to be making proper use of > > rcu_dereference(). All this code requires is that the task structures > > remain in existence through the full lifetime of the RCU read-side > > critical section, which is guaranteed because of the way the task_struct > > is freed. > > I disagree. It also requires - by virtue of the use of while_each_thread() - > that 'g' remains on the list that 't' is walking along. > > Now for a normal list, the head always stays on the list and is accessible > even from an rcu-removed entry. But the thread_group list isn't a normal > list. It doesn't have a distinct head. It is a loop of all of the > 'task_structs' in a thread group. One of them is designated the 'leader' but > de_thread() can change the 'leader' - though it doesn't remove the old leader. > > __unhash_process in mm/exit.c looks like it could remove the leader from the > list and definitely could remove a non-leader. > > So if a non-leader calls 'exec' and the leader calls 'exit', then a > task_struct that was the leader could become a non-leader and then be removed > from the list that kernel/hung_task could be walking along. That agrees with my understanding. > > So I don't think that while_each_thread() is currently safe. It depends on > the thread leader not disappearing and I think it can. I think that while_each_thread is perfectly safe, it just needs to be protected properly while used. it reads the tasklist, and both competing paths (__unhash_process and de_thread) are done with tasklist_lock write locked, so read-locking ought to suffice. all it needs is to be better documented. > [...] > > +/* Thread group leader can change, so stop loop when we see one > + * even if it isn't 'g' */ > #define while_each_thread(g, t) \ > - while ((t = next_thread(t)) != g) > + while ((t = next_thread(t)) != g && !thread_group_leader(t)) this is semantically wrong: it will stop as soon as it finds a thread that has newly become the leader, and not run the loop body code in that thread's case. so the thread that just execed would not get run on, and in the case of my code, would "escape" the cgroup migration. but I argue it is also organisationally wrong. while_each_thread's purpose is just to worry about the structure of the process list, not to account for behavioural details of de_thread. this check belongs outside of the macro, and it should be protected by tasklist_lock in the same critical section in which while_each_thread is used. -- Ben > > static inline int get_nr_threads(struct task_struct *tsk) > { > diff --git a/kernel/exit.c b/kernel/exit.c > index f2b321b..d6cef25 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead) > list_del_rcu(&p->tasks); > list_del_init(&p->sibling); > __this_cpu_dec(process_counts); > - } > - list_del_rcu(&p->thread_group); > + } else > + /* only remove members from the thread group. > + * The thread group leader must stay so that > + * while_each_thread() uses can see the end of > + * the list and stop. > + */ > + list_del_rcu(&p->thread_group); > } > > /* > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2011-07-28 6:26 ` Ben Blum @ 2011-07-28 12:17 ` Paul E. McKenney 2011-08-14 17:45 ` Oleg Nesterov 2 siblings, 0 replies; 63+ messages in thread From: Paul E. McKenney @ 2011-07-28 12:17 UTC (permalink / raw) To: NeilBrown Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Oleg Nesterov On Thu, Jul 28, 2011 at 11:08:13AM +1000, NeilBrown wrote: > On Wed, 27 Jul 2011 16:42:35 -0700 "Paul E. McKenney" > <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > > > On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote: > > > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: > > > > [ . . . ] > > > > > > The race as I understand it is with this code: > > > > > > > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > > > > > tsk->group_leader = tsk; > > > > leader->group_leader = tsk; > > > > > > > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > > > be held in the cgroup code. > > > > > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > > > this chunk is run with the same value for 'leader', but the > > > > while_each_thread is run after, then the while_read_thread() might loop > > > > forever. rcu_read_lock doesn't prevent this from happening. > > > > > > Somehow I was under the impression that holding tasklist_lock (for > > > writing) provided exclusion from code that holds rcu_read_lock - > > > probably because there are other points in the kernel which do > > > while_each_thread with only RCU-read held (and not tasklist): > > > > > > - kernel/hung_task.c, check_hung_uninterruptible_tasks() > > > > This one looks OK to me. The code is just referencing fields in each > > of the task structures, and appears to be making proper use of > > rcu_dereference(). All this code requires is that the task structures > > remain in existence through the full lifetime of the RCU read-side > > critical section, which is guaranteed because of the way the task_struct > > is freed. > > I disagree. It also requires - by virtue of the use of while_each_thread() - > that 'g' remains on the list that 't' is walking along. Doesn't the following code in the loop body deal with this possibilty? /* Exit if t or g was unhashed during refresh. */ if (t->state == TASK_DEAD || g->state == TASK_DEAD) goto unlock; Yes, a concurrent dethread could cause some of the tasks to be skipped, but there really is a hung thread, it will still be there to be caught next time, right? Thanx, Paul > Now for a normal list, the head always stays on the list and is accessible > even from an rcu-removed entry. But the thread_group list isn't a normal > list. It doesn't have a distinct head. It is a loop of all of the > 'task_structs' in a thread group. One of them is designated the 'leader' but > de_thread() can change the 'leader' - though it doesn't remove the old leader. > > __unhash_process in mm/exit.c looks like it could remove the leader from the > list and definitely could remove a non-leader. > > So if a non-leader calls 'exec' and the leader calls 'exit', then a > task_struct that was the leader could become a non-leader and then be removed > from the list that kernel/hung_task could be walking along. > > So I don't think that while_each_thread() is currently safe. It depends on > the thread leader not disappearing and I think it can. > > So I'm imagining a patch like this to ensure that while_each_thread() is > actually safe. If it is always safe you can remove that extra check in > cgroup_attach_proc() which looked wrong. > > I just hope someone who understands the process tree is listening.. > The change in exit.c is the most uncertain part. > > NeilBrown > > diff --git a/fs/exec.c b/fs/exec.c > index 6075a1e..c9ea5f0 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -960,6 +960,9 @@ static int de_thread(struct task_struct *tsk) > list_replace_init(&leader->sibling, &tsk->sibling); > > tsk->group_leader = tsk; > + smp_mb(); /* ensure that any reader will always be able to see > + * a task that claims to be the group leader > + */ > leader->group_leader = tsk; > > tsk->exit_signal = SIGCHLD; > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 14a6c7b..13e0192 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2267,8 +2267,10 @@ extern bool current_is_single_threaded(void); > #define do_each_thread(g, t) \ > for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do > > +/* Thread group leader can change, so stop loop when we see one > + * even if it isn't 'g' */ > #define while_each_thread(g, t) \ > - while ((t = next_thread(t)) != g) > + while ((t = next_thread(t)) != g && !thread_group_leader(t)) > > static inline int get_nr_threads(struct task_struct *tsk) > { > diff --git a/kernel/exit.c b/kernel/exit.c > index f2b321b..d6cef25 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead) > list_del_rcu(&p->tasks); > list_del_init(&p->sibling); > __this_cpu_dec(process_counts); > - } > - list_del_rcu(&p->thread_group); > + } else > + /* only remove members from the thread group. > + * The thread group leader must stay so that > + * while_each_thread() uses can see the end of > + * the list and stop. > + */ > + list_del_rcu(&p->thread_group); > } > > /* > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2011-07-28 6:26 ` Ben Blum 2011-07-28 12:17 ` Paul E. McKenney @ 2011-08-14 17:45 ` Oleg Nesterov 2 siblings, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-08-14 17:45 UTC (permalink / raw) To: NeilBrown Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 On 07/28, NeilBrown wrote: > > +/* Thread group leader can change, so stop loop when we see one > + * even if it isn't 'g' */ > #define while_each_thread(g, t) \ > - while ((t = next_thread(t)) != g) > + while ((t = next_thread(t)) != g && !thread_group_leader(t)) This assumes that while_each_thread(g, t) always uses g == group_leader. > @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead) > list_del_rcu(&p->tasks); > list_del_init(&p->sibling); > __this_cpu_dec(process_counts); > - } > - list_del_rcu(&p->thread_group); > + } else > + /* only remove members from the thread group. > + * The thread group leader must stay so that > + * while_each_thread() uses can see the end of > + * the list and stop. > + */ > + list_del_rcu(&p->thread_group); Hmm... this looks obvously wrong. This leaves the dead thread on the list. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. 2011-07-28 1:08 ` NeilBrown ` (2 preceding siblings ...) [not found] ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2011-08-14 17:45 ` Oleg Nesterov 3 siblings, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-08-14 17:45 UTC (permalink / raw) To: NeilBrown Cc: paulmck, Ben Blum, Paul Menage, Li Zefan, containers, linux-kernel On 07/28, NeilBrown wrote: > > +/* Thread group leader can change, so stop loop when we see one > + * even if it isn't 'g' */ > #define while_each_thread(g, t) \ > - while ((t = next_thread(t)) != g) > + while ((t = next_thread(t)) != g && !thread_group_leader(t)) This assumes that while_each_thread(g, t) always uses g == group_leader. > @@ -70,8 +70,13 @@ static void __unhash_process(struct task_struct *p, bool group_dead) > list_del_rcu(&p->tasks); > list_del_init(&p->sibling); > __this_cpu_dec(process_counts); > - } > - list_del_rcu(&p->thread_group); > + } else > + /* only remove members from the thread group. > + * The thread group leader must stay so that > + * while_each_thread() uses can see the end of > + * the list and stop. > + */ > + list_del_rcu(&p->thread_group); Hmm... this looks obvously wrong. This leaves the dead thread on the list. Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
[parent not found: <20110727150710.GB5242-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org>]
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110727150710.GB5242-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> @ 2011-07-27 23:42 ` Paul E. McKenney 0 siblings, 0 replies; 63+ messages in thread From: Paul E. McKenney @ 2011-07-27 23:42 UTC (permalink / raw) To: Ben Blum Cc: NeilBrown, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Oleg Nesterov On Wed, Jul 27, 2011 at 11:07:10AM -0400, Ben Blum wrote: > On Wed, Jul 27, 2011 at 05:11:01PM +1000, NeilBrown wrote: [ . . . ] > > The race as I understand it is with this code: > > > > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > > list_replace_init(&leader->sibling, &tsk->sibling); > > > > tsk->group_leader = tsk; > > leader->group_leader = tsk; > > > > > > which seems to be called with only tasklist_lock held, which doesn't seem to > > be held in the cgroup code. > > > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > > this chunk is run with the same value for 'leader', but the > > while_each_thread is run after, then the while_read_thread() might loop > > forever. rcu_read_lock doesn't prevent this from happening. > > Somehow I was under the impression that holding tasklist_lock (for > writing) provided exclusion from code that holds rcu_read_lock - > probably because there are other points in the kernel which do > while_each_thread with only RCU-read held (and not tasklist): > > - kernel/hung_task.c, check_hung_uninterruptible_tasks() This one looks OK to me. The code is just referencing fields in each of the task structures, and appears to be making proper use of rcu_dereference(). All this code requires is that the task structures remain in existence through the full lifetime of the RCU read-side critical section, which is guaranteed because of the way the task_struct is freed. > - kernel/posix-cpu-timers.c, thread_group_cputime() Same story here. The code only needs the task_struct structures to stick around long enough to compute the CPU time consumed, so this should be OK. > - fs/ioprio.c, ioprio_set() and ioprio_get() Here RCU protects traversal to the task structure as in the two examples above. In the ioprio_set() case, the actual change is carried out under task_lock(). So this code relies on RCU to safely locate the task structure, and on locking to safely carry out the update. The ioprio_get() is using RCU to protect finding the task structure. This returns an integer (the I/O priority), so no locking is required. Though if compilers continue becoming increasingly aggressive with their optimizations, there might need to be an ACCESS_ONCE() in get_task_ioprio(). > (There are also places, like kernel/signal.c, where code does > while_each_thread with only sighand->siglock held. this also seems > sketchy, since de_thread only takes that lock after the code quoted > above. there's a big comment in fs/exec.c where this is also done, but I > don't quite understand it.) > > You seem to imply that rcu_read_lock() doesn't exclude against > write_lock(&tasklist_lock). If that's true, then we can fix the cgroup > code simply by replacing rcu_read_lock/rcu_read_unlock with > read_lock and read_unlocck on tasklist_lock. (I can hurry a bugfix patch > for this together if so.) Indeed, rcu_read_lock() does not exclude write_lock(&tasklist_lock). In fact, it doesn't exclude against -any- lock. But don't take my word for it, instead take a look at the heaviest-weight implementation of rcu_read_lock() that currently exists in mainline: void __rcu_read_lock(void) { current->rcu_read_lock_nesting++; barrier(); } That really is all there is to it, at least in the absence of CONFIG_PROVE_RCU. The first line does nothing more than increment a counter in the current tasks task_struct, while the second line does nothing more than suppress code-motion optimizations that the compiler might otherwise be tempted to undertake. The purpose of rcu_read_lock() is instead to prevent any subsequent synchronize_rcu() calls (on any CPU or from any task) from returning until this task executes the matching rcu_read_unlock(). If you make careful use of rcu_read_lock(), synchronize_rcu(), and friends, you can get an effect that in some ways resembles mutual exclusion, but that is much faster and more scalable. In the case of the task structure, rcu_read_lock() guarantees that any task_struct that you can get a pointer to will remain in existence until you execute the matching rcu_read_unlock(). By "remain in existence", I mean that the task_struct won't be freed (and possibly reused). So rcu_read_lock() is giving you the following guarantee: If you legitimately pick up a pointer to an RCU-protected data structure, then that data structure will not be freed until after you execute the matching rcu_read_unlock(). > Wouldn't this mean that the three places listed above are also wrong? I do not believe so, give or take the ACCESS_ONCE(). The issue in your case is that you are relying not just on the process to exist, but also on the state of group leadership. RCU protects the former, but not the latter. I suspect that your suggested bugfix patch would do the trick, but I must defer to people who understand tasklist locking better than I do. Thanx, Paul > > The code in de_thread() is actually questionable by itself. > > "list_replace_rcu" cannot really be used on the head of a list - it is only > > meant to be used on a member of a list. > > To move a list from one head to another you should be using > > list_splice_init_rcu(). > > The ->tasks list doesn't seem to have a clearly distinguished 'head' but > > whatever is passed as 'g' to while_each_thread() is effectively a head and > > removing it from a list can cause a loop using while_each_thread() can not > > find the head and so never complete. > > > > I' not sure how best to fix this, though possibly changing > > while_each_thead to: > > > > while ((t = next_task(t)) != g && !thread_group_leader(t)) > > > > might be part of it. We would also need to move > > tsk->group_leader = tsk; > > in the above up to the top, and probably add some memory barrier. > > However I don't know enough about how the list is used to be sure. > > > > Comments? > > > > Thanks, > > NeilBrown > > > > > > I barely understand de_thread() from the reader's perspective, let alone > from the author's perspective, so I can't speak for that one. > > Thanks for pointing this out! > > -- Ben ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread. [not found] ` <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2011-07-27 15:07 ` Ben Blum @ 2011-08-14 17:40 ` Oleg Nesterov 1 sibling, 0 replies; 63+ messages in thread From: Oleg Nesterov @ 2011-08-14 17:40 UTC (permalink / raw) To: NeilBrown Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Paul E.McKenney Sorry for delay, just noticed this thread... On 07/27, NeilBrown wrote: > > The race as I understand it is with this code: > > > list_replace_rcu(&leader->tasks, &tsk->tasks); > list_replace_init(&leader->sibling, &tsk->sibling); > > tsk->group_leader = tsk; > leader->group_leader = tsk; > > > which seems to be called with only tasklist_lock held, which doesn't seem to > be held in the cgroup code. > > If the "thread_group_leader(leader)" call in cgroup_attach_proc() runs before > this chunk is run with the same value for 'leader', but the > while_each_thread is run after, then the while_read_thread() might loop > forever. rcu_read_lock doesn't prevent this from happening. Yes. This was already discussed. See http://marc.info/?t=127688987300002 Damn. I forgot about this completely. > The code in de_thread() is actually questionable by itself. > "list_replace_rcu" cannot really be used on the head of a list - it is only > meant to be used on a member of a list. > To move a list from one head to another you should be using > list_splice_init_rcu(). Hmm... can't understand this part. And just in case... list_replace_rcu() looks fine afaics. The real problem is release_task(old_leader) which does list_del_rcu(old_leader->thread_group), this is what breaks while_each_thread(). > The ->tasks list doesn't seem to have a clearly distinguished 'head' Exactly. This is the problem. But: you seem to confused ->tasks and ->thread_group ;) Oleg. ^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2011-10-19 5:43 UTC | newest] Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-07-27 7:11 Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread NeilBrown 2011-08-14 17:40 ` Oleg Nesterov 2011-08-15 0:11 ` NeilBrown [not found] ` <20110815101144.39812e9f-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2011-08-15 19:09 ` Oleg Nesterov 2011-08-15 19:09 ` Oleg Nesterov [not found] ` <20110814174000.GA2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2011-08-15 0:11 ` NeilBrown [not found] ` <20110727171101.5e32d8eb-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2011-07-27 15:07 ` Ben Blum 2011-07-27 15:07 ` Ben Blum 2011-07-27 23:42 ` Paul E. McKenney [not found] ` <20110727234235.GA2318-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 2011-07-28 1:08 ` NeilBrown 2011-07-28 1:08 ` NeilBrown 2011-07-28 6:26 ` Ben Blum 2011-07-28 7:13 ` NeilBrown [not found] ` <20110728171345.67d3797d-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2011-07-29 14:28 ` [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc Ben Blum 2011-07-29 14:28 ` Ben Blum 2011-08-01 19:31 ` Paul Menage [not found] ` <20110729142842.GA8462-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-08-01 19:31 ` Paul Menage 2011-08-15 18:49 ` Oleg Nesterov 2011-08-15 18:49 ` Oleg Nesterov 2011-08-15 22:50 ` Frederic Weisbecker 2011-08-15 23:04 ` Ben Blum 2011-08-15 23:09 ` Ben Blum 2011-08-15 23:19 ` Frederic Weisbecker [not found] ` <20110815230900.GB6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-08-15 23:19 ` Frederic Weisbecker [not found] ` <20110815230415.GA6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-08-15 23:09 ` Ben Blum 2011-08-15 23:04 ` Ben Blum 2011-08-15 23:11 ` [PATCH][BUGFIX] cgroups: fix ordering of calls " Ben Blum 2011-08-15 23:20 ` Frederic Weisbecker [not found] ` <20110815231156.GC6867-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-08-15 23:20 ` Frederic Weisbecker 2011-08-15 23:31 ` Paul Menage 2011-08-15 23:31 ` Paul Menage 2011-08-15 23:11 ` Ben Blum [not found] ` <20110815184957.GA16588-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2011-08-15 22:50 ` [PATCH][BUGFIX] cgroups: more safe tasklist locking " Frederic Weisbecker 2011-09-01 21:46 ` Ben Blum 2011-09-01 21:46 ` Ben Blum [not found] ` <20110901214643.GD10401-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-09-02 12:32 ` Oleg Nesterov 2011-09-02 12:32 ` Oleg Nesterov [not found] ` <20110902123251.GA26764-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2011-09-08 2:11 ` Ben Blum 2011-09-08 2:11 ` Ben Blum 2011-10-14 0:31 ` [PATCH 1/2] cgroups: use sighand lock instead of tasklist_lock " Ben Blum 2011-10-14 12:15 ` Frederic Weisbecker 2011-10-14 0:36 ` [PATCH 2/2] cgroups: convert ss->attach to use whole threadgroup flex_array (cpuset, memcontrol) Ben Blum 2011-10-14 12:21 ` Frederic Weisbecker 2011-10-14 13:53 ` Ben Blum 2011-10-14 13:54 ` Ben Blum 2011-10-14 15:22 ` Frederic Weisbecker 2011-10-17 19:11 ` Ben Blum 2011-10-14 15:21 ` Frederic Weisbecker 2011-10-19 5:43 ` Paul Menage [not found] ` <20110728062616.GC15204-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-07-28 7:13 ` Possible race between cgroup_attach_proc and de_thread, and questionable code in de_thread NeilBrown 2011-07-28 12:17 ` Paul E. McKenney [not found] ` <20110728121741.GB2427-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 2011-08-14 17:51 ` Oleg Nesterov 2011-08-14 17:51 ` Oleg Nesterov [not found] ` <20110814175119.GC2381-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2011-08-14 23:58 ` NeilBrown 2011-08-15 18:01 ` Paul E. McKenney 2011-08-14 23:58 ` NeilBrown 2011-08-15 18:01 ` Paul E. McKenney [not found] ` <20110728110813.7ff84b13-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2011-07-28 6:26 ` Ben Blum 2011-07-28 12:17 ` Paul E. McKenney 2011-08-14 17:45 ` Oleg Nesterov 2011-08-14 17:45 ` Oleg Nesterov [not found] ` <20110727150710.GB5242-japSPQJXeIlCM1neWV3AGuCmf2DRS9x2@public.gmane.org> 2011-07-27 23:42 ` Paul E. McKenney 2011-08-14 17:40 ` Oleg Nesterov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.