* [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic @ 2010-06-18 19:02 Oleg Nesterov 2010-06-18 19:34 ` while_each_thread() under rcu_read_lock() is broken? Oleg Nesterov ` (2 more replies) 0 siblings, 3 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-18 19:02 UTC (permalink / raw) To: Andrew Morton Cc: Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by "softlockup: check all tasks in hung_task" commit ce9dbe24 looks absolutely wrong. - rcu_lock_break() does put_task_struct(). If the task has exited it is not safe to even read its ->state, nothing protects this task_struct. - The TASK_DEAD checks are wrong too. Contrary to the comment, we can't use it to check if the task was unhashed. It can be unhashed without TASK_DEAD, or it can be valid with TASK_DEAD. For example, an autoreaping task can do release_task(current) long before it sets TASK_DEAD in do_exit(). Or, a zombie task can have ->state == TASK_DEAD but release_task() was not called, and in this case we must not break the loop. Change this code to check pid_alive() instead, and do this before we drop the reference to the task_struct. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/hung_task.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) --- 35-rc2/kernel/hung_task.c~CHT_FIX_RCU_LOCK_BREAK 2009-12-18 19:05:38.000000000 +0100 +++ 35-rc2/kernel/hung_task.c 2010-06-18 20:06:11.000000000 +0200 @@ -113,15 +113,20 @@ static void check_hung_task(struct task_ * For preemptible RCU it is sufficient to call rcu_read_unlock in order * exit the grace period. For classic RCU, a reschedule is required. */ -static void rcu_lock_break(struct task_struct *g, struct task_struct *t) +static bool rcu_lock_break(struct task_struct *g, struct task_struct *t) { + bool can_cont; + get_task_struct(g); get_task_struct(t); rcu_read_unlock(); cond_resched(); rcu_read_lock(); + can_cont = pid_alive(g) && pid_alive(t); put_task_struct(t); put_task_struct(g); + + return can_cont; } /* @@ -148,9 +153,7 @@ static void check_hung_uninterruptible_t goto unlock; if (!--batch_count) { batch_count = HUNG_TASK_BATCHING; - rcu_lock_break(g, t); - /* Exit if t or g was unhashed during refresh. */ - if (t->state == TASK_DEAD || g->state == TASK_DEAD) + if (!rcu_lock_break(g, t)) goto unlock; } /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ ^ permalink raw reply [flat|nested] 45+ messages in thread
* while_each_thread() under rcu_read_lock() is broken? 2010-06-18 19:02 [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Oleg Nesterov @ 2010-06-18 19:34 ` Oleg Nesterov 2010-06-18 21:08 ` Roland McGrath ` (2 more replies) 2010-06-18 20:11 ` [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Frederic Weisbecker 2010-06-18 20:38 ` Mandeep Singh Baines 2 siblings, 3 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-18 19:34 UTC (permalink / raw) To: Andrew Morton Cc: Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman, Paul E. McKenney (add cc's) Hmm. Once I sent this patch, I suddenly realized with horror that while_each_thread() is NOT safe under rcu_read_lock(). Both do_each_thread/while_each_thread or do/while_each_thread() can race with exec(). Yes, it is safe to do next_thread() or next_task(). But: #define while_each_thread(g, t) \ while ((t = next_thread(t)) != g) suppose that t is not the group leader, and it does de_thread() and then release_task(g). After that next_thread(t) returns t, not g, and the loop will never stop. I _really_ hope I missed something, will recheck tomorrow with the fresh head. Still I'd like to share my concerns... If I am right, probably we can fix this, something like #define while_each_thread(g, t) \ while ((t = next_thread(t)) != g && pid_alive(g)) [we can't do while (!thread_group_leadr(t = next_thread(t)))]. but this needs barrires, and we should validate the callers anyway. Or, perhaps, #define XXX(t) ({ struct task_struct *__prev = t; t = next_thread(t); t != g && t != __prev; }) #define while_each_thread(g, t) \ while (XXX(t)) Please tell me I am wrong! Oleg. On 06/18, Oleg Nesterov wrote: > > check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by > "softlockup: check all tasks in hung_task" commit ce9dbe24 looks > absolutely wrong. > > - rcu_lock_break() does put_task_struct(). If the task has exited > it is not safe to even read its ->state, nothing protects this > task_struct. > > - The TASK_DEAD checks are wrong too. Contrary to the comment, we > can't use it to check if the task was unhashed. It can be unhashed > without TASK_DEAD, or it can be valid with TASK_DEAD. > > For example, an autoreaping task can do release_task(current) > long before it sets TASK_DEAD in do_exit(). > > Or, a zombie task can have ->state == TASK_DEAD but release_task() > was not called, and in this case we must not break the loop. > > Change this code to check pid_alive() instead, and do this before we > drop the reference to the task_struct. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > > kernel/hung_task.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > --- 35-rc2/kernel/hung_task.c~CHT_FIX_RCU_LOCK_BREAK 2009-12-18 19:05:38.000000000 +0100 > +++ 35-rc2/kernel/hung_task.c 2010-06-18 20:06:11.000000000 +0200 > @@ -113,15 +113,20 @@ static void check_hung_task(struct task_ > * For preemptible RCU it is sufficient to call rcu_read_unlock in order > * exit the grace period. For classic RCU, a reschedule is required. > */ > -static void rcu_lock_break(struct task_struct *g, struct task_struct *t) > +static bool rcu_lock_break(struct task_struct *g, struct task_struct *t) > { > + bool can_cont; > + > get_task_struct(g); > get_task_struct(t); > rcu_read_unlock(); > cond_resched(); > rcu_read_lock(); > + can_cont = pid_alive(g) && pid_alive(t); > put_task_struct(t); > put_task_struct(g); > + > + return can_cont; > } > > /* > @@ -148,9 +153,7 @@ static void check_hung_uninterruptible_t > goto unlock; > if (!--batch_count) { > batch_count = HUNG_TASK_BATCHING; > - rcu_lock_break(g, t); > - /* Exit if t or g was unhashed during refresh. */ > - if (t->state == TASK_DEAD || g->state == TASK_DEAD) > + if (!rcu_lock_break(g, t)) > goto unlock; > } > /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-18 19:34 ` while_each_thread() under rcu_read_lock() is broken? Oleg Nesterov @ 2010-06-18 21:08 ` Roland McGrath 2010-06-18 22:37 ` Oleg Nesterov 2010-06-18 22:33 ` Paul E. McKenney 2010-06-19 5:00 ` Mandeep Baines 2 siblings, 1 reply; 45+ messages in thread From: Roland McGrath @ 2010-06-18 21:08 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman, Paul E. McKenney I think you're right. I can't see what would prevent that race. So for_each_process and do_each_thread are safe only under read_lock(&tasklist_lock) and while_each_thread is only safe under either that or siglock. (Also a few places using next_thread in similar loops outside those macros.) Perhaps we could move those del's from __unhash_process to __put_task_struct (or just delayed_put_task_struct?) and then they wouldn't need to be rculist.h calls after all. But we would a need careful audit to figure out the assumptions about being on the list meaning not reaped yet. I think de_thread() in exec-by-nonleader is the only case where this can happen, right? So then perhaps we could make it call release_task only via call_rcu? Thanks, Roland ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-18 21:08 ` Roland McGrath @ 2010-06-18 22:37 ` Oleg Nesterov 0 siblings, 0 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-18 22:37 UTC (permalink / raw) To: Roland McGrath Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman, Paul E. McKenney On 06/18, Roland McGrath wrote: > > I think you're right. I can't see what would prevent that race. How sad. > So for_each_process for_each_process() looks fine. It uses init_task as the anchor, it can't go away, it is swapper. > and do_each_thread are safe only under > read_lock(&tasklist_lock) and while_each_thread is only safe under > either that or siglock. Yes, (Also a few places using next_thread in > similar loops outside those macros.) I hope that most (all?) of next_thread() users can be converted to use while_each_thread(). > Perhaps we could move those del's from __unhash_process to > __put_task_struct (or just delayed_put_task_struct?) This needs write_lock_irq(tasklist), we can't take it in atomic context. And I bet this change (at least right now) has other implications. > I think de_thread() in exec-by-nonleader is the only case where this > can happen, right? So then perhaps we could make it call release_task > only via call_rcu? Hmm, perhaps... I am already sleeping, will try to check this idea tomorrow. At first glance, it looks promising to me. And I see the email from Paul which is too late to read for me today ;) In any case, I _think_ we can fix while_each_thread(), say XXX(t) from the previous email. But then we should audit the users like zap_threads() which assume we should not miss any "interesting" task. Probably zap_threads() is fine because of mmap_sem, but I can't think properly now. Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-18 19:34 ` while_each_thread() under rcu_read_lock() is broken? Oleg Nesterov 2010-06-18 21:08 ` Roland McGrath @ 2010-06-18 22:33 ` Paul E. McKenney 2010-06-21 17:09 ` Oleg Nesterov 2010-06-19 5:00 ` Mandeep Baines 2 siblings, 1 reply; 45+ messages in thread From: Paul E. McKenney @ 2010-06-18 22:33 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote: > (add cc's) > > Hmm. Once I sent this patch, I suddenly realized with horror that > while_each_thread() is NOT safe under rcu_read_lock(). Both > do_each_thread/while_each_thread or do/while_each_thread() can > race with exec(). > > Yes, it is safe to do next_thread() or next_task(). But: > > #define while_each_thread(g, t) \ > while ((t = next_thread(t)) != g) > > suppose that t is not the group leader, and it does de_thread() and then > release_task(g). After that next_thread(t) returns t, not g, and the loop > will never stop. > > I _really_ hope I missed something, will recheck tomorrow with the fresh > head. Still I'd like to share my concerns... > > If I am right, probably we can fix this, something like > > #define while_each_thread(g, t) \ > while ((t = next_thread(t)) != g && pid_alive(g)) > > [we can't do while (!thread_group_leadr(t = next_thread(t)))]. > but this needs barrires, and we should validate the callers anyway. > > Or, perhaps, > > #define XXX(t) ({ > struct task_struct *__prev = t; > t = next_thread(t); > t != g && t != __prev; > }) > > #define while_each_thread(g, t) \ > while (XXX(t)) Isn't the above vulnerable to a pthread_create() immediately following the offending exec()? Especially if the task doing the traversal is preempted? I cannot claim to understand the task-list code, but here are some techniques that might (or might not) help: o Check ACCESS_ONCE(p->group_leader == g), if false, restart the traversal. Any race on update of p->group_leader would sort itself out on later iterations. This of course might require careful attention of the order of updates to ->group_leader and the list pointers. I also don't like it much from a worst-case response-time viewpoint. ;-) Plus it requires all operations on the tasks be idempotent, which is a bit ugly and restrictive. o Maintain an ->oldnext field that tracks the old pointer to the next task for one RCU grace period after a de_thread() operation. When the grace period expires (presumably via call_rcu()), the ->oldnext field is set to NULL. If the ->oldnext field is non-NULL, any subsequent de_thread() operations wait until it is NULL. (I convinced myself that pthread_create() need -not- wait, but perhaps mistakenly -- the idea is that any recent de_thread() victim remains group leader, so is skipped by while_each_thread(), but you would know better than I.) Then while_each_thread() does checks similar to what you have above, possibly with the addition of the ->group_leader check, but follows the ->oldnext pointer if the checks indicate that this task has de_thread()ed itself. The ->oldnext access will need to be preceded by a memory barrier, but this is off the fast path, so should be OK. There would also need to be memory barriers on the update side. o Do the de_thread() incrementally. So if the list is tasks A, B, and C, in that order, and if we are de-thread()ing B, then make A's pointer refer to C, wait a grace period, then complete the de_thread() operation. I would be surprised if people would actually be happy with the resulting increase in exec() overhead, but mentioning it for completeness. Of course, synchronize_rcu_expedited() has much shorter latency, and might work in this situation. (Though please do let me know if you choose this approach -- it will mean that I need to worry about synchronize_rcu_expedited() scalability sooner rather than later! Which is OK as long as I know.) This all assumes that is OK for de_thread() to block, but I have no idea if this is the case. > Please tell me I am wrong! It would take a braver man than me to say that Oleg Nesterov is wrong! Thanx, Paul > Oleg. > > On 06/18, Oleg Nesterov wrote: > > > > check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by > > "softlockup: check all tasks in hung_task" commit ce9dbe24 looks > > absolutely wrong. > > > > - rcu_lock_break() does put_task_struct(). If the task has exited > > it is not safe to even read its ->state, nothing protects this > > task_struct. > > > > - The TASK_DEAD checks are wrong too. Contrary to the comment, we > > can't use it to check if the task was unhashed. It can be unhashed > > without TASK_DEAD, or it can be valid with TASK_DEAD. > > > > For example, an autoreaping task can do release_task(current) > > long before it sets TASK_DEAD in do_exit(). > > > > Or, a zombie task can have ->state == TASK_DEAD but release_task() > > was not called, and in this case we must not break the loop. > > > > Change this code to check pid_alive() instead, and do this before we > > drop the reference to the task_struct. > > > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > > --- > > > > kernel/hung_task.c | 11 +++++++---- > > 1 file changed, 7 insertions(+), 4 deletions(-) > > > > --- 35-rc2/kernel/hung_task.c~CHT_FIX_RCU_LOCK_BREAK 2009-12-18 19:05:38.000000000 +0100 > > +++ 35-rc2/kernel/hung_task.c 2010-06-18 20:06:11.000000000 +0200 > > @@ -113,15 +113,20 @@ static void check_hung_task(struct task_ > > * For preemptible RCU it is sufficient to call rcu_read_unlock in order > > * exit the grace period. For classic RCU, a reschedule is required. > > */ > > -static void rcu_lock_break(struct task_struct *g, struct task_struct *t) > > +static bool rcu_lock_break(struct task_struct *g, struct task_struct *t) > > { > > + bool can_cont; > > + > > get_task_struct(g); > > get_task_struct(t); > > rcu_read_unlock(); > > cond_resched(); > > rcu_read_lock(); > > + can_cont = pid_alive(g) && pid_alive(t); > > put_task_struct(t); > > put_task_struct(g); > > + > > + return can_cont; > > } > > > > /* > > @@ -148,9 +153,7 @@ static void check_hung_uninterruptible_t > > goto unlock; > > if (!--batch_count) { > > batch_count = HUNG_TASK_BATCHING; > > - rcu_lock_break(g, t); > > - /* Exit if t or g was unhashed during refresh. */ > > - if (t->state == TASK_DEAD || g->state == TASK_DEAD) > > + if (!rcu_lock_break(g, t)) > > goto unlock; > > } > > /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-18 22:33 ` Paul E. McKenney @ 2010-06-21 17:09 ` Oleg Nesterov 2010-06-21 17:44 ` Oleg Nesterov 2010-06-21 20:51 ` Paul E. McKenney 0 siblings, 2 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-21 17:09 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/18, Paul E. McKenney wrote: > > On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote: > > > > #define XXX(t) ({ > > struct task_struct *__prev = t; > > t = next_thread(t); > > t != g && t != __prev; > > }) > > > > #define while_each_thread(g, t) \ > > while (XXX(t)) > > Isn't the above vulnerable to a pthread_create() immediately following > the offending exec()? Especially if the task doing the traversal is > preempted? Yes, thanks! > here are some techniques that might (or might not) help: To simplify, let's consider the concrete example, rcu_read_lock(); g = t = returns_the_rcu_safe_task_struct_ptr(); do { printk("%d\n", t->pid); } while_each_thread(g, t); rcu_read_unlock(); Whatever we do, without tasklist/siglock this can obviously race with fork/exit/exec. It is OK to miss a thread, or print the pid of the already exited/released task. But it should not loop forever (the subject), and it should not print the same pid twice (ignoring pid reuse, of course). And, afaics, there are no problems with rcu magic per se, next_thread() always returns the task_struct we can safely dereference. The only problem is that while_each_thread() assumes that sooner or later next_thread() must reach the starting point, g. (zap_threads() is different, it must not miss a thread with ->mm we are going to dump, but it holds mmap_sem). > o Check ACCESS_ONCE(p->group_leader == g), but this assumeas that while_each_thread(g, t) always uses g == group_leader. zap_other_threads(), for example, starts with g == current and not necessarily the leader. > if false, restart > the traversal. I am not sure we should restart (don't pid the same pid twice), probably we should break the loop. So, I am thinking about the first attempt #define while_each_thread(g, t) \ while ((t = next_thread(t)) != g && pid_alive(g)) again. But this means while_each_thread() can miss more threads than it currently can under the same conditions. Correct, but not good. And, > This of course might > require careful attention of the order of updates to ->group_leader > and the list pointers. Yes. At least the code above needs barriers, and I am not sure it can be really correct without more complications. > o Maintain an ->oldnext field that tracks the old pointer to > the next task for one RCU grace period after a de_thread() > operation. When the grace period expires (presumably via > call_rcu()), the ->oldnext field is set to NULL. Well, another field in task_struct... > o Do the de_thread() incrementally. So if the list is tasks A, > B, and C, in that order, and if we are de-thread()ing B, > then make A's pointer refer to C, This breaks while_each_thread() under tasklist/siglock. It must see all unhashed tasks. Oh. I'll try to think more, but I also hope someone else can suggest a simple solution. > > Please tell me I am wrong! > > It would take a braver man than me to say that Oleg Nesterov is wrong! Hmm. Given that you proved I was wrong in the first lines of your email, I do not know what to think ;) Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 17:09 ` Oleg Nesterov @ 2010-06-21 17:44 ` Oleg Nesterov 2010-06-21 18:00 ` Oleg Nesterov 2010-06-21 19:02 ` Roland McGrath 2010-06-21 20:51 ` Paul E. McKenney 1 sibling, 2 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-21 17:44 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/21, Oleg Nesterov wrote: > > So, I am thinking about the first attempt > > #define while_each_thread(g, t) \ > while ((t = next_thread(t)) != g && pid_alive(g)) > > again. But this means while_each_thread() can miss more threads > than it currently can under the same conditions. Correct, but > not good. Not good, but correct ;) Probably it makes sense to fix the problem anyway, then think about the more optimal fix. static inline struct task_struct * next_thread_careful(const struct task_struct *g, const struct task_struct *t) { t = next_thread(t); /* * this pairs with the implicit barrier between detach_pid() * and list_del_rcu(g->thread_group) in __unhash_process(g). */ smp_rmb(); if (likely(pid_alive(g))) return t; else return g; } #define while_each_thread(g, t) \ while ((t = next_thread_careful(t)) != g) I think this should work. detach_pid() does unlock + lock at least once and thus we have the barrier (this worth a comment or we can add the explicit wmb() in __unhash_process). Paul, Roland, do you see any problems from the correctness pov, or a better fix for now? Perhaps it also makes sense to keep the old variant renamed to while_each_thread_locked(), I dunno. Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 17:44 ` Oleg Nesterov @ 2010-06-21 18:00 ` Oleg Nesterov 2010-06-21 19:02 ` Roland McGrath 1 sibling, 0 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-21 18:00 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/21, Oleg Nesterov wrote: > > On 06/21, Oleg Nesterov wrote: > > > > So, I am thinking about the first attempt > > > > #define while_each_thread(g, t) \ > > while ((t = next_thread(t)) != g && pid_alive(g)) > > > > again. But this means while_each_thread() can miss more threads > > than it currently can under the same conditions. Correct, but > > not good. > > Not good, but correct ;) Probably it makes sense to fix the problem > anyway, then think about the more optimal fix. > > static inline struct task_struct * > next_thread_careful(const struct task_struct *g, const struct task_struct *t) > { > t = next_thread(t); > /* > * this pairs with the implicit barrier between detach_pid() > * and list_del_rcu(g->thread_group) in __unhash_process(g). > */ > smp_rmb(); > if (likely(pid_alive(g))) > return t; > else > return g; > } > > #define while_each_thread(g, t) \ > while ((t = next_thread_careful(t)) != g) > > I think this should work. detach_pid() does unlock + lock at least > once and thus we have the barrier (this worth a comment or we > can add the explicit wmb() in __unhash_process). > > Paul, Roland, do you see any problems from the correctness pov, > or a better fix for now? > > Perhaps it also makes sense to keep the old variant renamed to > while_each_thread_locked(), I dunno. Well. but current_is_single_threaded() and zap_threads() have to use next_thread() or while_each_thread_locked() in this case... Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 17:44 ` Oleg Nesterov 2010-06-21 18:00 ` Oleg Nesterov @ 2010-06-21 19:02 ` Roland McGrath 2010-06-21 20:06 ` Oleg Nesterov 1 sibling, 1 reply; 45+ messages in thread From: Roland McGrath @ 2010-06-21 19:02 UTC (permalink / raw) To: Oleg Nesterov Cc: Paul E. McKenney, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman > Paul, Roland, do you see any problems from the correctness pov, > or a better fix for now? > > Perhaps it also makes sense to keep the old variant renamed to > while_each_thread_locked(), I dunno. Did we verify that only de_thread() can create the situation where a while_each_thread-style loop without either lock can be confused? If that's so, then just changing it to avoid the situation seems like it would be less invasive overall. Thanks, Roland ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 19:02 ` Roland McGrath @ 2010-06-21 20:06 ` Oleg Nesterov 2010-06-21 21:19 ` Eric W. Biederman 2010-07-08 23:59 ` Roland McGrath 0 siblings, 2 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-21 20:06 UTC (permalink / raw) To: Roland McGrath Cc: Paul E. McKenney, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman On 06/21, Roland McGrath wrote: > > > Paul, Roland, do you see any problems from the correctness pov, > > or a better fix for now? > > > > Perhaps it also makes sense to keep the old variant renamed to > > while_each_thread_locked(), I dunno. > > Did we verify that only de_thread() can create the situation where a > while_each_thread-style loop without either lock can be confused? I think yes, this is is the only case. I mean, while_each_thread(group_leader, t). If g != group_leader, then the lockless while_each_thread() has problems with the plain exit(g). Afaics. The more I think about this, the more I feel confused ;) But if we start from ->group_leader, then while_each_thread() must stop eventually. Otherwise we should assume that the dead (unhashed) tasks can create the circular list, obviously this is not possible. > If > that's so, then just changing it to avoid the situation seems like it > would be less invasive overall. How? We should change ->group_leader uner write_lock_irq(tasklist), synchronize_rcu() is not an option. We can't do call_rcu(release_task), we can't take tasklist for writing in the softirq context. But even if we could, this can't help in fact or I missed something. Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 20:06 ` Oleg Nesterov @ 2010-06-21 21:19 ` Eric W. Biederman 2010-06-22 14:34 ` Oleg Nesterov 2010-07-08 23:59 ` Roland McGrath 1 sibling, 1 reply; 45+ messages in thread From: Eric W. Biederman @ 2010-06-21 21:19 UTC (permalink / raw) To: Oleg Nesterov Cc: Roland McGrath, Paul E. McKenney, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable Oleg Nesterov <oleg@redhat.com> writes: > On 06/21, Roland McGrath wrote: >> >> > Paul, Roland, do you see any problems from the correctness pov, >> > or a better fix for now? >> > >> > Perhaps it also makes sense to keep the old variant renamed to >> > while_each_thread_locked(), I dunno. >> >> Did we verify that only de_thread() can create the situation where a >> while_each_thread-style loop without either lock can be confused? > > I think yes, this is is the only case. > > I mean, while_each_thread(group_leader, t). If g != group_leader, then > the lockless while_each_thread() has problems with the plain exit(g). > > Afaics. The more I think about this, the more I feel confused ;) > > But if we start from ->group_leader, then while_each_thread() must > stop eventually. Otherwise we should assume that the dead (unhashed) > tasks can create the circular list, obviously this is not possible. > >> If >> that's so, then just changing it to avoid the situation seems like it >> would be less invasive overall. > > How? We should change ->group_leader uner write_lock_irq(tasklist), > synchronize_rcu() is not an option. We can't do call_rcu(release_task), > we can't take tasklist for writing in the softirq context. But even > if we could, this can't help in fact or I missed something. We already do: call_rcu(&p->rcu, delayed_put_task_struct); in release_task. We don't call release_task until after we have removed it as leader and dropped the write lock. At first glance it sounds like the group leader is safe as a stopping point for a rcu while_each_thread, and I expect the fact that de_thread takes everything down to a single thread, could have nice properties here. If pid_alive were only to fail on the group leader when de_thread is called I think we could legitimately say that an event we won't worry about. It is close enough to a new thread being created anyway. Eric ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 21:19 ` Eric W. Biederman @ 2010-06-22 14:34 ` Oleg Nesterov 0 siblings, 0 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-22 14:34 UTC (permalink / raw) To: Eric W. Biederman Cc: Roland McGrath, Paul E. McKenney, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable On 06/21, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > >> If > >> that's so, then just changing it to avoid the situation seems like it > >> would be less invasive overall. > > > > How? We should change ->group_leader uner write_lock_irq(tasklist), > > synchronize_rcu() is not an option. We can't do call_rcu(release_task), > > we can't take tasklist for writing in the softirq context. But even > > if we could, this can't help in fact or I missed something. > > We already do: call_rcu(&p->rcu, delayed_put_task_struct); in release_task. > We don't call release_task until after we have removed it as leader and > dropped the write lock. Yes. I meant that while this is needed to ensure that next_thread/etc returns the rcu-safe data, this (or more rcu_call's) can help to fix while_each_thread. I think I was unclear. de_thread() changes ->group_leader, but this does not matter at all. I mentioned this only because we discussed the possibility to check ->group_leader in while_each_thread. What does matter is the single line in __unhash_process() list_del_rcu(&p->thread_group); this breaks while_each_thread(). IOW. Why list_for_each_rcu(head) actually works? It works because this head itself can't be removed from list. while_each_thread(g, t) is almost equal to list_for_each_rcu() and it can only work if g can't be removed from list (OK, strictly speaking until other sub-threads go away, but this doesn't matter). However, g can be removed if a) it is not ->group_leader and exits, or b) its subthread execs. > At first glance it sounds like the group leader is safe as a stopping > point for a rcu while_each_thread, and I expect the fact that > de_thread takes everything down to a single thread, could have nice > properties here. If pid_alive were only to fail on the group leader > when de_thread is called I think we could legitimately say that an event > we won't worry about. It is close enough to a new thread being created > anyway. Not sure I understand exactly... I mean, I don't understand whether you agree or not with the fix which adds pid_alive() check into next_thread_careful(). Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 20:06 ` Oleg Nesterov 2010-06-21 21:19 ` Eric W. Biederman @ 2010-07-08 23:59 ` Roland McGrath 2010-07-09 0:41 ` Paul E. McKenney 1 sibling, 1 reply; 45+ messages in thread From: Roland McGrath @ 2010-07-08 23:59 UTC (permalink / raw) To: Oleg Nesterov Cc: Paul E. McKenney, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman > We can't do call_rcu(release_task), > we can't take tasklist for writing in the softirq context. But even > if we could, this can't help in fact or I missed something. Ah, I had missed that constraint of call_rcu. (It's not mentioned in the kerneldoc comment directly, and is only buried in a couple of brief mentions in Documentation/RCU/.) I was thinking that would suffice because it does it between rcu_read_lock critical sections, but I guess it actually only does it after the last one and doesn't prevent a new one from having started on another CPU. (And we can't use it anyway.) Thanks, Roland ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-07-08 23:59 ` Roland McGrath @ 2010-07-09 0:41 ` Paul E. McKenney 2010-07-09 1:01 ` Roland McGrath 0 siblings, 1 reply; 45+ messages in thread From: Paul E. McKenney @ 2010-07-09 0:41 UTC (permalink / raw) To: Roland McGrath Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman On Thu, Jul 08, 2010 at 04:59:25PM -0700, Roland McGrath wrote: > > We can't do call_rcu(release_task), > > we can't take tasklist for writing in the softirq context. But even > > if we could, this can't help in fact or I missed something. > > Ah, I had missed that constraint of call_rcu. (It's not mentioned in the > kerneldoc comment directly, and is only buried in a couple of brief > mentions in Documentation/RCU/.) I was thinking that would suffice because > it does it between rcu_read_lock critical sections, but I guess it actually > only does it after the last one and doesn't prevent a new one from having > started on another CPU. (And we can't use it anyway.) Hmmm... The kerneldoc for rcu_read_lock() looks like it says this pretty unambiguously. Then again, that is easy for me to say, given that I was the one who made most of this stuff up. ;-) But it has been awhile since I looked after the kerneldoc. Please see below for the changes I would make if left to myself. Any other suggestions for improvement to the kerneldoc? Thanx, Paul ------------------------------------------------------------------------ commit b12483332f1a7ee2db11f4bdc3c4e61edb608b01 Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Date: Thu Jul 8 17:38:59 2010 -0700 rcu: improve kerneldoc for rcu_read_lock(), call_rcu(), and synchronize_rcu() Make it explicit that new RCU read-side critical sections that start after call_rcu() and synchronize_rcu() start might still be running after the end of the relevant grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 508ebba..27b44b3 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -444,7 +444,7 @@ extern int rcu_my_thread_group_empty(void); * until after the all the other CPUs exit their critical sections. * * Note, however, that RCU callbacks are permitted to run concurrently - * with RCU read-side critical sections. One way that this can happen + * with new RCU read-side critical sections. One way that this can happen * is via the following sequence of events: (1) CPU 0 enters an RCU * read-side critical section, (2) CPU 1 invokes call_rcu() to register * an RCU callback, (3) CPU 0 exits the RCU read-side critical section, @@ -602,11 +602,13 @@ extern void wakeme_after_rcu(struct rcu_head *head); /** * call_rcu() - Queue an RCU callback for invocation after a grace period. * @head: structure to be used for queueing the RCU updates. - * @func: actual update function to be invoked after the grace period + * @func: actual callback function to be invoked after the grace period * - * The update function will be invoked some time after a full grace - * period elapses, in other words after all currently executing RCU - * read-side critical sections have completed. RCU read-side critical + * The callback function will be invoked some time after a full grace + * period elapses, in other words after all pre-existing RCU read-side + * critical sections have completed. However, the callback function + * might well execute concurrently with RCU read-side critical sections + * that started after call_rcu() was invoked. RCU read-side critical * sections are delimited by rcu_read_lock() and rcu_read_unlock(), * and may be nested. */ @@ -616,9 +618,9 @@ extern void call_rcu(struct rcu_head *head, /** * call_rcu_bh() - Queue an RCU for invocation after a quicker grace period. * @head: structure to be used for queueing the RCU updates. - * @func: actual update function to be invoked after the grace period + * @func: actual callback function to be invoked after the grace period * - * The update function will be invoked some time after a full grace + * The callback function will be invoked some time after a full grace * period elapses, in other words after all currently executing RCU * read-side critical sections have completed. call_rcu_bh() assumes * that the read-side critical sections end on completion of a softirq diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index 9906f85..63bb771 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -546,9 +546,11 @@ EXPORT_SYMBOL_GPL(call_rcu); * * Control will return to the caller some time after a full grace * period has elapsed, in other words after all currently executing RCU - * read-side critical sections have completed. RCU read-side critical - * sections are delimited by rcu_read_lock() and rcu_read_unlock(), - * and may be nested. + * read-side critical sections have completed. Note, however, that + * upon return from synchronize_rcu(), the caller might well be executing + * concurrently with new RCU read-side critical sections that began while + * synchronize_rcu() was waiting. RCU read-side critical sections are + * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested. */ void synchronize_rcu(void) { ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-07-09 0:41 ` Paul E. McKenney @ 2010-07-09 1:01 ` Roland McGrath 2010-07-09 16:18 ` Paul E. McKenney 0 siblings, 1 reply; 45+ messages in thread From: Roland McGrath @ 2010-07-09 1:01 UTC (permalink / raw) To: paulmck Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman > Hmmm... The kerneldoc for rcu_read_lock() looks like it says this > pretty unambiguously. Then again, that is easy for me to say, given > that I was the one who made most of this stuff up. ;-) Yeah, well, I thought I knew what rcu_read_lock did and the only new thing I had in mind was using call_rcu, so that's the only place I looked. It wasn't entirely unreasonable to have to read through Documentation/RCU/ a bit for a subtlety of this kind, and I did find it quickly enough once Oleg's attentiveness alerted me to be freshly afraid of the subtleties. > But it has been awhile since I looked after the kerneldoc. Please see > below for the changes I would make if left to myself. Those look good to me! > Any other suggestions for improvement to the kerneldoc? Oh, I don't know, send over someone to read it to me in a sultry voice so I get appropriately motivated to cast about for all of it I can find and listen closely to every word? It already looks like if I'd actually read all of it twice that day, I would have had the information I needed. Thanks, Roland ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-07-09 1:01 ` Roland McGrath @ 2010-07-09 16:18 ` Paul E. McKenney 0 siblings, 0 replies; 45+ messages in thread From: Paul E. McKenney @ 2010-07-09 16:18 UTC (permalink / raw) To: Roland McGrath Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman On Thu, Jul 08, 2010 at 06:01:37PM -0700, Roland McGrath wrote: > > Hmmm... The kerneldoc for rcu_read_lock() looks like it says this > > pretty unambiguously. Then again, that is easy for me to say, given > > that I was the one who made most of this stuff up. ;-) > > Yeah, well, I thought I knew what rcu_read_lock did and the only new thing > I had in mind was using call_rcu, so that's the only place I looked. It > wasn't entirely unreasonable to have to read through Documentation/RCU/ a > bit for a subtlety of this kind, and I did find it quickly enough once > Oleg's attentiveness alerted me to be freshly afraid of the subtleties. > > > But it has been awhile since I looked after the kerneldoc. Please see > > below for the changes I would make if left to myself. > > Those look good to me! Very good, I have them queued. I have some messy dependencies in my tree, so it will take some time for them to appear. > > Any other suggestions for improvement to the kerneldoc? > > Oh, I don't know, send over someone to read it to me in a sultry voice > so I get appropriately motivated to cast about for all of it I can find > and listen closely to every word? It already looks like if I'd actually > read all of it twice that day, I would have had the information I needed. Hmmm... Despite being happily married for well over two decades, I do feel the need to advise you to be careful what you wish for. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 17:09 ` Oleg Nesterov 2010-06-21 17:44 ` Oleg Nesterov @ 2010-06-21 20:51 ` Paul E. McKenney 2010-06-21 21:22 ` Eric W. Biederman 2010-06-22 21:23 ` Oleg Nesterov 1 sibling, 2 replies; 45+ messages in thread From: Paul E. McKenney @ 2010-06-21 20:51 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On Mon, Jun 21, 2010 at 07:09:19PM +0200, Oleg Nesterov wrote: > On 06/18, Paul E. McKenney wrote: > > > > On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote: > > > > > > #define XXX(t) ({ > > > struct task_struct *__prev = t; > > > t = next_thread(t); > > > t != g && t != __prev; > > > }) > > > > > > #define while_each_thread(g, t) \ > > > while (XXX(t)) > > > > Isn't the above vulnerable to a pthread_create() immediately following > > the offending exec()? Especially if the task doing the traversal is > > preempted? > > Yes, thanks! > > > here are some techniques that might (or might not) help: > > To simplify, let's consider the concrete example, Sounds very good! > rcu_read_lock(); > > g = t = returns_the_rcu_safe_task_struct_ptr(); This returns a pointer to the task struct of the current thread? Or might this return a pointer some other thread's task struct? > do { > printk("%d\n", t->pid); > } while_each_thread(g, t); > > rcu_read_unlock(); > > Whatever we do, without tasklist/siglock this can obviously race > with fork/exit/exec. It is OK to miss a thread, or print the pid > of the already exited/released task. > > But it should not loop forever (the subject), and it should not > print the same pid twice (ignoring pid reuse, of course). > > And, afaics, there are no problems with rcu magic per se, next_thread() > always returns the task_struct we can safely dereference. The only > problem is that while_each_thread() assumes that sooner or later > next_thread() must reach the starting point, g. > > (zap_threads() is different, it must not miss a thread with ->mm > we are going to dump, but it holds mmap_sem). Indeed, the tough part is figuring out when you are done given that things can come and go at will. Some additional tricks, in no particular order: 1. Always start at the group leader. Of course, the group leader is probably permitted to leave any time it wants to, so this is not sufficient in and of itself. 2. Maintain a separate task structure that flags the head of the list. This separate structure is freed one RCU grace period following the disappearance of the current group leader. This should be quite robust, but "holy overhead, Batman!!!" (Apologies for the American pop culture reference, but nothing else seemed appropriate.) 3. Like #2, but reduce the storage overhead by using a structure containing only the pointers and perhaps a few additional fields. Set one of the bottom bits of one of the pointers to flag the fact that this is a special structure, allowing the reader to take note and do any additional checks required. Of course, it is possible for a reader to be diverted onto a new thread group, so it is then also possible that it needs to return back to the previous thread group. The smaller structure corresponding to the smaller structure might contain the pointers required for it to make this transition. Please note that a given thread group will have several of these structures if several thread-group depart in quick succession. The smaller structure would need to contain enough information to allow the reader to thread back to whatever group-leader task it started from. Which will likely make managing the lifetimes of these smaller structures "interesting"... Though this could be eased by maintaining forward links as well as backwards links. Then when the grace period expired, the corresponding backwards links could be NULLed -- and a second grace period would likely be needed. It is tempting to leave these special structs out of any thread groups that have not yet had a leader depart, but this is probably vulnerable to looping. (It might be possible to avoid this, but additional checks are required.) There are other approaches, but it is probably time for you to teach me more about the requirements you face. > > o Check ACCESS_ONCE(p->group_leader == g), > > but this assumeas that while_each_thread(g, t) always > uses g == group_leader. zap_other_threads(), for example, > starts with g == current and not necessarily the leader. > > > if false, restart > > the traversal. > > I am not sure we should restart (don't pid the same pid twice), > probably we should break the loop. > > So, I am thinking about the first attempt > > #define while_each_thread(g, t) \ > while ((t = next_thread(t)) != g && pid_alive(g)) > > again. But this means while_each_thread() can miss more threads > than it currently can under the same conditions. Correct, but > not good. OK, so we really want to complete the scan through the thread group, or at least avoid introducing additional tendency to not complete the scan. Good to know. ;-) > And, > > > This of course might > > require careful attention of the order of updates to ->group_leader > > and the list pointers. > > Yes. At least the code above needs barriers, and I am not sure > it can be really correct without more complications. Agreed! > > o Maintain an ->oldnext field that tracks the old pointer to > > the next task for one RCU grace period after a de_thread() > > operation. When the grace period expires (presumably via > > call_rcu()), the ->oldnext field is set to NULL. > > Well, another field in task_struct... Yeah, would be good to avoid this. Not sure it can be avoided, though. > > o Do the de_thread() incrementally. So if the list is tasks A, > > B, and C, in that order, and if we are de-thread()ing B, > > then make A's pointer refer to C, > > This breaks while_each_thread() under tasklist/siglock. It must > see all unhashed tasks. Could de_thread() hold those locks in order to avoid that breakage? > Oh. I'll try to think more, but I also hope someone else can suggest > a simple solution. ;-) > > > Please tell me I am wrong! > > > > It would take a braver man than me to say that Oleg Nesterov is wrong! > > Hmm. Given that you proved I was wrong in the first lines of your > email, I do not know what to think ;) Ah, but I didn't prove you wrong. Instead, I simply asked if you were wrong. You supplied the proof. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 20:51 ` Paul E. McKenney @ 2010-06-21 21:22 ` Eric W. Biederman 2010-06-21 21:38 ` Paul E. McKenney 2010-06-22 21:23 ` Oleg Nesterov 1 sibling, 1 reply; 45+ messages in thread From: Eric W. Biederman @ 2010-06-21 21:22 UTC (permalink / raw) To: paulmck Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes: > On Mon, Jun 21, 2010 at 07:09:19PM +0200, Oleg Nesterov wrote: >> On 06/18, Paul E. McKenney wrote: >> > >> > On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote: >> > > >> > > #define XXX(t) ({ >> > > struct task_struct *__prev = t; >> > > t = next_thread(t); >> > > t != g && t != __prev; >> > > }) >> > > >> > > #define while_each_thread(g, t) \ >> > > while (XXX(t)) >> > >> > Isn't the above vulnerable to a pthread_create() immediately following >> > the offending exec()? Especially if the task doing the traversal is >> > preempted? >> >> Yes, thanks! >> >> > here are some techniques that might (or might not) help: >> >> To simplify, let's consider the concrete example, > > Sounds very good! > >> rcu_read_lock(); >> >> g = t = returns_the_rcu_safe_task_struct_ptr(); > > This returns a pointer to the task struct of the current thread? > Or might this return a pointer some other thread's task struct? > >> do { >> printk("%d\n", t->pid); >> } while_each_thread(g, t); >> >> rcu_read_unlock(); >> >> Whatever we do, without tasklist/siglock this can obviously race >> with fork/exit/exec. It is OK to miss a thread, or print the pid >> of the already exited/released task. >> >> But it should not loop forever (the subject), and it should not >> print the same pid twice (ignoring pid reuse, of course). >> >> And, afaics, there are no problems with rcu magic per se, next_thread() >> always returns the task_struct we can safely dereference. The only >> problem is that while_each_thread() assumes that sooner or later >> next_thread() must reach the starting point, g. >> >> (zap_threads() is different, it must not miss a thread with ->mm >> we are going to dump, but it holds mmap_sem). > > Indeed, the tough part is figuring out when you are done given that things > can come and go at will. Some additional tricks, in no particular order: > > 1. Always start at the group leader. Of course, the group leader > is probably permitted to leave any time it wants to, so this > is not sufficient in and of itself. No. The group leader must exist as long as the group exists. Modulo de_thread weirdness. The group_leader can be a zombie but it can not go away completely. > 2. Maintain a separate task structure that flags the head of the > list. This separate structure is freed one RCU grace period > following the disappearance of the current group leader. This > should be quite robust, but "holy overhead, Batman!!!" (Apologies > for the American pop culture reference, but nothing else seemed > appropriate.) That is roughly what we have in the group leader right now. Eric ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 21:22 ` Eric W. Biederman @ 2010-06-21 21:38 ` Paul E. McKenney 0 siblings, 0 replies; 45+ messages in thread From: Paul E. McKenney @ 2010-06-21 21:38 UTC (permalink / raw) To: Eric W. Biederman Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable On Mon, Jun 21, 2010 at 02:22:59PM -0700, Eric W. Biederman wrote: > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes: > > > On Mon, Jun 21, 2010 at 07:09:19PM +0200, Oleg Nesterov wrote: > >> On 06/18, Paul E. McKenney wrote: > >> > > >> > On Fri, Jun 18, 2010 at 09:34:03PM +0200, Oleg Nesterov wrote: > >> > > > >> > > #define XXX(t) ({ > >> > > struct task_struct *__prev = t; > >> > > t = next_thread(t); > >> > > t != g && t != __prev; > >> > > }) > >> > > > >> > > #define while_each_thread(g, t) \ > >> > > while (XXX(t)) > >> > > >> > Isn't the above vulnerable to a pthread_create() immediately following > >> > the offending exec()? Especially if the task doing the traversal is > >> > preempted? > >> > >> Yes, thanks! > >> > >> > here are some techniques that might (or might not) help: > >> > >> To simplify, let's consider the concrete example, > > > > Sounds very good! > > > >> rcu_read_lock(); > >> > >> g = t = returns_the_rcu_safe_task_struct_ptr(); > > > > This returns a pointer to the task struct of the current thread? > > Or might this return a pointer some other thread's task struct? > > > >> do { > >> printk("%d\n", t->pid); > >> } while_each_thread(g, t); > >> > >> rcu_read_unlock(); > >> > >> Whatever we do, without tasklist/siglock this can obviously race > >> with fork/exit/exec. It is OK to miss a thread, or print the pid > >> of the already exited/released task. > >> > >> But it should not loop forever (the subject), and it should not > >> print the same pid twice (ignoring pid reuse, of course). > >> > >> And, afaics, there are no problems with rcu magic per se, next_thread() > >> always returns the task_struct we can safely dereference. The only > >> problem is that while_each_thread() assumes that sooner or later > >> next_thread() must reach the starting point, g. > >> > >> (zap_threads() is different, it must not miss a thread with ->mm > >> we are going to dump, but it holds mmap_sem). > > > > Indeed, the tough part is figuring out when you are done given that things > > can come and go at will. Some additional tricks, in no particular order: > > > > 1. Always start at the group leader. Of course, the group leader > > is probably permitted to leave any time it wants to, so this > > is not sufficient in and of itself. > > No. The group leader must exist as long as the group exists. > Modulo de_thread weirdness. The group_leader can be a zombie but > it can not go away completely. Ah, OK -- now that you mention it, all the thinks that I can think of that remove a thread from a group have the side effect of destroying the old group (exec() and exit()). Other things that create a new thread group leave the old thread group intact. Or am I forgetting some odd operation? > > 2. Maintain a separate task structure that flags the head of the > > list. This separate structure is freed one RCU grace period > > following the disappearance of the current group leader. This > > should be quite robust, but "holy overhead, Batman!!!" (Apologies > > for the American pop culture reference, but nothing else seemed > > appropriate.) > > That is roughly what we have in the group leader right now. But can't the group leader do an exec(), becoming the leader of a new thread group without waiting for a grace period? Or this possibility already covered? Thanx, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-21 20:51 ` Paul E. McKenney 2010-06-21 21:22 ` Eric W. Biederman @ 2010-06-22 21:23 ` Oleg Nesterov 2010-06-22 22:12 ` Paul E. McKenney 1 sibling, 1 reply; 45+ messages in thread From: Oleg Nesterov @ 2010-06-22 21:23 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/21, Paul E. McKenney wrote: > > Indeed, the tough part is figuring out when you are done given that things > can come and go at will. Some additional tricks, in no particular order: > > 1. Always start at the group leader. We can't. We have users which start at the arbitrary thread. > 2. Maintain a separate task structure that flags the head of the > list. This separate structure is freed one RCU grace period > following the disappearance of the current group leader. Even simpler, we can just add list_head into signal_struct. I thought about this, but this breaks thread_group_empty (this is fixeable) and, again, I'd like very much to avoid adding new fields into task_struct or signal_struct. > > Well, another field in task_struct... > > Yeah, would be good to avoid this. Not sure it can be avoided, though. Why? I think next_thread_careful() from http://marc.info/?l=linux-kernel&m=127714242731448 should work. If the caller holds tasklist or siglock, this change has no effect. If the caller does while_each_thread() under rcu_read_lock(), then it is OK to break the loop earlier than we do now. The lockless while_each_thread() works in a "best effort" manner anyway, if it races with exit_group() or exec() it can miss some/most/all sub-threads (including the new leader) with or without this change. Yes, zap_threads() needs additional fixes. But I think it is better to complicate a couple of lockless callers (or just change them to take tasklist) which must not miss an "interesting" thread. > > > o Do the de_thread() incrementally. So if the list is tasks A, > > > B, and C, in that order, and if we are de-thread()ing B, > > > then make A's pointer refer to C, > > > > This breaks while_each_thread() under tasklist/siglock. It must > > see all unhashed tasks. > > Could de_thread() hold those locks in order to avoid that breakage? How can it hold, say, siglock? We need to wait a grace period. To clarify. de_thread() kills all threads except the group_leader, so we have only 2 threads: group_leader A and B. If we add synchronize_rcu() before release_task(leader) (as Roland suggested), then we don't need to change A's pointer. This probably fixes while_each_thread() in the common case. But this disallows the tricks like rcu_lock_break(). And. Whatever we do with de_thread(), this can't fix the lockless while_each_thread(not_a_group_leader, t). I do not know if there is any user which does this though. fastpath_timer_check()->thread_group_cputimer() does this, but this is wrong and we already have the patch which removes it. Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-22 21:23 ` Oleg Nesterov @ 2010-06-22 22:12 ` Paul E. McKenney 2010-06-23 15:24 ` Oleg Nesterov 0 siblings, 1 reply; 45+ messages in thread From: Paul E. McKenney @ 2010-06-22 22:12 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On Tue, Jun 22, 2010 at 11:23:57PM +0200, Oleg Nesterov wrote: > On 06/21, Paul E. McKenney wrote: > > > > Indeed, the tough part is figuring out when you are done given that things > > can come and go at will. Some additional tricks, in no particular order: > > > > 1. Always start at the group leader. > > We can't. We have users which start at the arbitrary thread. OK. > > 2. Maintain a separate task structure that flags the head of the > > list. This separate structure is freed one RCU grace period > > following the disappearance of the current group leader. > > Even simpler, we can just add list_head into signal_struct. I thought > about this, but this breaks thread_group_empty (this is fixeable) and, > again, I'd like very much to avoid adding new fields into task_struct > or signal_struct. Understood. > > > Well, another field in task_struct... > > > > Yeah, would be good to avoid this. Not sure it can be avoided, though. > > Why? I think next_thread_careful() from > http://marc.info/?l=linux-kernel&m=127714242731448 > should work. > > If the caller holds tasklist or siglock, this change has no effect. > > If the caller does while_each_thread() under rcu_read_lock(), then > it is OK to break the loop earlier than we do now. The lockless > while_each_thread() works in a "best effort" manner anyway, if it > races with exit_group() or exec() it can miss some/most/all sub-threads > (including the new leader) with or without this change. > > Yes, zap_threads() needs additional fixes. But I think it is better > to complicate a couple of lockless callers (or just change them > to take tasklist) which must not miss an "interesting" thread. Is it the case that creating a new group leader from an existing group always destroys the old group? It certainly is the case for exec(). In my earlier emails, I was assuming that it was possible to create a new thread group without destroying the old one, and that the thread group leader might leave the thread group and form a new one, so that a new thread group leader would be selected for the old group. I suspect that I was confused. ;-) Anyway, if creating a new thread group implies destroying the old one, and if the thread group leader cannot be concurrently creating a new thread group and traversing the old one, then yes, I do believe your code at http://marc.info/?l=linux-kernel&m=127714242731448 will work. Assuming that the call to next_thread_careful(t) in the definition of while_each_thread() is replaced with next_thread_careful(g,t). And give or take memory barriers. The implied memory barrier mentioned in the comment in your example code is the spin_lock_irqsave() and spin_unlock_irqrestore() in free_pid(), which is called from __change_pid() which is called from detach_pid()? One some platforms, code may be reordered from both sides into the resulting critical section, so you actually need two non-overlapping lock-based critical sections to guarantee full memory-barrier semantics. And the atomic_inc() in free_pidmap() is not required to provide memory-barrier semantics, and does not do so on all platforms. Or does the implied memory barrier correspond to the first of three calls to detach_pid() in __unhash_process()? (The above analysis assumes that it corresponds to the last of the three.) > > > > o Do the de_thread() incrementally. So if the list is tasks A, > > > > B, and C, in that order, and if we are de-thread()ing B, > > > > then make A's pointer refer to C, > > > > > > This breaks while_each_thread() under tasklist/siglock. It must > > > see all unhashed tasks. > > > > Could de_thread() hold those locks in order to avoid that breakage? > > How can it hold, say, siglock? We need to wait a grace period. > To clarify. de_thread() kills all threads except the group_leader, > so we have only 2 threads: group_leader A and B. > > If we add synchronize_rcu() before release_task(leader) (as Roland > suggested), then we don't need to change A's pointer. This probably > fixes while_each_thread() in the common case. But this disallows > the tricks like rcu_lock_break(). > > > And. Whatever we do with de_thread(), this can't fix the lockless > while_each_thread(not_a_group_leader, t). I do not know if there is > any user which does this though. > fastpath_timer_check()->thread_group_cputimer() does this, but this > is wrong and we already have the patch which removes it. Indeed. Suppose that the starting task is the one immediately preceding the task group leader. You get a pointer to the task in question and traverse to the next task (the task group leader), during which time the thread group leader does exec() and maybe a pthread_create() or two. Oops! You are now now traversing the wrong thread group! There are ways of fixing this, but all the ones I know of require more fields in the task structure, so best if we don't need to start somewhere other than a group leader. Thanx, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-22 22:12 ` Paul E. McKenney @ 2010-06-23 15:24 ` Oleg Nesterov 2010-06-24 18:07 ` Paul E. McKenney 0 siblings, 1 reply; 45+ messages in thread From: Oleg Nesterov @ 2010-06-23 15:24 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/22, Paul E. McKenney wrote: > > On Tue, Jun 22, 2010 at 11:23:57PM +0200, Oleg Nesterov wrote: > > On 06/21, Paul E. McKenney wrote: > > > > > > Yeah, would be good to avoid this. Not sure it can be avoided, though. > > > > Why? I think next_thread_careful() from > > http://marc.info/?l=linux-kernel&m=127714242731448 > > should work. > > > > If the caller holds tasklist or siglock, this change has no effect. > > > > If the caller does while_each_thread() under rcu_read_lock(), then > > it is OK to break the loop earlier than we do now. The lockless > > while_each_thread() works in a "best effort" manner anyway, if it > > races with exit_group() or exec() it can miss some/most/all sub-threads > > (including the new leader) with or without this change. > > > > Yes, zap_threads() needs additional fixes. But I think it is better > > to complicate a couple of lockless callers (or just change them > > to take tasklist) which must not miss an "interesting" thread. > > Is it the case that creating a new group leader from an existing group > always destroys the old group? It certainly is the case for exec(). Yes. And only exec() can change the leader. > Anyway, if creating a new thread group implies destroying the old one, > and if the thread group leader cannot be concurrently creating a new > thread group and traversing the old one, then yes, I do believe your > code at http://marc.info/?l=linux-kernel&m=127714242731448 will work. > > Assuming that the call to next_thread_careful(t) in the definition of > while_each_thread() is replaced with next_thread_careful(g,t). Great. > And give or take memory barriers. > > The implied memory barrier mentioned in the comment in your example code > is the spin_lock_irqsave() and spin_unlock_irqrestore() Argh. No, no, no. I meant unlock + lock, not lock + unlock. But when I look at __change_pid() now I do not see the 2nd lock() after unlock(pidmap_lock). I was wrong, thanks. And, even if I was correct it is not nice to rely on the internals of detach_pid(). If we use next_thread_careful(), __unhash_process() needs wmb. Thanks! > > And. Whatever we do with de_thread(), this can't fix the lockless > > while_each_thread(not_a_group_leader, t). I do not know if there is > > any user which does this though. > > fastpath_timer_check()->thread_group_cputimer() does this, but this > > is wrong and we already have the patch which removes it. > > Indeed. Suppose that the starting task is the one immediately preceding > the task group leader. You get a pointer to the task in question > and traverse to the next task (the task group leader), during which > time the thread group leader does exec() and maybe a pthread_create() > or two. Oops! You are now now traversing the wrong thread group! Yes, but there is another more simple and much more feasible race. while_each_thread(non_leader, t) will loop forever if non_leader simply exits and passes __unhash_process(). After that next_thread(t) can never return non_leader. > There are ways of fixing this, but all the ones I know of require more > fields in the task structure, Just in case, I hope that next_thread_careful() handles this case too. > so best if we don't need to start somewhere > other than a group leader. (I assume, you mean the lockless case) I am not sure. Because it is not trivial to enforce this rule even if we add a fat comment. Please look at check_hung_uninterruptible_tasks(). It does do_each_thread/while_each_thread and thus it always starts at the leader. But in fact this is not true due to rcu_lock_break(). Or proc_task_readdir(). It finds the leader, does get_task_struct(leader) and drops rcu lock. After that first_tid() takes rcu_read_lock() again and does the open coded while_each_thread(). At first glance this case looks fine, "for (pos = leader; nr > 0; --nr)" can't loop forever. But in fact it is not: - proc_task_readdir() finds the leader L, does get_task_struct() and drops rcu lock. Suppose that filp->f_pos >= 2. Suppose also that the previous tid cached in filp->f_version is no longer valid. The caller is preempted. - a sub-thread T does exec, and does release_task(L). But L->thread_group->next == T, so far everything is good - T spawns other sub-threads (so that get_nr_threads() > nr) and exits. - grace period passes, T's task_struct is freed/unmapped/reused - proc_task_readdir() resumes and calls first_tid(L). next_thread(L) returns T == nowhere It is very possible that I missed something here, my only point is that I think it would be safer to assume nothing about the leaderness. Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-23 15:24 ` Oleg Nesterov @ 2010-06-24 18:07 ` Paul E. McKenney 2010-06-24 18:50 ` Chris Friesen ` (2 more replies) 0 siblings, 3 replies; 45+ messages in thread From: Paul E. McKenney @ 2010-06-24 18:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On Wed, Jun 23, 2010 at 05:24:21PM +0200, Oleg Nesterov wrote: > On 06/22, Paul E. McKenney wrote: > > > > On Tue, Jun 22, 2010 at 11:23:57PM +0200, Oleg Nesterov wrote: > > > On 06/21, Paul E. McKenney wrote: > > > > > > > > Yeah, would be good to avoid this. Not sure it can be avoided, though. > > > > > > Why? I think next_thread_careful() from > > > http://marc.info/?l=linux-kernel&m=127714242731448 > > > should work. > > > > > > If the caller holds tasklist or siglock, this change has no effect. > > > > > > If the caller does while_each_thread() under rcu_read_lock(), then > > > it is OK to break the loop earlier than we do now. The lockless > > > while_each_thread() works in a "best effort" manner anyway, if it > > > races with exit_group() or exec() it can miss some/most/all sub-threads > > > (including the new leader) with or without this change. > > > > > > Yes, zap_threads() needs additional fixes. But I think it is better > > > to complicate a couple of lockless callers (or just change them > > > to take tasklist) which must not miss an "interesting" thread. > > > > Is it the case that creating a new group leader from an existing group > > always destroys the old group? It certainly is the case for exec(). > > Yes. And only exec() can change the leader. > > > Anyway, if creating a new thread group implies destroying the old one, > > and if the thread group leader cannot be concurrently creating a new > > thread group and traversing the old one, then yes, I do believe your > > code at http://marc.info/?l=linux-kernel&m=127714242731448 will work. > > > > Assuming that the call to next_thread_careful(t) in the definition of > > while_each_thread() is replaced with next_thread_careful(g,t). > > Great. > > > And give or take memory barriers. > > > > The implied memory barrier mentioned in the comment in your example code > > is the spin_lock_irqsave() and spin_unlock_irqrestore() > > Argh. No, no, no. I meant unlock + lock, not lock + unlock. But when > I look at __change_pid() now I do not see the 2nd lock() after > unlock(pidmap_lock). I was wrong, thanks. > > And, even if I was correct it is not nice to rely on the internals > of detach_pid(). If we use next_thread_careful(), __unhash_process() > needs wmb. > > Thanks! > > > > And. Whatever we do with de_thread(), this can't fix the lockless > > > while_each_thread(not_a_group_leader, t). I do not know if there is > > > any user which does this though. > > > fastpath_timer_check()->thread_group_cputimer() does this, but this > > > is wrong and we already have the patch which removes it. > > > > Indeed. Suppose that the starting task is the one immediately preceding > > the task group leader. You get a pointer to the task in question > > and traverse to the next task (the task group leader), during which > > time the thread group leader does exec() and maybe a pthread_create() > > or two. Oops! You are now now traversing the wrong thread group! > > Yes, but there is another more simple and much more feasible race. > while_each_thread(non_leader, t) will loop forever if non_leader > simply exits and passes __unhash_process(). After that next_thread(t) > can never return non_leader. > > > There are ways of fixing this, but all the ones I know of require more > > fields in the task structure, > > Just in case, I hope that next_thread_careful() handles this case too. > > > so best if we don't need to start somewhere > > other than a group leader. > > (I assume, you mean the lockless case) > > I am not sure. Because it is not trivial to enforce this rule even if > we add a fat comment. Please look at check_hung_uninterruptible_tasks(). > It does do_each_thread/while_each_thread and thus it always starts at > the leader. But in fact this is not true due to rcu_lock_break(). > > Or proc_task_readdir(). It finds the leader, does get_task_struct(leader) > and drops rcu lock. After that first_tid() takes rcu_read_lock() again > and does the open coded while_each_thread(). At first glance this case > looks fine, "for (pos = leader; nr > 0; --nr)" can't loop forever. > But in fact it is not: > > - proc_task_readdir() finds the leader L, does get_task_struct() > and drops rcu lock. > > Suppose that filp->f_pos >= 2. > Suppose also that the previous tid cached in filp->f_version > is no longer valid. > > The caller is preempted. > > - a sub-thread T does exec, and does release_task(L). > > But L->thread_group->next == T, so far everything is good > > - T spawns other sub-threads (so that get_nr_threads() > nr) > and exits. > > - grace period passes, T's task_struct is freed/unmapped/reused > > - proc_task_readdir() resumes and calls first_tid(L). > > next_thread(L) returns T == nowhere > > It is very possible that I missed something here, my only point is > that I think it would be safer to assume nothing about the leaderness. It is past time that I list out my assumptions more carefully. ;-) First, what "bad things" can happen to a reader scanning a thread group? 1. The thread-group leader might do exec(), destroying the old list and forming a new one. In this case, we want any readers to stop scanning. 2. Some other thread might do exec(), destroying the old list and forming a new one. In this case, we also want any readers to stop scanning. 3. The thread-group leader might do pthread_exit(), removing itself from the thread group -- and might do so while the hapless reader is referencing that thread. But isn't this prohibited? Or is it really legal to do a pthread_create() to create a new thread and then have the parent thread call pthread_exit()? Not something I would consider trying in my own code! Well, I might, just to be perverse, but... ;-) 4. Some other thread might do pthread_exit(), removing itself from the thread group, and again might do so while the hapless reader is referencing that thread. In this case, we want the hapless reader to continue scanning the remainder of the thread group. 5. The thread-group leader might do exit(), destroying the old list without forming a new one. In this case, we want any readers to stop scanning. 6. Some other thread might do exit(), destroying the old list without forming a new one. In this case, we also want any readers to stop scanning. Anything else I might be missing? And proc_task_readdir() does look quite interesting, and I don't claim to understand it yet. Thanx, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-24 18:07 ` Paul E. McKenney @ 2010-06-24 18:50 ` Chris Friesen 2010-06-24 22:00 ` Oleg Nesterov 2010-06-24 21:14 ` Roland McGrath 2010-06-24 21:57 ` Oleg Nesterov 2 siblings, 1 reply; 45+ messages in thread From: Chris Friesen @ 2010-06-24 18:50 UTC (permalink / raw) To: paulmck Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/24/2010 12:07 PM, Paul E. McKenney wrote: > 3. The thread-group leader might do pthread_exit(), removing itself > from the thread group -- and might do so while the hapless reader > is referencing that thread. > > But isn't this prohibited? Or is it really legal to do a > pthread_create() to create a new thread and then have the > parent thread call pthread_exit()? Not something I would > consider trying in my own code! Well, I might, just to > be perverse, but... ;-) I believe SUS allows the main thread to explicitly call pthread_exit(), leaving the other threads to run. If the main() routine just returns then it implicitly calls exit(). Chris -- Chris Friesen Software Developer GENBAND chris.friesen@genband.com www.genband.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-24 18:50 ` Chris Friesen @ 2010-06-24 22:00 ` Oleg Nesterov 2010-06-25 0:08 ` Eric W. Biederman 0 siblings, 1 reply; 45+ messages in thread From: Oleg Nesterov @ 2010-06-24 22:00 UTC (permalink / raw) To: Chris Friesen Cc: paulmck, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/24, Chris Friesen wrote: > > On 06/24/2010 12:07 PM, Paul E. McKenney wrote: > > > 3. The thread-group leader might do pthread_exit(), removing itself > > from the thread group -- and might do so while the hapless reader > > is referencing that thread. > > > > But isn't this prohibited? Or is it really legal to do a > > pthread_create() to create a new thread and then have the > > parent thread call pthread_exit()? Not something I would > > consider trying in my own code! Well, I might, just to > > be perverse, but... ;-) > > I believe SUS allows the main thread to explicitly call pthread_exit(), > leaving the other threads to run. If the main() routine just returns > then it implicitly calls exit(). Correct. But, to clarify, if the main thread does pthread_exit() (sys_exit, actually), it won't be removed from the group. It will be zombie until all other threads exit. Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-24 22:00 ` Oleg Nesterov @ 2010-06-25 0:08 ` Eric W. Biederman 2010-06-25 3:42 ` Paul E. McKenney ` (2 more replies) 0 siblings, 3 replies; 45+ messages in thread From: Eric W. Biederman @ 2010-06-25 0:08 UTC (permalink / raw) To: Oleg Nesterov Cc: Chris Friesen, paulmck, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable Oleg Nesterov <oleg@redhat.com> writes: > On 06/24, Chris Friesen wrote: >> >> On 06/24/2010 12:07 PM, Paul E. McKenney wrote: >> >> > 3. The thread-group leader might do pthread_exit(), removing itself >> > from the thread group -- and might do so while the hapless reader >> > is referencing that thread. >> > >> > But isn't this prohibited? Or is it really legal to do a >> > pthread_create() to create a new thread and then have the >> > parent thread call pthread_exit()? Not something I would >> > consider trying in my own code! Well, I might, just to >> > be perverse, but... ;-) >> >> I believe SUS allows the main thread to explicitly call pthread_exit(), >> leaving the other threads to run. If the main() routine just returns >> then it implicitly calls exit(). > > Correct. > > But, to clarify, if the main thread does pthread_exit() (sys_exit, > actually), it won't be removed from the group. It will be zombie > until all other threads exit. That we don't cleanup that zombie leaders is unfortunate really, it means we have the entire de_thread special case. But short fixing libpthread to not make bad assumptions there is little we can do about it really. I'm only half following this conversation. If what we are looking for is a stable list_head that won't disappear on us we should be able to put one in sighand_struct or signal_struct (I forget which is which at the moment) and have a list_head that lives for the life of the longest living thread, and that won't get messed up by things like de_thread, and then next_thread could simply return NULL when we hit the end of the list. Eric ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-25 0:08 ` Eric W. Biederman @ 2010-06-25 3:42 ` Paul E. McKenney 2010-06-25 10:08 ` Oleg Nesterov 2010-07-09 0:52 ` Roland McGrath 2 siblings, 0 replies; 45+ messages in thread From: Paul E. McKenney @ 2010-06-25 3:42 UTC (permalink / raw) To: Eric W. Biederman Cc: Oleg Nesterov, Chris Friesen, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable On Thu, Jun 24, 2010 at 05:08:10PM -0700, Eric W. Biederman wrote: > Oleg Nesterov <oleg@redhat.com> writes: > > > On 06/24, Chris Friesen wrote: > >> > >> On 06/24/2010 12:07 PM, Paul E. McKenney wrote: > >> > >> > 3. The thread-group leader might do pthread_exit(), removing itself > >> > from the thread group -- and might do so while the hapless reader > >> > is referencing that thread. > >> > > >> > But isn't this prohibited? Or is it really legal to do a > >> > pthread_create() to create a new thread and then have the > >> > parent thread call pthread_exit()? Not something I would > >> > consider trying in my own code! Well, I might, just to > >> > be perverse, but... ;-) > >> > >> I believe SUS allows the main thread to explicitly call pthread_exit(), > >> leaving the other threads to run. If the main() routine just returns > >> then it implicitly calls exit(). > > > > Correct. > > > > But, to clarify, if the main thread does pthread_exit() (sys_exit, > > actually), it won't be removed from the group. It will be zombie > > until all other threads exit. > > That we don't cleanup that zombie leaders is unfortunate really, it > means we have the entire de_thread special case. But short fixing > libpthread to not make bad assumptions there is little we can do about > it really. Keeping the zombie leaders does make at least one of the lockless scan cases quite a bit simpler. I think, anyway. > I'm only half following this conversation. > > If what we are looking for is a stable list_head that won't disappear > on us we should be able to put one in sighand_struct or signal_struct > (I forget which is which at the moment) and have a list_head that > lives for the life of the longest living thread, and that won't get > messed up by things like de_thread, and then next_thread could simply > return NULL when we hit the end of the list. Oleg did suggest this possibility, but there were complications that I do not claim to fully understand. Thanx, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-25 0:08 ` Eric W. Biederman 2010-06-25 3:42 ` Paul E. McKenney @ 2010-06-25 10:08 ` Oleg Nesterov 2010-07-09 0:52 ` Roland McGrath 2 siblings, 0 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-25 10:08 UTC (permalink / raw) To: Eric W. Biederman Cc: Chris Friesen, paulmck, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable On 06/24, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > If what we are looking for is a stable list_head that won't disappear > on us we should be able to put one in sighand_struct or signal_struct > (I forget which is which at the moment) and have a list_head that > lives for the life of the longest living thread, and that won't get > messed up by things like de_thread, and then next_thread could simply > return NULL when we hit the end of the list. This was already discussed. Yes, we can add list_head into signal_struct. But this breaks, say, thread_group_empty() and the similar code. This is fixeable. But. We need to convert the code which does while_each_thread(g, t) to use list_for_each(t, head). And, if g != group_leader this can't work. Not to mention the increase of sizeof(signal_struct). Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-25 0:08 ` Eric W. Biederman 2010-06-25 3:42 ` Paul E. McKenney 2010-06-25 10:08 ` Oleg Nesterov @ 2010-07-09 0:52 ` Roland McGrath 2 siblings, 0 replies; 45+ messages in thread From: Roland McGrath @ 2010-07-09 0:52 UTC (permalink / raw) To: Eric W. Biederman Cc: Oleg Nesterov, Chris Friesen, paulmck, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable > That we don't cleanup that zombie leaders is unfortunate really, it > means we have the entire de_thread special case. But short fixing > libpthread to not make bad assumptions there is little we can do about > it really. To be fair (perish the thought!), the semantics for these things were clearly specified in POSIX-1996 years before we had made any attempt in Linux at implementing them in the current fashion or any other, so we have no one but ourselves to blame for our current implementation choices. There are no required semantics about "zombie leaders". The semantics are that wait* calls refer to the whole process (thread group) and those wait events don't occur until all threads (the initial thread or others) are dead (or all stopped). Nothing really says we have to keep the leader's task_struct around, we just need to keep the tgid assigned. We could have the signal_struct be the thing that holds the tgid and is on the children list that wait looks at, and let dead leaders "self-reap" like other threads do. Since we've already separated the ptrace list and pseudo-wait from the proper children/wait list, it's only some of these existing implementation assumptions and perhaps some /proc vagaries that prevent us doing that. > I'm only half following this conversation. Going for a third here. Thanks, Roland ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-24 18:07 ` Paul E. McKenney 2010-06-24 18:50 ` Chris Friesen @ 2010-06-24 21:14 ` Roland McGrath 2010-06-25 3:37 ` Paul E. McKenney 2010-06-24 21:57 ` Oleg Nesterov 2 siblings, 1 reply; 45+ messages in thread From: Roland McGrath @ 2010-06-24 21:14 UTC (permalink / raw) To: paulmck Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman > First, what "bad things" can happen to a reader scanning a thread > group? > > 1. The thread-group leader might do exec(), destroying the old > list and forming a new one. In this case, we want any readers > to stop scanning. This doesn't do anything different (for these concerns) from just all the other threads happening to exit right before the exec. There is no "destroying the old" and "forming the new", it's just that all the other threads are convinced to die now. There is no problem here. > 2. Some other thread might do exec(), destroying the old list and > forming a new one. In this case, we also want any readers to > stop scanning. Again, the list is not really destroyed, just everybody dies. What is different here is that ->group_leader changes. This is the only time that ever happens. Moreover, it's the only time that a task that was previously pointed to by any ->group_leader can be reaped before the rest of the group has already been reaped first (and thus the thread_group made a singleton). > 3. The thread-group leader might do pthread_exit(), removing itself > from the thread group -- and might do so while the hapless reader > is referencing that thread. This is called the delay_group_leader() case. It doesn't happen in a way that has the problems you are concerned with. The group_leader remains in EXIT_ZOMBIE state and can't be reaped until all the other threads have been reaped. There is no time at which any thread in the group is in any hashes or accessible by any means after the (final) group_leader is reaped. > 4. Some other thread might do pthread_exit(), removing itself > from the thread group, and again might do so while the hapless > reader is referencing that thread. In this case, we want > the hapless reader to continue scanning the remainder of the > thread group. This is the most normal case (and #1 is effectively just this repeated by every thread in parallel). > 5. The thread-group leader might do exit(), destroying the old > list without forming a new one. In this case, we want any > readers to stop scanning. All this means is everybody is convinced to die, and the group_leader dies too. It is not discernibly different from #6. > 6. Some other thread might do exit(), destroying the old list > without forming a new one. In this case, we also want any > readers to stop scanning. This just means everybody is convinced to die and is not materially different from each individual thread all happening to die at the same time. You've described all these cases as "we want any readers to stop scanning". That is far too strong, and sounds like some kind of guaranteed synchronization, which does not and need not exist. Any reader that needs a dead thread to be off the list holds siglock and/or tasklist_lock. For the casual readers that only use rcu_read_lock, we only "want any readers' loops eventually to terminate and never to dereference stale pointers". That's why normal RCU listiness is generally fine. The only problem we have is in #2. This is only a problem because readers' loops may be using the old ->group_leader pointer as the anchor for their circular-list round-robin loop. Once the former leader is removed from the list, that loop termination condition can never be met. Thanks, Roland ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-24 21:14 ` Roland McGrath @ 2010-06-25 3:37 ` Paul E. McKenney 2010-07-09 0:41 ` Roland McGrath 0 siblings, 1 reply; 45+ messages in thread From: Paul E. McKenney @ 2010-06-25 3:37 UTC (permalink / raw) To: Roland McGrath Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman On Thu, Jun 24, 2010 at 02:14:46PM -0700, Roland McGrath wrote: > > First, what "bad things" can happen to a reader scanning a thread > > group? > > > > 1. The thread-group leader might do exec(), destroying the old > > list and forming a new one. In this case, we want any readers > > to stop scanning. > > This doesn't do anything different (for these concerns) from just all the > other threads happening to exit right before the exec. There is no > "destroying the old" and "forming the new", it's just that all the other > threads are convinced to die now. There is no problem here. Understood -- I wasn't saying that each category posed a unique problem, but rather making sure that I really had enumerated all the possibilities. The reason for my "destroying the old" and "forming the new" is the possibility of someone doing proc_task_readdir() when the group leader does exec(), which causes all to die, and then the new process does pthread_create(), forming a new thread group. Because proc_task_readdir() neither holds a lock nor stays in an RCU read-side critical section for the full /proc scan, the old group might really be destroyed from the reader's point of view. That said, I freely admit that I am not very familiar with this code. > > 2. Some other thread might do exec(), destroying the old list and > > forming a new one. In this case, we also want any readers to > > stop scanning. > > Again, the list is not really destroyed, just everybody dies. What is > different here is that ->group_leader changes. This is the only time > that ever happens. Moreover, it's the only time that a task that was > previously pointed to by any ->group_leader can be reaped before the > rest of the group has already been reaped first (and thus the > thread_group made a singleton). Yep! Same proc_task_readdir() situation as before. The group leader cannot go away because proc_task_readdir() takes a reference. > > 3. The thread-group leader might do pthread_exit(), removing itself > > from the thread group -- and might do so while the hapless reader > > is referencing that thread. > > This is called the delay_group_leader() case. It doesn't happen in a > way that has the problems you are concerned with. The group_leader > remains in EXIT_ZOMBIE state and can't be reaped until all the other > threads have been reaped. There is no time at which any thread in the > group is in any hashes or accessible by any means after the (final) > group_leader is reaped. OK, good to know -- that does make things simpler. > > 4. Some other thread might do pthread_exit(), removing itself > > from the thread group, and again might do so while the hapless > > reader is referencing that thread. In this case, we want > > the hapless reader to continue scanning the remainder of the > > thread group. > > This is the most normal case (and #1 is effectively just this repeated > by every thread in parallel). Agreed. One possible difference is that in #1, no one is going to complain about the reader quitting, while in this case someone might well be annoyed if the list of threads is incomplete. > > 5. The thread-group leader might do exit(), destroying the old > > list without forming a new one. In this case, we want any > > readers to stop scanning. > > All this means is everybody is convinced to die, and the group_leader > dies too. It is not discernibly different from #6. Seems reasonable. > > 6. Some other thread might do exit(), destroying the old list > > without forming a new one. In this case, we also want any > > readers to stop scanning. > > This just means everybody is convinced to die and is not materially > different from each individual thread all happening to die at the same > time. Fair enough. Again, my goal was to ensure that I had covered all the cases as opposed to ensuring that I had described them minimally. > You've described all these cases as "we want any readers to stop > scanning". That is far too strong, and sounds like some kind of > guaranteed synchronization, which does not and need not exist. Any > reader that needs a dead thread to be off the list holds siglock > and/or tasklist_lock. For the casual readers that only use > rcu_read_lock, we only "want any readers' loops eventually to > terminate and never to dereference stale pointers". That's why > normal RCU listiness is generally fine. OK, so maybe "it is OK for readers to stop scanning" is a better way of putting it? > The only problem we have is in #2. This is only a problem because > readers' loops may be using the old ->group_leader pointer as the > anchor for their circular-list round-robin loop. Once the former > leader is removed from the list, that loop termination condition can > never be met. Does Oleg's checking for the group leader still being alive look correct to you? Thanx, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-25 3:37 ` Paul E. McKenney @ 2010-07-09 0:41 ` Roland McGrath 0 siblings, 0 replies; 45+ messages in thread From: Roland McGrath @ 2010-07-09 0:41 UTC (permalink / raw) To: paulmck Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, linux-kernel, stable, Eric W. Biederman > The reason for my "destroying the old" and "forming the new" is the > possibility of someone doing proc_task_readdir() when the group leader > does exec(), which causes all to die, and then the new process does > pthread_create(), forming a new thread group. Because proc_task_readdir() > neither holds a lock nor stays in an RCU read-side critical section > for the full /proc scan, the old group might really be destroyed from > the reader's point of view. I haven't tried to understand the /proc code. From the userland perspective, there is one thing called a "process" with a single ID that the kernel calls the TGID and everybody else calls the PID, and that container survives execs regardless of which of its threads do them. Listing /proc/TGID/task is the way userland (i.e. debuggers) enumerate all the threads (e.g. for attaching them all with ptrace). It's of course expected that threads will be coming and going, so userland expects to read it several times, until there were no new threads in the list after it attached all the ones from the last read (so it would know if those ones created any more). I can't quite tell but it sounds like you may be saying that this procedure won't work with rewinding the same fd. After an exec, that fd may point to a reaped former leader and yield no results. (Looking at the code now, it looks like readdir will fail with the unusual error ENOENT in that case, so userland could detect that case easily now.) To pick up the next iteration of that procedure correctly, you'd have to reopen /proc/TGID/task by name to get an fd associated with the new leader. That is the only thing I can think of that is meaningful in userland terms and might be what you mean by "destroying the old and forming the new". Is that it? But it also sounds like you may be saying that the lack of locking in proc_task_readdir means it could just punt out of its listing loop early at any time that the task it just looked at is reaped. Is that right? That is OK for userland if any given readdir call returns a short list--but not if it returns a premature EOF. It looks to me like that's possible too. If so, that is startling off hand, and breaks the userland expectation by the "least astonishment" principle. (That is, you can sometimes get a short listing showing a subset of threads that does not include some threads you previously saw as alive and are still alive.) It can also actually break the procedure I described above if one false EOF causes the reader to miss a new thread it hasn't seen before, so it thinks it has already stopped all the threads that are alive. I don't know of anything in userland using that procedure. But it's what I would have recommended if asked, before you brought this issue up. (strace -p does a single pass and is only intending to offer any guarantee if you've already finished stopping the thing with SIGSTOP first. gdb uses other means that amount to setting a breakpoint inside pthread_create before reading libpthread's data structures from user memory for the userland thread list without regard to the actual kernel thread situation.) I suppose we can just say that proc_task_readdir is entirely unreliable unless you're sure otherwise that threads are not being reaped while you read it, since that seems to be the truth of it. I would sure be happier as a userland programmer if the kernel gave something that I could rely on by some feasible race-noticing procedure akin to that above, but it's not the end of the world. Thanks, Roland ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-24 18:07 ` Paul E. McKenney 2010-06-24 18:50 ` Chris Friesen 2010-06-24 21:14 ` Roland McGrath @ 2010-06-24 21:57 ` Oleg Nesterov 2010-06-25 3:41 ` Paul E. McKenney 2 siblings, 1 reply; 45+ messages in thread From: Oleg Nesterov @ 2010-06-24 21:57 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/24, Paul E. McKenney wrote: > > On Wed, Jun 23, 2010 at 05:24:21PM +0200, Oleg Nesterov wrote: > > It is very possible that I missed something here, my only point is > > that I think it would be safer to assume nothing about the leaderness. > > It is past time that I list out my assumptions more carefully. ;-) > > First, what "bad things" can happen to a reader scanning a thread > group? (I assume you mean the lockless case) Currently, the only bad thing is that while_each_thread(g) can loop forever if we race with exec(), or exit() if g is not leader. And, to simplify, let's consider the same example again t = g; do { printk("pid %d\n", t->pid); } while_each_thread(g, t); > 1. The thread-group leader might do exec(), destroying the old > list and forming a new one. In this case, we want any readers > to stop scanning. I'd say, it is not that we want to stop scanning, it is OK to stop scanning after we printed g->pid > 2. Some other thread might do exec(), destroying the old list and > forming a new one. In this case, we also want any readers to > stop scanning. The same. If the code above runs under for_each_process(g) or it did "g = find_task_by_pid(tgid)", we will see either new or old leader and print its pid at least. > 3. The thread-group leader might do pthread_exit(), removing itself > from the thread group No. It can exit, but it won't be removed from thread group. It will be zombie untill all sub-threads disappear. > 4. Some other thread might do pthread_exit(), removing itself > from the thread group, and again might do so while the hapless > reader is referencing that thread. In this case, we want > the hapless reader to continue scanning the remainder of the > thread group. Yes. But, if that thread was used as a starting point g, then before the patch: loop forever after the patch: break > 5. The thread-group leader might do exit(), destroying the old > list without forming a new one. In this case, we want any > readers to stop scanning. > > 6. Some other thread might do exit(), destroying the old list > without forming a new one. In this case, we also want any > readers to stop scanning. Yes. But again, it is fine to print more pids as far as we know it is safe to iterate over the exiting thread group. However, next_thread_careful() can stop earlier compared to next_thread(). Either way, we can miss none/some/most/all threads if we race with exit_group(). > Anything else I might be missing? I think this is all. Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-24 21:57 ` Oleg Nesterov @ 2010-06-25 3:41 ` Paul E. McKenney 2010-06-25 9:55 ` Oleg Nesterov 0 siblings, 1 reply; 45+ messages in thread From: Paul E. McKenney @ 2010-06-25 3:41 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On Thu, Jun 24, 2010 at 11:57:02PM +0200, Oleg Nesterov wrote: > On 06/24, Paul E. McKenney wrote: > > > > On Wed, Jun 23, 2010 at 05:24:21PM +0200, Oleg Nesterov wrote: > > > It is very possible that I missed something here, my only point is > > > that I think it would be safer to assume nothing about the leaderness. > > > > It is past time that I list out my assumptions more carefully. ;-) > > > > First, what "bad things" can happen to a reader scanning a thread > > group? > > (I assume you mean the lockless case) You are quite right -- I should have stated that explicitly. > Currently, the only bad thing is that while_each_thread(g) can loop > forever if we race with exec(), or exit() if g is not leader. > > And, to simplify, let's consider the same example again > > t = g; > do { > printk("pid %d\n", t->pid); > } while_each_thread(g, t); > > > > 1. The thread-group leader might do exec(), destroying the old > > list and forming a new one. In this case, we want any readers > > to stop scanning. > > I'd say, it is not that we want to stop scanning, it is OK to stop > scanning after we printed g->pid Fair enough. > > 2. Some other thread might do exec(), destroying the old list and > > forming a new one. In this case, we also want any readers to > > stop scanning. > > The same. > > If the code above runs under for_each_process(g) or it did > "g = find_task_by_pid(tgid)", we will see either new or old leader > and print its pid at least. OK. > > 3. The thread-group leader might do pthread_exit(), removing itself > > from the thread group > > No. It can exit, but it won't be removed from thread group. It will > be zombie untill all sub-threads disappear. This does make things easier! Whew!!! ;-) > > 4. Some other thread might do pthread_exit(), removing itself > > from the thread group, and again might do so while the hapless > > reader is referencing that thread. In this case, we want > > the hapless reader to continue scanning the remainder of the > > thread group. > > Yes. > > But, if that thread was used as a starting point g, then > > before the patch: loop forever > after the patch: break So it is OK to skip some of the other threads in this case, even though they were present throughout the whole procedure? > > 5. The thread-group leader might do exit(), destroying the old > > list without forming a new one. In this case, we want any > > readers to stop scanning. > > > > 6. Some other thread might do exit(), destroying the old list > > without forming a new one. In this case, we also want any > > readers to stop scanning. > > Yes. But again, it is fine to print more pids as far as we know it > is safe to iterate over the exiting thread group. However, > next_thread_careful() can stop earlier compared to next_thread(). > Either way, we can miss none/some/most/all threads if we race with > exit_group(). Yes, if there is an exit(), it makes sense that you might not see all of the threads -- they could reasonably have disappeared before you got done listing them. > > Anything else I might be missing? > > I think this is all. OK, thank you (and Roland) for the tutorial! Thanx, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-25 3:41 ` Paul E. McKenney @ 2010-06-25 9:55 ` Oleg Nesterov 2010-06-28 23:43 ` Paul E. McKenney 0 siblings, 1 reply; 45+ messages in thread From: Oleg Nesterov @ 2010-06-25 9:55 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/24, Paul E. McKenney wrote: > > On Thu, Jun 24, 2010 at 11:57:02PM +0200, Oleg Nesterov wrote: > > On 06/24, Paul E. McKenney wrote: > > > > > > 4. Some other thread might do pthread_exit(), removing itself > > > from the thread group, and again might do so while the hapless > > > reader is referencing that thread. In this case, we want > > > the hapless reader to continue scanning the remainder of the > > > thread group. > > > > Yes. > > > > But, if that thread was used as a starting point g, then > > > > before the patch: loop forever > > after the patch: break > > So it is OK to skip some of the other threads in this case, even > though they were present throughout the whole procedure? I think, yes. We can miss them in any case, they can go away before while_each_thread(g, t) starts the scan. If g == group_leader (old or new), then we should notice this thread at least. Otherwise we can miss them all, with or without next_thread_careful(). Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-25 9:55 ` Oleg Nesterov @ 2010-06-28 23:43 ` Paul E. McKenney 2010-06-29 13:05 ` Oleg Nesterov 0 siblings, 1 reply; 45+ messages in thread From: Paul E. McKenney @ 2010-06-28 23:43 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On Fri, Jun 25, 2010 at 11:55:48AM +0200, Oleg Nesterov wrote: > On 06/24, Paul E. McKenney wrote: > > > > On Thu, Jun 24, 2010 at 11:57:02PM +0200, Oleg Nesterov wrote: > > > On 06/24, Paul E. McKenney wrote: > > > > > > > > 4. Some other thread might do pthread_exit(), removing itself > > > > from the thread group, and again might do so while the hapless > > > > reader is referencing that thread. In this case, we want > > > > the hapless reader to continue scanning the remainder of the > > > > thread group. > > > > > > Yes. > > > > > > But, if that thread was used as a starting point g, then > > > > > > before the patch: loop forever > > > after the patch: break > > > > So it is OK to skip some of the other threads in this case, even > > though they were present throughout the whole procedure? > > I think, yes. We can miss them in any case, they can go away before > while_each_thread(g, t) starts the scan. > > If g == group_leader (old or new), then we should notice this thread > at least. > > Otherwise we can miss them all, with or without next_thread_careful(). Just to be sure that we are actually talking about the same scenario... Suppose that a task group is lead by 2908 and has member 2909, 2910, 2911, and 2912. Suppose that 2910 does pthread_exit() just as some other task is "ls"ing the relevant /proc entry. Is it really OK for "ls" to show 2909 but not 2911 and 2912, even though 2911 and 2912 were alive and kicking the entire time? Thanx, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-28 23:43 ` Paul E. McKenney @ 2010-06-29 13:05 ` Oleg Nesterov 2010-06-29 15:34 ` Paul E. McKenney 0 siblings, 1 reply; 45+ messages in thread From: Oleg Nesterov @ 2010-06-29 13:05 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/28, Paul E. McKenney wrote: > > On Fri, Jun 25, 2010 at 11:55:48AM +0200, Oleg Nesterov wrote: > > On 06/24, Paul E. McKenney wrote: > > > > > > So it is OK to skip some of the other threads in this case, even > > > though they were present throughout the whole procedure? > > > > I think, yes. We can miss them in any case, they can go away before > > while_each_thread(g, t) starts the scan. > > > > If g == group_leader (old or new), then we should notice this thread > > at least. > > > > Otherwise we can miss them all, with or without next_thread_careful(). > > Just to be sure that we are actually talking about the same scenario... > > Suppose that a task group is lead by 2908 and has member 2909, 2910, > 2911, and 2912. Suppose that 2910 does pthread_exit() just as some > other task is "ls"ing the relevant /proc entry. Is it really OK for > "ls" to show 2909 but not 2911 and 2912, even though 2911 and 2912 > were alive and kicking the entire time? Confused. Let's return to do printk("%d\n", t->pid); while_each_thread(g, t); for the moment. In that case, if g != 2910 (the exiting thread) we will print all pids, except we can miss 2910. With or without next_thread_careful(). Only if we start at g == 2910, then current code: print 2910, then spin forever printing other pids next_thread_careful: stop printing when we notice that 2910 was unhashed. So, yes, in this case we can miss all other threads. As for "ls"ing the relevant /proc entry. proc_task_readdir() is complicated, it can drop rcu lock, sleep, etc. But basically it mimics while_each_thread() logic. Let's assume that proc_task_fill_cache() never fails. proc_task_readdir() always starts at the group_leader, 2908. So, with or without next_thread_careful() we can only miss the exiting 2910. But (again, unless I missed something) the current code can race with exec, and s/next_thread/next_thread_careful/ in first_tid() can fix the race. (just in case, we can fix it differently). But, of course, if you do "ls /proc/2910/task" instead of "ls /proc/2908/task" you can miss _all_ threads if 2910 exits before proc_task_readdir() finds its leader, 2908. Again, this is with or without next_thread_careful(). Paul, please let me know if I misunderstood your concerns, or if I missed something. Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-29 13:05 ` Oleg Nesterov @ 2010-06-29 15:34 ` Paul E. McKenney 2010-06-29 17:54 ` Oleg Nesterov 0 siblings, 1 reply; 45+ messages in thread From: Paul E. McKenney @ 2010-06-29 15:34 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On Tue, Jun 29, 2010 at 03:05:03PM +0200, Oleg Nesterov wrote: > On 06/28, Paul E. McKenney wrote: > > > > On Fri, Jun 25, 2010 at 11:55:48AM +0200, Oleg Nesterov wrote: > > > On 06/24, Paul E. McKenney wrote: > > > > > > > > So it is OK to skip some of the other threads in this case, even > > > > though they were present throughout the whole procedure? > > > > > > I think, yes. We can miss them in any case, they can go away before > > > while_each_thread(g, t) starts the scan. > > > > > > If g == group_leader (old or new), then we should notice this thread > > > at least. > > > > > > Otherwise we can miss them all, with or without next_thread_careful(). > > > > Just to be sure that we are actually talking about the same scenario... > > > > Suppose that a task group is lead by 2908 and has member 2909, 2910, > > 2911, and 2912. Suppose that 2910 does pthread_exit() just as some > > other task is "ls"ing the relevant /proc entry. Is it really OK for > > "ls" to show 2909 but not 2911 and 2912, even though 2911 and 2912 > > were alive and kicking the entire time? > > Confused. > > Let's return to > > do > printk("%d\n", t->pid); > while_each_thread(g, t); > > for the moment. > > In that case, if g != 2910 (the exiting thread) we will print all pids, > except we can miss 2910. With or without next_thread_careful(). > > Only if we start at g == 2910, then > > current code: print 2910, then spin forever printing > other pids > > next_thread_careful: stop printing when we notice that 2910 > was unhashed. > > So, yes, in this case we can miss all > other threads. > > As for "ls"ing the relevant /proc entry. proc_task_readdir() is complicated, > it can drop rcu lock, sleep, etc. But basically it mimics while_each_thread() > logic. Let's assume that proc_task_fill_cache() never fails. > > proc_task_readdir() always starts at the group_leader, 2908. So, with or > without next_thread_careful() we can only miss the exiting 2910. > > But (again, unless I missed something) the current code can race with exec, > and s/next_thread/next_thread_careful/ in first_tid() can fix the race. > (just in case, we can fix it differently). > > But, of course, if you do "ls /proc/2910/task" instead of "ls /proc/2908/task" > you can miss _all_ threads if 2910 exits before proc_task_readdir() finds > its leader, 2908. Again, this is with or without next_thread_careful(). > > > Paul, please let me know if I misunderstood your concerns, or if I missed > something. Thank you very much for laying this out completely! I was having a hard time believing that it was OK to miss threads in the "ls /proc/2910/task" case. But of course similar issues can arise when running "ls" on a directory with lots of files that are coming and going quickly in the meantime, I guess. And if proc_task_fill_cache() fails, we can miss tasks as well, correct? Given all this, I believe that your fix really does work. Thanx, Paul ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-29 15:34 ` Paul E. McKenney @ 2010-06-29 17:54 ` Oleg Nesterov 0 siblings, 0 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-29 17:54 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable, Eric W. Biederman On 06/29, Paul E. McKenney wrote: > > On Tue, Jun 29, 2010 at 03:05:03PM +0200, Oleg Nesterov wrote: > > > > Paul, please let me know if I misunderstood your concerns, or if I missed > > something. > > Thank you very much for laying this out completely! I was having a hard > time believing that it was OK to miss threads in the "ls /proc/2910/task" > case. But of course similar issues can arise when running "ls" on a > directory with lots of files that are coming and going quickly in the > meantime, I guess. Yes. And again, even if 2910 is not the group leader and it is exiting, "ls /proc/2910/task" will work because proc_task_readdir() akways starts at 2910->group_leader == 2008. It doesn't work only if proc_task_readdir() can't find its leader, in this particular case this just means 2910 no longer exists, and thus /proc/2910/ is dead even if we can still find this dentry. > And if proc_task_fill_cache() fails, we can miss > tasks as well, correct? Well, yes and no. Sure, if proc_task_fill_cache() fails we didn't reported all threads. But if /bin/ls does readdir() again after that, proc_task_readdir() tries to contunue starting from the last-pid-we-failed-to-report. If there is no task with that pid, we start from the group_leader and skip the number-of-already-reported-threads. So, we have a lot of issues here, we can miss some thread because "skip the number-of-already-reported-threads" can't be really accurate. But, to clarify, this has almost nothing to do with the original problem. Afaics, if we change first_tid() to use next_thread_careful() instead of next_thread(), we close the pure-theoretical race with exec but that is all. (and I am still not sure this race does exist, and even if it does we can fix it without next_thread_careful). > Given all this, I believe that your fix really does work. Great. I'll send the patch once I inspect zap_threads() and current_is_single_threaded() to figure out which changes they need. Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-18 19:34 ` while_each_thread() under rcu_read_lock() is broken? Oleg Nesterov 2010-06-18 21:08 ` Roland McGrath 2010-06-18 22:33 ` Paul E. McKenney @ 2010-06-19 5:00 ` Mandeep Baines 2010-06-19 5:35 ` Frederic Weisbecker 2010-06-19 19:19 ` Oleg Nesterov 2 siblings, 2 replies; 45+ messages in thread From: Mandeep Baines @ 2010-06-19 5:00 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Roland McGrath, linux-kernel, stable, Eric W. Biederman, Paul E. McKenney On Fri, Jun 18, 2010 at 12:34 PM, Oleg Nesterov <oleg@redhat.com> wrote: > (add cc's) > > Hmm. Once I sent this patch, I suddenly realized with horror that > while_each_thread() is NOT safe under rcu_read_lock(). Both > do_each_thread/while_each_thread or do/while_each_thread() can > race with exec(). > > Yes, it is safe to do next_thread() or next_task(). But: > > #define while_each_thread(g, t) \ > while ((t = next_thread(t)) != g) > > suppose that t is not the group leader, and it does de_thread() and then > release_task(g). After that next_thread(t) returns t, not g, and the loop > will never stop. > > I _really_ hope I missed something, will recheck tomorrow with the fresh > head. Still I'd like to share my concerns... > Yep. You're right. Not sure what I was thinking. This is only case where do_each_thread is protected by an rcu_read_lock. All others, correctly use read_lock. > If I am right, probably we can fix this, something like > > #define while_each_thread(g, t) \ > while ((t = next_thread(t)) != g && pid_alive(g)) > This seems like a reasonable approach. Maybe call it: while_each_thread_maybe_rcu() :) This makes hung_task a little less useful for failure fencing since this (and rcu_lock_break) increases the potential for never examining all threads but its still a nice lightweight diagnostic for finding bugs. > [we can't do while (!thread_group_leadr(t = next_thread(t)))]. > but this needs barrires, and we should validate the callers anyway. > > Or, perhaps, > > #define XXX(t) ({ > struct task_struct *__prev = t; > t = next_thread(t); > t != g && t != __prev; > }) > > #define while_each_thread(g, t) \ > while (XXX(t)) > > Please tell me I am wrong! > > Oleg. > > On 06/18, Oleg Nesterov wrote: >> >> check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by >> "softlockup: check all tasks in hung_task" commit ce9dbe24 looks >> absolutely wrong. >> >> - rcu_lock_break() does put_task_struct(). If the task has exited >> it is not safe to even read its ->state, nothing protects this >> task_struct. >> >> - The TASK_DEAD checks are wrong too. Contrary to the comment, we >> can't use it to check if the task was unhashed. It can be unhashed >> without TASK_DEAD, or it can be valid with TASK_DEAD. >> >> For example, an autoreaping task can do release_task(current) >> long before it sets TASK_DEAD in do_exit(). >> >> Or, a zombie task can have ->state == TASK_DEAD but release_task() >> was not called, and in this case we must not break the loop. >> >> Change this code to check pid_alive() instead, and do this before we >> drop the reference to the task_struct. >> >> Signed-off-by: Oleg Nesterov <oleg@redhat.com> >> --- >> >> kernel/hung_task.c | 11 +++++++---- >> 1 file changed, 7 insertions(+), 4 deletions(-) >> >> --- 35-rc2/kernel/hung_task.c~CHT_FIX_RCU_LOCK_BREAK 2009-12-18 19:05:38.000000000 +0100 >> +++ 35-rc2/kernel/hung_task.c 2010-06-18 20:06:11.000000000 +0200 >> @@ -113,15 +113,20 @@ static void check_hung_task(struct task_ >> * For preemptible RCU it is sufficient to call rcu_read_unlock in order >> * exit the grace period. For classic RCU, a reschedule is required. >> */ >> -static void rcu_lock_break(struct task_struct *g, struct task_struct *t) >> +static bool rcu_lock_break(struct task_struct *g, struct task_struct *t) >> { >> + bool can_cont; >> + >> get_task_struct(g); >> get_task_struct(t); >> rcu_read_unlock(); >> cond_resched(); >> rcu_read_lock(); >> + can_cont = pid_alive(g) && pid_alive(t); >> put_task_struct(t); >> put_task_struct(g); >> + >> + return can_cont; >> } >> >> /* >> @@ -148,9 +153,7 @@ static void check_hung_uninterruptible_t >> goto unlock; >> if (!--batch_count) { >> batch_count = HUNG_TASK_BATCHING; >> - rcu_lock_break(g, t); >> - /* Exit if t or g was unhashed during refresh. */ >> - if (t->state == TASK_DEAD || g->state == TASK_DEAD) >> + if (!rcu_lock_break(g, t)) >> goto unlock; >> } >> /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-19 5:00 ` Mandeep Baines @ 2010-06-19 5:35 ` Frederic Weisbecker 2010-06-19 15:44 ` Mandeep Baines 2010-06-19 19:19 ` Oleg Nesterov 1 sibling, 1 reply; 45+ messages in thread From: Frederic Weisbecker @ 2010-06-19 5:35 UTC (permalink / raw) To: Mandeep Baines Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Ingo Molnar, Jerome Marchand, Roland McGrath, linux-kernel, stable, Eric W. Biederman, Paul E. McKenney On Fri, Jun 18, 2010 at 10:00:54PM -0700, Mandeep Baines wrote: > On Fri, Jun 18, 2010 at 12:34 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > (add cc's) > > > > Hmm. Once I sent this patch, I suddenly realized with horror that > > while_each_thread() is NOT safe under rcu_read_lock(). Both > > do_each_thread/while_each_thread or do/while_each_thread() can > > race with exec(). > > > > Yes, it is safe to do next_thread() or next_task(). But: > > > > #define while_each_thread(g, t) \ > > while ((t = next_thread(t)) != g) > > > > suppose that t is not the group leader, and it does de_thread() and then > > release_task(g). After that next_thread(t) returns t, not g, and the loop > > will never stop. > > > > I _really_ hope I missed something, will recheck tomorrow with the fresh > > head. Still I'd like to share my concerns... > > > > Yep. You're right. Not sure what I was thinking. This is only case > where do_each_thread > is protected by an rcu_read_lock. All others, correctly use read_lock. cgroup does too. taskstats also uses rcu with while_each_threads, and may be some others. It's not your fault, theses iterators are supposed to be rcu safe, we are just encountering a bad race :) > > If I am right, probably we can fix this, something like > > > > #define while_each_thread(g, t) \ > > while ((t = next_thread(t)) != g && pid_alive(g)) > > > > This seems like a reasonable approach. Maybe call it: > > while_each_thread_maybe_rcu() :) Hmm, no while_each_thread must really be rcu_safe. > > This makes hung_task a little less useful for failure fencing since > this (and rcu_lock_break) > increases the potential for never examining all threads but its still > a nice lightweight diagnostic > for finding bugs. In fact may be we could just drop the rcu break, people who really care about latencies can use the preemptable version. I know the worry is more about delaying too much the grace period if we walk a too long task list, but I don't think it's really a problem. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-19 5:35 ` Frederic Weisbecker @ 2010-06-19 15:44 ` Mandeep Baines 0 siblings, 0 replies; 45+ messages in thread From: Mandeep Baines @ 2010-06-19 15:44 UTC (permalink / raw) To: Frederic Weisbecker Cc: Oleg Nesterov, Andrew Morton, Don Zickus, Ingo Molnar, Jerome Marchand, Roland McGrath, linux-kernel, stable, Eric W. Biederman, Paul E. McKenney On Fri, Jun 18, 2010 at 10:35 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote: > On Fri, Jun 18, 2010 at 10:00:54PM -0700, Mandeep Baines wrote: >> On Fri, Jun 18, 2010 at 12:34 PM, Oleg Nesterov <oleg@redhat.com> wrote: >> > (add cc's) >> > >> > Hmm. Once I sent this patch, I suddenly realized with horror that >> > while_each_thread() is NOT safe under rcu_read_lock(). Both >> > do_each_thread/while_each_thread or do/while_each_thread() can >> > race with exec(). >> > >> > Yes, it is safe to do next_thread() or next_task(). But: >> > >> > #define while_each_thread(g, t) \ >> > while ((t = next_thread(t)) != g) >> > >> > suppose that t is not the group leader, and it does de_thread() and then >> > release_task(g). After that next_thread(t) returns t, not g, and the loop >> > will never stop. >> > >> > I _really_ hope I missed something, will recheck tomorrow with the fresh >> > head. Still I'd like to share my concerns... >> > >> >> Yep. You're right. Not sure what I was thinking. This is only case >> where do_each_thread >> is protected by an rcu_read_lock. All others, correctly use read_lock. > > > > cgroup does too. > taskstats also uses rcu with while_each_threads, and may be some > others. > > It's not your fault, theses iterators are supposed to be rcu safe, > we are just encountering a bad race :) > Thanks:) Feel less dumb now. My quick grep only turned up hung_task: $ find . -name \*.c | xargs fgrep -B 10 do_each_thread | grep rcu ./kernel/hung_task.c- rcu_read_lock(); > > >> > If I am right, probably we can fix this, something like >> > >> > #define while_each_thread(g, t) \ >> > while ((t = next_thread(t)) != g && pid_alive(g)) >> > >> >> This seems like a reasonable approach. Maybe call it: >> >> while_each_thread_maybe_rcu() :) > > > > Hmm, no while_each_thread must really be rcu_safe. > I didn't realize there were other cases which need while_each_thread to be rcu-safe. For hung_task, its OK to break out on a release_task(g). We'll just check the threads we missed on the next iteration. > > >> >> This makes hung_task a little less useful for failure fencing since >> this (and rcu_lock_break) >> increases the potential for never examining all threads but its still >> a nice lightweight diagnostic >> for finding bugs. > > > > In fact may be we could just drop the rcu break, people who really > care about latencies can use the preemptable version. > For large systems, you'd pin a CPU for a very long time checking for hung_tasks. You'd cause a lot of memory pressure by delaying the grace period for such a long time. You'd also cause softlockups with the huge burst of call_rcus being processed by rcu_process_callbacks. > I know the worry is more about delaying too much the grace period if > we walk a too long task list, but I don't think it's really a problem. > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: while_each_thread() under rcu_read_lock() is broken? 2010-06-19 5:00 ` Mandeep Baines 2010-06-19 5:35 ` Frederic Weisbecker @ 2010-06-19 19:19 ` Oleg Nesterov 1 sibling, 0 replies; 45+ messages in thread From: Oleg Nesterov @ 2010-06-19 19:19 UTC (permalink / raw) To: Mandeep Baines Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Roland McGrath, linux-kernel, stable, Eric W. Biederman, Paul E. McKenney On 06/18, Mandeep Baines wrote: > > On Fri, Jun 18, 2010 at 12:34 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > I _really_ hope I missed something, will recheck tomorrow with the fresh > > head. Still I'd like to share my concerns... > > Yep. You're right. Not sure what I was thinking. This is only case > where do_each_thread > is protected by an rcu_read_lock. All others, correctly use read_lock. No, no. while_each_thread() is supposed to be rcu-safe, we should fix it. It has many rcu_read_lock() users. The patch which fixes rcu_lock_break() is orthogonal to this problem. Oleg. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic 2010-06-18 19:02 [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Oleg Nesterov 2010-06-18 19:34 ` while_each_thread() under rcu_read_lock() is broken? Oleg Nesterov @ 2010-06-18 20:11 ` Frederic Weisbecker 2010-06-18 20:38 ` Mandeep Singh Baines 2 siblings, 0 replies; 45+ messages in thread From: Frederic Weisbecker @ 2010-06-18 20:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Ingo Molnar, Jerome Marchand, Mandeep Singh Baines, Roland McGrath, linux-kernel, stable On Fri, Jun 18, 2010 at 09:02:51PM +0200, Oleg Nesterov wrote: > check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by > "softlockup: check all tasks in hung_task" commit ce9dbe24 looks > absolutely wrong. > > - rcu_lock_break() does put_task_struct(). If the task has exited > it is not safe to even read its ->state, nothing protects this > task_struct. > > - The TASK_DEAD checks are wrong too. Contrary to the comment, we > can't use it to check if the task was unhashed. It can be unhashed > without TASK_DEAD, or it can be valid with TASK_DEAD. > > For example, an autoreaping task can do release_task(current) > long before it sets TASK_DEAD in do_exit(). > > Or, a zombie task can have ->state == TASK_DEAD but release_task() > was not called, and in this case we must not break the loop. > > Change this code to check pid_alive() instead, and do this before we > drop the reference to the task_struct. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> Very nice! Acked-by: Frederic Weisbecker <fweisbec@gmail.com> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic 2010-06-18 19:02 [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Oleg Nesterov 2010-06-18 19:34 ` while_each_thread() under rcu_read_lock() is broken? Oleg Nesterov 2010-06-18 20:11 ` [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Frederic Weisbecker @ 2010-06-18 20:38 ` Mandeep Singh Baines 2 siblings, 0 replies; 45+ messages in thread From: Mandeep Singh Baines @ 2010-06-18 20:38 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Don Zickus, Frederic Weisbecker, Ingo Molnar, Jerome Marchand, Roland McGrath, linux-kernel, stable Oleg Nesterov (oleg@redhat.com) wrote: > check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by > "softlockup: check all tasks in hung_task" commit ce9dbe24 looks > absolutely wrong. > > - rcu_lock_break() does put_task_struct(). If the task has exited > it is not safe to even read its ->state, nothing protects this > task_struct. > > - The TASK_DEAD checks are wrong too. Contrary to the comment, we > can't use it to check if the task was unhashed. It can be unhashed > without TASK_DEAD, or it can be valid with TASK_DEAD. > > For example, an autoreaping task can do release_task(current) > long before it sets TASK_DEAD in do_exit(). > > Or, a zombie task can have ->state == TASK_DEAD but release_task() > was not called, and in this case we must not break the loop. > > Change this code to check pid_alive() instead, and do this before we > drop the reference to the task_struct. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- Acked-by: Mandeep Singh Baines <msb@google.com> ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2010-07-09 16:18 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-06-18 19:02 [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Oleg Nesterov 2010-06-18 19:34 ` while_each_thread() under rcu_read_lock() is broken? Oleg Nesterov 2010-06-18 21:08 ` Roland McGrath 2010-06-18 22:37 ` Oleg Nesterov 2010-06-18 22:33 ` Paul E. McKenney 2010-06-21 17:09 ` Oleg Nesterov 2010-06-21 17:44 ` Oleg Nesterov 2010-06-21 18:00 ` Oleg Nesterov 2010-06-21 19:02 ` Roland McGrath 2010-06-21 20:06 ` Oleg Nesterov 2010-06-21 21:19 ` Eric W. Biederman 2010-06-22 14:34 ` Oleg Nesterov 2010-07-08 23:59 ` Roland McGrath 2010-07-09 0:41 ` Paul E. McKenney 2010-07-09 1:01 ` Roland McGrath 2010-07-09 16:18 ` Paul E. McKenney 2010-06-21 20:51 ` Paul E. McKenney 2010-06-21 21:22 ` Eric W. Biederman 2010-06-21 21:38 ` Paul E. McKenney 2010-06-22 21:23 ` Oleg Nesterov 2010-06-22 22:12 ` Paul E. McKenney 2010-06-23 15:24 ` Oleg Nesterov 2010-06-24 18:07 ` Paul E. McKenney 2010-06-24 18:50 ` Chris Friesen 2010-06-24 22:00 ` Oleg Nesterov 2010-06-25 0:08 ` Eric W. Biederman 2010-06-25 3:42 ` Paul E. McKenney 2010-06-25 10:08 ` Oleg Nesterov 2010-07-09 0:52 ` Roland McGrath 2010-06-24 21:14 ` Roland McGrath 2010-06-25 3:37 ` Paul E. McKenney 2010-07-09 0:41 ` Roland McGrath 2010-06-24 21:57 ` Oleg Nesterov 2010-06-25 3:41 ` Paul E. McKenney 2010-06-25 9:55 ` Oleg Nesterov 2010-06-28 23:43 ` Paul E. McKenney 2010-06-29 13:05 ` Oleg Nesterov 2010-06-29 15:34 ` Paul E. McKenney 2010-06-29 17:54 ` Oleg Nesterov 2010-06-19 5:00 ` Mandeep Baines 2010-06-19 5:35 ` Frederic Weisbecker 2010-06-19 15:44 ` Mandeep Baines 2010-06-19 19:19 ` Oleg Nesterov 2010-06-18 20:11 ` [PATCH] fix the racy check_hung_uninterruptible_tasks()->rcu_lock_break() logic Frederic Weisbecker 2010-06-18 20:38 ` Mandeep Singh Baines
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.