All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] introduce for_each_process_thread_break() and for_each_process_thread_continue()
@ 2018-09-12 16:33 Oleg Nesterov
  2018-09-12 16:33 ` [PATCH 1/2] " Oleg Nesterov
  2018-09-12 16:33 ` [PATCH 2/2] hung_task: change check_hung_uninterruptible_tasks() to use for_each_process_thread_break/continue Oleg Nesterov
  0 siblings, 2 replies; 5+ messages in thread
From: Oleg Nesterov @ 2018-09-12 16:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Vyukov, Michal Hocko, Paul E. McKenney, Peter Zijlstra,
	linux-kernel

Resend.

IMHO this makes sense anyway, but mostly this is preparation for other changes.
show_state_filter() and other "slow" users of for_each_process() can livelock
and even trigger hard lockups.

Peter, you have already reviewed at least 1/2 (heh, two years ago) and iirc
you were mostly agree but pointed out that for_each_xxx() in _continue() could
be SPEND_TOO_MUCH_TIME themselves. I added a note into the changelog.

Oleg.

 include/linux/sched/signal.h | 10 ++++++++++
 kernel/exit.c                | 42 ++++++++++++++++++++++++++++++++++++++++++
 kernel/hung_task.c           | 25 +++++++++----------------
 3 files changed, 61 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] introduce for_each_process_thread_break() and for_each_process_thread_continue()
  2018-09-12 16:33 [PATCH 0/2] introduce for_each_process_thread_break() and for_each_process_thread_continue() Oleg Nesterov
@ 2018-09-12 16:33 ` Oleg Nesterov
  2018-09-12 19:25   ` Andrew Morton
  2018-09-12 16:33 ` [PATCH 2/2] hung_task: change check_hung_uninterruptible_tasks() to use for_each_process_thread_break/continue Oleg Nesterov
  1 sibling, 1 reply; 5+ messages in thread
From: Oleg Nesterov @ 2018-09-12 16:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Vyukov, Michal Hocko, Paul E. McKenney, Peter Zijlstra,
	linux-kernel

Usage:

	rcu_read_lock();
	for_each_process_thread(p, t) {
		do_something_slow(p, t);

		if (SPENT_TOO_MUCH_TIME) {
			for_each_process_thread_break(p, t);
			rcu_read_unlock();
			schedule();
			rcu_read_lock();
			for_each_process_thread_continue(&p, &t);
		}
	}
	rcu_read_unlock();

This looks similar to rcu_lock_break(), but much better and the next patch
changes check_hung_uninterruptible_tasks() to use these new helpers. But my
real target is show_state_filter() which can trivially lead to lockup.

Compared to rcu_lock_break(), for_each_process_thread_continue() never gives
up, it relies on fact that both process and thread lists are sorted by the
task->start_time key. So, for example, even if both leader/thread are already
dead we can find the next alive process and continue.

Strictly speaking, the for_each_process/for_each_thread loops in _continue()
could be "SPEND_TOO_MUCH_TIME" by themselves, so perhaps we will add another
"max_scan" argument later or do something else. But at least they can not
livelock under heavy fork/exit loads, they are bounded by PID_MAX_DEFAULT in
the worst case.

NOTE: it seems that, contrary to the comment, task_struct->start_time is not
really monotonic, and this should be probably fixed. Until then  _continue()
might skip more threads with the same ->start_time than necessary.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/sched/signal.h | 10 ++++++++++
 kernel/exit.c                | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1be3572..1c957d4 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -565,6 +565,16 @@ extern bool current_is_single_threaded(void);
 #define for_each_process_thread(p, t)	\
 	for_each_process(p) for_each_thread(p, t)
 
+static inline void
+for_each_process_thread_break(struct task_struct *p, struct task_struct *t)
+{
+	get_task_struct(p);
+	get_task_struct(t);
+}
+
+extern void
+for_each_process_thread_continue(struct task_struct **, struct task_struct **);
+
 typedef int (*proc_visitor)(struct task_struct *p, void *data);
 void walk_process_tree(struct task_struct *top, proc_visitor, void *);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 0e21e6d..71380c7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -319,6 +319,48 @@ void rcuwait_wake_up(struct rcuwait *w)
 	rcu_read_unlock();
 }
 
+void for_each_process_thread_continue(struct task_struct **p_leader,
+				      struct task_struct **p_thread)
+{
+	struct task_struct *leader = *p_leader, *thread = *p_thread;
+	struct task_struct *prev, *next;
+	u64 start_time;
+
+	if (pid_alive(thread)) {
+		/* mt exec could change the leader */
+		*p_leader = thread->group_leader;
+	} else if (pid_alive(leader)) {
+		start_time = thread->start_time;
+		prev = leader;
+
+		for_each_thread(leader, next) {
+			if (next->start_time > start_time)
+				break;
+			prev = next;
+		}
+
+		*p_thread = prev;
+	} else {
+		start_time = leader->start_time;
+		prev = &init_task;
+
+		for_each_process(next) {
+			if (next->start_time > start_time)
+				break;
+			prev = next;
+		}
+
+		*p_leader = prev;
+		/* a new thread can come after that, but this is fine */
+		*p_thread = list_last_entry(&prev->signal->thread_head,
+						struct task_struct,
+						thread_node);
+	}
+
+	put_task_struct(leader);
+	put_task_struct(thread);
+}
+
 /*
  * Determine if a process group is "orphaned", according to the POSIX
  * definition in 2.2.2.52.  Orphaned process groups are not to be affected
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] hung_task: change check_hung_uninterruptible_tasks() to use for_each_process_thread_break/continue
  2018-09-12 16:33 [PATCH 0/2] introduce for_each_process_thread_break() and for_each_process_thread_continue() Oleg Nesterov
  2018-09-12 16:33 ` [PATCH 1/2] " Oleg Nesterov
@ 2018-09-12 16:33 ` Oleg Nesterov
  1 sibling, 0 replies; 5+ messages in thread
From: Oleg Nesterov @ 2018-09-12 16:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Vyukov, Michal Hocko, Paul E. McKenney, Peter Zijlstra,
	linux-kernel

Change check_hung_uninterruptible_tasks() to use the new helpers and remove
the "can_cont" logic from rcu_lock_break().

We could add for_each_process_thread_break/continue right into rcu_lock_break()
but see the next patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/hung_task.c | 25 +++++++++----------------
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index b9132d1..13af558 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -148,20 +148,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
  * For preemptible RCU it is sufficient to call rcu_read_unlock in order
  * to exit the grace period. For classic RCU, a reschedule is required.
  */
-static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
+static void rcu_lock_break(void)
 {
-	bool can_cont;
-
-	get_task_struct(g);
-	get_task_struct(t);
 	rcu_read_unlock();
 	cond_resched();
 	rcu_read_lock();
-	can_cont = pid_alive(g) && pid_alive(t);
-	put_task_struct(t);
-	put_task_struct(g);
-
-	return can_cont;
 }
 
 /*
@@ -185,16 +176,18 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 	hung_task_show_lock = false;
 	rcu_read_lock();
 	for_each_process_thread(g, t) {
-		if (!max_count--)
+		/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
+		if (t->state == TASK_UNINTERRUPTIBLE)
+			check_hung_task(t, timeout);
+
+		if (!--max_count)
 			goto unlock;
 		if (!--batch_count) {
 			batch_count = HUNG_TASK_BATCHING;
-			if (!rcu_lock_break(g, t))
-				goto unlock;
+			for_each_process_thread_break(g, t);
+			rcu_lock_break();
+			for_each_process_thread_continue(&g, &t);
 		}
-		/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
-		if (t->state == TASK_UNINTERRUPTIBLE)
-			check_hung_task(t, timeout);
 	}
  unlock:
 	rcu_read_unlock();
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] introduce for_each_process_thread_break() and for_each_process_thread_continue()
  2018-09-12 16:33 ` [PATCH 1/2] " Oleg Nesterov
@ 2018-09-12 19:25   ` Andrew Morton
  2018-09-13 15:55     ` Oleg Nesterov
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2018-09-12 19:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dmitry Vyukov, Michal Hocko, Paul E. McKenney, Peter Zijlstra,
	linux-kernel

On Wed, 12 Sep 2018 18:33:35 +0200 Oleg Nesterov <oleg@redhat.com> wrote:

> Usage:
> 
> 	rcu_read_lock();
> 	for_each_process_thread(p, t) {
> 		do_something_slow(p, t);
> 
> 		if (SPENT_TOO_MUCH_TIME) {
> 			for_each_process_thread_break(p, t);
> 			rcu_read_unlock();
> 			schedule();
> 			rcu_read_lock();
> 			for_each_process_thread_continue(&p, &t);
> 		}
> 	}
> 	rcu_read_unlock();
> 
> This looks similar to rcu_lock_break(), but much better and the next patch
> changes check_hung_uninterruptible_tasks() to use these new helpers. But my
> real target is show_state_filter() which can trivially lead to lockup.
> 
> Compared to rcu_lock_break(), for_each_process_thread_continue() never gives
> up, it relies on fact that both process and thread lists are sorted by the
> task->start_time key. So, for example, even if both leader/thread are already
> dead we can find the next alive process and continue.
> 
> Strictly speaking, the for_each_process/for_each_thread loops in _continue()
> could be "SPEND_TOO_MUCH_TIME" by themselves, so perhaps we will add another
> "max_scan" argument later or do something else. But at least they can not
> livelock under heavy fork/exit loads, they are bounded by PID_MAX_DEFAULT in
> the worst case.
> 
> NOTE: it seems that, contrary to the comment, task_struct->start_time is not
> really monotonic, and this should be probably fixed. Until then  _continue()
> might skip more threads with the same ->start_time than necessary.
> 
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -565,6 +565,16 @@ extern bool current_is_single_threaded(void);
>  #define for_each_process_thread(p, t)	\
>  	for_each_process(p) for_each_thread(p, t)
>  
> +static inline void
> +for_each_process_thread_break(struct task_struct *p, struct task_struct *t)
> +{
> +	get_task_struct(p);
> +	get_task_struct(t);
> +}
> +
> +extern void
> +for_each_process_thread_continue(struct task_struct **, struct task_struct **);

These things will need some documentation, please.  What they do, why
they do it, how people should use them, when and why they should use
them.  Etcetera!  This is tricky stuff.


> +void for_each_process_thread_continue(struct task_struct **p_leader,
> +				      struct task_struct **p_thread)
> +{
> +	struct task_struct *leader = *p_leader, *thread = *p_thread;
> +	struct task_struct *prev, *next;
> +	u64 start_time;
> +
> +	if (pid_alive(thread)) {
> +		/* mt exec could change the leader */
> +		*p_leader = thread->group_leader;
> +	} else if (pid_alive(leader)) {
> +		start_time = thread->start_time;
> +		prev = leader;
> +
> +		for_each_thread(leader, next) {
> +			if (next->start_time > start_time)
> +				break;
> +			prev = next;
> +		}
> +
> +		*p_thread = prev;
> +	} else {
> +		start_time = leader->start_time;
> +		prev = &init_task;
> +
> +		for_each_process(next) {
> +			if (next->start_time > start_time)
> +				break;
> +			prev = next;
> +		}
> +
> +		*p_leader = prev;
> +		/* a new thread can come after that, but this is fine */
> +		*p_thread = list_last_entry(&prev->signal->thread_head,
> +						struct task_struct,
> +						thread_node);
> +	}
> +
> +	put_task_struct(leader);
> +	put_task_struct(thread);
> +}

Should these be available to modules, like the rest of these things
appear to be?  Or we could do that later if a need is shown.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] introduce for_each_process_thread_break() and for_each_process_thread_continue()
  2018-09-12 19:25   ` Andrew Morton
@ 2018-09-13 15:55     ` Oleg Nesterov
  0 siblings, 0 replies; 5+ messages in thread
From: Oleg Nesterov @ 2018-09-13 15:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Vyukov, Michal Hocko, Paul E. McKenney, Peter Zijlstra,
	linux-kernel

On 09/12, Andrew Morton wrote:
>
> > Usage:
> >
> > 	rcu_read_lock();
> > 	for_each_process_thread(p, t) {
> > 		do_something_slow(p, t);
> >
> > 		if (SPENT_TOO_MUCH_TIME) {
> > 			for_each_process_thread_break(p, t);
> > 			rcu_read_unlock();
> > 			schedule();
> > 			rcu_read_lock();
> > 			for_each_process_thread_continue(&p, &t);
> > 		}
> > 	}
> > 	rcu_read_unlock();

...

> > +static inline void
> > +for_each_process_thread_break(struct task_struct *p, struct task_struct *t)
> > +{
> > +	get_task_struct(p);
> > +	get_task_struct(t);
> > +}
> > +
> > +extern void
> > +for_each_process_thread_continue(struct task_struct **, struct task_struct **);
>
> These things will need some documentation, please.  What they do, why
> they do it, how people should use them, when and why they should use
> them.  Etcetera!  This is tricky stuff.

See "Usage" above, this is as simple as list_for_each_entry_continue_rcu(),
just you need to call _break() first.

OK, I'll try to add some comments and send V2.

> Should these be available to modules, like the rest of these things
> appear to be?  Or we could do that later if a need is shown.

Yes, I think this can be exported on demand,

Oleg.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-09-13 17:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-12 16:33 [PATCH 0/2] introduce for_each_process_thread_break() and for_each_process_thread_continue() Oleg Nesterov
2018-09-12 16:33 ` [PATCH 1/2] " Oleg Nesterov
2018-09-12 19:25   ` Andrew Morton
2018-09-13 15:55     ` Oleg Nesterov
2018-09-12 16:33 ` [PATCH 2/2] hung_task: change check_hung_uninterruptible_tasks() to use for_each_process_thread_break/continue Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.