linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: Introduce TASK_NOLOAD and TASK_IDLE
@ 2015-05-08 12:47 Peter Zijlstra
  2015-05-09  8:49 ` Julian Anastasov
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2015-05-08 12:47 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, Oleg Nesterov; +Cc: linux-kernel, ja, neilb

Subject: sched: Introduce TASK_NOLOAD and TASK_IDLE
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri May 8 14:23:45 CEST 2015

Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
all idle kthreads contribute to the loadavg is somewhat silly.

Now mostly this works OK, because kthreads have all their signals
masked. However there's a few sites where this is causing problems and
TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.

This patch adds TASK_NOLOAD which, when combined with
TASK_UNINTERRUPTIBLE avoids the loadavg accounting.

As most of imagined usage sites are loops where a thread wants to
idle, waiting for work, a helper TASK_IDLE is introduced.

Cc: Julian Anastasov <ja@ssi.bg>
Cc: NeilBrown <neilb@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h        |   10 +++++++---
 include/trace/events/sched.h |    3 ++-
 2 files changed, 9 insertions(+), 4 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -213,9 +213,10 @@ print_cfs_rq(struct seq_file *m, int cpu
 #define TASK_WAKEKILL		128
 #define TASK_WAKING		256
 #define TASK_PARKED		512
-#define TASK_STATE_MAX		1024
+#define TASK_NOLOAD		1024
+#define TASK_STATE_MAX		2048
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPN"
 
 extern char ___assert_task_state[1 - 2*!!(
 		sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
@@ -225,6 +226,8 @@ extern char ___assert_task_state[1 - 2*!
 #define TASK_STOPPED		(TASK_WAKEKILL | __TASK_STOPPED)
 #define TASK_TRACED		(TASK_WAKEKILL | __TASK_TRACED)
 
+#define TASK_IDLE		(TASK_UNINTERRUPTIBLE | TASK_NOLOAD)
+
 /* Convenience macros for the sake of wake_up */
 #define TASK_NORMAL		(TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
 #define TASK_ALL		(TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED)
@@ -240,7 +243,8 @@ extern char ___assert_task_state[1 - 2*!
 			((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)
 #define task_contributes_to_load(task)	\
 				((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
-				 (task->flags & PF_FROZEN) == 0)
+				 (task->flags & PF_FROZEN) == 0 && \
+				 (task->state & TASK_NOLOAD) == 0)
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -147,7 +147,8 @@ TRACE_EVENT(sched_switch,
 		  __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
 				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
 				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
-				{ 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
+				{ 128, "K" }, { 256, "W" }, { 512, "P" },
+				{ 1024, "N" }) : "R",
 		__entry->prev_state & TASK_STATE_MAX ? "+" : "",
 		__entry->next_comm, __entry->next_pid, __entry->next_prio)
 );

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Introduce TASK_NOLOAD and TASK_IDLE
  2015-05-08 12:47 [PATCH] sched: Introduce TASK_NOLOAD and TASK_IDLE Peter Zijlstra
@ 2015-05-09  8:49 ` Julian Anastasov
  2015-05-11  7:11   ` NeilBrown
  2015-05-11 14:22   ` Peter Zijlstra
  0 siblings, 2 replies; 7+ messages in thread
From: Julian Anastasov @ 2015-05-09  8:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Oleg Nesterov, linux-kernel, neilb


	Hello,

On Fri, 8 May 2015, Peter Zijlstra wrote:

> Subject: sched: Introduce TASK_NOLOAD and TASK_IDLE
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri May 8 14:23:45 CEST 2015
> 
> Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
> 'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
> all idle kthreads contribute to the loadavg is somewhat silly.
> 
> Now mostly this works OK, because kthreads have all their signals
> masked. However there's a few sites where this is causing problems and
> TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
> 
> This patch adds TASK_NOLOAD which, when combined with
> TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
> 
> As most of imagined usage sites are loops where a thread wants to
> idle, waiting for work, a helper TASK_IDLE is introduced.

	After checking our code in net/netfilter/ipvs/ip_vs_sync.c,
sync_thread_master(), we may also need some wrappers:

- schedule_timeout_idle (instead of schedule_timeout call):
	__set_current_state(TASK_IDLE);
	return schedule_timeout(timeout);

	- we here are really idle, so "N" looks ok

- pair of __wait_event_idle(wq, condition) and
	wait_event_idle(wq, condition) macros

	- we here are write-blocked for socket, not sure
	if this blocked vs idle difference is useful to
	represent, in new bit for this blocked state "B"=2048,
	with 2 TASK_NOLOAD variants: N(idle) and B(blocked,
	2|1024|2048, eg. for read-blocked or write-blocked).
	It will need additional argument 'state'/'blocked' for 
	*wait_event_idle().

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Introduce TASK_NOLOAD and TASK_IDLE
  2015-05-09  8:49 ` Julian Anastasov
@ 2015-05-11  7:11   ` NeilBrown
  2015-05-11 14:22   ` Peter Zijlstra
  1 sibling, 0 replies; 7+ messages in thread
From: NeilBrown @ 2015-05-11  7:11 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Oleg Nesterov, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1993 bytes --]

On Sat, 9 May 2015 11:49:01 +0300 (EEST) Julian Anastasov <ja@ssi.bg> wrote:

> 
> 	Hello,
> 
> On Fri, 8 May 2015, Peter Zijlstra wrote:
> 
> > Subject: sched: Introduce TASK_NOLOAD and TASK_IDLE
> > From: Peter Zijlstra <peterz@infradead.org>
> > Date: Fri May 8 14:23:45 CEST 2015
> > 
> > Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
> > 'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
> > all idle kthreads contribute to the loadavg is somewhat silly.
> > 
> > Now mostly this works OK, because kthreads have all their signals
> > masked. However there's a few sites where this is causing problems and
> > TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
> > 
> > This patch adds TASK_NOLOAD which, when combined with
> > TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
> > 
> > As most of imagined usage sites are loops where a thread wants to
> > idle, waiting for work, a helper TASK_IDLE is introduced.
> 
> 	After checking our code in net/netfilter/ipvs/ip_vs_sync.c,
> sync_thread_master(), we may also need some wrappers:
> 
> - schedule_timeout_idle (instead of schedule_timeout call):
> 	__set_current_state(TASK_IDLE);
> 	return schedule_timeout(timeout);
> 
> 	- we here are really idle, so "N" looks ok

yes, I would want
 wait_event_idle_timeout() and wait_event_idle()

but I'll be happy to take whatever I can get :-)

Thanks,
NeilBrown


> 
> - pair of __wait_event_idle(wq, condition) and
> 	wait_event_idle(wq, condition) macros
> 
> 	- we here are write-blocked for socket, not sure
> 	if this blocked vs idle difference is useful to
> 	represent, in new bit for this blocked state "B"=2048,
> 	with 2 TASK_NOLOAD variants: N(idle) and B(blocked,
> 	2|1024|2048, eg. for read-blocked or write-blocked).
> 	It will need additional argument 'state'/'blocked' for 
> 	*wait_event_idle().
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Introduce TASK_NOLOAD and TASK_IDLE
  2015-05-09  8:49 ` Julian Anastasov
  2015-05-11  7:11   ` NeilBrown
@ 2015-05-11 14:22   ` Peter Zijlstra
  2015-05-11 19:34     ` Julian Anastasov
  1 sibling, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2015-05-11 14:22 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Ingo Molnar, Linus Torvalds, Oleg Nesterov, linux-kernel, neilb

On Sat, May 09, 2015 at 11:49:01AM +0300, Julian Anastasov wrote:

> 	After checking our code in net/netfilter/ipvs/ip_vs_sync.c,
> sync_thread_master(), we may also need some wrappers:
> 
> - schedule_timeout_idle (instead of schedule_timeout call):
> 	__set_current_state(TASK_IDLE);
> 	return schedule_timeout(timeout);
> 
> 	- we here are really idle, so "N" looks ok

So I don't get the point of the schedule_timeout_*() stubs. What are
they for? Why would one use an unconditional schedule_timeout() call?
Isn't that what msleep() is for?

> - pair of __wait_event_idle(wq, condition) and
> 	wait_event_idle(wq, condition) macros
> 
> 	- we here are write-blocked for socket, not sure
> 	if this blocked vs idle difference is useful to
> 	represent, in new bit for this blocked state "B"=2048,
> 	with 2 TASK_NOLOAD variants: N(idle) and B(blocked,
> 	2|1024|2048, eg. for read-blocked or write-blocked).
> 	It will need additional argument 'state'/'blocked' for 
> 	*wait_event_idle().

Completely untested.

---
 include/linux/wait.h | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2db83349865b..b9ef6bacf633 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -307,6 +307,26 @@ do {									\
 	__ret;								\
 })
 
+#define __wait_event_idle(wq, condition)				\
+	(void)___wait_event(wq, condition, TASK_IDLE, 0, 0,		\
+			    schedule())
+
+
+/**
+ * wait_event_idle - sleep until a condition get true; do not contribute to the loadavg
+ * @wq: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ *
+ * See wait_event()
+ */
+#define wait_event_idle(wq, condition)					\
+do {									\
+	might_sleep();							\
+	if (condition)							\
+		break;							\
+	__wait_event_idle(wq, condition);				\
+} while (0)
+
 #define __wait_event_timeout(wq, condition, timeout)			\
 	___wait_event(wq, ___wait_cond_timeout(condition),		\
 		      TASK_UNINTERRUPTIBLE, 0, timeout,			\
@@ -346,8 +366,8 @@ do {									\
 		      __ret = schedule_timeout(__ret); try_to_freeze())
 
 /*
- * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid
- * increasing load and is freezable.
+ * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE
+ * and is freezable.
  */
 #define wait_event_freezable_timeout(wq, condition, timeout)		\
 ({									\
@@ -358,6 +378,23 @@ do {									\
 	__ret;								\
 })
 
+#define __wait_event_idle_timeout(wq, condition, timeout)		\
+	___wait_event(wq, ___wait_cond_timeout(condition),		\
+		      TASK_IDLE, 0, timeout,				\
+		      __ret = schedule_timeout(__ret))
+
+/*
+ * like wait_event_timeout() -- except it uses TASK_IDLE to avoid loadavg
+ */
+#define wait_event_idle_timeout(wq, condition, timeout)			\
+({									\
+	long __ret = timeout;						\
+	might_sleep();							\
+	if (!___wait_cond_timeout(condition))				\
+		ret = __wait_event_idle_timeout(wq, condition, timeout);\
+	__ret;								\
+})
+
 #define __wait_event_cmd(wq, condition, cmd1, cmd2)			\
 	(void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0,	\
 			    cmd1; schedule(); cmd2)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Introduce TASK_NOLOAD and TASK_IDLE
  2015-05-11 14:22   ` Peter Zijlstra
@ 2015-05-11 19:34     ` Julian Anastasov
  2015-05-12 12:25       ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Julian Anastasov @ 2015-05-11 19:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Oleg Nesterov, linux-kernel, neilb


	Hello,

On Mon, 11 May 2015, Peter Zijlstra wrote:

> > - schedule_timeout_idle (instead of schedule_timeout call):
> > 	__set_current_state(TASK_IDLE);
> > 	return schedule_timeout(timeout);
> > 
> > 	- we here are really idle, so "N" looks ok
> 
> So I don't get the point of the schedule_timeout_*() stubs. What are
> they for? Why would one use an unconditional schedule_timeout() call?
> Isn't that what msleep() is for?

	msleep will not return until timeout has expired.
Instead, we want to notice the kthread_should_stop() event
immediately. Additionally, TASK_UNINTERRUPTIBLE will increase
the load average. We can do it with extra wait queue
and the new __wait_event_idle_timeout but I guess
schedule_timeout_idle will be a good replacement for
schedule_timeout_interruptible calls when used for kthreads.

> + * like wait_event_timeout() -- except it uses TASK_IDLE to avoid loadavg
> + */
> +#define wait_event_idle_timeout(wq, condition, timeout)			\
> +({									\
> +	long __ret = timeout;						\
> +	might_sleep();							\
> +	if (!___wait_cond_timeout(condition))				\
> +		ret = __wait_event_idle_timeout(wq, condition, timeout);\

	ret may need underscores here...

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Introduce TASK_NOLOAD and TASK_IDLE
  2015-05-11 19:34     ` Julian Anastasov
@ 2015-05-12 12:25       ` Peter Zijlstra
  2015-05-13  7:22         ` Julian Anastasov
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2015-05-12 12:25 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Ingo Molnar, Linus Torvalds, Oleg Nesterov, linux-kernel, neilb

On Mon, May 11, 2015 at 10:34:11PM +0300, Julian Anastasov wrote:
> On Mon, 11 May 2015, Peter Zijlstra wrote:
> 
> > > - schedule_timeout_idle (instead of schedule_timeout call):
> > > 	__set_current_state(TASK_IDLE);
> > > 	return schedule_timeout(timeout);
> > > 
> > > 	- we here are really idle, so "N" looks ok
> > 
> > So I don't get the point of the schedule_timeout_*() stubs. What are
> > they for? Why would one use an unconditional schedule_timeout() call?
> > Isn't that what msleep() is for?
> 
> 	msleep will not return until timeout has expired.
> Instead, we want to notice the kthread_should_stop() event
> immediately. Additionally, TASK_UNINTERRUPTIBLE will increase
> the load average. We can do it with extra wait queue
> and the new __wait_event_idle_timeout but I guess
> schedule_timeout_idle will be a good replacement for
> schedule_timeout_interruptible calls when used for kthreads.

Fair enough I suppose, but then calling it schedule*() is just plain
wrong, it does not behave/act like a normal schedule() call.

Lemme go look at how widely abused that is.

*sigh*, its all over the place :/

$ git grep "schedule_timeout_\(interruptible\|killable\|uninterruptible\)" | wc -l
392

That said; I still don't see the point of schedule_timeout_idle(), we
should not sleep poll for state like that. We should only use TASK_IDLE
when we are in fact _IDLE_ and do not have work to do, at which point
one should use an wait_event() like construct to wait for new work.

> > + * like wait_event_timeout() -- except it uses TASK_IDLE to avoid loadavg
> > + */
> > +#define wait_event_idle_timeout(wq, condition, timeout)			\
> > +({									\
> > +	long __ret = timeout;						\
> > +	might_sleep();							\
> > +	if (!___wait_cond_timeout(condition))				\
> > +		ret = __wait_event_idle_timeout(wq, condition, timeout);\
> 
> 	ret may need underscores here...

I'm fairly sure that might aid in compilation indeed :-)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Introduce TASK_NOLOAD and TASK_IDLE
  2015-05-12 12:25       ` Peter Zijlstra
@ 2015-05-13  7:22         ` Julian Anastasov
  0 siblings, 0 replies; 7+ messages in thread
From: Julian Anastasov @ 2015-05-13  7:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Oleg Nesterov, linux-kernel, neilb


	Hello,

On Tue, 12 May 2015, Peter Zijlstra wrote:

> > 	msleep will not return until timeout has expired.
> > Instead, we want to notice the kthread_should_stop() event
> > immediately. Additionally, TASK_UNINTERRUPTIBLE will increase
> > the load average. We can do it with extra wait queue
> > and the new __wait_event_idle_timeout but I guess
> > schedule_timeout_idle will be a good replacement for
> > schedule_timeout_interruptible calls when used for kthreads.
> 
> Fair enough I suppose, but then calling it schedule*() is just plain
> wrong, it does not behave/act like a normal schedule() call.
> 
> Lemme go look at how widely abused that is.
> 
> *sigh*, its all over the place :/
> 
> $ git grep "schedule_timeout_\(interruptible\|killable\|uninterruptible\)" | wc -l
> 392
> 
> That said; I still don't see the point of schedule_timeout_idle(), we
> should not sleep poll for state like that. We should only use TASK_IDLE
> when we are in fact _IDLE_ and do not have work to do, at which point
> one should use an wait_event() like construct to wait for new work.

	Probably. But some kthreads may want to sleep,
like in the IPVS case where there is a more complex
mechanism to wake up the kthread which is a socket writer
and does not poll the socket all time.

	But I see that kthreads always need to check with
kthread_should_stop(), so if we add schedule_timeout_idle()
it should not be so simple, may be something like that
is race free on kthread_stop() event, if needed at all:

/* state: TASK_IDLE (idle) or TASK_UNINTERRUPTIBLE (busy) */
kthread_schedule_timeout(timeout, state)
{
	/* note: no underscores => set_mb */
	set_current_state(state);
	/* test_bit after memory barrier */
	if (kthread_should_stop())
		return timeout;
	return schedule_timeout(timeout);
}

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-05-13  7:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-08 12:47 [PATCH] sched: Introduce TASK_NOLOAD and TASK_IDLE Peter Zijlstra
2015-05-09  8:49 ` Julian Anastasov
2015-05-11  7:11   ` NeilBrown
2015-05-11 14:22   ` Peter Zijlstra
2015-05-11 19:34     ` Julian Anastasov
2015-05-12 12:25       ` Peter Zijlstra
2015-05-13  7:22         ` Julian Anastasov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).