All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] sched: CFS low-latency features
@ 2010-08-26 18:09 Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 01/11] sched: fix string comparison in features Mathieu Desnoyers
                   ` (12 more replies)
  0 siblings, 13 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2108 bytes --]

Hi,

Following the findings I presented a few months ago
(http://lkml.org/lkml/2010/4/18/13) about CFS having large vruntime spread
issues, Peter Zijlstra and I pursued the discussion and the implementation
effort (my work on this is funded by Nokia). I recently put the result together
and came up with this patchset, combining both his work and mine.

With this patchset, I got the following results with wakeup-latency.c (a 10ms
periodic timer), running periodic-fork.sh, Xorg, make -j3 and firefox (playing a
youtube video), with Xorg moving terminal windows around, in parallel on a UP
system (links to the test program source in the dyn min_vruntime patch). The
Xorg interactivity is very good with the new features enabled, but was poor
originally with the vanilla mainline scheduler. The 10ms timer delays are as
follow:

                         2.6.35.2 mainline*   with low-latency features**
maximum latency:                34465.2 µs                    8261.4 µs
average latency:                 6445.5 µs                     211.2 µs
missed timer events:                yes                           no

* 2.6.35.2 mainline test needs to run periodic-fork.sh for a few minutes first
  to let it rip the spread apart.

** low-latency features:

with the patchset applied and CONFIG_SCHED_DEBUG=y
(with debugfs mounted in /sys/debugfs)

for opt in DYN_MIN_VRUNTIME \
        NO_FAIR_SLEEPERS FAIR_SLEEPERS_INTERACTIVE FAIR_SLEEPERS_TIMER \
        INTERACTIVE TIMER \
        INTERACTIVE_FORK_EXPEDITED TIMER_FORK_EXPEDITED;
do echo $opt > /sys/debugfs/sched_features;
done

These patches are designed to allow individual enabling of each feature and to
make sure that the object size of sched.o does not grow when the features are
disabled on a CONFIG_SCHED_DEBUG=n kernel. Optimization of the try_to_wake_up()
fast path when features are enabled could be done later by merging some of these
features together. This patchset is based on 2.6.35.2.

Feedback is welcome,

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 01/11] sched: fix string comparison in features
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 02/11] sched: debug spread check account for nr_running Mathieu Desnoyers
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-features-fix-string-compare.patch --]
[-- Type: text/plain, Size: 1231 bytes --]

Incorrect handling of the following case:

INTERACTIVE
INTERACTIVE_SOMETHING_ELSE

The comparison only checks up to each element's length.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng.git/kernel/sched.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched.c
+++ linux-2.6-lttng.git/kernel/sched.c
@@ -722,7 +722,7 @@ sched_feat_write(struct file *filp, cons
 {
 	char buf[64];
 	char *cmp = buf;
-	int neg = 0;
+	int neg = 0, cmplen;
 	int i;
 
 	if (cnt > 63)
@@ -732,15 +732,24 @@ sched_feat_write(struct file *filp, cons
 		return -EFAULT;
 
 	buf[cnt] = 0;
+	for (i = 0; i < cnt; i++) {
+		if (buf[i] == '\n' || buf[i] == ' ') {
+			buf[i] = 0;
+			break;
+		}
+	}
 
 	if (strncmp(buf, "NO_", 3) == 0) {
 		neg = 1;
 		cmp += 3;
 	}
 
+	cmplen = strlen(cmp);
 	for (i = 0; sched_feat_names[i]; i++) {
 		int len = strlen(sched_feat_names[i]);
 
+		if (cmplen != len)
+			continue;
 		if (strncmp(cmp, sched_feat_names[i], len) == 0) {
 			if (neg)
 				sysctl_sched_features &= ~(1UL << i);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 02/11] sched: debug spread check account for nr_running
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 01/11] sched: fix string comparison in features Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 03/11] sched: FAIR_SLEEPERS feature Mathieu Desnoyers
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-fair-check-spread-with-running.patch --]
[-- Type: text/plain, Size: 1497 bytes --]

Extracted from a "sched-misc-bits.patch" patch from
Peter Zijlstra <a.p.zijlstra@chello.nl>.

This patch also moves the check_spread test where nr_running is always non-zero.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

Index: linux-2.6-lttng.git/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.git/kernel/sched_fair.c
@@ -715,7 +715,7 @@ static void check_spread(struct cfs_rq *
 	if (d < 0)
 		d = -d;
 
-	if (d > 3*sysctl_sched_latency)
+	if (d > 3*cfs_rq->nr_running*sysctl_sched_latency)
 		schedstat_inc(cfs_rq, nr_spread_over);
 #endif
 }
@@ -803,6 +803,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	check_spread(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -932,11 +933,9 @@ static void put_prev_entity(struct cfs_r
 	 * If still on the runqueue then deactivate_task()
 	 * was not called and update_curr() has to be done:
 	 */
-	if (prev->on_rq)
-		update_curr(cfs_rq);
-
-	check_spread(cfs_rq, prev);
 	if (prev->on_rq) {
+		update_curr(cfs_rq);
+		check_spread(cfs_rq, prev);
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 03/11] sched: FAIR_SLEEPERS feature
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 01/11] sched: fix string comparison in features Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 02/11] sched: debug spread check account for nr_running Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 04/11] sched: debug cleanup place entity Mathieu Desnoyers
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith

[-- Attachment #1: sched-add-sleeper.patch --]
[-- Type: text/plain, Size: 1659 bytes --]

Add the fair sleeper feature which disables sleeper extra vruntime boost on
wakeup. This is makes the DYN_MIN_VRUNTIME feature behave better by keeping the
min_vruntime value somewhere between MIN_vruntime and max_vruntime (see
/proc/sched_debug output with CONFIG_SCHED_DEBUG=y).

Turning on this knob is typically bad for interactivity. This is why a later
patch introduces the "FAIR_SLEEPERS_INTERACTIVE" feature, which provides this
combination of features:

NO_FAIR_SLEEPERS
FAIR_SLEEPERS_INTERACTIVE

So that fair sleeper fairness is only given to interactive wakeup chains.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 kernel/sched_fair.c     |    2 +-
 kernel/sched_features.h |    1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng.git/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.git/kernel/sched_fair.c
@@ -735,7 +735,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		vruntime += sched_vslice(cfs_rq, se);
 
 	/* sleeps up to a single latency don't count. */
-	if (!initial) {
+	if (sched_feat(FAIR_SLEEPERS) && !initial) {
 		unsigned long thresh = sysctl_sched_latency;
 
 		/*
Index: linux-2.6-lttng.git/kernel/sched_features.h
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_features.h
+++ linux-2.6-lttng.git/kernel/sched_features.h
@@ -3,6 +3,7 @@
  * them to run sooner, but does not allow tons of sleepers to
  * rip the spread apart.
  */
+SCHED_FEAT(FAIR_SLEEPERS, 1)
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, 1)
 
 /*


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 04/11] sched: debug cleanup place entity
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2010-08-26 18:09 ` [RFC PATCH 03/11] sched: FAIR_SLEEPERS feature Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 05/11] sched buddy enable buddy logic starting at 2 running threads Mathieu Desnoyers
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-debug-cleanup-place-entity.patch --]
[-- Type: text/plain, Size: 780 bytes --]

Extracted from a "sched-misc-bits.patch" patch from
Peter Zijlstra <a.p.zijlstra@chello.nl>.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Index: linux-2.6-lttng.git/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.git/kernel/sched_fair.c
@@ -749,9 +749,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	}
 
 	/* ensure we never gain time by being placed backwards. */
-	vruntime = max_vruntime(se->vruntime, vruntime);
-
-	se->vruntime = vruntime;
+	se->vruntime = max_vruntime(se->vruntime, vruntime);
 }
 
 static void


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 05/11] sched buddy enable buddy logic starting at 2 running threads
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2010-08-26 18:09 ` [RFC PATCH 04/11] sched: debug cleanup place entity Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 06/11] sched: dynamic min_vruntime Mathieu Desnoyers
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-buddy-enable-at-2.patch --]
[-- Type: text/plain, Size: 1489 bytes --]

Extracted from a "sched-misc-bits.patch" patch from
Peter Zijlstra <a.p.zijlstra@chello.nl>.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Index: linux-2.6-lttng.git/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.git/kernel/sched_fair.c
@@ -1647,7 +1647,11 @@ static void check_preempt_wakeup(struct 
 	struct task_struct *curr = rq->curr;
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
-	int scale = cfs_rq->nr_running >= sched_nr_latency;
+	/*
+	 * The buddy logic doesn't work well when there's not actually enough
+	 * tasks for there to be buddies.
+	 */
+	int buddies = (cfs_rq->nr_running >= 2);
 
 	if (unlikely(rt_prio(p->prio)))
 		goto preempt;
@@ -1658,7 +1662,7 @@ static void check_preempt_wakeup(struct 
 	if (unlikely(se == pse))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
+	if (sched_feat(NEXT_BUDDY) && buddies && !(wake_flags & WF_FORK))
 		set_next_buddy(pse);
 
 	/*
@@ -1704,7 +1708,7 @@ preempt:
 	if (unlikely(!se->on_rq || curr == rq->idle))
 		return;
 
-	if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
+	if (sched_feat(LAST_BUDDY) && buddies && entity_is_task(se))
 		set_last_buddy(se);
 }
 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 06/11] sched: dynamic min_vruntime
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2010-08-26 18:09 ` [RFC PATCH 05/11] sched buddy enable buddy logic starting at 2 running threads Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 07/11] sched rename struct task in_iowait field to sched_in_iowait Mathieu Desnoyers
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith, Peter Zijlstra

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: sched-dyn-min-vruntime.patch --]
[-- Type: text/plain, Size: 5113 bytes --]

[ Impact: Fixes the large vruntime spread problems I identified last fall, but
	  might have bad side-effects on Xorg interactivity. See the INTERACTIVE
          feature in a following patch that addresses this. ]

Push the scheduler dynamic min_vruntime upon deschedule. This ensures that the
following workload won't grow the spread to insanely large values over time
(give it 1-2 minutes), thus making the scheduler behave oddly with combined Xorg
and latency-sensitive threads: Xorg gets at the beginning of the spread, and the
latency-sensitive workloads get to be somewhere in the middle of the spread.

periodic-fork.sh:

#!/etc/sh

while ((1)); do
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
        tac /etc/passwd > /dev/null;
sleep 1;
done

My test program is wakeup-latency.c, provided by Nokia originally. A 10ms timer
spawns a thread which reads the time, and shows a warning if the expected
deadline has been missed by too much. It also warns about timer overruns.
It's available at:

http://www.efficios.com/pub/elc2010/wakeup-latency-0.1.tar.bz2

With periodic-fork.sh running and Xorg, without the DYN_MIN_VRUNTIME feature,
but with the INTERACTIVE, INTERACTIVE_FORK_EXPEDITED, TIMER and
TIMER_FORK_EXPEDITED features enabled:

....
min priority: 0, max priority: 0
late by: 6765.8 µs
late by: 5536.1 µs
overruns: 1
late by: 12212.3 µs
late by: 5477.5 µs
overruns: 1
late by: 12259.3 µs
overruns: 1
late by: 12224.9 µs
overruns: 1
late by: 12214.3 µs
overruns: 1
late by: 12196.2 µs

maximum latency: 12259.3 µs
average latency: 46.4 µs
missed timer events: 5

Now same workload with the DYN_MIN_VRUNTIME feature enabled:

min priority: 0, max priority: 0

maximum latency: 2908.3 µs
average latency: 6.9 µs
missed timer events: 0

Inspired from a patch done by Peter Zijlstra.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c     |   15 ++++++++++-----
 kernel/sched_features.h |    6 ++++++
 2 files changed, 16 insertions(+), 5 deletions(-)

Index: linux-2.6-lttng.git/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.git/kernel/sched_fair.c
@@ -301,9 +301,9 @@ static inline s64 entity_key(struct cfs_
 	return se->vruntime - cfs_rq->min_vruntime;
 }
 
-static void update_min_vruntime(struct cfs_rq *cfs_rq)
+static void update_min_vruntime(struct cfs_rq *cfs_rq, unsigned long delta_exec)
 {
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = cfs_rq->min_vruntime, new_vruntime;
 
 	if (cfs_rq->curr)
 		vruntime = cfs_rq->curr->vruntime;
@@ -319,7 +319,12 @@ static void update_min_vruntime(struct c
 			vruntime = min_vruntime(vruntime, se->vruntime);
 	}
 
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+	new_vruntime = cfs_rq->min_vruntime;
+	if (sched_feat(DYN_MIN_VRUNTIME) && delta_exec)
+		new_vruntime += calc_delta_mine(delta_exec, NICE_0_LOAD,
+						&cfs_rq->load);
+
+	cfs_rq->min_vruntime = max_vruntime(new_vruntime, vruntime);
 }
 
 /*
@@ -513,7 +518,7 @@ __update_curr(struct cfs_rq *cfs_rq, str
 	delta_exec_weighted = calc_delta_fair(delta_exec, curr);
 
 	curr->vruntime += delta_exec_weighted;
-	update_min_vruntime(cfs_rq);
+	update_min_vruntime(cfs_rq, delta_exec);
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -822,7 +827,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	account_entity_dequeue(cfs_rq, se);
-	update_min_vruntime(cfs_rq);
+	update_min_vruntime(cfs_rq, 0);
 
 	/*
 	 * Normalize the entity after updating the min_vruntime because the
Index: linux-2.6-lttng.git/kernel/sched_features.h
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_features.h
+++ linux-2.6-lttng.git/kernel/sched_features.h
@@ -57,6 +57,12 @@ SCHED_FEAT(LB_SHARES_UPDATE, 1)
 SCHED_FEAT(ASYM_EFF_LOAD, 1)
 
 /*
+ * Push the min_vruntime spread floor value when descheduling a task. This
+ * ensures the spread does not grow beyond control.
+ */
+SCHED_FEAT(DYN_MIN_VRUNTIME, 0)
+
+/*
  * Spin-wait on mutex acquisition when the mutex owner is running on
  * another cpu -- assumes that when the owner is running, it will soon
  * release the lock. Decreases scheduling overhead.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 07/11] sched rename struct task in_iowait field to sched_in_iowait
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2010-08-26 18:09 ` [RFC PATCH 06/11] sched: dynamic min_vruntime Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 08/11] sched input interactivity-driven next buddy Mathieu Desnoyers
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-rename-task-in-iowait.patch --]
[-- Type: text/plain, Size: 2427 bytes --]

Rename task struct in_iowait to sched_in_iowait, since it's only
scheduler-specific. Extracted from a patch originally from Peter Zijlstra.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    7 +++----
 kernel/sched.c        |    8 ++++----
 kernel/sched_fair.c   |    2 +-
 3 files changed, 8 insertions(+), 9 deletions(-)

Index: linux-2.6-lttng.git/include/linux/sched.h
===================================================================
--- linux-2.6-lttng.git.orig/include/linux/sched.h
+++ linux-2.6-lttng.git/include/linux/sched.h
@@ -1233,11 +1233,10 @@ struct task_struct {
 	unsigned did_exec:1;
 	unsigned in_execve:1;	/* Tell the LSMs that the process is doing an
 				 * execve */
-	unsigned in_iowait:1;
 
-
-	/* Revert to default priority/policy when forking */
-	unsigned sched_reset_on_fork:1;
+	unsigned sched_in_iowait:1;		/* Called io_schedule() */
+	unsigned sched_reset_on_fork:1;		/* Revert to default
+						 * priority/policy on fork */
 
 	pid_t pid;
 	pid_t tgid;
Index: linux-2.6-lttng.git/kernel/sched.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched.c
+++ linux-2.6-lttng.git/kernel/sched.c
@@ -4957,9 +4957,9 @@ void __sched io_schedule(void)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	current->in_iowait = 1;
+	current->sched_in_iowait = 1;
 	schedule();
-	current->in_iowait = 0;
+	current->sched_in_iowait = 0;
 	atomic_dec(&rq->nr_iowait);
 	delayacct_blkio_end();
 }
@@ -4972,9 +4972,9 @@ long __sched io_schedule_timeout(long ti
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	current->in_iowait = 1;
+	current->sched_in_iowait = 1;
 	ret = schedule_timeout(timeout);
-	current->in_iowait = 0;
+	current->sched_in_iowait = 0;
 	atomic_dec(&rq->nr_iowait);
 	delayacct_blkio_end();
 	return ret;
Index: linux-2.6-lttng.git/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.git/kernel/sched_fair.c
@@ -693,7 +693,7 @@ static void enqueue_sleeper(struct cfs_r
 		se->statistics.sum_sleep_runtime += delta;
 
 		if (tsk) {
-			if (tsk->in_iowait) {
+			if (tsk->sched_in_iowait) {
 				se->statistics.iowait_sum += delta;
 				se->statistics.iowait_count++;
 				trace_sched_stat_iowait(tsk, delta);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 08/11] sched input interactivity-driven next buddy
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2010-08-26 18:09 ` [RFC PATCH 07/11] sched rename struct task in_iowait field to sched_in_iowait Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 09/11] sched: timer-driven " Mathieu Desnoyers
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-input-buddy.patch --]
[-- Type: text/plain, Size: 8083 bytes --]

[ Impact: implement INTERACTIVE feature to increase Xorg responsiveness. ]

Apply next buddy logic to interactivity-driven wakeups. Don't pass the
interactivity flag across forks to defuse interactivity-based fork-bombs. The
goal of this patch is to ensure that Xorg keeps a good interactivity level by
ensuring that Xorg and its related threads quickly respond to wakeups caused by
user inputs.

Derived from a patch from Peter Zijlstra.

* This patch also makes sure that as soon as an iowait is perceived, the
  interactivity chain is stopped.
* This patch removes the previously available "NEXT_BUDDY" scheduler feature
  altogether.

On my 2.0GHz uniprocessor desktop, enabling the INTERACTIVE feature makes
firefox very responsive even if I overcommit my CPU with a make -j5 kernel
build.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 drivers/input/evdev.c   |    2 ++
 include/linux/sched.h   |   31 +++++++++++++++++++++++--------
 kernel/sched.c          |   12 +++++++++++-
 kernel/sched_fair.c     |   26 +++++++++++++++++++-------
 kernel/sched_features.h |   11 ++++-------
 5 files changed, 59 insertions(+), 23 deletions(-)

Index: linux-2.6-lttng.laptop/drivers/input/evdev.c
===================================================================
--- linux-2.6-lttng.laptop.orig/drivers/input/evdev.c
+++ linux-2.6-lttng.laptop/drivers/input/evdev.c
@@ -78,6 +78,7 @@ static void evdev_event(struct input_han
 	event.code = code;
 	event.value = value;
 
+	sched_wake_interactive_enable();
 	rcu_read_lock();
 
 	client = rcu_dereference(evdev->grab);
@@ -90,6 +91,7 @@ static void evdev_event(struct input_han
 	rcu_read_unlock();
 
 	wake_up_interruptible(&evdev->wait);
+	sched_wake_interactive_disable();
 }
 
 static int evdev_fasync(int fd, struct file *file, int on)
Index: linux-2.6-lttng.laptop/include/linux/sched.h
===================================================================
--- linux-2.6-lttng.laptop.orig/include/linux/sched.h
+++ linux-2.6-lttng.laptop/include/linux/sched.h
@@ -1024,14 +1024,17 @@ struct sched_domain;
 /*
  * wake flags
  */
-#define WF_SYNC		0x01		/* waker goes to sleep after wakup */
-#define WF_FORK		0x02		/* child wakeup after fork */
+#define WF_SYNC		(1 << 0)	/* waker goes to sleep after wakup */
+#define WF_FORK		(1 << 1)	/* child wakeup after fork */
+#define WF_INTERACTIVE	(1 << 2)	/* interactivity-driven wakeup */
+
+#define ENQUEUE_WAKEUP	(1 << 0)
+#define ENQUEUE_WAKING	(1 << 1)
+#define ENQUEUE_HEAD	(1 << 2)
+#define ENQUEUE_IO	(1 << 3)
+#define ENQUEUE_LATENCY	(1 << 4)
 
-#define ENQUEUE_WAKEUP		1
-#define ENQUEUE_WAKING		2
-#define ENQUEUE_HEAD		4
-
-#define DEQUEUE_SLEEP		1
+#define DEQUEUE_SLEEP	(1 << 0)
 
 struct sched_class {
 	const struct sched_class *next;
@@ -1124,7 +1127,8 @@ struct sched_entity {
 	struct load_weight	load;		/* for load-balancing */
 	struct rb_node		run_node;
 	struct list_head	group_node;
-	unsigned int		on_rq;
+	unsigned int		on_rq:1,
+				interactive:1;
 
 	u64			exec_start;
 	u64			sum_exec_runtime;
@@ -1237,6 +1241,7 @@ struct task_struct {
 	unsigned sched_in_iowait:1;		/* Called io_schedule() */
 	unsigned sched_reset_on_fork:1;		/* Revert to default
 						 * priority/policy on fork */
+	unsigned sched_wake_interactive:4;	/* User-driven wakeup */
 
 	pid_t pid;
 	pid_t tgid;
@@ -1502,6 +1507,16 @@ struct task_struct {
 #endif
 };
 
+static inline void sched_wake_interactive_enable(void)
+{
+	current->sched_wake_interactive++;
+}
+
+static inline void sched_wake_interactive_disable(void)
+{
+	current->sched_wake_interactive--;
+}
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
Index: linux-2.6-lttng.laptop/kernel/sched.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched.c
+++ linux-2.6-lttng.laptop/kernel/sched.c
@@ -2288,6 +2288,13 @@ static int try_to_wake_up(struct task_st
 	unsigned long en_flags = ENQUEUE_WAKEUP;
 	struct rq *rq;
 
+	if (sched_feat(INTERACTIVE) && !(wake_flags & WF_FORK)) {
+		if (current->sched_wake_interactive ||
+				wake_flags & WF_INTERACTIVE ||
+				current->se.interactive)
+			en_flags |= ENQUEUE_LATENCY;
+	}
+
 	this_cpu = get_cpu();
 
 	smp_wmb();
@@ -3613,8 +3620,11 @@ need_resched_nonpreemptible:
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
 		if (unlikely(signal_pending_state(prev->state, prev)))
 			prev->state = TASK_RUNNING;
-		else
+		else {
+			if (sched_feat(INTERACTIVE))
+				prev->se.interactive = 0;
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
+		}
 		switch_count = &prev->nvcsw;
 	}
 
Index: linux-2.6-lttng.laptop/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.laptop/kernel/sched_fair.c
@@ -774,6 +774,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP) {
+		if (sched_feat(INTERACTIVE)
+		    && flags & ENQUEUE_LATENCY && !(flags & ENQUEUE_IO))
+			se->interactive = 1;
 		place_entity(cfs_rq, se, 0);
 		enqueue_sleeper(cfs_rq, se);
 	}
@@ -916,14 +919,14 @@ static struct sched_entity *pick_next_en
 	struct sched_entity *se = __pick_next_entity(cfs_rq);
 	struct sched_entity *left = se;
 
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
-		se = cfs_rq->next;
+	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+		se = cfs_rq->last;
 
 	/*
-	 * Prefer last buddy, try to return the CPU to a preempted task.
+	 * Prefer the next buddy, only set through the interactivity logic.
 	 */
-	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
-		se = cfs_rq->last;
+	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+		se = cfs_rq->next;
 
 	clear_buddies(cfs_rq, se);
 
@@ -1046,6 +1049,9 @@ enqueue_task_fair(struct rq *rq, struct
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
 
+	if (p->sched_in_iowait)
+		flags |= ENQUEUE_IO;
+
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
@@ -1657,6 +1663,7 @@ static void check_preempt_wakeup(struct
 	 * tasks for there to be buddies.
 	 */
 	int buddies = (cfs_rq->nr_running >= 2);
+	int preempt = 0;
 
 	if (unlikely(rt_prio(p->prio)))
 		goto preempt;
@@ -1667,8 +1674,13 @@ static void check_preempt_wakeup(struct
 	if (unlikely(se == pse))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && buddies && !(wake_flags & WF_FORK))
+	if (sched_feat(INTERACTIVE)
+	    && !(wake_flags & WF_FORK) && pse->interactive) {
+		clear_buddies(cfs_rq, NULL);
 		set_next_buddy(pse);
+		preempt = 1;
+		buddies = 0;
+	}
 
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
@@ -1694,7 +1706,7 @@ static void check_preempt_wakeup(struct
 	update_curr(cfs_rq);
 	find_matching_se(&se, &pse);
 	BUG_ON(!pse);
-	if (wakeup_preempt_entity(se, pse) == 1)
+	if (preempt || wakeup_preempt_entity(se, pse) == 1)
 		goto preempt;
 
 	return;
Index: linux-2.6-lttng.laptop/kernel/sched_features.h
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched_features.h
+++ linux-2.6-lttng.laptop/kernel/sched_features.h
@@ -26,13 +26,6 @@ SCHED_FEAT(WAKEUP_PREEMPT, 1)
 SCHED_FEAT(AFFINE_WAKEUPS, 1)
 
 /*
- * Prefer to schedule the task we woke last (assuming it failed
- * wakeup-preemption), since its likely going to consume data we
- * touched, increases cache locality.
- */
-SCHED_FEAT(NEXT_BUDDY, 0)
-
-/*
  * Prefer to schedule the task that ran last (when we did
  * wake-preempt) as that likely will touch the same data, increases
  * cache locality.
@@ -61,6 +54,10 @@ SCHED_FEAT(ASYM_EFF_LOAD, 1)
  * ensures the spread does not grow beyond control.
  */
 SCHED_FEAT(DYN_MIN_VRUNTIME, 0)
+/*
+ * Input subsystem next buddy affinity. Not transitive across new task wakeups.
+ */
+SCHED_FEAT(INTERACTIVE, 0)
 
 /*
  * Spin-wait on mutex acquisition when the mutex owner is running on


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 09/11] sched: timer-driven next buddy
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
                   ` (7 preceding siblings ...)
  2010-08-26 18:09 ` [RFC PATCH 08/11] sched input interactivity-driven next buddy Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-27 18:02   ` [RFC PATCH 09/11] sched: timer-driven next buddy (update) Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 10/11] sched: fork expedited Mathieu Desnoyers
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith, Peter Zijlstra

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: sched-timer-buddy.patch --]
[-- Type: text/plain, Size: 8810 bytes --]

[ Impact: implement TIMER feature to diminish the latencies induced by wakeups
          performed by timer callbacks ]

Ensure that timer callbacks triggering wakeups get served ASAP by giving
timer-driven wakeups next-buddy affinity.

My test program is wakeup-latency.c, provided by Nokia originally. A 10ms timer
spawns a thread which reads the time, and shows a warning if the expected
deadline has been missed by too much. It also warns about timer overruns.

Without the TIMER and TIMER_FORK_EXPEDITED features:

min priority: 0, max priority: 0
[....]
maximum latency: 41453.6 µs
average latency: 4127.0 µs
missed timer events: 0

With the features enabled:

min priority: 0, max priority: 0
[...]
maximum latency: 10013.5 µs
average latency: 162.9 µs
missed timer events: 0

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h     |   16 +++++++++++++++-
 kernel/hrtimer.c          |    2 ++
 kernel/itimer.c           |    2 ++
 kernel/posix-cpu-timers.c |    2 ++
 kernel/posix-timers.c     |    2 ++
 kernel/sched.c            |    9 +++++++++
 kernel/sched_fair.c       |   11 ++++++++---
 kernel/sched_features.h   |    4 ++++
 kernel/timer.c            |    2 ++
 9 files changed, 46 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng.laptop/include/linux/sched.h
===================================================================
--- linux-2.6-lttng.laptop.orig/include/linux/sched.h
+++ linux-2.6-lttng.laptop/include/linux/sched.h
@@ -1027,12 +1027,14 @@ struct sched_domain;
 #define WF_SYNC		(1 << 0)	/* waker goes to sleep after wakup */
 #define WF_FORK		(1 << 1)	/* child wakeup after fork */
 #define WF_INTERACTIVE	(1 << 2)	/* interactivity-driven wakeup */
+#define WF_TIMER	(1 << 3)	/* timer-driven wakeup */
 
 #define ENQUEUE_WAKEUP	(1 << 0)
 #define ENQUEUE_WAKING	(1 << 1)
 #define ENQUEUE_HEAD	(1 << 2)
 #define ENQUEUE_IO	(1 << 3)
 #define ENQUEUE_LATENCY	(1 << 4)
+#define ENQUEUE_TIMER	(1 << 5)
 
 #define DEQUEUE_SLEEP	(1 << 0)
 
@@ -1128,7 +1130,8 @@ struct sched_entity {
 	struct rb_node		run_node;
 	struct list_head	group_node;
 	unsigned int		on_rq:1,
-				interactive:1;
+				interactive:1,
+				timer:1;
 
 	u64			exec_start;
 	u64			sum_exec_runtime;
@@ -1242,6 +1245,7 @@ struct task_struct {
 	unsigned sched_reset_on_fork:1;		/* Revert to default
 						 * priority/policy on fork */
 	unsigned sched_wake_interactive:4;	/* User-driven wakeup */
+	unsigned sched_wake_timer:4;		/* Timer-driven wakeup */
 
 	pid_t pid;
 	pid_t tgid;
@@ -1517,6 +1521,16 @@ static inline void sched_wake_interactiv
 	current->sched_wake_interactive--;
 }
 
+static inline void sched_wake_timer_enable(void)
+{
+	current->sched_wake_timer++;
+}
+
+static inline void sched_wake_timer_disable(void)
+{
+	current->sched_wake_timer--;
+}
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
Index: linux-2.6-lttng.laptop/kernel/sched_features.h
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched_features.h
+++ linux-2.6-lttng.laptop/kernel/sched_features.h
@@ -58,6 +58,10 @@ SCHED_FEAT(DYN_MIN_VRUNTIME, 0)
  * Input subsystem next buddy affinity. Not transitive across new task wakeups.
  */
 SCHED_FEAT(INTERACTIVE, 0)
+/*
+ * Timer subsystem next buddy affinity. Not transitive across new task wakeups.
+ */
+SCHED_FEAT(TIMER, 0)
 
 /*
  * Spin-wait on mutex acquisition when the mutex owner is running on
Index: linux-2.6-lttng.laptop/kernel/sched.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched.c
+++ linux-2.6-lttng.laptop/kernel/sched.c
@@ -2295,6 +2295,13 @@ static int try_to_wake_up(struct task_st
 			en_flags |= ENQUEUE_LATENCY;
 	}
 
+	if (sched_feat(TIMER) && !(wake_flags & WF_FORK)) {
+		if (current->sched_wake_timer ||
+				wake_flags & WF_TIMER ||
+				current->se.timer)
+			en_flags |= ENQUEUE_TIMER;
+	}
+
 	this_cpu = get_cpu();
 
 	smp_wmb();
@@ -3623,6 +3630,8 @@ need_resched_nonpreemptible:
 		else {
 			if (sched_feat(INTERACTIVE))
 				prev->se.interactive = 0;
+			if (sched_feat(TIMER))
+				prev->se.timer = 0;
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 		}
 		switch_count = &prev->nvcsw;
Index: linux-2.6-lttng.laptop/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.laptop/kernel/sched_fair.c
@@ -777,6 +777,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 		if (sched_feat(INTERACTIVE)
 		    && flags & ENQUEUE_LATENCY && !(flags & ENQUEUE_IO))
 			se->interactive = 1;
+		if (sched_feat(TIMER)
+		    && flags & ENQUEUE_TIMER && !(flags & ENQUEUE_IO))
+			se->timer = 1;
 		place_entity(cfs_rq, se, 0);
 		enqueue_sleeper(cfs_rq, se);
 	}
@@ -923,7 +926,8 @@ static struct sched_entity *pick_next_en
 		se = cfs_rq->last;
 
 	/*
-	 * Prefer the next buddy, only set through the interactivity logic.
+	 * Prefer the next buddy, only set through the interactivity and timer
+	 * logic.
 	 */
 	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
 		se = cfs_rq->next;
@@ -1674,8 +1678,9 @@ static void check_preempt_wakeup(struct
 	if (unlikely(se == pse))
 		return;
 
-	if (sched_feat(INTERACTIVE)
-	    && !(wake_flags & WF_FORK) && pse->interactive) {
+	if (!(wake_flags & WF_FORK)
+	    && ((sched_feat(INTERACTIVE) && pse->interactive)
+		|| (sched_feat(TIMER) && pse->timer))) {
 		clear_buddies(cfs_rq, NULL);
 		set_next_buddy(pse);
 		preempt = 1;
Index: linux-2.6-lttng.laptop/kernel/posix-timers.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/posix-timers.c
+++ linux-2.6-lttng.laptop/kernel/posix-timers.c
@@ -402,6 +402,7 @@ static enum hrtimer_restart posix_timer_
 	int si_private = 0;
 	enum hrtimer_restart ret = HRTIMER_NORESTART;
 
+	sched_wake_timer_enable();
 	timr = container_of(timer, struct k_itimer, it.real.timer);
 	spin_lock_irqsave(&timr->it_lock, flags);
 
@@ -456,6 +457,7 @@ static enum hrtimer_restart posix_timer_
 	}
 
 	unlock_timer(timr, flags);
+	sched_wake_timer_disable();
 	return ret;
 }
 
Index: linux-2.6-lttng.laptop/kernel/timer.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/timer.c
+++ linux-2.6-lttng.laptop/kernel/timer.c
@@ -1038,6 +1038,7 @@ static void call_timer_fn(struct timer_l
 	 */
 	struct lockdep_map lockdep_map = timer->lockdep_map;
 #endif
+	sched_wake_timer_enable();
 	/*
 	 * Couple the lock chain with the lock chain at
 	 * del_timer_sync() by acquiring the lock_map around the fn()
@@ -1062,6 +1063,7 @@ static void call_timer_fn(struct timer_l
 		 */
 		preempt_count() = preempt_count;
 	}
+	sched_wake_timer_disable();
 }
 
 #define INDEX(N) ((base->timer_jiffies >> (TVR_BITS + (N) * TVN_BITS)) & TVN_MASK)
Index: linux-2.6-lttng.laptop/kernel/hrtimer.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/hrtimer.c
+++ linux-2.6-lttng.laptop/kernel/hrtimer.c
@@ -1212,6 +1212,7 @@ static void __run_hrtimer(struct hrtimer
 
 	WARN_ON(!irqs_disabled());
 
+	sched_wake_timer_enable();
 	debug_deactivate(timer);
 	__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0);
 	timer_stats_account_hrtimer(timer);
@@ -1238,6 +1239,7 @@ static void __run_hrtimer(struct hrtimer
 		enqueue_hrtimer(timer, base);
 	}
 	timer->state &= ~HRTIMER_STATE_CALLBACK;
+	sched_wake_timer_disable();
 }
 
 #ifdef CONFIG_HIGH_RES_TIMERS
Index: linux-2.6-lttng.laptop/kernel/itimer.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/itimer.c
+++ linux-2.6-lttng.laptop/kernel/itimer.c
@@ -129,7 +129,9 @@ enum hrtimer_restart it_real_fn(struct h
 
 	trace_itimer_expire(ITIMER_REAL, sig->leader_pid, 0);
 	trace_timer_itimer_expired(sig);
+	sched_wake_timer_enable();
 	kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->leader_pid);
+	sched_wake_timer_disable();
 
 	return HRTIMER_NORESTART;
 }
Index: linux-2.6-lttng.laptop/kernel/posix-cpu-timers.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/posix-cpu-timers.c
+++ linux-2.6-lttng.laptop/kernel/posix-cpu-timers.c
@@ -610,6 +610,7 @@ static void arm_timer(struct k_itimer *t
  */
 static void cpu_timer_fire(struct k_itimer *timer)
 {
+	sched_wake_timer_enable();
 	if ((timer->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) {
 		/*
 		 * User don't want any signal.
@@ -637,6 +638,7 @@ static void cpu_timer_fire(struct k_itim
 		 */
 		posix_cpu_timer_schedule(timer);
 	}
+	sched_wake_timer_disable();
 }
 
 /*


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 10/11] sched: fork expedited
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
                   ` (8 preceding siblings ...)
  2010-08-26 18:09 ` [RFC PATCH 09/11] sched: timer-driven " Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-26 18:09 ` [RFC PATCH 11/11] sched: fair sleepers for timer and interactive Mathieu Desnoyers
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith, Peter Zijlstra

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: sched-fork-expedited.patch --]
[-- Type: text/plain, Size: 4780 bytes --]

[ Impact: implement fork vruntime boosting when forks are performed from a
          interactive or timer wakeup chain. ]

Add new features:
INTERACTIVE_FORK_EXPEDITED
TIMER_FORK_EXPEDITED

to expedite forks performed from interactive and timer wakeup chains.

INTERACTIVE_FORK_EXPEDITED is needed to make timer_create() with sigev_notify =
SIGEV_THREAD POSIX API have lower latencies than it currently does. Yes,
spawning a new thread each time the timer fires is an utter ugliness, but this
is a standard API people rely on. We seem to have a two choices there: either:

1) we push for SIGEV_THREAD deprecation. This is, after all, an utter glibc
   mess, where thread creation and memory allocation failing is no dealt with,
   and where the helper thread waiting for the signal is created the first time
   timer_create() is invoked, and therefore keeps the cgroup/scheduler/etc. state
   of the first caller.
or
2) We try to support this standard behavior at the kernel level, with
   TIMER_FORK_EXPEDITED.


This patch brings down the average latency of wakeup-latency.c from 4000µs down
to 160µs by making sure the thread spawned when the timer fires is not put at
the end of the current period, but rather gets a vruntime boost.

This fork vruntime boost given by executing through an interactive or timer
wakeup chain is not transferrable to children. This is intended to try ensuring
some degree of safety against timer-based fork bombs.

Disabling START_DEBIT instead of doing these *_FORK_EXPEDITED does not give good
results under a make -j5 kernel build, uniprocessor machine: Xorg interactivity
suffers a lot.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h   |    3 ++-
 kernel/sched.c          |    8 ++++++++
 kernel/sched_fair.c     |    8 ++++++++
 kernel/sched_features.h |   11 +++++++++++
 4 files changed, 29 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng.laptop/include/linux/sched.h
===================================================================
--- linux-2.6-lttng.laptop.orig/include/linux/sched.h
+++ linux-2.6-lttng.laptop/include/linux/sched.h
@@ -1131,7 +1131,8 @@ struct sched_entity {
 	struct list_head	group_node;
 	unsigned int		on_rq:1,
 				interactive:1,
-				timer:1;
+				timer:1,
+				fork_expedited:1;
 
 	u64			exec_start;
 	u64			sum_exec_runtime;
Index: linux-2.6-lttng.laptop/kernel/sched.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched.c
+++ linux-2.6-lttng.laptop/kernel/sched.c
@@ -2504,6 +2504,14 @@ void sched_fork(struct task_struct *p, i
 	if (!rt_prio(p->prio))
 		p->sched_class = &fair_sched_class;
 
+	if ((sched_feat(INTERACTIVE_FORK_EXPEDITED)
+	     && (current->sched_wake_interactive || current->se.interactive))
+	    || (sched_feat(TIMER_FORK_EXPEDITED)
+	     && (current->sched_wake_timer || current->se.timer)))
+		p->se.fork_expedited = 1;
+	else
+		p->se.fork_expedited = 0;
+
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
 
Index: linux-2.6-lttng.laptop/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.laptop/kernel/sched_fair.c
@@ -731,6 +731,14 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	u64 vruntime = cfs_rq->min_vruntime;
 
 	/*
+	 * Expedite forks when requested rather than putting forked thread in a
+	 * delayed slot.
+	 */
+	if ((sched_feat(INTERACTIVE_FORK_EXPEDITED)
+	    || sched_feat(TIMER_FORK_EXPEDITED)) && se->fork_expedited)
+		initial = 0;
+
+	/*
 	 * The 'current' period is already promised to the current tasks,
 	 * however the extra weight of the new task will slow them down a
 	 * little, place the new task so that it fits in the slot that
Index: linux-2.6-lttng.laptop/kernel/sched_features.h
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched_features.h
+++ linux-2.6-lttng.laptop/kernel/sched_features.h
@@ -59,9 +59,20 @@ SCHED_FEAT(DYN_MIN_VRUNTIME, 0)
  */
 SCHED_FEAT(INTERACTIVE, 0)
 /*
+ * Expedite forks performed from a wakeup chain coming from the input subsystem.
+ * Depends on the INTERACTIVE feature for following the wakeup chain across
+ * threads.
+ */
+SCHED_FEAT(INTERACTIVE_FORK_EXPEDITED, 0)
+/*
  * Timer subsystem next buddy affinity. Not transitive across new task wakeups.
  */
 SCHED_FEAT(TIMER, 0)
+/*
+ * Expedite forks performed from a wakeup chain coming from the timer subsystem.
+ * Depends on the TIMER feature for following the wakeup chain across threads.
+ */
+SCHED_FEAT(TIMER_FORK_EXPEDITED, 0)
 
 /*
  * Spin-wait on mutex acquisition when the mutex owner is running on


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 11/11] sched: fair sleepers for timer and interactive
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
                   ` (9 preceding siblings ...)
  2010-08-26 18:09 ` [RFC PATCH 10/11] sched: fork expedited Mathieu Desnoyers
@ 2010-08-26 18:09 ` Mathieu Desnoyers
  2010-08-26 18:57 ` [RFC PATCH 00/11] sched: CFS low-latency features Peter Zijlstra
  2010-08-27 10:47 ` Indan Zupancic
  12 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 18:09 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Mathieu Desnoyers, Tony Lindgren,
	Mike Galbraith

[-- Attachment #1: sched-fair-sleeper-for-timer-and-interactive.patch --]
[-- Type: text/plain, Size: 1812 bytes --]

Add FAIR_SLEEPERS_TIMER and FAIR_SLEEPERS_INTERACTIVE tuning knobs.

Turning off FAIR_SLEEPERS and turning on FAIR_SLEEPERS_INTERACTIVE does a pretty
good job for interactivity.

Setting FAIR_SLEEPERS to off helps keeping the min_vruntime value somewhere
between MIN_vruntime and max_vruntime rather than somewhere at the right of all
running threads.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 kernel/sched_fair.c     |    5 ++++-
 kernel/sched_features.h |    4 +++-
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6-lttng.git/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.git/kernel/sched_fair.c
@@ -748,7 +748,10 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		vruntime += sched_vslice(cfs_rq, se);
 
 	/* sleeps up to a single latency don't count. */
-	if (sched_feat(FAIR_SLEEPERS) && !initial) {
+	if (!initial
+	    && (sched_feat(FAIR_SLEEPERS)
+	       || (sched_feat(FAIR_SLEEPERS_TIMER) && se->timer)
+	       || (sched_feat(FAIR_SLEEPERS_INTERACTIVE) && se->interactive))) {
 		unsigned long thresh = sysctl_sched_latency;
 
 		/*
Index: linux-2.6-lttng.git/kernel/sched_features.h
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_features.h
+++ linux-2.6-lttng.git/kernel/sched_features.h
@@ -3,7 +3,9 @@
  * them to run sooner, but does not allow tons of sleepers to
  * rip the spread apart.
  */
-SCHED_FEAT(FAIR_SLEEPERS, 1)
+SCHED_FEAT(FAIR_SLEEPERS, 1)		 /* Apply to all wakeups */
+SCHED_FEAT(FAIR_SLEEPERS_INTERACTIVE, 0) /* Selects interactive wakeups */
+SCHED_FEAT(FAIR_SLEEPERS_TIMER, 0)	 /* Selects timer-driven wakeups */
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, 1)
 
 /*


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
                   ` (10 preceding siblings ...)
  2010-08-26 18:09 ` [RFC PATCH 11/11] sched: fair sleepers for timer and interactive Mathieu Desnoyers
@ 2010-08-26 18:57 ` Peter Zijlstra
  2010-08-26 21:25   ` Thomas Gleixner
  2010-08-26 23:49   ` Mathieu Desnoyers
  2010-08-27 10:47 ` Indan Zupancic
  12 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2010-08-26 18:57 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Tony Lindgren, Mike Galbraith

On Thu, 2010-08-26 at 14:09 -0400, Mathieu Desnoyers wrote:
> Feedback is welcome,
> 
So we have the following components to this patch:

  - dynamic min_vruntime -- push the min_vruntime ahead at the
       rate of the runqueue wide virtual clock. This approximates
       the virtual clock, esp. when turning off sleeper fairness.
       And is cheaper than actually computing the virtual clock.

       It allows for better insertion and re-weighting behaviour,
       but it does increase overhead somewhat.

  - special wakeups using the next-buddy to get scheduled 'soon',
       used by all wakeups from the input system and timers.

  - special fork semantics related to those special wakeups.


So while I would love to simply compute the virtual clock, it would add
a s64 mult to every enqueue/dequeue and a s64 div to each
enqueue/re-weight, which might be somewhat prohibitive, the dyn
min_vruntime approximation seems to work well enough and costs a u32 div
per enqueue.

Adding a preference to all user generated wakeups (input) and
propagating that state along the wakeup chain seems to make sense,
adding the same to all timers is something that needs to be discussed, I
can well imagine not all timers are equally important -- do we want to
extend the timer interface?

If we do decide we want both, we should at the very least merge the
try_to_wake_up() conditional blob (they're really identical). Preferably
we should reduce ttwu(), not add more to it... 

Fudging fork seems dubious at best, it seems generated by the use of
timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
broken thing to do, it has very ill defined semantics and is utterly
unable to properly cope with error cases. Furthermore its trivial to
actually correctly implement the desired behaviour, so I'm really
skeptical on this front; friends don't let friends use SIGEV_THREAD.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 18:57 ` [RFC PATCH 00/11] sched: CFS low-latency features Peter Zijlstra
@ 2010-08-26 21:25   ` Thomas Gleixner
  2010-08-26 22:22     ` Thomas Gleixner
  2010-08-26 23:49   ` Mathieu Desnoyers
  1 sibling, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2010-08-26 21:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> 
> Fudging fork seems dubious at best, it seems generated by the use of
> timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> broken thing to do, it has very ill defined semantics and is utterly
> unable to properly cope with error cases. Furthermore its trivial to
> actually correctly implement the desired behaviour, so I'm really
> skeptical on this front; friends don't let friends use SIGEV_THREAD.

SIGEV_THREAD is the best proof that the whole posix timer interface
was comitte[e]d under the influence of not to be revealed
mind-altering substances.

I completely object to add timer specific wakeup magic and support for
braindead fork orgies to the kernel proper. All that mess can be fixed
in user space by using sensible functionality.

Providing support for misdesigned crap just for POSIX compliance
reasons and to make some of the blind abusers of that very same crap
happy would be a completely stupid decision.

In fact that would make a brilliant precedence case for forcing the
kernel to solve user space madness at the expense of kernel
complexity. If we follow down that road we get requests for extra
functionality for AIO, networking and whatever in a split second with
no real good reason to reject them anymore.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 21:25   ` Thomas Gleixner
@ 2010-08-26 22:22     ` Thomas Gleixner
  2010-08-26 23:09       ` Mathieu Desnoyers
  2010-08-26 23:18       ` Paul E. McKenney
  0 siblings, 2 replies; 49+ messages in thread
From: Thomas Gleixner @ 2010-08-26 22:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > 
> > Fudging fork seems dubious at best, it seems generated by the use of
> > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > broken thing to do, it has very ill defined semantics and is utterly
> > unable to properly cope with error cases. Furthermore its trivial to
> > actually correctly implement the desired behaviour, so I'm really
> > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> 
> SIGEV_THREAD is the best proof that the whole posix timer interface
> was comitte[e]d under the influence of not to be revealed
> mind-altering substances.
> 
> I completely object to add timer specific wakeup magic and support for
> braindead fork orgies to the kernel proper. All that mess can be fixed
> in user space by using sensible functionality.
> 
> Providing support for misdesigned crap just for POSIX compliance
> reasons and to make some of the blind abusers of that very same crap
> happy would be a completely stupid decision.
> 
> In fact that would make a brilliant precedence case for forcing the
> kernel to solve user space madness at the expense of kernel
> complexity. If we follow down that road we get requests for extra
> functionality for AIO, networking and whatever in a split second with
> no real good reason to reject them anymore.

I really risked eye cancer and digged into the glibc code.

	/* There is not much we can do if the allocation fails.  */
	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);

So if the helper thread which gets the signal fails to create the
thread then everything is toast.

What about fixing the f*cked up glibc implementation in the first place
instead of fiddling in the kernel to support this utter madness? 

WTF can't the damned delivery thread not be created when timer_create
is called and the signal be delivered to that very thread directly via
SIGEV_THREAD_ID ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 22:22     ` Thomas Gleixner
@ 2010-08-26 23:09       ` Mathieu Desnoyers
  2010-08-26 23:36         ` Mathieu Desnoyers
  2010-08-27  7:37         ` Peter Zijlstra
  2010-08-26 23:18       ` Paul E. McKenney
  1 sibling, 2 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 23:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Steven Rostedt, Tony Lindgren, Mike Galbraith

* Thomas Gleixner (tglx@linutronix.de) wrote:
> On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > 
> > > Fudging fork seems dubious at best, it seems generated by the use of
> > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > broken thing to do, it has very ill defined semantics and is utterly
> > > unable to properly cope with error cases. Furthermore its trivial to
> > > actually correctly implement the desired behaviour, so I'm really
> > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > 
> > SIGEV_THREAD is the best proof that the whole posix timer interface
> > was comitte[e]d under the influence of not to be revealed
> > mind-altering substances.
> > 
> > I completely object to add timer specific wakeup magic and support for
> > braindead fork orgies to the kernel proper. All that mess can be fixed
> > in user space by using sensible functionality.
> > 
> > Providing support for misdesigned crap just for POSIX compliance
> > reasons and to make some of the blind abusers of that very same crap
> > happy would be a completely stupid decision.
> > 
> > In fact that would make a brilliant precedence case for forcing the
> > kernel to solve user space madness at the expense of kernel
> > complexity. If we follow down that road we get requests for extra
> > functionality for AIO, networking and whatever in a split second with
> > no real good reason to reject them anymore.
> 
> I really risked eye cancer and digged into the glibc code.
> 
> 	/* There is not much we can do if the allocation fails.  */
> 	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> 
> So if the helper thread which gets the signal fails to create the
> thread then everything is toast.
> 
> What about fixing the f*cked up glibc implementation in the first place
> instead of fiddling in the kernel to support this utter madness? 
> 
> WTF can't the damned delivery thread not be created when timer_create
> is called and the signal be delivered to that very thread directly via
> SIGEV_THREAD_ID ?

Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
pretty sure that would work.

The only thing we might have to be careful about is what happens if the timer
re-fires before the thread completes its execution. We might want to let the
signal handler detect these overruns somehow.

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 22:22     ` Thomas Gleixner
  2010-08-26 23:09       ` Mathieu Desnoyers
@ 2010-08-26 23:18       ` Paul E. McKenney
  2010-08-26 23:28         ` Mathieu Desnoyers
  1 sibling, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2010-08-26 23:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Mathieu Desnoyers, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Tony Lindgren,
	Mike Galbraith

On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > 
> > > Fudging fork seems dubious at best, it seems generated by the use of
> > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > broken thing to do, it has very ill defined semantics and is utterly
> > > unable to properly cope with error cases. Furthermore its trivial to
> > > actually correctly implement the desired behaviour, so I'm really
> > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > 
> > SIGEV_THREAD is the best proof that the whole posix timer interface
> > was comitte[e]d under the influence of not to be revealed
> > mind-altering substances.
> > 
> > I completely object to add timer specific wakeup magic and support for
> > braindead fork orgies to the kernel proper. All that mess can be fixed
> > in user space by using sensible functionality.
> > 
> > Providing support for misdesigned crap just for POSIX compliance
> > reasons and to make some of the blind abusers of that very same crap
> > happy would be a completely stupid decision.
> > 
> > In fact that would make a brilliant precedence case for forcing the
> > kernel to solve user space madness at the expense of kernel
> > complexity. If we follow down that road we get requests for extra
> > functionality for AIO, networking and whatever in a split second with
> > no real good reason to reject them anymore.
> 
> I really risked eye cancer and digged into the glibc code.
> 
> 	/* There is not much we can do if the allocation fails.  */
> 	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> 
> So if the helper thread which gets the signal fails to create the
> thread then everything is toast.
> 
> What about fixing the f*cked up glibc implementation in the first place
> instead of fiddling in the kernel to support this utter madness? 
> 
> WTF can't the damned delivery thread not be created when timer_create
> is called and the signal be delivered to that very thread directly via
> SIGEV_THREAD_ID ?

C'mon, Thomas!!!  That is entirely too sensible!!!  ;-)

But if you are going to create the thread at timer_create() time,
why not just have the new thread block for the desired duration?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 23:18       ` Paul E. McKenney
@ 2010-08-26 23:28         ` Mathieu Desnoyers
  2010-08-26 23:38           ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 23:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Tony Lindgren,
	Mike Galbraith

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > 
> > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > actually correctly implement the desired behaviour, so I'm really
> > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > 
> > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > was comitte[e]d under the influence of not to be revealed
> > > mind-altering substances.
> > > 
> > > I completely object to add timer specific wakeup magic and support for
> > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > in user space by using sensible functionality.
> > > 
> > > Providing support for misdesigned crap just for POSIX compliance
> > > reasons and to make some of the blind abusers of that very same crap
> > > happy would be a completely stupid decision.
> > > 
> > > In fact that would make a brilliant precedence case for forcing the
> > > kernel to solve user space madness at the expense of kernel
> > > complexity. If we follow down that road we get requests for extra
> > > functionality for AIO, networking and whatever in a split second with
> > > no real good reason to reject them anymore.
> > 
> > I really risked eye cancer and digged into the glibc code.
> > 
> > 	/* There is not much we can do if the allocation fails.  */
> > 	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > 
> > So if the helper thread which gets the signal fails to create the
> > thread then everything is toast.
> > 
> > What about fixing the f*cked up glibc implementation in the first place
> > instead of fiddling in the kernel to support this utter madness? 
> > 
> > WTF can't the damned delivery thread not be created when timer_create
> > is called and the signal be delivered to that very thread directly via
> > SIGEV_THREAD_ID ?
> 
> C'mon, Thomas!!!  That is entirely too sensible!!!  ;-)
> 
> But if you are going to create the thread at timer_create() time,
> why not just have the new thread block for the desired duration?

The timer infrastructure allows things like periodic timer which restarts when
it fires, detection of missed timer events, etc. If you try doing this in a
userland thread context with a simple sleep, then your period becomes however
long you sleep for _and_ the thread execution time. This is all in all quite
different from the timer semantic.

Thanks,

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 23:09       ` Mathieu Desnoyers
@ 2010-08-26 23:36         ` Mathieu Desnoyers
  2010-08-27  7:38           ` Peter Zijlstra
  2010-08-27  8:43           ` Thomas Gleixner
  2010-08-27  7:37         ` Peter Zijlstra
  1 sibling, 2 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 23:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Steven Rostedt, Tony Lindgren, Mike Galbraith

* Mathieu Desnoyers (mathieu.desnoyers@efficios.com) wrote:
> * Thomas Gleixner (tglx@linutronix.de) wrote:
> > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > 
> > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > actually correctly implement the desired behaviour, so I'm really
> > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > 
> > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > was comitte[e]d under the influence of not to be revealed
> > > mind-altering substances.
> > > 
> > > I completely object to add timer specific wakeup magic and support for
> > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > in user space by using sensible functionality.
> > > 
> > > Providing support for misdesigned crap just for POSIX compliance
> > > reasons and to make some of the blind abusers of that very same crap
> > > happy would be a completely stupid decision.
> > > 
> > > In fact that would make a brilliant precedence case for forcing the
> > > kernel to solve user space madness at the expense of kernel
> > > complexity. If we follow down that road we get requests for extra
> > > functionality for AIO, networking and whatever in a split second with
> > > no real good reason to reject them anymore.
> > 
> > I really risked eye cancer and digged into the glibc code.
> > 
> > 	/* There is not much we can do if the allocation fails.  */
> > 	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > 
> > So if the helper thread which gets the signal fails to create the
> > thread then everything is toast.
> > 
> > What about fixing the f*cked up glibc implementation in the first place
> > instead of fiddling in the kernel to support this utter madness? 
> > 
> > WTF can't the damned delivery thread not be created when timer_create
> > is called and the signal be delivered to that very thread directly via
> > SIGEV_THREAD_ID ?
> 
> Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
> pretty sure that would work.
> 
> The only thing we might have to be careful about is what happens if the timer
> re-fires before the thread completes its execution. We might want to let the
> signal handler detect these overruns somehow.

Hrm, thinking about it a little more, one of the "plus" sides of these
SIGEV_THREAD timers is that a single timer can fork threads that will run on
many cores on a multi-core system. If we go for preallocation of a single
thread, we lose that. Maybe we could think of a way to preallocate a thread pool
instead ?

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> > 
> > Thanks,
> > 
> > 	tglx
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 23:28         ` Mathieu Desnoyers
@ 2010-08-26 23:38           ` Paul E. McKenney
  2010-08-26 23:53             ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2010-08-26 23:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Tony Lindgren,
	Mike Galbraith

On Thu, Aug 26, 2010 at 07:28:58PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> > > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > > 
> > > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > > actually correctly implement the desired behaviour, so I'm really
> > > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > > 
> > > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > > was comitte[e]d under the influence of not to be revealed
> > > > mind-altering substances.
> > > > 
> > > > I completely object to add timer specific wakeup magic and support for
> > > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > > in user space by using sensible functionality.
> > > > 
> > > > Providing support for misdesigned crap just for POSIX compliance
> > > > reasons and to make some of the blind abusers of that very same crap
> > > > happy would be a completely stupid decision.
> > > > 
> > > > In fact that would make a brilliant precedence case for forcing the
> > > > kernel to solve user space madness at the expense of kernel
> > > > complexity. If we follow down that road we get requests for extra
> > > > functionality for AIO, networking and whatever in a split second with
> > > > no real good reason to reject them anymore.
> > > 
> > > I really risked eye cancer and digged into the glibc code.
> > > 
> > > 	/* There is not much we can do if the allocation fails.  */
> > > 	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > > 
> > > So if the helper thread which gets the signal fails to create the
> > > thread then everything is toast.
> > > 
> > > What about fixing the f*cked up glibc implementation in the first place
> > > instead of fiddling in the kernel to support this utter madness? 
> > > 
> > > WTF can't the damned delivery thread not be created when timer_create
> > > is called and the signal be delivered to that very thread directly via
> > > SIGEV_THREAD_ID ?
> > 
> > C'mon, Thomas!!!  That is entirely too sensible!!!  ;-)
> > 
> > But if you are going to create the thread at timer_create() time,
> > why not just have the new thread block for the desired duration?
> 
> The timer infrastructure allows things like periodic timer which restarts when
> it fires, detection of missed timer events, etc. If you try doing this in a
> userland thread context with a simple sleep, then your period becomes however
> long you sleep for _and_ the thread execution time. This is all in all quite
> different from the timer semantic.

Hmmm...  Why couldn't the thread in question set the next sleep time based
on the timer period?  Yes, if the function ran for longer than the period,
there would be a delay, but the POSIX semantics allow such a delay, right?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 18:57 ` [RFC PATCH 00/11] sched: CFS low-latency features Peter Zijlstra
  2010-08-26 21:25   ` Thomas Gleixner
@ 2010-08-26 23:49   ` Mathieu Desnoyers
  2010-08-27  7:42     ` Peter Zijlstra
  1 sibling, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 23:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Tony Lindgren, Mike Galbraith

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2010-08-26 at 14:09 -0400, Mathieu Desnoyers wrote:
> > Feedback is welcome,
> > 
> So we have the following components to this patch:
> 
>   - dynamic min_vruntime -- push the min_vruntime ahead at the
>        rate of the runqueue wide virtual clock. This approximates
>        the virtual clock, esp. when turning off sleeper fairness.
>        And is cheaper than actually computing the virtual clock.
> 
>        It allows for better insertion and re-weighting behaviour,
>        but it does increase overhead somewhat.
> 
>   - special wakeups using the next-buddy to get scheduled 'soon',
>        used by all wakeups from the input system and timers.
> 
>   - special fork semantics related to those special wakeups.
> 
> 
> So while I would love to simply compute the virtual clock, it would add
> a s64 mult to every enqueue/dequeue and a s64 div to each
> enqueue/re-weight, which might be somewhat prohibitive, the dyn
> min_vruntime approximation seems to work well enough and costs a u32 div
> per enqueue.

Yep, it's cheap enough and seems to work very well as far as my testing have
shown.

> Adding a preference to all user generated wakeups (input) and
> propagating that state along the wakeup chain seems to make sense,

Yes, this is what lets us kill FAIR_SLEEPERS (and thus let the dynamic
min_vruntime behave as expected), while keeping good Xorg interactivity.

> adding the same to all timers is something that needs to be discussed, I
> can well imagine not all timers are equally important -- do we want to
> extend the timer interface?

I just thought it made sense that when a timer fires and wakes up a thread,
there are pretty good chances that we might to wakeup this thread quickly. But
it brings the question in a more general sense: would we want this kind of
behavior also available for network packets, disk I/O, etc ? IOW, would it make
sense to have next-buddy selection on all these input-triggered wakeups ? Since
we're only using the next buddy selection, it will only perform this selection
if the selected buddy is within a specific vruntime range from the minimum, so
AFAIK, I don't think we would end up starving the system in any possible way.

So far I cannot see a situation where selecting the next buddy would _not_ make
sense in any kind of input-driven wakeups (interactive, timer, disk, network,
etc). But maybe it's just a lack of imagination on my part.


> If we do decide we want both, we should at the very least merge the
> try_to_wake_up() conditional blob (they're really identical). Preferably
> we should reduce ttwu(), not add more to it... 

Sure. Or maybe find out we want to target even more input paths... so far
interactive input seemed absolutely needed, timers seemed logical and
self-rate-limited. For the others, I don't know. It might actually make sense
too.

> 
> Fudging fork seems dubious at best, it seems generated by the use of
> timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> broken thing to do, it has very ill defined semantics and is utterly
> unable to properly cope with error cases. Furthermore its trivial to
> actually correctly implement the desired behaviour, so I'm really
> skeptical on this front; friends don't let friends use SIGEV_THREAD.

Agreed for the timer-tied case. I think the direction Thomas proposes for fixing
glibc makes much more sense. However, I am wondering about the interactive case
too: e.g., if you click on "open terminal", it needs to fork a process and you
would normally expect this to come up more quickly than the background kernel
compile you're doing. So there might be some interest for this fork vruntime
boost for interactivity-driven wakeups. Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 23:38           ` Paul E. McKenney
@ 2010-08-26 23:53             ` Mathieu Desnoyers
  2010-08-27  0:09               ` Paul E. McKenney
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-26 23:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Tony Lindgren,
	Mike Galbraith

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Aug 26, 2010 at 07:28:58PM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> > > > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > > > 
> > > > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > > > actually correctly implement the desired behaviour, so I'm really
> > > > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > > > 
> > > > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > > > was comitte[e]d under the influence of not to be revealed
> > > > > mind-altering substances.
> > > > > 
> > > > > I completely object to add timer specific wakeup magic and support for
> > > > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > > > in user space by using sensible functionality.
> > > > > 
> > > > > Providing support for misdesigned crap just for POSIX compliance
> > > > > reasons and to make some of the blind abusers of that very same crap
> > > > > happy would be a completely stupid decision.
> > > > > 
> > > > > In fact that would make a brilliant precedence case for forcing the
> > > > > kernel to solve user space madness at the expense of kernel
> > > > > complexity. If we follow down that road we get requests for extra
> > > > > functionality for AIO, networking and whatever in a split second with
> > > > > no real good reason to reject them anymore.
> > > > 
> > > > I really risked eye cancer and digged into the glibc code.
> > > > 
> > > > 	/* There is not much we can do if the allocation fails.  */
> > > > 	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > > > 
> > > > So if the helper thread which gets the signal fails to create the
> > > > thread then everything is toast.
> > > > 
> > > > What about fixing the f*cked up glibc implementation in the first place
> > > > instead of fiddling in the kernel to support this utter madness? 
> > > > 
> > > > WTF can't the damned delivery thread not be created when timer_create
> > > > is called and the signal be delivered to that very thread directly via
> > > > SIGEV_THREAD_ID ?
> > > 
> > > C'mon, Thomas!!!  That is entirely too sensible!!!  ;-)
> > > 
> > > But if you are going to create the thread at timer_create() time,
> > > why not just have the new thread block for the desired duration?
> > 
> > The timer infrastructure allows things like periodic timer which restarts when
> > it fires, detection of missed timer events, etc. If you try doing this in a
> > userland thread context with a simple sleep, then your period becomes however
> > long you sleep for _and_ the thread execution time. This is all in all quite
> > different from the timer semantic.
> 
> Hmmm...  Why couldn't the thread in question set the next sleep time based
> on the timer period?  Yes, if the function ran for longer than the period,
> there would be a delay, but the POSIX semantics allow such a delay, right?

I'm afraid you'll have a large error accumulation over time, and getting the
precise picture of how much time between now and where the the period end is
expected to be is kind of hard to do precisely from user-space. In a few words,
this solution would be terrible for jitter. This is why we usually rely on
timers rather than delays in these periodic workloads.

Thanks,

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 23:53             ` Mathieu Desnoyers
@ 2010-08-27  0:09               ` Paul E. McKenney
  2010-08-27 15:18                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2010-08-27  0:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Tony Lindgren,
	Mike Galbraith

On Thu, Aug 26, 2010 at 07:53:23PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Thu, Aug 26, 2010 at 07:28:58PM -0400, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> > > > > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > > > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > > > > 
> > > > > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > > > > actually correctly implement the desired behaviour, so I'm really
> > > > > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > > > > 
> > > > > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > > > > was comitte[e]d under the influence of not to be revealed
> > > > > > mind-altering substances.
> > > > > > 
> > > > > > I completely object to add timer specific wakeup magic and support for
> > > > > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > > > > in user space by using sensible functionality.
> > > > > > 
> > > > > > Providing support for misdesigned crap just for POSIX compliance
> > > > > > reasons and to make some of the blind abusers of that very same crap
> > > > > > happy would be a completely stupid decision.
> > > > > > 
> > > > > > In fact that would make a brilliant precedence case for forcing the
> > > > > > kernel to solve user space madness at the expense of kernel
> > > > > > complexity. If we follow down that road we get requests for extra
> > > > > > functionality for AIO, networking and whatever in a split second with
> > > > > > no real good reason to reject them anymore.
> > > > > 
> > > > > I really risked eye cancer and digged into the glibc code.
> > > > > 
> > > > > 	/* There is not much we can do if the allocation fails.  */
> > > > > 	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > > > > 
> > > > > So if the helper thread which gets the signal fails to create the
> > > > > thread then everything is toast.
> > > > > 
> > > > > What about fixing the f*cked up glibc implementation in the first place
> > > > > instead of fiddling in the kernel to support this utter madness? 
> > > > > 
> > > > > WTF can't the damned delivery thread not be created when timer_create
> > > > > is called and the signal be delivered to that very thread directly via
> > > > > SIGEV_THREAD_ID ?
> > > > 
> > > > C'mon, Thomas!!!  That is entirely too sensible!!!  ;-)
> > > > 
> > > > But if you are going to create the thread at timer_create() time,
> > > > why not just have the new thread block for the desired duration?
> > > 
> > > The timer infrastructure allows things like periodic timer which restarts when
> > > it fires, detection of missed timer events, etc. If you try doing this in a
> > > userland thread context with a simple sleep, then your period becomes however
> > > long you sleep for _and_ the thread execution time. This is all in all quite
> > > different from the timer semantic.
> > 
> > Hmmm...  Why couldn't the thread in question set the next sleep time based
> > on the timer period?  Yes, if the function ran for longer than the period,
> > there would be a delay, but the POSIX semantics allow such a delay, right?
> 
> I'm afraid you'll have a large error accumulation over time, and getting the
> precise picture of how much time between now and where the the period end is
> expected to be is kind of hard to do precisely from user-space. In a few words,
> this solution would be terrible for jitter. This is why we usually rely on
> timers rather than delays in these periodic workloads.

Why couldn't the timer_create() call record the start time, and then
compute the sleeps from that time?  So if timer_create() executed at
time t=100 and the period is 5, upon awakening and completing the first
invocation of the function in question, the thread does a sleep calculated
to wake at t=110.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 23:09       ` Mathieu Desnoyers
  2010-08-26 23:36         ` Mathieu Desnoyers
@ 2010-08-27  7:37         ` Peter Zijlstra
  2010-08-27 15:21           ` Mathieu Desnoyers
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2010-08-27  7:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

On Thu, 2010-08-26 at 19:09 -0400, Mathieu Desnoyers wrote:
> > WTF can't the damned delivery thread not be created when timer_create
> > is called and the signal be delivered to that very thread directly via
> > SIGEV_THREAD_ID ?
> 
> Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
> pretty sure that would work.
> 
> The only thing we might have to be careful about is what happens if the timer
> re-fires before the thread completes its execution. We might want to let the
> signal handler detect these overruns somehow.

Simply don't use SIGEV_THREAD and spawn you own thread and use
SIGEV_THREAD_ID yourself, the programmer knows the semantics and knows
if he cares about overlapping timers etc.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 23:36         ` Mathieu Desnoyers
@ 2010-08-27  7:38           ` Peter Zijlstra
  2010-08-27 15:23             ` Mathieu Desnoyers
  2010-08-27  8:43           ` Thomas Gleixner
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2010-08-27  7:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

On Thu, 2010-08-26 at 19:36 -0400, Mathieu Desnoyers wrote:
> > The only thing we might have to be careful about is what happens if the timer
> > re-fires before the thread completes its execution. We might want to let the
> > signal handler detect these overruns somehow.
> 
> Hrm, thinking about it a little more, one of the "plus" sides of these
> SIGEV_THREAD timers is that a single timer can fork threads that will run on
> many cores on a multi-core system. If we go for preallocation of a single
> thread, we lose that. Maybe we could think of a way to preallocate a thread pool
> instead ? 

Why try and fix a broken thing, just let the app spawn threads and use
SIGEV_THREAD_ID itself, it knows its requirements and can do what suits
the situation best. No need to try and patch up braindead posix stuff.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 23:49   ` Mathieu Desnoyers
@ 2010-08-27  7:42     ` Peter Zijlstra
  2010-08-27  8:19       ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2010-08-27  7:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Tony Lindgren, Mike Galbraith

On Thu, 2010-08-26 at 19:49 -0400, Mathieu Desnoyers wrote:
> AFAIK, I don't think we would end up starving the system in any possible way.

Correct, it does maintain fairness.

> So far I cannot see a situation where selecting the next buddy would _not_ make
> sense in any kind of input-driven wakeups (interactive, timer, disk, network,
> etc). But maybe it's just a lack of imagination on my part. 

The risk is that you end up with always using next-buddy, and we tried
that a while back and that didn't work well for some, Mike might
remember.

Also, when you use timers things like time-outs you really couldn't care
less if its handled sooner rather than later.

Disk is usually so slow you really don't want to consider it
interactive, but then sometimes you might,.. its a really hard problem.

The only clear situation is the direct input, that's a direct link
between the user and our wakeup chain and the user is always important.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27  7:42     ` Peter Zijlstra
@ 2010-08-27  8:19       ` Mike Galbraith
  2010-08-27 15:43         ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2010-08-27  8:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Thomas Gleixner, Tony Lindgren

On Fri, 2010-08-27 at 09:42 +0200, Peter Zijlstra wrote:
> On Thu, 2010-08-26 at 19:49 -0400, Mathieu Desnoyers wrote:
> > AFAIK, I don't think we would end up starving the system in any possible way.
> 
> Correct, it does maintain fairness.
> 
> > So far I cannot see a situation where selecting the next buddy would _not_ make
> > sense in any kind of input-driven wakeups (interactive, timer, disk, network,
> > etc). But maybe it's just a lack of imagination on my part. 
> 
> The risk is that you end up with always using next-buddy, and we tried
> that a while back and that didn't work well for some, Mike might
> remember.

I turned it off because it was ripping spread apart badly, and last
buddy did a better job of improving scalability without it.

> Also, when you use timers things like time-outs you really couldn't care
> less if its handled sooner rather than later.
> 
> Disk is usually so slow you really don't want to consider it
> interactive, but then sometimes you might,.. its a really hard problem.

(very hard)

> The only clear situation is the direct input, that's a direct link
> between the user and our wakeup chain and the user is always important.

Yeah, directly linked wakeups using next could be a good thing, but the
trouble with using any linkage to the user is that you have to pass it
on to reap benefit.. so when do you disconnect?

	-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 23:36         ` Mathieu Desnoyers
  2010-08-27  7:38           ` Peter Zijlstra
@ 2010-08-27  8:43           ` Thomas Gleixner
  2010-08-27 15:50             ` Mathieu Desnoyers
  1 sibling, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2010-08-27  8:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Steven Rostedt, Tony Lindgren, Mike Galbraith

On Thu, 26 Aug 2010, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (mathieu.desnoyers@efficios.com) wrote:
> > * Thomas Gleixner (tglx@linutronix.de) wrote:
> > > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > > 
> > > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > > actually correctly implement the desired behaviour, so I'm really
> > > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > > 
> > > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > > was comitte[e]d under the influence of not to be revealed
> > > > mind-altering substances.
> > > > 
> > > > I completely object to add timer specific wakeup magic and support for
> > > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > > in user space by using sensible functionality.
> > > > 
> > > > Providing support for misdesigned crap just for POSIX compliance
> > > > reasons and to make some of the blind abusers of that very same crap
> > > > happy would be a completely stupid decision.
> > > > 
> > > > In fact that would make a brilliant precedence case for forcing the
> > > > kernel to solve user space madness at the expense of kernel
> > > > complexity. If we follow down that road we get requests for extra
> > > > functionality for AIO, networking and whatever in a split second with
> > > > no real good reason to reject them anymore.
> > > 
> > > I really risked eye cancer and digged into the glibc code.
> > > 
> > > 	/* There is not much we can do if the allocation fails.  */
> > > 	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > > 
> > > So if the helper thread which gets the signal fails to create the
> > > thread then everything is toast.
> > > 
> > > What about fixing the f*cked up glibc implementation in the first place
> > > instead of fiddling in the kernel to support this utter madness? 
> > > 
> > > WTF can't the damned delivery thread not be created when timer_create
> > > is called and the signal be delivered to that very thread directly via
> > > SIGEV_THREAD_ID ?
> > 
> > Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
> > pretty sure that would work.
> > 
> > The only thing we might have to be careful about is what happens if the timer
> > re-fires before the thread completes its execution. We might want to let the
> > signal handler detect these overruns somehow.

That's easy.
 
> Hrm, thinking about it a little more, one of the "plus" sides of these
> SIGEV_THREAD timers is that a single timer can fork threads that will run on
> many cores on a multi-core system. If we go for preallocation of a single
> thread, we lose that. Maybe we could think of a way to preallocate a thread pool
> instead ?

Why should a single timer fork many threads? Just because a previous
thread did not complete before the timer fires again? That's
braindamage as all threads call the same function which then needs to
be serialized anyway. We really do not need a function which creates
tons of threads which get all stuck on the same resource.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
                   ` (11 preceding siblings ...)
  2010-08-26 18:57 ` [RFC PATCH 00/11] sched: CFS low-latency features Peter Zijlstra
@ 2010-08-27 10:47 ` Indan Zupancic
  2010-08-27 10:58   ` Peter Zijlstra
  12 siblings, 1 reply; 49+ messages in thread
From: Indan Zupancic @ 2010-08-27 10:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Peter Zijlstra, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Steven Rostedt, Thomas Gleixner, Mathieu Desnoyers

Hello,

On Thu, August 26, 2010 20:09, Mathieu Desnoyers wrote:
> Hi,
>
> Following the findings I presented a few months ago
> (http://lkml.org/lkml/2010/4/18/13) about CFS having large vruntime spread
> issues, Peter Zijlstra and I pursued the discussion and the implementation
> effort (my work on this is funded by Nokia). I recently put the result
> together
> and came up with this patchset, combining both his work and mine.
>
> With this patchset, I got the following results with wakeup-latency.c (a 10ms
> periodic timer), running periodic-fork.sh, Xorg, make -j3 and firefox (playing
> a
> youtube video), with Xorg moving terminal windows around, in parallel on a UP
> system (links to the test program source in the dyn min_vruntime patch). The
> Xorg interactivity is very good with the new features enabled, but was poor
> originally with the vanilla mainline scheduler. The 10ms timer delays are as
> follow:
>
>                          2.6.35.2 mainline*   with low-latency features**
> maximum latency:                34465.2 µs                    8261.4 µs
> average latency:                 6445.5 µs                     211.2 µs
> missed timer events:                yes                           no
>
> * 2.6.35.2 mainline test needs to run periodic-fork.sh for a few minutes first
>   to let it rip the spread apart.
>
> ** low-latency features:
>
> with the patchset applied and CONFIG_SCHED_DEBUG=y
> (with debugfs mounted in /sys/debugfs)
>
> for opt in DYN_MIN_VRUNTIME \
>         NO_FAIR_SLEEPERS FAIR_SLEEPERS_INTERACTIVE FAIR_SLEEPERS_TIMER \
>         INTERACTIVE TIMER \
>         INTERACTIVE_FORK_EXPEDITED TIMER_FORK_EXPEDITED;
> do echo $opt > /sys/debugfs/sched_features;
> done
>
> These patches are designed to allow individual enabling of each feature and to
> make sure that the object size of sched.o does not grow when the features are
> disabled on a CONFIG_SCHED_DEBUG=n kernel.

Please don't hide scheduler improvements behind obscure CONFIG_SCHED_DEBUG
options. If it doesn't make the scheduler better just don't merge it. If it
does then it should be enabled by default.

Thank you,

Indan


P.S. Tony Luck seems to be chopped of the CC list.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 10:47 ` Indan Zupancic
@ 2010-08-27 10:58   ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2010-08-27 10:58 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Thomas Gleixner

On Fri, 2010-08-27 at 12:47 +0200, Indan Zupancic wrote:
> 
> Please don't hide scheduler improvements behind obscure CONFIG_SCHED_DEBUG
> options. If it doesn't make the scheduler better just don't merge it. If it
> does then it should be enabled by default. 

It's RFC, features are nice to test things and see if/how they work.
Features are not a user option, so you don't have to worry about them.

Also, 'better' is a very hard thing to quantify. Some people like
throughput, some like latency, and others like raw context switch
performance.

Hopefully we'll soon have a non privileged sporadic task scheduler for
all our srt media needs.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27  0:09               ` Paul E. McKenney
@ 2010-08-27 15:18                 ` Mathieu Desnoyers
  2010-08-27 15:20                   ` Thomas Gleixner
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 15:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Tony Lindgren,
	Mike Galbraith

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Aug 26, 2010 at 07:53:23PM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Thu, Aug 26, 2010 at 07:28:58PM -0400, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> > > > > > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > > > > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > > > > > 
> > > > > > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > > > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > > > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > > > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > > > > > actually correctly implement the desired behaviour, so I'm really
> > > > > > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > > > > > 
> > > > > > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > > > > > was comitte[e]d under the influence of not to be revealed
> > > > > > > mind-altering substances.
> > > > > > > 
> > > > > > > I completely object to add timer specific wakeup magic and support for
> > > > > > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > > > > > in user space by using sensible functionality.
> > > > > > > 
> > > > > > > Providing support for misdesigned crap just for POSIX compliance
> > > > > > > reasons and to make some of the blind abusers of that very same crap
> > > > > > > happy would be a completely stupid decision.
> > > > > > > 
> > > > > > > In fact that would make a brilliant precedence case for forcing the
> > > > > > > kernel to solve user space madness at the expense of kernel
> > > > > > > complexity. If we follow down that road we get requests for extra
> > > > > > > functionality for AIO, networking and whatever in a split second with
> > > > > > > no real good reason to reject them anymore.
> > > > > > 
> > > > > > I really risked eye cancer and digged into the glibc code.
> > > > > > 
> > > > > > 	/* There is not much we can do if the allocation fails.  */
> > > > > > 	(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > > > > > 
> > > > > > So if the helper thread which gets the signal fails to create the
> > > > > > thread then everything is toast.
> > > > > > 
> > > > > > What about fixing the f*cked up glibc implementation in the first place
> > > > > > instead of fiddling in the kernel to support this utter madness? 
> > > > > > 
> > > > > > WTF can't the damned delivery thread not be created when timer_create
> > > > > > is called and the signal be delivered to that very thread directly via
> > > > > > SIGEV_THREAD_ID ?
> > > > > 
> > > > > C'mon, Thomas!!!  That is entirely too sensible!!!  ;-)
> > > > > 
> > > > > But if you are going to create the thread at timer_create() time,
> > > > > why not just have the new thread block for the desired duration?
> > > > 
> > > > The timer infrastructure allows things like periodic timer which restarts when
> > > > it fires, detection of missed timer events, etc. If you try doing this in a
> > > > userland thread context with a simple sleep, then your period becomes however
> > > > long you sleep for _and_ the thread execution time. This is all in all quite
> > > > different from the timer semantic.
> > > 
> > > Hmmm...  Why couldn't the thread in question set the next sleep time based
> > > on the timer period?  Yes, if the function ran for longer than the period,
> > > there would be a delay, but the POSIX semantics allow such a delay, right?
> > 
> > I'm afraid you'll have a large error accumulation over time, and getting the
> > precise picture of how much time between now and where the the period end is
> > expected to be is kind of hard to do precisely from user-space. In a few words,
> > this solution would be terrible for jitter. This is why we usually rely on
> > timers rather than delays in these periodic workloads.
> 
> Why couldn't the timer_create() call record the start time, and then
> compute the sleeps from that time?  So if timer_create() executed at
> time t=100 and the period is 5, upon awakening and completing the first
> invocation of the function in question, the thread does a sleep calculated
> to wake at t=110.

Let's focus on the userspace thread execution, right between the samping of the
current time and the call to sleep:

  Thread A
  current_time = read current time();
  sleep(period_end - current_time);

If the thread is preempted between these two operations, then we end up sleeping
for longer than what is needed. This kind of imprecision will add up over time,
so that after e.g. one day, instead of having the expected number of timer
executions, we'll have less than that. This kind of accumulated drift is an
unwanted side-effect of using delays in lieue of real periodic timers.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 15:18                 ` Mathieu Desnoyers
@ 2010-08-27 15:20                   ` Thomas Gleixner
  2010-08-27 15:30                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2010-08-27 15:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Peter Zijlstra, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Tony Lindgren,
	Mike Galbraith

On Fri, 27 Aug 2010, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > Why couldn't the timer_create() call record the start time, and then
> > compute the sleeps from that time?  So if timer_create() executed at
> > time t=100 and the period is 5, upon awakening and completing the first
> > invocation of the function in question, the thread does a sleep calculated
> > to wake at t=110.
> 
> Let's focus on the userspace thread execution, right between the samping of the
> current time and the call to sleep:
> 
>   Thread A
>   current_time = read current time();
>   sleep(period_end - current_time);
> 
> If the thread is preempted between these two operations, then we end up sleeping
> for longer than what is needed. This kind of imprecision will add up over time,
> so that after e.g. one day, instead of having the expected number of timer
> executions, we'll have less than that. This kind of accumulated drift is an
> unwanted side-effect of using delays in lieue of real periodic timers.

Nonsense, that's why we provide clock_nanosleep(ABSTIME)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27  7:37         ` Peter Zijlstra
@ 2010-08-27 15:21           ` Mathieu Desnoyers
  2010-08-27 15:41             ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 15:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2010-08-26 at 19:09 -0400, Mathieu Desnoyers wrote:
> > > WTF can't the damned delivery thread not be created when timer_create
> > > is called and the signal be delivered to that very thread directly via
> > > SIGEV_THREAD_ID ?
> > 
> > Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
> > pretty sure that would work.
> > 
> > The only thing we might have to be careful about is what happens if the timer
> > re-fires before the thread completes its execution. We might want to let the
> > signal handler detect these overruns somehow.
> 
> Simply don't use SIGEV_THREAD and spawn you own thread and use
> SIGEV_THREAD_ID yourself, the programmer knows the semantics and knows
> if he cares about overlapping timers etc.

>From man timer_create:

       SIGEV_THREAD
              Upon  timer  expiration,  invoke  sigev_notify_function as if it
              were the start function of a new thread.  (Among the implementa‐
              tion  possibilities  here are that each timer notification could
              result in the creation of a new thread, or that a single  thread
              is  created  to  receive  all  notifications.)   The function is
              invoked   with   sigev_value   as   its   sole   argument.    If
              sigev_notify_attributes  is  not  NULL,  it  should  point  to a
              pthread_attr_t structure that defines  attributes  for  the  new
              thread (see pthread_attr_init(3).

So basically, it's the glibc implementation that is broken, not the standard.

The programmer should expect that thread execution can overlap though.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27  7:38           ` Peter Zijlstra
@ 2010-08-27 15:23             ` Mathieu Desnoyers
  0 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2010-08-26 at 19:36 -0400, Mathieu Desnoyers wrote:
> > > The only thing we might have to be careful about is what happens if the timer
> > > re-fires before the thread completes its execution. We might want to let the
> > > signal handler detect these overruns somehow.
> > 
> > Hrm, thinking about it a little more, one of the "plus" sides of these
> > SIGEV_THREAD timers is that a single timer can fork threads that will run on
> > many cores on a multi-core system. If we go for preallocation of a single
> > thread, we lose that. Maybe we could think of a way to preallocate a thread pool
> > instead ? 
> 
> Why try and fix a broken thing, just let the app spawn threads and use
> SIGEV_THREAD_ID itself, it knows its requirements and can do what suits
> the situation best. No need to try and patch up braindead posix stuff.

As I dig through the code and learn more about the posix standard, my
understanding is that the _implementation_ is broken. The standard just leaves
enough rope for the implementation to either do the right thing or to hang
itself. Sadly glibc seems to have chosen the second option.

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 15:20                   ` Thomas Gleixner
@ 2010-08-27 15:30                     ` Mathieu Desnoyers
  2010-08-27 15:41                       ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 15:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul E. McKenney, Peter Zijlstra, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Tony Lindgren,
	Mike Galbraith

* Thomas Gleixner (tglx@linutronix.de) wrote:
> On Fri, 27 Aug 2010, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > Why couldn't the timer_create() call record the start time, and then
> > > compute the sleeps from that time?  So if timer_create() executed at
> > > time t=100 and the period is 5, upon awakening and completing the first
> > > invocation of the function in question, the thread does a sleep calculated
> > > to wake at t=110.
> > 
> > Let's focus on the userspace thread execution, right between the samping of the
> > current time and the call to sleep:
> > 
> >   Thread A
> >   current_time = read current time();
> >   sleep(period_end - current_time);
> > 
> > If the thread is preempted between these two operations, then we end up sleeping
> > for longer than what is needed. This kind of imprecision will add up over time,
> > so that after e.g. one day, instead of having the expected number of timer
> > executions, we'll have less than that. This kind of accumulated drift is an
> > unwanted side-effect of using delays in lieue of real periodic timers.
> 
> Nonsense, that's why we provide clock_nanosleep(ABSTIME)

If we're using CLOCK_MONOTONIC, you're right, this could work. I was only
thinking of relative delays.

So do you think Paul's ideal would be a good candidate for the timer_create
SIGEV_THREAD glibc implementation then ?

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 15:21           ` Mathieu Desnoyers
@ 2010-08-27 15:41             ` Peter Zijlstra
  2010-08-27 16:09               ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2010-08-27 15:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

On Fri, 2010-08-27 at 11:21 -0400, Mathieu Desnoyers wrote:
>        SIGEV_THREAD
>               Upon  timer  expiration,  invoke  sigev_notify_function as if it
>               were the start function of a new thread.  (Among the implementa‐
>               tion  possibilities  here are that each timer notification could
>               result in the creation of a new thread, or that a single  thread
>               is  created  to  receive  all  notifications.)   The function is
>               invoked   with   sigev_value   as   its   sole   argument.    If
>               sigev_notify_attributes  is  not  NULL,  it  should  point  to a
>               pthread_attr_t structure that defines  attributes  for  the  new
>               thread (see pthread_attr_init(3).
> 
> So basically, it's the glibc implementation that is broken, not the standard.

The standard is broken too, what context will the new thread inherit?
The pthread_attr_t stuff tries to cover some of that, but pthread_attr_t
doesn't cover all inherited task attributes, and allows for some very
'interesting' bugs [1].

The specification also doesn't cover the case where the handler takes
more time to execute than the timer interval.

[1] - consider the case where pthread_attr_t includes the stack and we
use a spawn thread on expire policy and then run into the situation
where the handler is delayed past the next expiration.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 15:30                     ` Mathieu Desnoyers
@ 2010-08-27 15:41                       ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2010-08-27 15:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Paul E. McKenney, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Tony Lindgren,
	Mike Galbraith

On Fri, 2010-08-27 at 11:30 -0400, Mathieu Desnoyers wrote:
> So do you think Paul's ideal would be a good candidate for the timer_create
> SIGEV_THREAD glibc implementation then ? 

Simply do not _ever_ use SIGEV_THREAD, there really is no reason to.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27  8:19       ` Mike Galbraith
@ 2010-08-27 15:43         ` Mathieu Desnoyers
  2010-08-27 18:38           ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 15:43 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Steven Rostedt, Thomas Gleixner, Tony Lindgren

* Mike Galbraith (efault@gmx.de) wrote:
> On Fri, 2010-08-27 at 09:42 +0200, Peter Zijlstra wrote:
> > On Thu, 2010-08-26 at 19:49 -0400, Mathieu Desnoyers wrote:
> > > AFAIK, I don't think we would end up starving the system in any possible way.
> > 
> > Correct, it does maintain fairness.
> > 
> > > So far I cannot see a situation where selecting the next buddy would _not_ make
> > > sense in any kind of input-driven wakeups (interactive, timer, disk, network,
> > > etc). But maybe it's just a lack of imagination on my part. 
> > 
> > The risk is that you end up with always using next-buddy, and we tried
> > that a while back and that didn't work well for some, Mike might
> > remember.
> 
> I turned it off because it was ripping spread apart badly, and last
> buddy did a better job of improving scalability without it.

Maybe with the dyn min_vruntime feature proposed in this patchset we should
reconsider this. Spread being ripped apart is exactly what it addresses.

> 
> > Also, when you use timers things like time-outs you really couldn't care
> > less if its handled sooner rather than later.

Maybe we could have a timer delay threshold under which we consider the wakeup
to be "short" or "long". e.g. a timer firing every ms is very likely to need its
wakups to proceed quickly, but if the timer fires every hours, we could not care
less.

> > 
> > Disk is usually so slow you really don't want to consider it
> > interactive, but then sometimes you might,.. its a really hard problem.
> 
> (very hard)
> 
> > The only clear situation is the direct input, that's a direct link
> > between the user and our wakeup chain and the user is always important.
> 
> Yeah, directly linked wakeups using next could be a good thing, but the
> trouble with using any linkage to the user is that you have to pass it
> on to reap benefit.. so when do you disconnect?

What we do here is that we pass it on across wakeups, but not across wakups of
new threads (so it's not passed across forks). However, AFAIU, this only
generates a bias that will select the wakeup target as next buddy IIF it is
within a certain range from the leftmost entity. The disconnect is done (setting
se.interactive to 0) on a per-thread basis when they go to sleep.

I've seen some version of Peter's patches which were decrementing a counter as
the token is passed across a wakeup, but I did not find it necessary so far.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27  8:43           ` Thomas Gleixner
@ 2010-08-27 15:50             ` Mathieu Desnoyers
  0 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 15:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Steven Rostedt, Tony Lindgren, Mike Galbraith

* Thomas Gleixner (tglx@linutronix.de) wrote:
> On Thu, 26 Aug 2010, Mathieu Desnoyers wrote:
[...] 
> > Hrm, thinking about it a little more, one of the "plus" sides of these
> > SIGEV_THREAD timers is that a single timer can fork threads that will run on
> > many cores on a multi-core system. If we go for preallocation of a single
> > thread, we lose that. Maybe we could think of a way to preallocate a thread pool
> > instead ?
> 
> Why should a single timer fork many threads? Just because a previous
> thread did not complete before the timer fires again? That's
> braindamage as all threads call the same function which then needs to
> be serialized anyway. We really do not need a function which creates
> tons of threads which get all stuck on the same resource.

It could make sense if the workload is mostly CPU-bound and there is only a very
short critical section shared between the threads. But I agree that in many
cases this will generate an utter contention mess.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 15:41             ` Peter Zijlstra
@ 2010-08-27 16:09               ` Mathieu Desnoyers
  2010-08-27 17:27                 ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 16:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Fri, 2010-08-27 at 11:21 -0400, Mathieu Desnoyers wrote:
> >        SIGEV_THREAD
> >               Upon  timer  expiration,  invoke  sigev_notify_function as if it
> >               were the start function of a new thread.  (Among the implementa‐
> >               tion  possibilities  here are that each timer notification could
> >               result in the creation of a new thread, or that a single  thread
> >               is  created  to  receive  all  notifications.)   The function is
> >               invoked   with   sigev_value   as   its   sole   argument.    If
> >               sigev_notify_attributes  is  not  NULL,  it  should  point  to a
> >               pthread_attr_t structure that defines  attributes  for  the  new
> >               thread (see pthread_attr_init(3).
> > 
> > So basically, it's the glibc implementation that is broken, not the standard.
> 
> The standard is broken too, what context will the new thread inherit?

Besides pthread_attr_t, thinking of the scheduler/cgroups/etc stuff, I'd think
it might be expected to inherit the state of the thread which calls
timer_create(). But this is not what glibc does right now, and it is not spelled
out clearly by the standard.

> The pthread_attr_t stuff tries to cover some of that, but pthread_attr_t
> doesn't cover all inherited task attributes, and allows for some very
> 'interesting' bugs [1].

(see below)

> 
> The specification also doesn't cover the case where the handler takes
> more time to execute than the timer interval.

Why should it ? It seems valid for a workload to result in spawning many threads
bound to more than a single core on a multi-core system. So concurrency
management should be performed by the application.

> 
> [1] - consider the case where pthread_attr_t includes the stack and we
> use a spawn thread on expire policy and then run into the situation
> where the handler is delayed past the next expiration.

Setting a thread stack and generating the signal more than once is taken into
account in the standard. It leads to unspecified result (IOW: don't do this):

http://www.opengroup.org/onlinepubs/009695399/functions/timer_create.html

"If evp->sigev_sigev_notify is SIGEV_THREAD and sev->sigev_notify_attributes is
not NULL, if the attribute pointed to by sev->sigev_notify_attributes has a
thread stack address specified by a call to pthread_attr_setstack() or
pthread_attr_setstackaddr(), the results are unspecified if the signal is
generated more than once."

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 16:09               ` Mathieu Desnoyers
@ 2010-08-27 17:27                 ` Peter Zijlstra
  2010-08-27 18:32                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2010-08-27 17:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

On Fri, 2010-08-27 at 12:09 -0400, Mathieu Desnoyers wrote:
> > The specification also doesn't cover the case where the handler takes
> > more time to execute than the timer interval.
> 
> Why should it ? It seems valid for a workload to result in spawning many threads
> bound to more than a single core on a multi-core system. So concurrency
> management should be performed by the application. 

The problem with allowing concurrency is that the moment you want to do
that you get the spawner context and error propagation problems.

Thus we're limited to spawning a single thread at timer_create() and
handling expirations in there. At that point you have to specify what
happens to an expiration while the handler is running, will it queue
handlers or will you have to carefully craft your handler using
timer_getoverrun().

But I really don't understand why POSIX would provide this composite
functionality instead of providing the separate bits to implement this,
which I think is only SIGEV_THREAD_ID which is missing.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 09/11] sched: timer-driven next buddy (update)
  2010-08-26 18:09 ` [RFC PATCH 09/11] sched: timer-driven " Mathieu Desnoyers
@ 2010-08-27 18:02   ` Mathieu Desnoyers
  2010-08-27 18:14     ` Thomas Gleixner
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 18:02 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt,
	Thomas Gleixner, Tony Lindgren, Mike Galbraith, Peter Zijlstra

Sorry, I forgot that I based this patchset on top of my LTTng tree all along, so
this new patch version is needed for reject-less application of the patchset on
a vanilla 2.6.35.2 tree.

Thanks,

Mathieu

Subject: sched: timer-driven next buddy

[ Impact: implement TIMER feature to diminish the latencies induced by wakeups
          performed by timer callbacks ]

Ensure that timer callbacks triggering wakeups get served ASAP by giving
timer-driven wakeups next-buddy affinity.

My test program is wakeup-latency.c, provided by Nokia originally. A 10ms timer
spawns a thread which reads the time, and shows a warning if the expected
deadline has been missed by too much. It also warns about timer overruns.

Without the TIMER and TIMER_FORK_EXPEDITED features:

min priority: 0, max priority: 0
[....]
maximum latency: 41453.6 µs
average latency: 4127.0 µs
missed timer events: 0

With the features enabled:

min priority: 0, max priority: 0
[...]
maximum latency: 10013.5 µs
average latency: 162.9 µs
missed timer events: 0

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h     |   16 +++++++++++++++-
 kernel/hrtimer.c          |    2 ++
 kernel/itimer.c           |    2 ++
 kernel/posix-cpu-timers.c |    2 ++
 kernel/posix-timers.c     |    2 ++
 kernel/sched.c            |    9 +++++++++
 kernel/sched_fair.c       |   11 ++++++++---
 kernel/sched_features.h   |    4 ++++
 kernel/timer.c            |    2 ++
 9 files changed, 46 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng.laptop/include/linux/sched.h
===================================================================
--- linux-2.6-lttng.laptop.orig/include/linux/sched.h
+++ linux-2.6-lttng.laptop/include/linux/sched.h
@@ -1027,12 +1027,14 @@ struct sched_domain;
 #define WF_SYNC		(1 << 0)	/* waker goes to sleep after wakup */
 #define WF_FORK		(1 << 1)	/* child wakeup after fork */
 #define WF_INTERACTIVE	(1 << 2)	/* interactivity-driven wakeup */
+#define WF_TIMER	(1 << 3)	/* timer-driven wakeup */
 
 #define ENQUEUE_WAKEUP	(1 << 0)
 #define ENQUEUE_WAKING	(1 << 1)
 #define ENQUEUE_HEAD	(1 << 2)
 #define ENQUEUE_IO	(1 << 3)
 #define ENQUEUE_LATENCY	(1 << 4)
+#define ENQUEUE_TIMER	(1 << 5)
 
 #define DEQUEUE_SLEEP	(1 << 0)
 
@@ -1128,7 +1130,8 @@ struct sched_entity {
 	struct rb_node		run_node;
 	struct list_head	group_node;
 	unsigned int		on_rq:1,
-				interactive:1;
+				interactive:1,
+				timer:1;
 
 	u64			exec_start;
 	u64			sum_exec_runtime;
@@ -1242,6 +1245,7 @@ struct task_struct {
 	unsigned sched_reset_on_fork:1;		/* Revert to default
 						 * priority/policy on fork */
 	unsigned sched_wake_interactive:4;	/* User-driven wakeup */
+	unsigned sched_wake_timer:4;		/* Timer-driven wakeup */
 
 	pid_t pid;
 	pid_t tgid;
@@ -1514,6 +1518,16 @@ static inline void sched_wake_interactiv
 	current->sched_wake_interactive--;
 }
 
+static inline void sched_wake_timer_enable(void)
+{
+	current->sched_wake_timer++;
+}
+
+static inline void sched_wake_timer_disable(void)
+{
+	current->sched_wake_timer--;
+}
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
Index: linux-2.6-lttng.laptop/kernel/sched_features.h
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched_features.h
+++ linux-2.6-lttng.laptop/kernel/sched_features.h
@@ -58,6 +58,10 @@ SCHED_FEAT(DYN_MIN_VRUNTIME, 0)
  * Input subsystem next buddy affinity. Not transitive across new task wakeups.
  */
 SCHED_FEAT(INTERACTIVE, 0)
+/*
+ * Timer subsystem next buddy affinity. Not transitive across new task wakeups.
+ */
+SCHED_FEAT(TIMER, 0)
 
 /*
  * Spin-wait on mutex acquisition when the mutex owner is running on
Index: linux-2.6-lttng.laptop/kernel/sched.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched.c
+++ linux-2.6-lttng.laptop/kernel/sched.c
@@ -2295,6 +2295,13 @@ static int try_to_wake_up(struct task_st
 			en_flags |= ENQUEUE_LATENCY;
 	}
 
+	if (sched_feat(TIMER) && !(wake_flags & WF_FORK)) {
+		if (current->sched_wake_timer ||
+				wake_flags & WF_TIMER ||
+				current->se.timer)
+			en_flags |= ENQUEUE_TIMER;
+	}
+
 	this_cpu = get_cpu();
 
 	smp_wmb();
@@ -3623,6 +3630,8 @@ need_resched_nonpreemptible:
 		else {
 			if (sched_feat(INTERACTIVE))
 				prev->se.interactive = 0;
+			if (sched_feat(TIMER))
+				prev->se.timer = 0;
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 		}
 		switch_count = &prev->nvcsw;
Index: linux-2.6-lttng.laptop/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.laptop/kernel/sched_fair.c
@@ -780,6 +780,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 		if (sched_feat(INTERACTIVE)
 		    && flags & ENQUEUE_LATENCY && !(flags & ENQUEUE_IO))
 			se->interactive = 1;
+		if (sched_feat(TIMER)
+		    && flags & ENQUEUE_TIMER && !(flags & ENQUEUE_IO))
+			se->timer = 1;
 		place_entity(cfs_rq, se, 0);
 		enqueue_sleeper(cfs_rq, se);
 	}
@@ -926,7 +929,8 @@ static struct sched_entity *pick_next_en
 		se = cfs_rq->last;
 
 	/*
-	 * Prefer the next buddy, only set through the interactivity logic.
+	 * Prefer the next buddy, only set through the interactivity and timer
+	 * logic.
 	 */
 	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
 		se = cfs_rq->next;
@@ -1677,8 +1681,9 @@ static void check_preempt_wakeup(struct
 	if (unlikely(se == pse))
 		return;
 
-	if (sched_feat(INTERACTIVE)
-	    && !(wake_flags & WF_FORK) && pse->interactive) {
+	if (!(wake_flags & WF_FORK)
+	    && ((sched_feat(INTERACTIVE) && pse->interactive)
+		|| (sched_feat(TIMER) && pse->timer))) {
 		clear_buddies(cfs_rq, NULL);
 		set_next_buddy(pse);
 		preempt = 1;
Index: linux-2.6-lttng.laptop/kernel/posix-timers.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/posix-timers.c
+++ linux-2.6-lttng.laptop/kernel/posix-timers.c
@@ -402,6 +402,7 @@ static enum hrtimer_restart posix_timer_
 	int si_private = 0;
 	enum hrtimer_restart ret = HRTIMER_NORESTART;
 
+	sched_wake_timer_enable();
 	timr = container_of(timer, struct k_itimer, it.real.timer);
 	spin_lock_irqsave(&timr->it_lock, flags);
 
@@ -456,6 +457,7 @@ static enum hrtimer_restart posix_timer_
 	}
 
 	unlock_timer(timr, flags);
+	sched_wake_timer_disable();
 	return ret;
 }
 
Index: linux-2.6-lttng.laptop/kernel/timer.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/timer.c
+++ linux-2.6-lttng.laptop/kernel/timer.c
@@ -1031,6 +1031,7 @@ static void call_timer_fn(struct timer_l
 	 */
 	struct lockdep_map lockdep_map = timer->lockdep_map;
 #endif
+	sched_wake_timer_enable();
 	/*
 	 * Couple the lock chain with the lock chain at
 	 * del_timer_sync() by acquiring the lock_map around the fn()
@@ -1055,6 +1056,7 @@ static void call_timer_fn(struct timer_l
 		 */
 		preempt_count() = preempt_count;
 	}
+	sched_wake_timer_disable();
 }
 
 #define INDEX(N) ((base->timer_jiffies >> (TVR_BITS + (N) * TVN_BITS)) & TVN_MASK)
Index: linux-2.6-lttng.laptop/kernel/hrtimer.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/hrtimer.c
+++ linux-2.6-lttng.laptop/kernel/hrtimer.c
@@ -1212,6 +1212,7 @@ static void __run_hrtimer(struct hrtimer
 
 	WARN_ON(!irqs_disabled());
 
+	sched_wake_timer_enable();
 	debug_deactivate(timer);
 	__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0);
 	timer_stats_account_hrtimer(timer);
@@ -1238,6 +1239,7 @@ static void __run_hrtimer(struct hrtimer
 		enqueue_hrtimer(timer, base);
 	}
 	timer->state &= ~HRTIMER_STATE_CALLBACK;
+	sched_wake_timer_disable();
 }
 
 #ifdef CONFIG_HIGH_RES_TIMERS
Index: linux-2.6-lttng.laptop/kernel/itimer.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/itimer.c
+++ linux-2.6-lttng.laptop/kernel/itimer.c
@@ -124,7 +124,9 @@ enum hrtimer_restart it_real_fn(struct h
 		container_of(timer, struct signal_struct, real_timer);
 
 	trace_itimer_expire(ITIMER_REAL, sig->leader_pid, 0);
+	sched_wake_timer_enable();
 	kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->leader_pid);
+	sched_wake_timer_disable();
 
 	return HRTIMER_NORESTART;
 }
Index: linux-2.6-lttng.laptop/kernel/posix-cpu-timers.c
===================================================================
--- linux-2.6-lttng.laptop.orig/kernel/posix-cpu-timers.c
+++ linux-2.6-lttng.laptop/kernel/posix-cpu-timers.c
@@ -610,6 +610,7 @@ static void arm_timer(struct k_itimer *t
  */
 static void cpu_timer_fire(struct k_itimer *timer)
 {
+	sched_wake_timer_enable();
 	if ((timer->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) {
 		/*
 		 * User don't want any signal.
@@ -637,6 +638,7 @@ static void cpu_timer_fire(struct k_itim
 		 */
 		posix_cpu_timer_schedule(timer);
 	}
+	sched_wake_timer_disable();
 }
 
 /*
-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 09/11] sched: timer-driven next buddy (update)
  2010-08-27 18:02   ` [RFC PATCH 09/11] sched: timer-driven next buddy (update) Mathieu Desnoyers
@ 2010-08-27 18:14     ` Thomas Gleixner
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2010-08-27 18:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Peter Zijlstra, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Steven Rostedt, Tony Lindgren, Mike Galbraith, Peter Zijlstra



On Fri, 27 Aug 2010, Mathieu Desnoyers wrote:

> Sorry, I forgot that I based this patchset on top of my LTTng tree all along, so
> this new patch version is needed for reject-less application of the patchset on
> a vanilla 2.6.35.2 tree.

That does not change my opinion on that much.

As I said before I strongly object to implement special handling for a
particular wakeup scenario as this will open a can of worms for all
other special cases and workloads.

Thanks,

	tglx



 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 17:27                 ` Peter Zijlstra
@ 2010-08-27 18:32                   ` Mathieu Desnoyers
  2010-08-27 19:23                     ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 18:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Fri, 2010-08-27 at 12:09 -0400, Mathieu Desnoyers wrote:
> > > The specification also doesn't cover the case where the handler takes
> > > more time to execute than the timer interval.
> > 
> > Why should it ? It seems valid for a workload to result in spawning many threads
> > bound to more than a single core on a multi-core system. So concurrency
> > management should be performed by the application. 
> 
> The problem with allowing concurrency is that the moment you want to do
> that you get the spawner context and error propagation problems.

These are problems you get only when you allow spawning any number of threads.
If, instead, you create a thread pool at timer creation, then you can allow
concurrency without problems with spawner context and error proparation.

> Thus we're limited to spawning a single thread at timer_create() and
> handling expirations in there. At that point you have to specify what
> happens to an expiration while the handler is running, will it queue
> handlers or will you have to carefully craft your handler using
> timer_getoverrun().

With my statement above in mind, it's true that we'd have to handle what happens
when the thread pool is exhausted. But couldn't we simply increment an overrun
counter from userland ? (It might be a different counter than kernel space kept
internally in userland)

> But I really don't understand why POSIX would provide this composite
> functionality instead of providing the separate bits to implement this,
> which I think is only SIGEV_THREAD_ID which is missing.

Higher abstraction API vs low-level control. A long running battle. ;)

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 15:43         ` Mathieu Desnoyers
@ 2010-08-27 18:38           ` Mathieu Desnoyers
  2010-08-28  7:33             ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 18:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Steven Rostedt, Thomas Gleixner, Tony Lindgren

* Mathieu Desnoyers (mathieu.desnoyers@efficios.com) wrote:
> * Mike Galbraith (efault@gmx.de) wrote:
> > On Fri, 2010-08-27 at 09:42 +0200, Peter Zijlstra wrote:
> > > On Thu, 2010-08-26 at 19:49 -0400, Mathieu Desnoyers wrote:
> > > > AFAIK, I don't think we would end up starving the system in any possible way.
> > > 
> > > Correct, it does maintain fairness.
> > > 
> > > > So far I cannot see a situation where selecting the next buddy would _not_ make
> > > > sense in any kind of input-driven wakeups (interactive, timer, disk, network,
> > > > etc). But maybe it's just a lack of imagination on my part. 
> > > 
> > > The risk is that you end up with always using next-buddy, and we tried
> > > that a while back and that didn't work well for some, Mike might
> > > remember.
> > 
> > I turned it off because it was ripping spread apart badly, and last
> > buddy did a better job of improving scalability without it.
> 
> Maybe with the dyn min_vruntime feature proposed in this patchset we should
> reconsider this. Spread being ripped apart is exactly what it addresses.

I'm curious: which workload was showing this kind of problem exactly ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 18:32                   ` Mathieu Desnoyers
@ 2010-08-27 19:23                     ` Peter Zijlstra
  2010-08-27 19:57                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2010-08-27 19:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

On Fri, 2010-08-27 at 14:32 -0400, Mathieu Desnoyers wrote:
> 
> These are problems you get only when you allow spawning any number of threads.
> If, instead, you create a thread pool at timer creation, then you can allow
> concurrency without problems with spawner context and error proparation.

That would be a massive resource waste for the normal case where the
interval > handler runtime.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 19:23                     ` Peter Zijlstra
@ 2010-08-27 19:57                       ` Mathieu Desnoyers
  2010-08-31 15:02                         ` Mathieu Desnoyers
  0 siblings, 1 reply; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-27 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Fri, 2010-08-27 at 14:32 -0400, Mathieu Desnoyers wrote:
> > 
> > These are problems you get only when you allow spawning any number of threads.
> > If, instead, you create a thread pool at timer creation, then you can allow
> > concurrency without problems with spawner context and error proparation.
> 
> That would be a massive resource waste for the normal case where the
> interval > handler runtime.

Not if the application can configure this value. So the "normal" case could be
set to 1 single thread. Only resource-intensive apps would set this differently,
e.g. to the detected number of cpus.

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 18:38           ` Mathieu Desnoyers
@ 2010-08-28  7:33             ` Mike Galbraith
  0 siblings, 0 replies; 49+ messages in thread
From: Mike Galbraith @ 2010-08-28  7:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Steven Rostedt, Thomas Gleixner, Tony Lindgren

On Fri, 2010-08-27 at 14:38 -0400, Mathieu Desnoyers wrote: 
> * Mathieu Desnoyers (mathieu.desnoyers@efficios.com) wrote:
> > * Mike Galbraith (efault@gmx.de) wrote:
> > > On Fri, 2010-08-27 at 09:42 +0200, Peter Zijlstra wrote:
> > > > On Thu, 2010-08-26 at 19:49 -0400, Mathieu Desnoyers wrote:
> > > > > AFAIK, I don't think we would end up starving the system in any possible way.
> > > > 
> > > > Correct, it does maintain fairness.
> > > > 
> > > > > So far I cannot see a situation where selecting the next buddy would _not_ make
> > > > > sense in any kind of input-driven wakeups (interactive, timer, disk, network,
> > > > > etc). But maybe it's just a lack of imagination on my part. 
> > > > 
> > > > The risk is that you end up with always using next-buddy, and we tried
> > > > that a while back and that didn't work well for some, Mike might
> > > > remember.
> > > 
> > > I turned it off because it was ripping spread apart badly, and last
> > > buddy did a better job of improving scalability without it.
> > 
> > Maybe with the dyn min_vruntime feature proposed in this patchset we should
> > reconsider this. Spread being ripped apart is exactly what it addresses.

Dunno.  Messing with spread is a very sharp double edged sword.

I took the patch set out for a spin, and saw some negative effects in
that regard.  I did see some positive effects as well though, x264 for
example really wants round robin, so profits.  Things where preemption
translates to throughput don't care for the idea much.  vmark rather
surprised me, it hated this feature for some reason, I expected the
opposite.

It's an interesting knob, but I wouldn't turn it on by default on
anything but maybe a UP desktop box.

(i kinda like cgroups classified by pgid for desktop interactivity under
load, works pretty darn well)

> I'm curious: which workload was showing this kind of problem exactly ?

Hm, I don't recall exact details.  I was looking at a lot of different
load mixes at the time, mostly interactive and batch, trying to shrink
the too darn high latencies seen with modest load mixes.

I never tried to figure out why next buddy had worse effect on spread
than last buddy (not obvious to me), just noted the fact that it did.  I
recall that it had a large negative effect on x264 throughput as well.

	-Mike

some numbers:

35.3x = DYN_MIN_VRUNTIME

netperf TCP_RR
35.3              97624.82   96982.62   97387.50   avg 97331.646 RR/sec  1.000
35.3x             95044.98   95365.52   94581.92   avg 94997.473 RR/sec   .976

tbench 8
35.3               1200.36    1200.56    1199.29   avg 1200.070 MB/sec   1.000
35.3x              1106.92    1110.50    1106.58   avg 1108.000 MB/sec    .923

x264 8
35.3                407.12     408.80     414.60   avg 410.173 fps       1.000
35.3x               428.07     436.43     438.16   avg 434.220 fps       1.058

vmark
35.3                149678     149269     150584   avg 149843.666 m/sec  1.000
35.3x               120872     120932     121247   avg 121017.000 m/sec   .807

mysql+oltp
                      1          2          4          8         16         32         64        128        256
35.3           10956.33   20747.86   37139.27   36898.70   36575.90   36104.63   34390.26   31574.46   29148.01
               10938.44   20835.17   37058.51   37051.71   36630.06   35930.90   34464.88   32024.50   28989.14
               10935.72   20792.54   37238.17   36989.97   36568.37   35961.00   34342.54   31532.39   29235.20
         avg   10943.49   20791.85   37145.31   36980.12   36591.44   35998.84   34399.22   31710.45   29124.11

35.3x          10944.22   20851.09   35609.32   35744.05   35137.49   33362.16   30796.03   28286.87   25105.84
               10958.68   20811.93   35604.57   35610.71   35147.65   33371.81   30877.52   28325.79   25113.85
               10962.72   20745.81   35728.36   35638.23   35124.56   33336.20   30794.99   28225.99   25202.88
         avg   10955.20   20802.94   35647.41   35664.33   35136.56   33356.72   30822.84   28279.55   25140.85
    vs 35.3       1.001      1.000       .959       .964       .960       .926       .896       .891       .863



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/11] sched: CFS low-latency features
  2010-08-27 19:57                       ` Mathieu Desnoyers
@ 2010-08-31 15:02                         ` Mathieu Desnoyers
  0 siblings, 0 replies; 49+ messages in thread
From: Mathieu Desnoyers @ 2010-08-31 15:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Steven Rostedt, Tony Lindgren, Mike Galbraith

* Mathieu Desnoyers (mathieu.desnoyers@efficios.com) wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Fri, 2010-08-27 at 14:32 -0400, Mathieu Desnoyers wrote:
> > > 
> > > These are problems you get only when you allow spawning any number of threads.
> > > If, instead, you create a thread pool at timer creation, then you can allow
> > > concurrency without problems with spawner context and error proparation.
> > 
> > That would be a massive resource waste for the normal case where the
> > interval > handler runtime.
> 
> Not if the application can configure this value. So the "normal" case could be
> set to 1 single thread. Only resource-intensive apps would set this differently,
> e.g. to the detected number of cpus.

.. but as we discussed in private, this would involve adding extra attribute
fields to what the standard proposes. I think the problem comes from the fact
that POSIX considers the pthread attributes to contain every possible thread
attribute one can imagine, which does not take into account the internal kernel
state set by other interfaces.

So this leaves us with a single-threaded SIGEV_THREAD, which is pretty much
useless. But at least it is not totally impossible for glibc to get it right by
moving to an implementation which uses only one thread.

So until one finds the time to work on fixing glibc SIGEV_THREAD timers, I would
strongly recommend against using it.

Thanks,

Mathieu

> 
> Mathieu
> 
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2010-08-31 15:02 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-26 18:09 [RFC PATCH 00/11] sched: CFS low-latency features Mathieu Desnoyers
2010-08-26 18:09 ` [RFC PATCH 01/11] sched: fix string comparison in features Mathieu Desnoyers
2010-08-26 18:09 ` [RFC PATCH 02/11] sched: debug spread check account for nr_running Mathieu Desnoyers
2010-08-26 18:09 ` [RFC PATCH 03/11] sched: FAIR_SLEEPERS feature Mathieu Desnoyers
2010-08-26 18:09 ` [RFC PATCH 04/11] sched: debug cleanup place entity Mathieu Desnoyers
2010-08-26 18:09 ` [RFC PATCH 05/11] sched buddy enable buddy logic starting at 2 running threads Mathieu Desnoyers
2010-08-26 18:09 ` [RFC PATCH 06/11] sched: dynamic min_vruntime Mathieu Desnoyers
2010-08-26 18:09 ` [RFC PATCH 07/11] sched rename struct task in_iowait field to sched_in_iowait Mathieu Desnoyers
2010-08-26 18:09 ` [RFC PATCH 08/11] sched input interactivity-driven next buddy Mathieu Desnoyers
2010-08-26 18:09 ` [RFC PATCH 09/11] sched: timer-driven " Mathieu Desnoyers
2010-08-27 18:02   ` [RFC PATCH 09/11] sched: timer-driven next buddy (update) Mathieu Desnoyers
2010-08-27 18:14     ` Thomas Gleixner
2010-08-26 18:09 ` [RFC PATCH 10/11] sched: fork expedited Mathieu Desnoyers
2010-08-26 18:09 ` [RFC PATCH 11/11] sched: fair sleepers for timer and interactive Mathieu Desnoyers
2010-08-26 18:57 ` [RFC PATCH 00/11] sched: CFS low-latency features Peter Zijlstra
2010-08-26 21:25   ` Thomas Gleixner
2010-08-26 22:22     ` Thomas Gleixner
2010-08-26 23:09       ` Mathieu Desnoyers
2010-08-26 23:36         ` Mathieu Desnoyers
2010-08-27  7:38           ` Peter Zijlstra
2010-08-27 15:23             ` Mathieu Desnoyers
2010-08-27  8:43           ` Thomas Gleixner
2010-08-27 15:50             ` Mathieu Desnoyers
2010-08-27  7:37         ` Peter Zijlstra
2010-08-27 15:21           ` Mathieu Desnoyers
2010-08-27 15:41             ` Peter Zijlstra
2010-08-27 16:09               ` Mathieu Desnoyers
2010-08-27 17:27                 ` Peter Zijlstra
2010-08-27 18:32                   ` Mathieu Desnoyers
2010-08-27 19:23                     ` Peter Zijlstra
2010-08-27 19:57                       ` Mathieu Desnoyers
2010-08-31 15:02                         ` Mathieu Desnoyers
2010-08-26 23:18       ` Paul E. McKenney
2010-08-26 23:28         ` Mathieu Desnoyers
2010-08-26 23:38           ` Paul E. McKenney
2010-08-26 23:53             ` Mathieu Desnoyers
2010-08-27  0:09               ` Paul E. McKenney
2010-08-27 15:18                 ` Mathieu Desnoyers
2010-08-27 15:20                   ` Thomas Gleixner
2010-08-27 15:30                     ` Mathieu Desnoyers
2010-08-27 15:41                       ` Peter Zijlstra
2010-08-26 23:49   ` Mathieu Desnoyers
2010-08-27  7:42     ` Peter Zijlstra
2010-08-27  8:19       ` Mike Galbraith
2010-08-27 15:43         ` Mathieu Desnoyers
2010-08-27 18:38           ` Mathieu Desnoyers
2010-08-28  7:33             ` Mike Galbraith
2010-08-27 10:47 ` Indan Zupancic
2010-08-27 10:58   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.