linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] Cure some vaux idle wrackage
@ 2013-11-20 16:04 Peter Zijlstra
  2013-11-20 16:04 ` [PATCH 1/7] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
                   ` (7 more replies)
  0 siblings, 8 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:04 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

I ran head-first into intel_powerclamp and acpi_pad yesterday -- afaict they
try and do pretty much the same thing but both had competing sets of bugs and
issues.

Related to that I went over the tree looking for preempt_enable_no_resched()
abuse.

I know that patch 3 isn't going to be popular with Arjan but you guys are going
to have to make that work. That stuff is hideously terrible.

A note to Zhang Rui, _never_ merge EXPORTs in another maintainers' guts without
an explicit ACK from him, Thomas was quite horrified when I asked him how
tick_nohz_idle_{enter,exit} got exported.

That said, thanks for a 'fun' two days :-(


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/7] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-20 16:04 [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
@ 2013-11-20 16:04 ` Peter Zijlstra
  2013-11-20 16:04 ` [PATCH 2/7] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding Peter Zijlstra
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:04 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki

[-- Attachment #1: peter_zijlstra-x86_acpi_idle-restructure_the_mwait_idle_routines.patch --]
[-- Type: text/plain, Size: 6942 bytes --]

People seem to delight in writing wrong and broken mwait idle routines;
collapse the lot.

This leaves mwait_play_dead() the sole remaining user of __mwait() and
new __mwait() users are probably doing it wrong.

Also remove __sti_mwait() as its unused.

Cc: arjan@linux.intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: lenb@kernel.org
Cc: rui.zhang@intel.com
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/mwait.h       |   40 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h   |   23 ---------------------
 arch/x86/kernel/acpi/cstate.c      |   23 ---------------------
 drivers/acpi/acpi_pad.c            |    5 ----
 drivers/acpi/processor_idle.c      |   15 -------------
 drivers/idle/intel_idle.c          |    8 -------
 drivers/thermal/intel_powerclamp.c |    4 ---
 7 files changed, 43 insertions(+), 75 deletions(-)

--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_X86_MWAIT_H
 #define _ASM_X86_MWAIT_H
 
+#include <linux/sched.h>
+
 #define MWAIT_SUBSTATE_MASK		0xf
 #define MWAIT_CSTATE_MASK		0xf
 #define MWAIT_SUBSTATE_SIZE		4
@@ -13,4 +15,42 @@
 
 #define MWAIT_ECX_INTERRUPT_BREAK	0x1
 
+static inline void __monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state
+ * through MWAIT. Whenever someone changes need_resched, we would be woken
+ * up from MWAIT (without an IPI).
+ *
+ * New with Core Duo processors, MWAIT can take some hints based on CPU
+ * capability.
+ */
+static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
+{
+	if (!current_set_polling_and_test()) {
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
+		__monitor((void *)&current_thread_info()->flags, 0, 0);
+		if (!need_resched())
+			__mwait(eax, ecx);
+	}
+	__current_clr_polling();
+}
+
 #endif /* _ASM_X86_MWAIT_H */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -700,29 +700,6 @@ static inline void sync_core(void)
 #endif
 }
 
-static inline void __monitor(const void *eax, unsigned long ecx,
-			     unsigned long edx)
-{
-	/* "monitor %eax, %ecx, %edx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc8;"
-		     :: "a" (eax), "c" (ecx), "d"(edx));
-}
-
-static inline void __mwait(unsigned long eax, unsigned long ecx)
-{
-	/* "mwait %eax, %ecx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
-{
-	trace_hardirqs_on();
-	/* "mwait %eax, %ecx;" */
-	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -150,29 +150,6 @@ int acpi_processor_ffh_cstate_probe(unsi
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 
-/*
- * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
- * which can obviate IPI to trigger checking of need_resched.
- * We execute MONITOR against need_resched and enter optimized wait state
- * through MWAIT. Whenever someone changes need_resched, we would be woken
- * up from MWAIT (without an IPI).
- *
- * New with Core Duo processors, MWAIT can take some hints based on CPU
- * capability.
- */
-void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
-{
-	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(ax, cx);
-	}
-}
-
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 {
 	unsigned int cpu = smp_processor_id();
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -193,10 +193,7 @@ static int power_saving_thread(void *dat
 					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 			stop_critical_timings();
 
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			smp_mb();
-			if (!need_resched())
-				__mwait(power_saving_mwait_eax, 1);
+			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
 			start_critical_timings();
 			if (lapic_marked_unstable)
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -727,11 +727,6 @@ static int acpi_idle_enter_c1(struct cpu
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	lapic_timer_state_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 
@@ -785,11 +780,6 @@ static int acpi_idle_enter_simple(struct
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
@@ -841,11 +831,6 @@ static int acpi_idle_enter_bm(struct cpu
 		}
 	}
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	acpi_unlazy_tlb(smp_processor_id());
 
 	/* Tell the scheduler that we are going deep-idle: */
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -359,13 +359,7 @@ static int intel_idle(struct cpuidle_dev
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
-	if (!current_set_polling_and_test()) {
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(eax, ecx);
-	}
+	mwait_idle_with_hints(eax, ecx);
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
 			 */
 			local_touch_nmi();
 			stop_critical_timings();
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			cpu_relax(); /* allow HT sibling to run */
-			__mwait(eax, ecx);
+			mwait_idle_with_hints(eax, ecx);
 			start_critical_timings();
 			atomic_inc(&idle_wakeup_counter);
 		}



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 2/7] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding
  2013-11-20 16:04 [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
  2013-11-20 16:04 ` [PATCH 1/7] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
@ 2013-11-20 16:04 ` Peter Zijlstra
  2013-11-21  8:25   ` Peter Zijlstra
  2013-11-20 16:04 ` [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations Peter Zijlstra
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:04 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

[-- Attachment #1: peterz-preempt_fold_need_resched.patch --]
[-- Type: text/plain, Size: 3313 bytes --]

With various drivers wanting to inject idle time; we get people
calling idle routines outside of the idle loop proper.

Therefore we need to be extra careful about not missing
TIF_NEED_RESCHED -> PREEMPT_NEED_RESCHED propagations.

While looking at this, I also realized there's a small window in the
existing idle loop where we can miss TIF_NEED_RESCHED; when it hits
right after the tif_need_resched() test at the end of the loop but
right before the need_resched() test at the start of the loop.

So move preempt_fold_need_resched() out of the loop where we're
guaranteed to have TIF_NEED_RESCHED set.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/mwait.h |    2 +-
 include/linux/preempt.h      |   10 ++++++++++
 include/linux/sched.h        |   15 +++++++++++++++
 kernel/cpu/idle.c            |   14 +++++++-------
 kernel/sched/core.c          |    3 +--
 5 files changed, 34 insertions(+), 10 deletions(-)

--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -50,7 +50,7 @@ static inline void mwait_idle_with_hints
 		if (!need_resched())
 			__mwait(eax, ecx);
 	}
-	__current_clr_polling();
+	current_clr_polling();
 }
 
 #endif /* _ASM_X86_MWAIT_H */
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -116,6 +116,16 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
+#ifdef CONFIG_PREEMPT
+#define preempt_fold_need_resched() \
+do { \
+	if (tif_need_resched()) \
+		set_preempt_need_resched(); \
+} while (0)
+#else
+#define preempt_fold_need_resched() do { } while (0)
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 
 struct preempt_notifier;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2630,6 +2630,21 @@ static inline bool __must_check current_
 }
 #endif
 
+static inline void current_clr_polling(void)
+{
+	__current_clr_polling();
+
+	/*
+	 * Ensure we check TIF_NEED_RESCHED after we clear the polling bit.
+	 * Once the bit is cleared, we'll get IPIs with every new
+	 * TIF_NEED_RESCHED and the IPI handler, scheduler_ipi(), will also
+	 * fold.
+	 */
+	smp_mb(); /* paired with resched_task() */
+
+	preempt_fold_need_resched();
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -105,14 +105,14 @@ static void cpu_idle_loop(void)
 				__current_set_polling();
 			}
 			arch_cpu_idle_exit();
-			/*
-			 * We need to test and propagate the TIF_NEED_RESCHED
-			 * bit here because we might not have send the
-			 * reschedule IPI to idle tasks.
-			 */
-			if (tif_need_resched())
-				set_preempt_need_resched();
 		}
+
+		/*
+		 * We need to test and propagate the TIF_NEED_RESCHED bit here
+		 * because we might not have send the reschedule IPI to idle
+		 * tasks.
+		 */
+		preempt_fold_need_resched();
 		tick_nohz_idle_exit();
 		schedule_preempt_disabled();
 	}
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1499,8 +1499,7 @@ void scheduler_ipi(void)
 	 * TIF_NEED_RESCHED remotely (for the first time) will also send
 	 * this IPI.
 	 */
-	if (tif_need_resched())
-		set_preempt_need_resched();
+	preempt_fold_need_resched();
 
 	if (llist_empty(&this_rq()->wake_list)
 			&& !tick_nohz_full_cpu(smp_processor_id())



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-20 16:04 [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
  2013-11-20 16:04 ` [PATCH 1/7] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
  2013-11-20 16:04 ` [PATCH 2/7] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding Peter Zijlstra
@ 2013-11-20 16:04 ` Peter Zijlstra
  2013-11-20 16:40   ` Arjan van de Ven
  2013-11-21  0:54   ` Jacob Pan
  2013-11-20 16:04 ` [PATCH 4/7] preempt, locking: Rework local_bh_{dis,en}able() Peter Zijlstra
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:04 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki

[-- Attachment #1: peterz-fixup-intel_clamp-mess.patch --]
[-- Type: text/plain, Size: 9667 bytes --]

People are starting to grow their own idle implementations in various
disgusting ways. Collapse the lot and use the generic idle code to
provide a proper idle cycle implementation.

This does not fully preseve existing behaviour in that the generic
idle cycle function calls into the normal cpuidle governed idle
routines and should thus respect things like QoS parameters and the
like.

If people want to over-ride the idle state they should talk to the
cpuidle folks about extending the interface and attempt to preserve
QoS guarantees, instead of jumping straight to the deepest possible C
state.

Compile tested only -- I've no idea how to actually use these vile
things.

Cc: hpa@zytor.com
Cc: arjan@linux.intel.com
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: lenb@kernel.org
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/acpi/acpi_pad.c            |   41 ------------
 drivers/thermal/intel_powerclamp.c |   38 -----------
 include/linux/cpu.h                |    2 
 kernel/cpu/idle.c                  |  123 ++++++++++++++++++++++---------------
 kernel/time/tick-sched.c           |    2 
 5 files changed, 82 insertions(+), 124 deletions(-)

--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -41,9 +41,7 @@ static DEFINE_MUTEX(round_robin_lock);
 static unsigned long power_saving_mwait_eax;
 
 static unsigned char tsc_detected_unstable;
-static unsigned char tsc_marked_unstable;
 static unsigned char lapic_detected_unstable;
-static unsigned char lapic_marked_unstable;
 
 static void power_saving_mwait_init(void)
 {
@@ -153,10 +151,9 @@ static int power_saving_thread(void *dat
 	unsigned int tsk_index = (unsigned long)data;
 	u64 last_jiffies = 0;
 
-	sched_setscheduler(current, SCHED_RR, &param);
+	sched_setscheduler(current, SCHED_FIFO, &param);
 
 	while (!kthread_should_stop()) {
-		int cpu;
 		u64 expire_time;
 
 		try_to_freeze();
@@ -171,41 +168,7 @@ static int power_saving_thread(void *dat
 
 		expire_time = jiffies + HZ * (100 - idle_pct) / 100;
 
-		while (!need_resched()) {
-			if (tsc_detected_unstable && !tsc_marked_unstable) {
-				/* TSC could halt in idle, so notify users */
-				mark_tsc_unstable("TSC halts in idle");
-				tsc_marked_unstable = 1;
-			}
-			if (lapic_detected_unstable && !lapic_marked_unstable) {
-				int i;
-				/* LAPIC could halt in idle, so notify users */
-				for_each_online_cpu(i)
-					clockevents_notify(
-						CLOCK_EVT_NOTIFY_BROADCAST_ON,
-						&i);
-				lapic_marked_unstable = 1;
-			}
-			local_irq_disable();
-			cpu = smp_processor_id();
-			if (lapic_marked_unstable)
-				clockevents_notify(
-					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
-			stop_critical_timings();
-
-			mwait_idle_with_hints(power_saving_mwait_eax, 1);
-
-			start_critical_timings();
-			if (lapic_marked_unstable)
-				clockevents_notify(
-					CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
-			local_irq_enable();
-
-			if (jiffies > expire_time) {
-				do_sleep = 1;
-				break;
-			}
-		}
+		play_idle(expire_time);
 
 		/*
 		 * current sched_rt has threshold for rt task running time.
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -247,11 +247,6 @@ static u64 pkg_state_counter(void)
 	return count;
 }
 
-static void noop_timer(unsigned long foo)
-{
-	/* empty... just the fact that we get the interrupt wakes us up */
-}
-
 static unsigned int get_compensation(int ratio)
 {
 	unsigned int comp = 0;
@@ -356,7 +351,6 @@ static bool powerclamp_adjust_controls(u
 static int clamp_thread(void *arg)
 {
 	int cpunr = (unsigned long)arg;
-	DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
 	static const struct sched_param param = {
 		.sched_priority = MAX_USER_RT_PRIO/2,
 	};
@@ -365,11 +359,9 @@ static int clamp_thread(void *arg)
 
 	set_bit(cpunr, cpu_clamping_mask);
 	set_freezable();
-	init_timer_on_stack(&wakeup_timer);
 	sched_setscheduler(current, SCHED_FIFO, &param);
 
-	while (true == clamping && !kthread_should_stop() &&
-		cpu_online(cpunr)) {
+	while (clamping && !kthread_should_stop() && cpu_online(cpunr)) {
 		int sleeptime;
 		unsigned long target_jiffies;
 		unsigned int guard;
@@ -417,35 +409,11 @@ static int clamp_thread(void *arg)
 		if (should_skip)
 			continue;
 
-		target_jiffies = jiffies + duration_jiffies;
-		mod_timer(&wakeup_timer, target_jiffies);
 		if (unlikely(local_softirq_pending()))
 			continue;
-		/*
-		 * stop tick sched during idle time, interrupts are still
-		 * allowed. thus jiffies are updated properly.
-		 */
-		preempt_disable();
-		tick_nohz_idle_enter();
-		/* mwait until target jiffies is reached */
-		while (time_before(jiffies, target_jiffies)) {
-			unsigned long ecx = 1;
-			unsigned long eax = target_mwait;
-
-			/*
-			 * REVISIT: may call enter_idle() to notify drivers who
-			 * can save power during cpu idle. same for exit_idle()
-			 */
-			local_touch_nmi();
-			stop_critical_timings();
-			mwait_idle_with_hints(eax, ecx);
-			start_critical_timings();
-			atomic_inc(&idle_wakeup_counter);
-		}
-		tick_nohz_idle_exit();
-		preempt_enable_no_resched();
+
+		play_idle(duration_jiffies);
 	}
-	del_timer_sync(&wakeup_timer);
 	clear_bit(cpunr, cpu_clamping_mask);
 
 	return 0;
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -215,6 +215,8 @@ enum cpuhp_state {
 	CPUHP_ONLINE,
 };
 
+void play_idle(unsigned long jiffies);
+
 void cpu_startup_entry(enum cpuhp_state state);
 void cpu_idle(void);
 
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -63,62 +63,88 @@ void __weak arch_cpu_idle(void)
 }
 
 /*
- * Generic idle loop implementation
+ * Generic idle cycle.
  */
-static void cpu_idle_loop(void)
+static void do_idle(void)
 {
-	while (1) {
-		tick_nohz_idle_enter();
+	tick_nohz_idle_enter();
 
-		while (!need_resched()) {
-			check_pgt_cache();
-			rmb();
-
-			if (cpu_is_offline(smp_processor_id()))
-				arch_cpu_idle_dead();
-
-			local_irq_disable();
-			arch_cpu_idle_enter();
-
-			/*
-			 * In poll mode we reenable interrupts and spin.
-			 *
-			 * Also if we detected in the wakeup from idle
-			 * path that the tick broadcast device expired
-			 * for us, we don't want to go deep idle as we
-			 * know that the IPI is going to arrive right
-			 * away
-			 */
-			if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
-				cpu_idle_poll();
-			} else {
-				if (!current_clr_polling_and_test()) {
-					stop_critical_timings();
-					rcu_idle_enter();
-					arch_cpu_idle();
-					WARN_ON_ONCE(irqs_disabled());
-					rcu_idle_exit();
-					start_critical_timings();
-				} else {
-					local_irq_enable();
-				}
-				__current_set_polling();
-			}
-			arch_cpu_idle_exit();
-		}
+	while (!need_resched()) {
+		check_pgt_cache();
+		rmb();
+
+		if (cpu_is_offline(smp_processor_id()))
+			arch_cpu_idle_dead();
+
+		local_irq_disable();
+		arch_cpu_idle_enter();
 
 		/*
-		 * We need to test and propagate the TIF_NEED_RESCHED bit here
-		 * because we might not have send the reschedule IPI to idle
-		 * tasks.
+		 * In poll mode we reenable interrupts and spin.
+		 *
+		 * Also if we detected in the wakeup from idle path that the
+		 * tick broadcast device expired for us, we don't want to go
+		 * deep idle as we know that the IPI is going to arrive right
+		 * away
 		 */
-		preempt_fold_need_resched();
-		tick_nohz_idle_exit();
-		schedule_preempt_disabled();
+		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+			cpu_idle_poll();
+		} else {
+			if (!current_clr_polling_and_test()) {
+				stop_critical_timings();
+				rcu_idle_enter();
+				arch_cpu_idle();
+				WARN_ON_ONCE(irqs_disabled());
+				rcu_idle_exit();
+				start_critical_timings();
+			} else {
+				local_irq_enable();
+			}
+			__current_set_polling();
+		}
+		arch_cpu_idle_exit();
 	}
+
+	/*
+	 * We need to test and propagate the TIF_NEED_RESCHED bit here
+	 * because we might not have send the reschedule IPI to idle
+	 * tasks.
+	 */
+	preempt_fold_need_resched();
+	tick_nohz_idle_exit();
+	schedule_preempt_disabled();
+}
+
+static void play_idle_timer(unsigned long foo)
+{
+	set_tsk_need_resched(current);
+}
+
+void play_idle(unsigned long duration)
+{
+	DEFINE_TIMER(wakeup_timer, play_idle_timer, 0, 0);
+
+	/*
+	 * Only FIFO tasks can disable the tick since they don't need the forced
+	 * preemption.
+	 */
+	WARN_ON_ONCE(current->policy != SCHED_FIFO);
+	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
+	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
+	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
+
+	init_timer_on_stack(&wakeup_timer);
+	mod_timer_pinned(&wakeup_timer, jiffies + duration);
+
+	preempt_disable();
+	do_idle();
+	del_timer_sync(&wakeup_timer);
+	preempt_fold_need_resched();
+	preempt_enable();
 }
+EXPORT_SYMBOL_GPL(play_idle);
 
-void cpu_startup_entry(enum cpuhp_state state)
+__noreturn void cpu_startup_entry(enum cpuhp_state state)
 {
 	/*
 	 * This #ifdef needs to die, but it's too late in the cycle to
@@ -137,5 +163,6 @@ void cpu_startup_entry(enum cpuhp_state
 #endif
 	__current_set_polling();
 	arch_cpu_idle_prepare();
-	cpu_idle_loop();
+	while (1)
+		do_idle();
 }
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -804,7 +804,6 @@ void tick_nohz_idle_enter(void)
 
 	local_irq_enable();
 }
-EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
 
 /**
  * tick_nohz_irq_exit - update next tick event from interrupt exit
@@ -932,7 +931,6 @@ void tick_nohz_idle_exit(void)
 
 	local_irq_enable();
 }
-EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
 
 static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
 {



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 4/7] preempt, locking: Rework local_bh_{dis,en}able()
  2013-11-20 16:04 [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
                   ` (2 preceding siblings ...)
  2013-11-20 16:04 ` [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations Peter Zijlstra
@ 2013-11-20 16:04 ` Peter Zijlstra
  2013-11-20 16:04 ` [PATCH 5/7] locking: Optimize lock_bh functions Peter Zijlstra
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:04 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

[-- Attachment #1: peter_zijlstra-inline-local_bh_disable.patch --]
[-- Type: text/plain, Size: 4831 bytes --]

Currently local_bh_disable() is out-of-line for no apparent reason.
So inline it to save a few cycles on call/return nonsense, the
function body is a single add on x86 (a few loads and store extra on
load/store archs).

Also expose two new local_bh functions:

  __local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt);

Which implement the actual local_bh_{dis,en}able() behaviour.

The next patch uses the exposed @cnt argument to optimize bh lock
functions.

Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/bottom_half.h  |   32 +++++++++++++++++++++++++++++---
 include/linux/hardirq.h      |    1 +
 include/linux/preempt_mask.h |    1 -
 kernel/softirq.c             |   33 ++++-----------------------------
 4 files changed, 34 insertions(+), 33 deletions(-)

--- a/include/linux/bottom_half.h
+++ b/include/linux/bottom_half.h
@@ -1,9 +1,35 @@
 #ifndef _LINUX_BH_H
 #define _LINUX_BH_H
 
-extern void local_bh_disable(void);
+#include <linux/preempt.h>
+#include <linux/preempt_mask.h>
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);
+#else
+static __always_inline void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
+{
+	preempt_count_add(cnt);
+	barrier();
+}
+#endif
+
+static inline void local_bh_disable(void)
+{
+	__local_bh_disable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
+
 extern void _local_bh_enable(void);
-extern void local_bh_enable(void);
-extern void local_bh_enable_ip(unsigned long ip);
+extern void __local_bh_enable_ip(unsigned long ip, unsigned int cnt);
+
+static inline void local_bh_enable_ip(unsigned long ip)
+{
+	__local_bh_enable_ip(ip, SOFTIRQ_DISABLE_OFFSET);
+}
+
+static inline void local_bh_enable(void)
+{
+	__local_bh_enable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
 
 #endif /* _LINUX_BH_H */
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -5,6 +5,7 @@
 #include <linux/lockdep.h>
 #include <linux/ftrace_irq.h>
 #include <linux/vtime.h>
+#include <asm/hardirq.h>
 
 
 extern void synchronize_irq(unsigned int irq);
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -2,7 +2,6 @@
 #define LINUX_PREEMPT_MASK_H
 
 #include <linux/preempt.h>
-#include <asm/hardirq.h>
 
 /*
  * We put the hardirq and softirq counter into the preemption
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -89,7 +89,7 @@ static void wakeup_softirqd(void)
  * where hardirqs are disabled legitimately:
  */
 #ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip, unsigned int cnt)
+void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
 {
 	unsigned long flags;
 
@@ -114,21 +114,8 @@ static void __local_bh_disable(unsigned
 	if (preempt_count() == cnt)
 		trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 }
-#else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
-{
-	preempt_count_add(cnt);
-	barrier();
-}
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
-void local_bh_disable(void)
-{
-	__local_bh_disable(_RET_IP_, SOFTIRQ_DISABLE_OFFSET);
-}
-
-EXPORT_SYMBOL(local_bh_disable);
-
 static void __local_bh_enable(unsigned int cnt)
 {
 	WARN_ON_ONCE(!irqs_disabled());
@@ -151,7 +138,7 @@ void _local_bh_enable(void)
 
 EXPORT_SYMBOL(_local_bh_enable);
 
-static inline void _local_bh_enable_ip(unsigned long ip)
+void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 {
 	WARN_ON_ONCE(in_irq() || irqs_disabled());
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -166,7 +153,7 @@ static inline void _local_bh_enable_ip(u
 	 * Keep preemption disabled until we are done with
 	 * softirq processing:
  	 */
-	preempt_count_sub(SOFTIRQ_DISABLE_OFFSET - 1);
+	preempt_count_sub(cnt - 1);
 
 	if (unlikely(!in_interrupt() && local_softirq_pending())) {
 		/*
@@ -183,18 +170,6 @@ static inline void _local_bh_enable_ip(u
 	preempt_check_resched();
 }
 
-void local_bh_enable(void)
-{
-	_local_bh_enable_ip(_RET_IP_);
-}
-EXPORT_SYMBOL(local_bh_enable);
-
-void local_bh_enable_ip(unsigned long ip)
-{
-	_local_bh_enable_ip(ip);
-}
-EXPORT_SYMBOL(local_bh_enable_ip);
-
 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
  * but break the loop if need_resched() is set or after 2 ms.
@@ -268,7 +243,7 @@ asmlinkage void __do_softirq(void)
 	pending = local_softirq_pending();
 	account_irq_enter_time(current);
 
-	__local_bh_disable(_RET_IP_, SOFTIRQ_OFFSET);
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	lockdep_softirq_start();
 
 	cpu = smp_processor_id();



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 5/7] locking: Optimize lock_bh functions
  2013-11-20 16:04 [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
                   ` (3 preceding siblings ...)
  2013-11-20 16:04 ` [PATCH 4/7] preempt, locking: Rework local_bh_{dis,en}able() Peter Zijlstra
@ 2013-11-20 16:04 ` Peter Zijlstra
  2013-11-20 16:04 ` [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:04 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

[-- Attachment #1: peter_zijlstra-remove_preempt_enable_no_sched-from-locks.patch --]
[-- Type: text/plain, Size: 5839 bytes --]

Currently all _bh_ lock functions do two preempt_count operations:

  local_bh_disable();
  preempt_disable();

and for the unlock:

  preempt_enable_no_resched();
  local_bh_enable();

Since its a waste of perfectly good cycles to modify the same variable
twice when you can do it in one go; use the new
__local_bh_{dis,en}able_ip() functions that allow us to provide a
preempt_count value to add/sub.

So define SOFTIRQ_LOCK_OFFSET as the offset a _bh_ lock needs to
add/sub to be done in one go.

As a bonus it gets rid of the preempt_enable_no_resched() usage.

Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/preempt_mask.h     |   15 +++++++++++++++
 include/linux/rwlock_api_smp.h   |   12 ++++--------
 include/linux/spinlock_api_smp.h |   12 ++++--------
 include/linux/spinlock_api_up.h  |   16 +++++++++++-----
 4 files changed, 34 insertions(+), 21 deletions(-)

--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -78,6 +78,21 @@
 #endif
 
 /*
+ * The preempt_count offset needed for things like:
+ *
+ *  spin_lock_bh()
+ *
+ * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and
+ * softirqs, such that unlock sequences of:
+ *
+ *  spin_unlock();
+ *  local_bh_enable();
+ *
+ * Work as expected.
+ */
+#define SOFTIRQ_LOCK_OFFSET (SOFTIRQ_DISABLE_OFFSET + PREEMPT_CHECK_OFFSET)
+
+/*
  * Are we running in atomic context?  WARNING: this macro cannot
  * always detect atomic context; in particular, it cannot know about
  * held spinlocks in non-preemptible kernels.  Thus it should not be
--- a/include/linux/rwlock_api_smp.h
+++ b/include/linux/rwlock_api_smp.h
@@ -172,8 +172,7 @@ static inline void __raw_read_lock_irq(r
 
 static inline void __raw_read_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_read_trylock, do_raw_read_lock);
 }
@@ -200,8 +199,7 @@ static inline void __raw_write_lock_irq(
 
 static inline void __raw_write_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock);
 }
@@ -250,8 +248,7 @@ static inline void __raw_read_unlock_bh(
 {
 	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_read_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 static inline void __raw_write_unlock_irqrestore(rwlock_t *lock,
@@ -275,8 +272,7 @@ static inline void __raw_write_unlock_bh
 {
 	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_write_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 #endif /* __LINUX_RWLOCK_API_SMP_H */
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -131,8 +131,7 @@ static inline void __raw_spin_lock_irq(r
 
 static inline void __raw_spin_lock_bh(raw_spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
 }
@@ -174,20 +173,17 @@ static inline void __raw_spin_unlock_bh(
 {
 	spin_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_spin_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 static inline int __raw_spin_trylock_bh(raw_spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	if (do_raw_spin_trylock(lock)) {
 		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 		return 1;
 	}
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	return 0;
 }
 
--- a/include/linux/spinlock_api_up.h
+++ b/include/linux/spinlock_api_up.h
@@ -24,11 +24,14 @@
  * flags straight, to suppress compiler warnings of unused lock
  * variables, and to add the proper checker annotations:
  */
+#define ___LOCK(lock) \
+  do { __acquire(lock); (void)(lock); } while (0)
+
 #define __LOCK(lock) \
-  do { preempt_disable(); __acquire(lock); (void)(lock); } while (0)
+  do { preempt_disable(); ___LOCK(lock); } while (0);
 
 #define __LOCK_BH(lock) \
-  do { local_bh_disable(); __LOCK(lock); } while (0)
+  do { __local_bh_disable_ip(_THIS_IP_, SOFTIRQ_LOCK_OFFSET); ___LOCK(lock); } while (0)
 
 #define __LOCK_IRQ(lock) \
   do { local_irq_disable(); __LOCK(lock); } while (0)
@@ -36,12 +39,15 @@
 #define __LOCK_IRQSAVE(lock, flags) \
   do { local_irq_save(flags); __LOCK(lock); } while (0)
 
+#define ___UNLOCK(lock) \
+  do { __release(lock); (void)(lock); } while (0)
+
 #define __UNLOCK(lock) \
-  do { preempt_enable(); __release(lock); (void)(lock); } while (0)
+  do { preempt_enable(); ___UNLOCK(lock); } while (0)
 
 #define __UNLOCK_BH(lock) \
-  do { preempt_enable_no_resched(); local_bh_enable(); \
-	  __release(lock); (void)(lock); } while (0)
+  do { __local_bh_enable_ip(_THIS_IP_, SOFTIRQ_LOCK_OFFSET); \
+       ___UNLOCK(lock); } while (0)
 
 #define __UNLOCK_IRQ(lock) \
   do { local_irq_enable(); __UNLOCK(lock); } while (0)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-20 16:04 [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
                   ` (4 preceding siblings ...)
  2013-11-20 16:04 ` [PATCH 5/7] locking: Optimize lock_bh functions Peter Zijlstra
@ 2013-11-20 16:04 ` Peter Zijlstra
  2013-11-20 18:02   ` Eliezer Tamir
  2013-11-20 16:04 ` [PATCH 7/7] preempt: Take away preempt_enable_no_resched() from modules Peter Zijlstra
  2013-11-20 16:34 ` [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:04 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

[-- Attachment #1: peterz-fixup-weird-preempt_enable_no_resched-usage.patch --]
[-- Type: text/plain, Size: 2815 bytes --]

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

As to the busy_poll mess; that looks to be completely and utterly
broken, sched_clock() can return utter garbage with interrupts enabled
(rare but still), it can drift unbounded between CPUs, so if you get
preempted/migrated and your new CPU is years behind on the previous
CPU we get to busy spin for a _very_ long time. There is a _REASON_
sched_clock() warns about preemptability - papering over it with a
preempt_disable()/preempt_enable_no_resched() is just terminal brain
damage on so many levels.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Chris Leech <christopher.leech@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/net/busy_poll.h |   20 ++++++++------------
 net/ipv4/tcp.c          |    4 ++--
 2 files changed, 10 insertions(+), 14 deletions(-)

--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -42,27 +42,23 @@ static inline bool net_busy_loop_on(void
 	return sysctl_net_busy_poll;
 }
 
-/* a wrapper to make debug_smp_processor_id() happy
- * we can use sched_clock() because we don't care much about precision
- * we only care that the average is bounded
- */
-#ifdef CONFIG_DEBUG_PREEMPT
 static inline u64 busy_loop_us_clock(void)
 {
 	u64 rc;
 
+	/*
+	 * XXX with interrupts enabled sched_clock() can return utter garbage
+	 * Futhermore, it can have unbounded drift between CPUs, so the below
+	 * usage is terminally broken and only serves to shut up a valid debug
+	 * warning.
+	 */
+
 	preempt_disable_notrace();
 	rc = sched_clock();
-	preempt_enable_no_resched_notrace();
+	preempt_enable_notrace();
 
 	return rc >> 10;
 }
-#else /* CONFIG_DEBUG_PREEMPT */
-static inline u64 busy_loop_us_clock(void)
-{
-	return sched_clock() >> 10;
-}
-#endif /* CONFIG_DEBUG_PREEMPT */
 
 static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
 {
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1623,11 +1623,11 @@ int tcp_recvmsg(struct kiocb *iocb, stru
 		    (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
 		    !sysctl_tcp_low_latency &&
 		    net_dma_find_channel()) {
-			preempt_enable_no_resched();
+			preempt_enable();
 			tp->ucopy.pinned_list =
 					dma_pin_iovec_pages(msg->msg_iov, len);
 		} else {
-			preempt_enable_no_resched();
+			preempt_enable();
 		}
 	}
 #endif



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 7/7] preempt: Take away preempt_enable_no_resched() from modules
  2013-11-20 16:04 [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
                   ` (5 preceding siblings ...)
  2013-11-20 16:04 ` [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
@ 2013-11-20 16:04 ` Peter Zijlstra
  2013-11-20 18:54   ` Jacob Pan
  2013-11-20 16:34 ` [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:04 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Rusty Russell

[-- Attachment #1: peterz-hide-preempt_enable_no_resched-modules.patch --]
[-- Type: text/plain, Size: 1515 bytes --]

Discourage drivers/modules to be creative with preemption.

Sadly all is implemented in macros and inline so if they want to do
evil they still can, but at least try and discourage some.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Chris Leech <christopher.leech@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/preempt.h |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -64,7 +64,11 @@ do { \
 } while (0)
 
 #else
-#define preempt_enable() preempt_enable_no_resched()
+#define preempt_enable() \
+do { \
+	barrier(); \
+	preempt_count_dec(); \
+} while (0)
 #define preempt_check_resched() do { } while (0)
 #endif
 
@@ -126,6 +130,16 @@ do { \
 #define preempt_fold_need_resched() do { } while (0)
 #endif
 
+#ifdef MODULE
+/*
+ * Modules have no business playing preemption tricks.
+ */
+#undef sched_preempt_enable_no_resched
+#undef preempt_enable_no_resched
+#undef preempt_enable_no_resched_notrace
+#undef preempt_check_resched
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 
 struct preempt_notifier;



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/7] Cure some vaux idle wrackage
  2013-11-20 16:04 [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
                   ` (6 preceding siblings ...)
  2013-11-20 16:04 ` [PATCH 7/7] preempt: Take away preempt_enable_no_resched() from modules Peter Zijlstra
@ 2013-11-20 16:34 ` Peter Zijlstra
  2013-11-20 17:19   ` Jacob Pan
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:34 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner
  Cc: linux-kernel, linux-pm

On Wed, Nov 20, 2013 at 05:04:50PM +0100, Peter Zijlstra wrote:
> I ran head-first into intel_powerclamp and acpi_pad yesterday -- afaict they
> try and do pretty much the same thing but both had competing sets of bugs and
> issues.
> 
> Related to that I went over the tree looking for preempt_enable_no_resched()
> abuse.
> 
> I know that patch 3 isn't going to be popular with Arjan but you guys are going
> to have to make that work. That stuff is hideously terrible.
> 
> A note to Zhang Rui, _never_ merge EXPORTs in another maintainers' guts without
> an explicit ACK from him, Thomas was quite horrified when I asked him how
> tick_nohz_idle_{enter,exit} got exported.
> 
> That said, thanks for a 'fun' two days :-(

s/vaux/faux/ .. its been a long day.. and there's far too many dead
email addresses @intel.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-20 16:04 ` [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations Peter Zijlstra
@ 2013-11-20 16:40   ` Arjan van de Ven
  2013-11-20 16:59     ` Peter Zijlstra
  2013-11-20 17:23     ` Thomas Gleixner
  2013-11-21  0:54   ` Jacob Pan
  1 sibling, 2 replies; 47+ messages in thread
From: Arjan van de Ven @ 2013-11-20 16:40 UTC (permalink / raw)
  To: Peter Zijlstra, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki

On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
> This does not fully preseve existing behaviour in that the generic
> idle cycle function calls into the normal cpuidle governed idle
> routines and should thus respect things like QoS parameters and the
> like.


NAK on the powerclamp side.

powerclamp MUST NOT do that....
it is needed to go to the deepest state no matter what
(this is for when your system is overheating. there is not a lot of choice
here... alternative is an emergency reset that the hardware does for safety)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-20 16:40   ` Arjan van de Ven
@ 2013-11-20 16:59     ` Peter Zijlstra
  2013-11-20 17:23     ` Thomas Gleixner
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:59 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: lenb, rjw, Eliezer Tamir, Chris Leech, David Miller, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, Thomas Gleixner,
	linux-kernel, linux-pm, Rafael J. Wysocki

On Wed, Nov 20, 2013 at 08:40:49AM -0800, Arjan van de Ven wrote:
> On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
> >This does not fully preseve existing behaviour in that the generic
> >idle cycle function calls into the normal cpuidle governed idle
> >routines and should thus respect things like QoS parameters and the
> >like.
> 
> 
> NAK on the powerclamp side.
> 
> powerclamp MUST NOT do that....
> it is needed to go to the deepest state no matter what
> (this is for when your system is overheating. there is not a lot of choice
> here... alternative is an emergency reset that the hardware does for safety)

Then its a worse broken piece of shit than I thought it was.

The only way to do what you want is to pretty much do stop_machine().

There's no guarantee your current FIFO-50 tasks will ever get to run.

Also, since when does Intel hardware do emergency resets on thermal
events? It used to force throttle stuff.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/7] Cure some vaux idle wrackage
  2013-11-20 16:34 ` [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
@ 2013-11-20 17:19   ` Jacob Pan
  2013-11-20 17:24     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2013-11-20 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm

On Wed, 20 Nov 2013 17:34:53 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 20, 2013 at 05:04:50PM +0100, Peter Zijlstra wrote:
> > I ran head-first into intel_powerclamp and acpi_pad yesterday --
> > afaict they try and do pretty much the same thing but both had
> > competing sets of bugs and issues.
> > 
> > Related to that I went over the tree looking for
> > preempt_enable_no_resched() abuse.
> > 
> > I know that patch 3 isn't going to be popular with Arjan but you
> > guys are going to have to make that work. That stuff is hideously
> > terrible.
> > 
> > A note to Zhang Rui, _never_ merge EXPORTs in another maintainers'
> > guts without an explicit ACK from him, Thomas was quite horrified
> > when I asked him how tick_nohz_idle_{enter,exit} got exported.
> > 
> > That said, thanks for a 'fun' two days :-(
> 
> s/vaux/faux/ .. its been a long day.. and there's far too many dead
> email addresses @intel.com
Hi Peter,
I cannot cleanly apply the patch set on upstream kernel. Failed on
Subject: [PATCH 4/7] preempt, locking: Rework local_bh_{dis,en}able()

Can you tell me which commit id should I do git am?

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-20 16:40   ` Arjan van de Ven
  2013-11-20 16:59     ` Peter Zijlstra
@ 2013-11-20 17:23     ` Thomas Gleixner
  2013-11-20 17:23       ` Arjan van de Ven
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2013-11-20 17:23 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, linux-kernel, linux-pm, Rafael J. Wysocki

On Wed, 20 Nov 2013, Arjan van de Ven wrote:

> On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
> > This does not fully preseve existing behaviour in that the generic
> > idle cycle function calls into the normal cpuidle governed idle
> > routines and should thus respect things like QoS parameters and the
> > like.
> 
> 
> NAK on the powerclamp side.
> 
> powerclamp MUST NOT do that....
> it is needed to go to the deepest state no matter what
> (this is for when your system is overheating. there is not a lot of choice
> here... alternative is an emergency reset that the hardware does for safety)

So that whole machinery falls apart when the thing which is running on
that hot core is a while(1) loop with a higher or equal FIFO priority
than this thread. Even if you'd run at prio 99, then there is no
guarantee that the cpu hog does not run with prio 99 as well and due
to FIFO and being on the CPU it's not going to let you on.

And even if the issue was caused by a lower prio CPU hog, the system
might just be in a RT throttled state because of the hog. So depending
on the throttler settings it might take quite some time for the
clamper thread to get on the CPU.

Amazingly naive and stupid approach.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-20 17:23     ` Thomas Gleixner
@ 2013-11-20 17:23       ` Arjan van de Ven
  2013-11-20 17:55         ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Arjan van de Ven @ 2013-11-20 17:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, linux-kernel, linux-pm, Rafael J. Wysocki

On 11/20/2013 9:23 AM, Thomas Gleixner wrote:
> On Wed, 20 Nov 2013, Arjan van de Ven wrote:
>
>> On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
>>> This does not fully preseve existing behaviour in that the generic
>>> idle cycle function calls into the normal cpuidle governed idle
>>> routines and should thus respect things like QoS parameters and the
>>> like.
>>
>>
>> NAK on the powerclamp side.
>>
>> powerclamp MUST NOT do that....
>> it is needed to go to the deepest state no matter what
>> (this is for when your system is overheating. there is not a lot of choice
>> here... alternative is an emergency reset that the hardware does for safety)
>
> So that whole machinery falls apart when the thing which is running on
> that hot core is a while(1) loop with a higher or equal FIFO priority
> than this thread. Even if you'd run at prio 99, then there is no
> guarantee that the cpu hog does not run with prio 99 as well and due
> to FIFO and being on the CPU it's not going to let you on.

the idea was to at least give people who know what they're doing a chance to run


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/7] Cure some vaux idle wrackage
  2013-11-20 17:19   ` Jacob Pan
@ 2013-11-20 17:24     ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 17:24 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm

On Wed, Nov 20, 2013 at 09:19:21AM -0800, Jacob Pan wrote:
> Hi Peter,
> I cannot cleanly apply the patch set on upstream kernel. Failed on
> Subject: [PATCH 4/7] preempt, locking: Rework local_bh_{dis,en}able()
> 
> Can you tell me which commit id should I do git am?

Oh, I wrote them against tip/master. Although I don't suppose there's
much in tip/master that's not (yet) in Linus' tree, so I suppose the
conflict should be an easy one.

I suppose the patch giving you grief is the one below.

---
commit f1a83e652bedef88d6d77d3dc58250e08e7062bd
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Nov 19 16:42:47 2013 +0100

    lockdep: Correctly annotate hardirq context in irq_exit()
    
    There was a reported deadlock on -rt which lockdep didn't report.
    
    It turns out that in irq_exit() we tell lockdep that the hardirq
    context ends and then do all kinds of locking afterwards.
    
    To fix it, move trace_hardirq_exit() to the very end of irq_exit(), this
    ensures all locking in tick_irq_exit() and rcu_irq_exit() are properly
    recorded as happening from hardirq context.
    
    This however leads to the 'fun' little problem of running softirqs
    while in hardirq context. To cure this make the softirq code a little
    more complex (in the CONFIG_TRACE_IRQFLAGS case).
    
    Due to stack swizzling arch dependent trickery we cannot pass an
    argument to __do_softirq() to tell it if it was done from hardirq
    context or not; so use a side-band argument.
    
    When we do __do_softirq() from hardirq context, 'atomically' flip to
    softirq context and back, so that no locking goes without being in
    either hard- or soft-irq context.
    
    I didn't find any new problems in mainline using this patch, but it
    did show the -rt problem.
    
    Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Frederic Weisbecker <fweisbec@gmail.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Link: http://lkml.kernel.org/n/tip-dgwc5cdksbn0jk09vbmcc9sa@git.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/kernel/softirq.c b/kernel/softirq.c
index b24988353458..eb0acf44b063 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -213,14 +213,52 @@ EXPORT_SYMBOL(local_bh_enable_ip);
 #define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
 #define MAX_SOFTIRQ_RESTART 10
 
+#ifdef CONFIG_TRACE_IRQFLAGS
+/*
+ * Convoluted means of passing __do_softirq() a message through the various
+ * architecture execute_on_stack() bits.
+ *
+ * When we run softirqs from irq_exit() and thus on the hardirq stack we need
+ * to keep the lockdep irq context tracking as tight as possible in order to
+ * not miss-qualify lock contexts and miss possible deadlocks.
+ */
+static DEFINE_PER_CPU(int, softirq_from_hardirq);
+
+static inline void lockdep_softirq_from_hardirq(void)
+{
+	this_cpu_write(softirq_from_hardirq, 1);
+}
+
+static inline void lockdep_softirq_start(void)
+{
+	if (this_cpu_read(softirq_from_hardirq))
+		trace_hardirq_exit();
+	lockdep_softirq_enter();
+}
+
+static inline void lockdep_softirq_end(void)
+{
+	lockdep_softirq_exit();
+	if (this_cpu_read(softirq_from_hardirq)) {
+		this_cpu_write(softirq_from_hardirq, 0);
+		trace_hardirq_enter();
+	}
+}
+
+#else
+static inline void lockdep_softirq_from_hardirq(void) { }
+static inline void lockdep_softirq_start(void) { }
+static inline void lockdep_softirq_end(void) { }
+#endif
+
 asmlinkage void __do_softirq(void)
 {
-	struct softirq_action *h;
-	__u32 pending;
 	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
-	int cpu;
 	unsigned long old_flags = current->flags;
 	int max_restart = MAX_SOFTIRQ_RESTART;
+	struct softirq_action *h;
+	__u32 pending;
+	int cpu;
 
 	/*
 	 * Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -233,7 +271,7 @@ asmlinkage void __do_softirq(void)
 	account_irq_enter_time(current);
 
 	__local_bh_disable(_RET_IP_, SOFTIRQ_OFFSET);
-	lockdep_softirq_enter();
+	lockdep_softirq_start();
 
 	cpu = smp_processor_id();
 restart:
@@ -280,16 +318,13 @@ asmlinkage void __do_softirq(void)
 		wakeup_softirqd();
 	}
 
-	lockdep_softirq_exit();
-
+	lockdep_softirq_end();
 	account_irq_exit_time(current);
 	__local_bh_enable(SOFTIRQ_OFFSET);
 	WARN_ON_ONCE(in_interrupt());
 	tsk_restore_flags(current, old_flags, PF_MEMALLOC);
 }
 
-
-
 asmlinkage void do_softirq(void)
 {
 	__u32 pending;
@@ -332,6 +367,7 @@ void irq_enter(void)
 static inline void invoke_softirq(void)
 {
 	if (!force_irqthreads) {
+		lockdep_softirq_from_hardirq();
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
 		/*
 		 * We can safely execute softirq on the current stack if
@@ -377,13 +413,13 @@ void irq_exit(void)
 #endif
 
 	account_irq_exit_time(current);
-	trace_hardirq_exit();
 	preempt_count_sub(HARDIRQ_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
 
 	tick_irq_exit();
 	rcu_irq_exit();
+	trace_hardirq_exit(); /* must be last! */
 }
 
 /*

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-20 17:23       ` Arjan van de Ven
@ 2013-11-20 17:55         ` Thomas Gleixner
  2013-11-20 18:21           ` Arjan van de Ven
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2013-11-20 17:55 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, linux-kernel, linux-pm, Rafael J. Wysocki

On Wed, 20 Nov 2013, Arjan van de Ven wrote:
> On 11/20/2013 9:23 AM, Thomas Gleixner wrote:
> > On Wed, 20 Nov 2013, Arjan van de Ven wrote:
> > 
> > > On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
> > > > This does not fully preseve existing behaviour in that the generic
> > > > idle cycle function calls into the normal cpuidle governed idle
> > > > routines and should thus respect things like QoS parameters and the
> > > > like.
> > > 
> > > 
> > > NAK on the powerclamp side.
> > > 
> > > powerclamp MUST NOT do that....
> > > it is needed to go to the deepest state no matter what
> > > (this is for when your system is overheating. there is not a lot of choice
> > > here... alternative is an emergency reset that the hardware does for
> > > safety)
> > 
> > So that whole machinery falls apart when the thing which is running on
> > that hot core is a while(1) loop with a higher or equal FIFO priority
> > than this thread. Even if you'd run at prio 99, then there is no
> > guarantee that the cpu hog does not run with prio 99 as well and due
> > to FIFO and being on the CPU it's not going to let you on.
> 
> the idea was to at least give people who know what they're doing a chance to
> run

You can't be serious about that. Even if people know what they are
doing, they have no chance to prevent the occasional high prio runaway
bug. And then you shrug your shoulders and tell that guy who spent
time to tune the thing proper "bad luck" or what?

This is the worst of all approaches and you know that. You fought such
crap tooth and nail not so long ago.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-20 16:04 ` [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
@ 2013-11-20 18:02   ` Eliezer Tamir
  2013-11-20 18:15     ` Peter Zijlstra
  2013-11-21 10:10     ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Eliezer Tamir @ 2013-11-20 18:02 UTC (permalink / raw)
  To: Peter Zijlstra, Arjan van de Ven, lenb, rjw, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner
  Cc: linux-kernel, linux-pm

On 20/11/2013 18:04, Peter Zijlstra wrote:
> The only valid use of preempt_enable_no_resched() is if the very next
> line is schedule() or if we know preemption cannot actually be enabled
> by that statement due to known more preempt_count 'refs'.

The reason I used the no resched version is that busy_poll_end_time()
is almost always called with rcu read lock held, so it seemed the more
correct option.

I have no issue with you changing this.

> As to the busy_poll mess; that looks to be completely and utterly
> broken, sched_clock() can return utter garbage with interrupts enabled
> (rare but still), it can drift unbounded between CPUs, so if you get
> preempted/migrated and your new CPU is years behind on the previous
> CPU we get to busy spin for a _very_ long time. There is a _REASON_
> sched_clock() warns about preemptability - papering over it with a
> preempt_disable()/preempt_enable_no_resched() is just terminal brain
> damage on so many levels.

IMHO This has been reviewed thoroughly.

When Ben Hutchings voiced concerns I rewrote the code to use time_after,
so even if you do get switched over to a CPU where the time is random
you will at most poll another full interval.

Linus asked me to remove this since it makes us use two time values
instead of one. see https://lkml.org/lkml/2013/7/8/345.

Cheers,
Eliezer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-20 18:02   ` Eliezer Tamir
@ 2013-11-20 18:15     ` Peter Zijlstra
  2013-11-20 20:14       ` Eliezer Tamir
  2013-11-21 10:10     ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 18:15 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Arjan van de Ven, lenb, rjw, Chris Leech, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm

On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote:
> On 20/11/2013 18:04, Peter Zijlstra wrote:
> > The only valid use of preempt_enable_no_resched() is if the very next
> > line is schedule() or if we know preemption cannot actually be enabled
> > by that statement due to known more preempt_count 'refs'.
> 
> The reason I used the no resched version is that busy_poll_end_time()
> is almost always called with rcu read lock held, so it seemed the more
> correct option.
> 
> I have no issue with you changing this.

There are options (CONFIG_PREEMPT_RCU) that allow scheduling while
holding rcu_read_lock().

Also, preempt_enable() only schedules when its possible to schedule, so
calling it when you know you cannot schedule is no issue.

> > As to the busy_poll mess; that looks to be completely and utterly
> > broken, sched_clock() can return utter garbage with interrupts enabled
> > (rare but still), it can drift unbounded between CPUs, so if you get
> > preempted/migrated and your new CPU is years behind on the previous
> > CPU we get to busy spin for a _very_ long time. There is a _REASON_
> > sched_clock() warns about preemptability - papering over it with a
> > preempt_disable()/preempt_enable_no_resched() is just terminal brain
> > damage on so many levels.
> 
> IMHO This has been reviewed thoroughly.

At the very least you completely forgot to preserve any of that. The
changelog that introduced it is completely void of anything useful and
the code has a distinct lack of comments.

> When Ben Hutchings voiced concerns I rewrote the code to use time_after,
> so even if you do get switched over to a CPU where the time is random
> you will at most poll another full interval.
> 
> Linus asked me to remove this since it makes us use two time values
> instead of one. see https://lkml.org/lkml/2013/7/8/345.

My brain is fried for today, I'll have a look tomorrow.

But note that with patch 7/7 in place modular code an no longer use
preempt_enable_no_resched(). I'm not sure net/ipv4/tcp.c can be build
modular -- but istr a time when it was.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-20 17:55         ` Thomas Gleixner
@ 2013-11-20 18:21           ` Arjan van de Ven
  2013-11-20 19:38             ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Arjan van de Ven @ 2013-11-20 18:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, linux-kernel, linux-pm, Rafael J. Wysocki

On 11/20/2013 9:55 AM, Thomas Gleixner wrote:
> On Wed, 20 Nov 2013, Arjan van de Ven wrote:
>> On 11/20/2013 9:23 AM, Thomas Gleixner wrote:
>>> On Wed, 20 Nov 2013, Arjan van de Ven wrote:
>>>
>>>> On 11/20/2013 8:04 AM, Peter Zijlstra wrote:
>>>>> This does not fully preseve existing behaviour in that the generic
>>>>> idle cycle function calls into the normal cpuidle governed idle
>>>>> routines and should thus respect things like QoS parameters and the
>>>>> like.
>>>>
>>>>
>>>> NAK on the powerclamp side.
>>>>
>>>> powerclamp MUST NOT do that....
>>>> it is needed to go to the deepest state no matter what
>>>> (this is for when your system is overheating. there is not a lot of choice
>>>> here... alternative is an emergency reset that the hardware does for
>>>> safety)
>>>
>>> So that whole machinery falls apart when the thing which is running on
>>> that hot core is a while(1) loop with a higher or equal FIFO priority
>>> than this thread. Even if you'd run at prio 99, then there is no
>>> guarantee that the cpu hog does not run with prio 99 as well and due
>>> to FIFO and being on the CPU it's not going to let you on.
>>
>> the idea was to at least give people who know what they're doing a chance to
>> run
>
> You can't be serious about that. Even if people know what they are
> doing, they have no chance to prevent the occasional high prio runaway
> bug. And then you shrug your shoulders and tell that guy who spent
> time to tune the thing proper "bad luck" or what?

his system will crawl to a halt (and if it's for a long time on a system with
submarginal cooling, potentially reboot) due to the hardware protecting itself.
What powerclamp is trying to do is avoid these steps from the hardware (they are both
very harsh and very inefficient).

but for powerclamp to work, it needs to inject a deep idle....
I'm very ok using generic abstractions for that, but the abstraction needs to then
include a "don't be nice about picking shallow C states for performance reasons, just
pick something very deep" parameter.

(powerclamp will adjust if you have some periodic realtime task or interrupts or ..
taking away idle from it, by over time injecting a bit more idle)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] preempt: Take away preempt_enable_no_resched() from modules
  2013-11-20 16:04 ` [PATCH 7/7] preempt: Take away preempt_enable_no_resched() from modules Peter Zijlstra
@ 2013-11-20 18:54   ` Jacob Pan
  2013-11-20 19:00     ` Peter Zijlstra
  2013-11-20 19:18     ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Jacob Pan @ 2013-11-20 18:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm, Rusty Russell

On Wed, 20 Nov 2013 17:04:57 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> Discourage drivers/modules to be creative with preemption.
> 
> Sadly all is implemented in macros and inline so if they want to do
> evil they still can, but at least try and discourage some.
> 
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: lenb@kernel.org
> Cc: rjw@rjwysocki.net
> Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
> Cc: Chris Leech <christopher.leech@intel.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: rui.zhang@intel.com
> Cc: jacob.jun.pan@linux.intel.com
> Cc: Mike Galbraith <bitbucket@online.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: hpa@zytor.com
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  include/linux/preempt.h |   16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -64,7 +64,11 @@ do { \
>  } while (0)
>  
>  #else
> -#define preempt_enable() preempt_enable_no_resched()
> +#define preempt_enable() \
> +do { \
> +	barrier(); \
> +	preempt_count_dec(); \
> +} while (0)
>  #define preempt_check_resched() do { } while (0)
>  #endif
>  
> @@ -126,6 +130,16 @@ do { \
>  #define preempt_fold_need_resched() do { } while (0)
>  #endif
>  
> +#ifdef MODULE
> +/*
> + * Modules have no business playing preemption tricks.
> + */
> +#undef sched_preempt_enable_no_resched
> +#undef preempt_enable_no_resched
> +#undef preempt_enable_no_resched_notrace
> +#undef preempt_check_resched
> +#endif
> +
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  
>  struct preempt_notifier;
> 
> 

run into a couple of compile issues.

1)
include/linux/rcupdate.h: In function ‘rcu_read_unlock_sched_notrace’:
include/linux/rcupdate.h:889:2: error: implicit declaration of function
‘preempt_enable_no_resched_notrace’ [-Werror=implicit-function-declaration]

2)
In file included from drivers/cpufreq/acpi-cpufreq.c:41:0:
include/linux/uaccess.h: In function ‘pagefault_enable’:
include/linux/uaccess.h:34:2: error: implicit declaration of function
‘preempt_check_resched’ [-Werror=implicit-function-declaration]

1) happens when CONFIG_PREEMPT is not set, perhaps add a dummy function?
2) I am not sure if we should take pagefault_enable away from modules

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] preempt: Take away preempt_enable_no_resched() from modules
  2013-11-20 18:54   ` Jacob Pan
@ 2013-11-20 19:00     ` Peter Zijlstra
  2013-11-20 19:18     ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 19:00 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm, Rusty Russell

On Wed, Nov 20, 2013 at 10:54:57AM -0800, Jacob Pan wrote:
> run into a couple of compile issues.
> 
> 1)
> include/linux/rcupdate.h: In function ‘rcu_read_unlock_sched_notrace’:
> include/linux/rcupdate.h:889:2: error: implicit declaration of function
> ‘preempt_enable_no_resched_notrace’ [-Werror=implicit-function-declaration]
> 
> 2)
> In file included from drivers/cpufreq/acpi-cpufreq.c:41:0:
> include/linux/uaccess.h: In function ‘pagefault_enable’:
> include/linux/uaccess.h:34:2: error: implicit declaration of function
> ‘preempt_check_resched’ [-Werror=implicit-function-declaration]
> 
> 1) happens when CONFIG_PREEMPT is not set, perhaps add a dummy function?

Ah, indeed, will fix.

> 2) I am not sure if we should take pagefault_enable away from modules

We can't I think, I'll have to fix that up. Easiest would be to make
preempt_check_resched() available again. That function actually adds
preemption points so its not bad.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] preempt: Take away preempt_enable_no_resched() from modules
  2013-11-20 18:54   ` Jacob Pan
  2013-11-20 19:00     ` Peter Zijlstra
@ 2013-11-20 19:18     ` Peter Zijlstra
  2013-11-20 19:29       ` Jacob Pan
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-20 19:18 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm, Rusty Russell

On Wed, Nov 20, 2013 at 10:54:57AM -0800, Jacob Pan wrote:
> run into a couple of compile issues.
> 
> 1)
> include/linux/rcupdate.h: In function ‘rcu_read_unlock_sched_notrace’:
> include/linux/rcupdate.h:889:2: error: implicit declaration of function
> ‘preempt_enable_no_resched_notrace’ [-Werror=implicit-function-declaration]
> 
> 2)
> In file included from drivers/cpufreq/acpi-cpufreq.c:41:0:
> include/linux/uaccess.h: In function ‘pagefault_enable’:
> include/linux/uaccess.h:34:2: error: implicit declaration of function
> ‘preempt_check_resched’ [-Werror=implicit-function-declaration]
> 
> 1) happens when CONFIG_PREEMPT is not set, perhaps add a dummy function?
> 2) I am not sure if we should take pagefault_enable away from modules

I think the below will cure both -- will fold in the proper patches
tomorrow, need to get away from the computer noaw.

---
 include/linux/preempt.h | 6 +++++-
 include/linux/uaccess.h | 5 ++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index a3d9dc8c2c00..41bc3fc4cdc5 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -93,7 +93,11 @@ do { \
 		__preempt_schedule_context(); \
 } while (0)
 #else
-#define preempt_enable_notrace() preempt_enable_no_resched_notrace()
+#define preempt_enable_notrace() \
+do { \
+	barrier(); \
+	__preempt_count_dec(); \
+} while (0)
 #endif
 
 #else /* !CONFIG_PREEMPT_COUNT */
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 9d8cf056e661..ecd3319dac33 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -25,13 +25,16 @@ static inline void pagefault_disable(void)
 
 static inline void pagefault_enable(void)
 {
+#ifndef CONFIG_PREEMPT
 	/*
 	 * make sure to issue those last loads/stores before enabling
 	 * the pagefault handler again.
 	 */
 	barrier();
 	preempt_count_dec();
-	preempt_check_resched();
+#else
+	preempt_enable();
+#endif
 }
 
 #ifndef ARCH_HAS_NOCACHE_UACCESS

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] preempt: Take away preempt_enable_no_resched() from modules
  2013-11-20 19:18     ` Peter Zijlstra
@ 2013-11-20 19:29       ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2013-11-20 19:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm, Rusty Russell

On Wed, 20 Nov 2013 20:18:41 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> I think the below will cure both -- will fold in the proper patches
> tomorrow, need to get away from the computer noaw.
[Jacob Pan] looks good to me. one more thing for tomorrow :)


diff --git a/kernel/softirq.c b/kernel/softirq.c
index d3283d3..750e853 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -114,6 +114,7 @@ void __local_bh_disable_ip(unsigned long ip,
unsigned int cnt) if (preempt_count() == cnt)
                trace_preempt_off(CALLER_ADDR0,
get_parent_ip(CALLER_ADDR1)); }
+EXPORT_SYMBOL_GPL(__local_bh_disable_ip);
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
 static void __local_bh_enable(unsigned int cnt)
@@ -169,7 +170,7 @@ void __local_bh_enable_ip(unsigned long ip,
unsigned int cnt) #endif
        preempt_check_resched();
 }
-
+EXPORT_SYMBOL_GPL(__local_bh_enable_ip);
 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
  * but break the loop if need_resched() is set or after 2 ms.

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-20 18:21           ` Arjan van de Ven
@ 2013-11-20 19:38             ` Thomas Gleixner
  2013-11-20 22:08               ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2013-11-20 19:38 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, linux-kernel, linux-pm, Rafael J. Wysocki

On Wed, 20 Nov 2013, Arjan van de Ven wrote:
> 
> but for powerclamp to work, it needs to inject a deep idle....  I'm
> very ok using generic abstractions for that, but the abstraction
> needs to then include a "don't be nice about picking shallow C
> states for performance reasons, just pick something very deep"
> parameter.

And that's what you should have done in the first place. Make the
generic code take a parameter to indicate that. Or tell the scheduler
to throttle the machine and go deep idle. That would also be helpful
to others who might need some similar thing.

No, you went with the worst design:

    - Hack it into some random driver
    - Export random core interfaces so it's harder to change them
    - Let others deal with the fallout

I'm cleaning up that kind of mess for more than 9 years now and I'm
really disappointed that you went over to the "who cares, works for
me" camp.

I can lively remember our discussions when we were cleaning up the
whole timer mess together in order to make NOHZ actually useful. Your
cursing about such code was definitely impressive back then.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-20 18:15     ` Peter Zijlstra
@ 2013-11-20 20:14       ` Eliezer Tamir
  0 siblings, 0 replies; 47+ messages in thread
From: Eliezer Tamir @ 2013-11-20 20:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Chris Leech, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm, Eliezer Tamir

On 20/11/2013 20:15, Peter Zijlstra wrote:
> There are options (CONFIG_PREEMPT_RCU) that allow scheduling while
> holding rcu_read_lock().
> 
> Also, preempt_enable() only schedules when its possible to schedule, so
> calling it when you know you cannot schedule is no issue.
> 

I have no issue with you changing busy_loop_us_clock() to use a regular
preempt enable.

I think that we still need to only do this if config preempt debug
is on. When it's off we should use the alternate implementation.
We are silencing a warning, but this is a performance critical path,
and we think we know what we are doing.

I tried to explain this in the comments. If you think my comments are
not clear enough, I'm open to suggestions.

Cheers,
Eliezer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-20 19:38             ` Thomas Gleixner
@ 2013-11-20 22:08               ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2013-11-20 22:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Arjan van de Ven, Peter Zijlstra, lenb, rjw, Eliezer Tamir,
	Chris Leech, David Miller, rui.zhang, Mike Galbraith,
	Ingo Molnar, hpa, linux-kernel, linux-pm, Rafael J. Wysocki

On Wed, 20 Nov 2013 20:38:03 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, 20 Nov 2013, Arjan van de Ven wrote:
> > 
> > but for powerclamp to work, it needs to inject a deep idle....  I'm
> > very ok using generic abstractions for that, but the abstraction
> > needs to then include a "don't be nice about picking shallow C
> > states for performance reasons, just pick something very deep"
> > parameter.
> 
> And that's what you should have done in the first place. Make the
> generic code take a parameter to indicate that. Or tell the scheduler
> to throttle the machine and go deep idle. That would also be helpful
> to others who might need some similar thing.
> 
I thought about that. Since the generic code is in performance critical
path and idle injection is a rare case. Then the question is do we
want sacrifice the 99% for %1 usage?

> No, you went with the worst design:
> 
>     - Hack it into some random driver
>     - Export random core interfaces so it's harder to change them
>     - Let others deal with the fallout
> 
> I'm cleaning up that kind of mess for more than 9 years now and I'm
> really disappointed that you went over to the "who cares, works for
> me" camp.
> 
> I can lively remember our discussions when we were cleaning up the
> whole timer mess together in order to make NOHZ actually useful. Your
> cursing about such code was definitely impressive back then.
> 
> Thanks,
> 
> 	tglx
> 
> 
> 
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-20 16:04 ` [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations Peter Zijlstra
  2013-11-20 16:40   ` Arjan van de Ven
@ 2013-11-21  0:54   ` Jacob Pan
  2013-11-21  8:21     ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2013-11-21  0:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm, Rafael J. Wysocki,
	Paul E. McKenney

On Wed, 20 Nov 2013 17:04:53 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> People are starting to grow their own idle implementations in various
> disgusting ways. Collapse the lot and use the generic idle code to
> provide a proper idle cycle implementation.
> 
+Paul

RCU and others rely on is_idle_task() might be broken with the
consolidated idle code since caller of do_idle may have pid != 0.

Should we use TS_POLL or introduce a new flag to identify idle task?

The reason why idle injection code does not inform RCU is that we have
known short period of idle time which does not impact RCU grace period.

On the other side, I see idle injection code is working with this
patchset with workaround in s_idle_task() by checking TS_POLL flag.
But the efficiency is down by ~30%. i.e.

before: inject 25% time to get 23-24% package idle
after: inject 25% time to get 16-17% package idle

Still looking into improvement.

Jacob



> This does not fully preseve existing behaviour in that the generic
> idle cycle function calls into the normal cpuidle governed idle
> routines and should thus respect things like QoS parameters and the
> like.
> 
> If people want to over-ride the idle state they should talk to the
> cpuidle folks about extending the interface and attempt to preserve
> QoS guarantees, instead of jumping straight to the deepest possible C
> state.
> 
> Compile tested only -- I've no idea how to actually use these vile
> things.
> 
> Cc: hpa@zytor.com
> Cc: arjan@linux.intel.com
> Cc: rui.zhang@intel.com
> Cc: jacob.jun.pan@linux.intel.com
> Cc: lenb@kernel.org
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  drivers/acpi/acpi_pad.c            |   41 ------------
>  drivers/thermal/intel_powerclamp.c |   38 -----------
>  include/linux/cpu.h                |    2 
>  kernel/cpu/idle.c                  |  123
> ++++++++++++++++++++++---------------
> kernel/time/tick-sched.c           |    2 5 files changed, 82
> insertions(+), 124 deletions(-)
> 
> --- a/drivers/acpi/acpi_pad.c
> +++ b/drivers/acpi/acpi_pad.c
> @@ -41,9 +41,7 @@ static DEFINE_MUTEX(round_robin_lock);
>  static unsigned long power_saving_mwait_eax;
>  
>  static unsigned char tsc_detected_unstable;
> -static unsigned char tsc_marked_unstable;
>  static unsigned char lapic_detected_unstable;
> -static unsigned char lapic_marked_unstable;
>  
>  static void power_saving_mwait_init(void)
>  {
> @@ -153,10 +151,9 @@ static int power_saving_thread(void *dat
>  	unsigned int tsk_index = (unsigned long)data;
>  	u64 last_jiffies = 0;
>  
> -	sched_setscheduler(current, SCHED_RR, &param);
> +	sched_setscheduler(current, SCHED_FIFO, &param);
>  
>  	while (!kthread_should_stop()) {
> -		int cpu;
>  		u64 expire_time;
>  
>  		try_to_freeze();
> @@ -171,41 +168,7 @@ static int power_saving_thread(void *dat
>  
>  		expire_time = jiffies + HZ * (100 - idle_pct) / 100;
>  
> -		while (!need_resched()) {
> -			if (tsc_detected_unstable
> && !tsc_marked_unstable) {
> -				/* TSC could halt in idle, so notify
> users */
> -				mark_tsc_unstable("TSC halts in
> idle");
> -				tsc_marked_unstable = 1;
> -			}
> -			if (lapic_detected_unstable
> && !lapic_marked_unstable) {
> -				int i;
> -				/* LAPIC could halt in idle, so
> notify users */
> -				for_each_online_cpu(i)
> -					clockevents_notify(
> -
> CLOCK_EVT_NOTIFY_BROADCAST_ON,
> -						&i);
> -				lapic_marked_unstable = 1;
> -			}
> -			local_irq_disable();
> -			cpu = smp_processor_id();
> -			if (lapic_marked_unstable)
> -				clockevents_notify(
> -
> CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
> -			stop_critical_timings();
> -
> -
> mwait_idle_with_hints(power_saving_mwait_eax, 1); -
> -			start_critical_timings();
> -			if (lapic_marked_unstable)
> -				clockevents_notify(
> -
> CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> -			local_irq_enable();
> -
> -			if (jiffies > expire_time) {
> -				do_sleep = 1;
> -				break;
> -			}
> -		}
> +		play_idle(expire_time);
>  
>  		/*
>  		 * current sched_rt has threshold for rt task
> running time. --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -247,11 +247,6 @@ static u64 pkg_state_counter(void)
>  	return count;
>  }
>  
> -static void noop_timer(unsigned long foo)
> -{
> -	/* empty... just the fact that we get the interrupt wakes us
> up */ -}
> -
>  static unsigned int get_compensation(int ratio)
>  {
>  	unsigned int comp = 0;
> @@ -356,7 +351,6 @@ static bool powerclamp_adjust_controls(u
>  static int clamp_thread(void *arg)
>  {
>  	int cpunr = (unsigned long)arg;
> -	DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
>  	static const struct sched_param param = {
>  		.sched_priority = MAX_USER_RT_PRIO/2,
>  	};
> @@ -365,11 +359,9 @@ static int clamp_thread(void *arg)
>  
>  	set_bit(cpunr, cpu_clamping_mask);
>  	set_freezable();
> -	init_timer_on_stack(&wakeup_timer);
>  	sched_setscheduler(current, SCHED_FIFO, &param);
>  
> -	while (true == clamping && !kthread_should_stop() &&
> -		cpu_online(cpunr)) {
> +	while (clamping && !kthread_should_stop() &&
> cpu_online(cpunr)) { int sleeptime;
>  		unsigned long target_jiffies;
>  		unsigned int guard;
> @@ -417,35 +409,11 @@ static int clamp_thread(void *arg)
>  		if (should_skip)
>  			continue;
>  
> -		target_jiffies = jiffies + duration_jiffies;
> -		mod_timer(&wakeup_timer, target_jiffies);
>  		if (unlikely(local_softirq_pending()))
>  			continue;
> -		/*
> -		 * stop tick sched during idle time, interrupts are
> still
> -		 * allowed. thus jiffies are updated properly.
> -		 */
> -		preempt_disable();
> -		tick_nohz_idle_enter();
> -		/* mwait until target jiffies is reached */
> -		while (time_before(jiffies, target_jiffies)) {
> -			unsigned long ecx = 1;
> -			unsigned long eax = target_mwait;
> -
> -			/*
> -			 * REVISIT: may call enter_idle() to notify
> drivers who
> -			 * can save power during cpu idle. same for
> exit_idle()
> -			 */
> -			local_touch_nmi();
> -			stop_critical_timings();
> -			mwait_idle_with_hints(eax, ecx);
> -			start_critical_timings();
> -			atomic_inc(&idle_wakeup_counter);
> -		}
> -		tick_nohz_idle_exit();
> -		preempt_enable_no_resched();
> +
> +		play_idle(duration_jiffies);
>  	}
> -	del_timer_sync(&wakeup_timer);
>  	clear_bit(cpunr, cpu_clamping_mask);
>  
>  	return 0;
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -215,6 +215,8 @@ enum cpuhp_state {
>  	CPUHP_ONLINE,
>  };
>  
> +void play_idle(unsigned long jiffies);
> +
>  void cpu_startup_entry(enum cpuhp_state state);
>  void cpu_idle(void);
>  
> --- a/kernel/cpu/idle.c
> +++ b/kernel/cpu/idle.c
> @@ -63,62 +63,88 @@ void __weak arch_cpu_idle(void)
>  }
>  
>  /*
> - * Generic idle loop implementation
> + * Generic idle cycle.
>   */
> -static void cpu_idle_loop(void)
> +static void do_idle(void)
>  {
> -	while (1) {
> -		tick_nohz_idle_enter();
> +	tick_nohz_idle_enter();
>  
> -		while (!need_resched()) {
> -			check_pgt_cache();
> -			rmb();
> -
> -			if (cpu_is_offline(smp_processor_id()))
> -				arch_cpu_idle_dead();
> -
> -			local_irq_disable();
> -			arch_cpu_idle_enter();
> -
> -			/*
> -			 * In poll mode we reenable interrupts and
> spin.
> -			 *
> -			 * Also if we detected in the wakeup from
> idle
> -			 * path that the tick broadcast device
> expired
> -			 * for us, we don't want to go deep idle as
> we
> -			 * know that the IPI is going to arrive right
> -			 * away
> -			 */
> -			if (cpu_idle_force_poll ||
> tick_check_broadcast_expired()) {
> -				cpu_idle_poll();
> -			} else {
> -				if (!current_clr_polling_and_test())
> {
> -					stop_critical_timings();
> -					rcu_idle_enter();
> -					arch_cpu_idle();
> -
> WARN_ON_ONCE(irqs_disabled());
> -					rcu_idle_exit();
> -					start_critical_timings();
> -				} else {
> -					local_irq_enable();
> -				}
> -				__current_set_polling();
> -			}
> -			arch_cpu_idle_exit();
> -		}
> +	while (!need_resched()) {
> +		check_pgt_cache();
> +		rmb();
> +
> +		if (cpu_is_offline(smp_processor_id()))
> +			arch_cpu_idle_dead();
> +
> +		local_irq_disable();
> +		arch_cpu_idle_enter();
>  
>  		/*
> -		 * We need to test and propagate the
> TIF_NEED_RESCHED bit here
> -		 * because we might not have send the reschedule IPI
> to idle
> -		 * tasks.
> +		 * In poll mode we reenable interrupts and spin.
> +		 *
> +		 * Also if we detected in the wakeup from idle path
> that the
> +		 * tick broadcast device expired for us, we don't
> want to go
> +		 * deep idle as we know that the IPI is going to
> arrive right
> +		 * away
>  		 */
> -		preempt_fold_need_resched();
> -		tick_nohz_idle_exit();
> -		schedule_preempt_disabled();
> +		if (cpu_idle_force_poll ||
> tick_check_broadcast_expired()) {
> +			cpu_idle_poll();
> +		} else {
> +			if (!current_clr_polling_and_test()) {
> +				stop_critical_timings();
> +				rcu_idle_enter();
> +				arch_cpu_idle();
> +				WARN_ON_ONCE(irqs_disabled());
> +				rcu_idle_exit();
> +				start_critical_timings();
> +			} else {
> +				local_irq_enable();
> +			}
> +			__current_set_polling();
> +		}
> +		arch_cpu_idle_exit();
>  	}
> +
> +	/*
> +	 * We need to test and propagate the TIF_NEED_RESCHED bit
> here
> +	 * because we might not have send the reschedule IPI to idle
> +	 * tasks.
> +	 */
> +	preempt_fold_need_resched();
> +	tick_nohz_idle_exit();
> +	schedule_preempt_disabled();
> +}
> +
> +static void play_idle_timer(unsigned long foo)
> +{
> +	set_tsk_need_resched(current);
> +}
> +
> +void play_idle(unsigned long duration)
> +{
> +	DEFINE_TIMER(wakeup_timer, play_idle_timer, 0, 0);
> +
> +	/*
> +	 * Only FIFO tasks can disable the tick since they don't
> need the forced
> +	 * preemption.
> +	 */
> +	WARN_ON_ONCE(current->policy != SCHED_FIFO);
> +	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
> +	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
> +	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
> +
> +	init_timer_on_stack(&wakeup_timer);
> +	mod_timer_pinned(&wakeup_timer, jiffies + duration);
> +
> +	preempt_disable();
> +	do_idle();
> +	del_timer_sync(&wakeup_timer);
> +	preempt_fold_need_resched();
> +	preempt_enable();
>  }
> +EXPORT_SYMBOL_GPL(play_idle);
>  
> -void cpu_startup_entry(enum cpuhp_state state)
> +__noreturn void cpu_startup_entry(enum cpuhp_state state)
>  {
>  	/*
>  	 * This #ifdef needs to die, but it's too late in the cycle
> to @@ -137,5 +163,6 @@ void cpu_startup_entry(enum cpuhp_state
>  #endif
>  	__current_set_polling();
>  	arch_cpu_idle_prepare();
> -	cpu_idle_loop();
> +	while (1)
> +		do_idle();
>  }
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -804,7 +804,6 @@ void tick_nohz_idle_enter(void)
>  
>  	local_irq_enable();
>  }
> -EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
>  
>  /**
>   * tick_nohz_irq_exit - update next tick event from interrupt exit
> @@ -932,7 +931,6 @@ void tick_nohz_idle_exit(void)
>  
>  	local_irq_enable();
>  }
> -EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
>  
>  static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
>  {
> 
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-21  0:54   ` Jacob Pan
@ 2013-11-21  8:21     ` Peter Zijlstra
  2013-11-21 16:07       ` Paul E. McKenney
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-21  8:21 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm, Rafael J. Wysocki,
	Paul E. McKenney

On Wed, Nov 20, 2013 at 04:54:06PM -0800, Jacob Pan wrote:
> On Wed, 20 Nov 2013 17:04:53 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > People are starting to grow their own idle implementations in various
> > disgusting ways. Collapse the lot and use the generic idle code to
> > provide a proper idle cycle implementation.
> > 
> +Paul
> 
> RCU and others rely on is_idle_task() might be broken with the
> consolidated idle code since caller of do_idle may have pid != 0.
> 
> Should we use TS_POLL or introduce a new flag to identify idle task?

PF_IDLE would be my preference, I checked and we seem to have a grand
total of 2 unused task_struct::flags left ;-)

> The reason why idle injection code does not inform RCU is that we have
> known short period of idle time which does not impact RCU grace period.
> 
> On the other side, I see idle injection code is working with this
> patchset with workaround in s_idle_task() by checking TS_POLL flag.
> But the efficiency is down by ~30%. i.e.
> 
> before: inject 25% time to get 23-24% package idle
> after: inject 25% time to get 16-17% package idle
> 
> Still looking into improvement.

So the quick hack is to make acpi_idle/intel_idle use the highest
possible C-state when pid!=0 && PF_IDLE.

Ideally though I'd see some of the QoS ramifications explored. Because
forcing the CPU into the highest C-state basically invalidates the
entire QoS stack.

So either make QoS and this idle injection stuff mutually exclusive in a
very explicit way -- disable the QoS interface when you enable one of
these idle injectors AND fail to engage the idle injectors when an
incompatible QoS setting is pre-existing.

Or come up with something smarter.

You also have to explore the case of higher priority tasks messing with
the proper operation of your injectors, this one is harder to deal with.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/7] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding
  2013-11-20 16:04 ` [PATCH 2/7] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding Peter Zijlstra
@ 2013-11-21  8:25   ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-21  8:25 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner
  Cc: linux-kernel, linux-pm

On Wed, Nov 20, 2013 at 05:04:52PM +0100, Peter Zijlstra wrote:
> --- a/kernel/cpu/idle.c
> +++ b/kernel/cpu/idle.c
> @@ -105,14 +105,14 @@ static void cpu_idle_loop(void)
>  				__current_set_polling();
>  			}
>  			arch_cpu_idle_exit();
> -			/*
> -			 * We need to test and propagate the TIF_NEED_RESCHED
> -			 * bit here because we might not have send the
> -			 * reschedule IPI to idle tasks.
> -			 */
> -			if (tif_need_resched())
> -				set_preempt_need_resched();
>  		}


		/*
		 * Ensure our read doesn't slip before the read that
		 * terminates the loop above; in that situation we might
		 * actually fail to observe the TIF_NEED_RESCHED we
		 * _know_ must be set here.
		 */

		smp_rmb(); /* pairs with resched_task() */

> +
> +		/*
> +		 * We need to test and propagate the TIF_NEED_RESCHED bit here
> +		 * because we might not have send the reschedule IPI to idle
> +		 * tasks.
> +		 */
> +		preempt_fold_need_resched();
>  		tick_nohz_idle_exit();
>  		schedule_preempt_disabled();
>  	}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-20 18:02   ` Eliezer Tamir
  2013-11-20 18:15     ` Peter Zijlstra
@ 2013-11-21 10:10     ` Peter Zijlstra
  2013-11-21 13:26       ` Eliezer Tamir
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-21 10:10 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Arjan van de Ven, lenb, rjw, Chris Leech, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm

On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote:
> IMHO This has been reviewed thoroughly.
> 
> When Ben Hutchings voiced concerns I rewrote the code to use time_after,
> so even if you do get switched over to a CPU where the time is random
> you will at most poll another full interval.
> 
> Linus asked me to remove this since it makes us use two time values
> instead of one. see https://lkml.org/lkml/2013/7/8/345.

I'm not sure I see how this would be true.

So the do_select() code basically does:

  for (;;) {

    /* actual poll loop */

    if (!need_resched()) {
      if (!busy_end) {
	busy_end = now() + busypoll;
	continue;
      }
      if (!((long)(busy_end - now()) < 0))
	continue;
    }

    /* go sleep */

  }

So imagine our CPU0 timebase is 1 minute ahead of CPU1 (60e9 vs 0), and we start by:

  busy_end = now() + busypoll; /* CPU0: 60e9 + d */

but then we migrate to CPU1 and do:

  busy_end - now() /* CPU1: 60e9 + d' */

and find we're still a minute out; and in fact we'll keep spinning for
that entire minute barring a need_resched().

Surely that's not intended and desired?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-21 10:10     ` Peter Zijlstra
@ 2013-11-21 13:26       ` Eliezer Tamir
  2013-11-21 13:39         ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Eliezer Tamir @ 2013-11-21 13:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Chris Leech, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm

On 21/11/2013 12:10, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote:
>> IMHO This has been reviewed thoroughly.
>>
>> When Ben Hutchings voiced concerns I rewrote the code to use time_after,
>> so even if you do get switched over to a CPU where the time is random
>> you will at most poll another full interval.
>>
>> Linus asked me to remove this since it makes us use two time values
>> instead of one. see https://lkml.org/lkml/2013/7/8/345.
> 
> I'm not sure I see how this would be true.
> 
> So the do_select() code basically does:
> 
>   for (;;) {
> 
>     /* actual poll loop */
> 
>     if (!need_resched()) {
>       if (!busy_end) {
> 	busy_end = now() + busypoll;
> 	continue;
>       }
>       if (!((long)(busy_end - now()) < 0))
> 	continue;
>     }
> 
>     /* go sleep */
> 
>   }
> 
> So imagine our CPU0 timebase is 1 minute ahead of CPU1 (60e9 vs 0), and we start by:
> 
>   busy_end = now() + busypoll; /* CPU0: 60e9 + d */
> 
> but then we migrate to CPU1 and do:
> 
>   busy_end - now() /* CPU1: 60e9 + d' */
> 
> and find we're still a minute out; and in fact we'll keep spinning for
> that entire minute barring a need_resched().

not exactly, poll will return if there are any events to report of if
a signal is pending.

> Surely that's not intended and desired?

This limit is an extra safety net, because busy polling is expensive,
we limit the time we are willing to do it.

We don't override any limit the user has put on the system call.
A signal or having events to report will also stop the looping.
So we are mostly capping the resources an _idle_ system will waste
on busy polling.

We want to globally cap the amount of time the system busy polls, on
average. Nothing catastrophic will happen in the extremely rare occasion
that we miss.

The alternative is to use one more int on every poll/select all the
time, this seems like a bigger cost.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-21 13:26       ` Eliezer Tamir
@ 2013-11-21 13:39         ` Peter Zijlstra
  2013-11-22  6:56           ` Eliezer Tamir
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-21 13:39 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Arjan van de Ven, lenb, rjw, Chris Leech, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm

On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote:
> On 21/11/2013 12:10, Peter Zijlstra wrote:
> > On Wed, Nov 20, 2013 at 08:02:54PM +0200, Eliezer Tamir wrote:
> >> IMHO This has been reviewed thoroughly.
> >>
> >> When Ben Hutchings voiced concerns I rewrote the code to use time_after,
> >> so even if you do get switched over to a CPU where the time is random
> >> you will at most poll another full interval.
> >>
> >> Linus asked me to remove this since it makes us use two time values
> >> instead of one. see https://lkml.org/lkml/2013/7/8/345.
> > 
> > I'm not sure I see how this would be true.
> > 
> > So the do_select() code basically does:
> > 
> >   for (;;) {
> > 
> >     /* actual poll loop */
> > 
> >     if (!need_resched()) {
> >       if (!busy_end) {
> > 	busy_end = now() + busypoll;
> > 	continue;
> >       }
> >       if (!((long)(busy_end - now()) < 0))
> > 	continue;
> >     }
> > 
> >     /* go sleep */
> > 
> >   }
> > 
> > So imagine our CPU0 timebase is 1 minute ahead of CPU1 (60e9 vs 0), and we start by:
> > 
> >   busy_end = now() + busypoll; /* CPU0: 60e9 + d */
> > 
> > but then we migrate to CPU1 and do:
> > 
> >   busy_end - now() /* CPU1: 60e9 + d' */
> > 
> > and find we're still a minute out; and in fact we'll keep spinning for
> > that entire minute barring a need_resched().
> 
> not exactly, poll will return if there are any events to report of if
> a signal is pending.

Sure, but lacking any of those, you're now busy waiting for a minute.

> > Surely that's not intended and desired?
> 
> This limit is an extra safety net, because busy polling is expensive,
> we limit the time we are willing to do it.

I just said your limit 'sysctl_net_busy_poll' isn't meaningful in any
way shape or fashion.

> We don't override any limit the user has put on the system call.

You are in fact, note how the normal select @endtime argument is only
set up _after_ you're done polling. So if the syscall had a timeout of 5
seconds, you just blew it by 55.

> A signal or having events to report will also stop the looping.
> So we are mostly capping the resources an _idle_ system will waste
> on busy polling.

Repeat, you're not actually capping anything.

> We want to globally cap the amount of time the system busy polls, on
> average. Nothing catastrophic will happen in the extremely rare occasion
> that we miss.
> 
> The alternative is to use one more int on every poll/select all the
> time, this seems like a bigger cost.

No, 'int' has nothing to do with it, using a semi-sane timesource does.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-21  8:21     ` Peter Zijlstra
@ 2013-11-21 16:07       ` Paul E. McKenney
  2013-11-21 16:21         ` Arjan van de Ven
  2013-11-21 16:29         ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Paul E. McKenney @ 2013-11-21 16:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jacob Pan, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	Chris Leech, David Miller, rui.zhang, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, linux-kernel, linux-pm,
	Rafael J. Wysocki

On Thu, Nov 21, 2013 at 09:21:51AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 04:54:06PM -0800, Jacob Pan wrote:
> > On Wed, 20 Nov 2013 17:04:53 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > People are starting to grow their own idle implementations in various
> > > disgusting ways. Collapse the lot and use the generic idle code to
> > > provide a proper idle cycle implementation.
> > > 
> > +Paul
> > 
> > RCU and others rely on is_idle_task() might be broken with the
> > consolidated idle code since caller of do_idle may have pid != 0.
> > 
> > Should we use TS_POLL or introduce a new flag to identify idle task?
> 
> PF_IDLE would be my preference, I checked and we seem to have a grand
> total of 2 unused task_struct::flags left ;-)

As long as RCU has some reliable way to identify an idle task, I am
good.  But I have to ask -- why can't idle injection coordinate with
the existing idle tasks rather than temporarily making alternative
idle tasks?

							Thanx, Paul

> > The reason why idle injection code does not inform RCU is that we have
> > known short period of idle time which does not impact RCU grace period.
> > 
> > On the other side, I see idle injection code is working with this
> > patchset with workaround in s_idle_task() by checking TS_POLL flag.
> > But the efficiency is down by ~30%. i.e.
> > 
> > before: inject 25% time to get 23-24% package idle
> > after: inject 25% time to get 16-17% package idle
> > 
> > Still looking into improvement.
> 
> So the quick hack is to make acpi_idle/intel_idle use the highest
> possible C-state when pid!=0 && PF_IDLE.
> 
> Ideally though I'd see some of the QoS ramifications explored. Because
> forcing the CPU into the highest C-state basically invalidates the
> entire QoS stack.
> 
> So either make QoS and this idle injection stuff mutually exclusive in a
> very explicit way -- disable the QoS interface when you enable one of
> these idle injectors AND fail to engage the idle injectors when an
> incompatible QoS setting is pre-existing.
> 
> Or come up with something smarter.
> 
> You also have to explore the case of higher priority tasks messing with
> the proper operation of your injectors, this one is harder to deal with.
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-21 16:07       ` Paul E. McKenney
@ 2013-11-21 16:21         ` Arjan van de Ven
  2013-11-21 19:19           ` Paul E. McKenney
  2013-11-21 16:29         ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Arjan van de Ven @ 2013-11-21 16:21 UTC (permalink / raw)
  To: paulmck, Peter Zijlstra
  Cc: Jacob Pan, lenb, rjw, Eliezer Tamir, Chris Leech, David Miller,
	rui.zhang, Mike Galbraith, Ingo Molnar, hpa, Thomas Gleixner,
	linux-kernel, linux-pm, Rafael J. Wysocki

On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
> As long as RCU has some reliable way to identify an idle task, I am
> good.  But I have to ask -- why can't idle injection coordinate with
> the existing idle tasks rather than temporarily making alternative
> idle tasks?

it's not a real idle. that's the whole problem of the situation.
to the rest of the OS, this is being BUSY (busy saving power using
a CPU instruction, but it might as well have been an mdelay() operation)
and it's also what end users expect; they want to be able to see
where there performance (read: cpu time in "top") is going.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-21 16:07       ` Paul E. McKenney
  2013-11-21 16:21         ` Arjan van de Ven
@ 2013-11-21 16:29         ` Peter Zijlstra
  2013-11-21 17:27           ` Paul E. McKenney
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-21 16:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jacob Pan, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	Chris Leech, David Miller, rui.zhang, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, linux-kernel, linux-pm,
	Rafael J. Wysocki

On Thu, Nov 21, 2013 at 08:07:16AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 21, 2013 at 09:21:51AM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 20, 2013 at 04:54:06PM -0800, Jacob Pan wrote:
> > > On Wed, 20 Nov 2013 17:04:53 +0100
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > People are starting to grow their own idle implementations in various
> > > > disgusting ways. Collapse the lot and use the generic idle code to
> > > > provide a proper idle cycle implementation.
> > > > 
> > > +Paul
> > > 
> > > RCU and others rely on is_idle_task() might be broken with the
> > > consolidated idle code since caller of do_idle may have pid != 0.
> > > 
> > > Should we use TS_POLL or introduce a new flag to identify idle task?
> > 
> > PF_IDLE would be my preference, I checked and we seem to have a grand
> > total of 2 unused task_struct::flags left ;-)
> 
> As long as RCU has some reliable way to identify an idle task, I am
> good.  But I have to ask -- why can't idle injection coordinate with
> the existing idle tasks rather than temporarily making alternative
> idle tasks?

Because that'd completely wreck how the scheduler selects tasks for just
these 2 arguably insane drivers.

We'd have to somehow teach it to pick the actual idle task instead of
this one task, but keep scheduling the rest of the tasks like normal --
we very much should keep higher priority tasks running like normal.

And we'd need a way to make it stop doing this 'proxy' execution.

That said, once we manage to replace the entire PI implementation with a
proper proxy execution scheme, the above would be possible by having a
resource (rt_mutex) associated with every idle task, and always held by
that task.

At that point we can do something like:

  rt_mutex_lock_timeout(cpu_idle_lock(cpu), jiffies);

And get the idle thread executing in our stead.

That said, idle is _special_ and I'd not be surprised we'd find a few
'funnies' along the way of trying to get that to actually work.

For now I'd rather not go there quite yet.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-21 16:29         ` Peter Zijlstra
@ 2013-11-21 17:27           ` Paul E. McKenney
  0 siblings, 0 replies; 47+ messages in thread
From: Paul E. McKenney @ 2013-11-21 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jacob Pan, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	Chris Leech, David Miller, rui.zhang, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, linux-kernel, linux-pm,
	Rafael J. Wysocki

On Thu, Nov 21, 2013 at 05:29:56PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 21, 2013 at 08:07:16AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 21, 2013 at 09:21:51AM +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 20, 2013 at 04:54:06PM -0800, Jacob Pan wrote:
> > > > On Wed, 20 Nov 2013 17:04:53 +0100
> > > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > > 
> > > > > People are starting to grow their own idle implementations in various
> > > > > disgusting ways. Collapse the lot and use the generic idle code to
> > > > > provide a proper idle cycle implementation.
> > > > > 
> > > > +Paul
> > > > 
> > > > RCU and others rely on is_idle_task() might be broken with the
> > > > consolidated idle code since caller of do_idle may have pid != 0.
> > > > 
> > > > Should we use TS_POLL or introduce a new flag to identify idle task?
> > > 
> > > PF_IDLE would be my preference, I checked and we seem to have a grand
> > > total of 2 unused task_struct::flags left ;-)
> > 
> > As long as RCU has some reliable way to identify an idle task, I am
> > good.  But I have to ask -- why can't idle injection coordinate with
> > the existing idle tasks rather than temporarily making alternative
> > idle tasks?
> 
> Because that'd completely wreck how the scheduler selects tasks for just
> these 2 arguably insane drivers.
> 
> We'd have to somehow teach it to pick the actual idle task instead of
> this one task, but keep scheduling the rest of the tasks like normal --
> we very much should keep higher priority tasks running like normal.
> 
> And we'd need a way to make it stop doing this 'proxy' execution.
> 
> That said, once we manage to replace the entire PI implementation with a
> proper proxy execution scheme, the above would be possible by having a
> resource (rt_mutex) associated with every idle task, and always held by
> that task.
> 
> At that point we can do something like:
> 
>   rt_mutex_lock_timeout(cpu_idle_lock(cpu), jiffies);
> 
> And get the idle thread executing in our stead.
> 
> That said, idle is _special_ and I'd not be surprised we'd find a few
> 'funnies' along the way of trying to get that to actually work.
> 
> For now I'd rather not go there quite yet.

Fair enough!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-21 16:21         ` Arjan van de Ven
@ 2013-11-21 19:19           ` Paul E. McKenney
  2013-11-21 19:45             ` Arjan van de Ven
  0 siblings, 1 reply; 47+ messages in thread
From: Paul E. McKenney @ 2013-11-21 19:19 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Jacob Pan, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm, Rafael J. Wysocki

On Thu, Nov 21, 2013 at 08:21:03AM -0800, Arjan van de Ven wrote:
> On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
> >As long as RCU has some reliable way to identify an idle task, I am
> >good.  But I have to ask -- why can't idle injection coordinate with
> >the existing idle tasks rather than temporarily making alternative
> >idle tasks?
> 
> it's not a real idle. that's the whole problem of the situation.
> to the rest of the OS, this is being BUSY (busy saving power using
> a CPU instruction, but it might as well have been an mdelay() operation)
> and it's also what end users expect; they want to be able to see
> where there performance (read: cpu time in "top") is going.

My concern is keeping RCU's books straight.  Suppose that there is a need
to call for idle in the middle of a preemptible RCU read-side critical
section.  Now, if that call for idle involves a context switch, all is
well -- RCU will see the task as still being in its RCU read-side critical
section, which means that it is OK for RCU to see the CPU as idle.

However, if there is no context switch and RCU sees the CPU as idle,
preemptible RCU could prematurely end the grace period.  If there is no
context switch and RCU sees the CPU as non-idle for too long, we start
getting RCU CPU stall warning splats.

Another approach would be to only inject idle when the CPU is not
doing anything that could possibly be in an RCU read-side critical
section.  But things might get a bit hot in case of an overly
long RCU read-side critical section.

One approach that might work would be to hook into RCU's context-switch
code going in and coming out, then telling RCU that the CPU is idle,
even though top and friends see it as non-idle.  This last is in fact
similar to how RCU handles userspace execution for NO_HZ_FULL.

Or were you thinking of yet another possible approach for this?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-21 19:19           ` Paul E. McKenney
@ 2013-11-21 19:45             ` Arjan van de Ven
  2013-11-21 20:07               ` Paul E. McKenney
  0 siblings, 1 reply; 47+ messages in thread
From: Arjan van de Ven @ 2013-11-21 19:45 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Jacob Pan, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm, Rafael J. Wysocki

On 11/21/2013 11:19 AM, Paul E. McKenney wrote:
> On Thu, Nov 21, 2013 at 08:21:03AM -0800, Arjan van de Ven wrote:
>> On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
>>> As long as RCU has some reliable way to identify an idle task, I am
>>> good.  But I have to ask -- why can't idle injection coordinate with
>>> the existing idle tasks rather than temporarily making alternative
>>> idle tasks?
>>
>> it's not a real idle. that's the whole problem of the situation.
>> to the rest of the OS, this is being BUSY (busy saving power using
>> a CPU instruction, but it might as well have been an mdelay() operation)
>> and it's also what end users expect; they want to be able to see
>> where there performance (read: cpu time in "top") is going.
>
> My concern is keeping RCU's books straight.  Suppose that there is a need
> to call for idle in the middle of a preemptible RCU read-side critical
> section.  Now, if that call for idle involves a context switch, all is
> well -- RCU will see the task as still being in its RCU read-side critical
> section, which means that it is OK for RCU to see the CPU as idle.
>
> However, if there is no context switch and RCU sees the CPU as idle,
> preemptible RCU could prematurely end the grace period.  If there is no
> context switch and RCU sees the CPU as non-idle for too long, we start
> getting RCU CPU stall warning splats.
>
> Another approach would be to only inject idle when the CPU is not
> doing anything that could possibly be in an RCU read-side critical
> section.  But things might get a bit hot in case of an overly
> long RCU read-side critical section.
>
> One approach that might work would be to hook into RCU's context-switch
> code going in and coming out, then telling RCU that the CPU is idle,
> even though top and friends see it as non-idle.  This last is in fact
> similar to how RCU handles userspace execution for NO_HZ_FULL.
>

so powerclamp and such are not "idle".
They are "busy" from everything except the lowest level of the CPU hardware.
once you start thinking of them as idle, all hell breaks lose in terms of implications
(including sysadmin visibility etc).... (hence some of the explosions in this thread
as well).

but it's not "idle".

it's "put the cpu in a low power state for a specified amount of time". sure it uses the same
instruction to do so that the idle loop uses.

(now to make it messy, the current driver does a bunch of things similar to the idle loop
which is a mess and fair to be complained about)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-21 19:45             ` Arjan van de Ven
@ 2013-11-21 20:07               ` Paul E. McKenney
  2013-11-22  0:10                 ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Paul E. McKenney @ 2013-11-21 20:07 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Jacob Pan, lenb, rjw, Eliezer Tamir, Chris Leech,
	David Miller, rui.zhang, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm, Rafael J. Wysocki

On Thu, Nov 21, 2013 at 11:45:20AM -0800, Arjan van de Ven wrote:
> On 11/21/2013 11:19 AM, Paul E. McKenney wrote:
> >On Thu, Nov 21, 2013 at 08:21:03AM -0800, Arjan van de Ven wrote:
> >>On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
> >>>As long as RCU has some reliable way to identify an idle task, I am
> >>>good.  But I have to ask -- why can't idle injection coordinate with
> >>>the existing idle tasks rather than temporarily making alternative
> >>>idle tasks?
> >>
> >>it's not a real idle. that's the whole problem of the situation.
> >>to the rest of the OS, this is being BUSY (busy saving power using
> >>a CPU instruction, but it might as well have been an mdelay() operation)
> >>and it's also what end users expect; they want to be able to see
> >>where there performance (read: cpu time in "top") is going.
> >
> >My concern is keeping RCU's books straight.  Suppose that there is a need
> >to call for idle in the middle of a preemptible RCU read-side critical
> >section.  Now, if that call for idle involves a context switch, all is
> >well -- RCU will see the task as still being in its RCU read-side critical
> >section, which means that it is OK for RCU to see the CPU as idle.
> >
> >However, if there is no context switch and RCU sees the CPU as idle,
> >preemptible RCU could prematurely end the grace period.  If there is no
> >context switch and RCU sees the CPU as non-idle for too long, we start
> >getting RCU CPU stall warning splats.
> >
> >Another approach would be to only inject idle when the CPU is not
> >doing anything that could possibly be in an RCU read-side critical
> >section.  But things might get a bit hot in case of an overly
> >long RCU read-side critical section.
> >
> >One approach that might work would be to hook into RCU's context-switch
> >code going in and coming out, then telling RCU that the CPU is idle,
> >even though top and friends see it as non-idle.  This last is in fact
> >similar to how RCU handles userspace execution for NO_HZ_FULL.
> >
> 
> so powerclamp and such are not "idle".
> They are "busy" from everything except the lowest level of the CPU hardware.
> once you start thinking of them as idle, all hell breaks lose in terms of implications
> (including sysadmin visibility etc).... (hence some of the explosions in this thread
> as well).
> 
> but it's not "idle".
> 
> it's "put the cpu in a low power state for a specified amount of time". sure it uses the same
> instruction to do so that the idle loop uses.
> 
> (now to make it messy, the current driver does a bunch of things similar to the idle loop
> which is a mess and fair to be complained about)

Then from an RCU viewpoint, they need to be short in duration.  Otherwise
you risk getting CPU stall-warning explosions from RCU.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-21 20:07               ` Paul E. McKenney
@ 2013-11-22  0:10                 ` Jacob Pan
  2013-11-22  4:20                   ` Paul E. McKenney
  0 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2013-11-22  0:10 UTC (permalink / raw)
  To: paulmck
  Cc: Arjan van de Ven, Peter Zijlstra, lenb, rjw, Eliezer Tamir,
	Chris Leech, David Miller, rui.zhang, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, linux-kernel, linux-pm,
	Rafael J. Wysocki

On Thu, 21 Nov 2013 12:07:17 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Thu, Nov 21, 2013 at 11:45:20AM -0800, Arjan van de Ven wrote:
> > On 11/21/2013 11:19 AM, Paul E. McKenney wrote:
> > >On Thu, Nov 21, 2013 at 08:21:03AM -0800, Arjan van de Ven wrote:
> > >>On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
> > >>>As long as RCU has some reliable way to identify an idle task, I
> > >>>am good.  But I have to ask -- why can't idle injection
> > >>>coordinate with the existing idle tasks rather than temporarily
> > >>>making alternative idle tasks?
> > >>
> > >>it's not a real idle. that's the whole problem of the situation.
> > >>to the rest of the OS, this is being BUSY (busy saving power using
> > >>a CPU instruction, but it might as well have been an mdelay()
> > >>operation) and it's also what end users expect; they want to be
> > >>able to see where there performance (read: cpu time in "top") is
> > >>going.
> > >
> > >My concern is keeping RCU's books straight.  Suppose that there is
> > >a need to call for idle in the middle of a preemptible RCU
> > >read-side critical section.  Now, if that call for idle involves a
> > >context switch, all is well -- RCU will see the task as still
> > >being in its RCU read-side critical section, which means that it
> > >is OK for RCU to see the CPU as idle.
> > >
> > >However, if there is no context switch and RCU sees the CPU as
> > >idle, preemptible RCU could prematurely end the grace period.  If
> > >there is no context switch and RCU sees the CPU as non-idle for
> > >too long, we start getting RCU CPU stall warning splats.
> > >
> > >Another approach would be to only inject idle when the CPU is not
> > >doing anything that could possibly be in an RCU read-side critical
> > >section.  But things might get a bit hot in case of an overly
> > >long RCU read-side critical section.
> > >
> > >One approach that might work would be to hook into RCU's
> > >context-switch code going in and coming out, then telling RCU that
> > >the CPU is idle, even though top and friends see it as non-idle.
> > >This last is in fact similar to how RCU handles userspace
> > >execution for NO_HZ_FULL.
> > >
> > 
> > so powerclamp and such are not "idle".
> > They are "busy" from everything except the lowest level of the CPU
> > hardware. once you start thinking of them as idle, all hell breaks
> > lose in terms of implications (including sysadmin visibility
> > etc).... (hence some of the explosions in this thread as well).
> > 
> > but it's not "idle".
> > 
> > it's "put the cpu in a low power state for a specified amount of
> > time". sure it uses the same instruction to do so that the idle
> > loop uses.
> > 
> > (now to make it messy, the current driver does a bunch of things
> > similar to the idle loop which is a mess and fair to be complained
> > about)
> 
> Then from an RCU viewpoint, they need to be short in duration.
> Otherwise you risk getting CPU stall-warning explosions from RCU.  ;-)
> 
> 							Thanx, Paul
> 
currently powerclamp allow idle injection duration between 6 to 25ms.
I guess that is short considering the stall check is in seconds?
	return till_stall_check * HZ + RCU_STALL_DELAY_DELTA;

BTW, by forcing intel_idle to use deepest c-states for idle injection
thread the efficiency problem is gone. I am surprised that cpuidle
would not pick the deepest c-states given powerclamp driver is asking
for 6ms idle time and the wakeup latencies are in the usec.
Anyway, for what i have tested so far powerclamp with this patchset can
work as well as the code before.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Jacob Pan]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-22  0:10                 ` Jacob Pan
@ 2013-11-22  4:20                   ` Paul E. McKenney
  2013-11-22 11:33                     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Paul E. McKenney @ 2013-11-22  4:20 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Arjan van de Ven, Peter Zijlstra, lenb, rjw, Eliezer Tamir,
	Chris Leech, David Miller, rui.zhang, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, linux-kernel, linux-pm,
	Rafael J. Wysocki

On Thu, Nov 21, 2013 at 04:10:05PM -0800, Jacob Pan wrote:
> On Thu, 21 Nov 2013 12:07:17 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Nov 21, 2013 at 11:45:20AM -0800, Arjan van de Ven wrote:
> > > On 11/21/2013 11:19 AM, Paul E. McKenney wrote:
> > > >On Thu, Nov 21, 2013 at 08:21:03AM -0800, Arjan van de Ven wrote:
> > > >>On 11/21/2013 8:07 AM, Paul E. McKenney wrote:
> > > >>>As long as RCU has some reliable way to identify an idle task, I
> > > >>>am good.  But I have to ask -- why can't idle injection
> > > >>>coordinate with the existing idle tasks rather than temporarily
> > > >>>making alternative idle tasks?
> > > >>
> > > >>it's not a real idle. that's the whole problem of the situation.
> > > >>to the rest of the OS, this is being BUSY (busy saving power using
> > > >>a CPU instruction, but it might as well have been an mdelay()
> > > >>operation) and it's also what end users expect; they want to be
> > > >>able to see where there performance (read: cpu time in "top") is
> > > >>going.
> > > >
> > > >My concern is keeping RCU's books straight.  Suppose that there is
> > > >a need to call for idle in the middle of a preemptible RCU
> > > >read-side critical section.  Now, if that call for idle involves a
> > > >context switch, all is well -- RCU will see the task as still
> > > >being in its RCU read-side critical section, which means that it
> > > >is OK for RCU to see the CPU as idle.
> > > >
> > > >However, if there is no context switch and RCU sees the CPU as
> > > >idle, preemptible RCU could prematurely end the grace period.  If
> > > >there is no context switch and RCU sees the CPU as non-idle for
> > > >too long, we start getting RCU CPU stall warning splats.
> > > >
> > > >Another approach would be to only inject idle when the CPU is not
> > > >doing anything that could possibly be in an RCU read-side critical
> > > >section.  But things might get a bit hot in case of an overly
> > > >long RCU read-side critical section.
> > > >
> > > >One approach that might work would be to hook into RCU's
> > > >context-switch code going in and coming out, then telling RCU that
> > > >the CPU is idle, even though top and friends see it as non-idle.
> > > >This last is in fact similar to how RCU handles userspace
> > > >execution for NO_HZ_FULL.
> > > >
> > > 
> > > so powerclamp and such are not "idle".
> > > They are "busy" from everything except the lowest level of the CPU
> > > hardware. once you start thinking of them as idle, all hell breaks
> > > lose in terms of implications (including sysadmin visibility
> > > etc).... (hence some of the explosions in this thread as well).
> > > 
> > > but it's not "idle".
> > > 
> > > it's "put the cpu in a low power state for a specified amount of
> > > time". sure it uses the same instruction to do so that the idle
> > > loop uses.
> > > 
> > > (now to make it messy, the current driver does a bunch of things
> > > similar to the idle loop which is a mess and fair to be complained
> > > about)
> > 
> > Then from an RCU viewpoint, they need to be short in duration.
> > Otherwise you risk getting CPU stall-warning explosions from RCU.  ;-)
> > 
> > 							Thanx, Paul
> > 
> currently powerclamp allow idle injection duration between 6 to 25ms.
> I guess that is short considering the stall check is in seconds?
> 	return till_stall_check * HZ + RCU_STALL_DELAY_DELTA;

The 6ms to 25ms range should be just fine as far as normal RCU grace
periods are concerned.  However, it does mean that expedited grace
periods could be delayed: They normally take a few tens of microseconds,
but if they were unlucky enough to show up during an idle injection,
they would be magnified by two to three orders of magnitude, which is
not pretty.

Hence my suggestion of hooking into RCU on idle-injection start and end
so that RCU considers that time period to be idle.  Just like it does
for user-mode execution on NO_HZ_FULL kernels, so I still don't see this
approach to be a problem.  I must confess that I still don't understand
what Arjan doesn't like about it.

							Thanx, Paul

> BTW, by forcing intel_idle to use deepest c-states for idle injection
> thread the efficiency problem is gone. I am surprised that cpuidle
> would not pick the deepest c-states given powerclamp driver is asking
> for 6ms idle time and the wakeup latencies are in the usec.
> Anyway, for what i have tested so far powerclamp with this patchset can
> work as well as the code before.
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> [Jacob Pan]
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-21 13:39         ` Peter Zijlstra
@ 2013-11-22  6:56           ` Eliezer Tamir
  2013-11-22 11:30             ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Eliezer Tamir @ 2013-11-22  6:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Chris Leech, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm

On 21/11/2013 15:39, Peter Zijlstra wrote:
> On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote:

>> We don't override any limit the user has put on the system call.
> 
> You are in fact, note how the normal select @endtime argument is only
> set up _after_ you're done polling. So if the syscall had a timeout of 5
> seconds, you just blew it by 55.
> 

That's a bug, we will fix it.

Cheers,
Eliezer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-22  6:56           ` Eliezer Tamir
@ 2013-11-22 11:30             ` Peter Zijlstra
  2013-11-26  7:15               ` Eliezer Tamir
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-22 11:30 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Arjan van de Ven, lenb, rjw, Chris Leech, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm

On Fri, Nov 22, 2013 at 08:56:00AM +0200, Eliezer Tamir wrote:
> On 21/11/2013 15:39, Peter Zijlstra wrote:
> > On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote:
> 
> >> We don't override any limit the user has put on the system call.
> > 
> > You are in fact, note how the normal select @endtime argument is only
> > set up _after_ you're done polling. So if the syscall had a timeout of 5
> > seconds, you just blew it by 55.
> > 
> 
> That's a bug, we will fix it.

The entire thing is a bug, and if I could take away sched_clock() from
you I would, but sadly its all inlines so I can't :-(

Please use local_clock(), yes its slightly more expensive, but I doubt
you can actually measure the effects on sane hardware.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-22  4:20                   ` Paul E. McKenney
@ 2013-11-22 11:33                     ` Peter Zijlstra
  2013-11-22 17:17                       ` Paul E. McKenney
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2013-11-22 11:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jacob Pan, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	Chris Leech, David Miller, rui.zhang, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, linux-kernel, linux-pm,
	Rafael J. Wysocki

On Thu, Nov 21, 2013 at 08:20:36PM -0800, Paul E. McKenney wrote:
> The 6ms to 25ms range should be just fine as far as normal RCU grace
> periods are concerned.  However, it does mean that expedited grace
> periods could be delayed: They normally take a few tens of microseconds,
> but if they were unlucky enough to show up during an idle injection,
> they would be magnified by two to three orders of magnitude, which is
> not pretty.
> 
> Hence my suggestion of hooking into RCU on idle-injection start and end
> so that RCU considers that time period to be idle.  Just like it does
> for user-mode execution on NO_HZ_FULL kernels, so I still don't see this
> approach to be a problem.  I must confess that I still don't understand
> what Arjan doesn't like about it.

Using these patches it would indeed use the RCU idle machinery as per
the normal idle path.

If you can I can add more WARN_ON()s in play_idle() to ensure we're not
called while holding any RCU locks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-22 11:33                     ` Peter Zijlstra
@ 2013-11-22 17:17                       ` Paul E. McKenney
  0 siblings, 0 replies; 47+ messages in thread
From: Paul E. McKenney @ 2013-11-22 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jacob Pan, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	Chris Leech, David Miller, rui.zhang, Mike Galbraith,
	Ingo Molnar, hpa, Thomas Gleixner, linux-kernel, linux-pm,
	Rafael J. Wysocki

On Fri, Nov 22, 2013 at 12:33:00PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 21, 2013 at 08:20:36PM -0800, Paul E. McKenney wrote:
> > The 6ms to 25ms range should be just fine as far as normal RCU grace
> > periods are concerned.  However, it does mean that expedited grace
> > periods could be delayed: They normally take a few tens of microseconds,
> > but if they were unlucky enough to show up during an idle injection,
> > they would be magnified by two to three orders of magnitude, which is
> > not pretty.
> > 
> > Hence my suggestion of hooking into RCU on idle-injection start and end
> > so that RCU considers that time period to be idle.  Just like it does
> > for user-mode execution on NO_HZ_FULL kernels, so I still don't see this
> > approach to be a problem.  I must confess that I still don't understand
> > what Arjan doesn't like about it.
> 
> Using these patches it would indeed use the RCU idle machinery as per
> the normal idle path.

OK, sorry for my confusion!

> If you can I can add more WARN_ON()s in play_idle() to ensure we're not
> called while holding any RCU locks.

An rcu_sleep_check() or something similar, please!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-22 11:30             ` Peter Zijlstra
@ 2013-11-26  7:15               ` Eliezer Tamir
  2013-11-26 10:51                 ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Eliezer Tamir @ 2013-11-26  7:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, linux-kernel, linux-pm

On 22/11/2013 13:30, Peter Zijlstra wrote:
> On Fri, Nov 22, 2013 at 08:56:00AM +0200, Eliezer Tamir wrote:
>> On 21/11/2013 15:39, Peter Zijlstra wrote:
>>> On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote:
> 
> Please use local_clock(), yes its slightly more expensive, but I doubt
> you can actually measure the effects on sane hardware.

If we limit the discussion to sane hardware, I should mention that on
current Intel CPUs TSC is guaranteed to be monotonic for anything up to
8 sockets. Even on slightly older HS TSC skew is very small and should
not be an issue for this use case.

So:
Modern sane HW does not have this issue.
The people that do busy polling typically pin tasks to cores anyway.
You need cap_net_admin to use this setting.
There is no real damage if the issue happens.
This is fast-low-latency-path so we are very sensitive to adding even
a small cost.
Linus really didn't like adding to the cost of poll/select when busy
polling is not being used.

Having said that, since we need to fix the timeout issue you pointed
out, we will test the use of local_clock() and see if it matters or
not.

Again, I have no objection to changing the use of
preempt_enable_no_resched() to a plain preempt_enable().

Cheers,
Eliezer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse
  2013-11-26  7:15               ` Eliezer Tamir
@ 2013-11-26 10:51                 ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2013-11-26 10:51 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Peter Zijlstra, Arjan van de Ven, lenb, rjw, Eliezer Tamir,
	David Miller, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, hpa, linux-kernel, linux-pm

On Tue, 26 Nov 2013, Eliezer Tamir wrote:

> On 22/11/2013 13:30, Peter Zijlstra wrote:
> > On Fri, Nov 22, 2013 at 08:56:00AM +0200, Eliezer Tamir wrote:
> >> On 21/11/2013 15:39, Peter Zijlstra wrote:
> >>> On Thu, Nov 21, 2013 at 03:26:17PM +0200, Eliezer Tamir wrote:
> > 
> > Please use local_clock(), yes its slightly more expensive, but I doubt
> > you can actually measure the effects on sane hardware.
> 
> If we limit the discussion to sane hardware, I should mention that on
> current Intel CPUs TSC is guaranteed to be monotonic for anything up to
> 8 sockets. Even on slightly older HS TSC skew is very small and should
> not be an issue for this use case.

> Modern sane HW does not have this issue.

That's wrong to begin with. There is no such thing which qualifies as
"sane hardware". Especially not if we are talking about timers.

> The people that do busy polling typically pin tasks to cores anyway.

This is completely irrelevant. If stuff falls apart if the task is not
pinned, then you lost nevertheless.

> You need cap_net_admin to use this setting.

And how is that relevant? cap_net_admin does not change the fact, that
you violate your constraints.

> There is no real damage if the issue happens.

You'r violating the constraints which is not fatal, but not desired
either.

> This is fast-low-latency-path so we are very sensitive to adding even
> a small cost.
> Linus really didn't like adding to the cost of poll/select when busy
> polling is not being used.
 
And that justifies exposing those who do not have access to "sane"
hardware and/or did not pin their tasks to constraint violation?

> Having said that, since we need to fix the timeout issue you pointed
> out, we will test the use of local_clock() and see if it matters or
> not.

If the hardware provides an indicator that the TSC is sane to use,
then sched_clock_stable is 1, so local_clock() will not do the slow
update dance at all. So for "sane" hardware the overhead is minimal
and on crappy hardware the correctness is still ensured with more
overhead.

If you are really concerned about the minimal overhead in the
sched_clock_stable == 1 case, then you better fix that (it's doable
with some brain) instead of hacking broken crap, based on even more
broken assumptions, into the networking code.

It's not the kernels fault, that we need to deal with
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK at all. And we have to deal with it
no matter what, so we cannot make this undone by magic assumptions.

Complain to those who forced us to do this. Hint: It's only ONE CPU
vendor who thought that providing useless timestamps is a brilliant
idea.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2013-11-26 10:52 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-20 16:04 [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
2013-11-20 16:04 ` [PATCH 1/7] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
2013-11-20 16:04 ` [PATCH 2/7] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding Peter Zijlstra
2013-11-21  8:25   ` Peter Zijlstra
2013-11-20 16:04 ` [PATCH 3/7] idle, thermal, acpi: Remove home grown idle implementations Peter Zijlstra
2013-11-20 16:40   ` Arjan van de Ven
2013-11-20 16:59     ` Peter Zijlstra
2013-11-20 17:23     ` Thomas Gleixner
2013-11-20 17:23       ` Arjan van de Ven
2013-11-20 17:55         ` Thomas Gleixner
2013-11-20 18:21           ` Arjan van de Ven
2013-11-20 19:38             ` Thomas Gleixner
2013-11-20 22:08               ` Jacob Pan
2013-11-21  0:54   ` Jacob Pan
2013-11-21  8:21     ` Peter Zijlstra
2013-11-21 16:07       ` Paul E. McKenney
2013-11-21 16:21         ` Arjan van de Ven
2013-11-21 19:19           ` Paul E. McKenney
2013-11-21 19:45             ` Arjan van de Ven
2013-11-21 20:07               ` Paul E. McKenney
2013-11-22  0:10                 ` Jacob Pan
2013-11-22  4:20                   ` Paul E. McKenney
2013-11-22 11:33                     ` Peter Zijlstra
2013-11-22 17:17                       ` Paul E. McKenney
2013-11-21 16:29         ` Peter Zijlstra
2013-11-21 17:27           ` Paul E. McKenney
2013-11-20 16:04 ` [PATCH 4/7] preempt, locking: Rework local_bh_{dis,en}able() Peter Zijlstra
2013-11-20 16:04 ` [PATCH 5/7] locking: Optimize lock_bh functions Peter Zijlstra
2013-11-20 16:04 ` [PATCH 6/7] sched: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
2013-11-20 18:02   ` Eliezer Tamir
2013-11-20 18:15     ` Peter Zijlstra
2013-11-20 20:14       ` Eliezer Tamir
2013-11-21 10:10     ` Peter Zijlstra
2013-11-21 13:26       ` Eliezer Tamir
2013-11-21 13:39         ` Peter Zijlstra
2013-11-22  6:56           ` Eliezer Tamir
2013-11-22 11:30             ` Peter Zijlstra
2013-11-26  7:15               ` Eliezer Tamir
2013-11-26 10:51                 ` Thomas Gleixner
2013-11-20 16:04 ` [PATCH 7/7] preempt: Take away preempt_enable_no_resched() from modules Peter Zijlstra
2013-11-20 18:54   ` Jacob Pan
2013-11-20 19:00     ` Peter Zijlstra
2013-11-20 19:18     ` Peter Zijlstra
2013-11-20 19:29       ` Jacob Pan
2013-11-20 16:34 ` [PATCH 0/7] Cure some vaux idle wrackage Peter Zijlstra
2013-11-20 17:19   ` Jacob Pan
2013-11-20 17:24     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).