linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* acpi_pad mwait usage
@ 2013-11-19  9:00 Peter Zijlstra
  2013-11-19  9:08 ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19  9:00 UTC (permalink / raw)
  To: lenb, rjw, linux-acpi, linux-kernel; +Cc: shaohua.li

Hi Len, Rafeal,

I stumbled over acpi_pad (yuck! ;-), but noticed that you don't set the
task in polling mode while using mwait. This means we'll still happily
send an IPI to wake you up.

A little something like the below should do; you might even be able to
remove the smp_mb() but since it is completely undocumented (another
fail) I couldn't tell if the implied barrier in
current_set_polling_and_test() suffices to replace it, so I left it.

If it compiles and works; change it to a proper SOB:

Maybe-Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/acpi/acpi_pad.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index fc6008fbce35..e9126efe7786 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -195,8 +195,10 @@ static int power_saving_thread(void *data)
 
 			__monitor((void *)&current_thread_info()->flags, 0, 0);
 			smp_mb();
-			if (!need_resched())
+			if (!current_set_polling_and_test()) {
 				__mwait(power_saving_mwait_eax, 1);
+				__current_clr_polling();
+			}
 
 			start_critical_timings();
 			if (lapic_marked_unstable)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: acpi_pad mwait usage
  2013-11-19  9:00 acpi_pad mwait usage Peter Zijlstra
@ 2013-11-19  9:08 ` Peter Zijlstra
  2013-11-19 11:31   ` [PATCH] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19  9:08 UTC (permalink / raw)
  To: lenb, rjw, linux-acpi, linux-kernel
  Cc: shaohua.li, rui.zhang, arjan, jacob.jun.pan

On Tue, Nov 19, 2013 at 10:00:19AM +0100, Peter Zijlstra wrote:
> Hi Len, Rafeal,
> 
> I stumbled over acpi_pad (yuck! ;-), but noticed that you don't set the
> task in polling mode while using mwait. This means we'll still happily
> send an IPI to wake you up.
> 
> A little something like the below should do; you might even be able to
> remove the smp_mb() but since it is completely undocumented (another
> fail) I couldn't tell if the implied barrier in
> current_set_polling_and_test() suffices to replace it, so I left it.
> 
> If it compiles and works; change it to a proper SOB:
> 
> Maybe-Signed-off-by: Peter Zijlstra <peterz@infradead.org>

Of course one such driver is insufficient and you lot needed a second;
same issue:

---
 drivers/thermal/intel_powerclamp.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 8f181b3f842b..d8111e1ac62e 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -439,8 +439,11 @@ static int clamp_thread(void *arg)
 			local_touch_nmi();
 			stop_critical_timings();
 			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			cpu_relax(); /* allow HT sibling to run */
-			__mwait(eax, ecx);
+			if (!current_set_polling_and_test()) {
+				cpu_relax(); /* allow HT sibling to run */
+				__mwait(eax, ecx);
+				__current_clr_polling();
+			}
 			start_critical_timings();
 			atomic_inc(&idle_wakeup_counter);
 		}

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19  9:08 ` Peter Zijlstra
@ 2013-11-19 11:31   ` Peter Zijlstra
  2013-11-19 13:06     ` Rafael J. Wysocki
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19 11:31 UTC (permalink / raw)
  To: lenb, rjw, linux-acpi, linux-kernel
  Cc: shaohua.li, rui.zhang, arjan, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, Thomas Gleixner, hpa

People seem to delight in writing wrong and broken mwait idle routines;
collapse the lot.

This leaves mwait_play_dead() the sole remaining user of __mwait() and
new __mwait() users are probably doing it wrong.

Also remove __sti_mwait() as its unused.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---

Mike, does this cure your core2?

 arch/x86/include/asm/mwait.h       | 42 ++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h   | 23 ---------------------
 arch/x86/kernel/acpi/cstate.c      | 23 ---------------------
 drivers/acpi/acpi_pad.c            |  5 +----
 drivers/acpi/processor_idle.c      | 15 --------------
 drivers/idle/intel_idle.c          |  8 +-------
 drivers/thermal/intel_powerclamp.c |  4 +---
 7 files changed, 45 insertions(+), 75 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 2f366d0ac6b4..80014dade987 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_X86_MWAIT_H
 #define _ASM_X86_MWAIT_H
 
+#include <linux/sched.h>
+
 #define MWAIT_SUBSTATE_MASK		0xf
 #define MWAIT_CSTATE_MASK		0xf
 #define MWAIT_SUBSTATE_SIZE		4
@@ -13,4 +15,44 @@
 
 #define MWAIT_ECX_INTERRUPT_BREAK	0x1
 
+static inline void __monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state
+ * through MWAIT. Whenever someone changes need_resched, we would be woken
+ * up from MWAIT (without an IPI).
+ *
+ * New with Core Duo processors, MWAIT can take some hints based on CPU
+ * capability.
+ */
+static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
+{
+	if (need_resched())
+		return;
+
+	if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+		clflush((void *)&current_thread_info()->flags);
+
+	__monitor((void *)&current_thread_info()->flags, 0, 0);
+	if (!current_set_polling_and_test())
+		__mwait(eax, ecx);
+
+	__current_clr_polling();
+}
+
 #endif /* _ASM_X86_MWAIT_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7b034a4057f9..24821f5768bc 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -700,29 +700,6 @@ static inline void sync_core(void)
 #endif
 }
 
-static inline void __monitor(const void *eax, unsigned long ecx,
-			     unsigned long edx)
-{
-	/* "monitor %eax, %ecx, %edx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc8;"
-		     :: "a" (eax), "c" (ecx), "d"(edx));
-}
-
-static inline void __mwait(unsigned long eax, unsigned long ecx)
-{
-	/* "mwait %eax, %ecx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
-{
-	trace_hardirqs_on();
-	/* "mwait %eax, %ecx;" */
-	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index d2b7f27781bc..e69182fd01cf 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -150,29 +150,6 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 
-/*
- * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
- * which can obviate IPI to trigger checking of need_resched.
- * We execute MONITOR against need_resched and enter optimized wait state
- * through MWAIT. Whenever someone changes need_resched, we would be woken
- * up from MWAIT (without an IPI).
- *
- * New with Core Duo processors, MWAIT can take some hints based on CPU
- * capability.
- */
-void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
-{
-	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(ax, cx);
-	}
-}
-
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 {
 	unsigned int cpu = smp_processor_id();
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index fc6008fbce35..509452a62f96 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -193,10 +193,7 @@ static int power_saving_thread(void *data)
 					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 			stop_critical_timings();
 
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			smp_mb();
-			if (!need_resched())
-				__mwait(power_saving_mwait_eax, 1);
+			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
 			start_critical_timings();
 			if (lapic_marked_unstable)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 644516d9bde6..f90c56c8379e 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -727,11 +727,6 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	lapic_timer_state_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 
@@ -785,11 +780,6 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
@@ -841,11 +831,6 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 		}
 	}
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	acpi_unlazy_tlb(smp_processor_id());
 
 	/* Tell the scheduler that we are going deep-idle: */
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 3226ce98fb18..3b56d76a5bca 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -359,13 +359,7 @@ static int intel_idle(struct cpuidle_device *dev,
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
-	if (!current_set_polling_and_test()) {
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(eax, ecx);
-	}
+	mwait_idle_with_hints(eax, ecx);
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 8f181b3f842b..e8275f2df9af 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
 			 */
 			local_touch_nmi();
 			stop_critical_timings();
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			cpu_relax(); /* allow HT sibling to run */
-			__mwait(eax, ecx);
+			mwait_idle_with_hints(eax, ecx);
 			start_critical_timings();
 			atomic_inc(&idle_wakeup_counter);
 		}

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 11:31   ` [PATCH] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
@ 2013-11-19 13:06     ` Rafael J. Wysocki
  2013-11-19 13:21     ` Mike Galbraith
  2013-11-19 14:22     ` Arjan van de Ven
  2 siblings, 0 replies; 22+ messages in thread
From: Rafael J. Wysocki @ 2013-11-19 13:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lenb, linux-acpi, linux-kernel, shaohua.li, rui.zhang, arjan,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On Tuesday, November 19, 2013 12:31:53 PM Peter Zijlstra wrote:
> People seem to delight in writing wrong and broken mwait idle routines;
> collapse the lot.
> 
> This leaves mwait_play_dead() the sole remaining user of __mwait() and
> new __mwait() users are probably doing it wrong.
> 
> Also remove __sti_mwait() as its unused.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

For the ACPI part:

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
> 
> Mike, does this cure your core2?
> 
>  arch/x86/include/asm/mwait.h       | 42 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/processor.h   | 23 ---------------------
>  arch/x86/kernel/acpi/cstate.c      | 23 ---------------------
>  drivers/acpi/acpi_pad.c            |  5 +----
>  drivers/acpi/processor_idle.c      | 15 --------------
>  drivers/idle/intel_idle.c          |  8 +-------
>  drivers/thermal/intel_powerclamp.c |  4 +---
>  7 files changed, 45 insertions(+), 75 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 2f366d0ac6b4..80014dade987 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -1,6 +1,8 @@
>  #ifndef _ASM_X86_MWAIT_H
>  #define _ASM_X86_MWAIT_H
>  
> +#include <linux/sched.h>
> +
>  #define MWAIT_SUBSTATE_MASK		0xf
>  #define MWAIT_CSTATE_MASK		0xf
>  #define MWAIT_SUBSTATE_SIZE		4
> @@ -13,4 +15,44 @@
>  
>  #define MWAIT_ECX_INTERRUPT_BREAK	0x1
>  
> +static inline void __monitor(const void *eax, unsigned long ecx,
> +			     unsigned long edx)
> +{
> +	/* "monitor %eax, %ecx, %edx;" */
> +	asm volatile(".byte 0x0f, 0x01, 0xc8;"
> +		     :: "a" (eax), "c" (ecx), "d"(edx));
> +}
> +
> +static inline void __mwait(unsigned long eax, unsigned long ecx)
> +{
> +	/* "mwait %eax, %ecx;" */
> +	asm volatile(".byte 0x0f, 0x01, 0xc9;"
> +		     :: "a" (eax), "c" (ecx));
> +}
> +
> +/*
> + * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
> + * which can obviate IPI to trigger checking of need_resched.
> + * We execute MONITOR against need_resched and enter optimized wait state
> + * through MWAIT. Whenever someone changes need_resched, we would be woken
> + * up from MWAIT (without an IPI).
> + *
> + * New with Core Duo processors, MWAIT can take some hints based on CPU
> + * capability.
> + */
> +static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
> +{
> +	if (need_resched())
> +		return;
> +
> +	if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> +		clflush((void *)&current_thread_info()->flags);
> +
> +	__monitor((void *)&current_thread_info()->flags, 0, 0);
> +	if (!current_set_polling_and_test())
> +		__mwait(eax, ecx);
> +
> +	__current_clr_polling();
> +}
> +
>  #endif /* _ASM_X86_MWAIT_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 7b034a4057f9..24821f5768bc 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -700,29 +700,6 @@ static inline void sync_core(void)
>  #endif
>  }
>  
> -static inline void __monitor(const void *eax, unsigned long ecx,
> -			     unsigned long edx)
> -{
> -	/* "monitor %eax, %ecx, %edx;" */
> -	asm volatile(".byte 0x0f, 0x01, 0xc8;"
> -		     :: "a" (eax), "c" (ecx), "d"(edx));
> -}
> -
> -static inline void __mwait(unsigned long eax, unsigned long ecx)
> -{
> -	/* "mwait %eax, %ecx;" */
> -	asm volatile(".byte 0x0f, 0x01, 0xc9;"
> -		     :: "a" (eax), "c" (ecx));
> -}
> -
> -static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
> -{
> -	trace_hardirqs_on();
> -	/* "mwait %eax, %ecx;" */
> -	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
> -		     :: "a" (eax), "c" (ecx));
> -}
> -
>  extern void select_idle_routine(const struct cpuinfo_x86 *c);
>  extern void init_amd_e400_c1e_mask(void);
>  
> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> index d2b7f27781bc..e69182fd01cf 100644
> --- a/arch/x86/kernel/acpi/cstate.c
> +++ b/arch/x86/kernel/acpi/cstate.c
> @@ -150,29 +150,6 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
>  }
>  EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
>  
> -/*
> - * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
> - * which can obviate IPI to trigger checking of need_resched.
> - * We execute MONITOR against need_resched and enter optimized wait state
> - * through MWAIT. Whenever someone changes need_resched, we would be woken
> - * up from MWAIT (without an IPI).
> - *
> - * New with Core Duo processors, MWAIT can take some hints based on CPU
> - * capability.
> - */
> -void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
> -{
> -	if (!need_resched()) {
> -		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> -			clflush((void *)&current_thread_info()->flags);
> -
> -		__monitor((void *)&current_thread_info()->flags, 0, 0);
> -		smp_mb();
> -		if (!need_resched())
> -			__mwait(ax, cx);
> -	}
> -}
> -
>  void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
>  {
>  	unsigned int cpu = smp_processor_id();
> diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
> index fc6008fbce35..509452a62f96 100644
> --- a/drivers/acpi/acpi_pad.c
> +++ b/drivers/acpi/acpi_pad.c
> @@ -193,10 +193,7 @@ static int power_saving_thread(void *data)
>  					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
>  			stop_critical_timings();
>  
> -			__monitor((void *)&current_thread_info()->flags, 0, 0);
> -			smp_mb();
> -			if (!need_resched())
> -				__mwait(power_saving_mwait_eax, 1);
> +			mwait_idle_with_hints(power_saving_mwait_eax, 1);
>  
>  			start_critical_timings();
>  			if (lapic_marked_unstable)
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 644516d9bde6..f90c56c8379e 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -727,11 +727,6 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
>  	if (unlikely(!pr))
>  		return -EINVAL;
>  
> -	if (cx->entry_method == ACPI_CSTATE_FFH) {
> -		if (current_set_polling_and_test())
> -			return -EINVAL;
> -	}
> -
>  	lapic_timer_state_broadcast(pr, cx, 1);
>  	acpi_idle_do_entry(cx);
>  
> @@ -785,11 +780,6 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
>  	if (unlikely(!pr))
>  		return -EINVAL;
>  
> -	if (cx->entry_method == ACPI_CSTATE_FFH) {
> -		if (current_set_polling_and_test())
> -			return -EINVAL;
> -	}
> -
>  	/*
>  	 * Must be done before busmaster disable as we might need to
>  	 * access HPET !
> @@ -841,11 +831,6 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
>  		}
>  	}
>  
> -	if (cx->entry_method == ACPI_CSTATE_FFH) {
> -		if (current_set_polling_and_test())
> -			return -EINVAL;
> -	}
> -
>  	acpi_unlazy_tlb(smp_processor_id());
>  
>  	/* Tell the scheduler that we are going deep-idle: */
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 3226ce98fb18..3b56d76a5bca 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -359,13 +359,7 @@ static int intel_idle(struct cpuidle_device *dev,
>  	if (!(lapic_timer_reliable_states & (1 << (cstate))))
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
>  
> -	if (!current_set_polling_and_test()) {
> -
> -		__monitor((void *)&current_thread_info()->flags, 0, 0);
> -		smp_mb();
> -		if (!need_resched())
> -			__mwait(eax, ecx);
> -	}
> +	mwait_idle_with_hints(eax, ecx);
>  
>  	if (!(lapic_timer_reliable_states & (1 << (cstate))))
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index 8f181b3f842b..e8275f2df9af 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
>  			 */
>  			local_touch_nmi();
>  			stop_critical_timings();
> -			__monitor((void *)&current_thread_info()->flags, 0, 0);
> -			cpu_relax(); /* allow HT sibling to run */
> -			__mwait(eax, ecx);
> +			mwait_idle_with_hints(eax, ecx);
>  			start_critical_timings();
>  			atomic_inc(&idle_wakeup_counter);
>  		}
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 11:31   ` [PATCH] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
  2013-11-19 13:06     ` Rafael J. Wysocki
@ 2013-11-19 13:21     ` Mike Galbraith
  2013-11-19 14:22     ` Arjan van de Ven
  2 siblings, 0 replies; 22+ messages in thread
From: Mike Galbraith @ 2013-11-19 13:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lenb, rjw, linux-acpi, linux-kernel, shaohua.li, rui.zhang,
	arjan, jacob.jun.pan, Ingo Molnar, Thomas Gleixner, hpa

On Tue, 2013-11-19 at 12:31 +0100, Peter Zijlstra wrote: 
> People seem to delight in writing wrong and broken mwait idle routines;
> collapse the lot.
> 
> This leaves mwait_play_dead() the sole remaining user of __mwait() and
> new __mwait() users are probably doing it wrong.
> 
> Also remove __sti_mwait() as its unused.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
> 
> Mike, does this cure your core2?

Nope.  Maybe an acpi/bios thingy on this box since diags that fired on
lappy during boot did not fire on desktop, and both are core2 booting
same kernel.  I kinda suspect I'll either be stuck with default_idle or
have to resurrect mwait_idle and carry it locally if I want the thing to
work well.  Guess I'll find out if/when I have time to squabble with it.

Meanwhile, desktop box works fine modulo benchmarks, lappy works fine
too, modulo max_cstate=1 scaring mwait_idle_with_hints away, which I
don't care about much.  Neither box is wonderful at rt testing where I
usually boot max_cstate=1, both just became a bit _less_ wonderful :)

-Mike


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 11:31   ` [PATCH] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
  2013-11-19 13:06     ` Rafael J. Wysocki
  2013-11-19 13:21     ` Mike Galbraith
@ 2013-11-19 14:22     ` Arjan van de Ven
  2013-11-19 14:46       ` Peter Zijlstra
  2013-11-19 14:51       ` Peter Zijlstra
  2 siblings, 2 replies; 22+ messages in thread
From: Arjan van de Ven @ 2013-11-19 14:22 UTC (permalink / raw)
  To: Peter Zijlstra, lenb, rjw, linux-acpi, linux-kernel
  Cc: shaohua.li, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, Thomas Gleixner, hpa

> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index 8f181b3f842b..e8275f2df9af 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
>   			 */
>   			local_touch_nmi();
>   			stop_critical_timings();
> -			__monitor((void *)&current_thread_info()->flags, 0, 0);
> -			cpu_relax(); /* allow HT sibling to run */
> -			__mwait(eax, ecx);
> +			mwait_idle_with_hints(eax, ecx);
>   			start_critical_timings();
>   			atomic_inc(&idle_wakeup_counter);
>   		}
>

hmm I take it that mwait_idle_with_hints is the one that also checks need_resched() ?
if so... powerclamp may not want to use that (the whole point is to NOT give the cpu
to tasks!)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 14:22     ` Arjan van de Ven
@ 2013-11-19 14:46       ` Peter Zijlstra
  2013-11-19 14:51       ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19 14:46 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: lenb, rjw, linux-acpi, linux-kernel, shaohua.li, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On Tue, Nov 19, 2013 at 06:22:43AM -0800, Arjan van de Ven wrote:
> >diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> >index 8f181b3f842b..e8275f2df9af 100644
> >--- a/drivers/thermal/intel_powerclamp.c
> >+++ b/drivers/thermal/intel_powerclamp.c
> >@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
> >  			 */
> >  			local_touch_nmi();
> >  			stop_critical_timings();
> >-			__monitor((void *)&current_thread_info()->flags, 0, 0);
> >-			cpu_relax(); /* allow HT sibling to run */
> >-			__mwait(eax, ecx);
> >+			mwait_idle_with_hints(eax, ecx);
> >  			start_critical_timings();
> >  			atomic_inc(&idle_wakeup_counter);
> >  		}
> >
> 
> hmm I take it that mwait_idle_with_hints is the one that also checks need_resched() ?
> if so... powerclamp may not want to use that (the whole point is to NOT give the cpu
> to tasks!)

Of course, you're very much not getting something that wins from
stop_machine or a fifo-99 task.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 14:22     ` Arjan van de Ven
  2013-11-19 14:46       ` Peter Zijlstra
@ 2013-11-19 14:51       ` Peter Zijlstra
  2013-11-19 15:13         ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19 14:51 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: lenb, rjw, linux-acpi, linux-kernel, shaohua.li, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On Tue, Nov 19, 2013 at 06:22:43AM -0800, Arjan van de Ven wrote:
> >diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> >index 8f181b3f842b..e8275f2df9af 100644
> >--- a/drivers/thermal/intel_powerclamp.c
> >+++ b/drivers/thermal/intel_powerclamp.c
> >@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
> >  			 */
> >  			local_touch_nmi();
> >  			stop_critical_timings();
> >-			__monitor((void *)&current_thread_info()->flags, 0, 0);
> >-			cpu_relax(); /* allow HT sibling to run */
> >-			__mwait(eax, ecx);
> >+			mwait_idle_with_hints(eax, ecx);
> >  			start_critical_timings();
> >  			atomic_inc(&idle_wakeup_counter);
> >  		}
> >
> 
> hmm I take it that mwait_idle_with_hints is the one that also checks need_resched() ?
> if so... powerclamp may not want to use that (the whole point is to NOT give the cpu
> to tasks!)

That said, that drive is completely wrecked. It uses
preempt_enable_no_resched() wrong too, it has uncommented barriers..

Dude, wtf are you guys smoking?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 14:51       ` Peter Zijlstra
@ 2013-11-19 15:13         ` Peter Zijlstra
  2013-11-19 21:06           ` Jacob Pan
                             ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19 15:13 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: lenb, rjw, linux-acpi, linux-kernel, shaohua.li, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On Tue, Nov 19, 2013 at 03:51:43PM +0100, Peter Zijlstra wrote:
> That said, that drive is completely wrecked. It uses
> preempt_enable_no_resched() wrong too, it has uncommented barriers..
> 
> Dude, wtf are you guys smoking?

---
Subject: sched: Take away preempt_enable_no_resched() and friends from modules

There is no way in hell modules are going to play preemption tricks like
this.

Cc: eliezer.tamir@linux.intel.com
Cc: arjan@linux.intel.com
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/thermal/intel_powerclamp.c |  2 +-
 include/linux/preempt.h            |  8 +++++++-
 include/net/busy_poll.h            | 15 +++------------
 3 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 8f181b3f842b..0a12ddc2eb4c 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -445,7 +445,7 @@ static int clamp_thread(void *arg)
 			atomic_inc(&idle_wakeup_counter);
 		}
 		tick_nohz_idle_exit();
-		preempt_enable_no_resched();
+		preempt_enable();
 	}
 	del_timer_sync(&wakeup_timer);
 	clear_bit(cpunr, cpu_clamping_mask);
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index a3d9dc8c2c00..3ed2b5335ab4 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -64,7 +64,7 @@ do { \
 } while (0)
 
 #else
-#define preempt_enable() preempt_enable_no_resched()
+#define preempt_enable() sched_preempt_enable_no_resched()
 #define preempt_check_resched() do { } while (0)
 #endif
 
@@ -116,6 +116,12 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
+#ifdef MODULE
+#undef preempt_enable_no_resched
+#undef preempt_enable_no_resched_notrace
+#undef preempt_check_resched
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 
 struct preempt_notifier;
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 829627d7b846..756827a86c2d 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -42,27 +42,18 @@ static inline bool net_busy_loop_on(void)
 	return sysctl_net_busy_poll;
 }
 
-/* a wrapper to make debug_smp_processor_id() happy
- * we can use sched_clock() because we don't care much about precision
- * we only care that the average is bounded
- */
-#ifdef CONFIG_DEBUG_PREEMPT
 static inline u64 busy_loop_us_clock(void)
 {
 	u64 rc;
 
+	/* XXX with interrupts enabled sched_clock() can return utter garbage */
+
 	preempt_disable_notrace();
 	rc = sched_clock();
-	preempt_enable_no_resched_notrace();
+	preempt_enable_notrace();
 
 	return rc >> 10;
 }
-#else /* CONFIG_DEBUG_PREEMPT */
-static inline u64 busy_loop_us_clock(void)
-{
-	return sched_clock() >> 10;
-}
-#endif /* CONFIG_DEBUG_PREEMPT */
 
 static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
 {


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 15:13         ` Peter Zijlstra
@ 2013-11-19 21:06           ` Jacob Pan
  2013-11-20 10:28             ` Peter Zijlstra
  2014-01-12 18:44           ` [tip:sched/idle] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
                             ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Jacob Pan @ 2013-11-19 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, linux-acpi, linux-kernel,
	shaohua.li, rui.zhang, Mike Galbraith, Ingo Molnar,
	Thomas Gleixner, hpa

On Tue, 19 Nov 2013 16:13:38 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Nov 19, 2013 at 03:51:43PM +0100, Peter Zijlstra wrote:
> > That said, that drive is completely wrecked. It uses
> > preempt_enable_no_resched() wrong too, it has uncommented barriers..
> > 
> > Dude, wtf are you guys smoking?
> 
I applied this patch on top of upstream kernel (801a760) and found out
my machine completely failed to enter idle when nothing is running.
turbostate shows 100% C0. ftrace shows kernel coming in and out of idle
frequently.

Both ACPI idle and intel_idle behaves the same way. I have to do the
following change to allow entering C-states again.

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 80014da..b51d1e1 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -49,10 +49,8 @@ static inline void mwait_idle_with_hints(unsigned
long eax, unsigned long ecx) clflush((void
*)&current_thread_info()->flags); 
        __monitor((void *)&current_thread_info()->flags, 0, 0);
-       if (!current_set_polling_and_test())
+        if (!need_resched())
                __mwait(eax, ecx);
-
-       __current_clr_polling();
 }
 
 #endif /* _ASM_X86_MWAIT_H */

Did i miss any other patches?

Jacob

> ---
> Subject: sched: Take away preempt_enable_no_resched() and friends
> from modules
> 
> There is no way in hell modules are going to play preemption tricks
> like this.
> 
> Cc: eliezer.tamir@linux.intel.com
> Cc: arjan@linux.intel.com
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  drivers/thermal/intel_powerclamp.c |  2 +-
>  include/linux/preempt.h            |  8 +++++++-
>  include/net/busy_poll.h            | 15 +++------------
>  3 files changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/thermal/intel_powerclamp.c
> b/drivers/thermal/intel_powerclamp.c index 8f181b3f842b..0a12ddc2eb4c
> 100644 --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -445,7 +445,7 @@ static int clamp_thread(void *arg)
>  			atomic_inc(&idle_wakeup_counter);
>  		}
>  		tick_nohz_idle_exit();
> -		preempt_enable_no_resched();
> +		preempt_enable();
>  	}
>  	del_timer_sync(&wakeup_timer);
>  	clear_bit(cpunr, cpu_clamping_mask);
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index a3d9dc8c2c00..3ed2b5335ab4 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -64,7 +64,7 @@ do { \
>  } while (0)
>  
>  #else
> -#define preempt_enable() preempt_enable_no_resched()
> +#define preempt_enable() sched_preempt_enable_no_resched()
>  #define preempt_check_resched() do { } while (0)
>  #endif
>  
> @@ -116,6 +116,12 @@ do { \
>  
>  #endif /* CONFIG_PREEMPT_COUNT */
>  
> +#ifdef MODULE
> +#undef preempt_enable_no_resched
> +#undef preempt_enable_no_resched_notrace
> +#undef preempt_check_resched
> +#endif
> +
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  
>  struct preempt_notifier;
> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> index 829627d7b846..756827a86c2d 100644
> --- a/include/net/busy_poll.h
> +++ b/include/net/busy_poll.h
> @@ -42,27 +42,18 @@ static inline bool net_busy_loop_on(void)
>  	return sysctl_net_busy_poll;
>  }
>  
> -/* a wrapper to make debug_smp_processor_id() happy
> - * we can use sched_clock() because we don't care much about
> precision
> - * we only care that the average is bounded
> - */
> -#ifdef CONFIG_DEBUG_PREEMPT
>  static inline u64 busy_loop_us_clock(void)
>  {
>  	u64 rc;
>  
> +	/* XXX with interrupts enabled sched_clock() can return
> utter garbage */ +
>  	preempt_disable_notrace();
>  	rc = sched_clock();
> -	preempt_enable_no_resched_notrace();
> +	preempt_enable_notrace();
>  
>  	return rc >> 10;
>  }
> -#else /* CONFIG_DEBUG_PREEMPT */
> -static inline u64 busy_loop_us_clock(void)
> -{
> -	return sched_clock() >> 10;
> -}
> -#endif /* CONFIG_DEBUG_PREEMPT */
>  
>  static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
>  {
> 

[Jacob Pan]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 21:06           ` Jacob Pan
@ 2013-11-20 10:28             ` Peter Zijlstra
  2013-11-20 10:58               ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-20 10:28 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Arjan van de Ven, lenb, rjw, linux-acpi, linux-kernel,
	shaohua.li, rui.zhang, Mike Galbraith, Ingo Molnar,
	Thomas Gleixner, hpa

On Tue, Nov 19, 2013 at 01:06:30PM -0800, Jacob Pan wrote:
> On Tue, 19 Nov 2013 16:13:38 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, Nov 19, 2013 at 03:51:43PM +0100, Peter Zijlstra wrote:
> > > That said, that drive is completely wrecked. It uses
> > > preempt_enable_no_resched() wrong too, it has uncommented barriers..
> > > 
> > > Dude, wtf are you guys smoking?
> > 
> I applied this patch on top of upstream kernel (801a760) and found out
> my machine completely failed to enter idle when nothing is running.
> turbostate shows 100% C0. ftrace shows kernel coming in and out of idle
> frequently.
> 
> Both ACPI idle and intel_idle behaves the same way. I have to do the
> following change to allow entering C-states again.
> 
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 80014da..b51d1e1 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -49,10 +49,8 @@ static inline void mwait_idle_with_hints(unsigned
> long eax, unsigned long ecx) clflush((void
> *)&current_thread_info()->flags); 
>         __monitor((void *)&current_thread_info()->flags, 0, 0);
> -       if (!current_set_polling_and_test())
> +        if (!need_resched())
>                 __mwait(eax, ecx);
> -
> -       __current_clr_polling();
>  }
>  
>  #endif /* _ASM_X86_MWAIT_H */

That doesn't make any sense; current_set_polling_and_test() returns the
same thing need_resched() does.

But you're right, intel_idle resides 100% in C0 and acpi_idle has 100%
C1 residency... most weird.

/me goes prod at it

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-20 10:28             ` Peter Zijlstra
@ 2013-11-20 10:58               ` Peter Zijlstra
  2013-11-20 16:24                 ` Arjan van de Ven
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-20 10:58 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Arjan van de Ven, lenb, rjw, linux-acpi, linux-kernel,
	shaohua.li, rui.zhang, Mike Galbraith, Ingo Molnar,
	Thomas Gleixner, hpa

On Wed, Nov 20, 2013 at 11:28:03AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 19, 2013 at 01:06:30PM -0800, Jacob Pan wrote:
> > I applied this patch on top of upstream kernel (801a760) and found out
> > my machine completely failed to enter idle when nothing is running.
> > turbostate shows 100% C0. ftrace shows kernel coming in and out of idle
> > frequently.
> > 
> > Both ACPI idle and intel_idle behaves the same way. I have to do the
> > following change to allow entering C-states again.

> That doesn't make any sense; current_set_polling_and_test() returns the
> same thing need_resched() does.
> 
> But you're right, intel_idle resides 100% in C0 and acpi_idle has 100%
> C1 residency... most weird.

So pretty silly actually; you cannot do a store (any store) in between
monitor and mwait.

The below version seems to work properly again with both acpi_idle and
intel_idle.

Now to go make that preempt_disable_no_resched cleanup compile.. :-)

---
Subject: x86, acpi, idle: Restructure the mwait idle routines
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 19 Nov 2013 12:31:53 +0100

People seem to delight in writing wrong and broken mwait idle routines;
collapse the lot.

This leaves mwait_play_dead() the sole remaining user of __mwait() and
new __mwait() users are probably doing it wrong.

Also remove __sti_mwait() as its unused.

Cc: arjan@linux.intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: lenb@kernel.org
Cc: shaohua.li@intel.com
Cc: rui.zhang@intel.com
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/mwait.h       |   40 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h   |   23 ---------------------
 arch/x86/kernel/acpi/cstate.c      |   23 ---------------------
 drivers/acpi/acpi_pad.c            |    5 ----
 drivers/acpi/processor_idle.c      |   15 -------------
 drivers/idle/intel_idle.c          |    8 -------
 drivers/thermal/intel_powerclamp.c |    4 ---
 7 files changed, 43 insertions(+), 75 deletions(-)

--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_X86_MWAIT_H
 #define _ASM_X86_MWAIT_H
 
+#include <linux/sched.h>
+
 #define MWAIT_SUBSTATE_MASK		0xf
 #define MWAIT_CSTATE_MASK		0xf
 #define MWAIT_SUBSTATE_SIZE		4
@@ -13,4 +15,42 @@
 
 #define MWAIT_ECX_INTERRUPT_BREAK	0x1
 
+static inline void __monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state
+ * through MWAIT. Whenever someone changes need_resched, we would be woken
+ * up from MWAIT (without an IPI).
+ *
+ * New with Core Duo processors, MWAIT can take some hints based on CPU
+ * capability.
+ */
+static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
+{
+	if (!current_set_polling_and_test()) {
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
+		__monitor((void *)&current_thread_info()->flags, 0, 0);
+		if (!need_resched())
+			__mwait(eax, ecx);
+	}
+	__current_clr_polling();
+}
+
 #endif /* _ASM_X86_MWAIT_H */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -700,29 +700,6 @@ static inline void sync_core(void)
 #endif
 }
 
-static inline void __monitor(const void *eax, unsigned long ecx,
-			     unsigned long edx)
-{
-	/* "monitor %eax, %ecx, %edx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc8;"
-		     :: "a" (eax), "c" (ecx), "d"(edx));
-}
-
-static inline void __mwait(unsigned long eax, unsigned long ecx)
-{
-	/* "mwait %eax, %ecx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
-{
-	trace_hardirqs_on();
-	/* "mwait %eax, %ecx;" */
-	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -150,29 +150,6 @@ int acpi_processor_ffh_cstate_probe(unsi
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 
-/*
- * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
- * which can obviate IPI to trigger checking of need_resched.
- * We execute MONITOR against need_resched and enter optimized wait state
- * through MWAIT. Whenever someone changes need_resched, we would be woken
- * up from MWAIT (without an IPI).
- *
- * New with Core Duo processors, MWAIT can take some hints based on CPU
- * capability.
- */
-void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
-{
-	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(ax, cx);
-	}
-}
-
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 {
 	unsigned int cpu = smp_processor_id();
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -193,10 +193,7 @@ static int power_saving_thread(void *dat
 					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 			stop_critical_timings();
 
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			smp_mb();
-			if (!need_resched())
-				__mwait(power_saving_mwait_eax, 1);
+			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
 			start_critical_timings();
 			if (lapic_marked_unstable)
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -727,11 +727,6 @@ static int acpi_idle_enter_c1(struct cpu
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	lapic_timer_state_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 
@@ -785,11 +780,6 @@ static int acpi_idle_enter_simple(struct
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
@@ -841,11 +831,6 @@ static int acpi_idle_enter_bm(struct cpu
 		}
 	}
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	acpi_unlazy_tlb(smp_processor_id());
 
 	/* Tell the scheduler that we are going deep-idle: */
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -359,13 +359,7 @@ static int intel_idle(struct cpuidle_dev
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
-	if (!current_set_polling_and_test()) {
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(eax, ecx);
-	}
+	mwait_idle_with_hints(eax, ecx);
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
 			 */
 			local_touch_nmi();
 			stop_critical_timings();
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			cpu_relax(); /* allow HT sibling to run */
-			__mwait(eax, ecx);
+			mwait_idle_with_hints(eax, ecx);
 			start_critical_timings();
 			atomic_inc(&idle_wakeup_counter);
 		}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-20 10:58               ` Peter Zijlstra
@ 2013-11-20 16:24                 ` Arjan van de Ven
  2013-11-20 16:33                   ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Arjan van de Ven @ 2013-11-20 16:24 UTC (permalink / raw)
  To: Peter Zijlstra, Jacob Pan
  Cc: lenb, rjw, linux-acpi, linux-kernel, shaohua.li, rui.zhang,
	Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On 11/20/2013 2:58 AM, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 11:28:03AM +0100, Peter Zijlstra wrote:
>> On Tue, Nov 19, 2013 at 01:06:30PM -0800, Jacob Pan wrote:
>>> I applied this patch on top of upstream kernel (801a760) and found out
>>> my machine completely failed to enter idle when nothing is running.
>>> turbostate shows 100% C0. ftrace shows kernel coming in and out of idle
>>> frequently.
>>>
>>> Both ACPI idle and intel_idle behaves the same way. I have to do the
>>> following change to allow entering C-states again.
>
>> That doesn't make any sense; current_set_polling_and_test() returns the
>> same thing need_resched() does.
>>
>> But you're right, intel_idle resides 100% in C0 and acpi_idle has 100%
>> C1 residency... most weird.
>
> So pretty silly actually; you cannot do a store (any store) in between
> monitor and mwait.

you can
just not to the cacheline you are watching (or things that alias with that)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-20 16:24                 ` Arjan van de Ven
@ 2013-11-20 16:33                   ` Peter Zijlstra
  2013-11-20 16:38                     ` Arjan van de Ven
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jacob Pan, lenb, rjw, linux-acpi, linux-kernel, shaohua.li,
	rui.zhang, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On Wed, Nov 20, 2013 at 08:24:47AM -0800, Arjan van de Ven wrote:
> On 11/20/2013 2:58 AM, Peter Zijlstra wrote:
> >So pretty silly actually; you cannot do a store (any store) in between
> >monitor and mwait.
> 
> you can
> just not to the cacheline you are watching (or things that alias with that)

Ah indeed, the thread_info::status is in the same cacheline.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-20 16:33                   ` Peter Zijlstra
@ 2013-11-20 16:38                     ` Arjan van de Ven
  0 siblings, 0 replies; 22+ messages in thread
From: Arjan van de Ven @ 2013-11-20 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jacob Pan, lenb, rjw, linux-acpi, linux-kernel, shaohua.li,
	rui.zhang, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On 11/20/2013 8:33 AM, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 08:24:47AM -0800, Arjan van de Ven wrote:
>> On 11/20/2013 2:58 AM, Peter Zijlstra wrote:
>>> So pretty silly actually; you cannot do a store (any store) in between
>>> monitor and mwait.
>>
>> you can
>> just not to the cacheline you are watching (or things that alias with that)
>
> Ah indeed, the thread_info::status is in the same cacheline.
>

you told it to wake on any write to that cacheline
and then you write to it ;-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:sched/idle] sched/preempt, locking: Rework local_bh_{dis, en}able()
  2013-11-19 15:13         ` Peter Zijlstra
  2013-11-19 21:06           ` Jacob Pan
@ 2014-01-12 18:44           ` tip-bot for Peter Zijlstra
  2014-01-12 18:44           ` [tip:sched/idle] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
                             ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-12 18:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, arjan, peterz, bitbucket, tglx

Commit-ID:  c795eb55e74041a4b1e46a82bb07d8bd513fb066
Gitweb:     http://git.kernel.org/tip/c795eb55e74041a4b1e46a82bb07d8bd513fb066
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 12 Jan 2014 10:48:24 +0100

sched/preempt, locking: Rework local_bh_{dis,en}able()

Currently local_bh_disable() is out-of-line for no apparent reason.
So inline it to save a few cycles on call/return nonsense, the
function body is a single add on x86 (a few loads and store extra on
load/store archs).

Also expose two new local_bh functions:

  __local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt);

Which implement the actual local_bh_{dis,en}able() behaviour.

The next patch uses the exposed @cnt argument to optimize bh lock
functions.

With build fixes from Jacob Pan.

Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/bottom_half.h  | 32 +++++++++++++++++++++++++++++---
 include/linux/hardirq.h      |  1 +
 include/linux/preempt_mask.h |  1 -
 kernel/softirq.c             | 35 ++++++-----------------------------
 4 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/include/linux/bottom_half.h b/include/linux/bottom_half.h
index 27b1bcf..86c12c9 100644
--- a/include/linux/bottom_half.h
+++ b/include/linux/bottom_half.h
@@ -1,9 +1,35 @@
 #ifndef _LINUX_BH_H
 #define _LINUX_BH_H
 
-extern void local_bh_disable(void);
+#include <linux/preempt.h>
+#include <linux/preempt_mask.h>
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);
+#else
+static __always_inline void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
+{
+	preempt_count_add(cnt);
+	barrier();
+}
+#endif
+
+static inline void local_bh_disable(void)
+{
+	__local_bh_disable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
+
 extern void _local_bh_enable(void);
-extern void local_bh_enable(void);
-extern void local_bh_enable_ip(unsigned long ip);
+extern void __local_bh_enable_ip(unsigned long ip, unsigned int cnt);
+
+static inline void local_bh_enable_ip(unsigned long ip)
+{
+	__local_bh_enable_ip(ip, SOFTIRQ_DISABLE_OFFSET);
+}
+
+static inline void local_bh_enable(void)
+{
+	__local_bh_enable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
 
 #endif /* _LINUX_BH_H */
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d9cf963..12d5f97 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -5,6 +5,7 @@
 #include <linux/lockdep.h>
 #include <linux/ftrace_irq.h>
 #include <linux/vtime.h>
+#include <asm/hardirq.h>
 
 
 extern void synchronize_irq(unsigned int irq);
diff --git a/include/linux/preempt_mask.h b/include/linux/preempt_mask.h
index d169820..30820ec 100644
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -2,7 +2,6 @@
 #define LINUX_PREEMPT_MASK_H
 
 #include <linux/preempt.h>
-#include <asm/hardirq.h>
 
 /*
  * We put the hardirq and softirq counter into the preemption
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 9a4500e..2a4b5f4 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -89,7 +89,7 @@ static void wakeup_softirqd(void)
  * where hardirqs are disabled legitimately:
  */
 #ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip, unsigned int cnt)
+void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
 {
 	unsigned long flags;
 
@@ -114,21 +114,9 @@ static void __local_bh_disable(unsigned long ip, unsigned int cnt)
 	if (preempt_count() == cnt)
 		trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 }
-#else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
-{
-	preempt_count_add(cnt);
-	barrier();
-}
+EXPORT_SYMBOL(__local_bh_disable_ip);
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
-void local_bh_disable(void)
-{
-	__local_bh_disable(_RET_IP_, SOFTIRQ_DISABLE_OFFSET);
-}
-
-EXPORT_SYMBOL(local_bh_disable);
-
 static void __local_bh_enable(unsigned int cnt)
 {
 	WARN_ON_ONCE(!irqs_disabled());
@@ -151,7 +139,7 @@ void _local_bh_enable(void)
 
 EXPORT_SYMBOL(_local_bh_enable);
 
-static inline void _local_bh_enable_ip(unsigned long ip)
+void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 {
 	WARN_ON_ONCE(in_irq() || irqs_disabled());
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -166,7 +154,7 @@ static inline void _local_bh_enable_ip(unsigned long ip)
 	 * Keep preemption disabled until we are done with
 	 * softirq processing:
  	 */
-	preempt_count_sub(SOFTIRQ_DISABLE_OFFSET - 1);
+	preempt_count_sub(cnt - 1);
 
 	if (unlikely(!in_interrupt() && local_softirq_pending())) {
 		/*
@@ -182,18 +170,7 @@ static inline void _local_bh_enable_ip(unsigned long ip)
 #endif
 	preempt_check_resched();
 }
-
-void local_bh_enable(void)
-{
-	_local_bh_enable_ip(_RET_IP_);
-}
-EXPORT_SYMBOL(local_bh_enable);
-
-void local_bh_enable_ip(unsigned long ip)
-{
-	_local_bh_enable_ip(ip);
-}
-EXPORT_SYMBOL(local_bh_enable_ip);
+EXPORT_SYMBOL(__local_bh_enable_ip);
 
 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
@@ -264,7 +241,7 @@ asmlinkage void __do_softirq(void)
 	pending = local_softirq_pending();
 	account_irq_enter_time(current);
 
-	__local_bh_disable(_RET_IP_, SOFTIRQ_OFFSET);
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
 
 	cpu = smp_processor_id();

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/idle] sched, net: Clean up preempt_enable_no_resched() abuse
  2013-11-19 15:13         ` Peter Zijlstra
  2013-11-19 21:06           ` Jacob Pan
  2014-01-12 18:44           ` [tip:sched/idle] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
@ 2014-01-12 18:44           ` tip-bot for Peter Zijlstra
  2014-01-12 18:44           ` [tip:sched/idle] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra
                             ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-12 18:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, arjan, peterz, bitbucket,
	davem, eliezer.tamir, akpm, tglx

Commit-ID:  3a5cd0b09443edb2f960497928de047e9c8ab848
Gitweb:     http://git.kernel.org/tip/3a5cd0b09443edb2f960497928de047e9c8ab848
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 12 Jan 2014 14:50:35 +0100

sched, net: Clean up preempt_enable_no_resched() abuse

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 net/ipv4/tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c4638e6..82de786 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1623,11 +1623,11 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		    (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
 		    !sysctl_tcp_low_latency &&
 		    net_dma_find_channel()) {
-			preempt_enable_no_resched();
+			preempt_enable();
 			tp->ucopy.pinned_list =
 					dma_pin_iovec_pages(msg->msg_iov, len);
 		} else {
-			preempt_enable_no_resched();
+			preempt_enable();
 		}
 	}
 #endif

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/idle] sched, net: Fixup busy_loop_us_clock()
  2013-11-19 15:13         ` Peter Zijlstra
                             ` (2 preceding siblings ...)
  2014-01-12 18:44           ` [tip:sched/idle] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
@ 2014-01-12 18:44           ` tip-bot for Peter Zijlstra
  2014-01-13 15:56           ` [tip:sched/core] locking: Optimize lock_bh functions tip-bot for Peter Zijlstra
                             ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-12 18:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, arjan, peterz, bitbucket,
	davem, eliezer.tamir, akpm, tglx

Commit-ID:  29fc2f66b552770909591d89ee87ab8690d67e61
Gitweb:     http://git.kernel.org/tip/29fc2f66b552770909591d89ee87ab8690d67e61
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 12 Jan 2014 14:50:37 +0100

sched, net: Fixup busy_loop_us_clock()

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

This busy_poll stuff looks to be completely and utterly broken,
sched_clock() can return utter garbage with interrupts enabled (rare
but still) and it can drift unbounded between CPUs.

This means that if you get preempted/migrated and your new CPU is
years behind on the previous CPU we get to busy spin for a _very_ long
time.

There is a _REASON_ sched_clock() warns about preemptability -
papering over it with a preempt_disable()/preempt_enable_no_resched()
is just terminal brain damage on so many levels.

Replace sched_clock() usage with local_clock() which has a bounded
drift between CPUs (<2 jiffies).

There is a further problem with the entire busy wait poll thing in
that the spin time is additive to the syscall timeout, not inclusive.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/net/busy_poll.h | 19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 829627d..1d67fb6 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -42,27 +42,10 @@ static inline bool net_busy_loop_on(void)
 	return sysctl_net_busy_poll;
 }
 
-/* a wrapper to make debug_smp_processor_id() happy
- * we can use sched_clock() because we don't care much about precision
- * we only care that the average is bounded
- */
-#ifdef CONFIG_DEBUG_PREEMPT
-static inline u64 busy_loop_us_clock(void)
-{
-	u64 rc;
-
-	preempt_disable_notrace();
-	rc = sched_clock();
-	preempt_enable_no_resched_notrace();
-
-	return rc >> 10;
-}
-#else /* CONFIG_DEBUG_PREEMPT */
 static inline u64 busy_loop_us_clock(void)
 {
-	return sched_clock() >> 10;
+	return local_clock() >> 10;
 }
-#endif /* CONFIG_DEBUG_PREEMPT */
 
 static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
 {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] locking: Optimize lock_bh functions
  2013-11-19 15:13         ` Peter Zijlstra
                             ` (3 preceding siblings ...)
  2014-01-12 18:44           ` [tip:sched/idle] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra
@ 2014-01-13 15:56           ` tip-bot for Peter Zijlstra
  2014-01-13 16:42           ` [tip:sched/core] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
                             ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-13 15:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, arjan, peterz, bitbucket, akpm, tglx

Commit-ID:  9ea4c380066fbe23fe0da7f4abfabc444f2467f4
Gitweb:     http://git.kernel.org/tip/9ea4c380066fbe23fe0da7f4abfabc444f2467f4
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:47:36 +0100

locking: Optimize lock_bh functions

Currently all _bh_ lock functions do two preempt_count operations:

  local_bh_disable();
  preempt_disable();

and for the unlock:

  preempt_enable_no_resched();
  local_bh_enable();

Since its a waste of perfectly good cycles to modify the same variable
twice when you can do it in one go; use the new
__local_bh_{dis,en}able_ip() functions that allow us to provide a
preempt_count value to add/sub.

So define SOFTIRQ_LOCK_OFFSET as the offset a _bh_ lock needs to
add/sub to be done in one go.

As a bonus it gets rid of the preempt_enable_no_resched() usage.

This reduces a 1000 loops of:

  spin_lock_bh(&bh_lock);
  spin_unlock_bh(&bh_lock);

from 53596 cycles to 51995 cycles. I didn't do enough measurements to
say for absolute sure that the result is significant but the the few
runs I did for each suggest it is so.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/preempt_mask.h     | 15 +++++++++++++++
 include/linux/rwlock_api_smp.h   | 12 ++++--------
 include/linux/spinlock_api_smp.h | 12 ++++--------
 include/linux/spinlock_api_up.h  | 16 +++++++++++-----
 kernel/softirq.c                 |  4 ++--
 5 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/include/linux/preempt_mask.h b/include/linux/preempt_mask.h
index d169820..b8d96bc 100644
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -79,6 +79,21 @@
 #endif
 
 /*
+ * The preempt_count offset needed for things like:
+ *
+ *  spin_lock_bh()
+ *
+ * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and
+ * softirqs, such that unlock sequences of:
+ *
+ *  spin_unlock();
+ *  local_bh_enable();
+ *
+ * Work as expected.
+ */
+#define SOFTIRQ_LOCK_OFFSET (SOFTIRQ_DISABLE_OFFSET + PREEMPT_CHECK_OFFSET)
+
+/*
  * Are we running in atomic context?  WARNING: this macro cannot
  * always detect atomic context; in particular, it cannot know about
  * held spinlocks in non-preemptible kernels.  Thus it should not be
diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h
index 9c9f049..5b9b84b 100644
--- a/include/linux/rwlock_api_smp.h
+++ b/include/linux/rwlock_api_smp.h
@@ -172,8 +172,7 @@ static inline void __raw_read_lock_irq(rwlock_t *lock)
 
 static inline void __raw_read_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_read_trylock, do_raw_read_lock);
 }
@@ -200,8 +199,7 @@ static inline void __raw_write_lock_irq(rwlock_t *lock)
 
 static inline void __raw_write_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock);
 }
@@ -250,8 +248,7 @@ static inline void __raw_read_unlock_bh(rwlock_t *lock)
 {
 	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_read_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 static inline void __raw_write_unlock_irqrestore(rwlock_t *lock,
@@ -275,8 +272,7 @@ static inline void __raw_write_unlock_bh(rwlock_t *lock)
 {
 	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_write_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 #endif /* __LINUX_RWLOCK_API_SMP_H */
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index bdb9993..42dfab8 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -131,8 +131,7 @@ static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
 
 static inline void __raw_spin_lock_bh(raw_spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
 }
@@ -174,20 +173,17 @@ static inline void __raw_spin_unlock_bh(raw_spinlock_t *lock)
 {
 	spin_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_spin_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 static inline int __raw_spin_trylock_bh(raw_spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	if (do_raw_spin_trylock(lock)) {
 		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 		return 1;
 	}
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	return 0;
 }
 
diff --git a/include/linux/spinlock_api_up.h b/include/linux/spinlock_api_up.h
index af1f472..d0d1888 100644
--- a/include/linux/spinlock_api_up.h
+++ b/include/linux/spinlock_api_up.h
@@ -24,11 +24,14 @@
  * flags straight, to suppress compiler warnings of unused lock
  * variables, and to add the proper checker annotations:
  */
+#define ___LOCK(lock) \
+  do { __acquire(lock); (void)(lock); } while (0)
+
 #define __LOCK(lock) \
-  do { preempt_disable(); __acquire(lock); (void)(lock); } while (0)
+  do { preempt_disable(); ___LOCK(lock); } while (0)
 
 #define __LOCK_BH(lock) \
-  do { local_bh_disable(); __LOCK(lock); } while (0)
+  do { __local_bh_disable_ip(_THIS_IP_, SOFTIRQ_LOCK_OFFSET); ___LOCK(lock); } while (0)
 
 #define __LOCK_IRQ(lock) \
   do { local_irq_disable(); __LOCK(lock); } while (0)
@@ -36,12 +39,15 @@
 #define __LOCK_IRQSAVE(lock, flags) \
   do { local_irq_save(flags); __LOCK(lock); } while (0)
 
+#define ___UNLOCK(lock) \
+  do { __release(lock); (void)(lock); } while (0)
+
 #define __UNLOCK(lock) \
-  do { preempt_enable(); __release(lock); (void)(lock); } while (0)
+  do { preempt_enable(); ___UNLOCK(lock); } while (0)
 
 #define __UNLOCK_BH(lock) \
-  do { preempt_enable_no_resched(); local_bh_enable(); \
-	  __release(lock); (void)(lock); } while (0)
+  do { __local_bh_enable_ip(_THIS_IP_, SOFTIRQ_LOCK_OFFSET); \
+       ___UNLOCK(lock); } while (0)
 
 #define __UNLOCK_IRQ(lock) \
   do { local_irq_enable(); __UNLOCK(lock); } while (0)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 11025cc..7500cce1 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -107,7 +107,7 @@ static void __local_bh_disable(unsigned long ip, unsigned int cnt)
 	/*
 	 * Were softirqs turned off above:
 	 */
-	if (softirq_count() == cnt)
+	if (softirq_count() == (cnt & SOFTIRQ_MASK))
 		trace_softirqs_off(ip);
 	raw_local_irq_restore(flags);
 
@@ -133,7 +133,7 @@ static void __local_bh_enable(unsigned int cnt)
 {
 	WARN_ON_ONCE(!irqs_disabled());
 
-	if (softirq_count() == cnt)
+	if (softirq_count() == (cnt & SOFTIRQ_MASK))
 		trace_softirqs_on(_RET_IP_);
 	preempt_count_sub(cnt);
 }

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/preempt, locking: Rework local_bh_{dis, en}able()
  2013-11-19 15:13         ` Peter Zijlstra
                             ` (4 preceding siblings ...)
  2014-01-13 15:56           ` [tip:sched/core] locking: Optimize lock_bh functions tip-bot for Peter Zijlstra
@ 2014-01-13 16:42           ` tip-bot for Peter Zijlstra
  2014-01-13 16:45           ` [tip:sched/core] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
  2014-01-13 16:45           ` [tip:sched/core] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-13 16:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, arjan, peterz, bitbucket, tglx

Commit-ID:  0bd3a173d711857fc9f583eb5825386cc08f3948
Gitweb:     http://git.kernel.org/tip/0bd3a173d711857fc9f583eb5825386cc08f3948
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 17:32:27 +0100

sched/preempt, locking: Rework local_bh_{dis,en}able()

Currently local_bh_disable() is out-of-line for no apparent reason.
So inline it to save a few cycles on call/return nonsense, the
function body is a single add on x86 (a few loads and store extra on
load/store archs).

Also expose two new local_bh functions:

  __local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt);

Which implement the actual local_bh_{dis,en}able() behaviour.

The next patch uses the exposed @cnt argument to optimize bh lock
functions.

With build fixes from Jacob Pan.

Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/bottom_half.h  | 32 +++++++++++++++++++++++++++++---
 include/linux/hardirq.h      |  1 +
 include/linux/preempt_mask.h |  1 -
 kernel/softirq.c             | 35 ++++++-----------------------------
 4 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/include/linux/bottom_half.h b/include/linux/bottom_half.h
index 27b1bcf..86c12c9 100644
--- a/include/linux/bottom_half.h
+++ b/include/linux/bottom_half.h
@@ -1,9 +1,35 @@
 #ifndef _LINUX_BH_H
 #define _LINUX_BH_H
 
-extern void local_bh_disable(void);
+#include <linux/preempt.h>
+#include <linux/preempt_mask.h>
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);
+#else
+static __always_inline void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
+{
+	preempt_count_add(cnt);
+	barrier();
+}
+#endif
+
+static inline void local_bh_disable(void)
+{
+	__local_bh_disable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
+
 extern void _local_bh_enable(void);
-extern void local_bh_enable(void);
-extern void local_bh_enable_ip(unsigned long ip);
+extern void __local_bh_enable_ip(unsigned long ip, unsigned int cnt);
+
+static inline void local_bh_enable_ip(unsigned long ip)
+{
+	__local_bh_enable_ip(ip, SOFTIRQ_DISABLE_OFFSET);
+}
+
+static inline void local_bh_enable(void)
+{
+	__local_bh_enable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
 
 #endif /* _LINUX_BH_H */
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d9cf963..12d5f97 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -5,6 +5,7 @@
 #include <linux/lockdep.h>
 #include <linux/ftrace_irq.h>
 #include <linux/vtime.h>
+#include <asm/hardirq.h>
 
 
 extern void synchronize_irq(unsigned int irq);
diff --git a/include/linux/preempt_mask.h b/include/linux/preempt_mask.h
index b8d96bc..dbeec4d 100644
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -2,7 +2,6 @@
 #define LINUX_PREEMPT_MASK_H
 
 #include <linux/preempt.h>
-#include <asm/hardirq.h>
 
 /*
  * We put the hardirq and softirq counter into the preemption
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 7500cce1..9e368ef 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -89,7 +89,7 @@ static void wakeup_softirqd(void)
  * where hardirqs are disabled legitimately:
  */
 #ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip, unsigned int cnt)
+void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
 {
 	unsigned long flags;
 
@@ -114,21 +114,9 @@ static void __local_bh_disable(unsigned long ip, unsigned int cnt)
 	if (preempt_count() == cnt)
 		trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 }
-#else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
-{
-	preempt_count_add(cnt);
-	barrier();
-}
+EXPORT_SYMBOL(__local_bh_disable_ip);
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
-void local_bh_disable(void)
-{
-	__local_bh_disable(_RET_IP_, SOFTIRQ_DISABLE_OFFSET);
-}
-
-EXPORT_SYMBOL(local_bh_disable);
-
 static void __local_bh_enable(unsigned int cnt)
 {
 	WARN_ON_ONCE(!irqs_disabled());
@@ -151,7 +139,7 @@ void _local_bh_enable(void)
 
 EXPORT_SYMBOL(_local_bh_enable);
 
-static inline void _local_bh_enable_ip(unsigned long ip)
+void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 {
 	WARN_ON_ONCE(in_irq() || irqs_disabled());
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -166,7 +154,7 @@ static inline void _local_bh_enable_ip(unsigned long ip)
 	 * Keep preemption disabled until we are done with
 	 * softirq processing:
  	 */
-	preempt_count_sub(SOFTIRQ_DISABLE_OFFSET - 1);
+	preempt_count_sub(cnt - 1);
 
 	if (unlikely(!in_interrupt() && local_softirq_pending())) {
 		/*
@@ -182,18 +170,7 @@ static inline void _local_bh_enable_ip(unsigned long ip)
 #endif
 	preempt_check_resched();
 }
-
-void local_bh_enable(void)
-{
-	_local_bh_enable_ip(_RET_IP_);
-}
-EXPORT_SYMBOL(local_bh_enable);
-
-void local_bh_enable_ip(unsigned long ip)
-{
-	_local_bh_enable_ip(ip);
-}
-EXPORT_SYMBOL(local_bh_enable_ip);
+EXPORT_SYMBOL(__local_bh_enable_ip);
 
 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
@@ -230,7 +207,7 @@ asmlinkage void __do_softirq(void)
 	pending = local_softirq_pending();
 	account_irq_enter_time(current);
 
-	__local_bh_disable(_RET_IP_, SOFTIRQ_OFFSET);
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	lockdep_softirq_enter();
 
 	cpu = smp_processor_id();

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched, net: Clean up preempt_enable_no_resched() abuse
  2013-11-19 15:13         ` Peter Zijlstra
                             ` (5 preceding siblings ...)
  2014-01-13 16:42           ` [tip:sched/core] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
@ 2014-01-13 16:45           ` tip-bot for Peter Zijlstra
  2014-01-13 16:45           ` [tip:sched/core] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-13 16:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, arjan, peterz, bitbucket,
	davem, eliezer.tamir, akpm, tglx

Commit-ID:  1774e9f3e5c8b38de3b3bc8bd0eacd280f655baf
Gitweb:     http://git.kernel.org/tip/1774e9f3e5c8b38de3b3bc8bd0eacd280f655baf
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 17:39:04 +0100

sched, net: Clean up preempt_enable_no_resched() abuse

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 net/ipv4/tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c4638e6..82de786 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1623,11 +1623,11 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		    (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
 		    !sysctl_tcp_low_latency &&
 		    net_dma_find_channel()) {
-			preempt_enable_no_resched();
+			preempt_enable();
 			tp->ucopy.pinned_list =
 					dma_pin_iovec_pages(msg->msg_iov, len);
 		} else {
-			preempt_enable_no_resched();
+			preempt_enable();
 		}
 	}
 #endif

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched, net: Fixup busy_loop_us_clock()
  2013-11-19 15:13         ` Peter Zijlstra
                             ` (6 preceding siblings ...)
  2014-01-13 16:45           ` [tip:sched/core] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
@ 2014-01-13 16:45           ` tip-bot for Peter Zijlstra
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-13 16:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, arjan, peterz, bitbucket,
	davem, eliezer.tamir, akpm, tglx

Commit-ID:  37089834528be3ef8cbf927e47c753b3e272a856
Gitweb:     http://git.kernel.org/tip/37089834528be3ef8cbf927e47c753b3e272a856
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 17:39:11 +0100

sched, net: Fixup busy_loop_us_clock()

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

This busy_poll stuff looks to be completely and utterly broken,
sched_clock() can return utter garbage with interrupts enabled (rare
but still) and it can drift unbounded between CPUs.

This means that if you get preempted/migrated and your new CPU is
years behind on the previous CPU we get to busy spin for a _very_ long
time.

There is a _REASON_ sched_clock() warns about preemptability -
papering over it with a preempt_disable()/preempt_enable_no_resched()
is just terminal brain damage on so many levels.

Replace sched_clock() usage with local_clock() which has a bounded
drift between CPUs (<2 jiffies).

There is a further problem with the entire busy wait poll thing in
that the spin time is additive to the syscall timeout, not inclusive.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/net/busy_poll.h | 19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 829627d..1d67fb6 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -42,27 +42,10 @@ static inline bool net_busy_loop_on(void)
 	return sysctl_net_busy_poll;
 }
 
-/* a wrapper to make debug_smp_processor_id() happy
- * we can use sched_clock() because we don't care much about precision
- * we only care that the average is bounded
- */
-#ifdef CONFIG_DEBUG_PREEMPT
-static inline u64 busy_loop_us_clock(void)
-{
-	u64 rc;
-
-	preempt_disable_notrace();
-	rc = sched_clock();
-	preempt_enable_no_resched_notrace();
-
-	return rc >> 10;
-}
-#else /* CONFIG_DEBUG_PREEMPT */
 static inline u64 busy_loop_us_clock(void)
 {
-	return sched_clock() >> 10;
+	return local_clock() >> 10;
 }
-#endif /* CONFIG_DEBUG_PREEMPT */
 
 static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
 {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-01-13 16:46 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-19  9:00 acpi_pad mwait usage Peter Zijlstra
2013-11-19  9:08 ` Peter Zijlstra
2013-11-19 11:31   ` [PATCH] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
2013-11-19 13:06     ` Rafael J. Wysocki
2013-11-19 13:21     ` Mike Galbraith
2013-11-19 14:22     ` Arjan van de Ven
2013-11-19 14:46       ` Peter Zijlstra
2013-11-19 14:51       ` Peter Zijlstra
2013-11-19 15:13         ` Peter Zijlstra
2013-11-19 21:06           ` Jacob Pan
2013-11-20 10:28             ` Peter Zijlstra
2013-11-20 10:58               ` Peter Zijlstra
2013-11-20 16:24                 ` Arjan van de Ven
2013-11-20 16:33                   ` Peter Zijlstra
2013-11-20 16:38                     ` Arjan van de Ven
2014-01-12 18:44           ` [tip:sched/idle] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
2014-01-12 18:44           ` [tip:sched/idle] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
2014-01-12 18:44           ` [tip:sched/idle] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra
2014-01-13 15:56           ` [tip:sched/core] locking: Optimize lock_bh functions tip-bot for Peter Zijlstra
2014-01-13 16:42           ` [tip:sched/core] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
2014-01-13 16:45           ` [tip:sched/core] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
2014-01-13 16:45           ` [tip:sched/core] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).