All of lore.kernel.org
 help / color / mirror / Atom feed
* acpi_pad mwait usage
@ 2013-11-19  9:00 Peter Zijlstra
  2013-11-19  9:08 ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19  9:00 UTC (permalink / raw)
  To: lenb, rjw, linux-acpi, linux-kernel; +Cc: shaohua.li

Hi Len, Rafeal,

I stumbled over acpi_pad (yuck! ;-), but noticed that you don't set the
task in polling mode while using mwait. This means we'll still happily
send an IPI to wake you up.

A little something like the below should do; you might even be able to
remove the smp_mb() but since it is completely undocumented (another
fail) I couldn't tell if the implied barrier in
current_set_polling_and_test() suffices to replace it, so I left it.

If it compiles and works; change it to a proper SOB:

Maybe-Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/acpi/acpi_pad.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index fc6008fbce35..e9126efe7786 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -195,8 +195,10 @@ static int power_saving_thread(void *data)
 
 			__monitor((void *)&current_thread_info()->flags, 0, 0);
 			smp_mb();
-			if (!need_resched())
+			if (!current_set_polling_and_test()) {
 				__mwait(power_saving_mwait_eax, 1);
+				__current_clr_polling();
+			}
 
 			start_critical_timings();
 			if (lapic_marked_unstable)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: acpi_pad mwait usage
  2013-11-19  9:00 acpi_pad mwait usage Peter Zijlstra
@ 2013-11-19  9:08 ` Peter Zijlstra
  2013-11-19 11:31   ` [PATCH] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19  9:08 UTC (permalink / raw)
  To: lenb, rjw, linux-acpi, linux-kernel
  Cc: shaohua.li, rui.zhang, arjan, jacob.jun.pan

On Tue, Nov 19, 2013 at 10:00:19AM +0100, Peter Zijlstra wrote:
> Hi Len, Rafeal,
> 
> I stumbled over acpi_pad (yuck! ;-), but noticed that you don't set the
> task in polling mode while using mwait. This means we'll still happily
> send an IPI to wake you up.
> 
> A little something like the below should do; you might even be able to
> remove the smp_mb() but since it is completely undocumented (another
> fail) I couldn't tell if the implied barrier in
> current_set_polling_and_test() suffices to replace it, so I left it.
> 
> If it compiles and works; change it to a proper SOB:
> 
> Maybe-Signed-off-by: Peter Zijlstra <peterz@infradead.org>

Of course one such driver is insufficient and you lot needed a second;
same issue:

---
 drivers/thermal/intel_powerclamp.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 8f181b3f842b..d8111e1ac62e 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -439,8 +439,11 @@ static int clamp_thread(void *arg)
 			local_touch_nmi();
 			stop_critical_timings();
 			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			cpu_relax(); /* allow HT sibling to run */
-			__mwait(eax, ecx);
+			if (!current_set_polling_and_test()) {
+				cpu_relax(); /* allow HT sibling to run */
+				__mwait(eax, ecx);
+				__current_clr_polling();
+			}
 			start_critical_timings();
 			atomic_inc(&idle_wakeup_counter);
 		}

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19  9:08 ` Peter Zijlstra
@ 2013-11-19 11:31   ` Peter Zijlstra
  2013-11-19 13:06     ` Rafael J. Wysocki
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19 11:31 UTC (permalink / raw)
  To: lenb, rjw, linux-acpi, linux-kernel
  Cc: shaohua.li, rui.zhang, arjan, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, Thomas Gleixner, hpa

People seem to delight in writing wrong and broken mwait idle routines;
collapse the lot.

This leaves mwait_play_dead() the sole remaining user of __mwait() and
new __mwait() users are probably doing it wrong.

Also remove __sti_mwait() as its unused.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---

Mike, does this cure your core2?

 arch/x86/include/asm/mwait.h       | 42 ++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h   | 23 ---------------------
 arch/x86/kernel/acpi/cstate.c      | 23 ---------------------
 drivers/acpi/acpi_pad.c            |  5 +----
 drivers/acpi/processor_idle.c      | 15 --------------
 drivers/idle/intel_idle.c          |  8 +-------
 drivers/thermal/intel_powerclamp.c |  4 +---
 7 files changed, 45 insertions(+), 75 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 2f366d0ac6b4..80014dade987 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_X86_MWAIT_H
 #define _ASM_X86_MWAIT_H
 
+#include <linux/sched.h>
+
 #define MWAIT_SUBSTATE_MASK		0xf
 #define MWAIT_CSTATE_MASK		0xf
 #define MWAIT_SUBSTATE_SIZE		4
@@ -13,4 +15,44 @@
 
 #define MWAIT_ECX_INTERRUPT_BREAK	0x1
 
+static inline void __monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state
+ * through MWAIT. Whenever someone changes need_resched, we would be woken
+ * up from MWAIT (without an IPI).
+ *
+ * New with Core Duo processors, MWAIT can take some hints based on CPU
+ * capability.
+ */
+static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
+{
+	if (need_resched())
+		return;
+
+	if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+		clflush((void *)&current_thread_info()->flags);
+
+	__monitor((void *)&current_thread_info()->flags, 0, 0);
+	if (!current_set_polling_and_test())
+		__mwait(eax, ecx);
+
+	__current_clr_polling();
+}
+
 #endif /* _ASM_X86_MWAIT_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7b034a4057f9..24821f5768bc 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -700,29 +700,6 @@ static inline void sync_core(void)
 #endif
 }
 
-static inline void __monitor(const void *eax, unsigned long ecx,
-			     unsigned long edx)
-{
-	/* "monitor %eax, %ecx, %edx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc8;"
-		     :: "a" (eax), "c" (ecx), "d"(edx));
-}
-
-static inline void __mwait(unsigned long eax, unsigned long ecx)
-{
-	/* "mwait %eax, %ecx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
-{
-	trace_hardirqs_on();
-	/* "mwait %eax, %ecx;" */
-	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index d2b7f27781bc..e69182fd01cf 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -150,29 +150,6 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 
-/*
- * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
- * which can obviate IPI to trigger checking of need_resched.
- * We execute MONITOR against need_resched and enter optimized wait state
- * through MWAIT. Whenever someone changes need_resched, we would be woken
- * up from MWAIT (without an IPI).
- *
- * New with Core Duo processors, MWAIT can take some hints based on CPU
- * capability.
- */
-void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
-{
-	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(ax, cx);
-	}
-}
-
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 {
 	unsigned int cpu = smp_processor_id();
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index fc6008fbce35..509452a62f96 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -193,10 +193,7 @@ static int power_saving_thread(void *data)
 					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 			stop_critical_timings();
 
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			smp_mb();
-			if (!need_resched())
-				__mwait(power_saving_mwait_eax, 1);
+			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
 			start_critical_timings();
 			if (lapic_marked_unstable)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 644516d9bde6..f90c56c8379e 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -727,11 +727,6 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	lapic_timer_state_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 
@@ -785,11 +780,6 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
@@ -841,11 +831,6 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 		}
 	}
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	acpi_unlazy_tlb(smp_processor_id());
 
 	/* Tell the scheduler that we are going deep-idle: */
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 3226ce98fb18..3b56d76a5bca 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -359,13 +359,7 @@ static int intel_idle(struct cpuidle_device *dev,
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
-	if (!current_set_polling_and_test()) {
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(eax, ecx);
-	}
+	mwait_idle_with_hints(eax, ecx);
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 8f181b3f842b..e8275f2df9af 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
 			 */
 			local_touch_nmi();
 			stop_critical_timings();
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			cpu_relax(); /* allow HT sibling to run */
-			__mwait(eax, ecx);
+			mwait_idle_with_hints(eax, ecx);
 			start_critical_timings();
 			atomic_inc(&idle_wakeup_counter);
 		}

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 11:31   ` [PATCH] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
@ 2013-11-19 13:06     ` Rafael J. Wysocki
  2013-11-19 13:21     ` Mike Galbraith
  2013-11-19 14:22     ` Arjan van de Ven
  2 siblings, 0 replies; 22+ messages in thread
From: Rafael J. Wysocki @ 2013-11-19 13:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lenb, linux-acpi, linux-kernel, shaohua.li, rui.zhang, arjan,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On Tuesday, November 19, 2013 12:31:53 PM Peter Zijlstra wrote:
> People seem to delight in writing wrong and broken mwait idle routines;
> collapse the lot.
> 
> This leaves mwait_play_dead() the sole remaining user of __mwait() and
> new __mwait() users are probably doing it wrong.
> 
> Also remove __sti_mwait() as its unused.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

For the ACPI part:

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
> 
> Mike, does this cure your core2?
> 
>  arch/x86/include/asm/mwait.h       | 42 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/processor.h   | 23 ---------------------
>  arch/x86/kernel/acpi/cstate.c      | 23 ---------------------
>  drivers/acpi/acpi_pad.c            |  5 +----
>  drivers/acpi/processor_idle.c      | 15 --------------
>  drivers/idle/intel_idle.c          |  8 +-------
>  drivers/thermal/intel_powerclamp.c |  4 +---
>  7 files changed, 45 insertions(+), 75 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 2f366d0ac6b4..80014dade987 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -1,6 +1,8 @@
>  #ifndef _ASM_X86_MWAIT_H
>  #define _ASM_X86_MWAIT_H
>  
> +#include <linux/sched.h>
> +
>  #define MWAIT_SUBSTATE_MASK		0xf
>  #define MWAIT_CSTATE_MASK		0xf
>  #define MWAIT_SUBSTATE_SIZE		4
> @@ -13,4 +15,44 @@
>  
>  #define MWAIT_ECX_INTERRUPT_BREAK	0x1
>  
> +static inline void __monitor(const void *eax, unsigned long ecx,
> +			     unsigned long edx)
> +{
> +	/* "monitor %eax, %ecx, %edx;" */
> +	asm volatile(".byte 0x0f, 0x01, 0xc8;"
> +		     :: "a" (eax), "c" (ecx), "d"(edx));
> +}
> +
> +static inline void __mwait(unsigned long eax, unsigned long ecx)
> +{
> +	/* "mwait %eax, %ecx;" */
> +	asm volatile(".byte 0x0f, 0x01, 0xc9;"
> +		     :: "a" (eax), "c" (ecx));
> +}
> +
> +/*
> + * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
> + * which can obviate IPI to trigger checking of need_resched.
> + * We execute MONITOR against need_resched and enter optimized wait state
> + * through MWAIT. Whenever someone changes need_resched, we would be woken
> + * up from MWAIT (without an IPI).
> + *
> + * New with Core Duo processors, MWAIT can take some hints based on CPU
> + * capability.
> + */
> +static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
> +{
> +	if (need_resched())
> +		return;
> +
> +	if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> +		clflush((void *)&current_thread_info()->flags);
> +
> +	__monitor((void *)&current_thread_info()->flags, 0, 0);
> +	if (!current_set_polling_and_test())
> +		__mwait(eax, ecx);
> +
> +	__current_clr_polling();
> +}
> +
>  #endif /* _ASM_X86_MWAIT_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 7b034a4057f9..24821f5768bc 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -700,29 +700,6 @@ static inline void sync_core(void)
>  #endif
>  }
>  
> -static inline void __monitor(const void *eax, unsigned long ecx,
> -			     unsigned long edx)
> -{
> -	/* "monitor %eax, %ecx, %edx;" */
> -	asm volatile(".byte 0x0f, 0x01, 0xc8;"
> -		     :: "a" (eax), "c" (ecx), "d"(edx));
> -}
> -
> -static inline void __mwait(unsigned long eax, unsigned long ecx)
> -{
> -	/* "mwait %eax, %ecx;" */
> -	asm volatile(".byte 0x0f, 0x01, 0xc9;"
> -		     :: "a" (eax), "c" (ecx));
> -}
> -
> -static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
> -{
> -	trace_hardirqs_on();
> -	/* "mwait %eax, %ecx;" */
> -	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
> -		     :: "a" (eax), "c" (ecx));
> -}
> -
>  extern void select_idle_routine(const struct cpuinfo_x86 *c);
>  extern void init_amd_e400_c1e_mask(void);
>  
> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> index d2b7f27781bc..e69182fd01cf 100644
> --- a/arch/x86/kernel/acpi/cstate.c
> +++ b/arch/x86/kernel/acpi/cstate.c
> @@ -150,29 +150,6 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
>  }
>  EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
>  
> -/*
> - * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
> - * which can obviate IPI to trigger checking of need_resched.
> - * We execute MONITOR against need_resched and enter optimized wait state
> - * through MWAIT. Whenever someone changes need_resched, we would be woken
> - * up from MWAIT (without an IPI).
> - *
> - * New with Core Duo processors, MWAIT can take some hints based on CPU
> - * capability.
> - */
> -void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
> -{
> -	if (!need_resched()) {
> -		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> -			clflush((void *)&current_thread_info()->flags);
> -
> -		__monitor((void *)&current_thread_info()->flags, 0, 0);
> -		smp_mb();
> -		if (!need_resched())
> -			__mwait(ax, cx);
> -	}
> -}
> -
>  void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
>  {
>  	unsigned int cpu = smp_processor_id();
> diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
> index fc6008fbce35..509452a62f96 100644
> --- a/drivers/acpi/acpi_pad.c
> +++ b/drivers/acpi/acpi_pad.c
> @@ -193,10 +193,7 @@ static int power_saving_thread(void *data)
>  					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
>  			stop_critical_timings();
>  
> -			__monitor((void *)&current_thread_info()->flags, 0, 0);
> -			smp_mb();
> -			if (!need_resched())
> -				__mwait(power_saving_mwait_eax, 1);
> +			mwait_idle_with_hints(power_saving_mwait_eax, 1);
>  
>  			start_critical_timings();
>  			if (lapic_marked_unstable)
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 644516d9bde6..f90c56c8379e 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -727,11 +727,6 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
>  	if (unlikely(!pr))
>  		return -EINVAL;
>  
> -	if (cx->entry_method == ACPI_CSTATE_FFH) {
> -		if (current_set_polling_and_test())
> -			return -EINVAL;
> -	}
> -
>  	lapic_timer_state_broadcast(pr, cx, 1);
>  	acpi_idle_do_entry(cx);
>  
> @@ -785,11 +780,6 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
>  	if (unlikely(!pr))
>  		return -EINVAL;
>  
> -	if (cx->entry_method == ACPI_CSTATE_FFH) {
> -		if (current_set_polling_and_test())
> -			return -EINVAL;
> -	}
> -
>  	/*
>  	 * Must be done before busmaster disable as we might need to
>  	 * access HPET !
> @@ -841,11 +831,6 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
>  		}
>  	}
>  
> -	if (cx->entry_method == ACPI_CSTATE_FFH) {
> -		if (current_set_polling_and_test())
> -			return -EINVAL;
> -	}
> -
>  	acpi_unlazy_tlb(smp_processor_id());
>  
>  	/* Tell the scheduler that we are going deep-idle: */
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 3226ce98fb18..3b56d76a5bca 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -359,13 +359,7 @@ static int intel_idle(struct cpuidle_device *dev,
>  	if (!(lapic_timer_reliable_states & (1 << (cstate))))
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
>  
> -	if (!current_set_polling_and_test()) {
> -
> -		__monitor((void *)&current_thread_info()->flags, 0, 0);
> -		smp_mb();
> -		if (!need_resched())
> -			__mwait(eax, ecx);
> -	}
> +	mwait_idle_with_hints(eax, ecx);
>  
>  	if (!(lapic_timer_reliable_states & (1 << (cstate))))
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index 8f181b3f842b..e8275f2df9af 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
>  			 */
>  			local_touch_nmi();
>  			stop_critical_timings();
> -			__monitor((void *)&current_thread_info()->flags, 0, 0);
> -			cpu_relax(); /* allow HT sibling to run */
> -			__mwait(eax, ecx);
> +			mwait_idle_with_hints(eax, ecx);
>  			start_critical_timings();
>  			atomic_inc(&idle_wakeup_counter);
>  		}
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 11:31   ` [PATCH] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
  2013-11-19 13:06     ` Rafael J. Wysocki
@ 2013-11-19 13:21     ` Mike Galbraith
  2013-11-19 14:22     ` Arjan van de Ven
  2 siblings, 0 replies; 22+ messages in thread
From: Mike Galbraith @ 2013-11-19 13:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lenb, rjw, linux-acpi, linux-kernel, shaohua.li, rui.zhang,
	arjan, jacob.jun.pan, Ingo Molnar, Thomas Gleixner, hpa

On Tue, 2013-11-19 at 12:31 +0100, Peter Zijlstra wrote: 
> People seem to delight in writing wrong and broken mwait idle routines;
> collapse the lot.
> 
> This leaves mwait_play_dead() the sole remaining user of __mwait() and
> new __mwait() users are probably doing it wrong.
> 
> Also remove __sti_mwait() as its unused.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
> 
> Mike, does this cure your core2?

Nope.  Maybe an acpi/bios thingy on this box since diags that fired on
lappy during boot did not fire on desktop, and both are core2 booting
same kernel.  I kinda suspect I'll either be stuck with default_idle or
have to resurrect mwait_idle and carry it locally if I want the thing to
work well.  Guess I'll find out if/when I have time to squabble with it.

Meanwhile, desktop box works fine modulo benchmarks, lappy works fine
too, modulo max_cstate=1 scaring mwait_idle_with_hints away, which I
don't care about much.  Neither box is wonderful at rt testing where I
usually boot max_cstate=1, both just became a bit _less_ wonderful :)

-Mike


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 11:31   ` [PATCH] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
  2013-11-19 13:06     ` Rafael J. Wysocki
  2013-11-19 13:21     ` Mike Galbraith
@ 2013-11-19 14:22     ` Arjan van de Ven
  2013-11-19 14:46       ` Peter Zijlstra
  2013-11-19 14:51       ` Peter Zijlstra
  2 siblings, 2 replies; 22+ messages in thread
From: Arjan van de Ven @ 2013-11-19 14:22 UTC (permalink / raw)
  To: Peter Zijlstra, lenb, rjw, linux-acpi, linux-kernel
  Cc: shaohua.li, rui.zhang, jacob.jun.pan, Mike Galbraith,
	Ingo Molnar, Thomas Gleixner, hpa

> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index 8f181b3f842b..e8275f2df9af 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
>   			 */
>   			local_touch_nmi();
>   			stop_critical_timings();
> -			__monitor((void *)&current_thread_info()->flags, 0, 0);
> -			cpu_relax(); /* allow HT sibling to run */
> -			__mwait(eax, ecx);
> +			mwait_idle_with_hints(eax, ecx);
>   			start_critical_timings();
>   			atomic_inc(&idle_wakeup_counter);
>   		}
>

hmm I take it that mwait_idle_with_hints is the one that also checks need_resched() ?
if so... powerclamp may not want to use that (the whole point is to NOT give the cpu
to tasks!)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 14:22     ` Arjan van de Ven
@ 2013-11-19 14:46       ` Peter Zijlstra
  2013-11-19 14:51       ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19 14:46 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: lenb, rjw, linux-acpi, linux-kernel, shaohua.li, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On Tue, Nov 19, 2013 at 06:22:43AM -0800, Arjan van de Ven wrote:
> >diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> >index 8f181b3f842b..e8275f2df9af 100644
> >--- a/drivers/thermal/intel_powerclamp.c
> >+++ b/drivers/thermal/intel_powerclamp.c
> >@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
> >  			 */
> >  			local_touch_nmi();
> >  			stop_critical_timings();
> >-			__monitor((void *)&current_thread_info()->flags, 0, 0);
> >-			cpu_relax(); /* allow HT sibling to run */
> >-			__mwait(eax, ecx);
> >+			mwait_idle_with_hints(eax, ecx);
> >  			start_critical_timings();
> >  			atomic_inc(&idle_wakeup_counter);
> >  		}
> >
> 
> hmm I take it that mwait_idle_with_hints is the one that also checks need_resched() ?
> if so... powerclamp may not want to use that (the whole point is to NOT give the cpu
> to tasks!)

Of course, you're very much not getting something that wins from
stop_machine or a fifo-99 task.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 14:22     ` Arjan van de Ven
  2013-11-19 14:46       ` Peter Zijlstra
@ 2013-11-19 14:51       ` Peter Zijlstra
  2013-11-19 15:13         ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19 14:51 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: lenb, rjw, linux-acpi, linux-kernel, shaohua.li, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On Tue, Nov 19, 2013 at 06:22:43AM -0800, Arjan van de Ven wrote:
> >diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> >index 8f181b3f842b..e8275f2df9af 100644
> >--- a/drivers/thermal/intel_powerclamp.c
> >+++ b/drivers/thermal/intel_powerclamp.c
> >@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
> >  			 */
> >  			local_touch_nmi();
> >  			stop_critical_timings();
> >-			__monitor((void *)&current_thread_info()->flags, 0, 0);
> >-			cpu_relax(); /* allow HT sibling to run */
> >-			__mwait(eax, ecx);
> >+			mwait_idle_with_hints(eax, ecx);
> >  			start_critical_timings();
> >  			atomic_inc(&idle_wakeup_counter);
> >  		}
> >
> 
> hmm I take it that mwait_idle_with_hints is the one that also checks need_resched() ?
> if so... powerclamp may not want to use that (the whole point is to NOT give the cpu
> to tasks!)

That said, that drive is completely wrecked. It uses
preempt_enable_no_resched() wrong too, it has uncommented barriers..

Dude, wtf are you guys smoking?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 14:51       ` Peter Zijlstra
@ 2013-11-19 15:13         ` Peter Zijlstra
  2013-11-19 21:06           ` Jacob Pan
                             ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-19 15:13 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: lenb, rjw, linux-acpi, linux-kernel, shaohua.li, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On Tue, Nov 19, 2013 at 03:51:43PM +0100, Peter Zijlstra wrote:
> That said, that drive is completely wrecked. It uses
> preempt_enable_no_resched() wrong too, it has uncommented barriers..
> 
> Dude, wtf are you guys smoking?

---
Subject: sched: Take away preempt_enable_no_resched() and friends from modules

There is no way in hell modules are going to play preemption tricks like
this.

Cc: eliezer.tamir@linux.intel.com
Cc: arjan@linux.intel.com
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/thermal/intel_powerclamp.c |  2 +-
 include/linux/preempt.h            |  8 +++++++-
 include/net/busy_poll.h            | 15 +++------------
 3 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 8f181b3f842b..0a12ddc2eb4c 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -445,7 +445,7 @@ static int clamp_thread(void *arg)
 			atomic_inc(&idle_wakeup_counter);
 		}
 		tick_nohz_idle_exit();
-		preempt_enable_no_resched();
+		preempt_enable();
 	}
 	del_timer_sync(&wakeup_timer);
 	clear_bit(cpunr, cpu_clamping_mask);
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index a3d9dc8c2c00..3ed2b5335ab4 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -64,7 +64,7 @@ do { \
 } while (0)
 
 #else
-#define preempt_enable() preempt_enable_no_resched()
+#define preempt_enable() sched_preempt_enable_no_resched()
 #define preempt_check_resched() do { } while (0)
 #endif
 
@@ -116,6 +116,12 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
+#ifdef MODULE
+#undef preempt_enable_no_resched
+#undef preempt_enable_no_resched_notrace
+#undef preempt_check_resched
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 
 struct preempt_notifier;
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 829627d7b846..756827a86c2d 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -42,27 +42,18 @@ static inline bool net_busy_loop_on(void)
 	return sysctl_net_busy_poll;
 }
 
-/* a wrapper to make debug_smp_processor_id() happy
- * we can use sched_clock() because we don't care much about precision
- * we only care that the average is bounded
- */
-#ifdef CONFIG_DEBUG_PREEMPT
 static inline u64 busy_loop_us_clock(void)
 {
 	u64 rc;
 
+	/* XXX with interrupts enabled sched_clock() can return utter garbage */
+
 	preempt_disable_notrace();
 	rc = sched_clock();
-	preempt_enable_no_resched_notrace();
+	preempt_enable_notrace();
 
 	return rc >> 10;
 }
-#else /* CONFIG_DEBUG_PREEMPT */
-static inline u64 busy_loop_us_clock(void)
-{
-	return sched_clock() >> 10;
-}
-#endif /* CONFIG_DEBUG_PREEMPT */
 
 static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
 {


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 15:13         ` Peter Zijlstra
@ 2013-11-19 21:06           ` Jacob Pan
  2013-11-20 10:28             ` Peter Zijlstra
  2014-01-12 18:44           ` [tip:sched/idle] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
                             ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Jacob Pan @ 2013-11-19 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, linux-acpi, linux-kernel,
	shaohua.li, rui.zhang, Mike Galbraith, Ingo Molnar,
	Thomas Gleixner, hpa

On Tue, 19 Nov 2013 16:13:38 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Nov 19, 2013 at 03:51:43PM +0100, Peter Zijlstra wrote:
> > That said, that drive is completely wrecked. It uses
> > preempt_enable_no_resched() wrong too, it has uncommented barriers..
> > 
> > Dude, wtf are you guys smoking?
> 
I applied this patch on top of upstream kernel (801a760) and found out
my machine completely failed to enter idle when nothing is running.
turbostate shows 100% C0. ftrace shows kernel coming in and out of idle
frequently.

Both ACPI idle and intel_idle behaves the same way. I have to do the
following change to allow entering C-states again.

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 80014da..b51d1e1 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -49,10 +49,8 @@ static inline void mwait_idle_with_hints(unsigned
long eax, unsigned long ecx) clflush((void
*)&current_thread_info()->flags); 
        __monitor((void *)&current_thread_info()->flags, 0, 0);
-       if (!current_set_polling_and_test())
+        if (!need_resched())
                __mwait(eax, ecx);
-
-       __current_clr_polling();
 }
 
 #endif /* _ASM_X86_MWAIT_H */

Did i miss any other patches?

Jacob

> ---
> Subject: sched: Take away preempt_enable_no_resched() and friends
> from modules
> 
> There is no way in hell modules are going to play preemption tricks
> like this.
> 
> Cc: eliezer.tamir@linux.intel.com
> Cc: arjan@linux.intel.com
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  drivers/thermal/intel_powerclamp.c |  2 +-
>  include/linux/preempt.h            |  8 +++++++-
>  include/net/busy_poll.h            | 15 +++------------
>  3 files changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/thermal/intel_powerclamp.c
> b/drivers/thermal/intel_powerclamp.c index 8f181b3f842b..0a12ddc2eb4c
> 100644 --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -445,7 +445,7 @@ static int clamp_thread(void *arg)
>  			atomic_inc(&idle_wakeup_counter);
>  		}
>  		tick_nohz_idle_exit();
> -		preempt_enable_no_resched();
> +		preempt_enable();
>  	}
>  	del_timer_sync(&wakeup_timer);
>  	clear_bit(cpunr, cpu_clamping_mask);
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index a3d9dc8c2c00..3ed2b5335ab4 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -64,7 +64,7 @@ do { \
>  } while (0)
>  
>  #else
> -#define preempt_enable() preempt_enable_no_resched()
> +#define preempt_enable() sched_preempt_enable_no_resched()
>  #define preempt_check_resched() do { } while (0)
>  #endif
>  
> @@ -116,6 +116,12 @@ do { \
>  
>  #endif /* CONFIG_PREEMPT_COUNT */
>  
> +#ifdef MODULE
> +#undef preempt_enable_no_resched
> +#undef preempt_enable_no_resched_notrace
> +#undef preempt_check_resched
> +#endif
> +
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  
>  struct preempt_notifier;
> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> index 829627d7b846..756827a86c2d 100644
> --- a/include/net/busy_poll.h
> +++ b/include/net/busy_poll.h
> @@ -42,27 +42,18 @@ static inline bool net_busy_loop_on(void)
>  	return sysctl_net_busy_poll;
>  }
>  
> -/* a wrapper to make debug_smp_processor_id() happy
> - * we can use sched_clock() because we don't care much about
> precision
> - * we only care that the average is bounded
> - */
> -#ifdef CONFIG_DEBUG_PREEMPT
>  static inline u64 busy_loop_us_clock(void)
>  {
>  	u64 rc;
>  
> +	/* XXX with interrupts enabled sched_clock() can return
> utter garbage */ +
>  	preempt_disable_notrace();
>  	rc = sched_clock();
> -	preempt_enable_no_resched_notrace();
> +	preempt_enable_notrace();
>  
>  	return rc >> 10;
>  }
> -#else /* CONFIG_DEBUG_PREEMPT */
> -static inline u64 busy_loop_us_clock(void)
> -{
> -	return sched_clock() >> 10;
> -}
> -#endif /* CONFIG_DEBUG_PREEMPT */
>  
>  static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
>  {
> 

[Jacob Pan]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-19 21:06           ` Jacob Pan
@ 2013-11-20 10:28             ` Peter Zijlstra
  2013-11-20 10:58               ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-20 10:28 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Arjan van de Ven, lenb, rjw, linux-acpi, linux-kernel,
	shaohua.li, rui.zhang, Mike Galbraith, Ingo Molnar,
	Thomas Gleixner, hpa

On Tue, Nov 19, 2013 at 01:06:30PM -0800, Jacob Pan wrote:
> On Tue, 19 Nov 2013 16:13:38 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, Nov 19, 2013 at 03:51:43PM +0100, Peter Zijlstra wrote:
> > > That said, that drive is completely wrecked. It uses
> > > preempt_enable_no_resched() wrong too, it has uncommented barriers..
> > > 
> > > Dude, wtf are you guys smoking?
> > 
> I applied this patch on top of upstream kernel (801a760) and found out
> my machine completely failed to enter idle when nothing is running.
> turbostate shows 100% C0. ftrace shows kernel coming in and out of idle
> frequently.
> 
> Both ACPI idle and intel_idle behaves the same way. I have to do the
> following change to allow entering C-states again.
> 
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 80014da..b51d1e1 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -49,10 +49,8 @@ static inline void mwait_idle_with_hints(unsigned
> long eax, unsigned long ecx) clflush((void
> *)&current_thread_info()->flags); 
>         __monitor((void *)&current_thread_info()->flags, 0, 0);
> -       if (!current_set_polling_and_test())
> +        if (!need_resched())
>                 __mwait(eax, ecx);
> -
> -       __current_clr_polling();
>  }
>  
>  #endif /* _ASM_X86_MWAIT_H */

That doesn't make any sense; current_set_polling_and_test() returns the
same thing need_resched() does.

But you're right, intel_idle resides 100% in C0 and acpi_idle has 100%
C1 residency... most weird.

/me goes prod at it

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-20 10:28             ` Peter Zijlstra
@ 2013-11-20 10:58               ` Peter Zijlstra
  2013-11-20 16:24                 ` Arjan van de Ven
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-20 10:58 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Arjan van de Ven, lenb, rjw, linux-acpi, linux-kernel,
	shaohua.li, rui.zhang, Mike Galbraith, Ingo Molnar,
	Thomas Gleixner, hpa

On Wed, Nov 20, 2013 at 11:28:03AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 19, 2013 at 01:06:30PM -0800, Jacob Pan wrote:
> > I applied this patch on top of upstream kernel (801a760) and found out
> > my machine completely failed to enter idle when nothing is running.
> > turbostate shows 100% C0. ftrace shows kernel coming in and out of idle
> > frequently.
> > 
> > Both ACPI idle and intel_idle behaves the same way. I have to do the
> > following change to allow entering C-states again.

> That doesn't make any sense; current_set_polling_and_test() returns the
> same thing need_resched() does.
> 
> But you're right, intel_idle resides 100% in C0 and acpi_idle has 100%
> C1 residency... most weird.

So pretty silly actually; you cannot do a store (any store) in between
monitor and mwait.

The below version seems to work properly again with both acpi_idle and
intel_idle.

Now to go make that preempt_disable_no_resched cleanup compile.. :-)

---
Subject: x86, acpi, idle: Restructure the mwait idle routines
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 19 Nov 2013 12:31:53 +0100

People seem to delight in writing wrong and broken mwait idle routines;
collapse the lot.

This leaves mwait_play_dead() the sole remaining user of __mwait() and
new __mwait() users are probably doing it wrong.

Also remove __sti_mwait() as its unused.

Cc: arjan@linux.intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: lenb@kernel.org
Cc: shaohua.li@intel.com
Cc: rui.zhang@intel.com
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/mwait.h       |   40 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h   |   23 ---------------------
 arch/x86/kernel/acpi/cstate.c      |   23 ---------------------
 drivers/acpi/acpi_pad.c            |    5 ----
 drivers/acpi/processor_idle.c      |   15 -------------
 drivers/idle/intel_idle.c          |    8 -------
 drivers/thermal/intel_powerclamp.c |    4 ---
 7 files changed, 43 insertions(+), 75 deletions(-)

--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_X86_MWAIT_H
 #define _ASM_X86_MWAIT_H
 
+#include <linux/sched.h>
+
 #define MWAIT_SUBSTATE_MASK		0xf
 #define MWAIT_CSTATE_MASK		0xf
 #define MWAIT_SUBSTATE_SIZE		4
@@ -13,4 +15,42 @@
 
 #define MWAIT_ECX_INTERRUPT_BREAK	0x1
 
+static inline void __monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state
+ * through MWAIT. Whenever someone changes need_resched, we would be woken
+ * up from MWAIT (without an IPI).
+ *
+ * New with Core Duo processors, MWAIT can take some hints based on CPU
+ * capability.
+ */
+static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
+{
+	if (!current_set_polling_and_test()) {
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
+		__monitor((void *)&current_thread_info()->flags, 0, 0);
+		if (!need_resched())
+			__mwait(eax, ecx);
+	}
+	__current_clr_polling();
+}
+
 #endif /* _ASM_X86_MWAIT_H */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -700,29 +700,6 @@ static inline void sync_core(void)
 #endif
 }
 
-static inline void __monitor(const void *eax, unsigned long ecx,
-			     unsigned long edx)
-{
-	/* "monitor %eax, %ecx, %edx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc8;"
-		     :: "a" (eax), "c" (ecx), "d"(edx));
-}
-
-static inline void __mwait(unsigned long eax, unsigned long ecx)
-{
-	/* "mwait %eax, %ecx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
-{
-	trace_hardirqs_on();
-	/* "mwait %eax, %ecx;" */
-	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -150,29 +150,6 @@ int acpi_processor_ffh_cstate_probe(unsi
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 
-/*
- * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
- * which can obviate IPI to trigger checking of need_resched.
- * We execute MONITOR against need_resched and enter optimized wait state
- * through MWAIT. Whenever someone changes need_resched, we would be woken
- * up from MWAIT (without an IPI).
- *
- * New with Core Duo processors, MWAIT can take some hints based on CPU
- * capability.
- */
-void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
-{
-	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(ax, cx);
-	}
-}
-
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 {
 	unsigned int cpu = smp_processor_id();
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -193,10 +193,7 @@ static int power_saving_thread(void *dat
 					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 			stop_critical_timings();
 
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			smp_mb();
-			if (!need_resched())
-				__mwait(power_saving_mwait_eax, 1);
+			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
 			start_critical_timings();
 			if (lapic_marked_unstable)
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -727,11 +727,6 @@ static int acpi_idle_enter_c1(struct cpu
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	lapic_timer_state_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 
@@ -785,11 +780,6 @@ static int acpi_idle_enter_simple(struct
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
@@ -841,11 +831,6 @@ static int acpi_idle_enter_bm(struct cpu
 		}
 	}
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	acpi_unlazy_tlb(smp_processor_id());
 
 	/* Tell the scheduler that we are going deep-idle: */
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -359,13 +359,7 @@ static int intel_idle(struct cpuidle_dev
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
-	if (!current_set_polling_and_test()) {
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(eax, ecx);
-	}
+	mwait_idle_with_hints(eax, ecx);
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
 			 */
 			local_touch_nmi();
 			stop_critical_timings();
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			cpu_relax(); /* allow HT sibling to run */
-			__mwait(eax, ecx);
+			mwait_idle_with_hints(eax, ecx);
 			start_critical_timings();
 			atomic_inc(&idle_wakeup_counter);
 		}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-20 10:58               ` Peter Zijlstra
@ 2013-11-20 16:24                 ` Arjan van de Ven
  2013-11-20 16:33                   ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Arjan van de Ven @ 2013-11-20 16:24 UTC (permalink / raw)
  To: Peter Zijlstra, Jacob Pan
  Cc: lenb, rjw, linux-acpi, linux-kernel, shaohua.li, rui.zhang,
	Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On 11/20/2013 2:58 AM, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 11:28:03AM +0100, Peter Zijlstra wrote:
>> On Tue, Nov 19, 2013 at 01:06:30PM -0800, Jacob Pan wrote:
>>> I applied this patch on top of upstream kernel (801a760) and found out
>>> my machine completely failed to enter idle when nothing is running.
>>> turbostate shows 100% C0. ftrace shows kernel coming in and out of idle
>>> frequently.
>>>
>>> Both ACPI idle and intel_idle behaves the same way. I have to do the
>>> following change to allow entering C-states again.
>
>> That doesn't make any sense; current_set_polling_and_test() returns the
>> same thing need_resched() does.
>>
>> But you're right, intel_idle resides 100% in C0 and acpi_idle has 100%
>> C1 residency... most weird.
>
> So pretty silly actually; you cannot do a store (any store) in between
> monitor and mwait.

you can
just not to the cacheline you are watching (or things that alias with that)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-20 16:24                 ` Arjan van de Ven
@ 2013-11-20 16:33                   ` Peter Zijlstra
  2013-11-20 16:38                     ` Arjan van de Ven
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-20 16:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jacob Pan, lenb, rjw, linux-acpi, linux-kernel, shaohua.li,
	rui.zhang, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On Wed, Nov 20, 2013 at 08:24:47AM -0800, Arjan van de Ven wrote:
> On 11/20/2013 2:58 AM, Peter Zijlstra wrote:
> >So pretty silly actually; you cannot do a store (any store) in between
> >monitor and mwait.
> 
> you can
> just not to the cacheline you are watching (or things that alias with that)

Ah indeed, the thread_info::status is in the same cacheline.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-20 16:33                   ` Peter Zijlstra
@ 2013-11-20 16:38                     ` Arjan van de Ven
  0 siblings, 0 replies; 22+ messages in thread
From: Arjan van de Ven @ 2013-11-20 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jacob Pan, lenb, rjw, linux-acpi, linux-kernel, shaohua.li,
	rui.zhang, Mike Galbraith, Ingo Molnar, Thomas Gleixner, hpa

On 11/20/2013 8:33 AM, Peter Zijlstra wrote:
> On Wed, Nov 20, 2013 at 08:24:47AM -0800, Arjan van de Ven wrote:
>> On 11/20/2013 2:58 AM, Peter Zijlstra wrote:
>>> So pretty silly actually; you cannot do a store (any store) in between
>>> monitor and mwait.
>>
>> you can
>> just not to the cacheline you are watching (or things that alias with that)
>
> Ah indeed, the thread_info::status is in the same cacheline.
>

you told it to wake on any write to that cacheline
and then you write to it ;-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:sched/idle] sched/preempt, locking: Rework local_bh_{dis, en}able()
  2013-11-19 15:13         ` Peter Zijlstra
  2013-11-19 21:06           ` Jacob Pan
@ 2014-01-12 18:44           ` tip-bot for Peter Zijlstra
  2014-01-12 18:44           ` [tip:sched/idle] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
                             ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-12 18:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, arjan, peterz, bitbucket, tglx

Commit-ID:  c795eb55e74041a4b1e46a82bb07d8bd513fb066
Gitweb:     http://git.kernel.org/tip/c795eb55e74041a4b1e46a82bb07d8bd513fb066
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 12 Jan 2014 10:48:24 +0100

sched/preempt, locking: Rework local_bh_{dis,en}able()

Currently local_bh_disable() is out-of-line for no apparent reason.
So inline it to save a few cycles on call/return nonsense, the
function body is a single add on x86 (a few loads and store extra on
load/store archs).

Also expose two new local_bh functions:

  __local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt);

Which implement the actual local_bh_{dis,en}able() behaviour.

The next patch uses the exposed @cnt argument to optimize bh lock
functions.

With build fixes from Jacob Pan.

Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/bottom_half.h  | 32 +++++++++++++++++++++++++++++---
 include/linux/hardirq.h      |  1 +
 include/linux/preempt_mask.h |  1 -
 kernel/softirq.c             | 35 ++++++-----------------------------
 4 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/include/linux/bottom_half.h b/include/linux/bottom_half.h
index 27b1bcf..86c12c9 100644
--- a/include/linux/bottom_half.h
+++ b/include/linux/bottom_half.h
@@ -1,9 +1,35 @@
 #ifndef _LINUX_BH_H
 #define _LINUX_BH_H
 
-extern void local_bh_disable(void);
+#include <linux/preempt.h>
+#include <linux/preempt_mask.h>
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);
+#else
+static __always_inline void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
+{
+	preempt_count_add(cnt);
+	barrier();
+}
+#endif
+
+static inline void local_bh_disable(void)
+{
+	__local_bh_disable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
+
 extern void _local_bh_enable(void);
-extern void local_bh_enable(void);
-extern void local_bh_enable_ip(unsigned long ip);
+extern void __local_bh_enable_ip(unsigned long ip, unsigned int cnt);
+
+static inline void local_bh_enable_ip(unsigned long ip)
+{
+	__local_bh_enable_ip(ip, SOFTIRQ_DISABLE_OFFSET);
+}
+
+static inline void local_bh_enable(void)
+{
+	__local_bh_enable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
 
 #endif /* _LINUX_BH_H */
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d9cf963..12d5f97 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -5,6 +5,7 @@
 #include <linux/lockdep.h>
 #include <linux/ftrace_irq.h>
 #include <linux/vtime.h>
+#include <asm/hardirq.h>
 
 
 extern void synchronize_irq(unsigned int irq);
diff --git a/include/linux/preempt_mask.h b/include/linux/preempt_mask.h
index d169820..30820ec 100644
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -2,7 +2,6 @@
 #define LINUX_PREEMPT_MASK_H
 
 #include <linux/preempt.h>
-#include <asm/hardirq.h>
 
 /*
  * We put the hardirq and softirq counter into the preemption
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 9a4500e..2a4b5f4 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -89,7 +89,7 @@ static void wakeup_softirqd(void)
  * where hardirqs are disabled legitimately:
  */
 #ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip, unsigned int cnt)
+void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
 {
 	unsigned long flags;
 
@@ -114,21 +114,9 @@ static void __local_bh_disable(unsigned long ip, unsigned int cnt)
 	if (preempt_count() == cnt)
 		trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 }
-#else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
-{
-	preempt_count_add(cnt);
-	barrier();
-}
+EXPORT_SYMBOL(__local_bh_disable_ip);
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
-void local_bh_disable(void)
-{
-	__local_bh_disable(_RET_IP_, SOFTIRQ_DISABLE_OFFSET);
-}
-
-EXPORT_SYMBOL(local_bh_disable);
-
 static void __local_bh_enable(unsigned int cnt)
 {
 	WARN_ON_ONCE(!irqs_disabled());
@@ -151,7 +139,7 @@ void _local_bh_enable(void)
 
 EXPORT_SYMBOL(_local_bh_enable);
 
-static inline void _local_bh_enable_ip(unsigned long ip)
+void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 {
 	WARN_ON_ONCE(in_irq() || irqs_disabled());
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -166,7 +154,7 @@ static inline void _local_bh_enable_ip(unsigned long ip)
 	 * Keep preemption disabled until we are done with
 	 * softirq processing:
  	 */
-	preempt_count_sub(SOFTIRQ_DISABLE_OFFSET - 1);
+	preempt_count_sub(cnt - 1);
 
 	if (unlikely(!in_interrupt() && local_softirq_pending())) {
 		/*
@@ -182,18 +170,7 @@ static inline void _local_bh_enable_ip(unsigned long ip)
 #endif
 	preempt_check_resched();
 }
-
-void local_bh_enable(void)
-{
-	_local_bh_enable_ip(_RET_IP_);
-}
-EXPORT_SYMBOL(local_bh_enable);
-
-void local_bh_enable_ip(unsigned long ip)
-{
-	_local_bh_enable_ip(ip);
-}
-EXPORT_SYMBOL(local_bh_enable_ip);
+EXPORT_SYMBOL(__local_bh_enable_ip);
 
 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
@@ -264,7 +241,7 @@ asmlinkage void __do_softirq(void)
 	pending = local_softirq_pending();
 	account_irq_enter_time(current);
 
-	__local_bh_disable(_RET_IP_, SOFTIRQ_OFFSET);
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
 
 	cpu = smp_processor_id();

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/idle] sched, net: Clean up preempt_enable_no_resched() abuse
  2013-11-19 15:13         ` Peter Zijlstra
  2013-11-19 21:06           ` Jacob Pan
  2014-01-12 18:44           ` [tip:sched/idle] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
@ 2014-01-12 18:44           ` tip-bot for Peter Zijlstra
  2014-01-12 18:44           ` [tip:sched/idle] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra
                             ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-12 18:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, arjan, peterz, bitbucket,
	davem, eliezer.tamir, akpm, tglx

Commit-ID:  3a5cd0b09443edb2f960497928de047e9c8ab848
Gitweb:     http://git.kernel.org/tip/3a5cd0b09443edb2f960497928de047e9c8ab848
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 12 Jan 2014 14:50:35 +0100

sched, net: Clean up preempt_enable_no_resched() abuse

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 net/ipv4/tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c4638e6..82de786 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1623,11 +1623,11 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		    (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
 		    !sysctl_tcp_low_latency &&
 		    net_dma_find_channel()) {
-			preempt_enable_no_resched();
+			preempt_enable();
 			tp->ucopy.pinned_list =
 					dma_pin_iovec_pages(msg->msg_iov, len);
 		} else {
-			preempt_enable_no_resched();
+			preempt_enable();
 		}
 	}
 #endif

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/idle] sched, net: Fixup busy_loop_us_clock()
  2013-11-19 15:13         ` Peter Zijlstra
                             ` (2 preceding siblings ...)
  2014-01-12 18:44           ` [tip:sched/idle] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
@ 2014-01-12 18:44           ` tip-bot for Peter Zijlstra
  2014-01-13 15:56           ` [tip:sched/core] locking: Optimize lock_bh functions tip-bot for Peter Zijlstra
                             ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-12 18:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, arjan, peterz, bitbucket,
	davem, eliezer.tamir, akpm, tglx

Commit-ID:  29fc2f66b552770909591d89ee87ab8690d67e61
Gitweb:     http://git.kernel.org/tip/29fc2f66b552770909591d89ee87ab8690d67e61
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 12 Jan 2014 14:50:37 +0100

sched, net: Fixup busy_loop_us_clock()

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

This busy_poll stuff looks to be completely and utterly broken,
sched_clock() can return utter garbage with interrupts enabled (rare
but still) and it can drift unbounded between CPUs.

This means that if you get preempted/migrated and your new CPU is
years behind on the previous CPU we get to busy spin for a _very_ long
time.

There is a _REASON_ sched_clock() warns about preemptability -
papering over it with a preempt_disable()/preempt_enable_no_resched()
is just terminal brain damage on so many levels.

Replace sched_clock() usage with local_clock() which has a bounded
drift between CPUs (<2 jiffies).

There is a further problem with the entire busy wait poll thing in
that the spin time is additive to the syscall timeout, not inclusive.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/net/busy_poll.h | 19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 829627d..1d67fb6 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -42,27 +42,10 @@ static inline bool net_busy_loop_on(void)
 	return sysctl_net_busy_poll;
 }
 
-/* a wrapper to make debug_smp_processor_id() happy
- * we can use sched_clock() because we don't care much about precision
- * we only care that the average is bounded
- */
-#ifdef CONFIG_DEBUG_PREEMPT
-static inline u64 busy_loop_us_clock(void)
-{
-	u64 rc;
-
-	preempt_disable_notrace();
-	rc = sched_clock();
-	preempt_enable_no_resched_notrace();
-
-	return rc >> 10;
-}
-#else /* CONFIG_DEBUG_PREEMPT */
 static inline u64 busy_loop_us_clock(void)
 {
-	return sched_clock() >> 10;
+	return local_clock() >> 10;
 }
-#endif /* CONFIG_DEBUG_PREEMPT */
 
 static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
 {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] locking: Optimize lock_bh functions
  2013-11-19 15:13         ` Peter Zijlstra
                             ` (3 preceding siblings ...)
  2014-01-12 18:44           ` [tip:sched/idle] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra
@ 2014-01-13 15:56           ` tip-bot for Peter Zijlstra
  2014-01-13 16:42           ` [tip:sched/core] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
                             ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-13 15:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, arjan, peterz, bitbucket, akpm, tglx

Commit-ID:  9ea4c380066fbe23fe0da7f4abfabc444f2467f4
Gitweb:     http://git.kernel.org/tip/9ea4c380066fbe23fe0da7f4abfabc444f2467f4
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:47:36 +0100

locking: Optimize lock_bh functions

Currently all _bh_ lock functions do two preempt_count operations:

  local_bh_disable();
  preempt_disable();

and for the unlock:

  preempt_enable_no_resched();
  local_bh_enable();

Since its a waste of perfectly good cycles to modify the same variable
twice when you can do it in one go; use the new
__local_bh_{dis,en}able_ip() functions that allow us to provide a
preempt_count value to add/sub.

So define SOFTIRQ_LOCK_OFFSET as the offset a _bh_ lock needs to
add/sub to be done in one go.

As a bonus it gets rid of the preempt_enable_no_resched() usage.

This reduces a 1000 loops of:

  spin_lock_bh(&bh_lock);
  spin_unlock_bh(&bh_lock);

from 53596 cycles to 51995 cycles. I didn't do enough measurements to
say for absolute sure that the result is significant but the the few
runs I did for each suggest it is so.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/preempt_mask.h     | 15 +++++++++++++++
 include/linux/rwlock_api_smp.h   | 12 ++++--------
 include/linux/spinlock_api_smp.h | 12 ++++--------
 include/linux/spinlock_api_up.h  | 16 +++++++++++-----
 kernel/softirq.c                 |  4 ++--
 5 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/include/linux/preempt_mask.h b/include/linux/preempt_mask.h
index d169820..b8d96bc 100644
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -79,6 +79,21 @@
 #endif
 
 /*
+ * The preempt_count offset needed for things like:
+ *
+ *  spin_lock_bh()
+ *
+ * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and
+ * softirqs, such that unlock sequences of:
+ *
+ *  spin_unlock();
+ *  local_bh_enable();
+ *
+ * Work as expected.
+ */
+#define SOFTIRQ_LOCK_OFFSET (SOFTIRQ_DISABLE_OFFSET + PREEMPT_CHECK_OFFSET)
+
+/*
  * Are we running in atomic context?  WARNING: this macro cannot
  * always detect atomic context; in particular, it cannot know about
  * held spinlocks in non-preemptible kernels.  Thus it should not be
diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h
index 9c9f049..5b9b84b 100644
--- a/include/linux/rwlock_api_smp.h
+++ b/include/linux/rwlock_api_smp.h
@@ -172,8 +172,7 @@ static inline void __raw_read_lock_irq(rwlock_t *lock)
 
 static inline void __raw_read_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_read_trylock, do_raw_read_lock);
 }
@@ -200,8 +199,7 @@ static inline void __raw_write_lock_irq(rwlock_t *lock)
 
 static inline void __raw_write_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock);
 }
@@ -250,8 +248,7 @@ static inline void __raw_read_unlock_bh(rwlock_t *lock)
 {
 	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_read_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 static inline void __raw_write_unlock_irqrestore(rwlock_t *lock,
@@ -275,8 +272,7 @@ static inline void __raw_write_unlock_bh(rwlock_t *lock)
 {
 	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_write_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 #endif /* __LINUX_RWLOCK_API_SMP_H */
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index bdb9993..42dfab8 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -131,8 +131,7 @@ static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
 
 static inline void __raw_spin_lock_bh(raw_spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
 }
@@ -174,20 +173,17 @@ static inline void __raw_spin_unlock_bh(raw_spinlock_t *lock)
 {
 	spin_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_spin_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 static inline int __raw_spin_trylock_bh(raw_spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	if (do_raw_spin_trylock(lock)) {
 		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 		return 1;
 	}
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	return 0;
 }
 
diff --git a/include/linux/spinlock_api_up.h b/include/linux/spinlock_api_up.h
index af1f472..d0d1888 100644
--- a/include/linux/spinlock_api_up.h
+++ b/include/linux/spinlock_api_up.h
@@ -24,11 +24,14 @@
  * flags straight, to suppress compiler warnings of unused lock
  * variables, and to add the proper checker annotations:
  */
+#define ___LOCK(lock) \
+  do { __acquire(lock); (void)(lock); } while (0)
+
 #define __LOCK(lock) \
-  do { preempt_disable(); __acquire(lock); (void)(lock); } while (0)
+  do { preempt_disable(); ___LOCK(lock); } while (0)
 
 #define __LOCK_BH(lock) \
-  do { local_bh_disable(); __LOCK(lock); } while (0)
+  do { __local_bh_disable_ip(_THIS_IP_, SOFTIRQ_LOCK_OFFSET); ___LOCK(lock); } while (0)
 
 #define __LOCK_IRQ(lock) \
   do { local_irq_disable(); __LOCK(lock); } while (0)
@@ -36,12 +39,15 @@
 #define __LOCK_IRQSAVE(lock, flags) \
   do { local_irq_save(flags); __LOCK(lock); } while (0)
 
+#define ___UNLOCK(lock) \
+  do { __release(lock); (void)(lock); } while (0)
+
 #define __UNLOCK(lock) \
-  do { preempt_enable(); __release(lock); (void)(lock); } while (0)
+  do { preempt_enable(); ___UNLOCK(lock); } while (0)
 
 #define __UNLOCK_BH(lock) \
-  do { preempt_enable_no_resched(); local_bh_enable(); \
-	  __release(lock); (void)(lock); } while (0)
+  do { __local_bh_enable_ip(_THIS_IP_, SOFTIRQ_LOCK_OFFSET); \
+       ___UNLOCK(lock); } while (0)
 
 #define __UNLOCK_IRQ(lock) \
   do { local_irq_enable(); __UNLOCK(lock); } while (0)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 11025cc..7500cce1 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -107,7 +107,7 @@ static void __local_bh_disable(unsigned long ip, unsigned int cnt)
 	/*
 	 * Were softirqs turned off above:
 	 */
-	if (softirq_count() == cnt)
+	if (softirq_count() == (cnt & SOFTIRQ_MASK))
 		trace_softirqs_off(ip);
 	raw_local_irq_restore(flags);
 
@@ -133,7 +133,7 @@ static void __local_bh_enable(unsigned int cnt)
 {
 	WARN_ON_ONCE(!irqs_disabled());
 
-	if (softirq_count() == cnt)
+	if (softirq_count() == (cnt & SOFTIRQ_MASK))
 		trace_softirqs_on(_RET_IP_);
 	preempt_count_sub(cnt);
 }

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/preempt, locking: Rework local_bh_{dis, en}able()
  2013-11-19 15:13         ` Peter Zijlstra
                             ` (4 preceding siblings ...)
  2014-01-13 15:56           ` [tip:sched/core] locking: Optimize lock_bh functions tip-bot for Peter Zijlstra
@ 2014-01-13 16:42           ` tip-bot for Peter Zijlstra
  2014-01-13 16:45           ` [tip:sched/core] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
  2014-01-13 16:45           ` [tip:sched/core] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-13 16:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, arjan, peterz, bitbucket, tglx

Commit-ID:  0bd3a173d711857fc9f583eb5825386cc08f3948
Gitweb:     http://git.kernel.org/tip/0bd3a173d711857fc9f583eb5825386cc08f3948
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 17:32:27 +0100

sched/preempt, locking: Rework local_bh_{dis,en}able()

Currently local_bh_disable() is out-of-line for no apparent reason.
So inline it to save a few cycles on call/return nonsense, the
function body is a single add on x86 (a few loads and store extra on
load/store archs).

Also expose two new local_bh functions:

  __local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt);

Which implement the actual local_bh_{dis,en}able() behaviour.

The next patch uses the exposed @cnt argument to optimize bh lock
functions.

With build fixes from Jacob Pan.

Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/bottom_half.h  | 32 +++++++++++++++++++++++++++++---
 include/linux/hardirq.h      |  1 +
 include/linux/preempt_mask.h |  1 -
 kernel/softirq.c             | 35 ++++++-----------------------------
 4 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/include/linux/bottom_half.h b/include/linux/bottom_half.h
index 27b1bcf..86c12c9 100644
--- a/include/linux/bottom_half.h
+++ b/include/linux/bottom_half.h
@@ -1,9 +1,35 @@
 #ifndef _LINUX_BH_H
 #define _LINUX_BH_H
 
-extern void local_bh_disable(void);
+#include <linux/preempt.h>
+#include <linux/preempt_mask.h>
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);
+#else
+static __always_inline void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
+{
+	preempt_count_add(cnt);
+	barrier();
+}
+#endif
+
+static inline void local_bh_disable(void)
+{
+	__local_bh_disable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
+
 extern void _local_bh_enable(void);
-extern void local_bh_enable(void);
-extern void local_bh_enable_ip(unsigned long ip);
+extern void __local_bh_enable_ip(unsigned long ip, unsigned int cnt);
+
+static inline void local_bh_enable_ip(unsigned long ip)
+{
+	__local_bh_enable_ip(ip, SOFTIRQ_DISABLE_OFFSET);
+}
+
+static inline void local_bh_enable(void)
+{
+	__local_bh_enable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
 
 #endif /* _LINUX_BH_H */
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index d9cf963..12d5f97 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -5,6 +5,7 @@
 #include <linux/lockdep.h>
 #include <linux/ftrace_irq.h>
 #include <linux/vtime.h>
+#include <asm/hardirq.h>
 
 
 extern void synchronize_irq(unsigned int irq);
diff --git a/include/linux/preempt_mask.h b/include/linux/preempt_mask.h
index b8d96bc..dbeec4d 100644
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -2,7 +2,6 @@
 #define LINUX_PREEMPT_MASK_H
 
 #include <linux/preempt.h>
-#include <asm/hardirq.h>
 
 /*
  * We put the hardirq and softirq counter into the preemption
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 7500cce1..9e368ef 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -89,7 +89,7 @@ static void wakeup_softirqd(void)
  * where hardirqs are disabled legitimately:
  */
 #ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip, unsigned int cnt)
+void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
 {
 	unsigned long flags;
 
@@ -114,21 +114,9 @@ static void __local_bh_disable(unsigned long ip, unsigned int cnt)
 	if (preempt_count() == cnt)
 		trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 }
-#else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
-{
-	preempt_count_add(cnt);
-	barrier();
-}
+EXPORT_SYMBOL(__local_bh_disable_ip);
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
-void local_bh_disable(void)
-{
-	__local_bh_disable(_RET_IP_, SOFTIRQ_DISABLE_OFFSET);
-}
-
-EXPORT_SYMBOL(local_bh_disable);
-
 static void __local_bh_enable(unsigned int cnt)
 {
 	WARN_ON_ONCE(!irqs_disabled());
@@ -151,7 +139,7 @@ void _local_bh_enable(void)
 
 EXPORT_SYMBOL(_local_bh_enable);
 
-static inline void _local_bh_enable_ip(unsigned long ip)
+void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 {
 	WARN_ON_ONCE(in_irq() || irqs_disabled());
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -166,7 +154,7 @@ static inline void _local_bh_enable_ip(unsigned long ip)
 	 * Keep preemption disabled until we are done with
 	 * softirq processing:
  	 */
-	preempt_count_sub(SOFTIRQ_DISABLE_OFFSET - 1);
+	preempt_count_sub(cnt - 1);
 
 	if (unlikely(!in_interrupt() && local_softirq_pending())) {
 		/*
@@ -182,18 +170,7 @@ static inline void _local_bh_enable_ip(unsigned long ip)
 #endif
 	preempt_check_resched();
 }
-
-void local_bh_enable(void)
-{
-	_local_bh_enable_ip(_RET_IP_);
-}
-EXPORT_SYMBOL(local_bh_enable);
-
-void local_bh_enable_ip(unsigned long ip)
-{
-	_local_bh_enable_ip(ip);
-}
-EXPORT_SYMBOL(local_bh_enable_ip);
+EXPORT_SYMBOL(__local_bh_enable_ip);
 
 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
@@ -230,7 +207,7 @@ asmlinkage void __do_softirq(void)
 	pending = local_softirq_pending();
 	account_irq_enter_time(current);
 
-	__local_bh_disable(_RET_IP_, SOFTIRQ_OFFSET);
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	lockdep_softirq_enter();
 
 	cpu = smp_processor_id();

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched, net: Clean up preempt_enable_no_resched() abuse
  2013-11-19 15:13         ` Peter Zijlstra
                             ` (5 preceding siblings ...)
  2014-01-13 16:42           ` [tip:sched/core] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
@ 2014-01-13 16:45           ` tip-bot for Peter Zijlstra
  2014-01-13 16:45           ` [tip:sched/core] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-13 16:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, arjan, peterz, bitbucket,
	davem, eliezer.tamir, akpm, tglx

Commit-ID:  1774e9f3e5c8b38de3b3bc8bd0eacd280f655baf
Gitweb:     http://git.kernel.org/tip/1774e9f3e5c8b38de3b3bc8bd0eacd280f655baf
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 17:39:04 +0100

sched, net: Clean up preempt_enable_no_resched() abuse

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 net/ipv4/tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c4638e6..82de786 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1623,11 +1623,11 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		    (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
 		    !sysctl_tcp_low_latency &&
 		    net_dma_find_channel()) {
-			preempt_enable_no_resched();
+			preempt_enable();
 			tp->ucopy.pinned_list =
 					dma_pin_iovec_pages(msg->msg_iov, len);
 		} else {
-			preempt_enable_no_resched();
+			preempt_enable();
 		}
 	}
 #endif

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched, net: Fixup busy_loop_us_clock()
  2013-11-19 15:13         ` Peter Zijlstra
                             ` (6 preceding siblings ...)
  2014-01-13 16:45           ` [tip:sched/core] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
@ 2014-01-13 16:45           ` tip-bot for Peter Zijlstra
  7 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-13 16:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, arjan, peterz, bitbucket,
	davem, eliezer.tamir, akpm, tglx

Commit-ID:  37089834528be3ef8cbf927e47c753b3e272a856
Gitweb:     http://git.kernel.org/tip/37089834528be3ef8cbf927e47c753b3e272a856
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 19 Nov 2013 16:13:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 17:39:11 +0100

sched, net: Fixup busy_loop_us_clock()

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

This busy_poll stuff looks to be completely and utterly broken,
sched_clock() can return utter garbage with interrupts enabled (rare
but still) and it can drift unbounded between CPUs.

This means that if you get preempted/migrated and your new CPU is
years behind on the previous CPU we get to busy spin for a _very_ long
time.

There is a _REASON_ sched_clock() warns about preemptability -
papering over it with a preempt_disable()/preempt_enable_no_resched()
is just terminal brain damage on so many levels.

Replace sched_clock() usage with local_clock() which has a bounded
drift between CPUs (<2 jiffies).

There is a further problem with the entire busy wait poll thing in
that the spin time is additive to the syscall timeout, not inclusive.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/net/busy_poll.h | 19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 829627d..1d67fb6 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -42,27 +42,10 @@ static inline bool net_busy_loop_on(void)
 	return sysctl_net_busy_poll;
 }
 
-/* a wrapper to make debug_smp_processor_id() happy
- * we can use sched_clock() because we don't care much about precision
- * we only care that the average is bounded
- */
-#ifdef CONFIG_DEBUG_PREEMPT
-static inline u64 busy_loop_us_clock(void)
-{
-	u64 rc;
-
-	preempt_disable_notrace();
-	rc = sched_clock();
-	preempt_enable_no_resched_notrace();
-
-	return rc >> 10;
-}
-#else /* CONFIG_DEBUG_PREEMPT */
 static inline u64 busy_loop_us_clock(void)
 {
-	return sched_clock() >> 10;
+	return local_clock() >> 10;
 }
-#endif /* CONFIG_DEBUG_PREEMPT */
 
 static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
 {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-01-13 16:46 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-19  9:00 acpi_pad mwait usage Peter Zijlstra
2013-11-19  9:08 ` Peter Zijlstra
2013-11-19 11:31   ` [PATCH] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
2013-11-19 13:06     ` Rafael J. Wysocki
2013-11-19 13:21     ` Mike Galbraith
2013-11-19 14:22     ` Arjan van de Ven
2013-11-19 14:46       ` Peter Zijlstra
2013-11-19 14:51       ` Peter Zijlstra
2013-11-19 15:13         ` Peter Zijlstra
2013-11-19 21:06           ` Jacob Pan
2013-11-20 10:28             ` Peter Zijlstra
2013-11-20 10:58               ` Peter Zijlstra
2013-11-20 16:24                 ` Arjan van de Ven
2013-11-20 16:33                   ` Peter Zijlstra
2013-11-20 16:38                     ` Arjan van de Ven
2014-01-12 18:44           ` [tip:sched/idle] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
2014-01-12 18:44           ` [tip:sched/idle] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
2014-01-12 18:44           ` [tip:sched/idle] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra
2014-01-13 15:56           ` [tip:sched/core] locking: Optimize lock_bh functions tip-bot for Peter Zijlstra
2014-01-13 16:42           ` [tip:sched/core] sched/preempt, locking: Rework local_bh_{dis, en}able() tip-bot for Peter Zijlstra
2014-01-13 16:45           ` [tip:sched/core] sched, net: Clean up preempt_enable_no_resched() abuse tip-bot for Peter Zijlstra
2014-01-13 16:45           ` [tip:sched/core] sched, net: Fixup busy_loop_us_clock() tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.