linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
@ 2016-11-21  0:53 Jiri Olsa
  2016-11-21  9:02 ` Paul E. McKenney
  2016-11-21  9:28 ` Peter Zijlstra
  0 siblings, 2 replies; 48+ messages in thread
From: Jiri Olsa @ 2016-11-21  0:53 UTC (permalink / raw)
  To: Steven Rostedt, Paul E. McKenney
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Josh Triplett,
	Andi Kleen, Jan Stancek

hi,
Jan hit following output when msr tracepoints are enabled on amd server:

[   91.585653] ===============================
[   91.589840] [ INFO: suspicious RCU usage. ]
[   91.594025] 4.9.0-rc1+ #1 Not tainted
[   91.597691] -------------------------------
[   91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious rcu_dereference_check() usage!
[   91.610222] 
[   91.610222] other info that might help us debug this:
[   91.610222] 
[   91.618224] 
[   91.618224] RCU used illegally from idle CPU!
[   91.618224] rcu_scheduler_active = 1, debug_locks = 0
[   91.629081] RCU used illegally from extended quiescent state!
[   91.634820] no locks held by swapper/1/0.
[   91.638832] 
[   91.638832] stack backtrace:
[   91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1
[   91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03   ' 05/09/2008
[   91.656159]  ffffc900018fbdf8 ffffffff813ed43c ffff88017ede8000 0000000000000001
[   91.663637]  ffffc900018fbe28 ffffffff810fdcd7 ffff880233f95dd0 00000000c0010055
[   91.671107]  0000000000000000 0000000000000000 ffffc900018fbe58 ffffffff814297ac
[   91.678560] Call Trace:
[   91.681022]  [<ffffffff813ed43c>] dump_stack+0x85/0xc9
[   91.686164]  [<ffffffff810fdcd7>] lockdep_rcu_suspicious+0xe7/0x120
[   91.692429]  [<ffffffff814297ac>] do_trace_read_msr+0x14c/0x1b0
[   91.698349]  [<ffffffff8106ddb2>] native_read_msr+0x32/0x40
[   91.703921]  [<ffffffff8103b2be>] amd_e400_idle+0x7e/0x110
[   91.709407]  [<ffffffff8103b78f>] arch_cpu_idle+0xf/0x20
[   91.714720]  [<ffffffff8181cd33>] default_idle_call+0x23/0x40
[   91.720467]  [<ffffffff810f306a>] cpu_startup_entry+0x1da/0x2b0
[   91.726387]  [<ffffffff81058b1f>] start_secondary+0x17f/0x1f0


it got away with attached change.. but this rcu logic
is far beyond me, so it's just wild guess.. ;-)

thanks,
jirka


---
diff --git a/arch/x86/lib/msr.c b/arch/x86/lib/msr.c
index d1dee753b949..ca15becea3b6 100644
--- a/arch/x86/lib/msr.c
+++ b/arch/x86/lib/msr.c
@@ -115,14 +115,14 @@ int msr_clear_bit(u32 msr, u8 bit)
 #ifdef CONFIG_TRACEPOINTS
 void do_trace_write_msr(unsigned msr, u64 val, int failed)
 {
-	trace_write_msr(msr, val, failed);
+	trace_write_msr_rcuidle(msr, val, failed);
 }
 EXPORT_SYMBOL(do_trace_write_msr);
 EXPORT_TRACEPOINT_SYMBOL(write_msr);
 
 void do_trace_read_msr(unsigned msr, u64 val, int failed)
 {
-	trace_read_msr(msr, val, failed);
+	trace_read_msr_rcuidle(msr, val, failed);
 }
 EXPORT_SYMBOL(do_trace_read_msr);
 EXPORT_TRACEPOINT_SYMBOL(read_msr);

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21  0:53 [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! Jiri Olsa
@ 2016-11-21  9:02 ` Paul E. McKenney
  2016-11-21  9:43   ` Peter Zijlstra
  2016-11-21  9:28 ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Paul E. McKenney @ 2016-11-21  9:02 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Steven Rostedt, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote:
> hi,
> Jan hit following output when msr tracepoints are enabled on amd server:
> 
> [   91.585653] ===============================
> [   91.589840] [ INFO: suspicious RCU usage. ]
> [   91.594025] 4.9.0-rc1+ #1 Not tainted
> [   91.597691] -------------------------------
> [   91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious rcu_dereference_check() usage!
> [   91.610222] 
> [   91.610222] other info that might help us debug this:
> [   91.610222] 
> [   91.618224] 
> [   91.618224] RCU used illegally from idle CPU!
> [   91.618224] rcu_scheduler_active = 1, debug_locks = 0
> [   91.629081] RCU used illegally from extended quiescent state!
> [   91.634820] no locks held by swapper/1/0.
> [   91.638832] 
> [   91.638832] stack backtrace:
> [   91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1
> [   91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03   ' 05/09/2008
> [   91.656159]  ffffc900018fbdf8 ffffffff813ed43c ffff88017ede8000 0000000000000001
> [   91.663637]  ffffc900018fbe28 ffffffff810fdcd7 ffff880233f95dd0 00000000c0010055
> [   91.671107]  0000000000000000 0000000000000000 ffffc900018fbe58 ffffffff814297ac
> [   91.678560] Call Trace:
> [   91.681022]  [<ffffffff813ed43c>] dump_stack+0x85/0xc9
> [   91.686164]  [<ffffffff810fdcd7>] lockdep_rcu_suspicious+0xe7/0x120
> [   91.692429]  [<ffffffff814297ac>] do_trace_read_msr+0x14c/0x1b0
> [   91.698349]  [<ffffffff8106ddb2>] native_read_msr+0x32/0x40
> [   91.703921]  [<ffffffff8103b2be>] amd_e400_idle+0x7e/0x110
> [   91.709407]  [<ffffffff8103b78f>] arch_cpu_idle+0xf/0x20
> [   91.714720]  [<ffffffff8181cd33>] default_idle_call+0x23/0x40
> [   91.720467]  [<ffffffff810f306a>] cpu_startup_entry+0x1da/0x2b0
> [   91.726387]  [<ffffffff81058b1f>] start_secondary+0x17f/0x1f0
> 
> 
> it got away with attached change.. but this rcu logic
> is far beyond me, so it's just wild guess.. ;-)

If in idle, the _rcuidle() is needed, so:

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> thanks,
> jirka
> 
> 
> ---
> diff --git a/arch/x86/lib/msr.c b/arch/x86/lib/msr.c
> index d1dee753b949..ca15becea3b6 100644
> --- a/arch/x86/lib/msr.c
> +++ b/arch/x86/lib/msr.c
> @@ -115,14 +115,14 @@ int msr_clear_bit(u32 msr, u8 bit)
>  #ifdef CONFIG_TRACEPOINTS
>  void do_trace_write_msr(unsigned msr, u64 val, int failed)
>  {
> -	trace_write_msr(msr, val, failed);
> +	trace_write_msr_rcuidle(msr, val, failed);
>  }
>  EXPORT_SYMBOL(do_trace_write_msr);
>  EXPORT_TRACEPOINT_SYMBOL(write_msr);
> 
>  void do_trace_read_msr(unsigned msr, u64 val, int failed)
>  {
> -	trace_read_msr(msr, val, failed);
> +	trace_read_msr_rcuidle(msr, val, failed);
>  }
>  EXPORT_SYMBOL(do_trace_read_msr);
>  EXPORT_TRACEPOINT_SYMBOL(read_msr);
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21  0:53 [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! Jiri Olsa
  2016-11-21  9:02 ` Paul E. McKenney
@ 2016-11-21  9:28 ` Peter Zijlstra
  2016-11-21  9:34   ` Jiri Olsa
                     ` (2 more replies)
  1 sibling, 3 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-11-21  9:28 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote:
> hi,
> Jan hit following output when msr tracepoints are enabled on amd server:
> 
> [   91.585653] ===============================
> [   91.589840] [ INFO: suspicious RCU usage. ]
> [   91.594025] 4.9.0-rc1+ #1 Not tainted
> [   91.597691] -------------------------------
> [   91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious rcu_dereference_check() usage!
> [   91.610222] 
> [   91.610222] other info that might help us debug this:
> [   91.610222] 
> [   91.618224] 
> [   91.618224] RCU used illegally from idle CPU!
> [   91.618224] rcu_scheduler_active = 1, debug_locks = 0
> [   91.629081] RCU used illegally from extended quiescent state!
> [   91.634820] no locks held by swapper/1/0.
> [   91.638832] 
> [   91.638832] stack backtrace:
> [   91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1
> [   91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03   ' 05/09/2008
> [   91.656159]  ffffc900018fbdf8 ffffffff813ed43c ffff88017ede8000 0000000000000001
> [   91.663637]  ffffc900018fbe28 ffffffff810fdcd7 ffff880233f95dd0 00000000c0010055
> [   91.671107]  0000000000000000 0000000000000000 ffffc900018fbe58 ffffffff814297ac
> [   91.678560] Call Trace:
> [   91.681022]  [<ffffffff813ed43c>] dump_stack+0x85/0xc9
> [   91.686164]  [<ffffffff810fdcd7>] lockdep_rcu_suspicious+0xe7/0x120
> [   91.692429]  [<ffffffff814297ac>] do_trace_read_msr+0x14c/0x1b0
> [   91.698349]  [<ffffffff8106ddb2>] native_read_msr+0x32/0x40
> [   91.703921]  [<ffffffff8103b2be>] amd_e400_idle+0x7e/0x110
> [   91.709407]  [<ffffffff8103b78f>] arch_cpu_idle+0xf/0x20
> [   91.714720]  [<ffffffff8181cd33>] default_idle_call+0x23/0x40
> [   91.720467]  [<ffffffff810f306a>] cpu_startup_entry+0x1da/0x2b0
> [   91.726387]  [<ffffffff81058b1f>] start_secondary+0x17f/0x1f0
> 
> 
> it got away with attached change.. but this rcu logic
> is far beyond me, so it's just wild guess.. ;-)

I think I prefer something like the below, that only annotates the one
RDMSR in question, instead of all of them.


diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0888a879120f..d6c6aa80675f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -357,7 +357,7 @@ static void amd_e400_idle(void)
 	if (!amd_e400_c1e_detected) {
 		u32 lo, hi;
 
-		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
+		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
 
 		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
 			amd_e400_c1e_detected = true;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21  9:28 ` Peter Zijlstra
@ 2016-11-21  9:34   ` Jiri Olsa
  2016-11-21  9:42     ` Peter Zijlstra
  2016-11-21 12:58     ` Peter Zijlstra
  2016-11-21 17:06   ` Andi Kleen
  2017-02-23 12:24   ` Jiri Olsa
  2 siblings, 2 replies; 48+ messages in thread
From: Jiri Olsa @ 2016-11-21  9:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote:
> > hi,
> > Jan hit following output when msr tracepoints are enabled on amd server:
> > 
> > [   91.585653] ===============================
> > [   91.589840] [ INFO: suspicious RCU usage. ]
> > [   91.594025] 4.9.0-rc1+ #1 Not tainted
> > [   91.597691] -------------------------------
> > [   91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious rcu_dereference_check() usage!
> > [   91.610222] 
> > [   91.610222] other info that might help us debug this:
> > [   91.610222] 
> > [   91.618224] 
> > [   91.618224] RCU used illegally from idle CPU!
> > [   91.618224] rcu_scheduler_active = 1, debug_locks = 0
> > [   91.629081] RCU used illegally from extended quiescent state!
> > [   91.634820] no locks held by swapper/1/0.
> > [   91.638832] 
> > [   91.638832] stack backtrace:
> > [   91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1
> > [   91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03   ' 05/09/2008
> > [   91.656159]  ffffc900018fbdf8 ffffffff813ed43c ffff88017ede8000 0000000000000001
> > [   91.663637]  ffffc900018fbe28 ffffffff810fdcd7 ffff880233f95dd0 00000000c0010055
> > [   91.671107]  0000000000000000 0000000000000000 ffffc900018fbe58 ffffffff814297ac
> > [   91.678560] Call Trace:
> > [   91.681022]  [<ffffffff813ed43c>] dump_stack+0x85/0xc9
> > [   91.686164]  [<ffffffff810fdcd7>] lockdep_rcu_suspicious+0xe7/0x120
> > [   91.692429]  [<ffffffff814297ac>] do_trace_read_msr+0x14c/0x1b0
> > [   91.698349]  [<ffffffff8106ddb2>] native_read_msr+0x32/0x40
> > [   91.703921]  [<ffffffff8103b2be>] amd_e400_idle+0x7e/0x110
> > [   91.709407]  [<ffffffff8103b78f>] arch_cpu_idle+0xf/0x20
> > [   91.714720]  [<ffffffff8181cd33>] default_idle_call+0x23/0x40
> > [   91.720467]  [<ffffffff810f306a>] cpu_startup_entry+0x1da/0x2b0
> > [   91.726387]  [<ffffffff81058b1f>] start_secondary+0x17f/0x1f0
> > 
> > 
> > it got away with attached change.. but this rcu logic
> > is far beyond me, so it's just wild guess.. ;-)
> 
> I think I prefer something like the below, that only annotates the one
> RDMSR in question, instead of all of them.

I was wondering about that, but haven't found RCU_NONIDLE 

thanks,
jirka

> 
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 0888a879120f..d6c6aa80675f 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -357,7 +357,7 @@ static void amd_e400_idle(void)
>  	if (!amd_e400_c1e_detected) {
>  		u32 lo, hi;
>  
> -		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
> +		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
>  
>  		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
>  			amd_e400_c1e_detected = true;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21  9:34   ` Jiri Olsa
@ 2016-11-21  9:42     ` Peter Zijlstra
  2016-11-21 11:22       ` Jiri Olsa
  2016-11-21 12:58     ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2016-11-21  9:42 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote:
> On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote:

> > I think I prefer something like the below, that only annotates the one
> > RDMSR in question, instead of all of them.
> 
> I was wondering about that, but haven't found RCU_NONIDLE 

Its in rcupdate.h ;-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21  9:02 ` Paul E. McKenney
@ 2016-11-21  9:43   ` Peter Zijlstra
  2016-11-21 11:00     ` Paul E. McKenney
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2016-11-21  9:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jiri Olsa, Steven Rostedt, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 01:02:25AM -0800, Paul E. McKenney wrote:
> On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote:
> > 
> > it got away with attached change.. but this rcu logic
> > is far beyond me, so it's just wild guess.. ;-)
> 
> If in idle, the _rcuidle() is needed, so:

Well, the point is, only this one rdmsr users is in idle, all the others
are not, so we should not be annotating _all_ of them, should we?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21  9:43   ` Peter Zijlstra
@ 2016-11-21 11:00     ` Paul E. McKenney
  0 siblings, 0 replies; 48+ messages in thread
From: Paul E. McKenney @ 2016-11-21 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Steven Rostedt, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 10:43:21AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 01:02:25AM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote:
> > > 
> > > it got away with attached change.. but this rcu logic
> > > is far beyond me, so it's just wild guess.. ;-)
> > 
> > If in idle, the _rcuidle() is needed, so:
> 
> Well, the point is, only this one rdmsr users is in idle, all the others
> are not, so we should not be annotating _all_ of them, should we?

Fair enough!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21  9:42     ` Peter Zijlstra
@ 2016-11-21 11:22       ` Jiri Olsa
  2016-11-21 11:31         ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Jiri Olsa @ 2016-11-21 11:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote:
> > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote:
> 
> > > I think I prefer something like the below, that only annotates the one
> > > RDMSR in question, instead of all of them.
> > 
> > I was wondering about that, but haven't found RCU_NONIDLE 
> 
> Its in rcupdate.h ;-)

it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. ?

thanks,
jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 11:22       ` Jiri Olsa
@ 2016-11-21 11:31         ` Peter Zijlstra
  2016-11-21 12:49           ` Jiri Olsa
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2016-11-21 11:31 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 12:22:29PM +0100, Jiri Olsa wrote:
> On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote:
> > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote:
> > 
> > > > I think I prefer something like the below, that only annotates the one
> > > > RDMSR in question, instead of all of them.
> > > 
> > > I was wondering about that, but haven't found RCU_NONIDLE 
> > 
> > Its in rcupdate.h ;-)
> 
> it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. ?

Probably, I still have to make pretty the uncore patch from last week,
and I have a PEBS patch that needs looking at :-)

BTW, if you have any opinion on that PEBS register set crap, do holler
:-)

  lkml.kernel.org/r/20161117171731.GV3157@twins.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 11:31         ` Peter Zijlstra
@ 2016-11-21 12:49           ` Jiri Olsa
  0 siblings, 0 replies; 48+ messages in thread
From: Jiri Olsa @ 2016-11-21 12:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 12:31:31PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 12:22:29PM +0100, Jiri Olsa wrote:
> > On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote:
> > > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote:
> > > 
> > > > > I think I prefer something like the below, that only annotates the one
> > > > > RDMSR in question, instead of all of them.
> > > > 
> > > > I was wondering about that, but haven't found RCU_NONIDLE 
> > > 
> > > Its in rcupdate.h ;-)
> > 
> > it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. ?
> 
> Probably, I still have to make pretty the uncore patch from last week,
> and I have a PEBS patch that needs looking at :-)
> 
> BTW, if you have any opinion on that PEBS register set crap, do holler
> :-)
> 
>   lkml.kernel.org/r/20161117171731.GV3157@twins.programming.kicks-ass.net

ook, will check

thanks,
jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21  9:34   ` Jiri Olsa
  2016-11-21  9:42     ` Peter Zijlstra
@ 2016-11-21 12:58     ` Peter Zijlstra
  2016-11-21 14:15       ` Steven Rostedt
  2016-11-21 14:20       ` Steven Rostedt
  1 sibling, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-11-21 12:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek, Borislav Petkov

On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote:

> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index 0888a879120f..d6c6aa80675f 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -357,7 +357,7 @@ static void amd_e400_idle(void)
> >  	if (!amd_e400_c1e_detected) {
> >  		u32 lo, hi;
> >  
> > -		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
> > +		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
> >  
> >  		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
> >  			amd_e400_c1e_detected = true;

OK, so while looking at this again, I don't like this ether :/

Problem with this one is that it always adds the RCU fiddling overhead,
even when we're not tracing.

I could do an rdmsr_notrace() for this one, dunno if its important.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 12:58     ` Peter Zijlstra
@ 2016-11-21 14:15       ` Steven Rostedt
  2016-11-21 14:37         ` Peter Zijlstra
  2016-11-21 14:20       ` Steven Rostedt
  1 sibling, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2016-11-21 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek, Borislav Petkov

On Mon, 21 Nov 2016 13:58:30 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote:
> 
> > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > > index 0888a879120f..d6c6aa80675f 100644
> > > --- a/arch/x86/kernel/process.c
> > > +++ b/arch/x86/kernel/process.c
> > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void)
> > >  	if (!amd_e400_c1e_detected) {
> > >  		u32 lo, hi;
> > >  
> > > -		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
> > > +		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
> > >  
> > >  		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
> > >  			amd_e400_c1e_detected = true;  
> 
> OK, so while looking at this again, I don't like this ether :/
> 
> Problem with this one is that it always adds the RCU fiddling overhead,
> even when we're not tracing.
> 
> I could do an rdmsr_notrace() for this one, dunno if its important.

But that would neglect the point of tracing rdmsr. What about:

	/* tracepoints require RCU enabled */
	if (trace_read_msr_enabled())
		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
	else
		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 12:58     ` Peter Zijlstra
  2016-11-21 14:15       ` Steven Rostedt
@ 2016-11-21 14:20       ` Steven Rostedt
  1 sibling, 0 replies; 48+ messages in thread
From: Steven Rostedt @ 2016-11-21 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek, Borislav Petkov

On Mon, 21 Nov 2016 13:58:30 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote:
> 
> > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > > index 0888a879120f..d6c6aa80675f 100644
> > > --- a/arch/x86/kernel/process.c
> > > +++ b/arch/x86/kernel/process.c
> > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void)
> > >  	if (!amd_e400_c1e_detected) {
> > >  		u32 lo, hi;
> > >  
> > > -		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
> > > +		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
> > >  
> > >  		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
> > >  			amd_e400_c1e_detected = true;  
> 
> OK, so while looking at this again, I don't like this ether :/
> 
> Problem with this one is that it always adds the RCU fiddling overhead,
> even when we're not tracing.
> 
> I could do an rdmsr_notrace() for this one, dunno if its important.

But that would neglect the point of tracing rdmsr. What about:

	/* tracepoints require RCU enabled */
	if (trace_read_msr_enabled())
		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
	else
		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 14:15       ` Steven Rostedt
@ 2016-11-21 14:37         ` Peter Zijlstra
  2016-11-21 15:35           ` Borislav Petkov
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2016-11-21 14:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek, Borislav Petkov

On Mon, Nov 21, 2016 at 09:15:43AM -0500, Steven Rostedt wrote:
> On Mon, 21 Nov 2016 13:58:30 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote:
> > 
> > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > > > index 0888a879120f..d6c6aa80675f 100644
> > > > --- a/arch/x86/kernel/process.c
> > > > +++ b/arch/x86/kernel/process.c
> > > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void)
> > > >  	if (!amd_e400_c1e_detected) {
> > > >  		u32 lo, hi;
> > > >  
> > > > -		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
> > > > +		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
> > > >  
> > > >  		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
> > > >  			amd_e400_c1e_detected = true;  
> > 
> > OK, so while looking at this again, I don't like this ether :/
> > 
> > Problem with this one is that it always adds the RCU fiddling overhead,
> > even when we're not tracing.
> > 
> > I could do an rdmsr_notrace() for this one, dunno if its important.
> 
> But that would neglect the point of tracing rdmsr. What about:
> 
> 	/* tracepoints require RCU enabled */
> 	if (trace_read_msr_enabled())
> 		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
> 	else
> 		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);

Yeah, but this one does a printk() when it hits the contidion it checks
for, so not tracing it would be fine I think.

Also, Boris, why do we need to redo that rdmsr until we see that bit
set? Can't we simply do the rdmsr once and then be done with it?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 14:37         ` Peter Zijlstra
@ 2016-11-21 15:35           ` Borislav Petkov
  2016-11-21 15:41             ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Borislav Petkov @ 2016-11-21 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 03:37:16PM +0100, Peter Zijlstra wrote:
> Yeah, but this one does a printk() when it hits the contidion it checks
> for, so not tracing it would be fine I think.

FWIW, there's already:

static inline void notrace
native_write_msr(unsigned int msr, u32 low, u32 high)
{
        __native_write_msr_notrace(msr, low, high);
        if (msr_tracepoint_active(__tracepoint_write_msr))
                do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
}

so could be mirrored.

> Also, Boris, why do we need to redo that rdmsr until we see that bit
> set? Can't we simply do the rdmsr once and then be done with it?

Hmm, so the way I read the definition of those bits -
K8_INTP_C1E_ACTIVE_MASK - it looks like they get set when the cores
enter halt:

28 C1eOnCmpHalt: C1E on chip multi-processing halt. Read-write.

1=When all cores of a processor have entered the halt state, the processor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
generates an IO cycle as specified by IORd, IOMsgData, and IOMsgAddr.
When this bit is set, SmiOnCmpHalt and IntPndMsg must be 0, otherwise
the behavior is undefined. For revision DA-C, this bit is only supported
for dual-core processors. For revision C3 and E, this bit is supported
for any number of cores. See 2.4.3.3.3 [Hardware Initiated C1E].

27 SmiOnCmpHalt: SMI on chip multi-processing halt. Read-write.

1=When all cores of the processor have entered the halt state, the
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

processor generates an SMI-trigger IO cycle as specified by IORd,
IOMsgData, and IOMsgAddr. When this bit is set C1eOnCmpHalt and
IntPndMsg must be 0, otherwise the behavior is undefined. The status is
stored in SMMFEC4[SmiOnCmpHaltSts]. See 2.4.3.3.1 [SMI Initiated C1E].

But the thing is, we do it only once until amd_e400_c1e_detected is true.

FWIW, I'd love it if we could do

	if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))

but I'm afraid we need to look at that MSR at least once until we set
the boolean.

I could ask if that is really the case and whether we can detect it
differently...

In any case, I have a box here in case you want me to test patches.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 15:35           ` Borislav Petkov
@ 2016-11-21 15:41             ` Peter Zijlstra
  2016-11-21 16:06               ` Borislav Petkov
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2016-11-21 15:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 04:35:38PM +0100, Borislav Petkov wrote:
> On Mon, Nov 21, 2016 at 03:37:16PM +0100, Peter Zijlstra wrote:
> > Yeah, but this one does a printk() when it hits the contidion it checks
> > for, so not tracing it would be fine I think.
> 
> FWIW, there's already:
> 
> static inline void notrace
> native_write_msr(unsigned int msr, u32 low, u32 high)
> {
>         __native_write_msr_notrace(msr, low, high);
>         if (msr_tracepoint_active(__tracepoint_write_msr))
>                 do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
> }
> 
> so could be mirrored.

Yeah, I know. I caused that to appear ;-)

> > Also, Boris, why do we need to redo that rdmsr until we see that bit
> > set? Can't we simply do the rdmsr once and then be done with it?
> 
> Hmm, so the way I read the definition of those bits -
> K8_INTP_C1E_ACTIVE_MASK - it looks like they get set when the cores
> enter halt:
> 
> 28 C1eOnCmpHalt: C1E on chip multi-processing halt. Read-write.
> 
> 1=When all cores of a processor have entered the halt state, the processor
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> generates an IO cycle as specified by IORd, IOMsgData, and IOMsgAddr.
> When this bit is set, SmiOnCmpHalt and IntPndMsg must be 0, otherwise
> the behavior is undefined. For revision DA-C, this bit is only supported
> for dual-core processors. For revision C3 and E, this bit is supported
> for any number of cores. See 2.4.3.3.3 [Hardware Initiated C1E].
> 
> 27 SmiOnCmpHalt: SMI on chip multi-processing halt. Read-write.
> 
> 1=When all cores of the processor have entered the halt state, the
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> processor generates an SMI-trigger IO cycle as specified by IORd,
> IOMsgData, and IOMsgAddr. When this bit is set C1eOnCmpHalt and
> IntPndMsg must be 0, otherwise the behavior is undefined. The status is
> stored in SMMFEC4[SmiOnCmpHaltSts]. See 2.4.3.3.1 [SMI Initiated C1E].
> 
> But the thing is, we do it only once until amd_e400_c1e_detected is true.
> 
> FWIW, I'd love it if we could do
> 
> 	if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))
> 
> but I'm afraid we need to look at that MSR at least once until we set
> the boolean.
> 
> I could ask if that is really the case and whether we can detect it
> differently...
> 
> In any case, I have a box here in case you want me to test patches.

Right, I just wondered about the !c1e present case. In that case we'll
forever read the msr in the hope of finding that bit set, and we never
will.

Don't care too much, just idle curiousity. Let me do the rdmsr_notrace()
thing.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 15:41             ` Peter Zijlstra
@ 2016-11-21 16:06               ` Borislav Petkov
  2016-11-29 13:16                 ` Borislav Petkov
  0 siblings, 1 reply; 48+ messages in thread
From: Borislav Petkov @ 2016-11-21 16:06 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek

+ tglx.

On Mon, Nov 21, 2016 at 04:41:04PM +0100, Peter Zijlstra wrote:
> Right, I just wondered about the !c1e present case. In that case we'll
> forever read the msr in the hope of finding that bit set, and we never
> will.

Well, erratum 400 flag is set on everything F10h from model 0x2 onwards,
which is basically everything that's still out there. And we set it for
most of the K8s too. So I don't think there's an affected machine out
there where we don't enable that erratum for so I'd assume it is turned
on everywhere.

IOW, what's the worst thing that can happen if we did this below?

We basically get rid of the detection and switch the timer to broadcast
mode immediately on the halting CPU.

amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))"
check so it will run on the affected CPUs only...

Thoughts?

---
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0888a879120f..7ab9d0254b4d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -354,41 +354,30 @@ void amd_e400_remove_cpu(int cpu)
  */
 static void amd_e400_idle(void)
 {
-	if (!amd_e400_c1e_detected) {
-		u32 lo, hi;
+	int cpu = smp_processor_id();
 
-		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
+	if (!static_cpu_has(X86_FEATURE_NONSTOP_TSC))
+		mark_tsc_unstable("TSC halt in AMD C1E");
 
-		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
-			amd_e400_c1e_detected = true;
-			if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
-				mark_tsc_unstable("TSC halt in AMD C1E");
-			pr_info("System has AMD C1E enabled\n");
-		}
-	}
-
-	if (amd_e400_c1e_detected) {
-		int cpu = smp_processor_id();
+	pr_info("System has AMD C1E enabled\n");
 
-		if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) {
-			cpumask_set_cpu(cpu, amd_e400_c1e_mask);
-			/* Force broadcast so ACPI can not interfere. */
-			tick_broadcast_force();
-			pr_info("Switch to broadcast mode on CPU%d\n", cpu);
-		}
-		tick_broadcast_enter();
+	if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) {
+		cpumask_set_cpu(cpu, amd_e400_c1e_mask);
+		/* Force broadcast so ACPI can not interfere. */
+		tick_broadcast_force();
+		pr_info("Switch to broadcast mode on CPU%d\n", cpu);
+	}
+	tick_broadcast_enter();
 
-		default_idle();
+	default_idle();
 
-		/*
-		 * The switch back from broadcast mode needs to be
-		 * called with interrupts disabled.
-		 */
-		local_irq_disable();
-		tick_broadcast_exit();
-		local_irq_enable();
-	} else
-		default_idle();
+	/*
+	 * The switch back from broadcast mode needs to be
+	 * called with interrupts disabled.
+	 */
+	local_irq_disable();
+	tick_broadcast_exit();
+	local_irq_enable();
 }
 
 /*

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21  9:28 ` Peter Zijlstra
  2016-11-21  9:34   ` Jiri Olsa
@ 2016-11-21 17:06   ` Andi Kleen
  2016-11-21 17:18     ` Peter Zijlstra
  2017-02-23 12:24   ` Jiri Olsa
  2 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-11-21 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Steven Rostedt, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek

> > it got away with attached change.. but this rcu logic
> > is far beyond me, so it's just wild guess.. ;-)
> 
> I think I prefer something like the below, that only annotates the one
> RDMSR in question, instead of all of them.

It would be far better to just fix trace points that they always work.

This whole thing is a travesty: we have tens of thousands of lines of code in
ftrace to support tracing in NMIs, but then "debug features"[1] like
this come around and make trace points unusable for far more code than
just the NMI handlers.

-Andi

[1] unclear what they  actually "debug" here.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 17:06   ` Andi Kleen
@ 2016-11-21 17:18     ` Peter Zijlstra
  2016-11-21 17:45       ` Andi Kleen
  2016-11-21 17:55       ` Steven Rostedt
  0 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-11-21 17:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jiri Olsa, Steven Rostedt, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote:
> > > it got away with attached change.. but this rcu logic
> > > is far beyond me, so it's just wild guess.. ;-)
> > 
> > I think I prefer something like the below, that only annotates the one
> > RDMSR in question, instead of all of them.
> 
> It would be far better to just fix trace points that they always work.
> 
> This whole thing is a travesty: we have tens of thousands of lines of code in
> ftrace to support tracing in NMIs, but then "debug features"[1] like
> this come around and make trace points unusable for far more code than
> just the NMI handlers.

Not sure the idle handlers are more code than we have NMI code, but yes,
its annoying.

Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid
locking, as one does.

Biggest objection would be that the rcu_irq_enter_irqson() thing does
POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint
with that is quite a few cycles.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 17:18     ` Peter Zijlstra
@ 2016-11-21 17:45       ` Andi Kleen
  2016-11-21 18:01         ` Steven Rostedt
  2016-11-21 17:55       ` Steven Rostedt
  1 sibling, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-11-21 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Jiri Olsa, Steven Rostedt, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, Nov 21, 2016 at 06:18:53PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote:
> > > > it got away with attached change.. but this rcu logic
> > > > is far beyond me, so it's just wild guess.. ;-)
> > > 
> > > I think I prefer something like the below, that only annotates the one
> > > RDMSR in question, instead of all of them.
> > 
> > It would be far better to just fix trace points that they always work.
> > 
> > This whole thing is a travesty: we have tens of thousands of lines of code in
> > ftrace to support tracing in NMIs, but then "debug features"[1] like
> > this come around and make trace points unusable for far more code than
> > just the NMI handlers.
> 
> Not sure the idle handlers are more code than we have NMI code, but yes,
> its annoying.
> 
> Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid
> locking, as one does.
> 
> Biggest objection would be that the rcu_irq_enter_irqson() thing does
> POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint
> with that is quite a few cycles.

This is only when the trace point is enabled right?

ftrace is already executing a lot of instructions when enabled. Adding two POPF
won't be end of the world.

-Andi


264171.903577780:  ffffffff810d6040 trace_event_raw_event_sched_switch	push %rbp
264171.903577780:  ffffffff810d6041 trace_event_raw_event_sched_switch	mov %rsp, %rbp
264171.903577780:  ffffffff810d6044 trace_event_raw_event_sched_switch	push %r15
264171.903577780:  ffffffff810d6046 trace_event_raw_event_sched_switch	push %r14
264171.903577780:  ffffffff810d6048 trace_event_raw_event_sched_switch	push %r13
264171.903577780:  ffffffff810d604a trace_event_raw_event_sched_switch	push %r12
264171.903577780:  ffffffff810d604c trace_event_raw_event_sched_switch	mov %esi, %r13d
264171.903577780:  ffffffff810d604f trace_event_raw_event_sched_switch	push %rbx
264171.903577780:  ffffffff810d6050 trace_event_raw_event_sched_switch	mov %rdi, %r12
264171.903577780:  ffffffff810d6053 trace_event_raw_event_sched_switch	mov %rdx, %r14
264171.903577780:  ffffffff810d6056 trace_event_raw_event_sched_switch	mov %rcx, %r15
264171.903577780:  ffffffff810d6059 trace_event_raw_event_sched_switch	sub $0x30, %rsp
264171.903577780:  ffffffff810d605d trace_event_raw_event_sched_switch	mov 0x48(%rdi), %rbx
264171.903577780:  ffffffff810d6061 trace_event_raw_event_sched_switch	test $0x80, %bl
264171.903577780:  ffffffff810d6064 trace_event_raw_event_sched_switch	jnz 0x810d608c
264171.903577780:  ffffffff810d6066 trace_event_raw_event_sched_switch	test $0x40, %bl
264171.903577780:  ffffffff810d6069 trace_event_raw_event_sched_switch	jnz trace_event_raw_event_sched_switch+216
264171.903577780:  ffffffff810d606f trace_event_raw_event_sched_switch	test $0x20, %bl
264171.903577780:  ffffffff810d6072 trace_event_raw_event_sched_switch	jz 0x810d6083
264171.903577780:  ffffffff810d6083 trace_event_raw_event_sched_switch	and $0x1, %bh
264171.903577780:  ffffffff810d6086 trace_event_raw_event_sched_switch	jnz trace_event_raw_event_sched_switch+237
264171.903578113:  ffffffff810d608c trace_event_raw_event_sched_switch	mov $0x40, %edx
264171.903578113:  ffffffff810d6091 trace_event_raw_event_sched_switch	mov %r12, %rsi
264171.903578113:  ffffffff810d6094 trace_event_raw_event_sched_switch	mov %rsp, %rdi
264171.903578113:  ffffffff810d6097 trace_event_raw_event_sched_switch	call trace_event_buffer_reserve
264171.903578113:  ffffffff8114f150 trace_event_buffer_reserve	test $0x1, 0x49(%rsi)
264171.903578113:  ffffffff8114f154 trace_event_buffer_reserve	push %rbp
264171.903578113:  ffffffff8114f155 trace_event_buffer_reserve	mov 0x10(%rsi), %r10
264171.903578113:  ffffffff8114f159 trace_event_buffer_reserve	mov %rsp, %rbp
264171.903578113:  ffffffff8114f15c trace_event_buffer_reserve	push %rbx
264171.903578113:  ffffffff8114f15d trace_event_buffer_reserve	jz 0x8114f17e
264171.903578113:  ffffffff8114f17e trace_event_buffer_reserve	mov %rdx, %rcx
264171.903578113:  ffffffff8114f181 trace_event_buffer_reserve	mov %rdi, %rbx
264171.903578113:  ffffffff8114f184 trace_event_buffer_reserve	pushfq 
264171.903578113:  ffffffff8114f185 trace_event_buffer_reserve	pop %r8
264171.903578113:  ffffffff8114f187 trace_event_buffer_reserve	mov %gs:0x7eebd2c1(%rip), %r9d
264171.903578113:  ffffffff8114f18f trace_event_buffer_reserve	and $0x7fffffff, %r9d
264171.903578113:  ffffffff8114f196 trace_event_buffer_reserve	mov %r8, 0x20(%rdi)
264171.903578113:  ffffffff8114f19a trace_event_buffer_reserve	mov %rsi, 0x10(%rdi)
264171.903578113:  ffffffff8114f19e trace_event_buffer_reserve	mov %r9d, 0x28(%rdi)
264171.903578113:  ffffffff8114f1a2 trace_event_buffer_reserve	mov 0x40(%r10), %edx
264171.903578113:  ffffffff8114f1a6 trace_event_buffer_reserve	call trace_event_buffer_lock_reserve
264171.903578113:  ffffffff81142210 trace_event_buffer_lock_reserve	push %rbp
264171.903578113:  ffffffff81142211 trace_event_buffer_lock_reserve	mov %rsp, %rbp
264171.903578113:  ffffffff81142214 trace_event_buffer_lock_reserve	push %r15
264171.903578113:  ffffffff81142216 trace_event_buffer_lock_reserve	push %r14
264171.903578113:  ffffffff81142218 trace_event_buffer_lock_reserve	push %r13
264171.903578113:  ffffffff8114221a trace_event_buffer_lock_reserve	push %r12
264171.903578113:  ffffffff8114221c trace_event_buffer_lock_reserve	mov %rdi, %r14
264171.903578113:  ffffffff8114221f trace_event_buffer_lock_reserve	push %rbx
264171.903578113:  ffffffff81142220 trace_event_buffer_lock_reserve	mov %edx, %r12d
264171.903578113:  ffffffff81142223 trace_event_buffer_lock_reserve	mov %rsi, %rbx
264171.903578113:  ffffffff81142226 trace_event_buffer_lock_reserve	mov %rcx, %r13
264171.903578113:  ffffffff81142229 trace_event_buffer_lock_reserve	mov %r9d, %r15d
264171.903578113:  ffffffff8114222c trace_event_buffer_lock_reserve	sub $0x10, %rsp
264171.903578113:  ffffffff81142230 trace_event_buffer_lock_reserve	mov 0x28(%rsi), %rax
264171.903578113:  ffffffff81142234 trace_event_buffer_lock_reserve	mov %r8, -0x30(%rbp)
264171.903578113:  ffffffff81142238 trace_event_buffer_lock_reserve	mov 0x20(%rax), %rdi
264171.903578113:  ffffffff8114223c trace_event_buffer_lock_reserve	mov %rdi, (%r14)
264171.903578113:  ffffffff8114223f trace_event_buffer_lock_reserve	test $0x24, 0x48(%rsi)
264171.903578113:  ffffffff81142243 trace_event_buffer_lock_reserve	jz 0x8114226d
264171.903578113:  ffffffff8114226d trace_event_buffer_lock_reserve	mov -0x30(%rbp), %rcx
264171.903578113:  ffffffff81142271 trace_event_buffer_lock_reserve	mov %r15d, %r8d
264171.903578113:  ffffffff81142274 trace_event_buffer_lock_reserve	mov %r13, %rdx
264171.903578113:  ffffffff81142277 trace_event_buffer_lock_reserve	mov %r12d, %esi
264171.903578113:  ffffffff8114227a trace_event_buffer_lock_reserve	call trace_buffer_lock_reserve
264171.903578113:  ffffffff811421c0 trace_buffer_lock_reserve	push %rbp
264171.903578113:  ffffffff811421c1 trace_buffer_lock_reserve	mov %rsp, %rbp
264171.903578113:  ffffffff811421c4 trace_buffer_lock_reserve	push %r14
264171.903578113:  ffffffff811421c6 trace_buffer_lock_reserve	push %r13
264171.903578113:  ffffffff811421c8 trace_buffer_lock_reserve	push %r12
264171.903578113:  ffffffff811421ca trace_buffer_lock_reserve	push %rbx
264171.903578113:  ffffffff811421cb trace_buffer_lock_reserve	mov %esi, %r12d
264171.903578113:  ffffffff811421ce trace_buffer_lock_reserve	mov %rdx, %rsi
264171.903578113:  ffffffff811421d1 trace_buffer_lock_reserve	mov %rcx, %r13
264171.903578113:  ffffffff811421d4 trace_buffer_lock_reserve	mov %r8d, %r14d
264171.903578113:  ffffffff811421d7 trace_buffer_lock_reserve	call ring_buffer_lock_reserve
264171.903578113:  ffffffff8113d950 ring_buffer_lock_reserve	mov 0x8(%rdi), %eax
264171.903578113:  ffffffff8113d953 ring_buffer_lock_reserve	test %eax, %eax
264171.903578113:  ffffffff8113d955 ring_buffer_lock_reserve	jnz ring_buffer_lock_reserve+166
264171.903578113:  ffffffff8113d95b ring_buffer_lock_reserve	mov %gs:0x7eecc826(%rip), %eax
264171.903578113:  ffffffff8113d962 ring_buffer_lock_reserve	mov %eax, %ecx
264171.903578113:  ffffffff8113d964 ring_buffer_lock_reserve	bt %rcx, 0x10(%rdi)
264171.903578113:  ffffffff8113d969 ring_buffer_lock_reserve	setb %cl
264171.903578113:  ffffffff8113d96c ring_buffer_lock_reserve	test %cl, %cl
264171.903578113:  ffffffff8113d96e ring_buffer_lock_reserve	jz ring_buffer_lock_reserve+166
264171.903578113:  ffffffff8113d974 ring_buffer_lock_reserve	mov 0x60(%rdi), %rdx
264171.903578113:  ffffffff8113d978 ring_buffer_lock_reserve	push %rbp
264171.903578113:  ffffffff8113d979 ring_buffer_lock_reserve	cdqe 
264171.903578113:  ffffffff8113d97b ring_buffer_lock_reserve	mov %rsp, %rbp
264171.903578113:  ffffffff8113d97e ring_buffer_lock_reserve	push %rbx
264171.903578113:  ffffffff8113d97f ring_buffer_lock_reserve	mov (%rdx,%rax,8), %rbx
264171.903578113:  ffffffff8113d983 ring_buffer_lock_reserve	mov 0x4(%rbx), %eax
264171.903578113:  ffffffff8113d986 ring_buffer_lock_reserve	test %eax, %eax
264171.903578113:  ffffffff8113d988 ring_buffer_lock_reserve	jnz 0x8113d9f1
264171.903578113:  ffffffff8113d98a ring_buffer_lock_reserve	cmp $0xfe8, %rsi
264171.903578113:  ffffffff8113d991 ring_buffer_lock_reserve	ja 0x8113d9f1
264171.903578113:  ffffffff8113d993 ring_buffer_lock_reserve	mov %gs:0x7eeceab6(%rip), %edx
264171.903578113:  ffffffff8113d99a ring_buffer_lock_reserve	test $0x1fff00, %edx
264171.903578113:  ffffffff8113d9a0 ring_buffer_lock_reserve	mov 0x20(%rbx), %ecx
264171.903578113:  ffffffff8113d9a3 ring_buffer_lock_reserve	mov $0x8, %eax
264171.903578113:  ffffffff8113d9a8 ring_buffer_lock_reserve	jnz 0x8113d9c6
264171.903578113:  ffffffff8113d9aa ring_buffer_lock_reserve	test %eax, %ecx
264171.903578113:  ffffffff8113d9ac ring_buffer_lock_reserve	jnz 0x8113d9f1
264171.903578113:  ffffffff8113d9ae ring_buffer_lock_reserve	or %ecx, %eax
264171.903578113:  ffffffff8113d9b0 ring_buffer_lock_reserve	mov %rsi, %rdx
264171.903578113:  ffffffff8113d9b3 ring_buffer_lock_reserve	mov %rbx, %rsi
264171.903578113:  ffffffff8113d9b6 ring_buffer_lock_reserve	mov %eax, 0x20(%rbx)
264171.903578113:  ffffffff8113d9b9 ring_buffer_lock_reserve	call rb_reserve_next_event
264171.903578113:  ffffffff8113d610 rb_reserve_next_event	inc 0x88(%rsi)
264171.903578113:  ffffffff8113d617 rb_reserve_next_event	inc 0x90(%rsi)
264171.903578113:  ffffffff8113d61e rb_reserve_next_event	mov 0x8(%rsi), %rax
264171.903578113:  ffffffff8113d622 rb_reserve_next_event	cmp %rdi, %rax
264171.903578113:  ffffffff8113d625 rb_reserve_next_event	jnz rb_reserve_next_event+678
264171.903578113:  ffffffff8113d62b rb_reserve_next_event	push %rbp
264171.903578113:  ffffffff8113d62c rb_reserve_next_event	mov %edx, %eax
264171.903578113:  ffffffff8113d62e rb_reserve_next_event	mov $0x8, %ecx
264171.903578113:  ffffffff8113d633 rb_reserve_next_event	mov %rsp, %rbp
264171.903578113:  ffffffff8113d636 rb_reserve_next_event	push %r15
264171.903578113:  ffffffff8113d638 rb_reserve_next_event	push %r14
264171.903578113:  ffffffff8113d63a rb_reserve_next_event	push %r13
264171.903578113:  ffffffff8113d63c rb_reserve_next_event	push %r12
264171.903578113:  ffffffff8113d63e rb_reserve_next_event	push %rbx
264171.903578113:  ffffffff8113d63f rb_reserve_next_event	lea 0x88(%rsi), %rbx
264171.903578113:  ffffffff8113d646 rb_reserve_next_event	sub $0x50, %rsp
264171.903578113:  ffffffff8113d64a rb_reserve_next_event	test %edx, %edx
264171.903578113:  ffffffff8113d64c rb_reserve_next_event	jnz rb_reserve_next_event+430
264171.903578113:  ffffffff8113d7be rb_reserve_next_event	cmp $0x70, %edx
264171.903578113:  ffffffff8113d7c1 rb_reserve_next_event	jbe 0x8113d7c6
264171.903578113:  ffffffff8113d7c6 rb_reserve_next_event	add $0x7, %eax
264171.903578113:  ffffffff8113d7c9 rb_reserve_next_event	mov $0x10, %ecx
264171.903578113:  ffffffff8113d7ce rb_reserve_next_event	and $0xfffffffc, %eax
264171.903578113:  ffffffff8113d7d1 rb_reserve_next_event	mov %eax, %edx
264171.903578113:  ffffffff8113d7d3 rb_reserve_next_event	cmp $0xc, %eax
264171.903578113:  ffffffff8113d7d6 rb_reserve_next_event	cmovnz %rdx, %rcx
264171.903578113:  ffffffff8113d7da rb_reserve_next_event	jmp rb_reserve_next_event+66
264171.903578113:  ffffffff8113d652 rb_reserve_next_event	lea -0x50(%rbp), %rax
264171.903578113:  ffffffff8113d656 rb_reserve_next_event	lea 0xa8(%rsi), %r12
264171.903578113:  ffffffff8113d65d rb_reserve_next_event	mov %rsi, %r13
264171.903578113:  ffffffff8113d660 rb_reserve_next_event	mov %rcx, -0x40(%rbp)
264171.903578113:  ffffffff8113d664 rb_reserve_next_event	mov $0x0, -0x30(%rbp)
264171.903578113:  ffffffff8113d66b rb_reserve_next_event	lea 0x18(%rax), %r15
264171.903578113:  ffffffff8113d66f rb_reserve_next_event	lea 0x10(%rax), %r14
264171.903578113:  ffffffff8113d673 rb_reserve_next_event	mov $0x0, -0x48(%rbp)
264171.903578113:  ffffffff8113d67b rb_reserve_next_event	mov $0x3e8, -0x74(%rbp)
264171.903578113:  ffffffff8113d682 rb_reserve_next_event	mov 0x8(%r13), %rax
264171.903578113:  ffffffff8113d686 rb_reserve_next_event	call 0x80(%rax)
264171.903578113:  ffffffff81134520 trace_clock_local	push %rbp
264171.903578113:  ffffffff81134521 trace_clock_local	mov %rsp, %rbp
264171.903578113:  ffffffff81134524 trace_clock_local	call native_sched_clock
264171.903578113:  ffffffff81078270 native_sched_clock	push %rbp
264171.903578113:  ffffffff81078271 native_sched_clock	mov %rsp, %rbp
264171.903578113:  ffffffff81078274 native_sched_clock	and $0xfffffffffffffff0, %rsp
264171.903578113:  ffffffff81078278 native_sched_clock	nop (%rax,%rax)
264171.903578113:  ffffffff8107827d native_sched_clock	rdtsc 
264171.903578113:  ffffffff8107827f native_sched_clock	shl $0x20, %rdx
264171.903578113:  ffffffff81078283 native_sched_clock	or %rdx, %rax
264171.903578113:  ffffffff81078286 native_sched_clock	mov %gs:0x7ef9cea2(%rip), %rsi
264171.903578113:  ffffffff8107828e native_sched_clock	mov %gs:0x7ef9cea2(%rip), %rdx
264171.903578113:  ffffffff81078296 native_sched_clock	cmp %rdx, %rsi
264171.903578113:  ffffffff81078299 native_sched_clock	jnz 0x810782d1
264171.903578113:  ffffffff8107829b native_sched_clock	mov (%rsi), %edx
264171.903578113:  ffffffff8107829d native_sched_clock	mov 0x4(%rsi), %ecx
264171.903578113:  ffffffff810782a0 native_sched_clock	mul %rdx
264171.903578113:  ffffffff810782a3 native_sched_clock	shrd %cl, %rdx, %rax
264171.903578113:  ffffffff810782a7 native_sched_clock	shr %cl, %rdx
264171.903578113:  ffffffff810782aa native_sched_clock	test $0x40, %cl
264171.903578113:  ffffffff810782ad native_sched_clock	cmovnz %rdx, %rax
264171.903578113:  ffffffff810782b1 native_sched_clock	add 0x8(%rsi), %rax
264171.903578113:  ffffffff810782b5 native_sched_clock	leave 
264171.903578113:  ffffffff810782b6 native_sched_clock	ret 
264171.903578113:  ffffffff81134529 trace_clock_local	pop %rbp
264171.903578113:  ffffffff8113452a trace_clock_local	ret 
264171.903578113:  ffffffff8113d68c rb_reserve_next_event	mov %rax, -0x50(%rbp)
264171.903578113:  ffffffff8113d690 rb_reserve_next_event	sub 0xa8(%r13), %rax
264171.903578113:  ffffffff8113d697 rb_reserve_next_event	mov 0xa8(%r13), %rcx
264171.903578113:  ffffffff8113d69e rb_reserve_next_event	cmp %rcx, -0x50(%rbp)
264171.903578113:  ffffffff8113d6a2 rb_reserve_next_event	jb 0x8113d6b4
264171.903578113:  ffffffff8113d6a4 rb_reserve_next_event	test $0xfffffffff8000000, %rax
264171.903578113:  ffffffff8113d6aa rb_reserve_next_event	mov %rax, -0x48(%rbp)
264171.903578113:  ffffffff8113d6ae rb_reserve_next_event	jnz rb_reserve_next_event+476
264171.903578113:  ffffffff8113d6b4 rb_reserve_next_event	mov -0x30(%rbp), %esi
264171.903578113:  ffffffff8113d6b7 rb_reserve_next_event	test %esi, %esi
264171.903578113:  ffffffff8113d6b9 rb_reserve_next_event	jnz rb_reserve_next_event+718

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 17:18     ` Peter Zijlstra
  2016-11-21 17:45       ` Andi Kleen
@ 2016-11-21 17:55       ` Steven Rostedt
  2016-11-21 18:24         ` Steven Rostedt
  1 sibling, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2016-11-21 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, 21 Nov 2016 18:18:53 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid
> locking, as one does.

Just to be clear, as ftrace in the kernel mostly represents function
tracing, which doesn't use RCU. This is a tracepoint feature.

> 
> Biggest objection would be that the rcu_irq_enter_irqson() thing does
> POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint
> with that is quite a few cycles.

Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not
concerned enough to put a heavy weight rcu idle code in for all
tracepoints.

Although, what about a percpu flag that can be checked in the
tracepoint code to see if it should enable RCU or not?

Hmm, I wonder if "rcu_is_watching()" is light enough to have in all
tracepoints?

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 17:45       ` Andi Kleen
@ 2016-11-21 18:01         ` Steven Rostedt
  2016-11-21 18:06           ` Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2016-11-21 18:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, 21 Nov 2016 09:45:04 -0800
Andi Kleen <andi@firstfloor.org> wrote:

> On Mon, Nov 21, 2016 at 06:18:53PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote:  
> > > > > it got away with attached change.. but this rcu logic
> > > > > is far beyond me, so it's just wild guess.. ;-)  
> > > > 
> > > > I think I prefer something like the below, that only annotates the one
> > > > RDMSR in question, instead of all of them.  
> > > 
> > > It would be far better to just fix trace points that they always work.
> > > 
> > > This whole thing is a travesty: we have tens of thousands of lines of code in
> > > ftrace to support tracing in NMIs, but then "debug features"[1] like
> > > this come around and make trace points unusable for far more code than
> > > just the NMI handlers.  
> > 
> > Not sure the idle handlers are more code than we have NMI code, but yes,
> > its annoying.
> > 
> > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid
> > locking, as one does.
> > 
> > Biggest objection would be that the rcu_irq_enter_irqson() thing does
> > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint
> > with that is quite a few cycles.  
> 
> This is only when the trace point is enabled right?
> 
> ftrace is already executing a lot of instructions when enabled. Adding two POPF
> won't be end of the world.

And a popf can be much more expensive than any of these. You should
know, not all instructions are equal.

Using perf, I've seen popf take up almst 30% of a function the size of
this.

-- Steve

> 
> -Andi
> 
> 
> 264171.903577780:  ffffffff810d6040 trace_event_raw_event_sched_switch	push %rbp
> 264171.903577780:  ffffffff810d6041 trace_event_raw_event_sched_switch	mov %rsp, %rbp
> 264171.903577780:  ffffffff810d6044 trace_event_raw_event_sched_switch	push %r15
> 264171.903577780:  ffffffff810d6046 trace_event_raw_event_sched_switch	push %r14
> 264171.903577780:  ffffffff810d6048 trace_event_raw_event_sched_switch	push %r13
> 264171.903577780:  ffffffff810d604a trace_event_raw_event_sched_switch	push %r12
> 264171.903577780:  ffffffff810d604c trace_event_raw_event_sched_switch	mov %esi, %r13d
> 264171.903577780:  ffffffff810d604f trace_event_raw_event_sched_switch	push %rbx
> 264171.903577780:  ffffffff810d6050 trace_event_raw_event_sched_switch	mov %rdi, %r12
> 264171.903577780:  ffffffff810d6053 trace_event_raw_event_sched_switch	mov %rdx, %r14
> 264171.903577780:  ffffffff810d6056 trace_event_raw_event_sched_switch	mov %rcx, %r15
> 264171.903577780:  ffffffff810d6059 trace_event_raw_event_sched_switch	sub $0x30, %rsp
> 264171.903577780:  ffffffff810d605d trace_event_raw_event_sched_switch	mov 0x48(%rdi), %rbx
> 264171.903577780:  ffffffff810d6061 trace_event_raw_event_sched_switch	test $0x80, %bl
> 264171.903577780:  ffffffff810d6064 trace_event_raw_event_sched_switch	jnz 0x810d608c
> 264171.903577780:  ffffffff810d6066 trace_event_raw_event_sched_switch	test $0x40, %bl
> 264171.903577780:  ffffffff810d6069 trace_event_raw_event_sched_switch	jnz trace_event_raw_event_sched_switch+216
> 264171.903577780:  ffffffff810d606f trace_event_raw_event_sched_switch	test $0x20, %bl
> 264171.903577780:  ffffffff810d6072 trace_event_raw_event_sched_switch	jz 0x810d6083
> 264171.903577780:  ffffffff810d6083 trace_event_raw_event_sched_switch	and $0x1, %bh
> 264171.903577780:  ffffffff810d6086 trace_event_raw_event_sched_switch	jnz trace_event_raw_event_sched_switch+237
> 264171.903578113:  ffffffff810d608c trace_event_raw_event_sched_switch	mov $0x40, %edx
> 264171.903578113:  ffffffff810d6091 trace_event_raw_event_sched_switch	mov %r12, %rsi
> 264171.903578113:  ffffffff810d6094 trace_event_raw_event_sched_switch	mov %rsp, %rdi
> 264171.903578113:  ffffffff810d6097 trace_event_raw_event_sched_switch	call trace_event_buffer_reserve
> 264171.903578113:  ffffffff8114f150 trace_event_buffer_reserve	test $0x1, 0x49(%rsi)
> 264171.903578113:  ffffffff8114f154 trace_event_buffer_reserve	push %rbp
> 264171.903578113:  ffffffff8114f155 trace_event_buffer_reserve	mov 0x10(%rsi), %r10
> 264171.903578113:  ffffffff8114f159 trace_event_buffer_reserve	mov %rsp, %rbp
> 264171.903578113:  ffffffff8114f15c trace_event_buffer_reserve	push %rbx
> 264171.903578113:  ffffffff8114f15d trace_event_buffer_reserve	jz 0x8114f17e
> 264171.903578113:  ffffffff8114f17e trace_event_buffer_reserve	mov %rdx, %rcx
> 264171.903578113:  ffffffff8114f181 trace_event_buffer_reserve	mov %rdi, %rbx
> 264171.903578113:  ffffffff8114f184 trace_event_buffer_reserve	pushfq 
> 264171.903578113:  ffffffff8114f185 trace_event_buffer_reserve	pop %r8
> 264171.903578113:  ffffffff8114f187 trace_event_buffer_reserve	mov %gs:0x7eebd2c1(%rip), %r9d
> 264171.903578113:  ffffffff8114f18f trace_event_buffer_reserve	and $0x7fffffff, %r9d
> 264171.903578113:  ffffffff8114f196 trace_event_buffer_reserve	mov %r8, 0x20(%rdi)
> 264171.903578113:  ffffffff8114f19a trace_event_buffer_reserve	mov %rsi, 0x10(%rdi)
> 264171.903578113:  ffffffff8114f19e trace_event_buffer_reserve	mov %r9d, 0x28(%rdi)
> 264171.903578113:  ffffffff8114f1a2 trace_event_buffer_reserve	mov 0x40(%r10), %edx
> 264171.903578113:  ffffffff8114f1a6 trace_event_buffer_reserve	call trace_event_buffer_lock_reserve
> 264171.903578113:  ffffffff81142210 trace_event_buffer_lock_reserve	push %rbp
> 264171.903578113:  ffffffff81142211 trace_event_buffer_lock_reserve	mov %rsp, %rbp
> 264171.903578113:  ffffffff81142214 trace_event_buffer_lock_reserve	push %r15
> 264171.903578113:  ffffffff81142216 trace_event_buffer_lock_reserve	push %r14
> 264171.903578113:  ffffffff81142218 trace_event_buffer_lock_reserve	push %r13
> 264171.903578113:  ffffffff8114221a trace_event_buffer_lock_reserve	push %r12
> 264171.903578113:  ffffffff8114221c trace_event_buffer_lock_reserve	mov %rdi, %r14
> 264171.903578113:  ffffffff8114221f trace_event_buffer_lock_reserve	push %rbx
> 264171.903578113:  ffffffff81142220 trace_event_buffer_lock_reserve	mov %edx, %r12d
> 264171.903578113:  ffffffff81142223 trace_event_buffer_lock_reserve	mov %rsi, %rbx
> 264171.903578113:  ffffffff81142226 trace_event_buffer_lock_reserve	mov %rcx, %r13
> 264171.903578113:  ffffffff81142229 trace_event_buffer_lock_reserve	mov %r9d, %r15d
> 264171.903578113:  ffffffff8114222c trace_event_buffer_lock_reserve	sub $0x10, %rsp
> 264171.903578113:  ffffffff81142230 trace_event_buffer_lock_reserve	mov 0x28(%rsi), %rax
> 264171.903578113:  ffffffff81142234 trace_event_buffer_lock_reserve	mov %r8, -0x30(%rbp)
> 264171.903578113:  ffffffff81142238 trace_event_buffer_lock_reserve	mov 0x20(%rax), %rdi
> 264171.903578113:  ffffffff8114223c trace_event_buffer_lock_reserve	mov %rdi, (%r14)
> 264171.903578113:  ffffffff8114223f trace_event_buffer_lock_reserve	test $0x24, 0x48(%rsi)
> 264171.903578113:  ffffffff81142243 trace_event_buffer_lock_reserve	jz 0x8114226d
> 264171.903578113:  ffffffff8114226d trace_event_buffer_lock_reserve	mov -0x30(%rbp), %rcx
> 264171.903578113:  ffffffff81142271 trace_event_buffer_lock_reserve	mov %r15d, %r8d
> 264171.903578113:  ffffffff81142274 trace_event_buffer_lock_reserve	mov %r13, %rdx
> 264171.903578113:  ffffffff81142277 trace_event_buffer_lock_reserve	mov %r12d, %esi
> 264171.903578113:  ffffffff8114227a trace_event_buffer_lock_reserve	call trace_buffer_lock_reserve
> 264171.903578113:  ffffffff811421c0 trace_buffer_lock_reserve	push %rbp
> 264171.903578113:  ffffffff811421c1 trace_buffer_lock_reserve	mov %rsp, %rbp
> 264171.903578113:  ffffffff811421c4 trace_buffer_lock_reserve	push %r14
> 264171.903578113:  ffffffff811421c6 trace_buffer_lock_reserve	push %r13
> 264171.903578113:  ffffffff811421c8 trace_buffer_lock_reserve	push %r12
> 264171.903578113:  ffffffff811421ca trace_buffer_lock_reserve	push %rbx
> 264171.903578113:  ffffffff811421cb trace_buffer_lock_reserve	mov %esi, %r12d
> 264171.903578113:  ffffffff811421ce trace_buffer_lock_reserve	mov %rdx, %rsi
> 264171.903578113:  ffffffff811421d1 trace_buffer_lock_reserve	mov %rcx, %r13
> 264171.903578113:  ffffffff811421d4 trace_buffer_lock_reserve	mov %r8d, %r14d
> 264171.903578113:  ffffffff811421d7 trace_buffer_lock_reserve	call ring_buffer_lock_reserve
> 264171.903578113:  ffffffff8113d950 ring_buffer_lock_reserve	mov 0x8(%rdi), %eax
> 264171.903578113:  ffffffff8113d953 ring_buffer_lock_reserve	test %eax, %eax
> 264171.903578113:  ffffffff8113d955 ring_buffer_lock_reserve	jnz ring_buffer_lock_reserve+166
> 264171.903578113:  ffffffff8113d95b ring_buffer_lock_reserve	mov %gs:0x7eecc826(%rip), %eax
> 264171.903578113:  ffffffff8113d962 ring_buffer_lock_reserve	mov %eax, %ecx
> 264171.903578113:  ffffffff8113d964 ring_buffer_lock_reserve	bt %rcx, 0x10(%rdi)
> 264171.903578113:  ffffffff8113d969 ring_buffer_lock_reserve	setb %cl
> 264171.903578113:  ffffffff8113d96c ring_buffer_lock_reserve	test %cl, %cl
> 264171.903578113:  ffffffff8113d96e ring_buffer_lock_reserve	jz ring_buffer_lock_reserve+166
> 264171.903578113:  ffffffff8113d974 ring_buffer_lock_reserve	mov 0x60(%rdi), %rdx
> 264171.903578113:  ffffffff8113d978 ring_buffer_lock_reserve	push %rbp
> 264171.903578113:  ffffffff8113d979 ring_buffer_lock_reserve	cdqe 
> 264171.903578113:  ffffffff8113d97b ring_buffer_lock_reserve	mov %rsp, %rbp
> 264171.903578113:  ffffffff8113d97e ring_buffer_lock_reserve	push %rbx
> 264171.903578113:  ffffffff8113d97f ring_buffer_lock_reserve	mov (%rdx,%rax,8), %rbx
> 264171.903578113:  ffffffff8113d983 ring_buffer_lock_reserve	mov 0x4(%rbx), %eax
> 264171.903578113:  ffffffff8113d986 ring_buffer_lock_reserve	test %eax, %eax
> 264171.903578113:  ffffffff8113d988 ring_buffer_lock_reserve	jnz 0x8113d9f1
> 264171.903578113:  ffffffff8113d98a ring_buffer_lock_reserve	cmp $0xfe8, %rsi
> 264171.903578113:  ffffffff8113d991 ring_buffer_lock_reserve	ja 0x8113d9f1
> 264171.903578113:  ffffffff8113d993 ring_buffer_lock_reserve	mov %gs:0x7eeceab6(%rip), %edx
> 264171.903578113:  ffffffff8113d99a ring_buffer_lock_reserve	test $0x1fff00, %edx
> 264171.903578113:  ffffffff8113d9a0 ring_buffer_lock_reserve	mov 0x20(%rbx), %ecx
> 264171.903578113:  ffffffff8113d9a3 ring_buffer_lock_reserve	mov $0x8, %eax
> 264171.903578113:  ffffffff8113d9a8 ring_buffer_lock_reserve	jnz 0x8113d9c6
> 264171.903578113:  ffffffff8113d9aa ring_buffer_lock_reserve	test %eax, %ecx
> 264171.903578113:  ffffffff8113d9ac ring_buffer_lock_reserve	jnz 0x8113d9f1
> 264171.903578113:  ffffffff8113d9ae ring_buffer_lock_reserve	or %ecx, %eax
> 264171.903578113:  ffffffff8113d9b0 ring_buffer_lock_reserve	mov %rsi, %rdx
> 264171.903578113:  ffffffff8113d9b3 ring_buffer_lock_reserve	mov %rbx, %rsi
> 264171.903578113:  ffffffff8113d9b6 ring_buffer_lock_reserve	mov %eax, 0x20(%rbx)
> 264171.903578113:  ffffffff8113d9b9 ring_buffer_lock_reserve	call rb_reserve_next_event
> 264171.903578113:  ffffffff8113d610 rb_reserve_next_event	inc 0x88(%rsi)
> 264171.903578113:  ffffffff8113d617 rb_reserve_next_event	inc 0x90(%rsi)
> 264171.903578113:  ffffffff8113d61e rb_reserve_next_event	mov 0x8(%rsi), %rax
> 264171.903578113:  ffffffff8113d622 rb_reserve_next_event	cmp %rdi, %rax
> 264171.903578113:  ffffffff8113d625 rb_reserve_next_event	jnz rb_reserve_next_event+678
> 264171.903578113:  ffffffff8113d62b rb_reserve_next_event	push %rbp
> 264171.903578113:  ffffffff8113d62c rb_reserve_next_event	mov %edx, %eax
> 264171.903578113:  ffffffff8113d62e rb_reserve_next_event	mov $0x8, %ecx
> 264171.903578113:  ffffffff8113d633 rb_reserve_next_event	mov %rsp, %rbp
> 264171.903578113:  ffffffff8113d636 rb_reserve_next_event	push %r15
> 264171.903578113:  ffffffff8113d638 rb_reserve_next_event	push %r14
> 264171.903578113:  ffffffff8113d63a rb_reserve_next_event	push %r13
> 264171.903578113:  ffffffff8113d63c rb_reserve_next_event	push %r12
> 264171.903578113:  ffffffff8113d63e rb_reserve_next_event	push %rbx
> 264171.903578113:  ffffffff8113d63f rb_reserve_next_event	lea 0x88(%rsi), %rbx
> 264171.903578113:  ffffffff8113d646 rb_reserve_next_event	sub $0x50, %rsp
> 264171.903578113:  ffffffff8113d64a rb_reserve_next_event	test %edx, %edx
> 264171.903578113:  ffffffff8113d64c rb_reserve_next_event	jnz rb_reserve_next_event+430
> 264171.903578113:  ffffffff8113d7be rb_reserve_next_event	cmp $0x70, %edx
> 264171.903578113:  ffffffff8113d7c1 rb_reserve_next_event	jbe 0x8113d7c6
> 264171.903578113:  ffffffff8113d7c6 rb_reserve_next_event	add $0x7, %eax
> 264171.903578113:  ffffffff8113d7c9 rb_reserve_next_event	mov $0x10, %ecx
> 264171.903578113:  ffffffff8113d7ce rb_reserve_next_event	and $0xfffffffc, %eax
> 264171.903578113:  ffffffff8113d7d1 rb_reserve_next_event	mov %eax, %edx
> 264171.903578113:  ffffffff8113d7d3 rb_reserve_next_event	cmp $0xc, %eax
> 264171.903578113:  ffffffff8113d7d6 rb_reserve_next_event	cmovnz %rdx, %rcx
> 264171.903578113:  ffffffff8113d7da rb_reserve_next_event	jmp rb_reserve_next_event+66
> 264171.903578113:  ffffffff8113d652 rb_reserve_next_event	lea -0x50(%rbp), %rax
> 264171.903578113:  ffffffff8113d656 rb_reserve_next_event	lea 0xa8(%rsi), %r12
> 264171.903578113:  ffffffff8113d65d rb_reserve_next_event	mov %rsi, %r13
> 264171.903578113:  ffffffff8113d660 rb_reserve_next_event	mov %rcx, -0x40(%rbp)
> 264171.903578113:  ffffffff8113d664 rb_reserve_next_event	mov $0x0, -0x30(%rbp)
> 264171.903578113:  ffffffff8113d66b rb_reserve_next_event	lea 0x18(%rax), %r15
> 264171.903578113:  ffffffff8113d66f rb_reserve_next_event	lea 0x10(%rax), %r14
> 264171.903578113:  ffffffff8113d673 rb_reserve_next_event	mov $0x0, -0x48(%rbp)
> 264171.903578113:  ffffffff8113d67b rb_reserve_next_event	mov $0x3e8, -0x74(%rbp)
> 264171.903578113:  ffffffff8113d682 rb_reserve_next_event	mov 0x8(%r13), %rax
> 264171.903578113:  ffffffff8113d686 rb_reserve_next_event	call 0x80(%rax)
> 264171.903578113:  ffffffff81134520 trace_clock_local	push %rbp
> 264171.903578113:  ffffffff81134521 trace_clock_local	mov %rsp, %rbp
> 264171.903578113:  ffffffff81134524 trace_clock_local	call native_sched_clock
> 264171.903578113:  ffffffff81078270 native_sched_clock	push %rbp
> 264171.903578113:  ffffffff81078271 native_sched_clock	mov %rsp, %rbp
> 264171.903578113:  ffffffff81078274 native_sched_clock	and $0xfffffffffffffff0, %rsp
> 264171.903578113:  ffffffff81078278 native_sched_clock	nop (%rax,%rax)
> 264171.903578113:  ffffffff8107827d native_sched_clock	rdtsc 
> 264171.903578113:  ffffffff8107827f native_sched_clock	shl $0x20, %rdx
> 264171.903578113:  ffffffff81078283 native_sched_clock	or %rdx, %rax
> 264171.903578113:  ffffffff81078286 native_sched_clock	mov %gs:0x7ef9cea2(%rip), %rsi
> 264171.903578113:  ffffffff8107828e native_sched_clock	mov %gs:0x7ef9cea2(%rip), %rdx
> 264171.903578113:  ffffffff81078296 native_sched_clock	cmp %rdx, %rsi
> 264171.903578113:  ffffffff81078299 native_sched_clock	jnz 0x810782d1
> 264171.903578113:  ffffffff8107829b native_sched_clock	mov (%rsi), %edx
> 264171.903578113:  ffffffff8107829d native_sched_clock	mov 0x4(%rsi), %ecx
> 264171.903578113:  ffffffff810782a0 native_sched_clock	mul %rdx
> 264171.903578113:  ffffffff810782a3 native_sched_clock	shrd %cl, %rdx, %rax
> 264171.903578113:  ffffffff810782a7 native_sched_clock	shr %cl, %rdx
> 264171.903578113:  ffffffff810782aa native_sched_clock	test $0x40, %cl
> 264171.903578113:  ffffffff810782ad native_sched_clock	cmovnz %rdx, %rax
> 264171.903578113:  ffffffff810782b1 native_sched_clock	add 0x8(%rsi), %rax
> 264171.903578113:  ffffffff810782b5 native_sched_clock	leave 
> 264171.903578113:  ffffffff810782b6 native_sched_clock	ret 
> 264171.903578113:  ffffffff81134529 trace_clock_local	pop %rbp
> 264171.903578113:  ffffffff8113452a trace_clock_local	ret 
> 264171.903578113:  ffffffff8113d68c rb_reserve_next_event	mov %rax, -0x50(%rbp)
> 264171.903578113:  ffffffff8113d690 rb_reserve_next_event	sub 0xa8(%r13), %rax
> 264171.903578113:  ffffffff8113d697 rb_reserve_next_event	mov 0xa8(%r13), %rcx
> 264171.903578113:  ffffffff8113d69e rb_reserve_next_event	cmp %rcx, -0x50(%rbp)
> 264171.903578113:  ffffffff8113d6a2 rb_reserve_next_event	jb 0x8113d6b4
> 264171.903578113:  ffffffff8113d6a4 rb_reserve_next_event	test $0xfffffffff8000000, %rax
> 264171.903578113:  ffffffff8113d6aa rb_reserve_next_event	mov %rax, -0x48(%rbp)
> 264171.903578113:  ffffffff8113d6ae rb_reserve_next_event	jnz rb_reserve_next_event+476
> 264171.903578113:  ffffffff8113d6b4 rb_reserve_next_event	mov -0x30(%rbp), %esi
> 264171.903578113:  ffffffff8113d6b7 rb_reserve_next_event	test %esi, %esi
> 264171.903578113:  ffffffff8113d6b9 rb_reserve_next_event	jnz rb_reserve_next_event+718

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 18:01         ` Steven Rostedt
@ 2016-11-21 18:06           ` Andi Kleen
  2016-11-21 18:22             ` Steven Rostedt
  0 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-11-21 18:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek

> And a popf can be much more expensive than any of these. You should
> know, not all instructions are equal.
> 
> Using perf, I've seen popf take up almst 30% of a function the size of
> this.

In any case it's a small fraction of the 600+ instructions which are currently
executed for every enabled trace point.

If ftrace was actually optimized code this would make some sense, but it
clearly isn't ...

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 18:06           ` Andi Kleen
@ 2016-11-21 18:22             ` Steven Rostedt
  2016-11-21 18:37               ` Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2016-11-21 18:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, 21 Nov 2016 10:06:54 -0800
Andi Kleen <andi@firstfloor.org> wrote:

> > And a popf can be much more expensive than any of these. You should
> > know, not all instructions are equal.
> > 
> > Using perf, I've seen popf take up almst 30% of a function the size of
> > this.  
> 
> In any case it's a small fraction of the 600+ instructions which are currently
> executed for every enabled trace point.
> 
> If ftrace was actually optimized code this would make some sense, but it
> clearly isn't ...

It tries to be optimized. I "unoptimized" it a while back to pull out
all the inlines that were done in the tracepoint itself. That is, the
trace_<tracepoint>() function is inlined in the code itself. By
breaking that up a bit, I was able to save a bunch of text because the
tracepoints were bloating the kernel tremendously.

There can be more optimization done too. But just because it's not
optimized to the best it can be (which should be our goal) is not
excuse to bloat it more with popf!

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 17:55       ` Steven Rostedt
@ 2016-11-21 18:24         ` Steven Rostedt
  2016-11-21 20:12           ` Paul E. McKenney
  0 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2016-11-21 18:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Andi Kleen, Jiri Olsa, linux-kernel, Ingo Molnar,
	Josh Triplett, Jan Stancek


Paul,


On Mon, 21 Nov 2016 12:55:01 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 21 Nov 2016 18:18:53 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid
> > locking, as one does.  
> 
> Just to be clear, as ftrace in the kernel mostly represents function
> tracing, which doesn't use RCU. This is a tracepoint feature.
> 
> > 
> > Biggest objection would be that the rcu_irq_enter_irqson() thing does
> > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint
> > with that is quite a few cycles.  
> 
> Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not
> concerned enough to put a heavy weight rcu idle code in for all
> tracepoints.
> 
> Although, what about a percpu flag that can be checked in the
> tracepoint code to see if it should enable RCU or not?
> 
> Hmm, I wonder if "rcu_is_watching()" is light enough to have in all
> tracepoints?

Is it possible to make rcu_is_watching() an inlined call to prevent the
overhead of doing a function call? This way we could use this in the
fast path of the tracepoint.

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 18:22             ` Steven Rostedt
@ 2016-11-21 18:37               ` Andi Kleen
  2016-11-21 19:06                 ` Steven Rostedt
                                   ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Andi Kleen @ 2016-11-21 18:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek

> It tries to be optimized. I "unoptimized" it a while back to pull out
> all the inlines that were done in the tracepoint itself. That is, the
> trace_<tracepoint>() function is inlined in the code itself. By
> breaking that up a bit, I was able to save a bunch of text because the
> tracepoints were bloating the kernel tremendously.

Just adding a few inlines won't fix the gigantic bloat that is currently
there. See the PT trace I posted earlier (it was even truncated, it's
actually worse). Just a single enabled trace point took about a us.

POPF can cause some serializion but it won't be more than a few tens
of cycles, which would be a few percent at best.

Here is it again untruncated:

http://halobates.de/tracepoint-trace

$ wc -l tracepoint-trace 
640 tracepoint-trace
$ head -1 tracepoint-trace 
[001] 264171.903577780:  ffffffff810d6040 trace_event_raw_event_sched_switch	push %rbp
$ tail -1 tracepoint-trace
[001] 264171.903578780:  ffffffff810d6117 trace_event_raw_event_sched_switch	ret


> There can be more optimization done too. But just because it's not
> optimized to the best it can be (which should be our goal) is not
> excuse to bloat it more with popf!

Ok so how should tracing in idle code work then in your opinion?

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 18:37               ` Andi Kleen
@ 2016-11-21 19:06                 ` Steven Rostedt
  2016-11-21 19:15                 ` Steven Rostedt
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Steven Rostedt @ 2016-11-21 19:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, 21 Nov 2016 10:37:00 -0800
Andi Kleen <andi@firstfloor.org> wrote:

> Ok so how should tracing in idle code work then in your opinion?

As I suggested already. If we can get a light weight rcu_is_watching()
then we can do the rcu_idle work when needed, and not when we don't
need it. Sure this will add more branches, but that's still cheaper
than popf.

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 18:37               ` Andi Kleen
  2016-11-21 19:06                 ` Steven Rostedt
@ 2016-11-21 19:15                 ` Steven Rostedt
  2016-11-21 20:44                   ` Andi Kleen
  2016-11-22 14:39                 ` Steven Rostedt
  2016-11-24  2:06                 ` Steven Rostedt
  3 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2016-11-21 19:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, 21 Nov 2016 10:37:00 -0800
Andi Kleen <andi@firstfloor.org> wrote:


> Just adding a few inlines won't fix the gigantic bloat that is currently
> there. See the PT trace I posted earlier (it was even truncated, it's
> actually worse). Just a single enabled trace point took about a us.

If you want to play with that, enable "Add tracepoint that benchmarks
tracepoints", and enable /sys/kernel/debug/tracing/events/benchmark

I just did on my box, and I have a max of 1.27 us, average .184 us.

Thus, on average, a tracepoint takes 184 nanoseconds.

Note, a cold cache tracepoint (the first one it traced which is not
saved in the "max") was 3.3 us.

Yes it can still be improved, and I do spend time doing that.

> 
> POPF can cause some serializion but it won't be more than a few tens
> of cycles, which would be a few percent at best.
> 
> Here is it again untruncated:
> 
> http://halobates.de/tracepoint-trace

There's a lot of push and pop regs due to function calling. There's
places that inlines can still improve things, and perhaps even some
likely unlikelys well placed.

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 18:24         ` Steven Rostedt
@ 2016-11-21 20:12           ` Paul E. McKenney
  0 siblings, 0 replies; 48+ messages in thread
From: Paul E. McKenney @ 2016-11-21 20:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Andi Kleen, Jiri Olsa, linux-kernel, Ingo Molnar,
	Josh Triplett, Jan Stancek

On Mon, Nov 21, 2016 at 01:24:38PM -0500, Steven Rostedt wrote:
> 
> Paul,
> 
> 
> On Mon, 21 Nov 2016 12:55:01 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Mon, 21 Nov 2016 18:18:53 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid
> > > locking, as one does.  
> > 
> > Just to be clear, as ftrace in the kernel mostly represents function
> > tracing, which doesn't use RCU. This is a tracepoint feature.
> > 
> > > 
> > > Biggest objection would be that the rcu_irq_enter_irqson() thing does
> > > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint
> > > with that is quite a few cycles.  
> > 
> > Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not
> > concerned enough to put a heavy weight rcu idle code in for all
> > tracepoints.
> > 
> > Although, what about a percpu flag that can be checked in the
> > tracepoint code to see if it should enable RCU or not?
> > 
> > Hmm, I wonder if "rcu_is_watching()" is light enough to have in all
> > tracepoints?
> 
> Is it possible to make rcu_is_watching() an inlined call to prevent the
> overhead of doing a function call? This way we could use this in the
> fast path of the tracepoint.

It would mean exposing the rcu_dynticks structure to the rest of the
kernel, but I guess that wouldn't be the end of the world.  Are you
calling rcu_is_watching() or __rcu_is_watching()?  The latter is
appropriate if preemption is already  disabled.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 19:15                 ` Steven Rostedt
@ 2016-11-21 20:44                   ` Andi Kleen
  2016-11-22  8:19                     ` Paul E. McKenney
  0 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-11-21 20:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek

> > http://halobates.de/tracepoint-trace
> 
> There's a lot of push and pop regs due to function calling. There's
> places that inlines can still improve things, and perhaps even some
> likely unlikelys well placed.

Assuming you avoid all the push/pop and all the call/ret this would only be
~25% of the total instructions. There is just far too much logic and
computation in there.

% awk ' { print $5 } '  tracepoint-trace | sort | uniq -c | sort -rn
    222 mov
     57 push
     57 pop
     35 test
     34 cmp
     32 and
     28 jz
     25 jnz
     21 ret
     20 call
     16 lea
     11 add

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 20:44                   ` Andi Kleen
@ 2016-11-22  8:19                     ` Paul E. McKenney
  0 siblings, 0 replies; 48+ messages in thread
From: Paul E. McKenney @ 2016-11-22  8:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steven Rostedt, Peter Zijlstra, Jiri Olsa, linux-kernel,
	Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, Nov 21, 2016 at 12:44:45PM -0800, Andi Kleen wrote:
> > > http://halobates.de/tracepoint-trace
> > 
> > There's a lot of push and pop regs due to function calling. There's
> > places that inlines can still improve things, and perhaps even some
> > likely unlikelys well placed.
> 
> Assuming you avoid all the push/pop and all the call/ret this would only be
> ~25% of the total instructions. There is just far too much logic and
> computation in there.
> 
> % awk ' { print $5 } '  tracepoint-trace | sort | uniq -c | sort -rn
>     222 mov
>      57 push
>      57 pop
>      35 test
>      34 cmp
>      32 and
>      28 jz
>      25 jnz
>      21 ret
>      20 call
>      16 lea
>      11 add

Hmmm...  It does indeed look like some performance analysis would be good...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 18:37               ` Andi Kleen
  2016-11-21 19:06                 ` Steven Rostedt
  2016-11-21 19:15                 ` Steven Rostedt
@ 2016-11-22 14:39                 ` Steven Rostedt
  2016-11-22 19:05                   ` Andi Kleen
  2016-11-24  2:06                 ` Steven Rostedt
  3 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2016-11-22 14:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, 21 Nov 2016 10:37:00 -0800
Andi Kleen <andi@firstfloor.org> wrote:


> Here is it again untruncated:
> 
> http://halobates.de/tracepoint-trace

BTW, what tool did you use to generate this?

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-22 14:39                 ` Steven Rostedt
@ 2016-11-22 19:05                   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-11-22 19:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek

On Tue, Nov 22, 2016 at 09:39:45AM -0500, Steven Rostedt wrote:
> On Mon, 21 Nov 2016 10:37:00 -0800
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> 
> > Here is it again untruncated:
> > 
> > http://halobates.de/tracepoint-trace
> 
> BTW, what tool did you use to generate this?

perf and Intel PT, with the disassembler patches added and udis86
installed:

https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/log/?h=perf/disassembler-2
http://udis86.sourceforge.net/

You need a CPU that supports PT, like Broadwell, Skylake, Goldmont.

# enable trace point
perf record -e intel_pt// -a sleep X
perf script --itrace=i0ns --ns -F cpu,ip,time,asm,sym

On Skylake can also get more accurate timing by enabling cycle timing
(at the cost of some overhead)

perf record -e intel_pt/cyc=1,cyc_thresh=1/ -a ...

The earlier trace didn't have that enabled.

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 18:37               ` Andi Kleen
                                   ` (2 preceding siblings ...)
  2016-11-22 14:39                 ` Steven Rostedt
@ 2016-11-24  2:06                 ` Steven Rostedt
  2016-11-28 21:48                   ` Andi Kleen
  3 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2016-11-24  2:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, 21 Nov 2016 10:37:00 -0800
Andi Kleen <andi@firstfloor.org> wrote:

> > It tries to be optimized. I "unoptimized" it a while back to pull out
> > all the inlines that were done in the tracepoint itself. That is, the
> > trace_<tracepoint>() function is inlined in the code itself. By
> > breaking that up a bit, I was able to save a bunch of text because the
> > tracepoints were bloating the kernel tremendously.  
> 
> Just adding a few inlines won't fix the gigantic bloat that is currently
> there. See the PT trace I posted earlier (it was even truncated, it's
> actually worse). Just a single enabled trace point took about a us.
> 
> POPF can cause some serializion but it won't be more than a few tens
> of cycles, which would be a few percent at best.
> 
> Here is it again untruncated:
> 
> http://halobates.de/tracepoint-trace
> 

I took a look at this and forced some more functions to be inlined. I
did a little tweaking here and there. Could you pull my tree and see if
things are better?  I don't currently have the hardware to run this
myself.

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git

  branch: ftrace/core


Thanks!

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-24  2:06                 ` Steven Rostedt
@ 2016-11-28 21:48                   ` Andi Kleen
  2016-11-28 22:46                     ` Steven Rostedt
  0 siblings, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2016-11-28 21:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek

> I took a look at this and forced some more functions to be inlined. I
> did a little tweaking here and there. Could you pull my tree and see if
> things are better?  I don't currently have the hardware to run this
> myself.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> 
>   branch: ftrace/core

Didn't help much unfortunately. I think you need a real fast path here.

http://halobates.de/tracepoint-trace2

$ wc -l tracepoint-trace2
595 tracepoint-trace2

Opcode Histogram: 

    201 mov
     51 push
     51 pop
     34 test
     32 cmp
     26 jz
     24 jnz
     24 and
     16 ret
     16 lea
     15 call
     12 add
     10 jmp

Functions with # of instructions:

    25 trace_event_raw_event_sched_switch
     19 trace_event_buffer_reserve
     21 trace_event_buffer_lock_reserve
     57 ring_buffer_lock_reserve
      3 trace_clock_local
      3 sched_clock
     24 native_sched_clock
      3 sched_clock
      2 trace_clock_local
     11 ring_buffer_lock_reserve
     81 __rb_reserve_next
     11 ring_buffer_lock_reserve
      6 trace_event_buffer_lock_reserve
     18 ring_buffer_event_data
      5 trace_event_buffer_lock_reserve
     38 tracing_generic_entry_update
     11 trace_event_buffer_lock_reserve
      5 trace_event_buffer_reserve
     18 ring_buffer_event_data
      4 trace_event_buffer_reserve
     28 trace_event_raw_event_sched_switch
     30 trace_event_buffer_commit
     19 trace_buffer_unlock_commit_regs
     11 ring_buffer_unlock_commit
     85 rb_commit
     18 ring_buffer_unlock_commit
     10 trace_buffer_unlock_commit_regs
      3 ftrace_trace_userstack
      7 trace_buffer_unlock_commit_regs
     11 trace_event_buffer_commit
      8 trace_event_raw_event_sched_switch

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-28 21:48                   ` Andi Kleen
@ 2016-11-28 22:46                     ` Steven Rostedt
  2016-11-28 23:10                       ` Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2016-11-28 22:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Jan Stancek

On Mon, 28 Nov 2016 13:48:25 -0800
Andi Kleen <andi@firstfloor.org> wrote:

> > I took a look at this and forced some more functions to be inlined. I
> > did a little tweaking here and there. Could you pull my tree and see if
> > things are better?  I don't currently have the hardware to run this
> > myself.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> > 
> >   branch: ftrace/core  
> 
> Didn't help much unfortunately. I think you need a real fast path here.

Well, actually it's been optimized a lot from the first instances.
Note, tracepoints need to be quite generic, so where it can be optimized
is not obvious.

> 
> http://halobates.de/tracepoint-trace2
> 
> $ wc -l tracepoint-trace2
> 595 tracepoint-trace2
> 
> Opcode Histogram: 
> 
>     201 mov
>      51 push
>      51 pop
>      34 test
>      32 cmp
>      26 jz
>      24 jnz
>      24 and
>      16 ret
>      16 lea
>      15 call
>      12 add
>      10 jmp
> 
> Functions with # of instructions:
> 
>     25 trace_event_raw_event_sched_switch

This is from the TRACE_EVENT macro.

trace_event_raw_event_##call(void *__data, proto)                       \
{                                                                       \
        struct trace_event_file *trace_file = __data;                   \
        struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
        struct trace_event_buffer fbuffer;                              \
        struct trace_event_raw_##call *entry;                           \
        int __data_size;                                                \
                                                                        \
        if (trace_trigger_soft_disabled(trace_file))                    \
                return;                                                 \
                                                                        \
        __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
									\
        entry = trace_event_buffer_reserve(&fbuffer, trace_file,        \
                                 sizeof(*entry) + __data_size);         \


>      19 trace_event_buffer_reserve

void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
                                 struct trace_event_file *trace_file,
                                 unsigned long len)
{
        struct trace_event_call *event_call = trace_file->event_call;

        if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) &&
            trace_event_ignore_this_pid(trace_file))
                return NULL;

        local_save_flags(fbuffer->flags);
        fbuffer->pc = preempt_count();
        /*
         * If CONFIG_PREEMPT is enabled, then the tracepoint itself disables
         * preemption (adding one to the preempt_count). Since we are
         * interested in the preempt_count at the time the tracepoint was
         * hit, we need to subtract one to offset the increment.
         */
        if (IS_ENABLED(CONFIG_PREEMPT))
                fbuffer->pc--;
        fbuffer->trace_file = trace_file;

        fbuffer->event =
                trace_event_buffer_lock_reserve(&fbuffer->buffer, trace_file,
                                                event_call->event.type, len,
                                                fbuffer->flags, fbuffer->pc);


>      21 trace_event_buffer_lock_reserve

struct ring_buffer_event *
trace_event_buffer_lock_reserve(struct ring_buffer **current_rb,
                          struct trace_event_file *trace_file,
                          int type, unsigned long len,
                          unsigned long flags, int pc)
{
        struct ring_buffer_event *entry;
        int val;

        *current_rb = trace_file->tr->trace_buffer.buffer;

        if ((trace_file->flags &
             (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) &&
            (entry = this_cpu_read(trace_buffered_event))) {


Note: Some of these if statements can be encapsulated with jump labels, as they
are false pretty much all of the time.


                /* Try to use the per cpu buffer first */
                val = this_cpu_inc_return(trace_buffered_event_cnt);
                if (val == 1) {
                        trace_event_setup(entry, type, flags, pc);
                        entry->array[0] = len;
                        return entry;
                }
                this_cpu_dec(trace_buffered_event_cnt);
        }

        entry = trace_buffer_lock_reserve(*current_rb,
                                         type, len, flags, pc);


>      57 ring_buffer_lock_reserve

struct ring_buffer_event *
ring_buffer_lock_reserve(struct ring_buffer *buffer, unsigned long length)
{
        struct ring_buffer_per_cpu *cpu_buffer;
        struct ring_buffer_event *event;
        int cpu;

        /* If we are tracing schedule, we don't want to recurse */
        preempt_disable_notrace();

        if (unlikely(atomic_read(&buffer->record_disabled)))
                goto out;

        cpu = raw_smp_processor_id();

        if (unlikely(!cpumask_test_cpu(cpu, buffer->cpumask)))
                goto out;

        cpu_buffer = buffer->buffers[cpu];

        if (unlikely(atomic_read(&cpu_buffer->record_disabled)))
                goto out;

        if (unlikely(length > BUF_MAX_DATA_SIZE))
                goto out;

        if (unlikely(trace_recursive_lock(cpu_buffer)))
                goto out;

        event = rb_reserve_next_event(buffer, cpu_buffer, length);
        if (!event)
                goto out_unlock;

static struct ring_buffer_event *
rb_reserve_next_event(struct ring_buffer *buffer,
                      struct ring_buffer_per_cpu *cpu_buffer,
                      unsigned long length)
{
        struct ring_buffer_event *event;
        struct rb_event_info info;
        int nr_loops = 0;
        u64 diff;

        rb_start_commit(cpu_buffer);

#ifdef CONFIG_RING_BUFFER_ALLOW_SWAP
        /*
         * Due to the ability to swap a cpu buffer from a buffer
         * it is possible it was swapped before we committed.
         * (committing stops a swap). We check for it here and
         * if it happened, we have to fail the write.
         */
        barrier();
        if (unlikely(ACCESS_ONCE(cpu_buffer->buffer) != buffer)) {
                local_dec(&cpu_buffer->committing);
                local_dec(&cpu_buffer->commits);
                return NULL;
        }
#endif

        info.length = rb_calculate_event_length(length);
 again:
        info.add_timestamp = 0;
        info.delta = 0;

        /*
         * We allow for interrupts to reenter here and do a trace.
         * If one does, it will cause this original code to loop
         * back here. Even with heavy interrupts happening, this
         * should only happen a few times in a row. If this happens
         * 1000 times in a row, there must be either an interrupt
         * storm or we have something buggy.
         * Bail!
         */
        if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000))
                goto out_fail;

        info.ts = rb_time_stamp(cpu_buffer->buffer);
        diff = info.ts - cpu_buffer->write_stamp;

        /* make sure this diff is calculated here */
        barrier();

        /* Did the write stamp get updated already? */
        if (likely(info.ts >= cpu_buffer->write_stamp)) {
                info.delta = diff;
                if (unlikely(test_time_stamp(info.delta)))
                        rb_handle_timestamp(cpu_buffer, &info);
        }

        event = __rb_reserve_next(cpu_buffer, &info);


>       3 trace_clock_local
>       3 sched_clock
>      24 native_sched_clock

Not much we can do about the clock

>       3 sched_clock
>       2 trace_clock_local
>      11 ring_buffer_lock_reserve
>      81 __rb_reserve_next

static struct ring_buffer_event *
__rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
                  struct rb_event_info *info)
{
        struct ring_buffer_event *event;
        struct buffer_page *tail_page;
        unsigned long tail, write;

        /*
         * If the time delta since the last event is too big to
         * hold in the time field of the event, then we append a
         * TIME EXTEND event ahead of the data event.
         */
        if (unlikely(info->add_timestamp))
                info->length += RB_LEN_TIME_EXTEND;

        /* Don't let the compiler play games with cpu_buffer->tail_page */
        tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page);
        write = local_add_return(info->length, &tail_page->write);

        /* set write to only the index of the write */
        write &= RB_WRITE_MASK;
        tail = write - info->length;

        /*
         * If this is the first commit on the page, then it has the same
         * timestamp as the page itself.
         */
        if (!tail)
                info->delta = 0;

        /* See if we shot pass the end of this buffer page */
        if (unlikely(write > BUF_PAGE_SIZE))
                return rb_move_tail(cpu_buffer, tail, info);

        /* We reserved something on the buffer */

        event = __rb_page_index(tail_page, tail);
        kmemcheck_annotate_bitfield(event, bitfield);
        rb_update_event(cpu_buffer, event, info);

        local_inc(&tail_page->entries);

        /*
         * If this is the first commit on the page, then update
         * its timestamp.
         */
        if (!tail)
                tail_page->page->time_stamp = info->ts;

        /* account for these added bytes */
        local_add(info->length, &cpu_buffer->entries_bytes);

        return event;
}

>      11 ring_buffer_lock_reserve
>       6 trace_event_buffer_lock_reserve

        if (!fbuffer->event)
                return NULL;

        fbuffer->entry = ring_buffer_event_data(fbuffer->event);
        return fbuffer->entry;


>      18 ring_buffer_event_data

static void *
rb_event_data(struct ring_buffer_event *event)
{
        if (event->type_len == RINGBUF_TYPE_TIME_EXTEND)
                event = skip_time_extend(event);
        BUG_ON(event->type_len > RINGBUF_TYPE_DATA_TYPE_LEN_MAX);
        /* If length is in len field, then array[0] has the data */
        if (event->type_len)
                return (void *)&event->array[0];
        /* Otherwise length is in array[0] and array[1] has the data */
        return (void *)&event->array[1];
}

Hmm, the above could probably be optimized better.

>       5 trace_event_buffer_lock_reserve
>      38 tracing_generic_entry_update

I tried to optimize this one, but nothing really helped.

void
tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
                             int pc)
{
        struct task_struct *tsk = current;

        entry->preempt_count            = pc & 0xff;
        entry->pid                      = (tsk) ? tsk->pid : 0;
        entry->flags =
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
                (irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
#else
                TRACE_FLAG_IRQS_NOSUPPORT |
#endif
                ((pc & NMI_MASK    ) ? TRACE_FLAG_NMI     : 0) |
                ((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
                ((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
                (tif_need_resched() ? TRACE_FLAG_NEED_RESCHED : 0) |
                (test_preempt_need_resched() ? TRACE_FLAG_PREEMPT_RESCHED : 0);
}

>      11 trace_event_buffer_lock_reserve
>       5 trace_event_buffer_reserve
>      18 ring_buffer_event_data

Strange, this is called twice?

>       4 trace_event_buffer_reserve
>      28 trace_event_raw_event_sched_switch
>      30 trace_event_buffer_commit
>      19 trace_buffer_unlock_commit_regs
>      11 ring_buffer_unlock_commit
>      85 rb_commit

The rb_commit is a pain due to lockless commits.

Namely this part:

static void
rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
{
        unsigned long max_count;

        /*
         * We only race with interrupts and NMIs on this CPU.
         * If we own the commit event, then we can commit
         * all others that interrupted us, since the interruptions
         * are in stack format (they finish before they come
         * back to us). This allows us to do a simple loop to
         * assign the commit to the tail.
         */
 again:
        max_count = cpu_buffer->nr_pages * 100;

        while (cpu_buffer->commit_page != READ_ONCE(cpu_buffer->tail_page)) {
                if (RB_WARN_ON(cpu_buffer, !(--max_count)))
                        return;
                if (RB_WARN_ON(cpu_buffer,
                               rb_is_reader_page(cpu_buffer->tail_page)))
                        return;
                local_set(&cpu_buffer->commit_page->page->commit,
                          rb_page_write(cpu_buffer->commit_page));
                rb_inc_page(cpu_buffer, &cpu_buffer->commit_page);
                /* Only update the write stamp if the page has an event */
                if (rb_page_write(cpu_buffer->commit_page))
                        cpu_buffer->write_stamp =
                                cpu_buffer->commit_page->page->time_stamp;
                /* add barrier to keep gcc from optimizing too much */
                barrier();
        }
        while (rb_commit_index(cpu_buffer) !=
               rb_page_write(cpu_buffer->commit_page)) {

                local_set(&cpu_buffer->commit_page->page->commit,
                          rb_page_write(cpu_buffer->commit_page));
                RB_WARN_ON(cpu_buffer,
                           local_read(&cpu_buffer->commit_page->page->commit) &
                           ~RB_WRITE_MASK);
                barrier();
        }

        /* again, keep gcc from optimizing */
        barrier();

        /*
         * If an interrupt came in just after the first while loop
         * and pushed the tail page forward, we will be left with
         * a dangling commit that will never go forward.
         */
        if (unlikely(cpu_buffer->commit_page != READ_ONCE(cpu_buffer->tail_page)))
                goto again;
}

>      18 ring_buffer_unlock_commit
>      10 trace_buffer_unlock_commit_regs
>       3 ftrace_trace_userstack

I should make this call a jump label, as it's seldom used.

>       7 trace_buffer_unlock_commit_regs
>      11 trace_event_buffer_commit
>       8 trace_event_raw_event_sched_switch
> 


There are a few things that can still be optimized. But the issue
really is that trace_events are generic and macro created, which I need
to balance how much gets inlined vs how much is passed around to
functions.

And the ring buffer is lockless, thus it needs to do various hoops to
make sure things don't break.

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-28 22:46                     ` Steven Rostedt
@ 2016-11-28 23:10                       ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2016-11-28 23:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek

> Well, actually it's been optimized a lot from the first instances.
> Note, tracepoints need to be quite generic, so where it can be optimized
> is not obvious.

There is quite a bit of stuff that doesn't change from trace point to
trace point. So you could just cache all these decisions into a single
per cpu variable, and invalidate if anything changes the conditions.

> > 
> > Functions with # of instructions:
> > 
> >     25 trace_event_raw_event_sched_switch
> 
> This is from the TRACE_EVENT macro.
> 
>         if (trace_trigger_soft_disabled(trace_file))                    \
>                 return;                                                 \

Cache?

>         if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) &&
>             trace_event_ignore_this_pid(trace_file))
>                 return NULL;
> 
>         *current_rb = trace_file->tr->trace_buffer.buffer;

Indirection could be cached.

> 
>         if ((trace_file->flags &
>              (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) &&
>             (entry = this_cpu_read(trace_buffered_event))) {

Multiple tests of trace_file->flags in different functions, 
could be all folded into a single test.

> 
> 
> Note: Some of these if statements can be encapsulated with jump labels, as they
> are false pretty much all of the time.
> 
> 
>                 /* Try to use the per cpu buffer first */
>                 val = this_cpu_inc_return(trace_buffered_event_cnt);
>                 if (val == 1) {
>                         trace_event_setup(entry, type, flags, pc);
>                         entry->array[0] = len;
>                         return entry;
>                 }
>                 this_cpu_dec(trace_buffered_event_cnt);

This can be pre cached in a fast path. If true just point per cpu
variable directly to buffer and invalidate if buffer changes or is full.

>         preempt_disable_notrace();
> 
>         if (unlikely(atomic_read(&buffer->record_disabled)))
>                 goto out;

Cache. 
> 
>         cpu = raw_smp_processor_id();
> 
>         if (unlikely(!cpumask_test_cpu(cpu, buffer->cpumask)))
>                 goto out;

Not needed for per cpu buffers in a fast path.

> 
>         cpu_buffer = buffer->buffers[cpu];
> 
>         if (unlikely(atomic_read(&cpu_buffer->record_disabled)))
>                 goto out;
cache

> 
>         if (unlikely(length > BUF_MAX_DATA_SIZE))
>                 goto out;
> 
>         if (unlikely(trace_recursive_lock(cpu_buffer)))
>                 goto out;

locking should invalidate state on this cpu, so shouldn't be needed

>         u64 diff;
> 
>         rb_start_commit(cpu_buffer);
> 
> #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP
>         /*
>          * Due to the ability to swap a cpu buffer from a buffer
>          * it is possible it was swapped before we committed.
>          * (committing stops a swap). We check for it here and
>          * if it happened, we have to fail the write.
>          */
>         barrier();
>         if (unlikely(ACCESS_ONCE(cpu_buffer->buffer) != buffer)) {
>                 local_dec(&cpu_buffer->committing);
>                 local_dec(&cpu_buffer->commits);
>                 return NULL;
>         }

swapping should invalidate state centrally (and then run slow path like
current path)

>         /*
>          * We allow for interrupts to reenter here and do a trace.
>          * If one does, it will cause this original code to loop
>          * back here. Even with heavy interrupts happening, this
>          * should only happen a few times in a row. If this happens
>          * 1000 times in a row, there must be either an interrupt
>          * storm or we have something buggy.
>          * Bail!
>          */
>         if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000))
>                 goto out_fail;

didn't you disable interrupts earlier?

>         /* Did the write stamp get updated already? */
>         if (likely(info.ts >= cpu_buffer->write_stamp)) {
>                 info.delta = diff;
>                 if (unlikely(test_time_stamp(info.delta)))
>                         rb_handle_timestamp(cpu_buffer, &info);
>         }

Not sure why a write stamp is needed. isn't that the last few entries?

...

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21 16:06               ` Borislav Petkov
@ 2016-11-29 13:16                 ` Borislav Petkov
  2016-11-29 13:59                   ` Thomas Gleixner
  2016-11-29 14:04                   ` Jiri Olsa
  0 siblings, 2 replies; 48+ messages in thread
From: Borislav Petkov @ 2016-11-29 13:16 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote:
> IOW, what's the worst thing that can happen if we did this below?
> 
> We basically get rid of the detection and switch the timer to broadcast
> mode immediately on the halting CPU.
> 
> amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))"
> check so it will run on the affected CPUs only...
> 
> Thoughts?

Actually, here's a better version. The E400 detection works only after
ACPI has been enabled so we piggyback the end of acpi_init().

We don't need the MSR read now - we do

	if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E))

on the idle path which is as fast as it gets.

Any complaints about this before I go and test it everywhere?

It builds and boots in my guest here ok, not that it means a whole lot.

The good news is, I have collected a fleet of boxes which all have that
erratum so testing should be pretty reliable. Something that doesn't
happen everytime!

:-)

---
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 5391b0ae7cc3..aa8c879e814f 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -94,7 +94,8 @@ static inline unsigned int acpi_processor_cstate_check(unsigned int max_cstate)
 	    boot_cpu_data.x86_model <= 0x05 &&
 	    boot_cpu_data.x86_mask < 0x0A)
 		return 1;
-	else if (amd_e400_c1e_detected)
+
+	else if (boot_cpu_has(X86_BUG_AMD_APIC_C1E))
 		return 1;
 	else
 		return max_cstate;
@@ -131,6 +132,7 @@ static inline bool acpi_has_cpu_in_madt(void)
 	return !!acpi_lapic;
 }
 
+extern void __init __weak arch_post_acpi_init(void);
 #else /* !CONFIG_ACPI */
 
 #define acpi_lapic 0
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 1d2b69fc0ceb..b3918609f6f1 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -204,6 +204,7 @@ static __always_inline __pure bool _static_cpu_has(u16 bit)
 
 #define static_cpu_has_bug(bit)		static_cpu_has((bit))
 #define boot_cpu_has_bug(bit)		cpu_has_bug(&boot_cpu_data, (bit))
+#define setup_clear_cpu_bug(bit)	setup_clear_cpu_cap((bit))
 
 #define MAX_CPU_FEATURES		(NCAPINTS * 32)
 #define cpu_have_feature		boot_cpu_has
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 1f6a92903b09..d98c5c432209 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -636,7 +636,6 @@ extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
 extern unsigned long		boot_option_idle_override;
-extern bool			amd_e400_c1e_detected;
 
 enum idle_boot_override {IDLE_NO_OVERRIDE=0, IDLE_HALT, IDLE_NOMWAIT,
 			 IDLE_POLL};
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index ee023919e476..56653b7322f9 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -289,9 +289,6 @@ void stop_this_cpu(void *dummy)
 		halt();
 }
 
-bool amd_e400_c1e_detected;
-EXPORT_SYMBOL(amd_e400_c1e_detected);
-
 static cpumask_var_t amd_e400_c1e_mask;
 
 void amd_e400_remove_cpu(int cpu)
@@ -301,26 +298,14 @@ void amd_e400_remove_cpu(int cpu)
 }
 
 /*
- * AMD Erratum 400 aware idle routine. We check for C1E active in the interrupt
- * pending message MSR. If we detect C1E, then we handle it the same
- * way as C3 power states (local apic timer and TSC stop)
+ * AMD Erratum 400 aware idle routine.
+ *
+ *
+ * We handle it the same way as C3 power states (local apic timer and TSC stop)
  */
 static void amd_e400_idle(void)
 {
-	if (!amd_e400_c1e_detected) {
-		u32 lo, hi;
-
-		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
-
-		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
-			amd_e400_c1e_detected = true;
-			if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
-				mark_tsc_unstable("TSC halt in AMD C1E");
-			pr_info("System has AMD C1E enabled\n");
-		}
-	}
-
-	if (amd_e400_c1e_detected) {
+	if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) {
 		int cpu = smp_processor_id();
 
 		if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) {
@@ -419,6 +404,30 @@ void __init init_amd_e400_c1e_mask(void)
 		zalloc_cpumask_var(&amd_e400_c1e_mask, GFP_KERNEL);
 }
 
+void __init arch_post_acpi_init(void)
+{
+	u32 lo, hi;
+
+	if (!boot_cpu_has_bug(X86_BUG_AMD_APIC_C1E))
+		return;
+
+	/*
+	 * AMD E400 detection needs to happen *after* ACPI has been enabled. We
+	 * check for C1E active in the interrupt pending message MSR. If we detect
+	 * C1E enabled, machine is affected.
+	 */
+	rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
+
+	if (!(lo & K8_INTP_C1E_ACTIVE_MASK)) {
+		setup_clear_cpu_bug(X86_BUG_AMD_APIC_C1E);
+		return;
+	}
+
+	if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
+		mark_tsc_unstable("TSC halt in AMD C1E");
+	pr_info("System has AMD C1E enabled\n");
+}
+
 static int __init idle_setup(char *str)
 {
 	if (!str)
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 56190d00fd87..da42fe354ffc 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -34,6 +34,7 @@
 #include <linux/reboot.h>
 #include <linux/delay.h>
 #ifdef CONFIG_X86
+#include <asm/acpi.h>
 #include <asm/mpspec.h>
 #endif
 #include <linux/acpi_iort.h>
@@ -1166,6 +1167,13 @@ static int __init acpi_bus_init(void)
 struct kobject *acpi_kobj;
 EXPORT_SYMBOL_GPL(acpi_kobj);
 
+/*
+ * Arch-specific one to override this.
+ */
+void __init __weak arch_post_acpi_init(void)
+{
+}
+
 static int __init acpi_init(void)
 {
 	int result;
@@ -1198,6 +1206,9 @@ static int __init acpi_init(void)
 	acpi_debugger_init();
 	acpi_setup_sb_notify_handler();
 	acpi_set_processor_mapping();
+
+	arch_post_acpi_init();
+
 	return 0;
 }
 
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 2237d3f24f0e..8eec3255d6f2 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -141,7 +141,7 @@ static void lapic_timer_check_state(int state, struct acpi_processor *pr,
 	if (cpu_has(&cpu_data(pr->id), X86_FEATURE_ARAT))
 		return;
 
-	if (amd_e400_c1e_detected)
+	if (boot_cpu_has(X86_BUG_AMD_APIC_C1E))
 		type = ACPI_STATE_C1;
 
 	/*

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-29 13:16                 ` Borislav Petkov
@ 2016-11-29 13:59                   ` Thomas Gleixner
  2016-11-30  8:48                     ` Borislav Petkov
  2016-11-29 14:04                   ` Jiri Olsa
  1 sibling, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2016-11-29 13:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Steven Rostedt, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen,
	Jan Stancek

On Tue, 29 Nov 2016, Borislav Petkov wrote:
> On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote:
> > IOW, what's the worst thing that can happen if we did this below?
> > 
> > We basically get rid of the detection and switch the timer to broadcast
> > mode immediately on the halting CPU.
> > 
> > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))"
> > check so it will run on the affected CPUs only...
> > 
> > Thoughts?
> 
> Actually, here's a better version. The E400 detection works only after
> ACPI has been enabled so we piggyback the end of acpi_init().
> 
> We don't need the MSR read now - we do
> 
> 	if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E))
> 
> on the idle path which is as fast as it gets.
> 
> Any complaints about this before I go and test it everywhere?

The issue is that you obvioulsy start with the assumption, that the machine
has this bug. As a consequence the machine is brute forced into tick
broadcast mode, which cannot be reverted when you clear that misfeature
after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is
forced into broadcast mode, which is not what you want.

As far as I understood the whole magic, this C1E misfeature takes only
effect _after_ ACPI has been initialized. So instead of setting the bug in
early boot and therefor forcing the broadcast nonsense, we should only set
it when ACPI has actually detected it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-29 13:16                 ` Borislav Petkov
  2016-11-29 13:59                   ` Thomas Gleixner
@ 2016-11-29 14:04                   ` Jiri Olsa
  1 sibling, 0 replies; 48+ messages in thread
From: Jiri Olsa @ 2016-11-29 14:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Thomas Gleixner, Steven Rostedt,
	Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett,
	Andi Kleen, Jan Stancek

On Tue, Nov 29, 2016 at 02:16:49PM +0100, Borislav Petkov wrote:
> On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote:
> > IOW, what's the worst thing that can happen if we did this below?
> > 
> > We basically get rid of the detection and switch the timer to broadcast
> > mode immediately on the halting CPU.
> > 
> > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))"
> > check so it will run on the affected CPUs only...
> > 
> > Thoughts?
> 
> Actually, here's a better version. The E400 detection works only after
> ACPI has been enabled so we piggyback the end of acpi_init().
> 
> We don't need the MSR read now - we do
> 
> 	if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E))
> 
> on the idle path which is as fast as it gets.
> 
> Any complaints about this before I go and test it everywhere?
> 
> It builds and boots in my guest here ok, not that it means a whole lot.
> 
> The good news is, I have collected a fleet of boxes which all have that
> erratum so testing should be pretty reliable. Something that doesn't
> happen everytime!

+1 one test server over here.. will check ;-)

jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-29 13:59                   ` Thomas Gleixner
@ 2016-11-30  8:48                     ` Borislav Petkov
  2016-11-30  8:54                       ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Borislav Petkov @ 2016-11-30  8:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Steven Rostedt, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen,
	Jan Stancek

On Tue, Nov 29, 2016 at 02:59:01PM +0100, Thomas Gleixner wrote:
> The issue is that you obvioulsy start with the assumption, that the machine
> has this bug. As a consequence the machine is brute forced into tick
> broadcast mode, which cannot be reverted when you clear that misfeature
> after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is
> forced into broadcast mode, which is not what you want.
> 
> As far as I understood the whole magic, this C1E misfeature takes only
> effect _after_ ACPI has been initialized. So instead of setting the bug in
> early boot and therefor forcing the broadcast nonsense, we should only set
> it when ACPI has actually detected it.

Problem is, select_idle_routine() runs a lot earlier than acpi_init() so
there's a window where we don't definitively know yet whether the box is
actually going to enter C1E or not.

  [ I presume the reason why we have to do the proper detection after
    ACPI has been initialized is because the frickelware decides whether
    to do C1E entry or not and then sets those bits in the MSR (or not). ]

If in that window we enter idle and we're on an affected machine and we
*don't* switch to broadcast mode, we risk not waking up from C1E, i.e.,
the main reason this fix was even done.

So, if we "prematurely" switch to broadcast mode on the affected CPUs,
we're ok, it will be detected properly later and we're in broadcast
mode already.

Now, on those machines which are not affected and we clear
X86_BUG_AMD_APIC_C1E because they don't enter C1E at all, I was thinking
of maybe doing amd_e400_remove_cpu() and clearing that e400 mask and
even freeing it so that they can do default_idle().

But you're saying tick_broadcast_enter() is irreversible?

Thanks.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-30  8:48                     ` Borislav Petkov
@ 2016-11-30  8:54                       ` Thomas Gleixner
  2016-11-30  9:07                         ` Borislav Petkov
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2016-11-30  8:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Steven Rostedt, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen,
	Jan Stancek

On Wed, 30 Nov 2016, Borislav Petkov wrote:
> On Tue, Nov 29, 2016 at 02:59:01PM +0100, Thomas Gleixner wrote:
> > The issue is that you obvioulsy start with the assumption, that the machine
> > has this bug. As a consequence the machine is brute forced into tick
> > broadcast mode, which cannot be reverted when you clear that misfeature
> > after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is
> > forced into broadcast mode, which is not what you want.
> > 
> > As far as I understood the whole magic, this C1E misfeature takes only
> > effect _after_ ACPI has been initialized. So instead of setting the bug in
> > early boot and therefor forcing the broadcast nonsense, we should only set
> > it when ACPI has actually detected it.
> 
> Problem is, select_idle_routine() runs a lot earlier than acpi_init() so
> there's a window where we don't definitively know yet whether the box is
> actually going to enter C1E or not.
> 
>   [ I presume the reason why we have to do the proper detection after
>     ACPI has been initialized is because the frickelware decides whether
>     to do C1E entry or not and then sets those bits in the MSR (or not). ]
> 
> If in that window we enter idle and we're on an affected machine and we
> *don't* switch to broadcast mode, we risk not waking up from C1E, i.e.,
> the main reason this fix was even done.
> 
> So, if we "prematurely" switch to broadcast mode on the affected CPUs,
> we're ok, it will be detected properly later and we're in broadcast
> mode already.

Right, that's the safe bet. But I'm quite sure that the C1E crap only
starts to work _after_ ACPI initialization.

> Now, on those machines which are not affected and we clear
> X86_BUG_AMD_APIC_C1E because they don't enter C1E at all, I was thinking
> of maybe doing amd_e400_remove_cpu() and clearing that e400 mask and
> even freeing it so that they can do default_idle().
> 
> But you're saying tick_broadcast_enter() is irreversible?

tick_force_broadcast() is irreversible

Thanks,

	tglx
	

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-30  8:54                       ` Thomas Gleixner
@ 2016-11-30  9:07                         ` Borislav Petkov
  2016-11-30  9:14                           ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Borislav Petkov @ 2016-11-30  9:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Steven Rostedt, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen,
	Jan Stancek

On Wed, Nov 30, 2016 at 09:54:58AM +0100, Thomas Gleixner wrote:
> Right, that's the safe bet. But I'm quite sure that the C1E crap only
> starts to work _after_ ACPI initialization.

Yap, I think it is an ACPI decision whether to enter C1E or not. And all
those boxes which are unaffected - they actually are but since the FW
doesn't enter C1E, they don't hit the bug.

> tick_force_broadcast() is irreversible

So if I let the cores in that small window use amd_e400_idle() and
then when I detect the machine doesn't enter C1E after all, I do
tick_broadcast_exit() on all cores in amd_e400_c1e_mask, then clear it
and free it, that would work?

Or do you see a better solution?

Thanks.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-30  9:07                         ` Borislav Petkov
@ 2016-11-30  9:14                           ` Thomas Gleixner
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Gleixner @ 2016-11-30  9:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Steven Rostedt, Jiri Olsa, Paul E. McKenney,
	linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen,
	Jan Stancek

On Wed, 30 Nov 2016, Borislav Petkov wrote:
> On Wed, Nov 30, 2016 at 09:54:58AM +0100, Thomas Gleixner wrote:
> > Right, that's the safe bet. But I'm quite sure that the C1E crap only
> > starts to work _after_ ACPI initialization.
> 
> Yap, I think it is an ACPI decision whether to enter C1E or not. And all
> those boxes which are unaffected - they actually are but since the FW
> doesn't enter C1E, they don't hit the bug.
> 
> > tick_force_broadcast() is irreversible
> 
> So if I let the cores in that small window use amd_e400_idle() and
> then when I detect the machine doesn't enter C1E after all, I do
> tick_broadcast_exit() on all cores in amd_e400_c1e_mask, then clear it
> and free it, that would work?
> 
> Or do you see a better solution?

Start with the bug cleared and select amd_e400_idle(), which will then use
default_idle() and not do any of the broadcast crap.

After ACPI gets initialized check the C1E misfeature and if it's detected,
then set the CPU BUG and send an IPI to all online CPUs.

I think this is safe because the CPU running the ACPI initialization is not
in idle and if any of the other CPUs would be stuck due to the C1E
enablement, then the IPI will kick them out.

At least worth a try.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2016-11-21  9:28 ` Peter Zijlstra
  2016-11-21  9:34   ` Jiri Olsa
  2016-11-21 17:06   ` Andi Kleen
@ 2017-02-23 12:24   ` Jiri Olsa
  2017-02-23 17:11     ` Andi Kleen
  2017-02-23 17:28     ` Peter Zijlstra
  2 siblings, 2 replies; 48+ messages in thread
From: Jiri Olsa @ 2017-02-23 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek

On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote:
> > hi,
> > Jan hit following output when msr tracepoints are enabled on amd server:
> > 
> > [   91.585653] ===============================
> > [   91.589840] [ INFO: suspicious RCU usage. ]
> > [   91.594025] 4.9.0-rc1+ #1 Not tainted
> > [   91.597691] -------------------------------
> > [   91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious rcu_dereference_check() usage!
> > [   91.610222] 
> > [   91.610222] other info that might help us debug this:
> > [   91.610222] 
> > [   91.618224] 
> > [   91.618224] RCU used illegally from idle CPU!
> > [   91.618224] rcu_scheduler_active = 1, debug_locks = 0
> > [   91.629081] RCU used illegally from extended quiescent state!
> > [   91.634820] no locks held by swapper/1/0.
> > [   91.638832] 
> > [   91.638832] stack backtrace:
> > [   91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1
> > [   91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03   ' 05/09/2008
> > [   91.656159]  ffffc900018fbdf8 ffffffff813ed43c ffff88017ede8000 0000000000000001
> > [   91.663637]  ffffc900018fbe28 ffffffff810fdcd7 ffff880233f95dd0 00000000c0010055
> > [   91.671107]  0000000000000000 0000000000000000 ffffc900018fbe58 ffffffff814297ac
> > [   91.678560] Call Trace:
> > [   91.681022]  [<ffffffff813ed43c>] dump_stack+0x85/0xc9
> > [   91.686164]  [<ffffffff810fdcd7>] lockdep_rcu_suspicious+0xe7/0x120
> > [   91.692429]  [<ffffffff814297ac>] do_trace_read_msr+0x14c/0x1b0
> > [   91.698349]  [<ffffffff8106ddb2>] native_read_msr+0x32/0x40
> > [   91.703921]  [<ffffffff8103b2be>] amd_e400_idle+0x7e/0x110
> > [   91.709407]  [<ffffffff8103b78f>] arch_cpu_idle+0xf/0x20
> > [   91.714720]  [<ffffffff8181cd33>] default_idle_call+0x23/0x40
> > [   91.720467]  [<ffffffff810f306a>] cpu_startup_entry+0x1da/0x2b0
> > [   91.726387]  [<ffffffff81058b1f>] start_secondary+0x17f/0x1f0
> > 
> > 
> > it got away with attached change.. but this rcu logic
> > is far beyond me, so it's just wild guess.. ;-)
> 
> I think I prefer something like the below, that only annotates the one
> RDMSR in question, instead of all of them.
> 
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 0888a879120f..d6c6aa80675f 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -357,7 +357,7 @@ static void amd_e400_idle(void)
>  	if (!amd_e400_c1e_detected) {
>  		u32 lo, hi;
>  
> -		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
> +		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
>  
>  		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
>  			amd_e400_c1e_detected = true;

hum, I might have missed some other solution in discussion,
and can't see this one being pulled in.. should I resend this?

thanks,
jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2017-02-23 12:24   ` Jiri Olsa
@ 2017-02-23 17:11     ` Andi Kleen
  2017-02-23 17:28     ` Peter Zijlstra
  1 sibling, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2017-02-23 17:11 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Steven Rostedt, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek

> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index 0888a879120f..d6c6aa80675f 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -357,7 +357,7 @@ static void amd_e400_idle(void)
> >  	if (!amd_e400_c1e_detected) {
> >  		u32 lo, hi;
> >  
> > -		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
> > +		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
> >  
> >  		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
> >  			amd_e400_c1e_detected = true;
> 
> hum, I might have missed some other solution in discussion,
> and can't see this one being pulled in.. should I resend this?

I suppose you could also use the new notrace msr variant.

BTW this check is quite annoying.

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2017-02-23 12:24   ` Jiri Olsa
  2017-02-23 17:11     ` Andi Kleen
@ 2017-02-23 17:28     ` Peter Zijlstra
  2017-02-23 17:56       ` Borislav Petkov
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2017-02-23 17:28 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar,
	Josh Triplett, Andi Kleen, Jan Stancek, Borislav Petkov,
	Thomas Gleixner

On Thu, Feb 23, 2017 at 01:24:34PM +0100, Jiri Olsa wrote:
> On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote:
> > 
> > I think I prefer something like the below, that only annotates the one
> > RDMSR in question, instead of all of them.
> > 
> > 
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index 0888a879120f..d6c6aa80675f 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -357,7 +357,7 @@ static void amd_e400_idle(void)
> >  	if (!amd_e400_c1e_detected) {
> >  		u32 lo, hi;
> >  
> > -		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
> > +		RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi));
> >  
> >  		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
> >  			amd_e400_c1e_detected = true;
> 
> hum, I might have missed some other solution in discussion,
> and can't see this one being pulled in.. should I resend this?
> 

I think Boris and Thomas were talking about moving the entire c1e
detection out of there.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage!
  2017-02-23 17:28     ` Peter Zijlstra
@ 2017-02-23 17:56       ` Borislav Petkov
  0 siblings, 0 replies; 48+ messages in thread
From: Borislav Petkov @ 2017-02-23 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Steven Rostedt, Paul E. McKenney, linux-kernel,
	Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek,
	Thomas Gleixner

On Thu, Feb 23, 2017 at 06:28:01PM +0100, Peter Zijlstra wrote:
> I think Boris and Thomas were talking about moving the entire c1e
> detection out of there.

Yeah, see

07c94a381253 ("x86/amd: Simplify AMD E400 aware idle routine")

tglx and I got rid of reading that MSR in the idle routine.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2017-02-23 18:24 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-21  0:53 [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! Jiri Olsa
2016-11-21  9:02 ` Paul E. McKenney
2016-11-21  9:43   ` Peter Zijlstra
2016-11-21 11:00     ` Paul E. McKenney
2016-11-21  9:28 ` Peter Zijlstra
2016-11-21  9:34   ` Jiri Olsa
2016-11-21  9:42     ` Peter Zijlstra
2016-11-21 11:22       ` Jiri Olsa
2016-11-21 11:31         ` Peter Zijlstra
2016-11-21 12:49           ` Jiri Olsa
2016-11-21 12:58     ` Peter Zijlstra
2016-11-21 14:15       ` Steven Rostedt
2016-11-21 14:37         ` Peter Zijlstra
2016-11-21 15:35           ` Borislav Petkov
2016-11-21 15:41             ` Peter Zijlstra
2016-11-21 16:06               ` Borislav Petkov
2016-11-29 13:16                 ` Borislav Petkov
2016-11-29 13:59                   ` Thomas Gleixner
2016-11-30  8:48                     ` Borislav Petkov
2016-11-30  8:54                       ` Thomas Gleixner
2016-11-30  9:07                         ` Borislav Petkov
2016-11-30  9:14                           ` Thomas Gleixner
2016-11-29 14:04                   ` Jiri Olsa
2016-11-21 14:20       ` Steven Rostedt
2016-11-21 17:06   ` Andi Kleen
2016-11-21 17:18     ` Peter Zijlstra
2016-11-21 17:45       ` Andi Kleen
2016-11-21 18:01         ` Steven Rostedt
2016-11-21 18:06           ` Andi Kleen
2016-11-21 18:22             ` Steven Rostedt
2016-11-21 18:37               ` Andi Kleen
2016-11-21 19:06                 ` Steven Rostedt
2016-11-21 19:15                 ` Steven Rostedt
2016-11-21 20:44                   ` Andi Kleen
2016-11-22  8:19                     ` Paul E. McKenney
2016-11-22 14:39                 ` Steven Rostedt
2016-11-22 19:05                   ` Andi Kleen
2016-11-24  2:06                 ` Steven Rostedt
2016-11-28 21:48                   ` Andi Kleen
2016-11-28 22:46                     ` Steven Rostedt
2016-11-28 23:10                       ` Andi Kleen
2016-11-21 17:55       ` Steven Rostedt
2016-11-21 18:24         ` Steven Rostedt
2016-11-21 20:12           ` Paul E. McKenney
2017-02-23 12:24   ` Jiri Olsa
2017-02-23 17:11     ` Andi Kleen
2017-02-23 17:28     ` Peter Zijlstra
2017-02-23 17:56       ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).