* [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! @ 2016-11-21 0:53 Jiri Olsa 2016-11-21 9:02 ` Paul E. McKenney 2016-11-21 9:28 ` Peter Zijlstra 0 siblings, 2 replies; 48+ messages in thread From: Jiri Olsa @ 2016-11-21 0:53 UTC (permalink / raw) To: Steven Rostedt, Paul E. McKenney Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek hi, Jan hit following output when msr tracepoints are enabled on amd server: [ 91.585653] =============================== [ 91.589840] [ INFO: suspicious RCU usage. ] [ 91.594025] 4.9.0-rc1+ #1 Not tainted [ 91.597691] ------------------------------- [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious rcu_dereference_check() usage! [ 91.610222] [ 91.610222] other info that might help us debug this: [ 91.610222] [ 91.618224] [ 91.618224] RCU used illegally from idle CPU! [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 [ 91.629081] RCU used illegally from extended quiescent state! [ 91.634820] no locks held by swapper/1/0. [ 91.638832] [ 91.638832] stack backtrace: [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 [ 91.656159] ffffc900018fbdf8 ffffffff813ed43c ffff88017ede8000 0000000000000001 [ 91.663637] ffffc900018fbe28 ffffffff810fdcd7 ffff880233f95dd0 00000000c0010055 [ 91.671107] 0000000000000000 0000000000000000 ffffc900018fbe58 ffffffff814297ac [ 91.678560] Call Trace: [ 91.681022] [<ffffffff813ed43c>] dump_stack+0x85/0xc9 [ 91.686164] [<ffffffff810fdcd7>] lockdep_rcu_suspicious+0xe7/0x120 [ 91.692429] [<ffffffff814297ac>] do_trace_read_msr+0x14c/0x1b0 [ 91.698349] [<ffffffff8106ddb2>] native_read_msr+0x32/0x40 [ 91.703921] [<ffffffff8103b2be>] amd_e400_idle+0x7e/0x110 [ 91.709407] [<ffffffff8103b78f>] arch_cpu_idle+0xf/0x20 [ 91.714720] [<ffffffff8181cd33>] default_idle_call+0x23/0x40 [ 91.720467] [<ffffffff810f306a>] cpu_startup_entry+0x1da/0x2b0 [ 91.726387] [<ffffffff81058b1f>] start_secondary+0x17f/0x1f0 it got away with attached change.. but this rcu logic is far beyond me, so it's just wild guess.. ;-) thanks, jirka --- diff --git a/arch/x86/lib/msr.c b/arch/x86/lib/msr.c index d1dee753b949..ca15becea3b6 100644 --- a/arch/x86/lib/msr.c +++ b/arch/x86/lib/msr.c @@ -115,14 +115,14 @@ int msr_clear_bit(u32 msr, u8 bit) #ifdef CONFIG_TRACEPOINTS void do_trace_write_msr(unsigned msr, u64 val, int failed) { - trace_write_msr(msr, val, failed); + trace_write_msr_rcuidle(msr, val, failed); } EXPORT_SYMBOL(do_trace_write_msr); EXPORT_TRACEPOINT_SYMBOL(write_msr); void do_trace_read_msr(unsigned msr, u64 val, int failed) { - trace_read_msr(msr, val, failed); + trace_read_msr_rcuidle(msr, val, failed); } EXPORT_SYMBOL(do_trace_read_msr); EXPORT_TRACEPOINT_SYMBOL(read_msr); ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 0:53 [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! Jiri Olsa @ 2016-11-21 9:02 ` Paul E. McKenney 2016-11-21 9:43 ` Peter Zijlstra 2016-11-21 9:28 ` Peter Zijlstra 1 sibling, 1 reply; 48+ messages in thread From: Paul E. McKenney @ 2016-11-21 9:02 UTC (permalink / raw) To: Jiri Olsa Cc: Steven Rostedt, linux-kernel, Peter Zijlstra, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > hi, > Jan hit following output when msr tracepoints are enabled on amd server: > > [ 91.585653] =============================== > [ 91.589840] [ INFO: suspicious RCU usage. ] > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > [ 91.597691] ------------------------------- > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious rcu_dereference_check() usage! > [ 91.610222] > [ 91.610222] other info that might help us debug this: > [ 91.610222] > [ 91.618224] > [ 91.618224] RCU used illegally from idle CPU! > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > [ 91.629081] RCU used illegally from extended quiescent state! > [ 91.634820] no locks held by swapper/1/0. > [ 91.638832] > [ 91.638832] stack backtrace: > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > [ 91.656159] ffffc900018fbdf8 ffffffff813ed43c ffff88017ede8000 0000000000000001 > [ 91.663637] ffffc900018fbe28 ffffffff810fdcd7 ffff880233f95dd0 00000000c0010055 > [ 91.671107] 0000000000000000 0000000000000000 ffffc900018fbe58 ffffffff814297ac > [ 91.678560] Call Trace: > [ 91.681022] [<ffffffff813ed43c>] dump_stack+0x85/0xc9 > [ 91.686164] [<ffffffff810fdcd7>] lockdep_rcu_suspicious+0xe7/0x120 > [ 91.692429] [<ffffffff814297ac>] do_trace_read_msr+0x14c/0x1b0 > [ 91.698349] [<ffffffff8106ddb2>] native_read_msr+0x32/0x40 > [ 91.703921] [<ffffffff8103b2be>] amd_e400_idle+0x7e/0x110 > [ 91.709407] [<ffffffff8103b78f>] arch_cpu_idle+0xf/0x20 > [ 91.714720] [<ffffffff8181cd33>] default_idle_call+0x23/0x40 > [ 91.720467] [<ffffffff810f306a>] cpu_startup_entry+0x1da/0x2b0 > [ 91.726387] [<ffffffff81058b1f>] start_secondary+0x17f/0x1f0 > > > it got away with attached change.. but this rcu logic > is far beyond me, so it's just wild guess.. ;-) If in idle, the _rcuidle() is needed, so: Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > thanks, > jirka > > > --- > diff --git a/arch/x86/lib/msr.c b/arch/x86/lib/msr.c > index d1dee753b949..ca15becea3b6 100644 > --- a/arch/x86/lib/msr.c > +++ b/arch/x86/lib/msr.c > @@ -115,14 +115,14 @@ int msr_clear_bit(u32 msr, u8 bit) > #ifdef CONFIG_TRACEPOINTS > void do_trace_write_msr(unsigned msr, u64 val, int failed) > { > - trace_write_msr(msr, val, failed); > + trace_write_msr_rcuidle(msr, val, failed); > } > EXPORT_SYMBOL(do_trace_write_msr); > EXPORT_TRACEPOINT_SYMBOL(write_msr); > > void do_trace_read_msr(unsigned msr, u64 val, int failed) > { > - trace_read_msr(msr, val, failed); > + trace_read_msr_rcuidle(msr, val, failed); > } > EXPORT_SYMBOL(do_trace_read_msr); > EXPORT_TRACEPOINT_SYMBOL(read_msr); > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 9:02 ` Paul E. McKenney @ 2016-11-21 9:43 ` Peter Zijlstra 2016-11-21 11:00 ` Paul E. McKenney 0 siblings, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2016-11-21 9:43 UTC (permalink / raw) To: Paul E. McKenney Cc: Jiri Olsa, Steven Rostedt, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 01:02:25AM -0800, Paul E. McKenney wrote: > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > > > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > If in idle, the _rcuidle() is needed, so: Well, the point is, only this one rdmsr users is in idle, all the others are not, so we should not be annotating _all_ of them, should we? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 9:43 ` Peter Zijlstra @ 2016-11-21 11:00 ` Paul E. McKenney 0 siblings, 0 replies; 48+ messages in thread From: Paul E. McKenney @ 2016-11-21 11:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Jiri Olsa, Steven Rostedt, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 10:43:21AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 01:02:25AM -0800, Paul E. McKenney wrote: > > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > > > > > it got away with attached change.. but this rcu logic > > > is far beyond me, so it's just wild guess.. ;-) > > > > If in idle, the _rcuidle() is needed, so: > > Well, the point is, only this one rdmsr users is in idle, all the others > are not, so we should not be annotating _all_ of them, should we? Fair enough! Thanx, Paul ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 0:53 [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! Jiri Olsa 2016-11-21 9:02 ` Paul E. McKenney @ 2016-11-21 9:28 ` Peter Zijlstra 2016-11-21 9:34 ` Jiri Olsa ` (2 more replies) 1 sibling, 3 replies; 48+ messages in thread From: Peter Zijlstra @ 2016-11-21 9:28 UTC (permalink / raw) To: Jiri Olsa Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > hi, > Jan hit following output when msr tracepoints are enabled on amd server: > > [ 91.585653] =============================== > [ 91.589840] [ INFO: suspicious RCU usage. ] > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > [ 91.597691] ------------------------------- > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious rcu_dereference_check() usage! > [ 91.610222] > [ 91.610222] other info that might help us debug this: > [ 91.610222] > [ 91.618224] > [ 91.618224] RCU used illegally from idle CPU! > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > [ 91.629081] RCU used illegally from extended quiescent state! > [ 91.634820] no locks held by swapper/1/0. > [ 91.638832] > [ 91.638832] stack backtrace: > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > [ 91.656159] ffffc900018fbdf8 ffffffff813ed43c ffff88017ede8000 0000000000000001 > [ 91.663637] ffffc900018fbe28 ffffffff810fdcd7 ffff880233f95dd0 00000000c0010055 > [ 91.671107] 0000000000000000 0000000000000000 ffffc900018fbe58 ffffffff814297ac > [ 91.678560] Call Trace: > [ 91.681022] [<ffffffff813ed43c>] dump_stack+0x85/0xc9 > [ 91.686164] [<ffffffff810fdcd7>] lockdep_rcu_suspicious+0xe7/0x120 > [ 91.692429] [<ffffffff814297ac>] do_trace_read_msr+0x14c/0x1b0 > [ 91.698349] [<ffffffff8106ddb2>] native_read_msr+0x32/0x40 > [ 91.703921] [<ffffffff8103b2be>] amd_e400_idle+0x7e/0x110 > [ 91.709407] [<ffffffff8103b78f>] arch_cpu_idle+0xf/0x20 > [ 91.714720] [<ffffffff8181cd33>] default_idle_call+0x23/0x40 > [ 91.720467] [<ffffffff810f306a>] cpu_startup_entry+0x1da/0x2b0 > [ 91.726387] [<ffffffff81058b1f>] start_secondary+0x17f/0x1f0 > > > it got away with attached change.. but this rcu logic > is far beyond me, so it's just wild guess.. ;-) I think I prefer something like the below, that only annotates the one RDMSR in question, instead of all of them. diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 0888a879120f..d6c6aa80675f 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -357,7 +357,7 @@ static void amd_e400_idle(void) if (!amd_e400_c1e_detected) { u32 lo, hi; - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); if (lo & K8_INTP_C1E_ACTIVE_MASK) { amd_e400_c1e_detected = true; ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 9:28 ` Peter Zijlstra @ 2016-11-21 9:34 ` Jiri Olsa 2016-11-21 9:42 ` Peter Zijlstra 2016-11-21 12:58 ` Peter Zijlstra 2016-11-21 17:06 ` Andi Kleen 2017-02-23 12:24 ` Jiri Olsa 2 siblings, 2 replies; 48+ messages in thread From: Jiri Olsa @ 2016-11-21 9:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > hi, > > Jan hit following output when msr tracepoints are enabled on amd server: > > > > [ 91.585653] =============================== > > [ 91.589840] [ INFO: suspicious RCU usage. ] > > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > > [ 91.597691] ------------------------------- > > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious rcu_dereference_check() usage! > > [ 91.610222] > > [ 91.610222] other info that might help us debug this: > > [ 91.610222] > > [ 91.618224] > > [ 91.618224] RCU used illegally from idle CPU! > > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > > [ 91.629081] RCU used illegally from extended quiescent state! > > [ 91.634820] no locks held by swapper/1/0. > > [ 91.638832] > > [ 91.638832] stack backtrace: > > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > > [ 91.656159] ffffc900018fbdf8 ffffffff813ed43c ffff88017ede8000 0000000000000001 > > [ 91.663637] ffffc900018fbe28 ffffffff810fdcd7 ffff880233f95dd0 00000000c0010055 > > [ 91.671107] 0000000000000000 0000000000000000 ffffc900018fbe58 ffffffff814297ac > > [ 91.678560] Call Trace: > > [ 91.681022] [<ffffffff813ed43c>] dump_stack+0x85/0xc9 > > [ 91.686164] [<ffffffff810fdcd7>] lockdep_rcu_suspicious+0xe7/0x120 > > [ 91.692429] [<ffffffff814297ac>] do_trace_read_msr+0x14c/0x1b0 > > [ 91.698349] [<ffffffff8106ddb2>] native_read_msr+0x32/0x40 > > [ 91.703921] [<ffffffff8103b2be>] amd_e400_idle+0x7e/0x110 > > [ 91.709407] [<ffffffff8103b78f>] arch_cpu_idle+0xf/0x20 > > [ 91.714720] [<ffffffff8181cd33>] default_idle_call+0x23/0x40 > > [ 91.720467] [<ffffffff810f306a>] cpu_startup_entry+0x1da/0x2b0 > > [ 91.726387] [<ffffffff81058b1f>] start_secondary+0x17f/0x1f0 > > > > > > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > I think I prefer something like the below, that only annotates the one > RDMSR in question, instead of all of them. I was wondering about that, but haven't found RCU_NONIDLE thanks, jirka > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index 0888a879120f..d6c6aa80675f 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > if (!amd_e400_c1e_detected) { > u32 lo, hi; > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > amd_e400_c1e_detected = true; ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 9:34 ` Jiri Olsa @ 2016-11-21 9:42 ` Peter Zijlstra 2016-11-21 11:22 ` Jiri Olsa 2016-11-21 12:58 ` Peter Zijlstra 1 sibling, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2016-11-21 9:42 UTC (permalink / raw) To: Jiri Olsa Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > I think I prefer something like the below, that only annotates the one > > RDMSR in question, instead of all of them. > > I was wondering about that, but haven't found RCU_NONIDLE Its in rcupdate.h ;-) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 9:42 ` Peter Zijlstra @ 2016-11-21 11:22 ` Jiri Olsa 2016-11-21 11:31 ` Peter Zijlstra 0 siblings, 1 reply; 48+ messages in thread From: Jiri Olsa @ 2016-11-21 11:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > I think I prefer something like the below, that only annotates the one > > > RDMSR in question, instead of all of them. > > > > I was wondering about that, but haven't found RCU_NONIDLE > > Its in rcupdate.h ;-) it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. ? thanks, jirka ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 11:22 ` Jiri Olsa @ 2016-11-21 11:31 ` Peter Zijlstra 2016-11-21 12:49 ` Jiri Olsa 0 siblings, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2016-11-21 11:31 UTC (permalink / raw) To: Jiri Olsa Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 12:22:29PM +0100, Jiri Olsa wrote: > On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote: > > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > > > I think I prefer something like the below, that only annotates the one > > > > RDMSR in question, instead of all of them. > > > > > > I was wondering about that, but haven't found RCU_NONIDLE > > > > Its in rcupdate.h ;-) > > it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. ? Probably, I still have to make pretty the uncore patch from last week, and I have a PEBS patch that needs looking at :-) BTW, if you have any opinion on that PEBS register set crap, do holler :-) lkml.kernel.org/r/20161117171731.GV3157@twins.programming.kicks-ass.net ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 11:31 ` Peter Zijlstra @ 2016-11-21 12:49 ` Jiri Olsa 0 siblings, 0 replies; 48+ messages in thread From: Jiri Olsa @ 2016-11-21 12:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 12:31:31PM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 12:22:29PM +0100, Jiri Olsa wrote: > > On Mon, Nov 21, 2016 at 10:42:27AM +0100, Peter Zijlstra wrote: > > > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > > > > > I think I prefer something like the below, that only annotates the one > > > > > RDMSR in question, instead of all of them. > > > > > > > > I was wondering about that, but haven't found RCU_NONIDLE > > > > > > Its in rcupdate.h ;-) > > > > it was too late in the morning.. ;-) I'm assuming you'll queue this patch.. ? > > Probably, I still have to make pretty the uncore patch from last week, > and I have a PEBS patch that needs looking at :-) > > BTW, if you have any opinion on that PEBS register set crap, do holler > :-) > > lkml.kernel.org/r/20161117171731.GV3157@twins.programming.kicks-ass.net ook, will check thanks, jirka ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 9:34 ` Jiri Olsa 2016-11-21 9:42 ` Peter Zijlstra @ 2016-11-21 12:58 ` Peter Zijlstra 2016-11-21 14:15 ` Steven Rostedt 2016-11-21 14:20 ` Steven Rostedt 1 sibling, 2 replies; 48+ messages in thread From: Peter Zijlstra @ 2016-11-21 12:58 UTC (permalink / raw) To: Jiri Olsa Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek, Borislav Petkov On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index 0888a879120f..d6c6aa80675f 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > if (!amd_e400_c1e_detected) { > > u32 lo, hi; > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > amd_e400_c1e_detected = true; OK, so while looking at this again, I don't like this ether :/ Problem with this one is that it always adds the RCU fiddling overhead, even when we're not tracing. I could do an rdmsr_notrace() for this one, dunno if its important. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 12:58 ` Peter Zijlstra @ 2016-11-21 14:15 ` Steven Rostedt 2016-11-21 14:37 ` Peter Zijlstra 2016-11-21 14:20 ` Steven Rostedt 1 sibling, 1 reply; 48+ messages in thread From: Steven Rostedt @ 2016-11-21 14:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek, Borislav Petkov On Mon, 21 Nov 2016 13:58:30 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > > index 0888a879120f..d6c6aa80675f 100644 > > > --- a/arch/x86/kernel/process.c > > > +++ b/arch/x86/kernel/process.c > > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > > if (!amd_e400_c1e_detected) { > > > u32 lo, hi; > > > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > > amd_e400_c1e_detected = true; > > OK, so while looking at this again, I don't like this ether :/ > > Problem with this one is that it always adds the RCU fiddling overhead, > even when we're not tracing. > > I could do an rdmsr_notrace() for this one, dunno if its important. But that would neglect the point of tracing rdmsr. What about: /* tracepoints require RCU enabled */ if (trace_read_msr_enabled()) RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); else rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); -- Steve ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 14:15 ` Steven Rostedt @ 2016-11-21 14:37 ` Peter Zijlstra 2016-11-21 15:35 ` Borislav Petkov 0 siblings, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2016-11-21 14:37 UTC (permalink / raw) To: Steven Rostedt Cc: Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek, Borislav Petkov On Mon, Nov 21, 2016 at 09:15:43AM -0500, Steven Rostedt wrote: > On Mon, 21 Nov 2016 13:58:30 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > > > index 0888a879120f..d6c6aa80675f 100644 > > > > --- a/arch/x86/kernel/process.c > > > > +++ b/arch/x86/kernel/process.c > > > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > > > if (!amd_e400_c1e_detected) { > > > > u32 lo, hi; > > > > > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > > > amd_e400_c1e_detected = true; > > > > OK, so while looking at this again, I don't like this ether :/ > > > > Problem with this one is that it always adds the RCU fiddling overhead, > > even when we're not tracing. > > > > I could do an rdmsr_notrace() for this one, dunno if its important. > > But that would neglect the point of tracing rdmsr. What about: > > /* tracepoints require RCU enabled */ > if (trace_read_msr_enabled()) > RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > else > rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); Yeah, but this one does a printk() when it hits the contidion it checks for, so not tracing it would be fine I think. Also, Boris, why do we need to redo that rdmsr until we see that bit set? Can't we simply do the rdmsr once and then be done with it? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 14:37 ` Peter Zijlstra @ 2016-11-21 15:35 ` Borislav Petkov 2016-11-21 15:41 ` Peter Zijlstra 0 siblings, 1 reply; 48+ messages in thread From: Borislav Petkov @ 2016-11-21 15:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 03:37:16PM +0100, Peter Zijlstra wrote: > Yeah, but this one does a printk() when it hits the contidion it checks > for, so not tracing it would be fine I think. FWIW, there's already: static inline void notrace native_write_msr(unsigned int msr, u32 low, u32 high) { __native_write_msr_notrace(msr, low, high); if (msr_tracepoint_active(__tracepoint_write_msr)) do_trace_write_msr(msr, ((u64)high << 32 | low), 0); } so could be mirrored. > Also, Boris, why do we need to redo that rdmsr until we see that bit > set? Can't we simply do the rdmsr once and then be done with it? Hmm, so the way I read the definition of those bits - K8_INTP_C1E_ACTIVE_MASK - it looks like they get set when the cores enter halt: 28 C1eOnCmpHalt: C1E on chip multi-processing halt. Read-write. 1=When all cores of a processor have entered the halt state, the processor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ generates an IO cycle as specified by IORd, IOMsgData, and IOMsgAddr. When this bit is set, SmiOnCmpHalt and IntPndMsg must be 0, otherwise the behavior is undefined. For revision DA-C, this bit is only supported for dual-core processors. For revision C3 and E, this bit is supported for any number of cores. See 2.4.3.3.3 [Hardware Initiated C1E]. 27 SmiOnCmpHalt: SMI on chip multi-processing halt. Read-write. 1=When all cores of the processor have entered the halt state, the ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ processor generates an SMI-trigger IO cycle as specified by IORd, IOMsgData, and IOMsgAddr. When this bit is set C1eOnCmpHalt and IntPndMsg must be 0, otherwise the behavior is undefined. The status is stored in SMMFEC4[SmiOnCmpHaltSts]. See 2.4.3.3.1 [SMI Initiated C1E]. But the thing is, we do it only once until amd_e400_c1e_detected is true. FWIW, I'd love it if we could do if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E)) but I'm afraid we need to look at that MSR at least once until we set the boolean. I could ask if that is really the case and whether we can detect it differently... In any case, I have a box here in case you want me to test patches. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 15:35 ` Borislav Petkov @ 2016-11-21 15:41 ` Peter Zijlstra 2016-11-21 16:06 ` Borislav Petkov 0 siblings, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2016-11-21 15:41 UTC (permalink / raw) To: Borislav Petkov Cc: Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 04:35:38PM +0100, Borislav Petkov wrote: > On Mon, Nov 21, 2016 at 03:37:16PM +0100, Peter Zijlstra wrote: > > Yeah, but this one does a printk() when it hits the contidion it checks > > for, so not tracing it would be fine I think. > > FWIW, there's already: > > static inline void notrace > native_write_msr(unsigned int msr, u32 low, u32 high) > { > __native_write_msr_notrace(msr, low, high); > if (msr_tracepoint_active(__tracepoint_write_msr)) > do_trace_write_msr(msr, ((u64)high << 32 | low), 0); > } > > so could be mirrored. Yeah, I know. I caused that to appear ;-) > > Also, Boris, why do we need to redo that rdmsr until we see that bit > > set? Can't we simply do the rdmsr once and then be done with it? > > Hmm, so the way I read the definition of those bits - > K8_INTP_C1E_ACTIVE_MASK - it looks like they get set when the cores > enter halt: > > 28 C1eOnCmpHalt: C1E on chip multi-processing halt. Read-write. > > 1=When all cores of a processor have entered the halt state, the processor > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > generates an IO cycle as specified by IORd, IOMsgData, and IOMsgAddr. > When this bit is set, SmiOnCmpHalt and IntPndMsg must be 0, otherwise > the behavior is undefined. For revision DA-C, this bit is only supported > for dual-core processors. For revision C3 and E, this bit is supported > for any number of cores. See 2.4.3.3.3 [Hardware Initiated C1E]. > > 27 SmiOnCmpHalt: SMI on chip multi-processing halt. Read-write. > > 1=When all cores of the processor have entered the halt state, the > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > processor generates an SMI-trigger IO cycle as specified by IORd, > IOMsgData, and IOMsgAddr. When this bit is set C1eOnCmpHalt and > IntPndMsg must be 0, otherwise the behavior is undefined. The status is > stored in SMMFEC4[SmiOnCmpHaltSts]. See 2.4.3.3.1 [SMI Initiated C1E]. > > But the thing is, we do it only once until amd_e400_c1e_detected is true. > > FWIW, I'd love it if we could do > > if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E)) > > but I'm afraid we need to look at that MSR at least once until we set > the boolean. > > I could ask if that is really the case and whether we can detect it > differently... > > In any case, I have a box here in case you want me to test patches. Right, I just wondered about the !c1e present case. In that case we'll forever read the msr in the hope of finding that bit set, and we never will. Don't care too much, just idle curiousity. Let me do the rdmsr_notrace() thing. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 15:41 ` Peter Zijlstra @ 2016-11-21 16:06 ` Borislav Petkov 2016-11-29 13:16 ` Borislav Petkov 0 siblings, 1 reply; 48+ messages in thread From: Borislav Petkov @ 2016-11-21 16:06 UTC (permalink / raw) To: Peter Zijlstra, Thomas Gleixner Cc: Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek + tglx. On Mon, Nov 21, 2016 at 04:41:04PM +0100, Peter Zijlstra wrote: > Right, I just wondered about the !c1e present case. In that case we'll > forever read the msr in the hope of finding that bit set, and we never > will. Well, erratum 400 flag is set on everything F10h from model 0x2 onwards, which is basically everything that's still out there. And we set it for most of the K8s too. So I don't think there's an affected machine out there where we don't enable that erratum for so I'd assume it is turned on everywhere. IOW, what's the worst thing that can happen if we did this below? We basically get rid of the detection and switch the timer to broadcast mode immediately on the halting CPU. amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" check so it will run on the affected CPUs only... Thoughts? --- diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 0888a879120f..7ab9d0254b4d 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -354,41 +354,30 @@ void amd_e400_remove_cpu(int cpu) */ static void amd_e400_idle(void) { - if (!amd_e400_c1e_detected) { - u32 lo, hi; + int cpu = smp_processor_id(); - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); + if (!static_cpu_has(X86_FEATURE_NONSTOP_TSC)) + mark_tsc_unstable("TSC halt in AMD C1E"); - if (lo & K8_INTP_C1E_ACTIVE_MASK) { - amd_e400_c1e_detected = true; - if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) - mark_tsc_unstable("TSC halt in AMD C1E"); - pr_info("System has AMD C1E enabled\n"); - } - } - - if (amd_e400_c1e_detected) { - int cpu = smp_processor_id(); + pr_info("System has AMD C1E enabled\n"); - if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) { - cpumask_set_cpu(cpu, amd_e400_c1e_mask); - /* Force broadcast so ACPI can not interfere. */ - tick_broadcast_force(); - pr_info("Switch to broadcast mode on CPU%d\n", cpu); - } - tick_broadcast_enter(); + if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) { + cpumask_set_cpu(cpu, amd_e400_c1e_mask); + /* Force broadcast so ACPI can not interfere. */ + tick_broadcast_force(); + pr_info("Switch to broadcast mode on CPU%d\n", cpu); + } + tick_broadcast_enter(); - default_idle(); + default_idle(); - /* - * The switch back from broadcast mode needs to be - * called with interrupts disabled. - */ - local_irq_disable(); - tick_broadcast_exit(); - local_irq_enable(); - } else - default_idle(); + /* + * The switch back from broadcast mode needs to be + * called with interrupts disabled. + */ + local_irq_disable(); + tick_broadcast_exit(); + local_irq_enable(); } /* -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 16:06 ` Borislav Petkov @ 2016-11-29 13:16 ` Borislav Petkov 2016-11-29 13:59 ` Thomas Gleixner 2016-11-29 14:04 ` Jiri Olsa 0 siblings, 2 replies; 48+ messages in thread From: Borislav Petkov @ 2016-11-29 13:16 UTC (permalink / raw) To: Peter Zijlstra, Thomas Gleixner Cc: Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote: > IOW, what's the worst thing that can happen if we did this below? > > We basically get rid of the detection and switch the timer to broadcast > mode immediately on the halting CPU. > > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" > check so it will run on the affected CPUs only... > > Thoughts? Actually, here's a better version. The E400 detection works only after ACPI has been enabled so we piggyback the end of acpi_init(). We don't need the MSR read now - we do if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) on the idle path which is as fast as it gets. Any complaints about this before I go and test it everywhere? It builds and boots in my guest here ok, not that it means a whole lot. The good news is, I have collected a fleet of boxes which all have that erratum so testing should be pretty reliable. Something that doesn't happen everytime! :-) --- diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h index 5391b0ae7cc3..aa8c879e814f 100644 --- a/arch/x86/include/asm/acpi.h +++ b/arch/x86/include/asm/acpi.h @@ -94,7 +94,8 @@ static inline unsigned int acpi_processor_cstate_check(unsigned int max_cstate) boot_cpu_data.x86_model <= 0x05 && boot_cpu_data.x86_mask < 0x0A) return 1; - else if (amd_e400_c1e_detected) + + else if (boot_cpu_has(X86_BUG_AMD_APIC_C1E)) return 1; else return max_cstate; @@ -131,6 +132,7 @@ static inline bool acpi_has_cpu_in_madt(void) return !!acpi_lapic; } +extern void __init __weak arch_post_acpi_init(void); #else /* !CONFIG_ACPI */ #define acpi_lapic 0 diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 1d2b69fc0ceb..b3918609f6f1 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -204,6 +204,7 @@ static __always_inline __pure bool _static_cpu_has(u16 bit) #define static_cpu_has_bug(bit) static_cpu_has((bit)) #define boot_cpu_has_bug(bit) cpu_has_bug(&boot_cpu_data, (bit)) +#define setup_clear_cpu_bug(bit) setup_clear_cpu_cap((bit)) #define MAX_CPU_FEATURES (NCAPINTS * 32) #define cpu_have_feature boot_cpu_has diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 1f6a92903b09..d98c5c432209 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -636,7 +636,6 @@ extern void select_idle_routine(const struct cpuinfo_x86 *c); extern void init_amd_e400_c1e_mask(void); extern unsigned long boot_option_idle_override; -extern bool amd_e400_c1e_detected; enum idle_boot_override {IDLE_NO_OVERRIDE=0, IDLE_HALT, IDLE_NOMWAIT, IDLE_POLL}; diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index ee023919e476..56653b7322f9 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -289,9 +289,6 @@ void stop_this_cpu(void *dummy) halt(); } -bool amd_e400_c1e_detected; -EXPORT_SYMBOL(amd_e400_c1e_detected); - static cpumask_var_t amd_e400_c1e_mask; void amd_e400_remove_cpu(int cpu) @@ -301,26 +298,14 @@ void amd_e400_remove_cpu(int cpu) } /* - * AMD Erratum 400 aware idle routine. We check for C1E active in the interrupt - * pending message MSR. If we detect C1E, then we handle it the same - * way as C3 power states (local apic timer and TSC stop) + * AMD Erratum 400 aware idle routine. + * + * + * We handle it the same way as C3 power states (local apic timer and TSC stop) */ static void amd_e400_idle(void) { - if (!amd_e400_c1e_detected) { - u32 lo, hi; - - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); - - if (lo & K8_INTP_C1E_ACTIVE_MASK) { - amd_e400_c1e_detected = true; - if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) - mark_tsc_unstable("TSC halt in AMD C1E"); - pr_info("System has AMD C1E enabled\n"); - } - } - - if (amd_e400_c1e_detected) { + if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) { int cpu = smp_processor_id(); if (!cpumask_test_cpu(cpu, amd_e400_c1e_mask)) { @@ -419,6 +404,30 @@ void __init init_amd_e400_c1e_mask(void) zalloc_cpumask_var(&amd_e400_c1e_mask, GFP_KERNEL); } +void __init arch_post_acpi_init(void) +{ + u32 lo, hi; + + if (!boot_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) + return; + + /* + * AMD E400 detection needs to happen *after* ACPI has been enabled. We + * check for C1E active in the interrupt pending message MSR. If we detect + * C1E enabled, machine is affected. + */ + rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); + + if (!(lo & K8_INTP_C1E_ACTIVE_MASK)) { + setup_clear_cpu_bug(X86_BUG_AMD_APIC_C1E); + return; + } + + if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) + mark_tsc_unstable("TSC halt in AMD C1E"); + pr_info("System has AMD C1E enabled\n"); +} + static int __init idle_setup(char *str) { if (!str) diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c index 56190d00fd87..da42fe354ffc 100644 --- a/drivers/acpi/bus.c +++ b/drivers/acpi/bus.c @@ -34,6 +34,7 @@ #include <linux/reboot.h> #include <linux/delay.h> #ifdef CONFIG_X86 +#include <asm/acpi.h> #include <asm/mpspec.h> #endif #include <linux/acpi_iort.h> @@ -1166,6 +1167,13 @@ static int __init acpi_bus_init(void) struct kobject *acpi_kobj; EXPORT_SYMBOL_GPL(acpi_kobj); +/* + * Arch-specific one to override this. + */ +void __init __weak arch_post_acpi_init(void) +{ +} + static int __init acpi_init(void) { int result; @@ -1198,6 +1206,9 @@ static int __init acpi_init(void) acpi_debugger_init(); acpi_setup_sb_notify_handler(); acpi_set_processor_mapping(); + + arch_post_acpi_init(); + return 0; } diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 2237d3f24f0e..8eec3255d6f2 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -141,7 +141,7 @@ static void lapic_timer_check_state(int state, struct acpi_processor *pr, if (cpu_has(&cpu_data(pr->id), X86_FEATURE_ARAT)) return; - if (amd_e400_c1e_detected) + if (boot_cpu_has(X86_BUG_AMD_APIC_C1E)) type = ACPI_STATE_C1; /* -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-29 13:16 ` Borislav Petkov @ 2016-11-29 13:59 ` Thomas Gleixner 2016-11-30 8:48 ` Borislav Petkov 2016-11-29 14:04 ` Jiri Olsa 1 sibling, 1 reply; 48+ messages in thread From: Thomas Gleixner @ 2016-11-29 13:59 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Tue, 29 Nov 2016, Borislav Petkov wrote: > On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote: > > IOW, what's the worst thing that can happen if we did this below? > > > > We basically get rid of the detection and switch the timer to broadcast > > mode immediately on the halting CPU. > > > > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" > > check so it will run on the affected CPUs only... > > > > Thoughts? > > Actually, here's a better version. The E400 detection works only after > ACPI has been enabled so we piggyback the end of acpi_init(). > > We don't need the MSR read now - we do > > if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) > > on the idle path which is as fast as it gets. > > Any complaints about this before I go and test it everywhere? The issue is that you obvioulsy start with the assumption, that the machine has this bug. As a consequence the machine is brute forced into tick broadcast mode, which cannot be reverted when you clear that misfeature after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is forced into broadcast mode, which is not what you want. As far as I understood the whole magic, this C1E misfeature takes only effect _after_ ACPI has been initialized. So instead of setting the bug in early boot and therefor forcing the broadcast nonsense, we should only set it when ACPI has actually detected it. Thanks, tglx ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-29 13:59 ` Thomas Gleixner @ 2016-11-30 8:48 ` Borislav Petkov 2016-11-30 8:54 ` Thomas Gleixner 0 siblings, 1 reply; 48+ messages in thread From: Borislav Petkov @ 2016-11-30 8:48 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Zijlstra, Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Tue, Nov 29, 2016 at 02:59:01PM +0100, Thomas Gleixner wrote: > The issue is that you obvioulsy start with the assumption, that the machine > has this bug. As a consequence the machine is brute forced into tick > broadcast mode, which cannot be reverted when you clear that misfeature > after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is > forced into broadcast mode, which is not what you want. > > As far as I understood the whole magic, this C1E misfeature takes only > effect _after_ ACPI has been initialized. So instead of setting the bug in > early boot and therefor forcing the broadcast nonsense, we should only set > it when ACPI has actually detected it. Problem is, select_idle_routine() runs a lot earlier than acpi_init() so there's a window where we don't definitively know yet whether the box is actually going to enter C1E or not. [ I presume the reason why we have to do the proper detection after ACPI has been initialized is because the frickelware decides whether to do C1E entry or not and then sets those bits in the MSR (or not). ] If in that window we enter idle and we're on an affected machine and we *don't* switch to broadcast mode, we risk not waking up from C1E, i.e., the main reason this fix was even done. So, if we "prematurely" switch to broadcast mode on the affected CPUs, we're ok, it will be detected properly later and we're in broadcast mode already. Now, on those machines which are not affected and we clear X86_BUG_AMD_APIC_C1E because they don't enter C1E at all, I was thinking of maybe doing amd_e400_remove_cpu() and clearing that e400 mask and even freeing it so that they can do default_idle(). But you're saying tick_broadcast_enter() is irreversible? Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-30 8:48 ` Borislav Petkov @ 2016-11-30 8:54 ` Thomas Gleixner 2016-11-30 9:07 ` Borislav Petkov 0 siblings, 1 reply; 48+ messages in thread From: Thomas Gleixner @ 2016-11-30 8:54 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Wed, 30 Nov 2016, Borislav Petkov wrote: > On Tue, Nov 29, 2016 at 02:59:01PM +0100, Thomas Gleixner wrote: > > The issue is that you obvioulsy start with the assumption, that the machine > > has this bug. As a consequence the machine is brute forced into tick > > broadcast mode, which cannot be reverted when you clear that misfeature > > after ACPI init. So in case of !NOHZ and !HIGHRES the periodic tick is > > forced into broadcast mode, which is not what you want. > > > > As far as I understood the whole magic, this C1E misfeature takes only > > effect _after_ ACPI has been initialized. So instead of setting the bug in > > early boot and therefor forcing the broadcast nonsense, we should only set > > it when ACPI has actually detected it. > > Problem is, select_idle_routine() runs a lot earlier than acpi_init() so > there's a window where we don't definitively know yet whether the box is > actually going to enter C1E or not. > > [ I presume the reason why we have to do the proper detection after > ACPI has been initialized is because the frickelware decides whether > to do C1E entry or not and then sets those bits in the MSR (or not). ] > > If in that window we enter idle and we're on an affected machine and we > *don't* switch to broadcast mode, we risk not waking up from C1E, i.e., > the main reason this fix was even done. > > So, if we "prematurely" switch to broadcast mode on the affected CPUs, > we're ok, it will be detected properly later and we're in broadcast > mode already. Right, that's the safe bet. But I'm quite sure that the C1E crap only starts to work _after_ ACPI initialization. > Now, on those machines which are not affected and we clear > X86_BUG_AMD_APIC_C1E because they don't enter C1E at all, I was thinking > of maybe doing amd_e400_remove_cpu() and clearing that e400 mask and > even freeing it so that they can do default_idle(). > > But you're saying tick_broadcast_enter() is irreversible? tick_force_broadcast() is irreversible Thanks, tglx ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-30 8:54 ` Thomas Gleixner @ 2016-11-30 9:07 ` Borislav Petkov 2016-11-30 9:14 ` Thomas Gleixner 0 siblings, 1 reply; 48+ messages in thread From: Borislav Petkov @ 2016-11-30 9:07 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Zijlstra, Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Wed, Nov 30, 2016 at 09:54:58AM +0100, Thomas Gleixner wrote: > Right, that's the safe bet. But I'm quite sure that the C1E crap only > starts to work _after_ ACPI initialization. Yap, I think it is an ACPI decision whether to enter C1E or not. And all those boxes which are unaffected - they actually are but since the FW doesn't enter C1E, they don't hit the bug. > tick_force_broadcast() is irreversible So if I let the cores in that small window use amd_e400_idle() and then when I detect the machine doesn't enter C1E after all, I do tick_broadcast_exit() on all cores in amd_e400_c1e_mask, then clear it and free it, that would work? Or do you see a better solution? Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-30 9:07 ` Borislav Petkov @ 2016-11-30 9:14 ` Thomas Gleixner 0 siblings, 0 replies; 48+ messages in thread From: Thomas Gleixner @ 2016-11-30 9:14 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, Steven Rostedt, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Wed, 30 Nov 2016, Borislav Petkov wrote: > On Wed, Nov 30, 2016 at 09:54:58AM +0100, Thomas Gleixner wrote: > > Right, that's the safe bet. But I'm quite sure that the C1E crap only > > starts to work _after_ ACPI initialization. > > Yap, I think it is an ACPI decision whether to enter C1E or not. And all > those boxes which are unaffected - they actually are but since the FW > doesn't enter C1E, they don't hit the bug. > > > tick_force_broadcast() is irreversible > > So if I let the cores in that small window use amd_e400_idle() and > then when I detect the machine doesn't enter C1E after all, I do > tick_broadcast_exit() on all cores in amd_e400_c1e_mask, then clear it > and free it, that would work? > > Or do you see a better solution? Start with the bug cleared and select amd_e400_idle(), which will then use default_idle() and not do any of the broadcast crap. After ACPI gets initialized check the C1E misfeature and if it's detected, then set the CPU BUG and send an IPI to all online CPUs. I think this is safe because the CPU running the ACPI initialization is not in idle and if any of the other CPUs would be stuck due to the C1E enablement, then the IPI will kick them out. At least worth a try. Thanks, tglx ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-29 13:16 ` Borislav Petkov 2016-11-29 13:59 ` Thomas Gleixner @ 2016-11-29 14:04 ` Jiri Olsa 1 sibling, 0 replies; 48+ messages in thread From: Jiri Olsa @ 2016-11-29 14:04 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, Thomas Gleixner, Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Tue, Nov 29, 2016 at 02:16:49PM +0100, Borislav Petkov wrote: > On Mon, Nov 21, 2016 at 05:06:54PM +0100, Borislav Petkov wrote: > > IOW, what's the worst thing that can happen if we did this below? > > > > We basically get rid of the detection and switch the timer to broadcast > > mode immediately on the halting CPU. > > > > amd_e400_idle() is behind an "if (cpu_has_bug(c, X86_BUG_AMD_APIC_C1E))" > > check so it will run on the affected CPUs only... > > > > Thoughts? > > Actually, here's a better version. The E400 detection works only after > ACPI has been enabled so we piggyback the end of acpi_init(). > > We don't need the MSR read now - we do > > if (static_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) > > on the idle path which is as fast as it gets. > > Any complaints about this before I go and test it everywhere? > > It builds and boots in my guest here ok, not that it means a whole lot. > > The good news is, I have collected a fleet of boxes which all have that > erratum so testing should be pretty reliable. Something that doesn't > happen everytime! +1 one test server over here.. will check ;-) jirka ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 12:58 ` Peter Zijlstra 2016-11-21 14:15 ` Steven Rostedt @ 2016-11-21 14:20 ` Steven Rostedt 1 sibling, 0 replies; 48+ messages in thread From: Steven Rostedt @ 2016-11-21 14:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek, Borislav Petkov On Mon, 21 Nov 2016 13:58:30 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Nov 21, 2016 at 10:34:25AM +0100, Jiri Olsa wrote: > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > > index 0888a879120f..d6c6aa80675f 100644 > > > --- a/arch/x86/kernel/process.c > > > +++ b/arch/x86/kernel/process.c > > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > > if (!amd_e400_c1e_detected) { > > > u32 lo, hi; > > > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > > amd_e400_c1e_detected = true; > > OK, so while looking at this again, I don't like this ether :/ > > Problem with this one is that it always adds the RCU fiddling overhead, > even when we're not tracing. > > I could do an rdmsr_notrace() for this one, dunno if its important. But that would neglect the point of tracing rdmsr. What about: /* tracepoints require RCU enabled */ if (trace_read_msr_enabled()) RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); else rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); -- Steve ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 9:28 ` Peter Zijlstra 2016-11-21 9:34 ` Jiri Olsa @ 2016-11-21 17:06 ` Andi Kleen 2016-11-21 17:18 ` Peter Zijlstra 2017-02-23 12:24 ` Jiri Olsa 2 siblings, 1 reply; 48+ messages in thread From: Andi Kleen @ 2016-11-21 17:06 UTC (permalink / raw) To: Peter Zijlstra Cc: Jiri Olsa, Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek > > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > I think I prefer something like the below, that only annotates the one > RDMSR in question, instead of all of them. It would be far better to just fix trace points that they always work. This whole thing is a travesty: we have tens of thousands of lines of code in ftrace to support tracing in NMIs, but then "debug features"[1] like this come around and make trace points unusable for far more code than just the NMI handlers. -Andi [1] unclear what they actually "debug" here. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 17:06 ` Andi Kleen @ 2016-11-21 17:18 ` Peter Zijlstra 2016-11-21 17:45 ` Andi Kleen 2016-11-21 17:55 ` Steven Rostedt 0 siblings, 2 replies; 48+ messages in thread From: Peter Zijlstra @ 2016-11-21 17:18 UTC (permalink / raw) To: Andi Kleen Cc: Jiri Olsa, Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote: > > > it got away with attached change.. but this rcu logic > > > is far beyond me, so it's just wild guess.. ;-) > > > > I think I prefer something like the below, that only annotates the one > > RDMSR in question, instead of all of them. > > It would be far better to just fix trace points that they always work. > > This whole thing is a travesty: we have tens of thousands of lines of code in > ftrace to support tracing in NMIs, but then "debug features"[1] like > this come around and make trace points unusable for far more code than > just the NMI handlers. Not sure the idle handlers are more code than we have NMI code, but yes, its annoying. Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid locking, as one does. Biggest objection would be that the rcu_irq_enter_irqson() thing does POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint with that is quite a few cycles. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 17:18 ` Peter Zijlstra @ 2016-11-21 17:45 ` Andi Kleen 2016-11-21 18:01 ` Steven Rostedt 2016-11-21 17:55 ` Steven Rostedt 1 sibling, 1 reply; 48+ messages in thread From: Andi Kleen @ 2016-11-21 17:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Andi Kleen, Jiri Olsa, Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, Nov 21, 2016 at 06:18:53PM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote: > > > > it got away with attached change.. but this rcu logic > > > > is far beyond me, so it's just wild guess.. ;-) > > > > > > I think I prefer something like the below, that only annotates the one > > > RDMSR in question, instead of all of them. > > > > It would be far better to just fix trace points that they always work. > > > > This whole thing is a travesty: we have tens of thousands of lines of code in > > ftrace to support tracing in NMIs, but then "debug features"[1] like > > this come around and make trace points unusable for far more code than > > just the NMI handlers. > > Not sure the idle handlers are more code than we have NMI code, but yes, > its annoying. > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > locking, as one does. > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > with that is quite a few cycles. This is only when the trace point is enabled right? ftrace is already executing a lot of instructions when enabled. Adding two POPF won't be end of the world. -Andi 264171.903577780: ffffffff810d6040 trace_event_raw_event_sched_switch push %rbp 264171.903577780: ffffffff810d6041 trace_event_raw_event_sched_switch mov %rsp, %rbp 264171.903577780: ffffffff810d6044 trace_event_raw_event_sched_switch push %r15 264171.903577780: ffffffff810d6046 trace_event_raw_event_sched_switch push %r14 264171.903577780: ffffffff810d6048 trace_event_raw_event_sched_switch push %r13 264171.903577780: ffffffff810d604a trace_event_raw_event_sched_switch push %r12 264171.903577780: ffffffff810d604c trace_event_raw_event_sched_switch mov %esi, %r13d 264171.903577780: ffffffff810d604f trace_event_raw_event_sched_switch push %rbx 264171.903577780: ffffffff810d6050 trace_event_raw_event_sched_switch mov %rdi, %r12 264171.903577780: ffffffff810d6053 trace_event_raw_event_sched_switch mov %rdx, %r14 264171.903577780: ffffffff810d6056 trace_event_raw_event_sched_switch mov %rcx, %r15 264171.903577780: ffffffff810d6059 trace_event_raw_event_sched_switch sub $0x30, %rsp 264171.903577780: ffffffff810d605d trace_event_raw_event_sched_switch mov 0x48(%rdi), %rbx 264171.903577780: ffffffff810d6061 trace_event_raw_event_sched_switch test $0x80, %bl 264171.903577780: ffffffff810d6064 trace_event_raw_event_sched_switch jnz 0x810d608c 264171.903577780: ffffffff810d6066 trace_event_raw_event_sched_switch test $0x40, %bl 264171.903577780: ffffffff810d6069 trace_event_raw_event_sched_switch jnz trace_event_raw_event_sched_switch+216 264171.903577780: ffffffff810d606f trace_event_raw_event_sched_switch test $0x20, %bl 264171.903577780: ffffffff810d6072 trace_event_raw_event_sched_switch jz 0x810d6083 264171.903577780: ffffffff810d6083 trace_event_raw_event_sched_switch and $0x1, %bh 264171.903577780: ffffffff810d6086 trace_event_raw_event_sched_switch jnz trace_event_raw_event_sched_switch+237 264171.903578113: ffffffff810d608c trace_event_raw_event_sched_switch mov $0x40, %edx 264171.903578113: ffffffff810d6091 trace_event_raw_event_sched_switch mov %r12, %rsi 264171.903578113: ffffffff810d6094 trace_event_raw_event_sched_switch mov %rsp, %rdi 264171.903578113: ffffffff810d6097 trace_event_raw_event_sched_switch call trace_event_buffer_reserve 264171.903578113: ffffffff8114f150 trace_event_buffer_reserve test $0x1, 0x49(%rsi) 264171.903578113: ffffffff8114f154 trace_event_buffer_reserve push %rbp 264171.903578113: ffffffff8114f155 trace_event_buffer_reserve mov 0x10(%rsi), %r10 264171.903578113: ffffffff8114f159 trace_event_buffer_reserve mov %rsp, %rbp 264171.903578113: ffffffff8114f15c trace_event_buffer_reserve push %rbx 264171.903578113: ffffffff8114f15d trace_event_buffer_reserve jz 0x8114f17e 264171.903578113: ffffffff8114f17e trace_event_buffer_reserve mov %rdx, %rcx 264171.903578113: ffffffff8114f181 trace_event_buffer_reserve mov %rdi, %rbx 264171.903578113: ffffffff8114f184 trace_event_buffer_reserve pushfq 264171.903578113: ffffffff8114f185 trace_event_buffer_reserve pop %r8 264171.903578113: ffffffff8114f187 trace_event_buffer_reserve mov %gs:0x7eebd2c1(%rip), %r9d 264171.903578113: ffffffff8114f18f trace_event_buffer_reserve and $0x7fffffff, %r9d 264171.903578113: ffffffff8114f196 trace_event_buffer_reserve mov %r8, 0x20(%rdi) 264171.903578113: ffffffff8114f19a trace_event_buffer_reserve mov %rsi, 0x10(%rdi) 264171.903578113: ffffffff8114f19e trace_event_buffer_reserve mov %r9d, 0x28(%rdi) 264171.903578113: ffffffff8114f1a2 trace_event_buffer_reserve mov 0x40(%r10), %edx 264171.903578113: ffffffff8114f1a6 trace_event_buffer_reserve call trace_event_buffer_lock_reserve 264171.903578113: ffffffff81142210 trace_event_buffer_lock_reserve push %rbp 264171.903578113: ffffffff81142211 trace_event_buffer_lock_reserve mov %rsp, %rbp 264171.903578113: ffffffff81142214 trace_event_buffer_lock_reserve push %r15 264171.903578113: ffffffff81142216 trace_event_buffer_lock_reserve push %r14 264171.903578113: ffffffff81142218 trace_event_buffer_lock_reserve push %r13 264171.903578113: ffffffff8114221a trace_event_buffer_lock_reserve push %r12 264171.903578113: ffffffff8114221c trace_event_buffer_lock_reserve mov %rdi, %r14 264171.903578113: ffffffff8114221f trace_event_buffer_lock_reserve push %rbx 264171.903578113: ffffffff81142220 trace_event_buffer_lock_reserve mov %edx, %r12d 264171.903578113: ffffffff81142223 trace_event_buffer_lock_reserve mov %rsi, %rbx 264171.903578113: ffffffff81142226 trace_event_buffer_lock_reserve mov %rcx, %r13 264171.903578113: ffffffff81142229 trace_event_buffer_lock_reserve mov %r9d, %r15d 264171.903578113: ffffffff8114222c trace_event_buffer_lock_reserve sub $0x10, %rsp 264171.903578113: ffffffff81142230 trace_event_buffer_lock_reserve mov 0x28(%rsi), %rax 264171.903578113: ffffffff81142234 trace_event_buffer_lock_reserve mov %r8, -0x30(%rbp) 264171.903578113: ffffffff81142238 trace_event_buffer_lock_reserve mov 0x20(%rax), %rdi 264171.903578113: ffffffff8114223c trace_event_buffer_lock_reserve mov %rdi, (%r14) 264171.903578113: ffffffff8114223f trace_event_buffer_lock_reserve test $0x24, 0x48(%rsi) 264171.903578113: ffffffff81142243 trace_event_buffer_lock_reserve jz 0x8114226d 264171.903578113: ffffffff8114226d trace_event_buffer_lock_reserve mov -0x30(%rbp), %rcx 264171.903578113: ffffffff81142271 trace_event_buffer_lock_reserve mov %r15d, %r8d 264171.903578113: ffffffff81142274 trace_event_buffer_lock_reserve mov %r13, %rdx 264171.903578113: ffffffff81142277 trace_event_buffer_lock_reserve mov %r12d, %esi 264171.903578113: ffffffff8114227a trace_event_buffer_lock_reserve call trace_buffer_lock_reserve 264171.903578113: ffffffff811421c0 trace_buffer_lock_reserve push %rbp 264171.903578113: ffffffff811421c1 trace_buffer_lock_reserve mov %rsp, %rbp 264171.903578113: ffffffff811421c4 trace_buffer_lock_reserve push %r14 264171.903578113: ffffffff811421c6 trace_buffer_lock_reserve push %r13 264171.903578113: ffffffff811421c8 trace_buffer_lock_reserve push %r12 264171.903578113: ffffffff811421ca trace_buffer_lock_reserve push %rbx 264171.903578113: ffffffff811421cb trace_buffer_lock_reserve mov %esi, %r12d 264171.903578113: ffffffff811421ce trace_buffer_lock_reserve mov %rdx, %rsi 264171.903578113: ffffffff811421d1 trace_buffer_lock_reserve mov %rcx, %r13 264171.903578113: ffffffff811421d4 trace_buffer_lock_reserve mov %r8d, %r14d 264171.903578113: ffffffff811421d7 trace_buffer_lock_reserve call ring_buffer_lock_reserve 264171.903578113: ffffffff8113d950 ring_buffer_lock_reserve mov 0x8(%rdi), %eax 264171.903578113: ffffffff8113d953 ring_buffer_lock_reserve test %eax, %eax 264171.903578113: ffffffff8113d955 ring_buffer_lock_reserve jnz ring_buffer_lock_reserve+166 264171.903578113: ffffffff8113d95b ring_buffer_lock_reserve mov %gs:0x7eecc826(%rip), %eax 264171.903578113: ffffffff8113d962 ring_buffer_lock_reserve mov %eax, %ecx 264171.903578113: ffffffff8113d964 ring_buffer_lock_reserve bt %rcx, 0x10(%rdi) 264171.903578113: ffffffff8113d969 ring_buffer_lock_reserve setb %cl 264171.903578113: ffffffff8113d96c ring_buffer_lock_reserve test %cl, %cl 264171.903578113: ffffffff8113d96e ring_buffer_lock_reserve jz ring_buffer_lock_reserve+166 264171.903578113: ffffffff8113d974 ring_buffer_lock_reserve mov 0x60(%rdi), %rdx 264171.903578113: ffffffff8113d978 ring_buffer_lock_reserve push %rbp 264171.903578113: ffffffff8113d979 ring_buffer_lock_reserve cdqe 264171.903578113: ffffffff8113d97b ring_buffer_lock_reserve mov %rsp, %rbp 264171.903578113: ffffffff8113d97e ring_buffer_lock_reserve push %rbx 264171.903578113: ffffffff8113d97f ring_buffer_lock_reserve mov (%rdx,%rax,8), %rbx 264171.903578113: ffffffff8113d983 ring_buffer_lock_reserve mov 0x4(%rbx), %eax 264171.903578113: ffffffff8113d986 ring_buffer_lock_reserve test %eax, %eax 264171.903578113: ffffffff8113d988 ring_buffer_lock_reserve jnz 0x8113d9f1 264171.903578113: ffffffff8113d98a ring_buffer_lock_reserve cmp $0xfe8, %rsi 264171.903578113: ffffffff8113d991 ring_buffer_lock_reserve ja 0x8113d9f1 264171.903578113: ffffffff8113d993 ring_buffer_lock_reserve mov %gs:0x7eeceab6(%rip), %edx 264171.903578113: ffffffff8113d99a ring_buffer_lock_reserve test $0x1fff00, %edx 264171.903578113: ffffffff8113d9a0 ring_buffer_lock_reserve mov 0x20(%rbx), %ecx 264171.903578113: ffffffff8113d9a3 ring_buffer_lock_reserve mov $0x8, %eax 264171.903578113: ffffffff8113d9a8 ring_buffer_lock_reserve jnz 0x8113d9c6 264171.903578113: ffffffff8113d9aa ring_buffer_lock_reserve test %eax, %ecx 264171.903578113: ffffffff8113d9ac ring_buffer_lock_reserve jnz 0x8113d9f1 264171.903578113: ffffffff8113d9ae ring_buffer_lock_reserve or %ecx, %eax 264171.903578113: ffffffff8113d9b0 ring_buffer_lock_reserve mov %rsi, %rdx 264171.903578113: ffffffff8113d9b3 ring_buffer_lock_reserve mov %rbx, %rsi 264171.903578113: ffffffff8113d9b6 ring_buffer_lock_reserve mov %eax, 0x20(%rbx) 264171.903578113: ffffffff8113d9b9 ring_buffer_lock_reserve call rb_reserve_next_event 264171.903578113: ffffffff8113d610 rb_reserve_next_event inc 0x88(%rsi) 264171.903578113: ffffffff8113d617 rb_reserve_next_event inc 0x90(%rsi) 264171.903578113: ffffffff8113d61e rb_reserve_next_event mov 0x8(%rsi), %rax 264171.903578113: ffffffff8113d622 rb_reserve_next_event cmp %rdi, %rax 264171.903578113: ffffffff8113d625 rb_reserve_next_event jnz rb_reserve_next_event+678 264171.903578113: ffffffff8113d62b rb_reserve_next_event push %rbp 264171.903578113: ffffffff8113d62c rb_reserve_next_event mov %edx, %eax 264171.903578113: ffffffff8113d62e rb_reserve_next_event mov $0x8, %ecx 264171.903578113: ffffffff8113d633 rb_reserve_next_event mov %rsp, %rbp 264171.903578113: ffffffff8113d636 rb_reserve_next_event push %r15 264171.903578113: ffffffff8113d638 rb_reserve_next_event push %r14 264171.903578113: ffffffff8113d63a rb_reserve_next_event push %r13 264171.903578113: ffffffff8113d63c rb_reserve_next_event push %r12 264171.903578113: ffffffff8113d63e rb_reserve_next_event push %rbx 264171.903578113: ffffffff8113d63f rb_reserve_next_event lea 0x88(%rsi), %rbx 264171.903578113: ffffffff8113d646 rb_reserve_next_event sub $0x50, %rsp 264171.903578113: ffffffff8113d64a rb_reserve_next_event test %edx, %edx 264171.903578113: ffffffff8113d64c rb_reserve_next_event jnz rb_reserve_next_event+430 264171.903578113: ffffffff8113d7be rb_reserve_next_event cmp $0x70, %edx 264171.903578113: ffffffff8113d7c1 rb_reserve_next_event jbe 0x8113d7c6 264171.903578113: ffffffff8113d7c6 rb_reserve_next_event add $0x7, %eax 264171.903578113: ffffffff8113d7c9 rb_reserve_next_event mov $0x10, %ecx 264171.903578113: ffffffff8113d7ce rb_reserve_next_event and $0xfffffffc, %eax 264171.903578113: ffffffff8113d7d1 rb_reserve_next_event mov %eax, %edx 264171.903578113: ffffffff8113d7d3 rb_reserve_next_event cmp $0xc, %eax 264171.903578113: ffffffff8113d7d6 rb_reserve_next_event cmovnz %rdx, %rcx 264171.903578113: ffffffff8113d7da rb_reserve_next_event jmp rb_reserve_next_event+66 264171.903578113: ffffffff8113d652 rb_reserve_next_event lea -0x50(%rbp), %rax 264171.903578113: ffffffff8113d656 rb_reserve_next_event lea 0xa8(%rsi), %r12 264171.903578113: ffffffff8113d65d rb_reserve_next_event mov %rsi, %r13 264171.903578113: ffffffff8113d660 rb_reserve_next_event mov %rcx, -0x40(%rbp) 264171.903578113: ffffffff8113d664 rb_reserve_next_event mov $0x0, -0x30(%rbp) 264171.903578113: ffffffff8113d66b rb_reserve_next_event lea 0x18(%rax), %r15 264171.903578113: ffffffff8113d66f rb_reserve_next_event lea 0x10(%rax), %r14 264171.903578113: ffffffff8113d673 rb_reserve_next_event mov $0x0, -0x48(%rbp) 264171.903578113: ffffffff8113d67b rb_reserve_next_event mov $0x3e8, -0x74(%rbp) 264171.903578113: ffffffff8113d682 rb_reserve_next_event mov 0x8(%r13), %rax 264171.903578113: ffffffff8113d686 rb_reserve_next_event call 0x80(%rax) 264171.903578113: ffffffff81134520 trace_clock_local push %rbp 264171.903578113: ffffffff81134521 trace_clock_local mov %rsp, %rbp 264171.903578113: ffffffff81134524 trace_clock_local call native_sched_clock 264171.903578113: ffffffff81078270 native_sched_clock push %rbp 264171.903578113: ffffffff81078271 native_sched_clock mov %rsp, %rbp 264171.903578113: ffffffff81078274 native_sched_clock and $0xfffffffffffffff0, %rsp 264171.903578113: ffffffff81078278 native_sched_clock nop (%rax,%rax) 264171.903578113: ffffffff8107827d native_sched_clock rdtsc 264171.903578113: ffffffff8107827f native_sched_clock shl $0x20, %rdx 264171.903578113: ffffffff81078283 native_sched_clock or %rdx, %rax 264171.903578113: ffffffff81078286 native_sched_clock mov %gs:0x7ef9cea2(%rip), %rsi 264171.903578113: ffffffff8107828e native_sched_clock mov %gs:0x7ef9cea2(%rip), %rdx 264171.903578113: ffffffff81078296 native_sched_clock cmp %rdx, %rsi 264171.903578113: ffffffff81078299 native_sched_clock jnz 0x810782d1 264171.903578113: ffffffff8107829b native_sched_clock mov (%rsi), %edx 264171.903578113: ffffffff8107829d native_sched_clock mov 0x4(%rsi), %ecx 264171.903578113: ffffffff810782a0 native_sched_clock mul %rdx 264171.903578113: ffffffff810782a3 native_sched_clock shrd %cl, %rdx, %rax 264171.903578113: ffffffff810782a7 native_sched_clock shr %cl, %rdx 264171.903578113: ffffffff810782aa native_sched_clock test $0x40, %cl 264171.903578113: ffffffff810782ad native_sched_clock cmovnz %rdx, %rax 264171.903578113: ffffffff810782b1 native_sched_clock add 0x8(%rsi), %rax 264171.903578113: ffffffff810782b5 native_sched_clock leave 264171.903578113: ffffffff810782b6 native_sched_clock ret 264171.903578113: ffffffff81134529 trace_clock_local pop %rbp 264171.903578113: ffffffff8113452a trace_clock_local ret 264171.903578113: ffffffff8113d68c rb_reserve_next_event mov %rax, -0x50(%rbp) 264171.903578113: ffffffff8113d690 rb_reserve_next_event sub 0xa8(%r13), %rax 264171.903578113: ffffffff8113d697 rb_reserve_next_event mov 0xa8(%r13), %rcx 264171.903578113: ffffffff8113d69e rb_reserve_next_event cmp %rcx, -0x50(%rbp) 264171.903578113: ffffffff8113d6a2 rb_reserve_next_event jb 0x8113d6b4 264171.903578113: ffffffff8113d6a4 rb_reserve_next_event test $0xfffffffff8000000, %rax 264171.903578113: ffffffff8113d6aa rb_reserve_next_event mov %rax, -0x48(%rbp) 264171.903578113: ffffffff8113d6ae rb_reserve_next_event jnz rb_reserve_next_event+476 264171.903578113: ffffffff8113d6b4 rb_reserve_next_event mov -0x30(%rbp), %esi 264171.903578113: ffffffff8113d6b7 rb_reserve_next_event test %esi, %esi 264171.903578113: ffffffff8113d6b9 rb_reserve_next_event jnz rb_reserve_next_event+718 ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 17:45 ` Andi Kleen @ 2016-11-21 18:01 ` Steven Rostedt 2016-11-21 18:06 ` Andi Kleen 0 siblings, 1 reply; 48+ messages in thread From: Steven Rostedt @ 2016-11-21 18:01 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, 21 Nov 2016 09:45:04 -0800 Andi Kleen <andi@firstfloor.org> wrote: > On Mon, Nov 21, 2016 at 06:18:53PM +0100, Peter Zijlstra wrote: > > On Mon, Nov 21, 2016 at 09:06:13AM -0800, Andi Kleen wrote: > > > > > it got away with attached change.. but this rcu logic > > > > > is far beyond me, so it's just wild guess.. ;-) > > > > > > > > I think I prefer something like the below, that only annotates the one > > > > RDMSR in question, instead of all of them. > > > > > > It would be far better to just fix trace points that they always work. > > > > > > This whole thing is a travesty: we have tens of thousands of lines of code in > > > ftrace to support tracing in NMIs, but then "debug features"[1] like > > > this come around and make trace points unusable for far more code than > > > just the NMI handlers. > > > > Not sure the idle handlers are more code than we have NMI code, but yes, > > its annoying. > > > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > > locking, as one does. > > > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > > with that is quite a few cycles. > > This is only when the trace point is enabled right? > > ftrace is already executing a lot of instructions when enabled. Adding two POPF > won't be end of the world. And a popf can be much more expensive than any of these. You should know, not all instructions are equal. Using perf, I've seen popf take up almst 30% of a function the size of this. -- Steve > > -Andi > > > 264171.903577780: ffffffff810d6040 trace_event_raw_event_sched_switch push %rbp > 264171.903577780: ffffffff810d6041 trace_event_raw_event_sched_switch mov %rsp, %rbp > 264171.903577780: ffffffff810d6044 trace_event_raw_event_sched_switch push %r15 > 264171.903577780: ffffffff810d6046 trace_event_raw_event_sched_switch push %r14 > 264171.903577780: ffffffff810d6048 trace_event_raw_event_sched_switch push %r13 > 264171.903577780: ffffffff810d604a trace_event_raw_event_sched_switch push %r12 > 264171.903577780: ffffffff810d604c trace_event_raw_event_sched_switch mov %esi, %r13d > 264171.903577780: ffffffff810d604f trace_event_raw_event_sched_switch push %rbx > 264171.903577780: ffffffff810d6050 trace_event_raw_event_sched_switch mov %rdi, %r12 > 264171.903577780: ffffffff810d6053 trace_event_raw_event_sched_switch mov %rdx, %r14 > 264171.903577780: ffffffff810d6056 trace_event_raw_event_sched_switch mov %rcx, %r15 > 264171.903577780: ffffffff810d6059 trace_event_raw_event_sched_switch sub $0x30, %rsp > 264171.903577780: ffffffff810d605d trace_event_raw_event_sched_switch mov 0x48(%rdi), %rbx > 264171.903577780: ffffffff810d6061 trace_event_raw_event_sched_switch test $0x80, %bl > 264171.903577780: ffffffff810d6064 trace_event_raw_event_sched_switch jnz 0x810d608c > 264171.903577780: ffffffff810d6066 trace_event_raw_event_sched_switch test $0x40, %bl > 264171.903577780: ffffffff810d6069 trace_event_raw_event_sched_switch jnz trace_event_raw_event_sched_switch+216 > 264171.903577780: ffffffff810d606f trace_event_raw_event_sched_switch test $0x20, %bl > 264171.903577780: ffffffff810d6072 trace_event_raw_event_sched_switch jz 0x810d6083 > 264171.903577780: ffffffff810d6083 trace_event_raw_event_sched_switch and $0x1, %bh > 264171.903577780: ffffffff810d6086 trace_event_raw_event_sched_switch jnz trace_event_raw_event_sched_switch+237 > 264171.903578113: ffffffff810d608c trace_event_raw_event_sched_switch mov $0x40, %edx > 264171.903578113: ffffffff810d6091 trace_event_raw_event_sched_switch mov %r12, %rsi > 264171.903578113: ffffffff810d6094 trace_event_raw_event_sched_switch mov %rsp, %rdi > 264171.903578113: ffffffff810d6097 trace_event_raw_event_sched_switch call trace_event_buffer_reserve > 264171.903578113: ffffffff8114f150 trace_event_buffer_reserve test $0x1, 0x49(%rsi) > 264171.903578113: ffffffff8114f154 trace_event_buffer_reserve push %rbp > 264171.903578113: ffffffff8114f155 trace_event_buffer_reserve mov 0x10(%rsi), %r10 > 264171.903578113: ffffffff8114f159 trace_event_buffer_reserve mov %rsp, %rbp > 264171.903578113: ffffffff8114f15c trace_event_buffer_reserve push %rbx > 264171.903578113: ffffffff8114f15d trace_event_buffer_reserve jz 0x8114f17e > 264171.903578113: ffffffff8114f17e trace_event_buffer_reserve mov %rdx, %rcx > 264171.903578113: ffffffff8114f181 trace_event_buffer_reserve mov %rdi, %rbx > 264171.903578113: ffffffff8114f184 trace_event_buffer_reserve pushfq > 264171.903578113: ffffffff8114f185 trace_event_buffer_reserve pop %r8 > 264171.903578113: ffffffff8114f187 trace_event_buffer_reserve mov %gs:0x7eebd2c1(%rip), %r9d > 264171.903578113: ffffffff8114f18f trace_event_buffer_reserve and $0x7fffffff, %r9d > 264171.903578113: ffffffff8114f196 trace_event_buffer_reserve mov %r8, 0x20(%rdi) > 264171.903578113: ffffffff8114f19a trace_event_buffer_reserve mov %rsi, 0x10(%rdi) > 264171.903578113: ffffffff8114f19e trace_event_buffer_reserve mov %r9d, 0x28(%rdi) > 264171.903578113: ffffffff8114f1a2 trace_event_buffer_reserve mov 0x40(%r10), %edx > 264171.903578113: ffffffff8114f1a6 trace_event_buffer_reserve call trace_event_buffer_lock_reserve > 264171.903578113: ffffffff81142210 trace_event_buffer_lock_reserve push %rbp > 264171.903578113: ffffffff81142211 trace_event_buffer_lock_reserve mov %rsp, %rbp > 264171.903578113: ffffffff81142214 trace_event_buffer_lock_reserve push %r15 > 264171.903578113: ffffffff81142216 trace_event_buffer_lock_reserve push %r14 > 264171.903578113: ffffffff81142218 trace_event_buffer_lock_reserve push %r13 > 264171.903578113: ffffffff8114221a trace_event_buffer_lock_reserve push %r12 > 264171.903578113: ffffffff8114221c trace_event_buffer_lock_reserve mov %rdi, %r14 > 264171.903578113: ffffffff8114221f trace_event_buffer_lock_reserve push %rbx > 264171.903578113: ffffffff81142220 trace_event_buffer_lock_reserve mov %edx, %r12d > 264171.903578113: ffffffff81142223 trace_event_buffer_lock_reserve mov %rsi, %rbx > 264171.903578113: ffffffff81142226 trace_event_buffer_lock_reserve mov %rcx, %r13 > 264171.903578113: ffffffff81142229 trace_event_buffer_lock_reserve mov %r9d, %r15d > 264171.903578113: ffffffff8114222c trace_event_buffer_lock_reserve sub $0x10, %rsp > 264171.903578113: ffffffff81142230 trace_event_buffer_lock_reserve mov 0x28(%rsi), %rax > 264171.903578113: ffffffff81142234 trace_event_buffer_lock_reserve mov %r8, -0x30(%rbp) > 264171.903578113: ffffffff81142238 trace_event_buffer_lock_reserve mov 0x20(%rax), %rdi > 264171.903578113: ffffffff8114223c trace_event_buffer_lock_reserve mov %rdi, (%r14) > 264171.903578113: ffffffff8114223f trace_event_buffer_lock_reserve test $0x24, 0x48(%rsi) > 264171.903578113: ffffffff81142243 trace_event_buffer_lock_reserve jz 0x8114226d > 264171.903578113: ffffffff8114226d trace_event_buffer_lock_reserve mov -0x30(%rbp), %rcx > 264171.903578113: ffffffff81142271 trace_event_buffer_lock_reserve mov %r15d, %r8d > 264171.903578113: ffffffff81142274 trace_event_buffer_lock_reserve mov %r13, %rdx > 264171.903578113: ffffffff81142277 trace_event_buffer_lock_reserve mov %r12d, %esi > 264171.903578113: ffffffff8114227a trace_event_buffer_lock_reserve call trace_buffer_lock_reserve > 264171.903578113: ffffffff811421c0 trace_buffer_lock_reserve push %rbp > 264171.903578113: ffffffff811421c1 trace_buffer_lock_reserve mov %rsp, %rbp > 264171.903578113: ffffffff811421c4 trace_buffer_lock_reserve push %r14 > 264171.903578113: ffffffff811421c6 trace_buffer_lock_reserve push %r13 > 264171.903578113: ffffffff811421c8 trace_buffer_lock_reserve push %r12 > 264171.903578113: ffffffff811421ca trace_buffer_lock_reserve push %rbx > 264171.903578113: ffffffff811421cb trace_buffer_lock_reserve mov %esi, %r12d > 264171.903578113: ffffffff811421ce trace_buffer_lock_reserve mov %rdx, %rsi > 264171.903578113: ffffffff811421d1 trace_buffer_lock_reserve mov %rcx, %r13 > 264171.903578113: ffffffff811421d4 trace_buffer_lock_reserve mov %r8d, %r14d > 264171.903578113: ffffffff811421d7 trace_buffer_lock_reserve call ring_buffer_lock_reserve > 264171.903578113: ffffffff8113d950 ring_buffer_lock_reserve mov 0x8(%rdi), %eax > 264171.903578113: ffffffff8113d953 ring_buffer_lock_reserve test %eax, %eax > 264171.903578113: ffffffff8113d955 ring_buffer_lock_reserve jnz ring_buffer_lock_reserve+166 > 264171.903578113: ffffffff8113d95b ring_buffer_lock_reserve mov %gs:0x7eecc826(%rip), %eax > 264171.903578113: ffffffff8113d962 ring_buffer_lock_reserve mov %eax, %ecx > 264171.903578113: ffffffff8113d964 ring_buffer_lock_reserve bt %rcx, 0x10(%rdi) > 264171.903578113: ffffffff8113d969 ring_buffer_lock_reserve setb %cl > 264171.903578113: ffffffff8113d96c ring_buffer_lock_reserve test %cl, %cl > 264171.903578113: ffffffff8113d96e ring_buffer_lock_reserve jz ring_buffer_lock_reserve+166 > 264171.903578113: ffffffff8113d974 ring_buffer_lock_reserve mov 0x60(%rdi), %rdx > 264171.903578113: ffffffff8113d978 ring_buffer_lock_reserve push %rbp > 264171.903578113: ffffffff8113d979 ring_buffer_lock_reserve cdqe > 264171.903578113: ffffffff8113d97b ring_buffer_lock_reserve mov %rsp, %rbp > 264171.903578113: ffffffff8113d97e ring_buffer_lock_reserve push %rbx > 264171.903578113: ffffffff8113d97f ring_buffer_lock_reserve mov (%rdx,%rax,8), %rbx > 264171.903578113: ffffffff8113d983 ring_buffer_lock_reserve mov 0x4(%rbx), %eax > 264171.903578113: ffffffff8113d986 ring_buffer_lock_reserve test %eax, %eax > 264171.903578113: ffffffff8113d988 ring_buffer_lock_reserve jnz 0x8113d9f1 > 264171.903578113: ffffffff8113d98a ring_buffer_lock_reserve cmp $0xfe8, %rsi > 264171.903578113: ffffffff8113d991 ring_buffer_lock_reserve ja 0x8113d9f1 > 264171.903578113: ffffffff8113d993 ring_buffer_lock_reserve mov %gs:0x7eeceab6(%rip), %edx > 264171.903578113: ffffffff8113d99a ring_buffer_lock_reserve test $0x1fff00, %edx > 264171.903578113: ffffffff8113d9a0 ring_buffer_lock_reserve mov 0x20(%rbx), %ecx > 264171.903578113: ffffffff8113d9a3 ring_buffer_lock_reserve mov $0x8, %eax > 264171.903578113: ffffffff8113d9a8 ring_buffer_lock_reserve jnz 0x8113d9c6 > 264171.903578113: ffffffff8113d9aa ring_buffer_lock_reserve test %eax, %ecx > 264171.903578113: ffffffff8113d9ac ring_buffer_lock_reserve jnz 0x8113d9f1 > 264171.903578113: ffffffff8113d9ae ring_buffer_lock_reserve or %ecx, %eax > 264171.903578113: ffffffff8113d9b0 ring_buffer_lock_reserve mov %rsi, %rdx > 264171.903578113: ffffffff8113d9b3 ring_buffer_lock_reserve mov %rbx, %rsi > 264171.903578113: ffffffff8113d9b6 ring_buffer_lock_reserve mov %eax, 0x20(%rbx) > 264171.903578113: ffffffff8113d9b9 ring_buffer_lock_reserve call rb_reserve_next_event > 264171.903578113: ffffffff8113d610 rb_reserve_next_event inc 0x88(%rsi) > 264171.903578113: ffffffff8113d617 rb_reserve_next_event inc 0x90(%rsi) > 264171.903578113: ffffffff8113d61e rb_reserve_next_event mov 0x8(%rsi), %rax > 264171.903578113: ffffffff8113d622 rb_reserve_next_event cmp %rdi, %rax > 264171.903578113: ffffffff8113d625 rb_reserve_next_event jnz rb_reserve_next_event+678 > 264171.903578113: ffffffff8113d62b rb_reserve_next_event push %rbp > 264171.903578113: ffffffff8113d62c rb_reserve_next_event mov %edx, %eax > 264171.903578113: ffffffff8113d62e rb_reserve_next_event mov $0x8, %ecx > 264171.903578113: ffffffff8113d633 rb_reserve_next_event mov %rsp, %rbp > 264171.903578113: ffffffff8113d636 rb_reserve_next_event push %r15 > 264171.903578113: ffffffff8113d638 rb_reserve_next_event push %r14 > 264171.903578113: ffffffff8113d63a rb_reserve_next_event push %r13 > 264171.903578113: ffffffff8113d63c rb_reserve_next_event push %r12 > 264171.903578113: ffffffff8113d63e rb_reserve_next_event push %rbx > 264171.903578113: ffffffff8113d63f rb_reserve_next_event lea 0x88(%rsi), %rbx > 264171.903578113: ffffffff8113d646 rb_reserve_next_event sub $0x50, %rsp > 264171.903578113: ffffffff8113d64a rb_reserve_next_event test %edx, %edx > 264171.903578113: ffffffff8113d64c rb_reserve_next_event jnz rb_reserve_next_event+430 > 264171.903578113: ffffffff8113d7be rb_reserve_next_event cmp $0x70, %edx > 264171.903578113: ffffffff8113d7c1 rb_reserve_next_event jbe 0x8113d7c6 > 264171.903578113: ffffffff8113d7c6 rb_reserve_next_event add $0x7, %eax > 264171.903578113: ffffffff8113d7c9 rb_reserve_next_event mov $0x10, %ecx > 264171.903578113: ffffffff8113d7ce rb_reserve_next_event and $0xfffffffc, %eax > 264171.903578113: ffffffff8113d7d1 rb_reserve_next_event mov %eax, %edx > 264171.903578113: ffffffff8113d7d3 rb_reserve_next_event cmp $0xc, %eax > 264171.903578113: ffffffff8113d7d6 rb_reserve_next_event cmovnz %rdx, %rcx > 264171.903578113: ffffffff8113d7da rb_reserve_next_event jmp rb_reserve_next_event+66 > 264171.903578113: ffffffff8113d652 rb_reserve_next_event lea -0x50(%rbp), %rax > 264171.903578113: ffffffff8113d656 rb_reserve_next_event lea 0xa8(%rsi), %r12 > 264171.903578113: ffffffff8113d65d rb_reserve_next_event mov %rsi, %r13 > 264171.903578113: ffffffff8113d660 rb_reserve_next_event mov %rcx, -0x40(%rbp) > 264171.903578113: ffffffff8113d664 rb_reserve_next_event mov $0x0, -0x30(%rbp) > 264171.903578113: ffffffff8113d66b rb_reserve_next_event lea 0x18(%rax), %r15 > 264171.903578113: ffffffff8113d66f rb_reserve_next_event lea 0x10(%rax), %r14 > 264171.903578113: ffffffff8113d673 rb_reserve_next_event mov $0x0, -0x48(%rbp) > 264171.903578113: ffffffff8113d67b rb_reserve_next_event mov $0x3e8, -0x74(%rbp) > 264171.903578113: ffffffff8113d682 rb_reserve_next_event mov 0x8(%r13), %rax > 264171.903578113: ffffffff8113d686 rb_reserve_next_event call 0x80(%rax) > 264171.903578113: ffffffff81134520 trace_clock_local push %rbp > 264171.903578113: ffffffff81134521 trace_clock_local mov %rsp, %rbp > 264171.903578113: ffffffff81134524 trace_clock_local call native_sched_clock > 264171.903578113: ffffffff81078270 native_sched_clock push %rbp > 264171.903578113: ffffffff81078271 native_sched_clock mov %rsp, %rbp > 264171.903578113: ffffffff81078274 native_sched_clock and $0xfffffffffffffff0, %rsp > 264171.903578113: ffffffff81078278 native_sched_clock nop (%rax,%rax) > 264171.903578113: ffffffff8107827d native_sched_clock rdtsc > 264171.903578113: ffffffff8107827f native_sched_clock shl $0x20, %rdx > 264171.903578113: ffffffff81078283 native_sched_clock or %rdx, %rax > 264171.903578113: ffffffff81078286 native_sched_clock mov %gs:0x7ef9cea2(%rip), %rsi > 264171.903578113: ffffffff8107828e native_sched_clock mov %gs:0x7ef9cea2(%rip), %rdx > 264171.903578113: ffffffff81078296 native_sched_clock cmp %rdx, %rsi > 264171.903578113: ffffffff81078299 native_sched_clock jnz 0x810782d1 > 264171.903578113: ffffffff8107829b native_sched_clock mov (%rsi), %edx > 264171.903578113: ffffffff8107829d native_sched_clock mov 0x4(%rsi), %ecx > 264171.903578113: ffffffff810782a0 native_sched_clock mul %rdx > 264171.903578113: ffffffff810782a3 native_sched_clock shrd %cl, %rdx, %rax > 264171.903578113: ffffffff810782a7 native_sched_clock shr %cl, %rdx > 264171.903578113: ffffffff810782aa native_sched_clock test $0x40, %cl > 264171.903578113: ffffffff810782ad native_sched_clock cmovnz %rdx, %rax > 264171.903578113: ffffffff810782b1 native_sched_clock add 0x8(%rsi), %rax > 264171.903578113: ffffffff810782b5 native_sched_clock leave > 264171.903578113: ffffffff810782b6 native_sched_clock ret > 264171.903578113: ffffffff81134529 trace_clock_local pop %rbp > 264171.903578113: ffffffff8113452a trace_clock_local ret > 264171.903578113: ffffffff8113d68c rb_reserve_next_event mov %rax, -0x50(%rbp) > 264171.903578113: ffffffff8113d690 rb_reserve_next_event sub 0xa8(%r13), %rax > 264171.903578113: ffffffff8113d697 rb_reserve_next_event mov 0xa8(%r13), %rcx > 264171.903578113: ffffffff8113d69e rb_reserve_next_event cmp %rcx, -0x50(%rbp) > 264171.903578113: ffffffff8113d6a2 rb_reserve_next_event jb 0x8113d6b4 > 264171.903578113: ffffffff8113d6a4 rb_reserve_next_event test $0xfffffffff8000000, %rax > 264171.903578113: ffffffff8113d6aa rb_reserve_next_event mov %rax, -0x48(%rbp) > 264171.903578113: ffffffff8113d6ae rb_reserve_next_event jnz rb_reserve_next_event+476 > 264171.903578113: ffffffff8113d6b4 rb_reserve_next_event mov -0x30(%rbp), %esi > 264171.903578113: ffffffff8113d6b7 rb_reserve_next_event test %esi, %esi > 264171.903578113: ffffffff8113d6b9 rb_reserve_next_event jnz rb_reserve_next_event+718 ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 18:01 ` Steven Rostedt @ 2016-11-21 18:06 ` Andi Kleen 2016-11-21 18:22 ` Steven Rostedt 0 siblings, 1 reply; 48+ messages in thread From: Andi Kleen @ 2016-11-21 18:06 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek > And a popf can be much more expensive than any of these. You should > know, not all instructions are equal. > > Using perf, I've seen popf take up almst 30% of a function the size of > this. In any case it's a small fraction of the 600+ instructions which are currently executed for every enabled trace point. If ftrace was actually optimized code this would make some sense, but it clearly isn't ... -Andi ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 18:06 ` Andi Kleen @ 2016-11-21 18:22 ` Steven Rostedt 2016-11-21 18:37 ` Andi Kleen 0 siblings, 1 reply; 48+ messages in thread From: Steven Rostedt @ 2016-11-21 18:22 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, 21 Nov 2016 10:06:54 -0800 Andi Kleen <andi@firstfloor.org> wrote: > > And a popf can be much more expensive than any of these. You should > > know, not all instructions are equal. > > > > Using perf, I've seen popf take up almst 30% of a function the size of > > this. > > In any case it's a small fraction of the 600+ instructions which are currently > executed for every enabled trace point. > > If ftrace was actually optimized code this would make some sense, but it > clearly isn't ... It tries to be optimized. I "unoptimized" it a while back to pull out all the inlines that were done in the tracepoint itself. That is, the trace_<tracepoint>() function is inlined in the code itself. By breaking that up a bit, I was able to save a bunch of text because the tracepoints were bloating the kernel tremendously. There can be more optimization done too. But just because it's not optimized to the best it can be (which should be our goal) is not excuse to bloat it more with popf! -- Steve ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 18:22 ` Steven Rostedt @ 2016-11-21 18:37 ` Andi Kleen 2016-11-21 19:06 ` Steven Rostedt ` (3 more replies) 0 siblings, 4 replies; 48+ messages in thread From: Andi Kleen @ 2016-11-21 18:37 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek > It tries to be optimized. I "unoptimized" it a while back to pull out > all the inlines that were done in the tracepoint itself. That is, the > trace_<tracepoint>() function is inlined in the code itself. By > breaking that up a bit, I was able to save a bunch of text because the > tracepoints were bloating the kernel tremendously. Just adding a few inlines won't fix the gigantic bloat that is currently there. See the PT trace I posted earlier (it was even truncated, it's actually worse). Just a single enabled trace point took about a us. POPF can cause some serializion but it won't be more than a few tens of cycles, which would be a few percent at best. Here is it again untruncated: http://halobates.de/tracepoint-trace $ wc -l tracepoint-trace 640 tracepoint-trace $ head -1 tracepoint-trace [001] 264171.903577780: ffffffff810d6040 trace_event_raw_event_sched_switch push %rbp $ tail -1 tracepoint-trace [001] 264171.903578780: ffffffff810d6117 trace_event_raw_event_sched_switch ret > There can be more optimization done too. But just because it's not > optimized to the best it can be (which should be our goal) is not > excuse to bloat it more with popf! Ok so how should tracing in idle code work then in your opinion? -Andi ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 18:37 ` Andi Kleen @ 2016-11-21 19:06 ` Steven Rostedt 2016-11-21 19:15 ` Steven Rostedt ` (2 subsequent siblings) 3 siblings, 0 replies; 48+ messages in thread From: Steven Rostedt @ 2016-11-21 19:06 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleen <andi@firstfloor.org> wrote: > Ok so how should tracing in idle code work then in your opinion? As I suggested already. If we can get a light weight rcu_is_watching() then we can do the rcu_idle work when needed, and not when we don't need it. Sure this will add more branches, but that's still cheaper than popf. -- Steve ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 18:37 ` Andi Kleen 2016-11-21 19:06 ` Steven Rostedt @ 2016-11-21 19:15 ` Steven Rostedt 2016-11-21 20:44 ` Andi Kleen 2016-11-22 14:39 ` Steven Rostedt 2016-11-24 2:06 ` Steven Rostedt 3 siblings, 1 reply; 48+ messages in thread From: Steven Rostedt @ 2016-11-21 19:15 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleen <andi@firstfloor.org> wrote: > Just adding a few inlines won't fix the gigantic bloat that is currently > there. See the PT trace I posted earlier (it was even truncated, it's > actually worse). Just a single enabled trace point took about a us. If you want to play with that, enable "Add tracepoint that benchmarks tracepoints", and enable /sys/kernel/debug/tracing/events/benchmark I just did on my box, and I have a max of 1.27 us, average .184 us. Thus, on average, a tracepoint takes 184 nanoseconds. Note, a cold cache tracepoint (the first one it traced which is not saved in the "max") was 3.3 us. Yes it can still be improved, and I do spend time doing that. > > POPF can cause some serializion but it won't be more than a few tens > of cycles, which would be a few percent at best. > > Here is it again untruncated: > > http://halobates.de/tracepoint-trace There's a lot of push and pop regs due to function calling. There's places that inlines can still improve things, and perhaps even some likely unlikelys well placed. -- Steve ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 19:15 ` Steven Rostedt @ 2016-11-21 20:44 ` Andi Kleen 2016-11-22 8:19 ` Paul E. McKenney 0 siblings, 1 reply; 48+ messages in thread From: Andi Kleen @ 2016-11-21 20:44 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek > > http://halobates.de/tracepoint-trace > > There's a lot of push and pop regs due to function calling. There's > places that inlines can still improve things, and perhaps even some > likely unlikelys well placed. Assuming you avoid all the push/pop and all the call/ret this would only be ~25% of the total instructions. There is just far too much logic and computation in there. % awk ' { print $5 } ' tracepoint-trace | sort | uniq -c | sort -rn 222 mov 57 push 57 pop 35 test 34 cmp 32 and 28 jz 25 jnz 21 ret 20 call 16 lea 11 add -Andi ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 20:44 ` Andi Kleen @ 2016-11-22 8:19 ` Paul E. McKenney 0 siblings, 0 replies; 48+ messages in thread From: Paul E. McKenney @ 2016-11-22 8:19 UTC (permalink / raw) To: Andi Kleen Cc: Steven Rostedt, Peter Zijlstra, Jiri Olsa, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, Nov 21, 2016 at 12:44:45PM -0800, Andi Kleen wrote: > > > http://halobates.de/tracepoint-trace > > > > There's a lot of push and pop regs due to function calling. There's > > places that inlines can still improve things, and perhaps even some > > likely unlikelys well placed. > > Assuming you avoid all the push/pop and all the call/ret this would only be > ~25% of the total instructions. There is just far too much logic and > computation in there. > > % awk ' { print $5 } ' tracepoint-trace | sort | uniq -c | sort -rn > 222 mov > 57 push > 57 pop > 35 test > 34 cmp > 32 and > 28 jz > 25 jnz > 21 ret > 20 call > 16 lea > 11 add Hmmm... It does indeed look like some performance analysis would be good... Thanx, Paul ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 18:37 ` Andi Kleen 2016-11-21 19:06 ` Steven Rostedt 2016-11-21 19:15 ` Steven Rostedt @ 2016-11-22 14:39 ` Steven Rostedt 2016-11-22 19:05 ` Andi Kleen 2016-11-24 2:06 ` Steven Rostedt 3 siblings, 1 reply; 48+ messages in thread From: Steven Rostedt @ 2016-11-22 14:39 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleen <andi@firstfloor.org> wrote: > Here is it again untruncated: > > http://halobates.de/tracepoint-trace BTW, what tool did you use to generate this? -- Steve ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-22 14:39 ` Steven Rostedt @ 2016-11-22 19:05 ` Andi Kleen 0 siblings, 0 replies; 48+ messages in thread From: Andi Kleen @ 2016-11-22 19:05 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Tue, Nov 22, 2016 at 09:39:45AM -0500, Steven Rostedt wrote: > On Mon, 21 Nov 2016 10:37:00 -0800 > Andi Kleen <andi@firstfloor.org> wrote: > > > > Here is it again untruncated: > > > > http://halobates.de/tracepoint-trace > > BTW, what tool did you use to generate this? perf and Intel PT, with the disassembler patches added and udis86 installed: https://git.kernel.org/cgit/linux/kernel/git/ak/linux-misc.git/log/?h=perf/disassembler-2 http://udis86.sourceforge.net/ You need a CPU that supports PT, like Broadwell, Skylake, Goldmont. # enable trace point perf record -e intel_pt// -a sleep X perf script --itrace=i0ns --ns -F cpu,ip,time,asm,sym On Skylake can also get more accurate timing by enabling cycle timing (at the cost of some overhead) perf record -e intel_pt/cyc=1,cyc_thresh=1/ -a ... The earlier trace didn't have that enabled. -Andi ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 18:37 ` Andi Kleen ` (2 preceding siblings ...) 2016-11-22 14:39 ` Steven Rostedt @ 2016-11-24 2:06 ` Steven Rostedt 2016-11-28 21:48 ` Andi Kleen 3 siblings, 1 reply; 48+ messages in thread From: Steven Rostedt @ 2016-11-24 2:06 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, 21 Nov 2016 10:37:00 -0800 Andi Kleen <andi@firstfloor.org> wrote: > > It tries to be optimized. I "unoptimized" it a while back to pull out > > all the inlines that were done in the tracepoint itself. That is, the > > trace_<tracepoint>() function is inlined in the code itself. By > > breaking that up a bit, I was able to save a bunch of text because the > > tracepoints were bloating the kernel tremendously. > > Just adding a few inlines won't fix the gigantic bloat that is currently > there. See the PT trace I posted earlier (it was even truncated, it's > actually worse). Just a single enabled trace point took about a us. > > POPF can cause some serializion but it won't be more than a few tens > of cycles, which would be a few percent at best. > > Here is it again untruncated: > > http://halobates.de/tracepoint-trace > I took a look at this and forced some more functions to be inlined. I did a little tweaking here and there. Could you pull my tree and see if things are better? I don't currently have the hardware to run this myself. git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git branch: ftrace/core Thanks! -- Steve ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-24 2:06 ` Steven Rostedt @ 2016-11-28 21:48 ` Andi Kleen 2016-11-28 22:46 ` Steven Rostedt 0 siblings, 1 reply; 48+ messages in thread From: Andi Kleen @ 2016-11-28 21:48 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek > I took a look at this and forced some more functions to be inlined. I > did a little tweaking here and there. Could you pull my tree and see if > things are better? I don't currently have the hardware to run this > myself. > > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git > > branch: ftrace/core Didn't help much unfortunately. I think you need a real fast path here. http://halobates.de/tracepoint-trace2 $ wc -l tracepoint-trace2 595 tracepoint-trace2 Opcode Histogram: 201 mov 51 push 51 pop 34 test 32 cmp 26 jz 24 jnz 24 and 16 ret 16 lea 15 call 12 add 10 jmp Functions with # of instructions: 25 trace_event_raw_event_sched_switch 19 trace_event_buffer_reserve 21 trace_event_buffer_lock_reserve 57 ring_buffer_lock_reserve 3 trace_clock_local 3 sched_clock 24 native_sched_clock 3 sched_clock 2 trace_clock_local 11 ring_buffer_lock_reserve 81 __rb_reserve_next 11 ring_buffer_lock_reserve 6 trace_event_buffer_lock_reserve 18 ring_buffer_event_data 5 trace_event_buffer_lock_reserve 38 tracing_generic_entry_update 11 trace_event_buffer_lock_reserve 5 trace_event_buffer_reserve 18 ring_buffer_event_data 4 trace_event_buffer_reserve 28 trace_event_raw_event_sched_switch 30 trace_event_buffer_commit 19 trace_buffer_unlock_commit_regs 11 ring_buffer_unlock_commit 85 rb_commit 18 ring_buffer_unlock_commit 10 trace_buffer_unlock_commit_regs 3 ftrace_trace_userstack 7 trace_buffer_unlock_commit_regs 11 trace_event_buffer_commit 8 trace_event_raw_event_sched_switch -Andi ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-28 21:48 ` Andi Kleen @ 2016-11-28 22:46 ` Steven Rostedt 2016-11-28 23:10 ` Andi Kleen 0 siblings, 1 reply; 48+ messages in thread From: Steven Rostedt @ 2016-11-28 22:46 UTC (permalink / raw) To: Andi Kleen Cc: Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, 28 Nov 2016 13:48:25 -0800 Andi Kleen <andi@firstfloor.org> wrote: > > I took a look at this and forced some more functions to be inlined. I > > did a little tweaking here and there. Could you pull my tree and see if > > things are better? I don't currently have the hardware to run this > > myself. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git > > > > branch: ftrace/core > > Didn't help much unfortunately. I think you need a real fast path here. Well, actually it's been optimized a lot from the first instances. Note, tracepoints need to be quite generic, so where it can be optimized is not obvious. > > http://halobates.de/tracepoint-trace2 > > $ wc -l tracepoint-trace2 > 595 tracepoint-trace2 > > Opcode Histogram: > > 201 mov > 51 push > 51 pop > 34 test > 32 cmp > 26 jz > 24 jnz > 24 and > 16 ret > 16 lea > 15 call > 12 add > 10 jmp > > Functions with # of instructions: > > 25 trace_event_raw_event_sched_switch This is from the TRACE_EVENT macro. trace_event_raw_event_##call(void *__data, proto) \ { \ struct trace_event_file *trace_file = __data; \ struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\ struct trace_event_buffer fbuffer; \ struct trace_event_raw_##call *entry; \ int __data_size; \ \ if (trace_trigger_soft_disabled(trace_file)) \ return; \ \ __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \ \ entry = trace_event_buffer_reserve(&fbuffer, trace_file, \ sizeof(*entry) + __data_size); \ > 19 trace_event_buffer_reserve void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer, struct trace_event_file *trace_file, unsigned long len) { struct trace_event_call *event_call = trace_file->event_call; if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) && trace_event_ignore_this_pid(trace_file)) return NULL; local_save_flags(fbuffer->flags); fbuffer->pc = preempt_count(); /* * If CONFIG_PREEMPT is enabled, then the tracepoint itself disables * preemption (adding one to the preempt_count). Since we are * interested in the preempt_count at the time the tracepoint was * hit, we need to subtract one to offset the increment. */ if (IS_ENABLED(CONFIG_PREEMPT)) fbuffer->pc--; fbuffer->trace_file = trace_file; fbuffer->event = trace_event_buffer_lock_reserve(&fbuffer->buffer, trace_file, event_call->event.type, len, fbuffer->flags, fbuffer->pc); > 21 trace_event_buffer_lock_reserve struct ring_buffer_event * trace_event_buffer_lock_reserve(struct ring_buffer **current_rb, struct trace_event_file *trace_file, int type, unsigned long len, unsigned long flags, int pc) { struct ring_buffer_event *entry; int val; *current_rb = trace_file->tr->trace_buffer.buffer; if ((trace_file->flags & (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) && (entry = this_cpu_read(trace_buffered_event))) { Note: Some of these if statements can be encapsulated with jump labels, as they are false pretty much all of the time. /* Try to use the per cpu buffer first */ val = this_cpu_inc_return(trace_buffered_event_cnt); if (val == 1) { trace_event_setup(entry, type, flags, pc); entry->array[0] = len; return entry; } this_cpu_dec(trace_buffered_event_cnt); } entry = trace_buffer_lock_reserve(*current_rb, type, len, flags, pc); > 57 ring_buffer_lock_reserve struct ring_buffer_event * ring_buffer_lock_reserve(struct ring_buffer *buffer, unsigned long length) { struct ring_buffer_per_cpu *cpu_buffer; struct ring_buffer_event *event; int cpu; /* If we are tracing schedule, we don't want to recurse */ preempt_disable_notrace(); if (unlikely(atomic_read(&buffer->record_disabled))) goto out; cpu = raw_smp_processor_id(); if (unlikely(!cpumask_test_cpu(cpu, buffer->cpumask))) goto out; cpu_buffer = buffer->buffers[cpu]; if (unlikely(atomic_read(&cpu_buffer->record_disabled))) goto out; if (unlikely(length > BUF_MAX_DATA_SIZE)) goto out; if (unlikely(trace_recursive_lock(cpu_buffer))) goto out; event = rb_reserve_next_event(buffer, cpu_buffer, length); if (!event) goto out_unlock; static struct ring_buffer_event * rb_reserve_next_event(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer, unsigned long length) { struct ring_buffer_event *event; struct rb_event_info info; int nr_loops = 0; u64 diff; rb_start_commit(cpu_buffer); #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP /* * Due to the ability to swap a cpu buffer from a buffer * it is possible it was swapped before we committed. * (committing stops a swap). We check for it here and * if it happened, we have to fail the write. */ barrier(); if (unlikely(ACCESS_ONCE(cpu_buffer->buffer) != buffer)) { local_dec(&cpu_buffer->committing); local_dec(&cpu_buffer->commits); return NULL; } #endif info.length = rb_calculate_event_length(length); again: info.add_timestamp = 0; info.delta = 0; /* * We allow for interrupts to reenter here and do a trace. * If one does, it will cause this original code to loop * back here. Even with heavy interrupts happening, this * should only happen a few times in a row. If this happens * 1000 times in a row, there must be either an interrupt * storm or we have something buggy. * Bail! */ if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)) goto out_fail; info.ts = rb_time_stamp(cpu_buffer->buffer); diff = info.ts - cpu_buffer->write_stamp; /* make sure this diff is calculated here */ barrier(); /* Did the write stamp get updated already? */ if (likely(info.ts >= cpu_buffer->write_stamp)) { info.delta = diff; if (unlikely(test_time_stamp(info.delta))) rb_handle_timestamp(cpu_buffer, &info); } event = __rb_reserve_next(cpu_buffer, &info); > 3 trace_clock_local > 3 sched_clock > 24 native_sched_clock Not much we can do about the clock > 3 sched_clock > 2 trace_clock_local > 11 ring_buffer_lock_reserve > 81 __rb_reserve_next static struct ring_buffer_event * __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer, struct rb_event_info *info) { struct ring_buffer_event *event; struct buffer_page *tail_page; unsigned long tail, write; /* * If the time delta since the last event is too big to * hold in the time field of the event, then we append a * TIME EXTEND event ahead of the data event. */ if (unlikely(info->add_timestamp)) info->length += RB_LEN_TIME_EXTEND; /* Don't let the compiler play games with cpu_buffer->tail_page */ tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page); write = local_add_return(info->length, &tail_page->write); /* set write to only the index of the write */ write &= RB_WRITE_MASK; tail = write - info->length; /* * If this is the first commit on the page, then it has the same * timestamp as the page itself. */ if (!tail) info->delta = 0; /* See if we shot pass the end of this buffer page */ if (unlikely(write > BUF_PAGE_SIZE)) return rb_move_tail(cpu_buffer, tail, info); /* We reserved something on the buffer */ event = __rb_page_index(tail_page, tail); kmemcheck_annotate_bitfield(event, bitfield); rb_update_event(cpu_buffer, event, info); local_inc(&tail_page->entries); /* * If this is the first commit on the page, then update * its timestamp. */ if (!tail) tail_page->page->time_stamp = info->ts; /* account for these added bytes */ local_add(info->length, &cpu_buffer->entries_bytes); return event; } > 11 ring_buffer_lock_reserve > 6 trace_event_buffer_lock_reserve if (!fbuffer->event) return NULL; fbuffer->entry = ring_buffer_event_data(fbuffer->event); return fbuffer->entry; > 18 ring_buffer_event_data static void * rb_event_data(struct ring_buffer_event *event) { if (event->type_len == RINGBUF_TYPE_TIME_EXTEND) event = skip_time_extend(event); BUG_ON(event->type_len > RINGBUF_TYPE_DATA_TYPE_LEN_MAX); /* If length is in len field, then array[0] has the data */ if (event->type_len) return (void *)&event->array[0]; /* Otherwise length is in array[0] and array[1] has the data */ return (void *)&event->array[1]; } Hmm, the above could probably be optimized better. > 5 trace_event_buffer_lock_reserve > 38 tracing_generic_entry_update I tried to optimize this one, but nothing really helped. void tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags, int pc) { struct task_struct *tsk = current; entry->preempt_count = pc & 0xff; entry->pid = (tsk) ? tsk->pid : 0; entry->flags = #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT (irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) | #else TRACE_FLAG_IRQS_NOSUPPORT | #endif ((pc & NMI_MASK ) ? TRACE_FLAG_NMI : 0) | ((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) | ((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) | (tif_need_resched() ? TRACE_FLAG_NEED_RESCHED : 0) | (test_preempt_need_resched() ? TRACE_FLAG_PREEMPT_RESCHED : 0); } > 11 trace_event_buffer_lock_reserve > 5 trace_event_buffer_reserve > 18 ring_buffer_event_data Strange, this is called twice? > 4 trace_event_buffer_reserve > 28 trace_event_raw_event_sched_switch > 30 trace_event_buffer_commit > 19 trace_buffer_unlock_commit_regs > 11 ring_buffer_unlock_commit > 85 rb_commit The rb_commit is a pain due to lockless commits. Namely this part: static void rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer) { unsigned long max_count; /* * We only race with interrupts and NMIs on this CPU. * If we own the commit event, then we can commit * all others that interrupted us, since the interruptions * are in stack format (they finish before they come * back to us). This allows us to do a simple loop to * assign the commit to the tail. */ again: max_count = cpu_buffer->nr_pages * 100; while (cpu_buffer->commit_page != READ_ONCE(cpu_buffer->tail_page)) { if (RB_WARN_ON(cpu_buffer, !(--max_count))) return; if (RB_WARN_ON(cpu_buffer, rb_is_reader_page(cpu_buffer->tail_page))) return; local_set(&cpu_buffer->commit_page->page->commit, rb_page_write(cpu_buffer->commit_page)); rb_inc_page(cpu_buffer, &cpu_buffer->commit_page); /* Only update the write stamp if the page has an event */ if (rb_page_write(cpu_buffer->commit_page)) cpu_buffer->write_stamp = cpu_buffer->commit_page->page->time_stamp; /* add barrier to keep gcc from optimizing too much */ barrier(); } while (rb_commit_index(cpu_buffer) != rb_page_write(cpu_buffer->commit_page)) { local_set(&cpu_buffer->commit_page->page->commit, rb_page_write(cpu_buffer->commit_page)); RB_WARN_ON(cpu_buffer, local_read(&cpu_buffer->commit_page->page->commit) & ~RB_WRITE_MASK); barrier(); } /* again, keep gcc from optimizing */ barrier(); /* * If an interrupt came in just after the first while loop * and pushed the tail page forward, we will be left with * a dangling commit that will never go forward. */ if (unlikely(cpu_buffer->commit_page != READ_ONCE(cpu_buffer->tail_page))) goto again; } > 18 ring_buffer_unlock_commit > 10 trace_buffer_unlock_commit_regs > 3 ftrace_trace_userstack I should make this call a jump label, as it's seldom used. > 7 trace_buffer_unlock_commit_regs > 11 trace_event_buffer_commit > 8 trace_event_raw_event_sched_switch > There are a few things that can still be optimized. But the issue really is that trace_events are generic and macro created, which I need to balance how much gets inlined vs how much is passed around to functions. And the ring buffer is lockless, thus it needs to do various hoops to make sure things don't break. -- Steve ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-28 22:46 ` Steven Rostedt @ 2016-11-28 23:10 ` Andi Kleen 0 siblings, 0 replies; 48+ messages in thread From: Andi Kleen @ 2016-11-28 23:10 UTC (permalink / raw) To: Steven Rostedt Cc: Andi Kleen, Peter Zijlstra, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek > Well, actually it's been optimized a lot from the first instances. > Note, tracepoints need to be quite generic, so where it can be optimized > is not obvious. There is quite a bit of stuff that doesn't change from trace point to trace point. So you could just cache all these decisions into a single per cpu variable, and invalidate if anything changes the conditions. > > > > Functions with # of instructions: > > > > 25 trace_event_raw_event_sched_switch > > This is from the TRACE_EVENT macro. > > if (trace_trigger_soft_disabled(trace_file)) \ > return; \ Cache? > if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) && > trace_event_ignore_this_pid(trace_file)) > return NULL; > > *current_rb = trace_file->tr->trace_buffer.buffer; Indirection could be cached. > > if ((trace_file->flags & > (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED)) && > (entry = this_cpu_read(trace_buffered_event))) { Multiple tests of trace_file->flags in different functions, could be all folded into a single test. > > > Note: Some of these if statements can be encapsulated with jump labels, as they > are false pretty much all of the time. > > > /* Try to use the per cpu buffer first */ > val = this_cpu_inc_return(trace_buffered_event_cnt); > if (val == 1) { > trace_event_setup(entry, type, flags, pc); > entry->array[0] = len; > return entry; > } > this_cpu_dec(trace_buffered_event_cnt); This can be pre cached in a fast path. If true just point per cpu variable directly to buffer and invalidate if buffer changes or is full. > preempt_disable_notrace(); > > if (unlikely(atomic_read(&buffer->record_disabled))) > goto out; Cache. > > cpu = raw_smp_processor_id(); > > if (unlikely(!cpumask_test_cpu(cpu, buffer->cpumask))) > goto out; Not needed for per cpu buffers in a fast path. > > cpu_buffer = buffer->buffers[cpu]; > > if (unlikely(atomic_read(&cpu_buffer->record_disabled))) > goto out; cache > > if (unlikely(length > BUF_MAX_DATA_SIZE)) > goto out; > > if (unlikely(trace_recursive_lock(cpu_buffer))) > goto out; locking should invalidate state on this cpu, so shouldn't be needed > u64 diff; > > rb_start_commit(cpu_buffer); > > #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP > /* > * Due to the ability to swap a cpu buffer from a buffer > * it is possible it was swapped before we committed. > * (committing stops a swap). We check for it here and > * if it happened, we have to fail the write. > */ > barrier(); > if (unlikely(ACCESS_ONCE(cpu_buffer->buffer) != buffer)) { > local_dec(&cpu_buffer->committing); > local_dec(&cpu_buffer->commits); > return NULL; > } swapping should invalidate state centrally (and then run slow path like current path) > /* > * We allow for interrupts to reenter here and do a trace. > * If one does, it will cause this original code to loop > * back here. Even with heavy interrupts happening, this > * should only happen a few times in a row. If this happens > * 1000 times in a row, there must be either an interrupt > * storm or we have something buggy. > * Bail! > */ > if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)) > goto out_fail; didn't you disable interrupts earlier? > /* Did the write stamp get updated already? */ > if (likely(info.ts >= cpu_buffer->write_stamp)) { > info.delta = diff; > if (unlikely(test_time_stamp(info.delta))) > rb_handle_timestamp(cpu_buffer, &info); > } Not sure why a write stamp is needed. isn't that the last few entries? ... -Andi ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 17:18 ` Peter Zijlstra 2016-11-21 17:45 ` Andi Kleen @ 2016-11-21 17:55 ` Steven Rostedt 2016-11-21 18:24 ` Steven Rostedt 1 sibling, 1 reply; 48+ messages in thread From: Steven Rostedt @ 2016-11-21 17:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Andi Kleen, Jiri Olsa, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, 21 Nov 2016 18:18:53 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > locking, as one does. Just to be clear, as ftrace in the kernel mostly represents function tracing, which doesn't use RCU. This is a tracepoint feature. > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > with that is quite a few cycles. Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not concerned enough to put a heavy weight rcu idle code in for all tracepoints. Although, what about a percpu flag that can be checked in the tracepoint code to see if it should enable RCU or not? Hmm, I wonder if "rcu_is_watching()" is light enough to have in all tracepoints? -- Steve ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 17:55 ` Steven Rostedt @ 2016-11-21 18:24 ` Steven Rostedt 2016-11-21 20:12 ` Paul E. McKenney 0 siblings, 1 reply; 48+ messages in thread From: Steven Rostedt @ 2016-11-21 18:24 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Andi Kleen, Jiri Olsa, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek Paul, On Mon, 21 Nov 2016 12:55:01 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 21 Nov 2016 18:18:53 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > > locking, as one does. > > Just to be clear, as ftrace in the kernel mostly represents function > tracing, which doesn't use RCU. This is a tracepoint feature. > > > > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > > with that is quite a few cycles. > > Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not > concerned enough to put a heavy weight rcu idle code in for all > tracepoints. > > Although, what about a percpu flag that can be checked in the > tracepoint code to see if it should enable RCU or not? > > Hmm, I wonder if "rcu_is_watching()" is light enough to have in all > tracepoints? Is it possible to make rcu_is_watching() an inlined call to prevent the overhead of doing a function call? This way we could use this in the fast path of the tracepoint. -- Steve ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 18:24 ` Steven Rostedt @ 2016-11-21 20:12 ` Paul E. McKenney 0 siblings, 0 replies; 48+ messages in thread From: Paul E. McKenney @ 2016-11-21 20:12 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Andi Kleen, Jiri Olsa, linux-kernel, Ingo Molnar, Josh Triplett, Jan Stancek On Mon, Nov 21, 2016 at 01:24:38PM -0500, Steven Rostedt wrote: > > Paul, > > > On Mon, 21 Nov 2016 12:55:01 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > On Mon, 21 Nov 2016 18:18:53 +0100 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > Its not ftrace as such though, its RCU, ftrace simply uses RCU to avoid > > > locking, as one does. > > > > Just to be clear, as ftrace in the kernel mostly represents function > > tracing, which doesn't use RCU. This is a tracepoint feature. > > > > > > > > Biggest objection would be that the rcu_irq_enter_irqson() thing does > > > POPF and rcu_irq_exit_irqson() does again. So wrapping every tracepoint > > > with that is quite a few cycles. > > > > Agree. Even though this ends up being a whack-a-mole(TM) fix, I'm not > > concerned enough to put a heavy weight rcu idle code in for all > > tracepoints. > > > > Although, what about a percpu flag that can be checked in the > > tracepoint code to see if it should enable RCU or not? > > > > Hmm, I wonder if "rcu_is_watching()" is light enough to have in all > > tracepoints? > > Is it possible to make rcu_is_watching() an inlined call to prevent the > overhead of doing a function call? This way we could use this in the > fast path of the tracepoint. It would mean exposing the rcu_dynticks structure to the rest of the kernel, but I guess that wouldn't be the end of the world. Are you calling rcu_is_watching() or __rcu_is_watching()? The latter is appropriate if preemption is already disabled. Thanx, Paul ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2016-11-21 9:28 ` Peter Zijlstra 2016-11-21 9:34 ` Jiri Olsa 2016-11-21 17:06 ` Andi Kleen @ 2017-02-23 12:24 ` Jiri Olsa 2017-02-23 17:11 ` Andi Kleen 2017-02-23 17:28 ` Peter Zijlstra 2 siblings, 2 replies; 48+ messages in thread From: Jiri Olsa @ 2017-02-23 12:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > On Mon, Nov 21, 2016 at 01:53:43AM +0100, Jiri Olsa wrote: > > hi, > > Jan hit following output when msr tracepoints are enabled on amd server: > > > > [ 91.585653] =============================== > > [ 91.589840] [ INFO: suspicious RCU usage. ] > > [ 91.594025] 4.9.0-rc1+ #1 Not tainted > > [ 91.597691] ------------------------------- > > [ 91.601877] ./arch/x86/include/asm/msr-trace.h:42 suspicious rcu_dereference_check() usage! > > [ 91.610222] > > [ 91.610222] other info that might help us debug this: > > [ 91.610222] > > [ 91.618224] > > [ 91.618224] RCU used illegally from idle CPU! > > [ 91.618224] rcu_scheduler_active = 1, debug_locks = 0 > > [ 91.629081] RCU used illegally from extended quiescent state! > > [ 91.634820] no locks held by swapper/1/0. > > [ 91.638832] > > [ 91.638832] stack backtrace: > > [ 91.643192] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc1+ #1 > > [ 91.649457] Hardware name: empty empty/S3992, BIOS 'V2.03 ' 05/09/2008 > > [ 91.656159] ffffc900018fbdf8 ffffffff813ed43c ffff88017ede8000 0000000000000001 > > [ 91.663637] ffffc900018fbe28 ffffffff810fdcd7 ffff880233f95dd0 00000000c0010055 > > [ 91.671107] 0000000000000000 0000000000000000 ffffc900018fbe58 ffffffff814297ac > > [ 91.678560] Call Trace: > > [ 91.681022] [<ffffffff813ed43c>] dump_stack+0x85/0xc9 > > [ 91.686164] [<ffffffff810fdcd7>] lockdep_rcu_suspicious+0xe7/0x120 > > [ 91.692429] [<ffffffff814297ac>] do_trace_read_msr+0x14c/0x1b0 > > [ 91.698349] [<ffffffff8106ddb2>] native_read_msr+0x32/0x40 > > [ 91.703921] [<ffffffff8103b2be>] amd_e400_idle+0x7e/0x110 > > [ 91.709407] [<ffffffff8103b78f>] arch_cpu_idle+0xf/0x20 > > [ 91.714720] [<ffffffff8181cd33>] default_idle_call+0x23/0x40 > > [ 91.720467] [<ffffffff810f306a>] cpu_startup_entry+0x1da/0x2b0 > > [ 91.726387] [<ffffffff81058b1f>] start_secondary+0x17f/0x1f0 > > > > > > it got away with attached change.. but this rcu logic > > is far beyond me, so it's just wild guess.. ;-) > > I think I prefer something like the below, that only annotates the one > RDMSR in question, instead of all of them. > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index 0888a879120f..d6c6aa80675f 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > if (!amd_e400_c1e_detected) { > u32 lo, hi; > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > amd_e400_c1e_detected = true; hum, I might have missed some other solution in discussion, and can't see this one being pulled in.. should I resend this? thanks, jirka ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2017-02-23 12:24 ` Jiri Olsa @ 2017-02-23 17:11 ` Andi Kleen 2017-02-23 17:28 ` Peter Zijlstra 1 sibling, 0 replies; 48+ messages in thread From: Andi Kleen @ 2017-02-23 17:11 UTC (permalink / raw) To: Jiri Olsa Cc: Peter Zijlstra, Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index 0888a879120f..d6c6aa80675f 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > if (!amd_e400_c1e_detected) { > > u32 lo, hi; > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > amd_e400_c1e_detected = true; > > hum, I might have missed some other solution in discussion, > and can't see this one being pulled in.. should I resend this? I suppose you could also use the new notrace msr variant. BTW this check is quite annoying. -Andi ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2017-02-23 12:24 ` Jiri Olsa 2017-02-23 17:11 ` Andi Kleen @ 2017-02-23 17:28 ` Peter Zijlstra 2017-02-23 17:56 ` Borislav Petkov 1 sibling, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2017-02-23 17:28 UTC (permalink / raw) To: Jiri Olsa Cc: Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek, Borislav Petkov, Thomas Gleixner On Thu, Feb 23, 2017 at 01:24:34PM +0100, Jiri Olsa wrote: > On Mon, Nov 21, 2016 at 10:28:50AM +0100, Peter Zijlstra wrote: > > > > I think I prefer something like the below, that only annotates the one > > RDMSR in question, instead of all of them. > > > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index 0888a879120f..d6c6aa80675f 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -357,7 +357,7 @@ static void amd_e400_idle(void) > > if (!amd_e400_c1e_detected) { > > u32 lo, hi; > > > > - rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); > > + RCU_NONIDLE(rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi)); > > > > if (lo & K8_INTP_C1E_ACTIVE_MASK) { > > amd_e400_c1e_detected = true; > > hum, I might have missed some other solution in discussion, > and can't see this one being pulled in.. should I resend this? > I think Boris and Thomas were talking about moving the entire c1e detection out of there. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! 2017-02-23 17:28 ` Peter Zijlstra @ 2017-02-23 17:56 ` Borislav Petkov 0 siblings, 0 replies; 48+ messages in thread From: Borislav Petkov @ 2017-02-23 17:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Jiri Olsa, Steven Rostedt, Paul E. McKenney, linux-kernel, Ingo Molnar, Josh Triplett, Andi Kleen, Jan Stancek, Thomas Gleixner On Thu, Feb 23, 2017 at 06:28:01PM +0100, Peter Zijlstra wrote: > I think Boris and Thomas were talking about moving the entire c1e > detection out of there. Yeah, see 07c94a381253 ("x86/amd: Simplify AMD E400 aware idle routine") tglx and I got rid of reading that MSR in the idle routine. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2017-02-23 18:24 UTC | newest] Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-21 0:53 [BUG] msr-trace.h:42 suspicious rcu_dereference_check() usage! Jiri Olsa 2016-11-21 9:02 ` Paul E. McKenney 2016-11-21 9:43 ` Peter Zijlstra 2016-11-21 11:00 ` Paul E. McKenney 2016-11-21 9:28 ` Peter Zijlstra 2016-11-21 9:34 ` Jiri Olsa 2016-11-21 9:42 ` Peter Zijlstra 2016-11-21 11:22 ` Jiri Olsa 2016-11-21 11:31 ` Peter Zijlstra 2016-11-21 12:49 ` Jiri Olsa 2016-11-21 12:58 ` Peter Zijlstra 2016-11-21 14:15 ` Steven Rostedt 2016-11-21 14:37 ` Peter Zijlstra 2016-11-21 15:35 ` Borislav Petkov 2016-11-21 15:41 ` Peter Zijlstra 2016-11-21 16:06 ` Borislav Petkov 2016-11-29 13:16 ` Borislav Petkov 2016-11-29 13:59 ` Thomas Gleixner 2016-11-30 8:48 ` Borislav Petkov 2016-11-30 8:54 ` Thomas Gleixner 2016-11-30 9:07 ` Borislav Petkov 2016-11-30 9:14 ` Thomas Gleixner 2016-11-29 14:04 ` Jiri Olsa 2016-11-21 14:20 ` Steven Rostedt 2016-11-21 17:06 ` Andi Kleen 2016-11-21 17:18 ` Peter Zijlstra 2016-11-21 17:45 ` Andi Kleen 2016-11-21 18:01 ` Steven Rostedt 2016-11-21 18:06 ` Andi Kleen 2016-11-21 18:22 ` Steven Rostedt 2016-11-21 18:37 ` Andi Kleen 2016-11-21 19:06 ` Steven Rostedt 2016-11-21 19:15 ` Steven Rostedt 2016-11-21 20:44 ` Andi Kleen 2016-11-22 8:19 ` Paul E. McKenney 2016-11-22 14:39 ` Steven Rostedt 2016-11-22 19:05 ` Andi Kleen 2016-11-24 2:06 ` Steven Rostedt 2016-11-28 21:48 ` Andi Kleen 2016-11-28 22:46 ` Steven Rostedt 2016-11-28 23:10 ` Andi Kleen 2016-11-21 17:55 ` Steven Rostedt 2016-11-21 18:24 ` Steven Rostedt 2016-11-21 20:12 ` Paul E. McKenney 2017-02-23 12:24 ` Jiri Olsa 2017-02-23 17:11 ` Andi Kleen 2017-02-23 17:28 ` Peter Zijlstra 2017-02-23 17:56 ` Borislav Petkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).