* [PATCH] x86 trace: Fix page fault tracing bug @ 2014-02-28 15:33 Jiri Olsa 2014-02-28 15:47 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 115+ messages in thread From: Jiri Olsa @ 2014-02-28 15:33 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, H. Peter Anvin, Seiji Aguchi The trace_do_page_fault function trigger tracepoint and then handles the actual page fault. This could lead to error if the tracepoint caused page fault. The original cr2 value gets lost and the original page fault handler kills current process with SIGSEGV. This happens if you record page faults with callchain data, the user part of it will cause tracepoint handler to page fault: # perf record -g -e exceptions:page_fault_user ls Fixing this by saving the original cr2 value and using it after tracepoint handler is done. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> --- arch/x86/mm/fault.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 9d591c8..52fad6c 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1016,11 +1016,11 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) * routines. */ static void __kprobes -__do_page_fault(struct pt_regs *regs, unsigned long error_code) +__do_page_fault(struct pt_regs *regs, unsigned long error_code, + unsigned long address) { struct vm_area_struct *vma; struct task_struct *tsk; - unsigned long address; struct mm_struct *mm; int fault; unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; @@ -1028,9 +1028,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code) tsk = current; mm = tsk->mm; - /* Get the faulting address: */ - address = read_cr2(); - /* * Detect and handle instructions that would cause a page fault for * both a tracked kernel page and a userspace page. @@ -1248,9 +1245,14 @@ dotraplinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; + unsigned long address; prev_state = exception_enter(); - __do_page_fault(regs, error_code); + + /* Get the faulting address: */ + address = read_cr2(); + + __do_page_fault(regs, error_code, address); exception_exit(prev_state); } @@ -1267,9 +1269,18 @@ dotraplinkage void __kprobes trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; + unsigned long address; prev_state = exception_enter(); + + /* + * The tracepoint processing could trigger another page + * fault (user space callchain reading) and destroy the + * original cr2 value, so read the faulting address now. + */ + address = read_cr2(); + trace_page_fault_entries(regs, error_code); - __do_page_fault(regs, error_code); + __do_page_fault(regs, error_code, address); exception_exit(prev_state); } -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: [PATCH] x86 trace: Fix page fault tracing bug 2014-02-28 15:33 [PATCH] x86 trace: Fix page fault tracing bug Jiri Olsa @ 2014-02-28 15:47 ` Peter Zijlstra 2014-02-28 16:05 ` [PATCHv2] " Jiri Olsa 2014-02-28 15:47 ` [PATCH] x86 trace: Fix page fault tracing bug Jiri Olsa 2014-02-28 16:01 ` Steven Rostedt 2 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 15:47 UTC (permalink / raw) To: Jiri Olsa Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, H. Peter Anvin, Seiji Aguchi On Fri, Feb 28, 2014 at 04:33:40PM +0100, Jiri Olsa wrote: While I like the idea of just pushing up the CR2 read; the below does the read too late still, exception_enter() also has a tracepoint in. > @@ -1267,9 +1269,18 @@ dotraplinkage void __kprobes > trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > + unsigned long address; > > prev_state = exception_enter(); > + > + /* > + * The tracepoint processing could trigger another page > + * fault (user space callchain reading) and destroy the > + * original cr2 value, so read the faulting address now. > + */ > + address = read_cr2(); > + > trace_page_fault_entries(regs, error_code); > - __do_page_fault(regs, error_code); > + __do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } > -- > 1.7.11.7 > ^ permalink raw reply [flat|nested] 115+ messages in thread
* [PATCHv2] x86 trace: Fix page fault tracing bug 2014-02-28 15:47 ` Peter Zijlstra @ 2014-02-28 16:05 ` Jiri Olsa 2014-02-28 16:11 ` H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 115+ messages in thread From: Jiri Olsa @ 2014-02-28 16:05 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, H. Peter Anvin, Seiji Aguchi, Steven Rostedt, Vince Weaver On Fri, Feb 28, 2014 at 04:47:08PM +0100, Peter Zijlstra wrote: > On Fri, Feb 28, 2014 at 04:33:40PM +0100, Jiri Olsa wrote: > > While I like the idea of just pushing up the CR2 read; the below does > the read too late still, exception_enter() also has a tracepoint in. please check v2, thanks jirka --- The trace_do_page_fault function trigger tracepoint and then handles the actual page fault. This could lead to error if the tracepoint caused page fault. The original cr2 value gets lost and the original page fault handler kills current process with SIGSEGV. This happens if you record page faults with callchain data, the user part of it will cause tracepoint handler to page fault: # perf record -g -e exceptions:page_fault_user ls Fixing this by saving the original cr2 value and using it after tracepoint handler is done. v2: Moving the cr2 read before exception_enter, because it could trigger tracepoint as well. Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Steven Rostedt <rostedt@goodmis.org> --- arch/x86/mm/fault.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 9d591c8..dd59031 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1016,11 +1016,11 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) * routines. */ static void __kprobes -__do_page_fault(struct pt_regs *regs, unsigned long error_code) +__do_page_fault(struct pt_regs *regs, unsigned long error_code, + unsigned long address) { struct vm_area_struct *vma; struct task_struct *tsk; - unsigned long address; struct mm_struct *mm; int fault; unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; @@ -1028,9 +1028,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code) tsk = current; mm = tsk->mm; - /* Get the faulting address: */ - address = read_cr2(); - /* * Detect and handle instructions that would cause a page fault for * both a tracked kernel page and a userspace page. @@ -1248,9 +1245,11 @@ dotraplinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; + /* Get the faulting address: */ + unsigned long address = read_cr2(); prev_state = exception_enter(); - __do_page_fault(regs, error_code); + __do_page_fault(regs, error_code, address); exception_exit(prev_state); } @@ -1267,9 +1266,16 @@ dotraplinkage void __kprobes trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; + /* + * The exception_enter and tracepoint processing could + * trigger another page faults (user space callchain + * reading) and destroy the original cr2 value, so read + * the faulting address now. + */ + unsigned long address = read_cr2(); prev_state = exception_enter(); trace_page_fault_entries(regs, error_code); - __do_page_fault(regs, error_code); + __do_page_fault(regs, error_code, address); exception_exit(prev_state); } -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: [PATCHv2] x86 trace: Fix page fault tracing bug 2014-02-28 16:05 ` [PATCHv2] " Jiri Olsa @ 2014-02-28 16:11 ` H. Peter Anvin 2014-02-28 16:23 ` Steven Rostedt 2014-02-28 16:15 ` Steven Rostedt 2014-03-05 0:03 ` [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults tip-bot for Jiri Olsa 2 siblings, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-28 16:11 UTC (permalink / raw) To: Jiri Olsa, Peter Zijlstra Cc: linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Seiji Aguchi, Steven Rostedt, Vince Weaver On 02/28/2014 08:05 AM, Jiri Olsa wrote: > On Fri, Feb 28, 2014 at 04:47:08PM +0100, Peter Zijlstra wrote: >> On Fri, Feb 28, 2014 at 04:33:40PM +0100, Jiri Olsa wrote: >> >> While I like the idea of just pushing up the CR2 read; the below does >> the read too late still, exception_enter() also has a tracepoint in. > > please check v2, thanks Is there any way something could come in even earlier (perhaps via __fentry__)? If so, do we need to hoist the reading of %cr2 all the way into assembly or something else? -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [PATCHv2] x86 trace: Fix page fault tracing bug 2014-02-28 16:11 ` H. Peter Anvin @ 2014-02-28 16:23 ` Steven Rostedt 0 siblings, 0 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 16:23 UTC (permalink / raw) To: H. Peter Anvin Cc: Jiri Olsa, Peter Zijlstra, linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Seiji Aguchi, Vince Weaver On Fri, 28 Feb 2014 08:11:59 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > If so, do we need to hoist the reading of %cr2 all the way into assembly > or something else? Function tracing code should not fault. Peter and Jiri were discussing on IRC to make sure that perf could not enable userspace stack tracing for function tracing. At this point, I would say, lets get rid of any users of function tracing callbacks that can fault. There callbacks are limited in context even more so than any other context (faults, interrupts, and NMIs). -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [PATCHv2] x86 trace: Fix page fault tracing bug 2014-02-28 16:05 ` [PATCHv2] " Jiri Olsa 2014-02-28 16:11 ` H. Peter Anvin @ 2014-02-28 16:15 ` Steven Rostedt 2014-03-05 0:03 ` [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults tip-bot for Jiri Olsa 2 siblings, 0 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 16:15 UTC (permalink / raw) To: Jiri Olsa Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, H. Peter Anvin, Seiji Aguchi, Vince Weaver Vince, you probably missed my other emails, as I sent from mutt, and not my normal email client. I see that mutt uses my own box to send, and I got these error messages (need to put back sending via my ISP instead of my local mail server): >>> vincent.weaver@maine.edu (after MAIL FROM): 550 5.5.4 <rostedt@home.goodmis.org>... Domain of sender address rostedt@home.goodmis.org refused. MX points to hosts without valid addresses. This violates RFC 1035/2181. -- Steve On Fri, 28 Feb 2014 17:05:26 +0100 Jiri Olsa <jolsa@redhat.com> wrote: > On Fri, Feb 28, 2014 at 04:47:08PM +0100, Peter Zijlstra wrote: > > On Fri, Feb 28, 2014 at 04:33:40PM +0100, Jiri Olsa wrote: > > > > While I like the idea of just pushing up the CR2 read; the below does > > the read too late still, exception_enter() also has a tracepoint in. > > please check v2, thanks > > jirka > > > --- > The trace_do_page_fault function trigger tracepoint > and then handles the actual page fault. > > This could lead to error if the tracepoint caused page > fault. The original cr2 value gets lost and the original > page fault handler kills current process with SIGSEGV. > > This happens if you record page faults with callchain > data, the user part of it will cause tracepoint handler > to page fault: > > # perf record -g -e exceptions:page_fault_user ls > > Fixing this by saving the original cr2 value > and using it after tracepoint handler is done. > > v2: Moving the cr2 read before exception_enter, because > it could trigger tracepoint as well. > > Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > Cc: Paul Mackerras <paulus@samba.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> > Cc: H. Peter Anvin <hpa@zytor.com> > Cc: Seiji Aguchi <seiji.aguchi@hds.com> > Cc: Vince Weaver <vincent.weaver@maine.edu> > Cc: Steven Rostedt <rostedt@goodmis.org> > --- > arch/x86/mm/fault.c | 20 +++++++++++++------- > 1 file changed, 13 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 9d591c8..dd59031 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1016,11 +1016,11 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) > * routines. > */ > static void __kprobes > -__do_page_fault(struct pt_regs *regs, unsigned long error_code) > +__do_page_fault(struct pt_regs *regs, unsigned long error_code, > + unsigned long address) > { > struct vm_area_struct *vma; > struct task_struct *tsk; > - unsigned long address; > struct mm_struct *mm; > int fault; > unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; > @@ -1028,9 +1028,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code) > tsk = current; > mm = tsk->mm; > > - /* Get the faulting address: */ > - address = read_cr2(); > - > /* > * Detect and handle instructions that would cause a page fault for > * both a tracked kernel page and a userspace page. > @@ -1248,9 +1245,11 @@ dotraplinkage void __kprobes > do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > + /* Get the faulting address: */ > + unsigned long address = read_cr2(); > > prev_state = exception_enter(); > - __do_page_fault(regs, error_code); > + __do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } > > @@ -1267,9 +1266,16 @@ dotraplinkage void __kprobes > trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > + /* > + * The exception_enter and tracepoint processing could > + * trigger another page faults (user space callchain > + * reading) and destroy the original cr2 value, so read > + * the faulting address now. > + */ > + unsigned long address = read_cr2(); > > prev_state = exception_enter(); > trace_page_fault_entries(regs, error_code); > - __do_page_fault(regs, error_code); > + __do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } ^ permalink raw reply [flat|nested] 115+ messages in thread
* [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-02-28 16:05 ` [PATCHv2] " Jiri Olsa 2014-02-28 16:11 ` H. Peter Anvin 2014-02-28 16:15 ` Steven Rostedt @ 2014-03-05 0:03 ` tip-bot for Jiri Olsa 2014-03-05 11:14 ` Peter Zijlstra 2 siblings, 1 reply; 115+ messages in thread From: tip-bot for Jiri Olsa @ 2014-03-05 0:03 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, acme, seiji.aguchi, jolsa, vincent.weaver, rostedt, tglx, hpa Commit-ID: 0ac09f9f8cd1fb028a48330edba6023d347d3cea Gitweb: http://git.kernel.org/tip/0ac09f9f8cd1fb028a48330edba6023d347d3cea Author: Jiri Olsa <jolsa@redhat.com> AuthorDate: Fri, 28 Feb 2014 17:05:26 +0100 Committer: H. Peter Anvin <hpa@linux.intel.com> CommitDate: Tue, 4 Mar 2014 16:00:14 -0800 x86, trace: Fix CR2 corruption when tracing page faults The trace_do_page_fault function trigger tracepoint and then handles the actual page fault. This could lead to error if the tracepoint caused page fault. The original cr2 value gets lost and the original page fault handler kills current process with SIGSEGV. This happens if you record page faults with callchain data, the user part of it will cause tracepoint handler to page fault: # perf record -g -e exceptions:page_fault_user ls Fixing this by saving the original cr2 value and using it after tracepoint handler is done. v2: Moving the cr2 read before exception_enter, because it could trigger tracepoint as well. Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1402211701380.6395@vincent-weaver-1.um.maine.edu Link: http://lkml.kernel.org/r/20140228160526.GD1133@krava.brq.redhat.com --- arch/x86/mm/fault.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 6dea040..e7fa28b 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1022,11 +1022,11 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) * routines. */ static void __kprobes -__do_page_fault(struct pt_regs *regs, unsigned long error_code) +__do_page_fault(struct pt_regs *regs, unsigned long error_code, + unsigned long address) { struct vm_area_struct *vma; struct task_struct *tsk; - unsigned long address; struct mm_struct *mm; int fault; unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; @@ -1034,9 +1034,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code) tsk = current; mm = tsk->mm; - /* Get the faulting address: */ - address = read_cr2(); - /* * Detect and handle instructions that would cause a page fault for * both a tracked kernel page and a userspace page. @@ -1252,9 +1249,11 @@ dotraplinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; + /* Get the faulting address: */ + unsigned long address = read_cr2(); prev_state = exception_enter(); - __do_page_fault(regs, error_code); + __do_page_fault(regs, error_code, address); exception_exit(prev_state); } @@ -1271,9 +1270,16 @@ dotraplinkage void __kprobes trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; + /* + * The exception_enter and tracepoint processing could + * trigger another page faults (user space callchain + * reading) and destroy the original cr2 value, so read + * the faulting address now. + */ + unsigned long address = read_cr2(); prev_state = exception_enter(); trace_page_fault_entries(regs, error_code); - __do_page_fault(regs, error_code); + __do_page_fault(regs, error_code, address); exception_exit(prev_state); } ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 0:03 ` [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults tip-bot for Jiri Olsa @ 2014-03-05 11:14 ` Peter Zijlstra 2014-03-05 12:20 ` Steven Rostedt 0 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-03-05 11:14 UTC (permalink / raw) To: mingo, hpa, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, rostedt, tglx, hpa Cc: linux-tip-commits On Tue, Mar 04, 2014 at 04:03:25PM -0800, tip-bot for Jiri Olsa wrote: FWIW I also prefer this patch. > @@ -1252,9 +1249,11 @@ dotraplinkage void __kprobes > do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > + /* Get the faulting address: */ > + unsigned long address = read_cr2(); > > prev_state = exception_enter(); > - __do_page_fault(regs, error_code); > + __do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } > > @@ -1271,9 +1270,16 @@ dotraplinkage void __kprobes > trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > + /* > + * The exception_enter and tracepoint processing could > + * trigger another page faults (user space callchain > + * reading) and destroy the original cr2 value, so read > + * the faulting address now. > + */ > + unsigned long address = read_cr2(); > > prev_state = exception_enter(); > trace_page_fault_entries(regs, error_code); > - __do_page_fault(regs, error_code); > + __do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } How about also marking these two functions as notrace? That would also avoid getting __mcount calls from before we read CR2. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 11:14 ` Peter Zijlstra @ 2014-03-05 12:20 ` Steven Rostedt 2014-03-05 12:25 ` Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-03-05 12:20 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, hpa, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Wed, 5 Mar 2014 12:14:15 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Mar 04, 2014 at 04:03:25PM -0800, tip-bot for Jiri Olsa wrote: > > FWIW I also prefer this patch. > > > @@ -1252,9 +1249,11 @@ dotraplinkage void __kprobes > > do_page_fault(struct pt_regs *regs, unsigned long error_code) > > { > > enum ctx_state prev_state; > > + /* Get the faulting address: */ > > + unsigned long address = read_cr2(); > > > > prev_state = exception_enter(); > > - __do_page_fault(regs, error_code); > > + __do_page_fault(regs, error_code, address); > > exception_exit(prev_state); > > } > > > > @@ -1271,9 +1270,16 @@ dotraplinkage void __kprobes > > trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) > > { > > enum ctx_state prev_state; > > + /* > > + * The exception_enter and tracepoint processing could > > + * trigger another page faults (user space callchain > > + * reading) and destroy the original cr2 value, so read > > + * the faulting address now. > > + */ > > + unsigned long address = read_cr2(); > > > > prev_state = exception_enter(); > > trace_page_fault_entries(regs, error_code); > > - __do_page_fault(regs, error_code); > > + __do_page_fault(regs, error_code, address); > > exception_exit(prev_state); > > } > > How about also marking these two functions as notrace? That would also > avoid getting __mcount calls from before we read CR2. Please no! I used tracing of the do_page_fault function all the time. It is very useful. Call backs from function tracing are to be very limited. They are more restrictive than NMI contexts. One thing you shouldn't do is to have page faults. Thus, fix the callback. -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 12:20 ` Steven Rostedt @ 2014-03-05 12:25 ` Peter Zijlstra 2014-03-05 12:33 ` Steven Rostedt 2014-03-05 12:36 ` Peter Zijlstra 0 siblings, 2 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-03-05 12:25 UTC (permalink / raw) To: Steven Rostedt Cc: mingo, hpa, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Wed, Mar 05, 2014 at 07:20:22AM -0500, Steven Rostedt wrote: > On Wed, 5 Mar 2014 12:14:15 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > Please no! I used tracing of the do_page_fault function all the time. > It is very useful. Why do you trace do_page_fault and not the __do_page_fault() function? do_page_fault() is a minimal wrapper which doesn't actually do all that much? ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 12:25 ` Peter Zijlstra @ 2014-03-05 12:33 ` Steven Rostedt 2014-03-05 12:54 ` Peter Zijlstra 2014-03-05 12:36 ` Peter Zijlstra 1 sibling, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-03-05 12:33 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, hpa, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Wed, 5 Mar 2014 13:25:35 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 05, 2014 at 07:20:22AM -0500, Steven Rostedt wrote: > > On Wed, 5 Mar 2014 12:14:15 +0100 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > Please no! I used tracing of the do_page_fault function all the time. > > It is very useful. > > Why do you trace do_page_fault and not the __do_page_fault() function? > do_page_fault() is a minimal wrapper which doesn't actually do all that > much? Then we better make sure that __do_page_fault() is never inlined. Otherwise, it wont be available to trace. I'm fine with adding "notrace" to do_page_fault() and to trace_do_page_fault() as long as we also include a "noinline" to __do_page_fault(). Would need a comment stating why that noinline is there though. -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 12:33 ` Steven Rostedt @ 2014-03-05 12:54 ` Peter Zijlstra 2014-03-05 13:02 ` Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-03-05 12:54 UTC (permalink / raw) To: Steven Rostedt Cc: mingo, hpa, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Wed, Mar 05, 2014 at 07:33:44AM -0500, Steven Rostedt wrote: > Then we better make sure that __do_page_fault() is never inlined. > Otherwise, it wont be available to trace. > > I'm fine with adding "notrace" to do_page_fault() and to > trace_do_page_fault() as long as we also include a "noinline" to > __do_page_fault(). Would need a comment stating why that noinline is > there though. When CONFIG_TRACING there's two callers, which makes it highly unlikely GCC would inline the massive __do_page_fault() function, but sure. How about something like so then; still has the normal_do_page_fault() thing, although I suppose we could drop that. It also puts trace_page_fault_entries() and trace_do_page_fault() under CONFIG_TRACING. I could only find the entry_32.S user; I suppose the 64bit one is hidden by CPP goo somewhere? --- arch/x86/include/asm/traps.h | 2 +- arch/x86/kernel/entry_32.S | 2 +- arch/x86/kernel/entry_64.S | 2 +- arch/x86/kernel/kvm.c | 2 +- arch/x86/mm/fault.c | 42 +++++++++++++++++++++++++++--------------- 5 files changed, 31 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 58d66fe06b61..1280f72deea8 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -71,7 +71,7 @@ dotraplinkage void do_double_fault(struct pt_regs *, long); asmlinkage __kprobes struct pt_regs *sync_regs(struct pt_regs *); #endif dotraplinkage void do_general_protection(struct pt_regs *, long); -dotraplinkage void do_page_fault(struct pt_regs *, unsigned long); +dotraplinkage void normal_do_page_fault(struct pt_regs *, unsigned long); #ifdef CONFIG_TRACING dotraplinkage void trace_do_page_fault(struct pt_regs *, unsigned long); #endif diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index a2a4f4697889..9a9f64755da8 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -1257,7 +1257,7 @@ END(trace_page_fault) ENTRY(page_fault) RING0_EC_FRAME ASM_CLAC - pushl_cfi $do_page_fault + pushl_cfi $normal_do_page_fault ALIGN error_code: /* the function address is in %gs's slot on the stack */ diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 1e96c3628bf2..7d49812741ac 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1491,7 +1491,7 @@ zeroentry xen_int3 do_int3 errorentry xen_stack_segment do_stack_segment #endif errorentry general_protection do_general_protection -trace_errorentry page_fault do_page_fault +trace_errorentry page_fault normal_do_page_fault #ifdef CONFIG_KVM_GUEST errorentry async_page_fault do_async_page_fault #endif diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 713f1b3bad52..9e7db22ec437 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -259,7 +259,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code) switch (kvm_read_and_reset_pf_reason()) { default: - do_page_fault(regs, error_code); + normal_do_page_fault(regs, error_code); break; case KVM_PV_REASON_PAGE_NOT_PRESENT: /* page is swapped out by the host. */ diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e7fa28bf3262..8134e5ada329 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1020,10 +1020,13 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate * routines. + * + * This function must have noinline because both callers + * {normal,trace}_do_page_fault() have notrace on. Having this an actual function + * guarantees there's a function trace entry. */ -static void __kprobes -__do_page_fault(struct pt_regs *regs, unsigned long error_code, - unsigned long address) +static void __kprobes noinline +do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address) { struct vm_area_struct *vma; struct task_struct *tsk; @@ -1245,31 +1248,38 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, up_read(&mm->mmap_sem); } -dotraplinkage void __kprobes -do_page_fault(struct pt_regs *regs, unsigned long error_code) +dotraplinkage void __kprobes notrace +normal_do_page_fault(struct pt_regs *regs, unsigned long error_code) { + unsigned long address = read_cr2(); /* Get the faulting address */ enum ctx_state prev_state; - /* Get the faulting address: */ - unsigned long address = read_cr2(); + + /* + * We must have this function tagged with __kprobes, notrace and call + * read_cr2() before calling anything else. To avoid calling any kind + * of tracing machinery before we've observed the CR2 value. + * + * exception_{enter,exit}() contain all sorts of tracepoints. + */ prev_state = exception_enter(); - __do_page_fault(regs, error_code, address); + do_page_fault(regs, error_code, address); exception_exit(prev_state); } -static void trace_page_fault_entries(struct pt_regs *regs, +#ifdef CONFIG_TRACING +static void trace_page_fault_entries(unsigned long address, struct pt_regs *regs, unsigned long error_code) { if (user_mode(regs)) - trace_page_fault_user(read_cr2(), regs, error_code); + trace_page_fault_user(address, regs, error_code); else - trace_page_fault_kernel(read_cr2(), regs, error_code); + trace_page_fault_kernel(address, regs, error_code); } -dotraplinkage void __kprobes +dotraplinkage void __kprobes notrace trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { - enum ctx_state prev_state; /* * The exception_enter and tracepoint processing could * trigger another page faults (user space callchain @@ -1277,9 +1287,11 @@ trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) * the faulting address now. */ unsigned long address = read_cr2(); + enum ctx_state prev_state; prev_state = exception_enter(); - trace_page_fault_entries(regs, error_code); - __do_page_fault(regs, error_code, address); + trace_page_fault_entries(address, regs, error_code); + do_page_fault(regs, error_code, address); exception_exit(prev_state); } +#endif ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 12:54 ` Peter Zijlstra @ 2014-03-05 13:02 ` Peter Zijlstra 2014-03-05 13:07 ` Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-03-05 13:02 UTC (permalink / raw) To: Steven Rostedt Cc: mingo, hpa, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Wed, Mar 05, 2014 at 01:54:21PM +0100, Peter Zijlstra wrote: > On Wed, Mar 05, 2014 at 07:33:44AM -0500, Steven Rostedt wrote: > > Then we better make sure that __do_page_fault() is never inlined. > > Otherwise, it wont be available to trace. > > > > I'm fine with adding "notrace" to do_page_fault() and to > > trace_do_page_fault() as long as we also include a "noinline" to > > __do_page_fault(). Would need a comment stating why that noinline is > > there though. > > When CONFIG_TRACING there's two callers, which makes it highly unlikely > GCC would inline the massive __do_page_fault() function, but sure. > > How about something like so then; still has the normal_do_page_fault() > thing, although I suppose we could drop that. > > It also puts trace_page_fault_entries() and trace_do_page_fault() under > CONFIG_TRACING. I could only find the entry_32.S user; I suppose the > 64bit one is hidden by CPP goo somewhere? > +trace_errorentry page_fault normal_do_page_fault ah found it; its in there. That also means the normal_do_page_fault() thing won't actually compile proper. Lemme do one with that. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 13:02 ` Peter Zijlstra @ 2014-03-05 13:07 ` Peter Zijlstra 0 siblings, 0 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-03-05 13:07 UTC (permalink / raw) To: Steven Rostedt Cc: mingo, hpa, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Wed, Mar 05, 2014 at 02:02:24PM +0100, Peter Zijlstra wrote: > > It also puts trace_page_fault_entries() and trace_do_page_fault() under > > CONFIG_TRACING. I could only find the entry_32.S user; I suppose the > > 64bit one is hidden by CPP goo somewhere? > > > +trace_errorentry page_fault normal_do_page_fault > > ah found it; its in there. That also means the normal_do_page_fault() > thing won't actually compile proper. > > Lemme do one with that. --- arch/x86/mm/fault.c | 33 +++++++++++++++++++++++---------- 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e7fa28bf3262..a10c8c792161 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1020,8 +1020,12 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate * routines. + * + * This function must have noinline because both callers + * {,trace_}do_page_fault() have notrace on. Having this an actual function + * guarantees there's a function trace entry. */ -static void __kprobes +static void __kprobes noinline __do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address) { @@ -1245,31 +1249,38 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, up_read(&mm->mmap_sem); } -dotraplinkage void __kprobes +dotraplinkage void __kprobes notrace do_page_fault(struct pt_regs *regs, unsigned long error_code) { + unsigned long address = read_cr2(); /* Get the faulting address */ enum ctx_state prev_state; - /* Get the faulting address: */ - unsigned long address = read_cr2(); + + /* + * We must have this function tagged with __kprobes, notrace and call + * read_cr2() before calling anything else. To avoid calling any kind + * of tracing machinery before we've observed the CR2 value. + * + * exception_{enter,exit}() contain all sorts of tracepoints. + */ prev_state = exception_enter(); __do_page_fault(regs, error_code, address); exception_exit(prev_state); } -static void trace_page_fault_entries(struct pt_regs *regs, +#ifdef CONFIG_TRACING +static void trace_page_fault_entries(unsigned long address, struct pt_regs *regs, unsigned long error_code) { if (user_mode(regs)) - trace_page_fault_user(read_cr2(), regs, error_code); + trace_page_fault_user(address, regs, error_code); else - trace_page_fault_kernel(read_cr2(), regs, error_code); + trace_page_fault_kernel(address, regs, error_code); } -dotraplinkage void __kprobes +dotraplinkage void __kprobes notrace trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { - enum ctx_state prev_state; /* * The exception_enter and tracepoint processing could * trigger another page faults (user space callchain @@ -1277,9 +1288,11 @@ trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) * the faulting address now. */ unsigned long address = read_cr2(); + enum ctx_state prev_state; prev_state = exception_enter(); - trace_page_fault_entries(regs, error_code); + trace_page_fault_entries(address, regs, error_code); __do_page_fault(regs, error_code, address); exception_exit(prev_state); } +#endif /* CONFIG_TRACING */ ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 12:25 ` Peter Zijlstra 2014-03-05 12:33 ` Steven Rostedt @ 2014-03-05 12:36 ` Peter Zijlstra 2014-03-05 13:00 ` Steven Rostedt 1 sibling, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-03-05 12:36 UTC (permalink / raw) To: Steven Rostedt Cc: mingo, hpa, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Wed, Mar 05, 2014 at 01:25:35PM +0100, Peter Zijlstra wrote: > On Wed, Mar 05, 2014 at 07:20:22AM -0500, Steven Rostedt wrote: > > On Wed, 5 Mar 2014 12:14:15 +0100 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > Please no! I used tracing of the do_page_fault function all the time. > > It is very useful. > > Why do you trace do_page_fault and not the __do_page_fault() function? > do_page_fault() is a minimal wrapper which doesn't actually do all that > much? Here, something like so; then you can still trace your do_page_fault(). It also fixes up trace_page_fault_entries() to not re-read the cr2. --- arch/x86/include/asm/traps.h | 2 +- arch/x86/kernel/entry_32.S | 2 +- arch/x86/kernel/entry_64.S | 2 +- arch/x86/kernel/kvm.c | 2 +- arch/x86/mm/fault.c | 21 ++++++++++----------- 5 files changed, 14 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 58d66fe06b61..1280f72deea8 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -71,7 +71,7 @@ dotraplinkage void do_double_fault(struct pt_regs *, long); asmlinkage __kprobes struct pt_regs *sync_regs(struct pt_regs *); #endif dotraplinkage void do_general_protection(struct pt_regs *, long); -dotraplinkage void do_page_fault(struct pt_regs *, unsigned long); +dotraplinkage void normal_do_page_fault(struct pt_regs *, unsigned long); #ifdef CONFIG_TRACING dotraplinkage void trace_do_page_fault(struct pt_regs *, unsigned long); #endif diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index a2a4f4697889..9a9f64755da8 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -1257,7 +1257,7 @@ END(trace_page_fault) ENTRY(page_fault) RING0_EC_FRAME ASM_CLAC - pushl_cfi $do_page_fault + pushl_cfi $normal_do_page_fault ALIGN error_code: /* the function address is in %gs's slot on the stack */ diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 1e96c3628bf2..7d49812741ac 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1491,7 +1491,7 @@ zeroentry xen_int3 do_int3 errorentry xen_stack_segment do_stack_segment #endif errorentry general_protection do_general_protection -trace_errorentry page_fault do_page_fault +trace_errorentry page_fault normal_do_page_fault #ifdef CONFIG_KVM_GUEST errorentry async_page_fault do_async_page_fault #endif diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 713f1b3bad52..9e7db22ec437 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -259,7 +259,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code) switch (kvm_read_and_reset_pf_reason()) { default: - do_page_fault(regs, error_code); + normal_do_page_fault(regs, error_code); break; case KVM_PV_REASON_PAGE_NOT_PRESENT: /* page is swapped out by the host. */ diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e7fa28bf3262..576cfc3d0086 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1022,8 +1022,7 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) * routines. */ static void __kprobes -__do_page_fault(struct pt_regs *regs, unsigned long error_code, - unsigned long address) +do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address) { struct vm_area_struct *vma; struct task_struct *tsk; @@ -1245,28 +1244,28 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, up_read(&mm->mmap_sem); } -dotraplinkage void __kprobes -do_page_fault(struct pt_regs *regs, unsigned long error_code) +dotraplinkage void __kprobes notrace +normal_do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; /* Get the faulting address: */ unsigned long address = read_cr2(); prev_state = exception_enter(); - __do_page_fault(regs, error_code, address); + do_page_fault(regs, error_code, address); exception_exit(prev_state); } -static void trace_page_fault_entries(struct pt_regs *regs, +static void trace_page_fault_entries(unsigned long address, struct pt_regs *regs, unsigned long error_code) { if (user_mode(regs)) - trace_page_fault_user(read_cr2(), regs, error_code); + trace_page_fault_user(address, regs, error_code); else - trace_page_fault_kernel(read_cr2(), regs, error_code); + trace_page_fault_kernel(address, regs, error_code); } -dotraplinkage void __kprobes +dotraplinkage void __kprobes notrace trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; @@ -1279,7 +1278,7 @@ trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) unsigned long address = read_cr2(); prev_state = exception_enter(); - trace_page_fault_entries(regs, error_code); - __do_page_fault(regs, error_code, address); + trace_page_fault_entries(address, regs, error_code); + do_page_fault(regs, error_code, address); exception_exit(prev_state); } ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 12:36 ` Peter Zijlstra @ 2014-03-05 13:00 ` Steven Rostedt 2014-03-05 13:08 ` Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-03-05 13:00 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, hpa, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Wed, 5 Mar 2014 13:36:35 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 05, 2014 at 01:25:35PM +0100, Peter Zijlstra wrote: > > On Wed, Mar 05, 2014 at 07:20:22AM -0500, Steven Rostedt wrote: > > > On Wed, 5 Mar 2014 12:14:15 +0100 > > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > Please no! I used tracing of the do_page_fault function all the time. > > > It is very useful. > > > > Why do you trace do_page_fault and not the __do_page_fault() function? > > do_page_fault() is a minimal wrapper which doesn't actually do all that > > much? > > Here, something like so; then you can still trace your do_page_fault(). > > It also fixes up trace_page_fault_entries() to not re-read the cr2. > > --- > arch/x86/include/asm/traps.h | 2 +- > arch/x86/kernel/entry_32.S | 2 +- > arch/x86/kernel/entry_64.S | 2 +- > arch/x86/kernel/kvm.c | 2 +- > arch/x86/mm/fault.c | 21 ++++++++++----------- > 5 files changed, 14 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h > index 58d66fe06b61..1280f72deea8 100644 > --- a/arch/x86/include/asm/traps.h > +++ b/arch/x86/include/asm/traps.h > @@ -71,7 +71,7 @@ dotraplinkage void do_double_fault(struct pt_regs *, long); > asmlinkage __kprobes struct pt_regs *sync_regs(struct pt_regs *); > #endif > dotraplinkage void do_general_protection(struct pt_regs *, long); > -dotraplinkage void do_page_fault(struct pt_regs *, unsigned long); > +dotraplinkage void normal_do_page_fault(struct pt_regs *, unsigned long); Why not call it "do_do_page_fault()" ;-) > #ifdef CONFIG_TRACING > dotraplinkage void trace_do_page_fault(struct pt_regs *, unsigned long); > #endif > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S > index a2a4f4697889..9a9f64755da8 100644 > --- a/arch/x86/kernel/entry_32.S > +++ b/arch/x86/kernel/entry_32.S > @@ -1257,7 +1257,7 @@ END(trace_page_fault) > ENTRY(page_fault) > RING0_EC_FRAME > ASM_CLAC > - pushl_cfi $do_page_fault > + pushl_cfi $normal_do_page_fault > ALIGN > error_code: > /* the function address is in %gs's slot on the stack */ > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S > index 1e96c3628bf2..7d49812741ac 100644 > --- a/arch/x86/kernel/entry_64.S > +++ b/arch/x86/kernel/entry_64.S > @@ -1491,7 +1491,7 @@ zeroentry xen_int3 do_int3 > errorentry xen_stack_segment do_stack_segment > #endif > errorentry general_protection do_general_protection > -trace_errorentry page_fault do_page_fault > +trace_errorentry page_fault normal_do_page_fault > #ifdef CONFIG_KVM_GUEST > errorentry async_page_fault do_async_page_fault > #endif > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index 713f1b3bad52..9e7db22ec437 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -259,7 +259,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code) > > switch (kvm_read_and_reset_pf_reason()) { > default: > - do_page_fault(regs, error_code); > + normal_do_page_fault(regs, error_code); > break; > case KVM_PV_REASON_PAGE_NOT_PRESENT: > /* page is swapped out by the host. */ > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index e7fa28bf3262..576cfc3d0086 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1022,8 +1022,7 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) > * routines. > */ > static void __kprobes Still requires the "noinline" above -- Steve > -__do_page_fault(struct pt_regs *regs, unsigned long error_code, > - unsigned long address) > +do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address) > { > struct vm_area_struct *vma; > struct task_struct *tsk; > @@ -1245,28 +1244,28 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, > up_read(&mm->mmap_sem); > } > > -dotraplinkage void __kprobes > -do_page_fault(struct pt_regs *regs, unsigned long error_code) > +dotraplinkage void __kprobes notrace > +normal_do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > /* Get the faulting address: */ > unsigned long address = read_cr2(); > > prev_state = exception_enter(); > - __do_page_fault(regs, error_code, address); > + do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } > > -static void trace_page_fault_entries(struct pt_regs *regs, > +static void trace_page_fault_entries(unsigned long address, struct pt_regs *regs, > unsigned long error_code) > { > if (user_mode(regs)) > - trace_page_fault_user(read_cr2(), regs, error_code); > + trace_page_fault_user(address, regs, error_code); > else > - trace_page_fault_kernel(read_cr2(), regs, error_code); > + trace_page_fault_kernel(address, regs, error_code); > } > > -dotraplinkage void __kprobes > +dotraplinkage void __kprobes notrace > trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > @@ -1279,7 +1278,7 @@ trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) > unsigned long address = read_cr2(); > > prev_state = exception_enter(); > - trace_page_fault_entries(regs, error_code); > - __do_page_fault(regs, error_code, address); > + trace_page_fault_entries(address, regs, error_code); > + do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 13:00 ` Steven Rostedt @ 2014-03-05 13:08 ` Peter Zijlstra 2014-03-05 21:37 ` H. Peter Anvin 0 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-03-05 13:08 UTC (permalink / raw) To: Steven Rostedt Cc: mingo, hpa, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Wed, Mar 05, 2014 at 08:00:18AM -0500, Steven Rostedt wrote: > > static void __kprobes > > Still requires the "noinline" above Trim trim trim trim your emails... also, you're like 3 version behind. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 13:08 ` Peter Zijlstra @ 2014-03-05 21:37 ` H. Peter Anvin 2014-03-06 8:40 ` Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-03-05 21:37 UTC (permalink / raw) To: Peter Zijlstra, Steven Rostedt Cc: mingo, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On 03/05/2014 05:08 AM, Peter Zijlstra wrote: > On Wed, Mar 05, 2014 at 08:00:18AM -0500, Steven Rostedt wrote: >>> static void __kprobes >> >> Still requires the "noinline" above > > Trim trim trim trim your emails... > > also, you're like 3 version behind. > OK, I have to admit to having lost track of this thread. Please let me know when there is anything actionable for me. -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-05 21:37 ` H. Peter Anvin @ 2014-03-06 8:40 ` Peter Zijlstra 2014-03-06 11:02 ` Steven Rostedt 0 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-03-06 8:40 UTC (permalink / raw) To: H. Peter Anvin Cc: Steven Rostedt, mingo, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Wed, Mar 05, 2014 at 01:37:25PM -0800, H. Peter Anvin wrote: > OK, I have to admit to having lost track of this thread. Please let me > know when there is anything actionable for me. lkml.kernel.org/r/20140305130749.GR3104@twins.programming.kicks-ass.net Was the latest; Rostedt are you good with that one? If so I suppose I should make it a proper patch and get hpa to look at it :-) ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults 2014-03-06 8:40 ` Peter Zijlstra @ 2014-03-06 11:02 ` Steven Rostedt 2014-03-06 14:53 ` [PATCH] x86: Further robustify CR2 handling vs tracing Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-03-06 11:02 UTC (permalink / raw) To: Peter Zijlstra Cc: H. Peter Anvin, mingo, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits On Thu, 6 Mar 2014 09:40:50 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 05, 2014 at 01:37:25PM -0800, H. Peter Anvin wrote: > > > OK, I have to admit to having lost track of this thread. Please let me > > know when there is anything actionable for me. > > lkml.kernel.org/r/20140305130749.GR3104@twins.programming.kicks-ass.net > > Was the latest; Rostedt are you good with that one? If so I suppose I > should make it a proper patch and get hpa to look at it :-) Yeah, looks good. Acked-by: Steven Rostedt <rostedt@goodmis.org> -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* [PATCH] x86: Further robustify CR2 handling vs tracing 2014-03-06 11:02 ` Steven Rostedt @ 2014-03-06 14:53 ` Peter Zijlstra 2014-03-07 23:07 ` [tip:x86/urgent] x86, trace: " tip-bot for Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-03-06 14:53 UTC (permalink / raw) To: Steven Rostedt Cc: H. Peter Anvin, mingo, paulus, linux-kernel, acme, seiji.aguchi, jolsa, vincent.weaver, tglx, hpa, linux-tip-commits Subject: x86: Further robustify CR2 handling vs tracing From: Peter Zijlstra <peterz@infradead.org> Date: Wed, 5 Mar 2014 14:07:49 +0100 Building on commit 0ac09f9f8cd1 ("x86, trace: Fix CR2 corruption when tracing page faults") this patch addresses another few issues: - Now that read_cr2() is lifted into trace_do_page_fault(), we should pass the address to trace_page_fault_entries() to avoid it re-reading a potentially changed cr2. - Put both trace_do_page_fault() and trace_page_fault_entries() under CONFIG_TRACING. - Mark both fault entry functions {,trace_}do_page_fault() as notrace to avoid getting __mcount or other function entry trace callbacks before we've observed CR2. - Mark __do_page_fault() as noinline to guarantee the function tracer does get to see the fault. Cc: jolsa@redhat.com Cc: vincent.weaver@maine.edu Cc: tglx@linutronix.de Cc: hpa@linux.intel.com Cc: mingo@kernel.org Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> --- arch/x86/mm/fault.c | 33 +++++++++++++++++++++++---------- 1 file changed, 23 insertions(+), 10 deletions(-) --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1020,8 +1020,12 @@ static inline bool smap_violation(int er * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate * routines. + * + * This function must have noinline because both callers + * {,trace_}do_page_fault() have notrace on. Having this an actual function + * guarantees there's a function trace entry. */ -static void __kprobes +static void __kprobes noinline __do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address) { @@ -1245,31 +1249,38 @@ __do_page_fault(struct pt_regs *regs, un up_read(&mm->mmap_sem); } -dotraplinkage void __kprobes +dotraplinkage void __kprobes notrace do_page_fault(struct pt_regs *regs, unsigned long error_code) { + unsigned long address = read_cr2(); /* Get the faulting address */ enum ctx_state prev_state; - /* Get the faulting address: */ - unsigned long address = read_cr2(); + + /* + * We must have this function tagged with __kprobes, notrace and call + * read_cr2() before calling anything else. To avoid calling any kind + * of tracing machinery before we've observed the CR2 value. + * + * exception_{enter,exit}() contain all sorts of tracepoints. + */ prev_state = exception_enter(); __do_page_fault(regs, error_code, address); exception_exit(prev_state); } -static void trace_page_fault_entries(struct pt_regs *regs, +#ifdef CONFIG_TRACING +static void trace_page_fault_entries(unsigned long address, struct pt_regs *regs, unsigned long error_code) { if (user_mode(regs)) - trace_page_fault_user(read_cr2(), regs, error_code); + trace_page_fault_user(address, regs, error_code); else - trace_page_fault_kernel(read_cr2(), regs, error_code); + trace_page_fault_kernel(address, regs, error_code); } -dotraplinkage void __kprobes +dotraplinkage void __kprobes notrace trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { - enum ctx_state prev_state; /* * The exception_enter and tracepoint processing could * trigger another page faults (user space callchain @@ -1277,9 +1288,11 @@ trace_do_page_fault(struct pt_regs *regs * the faulting address now. */ unsigned long address = read_cr2(); + enum ctx_state prev_state; prev_state = exception_enter(); - trace_page_fault_entries(regs, error_code); + trace_page_fault_entries(address, regs, error_code); __do_page_fault(regs, error_code, address); exception_exit(prev_state); } +#endif /* CONFIG_TRACING */ ^ permalink raw reply [flat|nested] 115+ messages in thread
* [tip:x86/urgent] x86, trace: Further robustify CR2 handling vs tracing 2014-03-06 14:53 ` [PATCH] x86: Further robustify CR2 handling vs tracing Peter Zijlstra @ 2014-03-07 23:07 ` tip-bot for Peter Zijlstra 0 siblings, 0 replies; 115+ messages in thread From: tip-bot for Peter Zijlstra @ 2014-03-07 23:07 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, peterz, jolsa, vincent.weaver, rostedt, tglx, hpa Commit-ID: d4078e232267ff53f3b030b9698a3c001db4dbec Gitweb: http://git.kernel.org/tip/d4078e232267ff53f3b030b9698a3c001db4dbec Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Wed, 5 Mar 2014 14:07:49 +0100 Committer: H. Peter Anvin <hpa@linux.intel.com> CommitDate: Thu, 6 Mar 2014 10:58:18 -0800 x86, trace: Further robustify CR2 handling vs tracing Building on commit 0ac09f9f8cd1 ("x86, trace: Fix CR2 corruption when tracing page faults") this patch addresses another few issues: - Now that read_cr2() is lifted into trace_do_page_fault(), we should pass the address to trace_page_fault_entries() to avoid it re-reading a potentially changed cr2. - Put both trace_do_page_fault() and trace_page_fault_entries() under CONFIG_TRACING. - Mark both fault entry functions {,trace_}do_page_fault() as notrace to avoid getting __mcount or other function entry trace callbacks before we've observed CR2. - Mark __do_page_fault() as noinline to guarantee the function tracer does get to see the fault. Cc: <jolsa@redhat.com> Cc: <vincent.weaver@maine.edu> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140306145300.GO9987@twins.programming.kicks-ass.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> --- arch/x86/mm/fault.c | 33 +++++++++++++++++++++++---------- 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e7fa28b..a10c8c7 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1020,8 +1020,12 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate * routines. + * + * This function must have noinline because both callers + * {,trace_}do_page_fault() have notrace on. Having this an actual function + * guarantees there's a function trace entry. */ -static void __kprobes +static void __kprobes noinline __do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address) { @@ -1245,31 +1249,38 @@ good_area: up_read(&mm->mmap_sem); } -dotraplinkage void __kprobes +dotraplinkage void __kprobes notrace do_page_fault(struct pt_regs *regs, unsigned long error_code) { + unsigned long address = read_cr2(); /* Get the faulting address */ enum ctx_state prev_state; - /* Get the faulting address: */ - unsigned long address = read_cr2(); + + /* + * We must have this function tagged with __kprobes, notrace and call + * read_cr2() before calling anything else. To avoid calling any kind + * of tracing machinery before we've observed the CR2 value. + * + * exception_{enter,exit}() contain all sorts of tracepoints. + */ prev_state = exception_enter(); __do_page_fault(regs, error_code, address); exception_exit(prev_state); } -static void trace_page_fault_entries(struct pt_regs *regs, +#ifdef CONFIG_TRACING +static void trace_page_fault_entries(unsigned long address, struct pt_regs *regs, unsigned long error_code) { if (user_mode(regs)) - trace_page_fault_user(read_cr2(), regs, error_code); + trace_page_fault_user(address, regs, error_code); else - trace_page_fault_kernel(read_cr2(), regs, error_code); + trace_page_fault_kernel(address, regs, error_code); } -dotraplinkage void __kprobes +dotraplinkage void __kprobes notrace trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { - enum ctx_state prev_state; /* * The exception_enter and tracepoint processing could * trigger another page faults (user space callchain @@ -1277,9 +1288,11 @@ trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) * the faulting address now. */ unsigned long address = read_cr2(); + enum ctx_state prev_state; prev_state = exception_enter(); - trace_page_fault_entries(regs, error_code); + trace_page_fault_entries(address, regs, error_code); __do_page_fault(regs, error_code, address); exception_exit(prev_state); } +#endif /* CONFIG_TRACING */ ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: [PATCH] x86 trace: Fix page fault tracing bug 2014-02-28 15:33 [PATCH] x86 trace: Fix page fault tracing bug Jiri Olsa 2014-02-28 15:47 ` Peter Zijlstra @ 2014-02-28 15:47 ` Jiri Olsa 2014-02-28 16:00 ` Steven Rostedt 2014-02-28 16:01 ` Steven Rostedt 2 siblings, 1 reply; 115+ messages in thread From: Jiri Olsa @ 2014-02-28 15:47 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, H. Peter Anvin, Seiji Aguchi On Fri, Feb 28, 2014 at 04:33:40PM +0100, Jiri Olsa wrote: > The trace_do_page_fault function trigger tracepoint > and then handles the actual page fault. > > This could lead to error if the tracepoint caused page > fault. The original cr2 value gets lost and the original > page fault handler kills current process with SIGSEGV. > > This happens if you record page faults with callchain > data, the user part of it will cause tracepoint handler > to page fault: > > # perf record -g -e exceptions:page_fault_user ls > > Fixing this by saving the original cr2 value > and using it after tracepoint handler is done. > sry, forgot.. Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> jirka > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > Cc: Paul Mackerras <paulus@samba.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> > Cc: H. Peter Anvin <hpa@zytor.com> > Cc: Seiji Aguchi <seiji.aguchi@hds.com> > --- > arch/x86/mm/fault.c | 25 ++++++++++++++++++------- > 1 file changed, 18 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 9d591c8..52fad6c 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1016,11 +1016,11 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) > * routines. > */ > static void __kprobes > -__do_page_fault(struct pt_regs *regs, unsigned long error_code) > +__do_page_fault(struct pt_regs *regs, unsigned long error_code, > + unsigned long address) > { > struct vm_area_struct *vma; > struct task_struct *tsk; > - unsigned long address; > struct mm_struct *mm; > int fault; > unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; > @@ -1028,9 +1028,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code) > tsk = current; > mm = tsk->mm; > > - /* Get the faulting address: */ > - address = read_cr2(); > - > /* > * Detect and handle instructions that would cause a page fault for > * both a tracked kernel page and a userspace page. > @@ -1248,9 +1245,14 @@ dotraplinkage void __kprobes > do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > + unsigned long address; > > prev_state = exception_enter(); > - __do_page_fault(regs, error_code); > + > + /* Get the faulting address: */ > + address = read_cr2(); > + > + __do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } > > @@ -1267,9 +1269,18 @@ dotraplinkage void __kprobes > trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > + unsigned long address; > > prev_state = exception_enter(); > + > + /* > + * The tracepoint processing could trigger another page > + * fault (user space callchain reading) and destroy the > + * original cr2 value, so read the faulting address now. > + */ > + address = read_cr2(); > + > trace_page_fault_entries(regs, error_code); > - __do_page_fault(regs, error_code); > + __do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } > -- > 1.7.11.7 > ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [PATCH] x86 trace: Fix page fault tracing bug 2014-02-28 15:47 ` [PATCH] x86 trace: Fix page fault tracing bug Jiri Olsa @ 2014-02-28 16:00 ` Steven Rostedt 0 siblings, 0 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 16:00 UTC (permalink / raw) To: Jiri Olsa Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, H. Peter Anvin, Seiji Aguchi, Vince Weaver Vince, can you test this patch instead. Seems that the bug you found was found by others. You can remove all patches again, and modify this patch such that the read of cr2 is before the exception_enter() call (in both locations) On Fri, Feb 28, 2014 at 04:47:15PM +0100, Jiri Olsa wrote: > On Fri, Feb 28, 2014 at 04:33:40PM +0100, Jiri Olsa wrote: > > The trace_do_page_fault function trigger tracepoint > > and then handles the actual page fault. > > > > This could lead to error if the tracepoint caused page > > fault. The original cr2 value gets lost and the original > > page fault handler kills current process with SIGSEGV. > > > > This happens if you record page faults with callchain > > data, the user part of it will cause tracepoint handler > > to page fault: > > > > # perf record -g -e exceptions:page_fault_user ls > > > > Fixing this by saving the original cr2 value > > and using it after tracepoint handler is done. > > > > sry, forgot.. > > Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Please also add: Reported-by: Vince Weaver <vincent.weaver@maine.edu> And after verificiation, add a Tested-by for him too. He's been doing a lot to help us solve this bug (from a different angle). You may also add: Link: http://lkml.kernel.org/r/20140224172536.GD9987@twins.programming.kicks-ass.net -- Steve > > jirka > > > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > > Cc: Paul Mackerras <paulus@samba.org> > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> > > Cc: H. Peter Anvin <hpa@zytor.com> > > Cc: Seiji Aguchi <seiji.aguchi@hds.com> > > --- > > arch/x86/mm/fault.c | 25 ++++++++++++++++++------- > > 1 file changed, 18 insertions(+), 7 deletions(-) > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 9d591c8..52fad6c 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -1016,11 +1016,11 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) > > * routines. > > */ > > static void __kprobes > > -__do_page_fault(struct pt_regs *regs, unsigned long error_code) > > +__do_page_fault(struct pt_regs *regs, unsigned long error_code, > > + unsigned long address) > > { > > struct vm_area_struct *vma; > > struct task_struct *tsk; > > - unsigned long address; > > struct mm_struct *mm; > > int fault; > > unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; > > @@ -1028,9 +1028,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code) > > tsk = current; > > mm = tsk->mm; > > > > - /* Get the faulting address: */ > > - address = read_cr2(); > > - > > /* > > * Detect and handle instructions that would cause a page fault for > > * both a tracked kernel page and a userspace page. > > @@ -1248,9 +1245,14 @@ dotraplinkage void __kprobes > > do_page_fault(struct pt_regs *regs, unsigned long error_code) > > { > > enum ctx_state prev_state; > > + unsigned long address; > > > > prev_state = exception_enter(); > > - __do_page_fault(regs, error_code); > > + > > + /* Get the faulting address: */ > > + address = read_cr2(); > > + > > + __do_page_fault(regs, error_code, address); > > exception_exit(prev_state); > > } > > > > @@ -1267,9 +1269,18 @@ dotraplinkage void __kprobes > > trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) > > { > > enum ctx_state prev_state; > > + unsigned long address; > > > > prev_state = exception_enter(); > > + > > + /* > > + * The tracepoint processing could trigger another page > > + * fault (user space callchain reading) and destroy the > > + * original cr2 value, so read the faulting address now. > > + */ > > + address = read_cr2(); > > + > > trace_page_fault_entries(regs, error_code); > > - __do_page_fault(regs, error_code); > > + __do_page_fault(regs, error_code, address); > > exception_exit(prev_state); > > } > > -- > > 1.7.11.7 > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [PATCH] x86 trace: Fix page fault tracing bug 2014-02-28 15:33 [PATCH] x86 trace: Fix page fault tracing bug Jiri Olsa 2014-02-28 15:47 ` Peter Zijlstra 2014-02-28 15:47 ` [PATCH] x86 trace: Fix page fault tracing bug Jiri Olsa @ 2014-02-28 16:01 ` Steven Rostedt 2 siblings, 0 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 16:01 UTC (permalink / raw) To: Jiri Olsa Cc: linux-kernel, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, H. Peter Anvin, Seiji Aguchi, Vince Weaver On Fri, Feb 28, 2014 at 04:33:40PM +0100, Jiri Olsa wrote: > The trace_do_page_fault function trigger tracepoint > and then handles the actual page fault. > > This could lead to error if the tracepoint caused page > fault. The original cr2 value gets lost and the original > page fault handler kills current process with SIGSEGV. > > This happens if you record page faults with callchain > data, the user part of it will cause tracepoint handler > to page fault: > > # perf record -g -e exceptions:page_fault_user ls > > Fixing this by saving the original cr2 value > and using it after tracepoint handler is done. > > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > Cc: Paul Mackerras <paulus@samba.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> > Cc: H. Peter Anvin <hpa@zytor.com> > Cc: Seiji Aguchi <seiji.aguchi@hds.com> > --- > arch/x86/mm/fault.c | 25 ++++++++++++++++++------- > 1 file changed, 18 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 9d591c8..52fad6c 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1016,11 +1016,11 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) > * routines. > */ > static void __kprobes > -__do_page_fault(struct pt_regs *regs, unsigned long error_code) > +__do_page_fault(struct pt_regs *regs, unsigned long error_code, > + unsigned long address) > { > struct vm_area_struct *vma; > struct task_struct *tsk; > - unsigned long address; > struct mm_struct *mm; > int fault; > unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; > @@ -1028,9 +1028,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code) > tsk = current; > mm = tsk->mm; > > - /* Get the faulting address: */ > - address = read_cr2(); > - > /* > * Detect and handle instructions that would cause a page fault for > * both a tracked kernel page and a userspace page. > @@ -1248,9 +1245,14 @@ dotraplinkage void __kprobes > do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > + unsigned long address; > > prev_state = exception_enter(); > - __do_page_fault(regs, error_code); > + > + /* Get the faulting address: */ > + address = read_cr2(); > + As Peter already stated, the address = read_cr2() needs to go before the exception_enter() call. > + __do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } > > @@ -1267,9 +1269,18 @@ dotraplinkage void __kprobes > trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > enum ctx_state prev_state; > + unsigned long address; > > prev_state = exception_enter(); > + > + /* > + * The tracepoint processing could trigger another page > + * fault (user space callchain reading) and destroy the > + * original cr2 value, so read the faulting address now. > + */ > + address = read_cr2(); Same here. After that, you can add: Acked-by: Steven Rostedt <rostedt@goodmis.org> -- Steve > + > trace_page_fault_entries(regs, error_code); > - __do_page_fault(regs, error_code); > + __do_page_fault(regs, error_code, address); > exception_exit(prev_state); > } > -- > 1.7.11.7 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 115+ messages in thread
* perf_fuzzer causes reboot @ 2014-02-21 20:25 Vince Weaver 2014-02-21 22:13 ` perf_fuzzer compiled for x32 " Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-21 20:25 UTC (permalink / raw) To: Linux Kernel; +Cc: Peter Zijlstra, Ingo Molnar So I'm not sure who exactly to report this to. Some perf people CC'd as I trigger it while using the perf_fuzzer. This is with 3.14-rc3 on a core2 machine, although I've had the reboots happen throughout at least 3.14-rc* I'm having a hard time coming up with a reproducible test case. Using the random seed that caused the below will cause the perf_fuzzer to segfault but not reboot. The log isn't very helpful, it reboots so fast that the oops doesn't finish printing and the serial log just moves to the bootloader... [ 4466.804123] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 [ 4466.808014] IP: [<ffffffff81111783>] cache_reap+0x5e/0x1c5 [ 4466.808014] PGD 0 [ 4466.808014] Oops: 0000 [#1] GNU GRUB version 2.00-17 Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-21 20:25 perf_fuzzer causes reboot Vince Weaver @ 2014-02-21 22:13 ` Vince Weaver 2014-02-21 22:34 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-21 22:13 UTC (permalink / raw) To: Vince Weaver Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar, H. Peter Anvin, H.J. Lu cc'ing x32 people On Fri, 21 Feb 2014, Vince Weaver wrote: > So I'm not sure who exactly to report this to. Some perf people CC'd as > I trigger it while using the perf_fuzzer. > > This is with 3.14-rc3 on a core2 machine, although I've had the reboots > happen throughout at least 3.14-rc* > > I'm having a hard time coming up with a reproducible test case. Using the > random seed that caused the below will cause the perf_fuzzer to segfault > but not reboot. > > The log isn't very helpful, it reboots so fast that the oops doesn't > finish printing and the serial log just moves to the bootloader... > > [ 4466.804123] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 > [ 4466.808014] IP: [<ffffffff81111783>] cache_reap+0x5e/0x1c5 > [ 4466.808014] PGD 0 > [ 4466.808014] Oops: 0000 [#1] GNU GRUB version 2.00-17 Maybe related, this is on an x32-compiled binary. When trying to reproduce the perf_fuzzer myseriously segfaults on what appears to be perfectly valid mmap'd perf ring-buffers. (running under gdb) Program received signal SIGSEGV, Segmentation fault. 0x0041efbb in __memset_sse2 () => 0x0041efbb <+2203>: movdqa %xmm0,(%rdi) rdi 0xf7f61000 4160098304 f7f61000-f7f72000 rw-s 00000000 00:08 4475 anon_inode:[perf_event] So I'm not sure if somehow something is wrong with the page mapping, that makes a valid write fail and sometimes (possibly due to address space randomization) reboot the system? Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-21 22:13 ` perf_fuzzer compiled for x32 " Vince Weaver @ 2014-02-21 22:34 ` Vince Weaver 2014-02-22 4:50 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-21 22:34 UTC (permalink / raw) To: Vince Weaver Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar, H. Peter Anvin, H.J. Lu On Fri, 21 Feb 2014, Vince Weaver wrote: > On Fri, 21 Feb 2014, Vince Weaver wrote: > > > So I'm not sure who exactly to report this to. Some perf people CC'd as > > I trigger it while using the perf_fuzzer. > > > > This is with 3.14-rc3 on a core2 machine, although I've had the reboots > > happen throughout at least 3.14-rc* > > > > I'm having a hard time coming up with a reproducible test case. Using the > > random seed that caused the below will cause the perf_fuzzer to segfault > > but not reboot. > > > > The log isn't very helpful, it reboots so fast that the oops doesn't > > finish printing and the serial log just moves to the bootloader... > > > > [ 4466.804123] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 > > [ 4466.808014] IP: [<ffffffff81111783>] cache_reap+0x5e/0x1c5 > > [ 4466.808014] PGD 0 > > [ 4466.808014] Oops: 0000 [#1] GNU GRUB version 2.00-17 > > Maybe related, this is on an x32-compiled binary. > > When trying to reproduce the perf_fuzzer myseriously segfaults on what > appears to be perfectly valid mmap'd perf ring-buffers. > > (running under gdb) > > Program received signal SIGSEGV, Segmentation fault. > 0x0041efbb in __memset_sse2 () > > => 0x0041efbb <+2203>: movdqa %xmm0,(%rdi) > > rdi 0xf7f61000 4160098304 > > f7f61000-f7f72000 rw-s 00000000 00:08 4475 anon_inode:[perf_event] > > So I'm not sure if somehow something is wrong with the page mapping, that > makes a valid write fail and sometimes (possibly due to address space > randomization) reboot the system? also strange, when I look at the core dumps it always shows the bad memory address happening at the beginning of an mmap page as expected for the ip listed, but the segfault listed in the kernel happens at some completely unrelated address that isn't even page aligned and shouldn't be possible based on the gdb/coredump results? [ 1560.313863] perf_fuzzer[2826]: segfault at 503283ff ip 000000000041efbb sp 00000000ffd367d8 error 6 in perf_fuzzer[400000+d1000] [ 1704.673245] perf_fuzzer[2835]: segfault at 503283ff ip 000000000041efbb sp 00000000ff972be8 error 6 in perf_fuzzer[400000+d1000] [ 2978.101276] perf_fuzzer[2841]: segfault at 503283ff ip 000000000041efbb sp 00000000ff92ba68 error 6 in perf_fuzzer[400000+d1000] [ 4907.185366] perf_fuzzer[2868]: segfault at 503283ff ip 000000000041efbb sp 00000000ffadcd28 error 6 in perf_fuzzer[400000+d1000] [ 9570.793746] perf_fuzzer[6183]: segfault at 4d0bf28e ip 000000000041efbb sp 00000000ff83f688 error 6 in perf_fuzzer[400000+d1000] [ 9743.888431] perf_fuzzer[6187]: segfault at 91734d5 ip 000000000041efbb sp 00000000ffb4e288 error 6 in perf_fuzzer[400000+d1000] Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-21 22:34 ` Vince Weaver @ 2014-02-22 4:50 ` Vince Weaver 2014-02-22 5:03 ` H. Peter Anvin 2014-02-22 6:26 ` H. Peter Anvin 0 siblings, 2 replies; 115+ messages in thread From: Vince Weaver @ 2014-02-22 4:50 UTC (permalink / raw) To: Vince Weaver Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar, H. Peter Anvin, H.J. Lu So I changed the perf_fuzzer so when it randomly stomps all over the perf_event_mmap_page, it uses a constant value of 0xdeadbeef rather than a random value. The result is below. The segfaults make a bit more sense now, it almost looks like what is happening is we are corrupting an address value somehow (head? tail?) and the kernel then uses the corrupt address and writes to memory outside of the mmap ring buffer. I still haven't figured out how to trigger this exactly, but you can see when over-written with 0xdeadbeef the memory address written to is consistently some small multiple of 0x120. I imagine it would be a bad thing if it turned out to be possible to select what memory address got written to. Although since I've only reproduced this on x32 maybe it won't be possible to over-write the kernel; but I have seen this bug cause a reboot when the wrong thing got over-written. [28002.850192] perf_fuzzer[7083]: segfault at 2be0 ip 000000000041efab sp 00000000ff826748 error 6 in perf_fuzzer[400000+d1000] [28639.769869] perf_fuzzer[7100]: segfault at 1320 ip 000000000041efab sp 00000000ffa65038 error 6 in perf_fuzzer[400000+d1000] [29396.986242] perf_fuzzer[7120]: segfault at 10e0 ip 000000000041efab sp 00000000ffd48e68 error 6 in perf_fuzzer[400000+d1000] [29738.892931] perf_fuzzer[7128]: segfault at 18c0 ip 000000000041efab sp 00000000ffcdcd88 error 6 in perf_fuzzer[400000+d1000] [29815.550210] perf_fuzzer[7132]: segfault at 120 ip 000000000041efab sp 00000000ffe673b8 error 6 in perf_fuzzer[400000+d1000] [30173.455348] perf_fuzzer[7141]: segfault at 120 ip 000000000041efab sp 00000000ffda1948 error 6 in perf_fuzzer[400000+d1000] [30570.625642] perf_fuzzer[7156]: segfault at 1680 ip 000000000041efab sp 00000000ffaad028 error 6 in perf_fuzzer[400000+d1000] [31047.887784] perf_fuzzer[7169]: segfault at 60c0 ip 000000000041efab sp 00000000ffaa86e8 error 6 in perf_fuzzer[400000+d1000] [31300.168714] perf_fuzzer[7175]: segfault at 3a80 ip 000000000041efab sp 00000000ffd83228 error 6 in perf_fuzzer[400000+d1000] [31984.727278] perf_fuzzer[7193]: segfault at 7e0 ip 000000000041efab sp 00000000ff9db1f8 error 6 in perf_fuzzer[400000+d1000] Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-22 4:50 ` Vince Weaver @ 2014-02-22 5:03 ` H. Peter Anvin 2014-02-22 6:26 ` H. Peter Anvin 1 sibling, 0 replies; 115+ messages in thread From: H. Peter Anvin @ 2014-02-22 5:03 UTC (permalink / raw) To: Vince Weaver; +Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu Those are segfaults in user space, though? On February 21, 2014 8:50:38 PM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: > >So I changed the perf_fuzzer so when it randomly stomps all over the >perf_event_mmap_page, it uses a constant value of 0xdeadbeef rather >than a random value. > >The result is below. The segfaults make a bit more sense now, it >almost looks like what is happening is we are corrupting an address >value somehow (head? tail?) and the kernel then uses the corrupt >address >and writes to memory outside of the mmap ring buffer. > >I still haven't figured out how to trigger this exactly, but you can >see when over-written with 0xdeadbeef the memory address written to is >consistently some small multiple of 0x120. > >I imagine it would be a bad thing if it turned out to be possible to >select what memory address got written to. Although since I've >only reproduced this on x32 maybe it won't be possible to over-write >the kernel; but I have seen this bug cause a reboot when the >wrong thing got over-written. > >[28002.850192] perf_fuzzer[7083]: segfault at 2be0 ip 000000000041efab >sp 00000000ff826748 error 6 in perf_fuzzer[400000+d1000] >[28639.769869] perf_fuzzer[7100]: segfault at 1320 ip 000000000041efab >sp 00000000ffa65038 error 6 in perf_fuzzer[400000+d1000] >[29396.986242] perf_fuzzer[7120]: segfault at 10e0 ip 000000000041efab >sp 00000000ffd48e68 error 6 in perf_fuzzer[400000+d1000] >[29738.892931] perf_fuzzer[7128]: segfault at 18c0 ip 000000000041efab >sp 00000000ffcdcd88 error 6 in perf_fuzzer[400000+d1000] >[29815.550210] perf_fuzzer[7132]: segfault at 120 ip 000000000041efab >sp 00000000ffe673b8 error 6 in perf_fuzzer[400000+d1000] >[30173.455348] perf_fuzzer[7141]: segfault at 120 ip 000000000041efab >sp 00000000ffda1948 error 6 in perf_fuzzer[400000+d1000] >[30570.625642] perf_fuzzer[7156]: segfault at 1680 ip 000000000041efab >sp 00000000ffaad028 error 6 in perf_fuzzer[400000+d1000] >[31047.887784] perf_fuzzer[7169]: segfault at 60c0 ip 000000000041efab >sp 00000000ffaa86e8 error 6 in perf_fuzzer[400000+d1000] >[31300.168714] perf_fuzzer[7175]: segfault at 3a80 ip 000000000041efab >sp 00000000ffd83228 error 6 in perf_fuzzer[400000+d1000] >[31984.727278] perf_fuzzer[7193]: segfault at 7e0 ip 000000000041efab >sp 00000000ff9db1f8 error 6 in perf_fuzzer[400000+d1000] > >Vince -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-22 4:50 ` Vince Weaver 2014-02-22 5:03 ` H. Peter Anvin @ 2014-02-22 6:26 ` H. Peter Anvin 2014-02-23 5:18 ` Vince Weaver 1 sibling, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-22 6:26 UTC (permalink / raw) To: Vince Weaver; +Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu On 02/21/2014 08:50 PM, Vince Weaver wrote: > > So I changed the perf_fuzzer so when it randomly stomps all over the > perf_event_mmap_page, it uses a constant value of 0xdeadbeef rather > than a random value. > > The result is below. The segfaults make a bit more sense now, it > almost looks like what is happening is we are corrupting an address > value somehow (head? tail?) and the kernel then uses the corrupt address > and writes to memory outside of the mmap ring buffer. > That seems unlikely: handle->page = (offset >> page_shift) & (rb->nr_pages - 1); offset &= (1UL << page_shift) - 1; The masking to the number of pages should make that not possible, even if a completely bogus value is written. > I still haven't figured out how to trigger this exactly, but you can > see when over-written with 0xdeadbeef the memory address written to is > consistently some small multiple of 0x120. > > I imagine it would be a bad thing if it turned out to be possible to > select what memory address got written to. Although since I've > only reproduced this on x32 maybe it won't be possible to over-write > the kernel; but I have seen this bug cause a reboot when the > wrong thing got over-written. > > [28002.850192] perf_fuzzer[7083]: segfault at 2be0 ip 000000000041efab sp 00000000ff826748 error 6 in perf_fuzzer[400000+d1000] > [28639.769869] perf_fuzzer[7100]: segfault at 1320 ip 000000000041efab sp 00000000ffa65038 error 6 in perf_fuzzer[400000+d1000] > [29396.986242] perf_fuzzer[7120]: segfault at 10e0 ip 000000000041efab sp 00000000ffd48e68 error 6 in perf_fuzzer[400000+d1000] > [29738.892931] perf_fuzzer[7128]: segfault at 18c0 ip 000000000041efab sp 00000000ffcdcd88 error 6 in perf_fuzzer[400000+d1000] > [29815.550210] perf_fuzzer[7132]: segfault at 120 ip 000000000041efab sp 00000000ffe673b8 error 6 in perf_fuzzer[400000+d1000] > [30173.455348] perf_fuzzer[7141]: segfault at 120 ip 000000000041efab sp 00000000ffda1948 error 6 in perf_fuzzer[400000+d1000] > [30570.625642] perf_fuzzer[7156]: segfault at 1680 ip 000000000041efab sp 00000000ffaad028 error 6 in perf_fuzzer[400000+d1000] > [31047.887784] perf_fuzzer[7169]: segfault at 60c0 ip 000000000041efab sp 00000000ffaa86e8 error 6 in perf_fuzzer[400000+d1000] > [31300.168714] perf_fuzzer[7175]: segfault at 3a80 ip 000000000041efab sp 00000000ffd83228 error 6 in perf_fuzzer[400000+d1000] > [31984.727278] perf_fuzzer[7193]: segfault at 7e0 ip 000000000041efab sp 00000000ff9db1f8 error 6 in perf_fuzzer[400000+d1000] Error 6 reflects a write in userspace to a not-present page. Since your previous trace indicates that the value of the register in question is a different one, I'm guessing that what we have here is PEBS getting activated. 0x120 is 2*0x90, and 0x90 is the size of a 64-bit PEBS record. -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-22 6:26 ` H. Peter Anvin @ 2014-02-23 5:18 ` Vince Weaver 2014-02-23 5:24 ` H. Peter Anvin 2014-02-23 6:07 ` H. Peter Anvin 0 siblings, 2 replies; 115+ messages in thread From: Vince Weaver @ 2014-02-23 5:18 UTC (permalink / raw) To: H. Peter Anvin Cc: Vince Weaver, Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu On Fri, 21 Feb 2014, H. Peter Anvin wrote: > Error 6 reflects a write in userspace to a not-present page. > > Since your previous trace indicates that the value of the register in question > is a different one, I'm guessing that what we have here is PEBS getting > activated. 0x120 is 2*0x90, and 0x90 is the size of a 64-bit PEBS record. I'm having problems generating a replayable syscall trace that exhibits the problem. It turns out that the segfault address listed (the multiple of 0x120) happens to be the value in the RBP register at the time of the segfault. That's odd, as the instruction is movdqa %xmm0,(%rdi) and rdi is the valid mmap address of the perf ring buffer rdi 0xf7768000 4151738368 so I'm not sure why RBP is involved at all. In all of the cases I've investigated the precise_ip value has been set for the problem event... but none of the events have been hardware events (software and breakpoint so far). So probably not PEBS related? Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-23 5:18 ` Vince Weaver @ 2014-02-23 5:24 ` H. Peter Anvin 2014-02-23 6:07 ` H. Peter Anvin 1 sibling, 0 replies; 115+ messages in thread From: H. Peter Anvin @ 2014-02-23 5:24 UTC (permalink / raw) To: Vince Weaver; +Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu What is the instructions around it, by any chance? On February 22, 2014 9:18:17 PM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: >On Fri, 21 Feb 2014, H. Peter Anvin wrote: > >> Error 6 reflects a write in userspace to a not-present page. >> >> Since your previous trace indicates that the value of the register in >question >> is a different one, I'm guessing that what we have here is PEBS >getting >> activated. 0x120 is 2*0x90, and 0x90 is the size of a 64-bit PEBS >record. > >I'm having problems generating a replayable syscall trace that exhibits > >the problem. > >It turns out that the segfault address listed (the multiple of 0x120) >happens to be the value in the RBP register at the time of the >segfault. > >That's odd, as the instruction is > movdqa %xmm0,(%rdi) >and rdi is the valid mmap address of the perf ring buffer > rdi 0xf7768000 4151738368 > >so I'm not sure why RBP is involved at all. > >In all of the cases I've investigated the precise_ip value has been set > >for the problem event... but none of the events have been hardware >events >(software and breakpoint so far). So probably not PEBS related? > >Vince -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-23 5:18 ` Vince Weaver 2014-02-23 5:24 ` H. Peter Anvin @ 2014-02-23 6:07 ` H. Peter Anvin 2014-02-23 14:05 ` Vince Weaver 1 sibling, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-23 6:07 UTC (permalink / raw) To: Vince Weaver; +Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu I'd be interested in how rbp gets set, too. It might just be a coincidence and the value in rbp has some other meaning here. On February 22, 2014 9:18:17 PM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: >On Fri, 21 Feb 2014, H. Peter Anvin wrote: > >> Error 6 reflects a write in userspace to a not-present page. >> >> Since your previous trace indicates that the value of the register in >question >> is a different one, I'm guessing that what we have here is PEBS >getting >> activated. 0x120 is 2*0x90, and 0x90 is the size of a 64-bit PEBS >record. > >I'm having problems generating a replayable syscall trace that exhibits > >the problem. > >It turns out that the segfault address listed (the multiple of 0x120) >happens to be the value in the RBP register at the time of the >segfault. > >That's odd, as the instruction is > movdqa %xmm0,(%rdi) >and rdi is the valid mmap address of the perf ring buffer > rdi 0xf7768000 4151738368 > >so I'm not sure why RBP is involved at all. > >In all of the cases I've investigated the precise_ip value has been set > >for the problem event... but none of the events have been hardware >events >(software and breakpoint so far). So probably not PEBS related? > >Vince -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-23 6:07 ` H. Peter Anvin @ 2014-02-23 14:05 ` Vince Weaver 2014-02-24 3:02 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-23 14:05 UTC (permalink / raw) To: H. Peter Anvin Cc: Vince Weaver, Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu On Sat, 22 Feb 2014, H. Peter Anvin wrote: > I'd be interested in how rbp gets set, too. It might just be a > coincidence and the value in rbp has some other meaning here. The code in question does this: i=find_random_active_event(); if (i<0) return; if ((event_data[i].mmap)) { value=0xdeadbeef; memset(event_data[i].mmap,value,getpagesize()); [New LWP 10526] Core was generated by `./perf_fuzzer -t OCIRMQWPpAi -r 1392938876'. Program terminated with signal 11, Segmentation fault. #0 0x0041efab in __memset_sse2 () (gdb) bt #0 0x0041efab in __memset_sse2 () #1 0x004017ec in trash_random_mmap () at perf_fuzzer.c:808 #2 main (argc=<optimized out>, argv=<optimized out>) at perf_fuzzer.c:1604 So rbp is set by the imul below, it is the offset into the event_data[i] array where the elements have size of 0x120 0x004017bd <+3085>: callq 0x402ee0 <find_random_active_event> 0x004017c2 <+3090>: test %eax,%eax 0x004017c4 <+3092>: js 0x4011e8 <main+1592> 0x004017ca <+3098>: imul $0x120,%eax,%ebp 0x004017d0 <+3104>: mov 0x756b2c(%ebp),%eax 0x004017d7 <+3111>: test %eax,%eax 0x004017d9 <+3113>: je 0x40183b <main+3211> 0x004017db <+3115>: mov 0xc(%esp),%edx 0x004017e0 <+3120>: mov %eax,%edi 0x004017e2 <+3122>: mov $0xdeadbeef,%esi 0x004017e7 <+3127>: callq 0x400260 0x004017ec <+3132>: testb $0x20,0x353e76(%rip) # 0x755669 <logging+$ 400260: ff 25 ce 0e 2d 00 jmpq *0x2d0ece(%rip) # 6d1134 $ 0x6d1134: 0x0041e710 Dump of assembler code for function __memset_sse2: 0x0041e710 <+0>: cmp $0x1,%rdx 0x0041e714 <+4>: mov %rdi,%rax 0x0041e717 <+7>: jne 0x41e71d <__memset_sse2+13> 0x0041e719 <+9>: mov %sil,(%rdi) and as far as I can tell nothing touches rbp again until the segfault. Nothing in _memset_sse2 does as far as I can tell. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-23 14:05 ` Vince Weaver @ 2014-02-24 3:02 ` Vince Weaver 2014-02-24 5:22 ` H. Peter Anvin 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-24 3:02 UTC (permalink / raw) To: Vince Weaver Cc: H. Peter Anvin, Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu On Sun, 23 Feb 2014, Vince Weaver wrote: > > and as far as I can tell nothing touches rbp again until the segfault. > Nothing in _memset_sse2 does as far as I can tell. I only know enough about ftrace to be dangerous, but here is what I think is the trace of the problem: perf_fuzzer-11492 [000] 197077.488363: function: intel_get_event_constraints perf_fuzzer-11492 [000] 197077.488363: function: intel_pebs_constraints perf_fuzzer-11492 [000] 197077.488365: function: intel_put_event_constraints perf_fuzzer-11492 [000] 197077.488365: function: intel_pmu_enable_all perf_fuzzer-11492 [000] 197077.488366: function: intel_pmu_pebs_enable_all perf_fuzzer-11492 [000] 197077.488367: function: intel_pmu_lbr_enable_all perf_fuzzer-11492 [000] 197077.488366: function: intel_pmu_pebs_enable_all perf_fuzzer-11492 [000] 197077.488367: function: intel_pmu_lbr_enable_all perf_fuzzer-11492 [000] 197077.488368: function: mutex_unlock perf_fuzzer-11492 [000] 197077.488368: function: mutex_lock perf_fuzzer-11492 [000] 197077.488369: function: _cond_resched perf_fuzzer-11492 [000] 197077.488370: function: _raw_spin_lock_irq perf_fuzzer-11492 [000] 197077.488370: function: mutex_unlock perf_fuzzer-11492 [000] 197077.488371: function: mutex_lock perf_fuzzer-11492 [000] 197077.488371: function: _cond_resched perf_fuzzer-11492 [000] 197077.488372: function: _raw_spin_lock_irq perf_fuzzer-11492 [000] 197077.488373: function: mutex_unlock perf_fuzzer-11492 [000] 197077.488373: function: mutex_lock perf_fuzzer-11492 [000] 197077.488374: function: _cond_resched perf_fuzzer-11492 [000] 197077.488374: function: _raw_spin_lock_irq perf_fuzzer-11492 [000] 197077.488375: function: smp_call_function_single perf_fuzzer-11492 [000] 197077.488376: function: _raw_spin_lock perf_fuzzer-11492 [000] 197077.488377: function: mutex_unlock perf_fuzzer-11492 [000] 197077.488378: function: mutex_lock perf_fuzzer-11492 [000] 197077.488378: function: _cond_resched perf_fuzzer-11492 [000] 197077.488379: function: _raw_spin_lock_irq perf_fuzzer-11492 [000] 197077.488380: function: smp_call_function_single perf_fuzzer-11492 [000] 197077.488380: function: _raw_spin_lock perf_fuzzer-11492 [000] 197077.488381: function: mutex_unlock perf_fuzzer-11492 [000] 197077.488382: function: mutex_unlock perf_fuzzer-11492 [000] 197077.488383: function: syscall_trace_leave perf_fuzzer-11492 [000] 197077.488383: sys_exit: NR 1073741981 = 0 perf_fuzzer-11492 [000] 197077.488387: function: do_device_not_available perf_fuzzer-11492 [000] 197077.488387: function: math_state_restore perf_fuzzer-11492 [000] 197077.488390: function: trace_do_page_fault perf_fuzzer-11492 [000] 197077.488391: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x6 perf_fuzzer-11492 [000] 197077.488395: function: perf_callchain perf_fuzzer-11492 [000] 197077.488396: function: copy_from_user_nmi perf_fuzzer-11492 [000] 197077.488397: function: trace_do_page_fault perf_fuzzer-11492 [000] 197077.488398: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0 perf_fuzzer-11492 [000] 197077.488399: function: __do_page_fault perf_fuzzer-11492 [000] 197077.488400: function: bad_area_nosemaphore perf_fuzzer-11492 [000] 197077.488401: function: __bad_area_nosemaphore perf_fuzzer-11492 [000] 197077.488401: function: no_context perf_fuzzer-11492 [000] 197077.488402: function: fixup_exception perf_fuzzer-11492 [000] 197077.488403: function: search_exception_tables perf_fuzzer-11492 [000] 197077.488403: function: search_extable perf_fuzzer-11492 [000] 197077.488405: function: copy_user_handle_tail perf_fuzzer-11492 [000] 197077.488406: function: trace_do_page_fault perf_fuzzer-11492 [000] 197077.488406: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0 perf_fuzzer-11492 [000] 197077.488407: function: __do_page_fault perf_fuzzer-11492 [000] 197077.488408: function: bad_area_nosemaphore perf_fuzzer-11492 [000] 197077.488409: function: __bad_area_nosemaphore perf_fuzzer-11492 [000] 197077.488409: function: no_context perf_fuzzer-11492 [000] 197077.488410: function: fixup_exception perf_fuzzer-11492 [000] 197077.488410: function: search_exception_tables perf_fuzzer-11492 [000] 197077.488411: function: search_extable perf_fuzzer-11492 [000] 197077.488413: function: perf_output_begin perf_fuzzer-11492 [000] 197077.488414: function: perf_output_copy perf_fuzzer-11492 [000] 197077.488415: function: perf_output_copy perf_fuzzer-11492 [000] 197077.488415: function: perf_output_copy perf_fuzzer-11492 [000] 197077.488416: function: perf_output_copy perf_fuzzer-11492 [000] 197077.488418: function: perf_output_copy perf_fuzzer-11492 [000] 197077.488419: function: perf_output_copy perf_fuzzer-11492 [000] 197077.488419: function: perf_output_end perf_fuzzer-11492 [000] 197077.488420: function: perf_output_put_handle perf_fuzzer-11492 [000] 197077.488421: function: __do_page_fault perf_fuzzer-11492 [000] 197077.488422: function: down_read_trylock perf_fuzzer-11492 [000] 197077.488423: function: _cond_resched perf_fuzzer-11492 [000] 197077.488423: function: find_vma perf_fuzzer-11492 [000] 197077.488424: function: bad_area perf_fuzzer-11492 [000] 197077.488425: function: up_read perf_fuzzer-11492 [000] 197077.488426: function: __bad_area_nosemaphore perf_fuzzer-11492 [000] 197077.488426: function: is_prefetch perf_fuzzer-11492 [000] 197077.488427: function: convert_ip_to_linear perf_fuzzer-11492 [000] 197077.488428: function: unhandled_signal perf_fuzzer-11492 [000] 197077.488429: function: __printk_ratelimit perf_fuzzer-11492 [000] 197077.488430: function: _raw_spin_trylock perf_fuzzer-11492 [000] 197077.488430: function: _raw_spin_unlock_irqrestore perf_fuzzer-11492 [000] 197077.488431: function: printk perf_fuzzer-11492 [000] 197077.488432: function: vprintk_emit perf_fuzzer-11492 [000] 197077.488434: function: _raw_spin_lock perf_fuzzer-11492 [000] 197077.488443: function: cont_add perf_fuzzer-11492 [000] 197077.488444: function: console_trylock perf_fuzzer-11492 [000] 197077.488445: function: down_trylock perf_fuzzer-11492 [000] 197077.488445: function: _raw_spin_lock_irqsave perf_fuzzer-11492 [000] 197077.488446: function: _raw_spin_unlock_irqrestore perf_fuzzer-11492 [000] 197077.488447: function: console_unlock perf_fuzzer-11492 [000] 197077.488447: function: _raw_spin_lock_irqsave perf_fuzzer-11492 [000] 197077.488449: function: print_time perf_fuzzer-11492 [000] 197077.488452: function: T.950 perf_fuzzer-11492 [000] 197077.488453: console: [197179.420735] perf_fuzzer[11492]: segfault at 22e0 ip 000000000041efab sp 00000000ffda0938 error 6 ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 3:02 ` Vince Weaver @ 2014-02-24 5:22 ` H. Peter Anvin 2014-02-24 15:35 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-24 5:22 UTC (permalink / raw) To: Vince Weaver; +Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu On 02/23/2014 07:02 PM, Vince Weaver wrote: > On Sun, 23 Feb 2014, Vince Weaver wrote: >> >> and as far as I can tell nothing touches rbp again until the segfault. >> Nothing in _memset_sse2 does as far as I can tell. > > I only know enough about ftrace to be dangerous, but here is what I think > is the trace of the problem: > > perf_fuzzer-11492 [000] 197077.488420: function: perf_output_put_handle > perf_fuzzer-11492 [000] 197077.488421: function: __do_page_fault So we do a write to the buffer rather immediately before this happens, and in particular that will update the head: rb->user_page->data_head = head; However, that doesn't explain what is going on and in particular the write to whatever address was in %rbp. The rest pretty much seems to be the page fault logic. Incidentally, I doubt that this is x32-related in any way; there seems to be absolutely no difference between x86-64 perf and x32 perf; more likely it just makes the error more reproducible because the address space is so much smaller. -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 5:22 ` H. Peter Anvin @ 2014-02-24 15:35 ` Vince Weaver 2014-02-24 16:34 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-24 15:35 UTC (permalink / raw) To: H. Peter Anvin Cc: Vince Weaver, Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu On Sun, 23 Feb 2014, H. Peter Anvin wrote: > So we do a write to the buffer rather immediately before this happens, > and in particular that will update the head: > > rb->user_page->data_head = head; > > However, that doesn't explain what is going on and in particular the > write to whatever address was in %rbp. The rest pretty much seems to be > the page fault logic. It turns out you don't even have to over-write rb->user_page->data_head. Just touching the mmap page with a write of a single byte (it doesn't matter where) is enough to trigger the bug. This is a pain to track down, it would be easier if I could get a replayable syscall trace, but even though the segfault is very reproducible with my fuzzer, it's very sensitive to extra syscalls in the trace path and the fuzzer logger/replayer path has a different number of write syscalls and won't trigger the problem. > Incidentally, I doubt that this is x32-related in any way; there seems > to be absolutely no difference between x86-64 perf and x32 perf; more > likely it just makes the error more reproducible because the address > space is so much smaller. quite possibly. I only began chasing the problem because when compiled for x32 this bug apparently will reboot the machine now and then (not just segfault the program). I never saw that failure mode with x86_64, but again maybe it's just easier to hit with the reduced address space as you say. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 15:35 ` Vince Weaver @ 2014-02-24 16:34 ` Vince Weaver 2014-02-24 16:47 ` H. Peter Anvin 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-24 16:34 UTC (permalink / raw) To: Vince Weaver Cc: H. Peter Anvin, Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu On Mon, 24 Feb 2014, Vince Weaver wrote: > Just touching the mmap page with a write of a single byte (it doesn't > matter where) is enough to trigger the bug. OK, investigating this more. perf_fuzzer-2971 [000] 154.944114: page_fault_user: address=0xf7729000 ip=0x41efab error_code=0x6 perf_fuzzer-2971 [000] 154.944118: function: ip=0xffffffff810d40e7 parent_ip=0xffffffff810d0840 perf_fuzzer-2971 [000] 154.944119: function: ip=0xffffffff812a91a5 parent_ip=0xffffffff81013ff5 perf_fuzzer-2971 [000] 154.944120: function: ip=0xffffffff8153837c parent_ip=0xffffffff81535432 perf_fuzzer-2971 [000] 154.944121: page_fault_kernel: address=0x22e0 ip=0xffffffff812a7d5c error_code=0x0 It looks like there are two page faults. The first is caused by the user code accessing the mmap'd page. It looks sort of normal and what you'd expect if the perf_event mmap ring buffer is being accessed for the first time. What follows is a kernel page fault, and this is the one where for whatever reason CR2 has obtained the value of the userspace RBP register. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 16:34 ` Vince Weaver @ 2014-02-24 16:47 ` H. Peter Anvin 2014-02-24 17:10 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-24 16:47 UTC (permalink / raw) To: Vince Weaver; +Cc: Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu Ok, so the obvious question is what is at that kernel address? On February 24, 2014 8:34:30 AM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: >On Mon, 24 Feb 2014, Vince Weaver wrote: > >> Just touching the mmap page with a write of a single byte (it doesn't > >> matter where) is enough to trigger the bug. > >OK, investigating this more. > >perf_fuzzer-2971 [000] 154.944114: page_fault_user: >address=0xf7729000 ip=0x41efab error_code=0x6 >perf_fuzzer-2971 [000] 154.944118: function: >ip=0xffffffff810d40e7 parent_ip=0xffffffff810d0840 >perf_fuzzer-2971 [000] 154.944119: function: >ip=0xffffffff812a91a5 parent_ip=0xffffffff81013ff5 >perf_fuzzer-2971 [000] 154.944120: function: >ip=0xffffffff8153837c parent_ip=0xffffffff81535432 >perf_fuzzer-2971 [000] 154.944121: page_fault_kernel: >address=0x22e0 ip=0xffffffff812a7d5c error_code=0x0 > >It looks like there are two page faults. The first is caused by the >user >code accessing the mmap'd page. It looks sort of normal and what you'd >expect if the perf_event mmap ring buffer is being accessed for the >first >time. > >What follows is a kernel page fault, and this is the one where for >whatever reason CR2 has obtained the value of the userspace RBP >register. > >Vince -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 16:47 ` H. Peter Anvin @ 2014-02-24 17:10 ` Vince Weaver 2014-02-24 17:25 ` Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-24 17:10 UTC (permalink / raw) To: H. Peter Anvin Cc: Vince Weaver, Linux Kernel, Peter Zijlstra, Ingo Molnar, H.J. Lu On Mon, 24 Feb 2014, H. Peter Anvin wrote: > On February 24, 2014 8:34:30 AM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: > >On Mon, 24 Feb 2014, Vince Weaver wrote: > > > >> Just touching the mmap page with a write of a single byte (it doesn't > > > >> matter where) is enough to trigger the bug. > > > >OK, investigating this more. > > > >perf_fuzzer-2971 [000] 154.944114: page_fault_user: > >address=0xf7729000 ip=0x41efab error_code=0x6 > >perf_fuzzer-2971 [000] 154.944118: function: > >ip=0xffffffff810d40e7 parent_ip=0xffffffff810d0840 > >perf_fuzzer-2971 [000] 154.944119: function: > >ip=0xffffffff812a91a5 parent_ip=0xffffffff81013ff5 > >perf_fuzzer-2971 [000] 154.944120: function: > >ip=0xffffffff8153837c parent_ip=0xffffffff81535432 > >perf_fuzzer-2971 [000] 154.944121: page_fault_kernel: > >address=0x22e0 ip=0xffffffff812a7d5c error_code=0x0 > Ok, so the obvious question is what is at that kernel address? > It's in copy_user_generic_string() rep movsq %ds:(%rsi),%es:(%rdi) And looking at the ftrace: perf_fuzzer-2979 [000] 161.475920: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x6 perf_fuzzer-2979 [000] 161.475922: function: perf_callchain perf_fuzzer-2979 [000] 161.475922: function: copy_from_user_nmi perf_fuzzer-2979 [000] 161.475923: function: trace_do_page_fault perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0 What is likely happening is the user page fault is triggering code to do a "perf_callchain" dump, which is calling copy_from_user_nmi() which calls copy_user_generic_string() which is somehow getting the user RBP in the RDI register somehow? Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 17:10 ` Vince Weaver @ 2014-02-24 17:25 ` Peter Zijlstra 2014-02-24 17:32 ` Vince Weaver 2014-02-24 17:52 ` H. Peter Anvin 0 siblings, 2 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-02-24 17:25 UTC (permalink / raw) To: Vince Weaver Cc: H. Peter Anvin, Linux Kernel, Ingo Molnar, H.J. Lu, Steven Rostedt On Mon, Feb 24, 2014 at 12:10:44PM -0500, Vince Weaver wrote: > On Mon, 24 Feb 2014, H. Peter Anvin wrote: > > > On February 24, 2014 8:34:30 AM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: > > >On Mon, 24 Feb 2014, Vince Weaver wrote: > > > > > >> Just touching the mmap page with a write of a single byte (it doesn't > > > > > >> matter where) is enough to trigger the bug. > > > > > >OK, investigating this more. > > > > > >perf_fuzzer-2971 [000] 154.944114: page_fault_user: > > >address=0xf7729000 ip=0x41efab error_code=0x6 > > >perf_fuzzer-2971 [000] 154.944118: function: > > >ip=0xffffffff810d40e7 parent_ip=0xffffffff810d0840 > > >perf_fuzzer-2971 [000] 154.944119: function: > > >ip=0xffffffff812a91a5 parent_ip=0xffffffff81013ff5 > > >perf_fuzzer-2971 [000] 154.944120: function: > > >ip=0xffffffff8153837c parent_ip=0xffffffff81535432 > > >perf_fuzzer-2971 [000] 154.944121: page_fault_kernel: > > >address=0x22e0 ip=0xffffffff812a7d5c error_code=0x0 > > > Ok, so the obvious question is what is at that kernel address? > > > > It's in copy_user_generic_string() > rep movsq %ds:(%rsi),%es:(%rdi) > > And looking at the ftrace: > perf_fuzzer-2979 [000] 161.475920: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x6 > perf_fuzzer-2979 [000] 161.475922: function: perf_callchain > perf_fuzzer-2979 [000] 161.475922: function: copy_from_user_nmi > perf_fuzzer-2979 [000] 161.475923: function: trace_do_page_fault > perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0 > > What is likely happening is the user page fault is triggering > code to do a "perf_callchain" dump, which is calling copy_from_user_nmi() > which calls copy_user_generic_string() which is somehow getting the user > RBP in the RDI register somehow? So that code very much relies on the 'recursive' NMI/iret magic from Steve, patch 3f3c8b8c4b2a3 (and assorted fixes later). If CR2 is getting corrupted; 7fbb98c5cb075 seems relevant. Peter, does x32 have a slightly different ABI/calling convention that would make any of these patches just slightly 'off'? ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 17:25 ` Peter Zijlstra @ 2014-02-24 17:32 ` Vince Weaver 2014-02-24 17:40 ` H. Peter Anvin ` (2 more replies) 2014-02-24 17:52 ` H. Peter Anvin 1 sibling, 3 replies; 115+ messages in thread From: Vince Weaver @ 2014-02-24 17:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar, H.J. Lu, Steven Rostedt On Mon, 24 Feb 2014, Peter Zijlstra wrote: > On Mon, Feb 24, 2014 at 12:10:44PM -0500, Vince Weaver wrote: > > On Mon, 24 Feb 2014, H. Peter Anvin wrote: > > > > > On February 24, 2014 8:34:30 AM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: > > > >On Mon, 24 Feb 2014, Vince Weaver wrote: > > > > > > > >> Just touching the mmap page with a write of a single byte (it doesn't > > > > > > > >> matter where) is enough to trigger the bug. > > > > > > > >OK, investigating this more. > > > > > > > >perf_fuzzer-2971 [000] 154.944114: page_fault_user: > > > >address=0xf7729000 ip=0x41efab error_code=0x6 > > > >perf_fuzzer-2971 [000] 154.944118: function: > > > >ip=0xffffffff810d40e7 parent_ip=0xffffffff810d0840 > > > >perf_fuzzer-2971 [000] 154.944119: function: > > > >ip=0xffffffff812a91a5 parent_ip=0xffffffff81013ff5 > > > >perf_fuzzer-2971 [000] 154.944120: function: > > > >ip=0xffffffff8153837c parent_ip=0xffffffff81535432 > > > >perf_fuzzer-2971 [000] 154.944121: page_fault_kernel: > > > >address=0x22e0 ip=0xffffffff812a7d5c error_code=0x0 > > > > > Ok, so the obvious question is what is at that kernel address? > > > > > > > It's in copy_user_generic_string() > > rep movsq %ds:(%rsi),%es:(%rdi) > > > > And looking at the ftrace: > > perf_fuzzer-2979 [000] 161.475920: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x6 > > perf_fuzzer-2979 [000] 161.475922: function: perf_callchain > > perf_fuzzer-2979 [000] 161.475922: function: copy_from_user_nmi > > perf_fuzzer-2979 [000] 161.475923: function: trace_do_page_fault > > perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0 > > > > What is likely happening is the user page fault is triggering > > code to do a "perf_callchain" dump, which is calling copy_from_user_nmi() > > which calls copy_user_generic_string() which is somehow getting the user > > RBP in the RDI register somehow? > > So that code very much relies on the 'recursive' NMI/iret magic from > Steve, patch 3f3c8b8c4b2a3 (and assorted fixes later). > > If CR2 is getting corrupted; 7fbb98c5cb075 seems relevant. > > Peter, does x32 have a slightly different ABI/calling convention that > would make any of these patches just slightly 'off'? I do note that perf_callchain_user(); Does fp = (void __user *)regs->bp; ... bytes = copy_from_user_nmi(&frame, fp, sizeof(frame)); And in my particular executable RBP has nothing to do with a frame pointer, but is instead being used as a general purpose register. Am I missing something here? Though in that case I'm not sure why this wouldn't be easier to trigger. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 17:32 ` Vince Weaver @ 2014-02-24 17:40 ` H. Peter Anvin 2014-02-24 18:00 ` Vince Weaver 2014-02-24 17:40 ` Peter Zijlstra 2014-02-24 17:41 ` Vince Weaver 2 siblings, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-24 17:40 UTC (permalink / raw) To: Vince Weaver, Peter Zijlstra Cc: Linux Kernel, Ingo Molnar, H.J. Lu, Steven Rostedt On 02/24/2014 09:32 AM, Vince Weaver wrote: >> >> Peter, does x32 have a slightly different ABI/calling convention that >> would make any of these patches just slightly 'off'? > > I do note that > perf_callchain_user(); > > Does > fp = (void __user *)regs->bp; > > ... > > bytes = copy_from_user_nmi(&frame, fp, sizeof(frame)); > > > And in my particular executable RBP has nothing to do with a frame > pointer, but is instead being used as a general purpose register. > > Am I missing something here? Though in that case I'm not sure why this > wouldn't be easier to trigger. > Neither x86-64 nor x32 are typically compiled with fixed frame pointers (which would be %rbp if they are). So I'm guessing the perf_callchain logic is only applicable to a user-space binary explicitly compiled with frame pointers turned on. So copy_from_user_nmi() stumbles onto a nonexistent page and takes a page fault. This isn't a big deal, because perf_callchain_user() is set up to handle that (and just terminates the trace), *except* now CR2 is corrupt, and we took this event while handling a page fault already... and apparently before we even did read_cr2() in __do_page_fault. The description of copy_from_user_nmi() states: /* * We rely on the nested NMI work to allow atomic faults from the NMI path; the * nested NMI paths are careful to preserve CR2. */ ... but that doesn't seem to happen here for whatever reason. There is no hint in your trace what happens after the kernel page fault so that makes it hard to know. -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 17:40 ` H. Peter Anvin @ 2014-02-24 18:00 ` Vince Weaver 2014-02-24 18:07 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-24 18:00 UTC (permalink / raw) To: H. Peter Anvin Cc: Vince Weaver, Peter Zijlstra, Linux Kernel, Ingo Molnar, H.J. Lu, Steven Rostedt [-- Attachment #1: Type: TEXT/PLAIN, Size: 1855 bytes --] On Mon, 24 Feb 2014, H. Peter Anvin wrote: > On 02/24/2014 09:32 AM, Vince Weaver wrote: > >> > >> Peter, does x32 have a slightly different ABI/calling convention that > >> would make any of these patches just slightly 'off'? > > > > I do note that > > perf_callchain_user(); > > > > Does > > fp = (void __user *)regs->bp; > > > > ... > > > > bytes = copy_from_user_nmi(&frame, fp, sizeof(frame)); > > > > > > And in my particular executable RBP has nothing to do with a frame > > pointer, but is instead being used as a general purpose register. > > > > Am I missing something here? Though in that case I'm not sure why this > > wouldn't be easier to trigger. > > > > Neither x86-64 nor x32 are typically compiled with fixed frame pointers > (which would be %rbp if they are). So I'm guessing the perf_callchain > logic is only applicable to a user-space binary explicitly compiled with > frame pointers turned on. > > So copy_from_user_nmi() stumbles onto a nonexistent page and takes a > page fault. This isn't a big deal, because perf_callchain_user() is set > up to handle that (and just terminates the trace), *except* now CR2 is > corrupt, and we took this event while handling a page fault already... > and apparently before we even did read_cr2() in __do_page_fault. > > The description of copy_from_user_nmi() states: > > /* > * We rely on the nested NMI work to allow atomic faults from the NMI > path; the > * nested NMI paths are careful to preserve CR2. > */ > > ... but that doesn't seem to happen here for whatever reason. > > There is no hint in your trace what happens after the kernel page fault > so that makes it hard to know. Ahh, ftrace, the cause of and solution to all my perf_fuzzing problems. Anyway I've attached the full tail end of the trace if you want to see everything that happens. Vince [-- Attachment #2: Type: APPLICATION/octet-stream, Size: 61434 bytes --] ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 18:00 ` Vince Weaver @ 2014-02-24 18:07 ` Vince Weaver 2014-02-24 18:34 ` H. Peter Anvin 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-24 18:07 UTC (permalink / raw) To: Vince Weaver Cc: H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar, H.J. Lu, Steven Rostedt On Mon, 24 Feb 2014, Vince Weaver wrote: > On Mon, 24 Feb 2014, H. Peter Anvin wrote: > > > On 02/24/2014 09:32 AM, Vince Weaver wrote: > > >> > > >> Peter, does x32 have a slightly different ABI/calling convention that > > >> would make any of these patches just slightly 'off'? > > > > > > I do note that > > > perf_callchain_user(); > > > > > > Does > > > fp = (void __user *)regs->bp; > > > > > > ... > > > > > > bytes = copy_from_user_nmi(&frame, fp, sizeof(frame)); > > > > > > > > > And in my particular executable RBP has nothing to do with a frame > > > pointer, but is instead being used as a general purpose register. > > > > > > Am I missing something here? Though in that case I'm not sure why this > > > wouldn't be easier to trigger. > > > > > > > Neither x86-64 nor x32 are typically compiled with fixed frame pointers > > (which would be %rbp if they are). So I'm guessing the perf_callchain > > logic is only applicable to a user-space binary explicitly compiled with > > frame pointers turned on. > > > > So copy_from_user_nmi() stumbles onto a nonexistent page and takes a > > page fault. This isn't a big deal, because perf_callchain_user() is set > > up to handle that (and just terminates the trace), *except* now CR2 is > > corrupt, and we took this event while handling a page fault already... > > and apparently before we even did read_cr2() in __do_page_fault. > > > > The description of copy_from_user_nmi() states: > > > > /* > > * We rely on the nested NMI work to allow atomic faults from the NMI > > path; the > > * nested NMI paths are careful to preserve CR2. > > */ > > > > ... but that doesn't seem to happen here for whatever reason. > > > > There is no hint in your trace what happens after the kernel page fault > > so that makes it hard to know. > > Ahh, ftrace, the cause of and solution to all my perf_fuzzing problems. > > Anyway I've attached the full tail end of the trace if you want to see > everything that happens. and then I note there are *two* kernel page faults. perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0 address=0x1 ip=0xffffffff812a7d9c error_code=0x0 perf_fuzzer-2979 [000] 161.475924: function: __do_page_fault perf_fuzzer-2979 [000] 161.475924: function: bad_area_nosemaphore perf_fuzzer-2979 [000] 161.475925: function: __bad_area_nosemaphore perf_fuzzer-2979 [000] 161.475925: function: no_context perf_fuzzer-2979 [000] 161.475925: function: fixup_exception perf_fuzzer-2979 [000] 161.475926: function: search_exception_tables perf_fuzzer-2979 [000] 161.475926: function: search_extable perf_fuzzer-2979 [000] 161.475927: function: copy_user_handle_tail perf_fuzzer-2979 [000] 161.475927: function: trace_do_page_fault perf_fuzzer-2979 [000] 161.475928: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0 address=0x1 ip=0xffffffff812a92bb error_code=0x0 perf_fuzzer-2979 [000] 161.475928: function: __do_page_fault perf_fuzzer-2979 [000] 161.475928: function: bad_area_nosemaphore perf_fuzzer-2979 [000] 161.475929: function: __bad_area_nosemaphore perf_fuzzer-2979 [000] 161.475929: function: no_context perf_fuzzer-2979 [000] 161.475929: function: fixup_exception perf_fuzzer-2979 [000] 161.475929: function: search_exception_tables perf_fuzzer-2979 [000] 161.475930: function: search_extable perf_fuzzer-2979 [000] 161.475931: function: perf_output_begin perf_fuzzer-2979 [000] 161.475931: function: perf_output_copy That second one is in copy_user_handle_tail() Sorry for the sloppy analysis here, I did most of the initial tracing last night at 1am typing one-handed with a sick crying baby draped over one shoulder, so not really operating at my best. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 18:07 ` Vince Weaver @ 2014-02-24 18:34 ` H. Peter Anvin 2014-02-24 19:13 ` Steven Rostedt 0 siblings, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-24 18:34 UTC (permalink / raw) To: Vince Weaver Cc: Peter Zijlstra, Linux Kernel, Ingo Molnar, H.J. Lu, Steven Rostedt On 02/24/2014 10:07 AM, Vince Weaver wrote: >> >> Anyway I've attached the full tail end of the trace if you want to see >> everything that happens. > > and then I note there are *two* kernel page faults. > > perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0 > address=0x1 ip=0xffffffff812a7d9c error_code=0x0 > perf_fuzzer-2979 [000] 161.475924: function: __do_page_fault > perf_fuzzer-2979 [000] 161.475924: function: bad_area_nosemaphore > perf_fuzzer-2979 [000] 161.475925: function: __bad_area_nosemaphore > perf_fuzzer-2979 [000] 161.475925: function: no_context > perf_fuzzer-2979 [000] 161.475925: function: fixup_exception > perf_fuzzer-2979 [000] 161.475926: function: search_exception_tables > perf_fuzzer-2979 [000] 161.475926: function: search_extable > perf_fuzzer-2979 [000] 161.475927: function: copy_user_handle_tail > perf_fuzzer-2979 [000] 161.475927: function: trace_do_page_fault > perf_fuzzer-2979 [000] 161.475928: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0 > address=0x1 ip=0xffffffff812a92bb error_code=0x0 > perf_fuzzer-2979 [000] 161.475928: function: __do_page_fault > perf_fuzzer-2979 [000] 161.475928: function: bad_area_nosemaphore > perf_fuzzer-2979 [000] 161.475929: function: __bad_area_nosemaphore > perf_fuzzer-2979 [000] 161.475929: function: no_context > perf_fuzzer-2979 [000] 161.475929: function: fixup_exception > perf_fuzzer-2979 [000] 161.475929: function: search_exception_tables > perf_fuzzer-2979 [000] 161.475930: function: search_extable > perf_fuzzer-2979 [000] 161.475931: function: perf_output_begin > perf_fuzzer-2979 [000] 161.475931: function: perf_output_copy > > That second one is in copy_user_handle_tail() > Either way, it really seems like we have a case of CR2 leakage out of the NMI context. -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 18:34 ` H. Peter Anvin @ 2014-02-24 19:13 ` Steven Rostedt 2014-02-24 19:15 ` H. Peter Anvin 2014-02-24 19:30 ` Peter Zijlstra 0 siblings, 2 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-24 19:13 UTC (permalink / raw) To: H. Peter Anvin Cc: Vince Weaver, Peter Zijlstra, Linux Kernel, Ingo Molnar, H.J. Lu On Mon, 24 Feb 2014 10:34:13 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > On 02/24/2014 10:07 AM, Vince Weaver wrote: > >> > >> Anyway I've attached the full tail end of the trace if you want to see > >> everything that happens. > > > > and then I note there are *two* kernel page faults. > > > > perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0 > > address=0x1 ip=0xffffffff812a7d9c error_code=0x0 > > perf_fuzzer-2979 [000] 161.475924: function: __do_page_fault > > perf_fuzzer-2979 [000] 161.475924: function: bad_area_nosemaphore > > perf_fuzzer-2979 [000] 161.475925: function: __bad_area_nosemaphore > > perf_fuzzer-2979 [000] 161.475925: function: no_context > > perf_fuzzer-2979 [000] 161.475925: function: fixup_exception > > perf_fuzzer-2979 [000] 161.475926: function: search_exception_tables > > perf_fuzzer-2979 [000] 161.475926: function: search_extable > > perf_fuzzer-2979 [000] 161.475927: function: copy_user_handle_tail > > perf_fuzzer-2979 [000] 161.475927: function: trace_do_page_fault > > perf_fuzzer-2979 [000] 161.475928: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0 > > address=0x1 ip=0xffffffff812a92bb error_code=0x0 > > perf_fuzzer-2979 [000] 161.475928: function: __do_page_fault > > perf_fuzzer-2979 [000] 161.475928: function: bad_area_nosemaphore > > perf_fuzzer-2979 [000] 161.475929: function: __bad_area_nosemaphore > > perf_fuzzer-2979 [000] 161.475929: function: no_context > > perf_fuzzer-2979 [000] 161.475929: function: fixup_exception > > perf_fuzzer-2979 [000] 161.475929: function: search_exception_tables > > perf_fuzzer-2979 [000] 161.475930: function: search_extable > > perf_fuzzer-2979 [000] 161.475931: function: perf_output_begin > > perf_fuzzer-2979 [000] 161.475931: function: perf_output_copy > > > > That second one is in copy_user_handle_tail() > > > > Either way, it really seems like we have a case of CR2 leakage out of > the NMI context. Ah, and x86_64 saves off the cr2 register when entering NMI and restores it before returning. But it seems to be missing from the i386 code. -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 19:13 ` Steven Rostedt @ 2014-02-24 19:15 ` H. Peter Anvin 2014-02-24 19:30 ` Peter Zijlstra 1 sibling, 0 replies; 115+ messages in thread From: H. Peter Anvin @ 2014-02-24 19:15 UTC (permalink / raw) To: Steven Rostedt Cc: Vince Weaver, Peter Zijlstra, Linux Kernel, Ingo Molnar, H.J. Lu On 02/24/2014 11:13 AM, Steven Rostedt wrote: >> >> Either way, it really seems like we have a case of CR2 leakage out of >> the NMI context. > > Ah, and x86_64 saves off the cr2 register when entering NMI and restores > it before returning. But it seems to be missing from the i386 code. > OK, that might be a problem, but this is the 64-bit kernel. -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 19:13 ` Steven Rostedt 2014-02-24 19:15 ` H. Peter Anvin @ 2014-02-24 19:30 ` Peter Zijlstra 2014-02-24 19:32 ` Steven Rostedt 2014-02-25 3:49 ` H. Peter Anvin 1 sibling, 2 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-02-24 19:30 UTC (permalink / raw) To: Steven Rostedt Cc: H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, H.J. Lu On Mon, Feb 24, 2014 at 02:13:29PM -0500, Steven Rostedt wrote: > Ah, and x86_64 saves off the cr2 register when entering NMI and restores > it before returning. But it seems to be missing from the i386 code. arch/x86/kernel/nmi.c: #define nmi_nesting_preprocess(regs) \ do { \ if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \ this_cpu_write(nmi_state, NMI_LATCHED); \ return; \ } \ this_cpu_write(nmi_state, NMI_EXECUTING); \ this_cpu_write(nmi_cr2, read_cr2()); \ } while (0); \ nmi_restart: #define nmi_nesting_postprocess() \ do { \ if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \ write_cr2(this_cpu_read(nmi_cr2)); \ if (this_cpu_dec_return(nmi_state)) \ goto nmi_restart; \ } while (0) That very much looks like saving/restoring CR2 to me. FWIW; I hate how the x86_64 and i386 versions of this NMI nesting magic are so completely different. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 19:30 ` Peter Zijlstra @ 2014-02-24 19:32 ` Steven Rostedt 2014-02-25 3:49 ` H. Peter Anvin 1 sibling, 0 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-24 19:32 UTC (permalink / raw) To: Peter Zijlstra Cc: H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, H.J. Lu On Mon, 24 Feb 2014 20:30:43 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Feb 24, 2014 at 02:13:29PM -0500, Steven Rostedt wrote: > > Ah, and x86_64 saves off the cr2 register when entering NMI and restores > > it before returning. But it seems to be missing from the i386 code. > > arch/x86/kernel/nmi.c: > > #define nmi_nesting_preprocess(regs) \ > do { \ > if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \ > this_cpu_write(nmi_state, NMI_LATCHED); \ > return; \ > } \ > this_cpu_write(nmi_state, NMI_EXECUTING); \ > this_cpu_write(nmi_cr2, read_cr2()); \ > } while (0); \ > nmi_restart: > > #define nmi_nesting_postprocess() \ > do { \ > if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \ > write_cr2(this_cpu_read(nmi_cr2)); \ > if (this_cpu_dec_return(nmi_state)) \ > goto nmi_restart; \ > } while (0) > > That very much looks like saving/restoring CR2 to me. > Ah, I forgot to look in the C file. > FWIW; I hate how the x86_64 and i386 versions of this NMI nesting magic > are so completely different. Yeah, it sucks. But that's because the architecture on handling NMIs is completely different. x86_64 swaps the stack automatically, where as i386 uses the same stack if you are already in the kernel. -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 19:30 ` Peter Zijlstra 2014-02-24 19:32 ` Steven Rostedt @ 2014-02-25 3:49 ` H. Peter Anvin 2014-02-25 14:07 ` Vince Weaver 1 sibling, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-25 3:49 UTC (permalink / raw) To: Peter Zijlstra, Steven Rostedt Cc: Vince Weaver, Linux Kernel, Ingo Molnar, H.J. Lu On 02/24/2014 11:30 AM, Peter Zijlstra wrote: > On Mon, Feb 24, 2014 at 02:13:29PM -0500, Steven Rostedt wrote: >> Ah, and x86_64 saves off the cr2 register when entering NMI and restores >> it before returning. But it seems to be missing from the i386 code. > > arch/x86/kernel/nmi.c: > > #define nmi_nesting_preprocess(regs) \ > do { \ > if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \ > this_cpu_write(nmi_state, NMI_LATCHED); \ > return; \ > } \ > this_cpu_write(nmi_state, NMI_EXECUTING); \ > this_cpu_write(nmi_cr2, read_cr2()); \ > } while (0); \ > nmi_restart: > > #define nmi_nesting_postprocess() \ > do { \ > if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \ > write_cr2(this_cpu_read(nmi_cr2)); \ > if (this_cpu_dec_return(nmi_state)) \ > goto nmi_restart; \ > } while (0) > > That very much looks like saving/restoring CR2 to me. > > FWIW; I hate how the x86_64 and i386 versions of this NMI nesting magic > are so completely different. Is there any way that nmi_cr2 can end up getting overwritten by multiple nestings of some kind? I would have thought it would have made more sense to spill cr2 onto the stack after the stack has been properly set up. -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-25 3:49 ` H. Peter Anvin @ 2014-02-25 14:07 ` Vince Weaver 2014-02-25 14:34 ` H. Peter Anvin 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-25 14:07 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, Steven Rostedt, Vince Weaver, Linux Kernel, Ingo Molnar On Mon, 24 Feb 2014, H. Peter Anvin wrote: > On 02/24/2014 11:30 AM, Peter Zijlstra wrote: > > On Mon, Feb 24, 2014 at 02:13:29PM -0500, Steven Rostedt wrote: > >> Ah, and x86_64 saves off the cr2 register when entering NMI and restores > >> it before returning. But it seems to be missing from the i386 code. > > > > arch/x86/kernel/nmi.c: > > > > #define nmi_nesting_preprocess(regs) \ > > do { \ > > if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \ > > this_cpu_write(nmi_state, NMI_LATCHED); \ > > return; \ > > } \ > > this_cpu_write(nmi_state, NMI_EXECUTING); \ > > this_cpu_write(nmi_cr2, read_cr2()); \ > > } while (0); \ > > nmi_restart: > > > > #define nmi_nesting_postprocess() \ > > do { \ > > if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \ > > write_cr2(this_cpu_read(nmi_cr2)); \ > > if (this_cpu_dec_return(nmi_state)) \ > > goto nmi_restart; \ > > } while (0) > > > > That very much looks like saving/restoring CR2 to me. > > > > FWIW; I hate how the x86_64 and i386 versions of this NMI nesting magic > > are so completely different. > > Is there any way that nmi_cr2 can end up getting overwritten by multiple > nestings of some kind? I would have thought it would have made more > sense to spill cr2 onto the stack after the stack has been properly set up. So how can I help with debugging this? While the missing cr2 issue made debugging frustrating, I find the other aspects of the bug more serious: 1. Programs that are doing valid memory accesses can segfault and worse 2. This bug can cause an instant-reboot of the system (I assume somehow with the right combination of memory accesses it causes a triple-fault?) #2 is why I spent all of this time tracking this down, I couldn't leave a machine fuzzing overnight without the machine rebooting. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-25 14:07 ` Vince Weaver @ 2014-02-25 14:34 ` H. Peter Anvin 2014-02-25 14:43 ` Steven Rostedt 0 siblings, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-25 14:34 UTC (permalink / raw) To: Vince Weaver; +Cc: Peter Zijlstra, Steven Rostedt, Linux Kernel, Ingo Molnar #2 is what I really don't understand. I worry something else is going on there On February 25, 2014 6:07:51 AM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: >On Mon, 24 Feb 2014, H. Peter Anvin wrote: > >> On 02/24/2014 11:30 AM, Peter Zijlstra wrote: >> > On Mon, Feb 24, 2014 at 02:13:29PM -0500, Steven Rostedt wrote: >> >> Ah, and x86_64 saves off the cr2 register when entering NMI and >restores >> >> it before returning. But it seems to be missing from the i386 >code. >> > >> > arch/x86/kernel/nmi.c: >> > >> > #define nmi_nesting_preprocess(regs) \ >> > do { \ >> > if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \ >> > this_cpu_write(nmi_state, NMI_LATCHED); \ >> > return; \ >> > } \ >> > this_cpu_write(nmi_state, NMI_EXECUTING); \ >> > this_cpu_write(nmi_cr2, read_cr2()); \ >> > } while (0); \ >> > nmi_restart: >> > >> > #define nmi_nesting_postprocess() \ >> > do { \ >> > if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \ >> > write_cr2(this_cpu_read(nmi_cr2)); \ >> > if (this_cpu_dec_return(nmi_state)) \ >> > goto nmi_restart; \ >> > } while (0) >> > >> > That very much looks like saving/restoring CR2 to me. >> > >> > FWIW; I hate how the x86_64 and i386 versions of this NMI nesting >magic >> > are so completely different. >> >> Is there any way that nmi_cr2 can end up getting overwritten by >multiple >> nestings of some kind? I would have thought it would have made more >> sense to spill cr2 onto the stack after the stack has been properly >set up. > >So how can I help with debugging this? > >While the missing cr2 issue made debugging frustrating, I find the >other >aspects of the bug more serious: > > 1. Programs that are doing valid memory accesses can segfault >and worse >2. This bug can cause an instant-reboot of the system (I assume >somehow > with the right combination of memory accesses it causes a > triple-fault?) > >#2 is why I spent all of this time tracking this down, I couldn't leave >a >machine fuzzing overnight without the machine rebooting. > >Vince -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-25 14:34 ` H. Peter Anvin @ 2014-02-25 14:43 ` Steven Rostedt 2014-02-25 15:33 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-25 14:43 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Vince Weaver, Peter Zijlstra, Linux Kernel, Ingo Molnar On Tue, 25 Feb 2014 06:34:55 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > #2 is what I really don't understand. > > I worry something else is going on there Yeah, me too. -- Steve > > > > >While the missing cr2 issue made debugging frustrating, I find the > >other > >aspects of the bug more serious: > > > > 1. Programs that are doing valid memory accesses can segfault > >and worse > >2. This bug can cause an instant-reboot of the system (I assume > >somehow > > with the right combination of memory accesses it causes a > > triple-fault?) > > > >#2 is why I spent all of this time tracking this down, I couldn't leave > >a > >machine fuzzing overnight without the machine rebooting. > > > >Vince > ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-25 14:43 ` Steven Rostedt @ 2014-02-25 15:33 ` Vince Weaver 2014-02-26 15:06 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-25 15:33 UTC (permalink / raw) To: Steven Rostedt Cc: H. Peter Anvin, Vince Weaver, Peter Zijlstra, Linux Kernel, Ingo Molnar On Tue, 25 Feb 2014, Steven Rostedt wrote: > On Tue, 25 Feb 2014 06:34:55 -0800 > "H. Peter Anvin" <hpa@zytor.com> wrote: > > > #2 is what I really don't understand. > > > > I worry something else is going on there > > Yeah, me too. > OK, well I'll work on isolating that next, I was hoping the segfault issue was related. It's a lot harder getting traces out of a machine that insta-reboots so fast that it can't even finish printing the oops message. I need my testing machine for some unrelated experiments so I might not get a chance to work on this further for a day or two. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-25 15:33 ` Vince Weaver @ 2014-02-26 15:06 ` Vince Weaver 2014-02-27 22:06 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-26 15:06 UTC (permalink / raw) To: Vince Weaver Cc: Steven Rostedt, H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Tue, 25 Feb 2014, Vince Weaver wrote: > On Tue, 25 Feb 2014, Steven Rostedt wrote: > > > On Tue, 25 Feb 2014 06:34:55 -0800 > > "H. Peter Anvin" <hpa@zytor.com> wrote: > > > > > #2 is what I really don't understand. > > > > > > I worry something else is going on there > > > > Yeah, me too. > > > > OK, well I'll work on isolating that next, I was hoping the segfault issue > was related. It's a lot harder getting traces out of a machine that > insta-reboots so fast that it can't even finish printing the oops message. OK, dug up the random seeds that are causing the system to reboot, and it turns out that it's hard to reproduce the actual reboot, but in the cases where it doesn't reboot it segfaults in the same code path that was causing the other segfaults in this thread. So I think it's all the same bug, although I can investigate further if needed. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-26 15:06 ` Vince Weaver @ 2014-02-27 22:06 ` Vince Weaver 2014-02-27 22:31 ` Steven Rostedt 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-27 22:06 UTC (permalink / raw) To: Vince Weaver Cc: Steven Rostedt, H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar I spent some more time on this. I managed to get a trace that exhibited the bug practically right away, but still unable to generate a reproducible trace :( So instead I'm adding WARN's and trace_printks to see what I can find out. Here's a summary of what I think is happneing. Please let me know if I'm wildly wrong in analyzing this. Userspace accesses the perf ring-buffer-user-page of an event for the first time (in this trace it's a SW_CPU_CLOCK event). This triggers a page_fault to bring in the page. This triggers a TRACEPOINT event (task/task_newtask) which has PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CALLCHAIN|PERF_SAMPLE_ID|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_STREAM_ID|PERF_SAMPLE_WEIGHT|PERF_SAMPLE_DATA_SRC|PERF_SAMPLE_TRANSACTION; and exclude_callchain_kernel=1, triggering a perf_callchain(). Here is a dump of the stack right after we enter perf_callchain() vince@core2:~$ [ 202.320444] ------------[ cut here ]------------ [ 202.324001] WARNING: CPU: 1 PID: 2873 at kernel/events/callchain.c:168 perf_callchain+0x67/0x211() hermal_sys ehci_pci ehci_hcd sg sd_mod usbcore usb_common [ 202.324001] CPU: 1 PID: 2873 Comm: perf_fuzzer Not tainted 3.14.0-rc3+ #24 [ 202.324001] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012 [ 202.324001] 00000000000000a8 ffff880119ec1b48 ffffffff81531dc1 00000000000000a8 [ 202.324001] 0000000000000000 ffff880119ec1b88 ffffffff81040ce4 ffff880119ec1b88 [ 202.324001] ffffffff810d416f ffff880119ec1e38 ffff8800cba67800 0000000000000000 [ 202.324001] Call Trace: [ 202.324001] [<ffffffff81531dc1>] dump_stack+0x49/0x60 [ 202.324001] [<ffffffff81040ce4>] warn_slowpath_common+0x81/0x9b [ 202.324001] [<ffffffff810d416f>] ? perf_callchain+0x67/0x211 [ 202.324001] [<ffffffff81040d18>] warn_slowpath_null+0x1a/0x1c [ 202.324001] [<ffffffff810d416f>] perf_callchain+0x67/0x211 [ 202.324001] [<ffffffff8106b0ab>] ? local_clock+0x1b/0x24 [ 202.324001] [<ffffffff810d0872>] perf_prepare_sample+0x7b/0x304 [ 202.324001] [<ffffffff810d1079>] __perf_event_overflow+0x156/0x1c1 [ 202.324001] [<ffffffff810f4613>] ? free_pgtables+0xa7/0xc9 [ 202.324001] [<ffffffff810d125e>] perf_swevent_overflow+0x41/0x5b [ 202.324001] [<ffffffff810d12ea>] perf_swevent_event+0x72/0x74 [ 202.324001] [<ffffffff810d1478>] perf_tp_event+0xea/0x1ef [ 202.324001] [<ffffffff8103515a>] ? perf_trace_x86_exceptions+0x4c/0xba [ 202.324001] [<ffffffff810351b6>] perf_trace_x86_exceptions+0xa8/0xba [ 202.324001] [<ffffffff8103515a>] ? perf_trace_x86_exceptions+0x4c/0xba [ 202.324001] [<ffffffff81538439>] trace_do_page_fault+0x48/0x99 [ 202.324001] [<ffffffff815354b2>] trace_page_fault+0x22/0x30 [ 202.324001] ---[ end trace 31ccd31b4e82cb42 ]--- This triggers a kernel pagefault because the BP register is not valid and copy_from_user_nmi() tries to copy the user stack area from there. This *should* be OK? And it maybe looks like it works, but then copy_user_handle_tail() causes another page fault again at the invaid BP register. But then the code continues to perf_output_begin() 237.265689: page_fault_kernel: address=0x17a0 ip=copy_user_handle_tail error_code=0x0 237.265689: function: __do_page_fault 237.265689: function: bad_area_nosemaphore 237.265690: function: __bad_area_nosemaphore 237.265690: function: no_context 237.265690: function: fixup_exception 237.265690: function: search_exception_tables 237.265690: function: search_extable 237.265691: function: perf_output_begin 237.265692: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e7 how are we back in __do_page_fault again here? 237.265692: function: __do_page_fault 237.265692: function: down_read_trylock 237.265692: function: _cond_resched 237.265693: function: find_vma 237.265693: function: bad_area 237.265693: function: up_read 237.265693: function: __bad_area_nosemaphore 237.265694: function: is_prefetch 237.265694: function: convert_ip_to_linear 237.265695: function: unhandled_signal 237.265695: function: __printk_ratelimit 237.265695: function: _raw_spin_trylock 237.265695: function: _raw_spin_unlock_irqrestore 237.265696: function: printk 237.265696: function: vprintk_emit [ 202.877004] perf_fuzzer[2873]: segfault at 17a0 ip 00000000004017fd sp 00000000ffd19d10 error 6 in perf_fuzzer[400000+d1000] Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-27 22:06 ` Vince Weaver @ 2014-02-27 22:31 ` Steven Rostedt 2014-02-27 22:52 ` H. Peter Anvin 2014-02-28 9:39 ` Peter Zijlstra 0 siblings, 2 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-27 22:31 UTC (permalink / raw) To: Vince Weaver; +Cc: H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Thu, 27 Feb 2014 17:06:36 -0500 (EST) Vince Weaver <vincent.weaver@maine.edu> wrote: > > I spent some more time on this. > I managed to get a trace that exhibited the bug practically right > away, but still unable to generate a reproducible trace :( > > So instead I'm adding WARN's and trace_printks to see what I can find out. > > Here's a summary of what I think is happneing. Please let me know if I'm > wildly wrong in analyzing this. > > > > Userspace accesses the perf > ring-buffer-user-page of an event for the first time (in this trace > it's a SW_CPU_CLOCK event). > > This triggers a page_fault to bring in the page. > > This triggers a TRACEPOINT event (task/task_newtask) which > has PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CALLCHAIN|PERF_SAMPLE_ID|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_STREAM_ID|PERF_SAMPLE_WEIGHT|PERF_SAMPLE_DATA_SRC|PERF_SAMPLE_TRANSACTION; > and exclude_callchain_kernel=1, triggering a perf_callchain(). > > Here is a dump of the stack right after we enter perf_callchain() > > vince@core2:~$ [ 202.320444] ------------[ cut here ]------------ > [ 202.324001] WARNING: CPU: 1 PID: 2873 at kernel/events/callchain.c:168 perf_callchain+0x67/0x211() > hermal_sys ehci_pci ehci_hcd sg sd_mod usbcore usb_common > [ 202.324001] CPU: 1 PID: 2873 Comm: perf_fuzzer Not tainted 3.14.0-rc3+ #24 > [ 202.324001] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012 > [ 202.324001] 00000000000000a8 ffff880119ec1b48 ffffffff81531dc1 00000000000000a8 > [ 202.324001] 0000000000000000 ffff880119ec1b88 ffffffff81040ce4 ffff880119ec1b88 > [ 202.324001] ffffffff810d416f ffff880119ec1e38 ffff8800cba67800 0000000000000000 > [ 202.324001] Call Trace: > [ 202.324001] [<ffffffff81531dc1>] dump_stack+0x49/0x60 > [ 202.324001] [<ffffffff81040ce4>] warn_slowpath_common+0x81/0x9b > [ 202.324001] [<ffffffff810d416f>] ? perf_callchain+0x67/0x211 > [ 202.324001] [<ffffffff81040d18>] warn_slowpath_null+0x1a/0x1c > [ 202.324001] [<ffffffff810d416f>] perf_callchain+0x67/0x211 > [ 202.324001] [<ffffffff8106b0ab>] ? local_clock+0x1b/0x24 > [ 202.324001] [<ffffffff810d0872>] perf_prepare_sample+0x7b/0x304 > [ 202.324001] [<ffffffff810d1079>] __perf_event_overflow+0x156/0x1c1 > [ 202.324001] [<ffffffff810f4613>] ? free_pgtables+0xa7/0xc9 > [ 202.324001] [<ffffffff810d125e>] perf_swevent_overflow+0x41/0x5b > [ 202.324001] [<ffffffff810d12ea>] perf_swevent_event+0x72/0x74 > [ 202.324001] [<ffffffff810d1478>] perf_tp_event+0xea/0x1ef > [ 202.324001] [<ffffffff8103515a>] ? perf_trace_x86_exceptions+0x4c/0xba > [ 202.324001] [<ffffffff810351b6>] perf_trace_x86_exceptions+0xa8/0xba > [ 202.324001] [<ffffffff8103515a>] ? perf_trace_x86_exceptions+0x4c/0xba > [ 202.324001] [<ffffffff81538439>] trace_do_page_fault+0x48/0x99 > [ 202.324001] [<ffffffff815354b2>] trace_page_fault+0x22/0x30 > [ 202.324001] ---[ end trace 31ccd31b4e82cb42 ]--- > > This triggers a kernel pagefault because the BP register is not valid and > copy_from_user_nmi() tries to copy the user stack area from there. > > This *should* be OK? And it maybe looks like it works, but then > > copy_user_handle_tail() > > causes another page fault again at the invaid BP register. > > But then the code continues to > perf_output_begin() > > > 237.265689: page_fault_kernel: address=0x17a0 ip=copy_user_handle_tail error_code=0x0 > 237.265689: function: __do_page_fault > 237.265689: function: bad_area_nosemaphore > 237.265690: function: __bad_area_nosemaphore > 237.265690: function: no_context > 237.265690: function: fixup_exception > 237.265690: function: search_exception_tables > 237.265690: function: search_extable > 237.265691: function: perf_output_begin You can also allow perf to be traced. Remove the CFLAGS_REMOVE_core.o line from kernel/events/Makefile. > 237.265692: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e7 > > how are we back in __do_page_fault again here? Well, the perf ring buffer is vmalloced, right? That can cause a page fault too. > > 237.265692: function: __do_page_fault Try adding a trace_printk() to see all the page faults? > 237.265692: function: down_read_trylock > 237.265692: function: _cond_resched > 237.265693: function: find_vma > 237.265693: function: bad_area > 237.265693: function: up_read > 237.265693: function: __bad_area_nosemaphore > 237.265694: function: is_prefetch > 237.265694: function: convert_ip_to_linear > 237.265695: function: unhandled_signal > 237.265695: function: __printk_ratelimit > 237.265695: function: _raw_spin_trylock > 237.265695: function: _raw_spin_unlock_irqrestore > 237.265696: function: printk > 237.265696: function: vprintk_emit > > [ 202.877004] perf_fuzzer[2873]: segfault at 17a0 ip 00000000004017fd sp 00000000ffd19d10 error 6 in perf_fuzzer[400000+d1000] > Yeah, something is getting mesed up. -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-27 22:31 ` Steven Rostedt @ 2014-02-27 22:52 ` H. Peter Anvin 2014-02-27 23:30 ` Steven Rostedt 2014-02-28 1:34 ` Vince Weaver 2014-02-28 9:39 ` Peter Zijlstra 1 sibling, 2 replies; 115+ messages in thread From: H. Peter Anvin @ 2014-02-27 22:52 UTC (permalink / raw) To: Steven Rostedt, Vince Weaver; +Cc: Peter Zijlstra, Linux Kernel, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 494 bytes --] On 02/27/2014 02:31 PM, Steven Rostedt wrote: > > Yeah, something is getting mesed up. > What it *looks* like to me is that we try to nest the cr2 save/restore, which doesn't nest because it is a percpu variable. ... except in the x86-64 case, we *ALSO* save/restore cr2 inside entry_64.S, which makes the stuff in do_nmi completely redundant and there for no good reason. I would actually suggest we do the equivalent on i386 as well. Vince, could you try this patch as an experiment? [-- Attachment #2: patch --] [-- Type: text/plain, Size: 952 bytes --] diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c index ddf9ecb..938e45c 100644 --- a/arch/x86/lib/usercopy.c +++ b/arch/x86/lib/usercopy.c @@ -10,6 +10,8 @@ #include <asm/word-at-a-time.h> #include <linux/sched.h> +#include <asm/processor.h> + /* * We rely on the nested NMI work to allow atomic faults from the NMI path; the * nested NMI paths are careful to preserve CR2. @@ -18,6 +20,7 @@ unsigned long copy_from_user_nmi(void *to, const void __user *from, unsigned long n) { unsigned long ret; + unsigned long cr2; if (__range_not_ok(from, n, TASK_SIZE)) return 0; @@ -27,9 +30,11 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n) * disable pagefaults so that its behaviour is consistent even when * called form other contexts. */ + cr2 = read_cr2(); pagefault_disable(); ret = __copy_from_user_inatomic(to, from, n); pagefault_enable(); + write_cr2(cr2); return ret; } ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-27 22:52 ` H. Peter Anvin @ 2014-02-27 23:30 ` Steven Rostedt 2014-02-27 23:46 ` H. Peter Anvin 2014-02-28 1:34 ` Vince Weaver 1 sibling, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-27 23:30 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Vince Weaver, Peter Zijlstra, Linux Kernel, Ingo Molnar On Thu, 27 Feb 2014 14:52:54 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > On 02/27/2014 02:31 PM, Steven Rostedt wrote: > > > > Yeah, something is getting mesed up. > > > > What it *looks* like to me is that we try to nest the cr2 save/restore, > which doesn't nest because it is a percpu variable. > > ... except in the x86-64 case, we *ALSO* save/restore cr2 inside > entry_64.S, which makes the stuff in do_nmi completely redundant and > there for no good reason. Peter, look at the code. That percpu cr2 is in a #ifdef CONFIG_X86_32 section. That is, it isn't even executed. That's i386 code. The only place the cr2 is saved for x86_64 is in entry_64.S. -- Steve > > I would actually suggest we do the equivalent on i386 as well. > > Vince, could you try this patch as an experiment? > > ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-27 23:30 ` Steven Rostedt @ 2014-02-27 23:46 ` H. Peter Anvin 2014-02-28 1:00 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-27 23:46 UTC (permalink / raw) To: Steven Rostedt; +Cc: Vince Weaver, Peter Zijlstra, Linux Kernel, Ingo Molnar On 02/27/2014 03:30 PM, Steven Rostedt wrote: > On Thu, 27 Feb 2014 14:52:54 -0800 > "H. Peter Anvin" <hpa@zytor.com> wrote: > >> On 02/27/2014 02:31 PM, Steven Rostedt wrote: >>> >>> Yeah, something is getting mesed up. >>> >> >> What it *looks* like to me is that we try to nest the cr2 save/restore, >> which doesn't nest because it is a percpu variable. >> >> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside >> entry_64.S, which makes the stuff in do_nmi completely redundant and >> there for no good reason. > > Peter, look at the code. That percpu cr2 is in a #ifdef CONFIG_X86_32 > section. That is, it isn't even executed. That's i386 code. The only > place the cr2 is saved for x86_64 is in entry_64.S. > Right, egg on my face. However, I still think it would make more sense for it to nest the way entry_64.S does if at all possible. That makes this even more confusing, though. I would still like to see what happens with the patch I sent Vince. -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-27 23:46 ` H. Peter Anvin @ 2014-02-28 1:00 ` Vince Weaver 2014-02-28 20:34 ` Paul E. McKenney 0 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-28 1:00 UTC (permalink / raw) To: H. Peter Anvin Cc: Steven Rostedt, Vince Weaver, Peter Zijlstra, Linux Kernel, Ingo Molnar On Thu, 27 Feb 2014, H. Peter Anvin wrote: > On 02/27/2014 03:30 PM, Steven Rostedt wrote: > > On Thu, 27 Feb 2014 14:52:54 -0800 > > "H. Peter Anvin" <hpa@zytor.com> wrote: > > > >> On 02/27/2014 02:31 PM, Steven Rostedt wrote: > >>> > >>> Yeah, something is getting mesed up. > >>> > >> > >> What it *looks* like to me is that we try to nest the cr2 save/restore, > >> which doesn't nest because it is a percpu variable. > >> > >> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside > >> entry_64.S, which makes the stuff in do_nmi completely redundant and > >> there for no good reason. > > > > Peter, look at the code. That percpu cr2 is in a #ifdef CONFIG_X86_32 > > section. That is, it isn't even executed. That's i386 code. The only > > place the cr2 is saved for x86_64 is in entry_64.S. > > > > Right, egg on my face. However, I still think it would make more sense > for it to nest the way entry_64.S does if at all possible. > > That makes this even more confusing, though. I would still like to see > what happens with the patch I sent Vince. I'll try your patch momentarily, first I had some other changes I started running before I left work (for some reason it recompiled the whole kernel). 8: function: perf_output_begin 8: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e 8: bputs: perf_output_begin: VMW: before rcu_dereference 9: function: __do_page_fault 9: function: down_read_trylock 9: function: _cond_resched 9: function: find_vma so it looks like the fault happens rcu_read_lock(); 116 /* 117 * For inherited events we send all the output towards the parent. 118 */ 119 if (event->parent) 120 event = event->parent; 121 somewhere between here 122 rb = rcu_dereference(event->rb); 123 if (unlikely(!rb)) 124 goto out; and here 125 126 if (unlikely(!rb->nr_pages)) 127 goto out; although if rcu locks do anything to turn off tracing then this could be suspect. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 1:00 ` Vince Weaver @ 2014-02-28 20:34 ` Paul E. McKenney 2014-02-28 20:47 ` Steven Rostedt 0 siblings, 1 reply; 115+ messages in thread From: Paul E. McKenney @ 2014-02-28 20:34 UTC (permalink / raw) To: Vince Weaver Cc: H. Peter Anvin, Steven Rostedt, Peter Zijlstra, Linux Kernel, Ingo Molnar On Thu, Feb 27, 2014 at 08:00:04PM -0500, Vince Weaver wrote: > On Thu, 27 Feb 2014, H. Peter Anvin wrote: > > > On 02/27/2014 03:30 PM, Steven Rostedt wrote: > > > On Thu, 27 Feb 2014 14:52:54 -0800 > > > "H. Peter Anvin" <hpa@zytor.com> wrote: > > > > > >> On 02/27/2014 02:31 PM, Steven Rostedt wrote: > > >>> > > >>> Yeah, something is getting mesed up. > > >>> > > >> > > >> What it *looks* like to me is that we try to nest the cr2 save/restore, > > >> which doesn't nest because it is a percpu variable. > > >> > > >> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside > > >> entry_64.S, which makes the stuff in do_nmi completely redundant and > > >> there for no good reason. > > > > > > Peter, look at the code. That percpu cr2 is in a #ifdef CONFIG_X86_32 > > > section. That is, it isn't even executed. That's i386 code. The only > > > place the cr2 is saved for x86_64 is in entry_64.S. > > > > > > > Right, egg on my face. However, I still think it would make more sense > > for it to nest the way entry_64.S does if at all possible. > > > > That makes this even more confusing, though. I would still like to see > > what happens with the patch I sent Vince. > > I'll try your patch momentarily, first I had some other changes I started > running before I left work (for some reason it recompiled the whole > kernel). > > 8: function: perf_output_begin > 8: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e > 8: bputs: perf_output_begin: VMW: before rcu_dereference > 9: function: __do_page_fault > 9: function: down_read_trylock > 9: function: _cond_resched > 9: function: find_vma > > so it looks like the fault happens > > rcu_read_lock(); > > 116 /* > 117 * For inherited events we send all the output towards the parent. > 118 */ > 119 if (event->parent) > 120 event = event->parent; > 121 > > somewhere between here > > 122 rb = rcu_dereference(event->rb); > 123 if (unlikely(!rb)) > 124 goto out; > > and here > > 125 > 126 if (unlikely(!rb->nr_pages)) > 127 goto out; > > although if rcu locks do anything to turn off tracing then this could be > suspect. The most likely suspect is of course event->rb in the rcu_dereference. I have to defer to Steven on how rcu_read_lock() and rcu_read_unlock() currently interact with tracing. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 20:34 ` Paul E. McKenney @ 2014-02-28 20:47 ` Steven Rostedt 2014-02-28 20:54 ` Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 20:47 UTC (permalink / raw) To: paulmck Cc: Vince Weaver, H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014 12:34:05 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Thu, Feb 27, 2014 at 08:00:04PM -0500, Vince Weaver wrote: > > On Thu, 27 Feb 2014, H. Peter Anvin wrote: > > > > > On 02/27/2014 03:30 PM, Steven Rostedt wrote: > > > > On Thu, 27 Feb 2014 14:52:54 -0800 > > > > "H. Peter Anvin" <hpa@zytor.com> wrote: > > > > > > > >> On 02/27/2014 02:31 PM, Steven Rostedt wrote: > > > >>> > > > >>> Yeah, something is getting mesed up. > > > >>> > > > >> > > > >> What it *looks* like to me is that we try to nest the cr2 save/restore, > > > >> which doesn't nest because it is a percpu variable. > > > >> > > > >> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside > > > >> entry_64.S, which makes the stuff in do_nmi completely redundant and > > > >> there for no good reason. > > > > > > > > Peter, look at the code. That percpu cr2 is in a #ifdef CONFIG_X86_32 > > > > section. That is, it isn't even executed. That's i386 code. The only > > > > place the cr2 is saved for x86_64 is in entry_64.S. > > > > > > > > > > Right, egg on my face. However, I still think it would make more sense > > > for it to nest the way entry_64.S does if at all possible. > > > > > > That makes this even more confusing, though. I would still like to see > > > what happens with the patch I sent Vince. > > > > I'll try your patch momentarily, first I had some other changes I started > > running before I left work (for some reason it recompiled the whole > > kernel). > > > > 8: function: perf_output_begin > > 8: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e > > 8: bputs: perf_output_begin: VMW: before rcu_dereference > > 9: function: __do_page_fault > > 9: function: down_read_trylock > > 9: function: _cond_resched > > 9: function: find_vma > > > > so it looks like the fault happens > > > > rcu_read_lock(); > > > > 116 /* > > 117 * For inherited events we send all the output towards the parent. > > 118 */ > > 119 if (event->parent) > > 120 event = event->parent; > > 121 > > > > somewhere between here > > > > 122 rb = rcu_dereference(event->rb); > > 123 if (unlikely(!rb)) > > 124 goto out; > > > > and here > > > > 125 > > 126 if (unlikely(!rb->nr_pages)) > > 127 goto out; > > > > although if rcu locks do anything to turn off tracing then this could be > > suspect. > > The most likely suspect is of course event->rb in the rcu_dereference. > I have to defer to Steven on how rcu_read_lock() and rcu_read_unlock() > currently interact with tracing. ;-) These are all perf related. You'll need to defer to Peter Zijlstra ;-) -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 20:47 ` Steven Rostedt @ 2014-02-28 20:54 ` Peter Zijlstra 2014-02-28 21:17 ` Paul E. McKenney 0 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 20:54 UTC (permalink / raw) To: Steven Rostedt Cc: paulmck, Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar On Fri, Feb 28, 2014 at 03:47:16PM -0500, Steven Rostedt wrote: > > > I'll try your patch momentarily, first I had some other changes I started > > > running before I left work (for some reason it recompiled the whole > > > kernel). > > > > > > 8: function: perf_output_begin > > > 8: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e > > > 8: bputs: perf_output_begin: VMW: before rcu_dereference > > > 9: function: __do_page_fault > > > 9: function: down_read_trylock > > > 9: function: _cond_resched > > > 9: function: find_vma > > > > > > so it looks like the fault happens > > > > > > rcu_read_lock(); > > > > > > 116 /* > > > 117 * For inherited events we send all the output towards the parent. > > > 118 */ > > > 119 if (event->parent) > > > 120 event = event->parent; > > > 121 > > > > > > somewhere between here > > > > > > 122 rb = rcu_dereference(event->rb); > > > 123 if (unlikely(!rb)) > > > 124 goto out; > > > > > > and here > > > > > > 125 > > > 126 if (unlikely(!rb->nr_pages)) > > > 127 goto out; > > > > > > although if rcu locks do anything to turn off tracing then this could be > > > suspect. > > > > The most likely suspect is of course event->rb in the rcu_dereference. > > I have to defer to Steven on how rcu_read_lock() and rcu_read_unlock() > > currently interact with tracing. ;-) > > These are all perf related. You'll need to defer to Peter Zijlstra ;-) I'm lost.. :/ pretty much all perf objects are RCU freed. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 20:54 ` Peter Zijlstra @ 2014-02-28 21:17 ` Paul E. McKenney 2014-02-28 21:27 ` Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: Paul E. McKenney @ 2014-02-28 21:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar On Fri, Feb 28, 2014 at 09:54:09PM +0100, Peter Zijlstra wrote: > On Fri, Feb 28, 2014 at 03:47:16PM -0500, Steven Rostedt wrote: > > > > I'll try your patch momentarily, first I had some other changes I started > > > > running before I left work (for some reason it recompiled the whole > > > > kernel). > > > > > > > > 8: function: perf_output_begin > > > > 8: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e > > > > 8: bputs: perf_output_begin: VMW: before rcu_dereference > > > > 9: function: __do_page_fault > > > > 9: function: down_read_trylock > > > > 9: function: _cond_resched > > > > 9: function: find_vma > > > > > > > > so it looks like the fault happens > > > > > > > > rcu_read_lock(); > > > > > > > > 116 /* > > > > 117 * For inherited events we send all the output towards the parent. > > > > 118 */ > > > > 119 if (event->parent) > > > > 120 event = event->parent; > > > > 121 > > > > > > > > somewhere between here > > > > > > > > 122 rb = rcu_dereference(event->rb); > > > > 123 if (unlikely(!rb)) > > > > 124 goto out; > > > > > > > > and here > > > > > > > > 125 > > > > 126 if (unlikely(!rb->nr_pages)) > > > > 127 goto out; > > > > > > > > although if rcu locks do anything to turn off tracing then this could be > > > > suspect. > > > > > > The most likely suspect is of course event->rb in the rcu_dereference. > > > I have to defer to Steven on how rcu_read_lock() and rcu_read_unlock() > > > currently interact with tracing. ;-) > > > > These are all perf related. You'll need to defer to Peter Zijlstra ;-) > > I'm lost.. :/ > > pretty much all perf objects are RCU freed. This code isn't running in idle context is it? If so, RCU will happily free out from under it. CONFIG_PROVE_RCU should detect this sort of thing, though. Thanx, Paul ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 21:17 ` Paul E. McKenney @ 2014-02-28 21:27 ` Peter Zijlstra 2014-02-28 21:51 ` Paul E. McKenney 0 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 21:27 UTC (permalink / raw) To: Paul E. McKenney Cc: Steven Rostedt, Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar On Fri, Feb 28, 2014 at 01:17:33PM -0800, Paul E. McKenney wrote: > This code isn't running in idle context is it? If so, RCU will happily > free out from under it. CONFIG_PROVE_RCU should detect this sort of thing, > though. Well, interrupts/NMIs can happen when idle, but the interrupt/NMI entry code deals with the idle state AFAIK. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 21:27 ` Peter Zijlstra @ 2014-02-28 21:51 ` Paul E. McKenney 2014-02-28 21:55 ` Peter Zijlstra 0 siblings, 1 reply; 115+ messages in thread From: Paul E. McKenney @ 2014-02-28 21:51 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar On Fri, Feb 28, 2014 at 10:27:00PM +0100, Peter Zijlstra wrote: > On Fri, Feb 28, 2014 at 01:17:33PM -0800, Paul E. McKenney wrote: > > This code isn't running in idle context is it? If so, RCU will happily > > free out from under it. CONFIG_PROVE_RCU should detect this sort of thing, > > though. > > Well, interrupts/NMIs can happen when idle, but the interrupt/NMI > entry code deals with the idle state AFAIK. Yep, rcu_irq_enter() and rcu_nmi_enter() deal with that. More worried about this being code invoked from some energy-efficiency driver or another within the idle loop. Thanx, Paul ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 21:51 ` Paul E. McKenney @ 2014-02-28 21:55 ` Peter Zijlstra 2014-02-28 22:05 ` Steven Rostedt 0 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 21:55 UTC (permalink / raw) To: Paul E. McKenney Cc: Steven Rostedt, Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar On Fri, Feb 28, 2014 at 01:51:50PM -0800, Paul E. McKenney wrote: > On Fri, Feb 28, 2014 at 10:27:00PM +0100, Peter Zijlstra wrote: > > On Fri, Feb 28, 2014 at 01:17:33PM -0800, Paul E. McKenney wrote: > > > This code isn't running in idle context is it? If so, RCU will happily > > > free out from under it. CONFIG_PROVE_RCU should detect this sort of thing, > > > though. > > > > Well, interrupts/NMIs can happen when idle, but the interrupt/NMI > > entry code deals with the idle state AFAIK. > > Yep, rcu_irq_enter() and rcu_nmi_enter() deal with that. More worried > about this being code invoked from some energy-efficiency driver or > another within the idle loop. Right, so any tracepoint can end up there; but I thought there was already the rule that tracepoints needed RCU enabled. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 21:55 ` Peter Zijlstra @ 2014-02-28 22:05 ` Steven Rostedt 2014-02-28 22:23 ` Paul E. McKenney 0 siblings, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 22:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul E. McKenney, Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014 22:55:11 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Feb 28, 2014 at 01:51:50PM -0800, Paul E. McKenney wrote: > > On Fri, Feb 28, 2014 at 10:27:00PM +0100, Peter Zijlstra wrote: > > > On Fri, Feb 28, 2014 at 01:17:33PM -0800, Paul E. McKenney wrote: > > > > This code isn't running in idle context is it? If so, RCU will happily > > > > free out from under it. CONFIG_PROVE_RCU should detect this sort of thing, > > > > though. > > > > > > Well, interrupts/NMIs can happen when idle, but the interrupt/NMI > > > entry code deals with the idle state AFAIK. > > > > Yep, rcu_irq_enter() and rcu_nmi_enter() deal with that. More worried > > about this being code invoked from some energy-efficiency driver or > > another within the idle loop. > > Right, so any tracepoint can end up there; but I thought there was > already the rule that tracepoints needed RCU enabled. There is and we have special tracepoint caller for those cases we want a tracepoint out of RCU scope. These will reactivate rcu in the tracepoint code. trace_<tp_name>_rcuidle(...) -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 22:05 ` Steven Rostedt @ 2014-02-28 22:23 ` Paul E. McKenney 0 siblings, 0 replies; 115+ messages in thread From: Paul E. McKenney @ 2014-02-28 22:23 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar On Fri, Feb 28, 2014 at 05:05:53PM -0500, Steven Rostedt wrote: > On Fri, 28 Feb 2014 22:55:11 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, Feb 28, 2014 at 01:51:50PM -0800, Paul E. McKenney wrote: > > > On Fri, Feb 28, 2014 at 10:27:00PM +0100, Peter Zijlstra wrote: > > > > On Fri, Feb 28, 2014 at 01:17:33PM -0800, Paul E. McKenney wrote: > > > > > This code isn't running in idle context is it? If so, RCU will happily > > > > > free out from under it. CONFIG_PROVE_RCU should detect this sort of thing, > > > > > though. > > > > > > > > Well, interrupts/NMIs can happen when idle, but the interrupt/NMI > > > > entry code deals with the idle state AFAIK. > > > > > > Yep, rcu_irq_enter() and rcu_nmi_enter() deal with that. More worried > > > about this being code invoked from some energy-efficiency driver or > > > another within the idle loop. > > > > Right, so any tracepoint can end up there; but I thought there was > > already the rule that tracepoints needed RCU enabled. > > There is and we have special tracepoint caller for those cases we want a > tracepoint out of RCU scope. These will reactivate rcu in the > tracepoint code. > > trace_<tp_name>_rcuidle(...) OK, I finally looked at the emails leading up to this in the thread... I believe that I am doing premature debugging. Thanx, Paul ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-27 22:52 ` H. Peter Anvin 2014-02-27 23:30 ` Steven Rostedt @ 2014-02-28 1:34 ` Vince Weaver 2014-02-28 2:17 ` H. Peter Anvin 2014-02-28 2:57 ` Steven Rostedt 1 sibling, 2 replies; 115+ messages in thread From: Vince Weaver @ 2014-02-28 1:34 UTC (permalink / raw) To: H. Peter Anvin Cc: Steven Rostedt, Vince Weaver, Peter Zijlstra, Linux Kernel, Ingo Molnar On Thu, 27 Feb 2014, H. Peter Anvin wrote: > On 02/27/2014 02:31 PM, Steven Rostedt wrote: > > > > Yeah, something is getting mesed up. > > > > What it *looks* like to me is that we try to nest the cr2 save/restore, > which doesn't nest because it is a percpu variable. > > ... except in the x86-64 case, we *ALSO* save/restore cr2 inside > entry_64.S, which makes the stuff in do_nmi completely redundant and > there for no good reason. > > I would actually suggest we do the equivalent on i386 as well. > > Vince, could you try this patch as an experiment? OK with your patch applied it does not segfault. The trace looks like you'd expect too: perf_fuzzer-2910 [000] 255.355331: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0 perf_fuzzer-2910 [000] 255.355331: function: __do_page_fault perf_fuzzer-2910 [000] 255.355331: function: bad_area_nosemaphore perf_fuzzer-2910 [000] 255.355331: function: __bad_area_nosemaphore perf_fuzzer-2910 [000] 255.355332: function: no_context perf_fuzzer-2910 [000] 255.355332: function: fixup_exception perf_fuzzer-2910 [000] 255.355332: function: search_exception_tables perf_fuzzer-2910 [000] 255.355332: function: search_extable perf_fuzzer-2910 [000] 255.355333: function: perf_output_begin perf_fuzzer-2910 [000] 255.355333: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e7 perf_fuzzer-2910 [000] 255.355333: bprint: perf_output_begin: VMW: before event->parent (nil) perf_fuzzer-2910 [000] 255.355334: bprint: perf_output_begin: VMW: before rcu_dereference (nil) perf_fuzzer-2910 [000] 255.355334: function: __do_page_fault perf_fuzzer-2910 [000] 255.355334: function: down_read_trylock perf_fuzzer-2910 [000] 255.355334: function: _cond_resched perf_fuzzer-2910 [000] 255.355335: function: find_vma perf_fuzzer-2910 [000] 255.355335: function: handle_mm_fault perf_fuzzer-2910 [000] 255.355335: function: __do_fault perf_fuzzer-2910 [000] 255.355335: bputs: perf_mmap_fault: VMW: perf_mmap_fault perf_fuzzer-2910 [000] 255.355336: bprint: perf_mmap_fault: VMW: perf_mmap_fault 0xffff8800cba6a980 perf_fuzzer-2910 [000] 255.355336: function: perf_mmap_to_page perf_fuzzer-2910 [000] 255.355336: function: _cond_resched perf_fuzzer-2910 [000] 255.355336: function: unlock_page perf_fuzzer-2910 [000] 255.355336: function: page_waitqueue perf_fuzzer-2910 [000] 255.355337: function: __wake_up_bit perf_fuzzer-2910 [000] 255.355337: bputs: perf_mmap_fault: VMW: perf_mmap_fault perf_fuzzer-2910 [000] 255.355337: function: _cond_resched perf_fuzzer-2910 [000] 255.355337: function: _raw_spin_lock perf_fuzzer-2910 [000] 255.355337: function: page_add_file_rmap perf_fuzzer-2910 [000] 255.355338: function: __inc_zone_page_state perf_fuzzer-2910 [000] 255.355338: function: __inc_zone_state perf_fuzzer-2910 [000] 255.355338: function: set_page_dirty perf_fuzzer-2910 [000] 255.355338: function: page_mapping perf_fuzzer-2910 [000] 255.355338: function: anon_set_page_dirty perf_fuzzer-2910 [000] 255.355339: function: unlock_page Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 1:34 ` Vince Weaver @ 2014-02-28 2:17 ` H. Peter Anvin 2014-02-28 2:57 ` Steven Rostedt 1 sibling, 0 replies; 115+ messages in thread From: H. Peter Anvin @ 2014-02-28 2:17 UTC (permalink / raw) To: Vince Weaver; +Cc: Steven Rostedt, Peter Zijlstra, Linux Kernel, Ingo Molnar Ok... I think we're definitely talking about a cr2 leak. The reboot might be a race condition in the NMI nesting handling maybe? On February 27, 2014 5:34:34 PM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: >On Thu, 27 Feb 2014, H. Peter Anvin wrote: > >> On 02/27/2014 02:31 PM, Steven Rostedt wrote: >> > >> > Yeah, something is getting mesed up. >> > >> >> What it *looks* like to me is that we try to nest the cr2 >save/restore, >> which doesn't nest because it is a percpu variable. >> >> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside >> entry_64.S, which makes the stuff in do_nmi completely redundant and >> there for no good reason. >> >> I would actually suggest we do the equivalent on i386 as well. >> >> Vince, could you try this patch as an experiment? > >OK with your patch applied it does not segfault. > >The trace looks like you'd expect too: > >perf_fuzzer-2910 [000] 255.355331: page_fault_kernel: >address=irq_stack_union ip=copy_user_handle_tail error_code=0x0 >perf_fuzzer-2910 [000] 255.355331: function: >__do_page_fault >perf_fuzzer-2910 [000] 255.355331: function: >bad_area_nosemaphore >perf_fuzzer-2910 [000] 255.355331: function: >__bad_area_nosemaphore >perf_fuzzer-2910 [000] 255.355332: function: >no_context >perf_fuzzer-2910 [000] 255.355332: function: > fixup_exception >perf_fuzzer-2910 [000] 255.355332: function: > search_exception_tables >perf_fuzzer-2910 [000] 255.355332: function: > search_extable >perf_fuzzer-2910 [000] 255.355333: function: >perf_output_begin >perf_fuzzer-2910 [000] 255.355333: bprint: >perf_output_begin: VMW: event type 2 config 2a st: 2c3e7 >perf_fuzzer-2910 [000] 255.355333: bprint: >perf_output_begin: VMW: before event->parent (nil) >perf_fuzzer-2910 [000] 255.355334: bprint: >perf_output_begin: VMW: before rcu_dereference (nil) >perf_fuzzer-2910 [000] 255.355334: function: >__do_page_fault >perf_fuzzer-2910 [000] 255.355334: function: >down_read_trylock >perf_fuzzer-2910 [000] 255.355334: function: >_cond_resched >perf_fuzzer-2910 [000] 255.355335: function: find_vma >perf_fuzzer-2910 [000] 255.355335: function: >handle_mm_fault >perf_fuzzer-2910 [000] 255.355335: function: >__do_fault >perf_fuzzer-2910 [000] 255.355335: bputs: >perf_mmap_fault: VMW: perf_mmap_fault >perf_fuzzer-2910 [000] 255.355336: bprint: >perf_mmap_fault: VMW: perf_mmap_fault 0xffff8800cba6a980 >perf_fuzzer-2910 [000] 255.355336: function: >perf_mmap_to_page >perf_fuzzer-2910 [000] 255.355336: function: >_cond_resched >perf_fuzzer-2910 [000] 255.355336: function: >unlock_page >perf_fuzzer-2910 [000] 255.355336: function: >page_waitqueue >perf_fuzzer-2910 [000] 255.355337: function: >__wake_up_bit >perf_fuzzer-2910 [000] 255.355337: bputs: >perf_mmap_fault: VMW: perf_mmap_fault >perf_fuzzer-2910 [000] 255.355337: function: >_cond_resched >perf_fuzzer-2910 [000] 255.355337: function: >_raw_spin_lock >perf_fuzzer-2910 [000] 255.355337: function: >page_add_file_rmap >perf_fuzzer-2910 [000] 255.355338: function: >__inc_zone_page_state >perf_fuzzer-2910 [000] 255.355338: function: > __inc_zone_state >perf_fuzzer-2910 [000] 255.355338: function: >set_page_dirty >perf_fuzzer-2910 [000] 255.355338: function: >page_mapping >perf_fuzzer-2910 [000] 255.355338: function: >anon_set_page_dirty >perf_fuzzer-2910 [000] 255.355339: function: >unlock_page > > >Vince -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 1:34 ` Vince Weaver 2014-02-28 2:17 ` H. Peter Anvin @ 2014-02-28 2:57 ` Steven Rostedt 2014-02-28 11:11 ` Peter Zijlstra 2014-02-28 14:15 ` Vince Weaver 1 sibling, 2 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 2:57 UTC (permalink / raw) To: Vince Weaver; +Cc: H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Thu, 27 Feb 2014 20:34:34 -0500 (EST) Vince Weaver <vincent.weaver@maine.edu> wrote: > > I would actually suggest we do the equivalent on i386 as well. > > > > Vince, could you try this patch as an experiment? > > OK with your patch applied it does not segfault. > Vince, Great! Can you remove Peter's patch, and try this one. It removes the crud to save the cr2 from entry_64.S and makes both i386 and x86_64 do the same thing in regards to cr2 handling. -- Steve diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 1e96c36..937cb8d 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1854,29 +1854,11 @@ end_repeat_nmi: call save_paranoid DEFAULT_FRAME 0 - /* - * Save off the CR2 register. If we take a page fault in the NMI then - * it could corrupt the CR2 value. If the NMI preempts a page fault - * handler before it was able to read the CR2 register, and then the - * NMI itself takes a page fault, the page fault that was preempted - * will read the information from the NMI page fault and not the - * origin fault. Save it off and restore it if it changes. - * Use the r12 callee-saved register. - */ - movq %cr2, %r12 - /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */ movq %rsp,%rdi movq $-1,%rsi call do_nmi - /* Did the NMI take a page fault? Restore cr2 if it did */ - movq %cr2, %rcx - cmpq %rcx, %r12 - je 1f - movq %r12, %cr2 -1: - testl %ebx,%ebx /* swapgs needed? */ jnz nmi_restore nmi_swapgs: diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 6fcb49c..f1a6294 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -443,7 +443,6 @@ enum nmi_states { NMI_LATCHED, }; static DEFINE_PER_CPU(enum nmi_states, nmi_state); -static DEFINE_PER_CPU(unsigned long, nmi_cr2); #define nmi_nesting_preprocess(regs) \ do { \ @@ -452,14 +451,11 @@ static DEFINE_PER_CPU(unsigned long, nmi_cr2); return; \ } \ this_cpu_write(nmi_state, NMI_EXECUTING); \ - this_cpu_write(nmi_cr2, read_cr2()); \ } while (0); \ nmi_restart: #define nmi_nesting_postprocess() \ do { \ - if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \ - write_cr2(this_cpu_read(nmi_cr2)); \ if (this_cpu_dec_return(nmi_state)) \ goto nmi_restart; \ } while (0) @@ -512,8 +508,21 @@ static inline void nmi_nesting_postprocess(void) dotraplinkage notrace __kprobes void do_nmi(struct pt_regs *regs, long error_code) { + unsigned long cr2; + nmi_nesting_preprocess(regs); + /* + * Save off the CR2 register. If we take a page fault in the NMI then + * it could corrupt the CR2 value. If the NMI preempts a page fault + * handler before it was able to read the CR2 register, and then the + * NMI itself takes a page fault, the page fault that was preempted + * will read the information from the NMI page fault and not the + * origin fault. Save it off and restore it if it changes. + * Use the r12 callee-saved register. + */ + cr2 = read_cr2(); + nmi_enter(); inc_irq_stat(__nmi_count); @@ -523,6 +532,10 @@ do_nmi(struct pt_regs *regs, long error_code) nmi_exit(); + /* Reads are cheaper than writes */ + if (unlikely(cr2 != read_cr2())) + write_cr2(cr2); + /* On i386, may loop back to preprocess */ nmi_nesting_postprocess(); } ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 2:57 ` Steven Rostedt @ 2014-02-28 11:11 ` Peter Zijlstra 2014-02-28 13:37 ` Steven Rostedt 2014-02-28 14:15 ` Vince Weaver 1 sibling, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 11:11 UTC (permalink / raw) To: Steven Rostedt; +Cc: Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar On Thu, Feb 27, 2014 at 09:57:26PM -0500, Steven Rostedt wrote: > @@ -512,8 +508,21 @@ static inline void nmi_nesting_postprocess(void) > dotraplinkage notrace __kprobes void > do_nmi(struct pt_regs *regs, long error_code) > { > + unsigned long cr2; > + > nmi_nesting_preprocess(regs); > > + /* > + * Save off the CR2 register. If we take a page fault in the NMI then > + * it could corrupt the CR2 value. If the NMI preempts a page fault > + * handler before it was able to read the CR2 register, and then the > + * NMI itself takes a page fault, the page fault that was preempted > + * will read the information from the NMI page fault and not the > + * origin fault. Save it off and restore it if it changes. > + * Use the r12 callee-saved register. You might want to make that line go away :-) > + */ > + cr2 = read_cr2(); > + > nmi_enter(); > > inc_irq_stat(__nmi_count); > @@ -523,6 +532,10 @@ do_nmi(struct pt_regs *regs, long error_code) > > nmi_exit(); > > + /* Reads are cheaper than writes */ > + if (unlikely(cr2 != read_cr2())) > + write_cr2(cr2); > + > /* On i386, may loop back to preprocess */ > nmi_nesting_postprocess(); > } ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 11:11 ` Peter Zijlstra @ 2014-02-28 13:37 ` Steven Rostedt 0 siblings, 0 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 13:37 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014 12:11:11 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Feb 27, 2014 at 09:57:26PM -0500, Steven Rostedt wrote: > > @@ -512,8 +508,21 @@ static inline void nmi_nesting_postprocess(void) > > dotraplinkage notrace __kprobes void > > do_nmi(struct pt_regs *regs, long error_code) > > { > > + unsigned long cr2; > > + > > nmi_nesting_preprocess(regs); > > > > + /* > > + * Save off the CR2 register. If we take a page fault in the NMI then > > + * it could corrupt the CR2 value. If the NMI preempts a page fault > > + * handler before it was able to read the CR2 register, and then the > > + * NMI itself takes a page fault, the page fault that was preempted > > + * will read the information from the NMI page fault and not the > > + * origin fault. Save it off and restore it if it changes. > > > + * Use the r12 callee-saved register. > > You might want to make that line go away :-) > Gah! I noticed that before sending and deleted the line, recreated the patch and sent that out. The one step I forgot to do was to save the file after deleting it :-p -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 2:57 ` Steven Rostedt 2014-02-28 11:11 ` Peter Zijlstra @ 2014-02-28 14:15 ` Vince Weaver 2014-02-28 14:23 ` Steven Rostedt 1 sibling, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-28 14:15 UTC (permalink / raw) To: Steven Rostedt Cc: Vince Weaver, H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Thu, 27 Feb 2014, Steven Rostedt wrote: > On Thu, 27 Feb 2014 20:34:34 -0500 (EST) > Vince Weaver <vincent.weaver@maine.edu> wrote: > > > > > I would actually suggest we do the equivalent on i386 as well. > > > > > > Vince, could you try this patch as an experiment? > > > > OK with your patch applied it does not segfault. > > > > Vince, Great! Can you remove Peter's patch, and try this one. It > removes the crud to save the cr2 from entry_64.S and makes both i386 > and x86_64 do the same thing in regards to cr2 handling. no, with only this patch applied it segfaults as per previous: [ 126.396049] perf_fuzzer[2904]: segfault at 17a0 ip 00000000004017fd sp 00000000ffaff3f0 error 6 in perf_fuzzer[400000+d1000] Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 14:15 ` Vince Weaver @ 2014-02-28 14:23 ` Steven Rostedt 2014-02-28 15:07 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 14:23 UTC (permalink / raw) To: Vince Weaver; +Cc: H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014 09:15:33 -0500 (EST) Vince Weaver <vincent.weaver@maine.edu> wrote: > On Thu, 27 Feb 2014, Steven Rostedt wrote: > > > On Thu, 27 Feb 2014 20:34:34 -0500 (EST) > > Vince Weaver <vincent.weaver@maine.edu> wrote: > > > > > > > > I would actually suggest we do the equivalent on i386 as well. > > > > > > > > Vince, could you try this patch as an experiment? > > > > > > OK with your patch applied it does not segfault. > > > > > > > Vince, Great! Can you remove Peter's patch, and try this one. It > > removes the crud to save the cr2 from entry_64.S and makes both i386 > > and x86_64 do the same thing in regards to cr2 handling. > > no, with only this patch applied it segfaults as per previous: > > [ 126.396049] perf_fuzzer[2904]: segfault at 17a0 ip 00000000004017fd sp 00000000ffaff3f0 error 6 in perf_fuzzer[400000+d1000] Interesting. Are you doing a perf function trace? And just in case, can you add this patch and make sure the copy is called by NMI. -- Steve diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c index ddf9ecb..ca943cd 100644 --- a/arch/x86/lib/usercopy.c +++ b/arch/x86/lib/usercopy.c @@ -29,6 +29,7 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n) */ pagefault_disable(); ret = __copy_from_user_inatomic(to, from, n); + trace_dump_stack(2) pagefault_enable(); return ret; ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 14:23 ` Steven Rostedt @ 2014-02-28 15:07 ` Vince Weaver 2014-02-28 15:13 ` H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 115+ messages in thread From: Vince Weaver @ 2014-02-28 15:07 UTC (permalink / raw) To: Steven Rostedt Cc: Vince Weaver, H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014, Steven Rostedt wrote: > Interesting. Are you doing a perf function trace? > > And just in case, can you add this patch and make sure the copy is > called by NMI. 199.900682: function: trace_do_page_fault 199.900683: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x6 199.900683: function: perf_swevent_get_recursion_context 199.900684: function: perf_tp_event 199.900684: function: perf_swevent_event 199.900684: function: perf_swevent_overflow 199.900684: function: __perf_event_overflow 199.900685: function: perf_prepare_sample 199.900685: function: __perf_event_header__init_id 199.900685: function: task_tgid_nr_ns 199.900685: function: perf_event_tid 199.900686: function: __task_pid_nr_ns 199.900686: function: perf_callchain 199.900687: function: copy_from_user_nmi 199.900687: function: trace_do_page_fault 199.900687: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0 199.900688: function: __do_page_fault 199.900688: function: bad_area_nosemaphore 199.900688: function: __bad_area_nosemaphore 199.900689: function: no_context 199.900689: function: fixup_exception 199.900689: function: search_exception_tables 199.900689: function: search_extable 199.900691: function: copy_user_handle_tail 199.900691: function: trace_do_page_fault 199.900691: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0 199.900691: function: __do_page_fault 199.900692: function: bad_area_nosemaphore 199.900692: function: __bad_area_nosemaphore 199.900692: function: no_context 199.900692: function: fixup_exception 199.900692: function: search_exception_tables 199.900692: function: search_extable 199.900693: function: save_stack_trace 199.900693: function: dump_trace 199.900694: function: print_context_stack 199.900694: function: __kernel_text_address 199.900694: function: is_module_text_address 199.900695: function: __module_text_address 199.900695: function: __module_address 199.900695: function: __kernel_text_address 199.900695: function: is_module_text_address 199.900696: function: __module_text_address 199.900696: function: __module_address ... 199.900705: function: __kernel_text_address 199.900809: kernel_stack: <stack trace> => perf_callchain (ffffffff810d35a2) => perf_prepare_sample (ffffffff810cfae3) => __perf_event_overflow (ffffffff810d02f4) => perf_swevent_overflow (ffffffff810d04e3) => perf_swevent_event (ffffffff810d0574) => perf_tp_event (ffffffff810d070c) => perf_trace_x86_exceptions (ffffffff810341b6) => trace_do_page_fault (ffffffff81537702) => trace_page_fault (ffffffff81534772) 199.900810: function: perf_output_begin 199.900810: function: __do_page_fault 199.900810: function: __perf_sw_event 199.900810: function: perf_swevent_get_recursion_context 199.900811: function: down_read_trylock 199.900811: function: _cond_resched 199.900811: function: find_vma 199.900811: function: bad_area 199.900812: function: up_read 199.900812: function: __bad_area_nosemaphore 199.900812: function: is_prefetch 199.900812: function: convert_ip_to_linear 199.900813: function: unhandled_signal 199.900813: function: __printk_ratelimit 199.900813: function: _raw_spin_trylock 199.900813: function: _raw_spin_unlock_irqrestore 199.900814: function: printk 199.900814: function: vprintk_emit ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 15:07 ` Vince Weaver @ 2014-02-28 15:13 ` H. Peter Anvin 2014-02-28 15:40 ` Peter Zijlstra 2014-02-28 15:20 ` Steven Rostedt 2014-02-28 20:38 ` H. Peter Anvin 2 siblings, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-02-28 15:13 UTC (permalink / raw) To: Vince Weaver, Steven Rostedt; +Cc: Peter Zijlstra, Linux Kernel, Ingo Molnar If I'm reading this right we end up going from the page fault tracepoint to copy_from_user_nmi() without going through NMI, and the cr2 corruption is obvious. I guess the assumption that only the NMI path needed to save cr2 is flawed? On February 28, 2014 7:07:29 AM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: >On Fri, 28 Feb 2014, Steven Rostedt wrote: > >> Interesting. Are you doing a perf function trace? >> >> And just in case, can you add this patch and make sure the copy is >> called by NMI. > >199.900682: function: trace_do_page_fault >199.900683: page_fault_user: address=__per_cpu_end >ip=__per_cpu_end error_code=0x6 >199.900683: function: perf_swevent_get_recursion_context >199.900684: function: perf_tp_event >199.900684: function: perf_swevent_event >199.900684: function: perf_swevent_overflow >199.900684: function: __perf_event_overflow >199.900685: function: perf_prepare_sample >199.900685: function: >__perf_event_header__init_id >199.900685: function: task_tgid_nr_ns >199.900685: function: perf_event_tid >199.900686: function: __task_pid_nr_ns >199.900686: function: perf_callchain >199.900687: function: copy_from_user_nmi >199.900687: function: trace_do_page_fault >199.900687: page_fault_kernel: address=irq_stack_union >ip=copy_user_generic_string error_code=0x0 >199.900688: function: __do_page_fault >199.900688: function: bad_area_nosemaphore >199.900688: function: __bad_area_nosemaphore >199.900689: function: no_context >199.900689: function: fixup_exception >199.900689: function: >search_exception_tables >199.900689: function: search_extable >199.900691: function: copy_user_handle_tail >199.900691: function: trace_do_page_fault >199.900691: page_fault_kernel: address=irq_stack_union >ip=copy_user_handle_tail error_code=0x0 >199.900691: function: __do_page_fault >199.900692: function: bad_area_nosemaphore >199.900692: function: __bad_area_nosemaphore >199.900692: function: no_context >199.900692: function: fixup_exception >199.900692: function: >search_exception_tables >199.900692: function: search_extable >199.900693: function: save_stack_trace >199.900693: function: dump_trace >199.900694: function: print_context_stack >199.900694: function: __kernel_text_address >199.900694: function: is_module_text_address >199.900695: function: __module_text_address >199.900695: function: __module_address >199.900695: function: __kernel_text_address >199.900695: function: is_module_text_address >199.900696: function: __module_text_address >199.900696: function: __module_address >... >199.900705: function: __kernel_text_address >199.900809: kernel_stack: <stack trace> >=> perf_callchain (ffffffff810d35a2) >=> perf_prepare_sample (ffffffff810cfae3) >=> __perf_event_overflow (ffffffff810d02f4) >=> perf_swevent_overflow (ffffffff810d04e3) >=> perf_swevent_event (ffffffff810d0574) >=> perf_tp_event (ffffffff810d070c) >=> perf_trace_x86_exceptions (ffffffff810341b6) >=> trace_do_page_fault (ffffffff81537702) >=> trace_page_fault (ffffffff81534772) >199.900810: function: perf_output_begin >199.900810: function: __do_page_fault >199.900810: function: __perf_sw_event >199.900810: function: >perf_swevent_get_recursion_context >199.900811: function: down_read_trylock >199.900811: function: _cond_resched >199.900811: function: find_vma >199.900811: function: bad_area >199.900812: function: up_read >199.900812: function: __bad_area_nosemaphore >199.900812: function: is_prefetch >199.900812: function: convert_ip_to_linear >199.900813: function: unhandled_signal >199.900813: function: __printk_ratelimit >199.900813: function: _raw_spin_trylock >199.900813: function: _raw_spin_unlock_irqrestore >199.900814: function: printk >199.900814: function: vprintk_emit -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 15:13 ` H. Peter Anvin @ 2014-02-28 15:40 ` Peter Zijlstra 2014-02-28 16:15 ` H. Peter Anvin 0 siblings, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 15:40 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Vince Weaver, Steven Rostedt, Linux Kernel, Ingo Molnar On Fri, Feb 28, 2014 at 07:13:06AM -0800, H. Peter Anvin wrote: > If I'm reading this right we end up going from the page fault > tracepoint to copy_from_user_nmi() without going through NMI, and the > cr2 corruption is obvious. I guess the assumption that only the NMI > path needed to save cr2 is flawed? It was never assumed it would only go through NMI, but that it would be NMI safe -- and as it turns out, it is that. What I did assume was that any other callsites would be safe, seeing how they'd already be running in 'normal' contexts. I had not considered people putting tracepoints _that_ early in the exception paths. Note that there's more tracepoints there than the one mentioned. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 15:40 ` Peter Zijlstra @ 2014-02-28 16:15 ` H. Peter Anvin 2014-02-28 16:29 ` Steven Rostedt 2014-02-28 20:55 ` Peter Zijlstra 0 siblings, 2 replies; 115+ messages in thread From: H. Peter Anvin @ 2014-02-28 16:15 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Vince Weaver, Steven Rostedt, Linux Kernel, Ingo Molnar On 02/28/2014 07:40 AM, Peter Zijlstra wrote: > On Fri, Feb 28, 2014 at 07:13:06AM -0800, H. Peter Anvin wrote: >> If I'm reading this right we end up going from the page fault >> tracepoint to copy_from_user_nmi() without going through NMI, and the >> cr2 corruption is obvious. I guess the assumption that only the NMI >> path needed to save cr2 is flawed? > > It was never assumed it would only go through NMI, but that it would be > NMI safe -- and as it turns out, it is that. > > What I did assume was that any other callsites would be safe, seeing how > they'd already be running in 'normal' contexts. > > I had not considered people putting tracepoints _that_ early in the > exception paths. > > Note that there's more tracepoints there than the one mentioned. > Well, I was talking about the assumption spelled out in the comment above copy_from_user_nmi() which pretty much states "cr2 is safe because cr2 is saved/restored in the NMI wrappers." -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 16:15 ` H. Peter Anvin @ 2014-02-28 16:29 ` Steven Rostedt 2014-02-28 19:33 ` [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace() Steven Rostedt 2014-02-28 20:56 ` perf_fuzzer compiled for x32 causes reboot Peter Zijlstra 2014-02-28 20:55 ` Peter Zijlstra 1 sibling, 2 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 16:29 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Fri, 28 Feb 2014 08:15:11 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > Well, I was talking about the assumption spelled out in the comment > above copy_from_user_nmi() which pretty much states "cr2 is safe because > cr2 is saved/restored in the NMI wrappers." Yeah, it seems that the name "copy_from_user_nmi()" is a misnomer. As it can be called outside of nmi context. Perhaps we should have a copy_from_user_trace() that does the save and restore of the cr2. As that's the only place that faults, it may be the best answer. Arnaldo, Can you test this patch and see if it fixes the bug for you too. Vince already said it fixes it for him. I'm attaching it below (it's from H. Peter). Peter Z., as both Jiri's and my patch fixed the callers of the problem area, and as we have been discussing, there may be more problem areas, I'm thinking the best solution is to just use H. Peter's patch instead. And then we should rename it to copy_from_user_trace(). -- Steve diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c index ddf9ecb..938e45c 100644 --- a/arch/x86/lib/usercopy.c +++ b/arch/x86/lib/usercopy.c @@ -10,6 +10,8 @@ #include <asm/word-at-a-time.h> #include <linux/sched.h> +#include <asm/processor.h> + /* * We rely on the nested NMI work to allow atomic faults from the NMI path; the * nested NMI paths are careful to preserve CR2. @@ -18,6 +20,7 @@ unsigned long copy_from_user_nmi(void *to, const void __user *from, unsigned long n) { unsigned long ret; + unsigned long cr2; if (__range_not_ok(from, n, TASK_SIZE)) return 0; @@ -27,9 +30,11 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n) * disable pagefaults so that its behaviour is consistent even when * called form other contexts. */ + cr2 = read_cr2(); pagefault_disable(); ret = __copy_from_user_inatomic(to, from, n); pagefault_enable(); + write_cr2(cr2); return ret; } ^ permalink raw reply related [flat|nested] 115+ messages in thread
* [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace() 2014-02-28 16:29 ` Steven Rostedt @ 2014-02-28 19:33 ` Steven Rostedt 2014-02-28 20:46 ` Peter Zijlstra 2014-02-28 20:56 ` perf_fuzzer compiled for x32 causes reboot Peter Zijlstra 1 sibling, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 19:33 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo [ H. Peter, Here's the rename patch. I did not include your update. You can add that first and then massage this patch on top. But this isn't critical for mainline or stable, where as I believe your patch is. ] The tracing utilities sometimes need to read from userspace (stack tracing), and to do this it has as special copy_from_user function called copy_from_user_nmi(). Well, as tracers can call this from outside of nmi context, the "_nmi" part is a misnomer and "_trace" is a better name. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- arch/x86/include/asm/perf_event.h | 2 +- arch/x86/include/asm/uaccess.h | 2 +- arch/x86/kernel/cpu/perf_event.c | 4 ++-- arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +- arch/x86/kernel/cpu/perf_event_intel_lbr.c | 2 +- arch/x86/lib/usercopy.c | 10 ++++++---- arch/x86/oprofile/backtrace.c | 4 ++-- 7 files changed, 14 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 8249df4..7bf4b25 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -274,6 +274,6 @@ static inline void perf_check_microcode(void) { } static inline void amd_pmu_disable_virt(void) { } #endif -#define arch_perf_out_copy_user copy_from_user_nmi +#define arch_perf_out_copy_user copy_from_user_trace #endif /* _ASM_X86_PERF_EVENT_H */ diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 8ec57c0..d734baf 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -515,7 +515,7 @@ struct __large_struct { unsigned long buf[100]; }; __put_user_size_ex((__typeof__(*(ptr)))(x), (ptr), sizeof(*(ptr))) extern unsigned long -copy_from_user_nmi(void *to, const void __user *from, unsigned long n); +copy_from_user_trace(void *to, const void __user *from, unsigned long n); extern __must_check long strncpy_from_user(char *dst, const char __user *src, long count); diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 8e13293..7eda4af 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -1988,7 +1988,7 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry) frame.next_frame = 0; frame.return_address = 0; - bytes = copy_from_user_nmi(&frame, fp, sizeof(frame)); + bytes = copy_from_user_trace(&frame, fp, sizeof(frame)); if (bytes != 0) break; @@ -2040,7 +2040,7 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs) frame.next_frame = NULL; frame.return_address = 0; - bytes = copy_from_user_nmi(&frame, fp, sizeof(frame)); + bytes = copy_from_user_trace(&frame, fp, sizeof(frame)); if (bytes != 0) break; diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c index ae96cfa..95d76c6 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c @@ -788,7 +788,7 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs) u8 *buf = this_cpu_read(insn_buffer); size = ip - to; /* Must fit our buffer, see above */ - bytes = copy_from_user_nmi(buf, (void __user *)to, size); + bytes = copy_from_user_trace(buf, (void __user *)to, size); if (bytes != 0) return 0; diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c index d82d155..0505ada 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c @@ -490,7 +490,7 @@ static int branch_type(unsigned long from, unsigned long to, int abort) return X86_BR_NONE; /* may fail if text not present */ - bytes = copy_from_user_nmi(buf, (void __user *)from, size); + bytes = copy_from_user_trace(buf, (void __user *)from, size); if (bytes != 0) return X86_BR_NONE; diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c index ddf9ecb..131ff16e 100644 --- a/arch/x86/lib/usercopy.c +++ b/arch/x86/lib/usercopy.c @@ -11,11 +11,13 @@ #include <linux/sched.h> /* - * We rely on the nested NMI work to allow atomic faults from the NMI path; the - * nested NMI paths are careful to preserve CR2. + * Used by tracing, needs to restore state of the cr2 register if + * the copy triggered a page fault. That's because tracing can happen + * between the time a normal page fault occurred and the cr2 is read + * by the handler. */ unsigned long -copy_from_user_nmi(void *to, const void __user *from, unsigned long n) +copy_from_user_trace(void *to, const void __user *from, unsigned long n) { unsigned long ret; @@ -33,4 +35,4 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n) return ret; } -EXPORT_SYMBOL_GPL(copy_from_user_nmi); +EXPORT_SYMBOL_GPL(copy_from_user_trace); diff --git a/arch/x86/oprofile/backtrace.c b/arch/x86/oprofile/backtrace.c index 5d04be5..7831650 100644 --- a/arch/x86/oprofile/backtrace.c +++ b/arch/x86/oprofile/backtrace.c @@ -46,7 +46,7 @@ dump_user_backtrace_32(struct stack_frame_ia32 *head) struct stack_frame_ia32 *fp; unsigned long bytes; - bytes = copy_from_user_nmi(bufhead, head, sizeof(bufhead)); + bytes = copy_from_user_trace(bufhead, head, sizeof(bufhead)); if (bytes != 0) return NULL; @@ -92,7 +92,7 @@ static struct stack_frame *dump_user_backtrace(struct stack_frame *head) struct stack_frame bufhead[2]; unsigned long bytes; - bytes = copy_from_user_nmi(bufhead, head, sizeof(bufhead)); + bytes = copy_from_user_trace(bufhead, head, sizeof(bufhead)); if (bytes != 0) return NULL; -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace() 2014-02-28 19:33 ` [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace() Steven Rostedt @ 2014-02-28 20:46 ` Peter Zijlstra 2014-02-28 20:51 ` Steven Rostedt 2014-02-28 21:01 ` Steven Rostedt 0 siblings, 2 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 20:46 UTC (permalink / raw) To: Steven Rostedt Cc: H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Fri, Feb 28, 2014 at 02:33:16PM -0500, Steven Rostedt wrote: > [ H. Peter, Here's the rename patch. I did not include your update. You > can add that first and then massage this patch on top. But this isn't > critical for mainline or stable, where as I believe your patch is. ] > > The tracing utilities sometimes need to read from userspace (stack tracing), > and to do this it has as special copy_from_user function called > copy_from_user_nmi(). Well, as tracers can call this from outside of > nmi context, the "_nmi" part is a misnomer and "_trace" is a better > name. NAK, spin_lock_irq() is very much an IRQ safe lock. Similarly copy_from_user_nmi() is an NMI safe copy from user. Furthermore, there's exactly 0 trace users, so the proposed name is actively worse. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace() 2014-02-28 20:46 ` Peter Zijlstra @ 2014-02-28 20:51 ` Steven Rostedt 2014-02-28 20:58 ` Peter Zijlstra 2014-02-28 21:01 ` Steven Rostedt 1 sibling, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 20:51 UTC (permalink / raw) To: Peter Zijlstra Cc: H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Fri, 28 Feb 2014 21:46:21 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Feb 28, 2014 at 02:33:16PM -0500, Steven Rostedt wrote: > > [ H. Peter, Here's the rename patch. I did not include your update. You > > can add that first and then massage this patch on top. But this isn't > > critical for mainline or stable, where as I believe your patch is. ] > > > > The tracing utilities sometimes need to read from userspace (stack tracing), > > and to do this it has as special copy_from_user function called > > copy_from_user_nmi(). Well, as tracers can call this from outside of > > nmi context, the "_nmi" part is a misnomer and "_trace" is a better > > name. > > NAK, spin_lock_irq() is very much an IRQ safe lock. Similarly > copy_from_user_nmi() is an NMI safe copy from user. > > Furthermore, there's exactly 0 trace users, so the proposed name is > actively worse. Heh, I consider perf and oprofile special tracers ;-) -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace() 2014-02-28 20:51 ` Steven Rostedt @ 2014-02-28 20:58 ` Peter Zijlstra 0 siblings, 0 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 20:58 UTC (permalink / raw) To: Steven Rostedt Cc: H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Fri, Feb 28, 2014 at 03:51:30PM -0500, Steven Rostedt wrote: > On Fri, 28 Feb 2014 21:46:21 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, Feb 28, 2014 at 02:33:16PM -0500, Steven Rostedt wrote: > > > [ H. Peter, Here's the rename patch. I did not include your update. You > > > can add that first and then massage this patch on top. But this isn't > > > critical for mainline or stable, where as I believe your patch is. ] > > > > > > The tracing utilities sometimes need to read from userspace (stack tracing), > > > and to do this it has as special copy_from_user function called > > > copy_from_user_nmi(). Well, as tracers can call this from outside of > > > nmi context, the "_nmi" part is a misnomer and "_trace" is a better > > > name. > > > > NAK, spin_lock_irq() is very much an IRQ safe lock. Similarly > > copy_from_user_nmi() is an NMI safe copy from user. > > > > Furthermore, there's exactly 0 trace users, so the proposed name is > > actively worse. > > Heh, I consider perf and oprofile special tracers ;-) They're first and foremost profilers; which is an entirely different thing. But other people seem to get confused on this distinction too. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace() 2014-02-28 20:46 ` Peter Zijlstra 2014-02-28 20:51 ` Steven Rostedt @ 2014-02-28 21:01 ` Steven Rostedt 2014-02-28 21:17 ` Peter Zijlstra 1 sibling, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 21:01 UTC (permalink / raw) To: Peter Zijlstra Cc: H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Fri, 28 Feb 2014 21:46:21 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > NAK, spin_lock_irq() is very much an IRQ safe lock. Similarly Actually, that's not true. It doesn't mean it is an IRQ safe lock. That would mean it is safe to use in IRQs. What it does mean is that it disables irqs and spin_unlock_irq() enables irqs. And we have spin_lock_irqsave() which disables irqs and saves the flags for a spin_lock_irqrestore() > copy_from_user_nmi() is an NMI safe copy from user. > > Furthermore, there's exactly 0 trace users, so the proposed name is > actively worse. OK, I'll go with that. So instead of an _nmi() which it has nothing to do with NMIs, how about making it like spin_lock_irq() and a more descriptive name about what it does? copy_from_user_savecr2() ? -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace() 2014-02-28 21:01 ` Steven Rostedt @ 2014-02-28 21:17 ` Peter Zijlstra 0 siblings, 0 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 21:17 UTC (permalink / raw) To: Steven Rostedt Cc: H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Fri, Feb 28, 2014 at 04:01:39PM -0500, Steven Rostedt wrote: > OK, I'll go with that. So instead of an _nmi() which it has nothing to > do with NMIs, Well, how is 'safe to use from NMI' nothing to do with NMIs? > how about making it like spin_lock_irq() and a more > descriptive name about what it does? > > copy_from_user_savecr2() So how expensive are these CR2 reads? Because if they're more expensive than one or two cycles (normal instructions) I'd really rather not put them in there. I seem to have missed what was wrong with Jiri's patch. Are we worried about the function trace events? So IF we're going to push them into this side of things; I'd really rather see the CR2 foo in perf_callchain_user() and make arch_perf_out_copy_user() a real function which includes the CR2 goo too. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 16:29 ` Steven Rostedt 2014-02-28 19:33 ` [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace() Steven Rostedt @ 2014-02-28 20:56 ` Peter Zijlstra 2014-02-28 21:06 ` Steven Rostedt 1 sibling, 1 reply; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 20:56 UTC (permalink / raw) To: Steven Rostedt Cc: H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Fri, Feb 28, 2014 at 11:29:46AM -0500, Steven Rostedt wrote: > On Fri, 28 Feb 2014 08:15:11 -0800 > "H. Peter Anvin" <hpa@zytor.com> wrote: > > > Well, I was talking about the assumption spelled out in the comment > > above copy_from_user_nmi() which pretty much states "cr2 is safe because > > cr2 is saved/restored in the NMI wrappers." > > Yeah, it seems that the name "copy_from_user_nmi()" is a misnomer. As > it can be called outside of nmi context. Perhaps we should have a > copy_from_user_trace() that does the save and restore of the cr2. > > As that's the only place that faults, it may be the best answer. > > Arnaldo, > > Can you test this patch and see if it fixes the bug for you too. Vince > already said it fixes it for him. > > I'm attaching it below (it's from H. Peter). > > Peter Z., as both Jiri's and my patch fixed the callers of the problem > area, and as we have been discussing, there may be more problem areas, > I'm thinking the best solution is to just use H. Peter's patch instead. > And then we should rename it to copy_from_user_trace(). Like already said; _trace is an absolutely abysmal name. Also you _really_ don't want an unconditional CR2 write in there, that's just stupidly expensive. Also, this function is called a _LOT_ under certain workloads, I don't know how cheap a CR2 read is, but it had better be really cheap. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 20:56 ` perf_fuzzer compiled for x32 causes reboot Peter Zijlstra @ 2014-02-28 21:06 ` Steven Rostedt 2014-03-01 9:16 ` Ingo Molnar 0 siblings, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 21:06 UTC (permalink / raw) To: Peter Zijlstra Cc: H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Fri, 28 Feb 2014 21:56:38 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > Like already said; _trace is an absolutely abysmal name. Also you > _really_ don't want an unconditional CR2 write in there, that's just > stupidly expensive. But a read isn't. Which is why we only do a write if the copy caused a page fault (which is already expensive). The proposed patch will have: if (cr2 != read_cr2()) write_cr2(cr2); > > Also, this function is called a _LOT_ under certain workloads, I don't > know how cheap a CR2 read is, but it had better be really cheap. That's a HPA question. -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 21:06 ` Steven Rostedt @ 2014-03-01 9:16 ` Ingo Molnar 2014-03-01 9:50 ` Borislav Petkov 2014-03-03 9:16 ` Peter Zijlstra 0 siblings, 2 replies; 115+ messages in thread From: Ingo Molnar @ 2014-03-01 9:16 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo * Steven Rostedt <rostedt@goodmis.org> wrote: > > Also, this function is called a _LOT_ under certain workloads, I > > don't know how cheap a CR2 read is, but it had better be really > > cheap. > > That's a HPA question. We read CR2 in the page fault hot path, so it's on the top of CPU architects' minds and it's reasonably optimized. A couple of cycles IIRC, but would be nice to hear actual numbers. Thanks, Ingo ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-03-01 9:16 ` Ingo Molnar @ 2014-03-01 9:50 ` Borislav Petkov 2014-03-01 16:50 ` H. Peter Anvin 2014-03-03 9:16 ` Peter Zijlstra 1 sibling, 1 reply; 115+ messages in thread From: Borislav Petkov @ 2014-03-01 9:50 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Sat, Mar 01, 2014 at 10:16:50AM +0100, Ingo Molnar wrote: > > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Also, this function is called a _LOT_ under certain workloads, I > > > don't know how cheap a CR2 read is, but it had better be really > > > cheap. > > > > That's a HPA question. > > We read CR2 in the page fault hot path, so it's on the top of CPU > architects' minds and it's reasonably optimized. A couple of cycles > IIRC, but would be nice to hear actual numbers. Yeah, we were discussing this last night on IRC. And hpa actually meant that the optimization potential was there but no one did do it, except maybe Transmeta. :-) So the expensive thing is writing to CR2 because it is a serializing instruction. In fact, all writes to control registers except CR8 are serializing. The reading from CR2 should be cheaper but not as cheap as a normal MOV %reg %reg is. On AMD, MOV %reg, %cr2 is done with microcode so definitely at least a couple of cycles and I'd guess it is not a trivial MOV on Intel too. Maybe a way to hide this cost is the OoO, as hpa suggested, depending on how much parallelism that particular code region can offer (serializing instructions close by). -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-03-01 9:50 ` Borislav Petkov @ 2014-03-01 16:50 ` H. Peter Anvin 2014-03-04 23:05 ` Borislav Petkov 0 siblings, 1 reply; 115+ messages in thread From: H. Peter Anvin @ 2014-03-01 16:50 UTC (permalink / raw) To: Borislav Petkov, Ingo Molnar Cc: Steven Rostedt, Peter Zijlstra, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo The bottom line is that if we want hard numbers we probably have to measure. Hoisting the cr2 read is a no-brainer, might even help performance... On March 1, 2014 1:50:42 AM PST, Borislav Petkov <bp@alien8.de> wrote: >On Sat, Mar 01, 2014 at 10:16:50AM +0100, Ingo Molnar wrote: >> >> * Steven Rostedt <rostedt@goodmis.org> wrote: >> >> > > Also, this function is called a _LOT_ under certain workloads, I >> > > don't know how cheap a CR2 read is, but it had better be really >> > > cheap. >> > >> > That's a HPA question. >> >> We read CR2 in the page fault hot path, so it's on the top of CPU >> architects' minds and it's reasonably optimized. A couple of cycles >> IIRC, but would be nice to hear actual numbers. > >Yeah, we were discussing this last night on IRC. > >And hpa actually meant that the optimization potential was there but no >one did do it, except maybe Transmeta. :-) > >So the expensive thing is writing to CR2 because it is a serializing >instruction. In fact, all writes to control registers except CR8 are >serializing. > >The reading from CR2 should be cheaper but not as cheap as a normal >MOV %reg %reg is. On AMD, MOV %reg, %cr2 is done with microcode so >definitely at least a couple of cycles and I'd guess it is not a >trivial >MOV on Intel too. > >Maybe a way to hide this cost is the OoO, as hpa suggested, depending >on >how much parallelism that particular code region can offer (serializing >instructions close by). -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-03-01 16:50 ` H. Peter Anvin @ 2014-03-04 23:05 ` Borislav Petkov 0 siblings, 0 replies; 115+ messages in thread From: Borislav Petkov @ 2014-03-04 23:05 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Sat, Mar 01, 2014 at 08:50:17AM -0800, H. Peter Anvin wrote: > The bottom line is that if we want hard numbers we probably have to > measure. > > Hoisting the cr2 read is a no-brainer, might even help performance... Btw, I just got word that on AMD, a read from CR2 is 4 cycles on family 0x15 and 3 cycles on family 0x10. If we're in the same ballpark on Intel, I'd say we're fine with those numbers even in hot paths. The OoO is an additional bonus which should hide latency even more. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-03-01 9:16 ` Ingo Molnar 2014-03-01 9:50 ` Borislav Petkov @ 2014-03-03 9:16 ` Peter Zijlstra 1 sibling, 0 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-03-03 9:16 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, H. Peter Anvin, Vince Weaver, Linux Kernel, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo On Sat, Mar 01, 2014 at 10:16:50AM +0100, Ingo Molnar wrote: > We read CR2 in the page fault hot path, so it's on the top of CPU > architects' minds and it's reasonably optimized. A couple of cycles > IIRC, but would be nice to hear actual numbers. Well, going with what Linus found; it looks like they made page faults more expensive on Haswell as compared to previous generations. So its not _that_ high on their list :/ ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 16:15 ` H. Peter Anvin 2014-02-28 16:29 ` Steven Rostedt @ 2014-02-28 20:55 ` Peter Zijlstra 1 sibling, 0 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 20:55 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Vince Weaver, Steven Rostedt, Linux Kernel, Ingo Molnar On Fri, Feb 28, 2014 at 08:15:11AM -0800, H. Peter Anvin wrote: > On 02/28/2014 07:40 AM, Peter Zijlstra wrote: > > On Fri, Feb 28, 2014 at 07:13:06AM -0800, H. Peter Anvin wrote: > >> If I'm reading this right we end up going from the page fault > >> tracepoint to copy_from_user_nmi() without going through NMI, and the > >> cr2 corruption is obvious. I guess the assumption that only the NMI > >> path needed to save cr2 is flawed? > > > > It was never assumed it would only go through NMI, but that it would be > > NMI safe -- and as it turns out, it is that. > > > > What I did assume was that any other callsites would be safe, seeing how > > they'd already be running in 'normal' contexts. > > > > I had not considered people putting tracepoints _that_ early in the > > exception paths. > > > > Note that there's more tracepoints there than the one mentioned. > > > > Well, I was talking about the assumption spelled out in the comment > above copy_from_user_nmi() which pretty much states "cr2 is safe because > cr2 is saved/restored in the NMI wrappers." That is because we assumed NMI was the only thing that could interrupt a fault handler before it would read CR2. Never thinking someone would put a tracepoint there. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 15:07 ` Vince Weaver 2014-02-28 15:13 ` H. Peter Anvin @ 2014-02-28 15:20 ` Steven Rostedt 2014-02-28 15:30 ` Steven Rostedt 2014-02-28 20:38 ` H. Peter Anvin 2 siblings, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 15:20 UTC (permalink / raw) To: Vince Weaver Cc: H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar, Seiji Aguchi On Fri, 28 Feb 2014 10:07:29 -0500 (EST) Vince Weaver <vincent.weaver@maine.edu> wrote: > On Fri, 28 Feb 2014, Steven Rostedt wrote: > 199.900696: function: __module_address > ... > 199.900705: function: __kernel_text_address > 199.900809: kernel_stack: <stack trace> > => perf_callchain (ffffffff810d35a2) > => perf_prepare_sample (ffffffff810cfae3) > => __perf_event_overflow (ffffffff810d02f4) > => perf_swevent_overflow (ffffffff810d04e3) > => perf_swevent_event (ffffffff810d0574) > => perf_tp_event (ffffffff810d070c) > => perf_trace_x86_exceptions (ffffffff810341b6) > => trace_do_page_fault (ffffffff81537702) > => trace_page_fault (ffffffff81534772) Thank you!!! You just found the bug :-) The bug was caused by: commit 25c74b10bacead867478480170083f69cfc0db48 x86, trace: Register exception handler to trace IDT With this code: dotraplinkage void __kprobes trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; prev_state = exception_enter(); trace_page_fault_entries(regs, error_code); __do_page_fault(regs, error_code); exception_exit(prev_state); } The trace_page_fault_entries() which is called before the cr2 is saved can fault by perf doing a userspace stack trace. But the cr2 is not restored when calling __do_page_fault() and that gets the wrong cr2. Below is a patch that should fix this. Please remove all other patches and try this out. Thanks, -- Steve > 199.900810: function: perf_output_begin > 199.900810: function: __do_page_fault > 199.900810: function: __perf_sw_event > 199.900810: function: perf_swevent_get_recursion_context > 199.900811: function: down_read_trylock > 199.900811: function: _cond_resched > 199.900811: function: find_vma > 199.900811: function: bad_area > 199.900812: function: up_read > 199.900812: function: __bad_area_nosemaphore > 199.900812: function: is_prefetch > 199.900812: function: convert_ip_to_linear > 199.900813: function: unhandled_signal > 199.900813: function: __printk_ratelimit > 199.900813: function: _raw_spin_trylock > 199.900813: function: _raw_spin_unlock_irqrestore > 199.900814: function: printk > 199.900814: function: vprintk_emit diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 6dea040..66b636d 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1271,9 +1271,15 @@ dotraplinkage void __kprobes trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; + unsigned long cr2; prev_state = exception_enter(); + /* The trace might fault, save the cr2 register */ + cr2 = read_cr2(); trace_page_fault_entries(regs, error_code); + /* Put back the original cr2 if needed */ + if (cr2 != read_cr2()) + write_cr2(cr2); __do_page_fault(regs, error_code); exception_exit(prev_state); } ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 15:20 ` Steven Rostedt @ 2014-02-28 15:30 ` Steven Rostedt 0 siblings, 0 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 15:30 UTC (permalink / raw) To: Vince Weaver Cc: H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar, Seiji Aguchi On Fri, 28 Feb 2014 10:20:00 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Below is a patch that should fix this. Please remove all other patches > and try this out. Updated patch, as Peter Zijlstra on IRC asked me if the exception_enter() can be traced. And looking at it, it sure can be. -- Steve diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 6dea040..f606b67 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1271,9 +1271,15 @@ dotraplinkage void __kprobes trace_do_page_fault(struct pt_regs *regs, unsigned long error_code) { enum ctx_state prev_state; + unsigned long cr2; + /* The trace might fault, save the cr2 register */ + cr2 = read_cr2(); prev_state = exception_enter(); trace_page_fault_entries(regs, error_code); + /* Put back the original cr2 if needed */ + if (cr2 != read_cr2()) + write_cr2(cr2); __do_page_fault(regs, error_code); exception_exit(prev_state); } ^ permalink raw reply related [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 15:07 ` Vince Weaver 2014-02-28 15:13 ` H. Peter Anvin 2014-02-28 15:20 ` Steven Rostedt @ 2014-02-28 20:38 ` H. Peter Anvin 2014-02-28 20:46 ` Steven Rostedt 2014-02-28 21:18 ` Vince Weaver 2 siblings, 2 replies; 115+ messages in thread From: H. Peter Anvin @ 2014-02-28 20:38 UTC (permalink / raw) To: Vince Weaver, Steven Rostedt; +Cc: Peter Zijlstra, Linux Kernel, Ingo Molnar Now we need to figure out if the reboot problem and the segfault problem are actually the same... I have a nasty feeling they might be different problems. On February 28, 2014 7:07:29 AM PST, Vince Weaver <vincent.weaver@maine.edu> wrote: >On Fri, 28 Feb 2014, Steven Rostedt wrote: > >> Interesting. Are you doing a perf function trace? >> >> And just in case, can you add this patch and make sure the copy is >> called by NMI. > >199.900682: function: trace_do_page_fault >199.900683: page_fault_user: address=__per_cpu_end >ip=__per_cpu_end error_code=0x6 >199.900683: function: perf_swevent_get_recursion_context >199.900684: function: perf_tp_event >199.900684: function: perf_swevent_event >199.900684: function: perf_swevent_overflow >199.900684: function: __perf_event_overflow >199.900685: function: perf_prepare_sample >199.900685: function: >__perf_event_header__init_id >199.900685: function: task_tgid_nr_ns >199.900685: function: perf_event_tid >199.900686: function: __task_pid_nr_ns >199.900686: function: perf_callchain >199.900687: function: copy_from_user_nmi >199.900687: function: trace_do_page_fault >199.900687: page_fault_kernel: address=irq_stack_union >ip=copy_user_generic_string error_code=0x0 >199.900688: function: __do_page_fault >199.900688: function: bad_area_nosemaphore >199.900688: function: __bad_area_nosemaphore >199.900689: function: no_context >199.900689: function: fixup_exception >199.900689: function: >search_exception_tables >199.900689: function: search_extable >199.900691: function: copy_user_handle_tail >199.900691: function: trace_do_page_fault >199.900691: page_fault_kernel: address=irq_stack_union >ip=copy_user_handle_tail error_code=0x0 >199.900691: function: __do_page_fault >199.900692: function: bad_area_nosemaphore >199.900692: function: __bad_area_nosemaphore >199.900692: function: no_context >199.900692: function: fixup_exception >199.900692: function: >search_exception_tables >199.900692: function: search_extable >199.900693: function: save_stack_trace >199.900693: function: dump_trace >199.900694: function: print_context_stack >199.900694: function: __kernel_text_address >199.900694: function: is_module_text_address >199.900695: function: __module_text_address >199.900695: function: __module_address >199.900695: function: __kernel_text_address >199.900695: function: is_module_text_address >199.900696: function: __module_text_address >199.900696: function: __module_address >... >199.900705: function: __kernel_text_address >199.900809: kernel_stack: <stack trace> >=> perf_callchain (ffffffff810d35a2) >=> perf_prepare_sample (ffffffff810cfae3) >=> __perf_event_overflow (ffffffff810d02f4) >=> perf_swevent_overflow (ffffffff810d04e3) >=> perf_swevent_event (ffffffff810d0574) >=> perf_tp_event (ffffffff810d070c) >=> perf_trace_x86_exceptions (ffffffff810341b6) >=> trace_do_page_fault (ffffffff81537702) >=> trace_page_fault (ffffffff81534772) >199.900810: function: perf_output_begin >199.900810: function: __do_page_fault >199.900810: function: __perf_sw_event >199.900810: function: >perf_swevent_get_recursion_context >199.900811: function: down_read_trylock >199.900811: function: _cond_resched >199.900811: function: find_vma >199.900811: function: bad_area >199.900812: function: up_read >199.900812: function: __bad_area_nosemaphore >199.900812: function: is_prefetch >199.900812: function: convert_ip_to_linear >199.900813: function: unhandled_signal >199.900813: function: __printk_ratelimit >199.900813: function: _raw_spin_trylock >199.900813: function: _raw_spin_unlock_irqrestore >199.900814: function: printk >199.900814: function: vprintk_emit -- Sent from my mobile phone. Please pardon brevity and lack of formatting. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 20:38 ` H. Peter Anvin @ 2014-02-28 20:46 ` Steven Rostedt 2014-02-28 21:18 ` Vince Weaver 1 sibling, 0 replies; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 20:46 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Vince Weaver, Peter Zijlstra, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014 12:38:52 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > Now we need to figure out if the reboot problem and the segfault problem are actually the same... I have a nasty feeling they might be different problems. I wonder if there was any recursion problem. Although, I believe perf has recursion checks. But perhaps something recursed before the checks? -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 20:38 ` H. Peter Anvin 2014-02-28 20:46 ` Steven Rostedt @ 2014-02-28 21:18 ` Vince Weaver 2014-02-28 21:30 ` Steven Rostedt 1 sibling, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-28 21:18 UTC (permalink / raw) To: H. Peter Anvin Cc: Vince Weaver, Steven Rostedt, Peter Zijlstra, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014, H. Peter Anvin wrote: > Now we need to figure out if the reboot problem and the segfault problem > are actually the same... I have a nasty feeling they might be different > problems. I'm currently running a script that tries setting EBP to all possible 32-bit pages and running the test to see if that triggers anything. If I look at my notes the original reboot crash might have happened when I had the fuzzer also generating overflow signals (my current tests do not) so I'm not sure if having all this mess triggered from inside a signal handler could make it reboot somehow. I was away from the computer this afternoon and of course I have scores of e-mails on this topic now with lots of competing patches. Is there one in particular I'm supposed to be testing? Thanks, Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 21:18 ` Vince Weaver @ 2014-02-28 21:30 ` Steven Rostedt 2014-02-28 23:34 ` Vince Weaver 0 siblings, 1 reply; 115+ messages in thread From: Steven Rostedt @ 2014-02-28 21:30 UTC (permalink / raw) To: Vince Weaver; +Cc: H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014 16:18:23 -0500 (EST) Vince Weaver <vincent.weaver@maine.edu> wrote: > I was away from the computer this afternoon and of course I have scores of > e-mails on this topic now with lots of competing patches. Is there one > in particular I'm supposed to be testing? I was poking fun at you on IRC for this exact reason: <rostedt> poor Vince, I keep sending him new patches. "No, don't test this patch, now test this one. Oh wait, try this one instead" * peterz sees Vince thinking: "stop... sending.. me.. damn... patches... already... !!!11!" <rostedt> or at least, "Let me finish this test before I cancel it again for another damn patch" <rostedt> then he's probably doing "I'm not going to run any tests now, until I wait a while to see if there's a new patch to test" :-) Anyway, I would say wait a bit while we sort this out. At least we have a strong idea of what the bug is. Now we need to agree on the solution. Vince, you have been really terrific in helping us figure out what was wrong. I don't want to waste your time until we can all agree on what the proper fix is. When we do, it would be great if you can test it. But for now, have a great weekend and thanks for all your hard work in narrowing down this issue. -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 21:30 ` Steven Rostedt @ 2014-02-28 23:34 ` Vince Weaver 2014-03-01 0:43 ` H. Peter Anvin 2014-03-01 3:36 ` Steven Rostedt 0 siblings, 2 replies; 115+ messages in thread From: Vince Weaver @ 2014-02-28 23:34 UTC (permalink / raw) To: Steven Rostedt Cc: Vince Weaver, H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014, Steven Rostedt wrote: > On Fri, 28 Feb 2014 16:18:23 -0500 (EST) > Vince Weaver <vincent.weaver@maine.edu> wrote: > > > I was away from the computer this afternoon and of course I have scores of > > e-mails on this topic now with lots of competing patches. Is there one > > in particular I'm supposed to be testing? > > I was poking fun at you on IRC for this exact reason: > > <rostedt> poor Vince, I keep sending him new patches. "No, don't test this patch, now test this one. Oh wait, try this one instead" > * peterz sees Vince thinking: "stop... sending.. me.. damn... patches... already... !!!11!" > <rostedt> or at least, "Let me finish this test before I cancel it again for another damn patch" > <rostedt> then he's probably doing "I'm not going to run any tests now, until I wait a while to see if there's a new patch to test" Well while it might appear that I spend all of my days finding perf_event bugs, I actually am a college professor so I do occasionally have to run off to teach a class, meet with students, or write papers/grants for other academics to reject. It's nice others can reproduce the issue now, it would have saved me a lot of trouble, although now in theory I have a much better handle of how to use/abuse ftrace so I guess it was worth it. Once the fix gets into git I'm sure the relentless perf_fuzzer will let us know if there are any other issues left. I do look forward to the day when I can leave it running overnight and have a clean syslog the next morning. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 23:34 ` Vince Weaver @ 2014-03-01 0:43 ` H. Peter Anvin 2014-03-01 3:36 ` Steven Rostedt 1 sibling, 0 replies; 115+ messages in thread From: H. Peter Anvin @ 2014-03-01 0:43 UTC (permalink / raw) To: Vince Weaver, Steven Rostedt; +Cc: Peter Zijlstra, Linux Kernel, Ingo Molnar On 02/28/2014 03:34 PM, Vince Weaver wrote: > > Well while it might appear that I spend all of my days finding perf_event > bugs, I actually am a college professor so I do occasionally have to run > off to teach a class, meet with students, or write papers/grants for other > academics to reject. > We really appreciate your help. This has been really critical. > > It's nice others can reproduce the issue now, it would have saved me a lot > of trouble, although now in theory I have a much better handle of how to > use/abuse ftrace so I guess it was worth it. > > Once the fix gets into git I'm sure the relentless perf_fuzzer will let us > know if there are any other issues left. I do look forward to the day > when I can leave it running overnight and have a clean syslog the next > morning. > We all do, definitely, and your help has been a huge step in that direction. -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-28 23:34 ` Vince Weaver 2014-03-01 0:43 ` H. Peter Anvin @ 2014-03-01 3:36 ` Steven Rostedt 2014-03-01 16:24 ` Andi Kleen 2014-03-02 16:02 ` Vince Weaver 1 sibling, 2 replies; 115+ messages in thread From: Steven Rostedt @ 2014-03-01 3:36 UTC (permalink / raw) To: Vince Weaver; +Cc: H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014 18:34:00 -0500 (EST) Vince Weaver <vincent.weaver@maine.edu> wrote: > > I was poking fun at you on IRC for this exact reason: > > > > <rostedt> poor Vince, I keep sending him new patches. "No, don't test this patch, now test this one. Oh wait, try this one instead" > > * peterz sees Vince thinking: "stop... sending.. me.. damn... patches... already... !!!11!" > > <rostedt> or at least, "Let me finish this test before I cancel it again for another damn patch" > > <rostedt> then he's probably doing "I'm not going to run any tests now, until I wait a while to see if there's a new patch to test" I hope you were not offended by this. It was actually a testament to the work you've given us. > > Well while it might appear that I spend all of my days finding perf_event > bugs, I actually am a college professor so I do occasionally have to run > off to teach a class, meet with students, or write papers/grants for other > academics to reject. But perf_event bug finder is a much more prestigious title than "college professor" ;-) > > It's nice others can reproduce the issue now, it would have saved me a lot > of trouble, although now in theory I have a much better handle of how to > use/abuse ftrace so I guess it was worth it. I always enjoy finding bugs in other people's code. That's usually the best way to learn what their code does. > > Once the fix gets into git I'm sure the relentless perf_fuzzer will let us > know if there are any other issues left. I do look forward to the day > when I can leave it running overnight and have a clean syslog the next > morning. BTW, is the perf_fuzzer code posted somewhere? It sounds like it can be really useful for us to do our own testing too. Thanks, -- Steve ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-03-01 3:36 ` Steven Rostedt @ 2014-03-01 16:24 ` Andi Kleen 2014-03-02 15:34 ` Vince Weaver 2014-03-02 16:02 ` Vince Weaver 1 sibling, 1 reply; 115+ messages in thread From: Andi Kleen @ 2014-03-01 16:24 UTC (permalink / raw) To: Steven Rostedt Cc: Vince Weaver, H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar Steven Rostedt <rostedt@goodmis.org> writes: > > BTW, is the perf_fuzzer code posted somewhere? It sounds like it can be > really useful for us to do our own testing too. I believe it's part of trinity. http://codemonkey.org.uk/projects/trinity/ Perhaps it should have a "ftracer fuzzer" too? -Andi ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-03-01 16:24 ` Andi Kleen @ 2014-03-02 15:34 ` Vince Weaver 0 siblings, 0 replies; 115+ messages in thread From: Vince Weaver @ 2014-03-02 15:34 UTC (permalink / raw) To: Andi Kleen Cc: Steven Rostedt, Vince Weaver, H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Sat, 1 Mar 2014, Andi Kleen wrote: > Steven Rostedt <rostedt@goodmis.org> writes: > > > > BTW, is the perf_fuzzer code posted somewhere? It sounds like it can be > > really useful for us to do our own testing too. > > I believe it's part of trinity. > > http://codemonkey.org.uk/projects/trinity/ > > Perhaps it should have a "ftracer fuzzer" too? It's not part of trinity, it's a separate project that re-uses some of trinity's codebase. http://web.eece.maine.edu/~vweaver/projects/perf_events/fuzzer/ You can get the source as part of the perf_event_tests package https://github.com/deater/perf_event_tests go into the fuzzer directory, run make, then "./perf_fuzzer". You can also run the ./fast_repro.sh script instead, it's what I've been using to track this kernel issue, although it tripped over the issue a lot faster when compiled with -mx32 than it did when complied as native x86_64. I think the code as found in the git tree should be working, I made a lot of changes when tracking this problem and I need to make sure I only check in the proper ones and not let any weird debugging patches slip in. I'm aware of at least two other WARN messages and possibly one or two hard-lockup bugs in addition to the potential system reboot that can be triggered by the fuzzer, it just takes a while to isolate these things. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-03-01 3:36 ` Steven Rostedt 2014-03-01 16:24 ` Andi Kleen @ 2014-03-02 16:02 ` Vince Weaver 1 sibling, 0 replies; 115+ messages in thread From: Vince Weaver @ 2014-03-02 16:02 UTC (permalink / raw) To: Steven Rostedt Cc: Vince Weaver, H. Peter Anvin, Peter Zijlstra, Linux Kernel, Ingo Molnar On Fri, 28 Feb 2014, Steven Rostedt wrote: > On Fri, 28 Feb 2014 18:34:00 -0500 (EST) > Vince Weaver <vincent.weaver@maine.edu> wrote: > > But perf_event bug finder is a much more prestigious title than > "college professor" ;-) yes, it's something to fall back on if/when I get denied tenure :) I do enjoy tracking down bugs like this, it's why I spend a lot of time on it when I should probably be doing other things. > BTW, is the perf_fuzzer code posted somewhere? It sounds like it can be > really useful for us to do our own testing too. yes, code is fully available though possibly not publicized well. I posted the full info later in the thread as I seem to be responding to e-mails in reverse order today for some reason. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-27 22:31 ` Steven Rostedt 2014-02-27 22:52 ` H. Peter Anvin @ 2014-02-28 9:39 ` Peter Zijlstra 1 sibling, 0 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-02-28 9:39 UTC (permalink / raw) To: Steven Rostedt; +Cc: Vince Weaver, H. Peter Anvin, Linux Kernel, Ingo Molnar On Thu, Feb 27, 2014 at 05:31:50PM -0500, Steven Rostedt wrote: > Well, the perf ring buffer is vmalloced, right? That can cause a page > fault too. On x86 they're typically not -- although we have a debug CONFIG option to test that code on x86 too. On SPARC/ARM etc.. we have to use vmalloc_user() because of cache aliasing (yay for VIPT). ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 17:32 ` Vince Weaver 2014-02-24 17:40 ` H. Peter Anvin @ 2014-02-24 17:40 ` Peter Zijlstra 2014-02-24 17:41 ` Vince Weaver 2 siblings, 0 replies; 115+ messages in thread From: Peter Zijlstra @ 2014-02-24 17:40 UTC (permalink / raw) To: Vince Weaver Cc: H. Peter Anvin, Linux Kernel, Ingo Molnar, H.J. Lu, Steven Rostedt On Mon, Feb 24, 2014 at 12:32:39PM -0500, Vince Weaver wrote: > I do note that > perf_callchain_user(); > > Does > fp = (void __user *)regs->bp; > > ... > > bytes = copy_from_user_nmi(&frame, fp, sizeof(frame)); > > > And in my particular executable RBP has nothing to do with a frame > pointer, but is instead being used as a general purpose register. > > Am I missing something here? Though in that case I'm not sure why this > wouldn't be easier to trigger. Ah, in case the frame doesn't actually exist we would expect to fault and get the fixup treatment, returning a short copy (the return value being bytes _NOT_ copied). When that happens; if (bytes != 0) break; At which point we'll terminate the stack frame iteration. This is where we rely on being able to take a fault from NMI context, the fault iret will re-enable NMIs, necessitating all the magic Steve did. ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 17:32 ` Vince Weaver 2014-02-24 17:40 ` H. Peter Anvin 2014-02-24 17:40 ` Peter Zijlstra @ 2014-02-24 17:41 ` Vince Weaver 2014-02-24 17:42 ` H. Peter Anvin 2 siblings, 1 reply; 115+ messages in thread From: Vince Weaver @ 2014-02-24 17:41 UTC (permalink / raw) To: Vince Weaver Cc: Peter Zijlstra, H. Peter Anvin, Linux Kernel, Ingo Molnar, H.J. Lu, Steven Rostedt On Mon, 24 Feb 2014, Vince Weaver wrote: > I do note that > perf_callchain_user(); > > Does > fp = (void __user *)regs->bp; > > ... > > bytes = copy_from_user_nmi(&frame, fp, sizeof(frame)); > > > And in my particular executable RBP has nothing to do with a frame > pointer, but is instead being used as a general purpose register. and as a reminder, I'm seeing this on an x32 executable, so perf_callchain_user32() is probably coming into play. So maybe it is an x32 issue after all. Vince ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 17:41 ` Vince Weaver @ 2014-02-24 17:42 ` H. Peter Anvin 0 siblings, 0 replies; 115+ messages in thread From: H. Peter Anvin @ 2014-02-24 17:42 UTC (permalink / raw) To: Vince Weaver Cc: Peter Zijlstra, Linux Kernel, Ingo Molnar, H.J. Lu, Steven Rostedt On 02/24/2014 09:41 AM, Vince Weaver wrote: > On Mon, 24 Feb 2014, Vince Weaver wrote: > >> I do note that >> perf_callchain_user(); >> >> Does >> fp = (void __user *)regs->bp; >> >> ... >> >> bytes = copy_from_user_nmi(&frame, fp, sizeof(frame)); >> >> >> And in my particular executable RBP has nothing to do with a frame >> pointer, but is instead being used as a general purpose register. > > and as a reminder, I'm seeing this on an x32 executable, so > perf_callchain_user32() is probably coming into play. > > So maybe it is an x32 issue after all. > No. if (!test_thread_flag(TIF_IA32)) return 0; TIF_IA32 is clear for an x32 process. -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
* Re: perf_fuzzer compiled for x32 causes reboot 2014-02-24 17:25 ` Peter Zijlstra 2014-02-24 17:32 ` Vince Weaver @ 2014-02-24 17:52 ` H. Peter Anvin 1 sibling, 0 replies; 115+ messages in thread From: H. Peter Anvin @ 2014-02-24 17:52 UTC (permalink / raw) To: Peter Zijlstra, Vince Weaver Cc: Linux Kernel, Ingo Molnar, H.J. Lu, Steven Rostedt On 02/24/2014 09:25 AM, Peter Zijlstra wrote: >> >> What is likely happening is the user page fault is triggering >> code to do a "perf_callchain" dump, which is calling copy_from_user_nmi() >> which calls copy_user_generic_string() which is somehow getting the user >> RBP in the RDI register somehow? > > So that code very much relies on the 'recursive' NMI/iret magic from > Steve, patch 3f3c8b8c4b2a3 (and assorted fixes later). > > If CR2 is getting corrupted; 7fbb98c5cb075 seems relevant. > > Peter, does x32 have a slightly different ABI/calling convention that > would make any of these patches just slightly 'off'? > As long as we're talking kernel code, x32 isn't even involved (we do not support compiling the kernel as x32 and most likely never will.) -hpa ^ permalink raw reply [flat|nested] 115+ messages in thread
end of thread, other threads:[~2014-03-07 23:08 UTC | newest] Thread overview: 115+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-02-28 15:33 [PATCH] x86 trace: Fix page fault tracing bug Jiri Olsa 2014-02-28 15:47 ` Peter Zijlstra 2014-02-28 16:05 ` [PATCHv2] " Jiri Olsa 2014-02-28 16:11 ` H. Peter Anvin 2014-02-28 16:23 ` Steven Rostedt 2014-02-28 16:15 ` Steven Rostedt 2014-03-05 0:03 ` [tip:x86/urgent] x86, trace: Fix CR2 corruption when tracing page faults tip-bot for Jiri Olsa 2014-03-05 11:14 ` Peter Zijlstra 2014-03-05 12:20 ` Steven Rostedt 2014-03-05 12:25 ` Peter Zijlstra 2014-03-05 12:33 ` Steven Rostedt 2014-03-05 12:54 ` Peter Zijlstra 2014-03-05 13:02 ` Peter Zijlstra 2014-03-05 13:07 ` Peter Zijlstra 2014-03-05 12:36 ` Peter Zijlstra 2014-03-05 13:00 ` Steven Rostedt 2014-03-05 13:08 ` Peter Zijlstra 2014-03-05 21:37 ` H. Peter Anvin 2014-03-06 8:40 ` Peter Zijlstra 2014-03-06 11:02 ` Steven Rostedt 2014-03-06 14:53 ` [PATCH] x86: Further robustify CR2 handling vs tracing Peter Zijlstra 2014-03-07 23:07 ` [tip:x86/urgent] x86, trace: " tip-bot for Peter Zijlstra 2014-02-28 15:47 ` [PATCH] x86 trace: Fix page fault tracing bug Jiri Olsa 2014-02-28 16:00 ` Steven Rostedt 2014-02-28 16:01 ` Steven Rostedt -- strict thread matches above, loose matches on Subject: below -- 2014-02-21 20:25 perf_fuzzer causes reboot Vince Weaver 2014-02-21 22:13 ` perf_fuzzer compiled for x32 " Vince Weaver 2014-02-21 22:34 ` Vince Weaver 2014-02-22 4:50 ` Vince Weaver 2014-02-22 5:03 ` H. Peter Anvin 2014-02-22 6:26 ` H. Peter Anvin 2014-02-23 5:18 ` Vince Weaver 2014-02-23 5:24 ` H. Peter Anvin 2014-02-23 6:07 ` H. Peter Anvin 2014-02-23 14:05 ` Vince Weaver 2014-02-24 3:02 ` Vince Weaver 2014-02-24 5:22 ` H. Peter Anvin 2014-02-24 15:35 ` Vince Weaver 2014-02-24 16:34 ` Vince Weaver 2014-02-24 16:47 ` H. Peter Anvin 2014-02-24 17:10 ` Vince Weaver 2014-02-24 17:25 ` Peter Zijlstra 2014-02-24 17:32 ` Vince Weaver 2014-02-24 17:40 ` H. Peter Anvin 2014-02-24 18:00 ` Vince Weaver 2014-02-24 18:07 ` Vince Weaver 2014-02-24 18:34 ` H. Peter Anvin 2014-02-24 19:13 ` Steven Rostedt 2014-02-24 19:15 ` H. Peter Anvin 2014-02-24 19:30 ` Peter Zijlstra 2014-02-24 19:32 ` Steven Rostedt 2014-02-25 3:49 ` H. Peter Anvin 2014-02-25 14:07 ` Vince Weaver 2014-02-25 14:34 ` H. Peter Anvin 2014-02-25 14:43 ` Steven Rostedt 2014-02-25 15:33 ` Vince Weaver 2014-02-26 15:06 ` Vince Weaver 2014-02-27 22:06 ` Vince Weaver 2014-02-27 22:31 ` Steven Rostedt 2014-02-27 22:52 ` H. Peter Anvin 2014-02-27 23:30 ` Steven Rostedt 2014-02-27 23:46 ` H. Peter Anvin 2014-02-28 1:00 ` Vince Weaver 2014-02-28 20:34 ` Paul E. McKenney 2014-02-28 20:47 ` Steven Rostedt 2014-02-28 20:54 ` Peter Zijlstra 2014-02-28 21:17 ` Paul E. McKenney 2014-02-28 21:27 ` Peter Zijlstra 2014-02-28 21:51 ` Paul E. McKenney 2014-02-28 21:55 ` Peter Zijlstra 2014-02-28 22:05 ` Steven Rostedt 2014-02-28 22:23 ` Paul E. McKenney 2014-02-28 1:34 ` Vince Weaver 2014-02-28 2:17 ` H. Peter Anvin 2014-02-28 2:57 ` Steven Rostedt 2014-02-28 11:11 ` Peter Zijlstra 2014-02-28 13:37 ` Steven Rostedt 2014-02-28 14:15 ` Vince Weaver 2014-02-28 14:23 ` Steven Rostedt 2014-02-28 15:07 ` Vince Weaver 2014-02-28 15:13 ` H. Peter Anvin 2014-02-28 15:40 ` Peter Zijlstra 2014-02-28 16:15 ` H. Peter Anvin 2014-02-28 16:29 ` Steven Rostedt 2014-02-28 19:33 ` [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace() Steven Rostedt 2014-02-28 20:46 ` Peter Zijlstra 2014-02-28 20:51 ` Steven Rostedt 2014-02-28 20:58 ` Peter Zijlstra 2014-02-28 21:01 ` Steven Rostedt 2014-02-28 21:17 ` Peter Zijlstra 2014-02-28 20:56 ` perf_fuzzer compiled for x32 causes reboot Peter Zijlstra 2014-02-28 21:06 ` Steven Rostedt 2014-03-01 9:16 ` Ingo Molnar 2014-03-01 9:50 ` Borislav Petkov 2014-03-01 16:50 ` H. Peter Anvin 2014-03-04 23:05 ` Borislav Petkov 2014-03-03 9:16 ` Peter Zijlstra 2014-02-28 20:55 ` Peter Zijlstra 2014-02-28 15:20 ` Steven Rostedt 2014-02-28 15:30 ` Steven Rostedt 2014-02-28 20:38 ` H. Peter Anvin 2014-02-28 20:46 ` Steven Rostedt 2014-02-28 21:18 ` Vince Weaver 2014-02-28 21:30 ` Steven Rostedt 2014-02-28 23:34 ` Vince Weaver 2014-03-01 0:43 ` H. Peter Anvin 2014-03-01 3:36 ` Steven Rostedt 2014-03-01 16:24 ` Andi Kleen 2014-03-02 15:34 ` Vince Weaver 2014-03-02 16:02 ` Vince Weaver 2014-02-28 9:39 ` Peter Zijlstra 2014-02-24 17:40 ` Peter Zijlstra 2014-02-24 17:41 ` Vince Weaver 2014-02-24 17:42 ` H. Peter Anvin 2014-02-24 17:52 ` H. Peter Anvin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).