All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
@ 2019-03-21  2:15 Steven Rostedt
  2019-03-21  8:33 ` Peter Zijlstra
  2019-04-17  1:52 ` He Zhe
  0 siblings, 2 replies; 35+ messages in thread
From: Steven Rostedt @ 2019-03-21  2:15 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Joel Fernandes, He Zhe,
	Linus Torvalds

From: Steven Rostedt (VMware) <rostedt@goodmis.org>

He Zhe reported a crash by enabling trace events and selecting
"userstacktrace" which will read the stack of userspace for every trace
event recorded. Zhe narrowed it down to:

 c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")

With the problem config, I was able to also reproduce the error. I
narrowed it down to just having to do the following:

 # cd /sys/kernel/tracing
 # echo 1 > options/userstacktrace
 # echo 1 > events/preemptirq/irq_disable/enable

And sure enough, I triggered a crash. Well, it was systemd crashing
with a bad memory access??

 systemd-journal[537]: segfault at ed8cb8 ip 00007f7fffc9fef5 sp 00007ffc4062cb10 error 7

And it would crash similarly each time I tried it, but always at a
different place. After spending the day on this, I finally figured it
out. The bug is happening in entry_64.S right after error_entry.
There's two TRACE_IRQS_OFF in that code path, which if I comment out,
the bug goes away. Then it dawned on me that the crash always happens
when systemd does a normal page fault. We had this bug before, and it
was with the exception trace points.

The issue is that a tracepoint can fault (reading vmalloc or whatever).
And doing a userspace stack trace most definitely will fault. But if we
are coming from a legitimate page fault, the address of that fault (in
the CR2 register) will be lost if we fault before we get to the page
fault handler. That's exactly what is happening.

To solve this, a TRACE_IRQS_OFF_CR2 (and ON for consistency) was added
that saves the CR2 register. A new trace_hardirqs_off_thunk_cr2 is
created that stores the cr2 register, calls the
trace_hardirqs_off_caller, then on return restores the cr2 register if
it changed, before returning.

On my tests this fixes the issue. I just want to know if this is a
legitimate fix or if someone can come up with a better fix?

Note: this also saves the exception context just like the
do_page_fault() function does.

Note2: This only gets enabled when lockdep or irq tracing is enabled,
which is not recommended for production environments.

Link: http://lkml.kernel.org/r/897cf5cf-fc24-8a64-cb28-847f2d2e63d2@windriver.com

Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7bc105f47d21..7edffec328e2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -292,6 +292,32 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 
 	syscall_return_slowpath(regs);
 }
+
+extern void trace_hardirqs_on_caller(unsigned long caller_addr);
+__visible void trace_hardirqs_on_caller_cr2(unsigned long caller_addr)
+{
+	unsigned long address = read_cr2(); /* Get the faulting address */
+	enum ctx_state prev_state;
+
+	prev_state = exception_enter();
+	trace_hardirqs_on_caller(caller_addr);
+	if (address != read_cr2())
+		write_cr2(address);
+	exception_exit(prev_state);
+}
+
+extern void trace_hardirqs_off_caller(unsigned long caller_addr);
+__visible void trace_hardirqs_off_caller_cr2(unsigned long caller_addr)
+{
+	unsigned long address = read_cr2(); /* Get the faulting address */
+	enum ctx_state prev_state;
+
+	prev_state = exception_enter();
+	trace_hardirqs_off_caller(caller_addr);
+	if (address != read_cr2())
+		write_cr2(address);
+	exception_exit(prev_state);
+}
 #endif
 
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..73ddf24fed3e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1248,12 +1248,12 @@ ENTRY(error_entry)
 	 * we fix gsbase, and we should do it before enter_from_user_mode
 	 * (which can take locks).
 	 */
-	TRACE_IRQS_OFF
+	TRACE_IRQS_OFF_CR2
 	CALL_enter_from_user_mode
 	ret
 
 .Lerror_entry_done:
-	TRACE_IRQS_OFF
+	TRACE_IRQS_OFF_CR2
 	ret
 
 	/*
diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
index be36bf4e0957..1300b53b98cb 100644
--- a/arch/x86/entry/thunk_64.S
+++ b/arch/x86/entry/thunk_64.S
@@ -41,6 +41,8 @@
 #ifdef CONFIG_TRACE_IRQFLAGS
 	THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller,1
 	THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
+	THUNK trace_hardirqs_on_thunk_cr2,trace_hardirqs_on_caller_cr2,1
+	THUNK trace_hardirqs_off_thunk_cr2,trace_hardirqs_off_caller_cr2,1
 #endif
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 058e40fed167..dd511742ca2e 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -172,9 +172,13 @@ static inline int arch_irqs_disabled(void)
 #ifdef CONFIG_TRACE_IRQFLAGS
 #  define TRACE_IRQS_ON		call trace_hardirqs_on_thunk;
 #  define TRACE_IRQS_OFF	call trace_hardirqs_off_thunk;
+#  define TRACE_IRQS_ON_CR2	call trace_hardirqs_on_thunk_cr2;
+#  define TRACE_IRQS_OFF_CR2	call trace_hardirqs_off_thunk_cr2;
 #else
 #  define TRACE_IRQS_ON
 #  define TRACE_IRQS_OFF
+#  define TRACE_IRQS_ON_CR2
+#  define TRACE_IRQS_OFF_CR2
 #endif
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #  ifdef CONFIG_X86_64

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21  2:15 [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry Steven Rostedt
@ 2019-03-21  8:33 ` Peter Zijlstra
  2019-03-21  9:02   ` Peter Zijlstra
  2019-03-21 13:04   ` Steven Rostedt
  2019-04-17  1:52 ` He Zhe
  1 sibling, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21  8:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Joel Fernandes, He Zhe,
	Linus Torvalds

On Wed, Mar 20, 2019 at 10:15:34PM -0400, Steven Rostedt wrote:

> And it would crash similarly each time I tried it, but always at a
> different place. After spending the day on this, I finally figured it
> out. The bug is happening in entry_64.S right after error_entry.
> There's two TRACE_IRQS_OFF in that code path, which if I comment out,
> the bug goes away. Then it dawned on me that the crash always happens
> when systemd does a normal page fault. We had this bug before, and it
> was with the exception trace points.

0ac09f9f8cd1 ("x86, trace: Fix CR2 corruption when tracing page faults")
d4078e232267 ("x86, trace: Further robustify CR2 handling vs tracing")

Or were you talking about:

70fb74a5420f ("x86: Save cr2 in NMI in case NMIs take a page fault (for i386)")

> The issue is that a tracepoint can fault (reading vmalloc or whatever).
> And doing a userspace stack trace most definitely will fault. But if we
> are coming from a legitimate page fault, the address of that fault (in
> the CR2 register) will be lost if we fault before we get to the page
> fault handler. That's exactly what is happening.

Shees, that could've been written much clearer. So you're saying:

idtentry page_fault             do_page_fault           has_error_code=1
  call error_entry
    TRACE_IRQS_OFF
      call trace_hardirqs_off*
        <tracer stuff>
	  <fault> # modifies CR2
  call do_page_fault
    address = read_cr2(); /* whoopsie */

Right?

> To solve this, a TRACE_IRQS_OFF_CR2 (and ON for consistency) was added
> that saves the CR2 register. A new trace_hardirqs_off_thunk_cr2 is
> created that stores the cr2 register, calls the
> trace_hardirqs_off_caller, then on return restores the cr2 register if
> it changed, before returning.

Yuck.. also, not consistent with the actual patch. The thunk doesn't
save/restore CR2.

I really hate making this special TRACE_IRQS_OFF_CR2 thing, it feels far
too fragile. I'd _much_ rather push the #PF CR2 read much earlier.

Also, argh I fscking hate context tracking. That makes all this so much
more complicated. It if weren't for CALL_enter_from_user_mode, we could
pull that TRACE_IRQS_OFF out of error_entry.

Damn... Andy, any bright ideas?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21  8:33 ` Peter Zijlstra
@ 2019-03-21  9:02   ` Peter Zijlstra
  2019-03-21 10:45     ` Peter Zijlstra
  2019-03-21 13:04   ` Steven Rostedt
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21  9:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Joel Fernandes, He Zhe,
	Linus Torvalds

On Thu, Mar 21, 2019 at 09:33:17AM +0100, Peter Zijlstra wrote:

I'm thinking this problem wasn't new.

> idtentry page_fault             do_page_fault           has_error_code=1
>   call error_entry
>     TRACE_IRQS_OFF
>       call trace_hardirqs_off*
>         <tracer stuff>
> 	  <fault> # modifies CR2

      CALL_enter_from_user_mode
        __context_tracking_exit()
	  trace_user_exit(0)
	    #PF

>   call do_page_fault
>     address = read_cr2(); /* whoopsie */

And that also isn't fixed by your patch.

I'm trying to make idtentry put cr2 in rdx, such that do_page_fault()
takes address as a 3rd argument, but I'm still fighting that context
tracking nonsense.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21  9:02   ` Peter Zijlstra
@ 2019-03-21 10:45     ` Peter Zijlstra
  2019-03-21 13:32       ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21 10:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Joel Fernandes, He Zhe,
	Linus Torvalds

On Thu, Mar 21, 2019 at 10:02:41AM +0100, Peter Zijlstra wrote:
> On Thu, Mar 21, 2019 at 09:33:17AM +0100, Peter Zijlstra wrote:
> 
> I'm thinking this problem wasn't new.
> 
> > idtentry page_fault             do_page_fault           has_error_code=1
> >   call error_entry
> >     TRACE_IRQS_OFF
> >       call trace_hardirqs_off*
> >         <tracer stuff>
> > 	  <fault> # modifies CR2
> 
>       CALL_enter_from_user_mode
>         __context_tracking_exit()
> 	  trace_user_exit(0)
> 	    #PF
> 
> >   call do_page_fault
> >     address = read_cr2(); /* whoopsie */
> 
> And that also isn't fixed by your patch.
> 
> I'm trying to make idtentry put cr2 in rdx, such that do_page_fault()
> takes address as a 3rd argument, but I'm still fighting that context
> tracking nonsense.
> 

Something a little like so; completely untested and obviously needs
32bit changes too.

---
 arch/x86/entry/entry_64.S | 28 ++++++++++++++++------------
 arch/x86/mm/fault.c       |  3 +--
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..776dbe7ba72e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -879,7 +879,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 read_cr2=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -901,18 +901,28 @@ ENTRY(\sym)
 
 	.if \paranoid
 	call	paranoid_entry
+	/* returned flag: ebx=0: need swapgs on exit, ebx=1: don't need it */
 	.else
 	call	error_entry
+	/* returned flag: ebx=1: CALL_enter_from_user_mode, ebx=0: don't need it */
 	.endif
 	UNWIND_HINT_REGS
-	/* returned flag: ebx=0: need swapgs on exit, ebx=1: don't need it */
 
-	.if \paranoid
+	.if \read_cr2
+	mov	%cr2, %rdx			/* XXX paravirt crap */
+	.endif
+
 	.if \shift_ist != -1
 	TRACE_IRQS_OFF_DEBUG			/* reload IDT in case of recursion */
 	.else
 	TRACE_IRQS_OFF
 	.endif
+
+	.if \paranoid == 0
+	testl	%ebx, %ebx
+	jz	.Lno_context_tracking_\@
+	CALL_enter_from_user_mode
+.Lno_context_tracking_\@:
 	.endif
 
 	movq	%rsp, %rdi			/* pt_regs pointer */
@@ -1140,7 +1150,7 @@ idtentry xenint3		do_int3			has_error_code=0
 #endif
 
 idtentry general_protection	do_general_protection	has_error_code=1
-idtentry page_fault		do_page_fault		has_error_code=1
+idtentry page_fault		do_page_fault		has_error_code=1	read_cr2=1
 
 #ifdef CONFIG_KVM_GUEST
 idtentry async_page_fault	do_async_page_fault	has_error_code=1
@@ -1243,17 +1253,11 @@ ENTRY(error_entry)
 	ENCODE_FRAME_POINTER
 	pushq	%r12
 
-	/*
-	 * We need to tell lockdep that IRQs are off.  We can't do this until
-	 * we fix gsbase, and we should do it before enter_from_user_mode
-	 * (which can take locks).
-	 */
-	TRACE_IRQS_OFF
-	CALL_enter_from_user_mode
+	mov	$1, %ebx			/* CALL_enter_from_user_mode */
 	ret
 
 .Lerror_entry_done:
-	TRACE_IRQS_OFF
+	xorl	%ebx, %ebx
 	ret
 
 	/*
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 667f1da36208..aac7a74869a2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1560,9 +1560,8 @@ trace_page_fault_entries(unsigned long address, struct pt_regs *regs,
  * exception_{enter,exit}() contains all sorts of tracepoints.
  */
 dotraplinkage void notrace
-do_page_fault(struct pt_regs *regs, unsigned long error_code)
+do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
-	unsigned long address = read_cr2(); /* Get the faulting address */
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21  8:33 ` Peter Zijlstra
  2019-03-21  9:02   ` Peter Zijlstra
@ 2019-03-21 13:04   ` Steven Rostedt
  1 sibling, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2019-03-21 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Joel Fernandes, He Zhe,
	Linus Torvalds

On Thu, 21 Mar 2019 09:33:17 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Mar 20, 2019 at 10:15:34PM -0400, Steven Rostedt wrote:
> 
> > And it would crash similarly each time I tried it, but always at a
> > different place. After spending the day on this, I finally figured it
> > out. The bug is happening in entry_64.S right after error_entry.
> > There's two TRACE_IRQS_OFF in that code path, which if I comment out,
> > the bug goes away. Then it dawned on me that the crash always happens
> > when systemd does a normal page fault. We had this bug before, and it
> > was with the exception trace points.  
> 
> 0ac09f9f8cd1 ("x86, trace: Fix CR2 corruption when tracing page faults")
> d4078e232267 ("x86, trace: Further robustify CR2 handling vs tracing")

Probably these two, as I remember more about the discussions around
them, and not the actual commits. Although, I did take a look at the
do_page_fault() code that was added because of them. I just didn't do a
git blame to see what added it.

> 
> Or were you talking about:
> 
> 70fb74a5420f ("x86: Save cr2 in NMI in case NMIs take a page fault (for i386)")
> 
> > The issue is that a tracepoint can fault (reading vmalloc or whatever).
> > And doing a userspace stack trace most definitely will fault. But if we
> > are coming from a legitimate page fault, the address of that fault (in
> > the CR2 register) will be lost if we fault before we get to the page
> > fault handler. That's exactly what is happening.  
> 
> Shees, that could've been written much clearer. So you're saying:

I wrote this just before going to bed. It was the best I could come up
with at the time.

> 
> idtentry page_fault             do_page_fault           has_error_code=1
>   call error_entry
>     TRACE_IRQS_OFF
>       call trace_hardirqs_off*
>         <tracer stuff>
> 	  <fault> # modifies CR2
>   call do_page_fault
>     address = read_cr2(); /* whoopsie */
> 
> Right?

Yes.

> 
> > To solve this, a TRACE_IRQS_OFF_CR2 (and ON for consistency) was added
> > that saves the CR2 register. A new trace_hardirqs_off_thunk_cr2 is
> > created that stores the cr2 register, calls the
> > trace_hardirqs_off_caller, then on return restores the cr2 register if
> > it changed, before returning.  
> 
> Yuck.. also, not consistent with the actual patch. The thunk doesn't
> save/restore CR2.

Well, the thunk calls the caller_cr2 that does, which is just a helper
function for the thunk.

> 
> I really hate making this special TRACE_IRQS_OFF_CR2 thing, it feels far
> too fragile. I'd _much_ rather push the #PF CR2 read much earlier.
> 
> Also, argh I fscking hate context tracking. That makes all this so much
> more complicated. It if weren't for CALL_enter_from_user_mode, we could
> pull that TRACE_IRQS_OFF out of error_entry.

Yeah, and I didn't even test this with context tracking enabled yet.

-- Steve


> 
> Damn... Andy, any bright ideas?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 10:45     ` Peter Zijlstra
@ 2019-03-21 13:32       ` Steven Rostedt
  2019-03-21 13:55         ` Steven Rostedt
  2019-03-21 17:22         ` Peter Zijlstra
  0 siblings, 2 replies; 35+ messages in thread
From: Steven Rostedt @ 2019-03-21 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Joel Fernandes, He Zhe,
	Linus Torvalds

On Thu, 21 Mar 2019 11:45:17 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> -	.if \paranoid
> +	.if \read_cr2
> +	mov	%cr2, %rdx			/* XXX paravirt crap */
> +	.endif
> +

I'm guessing this breaks paravirt, as that's one reason I didn't add
the read_rc in assembly.

I tested your code and it also fixes the issue, but needed to add this
to make it compile:

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7d6f3f3fad78..1530909fbf70 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -81,7 +81,7 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s);
 void __init trap_init(void);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *regs, long error_code);
-dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code);
+dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
 dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code);
 dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code);
 dotraplinkage void do_alignment_check(struct pt_regs *regs, long error_code);


-- Steve

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 13:32       ` Steven Rostedt
@ 2019-03-21 13:55         ` Steven Rostedt
  2019-03-21 17:23           ` Linus Torvalds
  2019-03-21 17:22         ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2019-03-21 13:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Joel Fernandes, He Zhe,
	Linus Torvalds

On Thu, 21 Mar 2019 09:32:42 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I tested your code and it also fixes the issue,

Although I just hit this:

------------[ cut here ]------------
General protection fault in user access. Non-canonical address?
WARNING: CPU: 2 PID: 1620 at arch/x86/mm/extable.c:125 ex_handler_uaccess+0xc4/0xf0
Modules linked in: iptable_mangle xt_CHECKSUM tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter snd_hda_codec_hdmi iTCO_wdt snd_hda_codec_realtek snd_hda_codec_generic iTCO_vendor_support wmi_bmof snd_hda_intel snd_hda_codec intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_core coretemp snd_seq crct10dif_pclmul crct10dif_common i915 aesni_intel snd_seq_device snd_pcm aes_x86_64 crypto_simd cryptd snd_timer glue_helper i2c_i801 lpc_ich video wmi pcc_cpufreq ip_tables x_tables e1000e
CPU: 2 PID: 1620 Comm: dhclient Not tainted 5.1.0-rc1-test-yocto-standard+ #42
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
RIP: 0010:ex_handler_uaccess+0xc4/0xf0
Code: 01 00 00 00 31 d2 be 01 00 00 00 48 c7 c7 e8 ca f6 ac c6 05 23 9f 8e 01 01 e8 68 df 11 00 48 c7 c7 20 69 b9 ac e8 4b 42 01 00 <0f> 0b b9 01 00 00 00 31 d2 be 01 00 00 00 48 c7 c7 b8 ca f6 ac e8
RSP: 0018:ffffa4bd409e79a0 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffffffffac602400 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffacf7f118
RBP: ffffa4bd409e79b8 R08: ffffffffad27ba00 R09: 000000000000003f
R10: 0000000000000000 R11: 0000000000000654 R12: 0000000000000001
R13: ffffa4bd409e7a28 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f7e2fe13e80(0000) GS:ffff8b101a880000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000001010 CR3: 0000000114d68001 CR4: 00000000001606e0
Call Trace:
 fixup_exception+0x4a/0x61
 do_general_protection+0x50/0x190
 general_protection+0x27/0x30
RIP: 0010:save_stack_trace_user+0xc9/0x190
Code: 0f 96 c6 48 c7 c7 88 6a f6 ac 31 c9 e8 40 e8 14 00 49 39 dc 0f 87 c3 00 00 00 41 83 87 a0 18 00 00 01 0f 1f 00 0f ae e8 31 db <4d> 8b 3c 24 31 f6 85 db ba 01 00 00 00 40 0f 94 c6 48 c7 c7 b8 6a
RSP: 0018:ffffa4bd409e7ad8 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffffacf66a88
RBP: ffffa4bd409e7b00 R08: 0000000000000000 R09: ffff8b0fb4df1a08
R10: 00000000000009f4 R11: ffff8b0fb4df1a04 R12: 62696c2f7273752f
R13: ffffa4bd409e7f58 R14: ffffa4bd409e7b10 R15: ffff8b1017d53a80
 ? save_stack_trace_user+0xb0/0x190
 ftrace_trace_userstack+0x128/0x1c0
 trace_buffer_unlock_commit_regs+0x83/0xb0
 trace_event_buffer_commit+0x6e/0x1e0
 trace_event_raw_event_preemptirq_template+0x73/0xb0
 ? __get_user_pages+0x2d0/0x860
 ? handle_mm_fault+0xa9/0x3c0
 trace_hardirqs_off+0xbd/0x100
 handle_mm_fault+0xa9/0x3c0
 __get_user_pages+0x2d0/0x860
 get_user_pages_remote+0x169/0x260
 copy_strings.isra.8.part.9+0x18e/0x300
 copy_strings_kernel+0x39/0x50
 __do_execve_file.isra.14+0x5b3/0x9e0
 do_execve+0x25/0x30
 __x64_sys_execve+0x2b/0x40
 do_syscall_64+0x79/0x1f0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f7e30272b0b
Code: 41 89 01 eb da 66 2e 0f 1f 84 00 00 00 00 00 f7 d8 64 41 89 01 eb d6 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4d 63 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc34858f28 EFLAGS: 00000206 ORIG_RAX: 000000000000003b
RAX: ffffffffffffffda RBX: 00005635d0651f60 RCX: 00007f7e30272b0b
RDX: 00005635d0658a60 RSI: 00007ffc34858f40 RDI: 00007ffc3485ae89
RBP: 00007ffc3485ae89 R08: 00005635d05ff290 R09: 0000000000000001
R10: 00007f7e2fe13e80 R11: 0000000000000206 R12: 00005635d0658a60
R13: 0000000000000000 R14: 00005635d05d9be0 R15: 0000000000000136
---[ end trace 0a02ebd5916dacc5 ]---

Looks to be an issue with the save_stack_trace_user() not checking if
the address is canonical before reading it. I guess access_ok() doesn't
check that. Should we add something in save_stack_trace_user() to test
if the frame it reads is canonical or not before reading it. We don't
really want these warnings to happen because the user space stack has a
non-canonical address in it as the stack tracer reads it.

--- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 13:32       ` Steven Rostedt
  2019-03-21 13:55         ` Steven Rostedt
@ 2019-03-21 17:22         ` Peter Zijlstra
  2019-03-21 18:05           ` Andy Lutomirski
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21 17:22 UTC (permalink / raw)
  To: Steven Rostedt, Juergen Gross
  Cc: LKML, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Joel Fernandes, He Zhe,
	Linus Torvalds

On Thu, Mar 21, 2019 at 09:32:42AM -0400, Steven Rostedt wrote:
> On Thu, 21 Mar 2019 11:45:17 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > -	.if \paranoid
> > +	.if \read_cr2
> > +	mov	%cr2, %rdx			/* XXX paravirt crap */
> > +	.endif
> > +
> 
> I'm guessing this breaks paravirt, as that's one reason I didn't add
> the read_rc in assembly.

Still completely missing 32bit support.. but this has paravirt bits on.

It changes the read_cr2 stuff to CALLEE_SAVED and implements it in asm
to only require AX, this means that asm can use it without it clobbering
all volatile regs.

My compiler also tripped over the KVM asyn pagefault stuff, which I also
converted to use the new form.

Again, completely untested. And given that the last one just worked,
this one will _have_ to explode and let the magic smoke out :-)

---
 arch/x86/entry/entry_64.S             | 36 ++++++++++++++++++++++-------------
 arch/x86/include/asm/kvm_para.h       |  2 +-
 arch/x86/include/asm/paravirt.h       | 10 ++++++----
 arch/x86/include/asm/paravirt_types.h |  2 +-
 arch/x86/include/asm/traps.h          |  2 +-
 arch/x86/kernel/asm-offsets.c         |  1 +
 arch/x86/kernel/head_64.S             |  2 +-
 arch/x86/kernel/kvm.c                 |  8 ++++----
 arch/x86/kernel/paravirt.c            |  2 +-
 arch/x86/mm/fault.c                   |  3 +--
 arch/x86/xen/enlighten_pv.c           |  3 ++-
 arch/x86/xen/mmu_pv.c                 | 12 +-----------
 arch/x86/xen/xen-asm.S                | 25 ++++++++++++++++++++++++
 arch/x86/xen/xen-ops.h                |  3 +++
 14 files changed, 71 insertions(+), 40 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..40a0af6ae862 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -42,6 +42,12 @@
 
 #include "calling.h"
 
+#ifdef CONFIG_PARAVIRT_XXL
+#define GET_CR2_INTO(reg) GET_CR2_INTO_RAX ; movq %rax, reg
+#else
+#define GET_CR2_INTO(reg) movq %cr2, reg
+#endif
+
 .code64
 .section .entry.text, "ax"
 
@@ -879,7 +885,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 read_cr2=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -901,18 +907,28 @@ ENTRY(\sym)
 
 	.if \paranoid
 	call	paranoid_entry
+	/* returned flag: ebx=0: need swapgs on exit, ebx=1: don't need it */
 	.else
 	call	error_entry
+	/* returned flag: ebx=1: CALL_enter_from_user_mode, ebx=0: don't need it */
 	.endif
 	UNWIND_HINT_REGS
-	/* returned flag: ebx=0: need swapgs on exit, ebx=1: don't need it */
 
-	.if \paranoid
+	.if \read_cr2
+	GET_CR2_INTO(%rdx);			/* can clobber %rax */
+	.endif
+
 	.if \shift_ist != -1
 	TRACE_IRQS_OFF_DEBUG			/* reload IDT in case of recursion */
 	.else
 	TRACE_IRQS_OFF
 	.endif
+
+	.if \paranoid == 0
+	testl	%ebx, %ebx
+	jz	.Lno_context_tracking_\@
+	CALL_enter_from_user_mode
+.Lno_context_tracking_\@:
 	.endif
 
 	movq	%rsp, %rdi			/* pt_regs pointer */
@@ -1140,10 +1156,10 @@ idtentry xenint3		do_int3			has_error_code=0
 #endif
 
 idtentry general_protection	do_general_protection	has_error_code=1
-idtentry page_fault		do_page_fault		has_error_code=1
+idtentry page_fault		do_page_fault		has_error_code=1	read_cr2=1
 
 #ifdef CONFIG_KVM_GUEST
-idtentry async_page_fault	do_async_page_fault	has_error_code=1
+idtentry async_page_fault	do_async_page_fault	has_error_code=1	read_cr2=1
 #endif
 
 #ifdef CONFIG_X86_MCE
@@ -1243,17 +1259,11 @@ ENTRY(error_entry)
 	ENCODE_FRAME_POINTER
 	pushq	%r12
 
-	/*
-	 * We need to tell lockdep that IRQs are off.  We can't do this until
-	 * we fix gsbase, and we should do it before enter_from_user_mode
-	 * (which can take locks).
-	 */
-	TRACE_IRQS_OFF
-	CALL_enter_from_user_mode
+	mov	$1, %ebx			/* CALL_enter_from_user_mode */
 	ret
 
 .Lerror_entry_done:
-	TRACE_IRQS_OFF
+	xorl	%ebx, %ebx
 	ret
 
 	/*
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 5ed3cf1c3934..9b4df6eaa11a 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -92,7 +92,7 @@ void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-void do_async_page_fault(struct pt_regs *regs, unsigned long error_code);
+void do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __init kvm_spinlock_init(void);
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..a0f25b9af88c 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -116,7 +116,7 @@ static inline void write_cr0(unsigned long x)
 
 static inline unsigned long read_cr2(void)
 {
-	return PVOP_CALL0(unsigned long, mmu.read_cr2);
+	return PVOP_CALLEE0(unsigned long, mmu.read_cr2);
 }
 
 static inline void write_cr2(unsigned long x)
@@ -911,9 +911,11 @@ extern void default_banner(void);
 		 )
 #endif
 
-#define GET_CR2_INTO_RAX				\
-	ANNOTATE_RETPOLINE_SAFE;				\
-	call PARA_INDIRECT(pv_ops+PV_MMU_read_cr2);
+#define GET_CR2_INTO_RAX						\
+	PARA_SITE(PARA_PATCH(PV_MMU_read_cr2),				\
+		  ANNOTATE_RETPOLINE_SAFE;				\
+		  call PARA_INDIRECT(pv_ops+PV_MMU_read_cr2);		\
+		 )
 
 #ifdef CONFIG_PARAVIRT_XXL
 #define USERGS_SYSRET64							\
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 2474e434a6f7..8e46f1b764a3 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -220,7 +220,7 @@ struct pv_mmu_ops {
 	void (*exit_mmap)(struct mm_struct *mm);
 
 #ifdef CONFIG_PARAVIRT_XXL
-	unsigned long (*read_cr2)(void);
+	struct paravirt_callee_save read_cr2;
 	void (*write_cr2)(unsigned long);
 
 	unsigned long (*read_cr3)(void);
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7d6f3f3fad78..1530909fbf70 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -81,7 +81,7 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s);
 void __init trap_init(void);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *regs, long error_code);
-dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code);
+dotraplinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
 dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *regs, long error_code);
 dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code);
 dotraplinkage void do_alignment_check(struct pt_regs *regs, long error_code);
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 168543d077d7..851199e5b478 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -77,6 +77,7 @@ static void __used common(void)
 	BLANK();
 	OFFSET(XEN_vcpu_info_mask, vcpu_info, evtchn_upcall_mask);
 	OFFSET(XEN_vcpu_info_pending, vcpu_info, evtchn_upcall_pending);
+	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
 	BLANK();
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d1dbe8e4eb82..7f21e16315d3 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -323,7 +323,7 @@ END(early_idt_handler_array)
 
 	cmpq $14,%rsi		/* Page fault? */
 	jnz 10f
-	GET_CR2_INTO(%rdi)	/* Can clobber any volatile register if pv */
+	GET_CR2_INTO(%rdi)	/* can clobber %rax if pv */
 	call early_make_pgtable
 	andl %eax,%eax
 	jz 20f			/* All good */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5c93a65ee1e5..6396f1628c63 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -255,23 +255,23 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
 NOKPROBE_SYMBOL(kvm_read_and_reset_pf_reason);
 
 dotraplinkage void
-do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
+do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
 	enum ctx_state prev_state;
 
 	switch (kvm_read_and_reset_pf_reason()) {
 	default:
-		do_page_fault(regs, error_code);
+		do_page_fault(regs, error_code, address);
 		break;
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		/* page is swapped out by the host. */
 		prev_state = exception_enter();
-		kvm_async_pf_task_wait((u32)read_cr2(), !user_mode(regs));
+		kvm_async_pf_task_wait((u32)address, !user_mode(regs));
 		exception_exit(prev_state);
 		break;
 	case KVM_PV_REASON_PAGE_READY:
 		rcu_irq_enter();
-		kvm_async_pf_task_wake((u32)read_cr2());
+		kvm_async_pf_task_wake((u32)address);
 		rcu_irq_exit();
 		break;
 	}
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c0e0101133f3..e853d129f100 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -382,7 +382,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.exit_mmap		= paravirt_nop,
 
 #ifdef CONFIG_PARAVIRT_XXL
-	.mmu.read_cr2		= native_read_cr2,
+	.mmu.read_cr2		= __PV_IS_CALLEE_SAVE(native_read_cr2),
 	.mmu.write_cr2		= native_write_cr2,
 	.mmu.read_cr3		= __native_read_cr3,
 	.mmu.write_cr3		= native_write_cr3,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 667f1da36208..aac7a74869a2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1560,9 +1560,8 @@ trace_page_fault_entries(unsigned long address, struct pt_regs *regs,
  * exception_{enter,exit}() contains all sorts of tracepoints.
  */
 dotraplinkage void notrace
-do_page_fault(struct pt_regs *regs, unsigned long error_code)
+do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
-	unsigned long address = read_cr2(); /* Get the faulting address */
 	enum ctx_state prev_state;
 
 	prev_state = exception_enter();
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index c54a493e139a..f90a1b67f924 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -998,7 +998,8 @@ void __init xen_setup_vcpu_info_placement(void)
 			__PV_IS_CALLEE_SAVE(xen_irq_disable_direct);
 		pv_ops.irq.irq_enable =
 			__PV_IS_CALLEE_SAVE(xen_irq_enable_direct);
-		pv_ops.mmu.read_cr2 = xen_read_cr2_direct;
+		pv_ops.mmu.read_cr2 =
+			__PV_IS_CALLEE_SAVE(xen_read_cr2_direct);
 	}
 }
 
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index a21e1734fc1f..921e727305bc 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1307,16 +1307,6 @@ static void xen_write_cr2(unsigned long cr2)
 	this_cpu_read(xen_vcpu)->arch.cr2 = cr2;
 }
 
-static unsigned long xen_read_cr2(void)
-{
-	return this_cpu_read(xen_vcpu)->arch.cr2;
-}
-
-unsigned long xen_read_cr2_direct(void)
-{
-	return this_cpu_read(xen_vcpu_info.arch.cr2);
-}
-
 static noinline void xen_flush_tlb(void)
 {
 	struct mmuext_op *op;
@@ -2399,7 +2389,7 @@ static void xen_leave_lazy_mmu(void)
 }
 
 static const struct pv_mmu_ops xen_mmu_ops __initconst = {
-	.read_cr2 = xen_read_cr2,
+	.read_cr2 = __PV_IS_CALLEE_SAVE(xen_read_cr2),
 	.write_cr2 = xen_write_cr2,
 
 	.read_cr3 = xen_read_cr3,
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 8019edd0125c..673b93d56c8a 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -135,3 +135,28 @@ ENTRY(check_events)
 	FRAME_END
 	ret
 ENDPROC(check_events)
+
+ENTRY(xen_read_cr2)
+	FRAME_BEGIN
+#ifdef CONFIG_X86_64
+	movq	PER_CPU_VAR(xen_vcpu), %rax
+	movq	XEN_vcpu_info_arch_cr2(%rax), %rax
+#else
+	movl	PER_CPU_VAR(xen_vcpu), %eax
+	movl	XEN_vcpu_info_arch_cr2(%eax), %eax
+#endif
+	FRAME_END
+	ret
+	ENDPROC(xen_read_cr2);
+
+ENTRY(xen_read_cr2_direct)
+	FRAME_BEGIN
+#ifdef CONFIG_X86_64
+	movq	PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_arch_cr2, %rax
+#else
+	movl	PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_arch_cr2, %eax
+#endif
+	FRAME_END
+	ret
+	ENDPROC(xen_read_cr2_direct);
+
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 0e60bd918695..38767bbe0034 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -134,6 +134,9 @@ __visible void xen_irq_disable_direct(void);
 __visible unsigned long xen_save_fl_direct(void);
 __visible void xen_restore_fl_direct(unsigned long);
 
+__visible unsigned long xen_read_cr2(void);
+__visible unsigned long xen_read_cr2_direct(void);
+
 /* These are not functions, and cannot be called normally */
 __visible void xen_iret(void);
 __visible void xen_sysret32(void);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 13:55         ` Steven Rostedt
@ 2019-03-21 17:23           ` Linus Torvalds
  0 siblings, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2019-03-21 17:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, LKML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Andy Lutomirski, Joel Fernandes,
	He Zhe

On Thu, Mar 21, 2019 at 6:55 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Looks to be an issue with the save_stack_trace_user() not checking if
> the address is canonical before reading it. I guess access_ok() doesn't
> check that.

access_ok() definitely does check for non-canonical.

But it only does so when USER_DS is in effect.

If you use KERNEL_DS, then you have to check the address yourself.

                    Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 17:22         ` Peter Zijlstra
@ 2019-03-21 18:05           ` Andy Lutomirski
  2019-03-21 18:10             ` Steven Rostedt
                               ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-21 18:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds

On Thu, Mar 21, 2019 at 10:22 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Mar 21, 2019 at 09:32:42AM -0400, Steven Rostedt wrote:
> > On Thu, 21 Mar 2019 11:45:17 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > -   .if \paranoid
> > > +   .if \read_cr2
> > > +   mov     %cr2, %rdx                      /* XXX paravirt crap */
> > > +   .endif
> > > +
> >
> > I'm guessing this breaks paravirt, as that's one reason I didn't add
> > the read_rc in assembly.
>
> Still completely missing 32bit support.. but this has paravirt bits on.
>
> It changes the read_cr2 stuff to CALLEE_SAVED and implements it in asm
> to only require AX, this means that asm can use it without it clobbering
> all volatile regs.
>
> My compiler also tripped over the KVM asyn pagefault stuff, which I also
> converted to use the new form.
>
> Again, completely untested. And given that the last one just worked,
> this one will _have_ to explode and let the magic smoke out :-)
>

Ugh.

I certainly agree in principle that sticking the CR2 read into the asm
is the right solution.  But this patch makes the spaghetti even more
tangled.  Maybe we can rearrange the code a bit so that the entry
sequence saves at least one register before calling error_entry, so we
can do it the obvious way.

In the long run, I think the right solution is to rewrite even more of
this mess in C.  We really ought to be able to put the IRQ flag
tracing and the context tracking into C code.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:05           ` Andy Lutomirski
@ 2019-03-21 18:10             ` Steven Rostedt
  2019-03-21 18:27               ` Andy Lutomirski
  2019-03-21 18:28               ` Peter Zijlstra
  2019-03-21 18:18             ` Linus Torvalds
  2019-03-21 18:27             ` Peter Zijlstra
  2 siblings, 2 replies; 35+ messages in thread
From: Steven Rostedt @ 2019-03-21 18:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds, Clark Williams

On Thu, 21 Mar 2019 11:05:06 -0700
Andy Lutomirski <luto@amacapital.net> wrote:

> In the long run, I think the right solution is to rewrite even more of
> this mess in C.  We really ought to be able to put the IRQ flag
> tracing and the context tracking into C code.

And once we do that, we can work on getting the irq tracing
incorporated into a jump_label type that we could possibly enable
lockdep at start up, and then disable it later, even on production
systems! That is, to be able to turn it off and bring the system back
up to full speed.

It would also allow for irq tracing too.

Some inquiring minds have been asking about this ;-)

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:05           ` Andy Lutomirski
  2019-03-21 18:10             ` Steven Rostedt
@ 2019-03-21 18:18             ` Linus Torvalds
  2019-03-21 18:20               ` Andy Lutomirski
                                 ` (2 more replies)
  2019-03-21 18:27             ` Peter Zijlstra
  2 siblings, 3 replies; 35+ messages in thread
From: Linus Torvalds @ 2019-03-21 18:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Steven Rostedt, Juergen Gross, LKML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joel Fernandes, He Zhe

On Thu, Mar 21, 2019 at 11:05 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> In the long run, I think the right solution is to rewrite even more of
> this mess in C.  We really ought to be able to put the IRQ flag
> tracing and the context tracking into C code.

Tangentially about this long run thing - can we start discussing just
dropping PV mode entirely in the long run? Or even not-so-long run?

What are the pressing advantages of paravirt these days? Because I can
point to a lot of pressing _disadvantages_, and not all of them are
about more complex code as shown by this patch...

             Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:18             ` Linus Torvalds
@ 2019-03-21 18:20               ` Andy Lutomirski
  2019-03-21 18:25                 ` Linus Torvalds
  2019-03-21 18:22               ` hpa
  2019-03-22  5:54               ` Juergen Gross
  2 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-21 18:20 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Cooper, Jan Beulich, Boris Ostrovsky,
	Juergen Gross, Stefano Stabellini
  Cc: Peter Zijlstra, Steven Rostedt, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe

On Thu, Mar 21, 2019 at 11:19 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Mar 21, 2019 at 11:05 AM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > In the long run, I think the right solution is to rewrite even more of
> > this mess in C.  We really ought to be able to put the IRQ flag
> > tracing and the context tracking into C code.
>
> Tangentially about this long run thing - can we start discussing just
> dropping PV mode entirely in the long run? Or even not-so-long run?
>
> What are the pressing advantages of paravirt these days? Because I can
> point to a lot of pressing _disadvantages_, and not all of them are
> about more complex code as shown by this patch...
>

I dunno.  Lots of people at least use to have serious commercial interest in it.

Hey Xen folks, how close are we to being able to say "if you want to
run a new kernel, you need to switch to PVH or similar"?

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:18             ` Linus Torvalds
  2019-03-21 18:20               ` Andy Lutomirski
@ 2019-03-21 18:22               ` hpa
  2019-03-22  5:54               ` Juergen Gross
  2 siblings, 0 replies; 35+ messages in thread
From: hpa @ 2019-03-21 18:22 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Peter Zijlstra, Steven Rostedt, Juergen Gross, LKML,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe

On March 21, 2019 11:18:53 AM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Thu, Mar 21, 2019 at 11:05 AM Andy Lutomirski <luto@amacapital.net>
>wrote:
>>
>> In the long run, I think the right solution is to rewrite even more
>of
>> this mess in C.  We really ought to be able to put the IRQ flag
>> tracing and the context tracking into C code.
>
>Tangentially about this long run thing - can we start discussing just
>dropping PV mode entirely in the long run? Or even not-so-long run?
>
>What are the pressing advantages of paravirt these days? Because I can
>point to a lot of pressing _disadvantages_, and not all of them are
>about more complex code as shown by this patch...
>
>             Linus

God only knows how nice that would be (for things that aren't naturally driverized.) Last I heard, which was quite a while ago, the big problem was that the Amazon cloud runs Xen in PV mode...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:20               ` Andy Lutomirski
@ 2019-03-21 18:25                 ` Linus Torvalds
  2019-03-21 18:37                   ` Peter Zijlstra
  2019-03-21 18:38                   ` Andy Lutomirski
  0 siblings, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2019-03-21 18:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Cooper, Jan Beulich, Boris Ostrovsky, Juergen Gross,
	Stefano Stabellini, Peter Zijlstra, Steven Rostedt, LKML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joel Fernandes, He Zhe

On Thu, Mar 21, 2019 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> I dunno.  Lots of people at least use to have serious commercial interest in it.

Yes, it used to be a big deal. But full virtualization has gotten a
lot more common and better.

> Hey Xen folks, how close are we to being able to say "if you want to
> run a new kernel, you need to switch to PVH or similar"?

I'd also like to know if we could perhaps at least limit PV to just
the thing that people care most deeply about.

For example, maybe people notice that they really deeply care about
the PV spinlocks because they help a lot for some loads, but don't
care so much about the low-level CPU PV stuff any more because modern
CPUs do _those_ things so well these days.

So it might not be an all-or-nothing thing, but a gradual "let's stop
supporting xyz under PV, because it causes pain and isn't worth it".

                Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:10             ` Steven Rostedt
@ 2019-03-21 18:27               ` Andy Lutomirski
  2019-03-21 20:50                 ` Peter Zijlstra
  2019-03-21 18:28               ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-21 18:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds, Clark Williams

[-- Attachment #1: Type: text/plain, Size: 1781 bytes --]

On Thu, Mar 21, 2019 at 11:10 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 21 Mar 2019 11:05:06 -0700
> Andy Lutomirski <luto@amacapital.net> wrote:
>
> > In the long run, I think the right solution is to rewrite even more of
> > this mess in C.  We really ought to be able to put the IRQ flag
> > tracing and the context tracking into C code.
>
> And once we do that, we can work on getting the irq tracing
> incorporated into a jump_label type that we could possibly enable
> lockdep at start up, and then disable it later, even on production
> systems! That is, to be able to turn it off and bring the system back
> up to full speed.
>
> It would also allow for irq tracing too.
>
> Some inquiring minds have been asking about this ;-)
>

Well, here's pass zero at this.  Untested, because it obviously
doesn't work.  Here are just a few things that are almost certainly
wrong with it

 - The assertion that entries return with IRQs off will blow up,
because a bunch of the entries can return with IRQs on.  Needs fixing.

 - native_load_gs_index needs to be re-added as a static inline
somewhere or maybe even as a real function because of PV.  Sigh.

 - The IRQ tracing needs to be re-added.

 - Some real semantics need to be defined for precisely what code is
responsible for tracing.

 - We need some asm-callable assertions to check the following
conditions as appropriate:

 (a) that IRQ flags are currently traced as off.
 (b) that IRQ flags are currently traced to match the IRET frame.
 (c) that our context tracking is currently in good shape.  I'm not
100% sure how to define this.


 - We need to do some serious don't-instrument-me stuff to all the C
entries, since we're now in an awful context when calling them.

 - I haven't touched IRQs at all.

[-- Attachment #2: entry.patch --]
[-- Type: text/x-patch, Size: 3740 bytes --]

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..0ae05724eafc 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -167,15 +167,11 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 
 	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
 
-	TRACE_IRQS_OFF
-
 	/* IRQs are off. */
 	movq	%rax, %rdi
 	movq	%rsp, %rsi
 	call	do_syscall_64		/* returns with IRQs disabled */
 
-	TRACE_IRQS_IRETQ		/* we're about to change IF */
-
 	/*
 	 * Try to use SYSRET instead of IRET if we're returning to
 	 * a completely clean 64-bit userspace context.  If we're not,
@@ -342,7 +338,6 @@ ENTRY(ret_from_fork)
 	UNWIND_HINT_REGS
 	movq	%rsp, %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
-	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
 	jmp	swapgs_restore_regs_and_return_to_usermode
 
 1:
@@ -595,7 +590,6 @@ ret_from_intr:
 GLOBAL(retint_user)
 	mov	%rsp,%rdi
 	call	prepare_exit_to_usermode
-	TRACE_IRQS_IRETQ
 
 GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 #ifdef CONFIG_DEBUG_ENTRY
@@ -651,10 +645,6 @@ retint_kernel:
 	jmp	0b
 1:
 #endif
-	/*
-	 * The iretq could re-enable interrupts:
-	 */
-	TRACE_IRQS_IRETQ
 
 GLOBAL(restore_regs_and_return_to_kernel)
 #ifdef CONFIG_DEBUG_ENTRY
@@ -879,15 +869,10 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
-	/* Sanity check */
-	.if \shift_ist != -1 && \paranoid == 0
-	.error "using shift_ist requires paranoid=1"
-	.endif
-
 	ASM_CLAC
 
 	.if \has_error_code == 0
@@ -907,14 +892,6 @@ ENTRY(\sym)
 	UNWIND_HINT_REGS
 	/* returned flag: ebx=0: need swapgs on exit, ebx=1: don't need it */
 
-	.if \paranoid
-	.if \shift_ist != -1
-	TRACE_IRQS_OFF_DEBUG			/* reload IDT in case of recursion */
-	.else
-	TRACE_IRQS_OFF
-	.endif
-	.endif
-
 	movq	%rsp, %rdi			/* pt_regs pointer */
 
 	.if \has_error_code
@@ -924,17 +901,9 @@ ENTRY(\sym)
 	xorl	%esi, %esi			/* no error code */
 	.endif
 
-	.if \shift_ist != -1
-	subq	$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
-	.endif
-
 	call	\do_sym
 
-	.if \shift_ist != -1
-	addq	$EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
-	.endif
-
-	/* these procedures expect "no swapgs" flag in ebx */
+	/* paranoid_exit expects "no swapgs" flag in ebx */
 	.if \paranoid
 	jmp	paranoid_exit
 	.else
@@ -986,18 +955,14 @@ idtentry simd_coprocessor_error		do_simd_coprocessor_error	has_error_code=0
 	 * Reload gs selector with exception handling
 	 * edi:  new selector
 	 */
-ENTRY(native_load_gs_index)
+ENTRY(native_load_gs_index_irqs_off)
+	DEBUG_ENTRY_ASSERT_IRQS_OFF
 	FRAME_BEGIN
-	pushfq
-	DISABLE_INTERRUPTS(CLBR_ANY & ~CLBR_RDI)
-	TRACE_IRQS_OFF
 	SWAPGS
 .Lgs_change:
 	movl	%edi, %gs
 2:	ALTERNATIVE "", "mfence", X86_BUG_SWAPGS_FENCE
 	SWAPGS
-	TRACE_IRQS_FLAGS (%rsp)
-	popfq
 	FRAME_END
 	ret
 ENDPROC(native_load_gs_index)
@@ -1243,17 +1208,7 @@ ENTRY(error_entry)
 	ENCODE_FRAME_POINTER
 	pushq	%r12
 
-	/*
-	 * We need to tell lockdep that IRQs are off.  We can't do this until
-	 * we fix gsbase, and we should do it before enter_from_user_mode
-	 * (which can take locks).
-	 */
-	TRACE_IRQS_OFF
-	CALL_enter_from_user_mode
-	ret
-
 .Lerror_entry_done:
-	TRACE_IRQS_OFF
 	ret
 
 	/*
@@ -1306,8 +1261,7 @@ END(error_entry)
 
 ENTRY(error_exit)
 	UNWIND_HINT_REGS
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF
+	DEBUG_ENTRY_ASSERT_IRQS_OFF
 	testb	$3, CS(%rsp)
 	jz	retint_kernel
 	jmp	retint_user

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:05           ` Andy Lutomirski
  2019-03-21 18:10             ` Steven Rostedt
  2019-03-21 18:18             ` Linus Torvalds
@ 2019-03-21 18:27             ` Peter Zijlstra
  2019-03-21 18:28               ` Andy Lutomirski
  2 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21 18:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds

On Thu, Mar 21, 2019 at 11:05:06AM -0700, Andy Lutomirski wrote:

> Ugh.
> 
> I certainly agree in principle that sticking the CR2 read into the asm
> is the right solution.  But this patch makes the spaghetti even more
> tangled.  Maybe we can rearrange the code a bit so that the entry
> sequence saves at least one register before calling error_entry, so we
> can do it the obvious way.

What is the obvious way? Note that with all the Xen/paravirt-me-harder
bits on, we need to clobber at least RAX to get CR2.

And most of the patch was munging the paravirt bits to make that happen.

Another thing I did notices is that CLAC and CLD are not next to one
another in these paths.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:10             ` Steven Rostedt
  2019-03-21 18:27               ` Andy Lutomirski
@ 2019-03-21 18:28               ` Peter Zijlstra
  2019-03-21 18:55                 ` Steven Rostedt
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21 18:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds, Clark Williams

On Thu, Mar 21, 2019 at 02:10:20PM -0400, Steven Rostedt wrote:
> On Thu, 21 Mar 2019 11:05:06 -0700
> Andy Lutomirski <luto@amacapital.net> wrote:
> 
> > In the long run, I think the right solution is to rewrite even more of
> > this mess in C.  We really ought to be able to put the IRQ flag
> > tracing and the context tracking into C code.
> 
> And once we do that, we can work on getting the irq tracing
> incorporated into a jump_label type that we could possibly enable
> lockdep at start up, and then disable it later, even on production
> systems! That is, to be able to turn it off and bring the system back
> up to full speed.

You forget the stupid amount of data bloat that lockdep brings.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:27             ` Peter Zijlstra
@ 2019-03-21 18:28               ` Andy Lutomirski
  2019-03-21 18:33                 ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-21 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds

On Thu, Mar 21, 2019 at 11:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Mar 21, 2019 at 11:05:06AM -0700, Andy Lutomirski wrote:
>
> > Ugh.
> >
> > I certainly agree in principle that sticking the CR2 read into the asm
> > is the right solution.  But this patch makes the spaghetti even more
> > tangled.  Maybe we can rearrange the code a bit so that the entry
> > sequence saves at least one register before calling error_entry, so we
> > can do it the obvious way.
>
> What is the obvious way? Note that with all the Xen/paravirt-me-harder
> bits on, we need to clobber at least RAX to get CR2.
>
> And most of the patch was munging the paravirt bits to make that happen.
>
> Another thing I did notices is that CLAC and CLD are not next to one
> another in these paths.

The thing I really don't like is all the EBX crud to try to get
everything in to the right order.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:28               ` Andy Lutomirski
@ 2019-03-21 18:33                 ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21 18:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds

On Thu, Mar 21, 2019 at 11:28:50AM -0700, Andy Lutomirski wrote:
> On Thu, Mar 21, 2019 at 11:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Mar 21, 2019 at 11:05:06AM -0700, Andy Lutomirski wrote:
> >
> > > Ugh.
> > >
> > > I certainly agree in principle that sticking the CR2 read into the asm
> > > is the right solution.  But this patch makes the spaghetti even more
> > > tangled.  Maybe we can rearrange the code a bit so that the entry
> > > sequence saves at least one register before calling error_entry, so we
> > > can do it the obvious way.
> >
> > What is the obvious way? Note that with all the Xen/paravirt-me-harder
> > bits on, we need to clobber at least RAX to get CR2.
> >
> > And most of the patch was munging the paravirt bits to make that happen.
> >
> > Another thing I did notices is that CLAC and CLD are not next to one
> > another in these paths.
> 
> The thing I really don't like is all the EBX crud to try to get
> everything in to the right order.

Ah yes, granted. I hate that part too but didn't see a way around it :/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:25                 ` Linus Torvalds
@ 2019-03-21 18:37                   ` Peter Zijlstra
  2019-03-21 18:39                     ` Andy Lutomirski
  2019-03-21 18:38                   ` Andy Lutomirski
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andrew Cooper, Jan Beulich, Boris Ostrovsky,
	Juergen Gross, Stefano Stabellini, Steven Rostedt, LKML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joel Fernandes, He Zhe

On Thu, Mar 21, 2019 at 11:25:44AM -0700, Linus Torvalds wrote:
> On Thu, Mar 21, 2019 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > I dunno.  Lots of people at least use to have serious commercial interest in it.
> 
> Yes, it used to be a big deal. But full virtualization has gotten a
> lot more common and better.
> 
> > Hey Xen folks, how close are we to being able to say "if you want to
> > run a new kernel, you need to switch to PVH or similar"?
> 
> I'd also like to know if we could perhaps at least limit PV to just
> the thing that people care most deeply about.
> 
> For example, maybe people notice that they really deeply care about
> the PV spinlocks because they help a lot for some loads, but don't
> care so much about the low-level CPU PV stuff any more because modern
> CPUs do _those_ things so well these days.
> 
> So it might not be an all-or-nothing thing, but a gradual "let's stop
> supporting xyz under PV, because it causes pain and isn't worth it".

So Juergen recently introduced PARAVIRT_XXL, which are exactly those
bits of PV we can get rid of.

This paravirt-me-harder config does indeed include the CR2 bits.

I recently talked to Andrew Cooper about this, and he said Xen Dom0
still needs all this :/

But yes, I'd _love_ to delete all that in a hurry.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:25                 ` Linus Torvalds
  2019-03-21 18:37                   ` Peter Zijlstra
@ 2019-03-21 18:38                   ` Andy Lutomirski
  2019-03-21 18:42                     ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-21 18:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andrew Cooper, Jan Beulich, Boris Ostrovsky,
	Juergen Gross, Stefano Stabellini, Peter Zijlstra,
	Steven Rostedt, LKML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Joel Fernandes, He Zhe

On Thu, Mar 21, 2019 at 11:33 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Mar 21, 2019 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > I dunno.  Lots of people at least use to have serious commercial interest in it.
>
> Yes, it used to be a big deal. But full virtualization has gotten a
> lot more common and better.
>
> > Hey Xen folks, how close are we to being able to say "if you want to
> > run a new kernel, you need to switch to PVH or similar"?
>
> I'd also like to know if we could perhaps at least limit PV to just
> the thing that people care most deeply about.
>
> For example, maybe people notice that they really deeply care about
> the PV spinlocks because they help a lot for some loads, but don't
> care so much about the low-level CPU PV stuff any more because modern
> CPUs do _those_ things so well these days.
>
> So it might not be an all-or-nothing thing, but a gradual "let's stop
> supporting xyz under PV, because it causes pain and isn't worth it".
>

I suspect we'll want PV spinlocks and other goodies like PV TLB
shootdown for a long time.  The stuff I'd like to kill eventually is
the PV "I'm not actually at CPL 0 so I'm faking it" part.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:37                   ` Peter Zijlstra
@ 2019-03-21 18:39                     ` Andy Lutomirski
  2019-03-21 20:00                       ` Andrew Cooper
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-21 18:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andy Lutomirski, Andrew Cooper, Jan Beulich,
	Boris Ostrovsky, Juergen Gross, Stefano Stabellini,
	Steven Rostedt, LKML, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Joel Fernandes, He Zhe

On Thu, Mar 21, 2019 at 11:37 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Mar 21, 2019 at 11:25:44AM -0700, Linus Torvalds wrote:
> > On Thu, Mar 21, 2019 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > I dunno.  Lots of people at least use to have serious commercial interest in it.
> >
> > Yes, it used to be a big deal. But full virtualization has gotten a
> > lot more common and better.
> >
> > > Hey Xen folks, how close are we to being able to say "if you want to
> > > run a new kernel, you need to switch to PVH or similar"?
> >
> > I'd also like to know if we could perhaps at least limit PV to just
> > the thing that people care most deeply about.
> >
> > For example, maybe people notice that they really deeply care about
> > the PV spinlocks because they help a lot for some loads, but don't
> > care so much about the low-level CPU PV stuff any more because modern
> > CPUs do _those_ things so well these days.
> >
> > So it might not be an all-or-nothing thing, but a gradual "let's stop
> > supporting xyz under PV, because it causes pain and isn't worth it".
>
> So Juergen recently introduced PARAVIRT_XXL, which are exactly those
> bits of PV we can get rid of.
>
> This paravirt-me-harder config does indeed include the CR2 bits.
>
> I recently talked to Andrew Cooper about this, and he said Xen Dom0
> still needs all this :/

There were patches from last year to fix that:

https://lwn.net/Articles/753982/

I have no clue what the status is.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:38                   ` Andy Lutomirski
@ 2019-03-21 18:42                     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21 18:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Andrew Cooper, Jan Beulich, Boris Ostrovsky,
	Juergen Gross, Stefano Stabellini, Steven Rostedt, LKML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joel Fernandes, He Zhe

On Thu, Mar 21, 2019 at 11:38:25AM -0700, Andy Lutomirski wrote:

> I suspect we'll want PV spinlocks and other goodies like PV TLB
> shootdown for a long time.  The stuff I'd like to kill eventually is
> the PV "I'm not actually at CPL 0 so I'm faking it" part.

Right, and it's exactly those bits that are PARAVIRT_XXL AFAIU.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:28               ` Peter Zijlstra
@ 2019-03-21 18:55                 ` Steven Rostedt
  2019-03-21 19:31                   ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2019-03-21 18:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds, Clark Williams

On Thu, 21 Mar 2019 19:28:30 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Mar 21, 2019 at 02:10:20PM -0400, Steven Rostedt wrote:
> > On Thu, 21 Mar 2019 11:05:06 -0700
> > Andy Lutomirski <luto@amacapital.net> wrote:
> >   
> > > In the long run, I think the right solution is to rewrite even more of
> > > this mess in C.  We really ought to be able to put the IRQ flag
> > > tracing and the context tracking into C code.  
> > 
> > And once we do that, we can work on getting the irq tracing
> > incorporated into a jump_label type that we could possibly enable
> > lockdep at start up, and then disable it later, even on production
> > systems! That is, to be able to turn it off and bring the system back
> > up to full speed.  
> 
> You forget the stupid amount of data bloat that lockdep brings.

No I didn't. Some users only care about performance, but find memory
cheap.

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:55                 ` Steven Rostedt
@ 2019-03-21 19:31                   ` Peter Zijlstra
  2019-03-21 19:50                     ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21 19:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds, Clark Williams

On Thu, Mar 21, 2019 at 02:55:51PM -0400, Steven Rostedt wrote:
> On Thu, 21 Mar 2019 19:28:30 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Mar 21, 2019 at 02:10:20PM -0400, Steven Rostedt wrote:
> > > On Thu, 21 Mar 2019 11:05:06 -0700
> > > Andy Lutomirski <luto@amacapital.net> wrote:
> > >   
> > > > In the long run, I think the right solution is to rewrite even more of
> > > > this mess in C.  We really ought to be able to put the IRQ flag
> > > > tracing and the context tracking into C code.  
> > > 
> > > And once we do that, we can work on getting the irq tracing
> > > incorporated into a jump_label type that we could possibly enable
> > > lockdep at start up, and then disable it later, even on production
> > > systems! That is, to be able to turn it off and bring the system back
> > > up to full speed.  
> > 
> > You forget the stupid amount of data bloat that lockdep brings.
> 
> No I didn't. Some users only care about performance, but find memory
> cheap.

Because cache-misses are free?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 19:31                   ` Peter Zijlstra
@ 2019-03-21 19:50                     ` Steven Rostedt
  2019-03-21 20:03                       ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2019-03-21 19:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds, Clark Williams

On Thu, 21 Mar 2019 20:31:52 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> > 
> > No I didn't. Some users only care about performance, but find memory
> > cheap.  
> 
> Because cache-misses are free?

If I ever did implement this, I would try to get all the data out of
line as much as possible, where only a nop would be inserted:

	jmp lockdep_code
	raw_locking
1:

[..]

lockdep_code:
	do all that lockdep needs
	raw_locking
	jmp 1b

and have that converted to:


	nop // jmp lockdep_code
	raw_locking
1:

when disabled. So the only extra pressure on the icache is the nop.

-- Steve

	

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:39                     ` Andy Lutomirski
@ 2019-03-21 20:00                       ` Andrew Cooper
  2019-03-21 20:35                         ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2019-03-21 20:00 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Linus Torvalds, Jan Beulich, Boris Ostrovsky, Juergen Gross,
	Stefano Stabellini, Steven Rostedt, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe

On 21/03/2019 18:39, Andy Lutomirski wrote:
> On Thu, Mar 21, 2019 at 11:37 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Mar 21, 2019 at 11:25:44AM -0700, Linus Torvalds wrote:
>>> On Thu, Mar 21, 2019 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote:
>>>> I dunno.  Lots of people at least use to have serious commercial interest in it.
>>> Yes, it used to be a big deal. But full virtualization has gotten a
>>> lot more common and better.
>>>
>>>> Hey Xen folks, how close are we to being able to say "if you want to
>>>> run a new kernel, you need to switch to PVH or similar"?
>>> I'd also like to know if we could perhaps at least limit PV to just
>>> the thing that people care most deeply about.
>>>
>>> For example, maybe people notice that they really deeply care about
>>> the PV spinlocks because they help a lot for some loads, but don't
>>> care so much about the low-level CPU PV stuff any more because modern
>>> CPUs do _those_ things so well these days.
>>>
>>> So it might not be an all-or-nothing thing, but a gradual "let's stop
>>> supporting xyz under PV, because it causes pain and isn't worth it".
>> So Juergen recently introduced PARAVIRT_XXL, which are exactly those
>> bits of PV we can get rid of.
>>
>> This paravirt-me-harder config does indeed include the CR2 bits.
>>
>> I recently talked to Andrew Cooper about this, and he said Xen Dom0
>> still needs all this :/
> There were patches from last year to fix that:
>
> https://lwn.net/Articles/753982/
>
> I have no clue what the status is.

I'm sorry to say that Xen PV is not going to die any time soon (as far
as Xen is concerned).

For customer VMs, using the VT-x/SVM hardware extensions is definitely
the way to go, but for host level operations, the difference between
syscall vs vmexit, or (not) having to do an EPT flush make an
overwhelming difference in performance.

Our PVH virtualisation mode, including for dom0, is making good
progress.  With Xen 4.12 and Linux 4.19+, a lot of basic functionality
now does work.  However, we've got 15 years of legacy problems to try
and untangle while doing this, including (but not limited to) Linux
(rather than Xen) being OSPM, or the fact that using virtual addresses
for hypercalls and mappings should never have propagate beyond the PV
ABI when HVM came along.

Even with the legacy API/ABI issues fixed, the straight performance
difference between root mode operations, and vmexits, means that it is
by no means a certainty that a PVH dom0 will be the best option for
systems to use.

I'm afraid that I've not got the original context of this thread, but
I'm going to guess that something is resulting in a #PF before %cr2
before it gets saved on the #PF path, and using a PVOP causes things to
explode?

If it helps in the short term, Xen will trap and emulate a mov from cr2
instruction correctly.  Performance will surely suck, but it might be an
option if we need to rethink how this information is made available to
guests.

Thanks,

~Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 19:50                     ` Steven Rostedt
@ 2019-03-21 20:03                       ` Peter Zijlstra
  2019-03-21 20:11                         ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21 20:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds, Clark Williams

On Thu, Mar 21, 2019 at 03:50:06PM -0400, Steven Rostedt wrote:
> On Thu, 21 Mar 2019 20:31:52 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > 
> > > No I didn't. Some users only care about performance, but find memory
> > > cheap.  
> > 
> > Because cache-misses are free?
> 
> If I ever did implement this, I would try to get all the data out of
> line as much as possible, where only a nop would be inserted:

Doesn't make sense; you say data, but then talk code and i$.

Not the point, spinlock_t is 4 bytes, but growns into a monster with
lockdep on. There are plenty locations where the spinlock and the data
it protects fit together into a single cacheline, no longer so with
lockdep on.

Another example is split pte locks, without lockdep they are in struct
page, with lockdep, they're a separate allocation, adding pointer
chases.

Also; I do not, and have never done so, understood the desire to have
this unified kernel. Building another kernel just isn't a problem, esp.
not if you're doing kernel development to begin with.

Making debug code complicated, such that you need to spend more time
debugging the debug code, just doens't make sense to me either.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 20:03                       ` Peter Zijlstra
@ 2019-03-21 20:11                         ` Steven Rostedt
  0 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2019-03-21 20:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds, Clark Williams

On Thu, 21 Mar 2019 21:03:16 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> Doesn't make sense; you say data, but then talk code and i$.

Ah, you meant d$

> 
> Not the point, spinlock_t is 4 bytes, but growns into a monster with
> lockdep on. There are plenty locations where the spinlock and the data
> it protects fit together into a single cacheline, no longer so with
> lockdep on.
> 
> Another example is split pte locks, without lockdep they are in struct
> page, with lockdep, they're a separate allocation, adding pointer
> chases.
> 
> Also; I do not, and have never done so, understood the desire to have
> this unified kernel. Building another kernel just isn't a problem, esp.
> not if you're doing kernel development to begin with.

It's not for the typical kernel developer like you and me. It's about
shipping to the customers. Shipping multiple kernels is a pain. Lockdep
has some features that can be used in production environments, like
seeing lock contention and irq latency. Most of that code is dependent
on lockdep. Having just tracing of locks and such would be useful. But
that also relies heavily on the lockdep infrastructure.

> 
> Making debug code complicated, such that you need to spend more time
> debugging the debug code, just doens't make sense to me either.

Perhaps this can also help clean it up, and organize it. Sometimes
adding features makes the code cleaner (see what RT has done to the
kernel).

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 20:00                       ` Andrew Cooper
@ 2019-03-21 20:35                         ` Steven Rostedt
  0 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2019-03-21 20:35 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, Jan Beulich,
	Boris Ostrovsky, Juergen Gross, Stefano Stabellini, LKML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joel Fernandes, He Zhe

On Thu, 21 Mar 2019 20:00:19 +0000
Andrew Cooper <andrew.cooper3@citrix.com> wrote:

> I'm afraid that I've not got the original context of this thread, but
> I'm going to guess that something is resulting in a #PF before %cr2
> before it gets saved on the #PF path, and using a PVOP causes things to
> explode?

Actually it had nothing to do with PVOPs. It was tracing that caused
it. We added tracepoints to irqs off and on (with a config enabled),
and there's an option to allow taking user stack traces at any
tracepoint. The tracing of irqs off happened at the start of the page
fault entry, and by tracing the userspace stack (which can also fault)
we corrupted the cr2 for the original page fault.

The work now is to store that cr2 early on, but that requires doing it
in assembly, and that's where PVOPs makes the solution hard.

-- Steve


> 
> If it helps in the short term, Xen will trap and emulate a mov from cr2
> instruction correctly.  Performance will surely suck, but it might be an
> option if we need to rethink how this information is made available to
> guests.
> 
> Thanks,
> 
> ~Andrew


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:27               ` Andy Lutomirski
@ 2019-03-21 20:50                 ` Peter Zijlstra
  2019-03-22  2:52                   ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2019-03-21 20:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Juergen Gross, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe, Linus Torvalds, Clark Williams

On Thu, Mar 21, 2019 at 11:27:00AM -0700, Andy Lutomirski wrote:
> Well, here's pass zero at this.  Untested, because it obviously
> doesn't work.  Here are just a few things that are almost certainly
> wrong with it

Aah, you're proposing to simply not do TRACE_IRQS_OFF and
CALL_enter_from_user_mode and let \do_sym deal with it all.

Yes, that looks like it could almost work; esp. if you start by only
doing this for the idtentry stuff.

>  - The IRQ tracing needs to be re-added.
> 
>  - Some real semantics need to be defined for precisely what code is
> responsible for tracing.

So we get passed \do_sym, how about we do:

	call	__\do_sym

And then use some CPP magic to generate the those functions such that we
have a consistent part of C glue between our asm and our 'real' C
handler.

This glue can then do the tracing in a consistent manner.

#define IDT_HANDLER(do_sym)						\
asmlinkage __visible notrace void __do_sym(struct pt_regs *regs)	\
{									\
	trace_hardirqs_off();						\
	if (user_mode(regs))						\
		enter_from_user_mode();					\
	do_sym(regs);							\
}

Except more complicated I'm afraid, we need to handle more args etc..

>  - We need some asm-callable assertions to check the following
> conditions as appropriate:
> 
>  (a) that IRQ flags are currently traced as off.

What do you need this for? When returning from do_sym ?

>  (b) that IRQ flags are currently traced to match the IRET frame.

idem. Can't we have our C glue do that?

>  (c) that our context tracking is currently in good shape.  I'm not
> 100% sure how to define this.

So looking at this more; I used the %ebx games employed by
paranoid_entry to convey the state, but I didn't have to do that, the
actual condition seems to be:

	regs->cs & 3

aka. user_mode(regs). In this case our C glue would need to do the
context tracking user exit.

In fact, I can change my patch to use that and reduce the ebx ugly.

>  - We need to do some serious don't-instrument-me stuff to all the C
> entries, since we're now in an awful context when calling them.

Yah, but that's not new. do_page_fault(), sync_regs() at the very least
have this, so we can easily have our C glue have this too.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 20:50                 ` Peter Zijlstra
@ 2019-03-22  2:52                   ` Andy Lutomirski
  0 siblings, 0 replies; 35+ messages in thread
From: Andy Lutomirski @ 2019-03-22  2:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Steven Rostedt, Juergen Gross, LKML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joel Fernandes, He Zhe, Linus Torvalds, Clark Williams

On Thu, Mar 21, 2019 at 1:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Mar 21, 2019 at 11:27:00AM -0700, Andy Lutomirski wrote:
> > Well, here's pass zero at this.  Untested, because it obviously
> > doesn't work.  Here are just a few things that are almost certainly
> > wrong with it
>
> Aah, you're proposing to simply not do TRACE_IRQS_OFF and
> CALL_enter_from_user_mode and let \do_sym deal with it all.
>
> Yes, that looks like it could almost work; esp. if you start by only
> doing this for the idtentry stuff.
>
> >  - The IRQ tracing needs to be re-added.
> >
> >  - Some real semantics need to be defined for precisely what code is
> > responsible for tracing.
>
> So we get passed \do_sym, how about we do:
>
>         call    __\do_sym
>
> And then use some CPP magic to generate the those functions such that we
> have a consistent part of C glue between our asm and our 'real' C
> handler.
>
> This glue can then do the tracing in a consistent manner.
>
> #define IDT_HANDLER(do_sym)                                             \
> asmlinkage __visible notrace void __do_sym(struct pt_regs *regs)        \
> {                                                                       \
>         trace_hardirqs_off();                                           \
>         if (user_mode(regs))                                            \
>                 enter_from_user_mode();                                 \
>         do_sym(regs);                                                   \
> }
>
> Except more complicated I'm afraid, we need to handle more args etc..

Seems reasonable.  We should drop the asmlinkage, and if do_sym is
static, the code should be pretty good.

>
> >  - We need some asm-callable assertions to check the following
> > conditions as appropriate:
> >
> >  (a) that IRQ flags are currently traced as off.
>
> What do you need this for? When returning from do_sym ?
>
> >  (b) that IRQ flags are currently traced to match the IRET frame.
>
> idem. Can't we have our C glue do that?

The idea was to sanity check the C glue.  Maybe this isn't such a big deal.

>
> >  (c) that our context tracking is currently in good shape.  I'm not
> > 100% sure how to define this.
>
> So looking at this more; I used the %ebx games employed by
> paranoid_entry to convey the state, but I didn't have to do that, the
> actual condition seems to be:
>
>         regs->cs & 3
>
> aka. user_mode(regs). In this case our C glue would need to do the
> context tracking user exit.
>
> In fact, I can change my patch to use that and reduce the ebx ugly.
>
> >  - We need to do some serious don't-instrument-me stuff to all the C
> > entries, since we're now in an awful context when calling them.
>
> Yah, but that's not new. do_page_fault(), sync_regs() at the very least
> have this, so we can easily have our C glue have this too.

True.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21 18:18             ` Linus Torvalds
  2019-03-21 18:20               ` Andy Lutomirski
  2019-03-21 18:22               ` hpa
@ 2019-03-22  5:54               ` Juergen Gross
  2 siblings, 0 replies; 35+ messages in thread
From: Juergen Gross @ 2019-03-22  5:54 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Peter Zijlstra, Steven Rostedt, LKML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Joel Fernandes,
	He Zhe

On 21/03/2019 19:18, Linus Torvalds wrote:
> On Thu, Mar 21, 2019 at 11:05 AM Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> In the long run, I think the right solution is to rewrite even more of
>> this mess in C.  We really ought to be able to put the IRQ flag
>> tracing and the context tracking into C code.
> 
> Tangentially about this long run thing - can we start discussing just
> dropping PV mode entirely in the long run? Or even not-so-long run?

As Andrew already said: the work is ongoing, but hard.

What I'm planning to do, however, is dropping 32-bit PV code from the
kernel not too far in the future. This is only a small step, but it is
a move in the right direction.


Juergen

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
  2019-03-21  2:15 [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry Steven Rostedt
  2019-03-21  8:33 ` Peter Zijlstra
@ 2019-04-17  1:52 ` He Zhe
  1 sibling, 0 replies; 35+ messages in thread
From: He Zhe @ 2019-04-17  1:52 UTC (permalink / raw)
  To: Steven Rostedt, LKML
  Cc: Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Andy Lutomirski, Joel Fernandes, Linus Torvalds

From the last replies in the thread, it seems some work is going on. May I ask
when we can possibly roughly have a fix or workaround?

Thanks,
Zhe


On 3/21/19 10:15 AM, Steven Rostedt wrote:
> From: Steven Rostedt (VMware) <rostedt@goodmis.org>
>
> He Zhe reported a crash by enabling trace events and selecting
> "userstacktrace" which will read the stack of userspace for every trace
> event recorded. Zhe narrowed it down to:
>
>  c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")
>
> With the problem config, I was able to also reproduce the error. I
> narrowed it down to just having to do the following:
>
>  # cd /sys/kernel/tracing
>  # echo 1 > options/userstacktrace
>  # echo 1 > events/preemptirq/irq_disable/enable
>
> And sure enough, I triggered a crash. Well, it was systemd crashing
> with a bad memory access??
>
>  systemd-journal[537]: segfault at ed8cb8 ip 00007f7fffc9fef5 sp 00007ffc4062cb10 error 7
>
> And it would crash similarly each time I tried it, but always at a
> different place. After spending the day on this, I finally figured it
> out. The bug is happening in entry_64.S right after error_entry.
> There's two TRACE_IRQS_OFF in that code path, which if I comment out,
> the bug goes away. Then it dawned on me that the crash always happens
> when systemd does a normal page fault. We had this bug before, and it
> was with the exception trace points.
>
> The issue is that a tracepoint can fault (reading vmalloc or whatever).
> And doing a userspace stack trace most definitely will fault. But if we
> are coming from a legitimate page fault, the address of that fault (in
> the CR2 register) will be lost if we fault before we get to the page
> fault handler. That's exactly what is happening.
>
> To solve this, a TRACE_IRQS_OFF_CR2 (and ON for consistency) was added
> that saves the CR2 register. A new trace_hardirqs_off_thunk_cr2 is
> created that stores the cr2 register, calls the
> trace_hardirqs_off_caller, then on return restores the cr2 register if
> it changed, before returning.
>
> On my tests this fixes the issue. I just want to know if this is a
> legitimate fix or if someone can come up with a better fix?
>
> Note: this also saves the exception context just like the
> do_page_fault() function does.
>
> Note2: This only gets enabled when lockdep or irq tracing is enabled,
> which is not recommended for production environments.
>
> Link: http://lkml.kernel.org/r/897cf5cf-fc24-8a64-cb28-847f2d2e63d2@windriver.com
>
> Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")
> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> ---
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 7bc105f47d21..7edffec328e2 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -292,6 +292,32 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>  
>  	syscall_return_slowpath(regs);
>  }
> +
> +extern void trace_hardirqs_on_caller(unsigned long caller_addr);
> +__visible void trace_hardirqs_on_caller_cr2(unsigned long caller_addr)
> +{
> +	unsigned long address = read_cr2(); /* Get the faulting address */
> +	enum ctx_state prev_state;
> +
> +	prev_state = exception_enter();
> +	trace_hardirqs_on_caller(caller_addr);
> +	if (address != read_cr2())
> +		write_cr2(address);
> +	exception_exit(prev_state);
> +}
> +
> +extern void trace_hardirqs_off_caller(unsigned long caller_addr);
> +__visible void trace_hardirqs_off_caller_cr2(unsigned long caller_addr)
> +{
> +	unsigned long address = read_cr2(); /* Get the faulting address */
> +	enum ctx_state prev_state;
> +
> +	prev_state = exception_enter();
> +	trace_hardirqs_off_caller(caller_addr);
> +	if (address != read_cr2())
> +		write_cr2(address);
> +	exception_exit(prev_state);
> +}
>  #endif
>  
>  #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 1f0efdb7b629..73ddf24fed3e 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1248,12 +1248,12 @@ ENTRY(error_entry)
>  	 * we fix gsbase, and we should do it before enter_from_user_mode
>  	 * (which can take locks).
>  	 */
> -	TRACE_IRQS_OFF
> +	TRACE_IRQS_OFF_CR2
>  	CALL_enter_from_user_mode
>  	ret
>  
>  .Lerror_entry_done:
> -	TRACE_IRQS_OFF
> +	TRACE_IRQS_OFF_CR2
>  	ret
>  
>  	/*
> diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
> index be36bf4e0957..1300b53b98cb 100644
> --- a/arch/x86/entry/thunk_64.S
> +++ b/arch/x86/entry/thunk_64.S
> @@ -41,6 +41,8 @@
>  #ifdef CONFIG_TRACE_IRQFLAGS
>  	THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller,1
>  	THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
> +	THUNK trace_hardirqs_on_thunk_cr2,trace_hardirqs_on_caller_cr2,1
> +	THUNK trace_hardirqs_off_thunk_cr2,trace_hardirqs_off_caller_cr2,1
>  #endif
>  
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 058e40fed167..dd511742ca2e 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -172,9 +172,13 @@ static inline int arch_irqs_disabled(void)
>  #ifdef CONFIG_TRACE_IRQFLAGS
>  #  define TRACE_IRQS_ON		call trace_hardirqs_on_thunk;
>  #  define TRACE_IRQS_OFF	call trace_hardirqs_off_thunk;
> +#  define TRACE_IRQS_ON_CR2	call trace_hardirqs_on_thunk_cr2;
> +#  define TRACE_IRQS_OFF_CR2	call trace_hardirqs_off_thunk_cr2;
>  #else
>  #  define TRACE_IRQS_ON
>  #  define TRACE_IRQS_OFF
> +#  define TRACE_IRQS_ON_CR2
> +#  define TRACE_IRQS_OFF_CR2
>  #endif
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>  #  ifdef CONFIG_X86_64
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2019-04-17  1:54 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-21  2:15 [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry Steven Rostedt
2019-03-21  8:33 ` Peter Zijlstra
2019-03-21  9:02   ` Peter Zijlstra
2019-03-21 10:45     ` Peter Zijlstra
2019-03-21 13:32       ` Steven Rostedt
2019-03-21 13:55         ` Steven Rostedt
2019-03-21 17:23           ` Linus Torvalds
2019-03-21 17:22         ` Peter Zijlstra
2019-03-21 18:05           ` Andy Lutomirski
2019-03-21 18:10             ` Steven Rostedt
2019-03-21 18:27               ` Andy Lutomirski
2019-03-21 20:50                 ` Peter Zijlstra
2019-03-22  2:52                   ` Andy Lutomirski
2019-03-21 18:28               ` Peter Zijlstra
2019-03-21 18:55                 ` Steven Rostedt
2019-03-21 19:31                   ` Peter Zijlstra
2019-03-21 19:50                     ` Steven Rostedt
2019-03-21 20:03                       ` Peter Zijlstra
2019-03-21 20:11                         ` Steven Rostedt
2019-03-21 18:18             ` Linus Torvalds
2019-03-21 18:20               ` Andy Lutomirski
2019-03-21 18:25                 ` Linus Torvalds
2019-03-21 18:37                   ` Peter Zijlstra
2019-03-21 18:39                     ` Andy Lutomirski
2019-03-21 20:00                       ` Andrew Cooper
2019-03-21 20:35                         ` Steven Rostedt
2019-03-21 18:38                   ` Andy Lutomirski
2019-03-21 18:42                     ` Peter Zijlstra
2019-03-21 18:22               ` hpa
2019-03-22  5:54               ` Juergen Gross
2019-03-21 18:27             ` Peter Zijlstra
2019-03-21 18:28               ` Andy Lutomirski
2019-03-21 18:33                 ` Peter Zijlstra
2019-03-21 13:04   ` Steven Rostedt
2019-04-17  1:52 ` He Zhe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.