All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/2] x86: NMI-safe trap handlers
@ 2010-07-14 15:49 Mathieu Desnoyers
  2010-07-14 15:49 ` [patch 1/2] x86_64 page fault NMI-safe Mathieu Desnoyers
                   ` (2 more replies)
  0 siblings, 3 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 15:49 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen

Hi,

There seem to have been some churn regarding Perf problems with per-cpu memory
allocation which uses vmalloc. Long story short: faulting NMIs reactivate NMIs
faster than supposed, because x86 re-enables NMIs at the first iret encountered,
which leads to nested NMIs.

x86_32 cannot use vmalloc_sync_all() to sychronize the TLBs from every
processes because the vmalloc area is mapped in a different address space for
each process on this architecture. A second alternative is to duplicate the
per-cpu allocation API to have a variant using kmalloc only. This would lead to
code and API duplication and should probably be kept as last resort. A third
solution to this problem is to make the page fault handler aware of NMIs and
ensure it can be called from this context. This third solution is proposed by
this patchset.

So I'm respinning this patchset which has been sitting for a while, used for
about 1-2 years in the LTTng tree without problems, already tested in a -tip
sub-branch in the past. It uses a ret/popf instruction pair instead of iret when
it detects that a trap handler is nested over an NMI. A second patch takes care
of making the page fault handler nmi-safe by using the cr3 register rather than
accessing ->current, which could be in the middle of being changed by a context
switch.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers
@ 2010-07-14 15:49 ` Mathieu Desnoyers
  2010-07-14 16:28   ` Linus Torvalds
  2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
  2010-07-14 17:06 ` [patch 0/2] x86: NMI-safe trap handlers Andi Kleen
  2 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 15:49 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Mathieu Desnoyers, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

[-- Attachment #1: x86_64-page-fault-nmi-safe.patch --]
[-- Type: text/plain, Size: 3314 bytes --]

> I think you're vastly overestimating what is sane to do from an NMI
> context.  It is utterly and totally insane to assume vmalloc is available
> in NMI.
>
>       -hpa
>

Ok, please tell me where I am wrong then.. by looking into
arch/x86/mm/fault.c, I see that vmalloc_sync_all() touches pgd_list
entries while the pgd_lock spinlock is taken, with interrupts disabled.
So it's protected against concurrent pgd_list modification from

a - vmalloc_sync_all() on other CPUs
b - local interrupts

However, a completely normal interrupt can come on a remote CPU, run
vmalloc_fault() and issue a set_pgd concurrently. Therefore I conclude
this interrupt disable is not there to insure any kind of protection
against concurrent updates.

Also, we see that vmalloc_fault has comments such as :

(for x86_32)
         * Do _not_ use "current" here. We might be inside
         * an interrupt in the middle of a task switch..

So it takes the pgd_addr from cr3, not from current. Using only the
stack/registers makes this NMI-safe even if "current" is invalid when
the NMI comes. This is caused by the fact that __switch_to will update
the registers before updating current_task without disabling interrupts.

You are right in that x86_64 does not seems to play as safely as x86_32
on this matter; it uses current->mm. Probably it shouldn't assume
"current" is valid. Actually, I don't see where x86_64 disables
interrupts around __switch_to, so this would seem to be a race
condition. Or have I missed something ?

(Ingo)
> > the scheduler disables interrupts around __switch_to(). (x86 does 
> > not set __ARCH_WANT_INTERRUPTS_ON_CTXSW)
>
(Mathieu)
> Ok, so I guess it's only useful to NMIs then. However, it makes me
> wonder why this comment was there in the first place on x86_32
> vmalloc_fault() and why it uses read_cr3() :
>
>         * Do _not_ use "current" here. We might be inside
>         * an interrupt in the middle of a task switch..
(Ingo)
hm, i guess it's still useful to keep the
__ARCH_WANT_INTERRUPTS_ON_CTXSW case working too. On -rt we used to
enable it to squeeze a tiny bit more latency out of the system.


Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: akpm@osdl.org
CC: mingo@elte.hu
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
---
 arch/x86/mm/fault.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/x86/mm/fault.c
===================================================================
--- linux-2.6-lttng.orig/arch/x86/mm/fault.c	2010-03-13 16:56:46.000000000 -0500
+++ linux-2.6-lttng/arch/x86/mm/fault.c	2010-03-13 16:57:53.000000000 -0500
@@ -360,6 +360,7 @@ void vmalloc_sync_all(void)
  */
 static noinline __kprobes int vmalloc_fault(unsigned long address)
 {
+	unsigned long pgd_paddr;
 	pgd_t *pgd, *pgd_ref;
 	pud_t *pud, *pud_ref;
 	pmd_t *pmd, *pmd_ref;
@@ -374,7 +375,8 @@ static noinline __kprobes int vmalloc_fa
 	 * happen within a race in page table update. In the later
 	 * case just flush:
 	 */
-	pgd = pgd_offset(current->active_mm, address);
+	pgd_paddr = read_cr3();
+	pgd = __va(pgd_paddr) + pgd_index(address);
 	pgd_ref = pgd_offset_k(address);
 	if (pgd_none(*pgd_ref))
 		return -1;


^ permalink raw reply	[flat|nested] 168+ messages in thread

* [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers
  2010-07-14 15:49 ` [patch 1/2] x86_64 page fault NMI-safe Mathieu Desnoyers
@ 2010-07-14 15:49 ` Mathieu Desnoyers
  2010-07-14 16:42   ` Maciej W. Rozycki
  2010-07-16 12:28   ` Avi Kivity
  2010-07-14 17:06 ` [patch 0/2] x86: NMI-safe trap handlers Andi Kleen
  2 siblings, 2 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 15:49 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Mathieu Desnoyers, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

[-- Attachment #1: x86-nmi-safe-int3-and-page-fault.patch --]
[-- Type: text/plain, Size: 21100 bytes --]

Implements an alternative iret with popf and return so trap and exception
handlers can return to the NMI handler without issuing iret. iret would cause
NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
copy the return instruction pointer to the top of the previous stack, issue a
popf, loads the previous esp and issue a near return (ret).

It allows placing dynamically patched static jumps in asm gotos, which will be
used for optimized tracepoints, in NMI code since returning from a breakpoint
would be valid. Accessing vmalloc'd memory, which allows executing module code
or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
This is very useful to tracers like LTTng.

This patch makes all faults, traps and exception safe to be called from NMI
context *except* single-stepping, which requires iret to restore the TF (trap
flag) and jump to the return address in a single instruction. Sorry, no kprobes
support in NMI handlers because of this limitation. This cannot be emulated
with popf/lret, because lret would be single-stepped. It does not apply to
"immediate values" because they do not use single-stepping. This code detects if
the TF flag is set and uses the iret path for single-stepping, even if it
reactivates NMIs prematurely.

Test to detect if nested under a NMI handler is only done upon the return from
trap/exception to kernel, which is not frequent. Other return paths (return from
trap/exception to userspace, return from interrupt) keep the exact same behavior
(no slowdown).

alpha and avr32 use the active count bit 31. This patch moves them to 28.

TODO : test alpha and avr32 active count modification
TODO : test with lguest, xen, kvm.

tested on x86_32 (tests implemented in a separate patch) :
- instrumented the return path to export the EIP, CS and EFLAGS values when
  taken so we know the return path code has been executed.
- trace_mark, using immediate values, with 10ms delay with the breakpoint
  activated. Runs well through the return path.
- tested vmalloc faults in NMI handler by placing a non-optimized marker in the
  NMI handler (so no breakpoint is executed) and connecting a probe which
  touches every pages of a 20MB vmalloc'd buffer. It executes trough the return
  path without problem.
- Tested with and without preemption

tested on x86_64
- instrumented the return path to export the EIP, CS and EFLAGS values when
  taken so we know the return path code has been executed.
- trace_mark, using immediate values, with 10ms delay with the breakpoint
  activated. Runs well through the return path.

To test on x86_64 :
- Test without preemption
- Test vmalloc faults
- Test on Intel 64 bits CPUs. (AMD64 was fine)

Changelog since v1 :
- x86_64 fixes.
Changelog since v2 :
- fix paravirt build
Changelog since v3 :
- Include modifications suggested by Jeremy
Changelog since v4 :
- including hardirq.h in entry_32/64.S is a bad idea (non ifndef'd C code),
  define NMI_MASK in the .S files directly.
Changelog since v5 :
- Add NMI_MASK to irq_count() and make die() more verbose for NMIs.
Changelog since v7 :
- Implement paravirtualized nmi_return.
Changelog since v8 :
- refreshed the patch for asm-offsets. Those were left out of v8.
- now depends on "Stringify support commas" patch.
Changelog since v9 :
- Only test the nmi nested preempt count flag upon return from exceptions, not
  on return from interrupts. Only the kernel return path has this test.
- Add Xen, VMI, lguest support. Use their iret pavavirt ops in lieu of
  nmi_return.

- update for 2.6.30-rc1
Follow NMI_MASK bits merged in mainline.

- update for 2.6.35-rc4-tip

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: akpm@osdl.org
CC: mingo@elte.hu
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
---
 arch/alpha/include/asm/thread_info.h  |    2 -
 arch/avr32/include/asm/thread_info.h  |    2 -
 arch/x86/include/asm/irqflags.h       |   56 ++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/paravirt.h       |    4 ++
 arch/x86/include/asm/paravirt_types.h |    1 
 arch/x86/kernel/asm-offsets_32.c      |    1 
 arch/x86/kernel/asm-offsets_64.c      |    1 
 arch/x86/kernel/dumpstack.c           |    2 +
 arch/x86/kernel/entry_32.S            |   30 ++++++++++++++++++
 arch/x86/kernel/entry_64.S            |   33 ++++++++++++++++++--
 arch/x86/kernel/paravirt.c            |    3 +
 arch/x86/kernel/paravirt_patch_32.c   |    6 +++
 arch/x86/kernel/paravirt_patch_64.c   |    6 +++
 arch/x86/kernel/vmi_32.c              |    2 +
 arch/x86/lguest/boot.c                |    1 
 arch/x86/xen/enlighten.c              |    1 
 16 files changed, 145 insertions(+), 6 deletions(-)

Index: linux.trees.git/arch/x86/kernel/entry_32.S
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/entry_32.S	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/entry_32.S	2010-07-14 08:02:11.000000000 -0400
@@ -80,6 +80,8 @@
 
 #define nr_syscalls ((syscall_table_size)/4)
 
+#define NMI_MASK 0x04000000
+
 #ifdef CONFIG_PREEMPT
 #define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
@@ -348,8 +350,32 @@ END(ret_from_fork)
 	# userspace resumption stub bypassing syscall exit tracing
 	ALIGN
 	RING0_PTREGS_FRAME
+
 ret_from_exception:
 	preempt_stop(CLBR_ANY)
+	GET_THREAD_INFO(%ebp)
+	movl PT_EFLAGS(%esp), %eax	# mix EFLAGS and CS
+	movb PT_CS(%esp), %al
+	andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
+	cmpl $USER_RPL, %eax
+	jae resume_userspace	# returning to v8086 or userspace
+	testl $NMI_MASK,TI_preempt_count(%ebp)
+	jz resume_kernel		/* Not nested over NMI ? */
+	testw $X86_EFLAGS_TF, PT_EFLAGS(%esp)
+	jnz resume_kernel		/*
+					 * If single-stepping an NMI handler,
+					 * use the normal iret path instead of
+					 * the popf/lret because lret would be
+					 * single-stepped. It should not
+					 * happen : it will reactivate NMIs
+					 * prematurely.
+					 */
+	TRACE_IRQS_IRET
+	RESTORE_REGS
+	addl $4, %esp			# skip orig_eax/error_code
+	CFI_ADJUST_CFA_OFFSET -4
+	INTERRUPT_RETURN_NMI_SAFE
+
 ret_from_intr:
 	GET_THREAD_INFO(%ebp)
 check_userspace:
@@ -949,6 +975,10 @@ ENTRY(native_iret)
 .previous
 END(native_iret)
 
+ENTRY(native_nmi_return)
+	NATIVE_INTERRUPT_RETURN_NMI_SAFE # Should we deal with popf exception ?
+END(native_nmi_return)
+
 ENTRY(native_irq_enable_sysexit)
 	sti
 	sysexit
Index: linux.trees.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/entry_64.S	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/entry_64.S	2010-07-14 08:02:11.000000000 -0400
@@ -163,6 +163,8 @@ GLOBAL(return_to_handler)
 #endif
 
 
+#define NMI_MASK 0x04000000
+
 #ifndef CONFIG_PREEMPT
 #define retint_kernel retint_restore_args
 #endif
@@ -875,6 +877,9 @@ ENTRY(native_iret)
 	.section __ex_table,"a"
 	.quad native_iret, bad_iret
 	.previous
+
+ENTRY(native_nmi_return)
+	NATIVE_INTERRUPT_RETURN_NMI_SAFE
 #endif
 
 	.section .fixup,"ax"
@@ -929,6 +934,24 @@ retint_signal:
 	GET_THREAD_INFO(%rcx)
 	jmp retint_with_reschedule
 
+	/* Returning to kernel space from exception. */
+	/* rcx:	 threadinfo. interrupts off. */
+ENTRY(retexc_kernel)
+	testl $NMI_MASK,TI_preempt_count(%rcx)
+	jz retint_kernel		/* Not nested over NMI ? */
+	testw $X86_EFLAGS_TF,EFLAGS-ARGOFFSET(%rsp)	/* trap flag? */
+	jnz retint_kernel		/*
+					 * If single-stepping an NMI handler,
+					 * use the normal iret path instead of
+					 * the popf/lret because lret would be
+					 * single-stepped. It should not
+					 * happen : it will reactivate NMIs
+					 * prematurely.
+					 */
+	RESTORE_ARGS 0,8,0
+	TRACE_IRQS_IRETQ
+	INTERRUPT_RETURN_NMI_SAFE
+
 #ifdef CONFIG_PREEMPT
 	/* Returning to kernel space. Check if we need preemption */
 	/* rcx:	 threadinfo. interrupts off. */
@@ -1375,12 +1398,18 @@ ENTRY(paranoid_exit)
 paranoid_swapgs:
 	TRACE_IRQS_IRETQ 0
 	SWAPGS_UNSAFE_STACK
+paranoid_restore_no_nmi:
 	RESTORE_ALL 8
 	jmp irq_return
 paranoid_restore:
+	GET_THREAD_INFO(%rcx)
 	TRACE_IRQS_IRETQ 0
+	testl $NMI_MASK,TI_preempt_count(%rcx)
+	jz paranoid_restore_no_nmi              /* Nested over NMI ? */
+	testw $X86_EFLAGS_TF,EFLAGS-0(%rsp)     /* trap flag? */
+	jnz paranoid_restore_no_nmi
 	RESTORE_ALL 8
-	jmp irq_return
+	INTERRUPT_RETURN_NMI_SAFE
 paranoid_userspace:
 	GET_THREAD_INFO(%rcx)
 	movl TI_flags(%rcx),%ebx
@@ -1479,7 +1508,7 @@ ENTRY(error_exit)
 	TRACE_IRQS_OFF
 	GET_THREAD_INFO(%rcx)
 	testl %eax,%eax
-	jne retint_kernel
+	jne retexc_kernel
 	LOCKDEP_SYS_EXIT_IRQ
 	movl TI_flags(%rcx),%edx
 	movl $_TIF_WORK_MASK,%edi
Index: linux.trees.git/arch/x86/include/asm/irqflags.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/irqflags.h	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/irqflags.h	2010-07-14 08:02:11.000000000 -0400
@@ -56,6 +56,61 @@ static inline void native_halt(void)
 
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * Only returns from a trap or exception to a NMI context (intra-privilege
+ * level near return) to the same SS and CS segments. Should be used
+ * upon trap or exception return when nested over a NMI context so no iret is
+ * issued. It takes care of modifying the eflags, rsp and returning to the
+ * previous function.
+ *
+ * The stack, at that point, looks like :
+ *
+ * 0(rsp)  RIP
+ * 8(rsp)  CS
+ * 16(rsp) EFLAGS
+ * 24(rsp) RSP
+ * 32(rsp) SS
+ *
+ * Upon execution :
+ * Copy EIP to the top of the return stack
+ * Update top of return stack address
+ * Pop eflags into the eflags register
+ * Make the return stack current
+ * Near return (popping the return address from the return stack)
+ */
+#define NATIVE_INTERRUPT_RETURN_NMI_SAFE	pushq %rax;		\
+						movq %rsp, %rax;	\
+						movq 24+8(%rax), %rsp;	\
+						pushq 0+8(%rax);	\
+						pushq 16+8(%rax);	\
+						movq (%rax), %rax;	\
+						popfq;			\
+						ret
+#else
+/*
+ * Protected mode only, no V8086. Implies that protected mode must
+ * be entered before NMIs or MCEs are enabled. Only returns from a trap or
+ * exception to a NMI context (intra-privilege level far return). Should be used
+ * upon trap or exception return when nested over a NMI context so no iret is
+ * issued.
+ *
+ * The stack, at that point, looks like :
+ *
+ * 0(esp) EIP
+ * 4(esp) CS
+ * 8(esp) EFLAGS
+ *
+ * Upon execution :
+ * Copy the stack eflags to top of stack
+ * Pop eflags into the eflags register
+ * Far return: pop EIP and CS into their register, and additionally pop EFLAGS.
+ */
+#define NATIVE_INTERRUPT_RETURN_NMI_SAFE	pushl 8(%esp);	\
+						popfl;		\
+						lret $4
+#endif
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
@@ -114,6 +169,7 @@ static inline unsigned long __raw_local_
 
 #define ENABLE_INTERRUPTS(x)	sti
 #define DISABLE_INTERRUPTS(x)	cli
+#define INTERRUPT_RETURN_NMI_SAFE	NATIVE_INTERRUPT_RETURN_NMI_SAFE
 
 #ifdef CONFIG_X86_64
 #define SWAPGS	swapgs
Index: linux.trees.git/arch/alpha/include/asm/thread_info.h
===================================================================
--- linux.trees.git.orig/arch/alpha/include/asm/thread_info.h	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/alpha/include/asm/thread_info.h	2010-07-14 08:02:11.000000000 -0400
@@ -56,7 +56,7 @@ register struct thread_info *__current_t
 #define THREAD_SIZE_ORDER 1
 #define THREAD_SIZE (2*PAGE_SIZE)
 
-#define PREEMPT_ACTIVE		0x40000000
+#define PREEMPT_ACTIVE		0x10000000
 
 /*
  * Thread information flags:
Index: linux.trees.git/arch/avr32/include/asm/thread_info.h
===================================================================
--- linux.trees.git.orig/arch/avr32/include/asm/thread_info.h	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/avr32/include/asm/thread_info.h	2010-07-14 08:02:11.000000000 -0400
@@ -66,7 +66,7 @@ static inline struct thread_info *curren
 
 #endif /* !__ASSEMBLY__ */
 
-#define PREEMPT_ACTIVE		0x40000000
+#define PREEMPT_ACTIVE		0x10000000
 
 /*
  * Thread information flags
Index: linux.trees.git/arch/x86/include/asm/paravirt.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/paravirt.h	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/paravirt.h	2010-07-14 08:02:11.000000000 -0400
@@ -943,6 +943,10 @@ extern void default_banner(void);
 	PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE,	\
 		  jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_iret))
 
+#define INTERRUPT_RETURN_NMI_SAFE					\
+	PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_nmi_return), CLBR_NONE,	\
+		  jmp *%cs:pv_cpu_ops+PV_CPU_nmi_return)
+
 #define DISABLE_INTERRUPTS(clobbers)					\
 	PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \
 		  PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);		\
Index: linux.trees.git/arch/x86/kernel/paravirt.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/paravirt.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/paravirt.c	2010-07-14 08:02:11.000000000 -0400
@@ -156,6 +156,7 @@ unsigned paravirt_patch_default(u8 type,
 		ret = paravirt_patch_ident_64(insnbuf, len);
 
 	else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
+		 type == PARAVIRT_PATCH(pv_cpu_ops.nmi_return) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret64))
@@ -204,6 +205,7 @@ static void native_flush_tlb_single(unsi
 
 /* These are in entry.S */
 extern void native_iret(void);
+extern void native_nmi_return(void);
 extern void native_irq_enable_sysexit(void);
 extern void native_usergs_sysret32(void);
 extern void native_usergs_sysret64(void);
@@ -373,6 +375,7 @@ struct pv_cpu_ops pv_cpu_ops = {
 	.usergs_sysret64 = native_usergs_sysret64,
 #endif
 	.iret = native_iret,
+	.nmi_return = native_nmi_return,
 	.swapgs = native_swapgs,
 
 	.set_iopl_mask = native_set_iopl_mask,
Index: linux.trees.git/arch/x86/kernel/paravirt_patch_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/paravirt_patch_32.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/paravirt_patch_32.c	2010-07-14 08:02:11.000000000 -0400
@@ -1,10 +1,13 @@
-#include <asm/paravirt.h>
+#include <linux/stringify.h>
+#include <linux/irqflags.h>
 
 DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
 DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
 DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
 DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
 DEF_NATIVE(pv_cpu_ops, iret, "iret");
+DEF_NATIVE(pv_cpu_ops, nmi_return,
+	__stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE));
 DEF_NATIVE(pv_cpu_ops, irq_enable_sysexit, "sti; sysexit");
 DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
 DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3");
@@ -41,6 +44,7 @@ unsigned native_patch(u8 type, u16 clobb
 		PATCH_SITE(pv_irq_ops, restore_fl);
 		PATCH_SITE(pv_irq_ops, save_fl);
 		PATCH_SITE(pv_cpu_ops, iret);
+		PATCH_SITE(pv_cpu_ops, nmi_return);
 		PATCH_SITE(pv_cpu_ops, irq_enable_sysexit);
 		PATCH_SITE(pv_mmu_ops, read_cr2);
 		PATCH_SITE(pv_mmu_ops, read_cr3);
Index: linux.trees.git/arch/x86/kernel/paravirt_patch_64.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/paravirt_patch_64.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/paravirt_patch_64.c	2010-07-14 08:02:11.000000000 -0400
@@ -1,12 +1,15 @@
+#include <linux/irqflags.h>
+#include <linux/stringify.h>
 #include <asm/paravirt.h>
 #include <asm/asm-offsets.h>
-#include <linux/stringify.h>
 
 DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
 DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
 DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq");
 DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax");
 DEF_NATIVE(pv_cpu_ops, iret, "iretq");
+DEF_NATIVE(pv_cpu_ops, nmi_return,
+	__stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE));
 DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
 DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");
 DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3");
@@ -51,6 +54,7 @@ unsigned native_patch(u8 type, u16 clobb
 		PATCH_SITE(pv_irq_ops, irq_enable);
 		PATCH_SITE(pv_irq_ops, irq_disable);
 		PATCH_SITE(pv_cpu_ops, iret);
+		PATCH_SITE(pv_cpu_ops, nmi_return);
 		PATCH_SITE(pv_cpu_ops, irq_enable_sysexit);
 		PATCH_SITE(pv_cpu_ops, usergs_sysret32);
 		PATCH_SITE(pv_cpu_ops, usergs_sysret64);
Index: linux.trees.git/arch/x86/kernel/asm-offsets_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/asm-offsets_32.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/asm-offsets_32.c	2010-07-14 08:02:11.000000000 -0400
@@ -113,6 +113,7 @@ void foo(void)
 	OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);
 	OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
 	OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
+	OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
 	OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit);
 	OFFSET(PV_CPU_read_cr0, pv_cpu_ops, read_cr0);
 #endif
Index: linux.trees.git/arch/x86/kernel/asm-offsets_64.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/asm-offsets_64.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/asm-offsets_64.c	2010-07-14 08:02:11.000000000 -0400
@@ -58,6 +58,7 @@ int main(void)
 	OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
 	OFFSET(PV_IRQ_adjust_exception_frame, pv_irq_ops, adjust_exception_frame);
 	OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
+	OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
 	OFFSET(PV_CPU_usergs_sysret32, pv_cpu_ops, usergs_sysret32);
 	OFFSET(PV_CPU_usergs_sysret64, pv_cpu_ops, usergs_sysret64);
 	OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit);
Index: linux.trees.git/arch/x86/xen/enlighten.c
===================================================================
--- linux.trees.git.orig/arch/x86/xen/enlighten.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/xen/enlighten.c	2010-07-14 08:02:12.000000000 -0400
@@ -953,6 +953,7 @@ static const struct pv_cpu_ops xen_cpu_o
 	.read_pmc = native_read_pmc,
 
 	.iret = xen_iret,
+	.nmi_return = xen_iret,
 	.irq_enable_sysexit = xen_sysexit,
 #ifdef CONFIG_X86_64
 	.usergs_sysret32 = xen_sysret32,
Index: linux.trees.git/arch/x86/kernel/vmi_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/vmi_32.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/vmi_32.c	2010-07-14 08:02:12.000000000 -0400
@@ -154,6 +154,8 @@ static unsigned vmi_patch(u8 type, u16 c
 					      insns, ip);
 		case PARAVIRT_PATCH(pv_cpu_ops.iret):
 			return patch_internal(VMI_CALL_IRET, len, insns, ip);
+		case PARAVIRT_PATCH(pv_cpu_ops.nmi_return):
+			return patch_internal(VMI_CALL_IRET, len, insns, ip);
 		case PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit):
 			return patch_internal(VMI_CALL_SYSEXIT, len, insns, ip);
 		default:
Index: linux.trees.git/arch/x86/lguest/boot.c
===================================================================
--- linux.trees.git.orig/arch/x86/lguest/boot.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/lguest/boot.c	2010-07-14 08:02:12.000000000 -0400
@@ -1270,6 +1270,7 @@ __init void lguest_init(void)
 	pv_cpu_ops.cpuid = lguest_cpuid;
 	pv_cpu_ops.load_idt = lguest_load_idt;
 	pv_cpu_ops.iret = lguest_iret;
+	pv_cpu_ops.nmi_return = lguest_iret;
 	pv_cpu_ops.load_sp0 = lguest_load_sp0;
 	pv_cpu_ops.load_tr_desc = lguest_load_tr_desc;
 	pv_cpu_ops.set_ldt = lguest_set_ldt;
Index: linux.trees.git/arch/x86/kernel/dumpstack.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/dumpstack.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/dumpstack.c	2010-07-14 08:02:12.000000000 -0400
@@ -258,6 +258,8 @@ void __kprobes oops_end(unsigned long fl
 
 	if (!signr)
 		return;
+	if (in_nmi())
+		panic("Fatal exception in non-maskable interrupt");
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
 	if (panic_on_oops)
Index: linux.trees.git/arch/x86/include/asm/paravirt_types.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/paravirt_types.h	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/paravirt_types.h	2010-07-14 08:02:12.000000000 -0400
@@ -181,6 +181,7 @@ struct pv_cpu_ops {
 	/* Normal iret.  Jump to this with the standard iret stack
 	   frame set up. */
 	void (*iret)(void);
+	void (*nmi_return)(void);
 
 	void (*swapgs)(void);
 


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 15:49 ` [patch 1/2] x86_64 page fault NMI-safe Mathieu Desnoyers
@ 2010-07-14 16:28   ` Linus Torvalds
  2010-07-14 17:06     ` Mathieu Desnoyers
  0 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 16:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Mathieu Desnoyers, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

On Wed, Jul 14, 2010 at 8:49 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>> I think you're vastly overestimating what is sane to do from an NMI
>> context.  It is utterly and totally insane to assume vmalloc is available
>> in NMI.

I agree that NMI handlers shouldn't touch vmalloc space. But now that
percpu data is mapped through the VM, I do agree that other CPU's may
potentially need to touch that data, and an interrupt (including an
NMI) might be the first to create the mapping.

And that's why the faulting code needs to be interrupt-safe for the
vmalloc area.

However, it does look like the current scheduler should make it safe
to access "current->mm->pgd" from regular interrupts, so the problem
is apparently only an NMI issue. So exactly what are the circumstances
that create and expose percpu data on a CPU _without_ mapping it on
that CPU?

IOW, I'm missing some background here. I agree that at least some
basic percpu data should generally be available for an NMI handler,
but at the same time I wonder how come that basic percpu data wasn't
already mapped?

The traditional irq vmalloc race was something like:
 - one CPU does a "fork()", which copies the basic kernel mappings
 - in another thread a driver does a vmalloc(), which creates a _new_
mapping that didn't get copied.
 - later on a switch_to() switches to the newly forked process that
missed the vmalloc initialization
 - we take an interrupt for the driver that needed the new vmalloc
space, but now it doesn't have it, and we fill it in at run-time for
the (rare) race.

and I'm simply not seeing how fork() could ever race with percpu data setup.

So please just document the sequence that actually needs the page
table setup for the NMI/percpu case.

This patch (1/2) doesn't look horrible per se. I have no problems with
it. I just want to understand why it is needed.

                        Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
@ 2010-07-14 16:42   ` Maciej W. Rozycki
  2010-07-14 18:12     ` Mathieu Desnoyers
  2010-07-16 12:28   ` Avi Kivity
  1 sibling, 1 reply; 168+ messages in thread
From: Maciej W. Rozycki @ 2010-07-14 16:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Mathieu Desnoyers,
	akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> This patch makes all faults, traps and exception safe to be called from NMI
> context *except* single-stepping, which requires iret to restore the TF (trap
> flag) and jump to the return address in a single instruction. Sorry, no kprobes

 Watch out for the RF flag too, that is not set correctly by POPFD -- that 
may be important for faulting instructions that also have a hardware 
breakpoint set at their address.

> support in NMI handlers because of this limitation. This cannot be emulated
> with popf/lret, because lret would be single-stepped. It does not apply to
> "immediate values" because they do not use single-stepping. This code detects if
> the TF flag is set and uses the iret path for single-stepping, even if it
> reactivates NMIs prematurely.

 What about the VM flag for VM86 tasks?  It cannot be changed by POPFD 
either.

 How about only using the special return path when a nested exception is 
about to return to the NMI handler?  You'd avoid all the odd cases then 
that do not happen in the NMI context.

  Maciej

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 0/2] x86: NMI-safe trap handlers
  2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers
  2010-07-14 15:49 ` [patch 1/2] x86_64 page fault NMI-safe Mathieu Desnoyers
  2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
@ 2010-07-14 17:06 ` Andi Kleen
  2010-07-14 17:08   ` Mathieu Desnoyers
  2 siblings, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-14 17:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen

> x86_32 cannot use vmalloc_sync_all() to sychronize the TLBs from every
> processes because the vmalloc area is mapped in a different address space for

That doesn't make sense. vmalloc_all_sync() should work on 32bit too.
It just needs to walk all processes and fix up every page table.

-Andi

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 16:28   ` Linus Torvalds
@ 2010-07-14 17:06     ` Mathieu Desnoyers
  2010-07-14 18:10       ` Linus Torvalds
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 17:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Wed, Jul 14, 2010 at 8:49 AM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:

(I was quoting Peter Anvin below) ;)

> >> I think you're vastly overestimating what is sane to do from an NMI
> >> context.  It is utterly and totally insane to assume vmalloc is available
> >> in NMI.
> 
> I agree that NMI handlers shouldn't touch vmalloc space. But now that
> percpu data is mapped through the VM, I do agree that other CPU's may
> potentially need to touch that data, and an interrupt (including an
> NMI) might be the first to create the mapping.
> 
[...]
> So please just document the sequence that actually needs the page
> table setup for the NMI/percpu case.
> 
> This patch (1/2) doesn't look horrible per se. I have no problems with
> it. I just want to understand why it is needed.

The problem originally addressed by this patch is the case where a NMI handler
try to access vmalloc'd per-cpu data, which goes as follow:

- One CPU does a fork(), which copies the basic kernel mappings.
- Perf allocates percpu memory for buffer control data structures.
  This mapping does not get copied.
- Tracing is activated.
- switch_to() to the newly forked process which missed the new percpu
  allocation.
- We take a NMI, which touches the vmalloc'd percpu memory in the Perf tracing
  handler, therefore leading to a page fault in NMI context. Here, we might be
  in the middle of switch_to(), where ->current might not be in sync with the
  current cr3 register.

The three choices we have to handle this that I am aware of are:
1) supporting page faults in NMI context, which imply removing ->current
   dependency and supporting iret-less return path.
2) duplicating the percpu alloc API with a variant that maps to kmalloc.
3) using vmalloc_sync_all() after creating the mapping. (only works for x86_64,
   not x86_32).

Choice 3 seems like a no-go on x86_32, choice 2 seems like a last-resort
(involves API duplication and reservation of a fixed-amount of per-cpu memory at
boot). Hence the proposal of choice 1.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 0/2] x86: NMI-safe trap handlers
  2010-07-14 17:06 ` [patch 0/2] x86: NMI-safe trap handlers Andi Kleen
@ 2010-07-14 17:08   ` Mathieu Desnoyers
  2010-07-14 18:56     ` Andi Kleen
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 17:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Tejun Heo

* Andi Kleen (andi@firstfloor.org) wrote:
> > x86_32 cannot use vmalloc_sync_all() to sychronize the TLBs from every
> > processes because the vmalloc area is mapped in a different address space for
> 
> That doesn't make sense. vmalloc_all_sync() should work on 32bit too.
> It just needs to walk all processes and fix up every page table.

Yeah, I've been taken aback when Tejun told me that a few moments ago. I
initially thought that vmalloc_sync_all() synchronized all page mappings of all
processes on x86_32. But apparently that does not seem to be the case. I'm
adding him in CC.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 17:06     ` Mathieu Desnoyers
@ 2010-07-14 18:10       ` Linus Torvalds
  2010-07-14 18:46         ` Ingo Molnar
  2010-07-14 20:39         ` Mathieu Desnoyers
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 18:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 10:06 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>>
>> This patch (1/2) doesn't look horrible per se. I have no problems with
>> it. I just want to understand why it is needed.

[ And patch 2/2 is much more intrusive, and touches a critical path
too.. If it was just the 1/2 series, I don't think I would care. For
the 2/2, I think I'd want to explore all the alternative options ]

> The problem originally addressed by this patch is the case where a NMI handler
> try to access vmalloc'd per-cpu data, which goes as follow:
>
> - One CPU does a fork(), which copies the basic kernel mappings.
> - Perf allocates percpu memory for buffer control data structures.
>  This mapping does not get copied.
> - Tracing is activated.
> - switch_to() to the newly forked process which missed the new percpu
>  allocation.
> - We take a NMI, which touches the vmalloc'd percpu memory in the Perf tracing
>  handler, therefore leading to a page fault in NMI context. Here, we might be
>  in the middle of switch_to(), where ->current might not be in sync with the
>  current cr3 register.

Ok. I was wondering why anybody would allocate core percpu variables
so late that this would ever be an issue, but I guess perf is a
reasonable such case. And reasonable to do from NMI.

That said - grr. I really wish there was some other alternative than
adding yet more complexity to the exception return path. That "iret
re-enables NMI's unconditionally" thing annoys me.

In fact, I wonder if we couldn't just do a software NMI disable
instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
allocated statically) that points to the NMI stack frame, and just
make the NMI code itself do something like

 NMI entry:
 - load percpu NMI stack frame pointer
 - if non-zero we know we're nested, and should ignore this NMI:
    - we're returning to kernel mode, so return immediately by using
"popf/ret", which also keeps NMI's disabled in the hardware until the
"real" NMI iret happens.
    - before the popf/iret, use the NMI stack pointer to make the NMI
return stack be invalid and cause a fault
  - set the NMI stack pointer to the current stack pointer

 NMI exit (not the above "immediate exit because we nested"):
   clear the percpu NMI stack pointer
   Just do the iret.

Now, the thing is, now the "iret" is atomic. If we had a nested NMI,
we'll take a fault, and that re-does our "delayed" NMI - and NMI's
will stay masked.

And if we didn't have a nested NMI, that iret will now unmask NMI's,
and everything is happy.

Doesn't the above sound like a good solution? In other words, we solve
the whole problem by simply _fixing_ the crazy Intel "iret-vs-NMI"
semantics. And we don't need to change the hotpath, and we'll just
_allow_ nested faults within NMI's.

Am I missing something? Maybe I'm not as clever as I think I am... But
I _feel_ clever.

                   Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 16:42   ` Maciej W. Rozycki
@ 2010-07-14 18:12     ` Mathieu Desnoyers
  2010-07-14 19:21       ` Maciej W. Rozycki
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 18:12 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

* Maciej W. Rozycki (macro@linux-mips.org) wrote:
> On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:
> 
> > This patch makes all faults, traps and exception safe to be called from NMI
> > context *except* single-stepping, which requires iret to restore the TF (trap
> > flag) and jump to the return address in a single instruction. Sorry, no kprobes
> 
>  Watch out for the RF flag too, that is not set correctly by POPFD -- that 
> may be important for faulting instructions that also have a hardware 
> breakpoint set at their address.
> 
> > support in NMI handlers because of this limitation. This cannot be emulated
> > with popf/lret, because lret would be single-stepped. It does not apply to
> > "immediate values" because they do not use single-stepping. This code detects if
> > the TF flag is set and uses the iret path for single-stepping, even if it
> > reactivates NMIs prematurely.
> 
>  What about the VM flag for VM86 tasks?  It cannot be changed by POPFD 
> either.
> 
>  How about only using the special return path when a nested exception is 
> about to return to the NMI handler?  You'd avoid all the odd cases then 
> that do not happen in the NMI context.

This is exactly what this patch does :-)

It selects the return path with

+       testl $NMI_MASK,TI_preempt_count(%ebp)
+       jz resume_kernel                /* Not nested over NMI ? */

In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
not explicitly set the RF flag, and the breakpoint instruction (int3) appears
not to set it. (from my understanding of Intel's
Intel Architecture Software Developer’s Manual Volume 3: System Programming
15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)

So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
It's just the "single-stepping" feature of kprobes which is problematic.
Luckily, only int3 is needed for code patching bypass.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 18:10       ` Linus Torvalds
@ 2010-07-14 18:46         ` Ingo Molnar
  2010-07-14 19:14           ` Linus Torvalds
  2010-07-14 20:39         ` Mathieu Desnoyers
  1 sibling, 1 reply; 168+ messages in thread
From: Ingo Molnar @ 2010-07-14 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Ok. I was wondering why anybody would allocate core percpu variables so late 
> that this would ever be an issue, but I guess perf is a reasonable such 
> case. And reasonable to do from NMI.

Yeah.

Frederic (re-)discovered this problem via very hard to debug crashes when he 
extended perf call-graph tracing to have a bit larger buffer and used 
percpu_alloc() for it (which is entirely reasonable in itself).

> That said - grr. I really wish there was some other alternative than adding 
> yet more complexity to the exception return path. That "iret re-enables 
> NMI's unconditionally" thing annoys me.

Ok. We can solve it by allocating the space from the non-vmalloc percpu area - 
8K per CPU.

> In fact, I wonder if we couldn't just do a software NMI disable
> instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
> allocated statically) that points to the NMI stack frame, and just
> make the NMI code itself do something like
> 
>  NMI entry:

I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel 
stack already, due to entering via the IST stack mechanism, which is 
non-nesting and which enters at the same point - right?

We could solve that by copying that small stack frame off before entering the 
'generic' NMI routine - but it all feels a bit pulled in by the hair.

I feel uneasy about taking pagefaults from the NMI handler. Even if we 
implemented it all correctly, who knows what CPU erratas are waiting there to 
be discovered, etc ...

I think we should try to muddle through by preventing these situations from 
happening (and adding a WARN_ONCE() to the vmalloc page-fault handler would 
certainly help as well), and only go to more clever schemes if no other option 
looks sane anymore?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 0/2] x86: NMI-safe trap handlers
  2010-07-14 17:08   ` Mathieu Desnoyers
@ 2010-07-14 18:56     ` Andi Kleen
  2010-07-14 23:29       ` Tejun Heo
  0 siblings, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-14 18:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Tejun Heo

On Wed, Jul 14, 2010 at 01:08:05PM -0400, Mathieu Desnoyers wrote:
> * Andi Kleen (andi@firstfloor.org) wrote:
> > > x86_32 cannot use vmalloc_sync_all() to sychronize the TLBs from every
> > > processes because the vmalloc area is mapped in a different address space for
> > 
> > That doesn't make sense. vmalloc_all_sync() should work on 32bit too.
> > It just needs to walk all processes and fix up every page table.
> 
> Yeah, I've been taken aback when Tejun told me that a few moments ago. I
> initially thought that vmalloc_sync_all() synchronized all page mappings of all
> processes on x86_32. But apparently that does not seem to be the case. I'm
> adding him in CC.

Well it worked when it was originally written. That was for the case
of a NMI handler in a module. If it doesn't work fix it. I don't 
think the NMI-safe fault is really needed with it.

-Andi

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 18:46         ` Ingo Molnar
@ 2010-07-14 19:14           ` Linus Torvalds
  2010-07-14 19:36             ` Frederic Weisbecker
  2010-07-14 19:41             ` Linus Torvalds
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 19:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 11:46 AM, Ingo Molnar <mingo@elte.hu> wrote:
>>  NMI entry:
>
> I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel
> stack already, due to entering via the IST stack mechanism, which is
> non-nesting and which enters at the same point - right?

Yeah, you're right, but we could easily fix that up. We know we don't
need any stack for the nested case, so all we would need to do is to
just subtract a small bit off %rsp, and copy the three words or so to
create a "new" stack for the non-nested case.

> We could solve that by copying that small stack frame off before entering the
> 'generic' NMI routine - but it all feels a bit pulled in by the hair.

Why? It's much cleaner than making the _real_ codepaths much worse.

                Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 18:12     ` Mathieu Desnoyers
@ 2010-07-14 19:21       ` Maciej W. Rozycki
  2010-07-14 19:58         ` Mathieu Desnoyers
  0 siblings, 1 reply; 168+ messages in thread
From: Maciej W. Rozycki @ 2010-07-14 19:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> >  How about only using the special return path when a nested exception is 
> > about to return to the NMI handler?  You'd avoid all the odd cases then 
> > that do not happen in the NMI context.
> 
> This is exactly what this patch does :-)

 Ah, OK then -- I understood you actually tested the value of TF in the 
image to be restored.

> It selects the return path with
> 
> +       testl $NMI_MASK,TI_preempt_count(%ebp)
> +       jz resume_kernel                /* Not nested over NMI ? */
> 
> In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
> not explicitly set the RF flag, and the breakpoint instruction (int3) appears
> not to set it. (from my understanding of Intel's
> Intel Architecture Software Developer’s Manual Volume 3: System Programming
> 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)

 The CPU only sets RF itself in the image saved in certain cases -- you'd 
see it set in the page fault handler for example, so that once the handler 
has finished any instruction breakpoint does not hit (presumably again, 
because the instruction breakpoint debug exception has the highest 
priority).  You mentioned the need to handle these faults.

> So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
> 
> It's just the "single-stepping" feature of kprobes which is problematic.
> Luckily, only int3 is needed for code patching bypass.

 Actually the breakpoint exception handler should actually probably set RF 
explicitly, but that depends on the exact debugging scenario, so I can't 
comment on it further.  I don't know how INT3 is used in this context, so 
I'm just noting this may be a danger zone.

  Maciej

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 19:14           ` Linus Torvalds
@ 2010-07-14 19:36             ` Frederic Weisbecker
  2010-07-14 19:54               ` Linus Torvalds
  2010-07-14 19:41             ` Linus Torvalds
  1 sibling, 1 reply; 168+ messages in thread
From: Frederic Weisbecker @ 2010-07-14 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Mathieu Desnoyers, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 12:14:01PM -0700, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 11:46 AM, Ingo Molnar <mingo@elte.hu> wrote:
> >>  NMI entry:
> >
> > I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel
> > stack already, due to entering via the IST stack mechanism, which is
> > non-nesting and which enters at the same point - right?
> 
> Yeah, you're right, but we could easily fix that up. We know we don't
> need any stack for the nested case, so all we would need to do is to
> just subtract a small bit off %rsp, and copy the three words or so to
> create a "new" stack for the non-nested case.
> 
> > We could solve that by copying that small stack frame off before entering the
> > 'generic' NMI routine - but it all feels a bit pulled in by the hair.
> 
> Why? It's much cleaner than making the _real_ codepaths much worse.



There is also the fact we need to handle the lost NMI, by defering its
treatment or so. That adds even more complexity.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 19:14           ` Linus Torvalds
  2010-07-14 19:36             ` Frederic Weisbecker
@ 2010-07-14 19:41             ` Linus Torvalds
  2010-07-14 19:56               ` Andi Kleen
  1 sibling, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 19:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 12:14 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Jul 14, 2010 at 11:46 AM, Ingo Molnar <mingo@elte.hu> wrote:
>> We could solve that by copying that small stack frame off before entering the
>> 'generic' NMI routine - but it all feels a bit pulled in by the hair.
>
> Why? It's much cleaner than making the _real_ codepaths much worse.

.. but if the option is to never take a fault at all from the NMI
handler, and that is doable, than that would be good, of course.

But that may not be fixable. At least not without much more pain than
just adding a fairly simple hack to the NMI path itself, and keeping
all the NMI pain away from all the other cases.

And doing the per-cpu NMI nesting hack would actually also work as a
way to essentially block NMI's from critical regions. With my NMI
nestign avoidance suggestion, you could literally do something like
this to block NMI's:

  /* This is just a fake stack structure */
  struct nmi_blocker {
     unsigned long rflags;
     unsigned long cs;
     unsigned long rip;
   };

  void block_nmi_on_this_cpu(struct nmi_blocker *blocker)
  {
      get_cpu();
      memset(blocker, 0, sizeof(*blocker));
      per_cpu_nmi_stack_frame = blocker;
  }

  void unblock_nmi_on_this_cpu(struct nmi_blocker *blocker)
  {
     per_cpu_nmi_stack_frame = NULL;
     barrier();
     /* Did an NMI happen? If so, we're now running NMI-blocked by hardware,
      * we need to emulate the NMI and do a real 'iret' here
      */
     if (blocker->cs == INVALID_CS)
        asm volatile(".. build stack frame, call NMI routine ..");
     put_cpu();
  }

or similar. Wouldn't that be nice to have as a capability?

                 Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 19:36             ` Frederic Weisbecker
@ 2010-07-14 19:54               ` Linus Torvalds
  2010-07-14 20:17                 ` Mathieu Desnoyers
  2010-07-14 22:14                 ` Frederic Weisbecker
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 19:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Mathieu Desnoyers, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
<fweisbec@gmail.com> wrote:
>
> There is also the fact we need to handle the lost NMI, by defering its
> treatment or so. That adds even more complexity.

I don't think your read my proposal very deeply. It already handles
them by taking a fault on the iret of the first one (that's why we
point to the stack frame - so that we can corrupt it and force a
fault).

   Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 19:41             ` Linus Torvalds
@ 2010-07-14 19:56               ` Andi Kleen
  2010-07-14 20:05                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-14 19:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Mathieu Desnoyers, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

> or similar. Wouldn't that be nice to have as a capability?

It means the NMI watchdog would get useless if these areas
become common.

Again I suspect all of this is not really needed anyways if 
vmalloc_sync_all() works properly. That would solve the original
problem Mathieu was trying to solve for per_cpu data. The rule
would be just to call vmalloc_sync_all() properly when changing
per CPU data too.

In fact I'm pretty sure it worked originally. Perhaps it regressed?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 19:21       ` Maciej W. Rozycki
@ 2010-07-14 19:58         ` Mathieu Desnoyers
  2010-07-14 20:36           ` Maciej W. Rozycki
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 19:58 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

* Maciej W. Rozycki (macro@linux-mips.org) wrote:
> On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:
> 
> > >  How about only using the special return path when a nested exception is 
> > > about to return to the NMI handler?  You'd avoid all the odd cases then 
> > > that do not happen in the NMI context.
> > 
> > This is exactly what this patch does :-)
> 
>  Ah, OK then -- I understood you actually tested the value of TF in the 
> image to be restored.

It tests it too. When it detects that the return path is about to return to a
NMI handler, it checks if the TF flag is set. If it is set, then "iret" is
really needed, because TF can only single-step an instruction when set by
"iret". The popf/ret scheme would otherwise trap at the "ret" instruction that
follows popf. Anyway, single-stepping is really discouraged in nmi handlers,
because there is no way to go around the iret.

> 
> > It selects the return path with
> > 
> > +       testl $NMI_MASK,TI_preempt_count(%ebp)
> > +       jz resume_kernel                /* Not nested over NMI ? */
> > 
> > In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
> > not explicitly set the RF flag, and the breakpoint instruction (int3) appears
> > not to set it. (from my understanding of Intel's
> > Intel Architecture Software Developer’s Manual Volume 3: System Programming
> > 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)
> 
>  The CPU only sets RF itself in the image saved in certain cases -- you'd 
> see it set in the page fault handler for example, so that once the handler 
> has finished any instruction breakpoint does not hit (presumably again, 
> because the instruction breakpoint debug exception has the highest 
> priority).  You mentioned the need to handle these faults.

Well, the only case where I think it might make sense to allow a breakpoint in
NMI handler code would be to temporarily replace a static branch, which should
in no way be able to trigger any other fault.

> 
> > So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
> > 
> > It's just the "single-stepping" feature of kprobes which is problematic.
> > Luckily, only int3 is needed for code patching bypass.
> 
>  Actually the breakpoint exception handler should actually probably set RF 
> explicitly, but that depends on the exact debugging scenario, so I can't 
> comment on it further.  I don't know how INT3 is used in this context, so 
> I'm just noting this may be a danger zone.

In the case of temporary bypass, the int3 is only there to divert the
instruction execution flow to somewhere else, and we come back to the original
code at the address following the instruction which has the breakpoint. So
basically, we never come back to the original instruction, ever. We might as
well just clear the RF flag from the EFLAGS image before popf.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 19:56               ` Andi Kleen
@ 2010-07-14 20:05                 ` Mathieu Desnoyers
  2010-07-14 20:07                   ` Andi Kleen
  2010-07-14 22:31                   ` Frederic Weisbecker
  0 siblings, 2 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 20:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Ingo Molnar, LKML, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

* Andi Kleen (andi@firstfloor.org) wrote:
> > or similar. Wouldn't that be nice to have as a capability?
> 
> It means the NMI watchdog would get useless if these areas
> become common.
> 
> Again I suspect all of this is not really needed anyways if 
> vmalloc_sync_all() works properly. That would solve the original
> problem Mathieu was trying to solve for per_cpu data. The rule
> would be just to call vmalloc_sync_all() properly when changing
> per CPU data too.

Yep, that would solve the page fault in nmi problem altogether without adding
complexity.

> 
> In fact I'm pretty sure it worked originally. Perhaps it regressed?

I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all()
between percpu data allocation and tracing activation ? The generic ring buffer
library I posted last week does it already as a precaution for this very
specific reason (making sure NMIs never trigger page faults).

Thanks,

Mathieu

> 
> -Andi
> -- 
> ak@linux.intel.com -- Speaking for myself only.

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 20:05                 ` Mathieu Desnoyers
@ 2010-07-14 20:07                   ` Andi Kleen
  2010-07-14 20:08                     ` H. Peter Anvin
  2010-07-14 22:31                   ` Frederic Weisbecker
  1 sibling, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-14 20:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

> > In fact I'm pretty sure it worked originally. Perhaps it regressed?
> 
> I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all()
> between percpu data allocation and tracing activation ? The generic ring buffer
> library I posted last week does it already as a precaution for this very
> specific reason (making sure NMIs never trigger page faults).

I suspect the low level per cpu allocation functions should 
just call it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 20:07                   ` Andi Kleen
@ 2010-07-14 20:08                     ` H. Peter Anvin
  2010-07-14 23:32                       ` Tejun Heo
  0 siblings, 1 reply; 168+ messages in thread
From: H. Peter Anvin @ 2010-07-14 20:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On 07/14/2010 01:07 PM, Andi Kleen wrote:
>>> In fact I'm pretty sure it worked originally. Perhaps it regressed?
>>
>> I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all()
>> between percpu data allocation and tracing activation ? The generic ring buffer
>> library I posted last week does it already as a precaution for this very
>> specific reason (making sure NMIs never trigger page faults).
> 
> I suspect the low level per cpu allocation functions should 
> just call it.
> 

Yes, specifically the point at which we allocate new per cpu memory blocks.

	-hpa


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 19:54               ` Linus Torvalds
@ 2010-07-14 20:17                 ` Mathieu Desnoyers
  2010-07-14 20:55                   ` Linus Torvalds
  2010-07-14 22:14                 ` Frederic Weisbecker
  1 sibling, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> <fweisbec@gmail.com> wrote:
> >
> > There is also the fact we need to handle the lost NMI, by defering its
> > treatment or so. That adds even more complexity.
> 
> I don't think your read my proposal very deeply. It already handles
> them by taking a fault on the iret of the first one (that's why we
> point to the stack frame - so that we can corrupt it and force a
> fault).

It only handles the case of a single NMI coming in. What happens in this
scenario?

- NMI (1) comes in.
- takes a fault
    - iret
- NMI (2) comes in.
  - nesting detected, popf/ret
- takes another fault
- NMI (3) comes in.
  - nesting detected, popf/ret
- iret faults
  - executes only one extra NMI handler

We miss NMI (3) here. I think this is an important change from a semantic where,
AFAIK, the hardware should be allowed to assume that the CPU will execute as
many nmi handlers are there are NMIs acknowledged.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 19:58         ` Mathieu Desnoyers
@ 2010-07-14 20:36           ` Maciej W. Rozycki
  0 siblings, 0 replies; 168+ messages in thread
From: Maciej W. Rozycki @ 2010-07-14 20:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> It tests it too. When it detects that the return path is about to return to a
> NMI handler, it checks if the TF flag is set. If it is set, then "iret" is
> really needed, because TF can only single-step an instruction when set by
> "iret". The popf/ret scheme would otherwise trap at the "ret" instruction that
> follows popf. Anyway, single-stepping is really discouraged in nmi handlers,
> because there is no way to go around the iret.

 Hmm, with Pentium Pro and more recent processors there is actually a 
nasty hack that will let you get away with POPF/RET and TF set. ;)  You 
can try it if you like and can arrange for an appropriate scenario.

> In the case of temporary bypass, the int3 is only there to divert the
> instruction execution flow to somewhere else, and we come back to the original
> code at the address following the instruction which has the breakpoint. So
> basically, we never come back to the original instruction, ever. We might as
> well just clear the RF flag from the EFLAGS image before popf.

 Yes, if you return to elsewhere, then that's actually quite desirable 
IMHO.

 This RF flag is quite complicated to handle and there are some errata 
involved too.  If I understand it correctly, all fault-class exception 
handlers are expected to set it manually in the image to be restored if 
they return to the original faulting instruction (that includes the debug 
exception handler if it was invoked as a fault, i.e. in response to an 
instruction breakpoint).  Then all trap-class exception handlers are 
expected to clear the flag (and that includes the debug exception handler 
if it was invoked as a trap, e.g. in response to a data breakpoint or a 
single step).  I haven't checked if Linux gets these bits right, but it 
may be worth doing so.

 For the record -- GDB hardly cares, because it removes any instruction 
breakpoints before it is asked to resume execution of an instruction that 
has a breakpoint set at, single-steps the instruction with all the other 
threads locked out and then reinserts the breakpoints so that they can hit 
again.  Then it proceeds with whatever should be done next to fulfil the 
execution request.

  Maciej

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 18:10       ` Linus Torvalds
  2010-07-14 18:46         ` Ingo Molnar
@ 2010-07-14 20:39         ` Mathieu Desnoyers
  2010-07-14 21:23           ` Linus Torvalds
  1 sibling, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 20:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Linus Torvalds (torvalds@linux-foundation.org) wrote
[...]
> In fact, I wonder if we couldn't just do a software NMI disable
> instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
> allocated statically) that points to the NMI stack frame, and just
> make the NMI code itself do something like
> 
>  NMI entry:

Let's try to figure out how far we can go with this idea. First, to answer
Ingo's critic, let's assume we do a stack frame copy before entering the
"generic" nmi handler routine.

>  - load percpu NMI stack frame pointer
>  - if non-zero we know we're nested, and should ignore this NMI:
>     - we're returning to kernel mode, so return immediately by using
> "popf/ret", which also keeps NMI's disabled in the hardware until the
> "real" NMI iret happens.

Maybe incrementing a per-cpu missed NMIs count could be appropriate here so we
know how many NMIs should be replayed at iret ?

>     - before the popf/iret, use the NMI stack pointer to make the NMI
> return stack be invalid and cause a fault

I assume you mean "popf/ret" here. So assuming we use a frame copy, we should
change the nmi stack pointer in the nesting 0 nmi stack copy, so the nesting 0
NMI iret will trigger the fault.

>   - set the NMI stack pointer to the current stack pointer

That would mean bringing back the NMI stack pointer to the (nesting - 1) nmi
stack copy.

> 
>  NMI exit (not the above "immediate exit because we nested"):
>    clear the percpu NMI stack pointer

This would be rather:
- Copy the nesting 0 stack copy back onto the real nmi stack.
- clear the percpu nmi stack pointer

** !

>    Just do the iret.

Which presumably faults if we changed the return stack for an invalid one and
executes as many NMIs as there are "missed nmis" counted (missed nmis should
probably be read with an xchg() instruction).

So, one question persists, regarding the "** !" comment: what do we do if an NMI
comes in exactly at that point ? I'm afraid it will overwrite the "real" nmi
stack, and therefore drop all the "pending" nmis by setting the nmi stack return
address to a valid one.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 20:17                 ` Mathieu Desnoyers
@ 2010-07-14 20:55                   ` Linus Torvalds
  2010-07-14 21:18                     ` Ingo Molnar
  0 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 20:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 1:17 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> It only handles the case of a single NMI coming in. What happens in this
> scenario?

[ two nested NMI's ]

The _right_ thing happens.

What do you think the hardware would have done itself? The NMI was
blocked. It wouldn't get replayed twice. If you have two NMI's
happening while another NMI is active, you will get a single NMI after
the first NMI has completed.

So stop these _idiotic_ complaints. And face the music:

 - NMI's aren't that important. They are a _hell_ of a lot less
important than the standard page fault path, for example.

 - We do _not_ want to add more NMI magic outside of the NMI
codepaths. It's much better to handle NMI special cases in the NMI
code, rather than sprinkle them in random other codepaths (like percpu
allocators) that have NOTHING WHAT-SO-EVER to do with NMI's!

                        Linus

>
> - NMI (1) comes in.
> - takes a fault
>    - iret
> - NMI (2) comes in.
>  - nesting detected, popf/ret
> - takes another fault
> - NMI (3) comes in.
>  - nesting detected, popf/ret
> - iret faults
>  - executes only one extra NMI handler
>
> We miss NMI (3) here. I think this is an important change from a semantic where,
> AFAIK, the hardware should be allowed to assume that the CPU will execute as
> many nmi handlers are there are NMIs acknowledged.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
>

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 20:55                   ` Linus Torvalds
@ 2010-07-14 21:18                     ` Ingo Molnar
  0 siblings, 0 replies; 168+ messages in thread
From: Ingo Molnar @ 2010-07-14 21:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Frederic Weisbecker, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Jul 14, 2010 at 1:17 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > It only handles the case of a single NMI coming in. What happens in this
> > scenario?
> 
> [ two nested NMI's ]
> 
> The _right_ thing happens.
> 
> What do you think the hardware would have done itself? The NMI was blocked. 
> It wouldn't get replayed twice. If you have two NMI's happening while 
> another NMI is active, you will get a single NMI after the first NMI has 
> completed.

If it ever became an issue, we could even do what softirqs do and re-execute 
the NMI handler. At least for things like PMU NMIs we have to handle them once 
they have been (re-)issued, or we'd get a stuck PMU.

But in any case it should be a non-issue.

	Ingo

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 20:39         ` Mathieu Desnoyers
@ 2010-07-14 21:23           ` Linus Torvalds
  2010-07-14 21:45             ` Maciej W. Rozycki
  2010-07-14 22:21             ` Mathieu Desnoyers
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 21:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 1:39 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
>>  - load percpu NMI stack frame pointer
>>  - if non-zero we know we're nested, and should ignore this NMI:
>>     - we're returning to kernel mode, so return immediately by using
>> "popf/ret", which also keeps NMI's disabled in the hardware until the
>> "real" NMI iret happens.
>
> Maybe incrementing a per-cpu missed NMIs count could be appropriate here so we
> know how many NMIs should be replayed at iret ?

No. As mentioned, there is no such counter in real hardware either.

Look at what happens for the not-nested case:

 - NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
NMI's disabled

 - NMI2 triggers. Nothing happens, the NMI's are disabled.

 - NMI3 triggers. Again, nothing happens, the NMI's are still disabled

 - the NMI handler returns.

 - What happens now?

How many NMI interrupts do you get? ONE. Exactly like my "emulate it
in software" approach. The hardware doesn't have any counters for
pending NMI's either. Why should the software emulation have them?

>>     - before the popf/iret, use the NMI stack pointer to make the NMI
>> return stack be invalid and cause a fault
>
> I assume you mean "popf/ret" here.

Yes, that was as typo. The whole point of using popf was obviously to
_avoid_ the iret ;)

> So assuming we use a frame copy, we should
> change the nmi stack pointer in the nesting 0 nmi stack copy, so the nesting 0
> NMI iret will trigger the fault
>
>>   - set the NMI stack pointer to the current stack pointer
>
> That would mean bringing back the NMI stack pointer to the (nesting - 1) nmi
> stack copy.

I think you're confused. Or I am by your question.

The NMI code would literally just do:

 - check if the NMI was nested, by looking at whether the percpu
nmi-stack-pointer is non-NULL

 - if it was nested, do nothing, an return with a popf/ret. The only
stack this sequence might needs is to save/restore the register that
we use for the percpu value (although maybe we can just co a "cmpl
$0,%_percpu_seg:nmi_stack_ptr" and not even need that), and it's
atomic because at this point we know that NMI's are disabled (we've
not _yet_ taken any nested faults)

 - if it's a regular (non-nesting) NMI, we'd basically do

     6* pushq 48(%rsp)

   to copy the five words that the NMI pushed (ss/esp/eflags/cs/eip)
and the one we saved ourselves (if we needed any, maybe we can make do
with just 5 words).

 - then we just save that new stack pointer to the percpu thing with a simple

     movq %rsp,%__percpu_seg:nmi_stack_ptr

and we're all done. The final "iret" will do the right thing (either
fault or return), and there are no races that I can see exactly
because we use a single nmi-atomic instruction (the "iret" itself) to
either re-enable NMI's _or_ test whether we should re-do an NMI.

There is a single-instruction window that is interestign in the return
path, which is the window between the two final instructions:

    movl $0,%__percpu_seg:nmi_stack_ptr
    iret

where I wonder what happens if we have re-enabled NMI (due to a fault
in the NMI handler), but we haven't actually taken the NMI itself yet,
so now we _will_ re-use the stack. Hmm. I suspect we need another of
those horrible "if the NMI happens at this particular %rip" cases that
we already have for the sysenter code on x86-32 for the NMI/DEBUG trap
case of fixing up the stack pointer.

And maybe I missed something else. But it does look reasonably simple.
Subtle, but not a lot of code. And the code is all very much about the
NMI itself, not about other random sequences. No?

                Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 21:23           ` Linus Torvalds
@ 2010-07-14 21:45             ` Maciej W. Rozycki
  2010-07-14 21:52               ` Linus Torvalds
  2010-07-14 22:21             ` Mathieu Desnoyers
  1 sibling, 1 reply; 168+ messages in thread
From: Maciej W. Rozycki @ 2010-07-14 21:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, 14 Jul 2010, Linus Torvalds wrote:

> No. As mentioned, there is no such counter in real hardware either.

 There is a 1-bit counter or actually a latch.

> Look at what happens for the not-nested case:
> 
>  - NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
> NMI's disabled

 Correct.

>  - NMI2 triggers. Nothing happens, the NMI's are disabled.

 The NMI latch records the second NMI.  Note this is edge-sensitive like 
the NMI line itself.

>  - NMI3 triggers. Again, nothing happens, the NMI's are still disabled

 Correct.

>  - the NMI handler returns.
> 
>  - What happens now?

 NMI2 latched above causes the NMI handler to be invoked as the next 
instruction after IRET.  The latch is cleared as the interrupt is taken.

> How many NMI interrupts do you get? ONE. Exactly like my "emulate it
> in software" approach. The hardware doesn't have any counters for
> pending NMI's either. Why should the software emulation have them?

 Two. :)

  Maciej

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 21:45             ` Maciej W. Rozycki
@ 2010-07-14 21:52               ` Linus Torvalds
  2010-07-14 22:31                 ` Maciej W. Rozycki
  0 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 21:52 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 2:45 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Wed, 14 Jul 2010, Linus Torvalds wrote:
>
>> No. As mentioned, there is no such counter in real hardware either.
>
>  There is a 1-bit counter or actually a latch.

Well, that's what our single-word flag is too.

>> Look at what happens for the not-nested case:
>>
>>  - NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
>> NMI's disabled
>
>  Correct.
>
>>  - NMI2 triggers. Nothing happens, the NMI's are disabled.
>
>  The NMI latch records the second NMI.  Note this is edge-sensitive like
> the NMI line itself.
>
>>  - NMI3 triggers. Again, nothing happens, the NMI's are still disabled
>
>  Correct.
>
>>  - the NMI handler returns.
>>
>>  - What happens now?
>
>  NMI2 latched above causes the NMI handler to be invoked as the next
> instruction after IRET.  The latch is cleared as the interrupt is taken.
>
>> How many NMI interrupts do you get? ONE. Exactly like my "emulate it
>> in software" approach. The hardware doesn't have any counters for
>> pending NMI's either. Why should the software emulation have them?
>
>  Two. :)

You just count differently. I don't count the first one (the "real"
NMI). That obviously happens. So I only count how many interrupts we
need to fake. That's my "one". That's the one that happens as a result
of the fault that we take on the iret in the emulated model.

So there is no need to count anything. We take a fault on the iret if
we got a nested NMI (regardless of how _many_ such nested NMI's we
took). That's the "latch", exactly like in the hardware. No counter.

(Yeah, yeah, you can call it a "one-bit counter", but I don't think
that's a counter. It's just a bit of information).

                      Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 19:54               ` Linus Torvalds
  2010-07-14 20:17                 ` Mathieu Desnoyers
@ 2010-07-14 22:14                 ` Frederic Weisbecker
  2010-07-14 22:31                   ` Mathieu Desnoyers
  1 sibling, 1 reply; 168+ messages in thread
From: Frederic Weisbecker @ 2010-07-14 22:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Mathieu Desnoyers, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> <fweisbec@gmail.com> wrote:
> >
> > There is also the fact we need to handle the lost NMI, by defering its
> > treatment or so. That adds even more complexity.
> 
> I don't think your read my proposal very deeply. It already handles
> them by taking a fault on the iret of the first one (that's why we
> point to the stack frame - so that we can corrupt it and force a
> fault).


Ah right, I missed this part.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 21:23           ` Linus Torvalds
  2010-07-14 21:45             ` Maciej W. Rozycki
@ 2010-07-14 22:21             ` Mathieu Desnoyers
  2010-07-14 22:37               ` Linus Torvalds
  1 sibling, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 22:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Wed, Jul 14, 2010 at 1:39 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> >>  - load percpu NMI stack frame pointer
> >>  - if non-zero we know we're nested, and should ignore this NMI:
> >>     - we're returning to kernel mode, so return immediately by using
> >> "popf/ret", which also keeps NMI's disabled in the hardware until the
> >> "real" NMI iret happens.
> >
> > Maybe incrementing a per-cpu missed NMIs count could be appropriate here so we
> > know how many NMIs should be replayed at iret ?
> 
> No. As mentioned, there is no such counter in real hardware either.
> 
> Look at what happens for the not-nested case:
> 
>  - NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
> NMI's disabled
> 
>  - NMI2 triggers. Nothing happens, the NMI's are disabled.
> 
>  - NMI3 triggers. Again, nothing happens, the NMI's are still disabled
> 
>  - the NMI handler returns.
> 
>  - What happens now?
> 
> How many NMI interrupts do you get? ONE. Exactly like my "emulate it
> in software" approach. The hardware doesn't have any counters for
> pending NMI's either. Why should the software emulation have them?

So I figure that given Maciej's response, we can get at most 2 nested nmis, no
more, no less. So I was probably going too far with the counter, but we need 2.
However, failure to deliver the second NMI in this case would not match the
hardware expectations (see below).

> 
> >>     - before the popf/iret, use the NMI stack pointer to make the NMI
> >> return stack be invalid and cause a fault
> >
> > I assume you mean "popf/ret" here.
> 
> Yes, that was as typo. The whole point of using popf was obviously to
> _avoid_ the iret ;)
> 
> > So assuming we use a frame copy, we should
> > change the nmi stack pointer in the nesting 0 nmi stack copy, so the nesting 0
> > NMI iret will trigger the fault
> >
> >>   - set the NMI stack pointer to the current stack pointer
> >
> > That would mean bringing back the NMI stack pointer to the (nesting - 1) nmi
> > stack copy.
> 
> I think you're confused. Or I am by your question.
> 
> The NMI code would literally just do:
> 
>  - check if the NMI was nested, by looking at whether the percpu
> nmi-stack-pointer is non-NULL
> 
>  - if it was nested, do nothing, an return with a popf/ret. The only
> stack this sequence might needs is to save/restore the register that
> we use for the percpu value (although maybe we can just co a "cmpl
> $0,%_percpu_seg:nmi_stack_ptr" and not even need that), and it's
> atomic because at this point we know that NMI's are disabled (we've
> not _yet_ taken any nested faults)
> 
>  - if it's a regular (non-nesting) NMI, we'd basically do
> 
>      6* pushq 48(%rsp)
> 
>    to copy the five words that the NMI pushed (ss/esp/eflags/cs/eip)
> and the one we saved ourselves (if we needed any, maybe we can make do
> with just 5 words).

Ah, right, you only need to do the copy and use the copy for the nesting level 0
NMI handler. The nested NMI can work on the "real" nmi stack because we never
expect it to fault.

> 
>  - then we just save that new stack pointer to the percpu thing with a simple
> 
>      movq %rsp,%__percpu_seg:nmi_stack_ptr
> 
> and we're all done. The final "iret" will do the right thing (either
> fault or return), and there are no races that I can see exactly
> because we use a single nmi-atomic instruction (the "iret" itself) to
> either re-enable NMI's _or_ test whether we should re-do an NMI.
> 
> There is a single-instruction window that is interestign in the return
> path, which is the window between the two final instructions:
> 
>     movl $0,%__percpu_seg:nmi_stack_ptr
>     iret
> 
> where I wonder what happens if we have re-enabled NMI (due to a fault
> in the NMI handler), but we haven't actually taken the NMI itself yet,
> so now we _will_ re-use the stack. Hmm. I suspect we need another of
> those horrible "if the NMI happens at this particular %rip" cases that
> we already have for the sysenter code on x86-32 for the NMI/DEBUG trap
> case of fixing up the stack pointer.

Yes, this was this exact instruction window I was worried about. I see another
possible failure mode:

- NMI
 - page fault
   - iret
 - NMI
   - set nmi_stack_ptr to 0, popf/lret.
 - page fault (yep, another one!)
   - iret
 - movl $0,%__percpu_seg:nmi_stack_ptr
 - iret

So in this case, movl/iret are executed with NMIs enabled. So if an NMI comes in
after the movl instruction, it will not detect that it is nested and will re-use
the percpu "nmi_stack_ptr" stack, squashing the "faulty" stack ptr with a brand
new one which won't trigger a fault. I'm afraid that in this case, the last NMI
handler will iret to the "nesting 0" handler at the iret instruction, which will
in turn return to itself, breaking all hell loose in the meantime (endless iret
loop).

So this also calls for special-casing an NMI nested on top of the following iret

 - movl $0,%__percpu_seg:nmi_stack_ptr
 - iret   <-----

instruction. At the beginning of the NMI handler, we could detect if we are
either nested over an NMI (checking nmi_stack_ptr != NULL) or if we are at this
specifica %rip, and assume we are nested in both cases.

> 
> And maybe I missed something else. But it does look reasonably simple.
> Subtle, but not a lot of code. And the code is all very much about the
> NMI itself, not about other random sequences. No?

If we can find a clean way to handle this NMI vs iret problem outside of the
entry_*.S code, within NMI-specific code, I'm indeed all for it. entry_*.s is
already complicated enough as it is. I think checking the %rip at NMI entry
could work out.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 22:14                 ` Frederic Weisbecker
@ 2010-07-14 22:31                   ` Mathieu Desnoyers
  2010-07-14 22:48                     ` Frederic Weisbecker
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 22:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Ingo Molnar, LKML, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Frederic Weisbecker (fweisbec@gmail.com) wrote:
> On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > <fweisbec@gmail.com> wrote:
> > >
> > > There is also the fact we need to handle the lost NMI, by defering its
> > > treatment or so. That adds even more complexity.
> > 
> > I don't think your read my proposal very deeply. It already handles
> > them by taking a fault on the iret of the first one (that's why we
> > point to the stack frame - so that we can corrupt it and force a
> > fault).
> 
> 
> Ah right, I missed this part.

Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
data structures exactly ? :)

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 21:52               ` Linus Torvalds
@ 2010-07-14 22:31                 ` Maciej W. Rozycki
  0 siblings, 0 replies; 168+ messages in thread
From: Maciej W. Rozycki @ 2010-07-14 22:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, 14 Jul 2010, Linus Torvalds wrote:

> You just count differently. I don't count the first one (the "real"
> NMI). That obviously happens. So I only count how many interrupts we
> need to fake. That's my "one". That's the one that happens as a result
> of the fault that we take on the iret in the emulated model.

 Ah, I see -- so we are on the same page after all.

> (Yeah, yeah, you can call it a "one-bit counter", but I don't think
> that's a counter. It's just a bit of information).

 Hardware has something like a strapped-high D flip-flop (NMI goes to the 
clock input) with an extra reset input I presume -- this dates back to 
8086 when the transistor count mattered with accuracy higher than 1e6. ;)

  Maciej

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 20:05                 ` Mathieu Desnoyers
  2010-07-14 20:07                   ` Andi Kleen
@ 2010-07-14 22:31                   ` Frederic Weisbecker
  2010-07-14 22:56                     ` Linus Torvalds
  1 sibling, 1 reply; 168+ messages in thread
From: Frederic Weisbecker @ 2010-07-14 22:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, Linus Torvalds, Ingo Molnar, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 04:05:52PM -0400, Mathieu Desnoyers wrote:
> * Andi Kleen (andi@firstfloor.org) wrote:
> > > or similar. Wouldn't that be nice to have as a capability?
> > 
> > It means the NMI watchdog would get useless if these areas
> > become common.
> > 
> > Again I suspect all of this is not really needed anyways if 
> > vmalloc_sync_all() works properly. That would solve the original
> > problem Mathieu was trying to solve for per_cpu data. The rule
> > would be just to call vmalloc_sync_all() properly when changing
> > per CPU data too.
> 
> Yep, that would solve the page fault in nmi problem altogether without adding
> complexity.
> 
> > 
> > In fact I'm pretty sure it worked originally. Perhaps it regressed?
> 
> I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all()
> between percpu data allocation and tracing activation ? The generic ring buffer
> library I posted last week does it already as a precaution for this very
> specific reason (making sure NMIs never trigger page faults).


Ok, I should try.

Until now I didn't because I clearly misunderstand the vmalloc internals. I'm
not even quite sure why a memory allocated with vmalloc sometimes can be not
mapped (and then fault once for this to sync). Some people have tried to explain
me but the picture is still vague to me.

And moreover, I'm not quite sure whether vmalloc_sync_all() syncs the pgd
for every tasks or so... Tejun seemed to say it's not necessary the case in
x86-32... Again I think I haven't totally understood the details.

Thanks.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 22:21             ` Mathieu Desnoyers
@ 2010-07-14 22:37               ` Linus Torvalds
  2010-07-14 22:51                 ` Jeremy Fitzhardinge
  2010-07-15  1:23                 ` Linus Torvalds
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 22:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 3:21 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> If we can find a clean way to handle this NMI vs iret problem outside of the
> entry_*.S code, within NMI-specific code, I'm indeed all for it. entry_*.s is
> already complicated enough as it is. I think checking the %rip at NMI entry
> could work out.

I think the %rip check should be pretty simple - exactly because there
is only a single point where the race is open between that 'mov' and
the 'iret'. So it's simpler than the (similar) thing we do for
debug/nmi stack fixup for sysenter that has to check a range.

The only worry is if that crazy paravirt code wants to paravirtualize
the iretq. Afaik, paravirt does that exactly because they screw up
iret handling themselves. Maybe we could stop doing that stupid iretq
paravirtualization, and just tell the paravirt people to do the same
thing I propose, and just allow nesting.

                           Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 22:31                   ` Mathieu Desnoyers
@ 2010-07-14 22:48                     ` Frederic Weisbecker
  2010-07-14 23:11                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 168+ messages in thread
From: Frederic Weisbecker @ 2010-07-14 22:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Ingo Molnar, LKML, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 06:31:07PM -0400, Mathieu Desnoyers wrote:
> * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > > <fweisbec@gmail.com> wrote:
> > > >
> > > > There is also the fact we need to handle the lost NMI, by defering its
> > > > treatment or so. That adds even more complexity.
> > > 
> > > I don't think your read my proposal very deeply. It already handles
> > > them by taking a fault on the iret of the first one (that's why we
> > > point to the stack frame - so that we can corrupt it and force a
> > > fault).
> > 
> > 
> > Ah right, I missed this part.
> 
> Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
> data structures exactly ? :)
> 
> Mathieu



So, when an event triggers in perf, we sometimes want to capture the stacktrace
that led to the event.

We want this stacktrace (here we call that a callchain) to be recorded
locklessly. So we want this callchain buffer per cpu, with the following
type:

	#define PERF_MAX_STACK_DEPTH		255

	struct perf_callchain_entry {
		__u64				nr;
		__u64				ip[PERF_MAX_STACK_DEPTH];
	};


That makes 2048 bytes. But per cpu is not enough for the callchain to be recorded
locklessly, we also need one buffer per context: task, softirq, hardirq, nmi, as
an event can trigger in any of these.
Since we disable preemption, none of these contexts can nest locally. In
fact hardirqs can nest but we just don't care about this corner case.

So, it makes 2048 * 4 = 8192 bytes. And that per cpu.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 22:37               ` Linus Torvalds
@ 2010-07-14 22:51                 ` Jeremy Fitzhardinge
  2010-07-14 23:02                   ` Linus Torvalds
  2010-07-15  1:23                 ` Linus Torvalds
  1 sibling, 1 reply; 168+ messages in thread
From: Jeremy Fitzhardinge @ 2010-07-14 22:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Frank Ch. Eigler, Tejun Heo

On 07/14/2010 03:37 PM, Linus Torvalds wrote:
> I think the %rip check should be pretty simple - exactly because there
> is only a single point where the race is open between that 'mov' and
> the 'iret'. So it's simpler than the (similar) thing we do for
> debug/nmi stack fixup for sysenter that has to check a range.
>
> The only worry is if that crazy paravirt code wants to paravirtualize
> the iretq. Afaik, paravirt does that exactly because they screw up
> iret handling themselves. Maybe we could stop doing that stupid iretq
> paravirtualization, and just tell the paravirt people to do the same
> thing I propose, and just allow nesting.
>   

We screw around with iret because there's a separate interrupt mask flag
which can't be set atomically with respect to a stack/ring change (well,
there's more to it, but I won't confuse matters).

But it only really matters if the PV guest can also get NMI-like
interrupts.  While Xen supports NMI for PV guests, we don't use it much
and I haven't looked into implementing support for it yet.

    J

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 22:31                   ` Frederic Weisbecker
@ 2010-07-14 22:56                     ` Linus Torvalds
  2010-07-14 23:09                       ` Andi Kleen
  2010-07-15 14:11                       ` Frederic Weisbecker
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 22:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mathieu Desnoyers, Andi Kleen, Ingo Molnar, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 3:31 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
> Until now I didn't because I clearly misunderstand the vmalloc internals. I'm
> not even quite sure why a memory allocated with vmalloc sometimes can be not
> mapped (and then fault once for this to sync). Some people have tried to explain
> me but the picture is still vague to me.

So the issue is that the system can have thousands and thousands of
page tables (one for each process), and what do you do when you add a
new kernel virtual mapping?

You can:

 - make sure that you only ever use _one_ single top-level entry for
all vmalloc issues, and can make sure that all processes are created
with that static entry filled in. This is optimal, but it just doesn't
work on all architectures (eg on 32-bit x86, it would limit the
vmalloc space to 4MB in non-PAE, whatever)

 - at vmalloc time, when adding a new page directory entry, walk all
the tens of thousands of existing page tables under a lock that
guarantees that we don't add any new ones (ie it will lock out fork())
and add the required pgd entry to them.

 - or just take the fault and do the "fill the page tables" on demand.

Quite frankly, most of the time it's probably better to make that last
choice (unless your hardware makes it easy to make the first choice,
which is obviously simplest for everybody). It makes it _much_ cheaper
to do vmalloc. It also avoids that nasty latency issue. And it's just
simpler too, and has no interesting locking issues with how/when you
expose the page tables in fork() etc.

So the only downside is that you do end up taking a fault in the
(rare) case where you have a newly created task that didn't get an
even newer vmalloc entry. And that fault can sometimes be in an
interrupt or an NMI. Normally it's trivial to handle that fairly
simple nested fault. But NMI has that inconvenient "iret unblocks
NMI's, because there is no dedicated 'nmiret' instruction" problem on
x86.

                            Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 22:51                 ` Jeremy Fitzhardinge
@ 2010-07-14 23:02                   ` Linus Torvalds
  2010-07-14 23:54                     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 23:02 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 3:51 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
> We screw around with iret because there's a separate interrupt mask flag
> which can't be set atomically with respect to a stack/ring change (well,
> there's more to it, but I won't confuse matters).

Umm, I know. It's what this whole discussion (non-paravirtualized) is
all about. And I have a suggestion that should fix the
non-paravirtualized case _without_ actually touching anything but the
NMI code itself.

What I tried to say is that the paravirtualized people should take a
look at my suggestion, and see if they can apply the logic to their
NMI handling too. And in the process totally remove the need for
paravirtualizing iret, exactly because the approach handles the magic
NMI lock logic entirely in the NMI handler itself.

Because I think the same thing that would make us not need to worry
about nested page faults _during_ NMI (because we could make the NMI
code do the right thing even when the hardware doesn't lock out NMI's
for us) is also the exact same logic that the paravirt monitor could
do for its own NMI handling.

Wouldn't it be nice to be able to remove the need to paravirtualize iret?

                  Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 22:56                     ` Linus Torvalds
@ 2010-07-14 23:09                       ` Andi Kleen
  2010-07-14 23:22                         ` Linus Torvalds
  2010-07-15 14:11                       ` Frederic Weisbecker
  1 sibling, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-14 23:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Frederic Weisbecker, Mathieu Desnoyers, Andi Kleen, Ingo Molnar,
	LKML, Andrew Morton, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

>  - at vmalloc time, when adding a new page directory entry, walk all
> the tens of thousands of existing page tables under a lock that
> guarantees that we don't add any new ones (ie it will lock out fork())
> and add the required pgd entry to them.
> 
>  - or just take the fault and do the "fill the page tables" on demand.
> 
> Quite frankly, most of the time it's probably better to make that last
> choice (unless your hardware makes it easy to make the first choice,

Adding new PGDs should happen only very rarely (in fact 
at most once per boot on i386-PAE36 with only 4 entries, most used by user 
space), most of the time when you do a vmalloc it changes only lower level 
tables. 

The PGD for the kernel mappings is already set up. On x86-64 it can happen 
more often in theory, but in practice it should be also extremly rare because
a PGD is so large.

That's why I'm not sure this problem even happened. It should 
be extremly rare that you exactly add that PGD during the per cpu
allocation. 

It can happen in theory, but for such a rare case take a lock
and walking everything should be fine.

-Andi


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 22:48                     ` Frederic Weisbecker
@ 2010-07-14 23:11                       ` Mathieu Desnoyers
  2010-07-14 23:38                         ` Frederic Weisbecker
  2010-07-14 23:40                         ` Steven Rostedt
  0 siblings, 2 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 23:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Ingo Molnar, LKML, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Frederic Weisbecker (fweisbec@gmail.com) wrote:
> On Wed, Jul 14, 2010 at 06:31:07PM -0400, Mathieu Desnoyers wrote:
> > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > > > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > > > <fweisbec@gmail.com> wrote:
> > > > >
> > > > > There is also the fact we need to handle the lost NMI, by defering its
> > > > > treatment or so. That adds even more complexity.
> > > > 
> > > > I don't think your read my proposal very deeply. It already handles
> > > > them by taking a fault on the iret of the first one (that's why we
> > > > point to the stack frame - so that we can corrupt it and force a
> > > > fault).
> > > 
> > > 
> > > Ah right, I missed this part.
> > 
> > Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
> > data structures exactly ? :)
> > 
> > Mathieu
> 
> 
> 
> So, when an event triggers in perf, we sometimes want to capture the stacktrace
> that led to the event.
> 
> We want this stacktrace (here we call that a callchain) to be recorded
> locklessly. So we want this callchain buffer per cpu, with the following
> type:

Ah OK, so you mean that perf now has 2 different ring buffer implementations ?
How about using a single one that is generic enough to handle perf and ftrace
needs instead ?

(/me runs away quickly before the lightning strikes) ;)

Mathieu


> 
> 	#define PERF_MAX_STACK_DEPTH		255
> 
> 	struct perf_callchain_entry {
> 		__u64				nr;
> 		__u64				ip[PERF_MAX_STACK_DEPTH];
> 	};
> 
> 
> That makes 2048 bytes. But per cpu is not enough for the callchain to be recorded
> locklessly, we also need one buffer per context: task, softirq, hardirq, nmi, as
> an event can trigger in any of these.
> Since we disable preemption, none of these contexts can nest locally. In
> fact hardirqs can nest but we just don't care about this corner case.
> 
> So, it makes 2048 * 4 = 8192 bytes. And that per cpu.
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 23:09                       ` Andi Kleen
@ 2010-07-14 23:22                         ` Linus Torvalds
  0 siblings, 0 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-14 23:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Frederic Weisbecker, Mathieu Desnoyers, Ingo Molnar, LKML,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 4:09 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> It can happen in theory, but for such a rare case take a lock
> and walking everything should be fine.

Actually, that's _exactly_ the wrong kind of thinking.

Bad latency is bad latency, even when it happens rarely. So latency
problems kill - even when they are rare. So you want to avoid them.
And walking every possible page table is a _huge_ latency problem when
it happens.

In contrast, what's the advantage of doing thigns synchronously while
holding a lock? It's that you can avoid a few page faults, and get
better CPU use. But that's _stupid_ if it's something that is very
rare to begin with.

So the very rarity argues for the lazy approach. If it wasn't rare,
there would be a much stronger argument for trying to do things
up-front.

                   Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 0/2] x86: NMI-safe trap handlers
  2010-07-14 18:56     ` Andi Kleen
@ 2010-07-14 23:29       ` Tejun Heo
  0 siblings, 0 replies; 168+ messages in thread
From: Tejun Heo @ 2010-07-14 23:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro

Hello,

On 07/14/2010 08:56 PM, Andi Kleen wrote:
> On Wed, Jul 14, 2010 at 01:08:05PM -0400, Mathieu Desnoyers wrote:
>> * Andi Kleen (andi@firstfloor.org) wrote:
>>>> x86_32 cannot use vmalloc_sync_all() to sychronize the TLBs from
>>>> every processes because the vmalloc area is mapped in a different
>>>> address space for
>>> That doesn't make sense. vmalloc_all_sync() should work on 32bit
>>> too.  It just needs to walk all processes and fix up every page
>>> table.

Yeah, vmalloc_sync_all() synchronizes everything by walking every page
table, so it should work.  I was saying that just flushing TLB
wouldn't cut it because multiple top level page table entries can be
used to map vmalloc areas.  It seems that both 32 and 64bit does that
tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 20:08                     ` H. Peter Anvin
@ 2010-07-14 23:32                       ` Tejun Heo
  0 siblings, 0 replies; 168+ messages in thread
From: Tejun Heo @ 2010-07-14 23:32 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Jeremy Fitzhardinge, Frank Ch. Eigler

Hello,

On 07/14/2010 10:08 PM, H. Peter Anvin wrote:
>> I suspect the low level per cpu allocation functions should 
>> just call it.
>>
> 
> Yes, specifically the point at which we allocate new per cpu memory
> blocks.

We can definitely do that but walking whole page table list is too
heavy to do automatically at that level especially when all users
other than NMI would be fine w/ the default lazy approach.  If Linus'
approach doesn't pan out, I think the right thing to do would be
adding a wrapper to vmalloc_sync_all() and let perf code call it after
percpu allocation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 23:11                       ` Mathieu Desnoyers
@ 2010-07-14 23:38                         ` Frederic Weisbecker
  2010-07-15 16:26                           ` Mathieu Desnoyers
  2010-07-14 23:40                         ` Steven Rostedt
  1 sibling, 1 reply; 168+ messages in thread
From: Frederic Weisbecker @ 2010-07-14 23:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Ingo Molnar, LKML, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 07:11:17PM -0400, Mathieu Desnoyers wrote:
> * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > On Wed, Jul 14, 2010 at 06:31:07PM -0400, Mathieu Desnoyers wrote:
> > > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > > On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > > > > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > > > > <fweisbec@gmail.com> wrote:
> > > > > >
> > > > > > There is also the fact we need to handle the lost NMI, by defering its
> > > > > > treatment or so. That adds even more complexity.
> > > > > 
> > > > > I don't think your read my proposal very deeply. It already handles
> > > > > them by taking a fault on the iret of the first one (that's why we
> > > > > point to the stack frame - so that we can corrupt it and force a
> > > > > fault).
> > > > 
> > > > 
> > > > Ah right, I missed this part.
> > > 
> > > Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
> > > data structures exactly ? :)
> > > 
> > > Mathieu
> > 
> > 
> > 
> > So, when an event triggers in perf, we sometimes want to capture the stacktrace
> > that led to the event.
> > 
> > We want this stacktrace (here we call that a callchain) to be recorded
> > locklessly. So we want this callchain buffer per cpu, with the following
> > type:
> 
> Ah OK, so you mean that perf now has 2 different ring buffer implementations ?
> How about using a single one that is generic enough to handle perf and ftrace
> needs instead ?
> 
> (/me runs away quickly before the lightning strikes) ;)
> 
> Mathieu


:-)

That's no ring buffer. It's a temporary linear buffer to fill a stacktrace,
and get its effective size before committing it to the real ring buffer.

Sure that involves two copies.

But I don't have a better solution in mind than using a pre-buffer for that,
since we can't know the size of the stacktrace in advance. We could
always reserve the max stacktrace size, but that would be wasteful.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 23:11                       ` Mathieu Desnoyers
  2010-07-14 23:38                         ` Frederic Weisbecker
@ 2010-07-14 23:40                         ` Steven Rostedt
  1 sibling, 0 replies; 168+ messages in thread
From: Steven Rostedt @ 2010-07-14 23:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Peter Zijlstra, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

[ /me removes the duplicate email of himself! ]

On Wed, 2010-07-14 at 19:11 -0400, Mathieu Desnoyers wrote:
> > 
> > So, when an event triggers in perf, we sometimes want to capture the stacktrace
> > that led to the event.
> > 
> > We want this stacktrace (here we call that a callchain) to be recorded
> > locklessly. So we want this callchain buffer per cpu, with the following
> > type:
> 
> Ah OK, so you mean that perf now has 2 different ring buffer implementations ?
> How about using a single one that is generic enough to handle perf and ftrace
> needs instead ?
> 
> (/me runs away quickly before the lightning strikes) ;)
> 

To be fair, that's just a temp buffer.

-- Steve

(/me sends this to try to remove the dup email he's getting )


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 23:02                   ` Linus Torvalds
@ 2010-07-14 23:54                     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 168+ messages in thread
From: Jeremy Fitzhardinge @ 2010-07-14 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Frank Ch. Eigler, Tejun Heo

On 07/14/2010 04:02 PM, Linus Torvalds wrote:
> Umm, I know. It's what this whole discussion (non-paravirtualized) is
> all about. And I have a suggestion that should fix the
> non-paravirtualized case _without_ actually touching anything but the
> NMI code itself.
>
> What I tried to say is that the paravirtualized people should take a
> look at my suggestion, and see if they can apply the logic to their
> NMI handling too.

My point is that it's moot (for now) because there is no NMI handing.

>  And in the process totally remove the need for
> paravirtualizing iret, exactly because the approach handles the magic
> NMI lock logic entirely in the NMI handler itself.
>   

Nothing in this thread is ringing any alarm bells from that perspective,
so I don't much mind either way.  When I get around to dealing with
paravirtualized NMI, I'll look at the state of things and see how to go
from there.  (Xen's iret hypercall takes a flag to say whether to unmask
NMIs, which will probably come in handy.)

I don't think any of the other pure PV implementations have NMI either,
so I don't think it affects them.

> Wouldn't it be nice to be able to remove the need to paravirtualize iret?
>   

Of course.  But we also need to do an iret in a hypercall to handle ring
switching in some cases, so we still need it aside from the interrupt issue.

    J


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 22:37               ` Linus Torvalds
  2010-07-14 22:51                 ` Jeremy Fitzhardinge
@ 2010-07-15  1:23                 ` Linus Torvalds
  2010-07-15  1:45                   ` Linus Torvalds
                                     ` (2 more replies)
  1 sibling, 3 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15  1:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I think the %rip check should be pretty simple - exactly because there
> is only a single point where the race is open between that 'mov' and
> the 'iret'. So it's simpler than the (similar) thing we do for
> debug/nmi stack fixup for sysenter that has to check a range.

So this is what I think it might look like, with the %rip in place.
And I changed the "nmi_stack_ptr" thing to have both the pointer and a
flag - because it turns out that in the single-instruction race case,
we actually want the old pointer.

Totally untested, of course. But _something_ like this might work:

#
# Two per-cpu variables: a "are we nested" flag (one byte), and
# a "if we're nested, what is the %rsp for the nested case".
#
# The reason for why we can't just clear the saved-rsp field and
# use that as the flag is that we actually want to know the saved
# rsp for the special case of having a nested NMI happen on the
# final iret of the unnested case.
#
nmi:
	cmpb $0,%__percpu_seg:nmi_stack_nesting
	jne nmi_nested_corrupt_and_return
	cmpq $nmi_iret_address,0(%rsp)
	je nmi_might_be_nested
	# create new stack
is_unnested_nmi:
	# Save some space for nested NMI's. The exception itself
	# will never use more space, but it might use less (since
	# if will be a kernel-kernel transition). But the nested
	# exception will want two save registers and a place to
	# save the original CS that it will corrupt
	subq $64,%rsp

	# copy the five words of stack info. 96 = 64 + stack
	# offset of ss.
	pushq 96(%rsp)   # ss
	pushq 96(%rsp)   # rsp
	pushq 96(%rsp)   # eflags
	pushq 96(%rsp)   # cs
	pushq 96(%rsp)   # rip

	# and set the nesting flags
	movq %rsp,%__percpu_seg:nmi_stack_ptr
	movb $0xff,%__percpu_seg:nmi_stack_nesting

regular_nmi_code:
	...
	# regular NMI code goes here, and can take faults,
	# because this sequence now has proper nested-nmi
	# handling
	...
nmi_exit:
	movb $0,%__percpu_seg:nmi_stack_nesting
nmi_iret_address:
	iret

# The saved rip points to the final NMI iret, after we've cleared
# nmi_stack_ptr. Check the CS segment to make sure.
nmi_might_be_nested:
	cmpw $__KERNEL_CS,8(%rsp)
	jne is_unnested_nmi

# This is the case when we hit just as we're supposed to do the final
# iret of a previous nmi.  We run the NMI using the old return address
# that is still on the stack, rather than copy the new one that is bogus
# and points to where the nested NMI interrupted the original NMI
# handler!
# Easy: just reset the stack pointer to the saved one (this is why
# we use a separate "valid" flag, so that we can still use the saved
# stack pointer)
	movq %__percpu_seg:nmi_stack_ptr,%rsp
	jmp regular_nmi_code

# This is the actual nested case.  Make sure we fault on iret by setting
# CS to zero and saving the old CS.  %rax contains the stack pointer to
# the original code.
nmi_nested_corrupt_and_return:
	pushq %rax
	pushq %rdx
	movq %__percpu_seg:nmi_stack_ptr,%rax
	movq 8(%rax),%rdx	# CS of original NMI
	testq %rdx,%rdx		# CS already zero?
	je nmi_nested_and_already_corrupted
	movq %rdx,40(%rax)	# save old CS away
	movq $0,8(%rax)
nmi_nested_and_already_corrupted:
	popq %rdx
	popq %rax
	popfq
	jmp *(%rsp)

Hmm?

               Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15  1:23                 ` Linus Torvalds
@ 2010-07-15  1:45                   ` Linus Torvalds
  2010-07-15 18:31                     ` Mathieu Desnoyers
  2010-07-15 16:44                   ` Mathieu Desnoyers
  2010-07-18 11:03                   ` Avi Kivity
  2 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15  1:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 6:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So this is what I think it might look like, with the %rip in place.
>  [ ...]
> Hmm?

I didn't fill in the iret fault details, because I thought that would
be trivial. We get an exception, it's a slow case, how hard can it be
to just call the NMI code?

But thinking some more about it, it doesn't feel as trivial any more.
We want to set up that same nesting code for the faked NMI call, but
now I made it be two separate variables, and they need to be set in an
NMI-safe way without us actually having access to the whole NMI
blocking that the CPU does for a real NMI.

So there's a few subtleties there too. Probably need to make the two
percpu values adjacent, and use cmpxchg16b in the "emulate NMI on
exception" code to set them both atomically. Or something. So I think
it's doable, but it's admittedly more complicated than I thought it
would be.

.. and obviously there's nothing that guarantees that the code I
already posted is correct either. The whole concept might be total
crap.

                  Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 22:56                     ` Linus Torvalds
  2010-07-14 23:09                       ` Andi Kleen
@ 2010-07-15 14:11                       ` Frederic Weisbecker
  2010-07-15 14:35                         ` Andi Kleen
                                           ` (2 more replies)
  1 sibling, 3 replies; 168+ messages in thread
From: Frederic Weisbecker @ 2010-07-15 14:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Andi Kleen, Ingo Molnar, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, Jul 14, 2010 at 03:56:43PM -0700, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 3:31 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >
> > Until now I didn't because I clearly misunderstand the vmalloc internals. I'm
> > not even quite sure why a memory allocated with vmalloc sometimes can be not
> > mapped (and then fault once for this to sync). Some people have tried to explain
> > me but the picture is still vague to me.
> 
> So the issue is that the system can have thousands and thousands of
> page tables (one for each process), and what do you do when you add a
> new kernel virtual mapping?
> 
> You can:
> 
>  - make sure that you only ever use _one_ single top-level entry for
> all vmalloc issues, and can make sure that all processes are created
> with that static entry filled in. This is optimal, but it just doesn't
> work on all architectures (eg on 32-bit x86, it would limit the
> vmalloc space to 4MB in non-PAE, whatever)


But then, even if you ensure that, don't we need to also fill lower level
entries for the new mapping.

Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
risk to add a new memory mapping for new memory allocated with kmalloc?



>  - at vmalloc time, when adding a new page directory entry, walk all
> the tens of thousands of existing page tables under a lock that
> guarantees that we don't add any new ones (ie it will lock out fork())
> and add the required pgd entry to them.
> 
>  - or just take the fault and do the "fill the page tables" on demand.
> 
> Quite frankly, most of the time it's probably better to make that last
> choice (unless your hardware makes it easy to make the first choice,
> which is obviously simplest for everybody). It makes it _much_ cheaper
> to do vmalloc. It also avoids that nasty latency issue. And it's just
> simpler too, and has no interesting locking issues with how/when you
> expose the page tables in fork() etc.
> 
> So the only downside is that you do end up taking a fault in the
> (rare) case where you have a newly created task that didn't get an
> even newer vmalloc entry.


But then how did the previous tasks get this new mapping? You said
we don't walk through every process page tables for vmalloc.

I would understand this race if we were to walk on every processes page
tables and add the new mapping on them, but we missed one new task that
forked or so, because we didn't lock (or just rcu).



> And that fault can sometimes be in an
> interrupt or an NMI. Normally it's trivial to handle that fairly
> simple nested fault. But NMI has that inconvenient "iret unblocks
> NMI's, because there is no dedicated 'nmiret' instruction" problem on
> x86.


Yeah.


So the parts of the problem I don't understand are:

- why don't we have this problem with kmalloc() ?
- did I understand well the race that makes the fault necessary,
  ie: we walk the tasklist lockless, add the new mapping if
  not present, but we might miss a task lately forked, but
  the fault will fix that.

Thanks.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 14:11                       ` Frederic Weisbecker
@ 2010-07-15 14:35                         ` Andi Kleen
  2010-07-16 11:21                           ` Frederic Weisbecker
  2010-07-15 14:46                         ` Steven Rostedt
  2010-07-15 14:51                         ` Linus Torvalds
  2 siblings, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-15 14:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Mathieu Desnoyers, Andi Kleen, Ingo Molnar, LKML,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

> But then how did the previous tasks get this new mapping? You said
> we don't walk through every process page tables for vmalloc.

No because those are always shared for the kernel and have been
filled in for init_mm.

Also most updates only update the lower tables anyways, top level
updates are extremly rare. In fact on PAE36 they should only happen
at most once, if at all, and most likely at early boot anyways
where you only  have a single task. 

On x86-64 they will only happen once every 512GB of vmalloc. 
So for most systems also at most once at early boot.
> 
> I would understand this race if we were to walk on every processes page
> tables and add the new mapping on them, but we missed one new task that
> forked or so, because we didn't lock (or just rcu).

The new task will always get a copy of the reference init_mm, which
was already updated.

-Andi

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 14:11                       ` Frederic Weisbecker
  2010-07-15 14:35                         ` Andi Kleen
@ 2010-07-15 14:46                         ` Steven Rostedt
  2010-07-16 10:47                           ` Frederic Weisbecker
  2010-07-15 14:51                         ` Linus Torvalds
  2 siblings, 1 reply; 168+ messages in thread
From: Steven Rostedt @ 2010-07-15 14:46 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Mathieu Desnoyers, Andi Kleen, Ingo Molnar, LKML,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, 2010-07-15 at 16:11 +0200, Frederic Weisbecker wrote:

> >  - make sure that you only ever use _one_ single top-level entry for
> > all vmalloc issues, and can make sure that all processes are created
> > with that static entry filled in. This is optimal, but it just doesn't
> > work on all architectures (eg on 32-bit x86, it would limit the
> > vmalloc space to 4MB in non-PAE, whatever)
> 
> 
> But then, even if you ensure that, don't we need to also fill lower level
> entries for the new mapping.

If I understand your question, you do not need to worry about the lower
level entries because all the processes will share the same top level.

process 1's GPD ------,
                      |
                      +------> PMD --> ...
                      |
process 2' GPD -------'

Thus we have one page entry shared by all processes. The issue happens
when the vm space crosses the PMD boundary and we need to update all the
GPD's of all processes to point to the new PMD we need to add to handle
the spread of the vm space.

> 
> Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> risk to add a new memory mapping for new memory allocated with kmalloc?

Because all of memory (well 800 some megs on 32 bit) is mapped into
memory for all processes. That is, kmalloc only uses this memory (as
does get_free_page()). All processes have a PMD (or PUD, whatever) that
maps this memory. The issues only arises when we use new virtual memory,
which vmalloc does. Vmalloc may map to physical memory that is already
mapped to all processes but the address that the vmalloc uses to access
that memory is not yet mapped.

The usual reason the kernel uses vmalloc is to get a contiguous range of
memory. The vmalloc can map several pages as one contiguous piece of
memory that in reality is several different pages scattered around
physical memory. kmalloc can only map pages that are contiguous in
physical memory. That is, if kmalloc gets 8192 bytes on an arch with
4096 byte pages, it will allocate two consecutive pages in physical
memory. If two contiguous pages are not available even if thousand of
single pages are, the kmalloc will fail, where as the vmalloc will not.

An allocation of vmalloc can use two different pages and just map the
page table to make them contiguous in view of the kernel. Note, this
comes at a cost. One is when we do this, we suffer the case where we
need to update a bunch of page tables. The other is that we must waste
TLB entries to point to these separate pages. Kmalloc and
get_free_page() uses the big memory mappings. That is, if the TLB allows
us to map large pages, we can do that for kernel memory since we just
want the contiguous memory as it is in physical memory.

Thus the kernel maps the physical memory with the fewest TLB entries as
needed (large pages and large TLB entries). If we can map 64K pages, we
do that. Then kmalloc just allocates within this range, it does not need
to map any pages. They are already mapped.

Does this make a bit more sense?

> 
> 
> 
> >  - at vmalloc time, when adding a new page directory entry, walk all
> > the tens of thousands of existing page tables under a lock that
> > guarantees that we don't add any new ones (ie it will lock out fork())
> > and add the required pgd entry to them.
> > 
> >  - or just take the fault and do the "fill the page tables" on demand.
> > 
> > Quite frankly, most of the time it's probably better to make that last
> > choice (unless your hardware makes it easy to make the first choice,
> > which is obviously simplest for everybody). It makes it _much_ cheaper
> > to do vmalloc. It also avoids that nasty latency issue. And it's just
> > simpler too, and has no interesting locking issues with how/when you
> > expose the page tables in fork() etc.
> > 
> > So the only downside is that you do end up taking a fault in the
> > (rare) case where you have a newly created task that didn't get an
> > even newer vmalloc entry.
> 
> 
> But then how did the previous tasks get this new mapping? You said
> we don't walk through every process page tables for vmalloc.

Actually we don't even need to walk the page tables in the first task
(although we might do that). When the kernel accesses that memory we
take the page fault, the page fault will see that this memory is vmalloc
data and fill in the page tables for the task at that time.

> 
> I would understand this race if we were to walk on every processes page
> tables and add the new mapping on them, but we missed one new task that
> forked or so, because we didn't lock (or just rcu).
> 
> 
> 
> > And that fault can sometimes be in an
> > interrupt or an NMI. Normally it's trivial to handle that fairly
> > simple nested fault. But NMI has that inconvenient "iret unblocks
> > NMI's, because there is no dedicated 'nmiret' instruction" problem on
> > x86.
> 
> 
> Yeah.
> 
> 
> So the parts of the problem I don't understand are:
> 
> - why don't we have this problem with kmalloc() ?

I hope I explained that above.

> - did I understand well the race that makes the fault necessary,
>   ie: we walk the tasklist lockless, add the new mapping if
>   not present, but we might miss a task lately forked, but
>   the fault will fix that.

I'm lost on this race. If we do a lock and walk all page tables I think
the race goes away. So I don't understand this either?

-- Steve



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 14:11                       ` Frederic Weisbecker
  2010-07-15 14:35                         ` Andi Kleen
  2010-07-15 14:46                         ` Steven Rostedt
@ 2010-07-15 14:51                         ` Linus Torvalds
  2010-07-15 15:38                           ` Linus Torvalds
  2010-07-16 12:00                           ` Frederic Weisbecker
  2 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15 14:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mathieu Desnoyers, Andi Kleen, Ingo Molnar, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 7:11 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Wed, Jul 14, 2010 at 03:56:43PM -0700, Linus Torvalds wrote:
>> You can:
>>
>>  - make sure that you only ever use _one_ single top-level entry for
>> all vmalloc issues, and can make sure that all processes are created
>> with that static entry filled in. This is optimal, but it just doesn't
>> work on all architectures (eg on 32-bit x86, it would limit the
>> vmalloc space to 4MB in non-PAE, whatever)
>
> But then, even if you ensure that, don't we need to also fill lower level
> entries for the new mapping.

Yes, but now they are all mapped by the one *shared* top-level entry.

Think about it.

[ Time passes ]

End result: if you can map the whole vmalloc area with a single
top-level entry that is shared by all processes, and can then just
fill in the lower levels when doing actual allocations, it means that
all processes will automatically get the entries added, and do not
need any fixups.

In other words, the page tables will be automatically correct and
filled in for everybody - without having to traverse any lists,
without any extra locking, and without any races. So this is efficient
and simple, and never needs any faulting to fill in page tables later
on.

(Side note: "single top-level entry" could equally well be "multiple
preallocated entries covering the whole region": the important part is
not really the "single entry", but the "preallocated and filled into
every page directory from the start" part)

> Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> risk to add a new memory mapping for new memory allocated with kmalloc?

No. The kmalloc space is all in the 1:1 kernel mapping, and is always
mapped. Even with PAGEALLOC_DEBUG, it's always mapped at the top
level, and even if a particular page is unmapped/remapped for
debugging, it is done so in the shared kernel page tables (which ends
up being the above trivial case - there is just a single set of page
directory entries that are shared by everybody).

>>  - at vmalloc time, when adding a new page directory entry, walk all
>> the tens of thousands of existing page tables under a lock that
>> guarantees that we don't add any new ones (ie it will lock out fork())
>> and add the required pgd entry to them.
>>
>>  - or just take the fault and do the "fill the page tables" on demand.
>>
>> Quite frankly, most of the time it's probably better to make that last
>> choice (unless your hardware makes it easy to make the first choice,
>> which is obviously simplest for everybody). It makes it _much_ cheaper
>> to do vmalloc. It also avoids that nasty latency issue. And it's just
>> simpler too, and has no interesting locking issues with how/when you
>> expose the page tables in fork() etc.
>>
>> So the only downside is that you do end up taking a fault in the
>> (rare) case where you have a newly created task that didn't get an
>> even newer vmalloc entry.
>
> But then how did the previous tasks get this new mapping? You said
> we don't walk through every process page tables for vmalloc.

We always add the mapping to the "init_mm" page tables when it is
created (just a single mm), and when fork creates a new page table, it
will always copy the kernel mapping parts from the old one. So the
_common_ case is that all normal mappings are already set up in page
tables, including newly created page tables.

The uncommon case is when there is a new page table created _and_ a
new vmalloc mapping, and the race that happens between those events.
Whent hat new page table is then later used (and it can be _much_
later, of course: we're now talking process scheduling, so IO delays
etc are relevant), it won't necessarily have the page table entries
for vmalloc stuff that was created since the page tables were created.
So we fill _those_ in dynamically.

But vmalloc mappings should be reasonably rare, and the actual "fill
them in" cases are much rarer still (since we fill them in page
directory entries at a time: so even if you make a lot of vmalloc()
calls, we only _fill_ at most once per page directory entry, which is
usually a pretty big chunk). On 32-bit x86, for example, we'd fill
once every 4MB (or 2MB if PAE), and you cannot have a lot of vmalloc
mappings that large (since the VM space is limited).

So the cost of filling things in is basically zero, because it happens
so seldom. And by _allowing_ things to be done lazily, we avoid all
the locking costs, and all the costs of traversing every single
possible mm (and there can be many many thousands of those).

> I would understand this race if we were to walk on every processes page
> tables and add the new mapping on them, but we missed one new task that
> forked or so, because we didn't lock (or just rcu).

.. and how do you keep track of which tasks you missed? And no, it's
not just the new tasks - you have old tasks that have their page
tables built up too, but need to be updated. They may never need the
mapping since they may be sleeping and never using the driver or data
structures that created it (in fact, that's a common case), so filling
them would be pointless. But if we don't do the lazy fill, we'd have
to fill them all, because WE DO NOT KNOW.

> So the parts of the problem I don't understand are:
>
> - why don't we have this problem with kmalloc() ?

Hopefully clarified.

> - did I understand well the race that makes the fault necessary,
>  ie: we walk the tasklist lockless, add the new mapping if
>  not present, but we might miss a task lately forked, but
>  the fault will fix that.

But the _fundamental_ issue is that we do not want to walk the
tasklist (or the mm_list) AT ALL. It's a f*cking waste of time. It's a
long list, and nobody cares. In many cases it won't be needed.

The lazy algorithm is _better_. It's way simpler (we take nested
faults all the time in the kernel, and it's a particularly _easy_ page
fault to handle with no IO or no locking needed), and it does less
work. It really boils down to that.

So it's not the lazy page table fill that is the problem. Never has
been. We've been doing the lazy fill for a long time, and it was
simple and useful way back when.

The problem has always been NMI, and nothing else. NMI's are nasty,
and the x86 NMI blocking is insane and crazy.

Which is why I'm so adamant that this should be fixed in the NMI code,
and we should _not_ talk about trying to screw up other, totally
unrelated, code. The lazy fill really was never the problem.

                        Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 14:51                         ` Linus Torvalds
@ 2010-07-15 15:38                           ` Linus Torvalds
  2010-07-16 12:00                           ` Frederic Weisbecker
  1 sibling, 0 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15 15:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mathieu Desnoyers, Andi Kleen, Ingo Molnar, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 7:51 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So it's not the lazy page table fill that is the problem. Never has
> been. We've been doing the lazy fill for a long time, and it was
> simple and useful way back when.

Btw, this is true to the degree that I would _much_ rather just get
rid of the crazy "vmalloc_sync_all()" crap entirely, and make it clear
that non-lazy vmalloc page table fill is a bug.

Because quite frankly, it _is_ a bug to depend on non-lazy vmalloc.
The whole function is only implemented on 32-bit x86, so any code that
thinks it needs it is either just wrong, or will only work on 32-bit
x86 anyway (and on other architectures by pure chance, likely because
their VM fill granularity is so big that they never saw the problem).

So getting rid of vmalloc_sync_all() entirely would be a good thing.
Then we wouldn't have that silly and pointless interface, and we
wouldn't have that crazy "this only does something on x86-32,
everywhere else it's a placebo".

                      Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-14 23:38                         ` Frederic Weisbecker
@ 2010-07-15 16:26                           ` Mathieu Desnoyers
  2010-08-03 17:18                             ` Peter Zijlstra
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-15 16:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Ingo Molnar, LKML, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Frederic Weisbecker (fweisbec@gmail.com) wrote:
> On Wed, Jul 14, 2010 at 07:11:17PM -0400, Mathieu Desnoyers wrote:
> > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > On Wed, Jul 14, 2010 at 06:31:07PM -0400, Mathieu Desnoyers wrote:
> > > > * Frederic Weisbecker (fweisbec@gmail.com) wrote:
> > > > > On Wed, Jul 14, 2010 at 12:54:19PM -0700, Linus Torvalds wrote:
> > > > > > On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker
> > > > > > <fweisbec@gmail.com> wrote:
> > > > > > >
> > > > > > > There is also the fact we need to handle the lost NMI, by defering its
> > > > > > > treatment or so. That adds even more complexity.
> > > > > > 
> > > > > > I don't think your read my proposal very deeply. It already handles
> > > > > > them by taking a fault on the iret of the first one (that's why we
> > > > > > point to the stack frame - so that we can corrupt it and force a
> > > > > > fault).
> > > > > 
> > > > > 
> > > > > Ah right, I missed this part.
> > > > 
> > > > Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k
> > > > data structures exactly ? :)
> > > > 
> > > > Mathieu
> > > 
> > > 
> > > 
> > > So, when an event triggers in perf, we sometimes want to capture the stacktrace
> > > that led to the event.
> > > 
> > > We want this stacktrace (here we call that a callchain) to be recorded
> > > locklessly. So we want this callchain buffer per cpu, with the following
> > > type:
> > 
> > Ah OK, so you mean that perf now has 2 different ring buffer implementations ?
> > How about using a single one that is generic enough to handle perf and ftrace
> > needs instead ?
> > 
> > (/me runs away quickly before the lightning strikes) ;)
> > 
> > Mathieu
> 
> 
> :-)
> 
> That's no ring buffer. It's a temporary linear buffer to fill a stacktrace,
> and get its effective size before committing it to the real ring buffer.

I was more thinking along the lines of making sure a ring buffer has the proper
support for your use-case. It shares a lot of requirements with a standard ring
buffer:

- Need to be lock-less
- Need to reserve space, write data in a buffer

By configuring a ring buffer with 4k sub-buffer size (that's configurable
dynamically), all we need to add is the ability to squash a previously saved
record from the buffer. I am confident we can provide a clean API for this that
would allow discard of previously committed entry as long as we are still on the
same non-fully-committed sub-buffer. This fits your use-case exactly, so that's
fine.

You could have one 4k ring buffer per cpu per execution context.  I wonder if
each Linux architecture have support for separated thread vs softirtq vs irq vs
nmi stacks ? Even then, given you have only one stack for all shared irqs, you
need something that is concurrency-aware at the ring buffer level.

These small stack-like ring buffers could be used to save your temporary stack
copy. When you really need to save it to a larger ring buffer along with a
trace, then you just take a snapshot of the stack ring buffers.

So you get to use one single ring buffer synchronization and memory allocation
mechanism, that everyone has reviewed. The advantage is that we would not be
having this nmi race discussion in the first place: the generic ring buffer uses
"get page" directly rather than relying on vmalloc, because these bugs have
already been identified and dealt with years ago.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15  1:23                 ` Linus Torvalds
  2010-07-15  1:45                   ` Linus Torvalds
@ 2010-07-15 16:44                   ` Mathieu Desnoyers
  2010-07-15 16:49                     ` Linus Torvalds
  2010-07-18 11:03                   ` Avi Kivity
  2 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-15 16:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I think the %rip check should be pretty simple - exactly because there
> > is only a single point where the race is open between that 'mov' and
> > the 'iret'. So it's simpler than the (similar) thing we do for
> > debug/nmi stack fixup for sysenter that has to check a range.
> 
> So this is what I think it might look like, with the %rip in place.
> And I changed the "nmi_stack_ptr" thing to have both the pointer and a
> flag - because it turns out that in the single-instruction race case,
> we actually want the old pointer.
> 
> Totally untested, of course. But _something_ like this might work:
> 
> #
> # Two per-cpu variables: a "are we nested" flag (one byte), and
> # a "if we're nested, what is the %rsp for the nested case".
> #
> # The reason for why we can't just clear the saved-rsp field and
> # use that as the flag is that we actually want to know the saved
> # rsp for the special case of having a nested NMI happen on the
> # final iret of the unnested case.
> #
> nmi:
> 	cmpb $0,%__percpu_seg:nmi_stack_nesting
> 	jne nmi_nested_corrupt_and_return
> 	cmpq $nmi_iret_address,0(%rsp)
> 	je nmi_might_be_nested
> 	# create new stack
> is_unnested_nmi:
> 	# Save some space for nested NMI's. The exception itself
> 	# will never use more space, but it might use less (since
> 	# if will be a kernel-kernel transition). But the nested
> 	# exception will want two save registers and a place to
> 	# save the original CS that it will corrupt
> 	subq $64,%rsp
> 
> 	# copy the five words of stack info. 96 = 64 + stack
> 	# offset of ss.
> 	pushq 96(%rsp)   # ss
> 	pushq 96(%rsp)   # rsp
> 	pushq 96(%rsp)   # eflags
> 	pushq 96(%rsp)   # cs
> 	pushq 96(%rsp)   # rip
> 
> 	# and set the nesting flags
> 	movq %rsp,%__percpu_seg:nmi_stack_ptr
> 	movb $0xff,%__percpu_seg:nmi_stack_nesting
> 
> regular_nmi_code:
> 	...
> 	# regular NMI code goes here, and can take faults,
> 	# because this sequence now has proper nested-nmi
> 	# handling
> 	...
> nmi_exit:
> 	movb $0,%__percpu_seg:nmi_stack_nesting

The first thing that strikes me is that we could be interrupted by a standard
interrupt on top of the iret instruction below. This interrupt handler could in
turn be interrupted by a NMI, so the NMI handler would not know that it is
nested over nmi_iret_address. Maybe we could simply disable interrupts
explicitly at the beginning of the handler, so they get re-enabled by iret below
upon return from nmi ?

Doing that would ensure that only NMIs can interrupt us.

I'll look a bit more at the code and come back with more comments if things come
up.

Thanks,

Mathieu

> nmi_iret_address:
> 	iret
> 
> # The saved rip points to the final NMI iret, after we've cleared
> # nmi_stack_ptr. Check the CS segment to make sure.
> nmi_might_be_nested:
> 	cmpw $__KERNEL_CS,8(%rsp)
> 	jne is_unnested_nmi
> 
> # This is the case when we hit just as we're supposed to do the final
> # iret of a previous nmi.  We run the NMI using the old return address
> # that is still on the stack, rather than copy the new one that is bogus
> # and points to where the nested NMI interrupted the original NMI
> # handler!
> # Easy: just reset the stack pointer to the saved one (this is why
> # we use a separate "valid" flag, so that we can still use the saved
> # stack pointer)
> 	movq %__percpu_seg:nmi_stack_ptr,%rsp
> 	jmp regular_nmi_code
> 
> # This is the actual nested case.  Make sure we fault on iret by setting
> # CS to zero and saving the old CS.  %rax contains the stack pointer to
> # the original code.
> nmi_nested_corrupt_and_return:
> 	pushq %rax
> 	pushq %rdx
> 	movq %__percpu_seg:nmi_stack_ptr,%rax
> 	movq 8(%rax),%rdx	# CS of original NMI
> 	testq %rdx,%rdx		# CS already zero?
> 	je nmi_nested_and_already_corrupted
> 	movq %rdx,40(%rax)	# save old CS away
> 	movq $0,8(%rax)
> nmi_nested_and_already_corrupted:
> 	popq %rdx
> 	popq %rax
> 	popfq
> 	jmp *(%rsp)
> 
> Hmm?
> 
>                Linus

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 16:44                   ` Mathieu Desnoyers
@ 2010-07-15 16:49                     ` Linus Torvalds
  2010-07-15 17:38                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15 16:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 9:44 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> The first thing that strikes me is that we could be interrupted by a standard
> interrupt on top of the iret instruction below.

No, that can never happen.

Why? Simple: regular interrupts aren't ever enabled in eflags. So the
only kinds of traps we can get are NMI's (that don't follow the normal
rules), and exceptions.

Of course, if there is some trap that re-enables interrupts even if
the trap happened in an interrupt-disabled region, then that would
change things, but that would be a bad bug regardless (and totally
independently of any NMI issues). So in that sense it's a "could
happen", but it's something that would be a totally separate bug.

                    Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 16:49                     ` Linus Torvalds
@ 2010-07-15 17:38                       ` Mathieu Desnoyers
  2010-07-15 20:44                         ` H. Peter Anvin
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-15 17:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Thu, Jul 15, 2010 at 9:44 AM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > The first thing that strikes me is that we could be interrupted by a standard
> > interrupt on top of the iret instruction below.
> 
> No, that can never happen.
> 
> Why? Simple: regular interrupts aren't ever enabled in eflags. So the
> only kinds of traps we can get are NMI's (that don't follow the normal
> rules), and exceptions.

Ah, you're right, since NMIs are an intr gate, IF is disabled in the EFLAGS all
along.

> 
> Of course, if there is some trap that re-enables interrupts even if
> the trap happened in an interrupt-disabled region, then that would
> change things, but that would be a bad bug regardless (and totally
> independently of any NMI issues). So in that sense it's a "could
> happen", but it's something that would be a totally separate bug.

Yep, this kind of scenario would therefore be a bug that does not belong to the
specific realm of nmis.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15  1:45                   ` Linus Torvalds
@ 2010-07-15 18:31                     ` Mathieu Desnoyers
  2010-07-15 18:43                       ` Linus Torvalds
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-15 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Wed, Jul 14, 2010 at 6:23 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So this is what I think it might look like, with the %rip in place.
> >  [ ...]
> > Hmm?
> 
> I didn't fill in the iret fault details, because I thought that would
> be trivial. We get an exception, it's a slow case, how hard can it be
> to just call the NMI code?

I'm wondering if we really have to handle this with a fault. Couldn't we just
send iret to the following nmi handler instead ? (chaining nmis)

I think we can even find a way to handle the fact that the fake nmi does not run
with nmis disabled. We could keep the nmi nested bit set if we find out that
iret will branch to the fake nmi. We would then have to make sure the fake nmi
entry point is a little further than the standard nmi entry point: somewhere
after the initial nmi nesting check.

> 
> But thinking some more about it, it doesn't feel as trivial any more.
> We want to set up that same nesting code for the faked NMI call, but
> now I made it be two separate variables, and they need to be set in an
> NMI-safe way without us actually having access to the whole NMI
> blocking that the CPU does for a real NMI.
> 
> So there's a few subtleties there too. Probably need to make the two
> percpu values adjacent, and use cmpxchg16b in the "emulate NMI on
> exception" code to set them both atomically. Or something. So I think
> it's doable, but it's admittedly more complicated than I thought it
> would be.

Hrm, we could probably get away with only keeping the nmi_stack_nested per-cpu
variable. The nmi_stack_ptr could be known statically if we set it at a fixed
offset from the bottom of stack rather than using an offset relative to the top
(which can change depending if we are nested over the kernel or userspace).
We just have to reserve enough space for the bottom of stack.

> 
> .. and obviously there's nothing that guarantees that the code I
> already posted is correct either. The whole concept might be total
> crap.

Call me optimistic if you want, but I think we're really getting somewhere. :)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 18:31                     ` Mathieu Desnoyers
@ 2010-07-15 18:43                       ` Linus Torvalds
  2010-07-15 18:48                         ` Linus Torvalds
  0 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15 18:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 11:31 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Hrm, we could probably get away with only keeping the nmi_stack_nested per-cpu
> variable. The nmi_stack_ptr could be known statically if we set it at a fixed
> offset from the bottom of stack rather than using an offset relative to the top
> (which can change depending if we are nested over the kernel or userspace).
> We just have to reserve enough space for the bottom of stack.

I thought about trying that, but I don't think it's true. At least not
for the 32-bit case.

The thing is, the 32-bit case will re-use the kernel stack if it
happens in kernel space, and will thus start from a random space (and
won't push all the information anyway). So a nested NMI really doesn't
know where the original NMI stack is to be found unless we save it
off.

In the case of x86-64, I think the stack will always be at a fixed
address, and push a fixed amount of data (because we use the IST
thing). So there you could probably just use the flag, but you'd still
have to handle the 32-bit case, and quite frankly, I think it would be
much nicer if the logic could be shared for the 32-bit and 64-bit
cases.

But maybe I'm missing something.

             Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 18:43                       ` Linus Torvalds
@ 2010-07-15 18:48                         ` Linus Torvalds
  2010-07-15 22:01                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15 18:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 11:43 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But maybe I'm missing something.

Hmm. Of course - one way of solving this might be to just make the
32-bit case switch stacks in software. That might be a good idea
regardless, and would not be complicated. We already do that for
sysenter, but the NMI case would be simpler because we don't need to
worry about being re-entered by NMI/DEBUG during the stack switch.

And since we have to play some games with moving the data on the stack
around _anyway_, doing the whole "switch stacks entirely rather than
just subtract a bit from the old stack" would be fairly logical.

So I think you may end up being right: we don't need to save the
original NMI stack pointer, because we can make sure that the
replacement stack (that we need anyway) is always deterministic.

                             Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 17:38                       ` Mathieu Desnoyers
@ 2010-07-15 20:44                         ` H. Peter Anvin
  0 siblings, 0 replies; 168+ messages in thread
From: H. Peter Anvin @ 2010-07-15 20:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On 07/15/2010 10:38 AM, Mathieu Desnoyers wrote:
>>
>> Of course, if there is some trap that re-enables interrupts even if
>> the trap happened in an interrupt-disabled region, then that would
>> change things, but that would be a bad bug regardless (and totally
>> independently of any NMI issues). So in that sense it's a "could
>> happen", but it's something that would be a totally separate bug.
> 
> Yep, this kind of scenario would therefore be a bug that does not belong to the
> specific realm of nmis.
> 

Yes, the only specific issue here is NMI -> trap -> IRET -> [nested
NMI], which is what this whole thing is about.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 18:48                         ` Linus Torvalds
@ 2010-07-15 22:01                           ` Mathieu Desnoyers
  2010-07-15 22:16                             ` Linus Torvalds
  2010-07-16 19:13                             ` Mathieu Desnoyers
  0 siblings, 2 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-15 22:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

Hi Linus,

I modified your code, intenting to handle the fake NMI entry gracefully given
that NMIs are not necessarily disabled at the entry point. It uses a "need fake
NMI" flag rather than playing games with CS and faults. When a fake NMI is
needed, it simply jumps back to the beginning of regular nmi code. NMI exit code
and fake NMI entry are made reentrant with respect to NMI handler interruption
by testing, at the very beginning of the NMI handler, if a NMI is nested over
the whole nmi_atomic .. nmi_atomic_end code region. This code assumes NMIs have
a separate stack.

This code is still utterly untested and might eat your Doritos, only provided
for general enjoyment.

Thanks,

Mathieu

#
# Two per-cpu variables: a "are we nested" flag (one byte).
# a "do we need to execute a fake NMI" flag (one byte).
# The %rsp at which the stack copy is saved is at a fixed address, which leaves
# enough room at the bottom of NMI stack for the "real" NMI entry stack. This
# assumes we have a separate NMI stack.
# The NMI stack copy top of stack is at nmi_stack_copy.
# The NMI stack copy "rip" is at nmi_stack_copy_rip, which is set to
# nmi_stack_copy-32.
#
nmi:
	# Test if nested over atomic code.
	cmpq $nmi_atomic,0(%rsp)
	jae nmi_addr_is_ae
	# Test if nested over general NMI code.
	cmpb $0,%__percpu_seg:nmi_stack_nesting
	jne nmi_nested_set_fake_and_return
	# create new stack
is_unnested_nmi:
	# Save some space for nested NMI's. The exception itself
	# will never use more space, but it might use less (since
	# if will be a kernel-kernel transition).

	# Save %rax on top of the stack (need to temporarily use it)
	pushq %rax
	movq %rsp, %rax
	movq $nmi_stack_copy,%rsp

	# copy the five words of stack info. rip starts at 8+0(%rax).
	pushq 8+32(%rax)    # ss
	pushq 8+24(%rax)    # rsp
	pushq 8+16(%rax)    # eflags
	pushq 8+8(%rax)     # cs
	pushq 8+0(%rax)     # rip
	movq 0(%rax),%rax # restore %rax

set_nmi_nesting:
	# and set the nesting flags
	movb $0xff,%__percpu_seg:nmi_stack_nesting

regular_nmi_code:
	...
	# regular NMI code goes here, and can take faults,
	# because this sequence now has proper nested-nmi
	# handling
	...

nmi_atomic:
	# An NMI nesting over the whole nmi_atomic .. nmi_atomic_end region will
	# be handled specially. This includes the fake NMI entry point.
	cmpb $0,%__percpu_seg:need_fake_nmi
	jne fake_nmi
	movb $0,%__percpu_seg:nmi_stack_nesting
	iret

	# This is the fake NMI entry point.
fake_nmi:
	movb $0x0,%__percpu_seg:need_fake_nmi
	jmp regular_nmi_code
nmi_atomic_end:

	# Make sure the address is in the nmi_atomic range and in CS segment.
nmi_addr_is_ae:
	cmpq $nmi_atomic_end,0(%rsp)
	jae is_unnested_nmi
	# The saved rip points to the final NMI iret. Check the CS segment to
	# make sure.
	cmpw $__KERNEL_CS,8(%rsp)
	jne is_unnested_nmi

# This is the case when we hit just as we're supposed to do the atomic code
# of a previous nmi.  We run the NMI using the old return address that is still
# on the stack, rather than copy the new one that is bogus and points to where
# the nested NMI interrupted the original NMI handler!
# Easy: just set the stack pointer to point to the stack copy, clear
# need_fake_nmi (because we are directly going to execute the requested NMI) and
# jump to "nesting flag set" (which is followed by regular nmi code execution).
	movq $nmi_stack_copy_rip,%rsp
	movb $0x0,%__percpu_seg:need_fake_nmi
	jmp set_nmi_nesting

# This is the actual nested case. Make sure we branch to the fake NMI handler
# after this handler is done.
nmi_nested_set_fake_and_return:
	movb $0xff,%__percpu_seg:need_fake_nmi
	popfq
	jmp *(%rsp)


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 22:01                           ` Mathieu Desnoyers
@ 2010-07-15 22:16                             ` Linus Torvalds
  2010-07-15 22:24                               ` H. Peter Anvin
                                                 ` (2 more replies)
  2010-07-16 19:13                             ` Mathieu Desnoyers
  1 sibling, 3 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15 22:16 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 3:01 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
>                 . NMI exit code
> and fake NMI entry are made reentrant with respect to NMI handler interruption
> by testing, at the very beginning of the NMI handler, if a NMI is nested over
> the whole nmi_atomic .. nmi_atomic_end code region.

That is totally bogus. The NMI can be nested by exceptions and
function calls - the whole _point_ of this thing. So testing "rip" for
anything else than the specific final "iret" is meaningless. You will
be in an NMI region regardless of what rip is.

> This code assumes NMIs have a separate stack.

It also needs to be made per-cpu (and the flags be per-cpu).

Then you could in fact possibly test the stack pointer for whether it
is in the NMI stack area, and use the value of %rsp itself as the
flag. So you could avoid the flag entirely. Because testing %rsp is
valid - testing %rip is not.

That would also avoid the race, because %rsp (as a flag) now gets
cleared atomically by the "iret". So that might actually solve things.

                          Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 22:16                             ` Linus Torvalds
@ 2010-07-15 22:24                               ` H. Peter Anvin
  2010-07-15 22:26                               ` Linus Torvalds
  2010-07-15 22:30                               ` Mathieu Desnoyers
  2 siblings, 0 replies; 168+ messages in thread
From: H. Peter Anvin @ 2010-07-15 22:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On 07/15/2010 03:16 PM, Linus Torvalds wrote:
> 
>> This code assumes NMIs have a separate stack.
> 
> It also needs to be made per-cpu (and the flags be per-cpu).
> 
> Then you could in fact possibly test the stack pointer for whether it
> is in the NMI stack area, and use the value of %rsp itself as the
> flag. So you could avoid the flag entirely. Because testing %rsp is
> valid - testing %rip is not.
> 
> That would also avoid the race, because %rsp (as a flag) now gets
> cleared atomically by the "iret". So that might actually solve things.
> 

This seems really clean to me.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 22:16                             ` Linus Torvalds
  2010-07-15 22:24                               ` H. Peter Anvin
@ 2010-07-15 22:26                               ` Linus Torvalds
  2010-07-15 22:46                                 ` H. Peter Anvin
  2010-07-15 22:58                                 ` Andi Kleen
  2010-07-15 22:30                               ` Mathieu Desnoyers
  2 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15 22:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 3:16 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Then you could in fact possibly test the stack pointer for whether it
> is in the NMI stack area, and use the value of %rsp itself as the
> flag. So you could avoid the flag entirely. Because testing %rsp is
> valid - testing %rip is not.
>
> That would also avoid the race, because %rsp (as a flag) now gets
> cleared atomically by the "iret". So that might actually solve things.

Hmm. So on x86-32, it's easy: if the NMI is nested, you can literally
look at the current %rsp value, and see if it's within the NMI stack
region.

But on x86-64, due to IST, you need to look at the saved-rsp value on
the stack, since the %rsp always gets reset to the NMI stack region
regardless of where it was before.

Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
the normal kernel stack mechanisms?

                                 Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 22:16                             ` Linus Torvalds
  2010-07-15 22:24                               ` H. Peter Anvin
  2010-07-15 22:26                               ` Linus Torvalds
@ 2010-07-15 22:30                               ` Mathieu Desnoyers
  2 siblings, 0 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-15 22:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Thu, Jul 15, 2010 at 3:01 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> >                 . NMI exit code
> > and fake NMI entry are made reentrant with respect to NMI handler interruption
> > by testing, at the very beginning of the NMI handler, if a NMI is nested over
> > the whole nmi_atomic .. nmi_atomic_end code region.
> 
> That is totally bogus. The NMI can be nested by exceptions and
> function calls - the whole _point_ of this thing. So testing "rip" for
> anything else than the specific final "iret" is meaningless. You will
> be in an NMI region regardless of what rip is.

There are 2 tests done on NMI handler entry:

1) test if nested over nmi_atomic region (which is a very restrained region
around nmi_exit, which does not do any function call nor take traps).
2) test if the per-cpu nmi_nesting flag is set.

Test #2 takes care of NMIs nested over functions called and traps.

> 
> > This code assumes NMIs have a separate stack.
> 
> It also needs to be made per-cpu (and the flags be per-cpu).

Sure, that was implied ;)

> 
> Then you could in fact possibly test the stack pointer for whether it
> is in the NMI stack area, and use the value of %rsp itself as the
> flag. So you could avoid the flag entirely. Because testing %rsp is
> valid - testing %rip is not.

That could be used as a way to detect "nesting over NMI", but I'm not entirely
sure it would deal with the "we need a fake NMI" flag set/clear (more or less
equivalent to setting CS to 0 in your implementation and then back to some other
value). The "set" is done with NMIs disabled, but the "clear" is done at fake
NMI entry, where NMIs are active.

> 
> That would also avoid the race, because %rsp (as a flag) now gets
> cleared atomically by the "iret". So that might actually solve things.

Well, I'm still unconvinced there is anything to solve, as I built my NMI entry
with 2 tests: one for "nmi_atomic" code range and the other for per-cpu nesting
flag. Given that I set/clear the per-cpu nesting flag either with NMIs off or
within the nmi_atomic code range, this should all work fine.

Unless I am missing something else ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 22:26                               ` Linus Torvalds
@ 2010-07-15 22:46                                 ` H. Peter Anvin
  2010-07-15 22:58                                 ` Andi Kleen
  1 sibling, 0 replies; 168+ messages in thread
From: H. Peter Anvin @ 2010-07-15 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On 07/15/2010 03:26 PM, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 3:16 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Then you could in fact possibly test the stack pointer for whether it
>> is in the NMI stack area, and use the value of %rsp itself as the
>> flag. So you could avoid the flag entirely. Because testing %rsp is
>> valid - testing %rip is not.
>>
>> That would also avoid the race, because %rsp (as a flag) now gets
>> cleared atomically by the "iret". So that might actually solve things.
> 
> Hmm. So on x86-32, it's easy: if the NMI is nested, you can literally
> look at the current %rsp value, and see if it's within the NMI stack
> region.
> 
> But on x86-64, due to IST, you need to look at the saved-rsp value on
> the stack, since the %rsp always gets reset to the NMI stack region
> regardless of where it was before.
> 
> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
> the normal kernel stack mechanisms?
> 

The reasons for using TSS (32 bits) or IST (64 bits) are: concern about
the size of the regular kernel stack, and a concern that the kernel
stack pointer may not be in a usable state.  The former is not a problem
here: we're doing a stack switch anyway, and so the additional overhead
on the main stack is pretty minimal, but the latter may be.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 22:26                               ` Linus Torvalds
  2010-07-15 22:46                                 ` H. Peter Anvin
@ 2010-07-15 22:58                                 ` Andi Kleen
  2010-07-15 23:20                                   ` H. Peter Anvin
  1 sibling, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-15 22:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
> the normal kernel stack mechanisms?

If you don't use IST the SYSCALL entry is racy during the window
when RSP is not set up yet (same for MCE etc.)

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 22:58                                 ` Andi Kleen
@ 2010-07-15 23:20                                   ` H. Peter Anvin
  2010-07-15 23:23                                     ` Linus Torvalds
  0 siblings, 1 reply; 168+ messages in thread
From: H. Peter Anvin @ 2010-07-15 23:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On 07/15/2010 03:58 PM, Andi Kleen wrote:
>> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
>> the normal kernel stack mechanisms?
> 
> If you don't use IST the SYSCALL entry is racy during the window
> when RSP is not set up yet (same for MCE etc.)
> 

Right, the kernel stack is not ready.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 23:20                                   ` H. Peter Anvin
@ 2010-07-15 23:23                                     ` Linus Torvalds
  2010-07-15 23:41                                       ` H. Peter Anvin
  2010-07-15 23:48                                       ` Andi Kleen
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15 23:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 4:20 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/15/2010 03:58 PM, Andi Kleen wrote:
>>> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
>>> the normal kernel stack mechanisms?
>>
>> If you don't use IST the SYSCALL entry is racy during the window
>> when RSP is not set up yet (same for MCE etc.)
>>
>
> Right, the kernel stack is not ready.

Well, it may not be ready for the _current_ NMI handler, but if we're
going to do a stack switch in sw on NMI anyway... ?

                 Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 23:23                                     ` Linus Torvalds
@ 2010-07-15 23:41                                       ` H. Peter Anvin
  2010-07-15 23:44                                         ` Linus Torvalds
  2010-07-15 23:48                                       ` Andi Kleen
  1 sibling, 1 reply; 168+ messages in thread
From: H. Peter Anvin @ 2010-07-15 23:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On 07/15/2010 04:23 PM, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 4:20 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 07/15/2010 03:58 PM, Andi Kleen wrote:
>>>> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
>>>> the normal kernel stack mechanisms?
>>>
>>> If you don't use IST the SYSCALL entry is racy during the window
>>> when RSP is not set up yet (same for MCE etc.)
>>>
>>
>> Right, the kernel stack is not ready.
> 
> Well, it may not be ready for the _current_ NMI handler, but if we're
> going to do a stack switch in sw on NMI anyway... ?
> 

No, the problem is that without IST it'll try to drop the NMI stack
frame itself *on the user stack*.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 23:41                                       ` H. Peter Anvin
@ 2010-07-15 23:44                                         ` Linus Torvalds
  2010-07-15 23:46                                           ` H. Peter Anvin
  0 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-15 23:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 4:41 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> No, the problem is that without IST it'll try to drop the NMI stack
> frame itself *on the user stack*.

Oh, because SS has already been cleared, but rsp still points to the
user stack? Ok, that does seem insurmountable.

             Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 23:44                                         ` Linus Torvalds
@ 2010-07-15 23:46                                           ` H. Peter Anvin
  0 siblings, 0 replies; 168+ messages in thread
From: H. Peter Anvin @ 2010-07-15 23:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On 07/15/2010 04:44 PM, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 4:41 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> No, the problem is that without IST it'll try to drop the NMI stack
>> frame itself *on the user stack*.
> 
> Oh, because SS has already been cleared, but rsp still points to the
> user stack? Ok, that does seem insurmountable.
> 

Well, SS doesn't matter for 64 bits, but yes, RSP still points to the
user stack.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 23:23                                     ` Linus Torvalds
  2010-07-15 23:41                                       ` H. Peter Anvin
@ 2010-07-15 23:48                                       ` Andi Kleen
  1 sibling, 0 replies; 168+ messages in thread
From: Andi Kleen @ 2010-07-15 23:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Andi Kleen, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler,
	Tejun Heo

On Thu, Jul 15, 2010 at 04:23:20PM -0700, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 4:20 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > On 07/15/2010 03:58 PM, Andi Kleen wrote:
> >>> Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
> >>> the normal kernel stack mechanisms?
> >>
> >> If you don't use IST the SYSCALL entry is racy during the window
> >> when RSP is not set up yet (same for MCE etc.)
> >>
> >
> > Right, the kernel stack is not ready.
> 
> Well, it may not be ready for the _current_ NMI handler, but if we're
> going to do a stack switch in sw on NMI anyway... ?

The CPU written initial stack frame would still go on the wrong stack.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 14:46                         ` Steven Rostedt
@ 2010-07-16 10:47                           ` Frederic Weisbecker
  2010-07-16 11:43                             ` Steven Rostedt
  0 siblings, 1 reply; 168+ messages in thread
From: Frederic Weisbecker @ 2010-07-16 10:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Mathieu Desnoyers, Andi Kleen, Ingo Molnar, LKML,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 10:46:13AM -0400, Steven Rostedt wrote:
> On Thu, 2010-07-15 at 16:11 +0200, Frederic Weisbecker wrote:
> 
> > >  - make sure that you only ever use _one_ single top-level entry for
> > > all vmalloc issues, and can make sure that all processes are created
> > > with that static entry filled in. This is optimal, but it just doesn't
> > > work on all architectures (eg on 32-bit x86, it would limit the
> > > vmalloc space to 4MB in non-PAE, whatever)
> > 
> > 
> > But then, even if you ensure that, don't we need to also fill lower level
> > entries for the new mapping.
> 
> If I understand your question, you do not need to worry about the lower
> level entries because all the processes will share the same top level.
> 
> process 1's GPD ------,
>                       |
>                       +------> PMD --> ...
>                       |
> process 2' GPD -------'
> 
> Thus we have one page entry shared by all processes. The issue happens
> when the vm space crosses the PMD boundary and we need to update all the
> GPD's of all processes to point to the new PMD we need to add to handle
> the spread of the vm space.




Oh right. We point to that PMD, and the update has been made itself inside
the lower level entries pointed by the PMD. Indeed.



> 
> > 
> > Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> > risk to add a new memory mapping for new memory allocated with kmalloc?
> 
> Because all of memory (well 800 some megs on 32 bit) is mapped into
> memory for all processes. That is, kmalloc only uses this memory (as
> does get_free_page()). All processes have a PMD (or PUD, whatever) that
> maps this memory. The issues only arises when we use new virtual memory,
> which vmalloc does. Vmalloc may map to physical memory that is already
> mapped to all processes but the address that the vmalloc uses to access
> that memory is not yet mapped.



Ok I see.




> 
> The usual reason the kernel uses vmalloc is to get a contiguous range of
> memory. The vmalloc can map several pages as one contiguous piece of
> memory that in reality is several different pages scattered around
> physical memory. kmalloc can only map pages that are contiguous in
> physical memory. That is, if kmalloc gets 8192 bytes on an arch with
> 4096 byte pages, it will allocate two consecutive pages in physical
> memory. If two contiguous pages are not available even if thousand of
> single pages are, the kmalloc will fail, where as the vmalloc will not.
> 
> An allocation of vmalloc can use two different pages and just map the
> page table to make them contiguous in view of the kernel. Note, this
> comes at a cost. One is when we do this, we suffer the case where we
> need to update a bunch of page tables. The other is that we must waste
> TLB entries to point to these separate pages. Kmalloc and
> get_free_page() uses the big memory mappings. That is, if the TLB allows
> us to map large pages, we can do that for kernel memory since we just
> want the contiguous memory as it is in physical memory.
> 
> Thus the kernel maps the physical memory with the fewest TLB entries as
> needed (large pages and large TLB entries). If we can map 64K pages, we
> do that. Then kmalloc just allocates within this range, it does not need
> to map any pages. They are already mapped.
> 
> Does this make a bit more sense?



Totally! You've made it very clear to me.
Moreover I did not know we can have such variable page size. I mean I thought
we can have variable page size but that would apply to every pages.





> 
> > 
> > 
> > 
> > >  - at vmalloc time, when adding a new page directory entry, walk all
> > > the tens of thousands of existing page tables under a lock that
> > > guarantees that we don't add any new ones (ie it will lock out fork())
> > > and add the required pgd entry to them.
> > > 
> > >  - or just take the fault and do the "fill the page tables" on demand.
> > > 
> > > Quite frankly, most of the time it's probably better to make that last
> > > choice (unless your hardware makes it easy to make the first choice,
> > > which is obviously simplest for everybody). It makes it _much_ cheaper
> > > to do vmalloc. It also avoids that nasty latency issue. And it's just
> > > simpler too, and has no interesting locking issues with how/when you
> > > expose the page tables in fork() etc.
> > > 
> > > So the only downside is that you do end up taking a fault in the
> > > (rare) case where you have a newly created task that didn't get an
> > > even newer vmalloc entry.
> > 
> > 
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
> 
> Actually we don't even need to walk the page tables in the first task
> (although we might do that). When the kernel accesses that memory we
> take the page fault, the page fault will see that this memory is vmalloc
> data and fill in the page tables for the task at that time.



Right.




> > 
> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
> > 
> > 
> > 
> > > And that fault can sometimes be in an
> > > interrupt or an NMI. Normally it's trivial to handle that fairly
> > > simple nested fault. But NMI has that inconvenient "iret unblocks
> > > NMI's, because there is no dedicated 'nmiret' instruction" problem on
> > > x86.
> > 
> > 
> > Yeah.
> > 
> > 
> > So the parts of the problem I don't understand are:
> > 
> > - why don't we have this problem with kmalloc() ?
> 
> I hope I explained that above.



Yeah :)

Thanks a lot for your explanations!


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 14:35                         ` Andi Kleen
@ 2010-07-16 11:21                           ` Frederic Weisbecker
  0 siblings, 0 replies; 168+ messages in thread
From: Frederic Weisbecker @ 2010-07-16 11:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Mathieu Desnoyers, Ingo Molnar, LKML,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 04:35:18PM +0200, Andi Kleen wrote:
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
> 
> No because those are always shared for the kernel and have been
> filled in for init_mm.
> 
> Also most updates only update the lower tables anyways, top level
> updates are extremly rare. In fact on PAE36 they should only happen
> at most once, if at all, and most likely at early boot anyways
> where you only  have a single task. 
> 
> On x86-64 they will only happen once every 512GB of vmalloc. 
> So for most systems also at most once at early boot.
> > 
> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
> 
> The new task will always get a copy of the reference init_mm, which
> was already updated.
> 
> -Andi


Ok, got it.

But then, in the example here with perf, I'm allocating 8192 bytes per cpu
and my total memory amount is of 2 GB.

And it always fault at least once on access, after the allocation.
I really doubt it's because we are adding a new top level page table,
considering the amount of memory I have.

It seems to me that the mapping of a newly allocated vmalloc area is
always inserted through the lazy way (update on fault). Or there is
something I'm missing.

Thanks.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-16 10:47                           ` Frederic Weisbecker
@ 2010-07-16 11:43                             ` Steven Rostedt
  0 siblings, 0 replies; 168+ messages in thread
From: Steven Rostedt @ 2010-07-16 11:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Mathieu Desnoyers, Andi Kleen, Ingo Molnar, LKML,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Fri, 2010-07-16 at 12:47 +0200, Frederic Weisbecker wrote:
> > Thus the kernel maps the physical memory with the fewest TLB entries as
> > needed (large pages and large TLB entries). If we can map 64K pages, we
> > do that. Then kmalloc just allocates within this range, it does not need
> > to map any pages. They are already mapped.
> > 
> > Does this make a bit more sense?
> 
> 
> 
> Totally! You've made it very clear to me.
> Moreover I did not know we can have such variable page size. I mean I thought
> we can have variable page size but that would apply to every pages.

In x86_64, if bit 7 in the PDE (Page Directory Entry) is set then it
points to a 2 Meg page. Otherwise it points to a page table which will
have 512 PTE's pointing to 4K pages.

Download:

http://support.amd.com/us/Processor_TechDocs/24593.pdf

It has nice diagrams that explains this. Check out page 207 (fig 5-17)
and 210 (fig 5-22).

The phys_pmd_init() in arch/x86/mm/init_64.c will try to map memory
using 2M pages if it can, otherwise it falls back to 4K pages.

-- Steve


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 14:51                         ` Linus Torvalds
  2010-07-15 15:38                           ` Linus Torvalds
@ 2010-07-16 12:00                           ` Frederic Weisbecker
  2010-07-16 12:54                             ` Steven Rostedt
  1 sibling, 1 reply; 168+ messages in thread
From: Frederic Weisbecker @ 2010-07-16 12:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Andi Kleen, Ingo Molnar, LKML, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, Jul 15, 2010 at 07:51:55AM -0700, Linus Torvalds wrote:
> On Thu, Jul 15, 2010 at 7:11 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Wed, Jul 14, 2010 at 03:56:43PM -0700, Linus Torvalds wrote:
> >> You can:
> >>
> >>  - make sure that you only ever use _one_ single top-level entry for
> >> all vmalloc issues, and can make sure that all processes are created
> >> with that static entry filled in. This is optimal, but it just doesn't
> >> work on all architectures (eg on 32-bit x86, it would limit the
> >> vmalloc space to 4MB in non-PAE, whatever)
> >
> > But then, even if you ensure that, don't we need to also fill lower level
> > entries for the new mapping.
> 
> Yes, but now they are all mapped by the one *shared* top-level entry.
> 
> Think about it.
> 
> [ Time passes ]
> 
> End result: if you can map the whole vmalloc area with a single
> top-level entry that is shared by all processes, and can then just
> fill in the lower levels when doing actual allocations, it means that
> all processes will automatically get the entries added, and do not
> need any fixups.
> 
> In other words, the page tables will be automatically correct and
> filled in for everybody - without having to traverse any lists,
> without any extra locking, and without any races. So this is efficient
> and simple, and never needs any faulting to fill in page tables later
> on.
> 
> (Side note: "single top-level entry" could equally well be "multiple
> preallocated entries covering the whole region": the important part is
> not really the "single entry", but the "preallocated and filled into
> every page directory from the start" part)



Right, I got it. Thanks for these explanations.



> 
> > Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> > risk to add a new memory mapping for new memory allocated with kmalloc?
> 
> No. The kmalloc space is all in the 1:1 kernel mapping, and is always
> mapped. Even with PAGEALLOC_DEBUG, it's always mapped at the top
> level, and even if a particular page is unmapped/remapped for
> debugging, it is done so in the shared kernel page tables (which ends
> up being the above trivial case - there is just a single set of page
> directory entries that are shared by everybody).



Ok.



> >>  - at vmalloc time, when adding a new page directory entry, walk all
> >> the tens of thousands of existing page tables under a lock that
> >> guarantees that we don't add any new ones (ie it will lock out fork())
> >> and add the required pgd entry to them.
> >>
> >>  - or just take the fault and do the "fill the page tables" on demand.
> >>
> >> Quite frankly, most of the time it's probably better to make that last
> >> choice (unless your hardware makes it easy to make the first choice,
> >> which is obviously simplest for everybody). It makes it _much_ cheaper
> >> to do vmalloc. It also avoids that nasty latency issue. And it's just
> >> simpler too, and has no interesting locking issues with how/when you
> >> expose the page tables in fork() etc.
> >>
> >> So the only downside is that you do end up taking a fault in the
> >> (rare) case where you have a newly created task that didn't get an
> >> even newer vmalloc entry.
> >
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
> 
> We always add the mapping to the "init_mm" page tables when it is
> created (just a single mm), and when fork creates a new page table, it
> will always copy the kernel mapping parts from the old one. So the
> _common_ case is that all normal mappings are already set up in page
> tables, including newly created page tables.
> 
> The uncommon case is when there is a new page table created _and_ a
> new vmalloc mapping, and the race that happens between those events.
> Whent hat new page table is then later used (and it can be _much_
> later, of course: we're now talking process scheduling, so IO delays
> etc are relevant), it won't necessarily have the page table entries
> for vmalloc stuff that was created since the page tables were created.
> So we fill _those_ in dynamically.



Such new page table created that might race is only about top level page
right? Otherwise it wouldn't race since the top level entries are shared
and then updates inside lower level pages are naturally propagated, if
I understood you well.

So, if only top level pages that gets added can generate such lazily
mapping update, I wonder why I experienced this fault everytime with
my patches.

I allocated 8192 bytes per cpu in a x86-32 system that has only 2 GB.
I doubt there is a top level page table update there at this time with
such a small amount of available memory. But still it faults once on
access.

I have troubles to visualize the race and the problem here.



> 
> But vmalloc mappings should be reasonably rare, and the actual "fill
> them in" cases are much rarer still (since we fill them in page
> directory entries at a time: so even if you make a lot of vmalloc()
> calls, we only _fill_ at most once per page directory entry, which is
> usually a pretty big chunk). On 32-bit x86, for example, we'd fill
> once every 4MB (or 2MB if PAE), and you cannot have a lot of vmalloc
> mappings that large (since the VM space is limited).
> 
> So the cost of filling things in is basically zero, because it happens
> so seldom. And by _allowing_ things to be done lazily, we avoid all
> the locking costs, and all the costs of traversing every single
> possible mm (and there can be many many thousands of those).



Ok.



> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
> 
> .. and how do you keep track of which tasks you missed? And no, it's
> not just the new tasks - you have old tasks that have their page
> tables built up too, but need to be updated. They may never need the
> mapping since they may be sleeping and never using the driver or data
> structures that created it (in fact, that's a common case), so filling
> them would be pointless. But if we don't do the lazy fill, we'd have
> to fill them all, because WE DO NOT KNOW.



Right.



> 
> > So the parts of the problem I don't understand are:
> >
> > - why don't we have this problem with kmalloc() ?
> 
> Hopefully clarified.


Indeed.



> > - did I understand well the race that makes the fault necessary,
> >  ie: we walk the tasklist lockless, add the new mapping if
> >  not present, but we might miss a task lately forked, but
> >  the fault will fix that.
> 
> But the _fundamental_ issue is that we do not want to walk the
> tasklist (or the mm_list) AT ALL. It's a f*cking waste of time. It's a
> long list, and nobody cares. In many cases it won't be needed.
> 
> The lazy algorithm is _better_. It's way simpler (we take nested
> faults all the time in the kernel, and it's a particularly _easy_ page
> fault to handle with no IO or no locking needed), and it does less
> work. It really boils down to that.


Yeah, agreed. But I'm still confused about when exactly we need to fault
(doubts I have detailed in my question above).



> So it's not the lazy page table fill that is the problem. Never has
> been. We've been doing the lazy fill for a long time, and it was
> simple and useful way back when.
> 
> The problem has always been NMI, and nothing else. NMI's are nasty,
> and the x86 NMI blocking is insane and crazy.
> 
> Which is why I'm so adamant that this should be fixed in the NMI code,
> and we should _not_ talk about trying to screw up other, totally
> unrelated, code. The lazy fill really was never the problem.


Yeah agreed.

Thanks for your explanations!


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
  2010-07-14 16:42   ` Maciej W. Rozycki
@ 2010-07-16 12:28   ` Avi Kivity
  2010-07-16 14:49     ` Mathieu Desnoyers
  1 sibling, 1 reply; 168+ messages in thread
From: Avi Kivity @ 2010-07-16 12:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Mathieu Desnoyers,
	akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote:
> Implements an alternative iret with popf and return so trap and exception
> handlers can return to the NMI handler without issuing iret. iret would cause
> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
> copy the return instruction pointer to the top of the previous stack, issue a
> popf, loads the previous esp and issue a near return (ret).
>
> It allows placing dynamically patched static jumps in asm gotos, which will be
> used for optimized tracepoints, in NMI code since returning from a breakpoint
> would be valid. Accessing vmalloc'd memory, which allows executing module code
> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
> This is very useful to tracers like LTTng.
>
> This patch makes all faults, traps and exception safe to be called from NMI
> context*except*  single-stepping, which requires iret to restore the TF (trap
> flag) and jump to the return address in a single instruction. Sorry, no kprobes
> support in NMI handlers because of this limitation. This cannot be emulated
> with popf/lret, because lret would be single-stepped. It does not apply to
> "immediate values" because they do not use single-stepping. This code detects if
> the TF flag is set and uses the iret path for single-stepping, even if it
> reactivates NMIs prematurely.
>    

You need to save/restore cr2 in addition, otherwise the following hits you

- page fault
- processor writes cr2, enters fault handler
- nmi
- page fault
- cr2 overwritten

I guess you would usually not notice the corruption since you'd just see 
a spurious fault on the page the NMI handler touched, but if the first 
fault happened in a kvm guest, then we'd corrupt the guest's cr2.

But the whole thing strikes me as overkill.  If it's 8k per-cpu, what's 
wrong with using a per-cpu pointer to a kmalloc() area?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-16 12:00                           ` Frederic Weisbecker
@ 2010-07-16 12:54                             ` Steven Rostedt
  0 siblings, 0 replies; 168+ messages in thread
From: Steven Rostedt @ 2010-07-16 12:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Mathieu Desnoyers, Andi Kleen, Ingo Molnar, LKML,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Fri, 2010-07-16 at 14:00 +0200, Frederic Weisbecker wrote:
> On Thu, Jul 15, 2010 at 07:51:55AM -0700, Linus Torvalds wrote:

> 
> Such new page table created that might race is only about top level page
> right? Otherwise it wouldn't race since the top level entries are shared
> and then updates inside lower level pages are naturally propagated, if
> I understood you well.
> 
> So, if only top level pages that gets added can generate such lazily
> mapping update, I wonder why I experienced this fault everytime with
> my patches.
> 
> I allocated 8192 bytes per cpu in a x86-32 system that has only 2 GB.
> I doubt there is a top level page table update there at this time with
> such a small amount of available memory. But still it faults once on
> access.
> 
> I have troubles to visualize the race and the problem here.
> 

A few trace_printks and a tracing_off() on fault would probably show
exactly what was happening ;-)

-- Steve



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 12:28   ` Avi Kivity
@ 2010-07-16 14:49     ` Mathieu Desnoyers
  2010-07-16 15:34       ` Andi Kleen
  2010-07-16 16:47       ` Avi Kivity
  0 siblings, 2 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 14:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

* Avi Kivity (avi@redhat.com) wrote:
> On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote:
>> Implements an alternative iret with popf and return so trap and exception
>> handlers can return to the NMI handler without issuing iret. iret would cause
>> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
>> copy the return instruction pointer to the top of the previous stack, issue a
>> popf, loads the previous esp and issue a near return (ret).
>>
>> It allows placing dynamically patched static jumps in asm gotos, which will be
>> used for optimized tracepoints, in NMI code since returning from a breakpoint
>> would be valid. Accessing vmalloc'd memory, which allows executing module code
>> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
>> This is very useful to tracers like LTTng.
>>
>> This patch makes all faults, traps and exception safe to be called from NMI
>> context*except*  single-stepping, which requires iret to restore the TF (trap
>> flag) and jump to the return address in a single instruction. Sorry, no kprobes
>> support in NMI handlers because of this limitation. This cannot be emulated
>> with popf/lret, because lret would be single-stepped. It does not apply to
>> "immediate values" because they do not use single-stepping. This code detects if
>> the TF flag is set and uses the iret path for single-stepping, even if it
>> reactivates NMIs prematurely.
>>    
>
> You need to save/restore cr2 in addition, otherwise the following hits you
>
> - page fault
> - processor writes cr2, enters fault handler
> - nmi
> - page fault
> - cr2 overwritten
>
> I guess you would usually not notice the corruption since you'd just see  
> a spurious fault on the page the NMI handler touched, but if the first  
> fault happened in a kvm guest, then we'd corrupt the guest's cr2.

OK, just to make sure: you mean we'd have to save/restore the cr2 register
at the beginning/end of the NMI handler execution, right ? The shouldn't we
save/restore cr3 too ?

> But the whole thing strikes me as overkill.  If it's 8k per-cpu, what's  
> wrong with using a per-cpu pointer to a kmalloc() area?

Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
much more than perf) can potentially cause large latencies, which could be
squashed by allowing page faults in NMI handlers. This looks like a stronger
argument to me.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 14:49     ` Mathieu Desnoyers
@ 2010-07-16 15:34       ` Andi Kleen
  2010-07-16 15:40         ` Mathieu Desnoyers
  2010-07-16 16:47       ` Avi Kivity
  1 sibling, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-16 15:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Avi Kivity, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler

> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> much more than perf) can potentially cause large latencies, which could be

You need to fix all other code too that walks tasks lists to avoid all those.

% gid for_each_process | wc -l

In fact the mm-struct walk is cheaper than a task-list walk because there
are always less than tasks.

-Andi

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 15:34       ` Andi Kleen
@ 2010-07-16 15:40         ` Mathieu Desnoyers
  0 siblings, 0 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 15:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Avi Kivity, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, akpm,
	H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler

* Andi Kleen (andi@firstfloor.org) wrote:
> > Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> > much more than perf) can potentially cause large latencies, which could be
> 
> You need to fix all other code too that walks tasks lists to avoid all those.
> 
> % gid for_each_process | wc -l

This can very well be done incrementally. And I agree, these should eventually
targeted too, especially those which hold locks. We've already started hearing
about tasklist lock live-locks in the past year, so I think we're pretty much at
the point where it should be looked at.

Thanks,

Mathieu

> 
> In fact the mm-struct walk is cheaper than a task-list walk because there
> are always less than tasks.

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 14:49     ` Mathieu Desnoyers
  2010-07-16 15:34       ` Andi Kleen
@ 2010-07-16 16:47       ` Avi Kivity
  2010-07-16 16:58         ` Mathieu Desnoyers
  1 sibling, 1 reply; 168+ messages in thread
From: Avi Kivity @ 2010-07-16 16:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote:
>
>> You need to save/restore cr2 in addition, otherwise the following hits you
>>
>> - page fault
>> - processor writes cr2, enters fault handler
>> - nmi
>> - page fault
>> - cr2 overwritten
>>
>> I guess you would usually not notice the corruption since you'd just see
>> a spurious fault on the page the NMI handler touched, but if the first
>> fault happened in a kvm guest, then we'd corrupt the guest's cr2.
>>      
> OK, just to make sure: you mean we'd have to save/restore the cr2 register
> at the beginning/end of the NMI handler execution, right ?

Yes.

> The shouldn't we
> save/restore cr3 too ?
>
>    

No, faults should not change cr3.

>> But the whole thing strikes me as overkill.  If it's 8k per-cpu, what's
>> wrong with using a per-cpu pointer to a kmalloc() area?
>>      
> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> much more than perf) can potentially cause large latencies, which could be
> squashed by allowing page faults in NMI handlers. This looks like a stronger
> argument to me.

Why is that kernel code calling vmalloc_sync_all()?  If it is only NMI 
which cannot take vmalloc faults, why bother?  If not, why not?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 16:47       ` Avi Kivity
@ 2010-07-16 16:58         ` Mathieu Desnoyers
  2010-07-16 17:54           ` Avi Kivity
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 16:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

* Avi Kivity (avi@redhat.com) wrote:
> On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote:
>>
>>> You need to save/restore cr2 in addition, otherwise the following hits you
>>>
>>> - page fault
>>> - processor writes cr2, enters fault handler
>>> - nmi
>>> - page fault
>>> - cr2 overwritten
>>>
>>> I guess you would usually not notice the corruption since you'd just see
>>> a spurious fault on the page the NMI handler touched, but if the first
>>> fault happened in a kvm guest, then we'd corrupt the guest's cr2.
>>>      
>> OK, just to make sure: you mean we'd have to save/restore the cr2 register
>> at the beginning/end of the NMI handler execution, right ?
>
> Yes.

OK

>
>> The shouldn't we
>> save/restore cr3 too ?
>>
>>    
>
> No, faults should not change cr3.

Ah, right.

>
>>> But the whole thing strikes me as overkill.  If it's 8k per-cpu, what's
>>> wrong with using a per-cpu pointer to a kmalloc() area?
>>>      
>> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
>> much more than perf) can potentially cause large latencies, which could be
>> squashed by allowing page faults in NMI handlers. This looks like a stronger
>> argument to me.
>
> Why is that kernel code calling vmalloc_sync_all()?  If it is only NMI  
> which cannot take vmalloc faults, why bother?  If not, why not?

Modules come as yet another example of stuff that is loaded in vmalloc'd space
and can be accesses from NMI context. That would include oprofile, tracers, and
probably others I'm forgetting about.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 16:58         ` Mathieu Desnoyers
@ 2010-07-16 17:54           ` Avi Kivity
  2010-07-16 18:05             ` H. Peter Anvin
  0 siblings, 1 reply; 168+ messages in thread
From: Avi Kivity @ 2010-07-16 17:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 07:58 PM, Mathieu Desnoyers wrote:
>
>> Why is that kernel code calling vmalloc_sync_all()?  If it is only NMI
>> which cannot take vmalloc faults, why bother?  If not, why not?
>>      
> Modules come as yet another example of stuff that is loaded in vmalloc'd space
> and can be accesses from NMI context. That would include oprofile, tracers, and
> probably others I'm forgetting about.
>    

Module loading can certainly take a vmalloc_sync_all() (though I agree 
it's unpleasant).  Anything else?

Note perf is not modular at this time, but could be made so with 
preempt/sched notifiers to hook the context switch.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 17:54           ` Avi Kivity
@ 2010-07-16 18:05             ` H. Peter Anvin
  2010-07-16 18:15               ` Avi Kivity
                                 ` (2 more replies)
  0 siblings, 3 replies; 168+ messages in thread
From: H. Peter Anvin @ 2010-07-16 18:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 10:54 AM, Avi Kivity wrote:
> On 07/16/2010 07:58 PM, Mathieu Desnoyers wrote:
>>
>>> Why is that kernel code calling vmalloc_sync_all()?  If it is only NMI
>>> which cannot take vmalloc faults, why bother?  If not, why not?
>>>      
>> Modules come as yet another example of stuff that is loaded in vmalloc'd space
>> and can be accesses from NMI context. That would include oprofile, tracers, and
>> probably others I'm forgetting about.
>>    
> 
> Module loading can certainly take a vmalloc_sync_all() (though I agree 
> it's unpleasant).  Anything else?
> 
> Note perf is not modular at this time, but could be made so with 
> preempt/sched notifiers to hook the context switch.
> 

Actually, module loading is already a performance problem; a lot of
distros load sometimes hundreds of modules on startup, and it's heavily
serialized, so I can see this being desirable to skip.

I really hope noone ever gets the idea of touching user space from an
NMI handler, though, and expecting it to work...

	-hpa


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:05             ` H. Peter Anvin
@ 2010-07-16 18:15               ` Avi Kivity
  2010-07-16 18:17                 ` H. Peter Anvin
                                   ` (2 more replies)
  2010-07-16 19:28               ` Andi Kleen
  2010-08-04  9:46               ` Peter Zijlstra
  2 siblings, 3 replies; 168+ messages in thread
From: Avi Kivity @ 2010-07-16 18:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 09:05 PM, H. Peter Anvin wrote:
>
>> Module loading can certainly take a vmalloc_sync_all() (though I agree
>> it's unpleasant).  Anything else?
>>
>> Note perf is not modular at this time, but could be made so with
>> preempt/sched notifiers to hook the context switch.
>>
>>      
> Actually, module loading is already a performance problem; a lot of
> distros load sometimes hundreds of modules on startup, and it's heavily
> serialized, so I can see this being desirable to skip.
>    

There aren't that many processes at this time (or there shouldn't be, 
don't know how fork-happy udev is at this stage), so the sync should be 
pretty fast.  In any case, we can sync only modules that contain NMI 
handlers.

> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work...
>    

I think the concern here is about an NMI handler's code running in 
vmalloc space, or is it something else?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:15               ` Avi Kivity
@ 2010-07-16 18:17                 ` H. Peter Anvin
  2010-07-16 18:28                   ` Avi Kivity
  2010-07-16 18:22                 ` Mathieu Desnoyers
  2010-07-16 18:25                 ` Linus Torvalds
  2 siblings, 1 reply; 168+ messages in thread
From: H. Peter Anvin @ 2010-07-16 18:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 11:15 AM, Avi Kivity wrote:
> 
> There aren't that many processes at this time (or there shouldn't be, 
> don't know how fork-happy udev is at this stage), so the sync should be 
> pretty fast.  In any case, we can sync only modules that contain NMI 
> handlers.
> 
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>    
> 
> I think the concern here is about an NMI handler's code running in 
> vmalloc space, or is it something else?
> 

Code or data, yes; including percpu data.

	-hpa

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:15               ` Avi Kivity
  2010-07-16 18:17                 ` H. Peter Anvin
@ 2010-07-16 18:22                 ` Mathieu Desnoyers
  2010-07-16 18:32                   ` Avi Kivity
  2010-07-16 18:25                 ` Linus Torvalds
  2 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 18:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

* Avi Kivity (avi@redhat.com) wrote:
> On 07/16/2010 09:05 PM, H. Peter Anvin wrote:
>>
>>> Module loading can certainly take a vmalloc_sync_all() (though I agree
>>> it's unpleasant).  Anything else?
>>>
>>> Note perf is not modular at this time, but could be made so with
>>> preempt/sched notifiers to hook the context switch.
>>>
>>>      
>> Actually, module loading is already a performance problem; a lot of
>> distros load sometimes hundreds of modules on startup, and it's heavily
>> serialized, so I can see this being desirable to skip.
>>    
>
> There aren't that many processes at this time (or there shouldn't be,  
> don't know how fork-happy udev is at this stage), so the sync should be  
> pretty fast.  In any case, we can sync only modules that contain NMI  
> handlers.

USB hotplug is a use-case happening randomly after the system is well there and
running; I'm afraid this does not fit in your module loading expectations. It
triggers tons of events, many of these actually load modules.

Thanks,

Mathieu

>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>    
>
> I think the concern here is about an NMI handler's code running in  
> vmalloc space, or is it something else?
>
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:15               ` Avi Kivity
  2010-07-16 18:17                 ` H. Peter Anvin
  2010-07-16 18:22                 ` Mathieu Desnoyers
@ 2010-07-16 18:25                 ` Linus Torvalds
  2010-07-16 19:30                   ` Andi Kleen
  2 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-16 18:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 11:15 AM, Avi Kivity <avi@redhat.com> wrote:
>
> I think the concern here is about an NMI handler's code running in vmalloc
> space, or is it something else?

I think the concern was also potentially doing things like backtraces
etc that may need access to the module data structures (I think the
ELF headers end up all being in vmalloc space too, for example).

The whole debugging thing is also an issue. Now, I obviously am not a
big fan of remote debuggers, but everybody tells me I'm wrong. And
putting a breakpoint on NMI is certainly not insane if you are doing
debugging in the first place. So it's not necessarily always about the
page faults.

                               Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:17                 ` H. Peter Anvin
@ 2010-07-16 18:28                   ` Avi Kivity
  2010-07-16 18:37                     ` Linus Torvalds
  0 siblings, 1 reply; 168+ messages in thread
From: Avi Kivity @ 2010-07-16 18:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 09:17 PM, H. Peter Anvin wrote:
>
>> I think the concern here is about an NMI handler's code running in
>> vmalloc space, or is it something else?
>>
>>      
> Code or data, yes; including percpu data.
>    

Use kmalloc and percpu pointers, it's not that onerous.

Oh, and you can access vmalloc space by switching cr3 temporarily to 
init_mm's, no?  Obviously not a very performant solution, at least 
without PCIDs, but can be used if needed.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:22                 ` Mathieu Desnoyers
@ 2010-07-16 18:32                   ` Avi Kivity
  2010-07-16 19:29                     ` H. Peter Anvin
  2010-07-16 19:32                     ` Andi Kleen
  0 siblings, 2 replies; 168+ messages in thread
From: Avi Kivity @ 2010-07-16 18:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 09:22 PM, Mathieu Desnoyers wrote:
>
>> There aren't that many processes at this time (or there shouldn't be,
>> don't know how fork-happy udev is at this stage), so the sync should be
>> pretty fast.  In any case, we can sync only modules that contain NMI
>> handlers.
>>      
> USB hotplug is a use-case happening randomly after the system is well there and
> running; I'm afraid this does not fit in your module loading expectations. It
> triggers tons of events, many of these actually load modules.
>    

How long would vmalloc_sync_all take with a few thousand mm_struct take?

We share the pmds, yes?  So it's a few thousand memory accesses.  The 
direct impact is probably negligible, compared to actually loading the 
module from disk.  All we need is to make sure the locking doesn't slow 
down unrelated stuff.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:28                   ` Avi Kivity
@ 2010-07-16 18:37                     ` Linus Torvalds
  2010-07-16 19:26                       ` Avi Kivity
  0 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-16 18:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 11:28 AM, Avi Kivity <avi@redhat.com> wrote:
>
> Use kmalloc and percpu pointers, it's not that onerous.

What people don't seem to understand is that WE SHOULD NOT MAKE NMI
FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING
WHAT-SO-EVER TO DO WITH NMI.

I'm shouting, because this point seems to have been continually
missed. It was missed in the original patches, and it's been missed in
the discussions.

Non-NMI code should simply never have to even _think_ about NMI's. Why
should it? It should just do whatever comes "natural" within its own
context.

This is why I've been pushing for the "let's just fix NMI" approach.
Not adding random hacks to other code sequences that have nothing
what-so-ever to do with NMI.

So don't add NMI code to the page fault code. Not to the debug code,
or to the module loading code. Don't say "use special allocations
because the NMI code may care about these particular data structures".
Because that way lies crap and unmaintainability.

                      Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 22:01                           ` Mathieu Desnoyers
  2010-07-15 22:16                             ` Linus Torvalds
@ 2010-07-16 19:13                             ` Mathieu Desnoyers
  1 sibling, 0 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 19:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

Hi Linus,

What I omitted in my original description paragraph is that I also test for NMIs
nested over NMI "regular code" with a "nesting" per-cpu flag, which deals with
the concerns you expressed in your reply about function calls and traps.

I'm self-replying to keep track of Avi's comment about the need to save/restore
cr2 at the beginning/end of the NMI handler, so we don't end up corrupting a VM
CR2 if we have the following scenario: trap in VM, NMI, trap in NMI. So I added
cr2 awareness to the code snippet below, so we should be close to have something
that starts to make sense. (although I'm not saying it's bug-free yet) ;)

Please note that I'll be off on vacation for 2 weeks starting this evening (back
on August 2) without Internet access, so my answers might be delayed.

Thanks !

Mathieu


Code originally written by Linus Torvalds, modified by Mathieu Desnoyers
intenting to handle the fake NMI entry gracefully given that NMIs are not
necessarily disabled at the entry point. It uses a "need fake NMI" flag rather
than playing games with CS and faults. When a fake NMI is needed, it simply
jumps back to the beginning of regular nmi code. NMI exit code and fake NMI
entry are made reentrant with respect to NMI handler interruption by testing, at
the very beginning of the NMI handler, if a NMI is nested over the whole
nmi_atomic ..  nmi_atomic_end code region. It also tests for nested NMIs by
keeping a per-cpu "nmi nested" flag"; this deals with detection of nesting over
the "regular nmi" execution. This code assumes NMIs have a separate stack.

#
# Two per-cpu variables: a "are we nested" flag (one byte).
# a "do we need to execute a fake NMI" flag (one byte).
# The %rsp at which the stack copy is saved is at a fixed address, which leaves
# enough room at the bottom of NMI stack for the "real" NMI entry stack. This
# assumes we have a separate NMI stack.
# The NMI stack copy top of stack is at nmi_stack_copy.
# The NMI stack copy "rip" is at nmi_stack_copy_rip, which is set to
# nmi_stack_copy-40.
#
nmi:
	# Test if nested over atomic code.
	cmpq $nmi_atomic,0(%rsp)
	jae nmi_addr_is_ae
	# Test if nested over general NMI code.
	cmpb $0,%__percpu_seg:nmi_stack_nesting
	jne nmi_nested_set_fake_and_return
	# create new stack
is_unnested_nmi:
	# Save some space for nested NMI's. The exception itself
	# will never use more space, but it might use less (since
	# if will be a kernel-kernel transition).

	# Save %rax on top of the stack (need to temporarily use it)
	pushq %rax
	movq %rsp, %rax
	movq $nmi_stack_copy,%rsp

	# copy the five words of stack info. rip starts at 8+0(%rax).
	# cr2 is saved at nmi_stack_copy_rip+40
	pushq %cr2          # save cr2 to handle nesting over page faults
	pushq 8+32(%rax)    # ss
	pushq 8+24(%rax)    # rsp
	pushq 8+16(%rax)    # eflags
	pushq 8+8(%rax)     # cs
	pushq 8+0(%rax)     # rip
	movq 0(%rax),%rax   # restore %rax

set_nmi_nesting:
	# and set the nesting flags
	movb $0xff,%__percpu_seg:nmi_stack_nesting

regular_nmi_code:
	...
	# regular NMI code goes here, and can take faults,
	# because this sequence now has proper nested-nmi
	# handling
	...

nmi_atomic:
	# An NMI nesting over the whole nmi_atomic .. nmi_atomic_end region will
	# be handled specially. This includes the fake NMI entry point.
	cmpb $0,%__percpu_seg:need_fake_nmi
	jne fake_nmi
	movb $0,%__percpu_seg:nmi_stack_nesting
	# restore cr2
	movq %nmi_stack_copy_rip+40,%cr2
	iret

	# This is the fake NMI entry point.
fake_nmi:
	movb $0x0,%__percpu_seg:need_fake_nmi
	jmp regular_nmi_code
nmi_atomic_end:

	# Make sure the address is in the nmi_atomic range and in CS segment.
nmi_addr_is_ae:
	cmpq $nmi_atomic_end,0(%rsp)
	jae is_unnested_nmi
	# The saved rip points to the final NMI iret. Check the CS segment to
	# make sure.
	cmpw $__KERNEL_CS,8(%rsp)
	jne is_unnested_nmi

# This is the case when we hit just as we're supposed to do the atomic code
# of a previous nmi.  We run the NMI using the old return address that is still
# on the stack, rather than copy the new one that is bogus and points to where
# the nested NMI interrupted the original NMI handler!
# Easy: just set the stack pointer to point to the stack copy, clear
# need_fake_nmi (because we are directly going to execute the requested NMI) and
# jump to "nesting flag set" (which is followed by regular nmi code execution).
	movq $nmi_stack_copy_rip,%rsp
	movb $0x0,%__percpu_seg:need_fake_nmi
	jmp set_nmi_nesting

# This is the actual nested case. Make sure we branch to the fake NMI handler
# after this handler is done.
nmi_nested_set_fake_and_return:
	movb $0xff,%__percpu_seg:need_fake_nmi
	popfq
	jmp *(%rsp)


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:37                     ` Linus Torvalds
@ 2010-07-16 19:26                       ` Avi Kivity
  2010-07-16 21:39                         ` Linus Torvalds
  0 siblings, 1 reply; 168+ messages in thread
From: Avi Kivity @ 2010-07-16 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 09:37 PM, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 11:28 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> Use kmalloc and percpu pointers, it's not that onerous.
>>      
> What people don't seem to understand is that WE SHOULD NOT MAKE NMI
> FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING
> WHAT-SO-EVER TO DO WITH NMI.
>
> I'm shouting, because this point seems to have been continually
> missed. It was missed in the original patches, and it's been missed in
> the discussions.
>
> Non-NMI code should simply never have to even _think_ about NMI's. Why
> should it? It should just do whatever comes "natural" within its own
> context.
>
>    

But we're not talking about non-NMI code.  The 8k referred to in the 
original patch are buffers used by NMI stack recording.  Module code 
vmalloc_sync_all() is only need by code that is executed during NMI, 
hence must be NMI aware.

> This is why I've been pushing for the "let's just fix NMI" approach.
> Not adding random hacks to other code sequences that have nothing
> what-so-ever to do with NMI.
>    

"fixing NMI" will result in code that is understandable by maybe three 
people after long and hard thinking.  NMI can happen in too many 
semi-defined contexts, so there will be plenty of edge cases.  I'm not 
sure we can ever trust such trickery.

> So don't add NMI code to the page fault code. Not to the debug code,
> or to the module loading code. Don't say "use special allocations
> because the NMI code may care about these particular data structures".
> Because that way lies crap and unmaintainability.
>    

If NMI code can call random hooks and access random data, yes.  But I 
don't think we're at that point yet.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:05             ` H. Peter Anvin
  2010-07-16 18:15               ` Avi Kivity
@ 2010-07-16 19:28               ` Andi Kleen
  2010-07-16 19:32                 ` Avi Kivity
  2010-08-04  9:46               ` Peter Zijlstra
  2 siblings, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-16 19:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Avi Kivity, Mathieu Desnoyers, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge,
	Frank Ch. Eigler

> Actually, module loading is already a performance problem; a lot of
> distros load sometimes hundreds of modules on startup, and it's heavily

On startup you don't have many processes.  If there's a problem
it's surely not the fault of vmalloc_sync_all()

BTW in my experience one reason module loading was traditionally slow was
that it did a stop_machine(). I think(?) that has been fixed
at some point. But even with that's it's more an issue on larger
systems.

> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work...

It can make sense for a backtrace in a profiler.

In fact perf is nearly doing it I believe, but moves
it to the self IPI handler in most cases.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:32                   ` Avi Kivity
@ 2010-07-16 19:29                     ` H. Peter Anvin
  2010-07-16 19:39                       ` Avi Kivity
  2010-07-16 19:32                     ` Andi Kleen
  1 sibling, 1 reply; 168+ messages in thread
From: H. Peter Anvin @ 2010-07-16 19:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 11:32 AM, Avi Kivity wrote:
> 
> How long would vmalloc_sync_all take with a few thousand mm_struct take?
> 
> We share the pmds, yes?  So it's a few thousand memory accesses.  The 
> direct impact is probably negligible, compared to actually loading the 
> module from disk.  All we need is to make sure the locking doesn't slow 
> down unrelated stuff.
> 

It's not the memory accesses, it's the need to synchronize all the CPUs.

	-hpa

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:25                 ` Linus Torvalds
@ 2010-07-16 19:30                   ` Andi Kleen
  2010-07-18  9:26                     ` Avi Kivity
  0 siblings, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-16 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge,
	Frank Ch. Eigler

On Fri, Jul 16, 2010 at 11:25:19AM -0700, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 11:15 AM, Avi Kivity <avi@redhat.com> wrote:
> >
> > I think the concern here is about an NMI handler's code running in vmalloc
> > space, or is it something else?
> 
> I think the concern was also potentially doing things like backtraces
> etc that may need access to the module data structures (I think the
> ELF headers end up all being in vmalloc space too, for example).
> 
> The whole debugging thing is also an issue. Now, I obviously am not a
> big fan of remote debuggers, but everybody tells me I'm wrong. And
> putting a breakpoint on NMI is certainly not insane if you are doing
> debugging in the first place. So it's not necessarily always about the
> page faults.

We already have infrastructure for kprobes to prevent breakpoints
on critical code (the __kprobes section). In principle kgdb/kdb
could be taught about honoring those too.

That wouldn't help for truly external JTAG debuggers, but I would assume
those generally can (should) handle any contexts anyways.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 19:28               ` Andi Kleen
@ 2010-07-16 19:32                 ` Avi Kivity
  2010-07-16 19:34                   ` Andi Kleen
  0 siblings, 1 reply; 168+ messages in thread
From: Avi Kivity @ 2010-07-16 19:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 10:28 PM, Andi Kleen wrote:
>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>      
> It can make sense for a backtrace in a profiler.
>
> In fact perf is nearly doing it I believe, but moves
> it to the self IPI handler in most cases.
>    

Interesting, is the self IPI guaranteed to execute synchronously after 
the NMI's IRET?  Or can the core IRET faster than the APIC and so we get 
the backtrace at the wrong place?

(and does it matter? the NMI itself is not always accurate)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:32                   ` Avi Kivity
  2010-07-16 19:29                     ` H. Peter Anvin
@ 2010-07-16 19:32                     ` Andi Kleen
  1 sibling, 0 replies; 168+ messages in thread
From: Andi Kleen @ 2010-07-16 19:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, H. Peter Anvin, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge,
	Frank Ch. Eigler

On Fri, Jul 16, 2010 at 09:32:00PM +0300, Avi Kivity wrote:
> On 07/16/2010 09:22 PM, Mathieu Desnoyers wrote:
> >
> >>There aren't that many processes at this time (or there shouldn't be,
> >>don't know how fork-happy udev is at this stage), so the sync should be
> >>pretty fast.  In any case, we can sync only modules that contain NMI
> >>handlers.
> >USB hotplug is a use-case happening randomly after the system is well there and
> >running; I'm afraid this does not fit in your module loading expectations. It
> >triggers tons of events, many of these actually load modules.
> 
> How long would vmalloc_sync_all take with a few thousand mm_struct take?
> 
> We share the pmds, yes?  So it's a few thousand memory accesses.
> The direct impact is probably negligible, compared to actually
> loading the module from disk.  All we need is to make sure the
> locking doesn't slow down unrelated stuff.

Also you have to remember that vmalloc_sync_all() only does something
when the top level page is actually updated. That is very rare.
(in many cases it should happen at most once per boot)
Most mapping changes update lower levels, and those are already
shared.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 19:32                 ` Avi Kivity
@ 2010-07-16 19:34                   ` Andi Kleen
  0 siblings, 0 replies; 168+ messages in thread
From: Andi Kleen @ 2010-07-16 19:34 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, akpm, Jeremy Fitzhardinge,
	Frank Ch. Eigler

On Fri, Jul 16, 2010 at 10:32:13PM +0300, Avi Kivity wrote:
> On 07/16/2010 10:28 PM, Andi Kleen wrote:
> >
> >>I really hope noone ever gets the idea of touching user space from an
> >>NMI handler, though, and expecting it to work...
> >It can make sense for a backtrace in a profiler.
> >
> >In fact perf is nearly doing it I believe, but moves
> >it to the self IPI handler in most cases.
> 
> Interesting, is the self IPI guaranteed to execute synchronously
> after the NMI's IRET?  Or can the core IRET faster than the APIC and
> so we get the backtrace at the wrong place?
> 
> (and does it matter? the NMI itself is not always accurate)

self ipi runs after the next STI (or POPF)

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 19:29                     ` H. Peter Anvin
@ 2010-07-16 19:39                       ` Avi Kivity
  0 siblings, 0 replies; 168+ messages in thread
From: Avi Kivity @ 2010-07-16 19:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 10:29 PM, H. Peter Anvin wrote:
> On 07/16/2010 11:32 AM, Avi Kivity wrote:
>    
>> How long would vmalloc_sync_all take with a few thousand mm_struct take?
>>
>> We share the pmds, yes?  So it's a few thousand memory accesses.  The
>> direct impact is probably negligible, compared to actually loading the
>> module from disk.  All we need is to make sure the locking doesn't slow
>> down unrelated stuff.
>>
>>      
> It's not the memory accesses, it's the need to synchronize all the CPUs.
>    

I'm missing something.  Why do we need to sync all cpus?  the 
vmalloc_sync_all() I'm reading doesn't.

Even if we do an on_each_cpu() somewhere, it isn't the end of the world.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 19:26                       ` Avi Kivity
@ 2010-07-16 21:39                         ` Linus Torvalds
  2010-07-16 22:07                           ` Andi Kleen
  2010-07-18  9:23                           ` Avi Kivity
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-16 21:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 12:26 PM, Avi Kivity <avi@redhat.com> wrote:
> On 07/16/2010 09:37 PM, Linus Torvalds wrote:
>>
>> Non-NMI code should simply never have to even _think_ about NMI's. Why
>> should it? It should just do whatever comes "natural" within its own
>> context.
>
> But we're not talking about non-NMI code.

Yes, we are. We're talking about breakpoints (look at the subject
line), and you are very much talking about things like that _idiotic_
vmalloc_sync_all() by module loading code etc etc.

Every _single_ "solution" I have seen - apart from my suggestion - has
been about making code "special" because some other code might run in
an NMI. Module init sequences having to do idiotic things just because
they have data structures that might get accessed by NMI.

And the thing is, if we just do NMI's correctly, and allow nesting,
ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
stupid things elsewhere.

In other words, why the hell are you arguing? Help Mathieu write the
low-level NMI handler right, and remove that idiotic
"vmalloc_sync_all()" that is fundamentally broken and should not
exist. Rather than talk about adding more of that kind of crap.

                   Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 21:39                         ` Linus Torvalds
@ 2010-07-16 22:07                           ` Andi Kleen
  2010-07-16 22:26                             ` Linus Torvalds
  2010-07-16 22:40                             ` Mathieu Desnoyers
  2010-07-18  9:23                           ` Avi Kivity
  1 sibling, 2 replies; 168+ messages in thread
From: Andi Kleen @ 2010-07-16 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge,
	Frank Ch. Eigler

> And the thing is, if we just do NMI's correctly, and allow nesting,
> ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> stupid things elsewhere.

One issue I have with nesting NMIs is that you need 
a nesting limit, otherwise you'll overflow the NMI stack.

We just got rid of nesting for normal interrupts because
of this stack overflow problem which hit in real situations.

In some cases you can get quite high NMI frequencies, e.g. with
performance counters. Now the current performance counter handlers
do not nest by themselves of course, but they might nest 
with other longer running NMI users.

I think none of the current handlers are likely to nest
for very long, but there's more and more NMI coded all the time,
so it's definitely a concern.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:07                           ` Andi Kleen
@ 2010-07-16 22:26                             ` Linus Torvalds
  2010-07-16 22:41                               ` Andi Kleen
  2010-07-16 22:40                             ` Mathieu Desnoyers
  1 sibling, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-16 22:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 3:07 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> One issue I have with nesting NMIs is that you need
> a nesting limit, otherwise you'll overflow the NMI stack.

Have you actually looked at the suggestion I (and now Mathieu)
suggested code for?

The nesting is very limited. NMI's would nest just once, and when that
happens, the nested NMI would never use more than something like a
hundred bytes of stack (most of which is what the CPU pushes
directly). And there would be no device interrupts that nest, and
practically the faults that nest obviously aren't going to be complex
faults either (ie the page fault would be the simple case that never
calls to 'handle_vm_fault()', but handles it all in
arch/x86/mm/fault.c.

IOW, there is absolutely _no_ issues with nesting. It's two levels
deep, and a much smaller stack footprint than our regular exception
nesting for those two levels too.

And your argument that there would be more and more NMI usage only
makes it more important that we handle NMI's without going crazy. Just
handle them cleanly instead of making them something totally special.

               Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:07                           ` Andi Kleen
  2010-07-16 22:26                             ` Linus Torvalds
@ 2010-07-16 22:40                             ` Mathieu Desnoyers
  1 sibling, 0 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 22:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Avi Kivity, H. Peter Anvin, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Jeremy Fitzhardinge, Frank Ch. Eigler

* Andi Kleen (andi@firstfloor.org) wrote:
> > And the thing is, if we just do NMI's correctly, and allow nesting,
> > ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> > stupid things elsewhere.
> 
> One issue I have with nesting NMIs is that you need 
> a nesting limit, otherwise you'll overflow the NMI stack.
> 
> We just got rid of nesting for normal interrupts because
> of this stack overflow problem which hit in real situations.
> 
> In some cases you can get quite high NMI frequencies, e.g. with
> performance counters. Now the current performance counter handlers
> do not nest by themselves of course, but they might nest 
> with other longer running NMI users.
> 
> I think none of the current handlers are likely to nest
> for very long, but there's more and more NMI coded all the time,
> so it's definitely a concern.

We're not proposing to actually "nest" NMIs per se. We copy the stack at the
beginning of the NMI handler (and then use the copy) to permit nesting of faults
over NMI handlers. Following NMIs that would "try" to nest over the NMI handler
would see their regular execution postponed until the end of the currently
running NMI handler. It's OK for these "nested" NMI handlers to use the bottom
of NMI stack because the NMI handler on which they are trying to nest is only
using the stack copy. These "nested" handlers return to the original NMI handler
very early just after setting a "pending nmi" flag. There is more to it (e.g.
handling NMI handler exit atomically with respect to incoming NMIs); please
refer to the last assembly code snipped I sent to Linus a little earlier today
for details.

Thanks,

Mathieu


> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:26                             ` Linus Torvalds
@ 2010-07-16 22:41                               ` Andi Kleen
  2010-07-17  1:15                                 ` Linus Torvalds
  0 siblings, 1 reply; 168+ messages in thread
From: Andi Kleen @ 2010-07-16 22:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 03:26:32PM -0700, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 3:07 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >
> > One issue I have with nesting NMIs is that you need
> > a nesting limit, otherwise you'll overflow the NMI stack.
> 
> Have you actually looked at the suggestion I (and now Mathieu)
> suggested code for?

Maybe I'm misunderstanding everything (and it has been a lot of emails
in the thread), but the case I was thinking of would be if the second NMI 
faults too, and then another one comes in after the IRET etc.

-Andi

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:41                               ` Andi Kleen
@ 2010-07-17  1:15                                 ` Linus Torvalds
  0 siblings, 0 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-17  1:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 3:41 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Maybe I'm misunderstanding everything (and it has been a lot of emails
> in the thread), but the case I was thinking of would be if the second NMI
> faults too, and then another one comes in after the IRET etc.

No, the nested NMI cannot fault, because it never even enters C code.
It literally just returns immediately after having noticed it is
nested (and corrupted the stack of the original one, so that the
original NMI will re-do itself at return)..

So the nested NMI will use some few tens of bytes of stack. In fact,
it will use the stack "above" the stack that the original NMI handler
is using, because it will reset the stack pointer back to the top of
the NMI stack. So in a very real sense, it is not even extending the
stack, it is just re-using a small part of the same stack that the
original NMI used (and that we copied away so that it doesn't matter
that it gets re-used)

As to another small but important detail: the _nested_ NMI actually
returns using "popf+ret", leaving NMI's blocked again. Thus
guaranteeing forward progress and lack of NMI storms.

To summarize:

 - the "original" (first-level) NMI can take faults (like the page
fault to fill in vmalloc pages lazily, or debug faults). That will
actually cause two stack frames (or three, if you debug a page fault
that happened while NMI was active). So there is certainly exception
nesting going on, but we're talking _much_ less stack than normal
stack usage where the nesting can be deep and in complex routines.

 - any "nested" NMI's will not actually use any more stack at all than
a non-nested one, because we've pre-reserved space for them (and we
_had_ to reserve space for them due to IST)

 - even if we get NMI's during the execution of the original NMI,
there can be only one such "spurious" NMI per nested exception. So if
we take a single page fault, that exception will re-enable NMI
(because it returns with "iret"), and as a result we may take a
_single_ new nested NMI until we disable NMI's again.

In other words, the approach is not all that different from doing
"lazy irq disable" like powerpc does for regular interrupts. For
NMI's, we do it because it's impossible (on x86) to disable NMI's
without actually taking one.

                         Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 21:39                         ` Linus Torvalds
  2010-07-16 22:07                           ` Andi Kleen
@ 2010-07-18  9:23                           ` Avi Kivity
  1 sibling, 0 replies; 168+ messages in thread
From: Avi Kivity @ 2010-07-18  9:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/17/2010 12:39 AM, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 12:26 PM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 07/16/2010 09:37 PM, Linus Torvalds wrote:
>>      
>>> Non-NMI code should simply never have to even _think_ about NMI's. Why
>>> should it? It should just do whatever comes "natural" within its own
>>> context.
>>>        
>> But we're not talking about non-NMI code.
>>      
> Yes, we are. We're talking about breakpoints (look at the subject
> line), and you are very much talking about things like that _idiotic_
> vmalloc_sync_all() by module loading code etc etc.
>    

Well, I'd put it in the nmi handler registration code, but you're 
right.  A user placing breakpoints can't even tell whether the 
breakpoint will be hit by NMI code, especially data breakpoints.

> Every _single_ "solution" I have seen - apart from my suggestion - has
> been about making code "special" because some other code might run in
> an NMI. Module init sequences having to do idiotic things just because
> they have data structures that might get accessed by NMI.
>
> And the thing is, if we just do NMI's correctly, and allow nesting,
> ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> stupid things elsewhere.
>
> In other words, why the hell are you arguing? Help Mathieu write the
> low-level NMI handler right, and remove that idiotic
> "vmalloc_sync_all()" that is fundamentally broken and should not
> exist. Rather than talk about adding more of that kind of crap.
>    

Well, at least we'll get a good test case for kvm's nmi blocking 
emulation (it's tricky since if we fault on an iret sometimes nmis get 
unblocked even though the instruction did not complete).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 19:30                   ` Andi Kleen
@ 2010-07-18  9:26                     ` Avi Kivity
  0 siblings, 0 replies; 168+ messages in thread
From: Avi Kivity @ 2010-07-18  9:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 10:30 PM, Andi Kleen wrote:
> We already have infrastructure for kprobes to prevent breakpoints
> on critical code (the __kprobes section). In principle kgdb/kdb
> could be taught about honoring those too.
>
>    

It doesn't help with NMI code calling other functions, or with data 
breakpoints.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15  1:23                 ` Linus Torvalds
  2010-07-15  1:45                   ` Linus Torvalds
  2010-07-15 16:44                   ` Mathieu Desnoyers
@ 2010-07-18 11:03                   ` Avi Kivity
  2010-07-18 17:36                     ` Linus Torvalds
  2 siblings, 1 reply; 168+ messages in thread
From: Avi Kivity @ 2010-07-18 11:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On 07/15/2010 04:23 AM, Linus Torvalds wrote:
> On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds
> <torvalds@linux-foundation.org>  wrote:
>    
>> I think the %rip check should be pretty simple - exactly because there
>> is only a single point where the race is open between that 'mov' and
>> the 'iret'. So it's simpler than the (similar) thing we do for
>> debug/nmi stack fixup for sysenter that has to check a range.
>>      
> So this is what I think it might look like, with the %rip in place.
> And I changed the "nmi_stack_ptr" thing to have both the pointer and a
> flag - because it turns out that in the single-instruction race case,
> we actually want the old pointer.
>
> Totally untested, of course. But _something_ like this might work:
>
> #
> # Two per-cpu variables: a "are we nested" flag (one byte), and
> # a "if we're nested, what is the %rsp for the nested case".
> #
> # The reason for why we can't just clear the saved-rsp field and
> # use that as the flag is that we actually want to know the saved
> # rsp for the special case of having a nested NMI happen on the
> # final iret of the unnested case.
> #
> nmi:
> 	cmpb $0,%__percpu_seg:nmi_stack_nesting
> 	jne nmi_nested_corrupt_and_return
> 	cmpq $nmi_iret_address,0(%rsp)
> 	je nmi_might_be_nested
> 	# create new stack
> is_unnested_nmi:
> 	# Save some space for nested NMI's. The exception itself
> 	# will never use more space, but it might use less (since
> 	# if will be a kernel-kernel transition). But the nested
> 	# exception will want two save registers and a place to
> 	# save the original CS that it will corrupt
> 	subq $64,%rsp
>
> 	# copy the five words of stack info. 96 = 64 + stack
> 	# offset of ss.
> 	pushq 96(%rsp)   # ss
> 	pushq 96(%rsp)   # rsp
> 	pushq 96(%rsp)   # eflags
> 	pushq 96(%rsp)   # cs
> 	pushq 96(%rsp)   # rip
>
> 	# and set the nesting flags
> 	movq %rsp,%__percpu_seg:nmi_stack_ptr
> 	movb $0xff,%__percpu_seg:nmi_stack_nesting
>
>    

By trading off some memory, we don't need this trickery.  We can 
allocate two nmi stacks, so the code becomes:

nmi:
     cmpb $0, %__percpu_seg:nmi_stack_nesting
     je unnested_nmi
     cmpq $nmi_iret,(%rsp)
     jne unnested_nmi
     cmpw $__KERNEL_CS,8(%rsp)
     jne unnested_nmi
     popf
     retfq
unnested_nmi:
     xorq $(nmi_stack_1 ^ nmi_stack_2),%__percpu_seg:tss_nmi_ist_entry
     movb $1, __percpu_seg:nmi_stack_nesting
regular_nmi:
     ...
regular_nmi_end:
     movb $0, __percpu_seg:nmi_stack_nesting
nmi_iret:
     iretq




-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-18 11:03                   ` Avi Kivity
@ 2010-07-18 17:36                     ` Linus Torvalds
  2010-07-18 18:04                       ` Avi Kivity
  2010-07-18 18:17                       ` Linus Torvalds
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-18 17:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Sun, Jul 18, 2010 at 4:03 AM, Avi Kivity <avi@redhat.com> wrote:
>
> By trading off some memory, we don't need this trickery.  We can allocate
> two nmi stacks, so the code becomes:

I really don't think you need even that. See earlier in the discussion
about how we could just test %rsp itself. Which makes all the %rip
testing totally unnecessary, because we don't even need any flags,and
we have no races because %rsp is atomically changed with taking the
exception.

Lookie here, the %rsp comparison really isn't that hard:

  nmi:
      pushq %rax
      pushq %rdx
      movq %rsp,%rdx          # current stack top
      movq 40(%rsp),%rax   # old stack top
      xor %rax,%rdx              # same 8kB aligned area?
      shrq $13,%rdx             # ignore low 13 bits
      je it_is_a_nested_nmi   # looks nested..
  non_nested:
      ...
      ... ok, we're not nested, do normal NMI handling ...
      ...
      popq %rdx
      popq %rax
      iret

  it_is_a_nested_nmi:
      cmpw $0,48(%rsp)     # double-check that it really was a nested exception
      jne non_nested           # from user space or something..
      # this is the nested case
      # NOTE! NMI's are blocked, we don't take any exceptions etc etc
      addq $-160,%rax        # 128-byte redzone on the old stack + 4 words
      movq (%rsp),%rdx
      movq %rdx,(%rax)       # old %rdx
      movq 8(%rsp),%rdx
      movq %rdx,8(%rax)     # old %rax
      movq 32(%rsp),%rdx
      movq %rdx,16(%rax)   # old %rflags
      movq 16(%rsp),%rdx
      movq %rdx,24(%rax)   # old %rip
      movq %rax,%rsp
      popq %rdx
      popq %rax
      popf
      ret $128                     # restore %rip and %rsp

doesn't that look pretty simple?

NOTE! OBVIOUSLY TOTALLY UNTESTED!

                            Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-18 17:36                     ` Linus Torvalds
@ 2010-07-18 18:04                       ` Avi Kivity
  2010-07-18 18:22                         ` Linus Torvalds
  2010-07-18 18:17                       ` Linus Torvalds
  1 sibling, 1 reply; 168+ messages in thread
From: Avi Kivity @ 2010-07-18 18:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On 07/18/2010 08:36 PM, Linus Torvalds wrote:
> On Sun, Jul 18, 2010 at 4:03 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> By trading off some memory, we don't need this trickery.  We can allocate
>> two nmi stacks, so the code becomes:
>>      
> I really don't think you need even that. See earlier in the discussion
> about how we could just test %rsp itself. Which makes all the %rip
> testing totally unnecessary, because we don't even need any flags,and
> we have no races because %rsp is atomically changed with taking the
> exception.
>
> Lookie here, the %rsp comparison really isn't that hard:
>
>    nmi:
>        pushq %rax
>        pushq %rdx
>        movq %rsp,%rdx          # current stack top
>        movq 40(%rsp),%rax   # old stack top
>        xor %rax,%rdx              # same 8kB aligned area?
>        shrq $13,%rdx             # ignore low 13 bits
>        je it_is_a_nested_nmi   # looks nested..
>
>    

...

> doesn't that look pretty simple?
>
>    

Too simple - an MCE will switch to its own stack, failing the test.  Now 
that we have correctable MCEs, that's not a good idea.

So the plain everyday sequence

   NMI
   #PF
   MCE (uncompleted)
   NMI

will fail.

Plus, even in the non-nested case, you have to copy the stack frame, or 
the nested NMI will corrupt it.  With stack switching, the nested NMI is 
allocated its own frame.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-18 17:36                     ` Linus Torvalds
  2010-07-18 18:04                       ` Avi Kivity
@ 2010-07-18 18:17                       ` Linus Torvalds
  2010-07-18 18:43                         ` Steven Rostedt
  1 sibling, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-18 18:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Sun, Jul 18, 2010 at 10:36 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Lookie here, the %rsp comparison really isn't that hard:

A few notes on that (still) untested code suggestion:

>  nmi:
>      pushq %rax
>      pushq %rdx
>      movq %rsp,%rdx          # current stack top
>      movq 40(%rsp),%rax   # old stack top
>      xor %rax,%rdx              # same 8kB aligned area?
>      shrq $13,%rdx             # ignore low 13 bits
>      je it_is_a_nested_nmi   # looks nested..
>  non_nested:
>      ...
>      ... ok, we're not nested, do normal NMI handling ...
>      ...

The non_nested case still needs to start off with moving it's stack
frame to a safe area that won't be overwritten by any nesting NMI's
(note that they cannot nest at this point, since we've done nothing
that can fault). So we'd still need that

    7* pushq 48(%rsp)

which copies the five words that got pushed by hardware, and the two
register-save locations that we used for the nesting check and special
return.

After we've done those 7 pushes, we can then run code that may take a
fault. Because when the fault returns with an "iret" and re-enables
NMI's, our nesting code is ready.

So all told, we need a maximum of about 216 bytes of stack for the
nested NMI case: 56 bytes for the seven copied words, and the 160
bytes that we build up _under_ the stack pointer for the nested case.
And we need the NMI stack itself to be aligned in order for that
"ignore low bits" check to work. Although we don't actually have to do
that "xor+shr", we could do the test equally well with a "sub+unsigned
compare against stack size".

Other than that, I think the extra test that we're really nested might
better be done differently:

>  it_is_a_nested_nmi:
>      cmpw $0,48(%rsp)     # double-check that it really was a nested exception
>      jne non_nested           # from user space or something..
>      # this is the nested case

It migth be safer to check the saved CS rather than the saved SS on
the stack to see that we really are in kernel mode. It's possible that
somebody could load a NULL SS in user mode and then just not use the
stack - and try to make it look like they are in kernel mode for when
the NMI happens. Now, I _think_ that loading a zero SS is supposed to
trap, but checking CS is still likely to be the better test for "were
we in kernel mode". That's where the CPL is really encoded, after all.

So that "cmpw $0,48(%rsp)" is probably ok, but it would likely be
better to do it as

   testl $3,24(%rsp)
   jne non_nested

instead. That's what entry_64.S does everywhere else.

Oh, and the non-nested case obviously needs all the regular "make the
kernel state look right" code. Like the swapgs stuff etc if required.
My example code was really meant to just document the nesting
handling, not the existing stuff we already need to do with
save_paranoid etc.

And I really think it should work, but I'd again like to stress that
it's just a RFD code sequence with no testing what-so-ever etc.

                      Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-18 18:04                       ` Avi Kivity
@ 2010-07-18 18:22                         ` Linus Torvalds
  2010-07-19  7:32                           ` Avi Kivity
  0 siblings, 1 reply; 168+ messages in thread
From: Linus Torvalds @ 2010-07-18 18:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Sun, Jul 18, 2010 at 11:04 AM, Avi Kivity <avi@redhat.com> wrote:
>
> Too simple - an MCE will switch to its own stack, failing the test.  Now
> that we have correctable MCEs, that's not a good idea.

Ahh, true. And I think we do DEBUG traps with IST too.

So we do need the explicit flag over the region. Too bad. I was hoping
to handle the nested case without having to set up the percpu segment
(that whole conditional swapgs thing, which is extra painful in NMI).

And at that point, if you require the separate flag anyway, the %rsp
range test is equivalent to the %rip range test.

                Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-18 18:17                       ` Linus Torvalds
@ 2010-07-18 18:43                         ` Steven Rostedt
  2010-07-18 19:26                           ` Linus Torvalds
  0 siblings, 1 reply; 168+ messages in thread
From: Steven Rostedt @ 2010-07-18 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On Sun, 2010-07-18 at 11:17 -0700, Linus Torvalds wrote:

> Oh, and the non-nested case obviously needs all the regular "make the
> kernel state look right" code. Like the swapgs stuff etc if required.
> My example code was really meant to just document the nesting
> handling, not the existing stuff we already need to do with
> save_paranoid etc.
> 
> And I really think it should work, but I'd again like to stress that
> it's just a RFD code sequence with no testing what-so-ever etc.
> 

Are you sure you don't want to use Mathieu's 2/2 patch? We are fixing
the x86 problem that iret re-enables NMIs, and you don't want to touch
anything else but the NMI code. But it may be saner to just fix the
places that call iret. We can perhaps encapsulate those into a single
macro that we can get right and will be correct everywhere it is used.

The ugliest part of Mathieu's code is dealing with paravirt, but
paravirt is ugly to begin with.

Doing this prevents nested NMIs as well as all the unknowns that will
come with dealing with nested NMIs. Where as, handling all iret's should
be straight forward, although a bit more intrusive than what we would
like.

Just saying,

-- Steve



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-18 18:43                         ` Steven Rostedt
@ 2010-07-18 19:26                           ` Linus Torvalds
  0 siblings, 0 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-18 19:26 UTC (permalink / raw)
  To: rostedt
  Cc: Avi Kivity, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On Sun, Jul 18, 2010 at 11:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Are you sure you don't want to use Mathieu's 2/2 patch?

Yeah, I'm pretty sure. Unless somebody can show that it's faster, I
really don't want to muck with regular iret's. Also, as shown during
the discussion, even with Mathieu's 2/2 patch, we'd _still_ need NMI
to also save cr2 etc.

So the sane thing to do is to put all the NMI crap where it belongs.
NMI's need to know about the fact that them taking exceptions is
special. That whole "vmalloc_sync_all()" is simply pure brokenness.

In other words, it is _not_ just about 'iret' fixup. It's a bigger
thing. NMI's are special, and we don't want to spread that specialness
around.

                  Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-18 18:22                         ` Linus Torvalds
@ 2010-07-19  7:32                           ` Avi Kivity
  0 siblings, 0 replies; 168+ messages in thread
From: Avi Kivity @ 2010-07-19  7:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On 07/18/2010 09:22 PM, Linus Torvalds wrote:
> On Sun, Jul 18, 2010 at 11:04 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> Too simple - an MCE will switch to its own stack, failing the test.  Now
>> that we have correctable MCEs, that's not a good idea.
>>      
> Ahh, true. And I think we do DEBUG traps with IST too.
>
> So we do need the explicit flag over the region. Too bad. I was hoping
> to handle the nested case without having to set up the percpu segment
> (that whole conditional swapgs thing, which is extra painful in NMI).
>    

Well, we have to do that anyway for the non-nested case.  So we just do 
it before checking whether we're nested or not, and undo it on the popf; 
retf path.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-07-15 16:26                           ` Mathieu Desnoyers
@ 2010-08-03 17:18                             ` Peter Zijlstra
  2010-08-03 18:25                               ` Mathieu Desnoyers
  2010-08-03 18:56                               ` Linus Torvalds
  0 siblings, 2 replies; 168+ messages in thread
From: Peter Zijlstra @ 2010-08-03 17:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:

> I was more thinking along the lines of making sure a ring buffer has the proper
> support for your use-case. It shares a lot of requirements with a standard ring
> buffer:
> 
> - Need to be lock-less
> - Need to reserve space, write data in a buffer
> 
> By configuring a ring buffer with 4k sub-buffer size (that's configurable
> dynamically), 

FWIW I really utterly detest the whole concept of sub-buffers.

> all we need to add is the ability to squash a previously saved
> record from the buffer. I am confident we can provide a clean API for this that
> would allow discard of previously committed entry as long as we are still on the
> same non-fully-committed sub-buffer. This fits your use-case exactly, so that's
> fine.

squash? truncate you mean? So we can allocate/reserve the largest
possible event size and write the actual event and then truncate to the
actually used size?

I really dislike how that will end up with huge holes in the buffer when
you get nested events.

Also, I think you're forgetting that doing the stack unwind is a very
costly pointer chase, adding a simple linear copy really doesn't seem
like a problem.

Additionally, if you have multiple consumers you can simply copy the
stacktrace again, avoiding the whole pointer chase exercise. While you
could conceivably copy from one ringbuffer into another that will result
in very nasty serialization issues.

> You could have one 4k ring buffer per cpu per execution context. 

Why?

>  I wonder if
> each Linux architecture have support for separated thread vs softirtq vs irq vs
> nmi stacks ? 

Why would that be relevant? We can have NMI inside IRQ inside soft-IRQ
inside task context in general (dismissing the nested IRQ mess). You
don't need to have a separate stack for each context in order to nest
them.

> Even then, given you have only one stack for all shared irqs, you
> need something that is concurrency-aware at the ring buffer level.

I'm failing to see you point. 

> These small stack-like ring buffers could be used to save your temporary stack
> copy. When you really need to save it to a larger ring buffer along with a
> trace, then you just take a snapshot of the stack ring buffers.

OK, why? Your proposal includes the exact same extra copy but introduces
a ton of extra code to effect the same, not a win.

> So you get to use one single ring buffer synchronization and memory allocation
> mechanism, that everyone has reviewed. The advantage is that we would not be
> having this nmi race discussion in the first place: the generic ring buffer uses
> "get page" directly rather than relying on vmalloc, because these bugs have
> already been identified and dealt with years ago.

That's like saying don't use percpu_alloc() but open-code the thing
using kmalloc()/get_pages().. I really don't see any merit in that.



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 17:18                             ` Peter Zijlstra
@ 2010-08-03 18:25                               ` Mathieu Desnoyers
  2010-08-04  6:46                                 ` Peter Zijlstra
  2010-08-03 18:56                               ` Linus Torvalds
  1 sibling, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-03 18:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:
> 
> > I was more thinking along the lines of making sure a ring buffer has the proper
> > support for your use-case. It shares a lot of requirements with a standard ring
> > buffer:
> > 
> > - Need to be lock-less
> > - Need to reserve space, write data in a buffer
> > 
> > By configuring a ring buffer with 4k sub-buffer size (that's configurable
> > dynamically), 
> 
> FWIW I really utterly detest the whole concept of sub-buffers.

This reluctance against splitting a buffer into sub-buffers might contribute to
explain the poor performance experienced with the Perf ring buffer. These
"sub-buffers" are really nothing new: these are called "periods" in the audio
world. They help lowering the ring buffer performance overhead because:

1) They allow writing into the ring buffer without SMP-safe synchronization
primitives and memory barriers for each record. Synchronization is only needed
across sub-buffer boundaries, which amortizes the cost over a large number of
events.

2) They are much more splice (and, in general, page-exchange) friendly, because
records written after a synchronization point start at the beginning of a page.
This removes the need for extra copies.

So I have to ask: do you detest the sub-buffer concept only because you are tied
to the current Perf userspace ABI which cannot support this without an ABI
change ?

I'm trying to help out here, but it does not make the task easy if we have both
hands tied in our back because we have to keep backward ABI compatibility for a
tool (perf) forever, even considering its sources are shipped with the kernel.

> 
> > all we need to add is the ability to squash a previously saved
> > record from the buffer. I am confident we can provide a clean API for this that
> > would allow discard of previously committed entry as long as we are still on the
> > same non-fully-committed sub-buffer. This fits your use-case exactly, so that's
> > fine.
> 
> squash? truncate you mean? So we can allocate/reserve the largest
> possible event size and write the actual event and then truncate to the
> actually used size?

Nope. I'm thinking that we can use a buffer just to save the stack as we call
functions and return, e.g.

call X -> reserve space to save "X" and arguments.
call Y -> same for Y.
call Z -> same for Z.
return -> discard event for Z.
return -> discard event for Y.

if we grab the buffer content at that point, then we have X and its arguments,
which is the function currently executed. That would require the ability to
uncommit and unreserve an event, which is not a problem as long as we have not
committed a full sub-buffer.

> 
> I really dislike how that will end up with huge holes in the buffer when
> you get nested events.

This buffer only works like a stack. I don't think your comment apply.

> 
> Also, I think you're forgetting that doing the stack unwind is a very
> costly pointer chase, adding a simple linear copy really doesn't seem
> like a problem.

I thought that this buffer was chasing the function entry/exits rather than
doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more
about his use-case ?

> 
> Additionally, if you have multiple consumers you can simply copy the
> stacktrace again, avoiding the whole pointer chase exercise. While you
> could conceivably copy from one ringbuffer into another that will result
> in very nasty serialization issues.

Assuming Frederic is saving information to this stack-like ring buffer at each
function entry and discarding at each function return, then we don't have the
pointer chase.

What I am proposing does not even involve a copy: when we want to take a
snapshot, we just have to force a sub-buffer switch on the ring buffer. The
"returns" happening at the beginning of the next (empty) sub-buffer would
clearly fail to discard records (expecting non-existing entry records). We would
then have to save a small record saying that a function return occurred. The
current stack frame at the end of the next sub-buffer could be deduced from the
complete collection of stack frame samples.

> 
> > You could have one 4k ring buffer per cpu per execution context. 
> 
> Why?

This seems to fit what Frederic described he needed: he uses one separate buffer
per cpu per execution context at the moment. But we could arguably save
all this stack-shaped information in per-cpu buffers.

> 
> >  I wonder if
> > each Linux architecture have support for separated thread vs softirtq vs irq vs
> > nmi stacks ? 
> 
> Why would that be relevant? We can have NMI inside IRQ inside soft-IRQ
> inside task context in general (dismissing the nested IRQ mess). You
> don't need to have a separate stack for each context in order to nest
> them.

I was asking this because Frederic seems to rely on having separate buffers per
cpu and per execution context to deal with concurrency (so not expecting
concurrency from interrupts or NMIs when writing into the softirq per-cpu stack
buffer).

> 
> > Even then, given you have only one stack for all shared irqs, you
> > need something that is concurrency-aware at the ring buffer level.
> 
> I'm failing to see you point. 

My point is that we might need to expect concurrency from local execution
contexts (e.g. interrupts nested over other interrupt handlers) in the design of
this stack-like ring buffer. I'm not sure Frederic's approach of using one
buffer per execution context per cpu makes sense for all cases. The memory vs
context isolation trade-off seems rather odd if we have to create e.g. one
buffer per IRQ number.

> 
> > These small stack-like ring buffers could be used to save your temporary stack
> > copy. When you really need to save it to a larger ring buffer along with a
> > trace, then you just take a snapshot of the stack ring buffers.
> 
> OK, why? Your proposal includes the exact same extra copy but introduces
> a ton of extra code to effect the same, not a win.

Please refer to the "no extra copy" solution I explain in the reply here (see
above). I did not want to go into too much details regarding performance
optimization in the initial mail to Frederic, as these things can be done
incrementally. But given that you insist... :)

> 
> > So you get to use one single ring buffer synchronization and memory allocation
> > mechanism, that everyone has reviewed. The advantage is that we would not be
> > having this nmi race discussion in the first place: the generic ring buffer uses
> > "get page" directly rather than relying on vmalloc, because these bugs have
> > already been identified and dealt with years ago.
> 
> That's like saying don't use percpu_alloc() but open-code the thing
> using kmalloc()/get_pages().. I really don't see any merit in that.

I'm not saying "open-code this". I'm saying "use a specialized library that does
this get_pages() allocation and execution context synchronization for you", so
we stop the code duplication madness.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 17:18                             ` Peter Zijlstra
  2010-08-03 18:25                               ` Mathieu Desnoyers
@ 2010-08-03 18:56                               ` Linus Torvalds
  2010-08-03 19:45                                 ` Mathieu Desnoyers
                                                   ` (2 more replies)
  1 sibling, 3 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-08-03 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> FWIW I really utterly detest the whole concept of sub-buffers.

I'm not quite sure why. Is it something fundamental, or just an
implementation issue?

One thing that I think could easily make sense in a _lot_ of buffering
areas is the notion of a "continuation" buffer. We know we have cases
where we want to attach a lot of data to one particular event, but the
buffering itself is inevitably always going to have some limits on
atomicity etc. And quite often, the event that _generates_ the data is
not necessarily going to have all that data in one contiguous region,
and doing a scatter-gather memcpy to get it that way is not good
either.

At the same time, I do _not_ believe that the kernel ring-buffer code
should handle pointers to sub-buffers etc, or worry about iovec-like
arrays of smaller ranges. So if _that_ is what you mean by "concept of
sub-buffers", then I agree with you.

But what I do think might make a lot of sense is to allow buffer
fragments, and just teach user space to do de-fragmentation. Where it
would be important that the de-fragmentation really is all in user
space, and not really ever visible to the ring-buffer implementation
itself (and there would not, for example, be any guarantees that the
fragments would be contiguous - there could be other events in the
buffer in between fragments).  Maybe we could even say that fragments
might be across different CPU ring-buffers, and user-space needs to
sort it out if it wants to (where "sort it out" literally would mean
having to sort and re-attach them in the right order, since there
wouldn't be any ordering between them).

>From a kernel perspective, the only thing you need for fragment
handling would be to have a buffer entry that just says "I'm fragment
number X of event ID Y". Nothing more. Everything else would be up to
the parser in user space to work out.

In other words - if you have something like the current situation,
where you want to save a whole back-trace, INSTEAD of allocating a
large max-sized buffer for it and "linearizing" the back-trace in
order to then create a backtrace ring event, maybe we could just fill
the ring buffer with lots of small fragments, and do the whole
linearizing in the code that reads it in user space. No temporary
allocations in kernel space at all, no memcpy, let user space sort it
out. Each stack level would just add its own event, and increment the
fragment count it uses.

It's going to be a fairly rare case, so some user space parsers might
just decide to ignore fragmented packets, because they know they
aren't interested in such "complex" events.

I dunno. This thread has kind of devolved into many different details,
and I reacted to just one very small fragment of it. Maybe not even a
very interesting fragment.

               Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 18:56                               ` Linus Torvalds
@ 2010-08-03 19:45                                 ` Mathieu Desnoyers
  2010-08-03 20:02                                   ` Linus Torvalds
  2010-08-04  6:27                                 ` Peter Zijlstra
  2010-08-04  6:46                                 ` Dave Chinner
  2 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-03 19:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > FWIW I really utterly detest the whole concept of sub-buffers.
> 
> I'm not quite sure why. Is it something fundamental, or just an
> implementation issue?

The real issue here, IMHO, is that Perf has tied gory ring buffer implementation
details to the userspace perf ABI, and there is now strong unwillingness from
Perf developers to break this ABI.

About the sub-buffer definition: it only means that a buffer is splitted into
many regions. Their boundary are synchronization points between the data
producer and consumer. This involves padding the end of regions when records do
not fit in the remaining space.

I think that the problem lays in that Peter wants all his ring-buffer data to be
side-to-side, without padding. He needs this because the perf ABI, presented to
the user-space perf program, requires this: every implementation detail is
exposed to user-space through a mmap'd memory region (yeah, even the control
data is touched by both the kernel and userland through that shared page).

When Perf has been initially proposed, I've thought that because the perf
user-space tool is shipped along with the kernel sources, we could change the
ABI easily afterward, but Peter seems to disagree and wants it to stay the as it
is for backward compatibility and not offending contributors. If I had known
this when the ABI first came in, I would have surely nack'd it.

Now we are stucked with this ABI which exposes every tiny ring buffer
implementation detail to userspace, which simply kills any future enhancement.

Thanks,

Mathieu

P.S.: I'm holding back reply to the rest of your email to increase focus on the
fundamental perf ABI problem.

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 19:45                                 ` Mathieu Desnoyers
@ 2010-08-03 20:02                                   ` Linus Torvalds
  2010-08-03 20:10                                     ` Ingo Molnar
  2010-08-03 20:54                                     ` Mathieu Desnoyers
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-08-03 20:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> The real issue here, IMHO, is that Perf has tied gory ring buffer implementation
> details to the userspace perf ABI, and there is now strong unwillingness from
> Perf developers to break this ABI.

The thing is - I think my outlined buffer fragmentation model would
work fine with the perf ABI too.  Exactly because there is no deep
structure, just the same "stream of small events" both from a kernel
and a user model standpoint. Sure, the stream would now contain a new
event type, but that's trivial. It would still be _entirely_
reasonable to have the actual data in the exact same ring buffer,
including the whole mmap'ed area.

Of course, when user space actually parses it, user space would have
to eventually defragment the event by allocating a new area and
copying the fragments together in the right order, but that's pretty
trivial to do. It certainly doesn't affect the current mmap'ed
interface in the least.

Now, whether the perf people feel they want that kind of
functionality, I don't know. It's possible that they simply do not
want to handle events that are complex enough that they would have
arbitrary size.

                   Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 20:02                                   ` Linus Torvalds
@ 2010-08-03 20:10                                     ` Ingo Molnar
  2010-08-03 20:21                                       ` Ingo Molnar
  2010-08-03 20:54                                     ` Mathieu Desnoyers
  1 sibling, 1 reply; 168+ messages in thread
From: Ingo Molnar @ 2010-08-03 20:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Peter Zijlstra, Frederic Weisbecker, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > The real issue here, IMHO, is that Perf has tied gory ring buffer 
> > implementation details to the userspace perf ABI, and there is now strong 
> > unwillingness from Perf developers to break this ABI.

(Wrong.)

> The thing is - I think my outlined buffer fragmentation model would work 
> fine with the perf ABI too.  Exactly because there is no deep structure, 
> just the same "stream of small events" both from a kernel and a user model 
> standpoint. Sure, the stream would now contain a new event type, but that's 
> trivial. It would still be _entirely_ reasonable to have the actual data in 
> the exact same ring buffer, including the whole mmap'ed area.

Yeah.

> Of course, when user space actually parses it, user space would have to 
> eventually defragment the event by allocating a new area and copying the 
> fragments together in the right order, but that's pretty trivial to do. It 
> certainly doesn't affect the current mmap'ed interface in the least.
> 
> Now, whether the perf people feel they want that kind of functionality, I 
> don't know. It's possible that they simply do not want to handle events that 
> are complex enough that they would have arbitrary size.

Looks useful. There's a steady trickle of new events and we already use type 
encapsulation for things like trace events - which are only made sense of 
later on in user-space.

We may want to add things like a NOP event to pad out the end of page

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 20:10                                     ` Ingo Molnar
@ 2010-08-03 20:21                                       ` Ingo Molnar
  2010-08-03 21:16                                         ` Mathieu Desnoyers
  0 siblings, 1 reply; 168+ messages in thread
From: Ingo Molnar @ 2010-08-03 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Peter Zijlstra, Frederic Weisbecker, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> > >
> > > The real issue here, IMHO, is that Perf has tied gory ring buffer 
> > > implementation details to the userspace perf ABI, and there is now strong 
> > > unwillingness from Perf developers to break this ABI.
> 
> (Wrong.)
> 
> > The thing is - I think my outlined buffer fragmentation model would work 
> > fine with the perf ABI too.  Exactly because there is no deep structure, 
> > just the same "stream of small events" both from a kernel and a user model 
> > standpoint. Sure, the stream would now contain a new event type, but that's 
> > trivial. It would still be _entirely_ reasonable to have the actual data in 
> > the exact same ring buffer, including the whole mmap'ed area.
> 
> Yeah.
> 
> > Of course, when user space actually parses it, user space would have to 
> > eventually defragment the event by allocating a new area and copying the 
> > fragments together in the right order, but that's pretty trivial to do. It 
> > certainly doesn't affect the current mmap'ed interface in the least.
> > 
> > Now, whether the perf people feel they want that kind of functionality, I 
> > don't know. It's possible that they simply do not want to handle events that 
> > are complex enough that they would have arbitrary size.
> 
> Looks useful. There's a steady trickle of new events and we already use type 
> encapsulation for things like trace events - which are only made sense of 
> later on in user-space.
> 
> We may want to add things like a NOP event to pad out the end of page

/me once again experiences the subtle difference between 'Y' and 'N' when postponing a mail

So adding fragments would be possible as well. We've got the space for such 
extensions in the ABI and the basic model of streaming information is not 
affected.

[ The control structure of the mmap area is there for performance/wakeup 
  optimizations (and to allow the kernel to lose information on producer 
  overload, while still giving user-space an idea that we lost data and how 
  much) - it does not affect semantics and does not limit us. ]

So there's no design limitation - Peter simply prefers one possible solution 
over another and outlined his reasons - we should hash that out based on the 
technical arguments.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 20:02                                   ` Linus Torvalds
  2010-08-03 20:10                                     ` Ingo Molnar
@ 2010-08-03 20:54                                     ` Mathieu Desnoyers
  1 sibling, 0 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-03 20:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > The real issue here, IMHO, is that Perf has tied gory ring buffer implementation
> > details to the userspace perf ABI, and there is now strong unwillingness from
> > Perf developers to break this ABI.
> 
> The thing is - I think my outlined buffer fragmentation model would
> work fine with the perf ABI too.  Exactly because there is no deep
> structure, just the same "stream of small events" both from a kernel
> and a user model standpoint. Sure, the stream would now contain a new
> event type, but that's trivial. It would still be _entirely_
> reasonable to have the actual data in the exact same ring buffer,
> including the whole mmap'ed area.

Yes, indeed. Your scheme (using a "cookie" to identify multiple related events,
each of them being the "continuation" of the previous event with the same
cookie) would work on top of basically all ring buffers implementations. We
already use something similar to follow socket buffers and block device buffers
across the kernel in LTTng.

> 
> Of course, when user space actually parses it, user space would have
> to eventually defragment the event by allocating a new area and
> copying the fragments together in the right order, but that's pretty
> trivial to do. It certainly doesn't affect the current mmap'ed
> interface in the least.
> 
> Now, whether the perf people feel they want that kind of
> functionality, I don't know. It's possible that they simply do not
> want to handle events that are complex enough that they would have
> arbitrary size.

I agree. Although I think the scheme you propose can sit on top of the ring
buffer and does not necessarily need to be at the bottom layer. The sub-buffer
disagreement Peter and I have is related to the ring buffer core.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 20:21                                       ` Ingo Molnar
@ 2010-08-03 21:16                                         ` Mathieu Desnoyers
  0 siblings, 0 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-03 21:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Frederic Weisbecker, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
> > > <mathieu.desnoyers@efficios.com> wrote:
> > > >
> > > > The real issue here, IMHO, is that Perf has tied gory ring buffer 
> > > > implementation details to the userspace perf ABI, and there is now strong 
> > > > unwillingness from Perf developers to break this ABI.
> > 
> > (Wrong.)

I am glad to hear this. So should I understand that if we show that the current
perf ABI imposes significant design constraints and results in poor performance
and inability to support flight recorder mode (which is needed to unify the ring
buffers), we can deprecate the ABI ?

[...]


> > We may want to add things like a NOP event to pad out the end of page

Or simply write the page (or sub-buffer) size information in a page (or
sub-buffer) header. The gain here is that by doing so we don't have to reserve
an event ID for the NOP event, which adds one extra ID reserved in _each_ event
header. You might be tempted to say "oh, it's just a single value, who cares ?",
but with the amount of data we're moving, being able to represent the event
header on a very small amount of bits really makes a difference. Bloat creeps in
one single bit at a time until we start not caring about adding whole integers,
and when we're there the game was over long ago: performance suffer deeply.

The huge size of the perf event headers is another factor that might explain its
poor performance by the way.

[...]

> [ The control structure of the mmap area is there for performance/wakeup 
>   optimizations

I am doubtful about an "optimization" that affects what should be a slow path:
user-space wakeup for delivering a multiple events at once. Have you checked if
this leads to actual noticeable performance increase at all ?

>                 (and to allow the kernel to lose information on producer 
>   overload, while still giving user-space an idea that we lost data and how 
>   much)

This can be performed with a standard system call rather than playing games
with a shared pages into which both the kernel and user-space write. The
advantage is that by letting user-space calling the kernel (rather than just
writing "I'm done" in that page by updating the consumer value), we can let the
kernel perform tasks that might enable us to implement flight recorder mode all
within the same ring buffer implementation.

>                  - it does not affect semantics and does not limit us. ]

Well, so far, the main limitation I can see is that it does not allow us to do
flight recorder tracing (a.k.a. overwrite mode).

> 
> So there's no design limitation - Peter simply prefers one possible solution 
> over another and outlined his reasons - we should hash that out based on the 
> technical arguments.

Another argument I've seen from Peter is that he prefers the perf
kernel-userspace interaction to happen through this shared page to diminish the
number of traced events generated by perf activity. But I find this argument
unconvincing, because it really only applies to system call tracing: the rest of
tracing will be affected by the perf user-space process activity. So we might as
well just bite the bullet and accept that the trace is "polluted" by user-space
perf events. It _is_ using up CPU time anyway, so I think it's actually _better_
to know about it, rather than to try to hide the tracer activity. If one really
wants to filter out the tracer activity, it can be done at post-processing
without problem. But at least the information is there.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 18:56                               ` Linus Torvalds
  2010-08-03 19:45                                 ` Mathieu Desnoyers
@ 2010-08-04  6:27                                 ` Peter Zijlstra
  2010-08-04 14:06                                   ` Mathieu Desnoyers
  2010-08-11 14:34                                   ` Steven Rostedt
  2010-08-04  6:46                                 ` Dave Chinner
  2 siblings, 2 replies; 168+ messages in thread
From: Peter Zijlstra @ 2010-08-04  6:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > FWIW I really utterly detest the whole concept of sub-buffers.
> 
> I'm not quite sure why. Is it something fundamental, or just an
> implementation issue?

The sub-buffer thing that both ftrace and lttng have is creating a large
buffer from a lot of small buffers, I simply don't see the point of
doing that. It adds complexity and limitations for very little gain.

Their benefit is known synchronization points into the stream, you can
parse each sub-buffer independently, but you can always break up a
continuous stream into smaller parts or use a transport that includes
index points or whatever.

Their down side is that you can never have individual events larger than
the sub-buffer, you need to be aware of the sub-buffer when reserving
space etc..



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 18:56                               ` Linus Torvalds
  2010-08-03 19:45                                 ` Mathieu Desnoyers
  2010-08-04  6:27                                 ` Peter Zijlstra
@ 2010-08-04  6:46                                 ` Dave Chinner
  2010-08-04  7:21                                   ` Ingo Molnar
  2 siblings, 1 reply; 168+ messages in thread
From: Dave Chinner @ 2010-08-04  6:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Mathieu Desnoyers, Frederic Weisbecker,
	Ingo Molnar, LKML, Andrew Morton, Steven Rostedt, Steven Rostedt,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

On Tue, Aug 03, 2010 at 11:56:11AM -0700, Linus Torvalds wrote:
> On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > FWIW I really utterly detest the whole concept of sub-buffers.
> 
> I'm not quite sure why. Is it something fundamental, or just an
> implementation issue?
> 
> One thing that I think could easily make sense in a _lot_ of buffering
> areas is the notion of a "continuation" buffer. We know we have cases
> where we want to attach a lot of data to one particular event, but the
> buffering itself is inevitably always going to have some limits on
> atomicity etc. And quite often, the event that _generates_ the data is
> not necessarily going to have all that data in one contiguous region,
> and doing a scatter-gather memcpy to get it that way is not good
> either.
> 
> At the same time, I do _not_ believe that the kernel ring-buffer code
> should handle pointers to sub-buffers etc, or worry about iovec-like
> arrays of smaller ranges. So if _that_ is what you mean by "concept of
> sub-buffers", then I agree with you.
> 
> But what I do think might make a lot of sense is to allow buffer
> fragments, and just teach user space to do de-fragmentation. Where it
> would be important that the de-fragmentation really is all in user
> space, and not really ever visible to the ring-buffer implementation
> itself (and there would not, for example, be any guarantees that the
> fragments would be contiguous - there could be other events in the
> buffer in between fragments).  Maybe we could even say that fragments
> might be across different CPU ring-buffers, and user-space needs to
> sort it out if it wants to (where "sort it out" literally would mean
> having to sort and re-attach them in the right order, since there
> wouldn't be any ordering between them).
> 
> From a kernel perspective, the only thing you need for fragment
> handling would be to have a buffer entry that just says "I'm fragment
> number X of event ID Y". Nothing more. Everything else would be up to
> the parser in user space to work out.

Heh. For a moment there I thought you were describing the the way
XFS writes transactions into it's log. Replace "CPU ring-buffers"
with "in-core log buffers", "userspace parsing" with "log recovery"
and "event ID" with "transaction ID", and the concept you describe
is eerily similar. That includes the fact that transactions are not
contiguous in the log, can interleave fragments between concurrent
transaction commits and they can span multiple log buffers, too. It
works pretty well for scaling concurrent writers....

I'll get back in my box now ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-03 18:25                               ` Mathieu Desnoyers
@ 2010-08-04  6:46                                 ` Peter Zijlstra
  2010-08-04  7:14                                   ` Ingo Molnar
  2010-08-04 14:45                                   ` Mathieu Desnoyers
  0 siblings, 2 replies; 168+ messages in thread
From: Peter Zijlstra @ 2010-08-04  6:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Tue, 2010-08-03 at 14:25 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:
> > 
> > > I was more thinking along the lines of making sure a ring buffer has the proper
> > > support for your use-case. It shares a lot of requirements with a standard ring
> > > buffer:
> > > 
> > > - Need to be lock-less
> > > - Need to reserve space, write data in a buffer
> > > 
> > > By configuring a ring buffer with 4k sub-buffer size (that's configurable
> > > dynamically), 
> > 
> > FWIW I really utterly detest the whole concept of sub-buffers.
> 
> This reluctance against splitting a buffer into sub-buffers might contribute to
> explain the poor performance experienced with the Perf ring buffer.

That's just unsubstantiated FUD.

>  These
> "sub-buffers" are really nothing new: these are called "periods" in the audio
> world. They help lowering the ring buffer performance overhead because:
> 
> 1) They allow writing into the ring buffer without SMP-safe synchronization
> primitives and memory barriers for each record. Synchronization is only needed
> across sub-buffer boundaries, which amortizes the cost over a large number of
> events.

The only SMP barrier we (should) have is when we update the user visible
head pointer. The buffer code itself uses local{,64}_t for all other
atomic ops.

If you want to amortize that barrier, simply hold off the head update
for a while, no need to introduce sub-buffers.

> 2) They are much more splice (and, in general, page-exchange) friendly, because
> records written after a synchronization point start at the beginning of a page.
> This removes the need for extra copies.

This just doesn't make any sense at all, I could splice full pages just
fine, splice keeps page order so these synchronization points aren't
critical in any way.

The only problem I have with splice atm is that we don't have a buffer
interface without mmap() and we cannot splice pages out from under
mmap() on all architectures in a sane manner.

> So I have to ask: do you detest the sub-buffer concept only because you are tied
> to the current Perf userspace ABI which cannot support this without an ABI
> change ?

No because I don't see the point.

> I'm trying to help out here, but it does not make the task easy if we have both
> hands tied in our back because we have to keep backward ABI compatibility for a
> tool (perf) forever, even considering its sources are shipped with the kernel.

Dude, its a published user<->kernel ABI, also you're not saying why you
would want to break it. In your other email you allude to things like
flight recorder mode, that could be done with the current set-up, no
need to break the ABI at all. All you need to do is track the tail
pointer and publish it.

> Nope. I'm thinking that we can use a buffer just to save the stack as we call
> functions and return, e.g.

We don't have a callback on function entry, and I'm not going to use
mcount for that, that's simply insane.

> call X -> reserve space to save "X" and arguments.
> call Y -> same for Y.
> call Z -> same for Z.
> return -> discard event for Z.
> return -> discard event for Y.
> 
> if we grab the buffer content at that point, then we have X and its arguments,
> which is the function currently executed. That would require the ability to
> uncommit and unreserve an event, which is not a problem as long as we have not
> committed a full sub-buffer.

Again, I'm not really seeing the point of using sub-buffers at all.

Also, what happens when we write an event after Y? Then the discard must
fail or turn Y into a NOP, leaving a hole in the buffer.

> I thought that this buffer was chasing the function entry/exits rather than
> doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more
> about his use-case ?

No, its a pure stack unwind from NMI context. When we get an event (PMI,
tracepoint, whatever) we write out event, if the consumer asked for a
stacktrace with each event, we unwind the stack for him.

> > Additionally, if you have multiple consumers you can simply copy the
> > stacktrace again, avoiding the whole pointer chase exercise. While you
> > could conceivably copy from one ringbuffer into another that will result
> > in very nasty serialization issues.
> 
> Assuming Frederic is saving information to this stack-like ring buffer at each
> function entry and discarding at each function return, then we don't have the
> pointer chase.
> 
> What I am proposing does not even involve a copy: when we want to take a
> snapshot, we just have to force a sub-buffer switch on the ring buffer. The
> "returns" happening at the beginning of the next (empty) sub-buffer would
> clearly fail to discard records (expecting non-existing entry records). We would
> then have to save a small record saying that a function return occurred. The
> current stack frame at the end of the next sub-buffer could be deduced from the
> complete collection of stack frame samples.

And suppose the stack-trace was all of 16 entries (not uncommon for a
kernel stack), then you waste a whole page for 128 bytes (assuming your
sub-buffer is page sized). I'll take the memcopy, thank you.



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-04  6:46                                 ` Peter Zijlstra
@ 2010-08-04  7:14                                   ` Ingo Molnar
  2010-08-04 14:45                                   ` Mathieu Desnoyers
  1 sibling, 0 replies; 168+ messages in thread
From: Ingo Molnar @ 2010-08-04  7:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Linus Torvalds, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo


* Peter Zijlstra <peterz@infradead.org> wrote:

> > What I am proposing does not even involve a copy: when we want to take a 
> > snapshot, we just have to force a sub-buffer switch on the ring buffer. 
> > The "returns" happening at the beginning of the next (empty) sub-buffer 
> > would clearly fail to discard records (expecting non-existing entry 
> > records). We would then have to save a small record saying that a function 
> > return occurred. The current stack frame at the end of the next sub-buffer 
> > could be deduced from the complete collection of stack frame samples.
> 
> And suppose the stack-trace was all of 16 entries (not uncommon for a kernel 
> stack), then you waste a whole page for 128 bytes (assuming your sub-buffer 
> is page sized). I'll take the memcopy, thank you.

To throw some hard numbers into the discussion, i found two random callgraph 
perf.data's on my boxes (both created prior the start of this discussion) and 
here is the distribution of their call-chain length:

aldebaran:~> perf report -D | grep 'chain: nr:' | cut -d: -f3- | sort -n | uniq -c
      2 4
     21 6
     23 8
     13 9
     20 10
     29 11
     21 12
     25 13
     54 14
    112 15
     72 16
     77 17
     35 18
     38 19
     48 20
     29 21
     10 22
     97 23
      3 24
      1 25
      2 26
      2 28
      2 29
      1 30
      2 31

So the peak/average here is around 15 entries.

The other one:

phoenix:~> perf report -D | grep 'chain: nr:' | cut -d: -f3- | sort -n | uniq -c
      1 2
     70 3
    222 4
    112 5
    116 6
    329 7
    241 8
    163 9
    203 10
    287 11
    159 12
      4 13
      6 14
     22 15
      2 16
     11 17
      5 18

Here the average is even lower - around 8 entries.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-04  6:46                                 ` Dave Chinner
@ 2010-08-04  7:21                                   ` Ingo Molnar
  0 siblings, 0 replies; 168+ messages in thread
From: Ingo Molnar @ 2010-08-04  7:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Peter Zijlstra, Mathieu Desnoyers,
	Frederic Weisbecker, LKML, Andrew Morton, Steven Rostedt,
	Steven Rostedt, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo


* Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Aug 03, 2010 at 11:56:11AM -0700, Linus Torvalds wrote:
> > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> > 
> > I'm not quite sure why. Is it something fundamental, or just an
> > implementation issue?
> > 
> > One thing that I think could easily make sense in a _lot_ of buffering
> > areas is the notion of a "continuation" buffer. We know we have cases
> > where we want to attach a lot of data to one particular event, but the
> > buffering itself is inevitably always going to have some limits on
> > atomicity etc. And quite often, the event that _generates_ the data is
> > not necessarily going to have all that data in one contiguous region,
> > and doing a scatter-gather memcpy to get it that way is not good
> > either.
> > 
> > At the same time, I do _not_ believe that the kernel ring-buffer code
> > should handle pointers to sub-buffers etc, or worry about iovec-like
> > arrays of smaller ranges. So if _that_ is what you mean by "concept of
> > sub-buffers", then I agree with you.
> > 
> > But what I do think might make a lot of sense is to allow buffer
> > fragments, and just teach user space to do de-fragmentation. Where it
> > would be important that the de-fragmentation really is all in user
> > space, and not really ever visible to the ring-buffer implementation
> > itself (and there would not, for example, be any guarantees that the
> > fragments would be contiguous - there could be other events in the
> > buffer in between fragments).  Maybe we could even say that fragments
> > might be across different CPU ring-buffers, and user-space needs to
> > sort it out if it wants to (where "sort it out" literally would mean
> > having to sort and re-attach them in the right order, since there
> > wouldn't be any ordering between them).
> > 
> > From a kernel perspective, the only thing you need for fragment
> > handling would be to have a buffer entry that just says "I'm fragment
> > number X of event ID Y". Nothing more. Everything else would be up to
> > the parser in user space to work out.
> 
> Heh. For a moment there I thought you were describing the the way XFS writes 
> transactions into it's log. Replace "CPU ring-buffers" with "in-core log 
> buffers", "userspace parsing" with "log recovery" and "event ID" with 
> "transaction ID", and the concept you describe is eerily similar. That 
> includes the fact that transactions are not contiguous in the log, can 
> interleave fragments between concurrent transaction commits and they can 
> span multiple log buffers, too. It works pretty well for scaling concurrent 
> writers....

That's certainly a good model when you have to stream into a 
persistent-storage transaction log space with multiple writers.

The difference is that with instrumentation we are generally able to make 
things per task or per cpu so there's no real multi-CPU 'concurrent writers' 
concurrency.

You dont have that luxory/simplicity when logging to storage, of course!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:05             ` H. Peter Anvin
  2010-07-16 18:15               ` Avi Kivity
  2010-07-16 19:28               ` Andi Kleen
@ 2010-08-04  9:46               ` Peter Zijlstra
  2010-08-04 20:23                 ` H. Peter Anvin
  2 siblings, 1 reply; 168+ messages in thread
From: Peter Zijlstra @ 2010-08-04  9:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Avi Kivity, Mathieu Desnoyers, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler,
	David Howells

On Fri, 2010-07-16 at 11:05 -0700, H. Peter Anvin wrote:
> 
> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work... 

Perf actually already does that to unwind user-space stacks... ;-)

See arch/x86/kernel/cpu/perf_event.c:copy_from_user_nmi() and its users.

What we do is a manual page table walk (using __get_user_pages_fast) and
simply bail when the page is not available.

That said, I think that the thing that started the whole
per-cpu-per-context temp stack-frame storage story also means that that
function is now broken and can lead to kmap_atomic corruption.

I really should brush up that stack based kmap_atomic thing, last time I
got stuck on FRV wanting things.

Linus should I refresh that whole series and give a FRV a slow but
working implementation and then let David Howells sort out things if he
cares about that?



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-04  6:27                                 ` Peter Zijlstra
@ 2010-08-04 14:06                                   ` Mathieu Desnoyers
  2010-08-04 14:50                                     ` Peter Zijlstra
  2010-08-11 14:34                                   ` Steven Rostedt
  1 sibling, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-04 14:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> > 
> > I'm not quite sure why. Is it something fundamental, or just an
> > implementation issue?
> 
> The sub-buffer thing that both ftrace and lttng have is creating a large
> buffer from a lot of small buffers, I simply don't see the point of
> doing that. It adds complexity and limitations for very little gain.

The first major gain is the ability to implement flight recorder tracing
(overwrite mode), which Perf still lacks.

A second major gain: having these sub-buffers lets the trace analyzer seek in
the trace very efficiently by allowing it to perform a binary search for time to
find the appropriate sub-buffer. It becomes immensely useful with large traces.

The third major gain: for live streaming of traces, having sub-buffer lets you
"package" the event data you send over the network into sub-buffers. So the
trace analyzer, receiving this information live while the trace is being
recorded, can start using the information when the full sub-buffer is received.
It does not have to play games with the last event (or event header) perhaps
being incompletely sent, which imply that you absolutely _need_ to save the
event size along with each event header (you cannot simply let the analyzer
parse the event payload to determine the size). Here again, space wasted.
Furthermore, this deals with information loss: a trace is still readable even if
a sub-buffer must be discarded.

Making sure events don't cross sub-buffer boundaries simplify a lot of things,
starting with dealing with "overwritten" sub-buffers in flight recorder mode.
Trying to deal with a partially overwritten event is just insane.

> 
> Their benefit is known synchronization points into the stream, you can
> parse each sub-buffer independently, but you can always break up a
> continuous stream into smaller parts or use a transport that includes
> index points or whatever.

I understand that you could perform amortized synchronization without
sub-buffers. I however don't see how flight recorder, efficient seek on multi-GB
traces (without reading the whole event stream), and live streaming can be
achieved.

> Their down side is that you can never have individual events larger than
> the sub-buffer,

True. But with configurable sub-buffer size (can be from 4kB to many MB), I
don't see the problem.

>                 you need to be aware of the sub-buffer when reserving
> space

Only the ring buffer needs to be aware of that. It returns an error if the event
is larger than the sub-buffer size.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-04  6:46                                 ` Peter Zijlstra
  2010-08-04  7:14                                   ` Ingo Molnar
@ 2010-08-04 14:45                                   ` Mathieu Desnoyers
  2010-08-04 14:56                                     ` Peter Zijlstra
  1 sibling, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-04 14:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2010-08-03 at 14:25 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Thu, 2010-07-15 at 12:26 -0400, Mathieu Desnoyers wrote:
> > > 
> > > > I was more thinking along the lines of making sure a ring buffer has the proper
> > > > support for your use-case. It shares a lot of requirements with a standard ring
> > > > buffer:
> > > > 
> > > > - Need to be lock-less
> > > > - Need to reserve space, write data in a buffer
> > > > 
> > > > By configuring a ring buffer with 4k sub-buffer size (that's configurable
> > > > dynamically), 
> > > 
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> > 
> > This reluctance against splitting a buffer into sub-buffers might contribute to
> > explain the poor performance experienced with the Perf ring buffer.
> 
> That's just unsubstantiated FUD.

Extracted from:
http://lkml.org/lkml/2010/7/9/368

(executive summary)

* Throughput

   * Flight recorder mode

Ring Buffer Library        83 ns/entry (512kB sub-buffers, no reader)
                           89 ns/entry (512kB sub-buffers: read 0.3M entries/s)


Ftrace Ring Buffer:       103 ns/entry (no reader)
                          187 ns/entry (read by event:     read 0.4M entries/s)

Perf record               (flight recorder mode unavailable)


   * Discard mode

Ring Buffer Library:      96 ns/entry discarded
                          257 ns/entry written (read: 2.8M entries/s)

Perf Ring Buffer:         423 ns/entry written (read: 2.3M entries/s)
(Note that this number is based on the perf event approximation output (based on
a 24 bytes/entry estimation) rather than the benchmark module count due its
inaccuracy, which is caused by perf not letting the benchmark module know about
discarded events.)

It is really hard to get a clear picture of the data write overhead with perf,
because you _need_ to consume data. Making perf support flight recorder mode
would really help getting benchmarks that are easier to compare.

> 
> >  These
> > "sub-buffers" are really nothing new: these are called "periods" in the audio
> > world. They help lowering the ring buffer performance overhead because:
> > 
> > 1) They allow writing into the ring buffer without SMP-safe synchronization
> > primitives and memory barriers for each record. Synchronization is only needed
> > across sub-buffer boundaries, which amortizes the cost over a large number of
> > events.
> 
> The only SMP barrier we (should) have is when we update the user visible
> head pointer. The buffer code itself uses local{,64}_t for all other
> atomic ops.
> 
> If you want to amortize that barrier, simply hold off the head update
> for a while, no need to introduce sub-buffers.

I understand your point about amortized synchronization. However I still don't
see how you can achieve flight recorder mode, efficient seek on multi-GB traces
without reading the whole event stream, and live streaming without sub-buffers
(and, ideally, without much headhaches involved). ;)

> 
> > 2) They are much more splice (and, in general, page-exchange) friendly, because
> > records written after a synchronization point start at the beginning of a page.
> > This removes the need for extra copies.
> 
> This just doesn't make any sense at all, I could splice full pages just
> fine, splice keeps page order so these synchronization points aren't
> critical in any way.

If you need to read non-filled pages, then you need to splice pages piece-wise.
This does not fit well with flight recorder tracing, for which the solution
Steven and I have found is to atomically exchange pages (for Ftrace) or
sub-buffers (for the generic ring buffer library) between the reader and writer.

> 
> The only problem I have with splice atm is that we don't have a buffer
> interface without mmap() and we cannot splice pages out from under
> mmap() on all architectures in a sane manner.

The problem Perf has is probably more with flight recorder (overwrite) tracing
support than splice() per se, in this you are right.

> 
> > So I have to ask: do you detest the sub-buffer concept only because you are tied
> > to the current Perf userspace ABI which cannot support this without an ABI
> > change ?
> 
> No because I don't see the point.

OK, good to know you are open to ABI changes if I present convincing arguments.

> 
> > I'm trying to help out here, but it does not make the task easy if we have both
> > hands tied in our back because we have to keep backward ABI compatibility for a
> > tool (perf) forever, even considering its sources are shipped with the kernel.
> 
> Dude, its a published user<->kernel ABI, also you're not saying why you
> would want to break it. In your other email you allude to things like
> flight recorder mode, that could be done with the current set-up, no
> need to break the ABI at all. All you need to do is track the tail
> pointer and publish it.

How do you plan to read the data concurrently with the writer overwriting the
data while you are reading it without corruption ?

> 
> > Nope. I'm thinking that we can use a buffer just to save the stack as we call
> > functions and return, e.g.
> 
> We don't have a callback on function entry, and I'm not going to use
> mcount for that, that's simply insane.

OK, now I get a clearer picture of what Frederic is trying to do.

> 
> > call X -> reserve space to save "X" and arguments.
> > call Y -> same for Y.
> > call Z -> same for Z.
> > return -> discard event for Z.
> > return -> discard event for Y.
> > 
> > if we grab the buffer content at that point, then we have X and its arguments,
> > which is the function currently executed. That would require the ability to
> > uncommit and unreserve an event, which is not a problem as long as we have not
> > committed a full sub-buffer.
> 
> Again, I'm not really seeing the point of using sub-buffers at all.

This part of the email is unrelated to sub-buffers.

> 
> Also, what happens when we write an event after Y? Then the discard must
> fail or turn Y into a NOP, leaving a hole in the buffer.

Given that this buffer is simply used to dump the stack unwind result then I
think my scenario above was simply mislead.

> 
> > I thought that this buffer was chasing the function entry/exits rather than
> > doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more
> > about his use-case ?
> 
> No, its a pure stack unwind from NMI context. When we get an event (PMI,
> tracepoint, whatever) we write out event, if the consumer asked for a
> stacktrace with each event, we unwind the stack for him.

So why the copy ? Frederic seems to put the stack unwind in a special temporary
buffer. Why is it not saved directly into the trace buffers ?

> > > Additionally, if you have multiple consumers you can simply copy the
> > > stacktrace again, avoiding the whole pointer chase exercise. While you
> > > could conceivably copy from one ringbuffer into another that will result
> > > in very nasty serialization issues.
> > 
> > Assuming Frederic is saving information to this stack-like ring buffer at each
> > function entry and discarding at each function return, then we don't have the
> > pointer chase.
> > 
> > What I am proposing does not even involve a copy: when we want to take a
> > snapshot, we just have to force a sub-buffer switch on the ring buffer. The
> > "returns" happening at the beginning of the next (empty) sub-buffer would
> > clearly fail to discard records (expecting non-existing entry records). We would
> > then have to save a small record saying that a function return occurred. The
> > current stack frame at the end of the next sub-buffer could be deduced from the
> > complete collection of stack frame samples.
> 
> And suppose the stack-trace was all of 16 entries (not uncommon for a
> kernel stack), then you waste a whole page for 128 bytes (assuming your
> sub-buffer is page sized). I'll take the memcopy, thank you.

Well, now that I understand what you are trying to achieve, I retract my
proposal of using a stack-like ring buffer for this. I think that the stack dump
should simply be saved directly to the ring buffer, without copy. The
dump_stack() functions might have to be extended so they don't just save text
dumbly, but can also be used to save events into the trace in binary format,
perhaps with the continuation cookie Linus was proposing.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-04 14:06                                   ` Mathieu Desnoyers
@ 2010-08-04 14:50                                     ` Peter Zijlstra
  2010-08-06  1:42                                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 168+ messages in thread
From: Peter Zijlstra @ 2010-08-04 14:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:

> The first major gain is the ability to implement flight recorder tracing
> (overwrite mode), which Perf still lacks.

http://lkml.org/lkml/2009/7/6/178

I've send out something like that several times, but nobody took it
(that is, tested it and provided a user). Note how it doesn't require
anything like sub-buffers.

> A second major gain: having these sub-buffers lets the trace analyzer seek in
> the trace very efficiently by allowing it to perform a binary search for time to
> find the appropriate sub-buffer. It becomes immensely useful with large traces.

You can add sync events with a specific magic cookie in. Once you find
the cookie you can sync and start reading it reliably -- the advantage
is that sync events are very easy to have as an option and don't
complicate the reserve path.

> The third major gain: for live streaming of traces, having sub-buffer lets you
> "package" the event data you send over the network into sub-buffers.

See the sync events. Also, a transport can rewrite the stream any which
way it pretty well wants to, as long as the kernel<->user interface is
reliable an unreliable user<->user transport can repackage it to suit
its needs.

> Making sure events don't cross sub-buffer boundaries simplify a lot of things,
> starting with dealing with "overwritten" sub-buffers in flight recorder mode.
> Trying to deal with a partially overwritten event is just insane.

See the above patch, simply parse the events and push the tail pointer
ahead of the reservation before you trample on it.

If you worry about the cost of parsing the events, you can amortize that
by things like keeping the offset of the first event in every page in
the pageframe, or the offset of the next sync event or whatever scheme
you want.

Again, no need for sub-buffers.

Also, not having sub-buffers makes reservation easier since you don't
need to worry about those empty tails.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-04 14:45                                   ` Mathieu Desnoyers
@ 2010-08-04 14:56                                     ` Peter Zijlstra
  2010-08-06  1:49                                       ` Mathieu Desnoyers
  2010-08-06  6:18                                       ` Masami Hiramatsu
  0 siblings, 2 replies; 168+ messages in thread
From: Peter Zijlstra @ 2010-08-04 14:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:

> How do you plan to read the data concurrently with the writer overwriting the
> data while you are reading it without corruption ?

I don't consider reading while writing (in overwrite mode) a valid case.

If you want to use overwrite, stop the writer before reading it.

>  I think that the stack dump
> should simply be saved directly to the ring buffer, without copy. The
> dump_stack() functions might have to be extended so they don't just save text
> dumbly, but can also be used to save events into the trace in binary format,
> perhaps with the continuation cookie Linus was proposing.

Because I don't want to support truncating reservations (because that
leads to large nops for nested events) and when the event needs to go to
multiple buffers you can re-use the stack-dump without having to do the
unwind again.

The problem with the continuation thing Linus suggested is that it would
bloat the output 3 fold. A stack entry is a single u64. If you want to
wrap that in a continuation event you need: a header (u64), a cookie
(u64) and the entry (u64).

Continuation events might make heaps of sense for larger data pieces,
but I don't see them being practical for such small pieces.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-08-04  9:46               ` Peter Zijlstra
@ 2010-08-04 20:23                 ` H. Peter Anvin
  0 siblings, 0 replies; 168+ messages in thread
From: H. Peter Anvin @ 2010-08-04 20:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Mathieu Desnoyers, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler,
	David Howells

On 08/04/2010 02:46 AM, Peter Zijlstra wrote:
> On Fri, 2010-07-16 at 11:05 -0700, H. Peter Anvin wrote:
>>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work... 
> 
> Perf actually already does that to unwind user-space stacks... ;-)
> 
> See arch/x86/kernel/cpu/perf_event.c:copy_from_user_nmi() and its users.
> 
> What we do is a manual page table walk (using __get_user_pages_fast) and
> simply bail when the page is not available.
> 

That's not really "touching user space", though.

	-hpa

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-04 14:50                                     ` Peter Zijlstra
@ 2010-08-06  1:42                                       ` Mathieu Desnoyers
  2010-08-06 10:11                                         ` Peter Zijlstra
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-06  1:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:
> 
> > The first major gain is the ability to implement flight recorder tracing
> > (overwrite mode), which Perf still lacks.
> 
> http://lkml.org/lkml/2009/7/6/178
> 
> I've send out something like that several times, but nobody took it
> (that is, tested it and provided a user). Note how it doesn't require
> anything like sub-buffers.

+static void perf_output_tail(struct perf_mmap_data *data, unsigned int head)
...
+       unsigned long tail, new;
...
+       unsigned long size;

+       while (tail + size - head < 0) {
....
+       }

How is the while condition ever be supposed to be true ? I guess nobody took it
because it simply was not ready for testing.

> 
> > A second major gain: having these sub-buffers lets the trace analyzer seek in
> > the trace very efficiently by allowing it to perform a binary search for time to
> > find the appropriate sub-buffer. It becomes immensely useful with large traces.
> 
> You can add sync events with a specific magic cookie in. Once you find
> the cookie you can sync and start reading it reliably

You need to read the whole trace to find these cookies (even if it is just once
at the beginning if you create an index). My experience with users have shown me
that the delay between stopping trace gathering having the data shown to the
user is very important, because this is repeatedly done while debugging a
problem, and this is time the user is sitting in front of his screen, waiting.

> -- the advantage
> is that sync events are very easy to have as an option and don't
> complicate the reserve path.

Perf, on its reserve/commit fast paths:

perf_output_begin: 543 bytes
  (perf_output_get_handle is inlined)

perf_output_put_handle: 201 bytes
perf_output_end:         77 bytes
  calls perf_output_put_handle

Total for perf:         821 bytes

Generic Ring Buffer Library reserve/commit fast paths:

Reserve:                       511 bytes
Commit:                        266 bytes
Total for Generic Ring Buffer: 777 bytes

So the generic ring buffer is not only faster and supports sub-buffers (along
with all the nice features this brings); its reserve and commit hot paths
fit in less instructions: it is *less* complicated than Perf's.


> 
> > The third major gain: for live streaming of traces, having sub-buffer lets you
> > "package" the event data you send over the network into sub-buffers.
> 
> See the sync events.

I am guessing you plan to rely on these sync events to know which data "blocs"
are fully received. This could possibly be made to work.

> Also, a transport can rewrite the stream any which
> way it pretty well wants to, as long as the kernel<->user interface is
> reliable an unreliable user<->user transport can repackage it to suit
> its needs.

repackage = copy = poor performance. No thanks.

> 
> > Making sure events don't cross sub-buffer boundaries simplify a lot of things,
> > starting with dealing with "overwritten" sub-buffers in flight recorder mode.
> > Trying to deal with a partially overwritten event is just insane.
> 
> See the above patch, simply parse the events and push the tail pointer
> ahead of the reservation before you trample on it.

I'm not sure that patch is ready for prime-time yet. As you point out in your
following email, you need to stop tracing to consume data, which does not fit my
users'use-cases.

> 
> If you worry about the cost of parsing the events, you can amortize that
> by things like keeping the offset of the first event in every page in
> the pageframe, or the offset of the next sync event or whatever scheme
> you want.

Hrm ? AFAIK, the page-frame is an internal kernel-only data structure. That
won't be exported to user-space, so how is the parser supposed to see this
information exactly to help it speeding up parsing ?

> 
> Again, no need for sub-buffers.

I don't see this claim as satisfactorily supported here, sorry.

> 
> Also, not having sub-buffers makes reservation easier since you don't
> need to worry about those empty tails.

So far I've shown that you sub-buffer-less implementation is not even simpler
than a implementation using sub-buffers.

By the way, even with your sub-buffer free scheme, you cannot write an event
bigger than your buffer size. So you have a likewise limitation in terms of
maximum event size (so you already have to test this on your fast path).

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-04 14:56                                     ` Peter Zijlstra
@ 2010-08-06  1:49                                       ` Mathieu Desnoyers
  2010-08-06  9:51                                         ` Peter Zijlstra
  2010-08-06  6:18                                       ` Masami Hiramatsu
  1 sibling, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-06  1:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> 
> > How do you plan to read the data concurrently with the writer overwriting the
> > data while you are reading it without corruption ?
> 
> I don't consider reading while writing (in overwrite mode) a valid case.
> 
> If you want to use overwrite, stop the writer before reading it.

How inconvenient. It happens that the relatively large group of users I am
working for do care for this use-case. They cannot afford to stop tracing as
soon as they hit "one bug". This "bug" could be a simple odd scenario that they
want to snapshot, but in all cases they want tracing to continue.

> 
> >  I think that the stack dump
> > should simply be saved directly to the ring buffer, without copy. The
> > dump_stack() functions might have to be extended so they don't just save text
> > dumbly, but can also be used to save events into the trace in binary format,
> > perhaps with the continuation cookie Linus was proposing.
> 
> Because I don't want to support truncating reservations (because that
> leads to large nops for nested events)

Agreed in this case. Truncating reservations might make sense for filtering, but
even there I have a strong preference for filtering directly on the information
received as parameter, before performing buffer space reservation, whenever
possible.

> and when the event needs to go to
> multiple buffers you can re-use the stack-dump without having to do the
> unwind again.
> 
> The problem with the continuation thing Linus suggested is that it would
> bloat the output 3 fold. A stack entry is a single u64. If you want to
> wrap that in a continuation event you need: a header (u64), a cookie
> (u64) and the entry (u64).

Agreed, it's probably not such a good fit for these small pieces of information.

> 
> Continuation events might make heaps of sense for larger data pieces,
> but I don't see them being practical for such small pieces.

Yep.

What I did in a past life in earlier LTTng versions was to use a 2-pass unwind.
The first pass is the most costly because it brings all the data into the L1
cache. This first pass is used to compute the array size you need to save the
whole stack frame, but it does not copy anything. The second pass performs the
copy. This was surprisingly efficient.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-04 14:56                                     ` Peter Zijlstra
  2010-08-06  1:49                                       ` Mathieu Desnoyers
@ 2010-08-06  6:18                                       ` Masami Hiramatsu
  2010-08-06  9:50                                         ` Peter Zijlstra
  1 sibling, 1 reply; 168+ messages in thread
From: Masami Hiramatsu @ 2010-08-06  6:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Linus Torvalds,
	Ingo Molnar, LKML, Andrew Morton, Steven Rostedt, Steven Rostedt,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo, 2nddept-manager

Peter Zijlstra wrote:
> On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> 
>> How do you plan to read the data concurrently with the writer overwriting the
>> data while you are reading it without corruption ?
> 
> I don't consider reading while writing (in overwrite mode) a valid case.
> 
> If you want to use overwrite, stop the writer before reading it.

For example, would you like to read system audit log always after
stop the audit?

NO, that's a most important requirement for tracers, especially for
system admins (they're the most important users of Linux) to check
the system health and catch system troubles.

For performance measurement and checking hotspot, one-shot tracing
is enough. But it's just for developers. But for the real world
computing, Linux is just an OS, users want to run their system,
middleware and applications, without troubles. But when they hit
a trouble, they wanna shoot it ASAP.
The flight recorder mode is mainly for those users.

Thank you,

-- 
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06  6:18                                       ` Masami Hiramatsu
@ 2010-08-06  9:50                                         ` Peter Zijlstra
  2010-08-06 13:37                                           ` Mathieu Desnoyers
                                                             ` (2 more replies)
  0 siblings, 3 replies; 168+ messages in thread
From: Peter Zijlstra @ 2010-08-06  9:50 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Linus Torvalds,
	Ingo Molnar, LKML, Andrew Morton, Steven Rostedt, Steven Rostedt,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo, 2nddept-manager

On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
> Peter Zijlstra wrote:
> > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > 
> >> How do you plan to read the data concurrently with the writer overwriting the
> >> data while you are reading it without corruption ?
> > 
> > I don't consider reading while writing (in overwrite mode) a valid case.
> > 
> > If you want to use overwrite, stop the writer before reading it.
> 
> For example, would you like to read system audit log always after
> stop the audit?
> 
> NO, that's a most important requirement for tracers, especially for
> system admins (they're the most important users of Linux) to check
> the system health and catch system troubles.
> 
> For performance measurement and checking hotspot, one-shot tracing
> is enough. But it's just for developers. But for the real world
> computing, Linux is just an OS, users want to run their system,
> middleware and applications, without troubles. But when they hit
> a trouble, they wanna shoot it ASAP.
> The flight recorder mode is mainly for those users.

You cannot over-write and consistently read the buffer, that's plain
impossible. With sub-buffers you can swivel a sub-buffer and
consistently read that, but there is no guarantee the next sub-buffer
you steal was indeed adjacent to the previous buffer you stole as that
might have gotten over-written by the active writer while you were
stealing the previous one.

If you want to snapshot buffers, do that, simply swivel the whole trace
buffer, and continue tracing in a new one, then consume the old trace in
a consistent manner.

I really see no value in being able to read unrelated bits and pieces of
a buffer.

So no, I will _not_ support reading an over-write buffer while there is
an active reader.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06  1:49                                       ` Mathieu Desnoyers
@ 2010-08-06  9:51                                         ` Peter Zijlstra
  2010-08-06 13:46                                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 168+ messages in thread
From: Peter Zijlstra @ 2010-08-06  9:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Frederic Weisbecker, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, 2010-08-05 at 21:49 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > 
> > > How do you plan to read the data concurrently with the writer overwriting the
> > > data while you are reading it without corruption ?
> > 
> > I don't consider reading while writing (in overwrite mode) a valid case.
> > 
> > If you want to use overwrite, stop the writer before reading it.
> 
> How inconvenient. It happens that the relatively large group of users I am
> working for do care for this use-case. They cannot afford to stop tracing as
> soon as they hit "one bug". This "bug" could be a simple odd scenario that they
> want to snapshot, but in all cases they want tracing to continue.

Snapshot is fine, just swivel the whole buffer.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06  1:42                                       ` Mathieu Desnoyers
@ 2010-08-06 10:11                                         ` Peter Zijlstra
  2010-08-06 11:14                                           ` Peter Zijlstra
  2010-08-06 14:13                                           ` Mathieu Desnoyers
  0 siblings, 2 replies; 168+ messages in thread
From: Peter Zijlstra @ 2010-08-06 10:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Thu, 2010-08-05 at 21:42 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:
> > On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:
> > 
> > > The first major gain is the ability to implement flight recorder tracing
> > > (overwrite mode), which Perf still lacks.
> > 
> > http://lkml.org/lkml/2009/7/6/178
> > 
> > I've send out something like that several times, but nobody took it
> > (that is, tested it and provided a user). Note how it doesn't require
> > anything like sub-buffers.

> How is the while condition ever be supposed to be true ? I guess nobody took it
> because it simply was not ready for testing.

I know, I never claimed it was, it was always an illustration of how to
accomplish it. But then, nobody found it important enough to finish.

> > > A second major gain: having these sub-buffers lets the trace analyzer seek in
> > > the trace very efficiently by allowing it to perform a binary search for time to
> > > find the appropriate sub-buffer. It becomes immensely useful with large traces.
> > 
> > You can add sync events with a specific magic cookie in. Once you find
> > the cookie you can sync and start reading it reliably
> 
> You need to read the whole trace to find these cookies (even if it is just once
> at the beginning if you create an index).

Depends on what you want to do, you can start reading at any point in
the stream and be guaranteed to find a sync point within sync-distance
+max-event-size.

>  My experience with users have shown me
> that the delay between stopping trace gathering having the data shown to the
> user is very important, because this is repeatedly done while debugging a
> problem, and this is time the user is sitting in front of his screen, waiting.

Yeah, because after having had to wait for 36h for the problem to
trigger that extra minute really kills.

All I can say is that in my experience brain throughput is the limiting
factor in debugging. Not some ability to draw fancy pictures.

> > -- the advantage
> > is that sync events are very easy to have as an option and don't
> > complicate the reserve path.
> 
> Perf, on its reserve/commit fast paths:
> 
> perf_output_begin: 543 bytes
>   (perf_output_get_handle is inlined)
> 
> perf_output_put_handle: 201 bytes
> perf_output_end:         77 bytes
>   calls perf_output_put_handle
> 
> Total for perf:         821 bytes
> 
> Generic Ring Buffer Library reserve/commit fast paths:
> 
> Reserve:                       511 bytes
> Commit:                        266 bytes
> Total for Generic Ring Buffer: 777 bytes
> 
> So the generic ring buffer is not only faster and supports sub-buffers (along
> with all the nice features this brings); its reserve and commit hot paths
> fit in less instructions: it is *less* complicated than Perf's.

All I can say is that less code doesn't equal less complex (nor faster
per-se). Nor have I spend all my time on writing the ring-buffer,
there's more interesting things to do.

And the last time I ran perf on perf, the buffer wasn't the thing that
was taking most time.

And unlike what you claim below, it most certainly can deal with events
larger than a single page.

> > If you worry about the cost of parsing the events, you can amortize that
> > by things like keeping the offset of the first event in every page in
> > the pageframe, or the offset of the next sync event or whatever scheme
> > you want.
> 
> Hrm ? AFAIK, the page-frame is an internal kernel-only data structure. That
> won't be exported to user-space, so how is the parser supposed to see this
> information exactly to help it speeding up parsing ?

Its about the kernel parsing the buffer to push the tail ahead of the
reserve window, so that you have a reliable point to start reading the
trace from -- or didn't you actually get the intent of that patch?



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06 10:11                                         ` Peter Zijlstra
@ 2010-08-06 11:14                                           ` Peter Zijlstra
  2010-08-06 14:15                                             ` Mathieu Desnoyers
  2010-08-06 14:13                                           ` Mathieu Desnoyers
  1 sibling, 1 reply; 168+ messages in thread
From: Peter Zijlstra @ 2010-08-06 11:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Fri, 2010-08-06 at 12:11 +0200, Peter Zijlstra wrote:
> > You need to read the whole trace to find these cookies (even if it is just once
> > at the beginning if you create an index).

Even if you want to index all sync points you can quickly skip through
the file using the sync-distance, after which you'll have, on average,
only 1/2 avg-event-size to read before you find your next sync point.

So suppose you have a 1M sync-distance, and an effective average event
size of 128 bytes, then for a 4G file, you can find all sync points by
only reading ~262144 bytes (not counting for the fact that the pagecache
will bring in full pages, which would result in something like 16M to be
read in total or somesuch -- which, again assumes read-ahead isn't going
to play tricks on you).


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06  9:50                                         ` Peter Zijlstra
@ 2010-08-06 13:37                                           ` Mathieu Desnoyers
  2010-08-07  9:51                                           ` Masami Hiramatsu
  2010-08-09 16:53                                           ` Frederic Weisbecker
  2 siblings, 0 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-06 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Masami Hiramatsu, Frederic Weisbecker, Linus Torvalds,
	Ingo Molnar, LKML, Andrew Morton, Steven Rostedt, Steven Rostedt,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo, 2nddept-manager

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
> > Peter Zijlstra wrote:
> > > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > > 
> > >> How do you plan to read the data concurrently with the writer overwriting the
> > >> data while you are reading it without corruption ?
> > > 
> > > I don't consider reading while writing (in overwrite mode) a valid case.
> > > 
> > > If you want to use overwrite, stop the writer before reading it.
> > 
> > For example, would you like to read system audit log always after
> > stop the audit?
> > 
> > NO, that's a most important requirement for tracers, especially for
> > system admins (they're the most important users of Linux) to check
> > the system health and catch system troubles.
> > 
> > For performance measurement and checking hotspot, one-shot tracing
> > is enough. But it's just for developers. But for the real world
> > computing, Linux is just an OS, users want to run their system,
> > middleware and applications, without troubles. But when they hit
> > a trouble, they wanna shoot it ASAP.
> > The flight recorder mode is mainly for those users.
> 
> You cannot over-write and consistently read the buffer, that's plain
> impossible.

If you think it is impossible, then you should really go have a look at the
generic ring buffer library, at LTTng and at Ftrace. It looks like we're all
doing the "impossible".

>             With sub-buffers you can swivel a sub-buffer and
> consistently read that, but there is no guarantee the next sub-buffer
> you steal was indeed adjacent to the previous buffer you stole as that
> might have gotten over-written by the active writer while you were
> stealing the previous one.

We don't care about taking the next adjascent sub-buffer. We care about always
grabbing the oldest sub-buffer that has been written up to the currentmost
one.

> 
> If you want to snapshot buffers, do that, simply swivel the whole trace
> buffer, and continue tracing in a new one, then consume the old trace in
> a consistent manner.

So you need to allocate many trace buffers to accomplish the same and an extra
layer on top that does this buffer exchange, I don't call that "simple". Note
that only two trace buffers might not be enough if you have repeated failures in
a short time window; the consumer might take some time to extract all these.

Compared to that, the sub-buffer scheme only needs a single buffer with 2 (or
more) sub-buffers, plus an extra sub-buffer owned by the reader that we exchange
with the sub-buffer we want to grab for reading. The reader always grabs the
sub-buffer with the oldest data into it. The number of sub-buffers used is the
limit on the number of snapshots that can be taken in a relatively short time
window (the time it takes to the reader to consume the data).

> 
> I really see no value in being able to read unrelated bits and pieces of
> a buffer.

Within a sub-buffer, events are adjascent, and between sub-buffers, events are
guaranteed to be in order (oldest to newest event). It is only in the case where
buffers are relatively small compared to the data throughput that the writer can
overwrite information that would have been useful for a snapshot (e.g.
overwriting relatively recent information while the reader reads the oldest
sub-buffer), but in that case users simply have to tune they buffer size
appropriately to match the trace data throughput.

> 
> So no, I will _not_ support reading an over-write buffer while there is
> an active reader.

(I guess you mean active writer)

Here you argue that you don't need to support this feature at the ring buffer
level because you can have a group of ring buffers that does it instead.
How is your multiple-buffer scheme any simpler than sub-buffers ? Either you
have to allocate many of them up front, or, if you want to do it on-demand, you
have to perform memory allocation in NMI context. I don't see any of these two
solutions as particularly appealing.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06  9:51                                         ` Peter Zijlstra
@ 2010-08-06 13:46                                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-06 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Linus Torvalds, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2010-08-05 at 21:49 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > > 
> > > > How do you plan to read the data concurrently with the writer overwriting the
> > > > data while you are reading it without corruption ?
> > > 
> > > I don't consider reading while writing (in overwrite mode) a valid case.
> > > 
> > > If you want to use overwrite, stop the writer before reading it.
> > 
> > How inconvenient. It happens that the relatively large group of users I am
> > working for do care for this use-case. They cannot afford to stop tracing as
> > soon as they hit "one bug". This "bug" could be a simple odd scenario that they
> > want to snapshot, but in all cases they want tracing to continue.
> 
> Snapshot is fine, just swivel the whole buffer.

There is a very important trade-off between the amount of information that can
be kept around in memory to take as snapshot and the amount of system memory
reserved for buffers. The sub-buffer scheme is pretty good at that: the whole
memory reserved (except the extra reader-owned sub-buffer) is available to save
the flight recorder trace.

With the multiple-buffer scheme you propose, only one of the buffers can be used
to save data. This is very limiting, especially for embedded systems in telecom
switches that does not have that much memory: all the memory reserved for the
buffer that is currently inactive is simply wasted. It does not even allow the
user to gather a longer snapshot.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06 10:11                                         ` Peter Zijlstra
  2010-08-06 11:14                                           ` Peter Zijlstra
@ 2010-08-06 14:13                                           ` Mathieu Desnoyers
  2010-08-11 14:44                                             ` Steven Rostedt
  1 sibling, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-06 14:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2010-08-05 at 21:42 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
> > > On Wed, 2010-08-04 at 10:06 -0400, Mathieu Desnoyers wrote:
[...]
> > > > A second major gain: having these sub-buffers lets the trace analyzer seek in
> > > > the trace very efficiently by allowing it to perform a binary search for time to
> > > > find the appropriate sub-buffer. It becomes immensely useful with large traces.
> > > 
> > > You can add sync events with a specific magic cookie in. Once you find
> > > the cookie you can sync and start reading it reliably
> > 
> > You need to read the whole trace to find these cookies (even if it is just once
> > at the beginning if you create an index).
> 
> Depends on what you want to do, you can start reading at any point in
> the stream and be guaranteed to find a sync point within sync-distance
> +max-event-size.

At _any_ point in the stream ?

So if I take, let's say, a few kB of Perf ring buffer data and I choose to
encode than into an event into another buffer (e.g. we're tracing part of the
network traffic). Then we end up in a situation where the event payload will
contain your "so special" sync point data. Basically, you have no guarantee that
you won't mix up standard event data and your synchronization event headers.

Your sync point solution just kills all encapsulation good practices in one go.

> >  My experience with users have shown me
> > that the delay between stopping trace gathering having the data shown to the
> > user is very important, because this is repeatedly done while debugging a
> > problem, and this is time the user is sitting in front of his screen, waiting.
> 
> Yeah, because after having had to wait for 36h for the problem to
> trigger that extra minute really kills.
> 
> All I can say is that in my experience brain throughput is the limiting
> factor in debugging. Not some ability to draw fancy pictures.

Here I have to bring up the fact that Linux kernel developers are not the only
tracer users.

Traces of multi-GB can be generated easily within a few seconds/minutes on many
workloads, so we're not talking of many-hours-traces here. But if we need to
read the whole trace before it can be shown, we're adding a significant delay
before the trace can be accessed.

In my experience, both brain and data gathering throughputs are limiting factors
to debugging. Drawing fancy pictures merely helps speeding up the brain process
in some cases.


> 
> > > -- the advantage
> > > is that sync events are very easy to have as an option and don't
> > > complicate the reserve path.
> > 
> > Perf, on its reserve/commit fast paths:
> > 
> > perf_output_begin: 543 bytes
> >   (perf_output_get_handle is inlined)
> > 
> > perf_output_put_handle: 201 bytes
> > perf_output_end:         77 bytes
> >   calls perf_output_put_handle
> > 
> > Total for perf:         821 bytes
> > 
> > Generic Ring Buffer Library reserve/commit fast paths:
> > 
> > Reserve:                       511 bytes
> > Commit:                        266 bytes
> > Total for Generic Ring Buffer: 777 bytes
> > 
> > So the generic ring buffer is not only faster and supports sub-buffers (along
> > with all the nice features this brings); its reserve and commit hot paths
> > fit in less instructions: it is *less* complicated than Perf's.
> 
> All I can say is that less code doesn't equal less complex (nor faster
> per-se).

Less code = less instruction cache overhead. I've also shown that the LTTng code
is at least twice faster. In terms of complexity, it is not much more complex; I
also took the extra care of doing the formal proofs to make sure the
corner-cases were dealt with, which I don't reckon neither Steven nor yourself
have done.

> Nor have I spend all my time on writing the ring-buffer,
> there's more interesting things to do.

I must admit that I probably spent much more time working on the ring buffer
than you did. It looks like one's interest can only focus on so many areas at
once. So if you are not that interested in ring buffers, can you at least stop
opposing to people who care deeply ?

If we agree that we don't care about the same use-cases, there might be room for
many ring buffers in the kernel. It's just a shame that we have to multiply
amount of code-review. We have to note that this goes against Linus' request for
a shared and common ring buffer used by all tracers.


> And the last time I ran perf on perf, the buffer wasn't the thing that
> was taking most time.

Very interesting. I know the trace clock performance are terrible too. But let's
keep that for another discussion please.

> 
> And unlike what you claim below, it most certainly can deal with events
> larger than a single page.

What I said below was: perf cannot write events larger than its buffer size. So
it already has to take that "test" branch for maximum event size. I said nothing
about page size in this context.

> 
> > > If you worry about the cost of parsing the events, you can amortize that
> > > by things like keeping the offset of the first event in every page in
> > > the pageframe, or the offset of the next sync event or whatever scheme
> > > you want.
> > 
> > Hrm ? AFAIK, the page-frame is an internal kernel-only data structure. That
> > won't be exported to user-space, so how is the parser supposed to see this
> > information exactly to help it speeding up parsing ?
> 
> Its about the kernel parsing the buffer to push the tail ahead of the
> reserve window, so that you have a reliable point to start reading the
> trace from -- or didn't you actually get the intent of that patch?

I got the intent of the patch, I just somehow missed that this paragraph was
applying to the patch specifically.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06 11:14                                           ` Peter Zijlstra
@ 2010-08-06 14:15                                             ` Mathieu Desnoyers
  0 siblings, 0 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-06 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Frederic Weisbecker, Ingo Molnar, LKML,
	Andrew Morton, Steven Rostedt, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Fri, 2010-08-06 at 12:11 +0200, Peter Zijlstra wrote:
> > > You need to read the whole trace to find these cookies (even if it is just once
> > > at the beginning if you create an index).
> 
> Even if you want to index all sync points you can quickly skip through
> the file using the sync-distance, after which you'll have, on average,
> only 1/2 avg-event-size to read before you find your next sync point.
> 
> So suppose you have a 1M sync-distance, and an effective average event
> size of 128 bytes, then for a 4G file, you can find all sync points by
> only reading ~262144 bytes (not counting for the fact that the pagecache
> will bring in full pages, which would result in something like 16M to be
> read in total or somesuch -- which, again assumes read-ahead isn't going
> to play tricks on you).

How do you distinguish between sync events and random payload data ?

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06  9:50                                         ` Peter Zijlstra
  2010-08-06 13:37                                           ` Mathieu Desnoyers
@ 2010-08-07  9:51                                           ` Masami Hiramatsu
  2010-08-09 16:53                                           ` Frederic Weisbecker
  2 siblings, 0 replies; 168+ messages in thread
From: Masami Hiramatsu @ 2010-08-07  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Linus Torvalds,
	Ingo Molnar, LKML, Andrew Morton, Steven Rostedt, Steven Rostedt,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo, 2nddept-manager

Peter Zijlstra wrote:
> On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
>> Peter Zijlstra wrote:
>>> On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
>>>
>>>> How do you plan to read the data concurrently with the writer overwriting the
>>>> data while you are reading it without corruption ?
>>> I don't consider reading while writing (in overwrite mode) a valid case.
>>>
>>> If you want to use overwrite, stop the writer before reading it.
>> For example, would you like to read system audit log always after
>> stop the audit?
>>
>> NO, that's a most important requirement for tracers, especially for
>> system admins (they're the most important users of Linux) to check
>> the system health and catch system troubles.
>>
>> For performance measurement and checking hotspot, one-shot tracing
>> is enough. But it's just for developers. But for the real world
>> computing, Linux is just an OS, users want to run their system,
>> middleware and applications, without troubles. But when they hit
>> a trouble, they wanna shoot it ASAP.
>> The flight recorder mode is mainly for those users.
> 
> You cannot over-write and consistently read the buffer, that's plain
> impossible. With sub-buffers you can swivel a sub-buffer and
> consistently read that, but there is no guarantee the next sub-buffer
> you steal was indeed adjacent to the previous buffer you stole as that
> might have gotten over-written by the active writer while you were
> stealing the previous one.

Right, we cannot ensure that. In over-written mode, reader could lose
some data, because of overwriting by writers. (or writer may fails
to write new data on buffer in non-overwritten mode)
However, I think that doesn't mean this mode is completely useless.
If we can know when(where) the data was lost, the rest of data
is enough useful in some cases.

> If you want to snapshot buffers, do that, simply swivel the whole trace
> buffer, and continue tracing in a new one, then consume the old trace in
> a consistent manner.

Hmm, would that consume much more memory compared with sub-buffer
ring buffer if we have spare buffers?
Or, if allocating it after reader opens buffer, will it also slow
down the reader process?

> I really see no value in being able to read unrelated bits and pieces of
> a buffer.

I think there is a trade-off of perfect snapshot and consuming memory,
and it depends on use-case in many cases.

> 
> So no, I will _not_ support reading an over-write buffer while there is
> an active reader.
> 

I hope you to reconsider how over-write buffer is useful even if
it is far from perfect.

Thank you,

-- 
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06  9:50                                         ` Peter Zijlstra
  2010-08-06 13:37                                           ` Mathieu Desnoyers
  2010-08-07  9:51                                           ` Masami Hiramatsu
@ 2010-08-09 16:53                                           ` Frederic Weisbecker
  2 siblings, 0 replies; 168+ messages in thread
From: Frederic Weisbecker @ 2010-08-09 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar,
	LKML, Andrew Morton, Steven Rostedt, Steven Rostedt,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo, 2nddept-manager

On Fri, Aug 06, 2010 at 11:50:40AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-08-06 at 15:18 +0900, Masami Hiramatsu wrote:
> > Peter Zijlstra wrote:
> > > On Wed, 2010-08-04 at 10:45 -0400, Mathieu Desnoyers wrote:
> > > 
> > >> How do you plan to read the data concurrently with the writer overwriting the
> > >> data while you are reading it without corruption ?
> > > 
> > > I don't consider reading while writing (in overwrite mode) a valid case.
> > > 
> > > If you want to use overwrite, stop the writer before reading it.
> > 
> > For example, would you like to read system audit log always after
> > stop the audit?
> > 
> > NO, that's a most important requirement for tracers, especially for
> > system admins (they're the most important users of Linux) to check
> > the system health and catch system troubles.
> > 
> > For performance measurement and checking hotspot, one-shot tracing
> > is enough. But it's just for developers. But for the real world
> > computing, Linux is just an OS, users want to run their system,
> > middleware and applications, without troubles. But when they hit
> > a trouble, they wanna shoot it ASAP.
> > The flight recorder mode is mainly for those users.
> 
> You cannot over-write and consistently read the buffer, that's plain
> impossible. With sub-buffers you can swivel a sub-buffer and
> consistently read that, but there is no guarantee the next sub-buffer
> you steal was indeed adjacent to the previous buffer you stole as that
> might have gotten over-written by the active writer while you were
> stealing the previous one.
> 
> If you want to snapshot buffers, do that, simply swivel the whole trace
> buffer, and continue tracing in a new one, then consume the old trace in
> a consistent manner.
> 
> I really see no value in being able to read unrelated bits and pieces of
> a buffer.



It all depends on the frequency on your events and on the amount of memory
used for the buffer.

If you are tracing syscalls in a semi-idle box with a ring buffer of 500 MB
per cpu, you really don't care about the writer catching up the reader: it
will simply not happen.

OTOH if you are tracing function graphs, no buffer size will ever be enough:
the writer will always be faster and catch up the reader.

Using the sub-buffer scheme though, and allowing concurrent writer and reader
in overwriting mode, we can easily tell the user about the writer beeing
faster and content that have been lost. On top of these informations, the
user can chose what to do: trying with a larger buffer or so.

See? It's not our role to say: the result might be unreliable if the user
does silly settings (not enough memory, reader too slow for random reasons,
too high frequency events or so...). Let the user deal with that and just
inform him about unreliable results. This is what ftrace does currently.

Also the snapshot thing doesn't look like a replacement. If you are
tracing on a low memory embedded system, you consume a lot of memory
to keep the snapshot alive, it means the live buffer can be critically
lowered and you might in turn lose traces there.
That said it's an interesting feature that may fit on other kind of
environments or for other needs.


Off-topic: It's sad that about tracing, we often have to figure out the needs
from embedded world, or learn from indirect sources. In the end we rarely
know from them directly. Except may be in confs....


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-04  6:27                                 ` Peter Zijlstra
  2010-08-04 14:06                                   ` Mathieu Desnoyers
@ 2010-08-11 14:34                                   ` Steven Rostedt
  2010-08-15 13:35                                     ` Mathieu Desnoyers
  2010-08-15 16:33                                     ` Avi Kivity
  1 sibling, 2 replies; 168+ messages in thread
From: Steven Rostedt @ 2010-08-11 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Mathieu Desnoyers, Frederic Weisbecker,
	Ingo Molnar, LKML, Andrew Morton, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

Egad! Go on vacation and the world falls apart.

On Wed, 2010-08-04 at 08:27 +0200, Peter Zijlstra wrote:
> On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > FWIW I really utterly detest the whole concept of sub-buffers.
> > 
> > I'm not quite sure why. Is it something fundamental, or just an
> > implementation issue?
> 
> The sub-buffer thing that both ftrace and lttng have is creating a large
> buffer from a lot of small buffers, I simply don't see the point of
> doing that. It adds complexity and limitations for very little gain.

So, I want to allocate a 10Meg buffer. I need to make sure the kernel
has 10megs of memory available. If the memory is quite fragmented, then
too bad, I lose out.

Oh wait, I could also use vmalloc. But then again, now I'm blasting
valuable TLB entries for a tracing utility, thus making the tracer have
a even bigger impact on the entire system.

BAH!

I originally wanted to go with the continuous buffer, but I was
convinced after trying to implement it, that it was a bad choice.
Specifically, because of needing to 1) get large amounts of memory that
is continuous, or 2) eating up TLB entries and causing the system to
perform poorer.

I chose page size "sub-buffers" to solve the above. It also made
implementing splice trivial. OK, I admit, I never thought about mmapping
the buffers, just because I figured splice was faster. But I do have
patches that allow a user to mmap the entire ring buffer, but only in a
"producer/consumer" mode.

Note, I use page size sub-buffers, but the design could work with any
size sub-buffers. I just never implemented that (even though, when I
wrote the code it was secretly on my todo list).


> 
> Their benefit is known synchronization points into the stream, you can
> parse each sub-buffer independently, but you can always break up a
> continuous stream into smaller parts or use a transport that includes
> index points or whatever.
> 
> Their down side is that you can never have individual events larger than
> the sub-buffer, you need to be aware of the sub-buffer when reserving
> space etc..

The answer to that is to make a macro to do the assignment of the event,
and add a new API.

	event = ring_buffer_reserve_unlimited();

	ring_buffer_assign(event, data1);
	ring_buffer_assign(event, data2);

	ring_buffer_commit(event);

The ring_buffer_reserve_unlimited() could reserve a bunch of space
beyond one ring buffer. It could reserve data in fragments. Then the
ring_buffer_assgin() could either copy directly to the event (if the
event exists on one sub buffer) or do a copy the space was fragmented.

Of course, userspace would need to know how to read it. And it can get
complex due to interrupts coming in and also reserving between
fragments, or what happens if a partial fragment is overwritten. But all
these are not impossible to solve.

-- Steve




^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-06 14:13                                           ` Mathieu Desnoyers
@ 2010-08-11 14:44                                             ` Steven Rostedt
  0 siblings, 0 replies; 168+ messages in thread
From: Steven Rostedt @ 2010-08-11 14:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Linus Torvalds, Frederic Weisbecker, Ingo Molnar,
	LKML, Andrew Morton, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

On Fri, 2010-08-06 at 10:13 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (peterz@infradead.org) wrote:

> Less code = less instruction cache overhead. I've also shown that the LTTng code
> is at least twice faster. In terms of complexity, it is not much more complex; I
> also took the extra care of doing the formal proofs to make sure the
> corner-cases were dealt with, which I don't reckon neither Steven nor yourself
> have done.

Yes Mathieu, you did a formal proof. Good for you. But honestly, it is
starting to get very annoying to hear you constantly stating that,
because, to most kernel developers, it is meaningless. Any slight
modification of your algorithm, renders the proof invalid.

You are not the only one that has done a proof to an algorithm in the
kernel, but you are definitely the only one that constantly reminds
people that you have done so. Congrats on your PhD, and in academia,
proofs are important.

But this is a ring buffer, not a critical part of the workings of the
kernel. There are much more critical and fragile parts of the kernel
that work fine without a formal proof.

Paul McKenney did a proof for RCU not for us, but just to help give him
a warm fuzzy about it. RCU is much more complex than the ftrace ring
buffer, and it also is much more critical. If Paul gets it wrong, a
machine will crash. He's right to worry. And even Paul told me that no
formal proof makes up for large scale testing. Which BTW, the ftrace
ring buffer has gone through.

Someday I may go ahead and do that proof, but I did do a very intensive
state diagram, and I'm quite confident that it works. It's been deployed
for quite a bit, and the design has yet to be a factor in any bug report
of the ring buffer.

-- Steve



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-11 14:34                                   ` Steven Rostedt
@ 2010-08-15 13:35                                     ` Mathieu Desnoyers
  2010-08-15 16:33                                     ` Avi Kivity
  1 sibling, 0 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-15 13:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Linus Torvalds, Frederic Weisbecker, Ingo Molnar,
	LKML, Andrew Morton, Thomas Gleixner, Christoph Hellwig,
	Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler, Tejun Heo

* Steven Rostedt (rostedt@goodmis.org) wrote:
> Egad! Go on vacation and the world falls apart.
> 
> On Wed, 2010-08-04 at 08:27 +0200, Peter Zijlstra wrote:
> > On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> > > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > FWIW I really utterly detest the whole concept of sub-buffers.
> > > 
> > > I'm not quite sure why. Is it something fundamental, or just an
> > > implementation issue?
> > 
> > The sub-buffer thing that both ftrace and lttng have is creating a large
> > buffer from a lot of small buffers, I simply don't see the point of
> > doing that. It adds complexity and limitations for very little gain.
> 
> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
> has 10megs of memory available. If the memory is quite fragmented, then
> too bad, I lose out.
> 
> Oh wait, I could also use vmalloc. But then again, now I'm blasting
> valuable TLB entries for a tracing utility, thus making the tracer have
> a even bigger impact on the entire system.
> 
> BAH!
> 
> I originally wanted to go with the continuous buffer, but I was
> convinced after trying to implement it, that it was a bad choice.
> Specifically, because of needing to 1) get large amounts of memory that
> is continuous, or 2) eating up TLB entries and causing the system to
> perform poorer.
> 
> I chose page size "sub-buffers" to solve the above. It also made
> implementing splice trivial. OK, I admit, I never thought about mmapping
> the buffers, just because I figured splice was faster. But I do have
> patches that allow a user to mmap the entire ring buffer, but only in a
> "producer/consumer" mode.

FYI: the generic ring buffer also implements the mmap() interface for the flight
recorder mode.

> 
> Note, I use page size sub-buffers, but the design could work with any
> size sub-buffers. I just never implemented that (even though, when I
> wrote the code it was secretly on my todo list).

The main difference between our designs is that Ftrace use a linked list and the
generic ring buffer lib. uses a sub-buffer/page table. Considering the use-case
of reading available flight recorder pages in reverse order I've hear about at
LinuxCon (heard about it from both from Peter and Masami, and it actually makes
a whole lot of sense, because the data we care about the most and want to read
ASAP is the last subbuffers), I think the page table is more appropriate (and
flexible) than a single-direction linked list, because it allows to pick a
random page (or subbuffer) in the buffer without walking over all pages.

> 
> 
> > 
> > Their benefit is known synchronization points into the stream, you can
> > parse each sub-buffer independently, but you can always break up a
> > continuous stream into smaller parts or use a transport that includes
> > index points or whatever.
> > 
> > Their down side is that you can never have individual events larger than
> > the sub-buffer, you need to be aware of the sub-buffer when reserving
> > space etc..
> 
> The answer to that is to make a macro to do the assignment of the event,
> and add a new API.
> 
> 	event = ring_buffer_reserve_unlimited();
> 
> 	ring_buffer_assign(event, data1);
> 	ring_buffer_assign(event, data2);
> 
> 	ring_buffer_commit(event);
> 
> The ring_buffer_reserve_unlimited() could reserve a bunch of space
> beyond one ring buffer. It could reserve data in fragments. Then the
> ring_buffer_assgin() could either copy directly to the event (if the
> event exists on one sub buffer) or do a copy the space was fragmented.
> 
> Of course, userspace would need to know how to read it. And it can get
> complex due to interrupts coming in and also reserving between
> fragments, or what happens if a partial fragment is overwritten. But all
> these are not impossible to solve.

Dealing with fragmentation, sub-buffer loss, etc. is then pushed up to the trace
analyzer. While I agree that we have to keep the burden of complexity out of the
kernel as much as possible, I also think that an elegant design at the data
producer level which keeps the trace reader/analyzer simple and reliable should
be favored if it keeps a similar level of complexity in the kernel code.

A good argument supporting this is that some tracing users want to use a mmap()
scheme directly on the trace buffers to analyze the data directly in user-space
on the traced machine. In these cases, the complexity/overhead added to the
analyzer will impact the overall performance of the system being traced.

Thanks,

Mathieu

> 
> -- Steve
> 
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-11 14:34                                   ` Steven Rostedt
  2010-08-15 13:35                                     ` Mathieu Desnoyers
@ 2010-08-15 16:33                                     ` Avi Kivity
  2010-08-15 16:44                                       ` Mathieu Desnoyers
  1 sibling, 1 reply; 168+ messages in thread
From: Avi Kivity @ 2010-08-15 16:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Linus Torvalds, Mathieu Desnoyers,
	Frederic Weisbecker, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

  On 08/11/2010 05:34 PM, Steven Rostedt wrote:
>
> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
> has 10megs of memory available. If the memory is quite fragmented, then
> too bad, I lose out.

With memory compaction, the cpu churns for a while, then you have your 
buffer.  Of course there's still no guarantee, just a significantly 
higher probability of success.

> Oh wait, I could also use vmalloc. But then again, now I'm blasting
> valuable TLB entries for a tracing utility, thus making the tracer have
> a even bigger impact on the entire system.

Most trace entries will occupy much less than a page, and are accessed 
sequentially, so I don't think this will have a large impact.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-15 16:33                                     ` Avi Kivity
@ 2010-08-15 16:44                                       ` Mathieu Desnoyers
  2010-08-15 16:51                                         ` Avi Kivity
  0 siblings, 1 reply; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-15 16:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Steven Rostedt, Peter Zijlstra, Linus Torvalds,
	Frederic Weisbecker, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

* Avi Kivity (avi@redhat.com) wrote:
>  On 08/11/2010 05:34 PM, Steven Rostedt wrote:
>>
>> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
>> has 10megs of memory available. If the memory is quite fragmented, then
>> too bad, I lose out.
>
> With memory compaction, the cpu churns for a while, then you have your  
> buffer.  Of course there's still no guarantee, just a significantly  
> higher probability of success.

The bigger the buffers, the lower the probabilities of success are. My users
often allocate buffers as large as a few GB per cpu. Relying on compaction does
not seem like a viable solution in this case.

>
>> Oh wait, I could also use vmalloc. But then again, now I'm blasting
>> valuable TLB entries for a tracing utility, thus making the tracer have
>> a even bigger impact on the entire system.
>
> Most trace entries will occupy much less than a page, and are accessed  
> sequentially, so I don't think this will have a large impact.

You seem to underestimate the frequency at which trace events can be generated.
E.g., by the time you run the scheduler once (which we can consider a very hot
kernel path), some tracing modes will generate thousands of events, which will
touch a very significant amount of TLB entries.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-15 16:44                                       ` Mathieu Desnoyers
@ 2010-08-15 16:51                                         ` Avi Kivity
  2010-08-15 18:31                                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 168+ messages in thread
From: Avi Kivity @ 2010-08-15 16:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Peter Zijlstra, Linus Torvalds,
	Frederic Weisbecker, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

  On 08/15/2010 07:44 PM, Mathieu Desnoyers wrote:
> * Avi Kivity (avi@redhat.com) wrote:
>>   On 08/11/2010 05:34 PM, Steven Rostedt wrote:
>>> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
>>> has 10megs of memory available. If the memory is quite fragmented, then
>>> too bad, I lose out.
>> With memory compaction, the cpu churns for a while, then you have your
>> buffer.  Of course there's still no guarantee, just a significantly
>> higher probability of success.
> The bigger the buffers, the lower the probabilities of success are. My users
> often allocate buffers as large as a few GB per cpu. Relying on compaction does
> not seem like a viable solution in this case.

Wow.  Even if you could compact that much memory, it would take quite a 
bit of time.

>>> Oh wait, I could also use vmalloc. But then again, now I'm blasting
>>> valuable TLB entries for a tracing utility, thus making the tracer have
>>> a even bigger impact on the entire system.
>> Most trace entries will occupy much less than a page, and are accessed
>> sequentially, so I don't think this will have a large impact.
> You seem to underestimate the frequency at which trace events can be generated.
> E.g., by the time you run the scheduler once (which we can consider a very hot
> kernel path), some tracing modes will generate thousands of events, which will
> touch a very significant amount of TLB entries.

Let's say a trace entry occupies 40 bytes and a TLB miss costs 200 
cycles on average.  So we have 100 entries per page costing 200 cycles; 
amortized each entry costs 2 cycles.

There's an additional cost caused by the need to re-fill the TLB later, 
but you incur that anyway if the scheduler caused a context switch.

Of course, my assumptions may be completely off (likely larger entries 
but smaller miss costs).  Has a vmalloc based implementation been 
tested?  It seems so much easier than the other alternatives.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-15 16:51                                         ` Avi Kivity
@ 2010-08-15 18:31                                           ` Mathieu Desnoyers
  2010-08-16 10:49                                             ` Avi Kivity
  2010-08-16 11:29                                             ` Avi Kivity
  0 siblings, 2 replies; 168+ messages in thread
From: Mathieu Desnoyers @ 2010-08-15 18:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Steven Rostedt, Peter Zijlstra, Linus Torvalds,
	Frederic Weisbecker, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

* Avi Kivity (avi@redhat.com) wrote:
>  On 08/15/2010 07:44 PM, Mathieu Desnoyers wrote:
>> * Avi Kivity (avi@redhat.com) wrote:
>>>   On 08/11/2010 05:34 PM, Steven Rostedt wrote:
>>>> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
>>>> has 10megs of memory available. If the memory is quite fragmented, then
>>>> too bad, I lose out.
>>> With memory compaction, the cpu churns for a while, then you have your
>>> buffer.  Of course there's still no guarantee, just a significantly
>>> higher probability of success.
>> The bigger the buffers, the lower the probabilities of success are. My users
>> often allocate buffers as large as a few GB per cpu. Relying on compaction does
>> not seem like a viable solution in this case.
>
> Wow.  Even if you could compact that much memory, it would take quite a  
> bit of time.

Yep.

>
>>>> Oh wait, I could also use vmalloc. But then again, now I'm blasting
>>>> valuable TLB entries for a tracing utility, thus making the tracer have
>>>> a even bigger impact on the entire system.
>>> Most trace entries will occupy much less than a page, and are accessed
>>> sequentially, so I don't think this will have a large impact.
>> You seem to underestimate the frequency at which trace events can be generated.
>> E.g., by the time you run the scheduler once (which we can consider a very hot
>> kernel path), some tracing modes will generate thousands of events, which will
>> touch a very significant amount of TLB entries.
>
> Let's say a trace entry occupies 40 bytes and a TLB miss costs 200  
> cycles on average.  So we have 100 entries per page costing 200 cycles;  
> amortized each entry costs 2 cycles.

A quick test (shown below) gives the cost of a TLB miss on the Intel Xeon E5404:

Number of cycles added over test baseline:

tlb and cache hit:            12.42
tlb hit, l2 hit, l1 miss      17.88
tlb hit,l2+l1 miss            32.34
tlb and cache miss           449.58

So it's closer to 500 per tlb miss.

Also, your analysis does not seem to correctly represent reality of the TLB
trashing cost. On a workload walking over a large number of random pages (e.g. a
large hash table) all the time, eating just a few more TLB entries will impact
the number of misses over the entire workload.

So it's not much the misses that we see at the tracing site that is the problem,
but also the extra misses taken by the application caused by the extra pressure
on TLB. So just a few more TLB entries taken by the tracer will likely hurt
these workloads.

>
> There's an additional cost caused by the need to re-fill the TLB later,  
> but you incur that anyway if the scheduler caused a context switch.

The performance hit is not taken if the scheduler schedules another thread with
the same mapping, only when it schedules a different process.

>
> Of course, my assumptions may be completely off (likely larger entries  
> but smaller miss costs).

Depending on the tracer design, the avg. event size can range from 12 bytes
(lttng is very agressive in event size compaction) to about 40 bytes (perf); so
for this you are mostly right. However, as explained above, the TLB miss cost is
higher than you expected.

>  Has a vmalloc based implementation been  
> tested?  It seems so much easier than the other alternatives.

I tested it in the past, and must admit that I changed from a vmalloc-based
implementation to page-based using software cross-page write primitives based on
feedback from Steven and Ingo. Diminishing TLB trashing seemed like a good
approach, and using vmalloc on 32-bit machines is a pain, because users have to
tweak the vmalloc region size at boot. So all in all, I moved to a vmalloc-less
implementation without much more thought.

If you feel we should test the performance of both approaches, we could do it in
the generic ring buffer library (it allows both type of allocation backends).
However, we'd have to find the right type of TLB-trashing real-world workload to
have meaningful results. This might be the hardest part.

Thanks,

Mathieu

# tlbmiss.c

#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>

typedef unsigned long long cycles_t;

#define barrier() __asm__ __volatile__("": : :"memory")

/*
 * Serialize core instruction execution. Also acts as a compiler barrier.
 * On PIC ebx cannot be clobbered
 */
#ifdef __PIC__
#define sync_core()						      \
       asm volatile("push %%ebx; cpuid; pop %%ebx"		       \
		    : : : "memory", "eax", "ecx", "edx");
#endif
#ifndef __PIC__
#define sync_core()						      \
       asm volatile("cpuid" : : : "memory", "eax", "ebx", "ecx", "edx");
#endif

#define mb()	asm volatile("mfence":::"memory")
#define smp_mb()	mb()

#define rdtsc(low,high) \
     __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high))

#define rdtscl(low) \
     __asm__ __volatile__("rdtsc" : "=a" (low) : : "edx")

#define rdtscll(val) \
     __asm__ __volatile__("rdtsc" : "=A" (val))


#define mb()  asm volatile("mfence":::"memory")

static inline cycles_t get_cycles_sync(void)
{
	unsigned long long ret = 0;

	smp_mb();
	sync_core();
	rdtscll(ret);
	sync_core();
	smp_mb();
	return ret;
}

#define PAGE_SIZE 4096ULL	/* 4k */
#define L1_CACHELINE_SIZE 64
#define L2_CACHELINE_SIZE 128
#define ARRAY_SIZE 262144ULL /* 1 GB */

static char testpage[PAGE_SIZE] __attribute__((aligned(PAGE_SIZE)));

static unsigned int idx[ARRAY_SIZE];

#define NR_TESTS 100

int main(int argc, char **argv)
{
	struct timeval tv;
	struct timezone tz;
	cycles_t time1, time2;
	double cycles_per_iter;
	unsigned int i, j;
	pid_t pid;
	char *array;
	double baseline;

	printf("number of tests : %lu\n", NR_TESTS);

	srandom(get_cycles_sync());

	array = malloc(sizeof(char) * ARRAY_SIZE * PAGE_SIZE);

	for (i=0; i<ARRAY_SIZE; i++)
		idx[i] = random() % ARRAY_SIZE;

	testpage[0] = 1;

	printf("Nothing (baseline)\n");
	cycles_per_iter = 0.0;
	for (i=0; i<NR_TESTS; i++) {
		for (j=0; j<ARRAY_SIZE; j++)
			array[idx[j] * PAGE_SIZE] = 1;
		testpage[0] = 1;
		time1 = get_cycles_sync();
		time2 = get_cycles_sync();
		cycles_per_iter += (time2 - time1);
	}
	cycles_per_iter /= (double)NR_TESTS;

	baseline = (double) cycles_per_iter;
	printf("Baseline takes %g cycles\n", baseline);

	printf("TLB and caches hit\n");
	cycles_per_iter = 0.0;
	for (i=0; i<NR_TESTS; i++) {
		for (j=0; j<ARRAY_SIZE; j++)
			array[idx[j] * PAGE_SIZE] = 1;
		testpage[0] = 1;
		time1 = get_cycles_sync();
		testpage[0] = 1;
		time2 = get_cycles_sync();
		cycles_per_iter += (time2 - time1);
	}
	cycles_per_iter /= (double)NR_TESTS;

	printf("tlb and cache hit %g cycles (adds %g)\n",
					(double) cycles_per_iter,
					(double) cycles_per_iter - baseline);

	printf("TLB hit, l2 cache hit, l1 cache miss\n");
	cycles_per_iter = 0.0;
	for (i=0; i<NR_TESTS; i++) {
		for (j=0; j<ARRAY_SIZE; j++)
			array[idx[j] * PAGE_SIZE] = 1;
		testpage[0] = 1;
		time1 = get_cycles_sync();
		testpage[L1_CACHELINE_SIZE] = 1;
		time2 = get_cycles_sync();
		cycles_per_iter += (time2 - time1);
	}
	cycles_per_iter /= (double)NR_TESTS;

	printf("tlb hit, l2 hit, l1 miss %g cycles (adds %g)\n",
					(double) cycles_per_iter,
					(double) cycles_per_iter - baseline);

	printf("TLB hit, l2 cache miss, l1 cache miss\n");
	cycles_per_iter = 0.0;
	for (i=0; i<NR_TESTS; i++) {
		for (j=0; j<ARRAY_SIZE; j++)
			array[idx[j] * PAGE_SIZE] = 1;
		testpage[0] = 1;
		time1 = get_cycles_sync();
		testpage[L2_CACHELINE_SIZE] = 1;
		time2 = get_cycles_sync();
		cycles_per_iter += (time2 - time1);
	}
	cycles_per_iter /= (double)NR_TESTS;

	printf("tlb hit,l2+l1 miss %g cycles (adds %g)\n",
					(double) cycles_per_iter,
					(double) cycles_per_iter - baseline);

	printf("TLB and cache miss\n");
	cycles_per_iter = 0.0;
	for (i=0; i<NR_TESTS; i++) {
		for (j=0; j<ARRAY_SIZE; j++)
			array[idx[j] * PAGE_SIZE] = 1;
		time1 = get_cycles_sync();
		testpage[0] = 1;
		time2 = get_cycles_sync();
		cycles_per_iter += (time2 - time1);
	}
	cycles_per_iter /= (double)NR_TESTS;

	printf("tlb and cache miss %g cycles (adds %g)\n",
					(double) cycles_per_iter,
					(double) cycles_per_iter - baseline);
	free(array);

	return 0;
}

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-15 18:31                                           ` Mathieu Desnoyers
@ 2010-08-16 10:49                                             ` Avi Kivity
  2010-08-16 11:29                                             ` Avi Kivity
  1 sibling, 0 replies; 168+ messages in thread
From: Avi Kivity @ 2010-08-16 10:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Peter Zijlstra, Linus Torvalds,
	Frederic Weisbecker, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

  On 08/15/2010 09:31 PM, Mathieu Desnoyers wrote:
>>>
>>> You seem to underestimate the frequency at which trace events can be generated.
>>> E.g., by the time you run the scheduler once (which we can consider a very hot
>>> kernel path), some tracing modes will generate thousands of events, which will
>>> touch a very significant amount of TLB entries.
>> Let's say a trace entry occupies 40 bytes and a TLB miss costs 200
>> cycles on average.  So we have 100 entries per page costing 200 cycles;
>> amortized each entry costs 2 cycles.
> A quick test (shown below) gives the cost of a TLB miss on the Intel Xeon E5404:
>
> Number of cycles added over test baseline:
>
> tlb and cache hit:            12.42
> tlb hit, l2 hit, l1 miss      17.88
> tlb hit,l2+l1 miss            32.34
> tlb and cache miss           449.58
>
> So it's closer to 500 per tlb miss.

The cache miss would not be avoided if the TLB was hit, so it should not 
be accounted as part of the costs (though a TLB miss will increase cache 
pressure).  Also, your test does not allow the cpu to pipeline anything; 
in reality, different workloads have different TLB miss costs:

- random reads (pointer chasing) incur almost the full impact since the 
processor is stalled
- sequential writes can be completely pipelined and suffer almost no impact

Even taking your numbers, it's still 5 cycles per trace entry.


> Also, your analysis does not seem to correctly represent reality of the TLB
> trashing cost. On a workload walking over a large number of random pages (e.g. a
> large hash table) all the time, eating just a few more TLB entries will impact
> the number of misses over the entire workload.

Let's say this doubles the impact.  So 10 cycles per trace entry.  Will 
a non-vmap solution cost less?


> So it's not much the misses that we see at the tracing site that is the problem,
> but also the extra misses taken by the application caused by the extra pressure
> on TLB. So just a few more TLB entries taken by the tracer will likely hurt
> these workloads.
>

I really think this should be benchmarked.

If the user workload thrashes the TLB, it should use huge pages itself, 
that will make it immune from kernel TLB thrashing and give it a nice 
boost besides.


>> There's an additional cost caused by the need to re-fill the TLB later,
>> but you incur that anyway if the scheduler caused a context switch.
> The performance hit is not taken if the scheduler schedules another thread with
> the same mapping, only when it schedules a different process.

True.

>> Of course, my assumptions may be completely off (likely larger entries
>> but smaller miss costs).
> Depending on the tracer design, the avg. event size can range from 12 bytes
> (lttng is very agressive in event size compaction) to about 40 bytes (perf); so
> for this you are mostly right. However, as explained above, the TLB miss cost is
> higher than you expected.

For the vmalloc area hit, it's lower.  For the user application, it may 
indeed be higher.

>>   Has a vmalloc based implementation been
>> tested?  It seems so much easier than the other alternatives.
> I tested it in the past, and must admit that I changed from a vmalloc-based
> implementation to page-based using software cross-page write primitives based on
> feedback from Steven and Ingo. Diminishing TLB trashing seemed like a good
> approach, and using vmalloc on 32-bit machines is a pain, because users have to
> tweak the vmalloc region size at boot. So all in all, I moved to a vmalloc-less
> implementation without much more thought.
>
> If you feel we should test the performance of both approaches, we could do it in
> the generic ring buffer library (it allows both type of allocation backends).
> However, we'd have to find the right type of TLB-trashing real-world workload to
> have meaningful results. This might be the hardest part.

specJBB is a well known TLB intensive workload, known to benefit greatly 
from large pages.

<snip test>

For a similar test see http://people.redhat.com/akivity/largepage.c.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 1/2] x86_64 page fault NMI-safe
  2010-08-15 18:31                                           ` Mathieu Desnoyers
  2010-08-16 10:49                                             ` Avi Kivity
@ 2010-08-16 11:29                                             ` Avi Kivity
  1 sibling, 0 replies; 168+ messages in thread
From: Avi Kivity @ 2010-08-16 11:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Peter Zijlstra, Linus Torvalds,
	Frederic Weisbecker, Ingo Molnar, LKML, Andrew Morton,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler, Tejun Heo

  On 08/15/2010 09:31 PM, Mathieu Desnoyers wrote:
>
> I tested it in the past, and must admit that I changed from a vmalloc-based
> implementation to page-based using software cross-page write primitives based on
> feedback from Steven and Ingo. Diminishing TLB trashing seemed like a good
> approach, and using vmalloc on 32-bit machines is a pain, because users have to
> tweak the vmalloc region size at boot. So all in all, I moved to a vmalloc-less
> implementation without much more thought.


Forgot to comment about the i386 issue - that really is a blocker if you 
absolutely need to support large trace buffers on 32-bit machines.  I 
would urge all those people to move to x86_64 and be done with it, but I 
don't know all the use cases.

It's possible to hack this to work by having a private mm_struct and 
switching to it temporarily, but it will be horribly slow.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:48   ` Jeffrey Merkey
@ 2010-07-16 22:53     ` Jeffrey Merkey
  0 siblings, 0 replies; 168+ messages in thread
From: Jeffrey Merkey @ 2010-07-16 22:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

>
> Well, the way I handled this problem on NetWare SMP and that other
> kernel was to create a pool of TSS descriptors and reload each during
> the exception to swap stacks before any handlers were called.  Allowed
> it to nest until I ran out of TSS descriptors (64 levels).  Not sure
> that's the way to go here though but it worked on that case.
>
> Jeff
>

Here is where that old dusty code lives these days - it deals with this problem.

http://open-source-netware.googlecode.com/files/manos-06-26-2010.tar.gz

file to look at is startup.386

;
;   nmi entry code
;

nmi_entry     macro
	cli
	push    ebx
	push    ebp
	mov     ebp, esp
	sub     ebp, SIZE TaskStateSegment
	mov     ebx, ebp

	mov     [ebp].tSS, ss
	mov     [ebp].tGS, gs         ; save segment registers
	mov     [ebp].tFS, fs
	mov     [ebp].tES, es
	mov     [ebp].tDS, ds
	pop     [ebp].tEBP
	mov     [ebp].tEDI, edi
	mov     [ebp].tESI, esi
	mov     [ebp].tEDX, edx
	mov     [ebp].tECX, ecx
	pop     [ebp].tEBX
	mov     [ebp].tEAX, eax

	pop     [ebp].tEIP            ; remove return address
	pop     eax
	mov     [ebp].tCS, ax
	pop     [ebp].tSystemFlags    ; get flags into TSS

	mov     [ebp].tESP, esp       ; save true stack address
	mov     esp, ebx            ; cover stack frame

	mov     eax, CR0
	and     eax, 0FFFFFFF7h     ; clear task switch bit in CR0 to
	mov     CR0, eax            ; avoid NPX exceptions

	xor	eax, eax
	mov	dr7, eax            ; disable breakpoints

	mov     eax, CR3            ;
	mov     [ebp].tCR3, eax     ;
	mov     eax, DebuggerPDE
	mov     CR3, eax

	;
	;   if we do not clear the NESTED_TASK_FLAG, then the IRET
	;   at the end of this function will cause
	;   an invalid TSS exception to be generated because the
	;   task busy bit was cleared earlier
	;

	pushfd
	and	dword ptr [esp], NOT (NESTED_TASK_FLAG OR SINGLE_STEP_FLAG)
	or	dword ptr [esp], RESUME_FLAG
	popfd

	mov     eax, 0FFFFFFFFh    ; mark as a non-pooled TSS exception
	push    eax

	push    0
	push    0
	push    ebp

	endm

;
;   TSS entry code
;


task_entry     macro
	LOCAL   @TSSNotNested, @NoLTR
	LOCAL   @UsedDefaultSegment
	LOCAL   @UsedPooledSegment
	LOCAL   @EnterTheDebugger

	cli
	xor    eax, eax
	str    ax
	mov    esi, offset SystemGDTTable
	mov    esi, dword ptr [esi + 2]
	lea    ebx, [esi + eax]
	mov    al, [ebx].TSSBase2
	mov    ah, [ebx].TSSBase3
	shl    eax, 16
	mov    ax, [ebx].TSSBase1

	;
	;  eax -> TSS Segment (Current)
	;  ebx -> TSS Descriptor (Current)
	;

	movzx  ecx, word ptr [eax].tBackLink
	or     ecx, ecx
	jz     @TSSNotNested

	mov    esi, offset SystemGDTTable
	mov    esi, dword ptr [esi + 2]
	lea    edx, [esi + ecx]
	mov    cl, [edx].TSSBase2
	mov    ch, [edx].TSSBase3
	shl    ecx, 16
	mov    cx, [edx].TSSBase1

	mov    ebp, ecx

	;
	;  edx -> TSS Descriptor (Previous)
	;  ebp -> TSS Segment (Previous)
	;
	;  clear busy state and reset TSS
	;

	mov     [edx].TSSType, 10001001b

@TSSNotNested:
	mov     [ebx].TSSType, 10001001b

	lgdt    ds: SystemGDTTable     ; reset GDT TSS Busy bit

	movzx   eax, word ptr [eax].tBackLink
	or      eax, eax
	jz      @NoLTR

	ltr     ax

@NoLTR:

	mov     eax, CR0
	and     eax, 0FFFFFFF7h     ; clear task switch bit in CR0 to
	mov     CR0, eax            ; avoid NPX exceptions

	xor	eax, eax
	mov	dr7, eax            ; disable breakpoints

	pushfd
	and	dword ptr [esp], NOT (NESTED_TASK_FLAG OR SINGLE_STEP_FLAG)
	or	dword ptr [esp], RESUME_FLAG
	popfd

	push    ebp
	call    AllocPooledResource
	pop     ebp

	or      eax, eax
	jz      @UsedDefaultSegment

	lea     ebp, [eax].TSSSegment
	mov     esp, [eax].StackTop

	push    eax                   ; push address of pooled resource
	jmp     @UsedPooledSegment

@UsedDefaultSegment:
	mov     eax, 0FFFFFFFFh       ; push non-pooled marker onto the stack
	push    eax

@UsedPooledSegment:

	push    0
	mov     eax, CR2    ; get fault address
	push    eax
	push    ebp         ;  pass the TSS

	endm

;
;  TSS exit code
;

Jeff

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:22 ` Linus Torvalds
  2010-07-16 22:48   ` Jeffrey Merkey
@ 2010-07-16 22:50   ` Jeffrey Merkey
  1 sibling, 0 replies; 168+ messages in thread
From: Jeffrey Merkey @ 2010-07-16 22:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

>
> If it was just the debug issue, I'd say "neener neener, debuggers are
> for wimps", but it's clearly not just about debug. It's a whole lot of
> other thigs. Random percpu datastructures used for tracing, kernel
> pointer verification code, yadda yadda.
>
>                  Linus
>

I guess I am a wimp then ... :-)

Jeff

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:22 ` Linus Torvalds
@ 2010-07-16 22:48   ` Jeffrey Merkey
  2010-07-16 22:53     ` Jeffrey Merkey
  2010-07-16 22:50   ` Jeffrey Merkey
  1 sibling, 1 reply; 168+ messages in thread
From: Jeffrey Merkey @ 2010-07-16 22:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri, Jul 16, 2010 at 4:22 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Jul 16, 2010 at 3:02 PM, Jeffrey Merkey <jeffmerkey@gmail.com> wrote:
>>
>> So Linus, my understanding of Intel's processor design is that the
>> processor will NEVER singal a nested NMI until it sees an iret from
>> the first NMI exception.
>
> Wrong.
>
> I like x86, but it has warts. The NMI blocking is one of them.
>
> The NMI's will be nested until the _next_ "iret", but it has no
> nesting. So if you take a fault during the NMI (debug, page table
> fixup, whatever), the iret in the faulthandler will re-enable NMI's
> even though we're still busy with the original NMI. There is no
> nesting, or any way to say that "this is a NMI-releasing iret". They
> could even do it still - make a new "iret that doesn't clear NMI" by
> adding a segment override prefix to iret or whatever. But it's not
> going to happen, and it's just one of those ugly special cases that
> has various historical reasons (recursive faults during NMI sure as
> hell didn't make sense back in the real-mode 8086 days).
>
> So we have to handle it in software. Or not ever trap at all inside
> the NMI handler.
>
> The original patch - and the patch I detest - is to make the normal
> fault paths use a "popf + ret" to emulate iret, but without the NMI
> release.
>
> Now, I could live with that if it's the only solution, but it _is_
> pretty damn ugly.
>
> If somebody shows that it's actually faster to do "popf + ret" when
> retuning to kernel space (a poor mans special-case iret), maybe it
> would be worth it, but the really critical code sequence is actually
> not "return to kernel space", but the "return to user space" case that
> really wants the iret. And I just think it's disgusting to add extra
> tests to that path.
>
> The other alternative would be to just make the rule be "NMI can never
> take traps". It's possible to do that, but quite frankly, it's a pain.
> It's a pain for page faults due to the whole vmalloc thing, and it's a
> pain if you ever want to debug an NMI in any way (or put a breakpoint
> on anything that is accessed from an NMI, which could potentially be
> quite a lot of things).
>
> If it was just the debug issue, I'd say "neener neener, debuggers are
> for wimps", but it's clearly not just about debug. It's a whole lot of
> other thigs. Random percpu datastructures used for tracing, kernel
> pointer verification code, yadda yadda.
>
>                  Linus
>

Well, the way I handled this problem on NetWare SMP and that other
kernel was to create a pool of TSS descriptors and reload each during
the exception to swap stacks before any handlers were called.  Allowed
it to nest until I ran out of TSS descriptors (64 levels).  Not sure
that's the way to go here though but it worked on that case.

Jeff

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:02 [patch 2/2] x86 NMI-safe INT3 and Page Fault Jeffrey Merkey
@ 2010-07-16 22:22 ` Linus Torvalds
  2010-07-16 22:48   ` Jeffrey Merkey
  2010-07-16 22:50   ` Jeffrey Merkey
  0 siblings, 2 replies; 168+ messages in thread
From: Linus Torvalds @ 2010-07-16 22:22 UTC (permalink / raw)
  To: Jeffrey Merkey; +Cc: linux-kernel

On Fri, Jul 16, 2010 at 3:02 PM, Jeffrey Merkey <jeffmerkey@gmail.com> wrote:
>
> So Linus, my understanding of Intel's processor design is that the
> processor will NEVER singal a nested NMI until it sees an iret from
> the first NMI exception.

Wrong.

I like x86, but it has warts. The NMI blocking is one of them.

The NMI's will be nested until the _next_ "iret", but it has no
nesting. So if you take a fault during the NMI (debug, page table
fixup, whatever), the iret in the faulthandler will re-enable NMI's
even though we're still busy with the original NMI. There is no
nesting, or any way to say that "this is a NMI-releasing iret". They
could even do it still - make a new "iret that doesn't clear NMI" by
adding a segment override prefix to iret or whatever. But it's not
going to happen, and it's just one of those ugly special cases that
has various historical reasons (recursive faults during NMI sure as
hell didn't make sense back in the real-mode 8086 days).

So we have to handle it in software. Or not ever trap at all inside
the NMI handler.

The original patch - and the patch I detest - is to make the normal
fault paths use a "popf + ret" to emulate iret, but without the NMI
release.

Now, I could live with that if it's the only solution, but it _is_
pretty damn ugly.

If somebody shows that it's actually faster to do "popf + ret" when
retuning to kernel space (a poor mans special-case iret), maybe it
would be worth it, but the really critical code sequence is actually
not "return to kernel space", but the "return to user space" case that
really wants the iret. And I just think it's disgusting to add extra
tests to that path.

The other alternative would be to just make the rule be "NMI can never
take traps". It's possible to do that, but quite frankly, it's a pain.
It's a pain for page faults due to the whole vmalloc thing, and it's a
pain if you ever want to debug an NMI in any way (or put a breakpoint
on anything that is accessed from an NMI, which could potentially be
quite a lot of things).

If it was just the debug issue, I'd say "neener neener, debuggers are
for wimps", but it's clearly not just about debug. It's a whole lot of
other thigs. Random percpu datastructures used for tracing, kernel
pointer verification code, yadda yadda.

                  Linus

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
@ 2010-07-16 22:02 Jeffrey Merkey
  2010-07-16 22:22 ` Linus Torvalds
  0 siblings, 1 reply; 168+ messages in thread
From: Jeffrey Merkey @ 2010-07-16 22:02 UTC (permalink / raw)
  To: linux-kernel

Date Fri, 16 Jul 2010 14:39:50 -0700
>From Linus Torvalds

> Linus Torvalds wrote:
> But we're not talking about non-NMI code.Yes, we are. We're talking about breakpoints (look at the subjectline), and you are very much talking about things like that _idiotic_vmalloc_sync_all() by module
> loading code etc etc.Every _single_ "solution" I have seen - apart from my suggestion - hasbeen about making code "special" because some other code might run inan NMI. Module init sequences having to > do idiotic things just becausethey have data structures that might get accessed by NMI.And the thing is, if we just do NMI's correctly, and allow nesting,ALL THOSE PROBLEMS GO AWAY. And there is no > reason what-so-ever to dostupid things elsewhere.In other words, why the hell are you arguing? Help Mathieu write thelow-level NMI handler right, and remove that idiotic"vmalloc_sync_all()" that is
> fundamentally broken and should notexist. Rather than talk about adding more of that kind of crap.
>
> Linus

So Linus, my understanding of Intel's processor design is that the
processor will NEVER singal a nested NMI until it sees an iret from
the first NMI exception.  At least that's how the processors were
working
when I started this unless this behavior has changed.    Just put a
gate on the exception that uses its own stack (which I think we do
anyway).

Jeff

^ permalink raw reply	[flat|nested] 168+ messages in thread

end of thread, other threads:[~2010-08-16 11:31 UTC | newest]

Thread overview: 168+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers
2010-07-14 15:49 ` [patch 1/2] x86_64 page fault NMI-safe Mathieu Desnoyers
2010-07-14 16:28   ` Linus Torvalds
2010-07-14 17:06     ` Mathieu Desnoyers
2010-07-14 18:10       ` Linus Torvalds
2010-07-14 18:46         ` Ingo Molnar
2010-07-14 19:14           ` Linus Torvalds
2010-07-14 19:36             ` Frederic Weisbecker
2010-07-14 19:54               ` Linus Torvalds
2010-07-14 20:17                 ` Mathieu Desnoyers
2010-07-14 20:55                   ` Linus Torvalds
2010-07-14 21:18                     ` Ingo Molnar
2010-07-14 22:14                 ` Frederic Weisbecker
2010-07-14 22:31                   ` Mathieu Desnoyers
2010-07-14 22:48                     ` Frederic Weisbecker
2010-07-14 23:11                       ` Mathieu Desnoyers
2010-07-14 23:38                         ` Frederic Weisbecker
2010-07-15 16:26                           ` Mathieu Desnoyers
2010-08-03 17:18                             ` Peter Zijlstra
2010-08-03 18:25                               ` Mathieu Desnoyers
2010-08-04  6:46                                 ` Peter Zijlstra
2010-08-04  7:14                                   ` Ingo Molnar
2010-08-04 14:45                                   ` Mathieu Desnoyers
2010-08-04 14:56                                     ` Peter Zijlstra
2010-08-06  1:49                                       ` Mathieu Desnoyers
2010-08-06  9:51                                         ` Peter Zijlstra
2010-08-06 13:46                                           ` Mathieu Desnoyers
2010-08-06  6:18                                       ` Masami Hiramatsu
2010-08-06  9:50                                         ` Peter Zijlstra
2010-08-06 13:37                                           ` Mathieu Desnoyers
2010-08-07  9:51                                           ` Masami Hiramatsu
2010-08-09 16:53                                           ` Frederic Weisbecker
2010-08-03 18:56                               ` Linus Torvalds
2010-08-03 19:45                                 ` Mathieu Desnoyers
2010-08-03 20:02                                   ` Linus Torvalds
2010-08-03 20:10                                     ` Ingo Molnar
2010-08-03 20:21                                       ` Ingo Molnar
2010-08-03 21:16                                         ` Mathieu Desnoyers
2010-08-03 20:54                                     ` Mathieu Desnoyers
2010-08-04  6:27                                 ` Peter Zijlstra
2010-08-04 14:06                                   ` Mathieu Desnoyers
2010-08-04 14:50                                     ` Peter Zijlstra
2010-08-06  1:42                                       ` Mathieu Desnoyers
2010-08-06 10:11                                         ` Peter Zijlstra
2010-08-06 11:14                                           ` Peter Zijlstra
2010-08-06 14:15                                             ` Mathieu Desnoyers
2010-08-06 14:13                                           ` Mathieu Desnoyers
2010-08-11 14:44                                             ` Steven Rostedt
2010-08-11 14:34                                   ` Steven Rostedt
2010-08-15 13:35                                     ` Mathieu Desnoyers
2010-08-15 16:33                                     ` Avi Kivity
2010-08-15 16:44                                       ` Mathieu Desnoyers
2010-08-15 16:51                                         ` Avi Kivity
2010-08-15 18:31                                           ` Mathieu Desnoyers
2010-08-16 10:49                                             ` Avi Kivity
2010-08-16 11:29                                             ` Avi Kivity
2010-08-04  6:46                                 ` Dave Chinner
2010-08-04  7:21                                   ` Ingo Molnar
2010-07-14 23:40                         ` Steven Rostedt
2010-07-14 19:41             ` Linus Torvalds
2010-07-14 19:56               ` Andi Kleen
2010-07-14 20:05                 ` Mathieu Desnoyers
2010-07-14 20:07                   ` Andi Kleen
2010-07-14 20:08                     ` H. Peter Anvin
2010-07-14 23:32                       ` Tejun Heo
2010-07-14 22:31                   ` Frederic Weisbecker
2010-07-14 22:56                     ` Linus Torvalds
2010-07-14 23:09                       ` Andi Kleen
2010-07-14 23:22                         ` Linus Torvalds
2010-07-15 14:11                       ` Frederic Weisbecker
2010-07-15 14:35                         ` Andi Kleen
2010-07-16 11:21                           ` Frederic Weisbecker
2010-07-15 14:46                         ` Steven Rostedt
2010-07-16 10:47                           ` Frederic Weisbecker
2010-07-16 11:43                             ` Steven Rostedt
2010-07-15 14:51                         ` Linus Torvalds
2010-07-15 15:38                           ` Linus Torvalds
2010-07-16 12:00                           ` Frederic Weisbecker
2010-07-16 12:54                             ` Steven Rostedt
2010-07-14 20:39         ` Mathieu Desnoyers
2010-07-14 21:23           ` Linus Torvalds
2010-07-14 21:45             ` Maciej W. Rozycki
2010-07-14 21:52               ` Linus Torvalds
2010-07-14 22:31                 ` Maciej W. Rozycki
2010-07-14 22:21             ` Mathieu Desnoyers
2010-07-14 22:37               ` Linus Torvalds
2010-07-14 22:51                 ` Jeremy Fitzhardinge
2010-07-14 23:02                   ` Linus Torvalds
2010-07-14 23:54                     ` Jeremy Fitzhardinge
2010-07-15  1:23                 ` Linus Torvalds
2010-07-15  1:45                   ` Linus Torvalds
2010-07-15 18:31                     ` Mathieu Desnoyers
2010-07-15 18:43                       ` Linus Torvalds
2010-07-15 18:48                         ` Linus Torvalds
2010-07-15 22:01                           ` Mathieu Desnoyers
2010-07-15 22:16                             ` Linus Torvalds
2010-07-15 22:24                               ` H. Peter Anvin
2010-07-15 22:26                               ` Linus Torvalds
2010-07-15 22:46                                 ` H. Peter Anvin
2010-07-15 22:58                                 ` Andi Kleen
2010-07-15 23:20                                   ` H. Peter Anvin
2010-07-15 23:23                                     ` Linus Torvalds
2010-07-15 23:41                                       ` H. Peter Anvin
2010-07-15 23:44                                         ` Linus Torvalds
2010-07-15 23:46                                           ` H. Peter Anvin
2010-07-15 23:48                                       ` Andi Kleen
2010-07-15 22:30                               ` Mathieu Desnoyers
2010-07-16 19:13                             ` Mathieu Desnoyers
2010-07-15 16:44                   ` Mathieu Desnoyers
2010-07-15 16:49                     ` Linus Torvalds
2010-07-15 17:38                       ` Mathieu Desnoyers
2010-07-15 20:44                         ` H. Peter Anvin
2010-07-18 11:03                   ` Avi Kivity
2010-07-18 17:36                     ` Linus Torvalds
2010-07-18 18:04                       ` Avi Kivity
2010-07-18 18:22                         ` Linus Torvalds
2010-07-19  7:32                           ` Avi Kivity
2010-07-18 18:17                       ` Linus Torvalds
2010-07-18 18:43                         ` Steven Rostedt
2010-07-18 19:26                           ` Linus Torvalds
2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
2010-07-14 16:42   ` Maciej W. Rozycki
2010-07-14 18:12     ` Mathieu Desnoyers
2010-07-14 19:21       ` Maciej W. Rozycki
2010-07-14 19:58         ` Mathieu Desnoyers
2010-07-14 20:36           ` Maciej W. Rozycki
2010-07-16 12:28   ` Avi Kivity
2010-07-16 14:49     ` Mathieu Desnoyers
2010-07-16 15:34       ` Andi Kleen
2010-07-16 15:40         ` Mathieu Desnoyers
2010-07-16 16:47       ` Avi Kivity
2010-07-16 16:58         ` Mathieu Desnoyers
2010-07-16 17:54           ` Avi Kivity
2010-07-16 18:05             ` H. Peter Anvin
2010-07-16 18:15               ` Avi Kivity
2010-07-16 18:17                 ` H. Peter Anvin
2010-07-16 18:28                   ` Avi Kivity
2010-07-16 18:37                     ` Linus Torvalds
2010-07-16 19:26                       ` Avi Kivity
2010-07-16 21:39                         ` Linus Torvalds
2010-07-16 22:07                           ` Andi Kleen
2010-07-16 22:26                             ` Linus Torvalds
2010-07-16 22:41                               ` Andi Kleen
2010-07-17  1:15                                 ` Linus Torvalds
2010-07-16 22:40                             ` Mathieu Desnoyers
2010-07-18  9:23                           ` Avi Kivity
2010-07-16 18:22                 ` Mathieu Desnoyers
2010-07-16 18:32                   ` Avi Kivity
2010-07-16 19:29                     ` H. Peter Anvin
2010-07-16 19:39                       ` Avi Kivity
2010-07-16 19:32                     ` Andi Kleen
2010-07-16 18:25                 ` Linus Torvalds
2010-07-16 19:30                   ` Andi Kleen
2010-07-18  9:26                     ` Avi Kivity
2010-07-16 19:28               ` Andi Kleen
2010-07-16 19:32                 ` Avi Kivity
2010-07-16 19:34                   ` Andi Kleen
2010-08-04  9:46               ` Peter Zijlstra
2010-08-04 20:23                 ` H. Peter Anvin
2010-07-14 17:06 ` [patch 0/2] x86: NMI-safe trap handlers Andi Kleen
2010-07-14 17:08   ` Mathieu Desnoyers
2010-07-14 18:56     ` Andi Kleen
2010-07-14 23:29       ` Tejun Heo
2010-07-16 22:02 [patch 2/2] x86 NMI-safe INT3 and Page Fault Jeffrey Merkey
2010-07-16 22:22 ` Linus Torvalds
2010-07-16 22:48   ` Jeffrey Merkey
2010-07-16 22:53     ` Jeffrey Merkey
2010-07-16 22:50   ` Jeffrey Merkey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.