linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] [GIT PULL] ftrace: Fix bug with function tracing and lockdep
@ 2012-05-31  1:28 Steven Rostedt
  2012-05-31  1:28 ` [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints Steven Rostedt
                   ` (5 more replies)
  0 siblings, 6 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31  1:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 5623 bytes --]


Ingo,

Dave Jones reported a bug from ftrace. I saw the same bug but
thought it was a compile problem because it would trigger everytime
I booted, but after rebuilding the kernel, it stopped triggering.
I mistook this as a build bug (something not cleaned properly).

What also confused me was that Dave also experienced the bug going
away after doing a make clean.

Seems that was just shear luck!

After rebuilding and booting I was able to get it to trigger again.
I found that it was something going wrong with the updating 
of the functions via the breakpoints (the removal of stop machine for
function tracing). I added a mdelay(1) where I usually sync the
updates (add breakpoint, wait 1ms, modify code, wait 1ms, remove breakpoint)
and this happened to trigger the bug 90% of the time.

Finally figured it out. It was a combination of lockdep and the int3
handler:

.macro paranoidzeroentry_ist sym do_sym ist
ENTRY(\sym)
	INTR_FRAME
	PARAVIRT_ADJUST_EXCEPTION_FRAME
	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
	subq $ORIG_RAX-R15, %rsp
	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
	call save_paranoid
	TRACE_IRQS_OFF
	movq %rsp,%rdi		/* pt_regs pointer */
	xorl %esi,%esi		/* no error code */
	subq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
	call \do_sym
	addq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
	jmp paranoid_exit	/* %ebx: no swapgs flag */
	CFI_ENDPROC
END(\sym)
.endm

The above macro is called by the do_int3 code. When a int3 is hit, the
stack jumps to the debug stack (like what happens on NMI). The
 subq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
is a little trick incase the \do_sym happens to hit a breakpoint too.
And for the most part, that works.

But that TRACE_IRQS_OFF up there (and several more called by the
paranoid_exit), is not protected by that trick. Thus, if the tracing
code hits a breakpoint we reset the debug stack and start clobbering all
over the previous code's stack.

Usually this is not an issue, unless lockdep sees the TRACE_IRQS_OFF
or TRACE_IRQS_ON while holding locks that have been taken for the first
time (with interrupts disabled). This explains why if the bug did not
trigger on the first try of starting the function tracer, it usually
would not trigger. It requires a new lock to be taken. The longer you
keep the box running, the less likely it will acquire a lock for the first
time.

Once lockdep sees a new lock changing its interrupts status, it starts
recording it (and calling lots of code). Unfortunately, some of this
code is traced by the function tracer.

I first started making a lot of this code 'notrace' but found that
it just wasn't reliable enough. The list of functions to mark just
kept growing, and I would only know if I got them all if I happen
to run all the paths.

Then I tried to play tricks by doing a switch of the stack. I tried
to modify sync_regs() but it would partially work, and it seemed
that NMIs would cause it to crash as well. I spent too much time on
it and finally figured to go with the easiest approach.

As this is only needed when LOCKDEP (and just in case, I did it when
TRACE_IRQFLAGS was enabled), I figured breakpoint performance wasn't
an issue (LOCKDEP and TRACE_IRQSFLAGS has a much bigger overhead).

I decided to reuse the code from the NMI to switch the IDT before calling
the TRACE_IRQS_ON/OFF and switch it back afterward. It would be nice
to do it just once on entering do_int3, and once leaving, but because
the TRACE_IRQS_ON/OFF and TRACE_IRQS_IRETQ are called in several branches
of the paranoid_exit, I decided to keep it simple and modify it every
time.

In the future, we can work to make this rely on modifying_ftrace_code,
but there's races to deal with, and this bug needs to be fixed now.

Also, while debugging this, I found a few more little bugs.

The first patch fixes a race with setting the modifying_ftrace_code
and the executing of the int3 handlers.

The second fixes a bug where we don't do the breakpoint update of
the function trace callback (the one called in the mcount trampoline).
This could cause a GPF when tracing is running and the function
tracer changes.

The third patch is a minor bug fix in the NMI handling of the idt switch.
If a NMI preempts a int3 handler, it switches the idt, properly, and
sets a variable to tell the handler to put it back when it exits. The
problem is that the variable never got reset, and it constantly loaded
the old IDT back on every NMI from then on. This is more of a performance
and correctness fix.

The last two patches fix the lockdep and breakpoint issue.

Comments are welcomed, but if there's no problem then ...

Please pull the latest tip/perf/urgent tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
tip/perf/urgent

Head SHA1: 1f9f5e93a9ab9cdf2504a25c24f3f8715c4ba785


Steven Rostedt (5):
      ftrace: Synchronize variable setting with breakpoints
      ftrace: Use breakpoint method to update ftrace caller
      x86: Reset the debug_stack update counter
      x86: Allow nesting of the debug stack IDT setting
      ftrace/x86: Do not change stacks in DEBUG when calling lockdep

----
 arch/x86/include/asm/ftrace.h |    2 +-
 arch/x86/kernel/cpu/common.c  |    8 ++++-
 arch/x86/kernel/entry_64.S    |   44 ++++++++++++++++++++++++--
 arch/x86/kernel/ftrace.c      |   69 ++++++++++++++++++++++++++++++++++++-----
 arch/x86/kernel/nmi.c         |    4 ++-
 arch/x86/kernel/traps.c       |    3 +-
 6 files changed, 116 insertions(+), 14 deletions(-)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31  1:28 [PATCH 0/5] [GIT PULL] ftrace: Fix bug with function tracing and lockdep Steven Rostedt
@ 2012-05-31  1:28 ` Steven Rostedt
  2012-05-31 11:06   ` Peter Zijlstra
  2012-05-31  1:28 ` [PATCH 2/5] ftrace: Use breakpoint method to update ftrace caller Steven Rostedt
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31  1:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 2667 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

When the function tracer starts modifying the code via breakpoints
it sets a variable (modifying_ftrace_code) to inform the breakpoint
handler to call the ftrace int3 code.

But there's no synchronization between setting this code and the
handler, thus it is possible for the handler to be called on another
CPU before it sees the variable. This will cause a kernel crash as
the int3 handler will not know what to do with it.

I originally added smp_mb()'s to force the visibility of the variable
but H. Peter Anvin suggested that I just make it atomic.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/include/asm/ftrace.h |    2 +-
 arch/x86/kernel/ftrace.c      |    6 +++---
 arch/x86/kernel/traps.c       |    3 ++-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index 18d9005..b0767bc 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -34,7 +34,7 @@
 
 #ifndef __ASSEMBLY__
 extern void mcount(void);
-extern int modifying_ftrace_code;
+extern atomic_t modifying_ftrace_code;
 
 static inline unsigned long ftrace_call_adjust(unsigned long addr)
 {
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 32ff365..ea9904f 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -168,7 +168,7 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 	return ret;
 }
 
-int modifying_ftrace_code __read_mostly;
+atomic_t modifying_ftrace_code __read_mostly;
 
 /*
  * A breakpoint was added to the code address we are about to
@@ -491,11 +491,11 @@ void ftrace_replace_code(int enable)
 
 void arch_ftrace_update_code(int command)
 {
-	modifying_ftrace_code++;
+	atomic_inc(&modifying_ftrace_code);
 
 	ftrace_modify_all_code(command);
 
-	modifying_ftrace_code--;
+	atomic_dec(&modifying_ftrace_code);
 }
 
 int __init ftrace_dyn_arch_init(void *data)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ff08457..32443ae 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -304,7 +304,8 @@ dotraplinkage void __kprobes notrace do_int3(struct pt_regs *regs, long error_co
 {
 #ifdef CONFIG_DYNAMIC_FTRACE
 	/* ftrace must be first, everything else may cause a recursive crash */
-	if (unlikely(modifying_ftrace_code) && ftrace_int3_handler(regs))
+	if (unlikely(atomic_read(&modifying_ftrace_code)) &&
+	    ftrace_int3_handler(regs))
 		return;
 #endif
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
-- 
1.7.10



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/5] ftrace: Use breakpoint method to update ftrace caller
  2012-05-31  1:28 [PATCH 0/5] [GIT PULL] ftrace: Fix bug with function tracing and lockdep Steven Rostedt
  2012-05-31  1:28 ` [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints Steven Rostedt
@ 2012-05-31  1:28 ` Steven Rostedt
  2012-05-31 11:18   ` Peter Zijlstra
  2012-05-31  1:28 ` [PATCH 3/5] x86: Reset the debug_stack update counter Steven Rostedt
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31  1:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 4154 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

On boot up and module load, it is fine to modify the code directly,
without the use of breakpoints. This is because boot up modification
is done before SMP is initialized, thus the modification is serial,
and module load is done before the module executes.

But after that we must use a SMP safe method to modify running code.

This has been done to change the nops at all the functions, but
the change of the ftrace callback handler itself was still using a
direct modification. If tracing was enabled and the function callback
was changed then another CPU could fault if it was currently calling
the original callback. This modification must use the breakpoint method
too.

Note, the direct method is still used for boot up and module load.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/ftrace.c |   65 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 60 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ea9904f..0d50ee2 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -100,7 +100,7 @@ static const unsigned char *ftrace_nop_replace(void)
 }
 
 static int
-ftrace_modify_code(unsigned long ip, unsigned const char *old_code,
+ftrace_modify_code_direct(unsigned long ip, unsigned const char *old_code,
 		   unsigned const char *new_code)
 {
 	unsigned char replaced[MCOUNT_INSN_SIZE];
@@ -141,7 +141,20 @@ int ftrace_make_nop(struct module *mod,
 	old = ftrace_call_replace(ip, addr);
 	new = ftrace_nop_replace();
 
-	return ftrace_modify_code(rec->ip, old, new);
+	/*
+	 * On boot up, and when modules are loaded, the MCOUNT_ADDR
+	 * is converted to a nop, and will never become MCOUNT_ADDR
+	 * again. This code is either running before SMP (on boot up)
+	 * or before the code will ever be executed (module load).
+	 * We do not want to use the breakpoint version in this case,
+	 * just modify the code directly.
+	 */
+	if (addr == MCOUNT_ADDR)
+		return ftrace_modify_code_direct(rec->ip, old, new);
+
+	/* Normal cases use add_brk_on_nop */
+	WARN_ONCE(1, "invalid use of ftrace_make_nop");
+	return -EINVAL;
 }
 
 int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
@@ -152,9 +165,16 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
 	old = ftrace_nop_replace();
 	new = ftrace_call_replace(ip, addr);
 
-	return ftrace_modify_code(rec->ip, old, new);
+	/* Should only be called when module is loaded */
+	return ftrace_modify_code_direct(rec->ip, old, new);
 }
 
+atomic_t modifying_ftrace_code __read_mostly;
+
+static int
+ftrace_modify_code(unsigned long ip, unsigned const char *old_code,
+		   unsigned const char *new_code);
+
 int ftrace_update_ftrace_func(ftrace_func_t func)
 {
 	unsigned long ip = (unsigned long)(&ftrace_call);
@@ -163,13 +183,16 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 
 	memcpy(old, &ftrace_call, MCOUNT_INSN_SIZE);
 	new = ftrace_call_replace(ip, (unsigned long)func);
+
+	atomic_inc(&modifying_ftrace_code);
+
 	ret = ftrace_modify_code(ip, old, new);
 
+	atomic_dec(&modifying_ftrace_code);
+
 	return ret;
 }
 
-atomic_t modifying_ftrace_code __read_mostly;
-
 /*
  * A breakpoint was added to the code address we are about to
  * modify, and this is the handle that will just skip over it.
@@ -489,6 +512,38 @@ void ftrace_replace_code(int enable)
 	}
 }
 
+static int
+ftrace_modify_code(unsigned long ip, unsigned const char *old_code,
+		   unsigned const char *new_code)
+{
+	int ret;
+
+	ret = add_break(ip, old_code);
+	if (ret)
+		goto out;
+
+	run_sync();
+
+	ret = add_update_code(ip, new_code);
+	if (ret)
+		goto fail_update;
+
+	run_sync();
+
+	ret = ftrace_write(ip, new_code, 1);
+	if (ret) {
+		ret = -EPERM;
+		goto out;
+	}
+	run_sync();
+ out:
+	return ret;
+
+ fail_update:
+	probe_kernel_write((void *)ip, &old_code[0], 1);
+	goto out;
+}
+
 void arch_ftrace_update_code(int command)
 {
 	atomic_inc(&modifying_ftrace_code);
-- 
1.7.10



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 3/5] x86: Reset the debug_stack update counter
  2012-05-31  1:28 [PATCH 0/5] [GIT PULL] ftrace: Fix bug with function tracing and lockdep Steven Rostedt
  2012-05-31  1:28 ` [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints Steven Rostedt
  2012-05-31  1:28 ` [PATCH 2/5] ftrace: Use breakpoint method to update ftrace caller Steven Rostedt
@ 2012-05-31  1:28 ` Steven Rostedt
  2012-05-31 19:19   ` H. Peter Anvin
  2012-05-31  1:28 ` [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting Steven Rostedt
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31  1:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1140 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

When an NMI goes off and it sees that it preempted the debug stack,
to keep the debug stack safe, it changes the IDT to point to one that
does not modify the stack on breakpoint (to allow breakpoints in NMIs).

But the variable that gets set to know to undo it on exit never gets
cleared on exit. Thus every NMI will reset it on exit the first time
it is done even if it does not need to be reset.

Cc: <stable@vger.kernel.org> # v3.3
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/nmi.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 9087527..c1fffc5 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -450,8 +450,10 @@ static inline void nmi_nesting_preprocess(struct pt_regs *regs)
 
 static inline void nmi_nesting_postprocess(void)
 {
-	if (unlikely(__get_cpu_var(update_debug_stack)))
+	if (unlikely(__get_cpu_var(update_debug_stack))) {
 		debug_stack_reset();
+		__get_cpu_var(update_debug_stack) = 0;
+	}
 }
 #endif
 
-- 
1.7.10



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31  1:28 [PATCH 0/5] [GIT PULL] ftrace: Fix bug with function tracing and lockdep Steven Rostedt
                   ` (2 preceding siblings ...)
  2012-05-31  1:28 ` [PATCH 3/5] x86: Reset the debug_stack update counter Steven Rostedt
@ 2012-05-31  1:28 ` Steven Rostedt
  2012-05-31 18:44   ` H. Peter Anvin
  2012-05-31 18:58   ` H. Peter Anvin
  2012-05-31  1:28 ` [PATCH 5/5] ftrace/x86: Do not change stacks in DEBUG when calling lockdep Steven Rostedt
  2012-05-31  2:13 ` [PATCH 0/5] (for 3.5)[GIT PULL] ftrace: Fix bug with function tracing and lockdep Steven Rostedt
  5 siblings, 2 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31  1:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1754 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

When the NMI handler runs, it checks if it preempted a debug handler
and if that handler is using the debug stack. If it is, it changes the
IDT table not to update the stack, otherwise it will reset the debug
stack and corrupt the debug handler it preempted.

Now that ftrace uses breakpoints to change functions from nops to
callers, many more places may hit a breakpoint. Unfortunately this
includes some of the calls that lockdep performs. Which causes issues
with the debug stack. It too needs to change the debug stack before
tracing (if called from the debug handler).

Allow the debug_stack_set_zero() and debug_stack_reset() to be nested
so that the debug handlers can take advantage of them too.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/cpu/common.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 82f29e7..63fc083 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1101,14 +1101,20 @@ int is_debug_stack(unsigned long addr)
 		 addr > (__get_cpu_var(debug_stack_addr) - DEBUG_STKSZ));
 }
 
+static DEFINE_PER_CPU(atomic_t, debug_idt_zero);
+
 void debug_stack_set_zero(void)
 {
+	atomic_inc(&__get_cpu_var(debug_idt_zero));
 	load_idt((const struct desc_ptr *)&nmi_idt_descr);
 }
 
 void debug_stack_reset(void)
 {
-	load_idt((const struct desc_ptr *)&idt_descr);
+	if (WARN_ON(!atomic_read(&__get_cpu_var(debug_idt_zero))))
+		return;
+	if (atomic_dec_and_test(&__get_cpu_var(debug_idt_zero)))
+		load_idt((const struct desc_ptr *)&idt_descr);
 }
 
 #else	/* CONFIG_X86_64 */
-- 
1.7.10



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 5/5] ftrace/x86: Do not change stacks in DEBUG when calling lockdep
  2012-05-31  1:28 [PATCH 0/5] [GIT PULL] ftrace: Fix bug with function tracing and lockdep Steven Rostedt
                   ` (3 preceding siblings ...)
  2012-05-31  1:28 ` [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting Steven Rostedt
@ 2012-05-31  1:28 ` Steven Rostedt
  2012-05-31  2:13 ` [PATCH 0/5] (for 3.5)[GIT PULL] ftrace: Fix bug with function tracing and lockdep Steven Rostedt
  5 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31  1:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 6310 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

When both DYNAMIC_FTRACE and LOCKDEP are set, the TRACE_IRQS_ON/OFF
will call into the lockdep code. The lockdep code can call lots of
functions that may be traced by ftrace. When ftrace is updating its
code and hits a breakpoint, the breakpoint handler will call into
lockdep. If lockdep happens to call a function that also has a breakpoint
attached, it will jump back into the breakpoint handler resetting
the stack to the debug stack and corrupt the contents currently on
that stack.

The 'do_sym' call that calls do_int3() is protected by modifying the
IST table to point to a different location if another breakpoint is
hit. But the TRACE_IRQS_OFF/ON are outside that protection, and if
a breakpoint is hit from those, the stack will get corrupted, and
the kernel will crash:

[ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
[ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
[ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
[ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
[ 1013.310600] CPU 2
[ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1013.401848]
[ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
[ 1013.437943] RIP: 8eb8:[<ffff88014630a000>]  [<ffff88014630a000>] 0xffff880146309fff
[ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408  EFLAGS: 00010046
[ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000
[ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20
[ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000
[ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000
[ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40
[ 1013.586108] FS:  0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000
[ 1013.609458] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0
[ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0)
[ 1013.717028] Stack:
[ 1013.724131]  ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000
[ 1013.745918]  cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627
[ 1013.767870]  ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb
[ 1013.790021] Call Trace:
[ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff
[ 1013.861443] RIP  [<ffff88014630a000>] 0xffff880146309fff
[ 1013.884466]  RSP <ffff88014780f408>
[ 1013.901507] CR2: 0000000000000002

The solution was to reuse the NMI functions that change the IDT table to make the debug
stack keep its current stack (in kernel mode) when hitting a breakpoint:

  call debug_stack_set_zero
  TRACE_IRQS_ON
  call debug_stack_reset

If the TRACE_IRQS_ON happens to hit a breakpoint then it will keep the current stack
and not crash the box.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/entry_64.S |   44 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 320852d..7d65133 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -191,6 +191,44 @@ ENDPROC(native_usergs_sysret64)
 .endm
 
 /*
+ * When dynamic function tracer is enabled it will add a breakpoint
+ * to all locations that it is about to modify, sync CPUs, update
+ * all the code, sync CPUs, then remove the breakpoints. In this time
+ * if lockdep is enabled, it might jump back into the debug handler
+ * outside the updating of the IST protection. (TRACE_IRQS_ON/OFF).
+ *
+ * We need to change the IDT table before calling TRACE_IRQS_ON/OFF to
+ * make sure the stack pointer does not get reset back to the top
+ * of the debug stack, and instead just reuses the current stack.
+ */
+#if defined(CONFIG_DYNAMIC_FTRACE) && defined(CONFIG_TRACE_IRQFLAGS)
+
+.macro TRACE_IRQS_OFF_DEBUG
+	call debug_stack_set_zero
+	TRACE_IRQS_OFF
+	call debug_stack_reset
+.endm
+
+.macro TRACE_IRQS_ON_DEBUG
+	call debug_stack_set_zero
+	TRACE_IRQS_ON
+	call debug_stack_reset
+.endm
+
+.macro TRACE_IRQS_IRETQ_DEBUG offset=ARGOFFSET
+	bt   $9,EFLAGS-\offset(%rsp)	/* interrupts off? */
+	jnc  1f
+	TRACE_IRQS_ON_DEBUG
+1:
+.endm
+
+#else
+# define TRACE_IRQS_OFF_DEBUG		TRACE_IRQS_OFF
+# define TRACE_IRQS_ON_DEBUG		TRACE_IRQS_ON
+# define TRACE_IRQS_IRETQ_DEBUG		TRACE_IRQS_IRETQ
+#endif
+
+/*
  * C code is not supposed to know about undefined top of stack. Every time
  * a C function with an pt_regs argument is called from the SYSCALL based
  * fast path FIXUP_TOP_OF_STACK is needed.
@@ -1098,7 +1136,7 @@ ENTRY(\sym)
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 	call save_paranoid
-	TRACE_IRQS_OFF
+	TRACE_IRQS_OFF_DEBUG
 	movq %rsp,%rdi		/* pt_regs pointer */
 	xorl %esi,%esi		/* no error code */
 	subq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
@@ -1393,7 +1431,7 @@ paranoidzeroentry machine_check *machine_check_vector(%rip)
 ENTRY(paranoid_exit)
 	DEFAULT_FRAME
 	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
+	TRACE_IRQS_OFF_DEBUG
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz paranoid_restore
 	testl $3,CS(%rsp)
@@ -1404,7 +1442,7 @@ paranoid_swapgs:
 	RESTORE_ALL 8
 	jmp irq_return
 paranoid_restore:
-	TRACE_IRQS_IRETQ 0
+	TRACE_IRQS_IRETQ_DEBUG 0
 	RESTORE_ALL 8
 	jmp irq_return
 paranoid_userspace:
-- 
1.7.10



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/5] (for 3.5)[GIT PULL] ftrace: Fix bug with function tracing and lockdep
  2012-05-31  1:28 [PATCH 0/5] [GIT PULL] ftrace: Fix bug with function tracing and lockdep Steven Rostedt
                   ` (4 preceding siblings ...)
  2012-05-31  1:28 ` [PATCH 5/5] ftrace/x86: Do not change stacks in DEBUG when calling lockdep Steven Rostedt
@ 2012-05-31  2:13 ` Steven Rostedt
  5 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31  2:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Wed, 2012-05-30 at 21:28 -0400, Steven Rostedt wrote:
> Ingo,
> 

> Please pull the latest tip/perf/urgent tree, which can be found at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> tip/perf/urgent
> 
> Head SHA1: 1f9f5e93a9ab9cdf2504a25c24f3f8715c4ba785
> 

Oh, I forgot to add in my announcement subject, that this is for 3.5, as
this bug came in during this merge window, with the exception of patch 3
which fixes a bug introduced in 3.3.

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31  1:28 ` [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints Steven Rostedt
@ 2012-05-31 11:06   ` Peter Zijlstra
  2012-05-31 14:08     ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2012-05-31 11:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Wed, 2012-05-30 at 21:28 -0400, Steven Rostedt wrote:
> From: Steven Rostedt <srostedt@redhat.com>
> 
> When the function tracer starts modifying the code via breakpoints
> it sets a variable (modifying_ftrace_code) to inform the breakpoint
> handler to call the ftrace int3 code.
> 
> But there's no synchronization between setting this code and the
> handler, thus it is possible for the handler to be called on another
> CPU before it sees the variable. This will cause a kernel crash as
> the int3 handler will not know what to do with it.
> 
> I originally added smp_mb()'s to force the visibility of the variable
> but H. Peter Anvin suggested that I just make it atomic.

Uhm,. maybe. atomic_{inc,dec}() implies a full memory barrier on x86,
but atomic_read() never has the smp_rmb() required.

Now smp_rmb() is mostly a nop on x86, except for CONFIG_X86_PPRO_FENCE.

So this should mostly work, but yuck.


Also, why does this stuff live in ftrace? I always thought you were
going to replace text_poke() so everybody that uses cross-modifying code
could profit?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/5] ftrace: Use breakpoint method to update ftrace caller
  2012-05-31  1:28 ` [PATCH 2/5] ftrace: Use breakpoint method to update ftrace caller Steven Rostedt
@ 2012-05-31 11:18   ` Peter Zijlstra
  2012-05-31 14:19     ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2012-05-31 11:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Wed, 2012-05-30 at 21:28 -0400, Steven Rostedt wrote:
> From: Steven Rostedt <srostedt@redhat.com>
> 
> On boot up and module load, it is fine to modify the code directly,
> without the use of breakpoints. This is because boot up modification
> is done before SMP is initialized, thus the modification is serial,
> and module load is done before the module executes.
> 
> But after that we must use a SMP safe method to modify running code.
> 
> This has been done to change the nops at all the functions, but
> the change of the ftrace callback handler itself was still using a
> direct modification. If tracing was enabled and the function callback
> was changed then another CPU could fault if it was currently calling
> the original callback. This modification must use the breakpoint method
> too.
> 
> Note, the direct method is still used for boot up and module load.

The changelog isn't clear if this is a fix or optimization. I suspect
the latter.

Still, you're now re-inventing text_poke() and text_poke_early().

Why are you keeping all this inside of ftrace?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 11:06   ` Peter Zijlstra
@ 2012-05-31 14:08     ` Steven Rostedt
  2012-05-31 15:16       ` Masami Hiramatsu
                         ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 13:06 +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 21:28 -0400, Steven Rostedt wrote:
> > From: Steven Rostedt <srostedt@redhat.com>
> > 
> > When the function tracer starts modifying the code via breakpoints
> > it sets a variable (modifying_ftrace_code) to inform the breakpoint
> > handler to call the ftrace int3 code.
> > 
> > But there's no synchronization between setting this code and the
> > handler, thus it is possible for the handler to be called on another
> > CPU before it sees the variable. This will cause a kernel crash as
> > the int3 handler will not know what to do with it.
> > 
> > I originally added smp_mb()'s to force the visibility of the variable
> > but H. Peter Anvin suggested that I just make it atomic.
> 
> Uhm,. maybe. atomic_{inc,dec}() implies a full memory barrier on x86,

Yeah, I believe (and H. Peter can correct me) that this is all that's
required for x86.

> but atomic_read() never has the smp_rmb() required.
> 
> Now smp_rmb() is mostly a nop on x86, except for CONFIG_X86_PPRO_FENCE.

No rmb() is required, as that's supplied by the breakpoint itself.
Basically, rmb() is used for ordering:

	load(A);
	rmb();
	loab(B);

To keep the machine from actually doing:

	load(B);
	load(A);

But what this is:

	<breakpoint>
	     |
	     +---------> <handler>
			    |
			load(A);

We need the load(A) to be after the breakpoint. Is it possible for the
machine to do it before?:

		     load(A)
	      |
	      |
	<breakpoint>
	      +----------> test(A)

??

If another breakpoint is hit (one other than one put in by ftrace) then
we don't care. It wont crash the system whether or not A is 1 or 0. We
just need to make sure that a ftrace breakpoint that is hit knows that
it was a ftrace breakpoint (calls the ftrace handler). No other
breakpoint should be on a ftrace nop anyway.


> 
> So this should mostly work, but yuck.

It should work, not just mostly.

> 
> 
> Also, why does this stuff live in ftrace? I always thought you were
> going to replace text_poke() so everybody that uses cross-modifying code
> could profit?

I discussed this with Masami at Collaboration Summit. The two are
similar but also very different. But we want to start merging the two
together where it makes sense.

Currently text_poke maps 2 pages into a FIXMAP area to make the
modification so that it does not need to change the protection of the
kernel (from ro to rw). Ftrace changes the text protection while it
makes the update then reverts it back when done. The reason for the
difference is that text_poke only handles 1 change at a time. There is a
batch mode, but it does the mapping 1 at a time even for that.

Now ftrace must change 30,000 locations at once!

# cat /debug/tracing/available_filter_functions | wc -l
29467

Could you imagine ftrace mapping 30,000 pages one at a time to do the
update?

Also, the text_poke() batch mode requires allocation of an array. This
could cause failures to start function tracing to allocate 30,000
structures. Ftrace has its own tight structure that's done in blocks
(the struct dyn_ftrace), that is (on x86_64) 16 byte items. This is
allocated at boot up, and module load, and is persistent throughout the
life of the system uptime (or module loaded). You can see it in the
dmesg:

ftrace: allocating 20503 entries in 81 pages

 - note this is just core code, before modules are added

The dyn_ftrace has a pointer to the mcount location, and a flags field.
The flags tells the update code what to change.

Now Masami said he thinks the text_poke should also do the change of
kernel protection as well, and not do the FIXMAP mapping. At least for
bulk changes. The added breakpoint code that we have can be shared
between ftrace and text_poke. That is something we want to do. After we
get ftrace working, we'll go ahead and make it work for text_poke as
well ;-)

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/5] ftrace: Use breakpoint method to update ftrace caller
  2012-05-31 11:18   ` Peter Zijlstra
@ 2012-05-31 14:19     ` Steven Rostedt
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 13:18 +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 21:28 -0400, Steven Rostedt wrote:
> > From: Steven Rostedt <srostedt@redhat.com>
> > 
> > On boot up and module load, it is fine to modify the code directly,
> > without the use of breakpoints. This is because boot up modification
> > is done before SMP is initialized, thus the modification is serial,
> > and module load is done before the module executes.
> > 
> > But after that we must use a SMP safe method to modify running code.
> > 
> > This has been done to change the nops at all the functions, but
> > the change of the ftrace callback handler itself was still using a
> > direct modification. If tracing was enabled and the function callback
> > was changed then another CPU could fault if it was currently calling
> > the original callback. This modification must use the breakpoint method
> > too.
> > 
> > Note, the direct method is still used for boot up and module load.
> 
> The changelog isn't clear if this is a fix or optimization. I suspect
> the latter.

It's a bug fix. There's a GPF kernel crash creeping around in the
current code. I can add to the change log what I put into my cover
letter:

"The second fixes a bug where we don't do the breakpoint update of
the function trace callback (the one called in the mcount trampoline).
This could cause a GPF when tracing is running and the function
tracer changes."


> 
> Still, you're now re-inventing text_poke() and text_poke_early().

I explained this in my other email. Also, it wasn't really re-invented,
as I believe ftrace did this first (back in 2007). text_poke() looks to
have come into play around 2008. Ftrace was the first to do live updates
of code while the system was up and running. Before that, the
alternatives code just changed the system at boot up (or going from SMP
to UP or vice versa). It never did the change in an SMP environment.

This patchset is not a implementation of text_poke() but a fix to what
was done earlier. The breakpoint method was something that text_poke()
tried earlier, but H. Peter, warned that Intel has not said it was safe
to do so. Thus we waited. When we got the unofficial OK, I added the
breakpoint method to get rid of stop_machine() from ftrace. That was the
main motivation for this.

> 
> Why are you keeping all this inside of ftrace?

I'm not ;-) Well, actually, I wont be. As I explained in my last email,
I talked to Masami about it, and we are discussing ways to change text
poke to do basically what ftrace is doing. I'm doing this with ftrace
first because it removes stop_machine() from it, and it also a greater
test case. If something is to go wrong, it will more likely go wrong
with 30,000 changes at once, than just 100.

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 14:08     ` Steven Rostedt
@ 2012-05-31 15:16       ` Masami Hiramatsu
  2012-05-31 15:29         ` Steven Rostedt
  2012-05-31 17:28       ` Peter Zijlstra
  2012-05-31 17:40       ` Peter Zijlstra
  2 siblings, 1 reply; 47+ messages in thread
From: Masami Hiramatsu @ 2012-05-31 15:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Andrew Morton,
	Frederic Weisbecker, H. Peter Anvin, Dave Jones, Andi Kleen

(2012/05/31 23:08), Steven Rostedt wrote:
> Now Masami said he thinks the text_poke should also do the change of
> kernel protection as well, and not do the FIXMAP mapping. At least for
> bulk changes. The added breakpoint code that we have can be shared
> between ftrace and text_poke. That is something we want to do. After we
> get ftrace working, we'll go ahead and make it work for text_poke as
> well ;-)

Indeed, ftrace can be a better test case for breakpoint-based self-
modifying. So after we can make sure that is safely work on x86
widely, it is better to rewrite text_poke with that method.

Perhaps, kprobes-*jump*-optimization may be better to handle it
because the target probe is simultaneously working while
optimizing (modifying code). This means that if someone hits
breakpoint of such kprobe, it must be handled by kprobes, not
only just tweaking IP address.

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 15:16       ` Masami Hiramatsu
@ 2012-05-31 15:29         ` Steven Rostedt
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 15:29 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Andrew Morton,
	Frederic Weisbecker, H. Peter Anvin, Dave Jones, Andi Kleen

On Fri, 2012-06-01 at 00:16 +0900, Masami Hiramatsu wrote:


> Perhaps, kprobes-*jump*-optimization may be better to handle it
> because the target probe is simultaneously working while
> optimizing (modifying code). This means that if someone hits
> breakpoint of such kprobe, it must be handled by kprobes, not
> only just tweaking IP address.

Yeah, ftrace is easier on what the breakpoint does. For kprobes, it's a
bit more invasive. Ftrace only deals with a nop, or a "trace-me" call.
As the modifications are being done, we always treat it as a nop until
the modification is complete.

With kprobes, the text_poke() is a bit more difficult, because it can't
treat the location as a nop, as what is changing actually performs some
task for the kernel. It requires that the breakpoint know what to do
with the text (out-of-line execution, and such). ftrace has the benefit
of just "skip this instruction" and return.

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 14:08     ` Steven Rostedt
  2012-05-31 15:16       ` Masami Hiramatsu
@ 2012-05-31 17:28       ` Peter Zijlstra
  2012-05-31 18:42         ` Steven Rostedt
  2012-05-31 17:40       ` Peter Zijlstra
  2 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2012-05-31 17:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 13:06 +0200, Peter Zijlstra wrote:
> > On Wed, 2012-05-30 at 21:28 -0400, Steven Rostedt wrote:
> > > From: Steven Rostedt <srostedt@redhat.com>
> > > 
> > > When the function tracer starts modifying the code via breakpoints
> > > it sets a variable (modifying_ftrace_code) to inform the breakpoint
> > > handler to call the ftrace int3 code.
> > > 
> > > But there's no synchronization between setting this code and the
> > > handler, thus it is possible for the handler to be called on another
> > > CPU before it sees the variable. This will cause a kernel crash as
> > > the int3 handler will not know what to do with it.
> > > 
> > > I originally added smp_mb()'s to force the visibility of the variable
> > > but H. Peter Anvin suggested that I just make it atomic.
> > 
> > Uhm,. maybe. atomic_{inc,dec}() implies a full memory barrier on x86,
> 
> Yeah, I believe (and H. Peter can correct me) that this is all that's
> required for x86.
> 
> > but atomic_read() never has the smp_rmb() required.
> > 
> > Now smp_rmb() is mostly a nop on x86, except for CONFIG_X86_PPRO_FENCE.
> 
> No rmb() is required, as that's supplied by the breakpoint itself.
> Basically, rmb() is used for ordering:
> 
> 	load(A);
> 	rmb();
> 	loab(B);
> 
> To keep the machine from actually doing:
> 
> 	load(B);
> 	load(A);

I know what rmb is for.. I also know you need to pair barriers. Hiding
them in atomic doesn't make the ordering any more obvious.

> But what this is:
> 
> 	<breakpoint>
> 	     |
> 	     +---------> <handler>
> 			    |
> 			load(A);
> 
> We need the load(A) to be after the breakpoint. Is it possible for the
> machine to do it before?:
> 
> 		     load(A)
> 	      |
> 	      |
> 	<breakpoint>
> 	      +----------> test(A)

I don't know, nor did you explain the implicit ordering there. Also in
such diagrams you need the other side as well.

> If another breakpoint is hit (one other than one put in by ftrace) then
> we don't care. It wont crash the system whether or not A is 1 or 0. We
> just need to make sure that a ftrace breakpoint that is hit knows that
> it was a ftrace breakpoint (calls the ftrace handler). No other
> breakpoint should be on a ftrace nop anyway.

So the ordering is like:

---

 CPU-0				CPU-1


 lock inc mod-count /* implicit (w)mb */
 write int3
				<trap-int3> /* implicit (r)mb */
				load mod-count

 sync-ipi-broadcast
 write rest-of-instruction
 sync-ipi-broadcast
 write head-of-instruction
 sync-ipi-broadcast
 lock dec mod-count /* implicit (w)mb */


Such that when we observe the int3 on CPU-1 we also must see the
increment on mod-count.

---

A simple something like the above makes it very clear what we're doing
and what we're expecting. I think a (local) trap should imply a barrier
of sorts but will have to defer to others (hpa?) to confirm. But at the
very least write it down someplace that you are assuming that.



fwiw run_sync() could do with a much bigger comment on why its sane to
enable interrupts.. That simply reeks, enabling interrupts too early can
wreck stuff properly.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 14:08     ` Steven Rostedt
  2012-05-31 15:16       ` Masami Hiramatsu
  2012-05-31 17:28       ` Peter Zijlstra
@ 2012-05-31 17:40       ` Peter Zijlstra
  2012-05-31 17:44         ` Peter Zijlstra
  2012-05-31 17:53         ` Steven Rostedt
  2 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2012-05-31 17:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> > Also, why does this stuff live in ftrace? I always thought you were
> > going to replace text_poke() so everybody that uses cross-modifying code
> > could profit?
> 
> I discussed this with Masami at Collaboration Summit. The two are
> similar but also very different. But we want to start merging the two
> together where it makes sense. 

Argh,. I so disagree. You're doing it backwards.

First you merge whatever is there, regardless of who came first.

Then, when everybody doing text modification is using the same
interface, do a second implementation using a Kconfig knob. If the scary
new one breaks, no sweat, flip the config. If its proven stable, kill
off the old one.

I really don't see why ftrace would be special here.. if you have all
text_poke() users use the magic new way you'll have more coverage and
better chances of hitting any snags if there are any.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 17:40       ` Peter Zijlstra
@ 2012-05-31 17:44         ` Peter Zijlstra
  2012-05-31 17:53         ` Steven Rostedt
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2012-05-31 17:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 19:40 +0200, Peter Zijlstra wrote:
> On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> > > Also, why does this stuff live in ftrace? I always thought you were
> > > going to replace text_poke() so everybody that uses cross-modifying code
> > > could profit?
> > 
> > I discussed this with Masami at Collaboration Summit. The two are
> > similar but also very different. But we want to start merging the two
> > together where it makes sense. 
> 
> Argh,. I so disagree. You're doing it backwards.
> 
> First you merge whatever is there, regardless of who came first.

The thing is, that's useful even if the new stuff never works. Having
two code-bases doing cross-modifying code in different ways just doesn't
sound right. That stuff is tricky enough as it is.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 17:40       ` Peter Zijlstra
  2012-05-31 17:44         ` Peter Zijlstra
@ 2012-05-31 17:53         ` Steven Rostedt
  2012-05-31 18:03           ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 19:40 +0200, Peter Zijlstra wrote:
> On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> > > Also, why does this stuff live in ftrace? I always thought you were
> > > going to replace text_poke() so everybody that uses cross-modifying code
> > > could profit?
> > 
> > I discussed this with Masami at Collaboration Summit. The two are
> > similar but also very different. But we want to start merging the two
> > together where it makes sense. 
> 
> Argh,. I so disagree. You're doing it backwards.
> 
> First you merge whatever is there, regardless of who came first.

The comment about coming first was more about 're-inventing' then about
merging. You can't reinvent something that didn't exist.

That said, I didn't even think about text poke while doing this. I was
just simply thinking about removing stop_machine from ftrace, that
required this. It was only a after thought that text_poke() could do the
same. And this came up at Collab, where I thought, oh yeah! we can
incorporate this with text poke.


> Then, when everybody doing text modification is using the same
> interface, do a second implementation using a Kconfig knob. If the scary
> new one breaks, no sweat, flip the config. If its proven stable, kill
> off the old one.

What do you suggest then? To revert the code and rewrite it so that
text_poke() does a similar thing?

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 17:53         ` Steven Rostedt
@ 2012-05-31 18:03           ` Peter Zijlstra
  2012-05-31 18:50             ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2012-05-31 18:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 13:53 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 19:40 +0200, Peter Zijlstra wrote:
> > On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> > > > Also, why does this stuff live in ftrace? I always thought you were
> > > > going to replace text_poke() so everybody that uses cross-modifying code
> > > > could profit?
> > > 
> > > I discussed this with Masami at Collaboration Summit. The two are
> > > similar but also very different. But we want to start merging the two
> > > together where it makes sense. 
> > 
> > Argh,. I so disagree. You're doing it backwards.
> > 
> > First you merge whatever is there, regardless of who came first.
> 
> The comment about coming first was more about 're-inventing' then about
> merging. You can't reinvent something that didn't exist.
> 
> That said, I didn't even think about text poke while doing this.

Well, the fail is before that, how could we grow two pieces of code
doing similar things in the first place? 

>  I was
> just simply thinking about removing stop_machine from ftrace, that
> required this. It was only a after thought that text_poke() could do the
> same. And this came up at Collab, where I thought, oh yeah! we can
> incorporate this with text poke.

But but but but.. the thing far back when Mathieu proposed the int3
scheme it was text_poke().. how.. did you not think of it this time!?

> 
> > Then, when everybody doing text modification is using the same
> > interface, do a second implementation using a Kconfig knob. If the scary
> > new one breaks, no sweat, flip the config. If its proven stable, kill
> > off the old one.
> 
> What do you suggest then? To revert the code and rewrite it so that
> text_poke() does a similar thing?

Too late for that now I guess.. I just wonder why you all thought it was
a good idea to have two pieces of code doing cross-modifying-code. I
always assumed ftrace used text_poke().

I hardly ever use dyn-ftrace but I do use some text_poke() through
jump_labels.

I would still like to end up with one code base doing CMC with two
implementations depending on a Kconfig knob.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 17:28       ` Peter Zijlstra
@ 2012-05-31 18:42         ` Steven Rostedt
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 18:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 19:28 +0200, Peter Zijlstra wrote:

> I don't know, nor did you explain the implicit ordering there. Also in
> such diagrams you need the other side as well.

Right, I forgot to add that.

> 
> > If another breakpoint is hit (one other than one put in by ftrace) then
> > we don't care. It wont crash the system whether or not A is 1 or 0. We
> > just need to make sure that a ftrace breakpoint that is hit knows that
> > it was a ftrace breakpoint (calls the ftrace handler). No other
> > breakpoint should be on a ftrace nop anyway.
> 
> So the ordering is like:
> 
> ---
> 
>  CPU-0				CPU-1
> 
> 
>  lock inc mod-count /* implicit (w)mb */
>  write int3
> 				<trap-int3> /* implicit (r)mb */
> 				load mod-count
> 
>  sync-ipi-broadcast
>  write rest-of-instruction
>  sync-ipi-broadcast
>  write head-of-instruction
>  sync-ipi-broadcast
>  lock dec mod-count /* implicit (w)mb */
> 
> 
> Such that when we observe the int3 on CPU-1 we also must see the
> increment on mod-count.
> 
> ---
> 
> A simple something like the above makes it very clear what we're doing
> and what we're expecting.

You are obviously better at drawing such diagrams than I am. Which also
explains why I needed to do the lockdep annotations to understand them
better. Because I don't have your ability to visualize it in such a nice
simple diagram. My vision is more of a bayesian tree than an oak. I
guess that explains why I like complex code so much :-)


>  I think a (local) trap should imply a barrier
> of sorts but will have to defer to others (hpa?) to confirm.

I would think it does, but lets have hpa or others confirm.


>  But at the
> very least write it down someplace that you are assuming that.

I'll cut and paste this into the comments.

> 
> 
> 
> fwiw run_sync() could do with a much bigger comment on why its sane to
> enable interrupts.. That simply reeks, enabling interrupts too early can
> wreck stuff properly.

Hmm, actually, I think these patches avoid the need for run_sync() at
boot up now anyway. Thus it shouldn't need to enable interrupts.

The ftrace.c code was what disabled interrupts on start up (and
re-enables them) before calling into this, this is because the old way
needed interrupts disabled, and other archs still use the old way. But,
your point stands, it should be commented better.

-- Steve





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31  1:28 ` [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting Steven Rostedt
@ 2012-05-31 18:44   ` H. Peter Anvin
  2012-05-31 18:58   ` H. Peter Anvin
  1 sibling, 0 replies; 47+ messages in thread
From: H. Peter Anvin @ 2012-05-31 18:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On 05/30/2012 06:28 PM, Steven Rostedt wrote:
> From: Steven Rostedt <srostedt@redhat.com>
> 
> When the NMI handler runs, it checks if it preempted a debug
> handler and if that handler is using the debug stack. If it is, it
> changes the IDT table not to update the stack, otherwise it will
> reset the debug stack and corrupt the debug handler it preempted.
> 
> Now that ftrace uses breakpoints to change functions from nops to 
> callers, many more places may hit a breakpoint. Unfortunately this 
> includes some of the calls that lockdep performs. Which causes
> issues with the debug stack. It too needs to change the debug stack
> before tracing (if called from the debug handler).
> 
> Allow the debug_stack_set_zero() and debug_stack_reset() to be
> nested so that the debug handlers can take advantage of them too.
> 

NAK on this in its current form.  A few minutes to let me write up a
technical review.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 18:03           ` Peter Zijlstra
@ 2012-05-31 18:50             ` Steven Rostedt
  2012-05-31 19:06               ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 18:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 20:03 +0200, Peter Zijlstra wrote:
> On Thu, 2012-05-31 at 13:53 -0400, Steven Rostedt wrote:
> > On Thu, 2012-05-31 at 19:40 +0200, Peter Zijlstra wrote:
> > > On Thu, 2012-05-31 at 10:08 -0400, Steven Rostedt wrote:
> > > > > Also, why does this stuff live in ftrace? I always thought you were
> > > > > going to replace text_poke() so everybody that uses cross-modifying code
> > > > > could profit?
> > > > 
> > > > I discussed this with Masami at Collaboration Summit. The two are
> > > > similar but also very different. But we want to start merging the two
> > > > together where it makes sense. 
> > > 
> > > Argh,. I so disagree. You're doing it backwards.
> > > 
> > > First you merge whatever is there, regardless of who came first.
> > 
> > The comment about coming first was more about 're-inventing' then about
> > merging. You can't reinvent something that didn't exist.
> > 
> > That said, I didn't even think about text poke while doing this.
> 
> Well, the fail is before that, how could we grow two pieces of code
> doing similar things in the first place? 

Again, ftrace is slightly different as it does 30,000 changes at once,
on top of known nops. This was done through stop_machine(), thus any
slowdown was a large hit to system performance. text_poke() took the way
of mapping a page to do the change, and Mathieu didn't want to change
that (IIRC). But now we want the two to be similar.


> 
> >  I was
> > just simply thinking about removing stop_machine from ftrace, that
> > required this. It was only a after thought that text_poke() could do the
> > same. And this came up at Collab, where I thought, oh yeah! we can
> > incorporate this with text poke.
> 
> But but but but.. the thing far back when Mathieu proposed the int3
> scheme it was text_poke().. how.. did you not think of it this time!?

Right, but we disagreed on its implementation, and yes, the idea came
from that, but I thought text_poke() was so different that it wouldn't
apply. It wasn't until Masami suggested changing text_poke() to be more
like what ftrace does, that I really took thought into it.

> 
> > 
> > > Then, when everybody doing text modification is using the same
> > > interface, do a second implementation using a Kconfig knob. If the scary
> > > new one breaks, no sweat, flip the config. If its proven stable, kill
> > > off the old one.
> > 
> > What do you suggest then? To revert the code and rewrite it so that
> > text_poke() does a similar thing?
> 
> Too late for that now I guess.. I just wonder why you all thought it was
> a good idea to have two pieces of code doing cross-modifying-code. I
> always assumed ftrace used text_poke().

Well, it was more of two different people working on two different
things. We really didn't look too closely at each others work. It was
more a social failure than a technical one.

> 
> I hardly ever use dyn-ftrace but I do use some text_poke() through
> jump_labels.

You don't use function tracer? That's dyn-ftrace.

But still, we need to keep the record as small as possible because it is
persistent throughout the life of the system running. Every location
must be recorded, and maintain a state (flags).

Text_poke() mostly grew out of the jump-label work. But yes, there's
still a lot that can be shared. The actual code modification may be.

> 
> I would still like to end up with one code base doing CMC with two
> implementations depending on a Kconfig knob.

You mean keep stop_machine around?

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31  1:28 ` [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting Steven Rostedt
  2012-05-31 18:44   ` H. Peter Anvin
@ 2012-05-31 18:58   ` H. Peter Anvin
  2012-05-31 19:25     ` Steven Rostedt
  1 sibling, 1 reply; 47+ messages in thread
From: H. Peter Anvin @ 2012-05-31 18:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On 05/30/2012 06:28 PM, Steven Rostedt wrote:
> 
> diff --git a/arch/x86/kernel/cpu/common.c
> b/arch/x86/kernel/cpu/common.c index 82f29e7..63fc083 100644 ---
> a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c 
> @@ -1101,14 +1101,20 @@ int is_debug_stack(unsigned long addr) addr
> > (__get_cpu_var(debug_stack_addr) - DEBUG_STKSZ)); }
> 
> +static DEFINE_PER_CPU(atomic_t, debug_idt_zero); + void
> debug_stack_set_zero(void) { +
> atomic_inc(&__get_cpu_var(debug_idt_zero)); load_idt((const struct
> desc_ptr *)&nmi_idt_descr); }
> 
> void debug_stack_reset(void) { -	load_idt((const struct desc_ptr
> *)&idt_descr); +	if
> (WARN_ON(!atomic_read(&__get_cpu_var(debug_idt_zero)))) +		return; 
> +	if (atomic_dec_and_test(&__get_cpu_var(debug_idt_zero))) +
> load_idt((const struct desc_ptr *)&idt_descr); }
> 

What you have here is a nesting counter, which is good, but I have
some objections to the implementation.

The name "debug_idt_zero" doesn't convey a counter in any way, shape
or form; perhaps debug_stack_users or something like that.

Since this is percpu, there is no reason to use the atomic operations
(globally locked!), and since it is on the local CPU there is no need
to manifest a pointer; we can instead use this_cpu_inc/dec_return:

static DEFINE_PER_CPU(u32, debug_stack_use_ctr);

void debug_stack_set_zero(void)
{
	if (this_cpu_inc_return(debug_stack_use_ctr) == 1)
	 	load_idt((const struct desc_ptr *)&nmi_idt_descr);
}

void debug_stack_reset(void)
{
	if (this_cpu_dec_return(debug_stack_use_ctr) == 0)
		load_idt((const struct desc_ptr *)&idt_descr);
}

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 18:50             ` Steven Rostedt
@ 2012-05-31 19:06               ` Peter Zijlstra
  2012-05-31 19:55                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2012-05-31 19:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Masami Hiramatsu, H. Peter Anvin, Dave Jones, Andi Kleen,
	Mathieu Desnoyers

On Thu, 2012-05-31 at 14:50 -0400, Steven Rostedt wrote:

> > Well, the fail is before that, how could we grow two pieces of code
> > doing similar things in the first place? 
> 
> Again, ftrace is slightly different as it does 30,000 changes at once,
> on top of known nops. This was done through stop_machine(), thus any
> slowdown was a large hit to system performance. text_poke() took the way
> of mapping a page to do the change, and Mathieu didn't want to change
> that (IIRC). But now we want the two to be similar.

We could give text_poke a function argument to do the actual
modification, leaving all the magic centralized.

Also, why did Mathieu insist on keeping that kmap()?

> > I hardly ever use dyn-ftrace but I do use some text_poke() through
> > jump_labels.
> 
> You don't use function tracer? That's dyn-ftrace.

Not much no.. I do use trace_printk() and ftrace_dump_on_oops a lot
though.

> But still, we need to keep the record as small as possible because it is
> persistent throughout the life of the system running. Every location
> must be recorded, and maintain a state (flags).
> 
> Text_poke() mostly grew out of the jump-label work. But yes, there's
> still a lot that can be shared. The actual code modification may be.

Afaicr we didn't change text_poke() for the jump-label stuff, except in
trivial ways (added a #ifdef and exposed a function etc..).

> > I would still like to end up with one code base doing CMC with two
> > implementations depending on a Kconfig knob.
> 
> You mean keep stop_machine around?

Yeah, like have CONFIG_CMC_STOPMACHINE and CONFIG_CMC_FANCY for a little
while.

If we find a problem with the fancy approach going back is easy, once
its proven stable we could remove the stop-machine one.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/5] x86: Reset the debug_stack update counter
  2012-05-31  1:28 ` [PATCH 3/5] x86: Reset the debug_stack update counter Steven Rostedt
@ 2012-05-31 19:19   ` H. Peter Anvin
  2012-05-31 19:26     ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: H. Peter Anvin @ 2012-05-31 19:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On 05/30/2012 06:28 PM, Steven Rostedt wrote:
>> 
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index
> 9087527..c1fffc5 100644 --- a/arch/x86/kernel/nmi.c +++
> b/arch/x86/kernel/nmi.c @@ -450,8 +450,10 @@ static inline void
> nmi_nesting_preprocess(struct pt_regs *regs)
> 
> static inline void nmi_nesting_postprocess(void) { -	if
> (unlikely(__get_cpu_var(update_debug_stack))) +	if
> (unlikely(__get_cpu_var(update_debug_stack))) { 
> debug_stack_reset(); +		__get_cpu_var(update_debug_stack) = 0; +	} 
> } #endif
> 

Please don't use __get_cpu_var(); it causes a pointer to be manifest,
which is free or almost free on most architectures but quite expensive
on x86.

Instead use this_cpu_read()/this_cpu_write().

Consider:

#include <linux/types.h>
#include <linux/percpu.h>

static DEFINE_PER_CPU(int, percpu_test);

int read_foo(void)
{
	return __get_cpu_var(percpu_test);
}

void write_foo(int x)
{
	__get_cpu_var(percpu_test) = x;
}

int read_bar(void)
{
	return this_cpu_read(percpu_test);
}

void write_bar(int x)
{
	this_cpu_write(percpu_test, x);
}

... and the corresponding assembly code (with gcc boilerplate removed):

read_foo:
        movq    $percpu_test, %rax      #, tcp_ptr__
        add %gs:this_cpu_off, %rax      # this_cpu_off, tcp_ptr__
        movl    (%rax), %eax    # *D.8429_3, *D.8429_3
        ret

write_foo:
        movq    $percpu_test, %rax      #, tcp_ptr__
        add %gs:this_cpu_off, %rax      # this_cpu_off, tcp_ptr__
        movl    %edi, (%rax)    # x, *D.8435_3
        ret

read_bar:
        movl %gs:percpu_test,%eax       # percpu_test, pfo_ret__

write_bar:
        movl %edi,%gs:percpu_test       # x, percpu_test
        ret

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 18:58   ` H. Peter Anvin
@ 2012-05-31 19:25     ` Steven Rostedt
  2012-05-31 19:27       ` H. Peter Anvin
  2012-06-01  2:30       ` Steven Rostedt
  0 siblings, 2 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 19:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 11:58 -0700, H. Peter Anvin wrote:

> What you have here is a nesting counter, which is good, but I have
> some objections to the implementation.
> 
> The name "debug_idt_zero" doesn't convey a counter in any way, shape
> or form; perhaps debug_stack_users or something like that.

Well, it is 'debug_idt_SET_zero' which is what it is doing, regardless
of counter (see below). But perhaps we should change it to:

debug_idt_get()
debug_idt_put()

?

> 
> Since this is percpu, there is no reason to use the atomic operations
> (globally locked!), and since it is on the local CPU there is no need
> to manifest a pointer; we can instead use this_cpu_inc/dec_return:
> 
> static DEFINE_PER_CPU(u32, debug_stack_use_ctr);
> 
> void debug_stack_set_zero(void)
> {
> 	if (this_cpu_inc_return(debug_stack_use_ctr) == 1)

If an NMI comes in here, it will not update the IDT and will corrupt the
stack. The load_idt() must happen for all calls. There's only two static
IDT tables. It's either one or the other. Thus, if it loads it twice, it
will just set it to the same value. The counter is to know when to set
it back.

-- Steve

> 	 	load_idt((const struct desc_ptr *)&nmi_idt_descr);
> }
> 
> void debug_stack_reset(void)
> {
> 	if (this_cpu_dec_return(debug_stack_use_ctr) == 0)
> 		load_idt((const struct desc_ptr *)&idt_descr);
> }
> 
> 	-hpa
> 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/5] x86: Reset the debug_stack update counter
  2012-05-31 19:19   ` H. Peter Anvin
@ 2012-05-31 19:26     ` Steven Rostedt
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 19:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 12:19 -0700, H. Peter Anvin wrote:
> On 05/30/2012 06:28 PM, Steven Rostedt wrote:
> >> 
> > diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index
> > 9087527..c1fffc5 100644 --- a/arch/x86/kernel/nmi.c +++
> > b/arch/x86/kernel/nmi.c @@ -450,8 +450,10 @@ static inline void
> > nmi_nesting_preprocess(struct pt_regs *regs)
> > 
> > static inline void nmi_nesting_postprocess(void) { -	if
> > (unlikely(__get_cpu_var(update_debug_stack))) +	if
> > (unlikely(__get_cpu_var(update_debug_stack))) { 
> > debug_stack_reset(); +		__get_cpu_var(update_debug_stack) = 0; +	} 
> > } #endif
> > 
> 
> Please don't use __get_cpu_var(); it causes a pointer to be manifest,
> which is free or almost free on most architectures but quite expensive
> on x86.
> 
> Instead use this_cpu_read()/this_cpu_write().

OK, will change.

Thanks!

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 19:25     ` Steven Rostedt
@ 2012-05-31 19:27       ` H. Peter Anvin
  2012-05-31 20:00         ` Steven Rostedt
  2012-06-01  2:30       ` Steven Rostedt
  1 sibling, 1 reply; 47+ messages in thread
From: H. Peter Anvin @ 2012-05-31 19:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On 05/31/2012 12:25 PM, Steven Rostedt wrote:
>> void debug_stack_set_zero(void)
>> {
>> 	if (this_cpu_inc_return(debug_stack_use_ctr) == 1)
> 
> If an NMI comes in here, it will not update the IDT and will corrupt the
> stack. The load_idt() must happen for all calls. There's only two static
> IDT tables. It's either one or the other. Thus, if it loads it twice, it
> will just set it to the same value. The counter is to know when to set
> it back.
> 

Now I'm really confused.  Why would it have to set it if it is already set?

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 19:06               ` Peter Zijlstra
@ 2012-05-31 19:55                 ` Mathieu Desnoyers
  2012-05-31 20:10                   ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Mathieu Desnoyers @ 2012-05-31 19:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, Ingo Molnar, Andrew Morton,
	Frederic Weisbecker, Masami Hiramatsu, H. Peter Anvin,
	Dave Jones, Andi Kleen

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2012-05-31 at 14:50 -0400, Steven Rostedt wrote:
> 
> > > Well, the fail is before that, how could we grow two pieces of code
> > > doing similar things in the first place? 
> > 
> > Again, ftrace is slightly different as it does 30,000 changes at once,
> > on top of known nops. This was done through stop_machine(), thus any
> > slowdown was a large hit to system performance. text_poke() took the way
> > of mapping a page to do the change, and Mathieu didn't want to change
> > that (IIRC). But now we want the two to be similar.
> 
> We could give text_poke a function argument to do the actual
> modification, leaving all the magic centralized.
> 
> Also, why did Mathieu insist on keeping that kmap()?

Not sure about the entire context here, but the goal of using kmap() is
to allow modification of text in configurations where the kernel text
is read-only: the kmap does a temporary shadow RW mapping that allows
modification of the text. Presumably that Ftrace's 30k changes are done
before the kernel text mapping is set to read-only ? If this is the
case, then it is similar to text_poke_early, which don't use the kmap
since it happens before kernel text gets write-protected. But text_poke
has to deal with RO pages.

Hopefully my answer makes sense in the context of your discussion.

Thanks,

Mathieu

> 
> > > I hardly ever use dyn-ftrace but I do use some text_poke() through
> > > jump_labels.
> > 
> > You don't use function tracer? That's dyn-ftrace.
> 
> Not much no.. I do use trace_printk() and ftrace_dump_on_oops a lot
> though.
> 
> > But still, we need to keep the record as small as possible because it is
> > persistent throughout the life of the system running. Every location
> > must be recorded, and maintain a state (flags).
> > 
> > Text_poke() mostly grew out of the jump-label work. But yes, there's
> > still a lot that can be shared. The actual code modification may be.
> 
> Afaicr we didn't change text_poke() for the jump-label stuff, except in
> trivial ways (added a #ifdef and exposed a function etc..).
> 
> > > I would still like to end up with one code base doing CMC with two
> > > implementations depending on a Kconfig knob.
> > 
> > You mean keep stop_machine around?
> 
> Yeah, like have CONFIG_CMC_STOPMACHINE and CONFIG_CMC_FANCY for a little
> while.
> 
> If we find a problem with the fancy approach going back is easy, once
> its proven stable we could remove the stop-machine one.

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 19:27       ` H. Peter Anvin
@ 2012-05-31 20:00         ` Steven Rostedt
  2012-05-31 20:17           ` H. Peter Anvin
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 20:00 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 12:27 -0700, H. Peter Anvin wrote:
> On 05/31/2012 12:25 PM, Steven Rostedt wrote:
> >> void debug_stack_set_zero(void)
> >> {
> >> 	if (this_cpu_inc_return(debug_stack_use_ctr) == 1)
> > 
> > If an NMI comes in here, it will not update the IDT and will corrupt the
> > stack. The load_idt() must happen for all calls. There's only two static
> > IDT tables. It's either one or the other. Thus, if it loads it twice, it
> > will just set it to the same value. The counter is to know when to set
> > it back.
> > 
> 
> Now I'm really confused.  Why would it have to set it if it is already set?
> 

It doesn't ;-)

But we don't know if it is what it needs to be. Just because the counter
is set to 1, doesn't mean that it was already set. Because we are
dealing with NMIs, that are totally asynchronous, and the race of
setting the counter and setting the idt.

Basically, (as you already know) the IST=0 means to use the same stack
if we hit a breakpoint. But usually it's set to '4' which is an index
into the TSS to find its per CPU stack.

If the IST is set to 4, and a breakpoint is hit, then the stack is set
to a fixed address, even if you are currently using the same stack!

We need to prevent this from happening. There's two places that this can
cause issues. The first is in the NMI handler, which is where this code
was first developed. The idea was to allow NMIs to hit breakpoints. But
if it were to do that after preempting a debug int3 handler, the
breakpoint it hit would reset the stack and the NMI would start
clobbering the stack of the code it preempted.

The NMI code on entering (and before it is allowed to hit any
breakpoints) looks at the stack it preempted. If it is a debug stack,
then it updates the IDT to have the int3 vector have a IST of 0 (keep
the same stack when triggered). Then if the NMI hits a breakpoint, it
just continues to use the NMI stack.

The int3 handler has a little trick that lets the int3 code reuse the
stack. It modifies the TSS to point to another stack before calling the
debug handler. If a NMI triggers now, or a breakpoint is hit again, it
wont corrupt the stack.

	subq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
	call \do_sym
	addq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)

The debug stack is EXCEPTION_STKSZ * 2 size. Before entering the \do_sym
(do_int3 in this case), it moves the TSS to point to another stack for
the int3 handler.

The problem this patch set is trying to fix is the case for lockdep. The
macro TRACE_IRQS_ON/OFF calls into lockdep code. And these are outside
that add/sub TSS trick. If lockdep code hits a breakpoint than we reset
back to the original stack address, and start clobbering the stack. This
is the bug that Dave Jones was triggering.

Now we could do this add/sub TSS trick instead of loading idt for all
the cases before calling lockdep in the debug handler. But this means
the paranoid_exit will need to be turned into a macro and we would
require it for each ist set.

Now, back to your original question. Why set it if it is already set.
Well, it doesn't hurt to set it (except for the performance hit we
take), but it does hurt if we should set it but don't, as that means we
can reset the stack and clobber what we preempted.

I'm sure there's room to make this more efficient. But I'm currently
trying to solve a kernel crash first, and then work on cleaning it up
later.

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 19:55                 ` Mathieu Desnoyers
@ 2012-05-31 20:10                   ` Steven Rostedt
  2012-05-31 20:26                     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 20:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Andrew Morton,
	Frederic Weisbecker, Masami Hiramatsu, H. Peter Anvin,
	Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 15:55 -0400, Mathieu Desnoyers wrote:

> > Also, why did Mathieu insist on keeping that kmap()?
> 
> Not sure about the entire context here, but the goal of using kmap() is
> to allow modification of text in configurations where the kernel text
> is read-only: the kmap does a temporary shadow RW mapping that allows
> modification of the text. Presumably that Ftrace's 30k changes are done
> before the kernel text mapping is set to read-only ? If this is the
> case, then it is similar to text_poke_early, which don't use the kmap
> since it happens before kernel text gets write-protected. But text_poke
> has to deal with RO pages.

No this is also when ftrace is enabled at runtime. The trick that ftrace
does is to temporarially  convert the kernel text from ro to rw, and
then back to ro when done. You can argue that this degrades the security
of the system, but tracing every function in the kernel does too ;-)
That's why it's a root only utility.

Hmm, this brings up another question. By default, perf does not allow
users to profile trace_events correct? IOW, perf does not let
unprivileged users call text_poke()? I just tried it and got the:

$ ~/bin/perf record -e sched:sched_switch sleep 1
Permission error - are you root?
Consider tweaking /proc/sys/kernel/perf_event_paranoid:
 -1 - Not paranoid at all
  0 - Disallow raw tracepoint access for unpriv
  1 - Disallow cpu events for unpriv
  2 - Disallow kernel profiling for unpriv


> 
> Hopefully my answer makes sense in the context of your discussion.

Almost,

-- Steve


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 20:00         ` Steven Rostedt
@ 2012-05-31 20:17           ` H. Peter Anvin
  2012-05-31 20:35             ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: H. Peter Anvin @ 2012-05-31 20:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On 05/31/2012 01:00 PM, Steven Rostedt wrote:
> 
> It doesn't ;-)
> 
> But we don't know if it is what it needs to be. Just because the counter
> is set to 1, doesn't mean that it was already set. Because we are
> dealing with NMIs, that are totally asynchronous, and the race of
> setting the counter and setting the idt.
> 
> Basically, (as you already know) the IST=0 means to use the same stack
> if we hit a breakpoint. But usually it's set to '4' which is an index
> into the TSS to find its per CPU stack.
> 
> If the IST is set to 4, and a breakpoint is hit, then the stack is set
> to a fixed address, even if you are currently using the same stack!
> 
> We need to prevent this from happening. There's two places that this can
> cause issues. The first is in the NMI handler, which is where this code
> was first developed. The idea was to allow NMIs to hit breakpoints. But
> if it were to do that after preempting a debug int3 handler, the
> breakpoint it hit would reset the stack and the NMI would start
> clobbering the stack of the code it preempted.
> 
> The NMI code on entering (and before it is allowed to hit any
> breakpoints) looks at the stack it preempted. If it is a debug stack,
> then it updates the IDT to have the int3 vector have a IST of 0 (keep
> the same stack when triggered). Then if the NMI hits a breakpoint, it
> just continues to use the NMI stack.
> 
> The int3 handler has a little trick that lets the int3 code reuse the
> stack. It modifies the TSS to point to another stack before calling the
> debug handler. If a NMI triggers now, or a breakpoint is hit again, it
> wont corrupt the stack.
> 
> 	subq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
> 	call \do_sym
> 	addq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
> 
> The debug stack is EXCEPTION_STKSZ * 2 size. Before entering the \do_sym
> (do_int3 in this case), it moves the TSS to point to another stack for
> the int3 handler.
> 
> The problem this patch set is trying to fix is the case for lockdep. The
> macro TRACE_IRQS_ON/OFF calls into lockdep code. And these are outside
> that add/sub TSS trick. If lockdep code hits a breakpoint than we reset
> back to the original stack address, and start clobbering the stack. This
> is the bug that Dave Jones was triggering.
> 
> Now we could do this add/sub TSS trick instead of loading idt for all
> the cases before calling lockdep in the debug handler. But this means
> the paranoid_exit will need to be turned into a macro and we would
> require it for each ist set.
> 
> Now, back to your original question. Why set it if it is already set.
> Well, it doesn't hurt to set it (except for the performance hit we
> take), but it does hurt if we should set it but don't, as that means we
> can reset the stack and clobber what we preempted.
> 
> I'm sure there's room to make this more efficient. But I'm currently
> trying to solve a kernel crash first, and then work on cleaning it up
> later.
> 

Ouch.  This is really way more complex than it has any excuse for being,
and it's the complexity that concerns me, not the performance.

I'd like a chart, or list, of the alternate stack environments we can be
in and what can transfer to what.  I think there might be an easier
solution that is more robust.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 20:10                   ` Steven Rostedt
@ 2012-05-31 20:26                     ` Peter Zijlstra
  2012-05-31 20:37                       ` Steven Rostedt
  2012-06-01  0:45                       ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2012-05-31 20:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Andrew Morton,
	Frederic Weisbecker, Masami Hiramatsu, H. Peter Anvin,
	Dave Jones, Andi Kleen, acme

On Thu, 2012-05-31 at 16:10 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 15:55 -0400, Mathieu Desnoyers wrote:
> 
> > > Also, why did Mathieu insist on keeping that kmap()?
> > 
> > Not sure about the entire context here, but the goal of using kmap() is
> > to allow modification of text in configurations where the kernel text
> > is read-only: the kmap does a temporary shadow RW mapping that allows
> > modification of the text. Presumably that Ftrace's 30k changes are done
> > before the kernel text mapping is set to read-only ? If this is the
> > case, then it is similar to text_poke_early, which don't use the kmap
> > since it happens before kernel text gets write-protected. But text_poke
> > has to deal with RO pages.
> 
> No this is also when ftrace is enabled at runtime. The trick that ftrace
> does is to temporarially  convert the kernel text from ro to rw, and
> then back to ro when done. You can argue that this degrades the security
> of the system, but tracing every function in the kernel does too ;-)
> That's why it's a root only utility.

Right, but when you loose stop-machine you could simply do 30k
kmap_atomic/kunmap_atomic's consecutively since you're not holding
anybody up.

> Hmm, this brings up another question. By default, perf does not allow
> users to profile trace_events correct? IOW, perf does not let
> unprivileged users call text_poke()? I just tried it and got the:
> 
> $ ~/bin/perf record -e sched:sched_switch sleep 1
> Permission error - are you root?
> Consider tweaking /proc/sys/kernel/perf_event_paranoid:
>  -1 - Not paranoid at all
>   0 - Disallow raw tracepoint access for unpriv
>   1 - Disallow cpu events for unpriv
>   2 - Disallow kernel profiling for unpriv

It would, except tools/perf does stupid, its unconditionally adding
PERF_SAMPLE_RAW (even for non-sampling events), which is the bit that
requires privs.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 20:17           ` H. Peter Anvin
@ 2012-05-31 20:35             ` Steven Rostedt
  2012-05-31 20:39               ` H. Peter Anvin
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 20:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 13:17 -0700, H. Peter Anvin wrote:

> Ouch.  This is really way more complex than it has any excuse for being,
> and it's the complexity that concerns me, not the performance.

Complexity is my Sun, and I am the planet that orbits around it.

> 
> I'd like a chart, or list, of the alternate stack environments we can be
> in and what can transfer to what.  I think there might be an easier
> solution that is more robust.

Well, it's not as bad as one might think:

#define STACKFAULT_STACK 1
#define DOUBLEFAULT_STACK 2
#define NMI_STACK 3
#define DEBUG_STACK 4
#define MCE_STACK 5
#define N_EXCEPTION_STACKS 5  /* hw limit: 7 */

These are the exceptions that have their own stacks.

arch/x86/kernel/traps.c:        set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
arch/x86/kernel/traps.c:        set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
arch/x86/kernel/traps.c:        set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
arch/x86/kernel/traps.c:        set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
arch/x86/kernel/traps.c:        set_intr_gate_ist(X86_TRAP_SS, &stack_segment, STACKFAULT_STACK);
arch/x86/kernel/traps.c:        set_intr_gate_ist(X86_TRAP_MC, &machine_check, MCE_STACK);

We only have two IDT tables that are fixed and are switched via the NMI
handler (debug_stack_set_zero), as well as this patch set.

head_64.S:

ENTRY(idt_table)
	.skip IDT_ENTRIES * 16

	.align L1_CACHE_BYTES
ENTRY(nmi_idt_table)
	.skip IDT_ENTRIES * 16


Only the DEBUG stack has a double size:

arch/x89/kernel/cpu/common.c:

static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
};


Thus only the debug stack does the stack TSS trick.

Is this what you were looking for? (God, it just shows how much time
I've been spending on this crap, as I was able to find all this by
memory and not grepping for it :-p )

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 20:26                     ` Peter Zijlstra
@ 2012-05-31 20:37                       ` Steven Rostedt
  2012-05-31 20:40                         ` Steven Rostedt
  2012-05-31 20:40                         ` Peter Zijlstra
  2012-06-01  0:45                       ` Arnaldo Carvalho de Melo
  1 sibling, 2 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Andrew Morton,
	Frederic Weisbecker, Masami Hiramatsu, H. Peter Anvin,
	Dave Jones, Andi Kleen, acme

On Thu, 2012-05-31 at 22:26 +0200, Peter Zijlstra wrote:

> Right, but when you loose stop-machine you could simply do 30k
> kmap_atomic/kunmap_atomic's consecutively since you're not holding
> anybody up.

It requires 3 IPIs per update too. Thus that's 90,000 IPIs you are
blasting^Wsending to all CPUs.

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 20:35             ` Steven Rostedt
@ 2012-05-31 20:39               ` H. Peter Anvin
  2012-05-31 20:56                 ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: H. Peter Anvin @ 2012-05-31 20:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On 05/31/2012 01:35 PM, Steven Rostedt wrote:
> 
> Thus only the debug stack does the stack TSS trick.
> 
> Is this what you were looking for? (God, it just shows how much time
> I've been spending on this crap, as I was able to find all this by
> memory and not grepping for it :-p )
> 

That's why I asked instead of pulled it up myself.

However, what is missing is what the permitted transitions are.  This
affects the options.

	-hpa


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 20:37                       ` Steven Rostedt
@ 2012-05-31 20:40                         ` Steven Rostedt
  2012-05-31 20:40                         ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Andrew Morton,
	Frederic Weisbecker, Masami Hiramatsu, H. Peter Anvin,
	Dave Jones, Andi Kleen, acme

On Thu, 2012-05-31 at 16:37 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 22:26 +0200, Peter Zijlstra wrote:
> 
> > Right, but when you loose stop-machine you could simply do 30k
> > kmap_atomic/kunmap_atomic's consecutively since you're not holding
> > anybody up.
> 
> It requires 3 IPIs per update too. Thus that's 90,000 IPIs you are
> blasting^Wsending to all CPUs.
> 

Note, currently ftrace adds breakpoints to all locations, sends out the
IPIs, modifies the code, sends out the IPIs and then removes the
breakpoint, sends out the IPIs. 

Just three IPIs for those 30,000 updates.

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 20:37                       ` Steven Rostedt
  2012-05-31 20:40                         ` Steven Rostedt
@ 2012-05-31 20:40                         ` Peter Zijlstra
  2012-05-31 20:49                           ` Steven Rostedt
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2012-05-31 20:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Andrew Morton,
	Frederic Weisbecker, Masami Hiramatsu, H. Peter Anvin,
	Dave Jones, Andi Kleen, acme

On Thu, 2012-05-31 at 16:37 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 22:26 +0200, Peter Zijlstra wrote:
> 
> > Right, but when you loose stop-machine you could simply do 30k
> > kmap_atomic/kunmap_atomic's consecutively since you're not holding
> > anybody up.
> 
> It requires 3 IPIs per update too. Thus that's 90,000 IPIs you are
> blasting^Wsending to all CPUs.

Uhm, no. 

for_each() {
  kmap_atomic()
  frob int3
  kunmap_atomic();
}

sync-ipi-broadcast();

for_each() {
  kmap_atomic();
  frob tail
  kunmap_atomic();
}

sync-ipi-broadcast();

for_each() {
  kmap_atomic();
  frob head
  kunmap_atomic();
};

sync-ipi-broadcast();

How is that sending 90k ipis?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 20:40                         ` Peter Zijlstra
@ 2012-05-31 20:49                           ` Steven Rostedt
  2012-06-01  4:53                             ` Masami Hiramatsu
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 20:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Andrew Morton,
	Frederic Weisbecker, Masami Hiramatsu, H. Peter Anvin,
	Dave Jones, Andi Kleen, acme

On Thu, 2012-05-31 at 22:40 +0200, Peter Zijlstra wrote:
> On Thu, 2012-05-31 at 16:37 -0400, Steven Rostedt wrote:
> > On Thu, 2012-05-31 at 22:26 +0200, Peter Zijlstra wrote:
> > 
> > > Right, but when you loose stop-machine you could simply do 30k
> > > kmap_atomic/kunmap_atomic's consecutively since you're not holding
> > > anybody up.
> > 
> > It requires 3 IPIs per update too. Thus that's 90,000 IPIs you are
> > blasting^Wsending to all CPUs.
> 
> Uhm, no. 
> 

----------------------------------+
> for_each() {			|
>   kmap_atomic()			|
>   frob int3			|
>   kunmap_atomic();		|
> }				|
> 				|
> sync-ipi-broadcast();		+--- Break points applied
> 				|
> for_each() {			|
>   kmap_atomic();		|
>   frob tail			|
>   kunmap_atomic();		|
> }				|
----------------------------------+

Note, for the above time, the entire kernel has breakpoints added, and
every function is taking a hit due to it. By slowing down this process,
the rest of the system *will* be impacted. Ideally, we want to finish it
as quick as possible. Not to mention, the kmap_atomics will be slowed
down by the breakpoints that are attached to them.

-- Steve

> 
> sync-ipi-broadcast();
> 
> for_each() {
>   kmap_atomic();
>   frob head
>   kunmap_atomic();
> };
> 
> sync-ipi-broadcast();
> 
> How is that sending 90k ipis?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 20:39               ` H. Peter Anvin
@ 2012-05-31 20:56                 ` Steven Rostedt
  2012-05-31 21:09                   ` H. Peter Anvin
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 20:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 13:39 -0700, H. Peter Anvin wrote:
> On 05/31/2012 01:35 PM, Steven Rostedt wrote:
> > 
> > Thus only the debug stack does the stack TSS trick.
> > 
> > Is this what you were looking for? (God, it just shows how much time
> > I've been spending on this crap, as I was able to find all this by
> > memory and not grepping for it :-p )
> > 
> 
> That's why I asked instead of pulled it up myself.
> 
> However, what is missing is what the permitted transitions are.  This
> affects the options.

Only the debug stack changes. These happen in the two places I already
mentioned.

One, the switch of the IDT in NMI (to convert the IST from 4 to 0, and
just keep the same stack).

Two, is the trick in the paranoidzeroentry_ist code (adding the debug
stack). That changes the DEBUG stack pointer.

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 20:56                 ` Steven Rostedt
@ 2012-05-31 21:09                   ` H. Peter Anvin
  2012-05-31 21:37                     ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: H. Peter Anvin @ 2012-05-31 21:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On 05/31/2012 01:56 PM, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 13:39 -0700, H. Peter Anvin wrote:
>> On 05/31/2012 01:35 PM, Steven Rostedt wrote:
>>>
>>> Thus only the debug stack does the stack TSS trick.
>>>
>>> Is this what you were looking for? (God, it just shows how much time
>>> I've been spending on this crap, as I was able to find all this by
>>> memory and not grepping for it :-p )
>>>
>>
>> That's why I asked instead of pulled it up myself.
>>
>> However, what is missing is what the permitted transitions are.  This
>> affects the options.
> 
> Only the debug stack changes. These happen in the two places I already
> mentioned.
> 
> One, the switch of the IDT in NMI (to convert the IST from 4 to 0, and
> just keep the same stack).
> 
> Two, is the trick in the paranoidzeroentry_ist code (adding the debug
> stack). That changes the DEBUG stack pointer.
> 

No, I'm asking what environments (alternate stacks) are permitted where.
 That is the important information.

The interaction of the above two things is part of the problem, of
course, and I think it can (and should) be avoided.

	-hpa


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 21:09                   ` H. Peter Anvin
@ 2012-05-31 21:37                     ` Steven Rostedt
  2012-05-31 21:38                       ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 21:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 14:09 -0700, H. Peter Anvin wrote:

> No, I'm asking what environments (alternate stacks) are permitted where.
>  That is the important information.

I believe that the IST stack switch is just a convenient way to allow
separate stacks for separate events instead of trying to handle all
events on a single stack, and risk stack overflow.

Thus, the debug/int3 and NMI events use a separate stack to not stress
the current stack that is not expected to switch. The external interrupt
stacks are switched on entry of the interrupt and stays on that stack
and until its finished. There's a check to see if it is already on the
stack before it does the switch so that it doesn't have the HW switch
issues that NMI and int3 has.

Thus, if we say we must stay on either the NMI stack or the DEBUG stack,
then this is what happens. Well, almost.

For NMIs, if it preempted something on the debug stack, it will always
stay on the NMI stack and never switch. A check is made for both debug
stacks at once:

int is_debug_stack(unsigned long addr)
{
	return __get_cpu_var(debug_stack_usage) ||
		(addr <= __get_cpu_var(debug_stack_addr) &&
		 addr > (__get_cpu_var(debug_stack_addr) - DEBUG_STKSZ));
}

Where:

#define DEBUG_STACK_ORDER (EXCEPTION_STACK_ORDER + 1)
#define DEBUG_STKSZ (PAGE_SIZE << DEBUG_STACK_ORDER)


But if the NMI did not preempt the DEBUG stack, and it hits a breakpoint
it will switch to the debug stack. And it is possible to make another
switch if the debug code were to somehow.

Thus the transactions are something like this:

NORMAL_STACK (either kernel/user/irq) -> DEBUG1 -> DEBUG2

NORMAL_STACK -> DEBUG1 -> NMI

NORMAL_STACK -> DEBUG1 -> DEBUG2 -> NMI

NORMAL_STACK -> NMI -> DEBUG1 -> DEBUG2

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 21:37                     ` Steven Rostedt
@ 2012-05-31 21:38                       ` Steven Rostedt
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-05-31 21:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 17:37 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 14:09 -0700, H. Peter Anvin wrote:
> 
> > No, I'm asking what environments (alternate stacks) are permitted where.
> >  That is the important information.
> 
> I believe that the IST stack switch is just a convenient way to allow
> separate stacks for separate events instead of trying to handle all
> events on a single stack, and risk stack overflow.
> 
> Thus, the debug/int3 and NMI events use a separate stack to not stress
> the current stack that is not expected to switch. The external interrupt
> stacks are switched on entry of the interrupt and stays on that stack
> and until its finished. There's a check to see if it is already on the
> stack before it does the switch so that it doesn't have the HW switch
> issues that NMI and int3 has.
> 
> Thus, if we say we must stay on either the NMI stack or the DEBUG stack,
> then this is what happens. Well, almost.
> 
> For NMIs, if it preempted something on the debug stack, it will always
> stay on the NMI stack and never switch. A check is made for both debug
> stacks at once:
> 
> int is_debug_stack(unsigned long addr)
> {
> 	return __get_cpu_var(debug_stack_usage) ||
> 		(addr <= __get_cpu_var(debug_stack_addr) &&
> 		 addr > (__get_cpu_var(debug_stack_addr) - DEBUG_STKSZ));
> }
> 

And, yes, I can send a patch to change these to this_cpu_() if you
like ;-)

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 20:26                     ` Peter Zijlstra
  2012-05-31 20:37                       ` Steven Rostedt
@ 2012-06-01  0:45                       ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 47+ messages in thread
From: Arnaldo Carvalho de Melo @ 2012-06-01  0:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, linux-kernel, Ingo Molnar,
	Andrew Morton, Frederic Weisbecker, Masami Hiramatsu,
	H. Peter Anvin, Dave Jones, Andi Kleen

Em Thu, May 31, 2012 at 10:26:09PM +0200, Peter Zijlstra escreveu:
> > Hmm, this brings up another question. By default, perf does not allow
> > users to profile trace_events correct? IOW, perf does not let
> > unprivileged users call text_poke()? I just tried it and got the:
> > 
> > $ ~/bin/perf record -e sched:sched_switch sleep 1
> > Permission error - are you root?
> > Consider tweaking /proc/sys/kernel/perf_event_paranoid:
> >  -1 - Not paranoid at all
> >   0 - Disallow raw tracepoint access for unpriv
> >   1 - Disallow cpu events for unpriv
> >   2 - Disallow kernel profiling for unpriv
> 
> It would, except tools/perf does stupid, its unconditionally adding
> PERF_SAMPLE_RAW (even for non-sampling events), which is the bit that
> requires privs.

Looserland, bah, I'll check that, thanks.

- Arnaldo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting
  2012-05-31 19:25     ` Steven Rostedt
  2012-05-31 19:27       ` H. Peter Anvin
@ 2012-06-01  2:30       ` Steven Rostedt
  1 sibling, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2012-06-01  2:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Masami Hiramatsu, Dave Jones, Andi Kleen

On Thu, 2012-05-31 at 15:25 -0400, Steven Rostedt wrote:
> On Thu, 2012-05-31 at 11:58 -0700, H. Peter Anvin wrote:
> 
> > What you have here is a nesting counter, which is good, but I have
> > some objections to the implementation.
> > 
> > The name "debug_idt_zero" doesn't convey a counter in any way, shape
> > or form; perhaps debug_stack_users or something like that.
> 
> Well, it is 'debug_idt_SET_zero' which is what it is doing, regardless
> of counter (see below). But perhaps we should change it to:
> 
> debug_idt_get()
> debug_idt_put()

Bah, silly me, I misunderstood you. You were talking about the counter
itself, not the function names.

OK, will fix.

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-05-31 20:49                           ` Steven Rostedt
@ 2012-06-01  4:53                             ` Masami Hiramatsu
  2012-06-01 11:37                               ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Masami Hiramatsu @ 2012-06-01  4:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mathieu Desnoyers, linux-kernel, Ingo Molnar,
	Andrew Morton, Frederic Weisbecker, H. Peter Anvin, Dave Jones,
	Andi Kleen, acme, yrl.pp-manager.tt

(2012/06/01 5:49), Steven Rostedt wrote:
> On Thu, 2012-05-31 at 22:40 +0200, Peter Zijlstra wrote:
>> On Thu, 2012-05-31 at 16:37 -0400, Steven Rostedt wrote:
>>> On Thu, 2012-05-31 at 22:26 +0200, Peter Zijlstra wrote:
>>>
>>>> Right, but when you loose stop-machine you could simply do 30k
>>>> kmap_atomic/kunmap_atomic's consecutively since you're not holding
>>>> anybody up.
>>>
>>> It requires 3 IPIs per update too. Thus that's 90,000 IPIs you are
>>> blasting^Wsending to all CPUs.
>>
>> Uhm, no. 
>>
> 
> ----------------------------------+
>> for_each() {			|
>>   kmap_atomic()			|
>>   frob int3			|
>>   kunmap_atomic();		|
>> }				|
>> 				|
>> sync-ipi-broadcast();		+--- Break points applied
>> 				|
>> for_each() {			|
>>   kmap_atomic();		|
>>   frob tail			|
>>   kunmap_atomic();		|
>> }				|
> ----------------------------------+
> 
> Note, for the above time, the entire kernel has breakpoints added, and
> every function is taking a hit due to it. By slowing down this process,
> the rest of the system *will* be impacted. Ideally, we want to finish it
> as quick as possible. Not to mention, the kmap_atomics will be slowed
> down by the breakpoints that are attached to them.

Hmm, why don't we have two text_poke interfaces for single and
batch? As like dyn_ftrace, since modifying massive points takes
a lot time, so we may have additional kconfig something like
CONFIG_QUICK_BATCH_TEXT_POKE which switches text area to rw while
batch text_poke.

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-06-01  4:53                             ` Masami Hiramatsu
@ 2012-06-01 11:37                               ` Steven Rostedt
  2012-06-01 12:52                                 ` Masami Hiramatsu
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2012-06-01 11:37 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Mathieu Desnoyers, linux-kernel, Ingo Molnar,
	Andrew Morton, Frederic Weisbecker, H. Peter Anvin, Dave Jones,
	Andi Kleen, acme, yrl.pp-manager.tt

On Fri, 2012-06-01 at 13:53 +0900, Masami Hiramatsu wrote:

> Hmm, why don't we have two text_poke interfaces for single and
> batch? As like dyn_ftrace, since modifying massive points takes
> a lot time, so we may have additional kconfig something like
> CONFIG_QUICK_BATCH_TEXT_POKE which switches text area to rw while
> batch text_poke.
> 

I'll be working on patches to consolidate the two after I get everything
else working :-)  I still need to work on the ftrace kprobe stuff too.

I hate having a config option to switch between the two, except for
something that can be there if we find the new approach is buggy (like
we have with lockdep). That may be a solution for this if we don't agree
on one now. That is, bring back stop_machine() when LOCKDEP is enabled.
But that should only be a temporary work around not a true fix.

I have no problem with having most of the modifying code be shared
between text_poke and ftrace. I guess the question is, do we want to do
it only with the FIXMAP or do we want text_poke and ftrace to use the rw
method for large batches. Heck, we can set a limit. If we are going to
update more that 100 locations, we switch the kernel to rw, otherwise we
do the FIXMAP update.

-- Steve



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints
  2012-06-01 11:37                               ` Steven Rostedt
@ 2012-06-01 12:52                                 ` Masami Hiramatsu
  0 siblings, 0 replies; 47+ messages in thread
From: Masami Hiramatsu @ 2012-06-01 12:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mathieu Desnoyers, linux-kernel, Ingo Molnar,
	Andrew Morton, Frederic Weisbecker, H. Peter Anvin, Dave Jones,
	Andi Kleen, acme, yrl.pp-manager.tt

(2012/06/01 20:37), Steven Rostedt wrote:
> On Fri, 2012-06-01 at 13:53 +0900, Masami Hiramatsu wrote:
> 
>> Hmm, why don't we have two text_poke interfaces for single and
>> batch? As like dyn_ftrace, since modifying massive points takes
>> a lot time, so we may have additional kconfig something like
>> CONFIG_QUICK_BATCH_TEXT_POKE which switches text area to rw while
>> batch text_poke.
>>
> 
> I'll be working on patches to consolidate the two after I get everything
> else working :-)  I still need to work on the ftrace kprobe stuff too.
> 
> I hate having a config option to switch between the two, except for
> something that can be there if we find the new approach is buggy (like
> we have with lockdep). That may be a solution for this if we don't agree
> on one now. That is, bring back stop_machine() when LOCKDEP is enabled.
> But that should only be a temporary work around not a true fix.
> 
> I have no problem with having most of the modifying code be shared
> between text_poke and ftrace. I guess the question is, do we want to do
> it only with the FIXMAP or do we want text_poke and ftrace to use the rw
> method for large batches. Heck, we can set a limit. If we are going to
> update more that 100 locations, we switch the kernel to rw, otherwise we
> do the FIXMAP update.

That's reasonable for me :). You can feel free to change/remove
text_poke_smp_batch which is already old-style interface.

Thanks!

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2012-06-01 12:52 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-31  1:28 [PATCH 0/5] [GIT PULL] ftrace: Fix bug with function tracing and lockdep Steven Rostedt
2012-05-31  1:28 ` [PATCH 1/5] ftrace: Synchronize variable setting with breakpoints Steven Rostedt
2012-05-31 11:06   ` Peter Zijlstra
2012-05-31 14:08     ` Steven Rostedt
2012-05-31 15:16       ` Masami Hiramatsu
2012-05-31 15:29         ` Steven Rostedt
2012-05-31 17:28       ` Peter Zijlstra
2012-05-31 18:42         ` Steven Rostedt
2012-05-31 17:40       ` Peter Zijlstra
2012-05-31 17:44         ` Peter Zijlstra
2012-05-31 17:53         ` Steven Rostedt
2012-05-31 18:03           ` Peter Zijlstra
2012-05-31 18:50             ` Steven Rostedt
2012-05-31 19:06               ` Peter Zijlstra
2012-05-31 19:55                 ` Mathieu Desnoyers
2012-05-31 20:10                   ` Steven Rostedt
2012-05-31 20:26                     ` Peter Zijlstra
2012-05-31 20:37                       ` Steven Rostedt
2012-05-31 20:40                         ` Steven Rostedt
2012-05-31 20:40                         ` Peter Zijlstra
2012-05-31 20:49                           ` Steven Rostedt
2012-06-01  4:53                             ` Masami Hiramatsu
2012-06-01 11:37                               ` Steven Rostedt
2012-06-01 12:52                                 ` Masami Hiramatsu
2012-06-01  0:45                       ` Arnaldo Carvalho de Melo
2012-05-31  1:28 ` [PATCH 2/5] ftrace: Use breakpoint method to update ftrace caller Steven Rostedt
2012-05-31 11:18   ` Peter Zijlstra
2012-05-31 14:19     ` Steven Rostedt
2012-05-31  1:28 ` [PATCH 3/5] x86: Reset the debug_stack update counter Steven Rostedt
2012-05-31 19:19   ` H. Peter Anvin
2012-05-31 19:26     ` Steven Rostedt
2012-05-31  1:28 ` [PATCH 4/5] x86: Allow nesting of the debug stack IDT setting Steven Rostedt
2012-05-31 18:44   ` H. Peter Anvin
2012-05-31 18:58   ` H. Peter Anvin
2012-05-31 19:25     ` Steven Rostedt
2012-05-31 19:27       ` H. Peter Anvin
2012-05-31 20:00         ` Steven Rostedt
2012-05-31 20:17           ` H. Peter Anvin
2012-05-31 20:35             ` Steven Rostedt
2012-05-31 20:39               ` H. Peter Anvin
2012-05-31 20:56                 ` Steven Rostedt
2012-05-31 21:09                   ` H. Peter Anvin
2012-05-31 21:37                     ` Steven Rostedt
2012-05-31 21:38                       ` Steven Rostedt
2012-06-01  2:30       ` Steven Rostedt
2012-05-31  1:28 ` [PATCH 5/5] ftrace/x86: Do not change stacks in DEBUG when calling lockdep Steven Rostedt
2012-05-31  2:13 ` [PATCH 0/5] (for 3.5)[GIT PULL] ftrace: Fix bug with function tracing and lockdep Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).