linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [git pull] changes for tip, and a nasty x86 page table bug
@ 2009-02-20  1:13 Steven Rostedt
  2009-02-20  1:13 ` [PATCH 1/6] x86: check PMD in spurious_fault handler Steven Rostedt
                   ` (6 more replies)
  0 siblings, 7 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  1:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, Mathieu Desnoyers, H. Peter Anvin

Ingo,

The list of changes are to keep the kernel text read only when
CONFIG_DEBUG_RODATA is set, even when the DYNAMIC_FTRACE is configured.
What it does now is to change the kernel text to writable before
modifying the mcount call points, and changes it back to read only
when it is finished.

In doing this change, I stumbled upon a nasty bug in the page handling
of the x86 code, where we can fall into a state where the PTE
has the RW bit set, but the PMD does not. This will cause an
infinite loop of faults. The reason is that the fault handler
detects "spurious faults" when it hits a page fault but the permissions
show it as correct. This test only checks the PTE pages and not the
PMD level. Thus it will return just to fault again.

The first two patches deals with this bug directly, the rest are
ftrace related. I'm thinking ftrace may be the only user to cause
this bug, but if it is not, then we might want to consider those
changes for 29. Otherwise, the changes to hit the bug are for 30
and you can wait on fixing this bug until then.

Please pull the latest tip/tracing/ftrace tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
tip/tracing/ftrace


Steven Rostedt (6):
      x86: check PMD in spurious_fault handler
      x86: keep pmd rw bit set when creating 4K level pages
      ftrace: allow archs to preform pre and post process for code modification
      ftrace, x86: make kernel text writable only for conversions
      ftrace: immediately stop code modification if failure is detected
      ftrace: break out modify loop immediately on detection of error

----
 arch/x86/include/asm/ftrace.h |   10 ++++++++++
 arch/x86/kernel/ftrace.c      |   24 ++++++++++++++++++++++++
 arch/x86/mm/fault.c           |   13 ++++++++++++-
 arch/x86/mm/init_32.c         |   27 ++++++++++++++++++++++++---
 arch/x86/mm/init_64.c         |   29 ++++++++++++++++++++++++-----
 arch/x86/mm/pageattr.c        |    4 +++-
 include/linux/ftrace.h        |    3 +++
 kernel/trace/ftrace.c         |   34 +++++++++++++++++++++++++++++++++-
 8 files changed, 133 insertions(+), 11 deletions(-)
-- 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 1/6] x86: check PMD in spurious_fault handler
  2009-02-20  1:13 [git pull] changes for tip, and a nasty x86 page table bug Steven Rostedt
@ 2009-02-20  1:13 ` Steven Rostedt
  2009-02-20  1:13 ` [PATCH 2/6] x86: keep pmd rw bit set when creating 4K level pages Steven Rostedt
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  1:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, Mathieu Desnoyers, H. Peter Anvin, Steven Rostedt

[-- Attachment #1: 0001-x86-check-PMD-in-spurious_fault-handler.patch --]
[-- Type: text/plain, Size: 1435 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Impact: fix to prevent hard lockup on bad PMD permissions

If the PMD does not have the correct permissions for a page access,
but the PTE does, the spurious fault handler will mistake the fault
as a lazy TLB transaction. This will result in an infinite loop of:

 fault -> spurious_fault check (pass) -> return to code -> fault

This patch adds a check and a warn on if the PTE passes the permissions
but the PMD does not.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/mm/fault.c |   13 ++++++++++++-
 1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index c76ef1d..7b579a6 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -455,6 +455,7 @@ static int spurious_fault(unsigned long address,
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	int ret;
 
 	/* Reserved-bit violation or user access to kernel space? */
 	if (error_code & (PF_USER | PF_RSVD))
@@ -482,7 +483,17 @@ static int spurious_fault(unsigned long address,
 	if (!pte_present(*pte))
 		return 0;
 
-	return spurious_fault_check(error_code, pte);
+	ret = spurious_fault_check(error_code, pte);
+	if (!ret)
+		return 0;
+
+	/*
+	 * Make sure we have permissions in PMD
+	 * If not, then there's a bug in the page tables.
+	 */
+	ret = spurious_fault_check(error_code, (pte_t *) pmd);
+	WARN_ON(!ret);
+	return ret;
 }
 
 /*
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH 2/6] x86: keep pmd rw bit set when creating 4K level pages
  2009-02-20  1:13 [git pull] changes for tip, and a nasty x86 page table bug Steven Rostedt
  2009-02-20  1:13 ` [PATCH 1/6] x86: check PMD in spurious_fault handler Steven Rostedt
@ 2009-02-20  1:13 ` Steven Rostedt
  2009-02-20  1:13 ` [PATCH 3/6] ftrace: allow archs to preform pre and post process for code modification Steven Rostedt
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  1:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, Mathieu Desnoyers, H. Peter Anvin, Steven Rostedt

[-- Attachment #1: 0002-x86-keep-pmd-rw-bit-set-when-creating-4K-level-page.patch --]
[-- Type: text/plain, Size: 2259 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Impact: fix to set_memory_rw

I was hitting a hard lock up when I would set a page range to
read-write, and then write to it. The lock up happened because
the PTE was set to RW but its PMD was not. This would take a page
fault, but the page fault handler mistaken it for a spurious fault
caused by lazy TLB transactions. This was because it only checked
the permission bits of the PTE, which were correct. The PMD
was not. The fault handler would return only to take the page
fault again.

 fault -> PTE OK must be spurious -> return -> fault -> etc.

What caused this anomaly was this:

1) The kernel pages were set at the end of boot up to read-only.
2) Since the change could keep the large 2M page tables it just
   changed the PTE bit for the 2M section.
3) The 2M section needed to be split up for NX bit being set.
4) The break up made the original PTE into a PMD and moved the
   protection bits to the smaller 4K PTEs. The PMD kept its
   RW bit off.
5) Now to set the range of pages for RW. Only the PTEs were
   modified (already split up), and not the PMD that contained
   them.

After that, we were in a state where the PTEs allowed the write but the
PMD did not.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/mm/pageattr.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 84ba748..79c700d 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -513,11 +513,13 @@ static int split_large_page(pte_t *kpte, unsigned long address)
 	 * On Intel the NX bit of all levels must be cleared to make a
 	 * page executable. See section 4.13.2 of Intel 64 and IA-32
 	 * Architectures Software Developer's Manual).
+	 * The same is true for RW. Let the PTE determine the
+	 * the RW protection, and keep the PMD RW set.
 	 *
 	 * Mark the entry present. The current mapping might be
 	 * set to not present, which we preserved above.
 	 */
-	ref_prot = pte_pgprot(pte_mkexec(pte_clrhuge(*kpte)));
+	ref_prot = pte_pgprot(pte_mkwrite(pte_mkexec(pte_clrhuge(*kpte))));
 	pgprot_val(ref_prot) |= _PAGE_PRESENT;
 	__set_pmd_pte(kpte, address, mk_pte(base, ref_prot));
 	base = NULL;
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH 3/6] ftrace: allow archs to preform pre and post process for code modification
  2009-02-20  1:13 [git pull] changes for tip, and a nasty x86 page table bug Steven Rostedt
  2009-02-20  1:13 ` [PATCH 1/6] x86: check PMD in spurious_fault handler Steven Rostedt
  2009-02-20  1:13 ` [PATCH 2/6] x86: keep pmd rw bit set when creating 4K level pages Steven Rostedt
@ 2009-02-20  1:13 ` Steven Rostedt
  2009-02-20  1:13 ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Steven Rostedt
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  1:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, Mathieu Desnoyers, H. Peter Anvin, Steven Rostedt

[-- Attachment #1: 0003-ftrace-allow-archs-to-preform-pre-and-post-process.patch --]
[-- Type: text/plain, Size: 2042 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

This patch creates the weak functions: ftrace_arch_modify_prepare
and ftrace_arch_modify_post_process that are called before and
after the stop machine is called to modify the kernel text.

If the arch needs to do pre or post processing, it only needs to define
these functions.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ftrace.h |    3 +++
 kernel/trace/ftrace.c  |   28 ++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 9d224c4..644b9a9 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -106,6 +106,9 @@ struct ftrace_func_command {
 /* asm/ftrace.h must be defined for archs supporting dynamic ftrace */
 #include <asm/ftrace.h>
 
+int ftrace_arch_modify_prepare(void);
+int ftrace_arch_modify_post_process(void);
+
 struct seq_file;
 
 struct ftrace_probe_ops {
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 330a059..de3bd93 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -580,6 +580,24 @@ ftrace_code_disable(struct module *mod, struct dyn_ftrace *rec)
 	return 1;
 }
 
+/*
+ * archs can override this function if they must do something
+ * before the modifying code is performed.
+ */
+int __weak ftrace_arch_modify_prepare(void)
+{
+	return 0;
+}
+
+/*
+ * archs can override this function if they must do something
+ * after the modifying code is performed.
+ */
+int __weak ftrace_arch_modify_post_process(void)
+{
+	return 0;
+}
+
 static int __ftrace_modify_code(void *data)
 {
 	int *command = data;
@@ -602,7 +620,17 @@ static int __ftrace_modify_code(void *data)
 
 static void ftrace_run_update_code(int command)
 {
+	int ret;
+
+	ret = ftrace_arch_modify_prepare();
+	FTRACE_WARN_ON(ret);
+	if (ret)
+		return;
+
 	stop_machine(__ftrace_modify_code, &command, NULL);
+
+	ret = ftrace_arch_modify_post_process();
+	FTRACE_WARN_ON(ret);
 }
 
 static ftrace_func_t saved_ftrace_func;
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-20  1:13 [git pull] changes for tip, and a nasty x86 page table bug Steven Rostedt
                   ` (2 preceding siblings ...)
  2009-02-20  1:13 ` [PATCH 3/6] ftrace: allow archs to preform pre and post process for code modification Steven Rostedt
@ 2009-02-20  1:13 ` Steven Rostedt
  2009-02-20  1:32   ` Andrew Morton
  2009-02-22 17:50   ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Andi Kleen
  2009-02-20  1:13 ` [PATCH 5/6] ftrace: immediately stop code modification if failure is detected Steven Rostedt
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  1:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, Mathieu Desnoyers, H. Peter Anvin, Steven Rostedt

[-- Attachment #1: 0004-ftrace-x86-make-kernel-text-writable-only-for-conv.patch --]
[-- Type: text/plain, Size: 5051 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Impact: keep kernel text read only

Because dynamic ftrace converts the calls to mcount into and out of
nops at run time, we needed to always keep the kernel text writable.

But this defeats the point of CONFIG_DEBUG_RODATA. This patch converts
the kernel code to writable before ftrace modifies the text, and converts
it back to read only afterward.

The conversion is done via stop_machine and no IPIs may be executed
at that time. The kernel text is set to write just before calling
stop_machine and set to read only again right afterward.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/include/asm/ftrace.h |   10 ++++++++++
 arch/x86/kernel/ftrace.c      |   20 ++++++++++++++++++++
 arch/x86/mm/init_32.c         |   27 ++++++++++++++++++++++++---
 arch/x86/mm/init_64.c         |   29 ++++++++++++++++++++++++-----
 4 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index b55b4a7..5564cf3 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -80,4 +80,14 @@ extern void return_to_handler(void);
 #endif /* __ASSEMBLY__ */
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
+#ifndef __ASSEMBLY__
+#ifdef CONFIG_DEBUG_RODATA
+void set_kernel_text_rw(void);
+void set_kernel_text_ro(void);
+#else
+static inline void set_kernel_text_rw(void) { }
+static inline void set_kernel_text_ro(void) { }
+#endif
+#endif /* __ASSEMBLY__ */
+
 #endif /* _ASM_X86_FTRACE_H */
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 2f9c0c8..05041b0 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -26,6 +26,26 @@
 
 #ifdef CONFIG_DYNAMIC_FTRACE
 
+int ftrace_arch_modify_prepare(void)
+{
+	/* at boot up, we are still writable */
+	if (system_state != SYSTEM_RUNNING)
+		return 0;
+
+	set_kernel_text_rw();
+	return 0;
+}
+
+int ftrace_arch_modify_post_process(void)
+{
+	/* at boot up, we are still writable */
+	if (system_state != SYSTEM_RUNNING)
+		return 0;
+
+	set_kernel_text_ro();
+	return 0;
+}
+
 union ftrace_code_union {
 	char code[MCOUNT_INSN_SIZE];
 	struct {
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 2cef050..bcd7f00 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -1155,13 +1155,35 @@ static noinline int do_test_wp_bit(void)
 const int rodata_test_data = 0xC3;
 EXPORT_SYMBOL_GPL(rodata_test_data);
 
+/* used by ftrace */
+void set_kernel_text_rw(void)
+{
+	unsigned long start = PFN_ALIGN(_text);
+	unsigned long size = PFN_ALIGN(_etext) - start;
+
+	printk(KERN_INFO "Set kernel text: %lx - %lx for read write\n",
+	       start, start+size);
+
+	set_pages_rw(virt_to_page(start), size >> PAGE_SHIFT);
+}
+
+/* used by ftrace */
+void set_kernel_text_ro(void)
+{
+	unsigned long start = PFN_ALIGN(_text);
+	unsigned long size = PFN_ALIGN(_etext) - start;
+
+	printk(KERN_INFO "Set kernel text: %lx - %lx for read only\n",
+	       start, start+size);
+
+	set_pages_ro(virt_to_page(start), size >> PAGE_SHIFT);
+}
+
 void mark_rodata_ro(void)
 {
 	unsigned long start = PFN_ALIGN(_text);
 	unsigned long size = PFN_ALIGN(_etext) - start;
 
-#ifndef CONFIG_DYNAMIC_FTRACE
-	/* Dynamic tracing modifies the kernel text section */
 	set_pages_ro(virt_to_page(start), size >> PAGE_SHIFT);
 	printk(KERN_INFO "Write protecting the kernel text: %luk\n",
 		size >> 10);
@@ -1174,7 +1196,6 @@ void mark_rodata_ro(void)
 	printk(KERN_INFO "Testing CPA: write protecting again\n");
 	set_pages_ro(virt_to_page(start), size>>PAGE_SHIFT);
 #endif
-#endif /* CONFIG_DYNAMIC_FTRACE */
 
 	start += size;
 	size = (unsigned long)__end_rodata - start;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e6d36b4..8c1b5ee 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -986,17 +986,36 @@ void free_initmem(void)
 const int rodata_test_data = 0xC3;
 EXPORT_SYMBOL_GPL(rodata_test_data);
 
+/* used by ftrace */
+void set_kernel_text_rw(void)
+{
+	unsigned long start = PFN_ALIGN(_stext);
+	unsigned long end = PFN_ALIGN(__start_rodata);
+
+	printk(KERN_INFO "Set kernel text: %lx - %lx for read write\n",
+	       start, end);
+
+	set_memory_rw(start, (end - start) >> PAGE_SHIFT);
+}
+
+/* used by ftrace */
+void set_kernel_text_ro(void)
+{
+	unsigned long start = PFN_ALIGN(_stext);
+	unsigned long end = PFN_ALIGN(__start_rodata);
+
+	printk(KERN_INFO "Set kernel text: %lx - %lx for read only\n",
+	       start, end);
+
+	set_memory_ro(start, (end - start) >> PAGE_SHIFT);
+}
+
 void mark_rodata_ro(void)
 {
 	unsigned long start = PFN_ALIGN(_stext), end = PFN_ALIGN(__end_rodata);
 	unsigned long rodata_start =
 		((unsigned long)__start_rodata + PAGE_SIZE - 1) & PAGE_MASK;
 
-#ifdef CONFIG_DYNAMIC_FTRACE
-	/* Dynamic tracing modifies the kernel text section */
-	start = rodata_start;
-#endif
-
 	printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
 	       (end - start) >> 10);
 	set_memory_ro(start, (end - start) >> PAGE_SHIFT);
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH 5/6] ftrace: immediately stop code modification if failure is detected
  2009-02-20  1:13 [git pull] changes for tip, and a nasty x86 page table bug Steven Rostedt
                   ` (3 preceding siblings ...)
  2009-02-20  1:13 ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Steven Rostedt
@ 2009-02-20  1:13 ` Steven Rostedt
  2009-02-20  1:13 ` [PATCH 6/6] ftrace: break out modify loop immediately on detection of error Steven Rostedt
  2009-02-20  2:00 ` [git pull] changes for tip, and a nasty x86 page table bug Linus Torvalds
  6 siblings, 0 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  1:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, Mathieu Desnoyers, H. Peter Anvin, Steven Rostedt

[-- Attachment #1: 0005-ftrace-immediately-stop-code-modification-if-failur.patch --]
[-- Type: text/plain, Size: 1634 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Impact: fix to prevent NMI lockup

If the page fault handler produces a WARN_ON in the modifying of
text, and the system is setup to have a high frequency of NMIs,
we can lock up the system on a failure to modify code.

The modifying of code with NMIs allows all NMIs to modify the code
if it is about to run. This prevents a modifier on one CPU from
modifying code running in NMI context on another CPU. The modifying
is done through stop_machine, so only NMIs must be considered.

But if the write causes the page fault handler to produce a warning,
the print can slow it down enough that as soon as it is done
it will take another NMI before going back to the process context.
The new NMI will perform the write again causing another print and
this will hang the box.

This patch turns off the writing as soon as a failure is detected
and does not wait for it to be turned off by the process context.
This will keep NMIs from getting stuck in this back and forth
of print outs.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/ftrace.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 05041b0..26b64a8 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -131,6 +131,10 @@ static void ftrace_mod_code(void)
 	 */
 	mod_code_status = probe_kernel_write(mod_code_ip, mod_code_newcode,
 					     MCOUNT_INSN_SIZE);
+
+	/* if we fail, then kill any new writers */
+	if (mod_code_status)
+		mod_code_write = 0;
 }
 
 void ftrace_nmi_enter(void)
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH 6/6] ftrace: break out modify loop immediately on detection of error
  2009-02-20  1:13 [git pull] changes for tip, and a nasty x86 page table bug Steven Rostedt
                   ` (4 preceding siblings ...)
  2009-02-20  1:13 ` [PATCH 5/6] ftrace: immediately stop code modification if failure is detected Steven Rostedt
@ 2009-02-20  1:13 ` Steven Rostedt
  2009-02-20  2:00 ` [git pull] changes for tip, and a nasty x86 page table bug Linus Torvalds
  6 siblings, 0 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  1:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, Mathieu Desnoyers, H. Peter Anvin, Steven Rostedt

[-- Attachment #1: 0006-ftrace-break-out-modify-loop-immediately-on-detecti.patch --]
[-- Type: text/plain, Size: 882 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Impact: added precaution on failure detection

Break out of the modifying loop as soon as a failure is detected.
This is just an added precaution found by code review and was not
found by any bug chasing.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/ftrace.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index de3bd93..f9fe29d 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -557,10 +557,14 @@ static void ftrace_replace_code(int enable)
 			if ((system_state == SYSTEM_BOOTING) ||
 			    !core_kernel_text(rec->ip)) {
 				ftrace_free_rec(rec);
-			} else
+			} else {
 				ftrace_bug(failed, rec->ip);
+				goto out;
+			}
 		}
 	} while_for_each_ftrace_rec();
+ out:
+	return;
 }
 
 static int
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-20  1:13 ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Steven Rostedt
@ 2009-02-20  1:32   ` Andrew Morton
  2009-02-20  1:44     ` Steven Rostedt
  2009-02-22 17:50   ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Andi Kleen
  1 sibling, 1 reply; 89+ messages in thread
From: Andrew Morton @ 2009-02-20  1:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, mingo, tglx, peterz, fweisbec, torvalds, arjan,
	rusty, mathieu.desnoyers, hpa, srostedt

On Thu, 19 Feb 2009 20:13:20 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> +int ftrace_arch_modify_prepare(void)
> +{
> +	/* at boot up, we are still writable */
> +	if (system_state != SYSTEM_RUNNING)
> +		return 0;
> +
> +	set_kernel_text_rw();
> +	return 0;
> +}
> +
> +int ftrace_arch_modify_post_process(void)
> +{
> +	/* at boot up, we are still writable */
> +	if (system_state != SYSTEM_RUNNING)
> +		return 0;
> +
> +	set_kernel_text_ro();
> +	return 0;
> +}

It would be prudent to avoid using system_state.  People can change the
point at which it transitions and can unexpectedly insert (or move)
code to sites where system_state has new values, etc.  It was a bad
idea.

It would be clearer and more robust to create your own little flag for
this purpose and set to it true and false at the places where that is
appropriate for this application.  It's just one more byte...

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-20  1:32   ` Andrew Morton
@ 2009-02-20  1:44     ` Steven Rostedt
  2009-02-20  2:05       ` [PATCH][git pull] update to tip/tracing/ftrace Steven Rostedt
  0 siblings, 1 reply; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  1:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, mingo, tglx, peterz, fweisbec, torvalds, arjan,
	rusty, mathieu.desnoyers, hpa, srostedt


On Thu, 19 Feb 2009, Andrew Morton wrote:

> On Thu, 19 Feb 2009 20:13:20 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > +int ftrace_arch_modify_prepare(void)
> > +{
> > +	/* at boot up, we are still writable */
> > +	if (system_state != SYSTEM_RUNNING)
> > +		return 0;
> > +
> > +	set_kernel_text_rw();
> > +	return 0;
> > +}
> > +
> > +int ftrace_arch_modify_post_process(void)
> > +{
> > +	/* at boot up, we are still writable */
> > +	if (system_state != SYSTEM_RUNNING)
> > +		return 0;
> > +
> > +	set_kernel_text_ro();
> > +	return 0;
> > +}
> 
> It would be prudent to avoid using system_state.  People can change the
> point at which it transitions and can unexpectedly insert (or move)
> code to sites where system_state has new values, etc.  It was a bad
> idea.

Good to know.

> 
> It would be clearer and more robust to create your own little flag for
> this purpose and set to it true and false at the places where that is
> appropriate for this application.  It's just one more byte...

I just did not want to set it to read-only before the main text decided to 
do this. I could probably move those checks into the set_kernel_text_* 
functions in init_32/64.c files. We should not convert until after the 
DEBUG_RODATA did its first conversion. Yeah, that's a better place for it.

I'll write up another patch.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [git pull] changes for tip, and a nasty x86 page table bug
  2009-02-20  1:13 [git pull] changes for tip, and a nasty x86 page table bug Steven Rostedt
                   ` (5 preceding siblings ...)
  2009-02-20  1:13 ` [PATCH 6/6] ftrace: break out modify loop immediately on detection of error Steven Rostedt
@ 2009-02-20  2:00 ` Linus Torvalds
  2009-02-20  2:08   ` Steven Rostedt
  6 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2009-02-20  2:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Arjan van de Ven,
	Rusty Russell, Mathieu Desnoyers, H. Peter Anvin



On Thu, 19 Feb 2009, Steven Rostedt wrote:
> 
> In doing this change, I stumbled upon a nasty bug in the page handling
> of the x86 code, where we can fall into a state where the PTE
> has the RW bit set, but the PMD does not.

How do we ever have a PMD that is read-only? That sounds like a bug to 
begin with. There's no reason to ever do that.

In fact, it should trigger the pmd_bad() tests if it ever happens, 
wouldn't it? We want all the _KERNPG_TABLE bits to always be set, no?

		Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH][git pull] update to tip/tracing/ftrace
  2009-02-20  1:44     ` Steven Rostedt
@ 2009-02-20  2:05       ` Steven Rostedt
  0 siblings, 0 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  2:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, mingo, tglx, peterz, fweisbec, torvalds, arjan,
	rusty, mathieu.desnoyers, hpa, srostedt


Ingo,

Please pull the latest tip/tracing/ftrace tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
tip/tracing/ftrace


Steven Rostedt (1):
      ftrace, x86: do not depend on system state for kernel text info

----
 arch/x86/kernel/ftrace.c |    8 --------
 arch/x86/mm/init_32.c    |   10 ++++++++++
 arch/x86/mm/init_64.c    |   10 ++++++++++
 3 files changed, 20 insertions(+), 8 deletions(-)
---------------------------
commit 5d8ecb6568c598de6b6e678329e2ec0703a821f7
Author: Steven Rostedt <srostedt@redhat.com>
Date:   Thu Feb 19 20:51:45 2009 -0500

    ftrace, x86: do not depend on system state for kernel text info
    
    Andrew Morton pointed out that using SYSTEM_STATE is a bad idea
    since there is no guarantee to what its state will actually be.
    
    Instead, I moved the check into the set_kernel_text_* functions
    themselves, and use a local variable to determine when it is
    OK to change the kernel text RW permissions.
    
    Reported-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Steven Rostedt <srostedt@redhat.com>

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 26b64a8..4f4e82c 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -28,20 +28,12 @@
 
 int ftrace_arch_modify_prepare(void)
 {
-	/* at boot up, we are still writable */
-	if (system_state != SYSTEM_RUNNING)
-		return 0;
-
 	set_kernel_text_rw();
 	return 0;
 }
 
 int ftrace_arch_modify_post_process(void)
 {
-	/* at boot up, we are still writable */
-	if (system_state != SYSTEM_RUNNING)
-		return 0;
-
 	set_kernel_text_ro();
 	return 0;
 }
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index bcd7f00..9ca4c57 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -1155,12 +1155,17 @@ static noinline int do_test_wp_bit(void)
 const int rodata_test_data = 0xC3;
 EXPORT_SYMBOL_GPL(rodata_test_data);
 
+static int kernel_set_to_readonly;
+
 /* used by ftrace */
 void set_kernel_text_rw(void)
 {
 	unsigned long start = PFN_ALIGN(_text);
 	unsigned long size = PFN_ALIGN(_etext) - start;
 
+	if (!kernel_set_to_readonly)
+		return;
+
 	printk(KERN_INFO "Set kernel text: %lx - %lx for read write\n",
 	       start, start+size);
 
@@ -1173,6 +1178,9 @@ void set_kernel_text_ro(void)
 	unsigned long start = PFN_ALIGN(_text);
 	unsigned long size = PFN_ALIGN(_etext) - start;
 
+	if (!kernel_set_to_readonly)
+		return;
+
 	printk(KERN_INFO "Set kernel text: %lx - %lx for read only\n",
 	       start, start+size);
 
@@ -1188,6 +1196,8 @@ void mark_rodata_ro(void)
 	printk(KERN_INFO "Write protecting the kernel text: %luk\n",
 		size >> 10);
 
+	kernel_set_to_readonly = 1;
+
 #ifdef CONFIG_CPA_DEBUG
 	printk(KERN_INFO "Testing CPA: Reverting %lx-%lx\n",
 		start, start+size);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8c1b5ee..c204433 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -986,12 +986,17 @@ void free_initmem(void)
 const int rodata_test_data = 0xC3;
 EXPORT_SYMBOL_GPL(rodata_test_data);
 
+static int kernel_set_to_readonly;
+
 /* used by ftrace */
 void set_kernel_text_rw(void)
 {
 	unsigned long start = PFN_ALIGN(_stext);
 	unsigned long end = PFN_ALIGN(__start_rodata);
 
+	if (!kernel_set_to_readonly)
+		return;
+
 	printk(KERN_INFO "Set kernel text: %lx - %lx for read write\n",
 	       start, end);
 
@@ -1004,6 +1009,9 @@ void set_kernel_text_ro(void)
 	unsigned long start = PFN_ALIGN(_stext);
 	unsigned long end = PFN_ALIGN(__start_rodata);
 
+	if (!kernel_set_to_readonly)
+		return;
+
 	printk(KERN_INFO "Set kernel text: %lx - %lx for read only\n",
 	       start, end);
 
@@ -1020,6 +1028,8 @@ void mark_rodata_ro(void)
 	       (end - start) >> 10);
 	set_memory_ro(start, (end - start) >> PAGE_SHIFT);
 
+	kernel_set_to_readonly = 1;
+
 	/*
 	 * The rodata section (but not the kernel text!) should also be
 	 * not-executable.



^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [git pull] changes for tip, and a nasty x86 page table bug
  2009-02-20  2:00 ` [git pull] changes for tip, and a nasty x86 page table bug Linus Torvalds
@ 2009-02-20  2:08   ` Steven Rostedt
  2009-02-20  3:44     ` Linus Torvalds
  0 siblings, 1 reply; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  2:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Arjan van de Ven,
	Rusty Russell, Mathieu Desnoyers, H. Peter Anvin


On Thu, 19 Feb 2009, Linus Torvalds wrote:

> 
> 
> On Thu, 19 Feb 2009, Steven Rostedt wrote:
> > 
> > In doing this change, I stumbled upon a nasty bug in the page handling
> > of the x86 code, where we can fall into a state where the PTE
> > has the RW bit set, but the PMD does not.
> 
> How do we ever have a PMD that is read-only? That sounds like a bug to 
> begin with. There's no reason to ever do that.

Patch 2/6 explains how this happened, and supplies the fix.

> 
> In fact, it should trigger the pmd_bad() tests if it ever happens, 
> wouldn't it? We want all the _KERNPG_TABLE bits to always be set, no?

I'm not sure what would do a pmd_bad check.

-- Steve


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [git pull] changes for tip, and a nasty x86 page table bug
  2009-02-20  2:08   ` Steven Rostedt
@ 2009-02-20  3:44     ` Linus Torvalds
  2009-02-20  4:00       ` Steven Rostedt
  2009-02-20  7:29       ` [PATCH] x86: use the right protections for split-up pagetables Ingo Molnar
  0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2009-02-20  3:44 UTC (permalink / raw)
  To: Steven Rostedt, Huang Ying, Thomas Gleixner, Ingo Molnar
  Cc: Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin



On Thu, 19 Feb 2009, Steven Rostedt wrote:
> > 
> > How do we ever have a PMD that is read-only? That sounds like a bug to 
> > begin with. There's no reason to ever do that.
> 
> Patch 2/6 explains how this happened, and supplies the fix.

I think your fix is for a real bug, but I think it's still bogus.

That whole "ref_prot" code is SH*T. When we do a set_pmd(), the old 
huge-page protections do not matter AT ALL for the new pmd. It matters for 
the new _leaf_ entries (the "ref_prot" 20 lines higher up), but not for 
the upper level. That should have all bits set.

So the whole

        ref_prot = pte_pgprot(pte_mkexec(pte_clrhuge(*kpte)));
        pgprot_val(ref_prot) |= _PAGE_PRESENT;
        __set_pmd_pte(kpte, address, mk_pte(base, ref_prot));

sequence is utter crap, I think. The whole "ref_prot" there should be just 
_pgprot(_KERNPG_TABLE), I think. I don't think there is any other valid 
value.

So I would argue that the comment above that piece of code is total and 
utter crap (all the protection info _and_ all the PAT bits are now in the 
pte, and trying to move them into the pmd is *buggy*), and the three lines 
should basically be

	__set_pmd_pte(kpte, address, mk_pte(base, _pgprot(_KERNPG_TABLE)));

but let's see if somebody can tell me why I'm wrong.

"git blame" attributes this all to Ying Huang and Thomas. And looking at 
the commit that introduced the pte_mkexec(), I really think the code was 
confused and people never thought about it deeply.

Comments?

		Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [git pull] changes for tip, and a nasty x86 page table bug
  2009-02-20  3:44     ` Linus Torvalds
@ 2009-02-20  4:00       ` Steven Rostedt
  2009-02-20  4:17         ` Linus Torvalds
  2009-02-20  7:29       ` [PATCH] x86: use the right protections for split-up pagetables Ingo Molnar
  1 sibling, 1 reply; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  4:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Huang Ying, Thomas Gleixner, Ingo Molnar,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin


On Thu, 19 Feb 2009, Linus Torvalds wrote:

> 
> 
> On Thu, 19 Feb 2009, Steven Rostedt wrote:
> > > 
> > > How do we ever have a PMD that is read-only? That sounds like a bug to 
> > > begin with. There's no reason to ever do that.
> > 
> > Patch 2/6 explains how this happened, and supplies the fix.
> 
> I think your fix is for a real bug, but I think it's still bogus.
> 
> That whole "ref_prot" code is SH*T. When we do a set_pmd(), the old 
> huge-page protections do not matter AT ALL for the new pmd. It matters for 
> the new _leaf_ entries (the "ref_prot" 20 lines higher up), but not for 
> the upper level. That should have all bits set.
> 
> So the whole
> 
>         ref_prot = pte_pgprot(pte_mkexec(pte_clrhuge(*kpte)));
>         pgprot_val(ref_prot) |= _PAGE_PRESENT;
>         __set_pmd_pte(kpte, address, mk_pte(base, ref_prot));
> 
> sequence is utter crap, I think. The whole "ref_prot" there should be just 
> _pgprot(_KERNPG_TABLE), I think. I don't think there is any other valid 
> value.
> 
> So I would argue that the comment above that piece of code is total and 
> utter crap (all the protection info _and_ all the PAT bits are now in the 
> pte, and trying to move them into the pmd is *buggy*), and the three lines 
> should basically be
> 
> 	__set_pmd_pte(kpte, address, mk_pte(base, _pgprot(_KERNPG_TABLE)));
> 
> but let's see if somebody can tell me why I'm wrong.
> 
> "git blame" attributes this all to Ying Huang and Thomas. And looking at 
> the commit that introduced the pte_mkexec(), I really think the code was 
> confused and people never thought about it deeply.
> 
> Comments?

My original change was to just set the PMD to the KERNPG_TABLE entry like 
you suggested, but I was thinking that there was a reason for the madness, 
that I did not understand. So I just did the bare minimum to get my code 
working.

Is this something worthy of 29? I could whip up a patch against your 
latest tree.

-- Steve


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [git pull] changes for tip, and a nasty x86 page table bug
  2009-02-20  4:00       ` Steven Rostedt
@ 2009-02-20  4:17         ` Linus Torvalds
  2009-02-20  4:34           ` Steven Rostedt
  2009-02-20  5:02           ` Huang Ying
  0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2009-02-20  4:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Huang Ying, Thomas Gleixner, Ingo Molnar,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin



On Thu, 19 Feb 2009, Steven Rostedt wrote:
> 
> Is this something worthy of 29? I could whip up a patch against your 
> latest tree.

I think it's a real issue, but I do have to admit that I don't see why it 
would only trigegr for you. Is it just because the trace stuff ends up 
setting pages to RW, and you have to have had a lot of read-only stuff to 
get a whole read-only PMD to begin with?

So there's two things that make me nervous:

 - I do think the KERNPG_TABLE thing is the right thing, and I _think_ 
   that code is just confused, and we should just do KERNPG_TABLE rather 
   than play with confused bits one by one (PRESENT, RW, NX) to the point 
   of just making for more confusion.

   But I'd like some of the people involved with that code confirm that. 
   Either a "Yeah, we were just confused" or "No, there's this really 
   subtle thing going on, liek this: ..."

 - The fact that apparently you're the first one to hit this. I realize 
   that you do odd things with ftrace. Was it the fact that you made the 
   "set_memory_ro()" area larger, and then more dynamically mark it back 
   to read-write that you hit it? Haven't we done things like that before?

But that said, I'd love to fix this for 2.6.29, especially if somebody 
can resolve the two worries above. I do _not_ want to take your patch that 
makes confused code even more confused, unless somebody really explains 
why a pure KERNPG_TABLE isn't right.

			Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [git pull] changes for tip, and a nasty x86 page table bug
  2009-02-20  4:17         ` Linus Torvalds
@ 2009-02-20  4:34           ` Steven Rostedt
  2009-02-20  5:02           ` Huang Ying
  1 sibling, 0 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20  4:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Huang Ying, Thomas Gleixner, Ingo Molnar,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin


On Thu, 19 Feb 2009, Linus Torvalds wrote:
> 
> On Thu, 19 Feb 2009, Steven Rostedt wrote:
> > 
> > Is this something worthy of 29? I could whip up a patch against your 
> > latest tree.
> 
> I think it's a real issue, but I do have to admit that I don't see why it 
> would only trigegr for you. Is it just because the trace stuff ends up 
> setting pages to RW, and you have to have had a lot of read-only stuff to 
> get a whole read-only PMD to begin with?

The PMD read only has been there before ftrace. Setting the 
CONFIG_DEBUG_RODATA causes the issue. After the the 2M page is set to read 
only, the change to set the NX bits for the data section creates the PMD 
with the read write bit cleared.

The thing I do differently was that I needed to modify the text section
after this has been set. ftrace does a mass change upon user request, so 
the simple thing was to enable the pages as read-write, modify, then set 
back to read only.

Other code (kprobes and such) uses text_poke to make their changes. This 
goes through the process of creating vmalloc areas to point to the 
necessary code to change. The kernel proper page tables are not touched. 
So basically, it does a back door to make the change. This avoids the bug 
by not needing to convert those PTEs protected by a read only PMD into 
read-write pages.

I hit the bug by trying to write to the addresses protected by the
read only PMD.

> 
> So there's two things that make me nervous:
> 
>  - I do think the KERNPG_TABLE thing is the right thing, and I _think_ 
>    that code is just confused, and we should just do KERNPG_TABLE rather 
>    than play with confused bits one by one (PRESENT, RW, NX) to the point 
>    of just making for more confusion.

I agree with you here. I just did this change on my local tree and my code 
still works.

> 
>    But I'd like some of the people involved with that code confirm that. 
>    Either a "Yeah, we were just confused" or "No, there's this really 
>    subtle thing going on, liek this: ..."
> 
>  - The fact that apparently you're the first one to hit this. I realize 
>    that you do odd things with ftrace. Was it the fact that you made the 
>    "set_memory_ro()" area larger, and then more dynamically mark it back 
>    to read-write that you hit it? Haven't we done things like that before?

No, I was just the first one to try to convert these pages back to rw and
write to them.

> 
> But that said, I'd love to fix this for 2.6.29, especially if somebody 
> can resolve the two worries above. I do _not_ want to take your patch that 
> makes confused code even more confused, unless somebody really explains 
> why a pure KERNPG_TABLE isn't right.

OK, agreed. I'll wait on Thomas et.al. for a response, and let me get to 
bed.

-- Steve


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [git pull] changes for tip, and a nasty x86 page table bug
  2009-02-20  4:17         ` Linus Torvalds
  2009-02-20  4:34           ` Steven Rostedt
@ 2009-02-20  5:02           ` Huang Ying
  1 sibling, 0 replies; 89+ messages in thread
From: Huang Ying @ 2009-02-20  5:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Thomas Gleixner, Ingo Molnar,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 1628 bytes --]

Hi, Linus,

On Fri, 2009-02-20 at 12:17 +0800, Linus Torvalds wrote:
> 
> On Thu, 19 Feb 2009, Steven Rostedt wrote:
> > 
> > Is this something worthy of 29? I could whip up a patch against your 
> > latest tree.
> 
> I think it's a real issue, but I do have to admit that I don't see why it 
> would only trigegr for you. Is it just because the trace stuff ends up 
> setting pages to RW, and you have to have had a lot of read-only stuff to 
> get a whole read-only PMD to begin with?
> 
> So there's two things that make me nervous:
> 
>  - I do think the KERNPG_TABLE thing is the right thing, and I _think_ 
>    that code is just confused, and we should just do KERNPG_TABLE rather 
>    than play with confused bits one by one (PRESENT, RW, NX) to the point 
>    of just making for more confusion.
> 
>    But I'd like some of the people involved with that code confirm that. 
>    Either a "Yeah, we were just confused" or "No, there's this really 
>    subtle thing going on, liek this: ..."
> 
>  - The fact that apparently you're the first one to hit this. I realize 
>    that you do odd things with ftrace. Was it the fact that you made the 
>    "set_memory_ro()" area larger, and then more dynamically mark it back 
>    to read-write that you hit it? Haven't we done things like that before?

In fact, I am the first one to hit a similar bug. I do some odd thing
with EFI to change the page tables to be executable. Unfortunately I
fixed that bug in a confused way.

Yes. I think KERNPG_TABLE fixes all these types of bugs in a more clear
way.

Best Regards,
Huang Ying


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH] x86: use the right protections for split-up pagetables
  2009-02-20  3:44     ` Linus Torvalds
  2009-02-20  4:00       ` Steven Rostedt
@ 2009-02-20  7:29       ` Ingo Molnar
  2009-02-20  7:39         ` [PATCH, v2] " Ingo Molnar
                           ` (2 more replies)
  1 sibling, 3 replies; 89+ messages in thread
From: Ingo Molnar @ 2009-02-20  7:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Huang Ying, Thomas Gleixner,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So the whole
> 
>         ref_prot = pte_pgprot(pte_mkexec(pte_clrhuge(*kpte)));
>         pgprot_val(ref_prot) |= _PAGE_PRESENT;
>         __set_pmd_pte(kpte, address, mk_pte(base, ref_prot));
> 
> sequence is utter crap, I think. The whole "ref_prot" there 
> should be just _pgprot(_KERNPG_TABLE), I think. I don't think 
> there is any other valid value.

Agreed, split_large_page() was just plain confused here - there 
was no hidden reason for this logic. It makes no sense to bring 
any pte level protection information to the PMD level because a 
pmd entry covers a set of 512 ptes so there's no singular 
protection attribute that can be carried to it.

The right solution is what you suggested: to use the most 
permissive protection bits for the pmd, i.e. _KERNPG_TABLE. 
Since the protection bits get combined, this makes the pte 
protections control the final behavior of the mapping - so 
subsequent code patching and similar activities will work fine.

The bug was mostly harmless until Steve hacked his kernel to 
have the right (large) size of readonly, text and data areas. I 
never hit such an ftrace hang even with allyesconfig bzImage 
bootups [which has obscenely large text and data sections], so i 
think something in Steve's tree was also needed to trigger it: 
an unusually large readonly data section.

I've queued up the fix below in tip:x86/urgent and will send a 
pull request later today if it passes testing. Steve, does this 
solve the bug you've hit?

With this fix i dont think the other bits from Steve's series 
(patch 1-4) are needed at all - those patches expose PMD details 
in various places that iterate over ptes - that's ugly and 
unnecessary as well if the PMD's protection is permissive.

[ Also, since you suggested the fix i've added your Acked-by, 
  let me know if you dont agree with any aspect of the fix. ]

	Ingo

---------------->
>From f07eb4c47d5d4a70dc8eb8e2c158741cd6c69948 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 20 Feb 2009 08:04:13 +0100
Subject: [PATCH] x86: use the right protections for split-up pagetables

Steven Rostedt found a bug in where in his modified kernel
ftrace was unable to modify the kernel text, due to the PMD
itself having been marked read-only as well in
split_large_page().

The fix, suggested by Linus, is to not try to 'clone' the
reference protection of a huge-page, but to use the standard
(and permissive) page protection bits of KERNPG_TABLE.

The 'cloning' makes sense for the ptes but it's a confused and
incorrect concept at the page table level - because the
pagetable entry is a set of all ptes and hence cannot
'clone' any single protection attribute - the ptes can be any
mixture of protections.

With the permissive KERNPG_TABLE, even if the pte protections
get changed after this point (due to ftrace doing code-patching
or other similar activities like kprobes), the resulting combined
protections will still be correct and the pte's restrictive
(or permissive) protections will control it.

Also update the comment.

This bug was there for a long time but has not caused visible
problems before as it needs a rather large read-only area to
trigger. Steve possibly hacked his kernel with some really
large arrays or so. Anyway, the bug is definitely worth fixing.

[ Huang Ying also experienced problems in this area when writing
  the EFI code, but the real bug in split_large_page() was not
  realized back then. ]

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Huang Ying <ying.huang@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/mm/pageattr.c |   15 +++++----------
 1 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 8ca0d85..17d5d1a 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -508,18 +508,13 @@ static int split_large_page(pte_t *kpte, unsigned long address)
 #endif
 
 	/*
-	 * Install the new, split up pagetable. Important details here:
+	 * Install the new, split up pagetable.
 	 *
-	 * On Intel the NX bit of all levels must be cleared to make a
-	 * page executable. See section 4.13.2 of Intel 64 and IA-32
-	 * Architectures Software Developer's Manual).
-	 *
-	 * Mark the entry present. The current mapping might be
-	 * set to not present, which we preserved above.
+	 * We use the standard kernel pagetable protections for the new
+	 * pagetable protections, the actual ptes set above control the
+	 * primary protection behavior:
 	 */
-	ref_prot = pte_pgprot(pte_mkexec(pte_clrhuge(*kpte)));
-	pgprot_val(ref_prot) |= _PAGE_PRESENT;
-	__set_pmd_pte(kpte, address, mk_pte(base, ref_prot));
+	__set_pmd_pte(kpte, address, mk_pte(base, _pgprot(_KERNPG_TABLE)));
 	base = NULL;
 
 out_unlock:

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH, v2] x86: use the right protections for split-up pagetables
  2009-02-20  7:29       ` [PATCH] x86: use the right protections for split-up pagetables Ingo Molnar
@ 2009-02-20  7:39         ` Ingo Molnar
  2009-02-20  8:02           ` Ingo Molnar
  2009-02-20 13:57         ` [PATCH] " Steven Rostedt
  2009-02-20 15:40         ` Linus Torvalds
  2 siblings, 1 reply; 89+ messages in thread
From: Ingo Molnar @ 2009-02-20  7:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Huang Ying, Thomas Gleixner,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin


* Ingo Molnar <mingo@elte.hu> wrote:

> I've queued up the fix below in tip:x86/urgent and will send a 
> pull request later today if it passes testing. Steve, does 
> this solve the bug you've hit?

Updated one below - trivial build fix.

	Ingo

--------------------->
>From 07a66d7c53a538e1a9759954a82bb6c07365eff9 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 20 Feb 2009 08:04:13 +0100
Subject: [PATCH] x86: use the right protections for split-up pagetables

Steven Rostedt found a bug in where in his modified kernel
ftrace was unable to modify the kernel text, due to the PMD
itself having been marked read-only as well in
split_large_page().

The fix, suggested by Linus, is to not try to 'clone' the
reference protection of a huge-page, but to use the standard
(and permissive) page protection bits of KERNPG_TABLE.

The 'cloning' makes sense for the ptes but it's a confused and
incorrect concept at the page table level - because the
pagetable entry is a set of all ptes and hence cannot
'clone' any single protection attribute - the ptes can be any
mixture of protections.

With the permissive KERNPG_TABLE, even if the pte protections
get changed after this point (due to ftrace doing code-patching
or other similar activities like kprobes), the resulting combined
protections will still be correct and the pte's restrictive
(or permissive) protections will control it.

Also update the comment.

This bug was there for a long time but has not caused visible
problems before as it needs a rather large read-only area to
trigger. Steve possibly hacked his kernel with some really
large arrays or so. Anyway, the bug is definitely worth fixing.

[ Huang Ying also experienced problems in this area when writing
  the EFI code, but the real bug in split_large_page() was not
  realized back then. ]

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Huang Ying <ying.huang@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/mm/pageattr.c |   15 +++++----------
 1 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 8ca0d85..7be47d1 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -508,18 +508,13 @@ static int split_large_page(pte_t *kpte, unsigned long address)
 #endif
 
 	/*
-	 * Install the new, split up pagetable. Important details here:
+	 * Install the new, split up pagetable.
 	 *
-	 * On Intel the NX bit of all levels must be cleared to make a
-	 * page executable. See section 4.13.2 of Intel 64 and IA-32
-	 * Architectures Software Developer's Manual).
-	 *
-	 * Mark the entry present. The current mapping might be
-	 * set to not present, which we preserved above.
+	 * We use the standard kernel pagetable protections for the new
+	 * pagetable protections, the actual ptes set above control the
+	 * primary protection behavior:
 	 */
-	ref_prot = pte_pgprot(pte_mkexec(pte_clrhuge(*kpte)));
-	pgprot_val(ref_prot) |= _PAGE_PRESENT;
-	__set_pmd_pte(kpte, address, mk_pte(base, ref_prot));
+	__set_pmd_pte(kpte, address, mk_pte(base, __pgprot(_KERNPG_TABLE)));
 	base = NULL;
 
 out_unlock:

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [PATCH, v2] x86: use the right protections for split-up pagetables
  2009-02-20  7:39         ` [PATCH, v2] " Ingo Molnar
@ 2009-02-20  8:02           ` Ingo Molnar
  2009-02-20 10:24             ` Ingo Molnar
  0 siblings, 1 reply; 89+ messages in thread
From: Ingo Molnar @ 2009-02-20  8:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Huang Ying, Thomas Gleixner,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin


* Ingo Molnar <mingo@elte.hu> wrote:

> [ Huang Ying also experienced problems in this area when writing
>   the EFI code, but the real bug in split_large_page() was not
>   realized back then. ]

> -	 * On Intel the NX bit of all levels must be cleared to make a
> -	 * page executable. See section 4.13.2 of Intel 64 and IA-32
> -	 * Architectures Software Developer's Manual).

Hm, in hindsight, we should have noticed this bug sooner - when 
the NX comment above was added. There's never any good reason to 
play protection games with higher-level pagetable entries. We 
dont do it to user-space pagetables either - we just populate 
them to _PAGE_TABLE and that's it.

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH, v2] x86: use the right protections for split-up pagetables
  2009-02-20  8:02           ` Ingo Molnar
@ 2009-02-20 10:24             ` Ingo Molnar
  0 siblings, 0 replies; 89+ messages in thread
From: Ingo Molnar @ 2009-02-20 10:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Huang Ying, Thomas Gleixner,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin


* Ingo Molnar <mingo@elte.hu> wrote:

> [...] There's never any good reason to play protection games 
> with higher-level pagetable entries. We dont do it to 
> user-space pagetables either - we just populate them to 
> _PAGE_TABLE and that's it.

btw., this means that we could probably even use _PAGE_TABLE 
here (i.e. with the _PAGE_USER bit set), and rely on the PTE 
clearing the user bit ... but in this case that tiny bit of 
paranoia seems justified.

Btw., i also checked when this bug got introduced, and it got 
introduced 5 years ago (in May 2004) in 2.6.7-rc1, via this 
commit [historic-git sha1]:

 fb75a3d: [PATCH] x86-64 updates

 Date:   Fri May 14 20:40:53 2004 -0700

 [...]
     - Handle NX bit for code pages correctly in change_page_attr()
 [...]

-                       set_pte(kpte,mk_pte(split, PAGE_KERNEL));
+                       set_pte(kpte,mk_pte(split, ref_prot));

( That 'set_pte(kpte,...)' sequence is not a pte update but a 
  _pmd_ update, it is the ex-largepage pte, i.e. the pmd. )

So it's an ancient, dormant bug in the CPA code that nobody ever 
triggered, and we didnt notice when we rewrote that code either.

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] x86: use the right protections for split-up pagetables
  2009-02-20  7:29       ` [PATCH] x86: use the right protections for split-up pagetables Ingo Molnar
  2009-02-20  7:39         ` [PATCH, v2] " Ingo Molnar
@ 2009-02-20 13:57         ` Steven Rostedt
  2009-02-20 15:40         ` Linus Torvalds
  2 siblings, 0 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-20 13:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Huang Ying, Thomas Gleixner,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin


On Fri, 20 Feb 2009, Ingo Molnar wrote:

> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > So the whole
> > 
> >         ref_prot = pte_pgprot(pte_mkexec(pte_clrhuge(*kpte)));
> >         pgprot_val(ref_prot) |= _PAGE_PRESENT;
> >         __set_pmd_pte(kpte, address, mk_pte(base, ref_prot));
> > 
> > sequence is utter crap, I think. The whole "ref_prot" there 
> > should be just _pgprot(_KERNPG_TABLE), I think. I don't think 
> > there is any other valid value.
> 
> Agreed, split_large_page() was just plain confused here - there 
> was no hidden reason for this logic. It makes no sense to bring 
> any pte level protection information to the PMD level because a 
> pmd entry covers a set of 512 ptes so there's no singular 
> protection attribute that can be carried to it.
> 
> The right solution is what you suggested: to use the most 
> permissive protection bits for the pmd, i.e. _KERNPG_TABLE. 
> Since the protection bits get combined, this makes the pte 
> protections control the final behavior of the mapping - so 
> subsequent code patching and similar activities will work fine.
> 
> The bug was mostly harmless until Steve hacked his kernel to 
> have the right (large) size of readonly, text and data areas. I 
> never hit such an ftrace hang even with allyesconfig bzImage 
> bootups [which has obscenely large text and data sections], so i 
> think something in Steve's tree was also needed to trigger it: 
> an unusually large readonly data section.
> 
> I've queued up the fix below in tip:x86/urgent and will send a 
> pull request later today if it passes testing. Steve, does this 
> solve the bug you've hit?

Yep, I've already tried this fix. It works fine.

-- Steve

> 
> With this fix i dont think the other bits from Steve's series 
> (patch 1-4) are needed at all - those patches expose PMD details 
> in various places that iterate over ptes - that's ugly and 
> unnecessary as well if the PMD's protection is permissive.
> 
> [ Also, since you suggested the fix i've added your Acked-by, 
>   let me know if you dont agree with any aspect of the fix. ]
> 
> 	Ingo
> 
> ---------------->
> >From f07eb4c47d5d4a70dc8eb8e2c158741cd6c69948 Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri, 20 Feb 2009 08:04:13 +0100
> Subject: [PATCH] x86: use the right protections for split-up pagetables
> 
> Steven Rostedt found a bug in where in his modified kernel
> ftrace was unable to modify the kernel text, due to the PMD
> itself having been marked read-only as well in
> split_large_page().
> 
> The fix, suggested by Linus, is to not try to 'clone' the
> reference protection of a huge-page, but to use the standard
> (and permissive) page protection bits of KERNPG_TABLE.
> 
> The 'cloning' makes sense for the ptes but it's a confused and
> incorrect concept at the page table level - because the
> pagetable entry is a set of all ptes and hence cannot
> 'clone' any single protection attribute - the ptes can be any
> mixture of protections.
> 
> With the permissive KERNPG_TABLE, even if the pte protections
> get changed after this point (due to ftrace doing code-patching
> or other similar activities like kprobes), the resulting combined
> protections will still be correct and the pte's restrictive
> (or permissive) protections will control it.
> 
> Also update the comment.
> 
> This bug was there for a long time but has not caused visible
> problems before as it needs a rather large read-only area to
> trigger. Steve possibly hacked his kernel with some really
> large arrays or so. Anyway, the bug is definitely worth fixing.
> 
> [ Huang Ying also experienced problems in this area when writing
>   the EFI code, but the real bug in split_large_page() was not
>   realized back then. ]
> 
> Reported-by: Steven Rostedt <rostedt@goodmis.org>
> Reported-by: Huang Ying <ying.huang@intel.com>
> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  arch/x86/mm/pageattr.c |   15 +++++----------
>  1 files changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index 8ca0d85..17d5d1a 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -508,18 +508,13 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  #endif
>  
>  	/*
> -	 * Install the new, split up pagetable. Important details here:
> +	 * Install the new, split up pagetable.
>  	 *
> -	 * On Intel the NX bit of all levels must be cleared to make a
> -	 * page executable. See section 4.13.2 of Intel 64 and IA-32
> -	 * Architectures Software Developer's Manual).
> -	 *
> -	 * Mark the entry present. The current mapping might be
> -	 * set to not present, which we preserved above.
> +	 * We use the standard kernel pagetable protections for the new
> +	 * pagetable protections, the actual ptes set above control the
> +	 * primary protection behavior:
>  	 */
> -	ref_prot = pte_pgprot(pte_mkexec(pte_clrhuge(*kpte)));
> -	pgprot_val(ref_prot) |= _PAGE_PRESENT;
> -	__set_pmd_pte(kpte, address, mk_pte(base, ref_prot));
> +	__set_pmd_pte(kpte, address, mk_pte(base, _pgprot(_KERNPG_TABLE)));
>  	base = NULL;
>  
>  out_unlock:
> 
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] x86: use the right protections for split-up pagetables
  2009-02-20  7:29       ` [PATCH] x86: use the right protections for split-up pagetables Ingo Molnar
  2009-02-20  7:39         ` [PATCH, v2] " Ingo Molnar
  2009-02-20 13:57         ` [PATCH] " Steven Rostedt
@ 2009-02-20 15:40         ` Linus Torvalds
  2009-02-20 16:59           ` Ingo Molnar
  2009-02-20 18:33           ` H. Peter Anvin
  2 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2009-02-20 15:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Huang Ying, Thomas Gleixner,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin



On Fri, 20 Feb 2009, Ingo Molnar wrote:
> 
> Agreed, split_large_page() was just plain confused here - there 
> was no hidden reason for this logic. It makes no sense to bring 
> any pte level protection information to the PMD level because a 
> pmd entry covers a set of 512 ptes so there's no singular 
> protection attribute that can be carried to it.

Btw, I think split_large_page() is confused in another way too, although 
I'm not entirely sure that it matters. I suspect that it doesn't, if I 
read things correctly.

The confusion? When it moves the 'ref_prot' bits from the upper level, it 
doesn't do the right thing for the PAT bit. That bit is special, and moves 
around depending on level. In the upper levels, it's bit#12, and in the 
final 4k pte level it's bit#7.

So _if_ the PAT bit ever matters, it looks like split_large_page() does 
the wrong thing.

Now, it looks like we avoid the PAT bit on purpose, and we only ever 
encode four PAT values (ie we use only the PCD/PWT bits, and leave the PAT 
bit clear - we don't need any more cases), _but_ we actually do end up 
looking at the PAT bit anyway in cache_attr(). So it looks like at least 
some of the code is _trying_ to handle the PAT bit, but I can pretty much 
guarantee that at least split_large_page() is broken if it is ever set.

			Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] x86: use the right protections for split-up pagetables
  2009-02-20 15:40         ` Linus Torvalds
@ 2009-02-20 16:59           ` Ingo Molnar
  2009-02-20 18:33           ` H. Peter Anvin
  1 sibling, 0 replies; 89+ messages in thread
From: Ingo Molnar @ 2009-02-20 16:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Huang Ying, Thomas Gleixner,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 20 Feb 2009, Ingo Molnar wrote:
> > 
> > Agreed, split_large_page() was just plain confused here - there 
> > was no hidden reason for this logic. It makes no sense to bring 
> > any pte level protection information to the PMD level because a 
> > pmd entry covers a set of 512 ptes so there's no singular 
> > protection attribute that can be carried to it.
> 
> Btw, I think split_large_page() is confused in another way 
> too, although I'm not entirely sure that it matters. I suspect 
> that it doesn't, if I read things correctly.
> 
> The confusion? When it moves the 'ref_prot' bits from the 
> upper level, it doesn't do the right thing for the PAT bit. 
> That bit is special, and moves around depending on level. In 
> the upper levels, it's bit#12, and in the final 4k pte level 
> it's bit#7.
> 
> So _if_ the PAT bit ever matters, it looks like 
> split_large_page() does the wrong thing.
> 
> Now, it looks like we avoid the PAT bit on purpose, and we 
> only ever encode four PAT values (ie we use only the PCD/PWT 
> bits, and leave the PAT bit clear - we don't need any more 
> cases), _but_ we actually do end up looking at the PAT bit 
> anyway in cache_attr(). So it looks like at least some of the 
> code is _trying_ to handle the PAT bit, but I can pretty much 
> guarantee that at least split_large_page() is broken if it is 
> ever set.

Yeah. This our current PAT encodings table:

        /*
         * PTE encoding used in Linux:
         *      PAT
         *      |PCD
         *      ||PWT
         *      |||
         *      000 WB          _PAGE_CACHE_WB
         *      001 WC          _PAGE_CACHE_WC
         *      010 UC-         _PAGE_CACHE_UC_MINUS
         *      011 UC          _PAGE_CACHE_UC
         * PAT bit unused
         */
        pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
              PAT(4, WB) | PAT(5, WC) | PAT(6, UC_MINUS) | PAT(7, UC);

it's intentionally left compressed and the extended PAT bit is 
never set, we only need 4 caching types.

( Speculation: in theory it would be possible for some CPU's
  TLB-fill fastpath to have some small penalty on having a 
  non-zero extended-PAT bit. So eliminating weird bits and 
  compressing pte bit usage is always a good idea. )

Nevertheless you are right that there's a disconnect here and 
that were it ever set we'd unconditionally lift the 2MB/1GB PAT 
bit [bit 12] over into the 4K level.

If we ever set the PAT bit on a large page then the 
split_large_page() behavior would become rather nasty: we'd 
corrupt pte bit 12, i.e. we'd lose linear mappings, we'd map 
every even page twice (and not map any uneven pages), and we'd 
start corrupting memory and would crash in interesting ways.

There's two solutions:

 - make the ref_prot opt-in and explicitly enumerate all the
   bits we handle correctly today

 - add a debug warning for the bits we know we dont handle

I went for the second as the first one would include basically 
all the meaningful bits we have:

   [_PAGE_BIT_PRESENT]
    _PAGE_BIT_RW
    _PAGE_BIT_USER
    _PAGE_BIT_PWT
    _PAGE_BIT_PCD 
   [_PAGE_BIT_ACCESSED]
   [_PAGE_BIT_DIRTY]
    _PAGE_BIT_GLOBAL 
    _PAGE_BIT_NX

( the ones in brackets are not important because we set/clear 
  them anyway, but they dont hurt either. )

And if we do not include PAT and it gets used in the future the 
function could break too - just in different ways. (by not 
carrying over the PAT bit.)

So i think it's safest to put in a sharp debug check for the 
known-unhandled bit. I've queued up the fix below in tip:x86/mm, 
do you think this approach is the best?
 
	Ingo

-------------------->
>From 7a5714e0186030676d79a7b4b9830c8e45c3b0a1 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 20 Feb 2009 17:44:21 +0100
Subject: [PATCH] x86, pat: add large-PAT check to split_large_page()

Impact: future-proof the split_large_page() function

Linus noticed that split_large_page() is not safe wrt. the
PAT bit: it is bit 12 on the 1GB and 2MB page table level
(_PAGE_BIT_PAT_LARGE), and it is bit 7 on the 4K page
table level (_PAGE_BIT_PAT).

Currently it is not a problem because we never set
_PAGE_BIT_PAT_LARGE on any of the large-page mappings - but
should this happen in the future the split_large_page() would
silently lift bit 12 into the lowlevel 4K pte and would start
corrupting the physical page frame offset. Not fun.

So add a debug warning, to make sure if something ever sets
the PAT bit then this function gets updated too.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/mm/pageattr.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 7be47d1..8253bc9 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -482,6 +482,13 @@ static int split_large_page(pte_t *kpte, unsigned long address)
 	pbase = (pte_t *)page_address(base);
 	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
 	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
+	/*
+	 * If we ever want to utilize the PAT bit, we need to
+	 * update this function to make sure it's converted from
+	 * bit 12 to bit 7 when we cross from the 2MB level to
+	 * the 4K level:
+	 */
+	WARN_ON_ONCE(pgprot_val(ref_prot) & _PAGE_PAT_LARGE);
 
 #ifdef CONFIG_X86_64
 	if (level == PG_LEVEL_1G) {

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [PATCH] x86: use the right protections for split-up pagetables
  2009-02-20 15:40         ` Linus Torvalds
  2009-02-20 16:59           ` Ingo Molnar
@ 2009-02-20 18:33           ` H. Peter Anvin
  1 sibling, 0 replies; 89+ messages in thread
From: H. Peter Anvin @ 2009-02-20 18:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Steven Rostedt, Huang Ying, Thomas Gleixner,
	Linux Kernel Mailing List, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers

Linus Torvalds wrote:
> 
> The confusion? When it moves the 'ref_prot' bits from the upper level, it 
> doesn't do the right thing for the PAT bit. That bit is special, and moves 
> around depending on level. In the upper levels, it's bit#12, and in the 
> final 4k pte level it's bit#7.
> 

... and in the second level of two-level page tables, it doesn't exist
at all.

Worse, there are errata on some processors (not sure if there are any we
currently don't blacklist) where the PATx bit logic basically gets fed
random data.  Setting up the PAT so that the lower and upper half alias
works around this.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-20  1:13 ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Steven Rostedt
  2009-02-20  1:32   ` Andrew Morton
@ 2009-02-22 17:50   ` Andi Kleen
  2009-02-22 22:53     ` Steven Rostedt
  2009-02-27 21:08     ` Pavel Machek
  1 sibling, 2 replies; 89+ messages in thread
From: Andi Kleen @ 2009-02-22 17:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Arjan van de Ven, Rusty Russell, Mathieu Desnoyers,
	H. Peter Anvin, Steven Rostedt

Steven Rostedt <rostedt@goodmis.org> writes:

> From: Steven Rostedt <srostedt@redhat.com>
>
> Impact: keep kernel text read only
>
> Because dynamic ftrace converts the calls to mcount into and out of
> nops at run time, we needed to always keep the kernel text writable.
>
> But this defeats the point of CONFIG_DEBUG_RODATA. This patch converts
> the kernel code to writable before ftrace modifies the text, and converts
> it back to read only afterward.
>
> The conversion is done via stop_machine and no IPIs may be executed
> at that time. The kernel text is set to write just before calling
> stop_machine and set to read only again right afterward.

The very old text poke code I had for this just used a dynamic
mapping elsewhere instead to modify the code. That's much less
intrusive than changing the complete mappings. Any reason you can't use 
that too?

-Andi

>
> Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> ---
>  arch/x86/include/asm/ftrace.h |   10 ++++++++++
>  arch/x86/kernel/ftrace.c      |   20 ++++++++++++++++++++
>  arch/x86/mm/init_32.c         |   27 ++++++++++++++++++++++++---
>  arch/x86/mm/init_64.c         |   29 ++++++++++++++++++++++++-----
>  4 files changed, 78 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index b55b4a7..5564cf3 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -80,4 +80,14 @@ extern void return_to_handler(void);
>  #endif /* __ASSEMBLY__ */
>  #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
>  
> +#ifndef __ASSEMBLY__
> +#ifdef CONFIG_DEBUG_RODATA
> +void set_kernel_text_rw(void);
> +void set_kernel_text_ro(void);
> +#else
> +static inline void set_kernel_text_rw(void) { }
> +static inline void set_kernel_text_ro(void) { }
> +#endif
> +#endif /* __ASSEMBLY__ */
> +

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-22 17:50   ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Andi Kleen
@ 2009-02-22 22:53     ` Steven Rostedt
  2009-02-23  0:29       ` Andi Kleen
  2009-02-23  2:33       ` Mathieu Desnoyers
  2009-02-27 21:08     ` Pavel Machek
  1 sibling, 2 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-22 22:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Arjan van de Ven, Rusty Russell, Mathieu Desnoyers,
	H. Peter Anvin, Steven Rostedt




On Sun, 22 Feb 2009, Andi Kleen wrote:

> Steven Rostedt <rostedt@goodmis.org> writes:
> 
> > From: Steven Rostedt <srostedt@redhat.com>
> >
> > Impact: keep kernel text read only
> >
> > Because dynamic ftrace converts the calls to mcount into and out of
> > nops at run time, we needed to always keep the kernel text writable.
> >
> > But this defeats the point of CONFIG_DEBUG_RODATA. This patch converts
> > the kernel code to writable before ftrace modifies the text, and converts
> > it back to read only afterward.
> >
> > The conversion is done via stop_machine and no IPIs may be executed
> > at that time. The kernel text is set to write just before calling
> > stop_machine and set to read only again right afterward.
> 
> The very old text poke code I had for this just used a dynamic
> mapping elsewhere instead to modify the code. That's much less
> intrusive than changing the complete mappings. Any reason you can't use 
> that too?
> 

We are changing over 19000 locations in the kernel. This touches almost 
all kernel text pages anyway. You want to map a page in and out for over 
19000 locations?

-- Steve


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-22 22:53     ` Steven Rostedt
@ 2009-02-23  0:29       ` Andi Kleen
  2009-02-23  2:33       ` Mathieu Desnoyers
  1 sibling, 0 replies; 89+ messages in thread
From: Andi Kleen @ 2009-02-23  0:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin, Steven Rostedt

> We are changing over 19000 locations in the kernel. This touches almost 
> all kernel text pages anyway. You want to map a page in and out for over 
> 19000 locations?

Well kernel size / PAGE_SIZE times if you sort locations first and
take a last-hit cache. Or if you want to overoptimize you can
also use 2MB pages when available. Also it can be done much cheaper
than a full flush because it doesn't need to be global over all CPUs
(assuming you disable preempt, which you probably do anyways)
And it can use INVLPG on x86 (or similar directed flushed) 
which is much much cheaper than blowing everything away.

I'm not sure which one would be faster, but I suspect the difference
will not be very large. And not changing the kernel has the 
advantage that there is no window where it is unprotected.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-22 22:53     ` Steven Rostedt
  2009-02-23  0:29       ` Andi Kleen
@ 2009-02-23  2:33       ` Mathieu Desnoyers
  2009-02-23  4:29         ` Steven Rostedt
  2009-02-23  9:02         ` Ingo Molnar
  1 sibling, 2 replies; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-02-23  2:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> 
> 
> On Sun, 22 Feb 2009, Andi Kleen wrote:
> 
> > Steven Rostedt <rostedt@goodmis.org> writes:
> > 
> > > From: Steven Rostedt <srostedt@redhat.com>
> > >
> > > Impact: keep kernel text read only
> > >
> > > Because dynamic ftrace converts the calls to mcount into and out of
> > > nops at run time, we needed to always keep the kernel text writable.
> > >
> > > But this defeats the point of CONFIG_DEBUG_RODATA. This patch converts
> > > the kernel code to writable before ftrace modifies the text, and converts
> > > it back to read only afterward.
> > >
> > > The conversion is done via stop_machine and no IPIs may be executed
> > > at that time. The kernel text is set to write just before calling
> > > stop_machine and set to read only again right afterward.
> > 
> > The very old text poke code I had for this just used a dynamic
> > mapping elsewhere instead to modify the code. That's much less
> > intrusive than changing the complete mappings. Any reason you can't use 
> > that too?
> > 
> 
> We are changing over 19000 locations in the kernel. This touches almost 
> all kernel text pages anyway. You want to map a page in and out for over 
> 19000 locations?
> 
> -- Steve
> 

Hi Steve,

Can you provide numbers to indicate why it's required to be so intrusive
in the kernel mappings while doing these modifications ? I think opening
such time window where standard code mapping is writeable globally in
config RO_DATA kernels could open the door to unexpected side-effects,
so ideally going through the "backdoor" page mapped by text_poke seems
safer. Given similar performance, I would tend to use a text_poke-like
approach.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23  2:33       ` Mathieu Desnoyers
@ 2009-02-23  4:29         ` Steven Rostedt
  2009-02-23  4:53           ` Mathieu Desnoyers
  2009-02-23  9:02         ` Ingo Molnar
  1 sibling, 1 reply; 89+ messages in thread
From: Steven Rostedt @ 2009-02-23  4:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt


On Sun, 22 Feb 2009, Mathieu Desnoyers wrote:
> > 
> > We are changing over 19000 locations in the kernel. This touches almost 
> > all kernel text pages anyway. You want to map a page in and out for over 
> > 19000 locations?
> > 
> > -- Steve
> > 
> 
> Hi Steve,
> 
> Can you provide numbers to indicate why it's required to be so intrusive
> in the kernel mappings while doing these modifications ? I think opening
> such time window where standard code mapping is writeable globally in
> config RO_DATA kernels could open the door to unexpected side-effects,
> so ideally going through the "backdoor" page mapped by text_poke seems
> safer. Given similar performance, I would tend to use a text_poke-like
> approach.
> 

Not sure which numbers you are asking for. The dynamic function tracer 
modifies all mcount calls, which are done by practically every function in 
the kernel. With a normal Fedora kernel (and all its loaded modules), 
that's between 15 to 20 thousand functions, depending on what modules are 
loaded.

At boot up we convert them all to nops, but when we enable the function 
tracer, we convert them back to calls to the function tracer. This is done 
by a priviledge user, since the function tracer can add quite a bit of 
overhead when activated.

I do not really see how changing this for the short period of time is any 
different than making another mapping point to the kernel code. If you 
could find a way to break this security, you should be able to break it 
with another mapping as well.

Also note that this dynamic tracing code works for not only x86, it also 
works for PPC, ARM, superH and ia64. To use text_poke, that would require 
all of these to have text_poke ported.

How does text_poke solve the issue of executing code on another CPU that 
is changing?

-- Steve


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23  4:29         ` Steven Rostedt
@ 2009-02-23  4:53           ` Mathieu Desnoyers
  2009-02-23 14:48             ` Steven Rostedt
  0 siblings, 1 reply; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-02-23  4:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Sun, 22 Feb 2009, Mathieu Desnoyers wrote:
> > > 
> > > We are changing over 19000 locations in the kernel. This touches almost 
> > > all kernel text pages anyway. You want to map a page in and out for over 
> > > 19000 locations?
> > > 
> > > -- Steve
> > > 
> > 
> > Hi Steve,
> > 
> > Can you provide numbers to indicate why it's required to be so intrusive
> > in the kernel mappings while doing these modifications ? I think opening
> > such time window where standard code mapping is writeable globally in
> > config RO_DATA kernels could open the door to unexpected side-effects,
> > so ideally going through the "backdoor" page mapped by text_poke seems
> > safer. Given similar performance, I would tend to use a text_poke-like
> > approach.
> > 
> 
> Not sure which numbers you are asking for. The dynamic function tracer 
> modifies all mcount calls, which are done by practically every function in 
> the kernel. With a normal Fedora kernel (and all its loaded modules), 
> that's between 15 to 20 thousand functions, depending on what modules are 
> loaded.
> 

I mean comparing the cost of changing the kernel mappings and doing the
edits (as you do) vs doing it through a text-poke-like mapping.

> At boot up we convert them all to nops, but when we enable the function 
> tracer, we convert them back to calls to the function tracer. This is done 
> by a priviledge user, since the function tracer can add quite a bit of 
> overhead when activated.
> 
> I do not really see how changing this for the short period of time is any 
> different than making another mapping point to the kernel code. If you 
> could find a way to break this security, you should be able to break it 
> with another mapping as well.

It's not only about breaking the security. It's mostly to insure
internal kernel consistency. While you are changing these mappings, you
could possibly have a window where other kernel code is running (irq,
other cpus threads). That code could itself be buggy and use the
writeable window to overwrite some kernel code or RO data. This
side-effect would go undetected while users think the RO data *is* is,
but it wouldn't. You also bring a good point about security : if someone
ever rely on CONFIG_DEBUG_RODATA for some security reasons, then we give
a big window where kernel text and ro data is writeable at *known*
addresses, while we can randomize the address used for text_poke.

> 
> Also note that this dynamic tracing code works for not only x86, it also 
> works for PPC, ARM, superH and ia64. To use text_poke, that would require 
> all of these to have text_poke ported.
> 

Do these architecture have DEBUG_RODATA config ? If not, then a simple
memcpy is ok.

> How does text_poke solve the issue of executing code on another CPU that 
> is changing?
> 

text_poke itself does not provide that. This must be insured by the
user on a case-by-case basis. For instance, kprobes is changing code
atomically _and_ just inserting/removing breakpoints. Doing this is fine
with cross-cpu code modification (XMC). alternative code is only
changing code when the CPU is UP, so it's also ok. However, changing
multi-bytes instructions without changing those for a trap-generating
instruction when the CPUs are up (SMP) falls into the erratas, and the
code that uses text_poke must carefully perform this modification, e.g.
by doing like I do in my immediate values patchset in the lttng git
tree (using a temporary breakpoint while doing the modification). This
would apply directly to the function tracer, and you could get rid of
this ugly latency-inducing stop_machine() call.

Mathieu

> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23  2:33       ` Mathieu Desnoyers
  2009-02-23  4:29         ` Steven Rostedt
@ 2009-02-23  9:02         ` Ingo Molnar
  1 sibling, 0 replies; 89+ messages in thread
From: Ingo Molnar @ 2009-02-23  9:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Andi Kleen, linux-kernel, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Can you provide numbers to indicate why it's required to be so 
> intrusive in the kernel mappings while doing these 
> modifications ? I think opening such time window where 
> standard code mapping is writeable globally in config RO_DATA 
> kernels could open the door to unexpected side-effects, so 
> ideally going through the "backdoor" page mapped by text_poke 
> seems safer. Given similar performance, I would tend to use a 
> text_poke-like approach.

It's not really an issue - this code is only called during 
normal operation if the admin does it.

As far as scare mongering goes a "backdoor" page is in fact more 
attackable because it's at a more predictable position and due 
to text-poke's slowness the window of vulnerability is longer.

Anyway, this is all pretty theoretical and irrelevant. The 
purpose of RODATA is mainly to protect against benign/unintended 
sources of kernel text corruption. An attacker, if he can modify 
arbitrary kernel text address can already modify other critical 
data structures to gain access.

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23  4:53           ` Mathieu Desnoyers
@ 2009-02-23 14:48             ` Steven Rostedt
  2009-02-23 15:42               ` Mathieu Desnoyers
  0 siblings, 1 reply; 89+ messages in thread
From: Steven Rostedt @ 2009-02-23 14:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt


On Sun, 22 Feb 2009, Mathieu Desnoyers wrote:

> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > On Sun, 22 Feb 2009, Mathieu Desnoyers wrote:
> > > > 
> > > > We are changing over 19000 locations in the kernel. This touches almost 
> > > > all kernel text pages anyway. You want to map a page in and out for over 
> > > > 19000 locations?
> > > > 
> > > > -- Steve
> > > > 
> > > 
> > > Hi Steve,
> > > 
> > > Can you provide numbers to indicate why it's required to be so intrusive
> > > in the kernel mappings while doing these modifications ? I think opening
> > > such time window where standard code mapping is writeable globally in
> > > config RO_DATA kernels could open the door to unexpected side-effects,
> > > so ideally going through the "backdoor" page mapped by text_poke seems
> > > safer. Given similar performance, I would tend to use a text_poke-like
> > > approach.
> > > 
> > 
> > Not sure which numbers you are asking for. The dynamic function tracer 
> > modifies all mcount calls, which are done by practically every function in 
> > the kernel. With a normal Fedora kernel (and all its loaded modules), 
> > that's between 15 to 20 thousand functions, depending on what modules are 
> > loaded.
> > 
> 
> I mean comparing the cost of changing the kernel mappings and doing the
> edits (as you do) vs doing it through a text-poke-like mapping.

Well, I could try to do the benchmarks, but that would require a bit of 
development (see below).

> 
> > At boot up we convert them all to nops, but when we enable the function 
> > tracer, we convert them back to calls to the function tracer. This is done 
> > by a priviledge user, since the function tracer can add quite a bit of 
> > overhead when activated.
> > 
> > I do not really see how changing this for the short period of time is any 
> > different than making another mapping point to the kernel code. If you 
> > could find a way to break this security, you should be able to break it 
> > with another mapping as well.
> 
> It's not only about breaking the security. It's mostly to insure
> internal kernel consistency. While you are changing these mappings, you
> could possibly have a window where other kernel code is running (irq,
> other cpus threads). That code could itself be buggy and use the
> writeable window to overwrite some kernel code or RO data. This
> side-effect would go undetected while users think the RO data *is* is,
> but it wouldn't. You also bring a good point about security : if someone
> ever rely on CONFIG_DEBUG_RODATA for some security reasons, then we give
> a big window where kernel text and ro data is writeable at *known*
> addresses, while we can randomize the address used for text_poke.

As Ingo already mentioned, if an attacker can write to kernel memory then 
the game is pretty much over.

As for RO_DATA and bugs, it is a very small window for this to happen, and 
the sys-admin is the one making the change. This is not some periodical 
update. The sys-admin must be the one to initiate the tracer to modify 
text, ie, enabling or disabling the function tracer. Which, by the way, is 
something a sys-admin should only do when the system is off line. The 
overhead of all functions being traced, would not be something to be 
doing on a production system, unless they need to analyze something going 
wrong.

> 
> > 
> > Also note that this dynamic tracing code works for not only x86, it also 
> > works for PPC, ARM, superH and ia64. To use text_poke, that would require 
> > all of these to have text_poke ported.
> > 
> 
> Do these architecture have DEBUG_RODATA config ? If not, then a simple
> memcpy is ok.

No, the stop_machine has nothing to do with RODATA config. It has to do 
with a safe way of modifying text that might run on another CPU.

> 
> > How does text_poke solve the issue of executing code on another CPU that 
> > is changing?
> > 
> 
> text_poke itself does not provide that. This must be insured by the

Exactly! Then I can not replace stop_machine with "text_poke".

> user on a case-by-case basis. For instance, kprobes is changing code
> atomically _and_ just inserting/removing breakpoints. Doing this is fine
> with cross-cpu code modification (XMC). alternative code is only
> changing code when the CPU is UP, so it's also ok. However, changing
> multi-bytes instructions without changing those for a trap-generating
> instruction when the CPUs are up (SMP) falls into the erratas, and the
> code that uses text_poke must carefully perform this modification, e.g.
> by doing like I do in my immediate values patchset in the lttng git
> tree (using a temporary breakpoint while doing the modification). This
> would apply directly to the function tracer, and you could get rid of
> this ugly latency-inducing stop_machine() call.

Then I would need to implement this break point code on every arch. 
Actually, I find the stop_machine quite an elegant solution, and not that 
ugly. Modifying code on an SMP box is very dangerous. The "stop_machine" 
turns the SMP box into a UP box while it is running (with the exception of 
NMIs, but we deal with that separately).

The current method (with DEBUG_RODATA on x86), is to make the kernel text 
writable. Call stop_machine that will modify all the locations (no 
interruptions), then switch the kernel text back to read only.

What you want me to do is, memory map a location, set a break point on 
it for anyone that happens to hit it (and do what? call the previous 
command?) modify the code, remove the break point, remove the memory 
mapping and do that for another 19000 times. And this must also be 
implemented on all archs that support dynamic ftrace.

That seems to be much more complex and quite frankly, much more error 
prone. Tracing's number one priority is stability. The more complex it 
becomes the less stable it will be.

-- Steve


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23 14:48             ` Steven Rostedt
@ 2009-02-23 15:42               ` Mathieu Desnoyers
  2009-02-23 15:51                 ` Steven Rostedt
  0 siblings, 1 reply; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-02-23 15:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Sun, 22 Feb 2009, Mathieu Desnoyers wrote:
> 
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > 
> > > On Sun, 22 Feb 2009, Mathieu Desnoyers wrote:
> > > > > 
> > > > > We are changing over 19000 locations in the kernel. This touches almost 
> > > > > all kernel text pages anyway. You want to map a page in and out for over 
> > > > > 19000 locations?
> > > > > 
> > > > > -- Steve
> > > > > 
> > > > 
> > > > Hi Steve,
> > > > 
> > > > Can you provide numbers to indicate why it's required to be so intrusive
> > > > in the kernel mappings while doing these modifications ? I think opening
> > > > such time window where standard code mapping is writeable globally in
> > > > config RO_DATA kernels could open the door to unexpected side-effects,
> > > > so ideally going through the "backdoor" page mapped by text_poke seems
> > > > safer. Given similar performance, I would tend to use a text_poke-like
> > > > approach.
> > > > 
> > > 
> > > Not sure which numbers you are asking for. The dynamic function tracer 
> > > modifies all mcount calls, which are done by practically every function in 
> > > the kernel. With a normal Fedora kernel (and all its loaded modules), 
> > > that's between 15 to 20 thousand functions, depending on what modules are 
> > > loaded.
> > > 
> > 
> > I mean comparing the cost of changing the kernel mappings and doing the
> > edits (as you do) vs doing it through a text-poke-like mapping.
> 
> Well, I could try to do the benchmarks, but that would require a bit of 
> development (see below).
> 
> > 
> > > At boot up we convert them all to nops, but when we enable the function 
> > > tracer, we convert them back to calls to the function tracer. This is done 
> > > by a priviledge user, since the function tracer can add quite a bit of 
> > > overhead when activated.
> > > 
> > > I do not really see how changing this for the short period of time is any 
> > > different than making another mapping point to the kernel code. If you 
> > > could find a way to break this security, you should be able to break it 
> > > with another mapping as well.
> > 
> > It's not only about breaking the security. It's mostly to insure
> > internal kernel consistency. While you are changing these mappings, you
> > could possibly have a window where other kernel code is running (irq,
> > other cpus threads). That code could itself be buggy and use the
> > writeable window to overwrite some kernel code or RO data. This
> > side-effect would go undetected while users think the RO data *is* is,
> > but it wouldn't. You also bring a good point about security : if someone
> > ever rely on CONFIG_DEBUG_RODATA for some security reasons, then we give
> > a big window where kernel text and ro data is writeable at *known*
> > addresses, while we can randomize the address used for text_poke.
> 
> As Ingo already mentioned, if an attacker can write to kernel memory then 
> the game is pretty much over.
> 
> As for RO_DATA and bugs, it is a very small window for this to happen, and 
> the sys-admin is the one making the change. This is not some periodical 
> update. The sys-admin must be the one to initiate the tracer to modify 
> text, ie, enabling or disabling the function tracer. Which, by the way, is 
> something a sys-admin should only do when the system is off line. The 
> overhead of all functions being traced, would not be something to be 
> doing on a production system, unless they need to analyze something going 
> wrong.
> 

The argument "not to be used on production systems" is incompatible with
the LTTng view, sorry. If you design your code so it's usable only in
debugging scenarios on development machines and not in the field, then I
doubt LTTng will be able to rely on it. I'm OK with that, as long as
nobody argue that such tracepoint could be replaced by the function
tracer, because tracepoints has to be enabled in the field on production
machines.

I agree that the racy time window is not that large and is not really a
security concern, but it's still just annoying.


> > 
> > > 
> > > Also note that this dynamic tracing code works for not only x86, it also 
> > > works for PPC, ARM, superH and ia64. To use text_poke, that would require 
> > > all of these to have text_poke ported.
> > > 
> > 
> > Do these architecture have DEBUG_RODATA config ? If not, then a simple
> > memcpy is ok.
> 
> No, the stop_machine has nothing to do with RODATA config. It has to do 
> with a safe way of modifying text that might run on another CPU.
> 
> > 
> > > How does text_poke solve the issue of executing code on another CPU that 
> > > is changing?
> > > 
> > 
> > text_poke itself does not provide that. This must be insured by the
> 
> Exactly! Then I can not replace stop_machine with "text_poke".
> 
> > user on a case-by-case basis. For instance, kprobes is changing code
> > atomically _and_ just inserting/removing breakpoints. Doing this is fine
> > with cross-cpu code modification (XMC). alternative code is only
> > changing code when the CPU is UP, so it's also ok. However, changing
> > multi-bytes instructions without changing those for a trap-generating
> > instruction when the CPUs are up (SMP) falls into the erratas, and the
> > code that uses text_poke must carefully perform this modification, e.g.
> > by doing like I do in my immediate values patchset in the lttng git
> > tree (using a temporary breakpoint while doing the modification). This
> > would apply directly to the function tracer, and you could get rid of
> > this ugly latency-inducing stop_machine() call.
> 
> Then I would need to implement this break point code on every arch. 

No, just on every arch which has such XMC erratas. Intel and ia64 are
the two I am aware of. But I guess if you want to play safe, doing it on
each architecture makes sense.

> Actually, I find the stop_machine quite an elegant solution, and not that 
> ugly. Modifying code on an SMP box is very dangerous. The "stop_machine" 
> turns the SMP box into a UP box while it is running (with the exception of 
> NMIs, but we deal with that separately).

The whole aspect of "we deal with that separately" seems like it adds
complexity to something that should stay simple by working around the
real problem.

> 
> The current method (with DEBUG_RODATA on x86), is to make the kernel text 
> writable. Call stop_machine that will modify all the locations (no 
> interruptions), then switch the kernel text back to read only.
> 

I'm pretty sure that leads to unacceptably long interrupt latency on
production machines.

> What you want me to do is, memory map a location, set a break point on 
> it for anyone that happens to hit it (and do what? call the previous 
> command?)

No. iret just after the modified site. You are flipping between "call"
and "nops", so you take the simplest behavior : nops.

> modify the code, remove the break point, remove the memory 
> mapping and do that for another 19000 times. And this must also be 
> implemented on all archs that support dynamic ftrace.
> 

#define BREAKPOINT_INSN ...
#define BREAKPOINT_INSN_LEN ...

This can be abstracted pretty easily.


> That seems to be much more complex and quite frankly, much more error 
> prone. Tracing's number one priority is stability. The more complex it 
> becomes the less stable it will be.
> 

Given you plan to use tracing only in debugging setups seems to make you
miss another very important aspect : tracer intrusiveness. Disabling
interrupts for a few milliseconds in a row on a telecommunication system
is just unacceptable.

I agree that stability is very important, but, as Einstein said :
"Make everything as simple as possible, but not simpler".

Mathieu

> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23 15:42               ` Mathieu Desnoyers
@ 2009-02-23 15:51                 ` Steven Rostedt
  2009-02-23 15:55                   ` Steven Rostedt
  2009-02-23 16:13                   ` Mathieu Desnoyers
  0 siblings, 2 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-23 15:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt


On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
> > 
> > As for RO_DATA and bugs, it is a very small window for this to happen, and 
> > the sys-admin is the one making the change. This is not some periodical 
> > update. The sys-admin must be the one to initiate the tracer to modify 
> > text, ie, enabling or disabling the function tracer. Which, by the way, is 
> > something a sys-admin should only do when the system is off line. The 
> > overhead of all functions being traced, would not be something to be 
> > doing on a production system, unless they need to analyze something going 
> > wrong.
> > 
> 
> The argument "not to be used on production systems" is incompatible with
> the LTTng view, sorry. If you design your code so it's usable only in
> debugging scenarios on development machines and not in the field, then I
> doubt LTTng will be able to rely on it. I'm OK with that, as long as
> nobody argue that such tracepoint could be replaced by the function
> tracer, because tracepoints has to be enabled in the field on production
> machines.

Please do not confuse ftrace with the function tracer. The stop_machine
is only about the function tracer and has nothing to do with the rest of
ftrace. This is one detail. Yes, tracing EVERY function in the kernel
will add an overhead. There's no way around it. It's OK to do it on a
production system, but it WILL add overhead. That's what happens when you
trace EVERY function.

Note, I leave a lot of the other tracers on by default, and those are all
within the noise of overhead. I'm only talking about the function tracer
that is meant to do a lot of tracing. Does LTTng trace EVERY function?

Now, yes, if you only select a few functions, there's no noticeable 
overhead. And yes then you would need to do the stop_machine anyway, and 
there will be a small window where the kernel text will be writable. 
Tracing only a small set of functions (say a few 100) is not much of an 
overhead, and I could see that being done on a production system.

> 
> I agree that the racy time window is not that large and is not really a
> security concern, but it's still just annoying.

Annoying? how so?

Again, the stop_machine part has nothing to do with DEBUG_RODATA, it is 
about the safest and easiest way to modify kernel text.

-- Steve


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23 15:51                 ` Steven Rostedt
@ 2009-02-23 15:55                   ` Steven Rostedt
  2009-02-23 16:13                   ` Mathieu Desnoyers
  1 sibling, 0 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-23 15:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt


On Mon, 23 Feb 2009, Steven Rostedt wrote:

> 
> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
> > > 
> > > As for RO_DATA and bugs, it is a very small window for this to happen, and 
> > > the sys-admin is the one making the change. This is not some periodical 
> > > update. The sys-admin must be the one to initiate the tracer to modify 
> > > text, ie, enabling or disabling the function tracer. Which, by the way, is 
> > > something a sys-admin should only do when the system is off line. The 
> > > overhead of all functions being traced, would not be something to be 
> > > doing on a production system, unless they need to analyze something going 
> > > wrong.
> > > 
> > 
> > The argument "not to be used on production systems" is incompatible with
> > the LTTng view, sorry. If you design your code so it's usable only in
> > debugging scenarios on development machines and not in the field, then I
> > doubt LTTng will be able to rely on it. I'm OK with that, as long as
> > nobody argue that such tracepoint could be replaced by the function
> > tracer, because tracepoints has to be enabled in the field on production
> > machines.
> 
> Please do not confuse ftrace with the function tracer. The stop_machine
> is only about the function tracer and has nothing to do with the rest of
> ftrace. This is one detail. Yes, tracing EVERY function in the kernel
> will add an overhead. There's no way around it. It's OK to do it on a
> production system, but it WILL add overhead. That's what happens when you
> trace EVERY function.
> 
> Note, I leave a lot of the other tracers on by default, and those are all
> within the noise of overhead. I'm only talking about the function tracer
> that is meant to do a lot of tracing. Does LTTng trace EVERY function?

BTW, The above is more about the answer to my statement about running on
a production system. Below, is more an answer to the above. After 
rereading what I wrote, I did not explain it very well.

-- Steve

> 
> Now, yes, if you only select a few functions, there's no noticeable 
> overhead. And yes then you would need to do the stop_machine anyway, and 
> there will be a small window where the kernel text will be writable. 
> Tracing only a small set of functions (say a few 100) is not much of an 
> overhead, and I could see that being done on a production system.
> 
> > 
> > I agree that the racy time window is not that large and is not really a
> > security concern, but it's still just annoying.
> 
> Annoying? how so?
> 
> Again, the stop_machine part has nothing to do with DEBUG_RODATA, it is 
> about the safest and easiest way to modify kernel text.
> 
> -- Steve
> 
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23 15:51                 ` Steven Rostedt
  2009-02-23 15:55                   ` Steven Rostedt
@ 2009-02-23 16:13                   ` Mathieu Desnoyers
  2009-02-23 16:48                     ` Steven Rostedt
  1 sibling, 1 reply; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-02-23 16:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
> > > 
> > > As for RO_DATA and bugs, it is a very small window for this to happen, and 
> > > the sys-admin is the one making the change. This is not some periodical 
> > > update. The sys-admin must be the one to initiate the tracer to modify 
> > > text, ie, enabling or disabling the function tracer. Which, by the way, is 
> > > something a sys-admin should only do when the system is off line. The 
> > > overhead of all functions being traced, would not be something to be 
> > > doing on a production system, unless they need to analyze something going 
> > > wrong.
> > > 
> > 
> > The argument "not to be used on production systems" is incompatible with
> > the LTTng view, sorry. If you design your code so it's usable only in
> > debugging scenarios on development machines and not in the field, then I
> > doubt LTTng will be able to rely on it. I'm OK with that, as long as
> > nobody argue that such tracepoint could be replaced by the function
> > tracer, because tracepoints has to be enabled in the field on production
> > machines.
> 
> Please do not confuse ftrace with the function tracer. The stop_machine
> is only about the function tracer and has nothing to do with the rest of
> ftrace. This is one detail. Yes, tracing EVERY function in the kernel
> will add an overhead. There's no way around it. It's OK to do it on a
> production system, but it WILL add overhead. That's what happens when you
> trace EVERY function.
> 

I specifically talked about the function tracer here, so there is no
confusion.

> Note, I leave a lot of the other tracers on by default, and those are all
> within the noise of overhead. I'm only talking about the function tracer
> that is meant to do a lot of tracing. Does LTTng trace EVERY function?
> 

It can, by using your function tracer. It has a mode where it can
enable/disable a filter in a callback connected on tracepoints. This
filter is then used to enable detailed function tracing for a short time
window. Also, you could think of tracing every function calls with
LTTng's flight recorder mode, which only spins in memory overwriting the
oldest information. That would provide snapshots on demand of the last
functions called.

> Now, yes, if you only select a few functions, there's no noticeable 
> overhead. And yes then you would need to do the stop_machine anyway, and 
> there will be a small window where the kernel text will be writable. 
> Tracing only a small set of functions (say a few 100) is not much of an 
> overhead, and I could see that being done on a production system.
> 

This is what LTTng can do today. But that involves the function tracer
stop_machine() call, which I dislike.

> > 
> > I agree that the racy time window is not that large and is not really a
> > security concern, but it's still just annoying.
> 
> Annoying? how so?
> 
> Again, the stop_machine part has nothing to do with DEBUG_RODATA, it is 
> about the safest and easiest way to modify kernel text.
> 

We are running in circles here because there is no real argument
brought.

1 - You claim that changing the kernel's mapping, which has been
pointed out as an intrusive kernel modification, is faster than using a
text-poke-like approach. Please provide numbers to support such claims.

2 - You claim that using stop_machine is simpler and therefore safer
than using a breakpoint-based approach. I start having some doubts about
simplicity when you start talking about the workarounds you have to do
for NMIs, but more importantly, you seem to recognise that the latency
it induces would be inadequate for production systems. Therefore it's
unusable in some LTTng use-cases just because of that. If you expect the
function tracer to become used more widely in LTTng, these concerns
should be addressed.

If, in the end, your argument is "the function tracer works as-is now,
and I have no time to change it given it represents too much work" or "I
don't care about your use-cases", I'm OK with that. But please then don't
argue that it's because it's the best technical solution when it isn't.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23 16:13                   ` Mathieu Desnoyers
@ 2009-02-23 16:48                     ` Steven Rostedt
  2009-02-23 17:31                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 89+ messages in thread
From: Steven Rostedt @ 2009-02-23 16:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt


On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
> It can, by using your function tracer. It has a mode where it can
> enable/disable a filter in a callback connected on tracepoints. This
> filter is then used to enable detailed function tracing for a short time
> window. Also, you could think of tracing every function calls with
> LTTng's flight recorder mode, which only spins in memory overwriting the
> oldest information. That would provide snapshots on demand of the last
> functions called.
> 
> > Now, yes, if you only select a few functions, there's no noticeable 
> > overhead. And yes then you would need to do the stop_machine anyway, and 
> > there will be a small window where the kernel text will be writable. 
> > Tracing only a small set of functions (say a few 100) is not much of an 
> > overhead, and I could see that being done on a production system.
> > 
> 
> This is what LTTng can do today. But that involves the function tracer
> stop_machine() call, which I dislike.

What's wrong with stop_machine?  Specifically, what do you dislike about 
it?

> 
> > > 
> > > I agree that the racy time window is not that large and is not really a
> > > security concern, but it's still just annoying.
> > 
> > Annoying? how so?
> > 
> > Again, the stop_machine part has nothing to do with DEBUG_RODATA, it is 
> > about the safest and easiest way to modify kernel text.
> > 
> 
> We are running in circles here because there is no real argument
> brought.
> 
> 1 - You claim that changing the kernel's mapping, which has been
> pointed out as an intrusive kernel modification, is faster than using a
> text-poke-like approach. Please provide numbers to support such claims.

Hmm, lets see. I simply set a bit in the PTE mappings. There's not many, 
since a lot are 2M pages, for x86_64. Call stop_machine, and now I can 
modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of 
the bits are only done when CONFIG_DEBUG_RODATA is set.

text_poke requires allocating a page. Map the page into memory. Set up a 
break point. Knowing what to do when that break point is hit by another 
process. Modify the one location. Unmap the page. Free the page. Remove 
the breakpoint.

Yes, this may be faster if I only modify one location. I would be hard 
pressed that this is faster when I modify a few hundred locations. 
The stop_machine method does it all at once. Not one at a time.


> 
> 2 - You claim that using stop_machine is simpler and therefore safer
> than using a breakpoint-based approach. I start having some doubts about
> simplicity when you start talking about the workarounds you have to do
> for NMIs,

I agree, the NMI work around was tricky, but the final solution (which
we tested vigorously) works well. My claim that it is simpler is not about 
the small steps, but rather the number of variables we need to deal with.
Stop machine shuts down all the CPUs and executes my code on one CPU. 
Interrupts are disabled on all CPUs, and we only need to worry about the 
NMI. Which we now do.

Your solution is about mapping another page on a running system, where
anything can happen. The number of variables that can go wrong is much 
greater simply by the fact that you have no idea as to what is running at 
the same time as you perform your modifications.

With stop_machine, the number of variables is much less, because I know 
everything that is happening when I do the modification. I do not need to 
worry about some strange driver doing some kind of tricks because it 
simply is not running.

> but more importantly, you seem to recognise that the latency
> it induces would be inadequate for production systems.

Wrong. I recognise the latency of tracing all functions on a production 
system. Heck, we trace spin_lock, rcu_read_lock, mutex_lock, and all that 
jazz. Just slowing those functions down a bit will have a noticeable 
impact. I've found that adding those functions to set_ftrace_notrace drops 
the function tracer penalty, significantly.


> Therefore it's
> unusable in some LTTng use-cases just because of that. If you expect the
> function tracer to become used more widely in LTTng, these concerns
> should be addressed.

If you only want to trace a few hundred functions, then the overhead with
it on should not be significant. Depending on which functions you trace. 
As mentioned above, tracing only spin_lock can slow the system down.

Set up the functions you want to trace, enable them. You can have the
ring buffer disabled (echo 0 > /debug/tracing/tracing_on) and just turn on 
the ring buffer for your snapshot, and turn it off when you are done. When 
all tracing is done, then disable the function tracing.


> 
> If, in the end, your argument is "the function tracer works as-is now,
> and I have no time to change it given it represents too much work" or "I
> don't care about your use-cases", I'm OK with that. But please then don't
> argue that it's because it's the best technical solution when it isn't.

No, I have yet to hear a valuable argument against stop_machine. You are 
pushing the burden of proof on me, when we have something that does work, 
on several archs. You want me to redesign the system to be x86 only, and 
then say, hey, my original code works better.

I do not see text_poke being theoretically better. The only reason you 
given me to use it is because you dislike stop_machine.

-- Steve




^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23 16:48                     ` Steven Rostedt
@ 2009-02-23 17:31                       ` Mathieu Desnoyers
  2009-02-23 18:17                         ` Steven Rostedt
  2009-02-23 18:23                         ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Steven Rostedt
  0 siblings, 2 replies; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-02-23 17:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
> > It can, by using your function tracer. It has a mode where it can
> > enable/disable a filter in a callback connected on tracepoints. This
> > filter is then used to enable detailed function tracing for a short time
> > window. Also, you could think of tracing every function calls with
> > LTTng's flight recorder mode, which only spins in memory overwriting the
> > oldest information. That would provide snapshots on demand of the last
> > functions called.
> > 
> > > Now, yes, if you only select a few functions, there's no noticeable 
> > > overhead. And yes then you would need to do the stop_machine anyway, and 
> > > there will be a small window where the kernel text will be writable. 
> > > Tracing only a small set of functions (say a few 100) is not much of an 
> > > overhead, and I could see that being done on a production system.
> > > 
> > 
> > This is what LTTng can do today. But that involves the function tracer
> > stop_machine() call, which I dislike.
> 
> What's wrong with stop_machine?  Specifically, what do you dislike about 
> it?
> 
> > 
> > > > 
> > > > I agree that the racy time window is not that large and is not really a
> > > > security concern, but it's still just annoying.
> > > 
> > > Annoying? how so?
> > > 
> > > Again, the stop_machine part has nothing to do with DEBUG_RODATA, it is 
> > > about the safest and easiest way to modify kernel text.
> > > 
> > 
> > We are running in circles here because there is no real argument
> > brought.
> > 
> > 1 - You claim that changing the kernel's mapping, which has been
> > pointed out as an intrusive kernel modification, is faster than using a
> > text-poke-like approach. Please provide numbers to support such claims.
> 
> Hmm, lets see. I simply set a bit in the PTE mappings. There's not many, 
> since a lot are 2M pages, for x86_64. Call stop_machine, and now I can 
> modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of 
> the bits are only done when CONFIG_DEBUG_RODATA is set.
> 
> text_poke requires allocating a page. Map the page into memory. Set up a 
> break point.

text_poke does not _require_ a break point. text_poke can work with
stop_machine. There are two different problems here :

- How you deal with concurrency
  - you use stop machine
  - I use breakpoints
- How you deal with RO page mappings
  - you change the kernel page flags
  - i use text_poke

Please don't mix those separate concerns.

> Knowing what to do when that break point is hit by another 
> process. Modify the one location. Unmap the page. Free the page. Remove 
> the breakpoint.
> 
> Yes, this may be faster if I only modify one location. I would be hard 
> pressed that this is faster when I modify a few hundred locations. 
> The stop_machine method does it all at once. Not one at a time.
> 
> 
> > 
> > 2 - You claim that using stop_machine is simpler and therefore safer
> > than using a breakpoint-based approach. I start having some doubts about
> > simplicity when you start talking about the workarounds you have to do
> > for NMIs,
> 
> I agree, the NMI work around was tricky, but the final solution (which
> we tested vigorously) works well. My claim that it is simpler is not about 
> the small steps, but rather the number of variables we need to deal with.
> Stop machine shuts down all the CPUs and executes my code on one CPU. 
> Interrupts are disabled on all CPUs, and we only need to worry about the 
> NMI. Which we now do.
> 
> Your solution is about mapping another page on a running system, where
> anything can happen. The number of variables that can go wrong is much 
> greater simply by the fact that you have no idea as to what is running at 
> the same time as you perform your modifications.
> 
> With stop_machine, the number of variables is much less, because I know 
> everything that is happening when I do the modification. I do not need to 
> worry about some strange driver doing some kind of tricks because it 
> simply is not running.
> 
> > but more importantly, you seem to recognise that the latency
> > it induces would be inadequate for production systems.
> 
> Wrong. I recognise the latency of tracing all functions on a production 
> system. Heck, we trace spin_lock, rcu_read_lock, mutex_lock, and all that 
> jazz. Just slowing those functions down a bit will have a noticeable 
> impact. I've found that adding those functions to set_ftrace_notrace drops 
> the function tracer penalty, significantly.
> 
> 
> > Therefore it's
> > unusable in some LTTng use-cases just because of that. If you expect the
> > function tracer to become used more widely in LTTng, these concerns
> > should be addressed.
> 
> If you only want to trace a few hundred functions, then the overhead with
> it on should not be significant. Depending on which functions you trace. 
> As mentioned above, tracing only spin_lock can slow the system down.
> 
> Set up the functions you want to trace, enable them. You can have the
> ring buffer disabled (echo 0 > /debug/tracing/tracing_on) and just turn on 
> the ring buffer for your snapshot, and turn it off when you are done. When 
> all tracing is done, then disable the function tracing.
> 
> 
> > 
> > If, in the end, your argument is "the function tracer works as-is now,
> > and I have no time to change it given it represents too much work" or "I
> > don't care about your use-cases", I'm OK with that. But please then don't
> > argue that it's because it's the best technical solution when it isn't.
> 
> No, I have yet to hear a valuable argument against stop_machine. You are 
> pushing the burden of proof on me, when we have something that does work, 
> on several archs. You want me to redesign the system to be x86 only, and 
> then say, hey, my original code works better.
> 

stop_machine involves high interrupt latency. This is the argument I've
been repeating for 1-2 emails already. And I have to disagree with you :
we can do this code generically given the right abstractions
(BREAKPOINT_INSN* macros I proposed earlier). Is having something that
"works" your only argument to stop improving it ?

> I do not see text_poke being theoretically better. The only reason you 
> given me to use it is because you dislike stop_machine.
> 

There is absolutely no link between stop_machine and text_poke. I argue
against stop_machine saying that the breakpoint approach is less
intrusive because it does not involve disabling interrupts for so long,
and I argue against modifying the kernel page flags because that
modifies the access rights of the core kernel and modules to RO
mappings, which is IMO a side-effect that we should eliminate _if we
can_. Please keep those two concerns separate.

Mathieu

> -- Steve
> 
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23 17:31                       ` Mathieu Desnoyers
@ 2009-02-23 18:17                         ` Steven Rostedt
  2009-02-23 18:34                           ` Mathieu Desnoyers
  2009-02-27 17:52                           ` Masami Hiramatsu
  2009-02-23 18:23                         ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Steven Rostedt
  1 sibling, 2 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-23 18:17 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt


On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
> > 
> > Hmm, lets see. I simply set a bit in the PTE mappings. There's not many, 
> > since a lot are 2M pages, for x86_64. Call stop_machine, and now I can 
> > modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of 
> > the bits are only done when CONFIG_DEBUG_RODATA is set.
> > 
> > text_poke requires allocating a page. Map the page into memory. Set up a 
> > break point.
> 
> text_poke does not _require_ a break point. text_poke can work with
> stop_machine.

It can? Doesn't text_poke require allocating pages? The code called by 
stop_machine is all atomic. vmap does not give an option to allocate with 
GFP_ATOMIC.


> There are two different problems here :

I agree that they are two different problems. The reason I relate them is 
because text_poke can not be called from a stop_machine call.

> 
> - How you deal with concurrency
>   - you use stop machine
>   - I use breakpoints
> - How you deal with RO page mappings
>   - you change the kernel page flags
>   - i use text_poke
> 
> Please don't mix those separate concerns.

So you have two different concerns. One is that I use stop_machine, 
instead of break points, the other is that I modify all kernel text to
make the change.

Lets look at them separately.

The stop_machine vs. break points.

breakpoints is a cool trick, but is not implemented on all the archs that 
dynamic ftrace is.

break points are performed on a running system. This may be lower in 
latency tracing when the tracer is started, but can create a large number 
of variables that can not all be understood.

stop_machine is quite simple. No need to take traps, no need to handle 
what to do when another process runs the code being changed.

When making the hooks, stop_machine can add a bit of a interrupt latency. 
But this is only when the hooks are added or removed. Why is this such a 
big deal?  It is much easier to add the hooks with tracing disabled (via 
a simple toggle bit). Then start and stop your tracing by using the toggle 
bit. After you are all done, then remove the hooks. Or just keep them 
on since they are low overhead anyway (only a few hooks right?)


CONFIG_DEBUG_RODATA (only an x86 issue at the moment)

text_poke vs changing all pages:

You said this is a separate issue than stop_machine. But that is not the 
case. text_poke can not be done in an atomic section. This removes it from 
being used by stop_machine.

As you said, text_poke only handles the RO/RW issue, not the modifying of 
code on the fly. Thus, keeping stop_machine around, we must also not use 
text_poke.

I guess this takes the text_poke vs changing all pages out of the 
question. While stop_machine is still being used, we can not use 
text_poke (without rewriting it).

Also when we want to trace all functions, is it really necessary to vmap
each one at a time? Andi suggested that we could optimise by mapping 
larger pages, and finding the ones that share the page. This too would 
require a rewrite of text_poke.



> > > 
> > > If, in the end, your argument is "the function tracer works as-is now,
> > > and I have no time to change it given it represents too much work" or "I
> > > don't care about your use-cases", I'm OK with that. But please then don't
> > > argue that it's because it's the best technical solution when it isn't.
> > 
> > No, I have yet to hear a valuable argument against stop_machine. You are 
> > pushing the burden of proof on me, when we have something that does work, 
> > on several archs. You want me to redesign the system to be x86 only, and 
> > then say, hey, my original code works better.
> > 
> 
> stop_machine involves high interrupt latency. This is the argument I've
> been repeating for 1-2 emails already. And I have to disagree with you :
> we can do this code generically given the right abstractions
> (BREAKPOINT_INSN* macros I proposed earlier). Is having something that
> "works" your only argument to stop improving it ?

The high interrupt latency only happens at the time we need to hook the 
functions. This does not mean it is the time to start the tracing. That 
can be done separately.

Your only concern is the stop_machine latency? Then you might as well also 
prevent modules, since that uses stop machine too. Again, this happens 
only when the tracer hooks are added or removed. This is done at a time 
the sys-admin will activate it. It is not a random latency that is 
occurred by some timer or other asynchronous event.

> 
> > I do not see text_poke being theoretically better. The only reason you 
> > given me to use it is because you dislike stop_machine.
> > 
> 
> There is absolutely no link between stop_machine and text_poke. I argue
> against stop_machine saying that the breakpoint approach is less
> intrusive because it does not involve disabling interrupts for so long,
> and I argue against modifying the kernel page flags because that
> modifies the access rights of the core kernel and modules to RO
> mappings, which is IMO a side-effect that we should eliminate _if we
> can_. Please keep those two concerns separate.

text_poke can not be executed from stop_machine. There's the link. The two 
concerns are not separate.

Your concern with stop_machine is that it will cause an interrupt latency 
when the sysadmin enables or disables the functions. There exists other 
interrupt latencies that can be worst that are asynchronous. Run hackbench 
with the irqs off tracer and see for yourself.

-- Steve


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23 17:31                       ` Mathieu Desnoyers
  2009-02-23 18:17                         ` Steven Rostedt
@ 2009-02-23 18:23                         ` Steven Rostedt
  1 sibling, 0 replies; 89+ messages in thread
From: Steven Rostedt @ 2009-02-23 18:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt


On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
> 
> There is absolutely no link between stop_machine and text_poke. I argue
> against stop_machine saying that the breakpoint approach is less
> intrusive because it does not involve disabling interrupts for so long,
> and I argue against modifying the kernel page flags because that
> modifies the access rights of the core kernel and modules to RO

One correction. modules text is always mapped RW, event with 
CONFIG_DEBUG_RODATA. Perhaps we could fix that, but as it is today,
you do not need text_poke to modify module text.

-- Steve

> mappings, which is IMO a side-effect that we should eliminate _if we
> can_. Please keep those two concerns separate.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23 18:17                         ` Steven Rostedt
@ 2009-02-23 18:34                           ` Mathieu Desnoyers
  2009-02-27 17:52                           ` Masami Hiramatsu
  1 sibling, 0 replies; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-02-23 18:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
> > > 
> > > Hmm, lets see. I simply set a bit in the PTE mappings. There's not many, 
> > > since a lot are 2M pages, for x86_64. Call stop_machine, and now I can 
> > > modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of 
> > > the bits are only done when CONFIG_DEBUG_RODATA is set.
> > > 
> > > text_poke requires allocating a page. Map the page into memory. Set up a 
> > > break point.
> > 
> > text_poke does not _require_ a break point. text_poke can work with
> > stop_machine.
> 
> It can? Doesn't text_poke require allocating pages? The code called by 
> stop_machine is all atomic. vmap does not give an option to allocate with 
> GFP_ATOMIC.
> 
> 
> > There are two different problems here :
> 
> I agree that they are two different problems. The reason I relate them is 
> because text_poke can not be called from a stop_machine call.
> 
> > 
> > - How you deal with concurrency
> >   - you use stop machine
> >   - I use breakpoints
> > - How you deal with RO page mappings
> >   - you change the kernel page flags
> >   - i use text_poke
> > 
> > Please don't mix those separate concerns.
> 
> So you have two different concerns. One is that I use stop_machine, 
> instead of break points, the other is that I modify all kernel text to
> make the change.
> 
> Lets look at them separately.
> 
> The stop_machine vs. break points.
> 
> breakpoints is a cool trick, but is not implemented on all the archs that 
> dynamic ftrace is.
> 
> break points are performed on a running system. This may be lower in 
> latency tracing when the tracer is started, but can create a large number 
> of variables that can not all be understood.
> 
> stop_machine is quite simple. No need to take traps, no need to handle 
> what to do when another process runs the code being changed.
> 
> When making the hooks, stop_machine can add a bit of a interrupt latency. 
> But this is only when the hooks are added or removed. Why is this such a 
> big deal?

On a live system, adding interrupt latency even when tracing is not
active yet _is_ a big deal.

> It is much easier to add the hooks with tracing disabled (via 
> a simple toggle bit). Then start and stop your tracing by using the toggle 
> bit. After you are all done, then remove the hooks. Or just keep them 
> on since they are low overhead anyway (only a few hooks right?)
> 
> 
> CONFIG_DEBUG_RODATA (only an x86 issue at the moment)
> 
> text_poke vs changing all pages:
> 
> You said this is a separate issue than stop_machine. But that is not the 
> case. text_poke can not be done in an atomic section. This removes it from 
> being used by stop_machine.
> 

Hrm, I wonder if we could create a variant of vmap_ram that would be
atomic ? That would clearly fix our problems.

> As you said, text_poke only handles the RO/RW issue, not the modifying of 
> code on the fly. Thus, keeping stop_machine around, we must also not use 
> text_poke.

Not if we modify vmap_ram...

> 
> I guess this takes the text_poke vs changing all pages out of the 
> question. While stop_machine is still being used, we can not use 
> text_poke (without rewriting it).

Where is the problem ? Let's improve it if needed.

> 
> Also when we want to trace all functions, is it really necessary to vmap
> each one at a time? Andi suggested that we could optimise by mapping 
> larger pages, and finding the ones that share the page. This too would 
> require a rewrite of text_poke.
> 

This is an optimization, we should see the performance penality first
before we start optimizing things too early.

Mathieu

> 
> 
> > > > 
> > > > If, in the end, your argument is "the function tracer works as-is now,
> > > > and I have no time to change it given it represents too much work" or "I
> > > > don't care about your use-cases", I'm OK with that. But please then don't
> > > > argue that it's because it's the best technical solution when it isn't.
> > > 
> > > No, I have yet to hear a valuable argument against stop_machine. You are 
> > > pushing the burden of proof on me, when we have something that does work, 
> > > on several archs. You want me to redesign the system to be x86 only, and 
> > > then say, hey, my original code works better.
> > > 
> > 
> > stop_machine involves high interrupt latency. This is the argument I've
> > been repeating for 1-2 emails already. And I have to disagree with you :
> > we can do this code generically given the right abstractions
> > (BREAKPOINT_INSN* macros I proposed earlier). Is having something that
> > "works" your only argument to stop improving it ?
> 
> The high interrupt latency only happens at the time we need to hook the 
> functions. This does not mean it is the time to start the tracing. That 
> can be done separately.
> 
> Your only concern is the stop_machine latency? Then you might as well also 
> prevent modules, since that uses stop machine too. Again, this happens 
> only when the tracer hooks are added or removed. This is done at a time 
> the sys-admin will activate it. It is not a random latency that is 
> occurred by some timer or other asynchronous event.
> 
> > 
> > > I do not see text_poke being theoretically better. The only reason you 
> > > given me to use it is because you dislike stop_machine.
> > > 
> > 
> > There is absolutely no link between stop_machine and text_poke. I argue
> > against stop_machine saying that the breakpoint approach is less
> > intrusive because it does not involve disabling interrupts for so long,
> > and I argue against modifying the kernel page flags because that
> > modifies the access rights of the core kernel and modules to RO
> > mappings, which is IMO a side-effect that we should eliminate _if we
> > can_. Please keep those two concerns separate.
> 
> text_poke can not be executed from stop_machine. There's the link. The two 
> concerns are not separate.
> 
> Your concern with stop_machine is that it will cause an interrupt latency 
> when the sysadmin enables or disables the functions. There exists other 
> interrupt latencies that can be worst that are asynchronous. Run hackbench 
> with the irqs off tracer and see for yourself.
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-23 18:17                         ` Steven Rostedt
  2009-02-23 18:34                           ` Mathieu Desnoyers
@ 2009-02-27 17:52                           ` Masami Hiramatsu
  2009-02-27 18:07                             ` Mathieu Desnoyers
  1 sibling, 1 reply; 89+ messages in thread
From: Masami Hiramatsu @ 2009-02-27 17:52 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Andi Kleen, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt

[-- Attachment #1: Type: text/plain, Size: 6739 bytes --]

Steven Rostedt wrote:
> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
>>> Hmm, lets see. I simply set a bit in the PTE mappings. There's not many, 
>>> since a lot are 2M pages, for x86_64. Call stop_machine, and now I can 
>>> modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of 
>>> the bits are only done when CONFIG_DEBUG_RODATA is set.
>>>
>>> text_poke requires allocating a page. Map the page into memory. Set up a 
>>> break point.
>> text_poke does not _require_ a break point. text_poke can work with
>> stop_machine.
> 
> It can? Doesn't text_poke require allocating pages? The code called by 
> stop_machine is all atomic. vmap does not give an option to allocate with 
> GFP_ATOMIC.

Hi,

With my patch, text_poke() never allocate pages any more :)

BTW, IMHO, both of your methods are useful and have trade-off.

ftrace wants to change massive amount of code at once. If we do
that with text_poke(), we have to map/unmap pages each time and
it will take a long time -- might be longer than one stop_machine_run().

On the other hand, text_poke() user like as kprobes and tracepoints,
just want to change a few amount of code at once, and it will be
added/removed incrementally. If we do that with stop_machine_run(),
we'll be annoyed by frequent machine stops.(Moreover, kprobes uses
breakpoint, so it doesn't need stop_machine_run())


Thank you,

>> There are two different problems here :
> 
> I agree that they are two different problems. The reason I relate them is 
> because text_poke can not be called from a stop_machine call.
> 
>> - How you deal with concurrency
>>   - you use stop machine
>>   - I use breakpoints
>> - How you deal with RO page mappings
>>   - you change the kernel page flags
>>   - i use text_poke
>>
>> Please don't mix those separate concerns.
> 
> So you have two different concerns. One is that I use stop_machine, 
> instead of break points, the other is that I modify all kernel text to
> make the change.
> 
> Lets look at them separately.
> 
> The stop_machine vs. break points.
> 
> breakpoints is a cool trick, but is not implemented on all the archs that 
> dynamic ftrace is.
> 
> break points are performed on a running system. This may be lower in 
> latency tracing when the tracer is started, but can create a large number 
> of variables that can not all be understood.
> 
> stop_machine is quite simple. No need to take traps, no need to handle 
> what to do when another process runs the code being changed.
> 
> When making the hooks, stop_machine can add a bit of a interrupt latency. 
> But this is only when the hooks are added or removed. Why is this such a 
> big deal?  It is much easier to add the hooks with tracing disabled (via 
> a simple toggle bit). Then start and stop your tracing by using the toggle 
> bit. After you are all done, then remove the hooks. Or just keep them 
> on since they are low overhead anyway (only a few hooks right?)
> 
> 
> CONFIG_DEBUG_RODATA (only an x86 issue at the moment)
> 
> text_poke vs changing all pages:
> 
> You said this is a separate issue than stop_machine. But that is not the 
> case. text_poke can not be done in an atomic section. This removes it from 
> being used by stop_machine.
> 
> As you said, text_poke only handles the RO/RW issue, not the modifying of 
> code on the fly. Thus, keeping stop_machine around, we must also not use 
> text_poke.
> 
> I guess this takes the text_poke vs changing all pages out of the 
> question. While stop_machine is still being used, we can not use 
> text_poke (without rewriting it).
> 
> Also when we want to trace all functions, is it really necessary to vmap
> each one at a time? Andi suggested that we could optimise by mapping 
> larger pages, and finding the ones that share the page. This too would 
> require a rewrite of text_poke.
> 
> 
> 
>>>> If, in the end, your argument is "the function tracer works as-is now,
>>>> and I have no time to change it given it represents too much work" or "I
>>>> don't care about your use-cases", I'm OK with that. But please then don't
>>>> argue that it's because it's the best technical solution when it isn't.
>>> No, I have yet to hear a valuable argument against stop_machine. You are 
>>> pushing the burden of proof on me, when we have something that does work, 
>>> on several archs. You want me to redesign the system to be x86 only, and 
>>> then say, hey, my original code works better.
>>>
>> stop_machine involves high interrupt latency. This is the argument I've
>> been repeating for 1-2 emails already. And I have to disagree with you :
>> we can do this code generically given the right abstractions
>> (BREAKPOINT_INSN* macros I proposed earlier). Is having something that
>> "works" your only argument to stop improving it ?
> 
> The high interrupt latency only happens at the time we need to hook the 
> functions. This does not mean it is the time to start the tracing. That 
> can be done separately.
> 
> Your only concern is the stop_machine latency? Then you might as well also 
> prevent modules, since that uses stop machine too. Again, this happens 
> only when the tracer hooks are added or removed. This is done at a time 
> the sys-admin will activate it. It is not a random latency that is 
> occurred by some timer or other asynchronous event.
> 
>>> I do not see text_poke being theoretically better. The only reason you 
>>> given me to use it is because you dislike stop_machine.
>>>
>> There is absolutely no link between stop_machine and text_poke. I argue
>> against stop_machine saying that the breakpoint approach is less
>> intrusive because it does not involve disabling interrupts for so long,
>> and I argue against modifying the kernel page flags because that
>> modifies the access rights of the core kernel and modules to RO
>> mappings, which is IMO a side-effect that we should eliminate _if we
>> can_. Please keep those two concerns separate.
> 
> text_poke can not be executed from stop_machine. There's the link. The two 
> concerns are not separate.
> 
> Your concern with stop_machine is that it will cause an interrupt latency 
> when the sysadmin enables or disables the functions. There exists other 
> interrupt latencies that can be worst that are asynchronous. Run hackbench 
> with the irqs off tracer and see for yourself.
> 
> -- Steve
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


[-- Attachment #2: text_poke-use-own-vmap-area.patch --]
[-- Type: text/plain, Size: 3146 bytes --]

Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
and delayed unmapping.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
---
 arch/x86/include/asm/alternative.h |    1 +
 arch/x86/kernel/alternative.c      |   25 ++++++++++++++++++++-----
 init/main.c                        |    3 +++
 3 files changed, 24 insertions(+), 5 deletions(-)

Index: linux-2.6/arch/x86/include/asm/alternative.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/alternative.h
+++ linux-2.6/arch/x86/include/asm/alternative.h
@@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
  * The _early version expects the memory to already be RW.
  */
 
+extern void text_poke_init(void);
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 
Index: linux-2.6/arch/x86/kernel/alternative.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/alternative.c
+++ linux-2.6/arch/x86/kernel/alternative.c
@@ -485,6 +485,16 @@ void *text_poke_early(void *addr, const 
 	return addr;
 }
 
+static struct vm_struct *text_poke_area[2];
+static DEFINE_SPINLOCK(text_poke_lock);
+
+void __init text_poke_init(void)
+{
+	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
+	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
+	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
+}
+
 /**
  * text_poke - Update instructions on a live kernel
  * @addr: address to modify
@@ -501,8 +511,9 @@ void *__kprobes text_poke(void *addr, co
 	unsigned long flags;
 	char *vaddr;
 	int nr_pages = 2;
-	struct page *pages[2];
-	int i;
+	struct page *pages[2], **pgp = pages;
+	int i, ret;
+	struct vm_struct *vma;
 
 	if (!core_kernel_text((unsigned long)addr)) {
 		pages[0] = vmalloc_to_page(addr);
@@ -515,12 +526,16 @@ void *__kprobes text_poke(void *addr, co
 	BUG_ON(!pages[0]);
 	if (!pages[1])
 		nr_pages = 1;
-	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
-	BUG_ON(!vaddr);
+	spin_lock(&text_poke_lock);
+	vma = text_poke_area[nr_pages-1];
+	ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
+	BUG_ON(ret);
+	vaddr = vma->addr;
 	local_irq_save(flags);
 	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
 	local_irq_restore(flags);
-	vunmap(vaddr);
+	unmap_kernel_range((unsigned long)vma->addr, (unsigned long)vma->size);
+	spin_unlock(&text_poke_lock);
 	sync_core();
 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
 	   that causes hangs on some VIA CPUs. */
@@ -528,3 +543,4 @@ void *__kprobes text_poke(void *addr, co
 		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
 	return addr;
 }
+EXPORT_SYMBOL_GPL(text_poke);
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c
+++ linux-2.6/init/main.c
@@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
 	taskstats_init_early();
 	delayacct_init();
 
+#ifdef CONFIG_X86
+	text_poke_init();
+#endif
 	check_bugs();
 
 	acpi_early_init(); /* before LAPIC and SMP init */

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-27 17:52                           ` Masami Hiramatsu
@ 2009-02-27 18:07                             ` Mathieu Desnoyers
  2009-02-27 18:34                               ` Masami Hiramatsu
  0 siblings, 1 reply; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-02-27 18:07 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Andi Kleen, linux-kernel, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Steven Rostedt wrote:
> > On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
> >>> Hmm, lets see. I simply set a bit in the PTE mappings. There's not many, 
> >>> since a lot are 2M pages, for x86_64. Call stop_machine, and now I can 
> >>> modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of 
> >>> the bits are only done when CONFIG_DEBUG_RODATA is set.
> >>>
> >>> text_poke requires allocating a page. Map the page into memory. Set up a 
> >>> break point.
> >> text_poke does not _require_ a break point. text_poke can work with
> >> stop_machine.
> > 
> > It can? Doesn't text_poke require allocating pages? The code called by 
> > stop_machine is all atomic. vmap does not give an option to allocate with 
> > GFP_ATOMIC.
> 
> Hi,
> 
> With my patch, text_poke() never allocate pages any more :)
> 
> BTW, IMHO, both of your methods are useful and have trade-off.
> 
> ftrace wants to change massive amount of code at once. If we do
> that with text_poke(), we have to map/unmap pages each time and
> it will take a long time -- might be longer than one stop_machine_run().
> 
> On the other hand, text_poke() user like as kprobes and tracepoints,
> just want to change a few amount of code at once, and it will be
> added/removed incrementally. If we do that with stop_machine_run(),
> we'll be annoyed by frequent machine stops.(Moreover, kprobes uses
> breakpoint, so it doesn't need stop_machine_run())
> 

Hi Masami,

Is this text_poke version executable in atomic context ? If yes, then
that would be good to add a comment saying it. Please see below for
comments.

> 
> Thank you,
> 
[...]
> Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
> and delayed unmapping.
> 
> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> ---
>  arch/x86/include/asm/alternative.h |    1 +
>  arch/x86/kernel/alternative.c      |   25 ++++++++++++++++++++-----
>  init/main.c                        |    3 +++
>  3 files changed, 24 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/arch/x86/include/asm/alternative.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/alternative.h
> +++ linux-2.6/arch/x86/include/asm/alternative.h
> @@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
>   * The _early version expects the memory to already be RW.
>   */
>  
> +extern void text_poke_init(void);
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>  
> Index: linux-2.6/arch/x86/kernel/alternative.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/alternative.c
> +++ linux-2.6/arch/x86/kernel/alternative.c
> @@ -485,6 +485,16 @@ void *text_poke_early(void *addr, const 
>  	return addr;
>  }
>  
> +static struct vm_struct *text_poke_area[2];
> +static DEFINE_SPINLOCK(text_poke_lock);
> +
> +void __init text_poke_init(void)
> +{
> +	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
> +	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);

Why is this text_poke_area[1] 2 * PAGE_SIZE in size ? I would have
thought that text_poke_area[0] would be PAGE_SIZE, text_poke_area[1]
also be PAGE_SIZE, and that the sum of both would be 2 * PAGE_SIZE..

> +	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
> +}
> +
>  /**
>   * text_poke - Update instructions on a live kernel
>   * @addr: address to modify
> @@ -501,8 +511,9 @@ void *__kprobes text_poke(void *addr, co
>  	unsigned long flags;
>  	char *vaddr;
>  	int nr_pages = 2;
> -	struct page *pages[2];
> -	int i;
> +	struct page *pages[2], **pgp = pages;

Hrm, why do you need **pgp ? Could you simply pass &pages to map_vm_area ?

Thanks,

Mathieu

> +	int i, ret;
> +	struct vm_struct *vma;
>  
>  	if (!core_kernel_text((unsigned long)addr)) {
>  		pages[0] = vmalloc_to_page(addr);
> @@ -515,12 +526,16 @@ void *__kprobes text_poke(void *addr, co
>  	BUG_ON(!pages[0]);
>  	if (!pages[1])
>  		nr_pages = 1;
> -	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
> -	BUG_ON(!vaddr);
> +	spin_lock(&text_poke_lock);
> +	vma = text_poke_area[nr_pages-1];
> +	ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
> +	BUG_ON(ret);
> +	vaddr = vma->addr;
>  	local_irq_save(flags);
>  	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
>  	local_irq_restore(flags);
> -	vunmap(vaddr);
> +	unmap_kernel_range((unsigned long)vma->addr, (unsigned long)vma->size);
> +	spin_unlock(&text_poke_lock);
>  	sync_core();
>  	/* Could also do a CLFLUSH here to speed up CPU recovery; but
>  	   that causes hangs on some VIA CPUs. */
> @@ -528,3 +543,4 @@ void *__kprobes text_poke(void *addr, co
>  		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
>  	return addr;
>  }
> +EXPORT_SYMBOL_GPL(text_poke);
> Index: linux-2.6/init/main.c
> ===================================================================
> --- linux-2.6.orig/init/main.c
> +++ linux-2.6/init/main.c
> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
>  	taskstats_init_early();
>  	delayacct_init();
>  
> +#ifdef CONFIG_X86
> +	text_poke_init();
> +#endif
>  	check_bugs();
>  
>  	acpi_early_init(); /* before LAPIC and SMP init */


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-27 18:07                             ` Mathieu Desnoyers
@ 2009-02-27 18:34                               ` Masami Hiramatsu
  2009-02-27 18:53                                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 89+ messages in thread
From: Masami Hiramatsu @ 2009-02-27 18:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Andi Kleen, linux-kernel, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>> Steven Rostedt wrote:
>>> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
>>>>> Hmm, lets see. I simply set a bit in the PTE mappings. There's not many, 
>>>>> since a lot are 2M pages, for x86_64. Call stop_machine, and now I can 
>>>>> modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of 
>>>>> the bits are only done when CONFIG_DEBUG_RODATA is set.
>>>>>
>>>>> text_poke requires allocating a page. Map the page into memory. Set up a 
>>>>> break point.
>>>> text_poke does not _require_ a break point. text_poke can work with
>>>> stop_machine.
>>> It can? Doesn't text_poke require allocating pages? The code called by 
>>> stop_machine is all atomic. vmap does not give an option to allocate with 
>>> GFP_ATOMIC.
>> Hi,
>>
>> With my patch, text_poke() never allocate pages any more :)
>>
>> BTW, IMHO, both of your methods are useful and have trade-off.
>>
>> ftrace wants to change massive amount of code at once. If we do
>> that with text_poke(), we have to map/unmap pages each time and
>> it will take a long time -- might be longer than one stop_machine_run().
>>
>> On the other hand, text_poke() user like as kprobes and tracepoints,
>> just want to change a few amount of code at once, and it will be
>> added/removed incrementally. If we do that with stop_machine_run(),
>> we'll be annoyed by frequent machine stops.(Moreover, kprobes uses
>> breakpoint, so it doesn't need stop_machine_run())
>>
> 
> Hi Masami,
> 
> Is this text_poke version executable in atomic context ? If yes, then
> that would be good to add a comment saying it. Please see below for
> comments.

Thank you for comments!
I think it could be. ah, spin_lock might be changed to spin_lock_irqsave()...

>> Thank you,
>>
> [...]
>> Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
>> and delayed unmapping.
>>
>> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
>> ---
>>  arch/x86/include/asm/alternative.h |    1 +
>>  arch/x86/kernel/alternative.c      |   25 ++++++++++++++++++++-----
>>  init/main.c                        |    3 +++
>>  3 files changed, 24 insertions(+), 5 deletions(-)
>>
>> Index: linux-2.6/arch/x86/include/asm/alternative.h
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/include/asm/alternative.h
>> +++ linux-2.6/arch/x86/include/asm/alternative.h
>> @@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
>>   * The _early version expects the memory to already be RW.
>>   */
>>  
>> +extern void text_poke_init(void);
>>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>  
>> Index: linux-2.6/arch/x86/kernel/alternative.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/kernel/alternative.c
>> +++ linux-2.6/arch/x86/kernel/alternative.c
>> @@ -485,6 +485,16 @@ void *text_poke_early(void *addr, const 
>>  	return addr;
>>  }
>>  
>> +static struct vm_struct *text_poke_area[2];
>> +static DEFINE_SPINLOCK(text_poke_lock);
>> +
>> +void __init text_poke_init(void)
>> +{
>> +	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
>> +	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
> 
> Why is this text_poke_area[1] 2 * PAGE_SIZE in size ? I would have
> thought that text_poke_area[0] would be PAGE_SIZE, text_poke_area[1]
> also be PAGE_SIZE, and that the sum of both would be 2 * PAGE_SIZE..

Unfortunately, current map_vm_area() tries to map the size of vm_area,
this means, you can't use 2page-size vm_area for mapping just 1 page...
(or maybe, we can set pages[1] = pages[0] when 2nd page doesn't exist)


>> +	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
>> +}
>> +
>>  /**
>>   * text_poke - Update instructions on a live kernel
>>   * @addr: address to modify
>> @@ -501,8 +511,9 @@ void *__kprobes text_poke(void *addr, co
>>  	unsigned long flags;
>>  	char *vaddr;
>>  	int nr_pages = 2;
>> -	struct page *pages[2];
>> -	int i;
>> +	struct page *pages[2], **pgp = pages;
> 
> Hrm, why do you need **pgp ? Could you simply pass &pages to map_vm_area ?

As you know, pages means just the address(value) of an array, so you can't
get the address of the address...(pages and &pages are same.)

        int array[2];
        printf("%p, %p",array, &array);

please try it :)

And actually, map_vm_area() requires the address of a pointer.
---
int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
{
        unsigned long addr = (unsigned long)area->addr;
        unsigned long end = addr + area->size - PAGE_SIZE;
        int err;

        err = vmap_page_range(addr, end, prot, *pages);
        if (err > 0) {
                *pages += err;
                ^^^^^^^^^^^^^^ Here, it tries to add err(=number of mapped pages)
                               to the pages pointer!
                err = 0;
        }

        return err;
}
---


> 
> Thanks,
> 
> Mathieu
> 
>> +	int i, ret;
>> +	struct vm_struct *vma;
>>  
>>  	if (!core_kernel_text((unsigned long)addr)) {
>>  		pages[0] = vmalloc_to_page(addr);
>> @@ -515,12 +526,16 @@ void *__kprobes text_poke(void *addr, co
>>  	BUG_ON(!pages[0]);
>>  	if (!pages[1])
>>  		nr_pages = 1;
>> -	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
>> -	BUG_ON(!vaddr);
>> +	spin_lock(&text_poke_lock);
>> +	vma = text_poke_area[nr_pages-1];
>> +	ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
>> +	BUG_ON(ret);
>> +	vaddr = vma->addr;
>>  	local_irq_save(flags);
>>  	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
>>  	local_irq_restore(flags);
>> -	vunmap(vaddr);
>> +	unmap_kernel_range((unsigned long)vma->addr, (unsigned long)vma->size);
>> +	spin_unlock(&text_poke_lock);
>>  	sync_core();
>>  	/* Could also do a CLFLUSH here to speed up CPU recovery; but
>>  	   that causes hangs on some VIA CPUs. */
>> @@ -528,3 +543,4 @@ void *__kprobes text_poke(void *addr, co
>>  		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
>>  	return addr;
>>  }
>> +EXPORT_SYMBOL_GPL(text_poke);
>> Index: linux-2.6/init/main.c
>> ===================================================================
>> --- linux-2.6.orig/init/main.c
>> +++ linux-2.6/init/main.c
>> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
>>  	taskstats_init_early();
>>  	delayacct_init();
>>  
>> +#ifdef CONFIG_X86
>> +	text_poke_init();
>> +#endif
>>  	check_bugs();
>>  
>>  	acpi_early_init(); /* before LAPIC and SMP init */
> 
> 

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-27 18:34                               ` Masami Hiramatsu
@ 2009-02-27 18:53                                 ` Mathieu Desnoyers
  2009-02-27 20:57                                   ` Masami Hiramatsu
  0 siblings, 1 reply; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-02-27 18:53 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Andi Kleen, linux-kernel, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Mathieu Desnoyers wrote:
> > * Masami Hiramatsu (mhiramat@redhat.com) wrote:
> >> Steven Rostedt wrote:
> >>> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
> >>>>> Hmm, lets see. I simply set a bit in the PTE mappings. There's not many, 
> >>>>> since a lot are 2M pages, for x86_64. Call stop_machine, and now I can 
> >>>>> modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of 
> >>>>> the bits are only done when CONFIG_DEBUG_RODATA is set.
> >>>>>
> >>>>> text_poke requires allocating a page. Map the page into memory. Set up a 
> >>>>> break point.
> >>>> text_poke does not _require_ a break point. text_poke can work with
> >>>> stop_machine.
> >>> It can? Doesn't text_poke require allocating pages? The code called by 
> >>> stop_machine is all atomic. vmap does not give an option to allocate with 
> >>> GFP_ATOMIC.
> >> Hi,
> >>
> >> With my patch, text_poke() never allocate pages any more :)
> >>
> >> BTW, IMHO, both of your methods are useful and have trade-off.
> >>
> >> ftrace wants to change massive amount of code at once. If we do
> >> that with text_poke(), we have to map/unmap pages each time and
> >> it will take a long time -- might be longer than one stop_machine_run().
> >>
> >> On the other hand, text_poke() user like as kprobes and tracepoints,
> >> just want to change a few amount of code at once, and it will be
> >> added/removed incrementally. If we do that with stop_machine_run(),
> >> we'll be annoyed by frequent machine stops.(Moreover, kprobes uses
> >> breakpoint, so it doesn't need stop_machine_run())
> >>
> > 
> > Hi Masami,
> > 
> > Is this text_poke version executable in atomic context ? If yes, then
> > that would be good to add a comment saying it. Please see below for
> > comments.
> 
> Thank you for comments!
> I think it could be. ah, spin_lock might be changed to spin_lock_irqsave()...
> 

You are right. If we plan to execute this in both atomic and non-atomic
context, spin_lock_irqsave would make sure we are always busy-looping
with interrupts off.

Having spinlocks taken in _both_ interrupts on and off contexts leads to
higher interrupt latencies when the interrupt-off waits for an
interrupt-on thread.


> >> Thank you,
> >>
> > [...]
> >> Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
> >> and delayed unmapping.
> >>
> >> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> >> ---
> >>  arch/x86/include/asm/alternative.h |    1 +
> >>  arch/x86/kernel/alternative.c      |   25 ++++++++++++++++++++-----
> >>  init/main.c                        |    3 +++
> >>  3 files changed, 24 insertions(+), 5 deletions(-)
> >>
> >> Index: linux-2.6/arch/x86/include/asm/alternative.h
> >> ===================================================================
> >> --- linux-2.6.orig/arch/x86/include/asm/alternative.h
> >> +++ linux-2.6/arch/x86/include/asm/alternative.h
> >> @@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
> >>   * The _early version expects the memory to already be RW.
> >>   */
> >>  
> >> +extern void text_poke_init(void);
> >>  extern void *text_poke(void *addr, const void *opcode, size_t len);
> >>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> >>  
> >> Index: linux-2.6/arch/x86/kernel/alternative.c
> >> ===================================================================
> >> --- linux-2.6.orig/arch/x86/kernel/alternative.c
> >> +++ linux-2.6/arch/x86/kernel/alternative.c
> >> @@ -485,6 +485,16 @@ void *text_poke_early(void *addr, const 
> >>  	return addr;
> >>  }
> >>  
> >> +static struct vm_struct *text_poke_area[2];
> >> +static DEFINE_SPINLOCK(text_poke_lock);
> >> +
> >> +void __init text_poke_init(void)
> >> +{
> >> +	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
> >> +	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
> > 
> > Why is this text_poke_area[1] 2 * PAGE_SIZE in size ? I would have
> > thought that text_poke_area[0] would be PAGE_SIZE, text_poke_area[1]
> > also be PAGE_SIZE, and that the sum of both would be 2 * PAGE_SIZE..
> 
> Unfortunately, current map_vm_area() tries to map the size of vm_area,
> this means, you can't use 2page-size vm_area for mapping just 1 page...
> (or maybe, we can set pages[1] = pages[0] when 2nd page doesn't exist)
> 

OK, given we sometimes have to map only a single page (e.g. at the end
of a text section), we really need both 1 and 2 pages mapping. So I
think you solution is good.

> 
> >> +	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
> >> +}
> >> +
> >>  /**
> >>   * text_poke - Update instructions on a live kernel
> >>   * @addr: address to modify
> >> @@ -501,8 +511,9 @@ void *__kprobes text_poke(void *addr, co
> >>  	unsigned long flags;
> >>  	char *vaddr;
> >>  	int nr_pages = 2;
> >> -	struct page *pages[2];
> >> -	int i;
> >> +	struct page *pages[2], **pgp = pages;
> > 
> > Hrm, why do you need **pgp ? Could you simply pass &pages to map_vm_area ?
> 
> As you know, pages means just the address(value) of an array, so you can't
> get the address of the address...(pages and &pages are same.)
> 
>         int array[2];
>         printf("%p, %p",array, &array);
> 
> please try it :)
> 
> And actually, map_vm_area() requires the address of a pointer.

Ah yes, thanks for the explanation.

After changing the spinlock/irqsave, I think that patch would be good to
merge. And then Steve could use text_poke within stop_machine if he
likes.

Mathieu

> ---
> int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
> {
>         unsigned long addr = (unsigned long)area->addr;
>         unsigned long end = addr + area->size - PAGE_SIZE;
>         int err;
> 
>         err = vmap_page_range(addr, end, prot, *pages);
>         if (err > 0) {
>                 *pages += err;
>                 ^^^^^^^^^^^^^^ Here, it tries to add err(=number of mapped pages)
>                                to the pages pointer!
>                 err = 0;
>         }
> 
>         return err;
> }
> ---
> 
> 
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> >> +	int i, ret;
> >> +	struct vm_struct *vma;
> >>  
> >>  	if (!core_kernel_text((unsigned long)addr)) {
> >>  		pages[0] = vmalloc_to_page(addr);
> >> @@ -515,12 +526,16 @@ void *__kprobes text_poke(void *addr, co
> >>  	BUG_ON(!pages[0]);
> >>  	if (!pages[1])
> >>  		nr_pages = 1;
> >> -	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
> >> -	BUG_ON(!vaddr);
> >> +	spin_lock(&text_poke_lock);
> >> +	vma = text_poke_area[nr_pages-1];
> >> +	ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
> >> +	BUG_ON(ret);
> >> +	vaddr = vma->addr;
> >>  	local_irq_save(flags);
> >>  	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
> >>  	local_irq_restore(flags);
> >> -	vunmap(vaddr);
> >> +	unmap_kernel_range((unsigned long)vma->addr, (unsigned long)vma->size);
> >> +	spin_unlock(&text_poke_lock);
> >>  	sync_core();
> >>  	/* Could also do a CLFLUSH here to speed up CPU recovery; but
> >>  	   that causes hangs on some VIA CPUs. */
> >> @@ -528,3 +543,4 @@ void *__kprobes text_poke(void *addr, co
> >>  		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
> >>  	return addr;
> >>  }
> >> +EXPORT_SYMBOL_GPL(text_poke);
> >> Index: linux-2.6/init/main.c
> >> ===================================================================
> >> --- linux-2.6.orig/init/main.c
> >> +++ linux-2.6/init/main.c
> >> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
> >>  	taskstats_init_early();
> >>  	delayacct_init();
> >>  
> >> +#ifdef CONFIG_X86
> >> +	text_poke_init();
> >> +#endif
> >>  	check_bugs();
> >>  
> >>  	acpi_early_init(); /* before LAPIC and SMP init */
> > 
> > 
> 
> -- 
> Masami Hiramatsu
> 
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
> 
> e-mail: mhiramat@redhat.com
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-27 18:53                                 ` Mathieu Desnoyers
@ 2009-02-27 20:57                                   ` Masami Hiramatsu
  2009-03-02 17:01                                     ` [RFC][PATCH] x86: make text_poke() atomic Masami Hiramatsu
  0 siblings, 1 reply; 89+ messages in thread
From: Masami Hiramatsu @ 2009-02-27 20:57 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Andi Kleen, linux-kernel, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>> Mathieu Desnoyers wrote:
>>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>>> Steven Rostedt wrote:
>>>>> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
>>>>>>> Hmm, lets see. I simply set a bit in the PTE mappings. There's not many, 
>>>>>>> since a lot are 2M pages, for x86_64. Call stop_machine, and now I can 
>>>>>>> modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of 
>>>>>>> the bits are only done when CONFIG_DEBUG_RODATA is set.
>>>>>>>
>>>>>>> text_poke requires allocating a page. Map the page into memory. Set up a 
>>>>>>> break point.
>>>>>> text_poke does not _require_ a break point. text_poke can work with
>>>>>> stop_machine.
>>>>> It can? Doesn't text_poke require allocating pages? The code called by 
>>>>> stop_machine is all atomic. vmap does not give an option to allocate with 
>>>>> GFP_ATOMIC.
>>>> Hi,
>>>>
>>>> With my patch, text_poke() never allocate pages any more :)
>>>>
>>>> BTW, IMHO, both of your methods are useful and have trade-off.
>>>>
>>>> ftrace wants to change massive amount of code at once. If we do
>>>> that with text_poke(), we have to map/unmap pages each time and
>>>> it will take a long time -- might be longer than one stop_machine_run().
>>>>
>>>> On the other hand, text_poke() user like as kprobes and tracepoints,
>>>> just want to change a few amount of code at once, and it will be
>>>> added/removed incrementally. If we do that with stop_machine_run(),
>>>> we'll be annoyed by frequent machine stops.(Moreover, kprobes uses
>>>> breakpoint, so it doesn't need stop_machine_run())
>>>>
>>> Hi Masami,
>>>
>>> Is this text_poke version executable in atomic context ? If yes, then
>>> that would be good to add a comment saying it. Please see below for
>>> comments.
>> Thank you for comments!
>> I think it could be. ah, spin_lock might be changed to spin_lock_irqsave()...
>>
> 
> You are right. If we plan to execute this in both atomic and non-atomic
> context, spin_lock_irqsave would make sure we are always busy-looping
> with interrupts off.

Oops, when I tested spin_lock_irqsave, it caused warnings.

------------[ cut here ]------------
WARNING: at /home/mhiramat/ksrc/linux-2.6/kernel/smp.c:329 smp_call_function_man
y+0x37/0x1c9()
Hardware name: Precision WorkStation T5400
Modules linked in:
Pid: 1, comm: swapper Tainted: G        W  2.6.29-rc6 #16
Call Trace:
 [<c042f7b1>] warn_slowpath+0x71/0xa8
 [<c044dccb>] ? trace_hardirqs_on_caller+0x18/0x145
 [<c06dc42f>] ? _spin_unlock_irq+0x22/0x2f
 [<c044efc9>] ? print_lock_contention_bug+0x14/0xd7
 [<c044efc9>] ? print_lock_contention_bug+0x14/0xd7
 [<c044cfbb>] ? trace_hardirqs_off_caller+0x18/0xa3
 [<c045383b>] smp_call_function_many+0x37/0x1c9
 [<c04138fc>] ? do_flush_tlb_all+0x0/0x3c
 [<c04138fc>] ? do_flush_tlb_all+0x0/0x3c
 [<c04539e9>] smp_call_function+0x1c/0x23
 [<c0433ee1>] on_each_cpu+0xf/0x3a
 [<c04138c6>] flush_tlb_all+0x14/0x16
 [<c04946f7>] unmap_kernel_range+0xf/0x11
 [<c06dd78a>] text_poke+0xf1/0x12c

unmap_kernel_range() requires irq enabled, but spin_lock_irqsave
(and stop_machine too)disables irq. so we have to solve this issue.

I have some ideas:

- export(just remove static) vunmap_page_range() and don't use
  flush_tlb_all().
 Because this vm_area are not used by other cpus, we don't need
 to flush TLB of all cpus.

- make unmap_kernel_range_local() function.

- extend kmap_atomic_prot() to map lowmem page when the 'prot'
  is different.


Thanks,

> 
> Having spinlocks taken in _both_ interrupts on and off contexts leads to
> higher interrupt latencies when the interrupt-off waits for an
> interrupt-on thread.
> 
> 
>>>> Thank you,
>>>>
>>> [...]
>>>> Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
>>>> and delayed unmapping.
>>>>
>>>> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
>>>> ---
>>>>  arch/x86/include/asm/alternative.h |    1 +
>>>>  arch/x86/kernel/alternative.c      |   25 ++++++++++++++++++++-----
>>>>  init/main.c                        |    3 +++
>>>>  3 files changed, 24 insertions(+), 5 deletions(-)
>>>>
>>>> Index: linux-2.6/arch/x86/include/asm/alternative.h
>>>> ===================================================================
>>>> --- linux-2.6.orig/arch/x86/include/asm/alternative.h
>>>> +++ linux-2.6/arch/x86/include/asm/alternative.h
>>>> @@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
>>>>   * The _early version expects the memory to already be RW.
>>>>   */
>>>>  
>>>> +extern void text_poke_init(void);
>>>>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>>>>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>>>  
>>>> Index: linux-2.6/arch/x86/kernel/alternative.c
>>>> ===================================================================
>>>> --- linux-2.6.orig/arch/x86/kernel/alternative.c
>>>> +++ linux-2.6/arch/x86/kernel/alternative.c
>>>> @@ -485,6 +485,16 @@ void *text_poke_early(void *addr, const 
>>>>  	return addr;
>>>>  }
>>>>  
>>>> +static struct vm_struct *text_poke_area[2];
>>>> +static DEFINE_SPINLOCK(text_poke_lock);
>>>> +
>>>> +void __init text_poke_init(void)
>>>> +{
>>>> +	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
>>>> +	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
>>> Why is this text_poke_area[1] 2 * PAGE_SIZE in size ? I would have
>>> thought that text_poke_area[0] would be PAGE_SIZE, text_poke_area[1]
>>> also be PAGE_SIZE, and that the sum of both would be 2 * PAGE_SIZE..
>> Unfortunately, current map_vm_area() tries to map the size of vm_area,
>> this means, you can't use 2page-size vm_area for mapping just 1 page...
>> (or maybe, we can set pages[1] = pages[0] when 2nd page doesn't exist)
>>
> 
> OK, given we sometimes have to map only a single page (e.g. at the end
> of a text section), we really need both 1 and 2 pages mapping. So I
> think you solution is good.
> 
>>>> +	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
>>>> +}
>>>> +
>>>>  /**
>>>>   * text_poke - Update instructions on a live kernel
>>>>   * @addr: address to modify
>>>> @@ -501,8 +511,9 @@ void *__kprobes text_poke(void *addr, co
>>>>  	unsigned long flags;
>>>>  	char *vaddr;
>>>>  	int nr_pages = 2;
>>>> -	struct page *pages[2];
>>>> -	int i;
>>>> +	struct page *pages[2], **pgp = pages;
>>> Hrm, why do you need **pgp ? Could you simply pass &pages to map_vm_area ?
>> As you know, pages means just the address(value) of an array, so you can't
>> get the address of the address...(pages and &pages are same.)
>>
>>         int array[2];
>>         printf("%p, %p",array, &array);
>>
>> please try it :)
>>
>> And actually, map_vm_area() requires the address of a pointer.
> 
> Ah yes, thanks for the explanation.
> 
> After changing the spinlock/irqsave, I think that patch would be good to
> merge. And then Steve could use text_poke within stop_machine if he
> likes.
> 
> Mathieu
> 
>> ---
>> int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
>> {
>>         unsigned long addr = (unsigned long)area->addr;
>>         unsigned long end = addr + area->size - PAGE_SIZE;
>>         int err;
>>
>>         err = vmap_page_range(addr, end, prot, *pages);
>>         if (err > 0) {
>>                 *pages += err;
>>                 ^^^^^^^^^^^^^^ Here, it tries to add err(=number of mapped pages)
>>                                to the pages pointer!
>>                 err = 0;
>>         }
>>
>>         return err;
>> }
>> ---
>>
>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>>> +	int i, ret;
>>>> +	struct vm_struct *vma;
>>>>  
>>>>  	if (!core_kernel_text((unsigned long)addr)) {
>>>>  		pages[0] = vmalloc_to_page(addr);
>>>> @@ -515,12 +526,16 @@ void *__kprobes text_poke(void *addr, co
>>>>  	BUG_ON(!pages[0]);
>>>>  	if (!pages[1])
>>>>  		nr_pages = 1;
>>>> -	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
>>>> -	BUG_ON(!vaddr);
>>>> +	spin_lock(&text_poke_lock);
>>>> +	vma = text_poke_area[nr_pages-1];
>>>> +	ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
>>>> +	BUG_ON(ret);
>>>> +	vaddr = vma->addr;
>>>>  	local_irq_save(flags);
>>>>  	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
>>>>  	local_irq_restore(flags);
>>>> -	vunmap(vaddr);
>>>> +	unmap_kernel_range((unsigned long)vma->addr, (unsigned long)vma->size);
>>>> +	spin_unlock(&text_poke_lock);
>>>>  	sync_core();
>>>>  	/* Could also do a CLFLUSH here to speed up CPU recovery; but
>>>>  	   that causes hangs on some VIA CPUs. */
>>>> @@ -528,3 +543,4 @@ void *__kprobes text_poke(void *addr, co
>>>>  		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
>>>>  	return addr;
>>>>  }
>>>> +EXPORT_SYMBOL_GPL(text_poke);
>>>> Index: linux-2.6/init/main.c
>>>> ===================================================================
>>>> --- linux-2.6.orig/init/main.c
>>>> +++ linux-2.6/init/main.c
>>>> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
>>>>  	taskstats_init_early();
>>>>  	delayacct_init();
>>>>  
>>>> +#ifdef CONFIG_X86
>>>> +	text_poke_init();
>>>> +#endif
>>>>  	check_bugs();
>>>>  
>>>>  	acpi_early_init(); /* before LAPIC and SMP init */
>>>
>> -- 
>> Masami Hiramatsu
>>
>> Software Engineer
>> Hitachi Computer Products (America) Inc.
>> Software Solutions Division
>>
>> e-mail: mhiramat@redhat.com
>>
> 

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-22 17:50   ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Andi Kleen
  2009-02-22 22:53     ` Steven Rostedt
@ 2009-02-27 21:08     ` Pavel Machek
  2009-02-28 16:56       ` Andi Kleen
  1 sibling, 1 reply; 89+ messages in thread
From: Pavel Machek @ 2009-02-27 21:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steven Rostedt, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin, Steven Rostedt

On Sun 2009-02-22 18:50:00, Andi Kleen wrote:
> Steven Rostedt <rostedt@goodmis.org> writes:
> 
> > From: Steven Rostedt <srostedt@redhat.com>
> >
> > Impact: keep kernel text read only
> >
> > Because dynamic ftrace converts the calls to mcount into and out of
> > nops at run time, we needed to always keep the kernel text writable.
> >
> > But this defeats the point of CONFIG_DEBUG_RODATA. This patch converts
> > the kernel code to writable before ftrace modifies the text, and converts
> > it back to read only afterward.
> >
> > The conversion is done via stop_machine and no IPIs may be executed
> > at that time. The kernel text is set to write just before calling
> > stop_machine and set to read only again right afterward.
> 
> The very old text poke code I had for this just used a dynamic
> mapping elsewhere instead to modify the code. That's much less
> intrusive than changing the complete mappings. Any reason you can't use 
> that too?

Is it legal to have two mappings of same page with different
attributes? IIRC some processors did not like that...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-27 21:08     ` Pavel Machek
@ 2009-02-28 16:56       ` Andi Kleen
  2009-02-28 22:08         ` Pavel Machek
  0 siblings, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2009-02-28 16:56 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andi Kleen, Steven Rostedt, linux-kernel, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, Mathieu Desnoyers, H. Peter Anvin, Steven Rostedt

On Fri, Feb 27, 2009 at 10:08:56PM +0100, Pavel Machek wrote:
> On Sun 2009-02-22 18:50:00, Andi Kleen wrote:
> > Steven Rostedt <rostedt@goodmis.org> writes:
> > 
> > > From: Steven Rostedt <srostedt@redhat.com>
> > >
> > > Impact: keep kernel text read only
> > >
> > > Because dynamic ftrace converts the calls to mcount into and out of
> > > nops at run time, we needed to always keep the kernel text writable.
> > >
> > > But this defeats the point of CONFIG_DEBUG_RODATA. This patch converts
> > > the kernel code to writable before ftrace modifies the text, and converts
> > > it back to read only afterward.
> > >
> > > The conversion is done via stop_machine and no IPIs may be executed
> > > at that time. The kernel text is set to write just before calling
> > > stop_machine and set to read only again right afterward.
> > 
> > The very old text poke code I had for this just used a dynamic
> > mapping elsewhere instead to modify the code. That's much less
> > intrusive than changing the complete mappings. Any reason you can't use 
> > that too?
> 
> Is it legal to have two mappings of same page with different
> attributes? IIRC some processors did not like that...

If you mean PAT caching attributes: correct it is not legal in x86 and
causes problems including data corruption.
If you mean other attributes like large page vs small page: it's normally legal,
with a few exceptions.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-28 16:56       ` Andi Kleen
@ 2009-02-28 22:08         ` Pavel Machek
       [not found]           ` <87wsba1a9f.fsf@basil.nowhere.org>
  0 siblings, 1 reply; 89+ messages in thread
From: Pavel Machek @ 2009-02-28 22:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steven Rostedt, linux-kernel, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell,
	Mathieu Desnoyers, H. Peter Anvin, Steven Rostedt

> > > The very old text poke code I had for this just used a dynamic
> > > mapping elsewhere instead to modify the code. That's much less
> > > intrusive than changing the complete mappings. Any reason you can't use 
> > > that too?
> > 
> > Is it legal to have two mappings of same page with different
> > attributes? IIRC some processors did not like that...
> 
> If you mean PAT caching attributes: correct it is not legal in x86 and
> causes problems including data corruption.

Aha, PAT is what I remembered on x86-64.

> If you mean other attributes like large page vs small page: it's normally legal,
> with a few exceptions.

...but is it okay on other architectures, like sparc, with funky cache
setups?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
       [not found]           ` <87wsba1a9f.fsf@basil.nowhere.org>
@ 2009-02-28 22:19             ` Pavel Machek
  2009-02-28 23:52               ` Andi Kleen
  0 siblings, 1 reply; 89+ messages in thread
From: Pavel Machek @ 2009-02-28 22:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Sat 2009-02-28 23:12:28, Andi Kleen wrote:
> Pavel Machek <pavel@ucw.cz> writes:
> 
> [trimmed cc list]
> 
> >> If you mean other attributes like large page vs small page: it's normally legal,
> >> with a few exceptions.
> >
> > ...but is it okay on other architectures, like sparc, with funky cache
> > setups?
> 
> I don't know for sure (how about you look it up in a sparc manual?), but
> I would assume it's also not safe there. I know it's not allowed in IA64 
> at least, with even some stricter rules than on x86.

So using aliases for kernel text rewriting is bad idea because it
would break ia64? That was what the thread was about. (Sorry, no ia64
or sparc manuals nearby).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions
  2009-02-28 22:19             ` Pavel Machek
@ 2009-02-28 23:52               ` Andi Kleen
  0 siblings, 0 replies; 89+ messages in thread
From: Andi Kleen @ 2009-02-28 23:52 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andi Kleen, linux-kernel

> So using aliases for kernel text rewriting is bad idea because it
> would break ia64?

Nope.

First what you do on x86 kernels doesn't matter to ia64.
And then kernel text rewriting doesn't require any illegal aliases anyways.
It would only be a problem if someone set the kernel text uncached,
which no sane person would do.

> That was what the thread was about. (Sorry, no ia64
> or sparc manuals nearby).

google knows them all.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [RFC][PATCH] x86: make text_poke() atomic
  2009-02-27 20:57                                   ` Masami Hiramatsu
@ 2009-03-02 17:01                                     ` Masami Hiramatsu
  2009-03-02 17:19                                       ` Mathieu Desnoyers
                                                         ` (2 more replies)
  0 siblings, 3 replies; 89+ messages in thread
From: Masami Hiramatsu @ 2009-03-02 17:01 UTC (permalink / raw)
  To: Mathieu Desnoyers, Ingo Molnar, Andrew Morton, Nick Piggin
  Cc: Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Arjan van de Ven, Rusty Russell, H. Peter Anvin, Steven Rostedt

Masami Hiramatsu wrote:
> Mathieu Desnoyers wrote:
>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>> Mathieu Desnoyers wrote:
>>>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>>>> Steven Rostedt wrote:
>>>>>> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
>>>>>>>> Hmm, lets see. I simply set a bit in the PTE mappings. There's not many, 
>>>>>>>> since a lot are 2M pages, for x86_64. Call stop_machine, and now I can 
>>>>>>>> modify 1 or 20,000 locations. Set the PTE bit back. Note, the changing of 
>>>>>>>> the bits are only done when CONFIG_DEBUG_RODATA is set.
>>>>>>>>
>>>>>>>> text_poke requires allocating a page. Map the page into memory. Set up a 
>>>>>>>> break point.
>>>>>>> text_poke does not _require_ a break point. text_poke can work with
>>>>>>> stop_machine.
>>>>>> It can? Doesn't text_poke require allocating pages? The code called by 
>>>>>> stop_machine is all atomic. vmap does not give an option to allocate with 
>>>>>> GFP_ATOMIC.
>>>>> Hi,
>>>>>
>>>>> With my patch, text_poke() never allocate pages any more :)
>>>>>
>>>>> BTW, IMHO, both of your methods are useful and have trade-off.
>>>>>
>>>>> ftrace wants to change massive amount of code at once. If we do
>>>>> that with text_poke(), we have to map/unmap pages each time and
>>>>> it will take a long time -- might be longer than one stop_machine_run().
>>>>>
>>>>> On the other hand, text_poke() user like as kprobes and tracepoints,
>>>>> just want to change a few amount of code at once, and it will be
>>>>> added/removed incrementally. If we do that with stop_machine_run(),
>>>>> we'll be annoyed by frequent machine stops.(Moreover, kprobes uses
>>>>> breakpoint, so it doesn't need stop_machine_run())
>>>>>
>>>> Hi Masami,
>>>>
>>>> Is this text_poke version executable in atomic context ? If yes, then
>>>> that would be good to add a comment saying it. Please see below for
>>>> comments.
>>> Thank you for comments!
>>> I think it could be. ah, spin_lock might be changed to spin_lock_irqsave()...
>>>
>> You are right. If we plan to execute this in both atomic and non-atomic
>> context, spin_lock_irqsave would make sure we are always busy-looping
>> with interrupts off.
> 
> Oops, when I tested spin_lock_irqsave, it caused warnings.
> 
> ------------[ cut here ]------------
> WARNING: at /home/mhiramat/ksrc/linux-2.6/kernel/smp.c:329 smp_call_function_man
> y+0x37/0x1c9()
> Hardware name: Precision WorkStation T5400
> Modules linked in:
> Pid: 1, comm: swapper Tainted: G        W  2.6.29-rc6 #16
> Call Trace:
>  [<c042f7b1>] warn_slowpath+0x71/0xa8
>  [<c044dccb>] ? trace_hardirqs_on_caller+0x18/0x145
>  [<c06dc42f>] ? _spin_unlock_irq+0x22/0x2f
>  [<c044efc9>] ? print_lock_contention_bug+0x14/0xd7
>  [<c044efc9>] ? print_lock_contention_bug+0x14/0xd7
>  [<c044cfbb>] ? trace_hardirqs_off_caller+0x18/0xa3
>  [<c045383b>] smp_call_function_many+0x37/0x1c9
>  [<c04138fc>] ? do_flush_tlb_all+0x0/0x3c
>  [<c04138fc>] ? do_flush_tlb_all+0x0/0x3c
>  [<c04539e9>] smp_call_function+0x1c/0x23
>  [<c0433ee1>] on_each_cpu+0xf/0x3a
>  [<c04138c6>] flush_tlb_all+0x14/0x16
>  [<c04946f7>] unmap_kernel_range+0xf/0x11
>  [<c06dd78a>] text_poke+0xf1/0x12c
> 
> unmap_kernel_range() requires irq enabled, but spin_lock_irqsave
> (and stop_machine too)disables irq. so we have to solve this issue.
> 
> I have some ideas:
> 
> - export(just remove static) vunmap_page_range() and don't use
>   flush_tlb_all().
>  Because this vm_area are not used by other cpus, we don't need
>  to flush TLB of all cpus.
> 
> - make unmap_kernel_range_local() function.
> 
> - extend kmap_atomic_prot() to map lowmem page when the 'prot'
>   is different.
> 

I updated my patch based on the first idea.
I also checked that text_poke() can be called from stop_machine()
with this patch.
---

Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
and delayed unmapping, and call vunmap_page_range() and local_flush_tlb()
directly because this mapping is temporary and local.

At the result of above change, text_poke() becomes atomic and can be called
from stop_machine() etc.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Nick Piggin <npiggin@suse.de>
---
  arch/x86/include/asm/alternative.h |    1 +
  arch/x86/kernel/alternative.c      |   36 +++++++++++++++++++++++++++++-------
  include/linux/vmalloc.h            |    1 +
  init/main.c                        |    3 +++
  mm/vmalloc.c                       |    2 +-
  5 files changed, 35 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/x86/include/asm/alternative.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/alternative.h
+++ linux-2.6/arch/x86/include/asm/alternative.h
@@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
   * The _early version expects the memory to already be RW.
   */

+extern void text_poke_init(void);
  extern void *text_poke(void *addr, const void *opcode, size_t len);
  extern void *text_poke_early(void *addr, const void *opcode, size_t len);

Index: linux-2.6/arch/x86/kernel/alternative.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/alternative.c
+++ linux-2.6/arch/x86/kernel/alternative.c
@@ -12,6 +12,7 @@
  #include <asm/nmi.h>
  #include <asm/vsyscall.h>
  #include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
  #include <asm/io.h>

  #define MAX_PATCH_LEN (255-1)
@@ -485,6 +486,16 @@ void *text_poke_early(void *addr, const
  	return addr;
  }

+static struct vm_struct *text_poke_area[2];
+static DEFINE_SPINLOCK(text_poke_lock);
+
+void __init text_poke_init(void)
+{
+	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
+	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
+	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
+}
+
  /**
   * text_poke - Update instructions on a live kernel
   * @addr: address to modify
@@ -501,8 +512,9 @@ void *__kprobes text_poke(void *addr, co
  	unsigned long flags;
  	char *vaddr;
  	int nr_pages = 2;
-	struct page *pages[2];
-	int i;
+	struct page *pages[2], **pgp = pages;
+	int i, ret;
+	struct vm_struct *vma;

  	if (!core_kernel_text((unsigned long)addr)) {
  		pages[0] = vmalloc_to_page(addr);
@@ -515,12 +527,22 @@ void *__kprobes text_poke(void *addr, co
  	BUG_ON(!pages[0]);
  	if (!pages[1])
  		nr_pages = 1;
-	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
-	BUG_ON(!vaddr);
-	local_irq_save(flags);
+	spin_lock_irqsave(&text_poke_lock, flags);
+	vma = text_poke_area[nr_pages-1];
+	ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
+	BUG_ON(ret);
+	vaddr = vma->addr;
  	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
-	local_irq_restore(flags);
-	vunmap(vaddr);
+	/* Ported from unmap_kernel_range() */
+	flush_cache_vunmap((unsigned long)vma->addr, (unsigned long)vma->size);
+	vunmap_page_range((unsigned long)vma->addr,
+			  (unsigned long)vma->addr + (unsigned long)vma->size);
+	/*
+	 * Since this mapping is temporary, local and protected by spinlock,
+	 * we just need to flush TLB on local processor.
+	 */
+	local_flush_tlb();
+	spin_unlock_irqrestore(&text_poke_lock, flags);
  	sync_core();
  	/* Could also do a CLFLUSH here to speed up CPU recovery; but
  	   that causes hangs on some VIA CPUs. */
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c
+++ linux-2.6/init/main.c
@@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
  	taskstats_init_early();
  	delayacct_init();

+#ifdef CONFIG_X86
+	text_poke_init();
+#endif
  	check_bugs();

  	acpi_early_init(); /* before LAPIC and SMP init */
Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c
+++ linux-2.6/mm/vmalloc.c
@@ -71,7 +71,7 @@ static void vunmap_pud_range(pgd_t *pgd,
  	} while (pud++, addr = next, addr != end);
  }

-static void vunmap_page_range(unsigned long addr, unsigned long end)
+void vunmap_page_range(unsigned long addr, unsigned long end)
  {
  	pgd_t *pgd;
  	unsigned long next;
Index: linux-2.6/include/linux/vmalloc.h
===================================================================
--- linux-2.6.orig/include/linux/vmalloc.h
+++ linux-2.6/include/linux/vmalloc.h
@@ -96,6 +96,7 @@ extern struct vm_struct *remove_vm_area(
  extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
  			struct page ***pages);
  extern void unmap_kernel_range(unsigned long addr, unsigned long size);
+extern void vunmap_page_range(unsigned long addr, unsigned long end);

  /* Allocate/destroy a 'vmalloc' VM area. */
  extern struct vm_struct *alloc_vm_area(size_t size);

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 17:01                                     ` [RFC][PATCH] x86: make text_poke() atomic Masami Hiramatsu
@ 2009-03-02 17:19                                       ` Mathieu Desnoyers
  2009-03-02 22:15                                         ` Masami Hiramatsu
  2009-03-02 18:28                                       ` [RFC][PATCH] x86: make text_poke() atomic Arjan van de Ven
  2009-03-03  4:54                                       ` Nick Piggin
  2 siblings, 1 reply; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-02 17:19 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Masami Hiramatsu wrote:
>> Mathieu Desnoyers wrote:
>>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>>> Mathieu Desnoyers wrote:
>>>>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>>>>> Steven Rostedt wrote:
>>>>>>> On Mon, 23 Feb 2009, Mathieu Desnoyers wrote:
>>>>>>>>> Hmm, lets see. I simply set a bit in the PTE mappings. 
>>>>>>>>> There's not many, since a lot are 2M pages, for x86_64. 
>>>>>>>>> Call stop_machine, and now I can modify 1 or 20,000 
>>>>>>>>> locations. Set the PTE bit back. Note, the changing of  
>>>>>>>>> the bits are only done when CONFIG_DEBUG_RODATA is set.
>>>>>>>>>
>>>>>>>>> text_poke requires allocating a page. Map the page into 
>>>>>>>>> memory. Set up a break point.
>>>>>>>> text_poke does not _require_ a break point. text_poke can work with
>>>>>>>> stop_machine.
>>>>>>> It can? Doesn't text_poke require allocating pages? The code 
>>>>>>> called by stop_machine is all atomic. vmap does not give an 
>>>>>>> option to allocate with GFP_ATOMIC.
>>>>>> Hi,
>>>>>>
>>>>>> With my patch, text_poke() never allocate pages any more :)
>>>>>>
>>>>>> BTW, IMHO, both of your methods are useful and have trade-off.
>>>>>>
>>>>>> ftrace wants to change massive amount of code at once. If we do
>>>>>> that with text_poke(), we have to map/unmap pages each time and
>>>>>> it will take a long time -- might be longer than one stop_machine_run().
>>>>>>
>>>>>> On the other hand, text_poke() user like as kprobes and tracepoints,
>>>>>> just want to change a few amount of code at once, and it will be
>>>>>> added/removed incrementally. If we do that with stop_machine_run(),
>>>>>> we'll be annoyed by frequent machine stops.(Moreover, kprobes uses
>>>>>> breakpoint, so it doesn't need stop_machine_run())
>>>>>>
>>>>> Hi Masami,
>>>>>
>>>>> Is this text_poke version executable in atomic context ? If yes, then
>>>>> that would be good to add a comment saying it. Please see below for
>>>>> comments.
>>>> Thank you for comments!
>>>> I think it could be. ah, spin_lock might be changed to spin_lock_irqsave()...
>>>>
>>> You are right. If we plan to execute this in both atomic and non-atomic
>>> context, spin_lock_irqsave would make sure we are always busy-looping
>>> with interrupts off.
>>
>> Oops, when I tested spin_lock_irqsave, it caused warnings.
>>
>> ------------[ cut here ]------------
>> WARNING: at /home/mhiramat/ksrc/linux-2.6/kernel/smp.c:329 smp_call_function_man
>> y+0x37/0x1c9()
>> Hardware name: Precision WorkStation T5400
>> Modules linked in:
>> Pid: 1, comm: swapper Tainted: G        W  2.6.29-rc6 #16
>> Call Trace:
>>  [<c042f7b1>] warn_slowpath+0x71/0xa8
>>  [<c044dccb>] ? trace_hardirqs_on_caller+0x18/0x145
>>  [<c06dc42f>] ? _spin_unlock_irq+0x22/0x2f
>>  [<c044efc9>] ? print_lock_contention_bug+0x14/0xd7
>>  [<c044efc9>] ? print_lock_contention_bug+0x14/0xd7
>>  [<c044cfbb>] ? trace_hardirqs_off_caller+0x18/0xa3
>>  [<c045383b>] smp_call_function_many+0x37/0x1c9
>>  [<c04138fc>] ? do_flush_tlb_all+0x0/0x3c
>>  [<c04138fc>] ? do_flush_tlb_all+0x0/0x3c
>>  [<c04539e9>] smp_call_function+0x1c/0x23
>>  [<c0433ee1>] on_each_cpu+0xf/0x3a
>>  [<c04138c6>] flush_tlb_all+0x14/0x16
>>  [<c04946f7>] unmap_kernel_range+0xf/0x11
>>  [<c06dd78a>] text_poke+0xf1/0x12c
>>
>> unmap_kernel_range() requires irq enabled, but spin_lock_irqsave
>> (and stop_machine too)disables irq. so we have to solve this issue.
>>
>> I have some ideas:
>>
>> - export(just remove static) vunmap_page_range() and don't use
>>   flush_tlb_all().
>>  Because this vm_area are not used by other cpus, we don't need
>>  to flush TLB of all cpus.
>>
>> - make unmap_kernel_range_local() function.
>>
>> - extend kmap_atomic_prot() to map lowmem page when the 'prot'
>>   is different.
>>
>
> I updated my patch based on the first idea.
> I also checked that text_poke() can be called from stop_machine()
> with this patch.
> ---
>
> Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
> and delayed unmapping, and call vunmap_page_range() and local_flush_tlb()
> directly because this mapping is temporary and local.
>
> At the result of above change, text_poke() becomes atomic and can be called
> from stop_machine() etc.
>
> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Nick Piggin <npiggin@suse.de>
> ---
>  arch/x86/include/asm/alternative.h |    1 +
>  arch/x86/kernel/alternative.c      |   36 +++++++++++++++++++++++++++++-------
>  include/linux/vmalloc.h            |    1 +
>  init/main.c                        |    3 +++
>  mm/vmalloc.c                       |    2 +-
>  5 files changed, 35 insertions(+), 8 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/alternative.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/alternative.h
> +++ linux-2.6/arch/x86/include/asm/alternative.h
> @@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
>   * The _early version expects the memory to already be RW.
>   */
>
> +extern void text_poke_init(void);
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>
> Index: linux-2.6/arch/x86/kernel/alternative.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/alternative.c
> +++ linux-2.6/arch/x86/kernel/alternative.c
> @@ -12,6 +12,7 @@
>  #include <asm/nmi.h>
>  #include <asm/vsyscall.h>
>  #include <asm/cacheflush.h>
> +#include <asm/tlbflush.h>
>  #include <asm/io.h>
>
>  #define MAX_PATCH_LEN (255-1)
> @@ -485,6 +486,16 @@ void *text_poke_early(void *addr, const
>  	return addr;
>  }
>
> +static struct vm_struct *text_poke_area[2];
> +static DEFINE_SPINLOCK(text_poke_lock);
> +
> +void __init text_poke_init(void)
> +{
> +	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
> +	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
> +	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
> +}
> +
>  /**
>   * text_poke - Update instructions on a live kernel
>   * @addr: address to modify
> @@ -501,8 +512,9 @@ void *__kprobes text_poke(void *addr, co
>  	unsigned long flags;
>  	char *vaddr;
>  	int nr_pages = 2;
> -	struct page *pages[2];
> -	int i;
> +	struct page *pages[2], **pgp = pages;
> +	int i, ret;
> +	struct vm_struct *vma;
>
>  	if (!core_kernel_text((unsigned long)addr)) {
>  		pages[0] = vmalloc_to_page(addr);
> @@ -515,12 +527,22 @@ void *__kprobes text_poke(void *addr, co
>  	BUG_ON(!pages[0]);
>  	if (!pages[1])
>  		nr_pages = 1;
> -	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
> -	BUG_ON(!vaddr);
> -	local_irq_save(flags);
> +	spin_lock_irqsave(&text_poke_lock, flags);
> +	vma = text_poke_area[nr_pages-1];
> +	ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
> +	BUG_ON(ret);
> +	vaddr = vma->addr;
>  	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
> -	local_irq_restore(flags);
> -	vunmap(vaddr);
> +	/* Ported from unmap_kernel_range() */
> +	flush_cache_vunmap((unsigned long)vma->addr, (unsigned long)vma->size);
> +	vunmap_page_range((unsigned long)vma->addr,
> +			  (unsigned long)vma->addr + (unsigned long)vma->size);
> +	/*
> +	 * Since this mapping is temporary, local and protected by spinlock,
> +	 * we just need to flush TLB on local processor.
> +	 */
> +	local_flush_tlb();
> +	spin_unlock_irqrestore(&text_poke_lock, flags);
>  	sync_core();
>  	/* Could also do a CLFLUSH here to speed up CPU recovery; but
>  	   that causes hangs on some VIA CPUs. */
> Index: linux-2.6/init/main.c
> ===================================================================
> --- linux-2.6.orig/init/main.c
> +++ linux-2.6/init/main.c
> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
>  	taskstats_init_early();
>  	delayacct_init();
>
> +#ifdef CONFIG_X86
> +	text_poke_init();
> +#endif

All good, except this above. There should be an empty text_poke_init()
in some header file, and an implementation for the X86 arch rather than
a ifdef in init/main.c.

Mathieu

>  	check_bugs();
>
>  	acpi_early_init(); /* before LAPIC and SMP init */
> Index: linux-2.6/mm/vmalloc.c
> ===================================================================
> --- linux-2.6.orig/mm/vmalloc.c
> +++ linux-2.6/mm/vmalloc.c
> @@ -71,7 +71,7 @@ static void vunmap_pud_range(pgd_t *pgd,
>  	} while (pud++, addr = next, addr != end);
>  }
>
> -static void vunmap_page_range(unsigned long addr, unsigned long end)
> +void vunmap_page_range(unsigned long addr, unsigned long end)
>  {
>  	pgd_t *pgd;
>  	unsigned long next;
> Index: linux-2.6/include/linux/vmalloc.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmalloc.h
> +++ linux-2.6/include/linux/vmalloc.h
> @@ -96,6 +96,7 @@ extern struct vm_struct *remove_vm_area(
>  extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
>  			struct page ***pages);
>  extern void unmap_kernel_range(unsigned long addr, unsigned long size);
> +extern void vunmap_page_range(unsigned long addr, unsigned long end);
>
>  /* Allocate/destroy a 'vmalloc' VM area. */
>  extern struct vm_struct *alloc_vm_area(size_t size);
>
> -- 
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: mhiramat@redhat.com
>
>

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 17:01                                     ` [RFC][PATCH] x86: make text_poke() atomic Masami Hiramatsu
  2009-03-02 17:19                                       ` Mathieu Desnoyers
@ 2009-03-02 18:28                                       ` Arjan van de Ven
  2009-03-02 18:36                                         ` Mathieu Desnoyers
  2009-03-02 18:42                                         ` Linus Torvalds
  2009-03-03  4:54                                       ` Nick Piggin
  2 siblings, 2 replies; 89+ messages in thread
From: Arjan van de Ven @ 2009-03-02 18:28 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Ingo Molnar, Andrew Morton, Nick Piggin,
	Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

> 
> Use map_vm_area() instead of vmap() in text_poke() for avoiding page
> allocation and delayed unmapping, and call vunmap_page_range() and
> local_flush_tlb() directly because this mapping is temporary and
> local.
> 
> At the result of above change, text_poke() becomes atomic and can be
> called from stop_machine() etc.

.... but text_poke() realistically needs to call stop_machine() since
you can't poke live code.... so that makes me wonder how useful this
is...

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 18:28                                       ` [RFC][PATCH] x86: make text_poke() atomic Arjan van de Ven
@ 2009-03-02 18:36                                         ` Mathieu Desnoyers
  2009-03-02 18:55                                           ` Arjan van de Ven
  2009-03-02 18:42                                         ` Linus Torvalds
  1 sibling, 1 reply; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-02 18:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Masami Hiramatsu, Ingo Molnar, Andrew Morton, Nick Piggin,
	Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

* Arjan van de Ven (arjan@infradead.org) wrote:
> > 
> > Use map_vm_area() instead of vmap() in text_poke() for avoiding page
> > allocation and delayed unmapping, and call vunmap_page_range() and
> > local_flush_tlb() directly because this mapping is temporary and
> > local.
> > 
> > At the result of above change, text_poke() becomes atomic and can be
> > called from stop_machine() etc.
> 
> .... but text_poke() realistically needs to call stop_machine() since
> you can't poke live code.... so that makes me wonder how useful this
> is...

Hi Arjan,

Stop machine is not required when inserting a breakpoint. And cleverly
using this breakpoint technique can permit modifying other instructions
as well.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 18:28                                       ` [RFC][PATCH] x86: make text_poke() atomic Arjan van de Ven
  2009-03-02 18:36                                         ` Mathieu Desnoyers
@ 2009-03-02 18:42                                         ` Linus Torvalds
  1 sibling, 0 replies; 89+ messages in thread
From: Linus Torvalds @ 2009-03-02 18:42 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Andrew Morton,
	Nick Piggin, Steven Rostedt, Andi Kleen, linux-kernel,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Rusty Russell, H. Peter Anvin, Steven Rostedt



On Mon, 2 Mar 2009, Arjan van de Ven wrote:
> 
> .... but text_poke() realistically needs to call stop_machine() since
> you can't poke live code.... so that makes me wonder how useful this
> is...

Well, not always. There's at least two cases where we don't need it:

 - in the UP -> SMP transition.

 - perhaps more interestingly, we're still kind of waiting for the 
   resolution of the whole "nop out the first byte to a single-byte 'irq3' 
   trap instruction, then rewrite the rest of the instruction, and then 
   reset the first byte of the final instruction" thing.

IOW, there are possible non-stop_machine() models where rewriting 
instructions in a live system does work, and quite frankly, I think we 
need them. Making the rule be the (obviously safe) "we can only do this in 
stop_machine" is quite possibly not going to be an acceptable rule and we 
may need alternatives.

			Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 18:36                                         ` Mathieu Desnoyers
@ 2009-03-02 18:55                                           ` Arjan van de Ven
  2009-03-02 19:13                                             ` Masami Hiramatsu
  2009-03-02 19:47                                             ` Mathieu Desnoyers
  0 siblings, 2 replies; 89+ messages in thread
From: Arjan van de Ven @ 2009-03-02 18:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, Ingo Molnar, Andrew Morton, Nick Piggin,
	Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

On Mon, 2 Mar 2009 13:36:17 -0500
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> * Arjan van de Ven (arjan@infradead.org) wrote:
> > > 
> > > Use map_vm_area() instead of vmap() in text_poke() for avoiding
> > > page allocation and delayed unmapping, and call
> > > vunmap_page_range() and local_flush_tlb() directly because this
> > > mapping is temporary and local.
> > > 
> > > At the result of above change, text_poke() becomes atomic and can
> > > be called from stop_machine() etc.
> > 
> > .... but text_poke() realistically needs to call stop_machine()
> > since you can't poke live code.... so that makes me wonder how
> > useful this is...
> 
> Hi Arjan,
> 
> Stop machine is not required when inserting a breakpoint. 

that is your assumption; when I spoke with CPU architects they
cringed ;(



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 18:55                                           ` Arjan van de Ven
@ 2009-03-02 19:13                                             ` Masami Hiramatsu
  2009-03-02 19:23                                               ` H. Peter Anvin
  2009-03-02 19:47                                             ` Mathieu Desnoyers
  1 sibling, 1 reply; 89+ messages in thread
From: Masami Hiramatsu @ 2009-03-02 19:13 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Mathieu Desnoyers, Ingo Molnar, Andrew Morton, Nick Piggin,
	Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

Arjan van de Ven wrote:
> On Mon, 2 Mar 2009 13:36:17 -0500
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
>> * Arjan van de Ven (arjan@infradead.org) wrote:
>>>> Use map_vm_area() instead of vmap() in text_poke() for avoiding
>>>> page allocation and delayed unmapping, and call
>>>> vunmap_page_range() and local_flush_tlb() directly because this
>>>> mapping is temporary and local.
>>>>
>>>> At the result of above change, text_poke() becomes atomic and can
>>>> be called from stop_machine() etc.
>>> .... but text_poke() realistically needs to call stop_machine()
>>> since you can't poke live code.... so that makes me wonder how
>>> useful this is...
>> Hi Arjan,
>>
>> Stop machine is not required when inserting a breakpoint. 
> 
> that is your assumption; when I spoke with CPU architects they
> cringed ;(

Is that true even if modifying just one-byte (like int3 insertion)
and we don't care synchronous write(that means code modification
effects on other processors after a while)?

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 19:13                                             ` Masami Hiramatsu
@ 2009-03-02 19:23                                               ` H. Peter Anvin
  0 siblings, 0 replies; 89+ messages in thread
From: H. Peter Anvin @ 2009-03-02 19:23 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Arjan van de Ven, Mathieu Desnoyers, Ingo Molnar, Andrew Morton,
	Nick Piggin, Steven Rostedt, Andi Kleen, linux-kernel,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Rusty Russell, Steven Rostedt

Masami Hiramatsu wrote:
>>>
>>> Stop machine is not required when inserting a breakpoint. 
>>
>> that is your assumption; when I spoke with CPU architects they
>> cringed ;(
> 
> Is that true even if modifying just one-byte (like int3 insertion)
> and we don't care synchronous write(that means code modification
> effects on other processors after a while)?
> 

The problem is that he's using is as part of a sequenced series of 
steps.  The lack of synchronization comes into play there.

	-hpa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 18:55                                           ` Arjan van de Ven
  2009-03-02 19:13                                             ` Masami Hiramatsu
@ 2009-03-02 19:47                                             ` Mathieu Desnoyers
  1 sibling, 0 replies; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-02 19:47 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Masami Hiramatsu, Ingo Molnar, Andrew Morton, Nick Piggin,
	Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

* Arjan van de Ven (arjan@infradead.org) wrote:
> On Mon, 2 Mar 2009 13:36:17 -0500
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > * Arjan van de Ven (arjan@infradead.org) wrote:
> > > > 
> > > > Use map_vm_area() instead of vmap() in text_poke() for avoiding
> > > > page allocation and delayed unmapping, and call
> > > > vunmap_page_range() and local_flush_tlb() directly because this
> > > > mapping is temporary and local.
> > > > 
> > > > At the result of above change, text_poke() becomes atomic and can
> > > > be called from stop_machine() etc.
> > > 
> > > .... but text_poke() realistically needs to call stop_machine()
> > > since you can't poke live code.... so that makes me wonder how
> > > useful this is...
> > 
> > Hi Arjan,
> > 
> > Stop machine is not required when inserting a breakpoint. 
> 
> that is your assumption; when I spoke with CPU architects they
> cringed ;(
> 

Given you are not citing any technical material, I guess you are
refering to :

Intel® Core™2 Duo Processor E8000Δ and E7000Δ Series
http://download.intel.com/design/processor/specupdt/318733.pdf (page 46)

AW75. Unsynchronized Cross-Modifying Code Operations Can Cause
Unexpected Instruction Execution Results

Am I correct ? This errata has been around since the Pentium III and is
still valid today. Other current CPUs with this errata :

Intel® Atom™ Processor Z5xxΔ Series
http://download.intel.com/design/processor/specupdt/319536.pdf (page 22)
AAE18 Unsynchronized Cross-Modifying Code Operations Can Cause
Unexpected Instruction Execution Results


First point : given your statement, kprobes would be buggy on x86 _and_
ia64. If this is true, then it should be addressed. If not, then we
should not worry about it.


The algorithm they propose to work around the architectural limitations
is stated here :
http://download.intel.com/design/PentiumII/manuals/24319202.pdf
7.1.3 Handling Self- and Cross-Modifying Code

Basically implies using something like stop-machine. However, if we read
carefully the few amount of information available in this errata :

"The act of a processor writing data into a currently executing code
segment with the intent of executing that data as code is called
self-modifying code. Intel Architecture processors exhibit
model-specific behavior when executing self-modified code, depending
upon how far ahead of the current execution pointer the code has been
modified. As processor architectures become more complex and start to
speculatively execute code ahead of the retirement point (as in the P6
family processors), the rules regarding which code should execute, pre-
or post-modification, become blurred."

Basically, this points to the speculative code execution as being the
core of the problems encountered with code modification. But given int3
*IS* a _serializing_ instruction, it is not affected by this errata.
Quoting Richard J Moore from IBM from a discussion we had a few years
ago :

 * "There is another issue to consider when looking into using probes other
 * then int3:
 *
 * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
 * practice of modifying code on one processor where another has prefetched
 * the unmodified version of the code. Intel states that unpredictable general
 * protection faults may result if a synchronizing instruction (iret, int,
 * int3, cpuid, etc ) is not executed on the second processor before it
 * executes the pre-fetched out-of-date copy of the instruction.
 *
 * When we became aware of this I had a long discussion with Intel's
 * microarchitecture guys. It turns out that the reason for this erratum
 * (which incidentally Intel does not intend to fix) is because the trace
 * cache - the stream of micro-ops resulting from instruction interpretation -
 * cannot be guaranteed to be valid. Reading between the lines I assume this
 * issue arises because of optimization done in the trace cache, where it is
 * no longer possible to identify the original instruction boundaries. If the
 * CPU discoverers that the trace cache has been invalidated because of
 * unsynchronized cross-modification then instruction execution will be
 * aborted with a GPF. Further discussion with Intel revealed that replacing
 * the first opcode byte with an int3 would not be subject to this erratum.
 *
 * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."

Therefore, I think assuming int3 as safe for _synchronized_ XMC is ok.
The multi-step algorithm I use to perform code modification in my
immediate values patch based on int3 basically writes the int3, sends an
IPI to _each_ CPU to make sure they issue a synchronizing instruction
(cpuid) and then I can safely proceed to change the instruction,
including the first byte, because I know that all CPUs which could have
potentially seen the old instruction have had the seen the new version
(breakpoint) and have issued a synchronizing instruction (in that order).
Note that I put a smp_wmb() after the int3 write, and a smp_rmb() in the
IPI handler before the cpuid instruction.

Note that extra care will have to be taken to handle synchronization of
instruction and data caches on the Itanium, but this is a different
architecture and topic, which is not the primary focus of our discussion
here :
Cache Coherency in Itanium® Processor Software
http://cache-www.intel.com/cd/00/00/21/57/215792_215792.pdf

Mathieu



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 17:19                                       ` Mathieu Desnoyers
@ 2009-03-02 22:15                                         ` Masami Hiramatsu
  2009-03-02 22:22                                           ` Ingo Molnar
  0 siblings, 1 reply; 89+ messages in thread
From: Masami Hiramatsu @ 2009-03-02 22:15 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>> Index: linux-2.6/init/main.c
>> ===================================================================
>> --- linux-2.6.orig/init/main.c
>> +++ linux-2.6/init/main.c
>> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
>>  	taskstats_init_early();
>>  	delayacct_init();
>>
>> +#ifdef CONFIG_X86
>> +	text_poke_init();
>> +#endif
> 
> All good, except this above. There should be an empty text_poke_init()
> in some header file, and an implementation for the X86 arch rather than
> a ifdef in init/main.c.

Hmm, I'd rather use __weak function instead of defining it in some header
files, because text_poke() and alternatives exist only on x86.

I know that we need to discuss cross modifying code on x86 with
Arjan or other Intel engineers. This patch may still be useful
for removing unnecessary vm_area allocation in text_poke().

Thank you,

---
Use map_vm_area() instead of vmap() in text_poke() for avoiding page allocation
and delayed unmapping, and call vunmap_page_range() and local_flush_tlb()
directly because this mapping is temporary and local.

At the result of above change, text_poke() becomes atomic and can be called
from stop_machine() etc.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Nick Piggin <npiggin@suse.de>
---
  arch/x86/include/asm/alternative.h |    1 +
  arch/x86/kernel/alternative.c      |   36 +++++++++++++++++++++++++++++-------
  include/linux/vmalloc.h            |    1 +
  init/main.c                        |    5 +++++
  mm/vmalloc.c                       |    2 +-
  5 files changed, 37 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/x86/include/asm/alternative.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/alternative.h
+++ linux-2.6/arch/x86/include/asm/alternative.h
@@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
   * The _early version expects the memory to already be RW.
   */

+extern void text_poke_init(void);
  extern void *text_poke(void *addr, const void *opcode, size_t len);
  extern void *text_poke_early(void *addr, const void *opcode, size_t len);

Index: linux-2.6/arch/x86/kernel/alternative.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/alternative.c
+++ linux-2.6/arch/x86/kernel/alternative.c
@@ -12,6 +12,7 @@
  #include <asm/nmi.h>
  #include <asm/vsyscall.h>
  #include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
  #include <asm/io.h>

  #define MAX_PATCH_LEN (255-1)
@@ -485,6 +486,16 @@ void *text_poke_early(void *addr, const
  	return addr;
  }

+static struct vm_struct *text_poke_area[2];
+static DEFINE_SPINLOCK(text_poke_lock);
+
+void __init text_poke_init(void)
+{
+	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
+	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
+	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
+}
+
  /**
   * text_poke - Update instructions on a live kernel
   * @addr: address to modify
@@ -501,8 +512,9 @@ void *__kprobes text_poke(void *addr, co
  	unsigned long flags;
  	char *vaddr;
  	int nr_pages = 2;
-	struct page *pages[2];
-	int i;
+	struct page *pages[2], **pgp = pages;
+	int i, ret;
+	struct vm_struct *vma;

  	if (!core_kernel_text((unsigned long)addr)) {
  		pages[0] = vmalloc_to_page(addr);
@@ -515,12 +527,22 @@ void *__kprobes text_poke(void *addr, co
  	BUG_ON(!pages[0]);
  	if (!pages[1])
  		nr_pages = 1;
-	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
-	BUG_ON(!vaddr);
-	local_irq_save(flags);
+	spin_lock_irqsave(&text_poke_lock, flags);
+	vma = text_poke_area[nr_pages-1];
+	ret = map_vm_area(vma, PAGE_KERNEL, &pgp);
+	BUG_ON(ret);
+	vaddr = vma->addr;
  	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
-	local_irq_restore(flags);
-	vunmap(vaddr);
+	/* Ported from unmap_kernel_range() */
+	flush_cache_vunmap((unsigned long)vma->addr, (unsigned long)vma->size);
+	vunmap_page_range((unsigned long)vma->addr,
+			  (unsigned long)vma->addr + (unsigned long)vma->size);
+	/*
+	 * Since this mapping is temporary, local and protected by spinlock,
+	 * we just need to flush TLB on local processor.
+	 */
+	local_flush_tlb();
+	spin_unlock_irqrestore(&text_poke_lock, flags);
  	sync_core();
  	/* Could also do a CLFLUSH here to speed up CPU recovery; but
  	   that causes hangs on some VIA CPUs. */
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c
+++ linux-2.6/init/main.c
@@ -526,6 +526,10 @@ void __init __weak thread_info_cache_ini
  {
  }

+void __init __weak text_poke_init(void)
+{
+}
+
  asmlinkage void __init start_kernel(void)
  {
  	char * command_line;
@@ -676,6 +680,7 @@ asmlinkage void __init start_kernel(void
  	taskstats_init_early();
  	delayacct_init();

+	text_poke_init();
  	check_bugs();

  	acpi_early_init(); /* before LAPIC and SMP init */
Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c
+++ linux-2.6/mm/vmalloc.c
@@ -71,7 +71,7 @@ static void vunmap_pud_range(pgd_t *pgd,
  	} while (pud++, addr = next, addr != end);
  }

-static void vunmap_page_range(unsigned long addr, unsigned long end)
+void vunmap_page_range(unsigned long addr, unsigned long end)
  {
  	pgd_t *pgd;
  	unsigned long next;
Index: linux-2.6/include/linux/vmalloc.h
===================================================================
--- linux-2.6.orig/include/linux/vmalloc.h
+++ linux-2.6/include/linux/vmalloc.h
@@ -96,6 +96,7 @@ extern struct vm_struct *remove_vm_area(
  extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
  			struct page ***pages);
  extern void unmap_kernel_range(unsigned long addr, unsigned long size);
+extern void vunmap_page_range(unsigned long addr, unsigned long end);

  /* Allocate/destroy a 'vmalloc' VM area. */
  extern struct vm_struct *alloc_vm_area(size_t size);

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 22:15                                         ` Masami Hiramatsu
@ 2009-03-02 22:22                                           ` Ingo Molnar
  2009-03-02 22:55                                             ` Masami Hiramatsu
  0 siblings, 1 reply; 89+ messages in thread
From: Ingo Molnar @ 2009-03-02 22:22 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt


* Masami Hiramatsu <mhiramat@redhat.com> wrote:

> Mathieu Desnoyers wrote:
>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>> Index: linux-2.6/init/main.c
>>> ===================================================================
>>> --- linux-2.6.orig/init/main.c
>>> +++ linux-2.6/init/main.c
>>> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
>>>  	taskstats_init_early();
>>>  	delayacct_init();
>>>
>>> +#ifdef CONFIG_X86
>>> +	text_poke_init();
>>> +#endif
>>
>> All good, except this above. There should be an empty text_poke_init()
>> in some header file, and an implementation for the X86 arch rather than
>> a ifdef in init/main.c.
>
> Hmm, I'd rather use __weak function instead of defining it in some header
> files, because text_poke() and alternatives exist only on x86.
>
> I know that we need to discuss cross modifying code on x86 with
> Arjan or other Intel engineers. This patch may still be useful
> for removing unnecessary vm_area allocation in text_poke().
>
> Thank you,
>
> ---
>
> Use map_vm_area() instead of vmap() in text_poke() for 
> avoiding page allocation and delayed unmapping, and call 
> vunmap_page_range() and local_flush_tlb() directly because 
> this mapping is temporary and local.
>
> At the result of above change, text_poke() becomes atomic and 
> can be called from stop_machine() etc.

That looks like a good fix in itself - see a few minor details 
below.

(Note, i could not try your patch because it has widespread 
whitespace damage - please watch out for this for future 
patches.)

> +static struct vm_struct *text_poke_area[2];
> +static DEFINE_SPINLOCK(text_poke_lock);
> +
> +void __init text_poke_init(void)
> +{
> +	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
> +	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
> +	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);

BUG_ON() for non-100%-essential init code is a no-no. Please 
change it to WARN_ON() so that people have a chance to report i.

Also, i think all these vma complications came from the decision 
to use vmap - and vmap enhancements in .29 complicated this 
supposedly-simple interface.

So perhaps another approach to (re-)consider would be to go back 
to atomic fixmaps here. It spends 3 slots but that's no big 
deal.

In exchange it will be conceptually simpler, and will also scale 
much better than a global spinlock. What do you think?

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 22:22                                           ` Ingo Molnar
@ 2009-03-02 22:55                                             ` Masami Hiramatsu
  2009-03-02 23:09                                               ` Ingo Molnar
  0 siblings, 1 reply; 89+ messages in thread
From: Masami Hiramatsu @ 2009-03-02 22:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt



Ingo Molnar wrote:
> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
> 
>> Mathieu Desnoyers wrote:
>>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
>>>> Index: linux-2.6/init/main.c
>>>> ===================================================================
>>>> --- linux-2.6.orig/init/main.c
>>>> +++ linux-2.6/init/main.c
>>>> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
>>>>  	taskstats_init_early();
>>>>  	delayacct_init();
>>>>
>>>> +#ifdef CONFIG_X86
>>>> +	text_poke_init();
>>>> +#endif
>>> All good, except this above. There should be an empty text_poke_init()
>>> in some header file, and an implementation for the X86 arch rather than
>>> a ifdef in init/main.c.
>> Hmm, I'd rather use __weak function instead of defining it in some header
>> files, because text_poke() and alternatives exist only on x86.
>>
>> I know that we need to discuss cross modifying code on x86 with
>> Arjan or other Intel engineers. This patch may still be useful
>> for removing unnecessary vm_area allocation in text_poke().
>>
>> Thank you,
>>
>> ---
>>
>> Use map_vm_area() instead of vmap() in text_poke() for 
>> avoiding page allocation and delayed unmapping, and call 
>> vunmap_page_range() and local_flush_tlb() directly because 
>> this mapping is temporary and local.
>>
>> At the result of above change, text_poke() becomes atomic and 
>> can be called from stop_machine() etc.
> 
> That looks like a good fix in itself - see a few minor details 
> below.

Thank you for review,

> 
> (Note, i could not try your patch because it has widespread 
> whitespace damage - please watch out for this for future 
> patches.)

Oops, it was my mis-setting...

> 
>> +static struct vm_struct *text_poke_area[2];
>> +static DEFINE_SPINLOCK(text_poke_lock);
>> +
>> +void __init text_poke_init(void)
>> +{
>> +	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
>> +	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
>> +	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
> 
> BUG_ON() for non-100%-essential init code is a no-no. Please 
> change it to WARN_ON() so that people have a chance to report i.

Sure.

> 
> Also, i think all these vma complications came from the decision 
> to use vmap - and vmap enhancements in .29 complicated this 
> supposedly-simple interface.
> 
> So perhaps another approach to (re-)consider would be to go back 
> to atomic fixmaps here. It spends 3 slots but that's no big 
> deal.

Oh, it's a good idea! fixmaps must make it simpler.

> 
> In exchange it will be conceptually simpler, and will also scale 
> much better than a global spinlock. What do you think?

I think even if I use fixmaps, we have to use a spinlock to protect
the fixmap area from other threads...

> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 22:55                                             ` Masami Hiramatsu
@ 2009-03-02 23:09                                               ` Ingo Molnar
  2009-03-02 23:38                                                 ` Masami Hiramatsu
  0 siblings, 1 reply; 89+ messages in thread
From: Ingo Molnar @ 2009-03-02 23:09 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt


* Masami Hiramatsu <mhiramat@redhat.com> wrote:

> 
> 
> Ingo Molnar wrote:
> > * Masami Hiramatsu <mhiramat@redhat.com> wrote:
> > 
> >> Mathieu Desnoyers wrote:
> >>> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
> >>>> Index: linux-2.6/init/main.c
> >>>> ===================================================================
> >>>> --- linux-2.6.orig/init/main.c
> >>>> +++ linux-2.6/init/main.c
> >>>> @@ -676,6 +676,9 @@ asmlinkage void __init start_kernel(void
> >>>>  	taskstats_init_early();
> >>>>  	delayacct_init();
> >>>>
> >>>> +#ifdef CONFIG_X86
> >>>> +	text_poke_init();
> >>>> +#endif
> >>> All good, except this above. There should be an empty text_poke_init()
> >>> in some header file, and an implementation for the X86 arch rather than
> >>> a ifdef in init/main.c.
> >> Hmm, I'd rather use __weak function instead of defining it in some header
> >> files, because text_poke() and alternatives exist only on x86.
> >>
> >> I know that we need to discuss cross modifying code on x86 with
> >> Arjan or other Intel engineers. This patch may still be useful
> >> for removing unnecessary vm_area allocation in text_poke().
> >>
> >> Thank you,
> >>
> >> ---
> >>
> >> Use map_vm_area() instead of vmap() in text_poke() for 
> >> avoiding page allocation and delayed unmapping, and call 
> >> vunmap_page_range() and local_flush_tlb() directly because 
> >> this mapping is temporary and local.
> >>
> >> At the result of above change, text_poke() becomes atomic and 
> >> can be called from stop_machine() etc.
> > 
> > That looks like a good fix in itself - see a few minor details 
> > below.
> 
> Thank you for review,
> 
> > 
> > (Note, i could not try your patch because it has widespread 
> > whitespace damage - please watch out for this for future 
> > patches.)
> 
> Oops, it was my mis-setting...
> 
> > 
> >> +static struct vm_struct *text_poke_area[2];
> >> +static DEFINE_SPINLOCK(text_poke_lock);
> >> +
> >> +void __init text_poke_init(void)
> >> +{
> >> +	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
> >> +	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
> >> +	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
> > 
> > BUG_ON() for non-100%-essential init code is a no-no. Please 
> > change it to WARN_ON() so that people have a chance to report i.
> 
> Sure.
> 
> > 
> > Also, i think all these vma complications came from the decision 
> > to use vmap - and vmap enhancements in .29 complicated this 
> > supposedly-simple interface.
> > 
> > So perhaps another approach to (re-)consider would be to go back 
> > to atomic fixmaps here. It spends 3 slots but that's no big 
> > deal.
> 
> Oh, it's a good idea! fixmaps must make it simpler.
> 
> > 
> > In exchange it will be conceptually simpler, and will also scale 
> > much better than a global spinlock. What do you think?
> 
> I think even if I use fixmaps, we have to use a spinlock to protect
> the fixmap area from other threads...

that's why i suggested to use an atomic-kmap, not a fixmap.

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 23:09                                               ` Ingo Molnar
@ 2009-03-02 23:38                                                 ` Masami Hiramatsu
  2009-03-02 23:49                                                   ` Ingo Molnar
  0 siblings, 1 reply; 89+ messages in thread
From: Masami Hiramatsu @ 2009-03-02 23:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

Ingo Molnar wrote:
>>> So perhaps another approach to (re-)consider would be to go back 
>>> to atomic fixmaps here. It spends 3 slots but that's no big 
>>> deal.
>> Oh, it's a good idea! fixmaps must make it simpler.
>>
>>> In exchange it will be conceptually simpler, and will also scale 
>>> much better than a global spinlock. What do you think?
>> I think even if I use fixmaps, we have to use a spinlock to protect
>> the fixmap area from other threads...
> 
> that's why i suggested to use an atomic-kmap, not a fixmap.

Even if the mapping is atomic, text_poke() has to protect pte
from other text_poke()s while changing code.
AFAIK, atomic-kmap itself doesn't ensure that, does it?

Thank you,

> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 23:38                                                 ` Masami Hiramatsu
@ 2009-03-02 23:49                                                   ` Ingo Molnar
  2009-03-03  0:00                                                     ` Mathieu Desnoyers
                                                                       ` (3 more replies)
  0 siblings, 4 replies; 89+ messages in thread
From: Ingo Molnar @ 2009-03-02 23:49 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt


* Masami Hiramatsu <mhiramat@redhat.com> wrote:

> Ingo Molnar wrote:
> >>> So perhaps another approach to (re-)consider would be to go back 
> >>> to atomic fixmaps here. It spends 3 slots but that's no big 
> >>> deal.
> >> Oh, it's a good idea! fixmaps must make it simpler.
> >>
> >>> In exchange it will be conceptually simpler, and will also scale 
> >>> much better than a global spinlock. What do you think?
> >> I think even if I use fixmaps, we have to use a spinlock to protect
> >> the fixmap area from other threads...
> > 
> > that's why i suggested to use an atomic-kmap, not a fixmap.
> 
> Even if the mapping is atomic, text_poke() has to protect pte
> from other text_poke()s while changing code.
> AFAIK, atomic-kmap itself doesn't ensure that, does it?

Well, but text_poke() is not a serializing API to begin with. 
It's normally used in code patching sequences when we 'know' 
that there cannot be similar parallel activities. The kprobes 
usage of text_poke() looks unsafe - and that needs to be fixed.

So indeed a new global lock is needed there.

It's fixable and we'll fixit, but text_poke() is really more 
complex than i'd like it to be.

stop_machine_run() is essentially instantaneous in practice and 
obviously serializing so it warrants a second look at least. 
Have you tried to use it in kprobes?

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 23:49                                                   ` Ingo Molnar
@ 2009-03-03  0:00                                                     ` Mathieu Desnoyers
  2009-03-03  0:00                                                     ` [PATCH] Text Edit Lock - Architecture Independent Code Mathieu Desnoyers
                                                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-03  0:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Masami Hiramatsu, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
> 
> > Ingo Molnar wrote:
> > >>> So perhaps another approach to (re-)consider would be to go back 
> > >>> to atomic fixmaps here. It spends 3 slots but that's no big 
> > >>> deal.
> > >> Oh, it's a good idea! fixmaps must make it simpler.
> > >>
> > >>> In exchange it will be conceptually simpler, and will also scale 
> > >>> much better than a global spinlock. What do you think?
> > >> I think even if I use fixmaps, we have to use a spinlock to protect
> > >> the fixmap area from other threads...
> > > 
> > > that's why i suggested to use an atomic-kmap, not a fixmap.
> > 
> > Even if the mapping is atomic, text_poke() has to protect pte
> > from other text_poke()s while changing code.
> > AFAIK, atomic-kmap itself doesn't ensure that, does it?
> 
> Well, but text_poke() is not a serializing API to begin with. 
> It's normally used in code patching sequences when we 'know' 
> that there cannot be similar parallel activities. The kprobes 
> usage of text_poke() looks unsafe - and that needs to be fixed.
> 
> So indeed a new global lock is needed there.
> 
> It's fixable and we'll fixit, but text_poke() is really more 
> complex than i'd like it to be.
> 
> stop_machine_run() is essentially instantaneous in practice and 
> obviously serializing so it warrants a second look at least. 
> Have you tried to use it in kprobes?
> 
> 	Ingo

This is why I prepared 

text-edit-lock-architecture-independent-code.patch
text-edit-lock-kprobes-architecture-independent-support.patch

A while ago. I'll post them right away.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH] Text Edit Lock - Architecture Independent Code
  2009-03-02 23:49                                                   ` Ingo Molnar
  2009-03-03  0:00                                                     ` Mathieu Desnoyers
@ 2009-03-03  0:00                                                     ` Mathieu Desnoyers
  2009-03-03  0:32                                                       ` Ingo Molnar
  2009-03-03  0:01                                                     ` [PATCH] Text Edit Lock - kprobes architecture independent support Mathieu Desnoyers
  2009-03-03  0:05                                                     ` [RFC][PATCH] x86: make text_poke() atomic Masami Hiramatsu
  3 siblings, 1 reply; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-03  0:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Masami Hiramatsu, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

This is an architecture independant synchronization around kernel text
modifications through use of a global mutex.

A mutex has been chosen so that kprobes, the main user of this, can sleep during
memory allocation between the memory read of the instructions it must replace
and the memory write of the breakpoint.

Other user of this interface: immediate values.

Paravirt and alternatives are always done when SMP is inactive, so there is no
need to use locks.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <andi@firstfloor.org>
CC: Ingo Molnar <mingo@elte.hu>
---
 include/linux/memory.h |    7 +++++++
 mm/memory.c            |   34 ++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

Index: linux-2.6-lttng/include/linux/memory.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/memory.h	2009-01-30 09:47:59.000000000 -0500
+++ linux-2.6-lttng/include/linux/memory.h	2009-01-30 10:25:33.000000000 -0500
@@ -99,4 +99,11 @@ enum mem_add_context { BOOT, HOTPLUG };
 #define hotplug_memory_notifier(fn, pri) do { } while (0)
 #endif
 
+/*
+ * Take and release the kernel text modification lock, used for code patching.
+ * Users of this lock can sleep.
+ */
+extern void kernel_text_lock(void);
+extern void kernel_text_unlock(void);
+
 #endif /* _LINUX_MEMORY_H_ */
Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c	2009-01-30 09:47:59.000000000 -0500
+++ linux-2.6-lttng/mm/memory.c	2009-01-30 10:26:01.000000000 -0500
@@ -48,6 +48,8 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/delayacct.h>
+#include <linux/kprobes.h>
+#include <linux/mutex.h>
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
@@ -99,6 +101,12 @@ int randomize_va_space __read_mostly =
 					2;
 #endif
 
+/*
+ * mutex protecting text section modification (dynamic code patching).
+ * some users need to sleep (allocating memory...) while they hold this lock.
+ */
+static DEFINE_MUTEX(text_mutex);
+
 static int __init disable_randmaps(char *s)
 {
 	randomize_va_space = 0;
@@ -3192,3 +3200,29 @@ void might_fault(void)
 }
 EXPORT_SYMBOL(might_fault);
 #endif
+
+/**
+ * kernel_text_lock     -   Take the kernel text modification lock
+ *
+ * Insures mutual write exclusion of kernel and modules text live text
+ * modification. Should be used for code patching.
+ * Users of this lock can sleep.
+ */
+void __kprobes kernel_text_lock(void)
+{
+	mutex_lock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(kernel_text_lock);
+
+/**
+ * kernel_text_unlock   -   Release the kernel text modification lock
+ *
+ * Insures mutual write exclusion of kernel and modules text live text
+ * modification. Should be used for code patching.
+ * Users of this lock can sleep.
+ */
+void __kprobes kernel_text_unlock(void)
+{
+	mutex_unlock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(kernel_text_unlock);

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH] Text Edit Lock - kprobes architecture independent support
  2009-03-02 23:49                                                   ` Ingo Molnar
  2009-03-03  0:00                                                     ` Mathieu Desnoyers
  2009-03-03  0:00                                                     ` [PATCH] Text Edit Lock - Architecture Independent Code Mathieu Desnoyers
@ 2009-03-03  0:01                                                     ` Mathieu Desnoyers
  2009-03-03  0:10                                                       ` Masami Hiramatsu
  2009-03-03  0:05                                                     ` [RFC][PATCH] x86: make text_poke() atomic Masami Hiramatsu
  3 siblings, 1 reply; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-03  0:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Masami Hiramatsu, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

Use the mutual exclusion provided by the text edit lock in the kprobes code. It
allows coherent manipulation of the kernel code by other subsystems.

Changelog:

Move the kernel_text_lock/unlock out of the for loops.

(applies on 2.6.29-rc6)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
CC: ananth@in.ibm.com
CC: anil.s.keshavamurthy@intel.com
CC: davem@davemloft.net
CC: Roel Kluin <12o3l@tiscali.nl>
---
 kernel/kprobes.c |   17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kprobes.c	2009-01-30 10:24:45.000000000 -0500
+++ linux-2.6-lttng/kernel/kprobes.c	2009-01-30 10:27:56.000000000 -0500
@@ -43,6 +43,7 @@
 #include <linux/seq_file.h>
 #include <linux/debugfs.h>
 #include <linux/kdebug.h>
+#include <linux/memory.h>
 
 #include <asm-generic/sections.h>
 #include <asm/cacheflush.h>
@@ -699,9 +700,10 @@ int __kprobes register_kprobe(struct kpr
 		goto out;
 	}
 
+	kernel_text_lock();
 	ret = arch_prepare_kprobe(p);
 	if (ret)
-		goto out;
+		goto out_unlock_text;
 
 	INIT_HLIST_NODE(&p->hlist);
 	hlist_add_head_rcu(&p->hlist,
@@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr
 
 	if (kprobe_enabled)
 		arch_arm_kprobe(p);
-
+out_unlock_text:
+	kernel_text_unlock();
 out:
 	mutex_unlock(&kprobe_mutex);
 
@@ -746,8 +749,11 @@ valid_p:
 		 * enabled and not gone - otherwise, the breakpoint would
 		 * already have been removed. We save on flushing icache.
 		 */
-		if (kprobe_enabled && !kprobe_gone(old_p))
+		if (kprobe_enabled && !kprobe_gone(old_p)) {
+			kernel_text_lock();
 			arch_disarm_kprobe(p);
+			kernel_text_unlock();
+		}
 		hlist_del_rcu(&old_p->hlist);
 	} else {
 		if (p->break_handler && !kprobe_gone(p))
@@ -918,7 +924,6 @@ static int __kprobes pre_handler_kretpro
 		}
 
 		arch_prepare_kretprobe(ri, regs);
-
 		/* XXX(hch): why is there no hlist_move_head? */
 		INIT_HLIST_NODE(&ri->hlist);
 		kretprobe_table_lock(hash, &flags);
@@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
 	if (kprobe_enabled)
 		goto already_enabled;
 
+	kernel_text_lock();
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist)
 			if (!kprobe_gone(p))
 				arch_arm_kprobe(p);
 	}
+	kernel_text_unlock();
 
 	kprobe_enabled = true;
 	printk(KERN_INFO "Kprobes globally enabled\n");
@@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
 
 	kprobe_enabled = false;
 	printk(KERN_INFO "Kprobes globally disabled\n");
+	kernel_text_lock();
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist) {
@@ -1317,6 +1325,7 @@ static void __kprobes disable_all_kprobe
 				arch_disarm_kprobe(p);
 		}
 	}
+	kernel_text_unlock();
 
 	mutex_unlock(&kprobe_mutex);
 	/* Allow all currently running kprobes to complete */
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 23:49                                                   ` Ingo Molnar
                                                                       ` (2 preceding siblings ...)
  2009-03-03  0:01                                                     ` [PATCH] Text Edit Lock - kprobes architecture independent support Mathieu Desnoyers
@ 2009-03-03  0:05                                                     ` Masami Hiramatsu
  2009-03-03  0:22                                                       ` Ingo Molnar
  3 siblings, 1 reply; 89+ messages in thread
From: Masami Hiramatsu @ 2009-03-03  0:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt



Ingo Molnar wrote:
> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
> 
>> Ingo Molnar wrote:
>>>>> So perhaps another approach to (re-)consider would be to go back 
>>>>> to atomic fixmaps here. It spends 3 slots but that's no big 
>>>>> deal.
>>>> Oh, it's a good idea! fixmaps must make it simpler.
>>>>
>>>>> In exchange it will be conceptually simpler, and will also scale 
>>>>> much better than a global spinlock. What do you think?
>>>> I think even if I use fixmaps, we have to use a spinlock to protect
>>>> the fixmap area from other threads...
>>> that's why i suggested to use an atomic-kmap, not a fixmap.
>> Even if the mapping is atomic, text_poke() has to protect pte
>> from other text_poke()s while changing code.
>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
> 
> Well, but text_poke() is not a serializing API to begin with. 
> It's normally used in code patching sequences when we 'know' 
> that there cannot be similar parallel activities. The kprobes 
> usage of text_poke() looks unsafe - and that needs to be fixed.

Oh, kprobes already prohibited parallel arming/disarming
by using kprobe_mutex. :-)

> So indeed a new global lock is needed there.
> 
> It's fixable and we'll fixit, but text_poke() is really more 
> complex than i'd like it to be.
> 
> stop_machine_run() is essentially instantaneous in practice and 
> obviously serializing so it warrants a second look at least. 
> Have you tried to use it in kprobes?

No, but it seems that cost high for incremental use(registration)
of kprobes...

Thank you,

> 
> 	Ingo

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] Text Edit Lock - kprobes architecture independent support
  2009-03-03  0:01                                                     ` [PATCH] Text Edit Lock - kprobes architecture independent support Mathieu Desnoyers
@ 2009-03-03  0:10                                                       ` Masami Hiramatsu
  0 siblings, 0 replies; 89+ messages in thread
From: Masami Hiramatsu @ 2009-03-03  0:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

Mathieu Desnoyers wrote:
> Use the mutual exclusion provided by the text edit lock in the kprobes code. It
> allows coherent manipulation of the kernel code by other subsystems.

Oh, I see what you said...
This seems really useful.

Acked-by: Masami Hiramatsu <mhiramat@redhat.com>

> 
> Changelog:
> 
> Move the kernel_text_lock/unlock out of the for loops.
> 
> (applies on 2.6.29-rc6)
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> CC: ananth@in.ibm.com
> CC: anil.s.keshavamurthy@intel.com
> CC: davem@davemloft.net
> CC: Roel Kluin <12o3l@tiscali.nl>
> ---
>  kernel/kprobes.c |   17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6-lttng/kernel/kprobes.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/kprobes.c	2009-01-30 10:24:45.000000000 -0500
> +++ linux-2.6-lttng/kernel/kprobes.c	2009-01-30 10:27:56.000000000 -0500
> @@ -43,6 +43,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/debugfs.h>
>  #include <linux/kdebug.h>
> +#include <linux/memory.h>
>  
>  #include <asm-generic/sections.h>
>  #include <asm/cacheflush.h>
> @@ -699,9 +700,10 @@ int __kprobes register_kprobe(struct kpr
>  		goto out;
>  	}
>  
> +	kernel_text_lock();
>  	ret = arch_prepare_kprobe(p);
>  	if (ret)
> -		goto out;
> +		goto out_unlock_text;
>  
>  	INIT_HLIST_NODE(&p->hlist);
>  	hlist_add_head_rcu(&p->hlist,
> @@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr
>  
>  	if (kprobe_enabled)
>  		arch_arm_kprobe(p);
> -
> +out_unlock_text:
> +	kernel_text_unlock();
>  out:
>  	mutex_unlock(&kprobe_mutex);
>  
> @@ -746,8 +749,11 @@ valid_p:
>  		 * enabled and not gone - otherwise, the breakpoint would
>  		 * already have been removed. We save on flushing icache.
>  		 */
> -		if (kprobe_enabled && !kprobe_gone(old_p))
> +		if (kprobe_enabled && !kprobe_gone(old_p)) {
> +			kernel_text_lock();
>  			arch_disarm_kprobe(p);
> +			kernel_text_unlock();
> +		}
>  		hlist_del_rcu(&old_p->hlist);
>  	} else {
>  		if (p->break_handler && !kprobe_gone(p))
> @@ -918,7 +924,6 @@ static int __kprobes pre_handler_kretpro
>  		}
>  
>  		arch_prepare_kretprobe(ri, regs);
> -
>  		/* XXX(hch): why is there no hlist_move_head? */
>  		INIT_HLIST_NODE(&ri->hlist);
>  		kretprobe_table_lock(hash, &flags);
> @@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
>  	if (kprobe_enabled)
>  		goto already_enabled;
>  
> +	kernel_text_lock();
>  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
>  		head = &kprobe_table[i];
>  		hlist_for_each_entry_rcu(p, node, head, hlist)
>  			if (!kprobe_gone(p))
>  				arch_arm_kprobe(p);
>  	}
> +	kernel_text_unlock();
>  
>  	kprobe_enabled = true;
>  	printk(KERN_INFO "Kprobes globally enabled\n");
> @@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
>  
>  	kprobe_enabled = false;
>  	printk(KERN_INFO "Kprobes globally disabled\n");
> +	kernel_text_lock();
>  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
>  		head = &kprobe_table[i];
>  		hlist_for_each_entry_rcu(p, node, head, hlist) {
> @@ -1317,6 +1325,7 @@ static void __kprobes disable_all_kprobe
>  				arch_disarm_kprobe(p);
>  		}
>  	}
> +	kernel_text_unlock();
>  
>  	mutex_unlock(&kprobe_mutex);
>  	/* Allow all currently running kprobes to complete */

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-03  0:05                                                     ` [RFC][PATCH] x86: make text_poke() atomic Masami Hiramatsu
@ 2009-03-03  0:22                                                       ` Ingo Molnar
  2009-03-03  0:31                                                         ` Masami Hiramatsu
  0 siblings, 1 reply; 89+ messages in thread
From: Ingo Molnar @ 2009-03-03  0:22 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt


* Masami Hiramatsu <mhiramat@redhat.com> wrote:

> 
> 
> Ingo Molnar wrote:
> > * Masami Hiramatsu <mhiramat@redhat.com> wrote:
> > 
> >> Ingo Molnar wrote:
> >>>>> So perhaps another approach to (re-)consider would be to go back 
> >>>>> to atomic fixmaps here. It spends 3 slots but that's no big 
> >>>>> deal.
> >>>> Oh, it's a good idea! fixmaps must make it simpler.
> >>>>
> >>>>> In exchange it will be conceptually simpler, and will also scale 
> >>>>> much better than a global spinlock. What do you think?
> >>>> I think even if I use fixmaps, we have to use a spinlock to protect
> >>>> the fixmap area from other threads...
> >>> that's why i suggested to use an atomic-kmap, not a fixmap.
> >> Even if the mapping is atomic, text_poke() has to protect pte
> >> from other text_poke()s while changing code.
> >> AFAIK, atomic-kmap itself doesn't ensure that, does it?
> > 
> > Well, but text_poke() is not a serializing API to begin with. 
> > It's normally used in code patching sequences when we 'know' 
> > that there cannot be similar parallel activities. The kprobes 
> > usage of text_poke() looks unsafe - and that needs to be fixed.
> 
> Oh, kprobes already prohibited parallel arming/disarming
> by using kprobe_mutex. :-)

yeah, but still the API is somewhat unsafe.

In any case, you also answered your own question:

> >> Even if the mapping is atomic, text_poke() has to protect pte
> >> from other text_poke()s while changing code.
> >> AFAIK, atomic-kmap itself doesn't ensure that, does it?

kprobe_mutex does that.

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-03  0:22                                                       ` Ingo Molnar
@ 2009-03-03  0:31                                                         ` Masami Hiramatsu
  2009-03-03 16:31                                                           ` [PATCH] x86: make text_poke() atomic using fixmap Masami Hiramatsu
  0 siblings, 1 reply; 89+ messages in thread
From: Masami Hiramatsu @ 2009-03-03  0:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

Ingo Molnar wrote:
> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
> 
>>
>> Ingo Molnar wrote:
>>> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
>>>
>>>> Ingo Molnar wrote:
>>>>>>> So perhaps another approach to (re-)consider would be to go back 
>>>>>>> to atomic fixmaps here. It spends 3 slots but that's no big 
>>>>>>> deal.
>>>>>> Oh, it's a good idea! fixmaps must make it simpler.
>>>>>>
>>>>>>> In exchange it will be conceptually simpler, and will also scale 
>>>>>>> much better than a global spinlock. What do you think?
>>>>>> I think even if I use fixmaps, we have to use a spinlock to protect
>>>>>> the fixmap area from other threads...
>>>>> that's why i suggested to use an atomic-kmap, not a fixmap.
>>>> Even if the mapping is atomic, text_poke() has to protect pte
>>>> from other text_poke()s while changing code.
>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
>>> Well, but text_poke() is not a serializing API to begin with. 
>>> It's normally used in code patching sequences when we 'know' 
>>> that there cannot be similar parallel activities. The kprobes 
>>> usage of text_poke() looks unsafe - and that needs to be fixed.
>> Oh, kprobes already prohibited parallel arming/disarming
>> by using kprobe_mutex. :-)
> 
> yeah, but still the API is somewhat unsafe.

Yeah, kprobe_mutex protects text_poke from other kprobes, but
not from other text_poke() users...

> In any case, you also answered your own question:
> 
>>>> Even if the mapping is atomic, text_poke() has to protect pte
>>>> from other text_poke()s while changing code.
>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
> 
> kprobe_mutex does that.

Anyway, text_edit_lock ensures that.

By the way, I think set_fixmap/clear_fixmap seems simpler than
kmap_atomic() variant. Would you think improving kmap_atomic_prot()
is better?

> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] Text Edit Lock - Architecture Independent Code
  2009-03-03  0:00                                                     ` [PATCH] Text Edit Lock - Architecture Independent Code Mathieu Desnoyers
@ 2009-03-03  0:32                                                       ` Ingo Molnar
  2009-03-03  0:39                                                         ` Mathieu Desnoyers
                                                                           ` (2 more replies)
  0 siblings, 3 replies; 89+ messages in thread
From: Ingo Molnar @ 2009-03-03  0:32 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra
  Cc: Masami Hiramatsu, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> +/*
> + * Take and release the kernel text modification lock, used for code patching.
> + * Users of this lock can sleep.
> + */
> +extern void kernel_text_lock(void);
> +extern void kernel_text_unlock(void);

Locking APIs with hidden semantics are very ugly. Remember 
lock_kernel()?

> +/*
> + * mutex protecting text section modification (dynamic code patching).
> + * some users need to sleep (allocating memory...) while they hold this lock.
> + */
> +static DEFINE_MUTEX(text_mutex);

Please update those sites to do an explicit:

	mutex_lock(&text_mutex);

instead.

That way we save a function call, and we'll also see exactly 
what type of lock is being taken, etc.

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] Text Edit Lock - Architecture Independent Code
  2009-03-03  0:32                                                       ` Ingo Molnar
@ 2009-03-03  0:39                                                         ` Mathieu Desnoyers
  2009-03-03  1:30                                                         ` [PATCH] Text Edit Lock - Architecture Independent Code (v2) Mathieu Desnoyers
  2009-03-03  1:31                                                         ` [PATCH] Text Edit Lock - kprobes architecture independent support (v2) Mathieu Desnoyers
  2 siblings, 0 replies; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-03  0:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Masami Hiramatsu, Andrew Morton, Nick Piggin,
	Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Arjan van de Ven, Rusty Russell, H. Peter Anvin, Steven Rostedt

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > +/*
> > + * Take and release the kernel text modification lock, used for code patching.
> > + * Users of this lock can sleep.
> > + */
> > +extern void kernel_text_lock(void);
> > +extern void kernel_text_unlock(void);
> 
> Locking APIs with hidden semantics are very ugly. Remember 
> lock_kernel()?
> 
> > +/*
> > + * mutex protecting text section modification (dynamic code patching).
> > + * some users need to sleep (allocating memory...) while they hold this lock.
> > + */
> > +static DEFINE_MUTEX(text_mutex);
> 
> Please update those sites to do an explicit:
> 
> 	mutex_lock(&text_mutex);
> 
> instead.
> 
> That way we save a function call, and we'll also see exactly 
> what type of lock is being taken, etc.
> 

OK. However we'll have to export the text_mutex symbol and use it in
various locations. As long as we are fine with that, I'll provide an
updated patch.

Mathieu

> 	Ingo

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH] Text Edit Lock - Architecture Independent Code (v2)
  2009-03-03  0:32                                                       ` Ingo Molnar
  2009-03-03  0:39                                                         ` Mathieu Desnoyers
@ 2009-03-03  1:30                                                         ` Mathieu Desnoyers
  2009-03-03  1:31                                                         ` [PATCH] Text Edit Lock - kprobes architecture independent support (v2) Mathieu Desnoyers
  2 siblings, 0 replies; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-03  1:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Masami Hiramatsu, Andrew Morton, Nick Piggin,
	Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Arjan van de Ven, Rusty Russell, H. Peter Anvin, Steven Rostedt

This is an architecture independant synchronization around kernel text
modifications through use of a global mutex.

A mutex has been chosen so that kprobes, the main user of this, can sleep during
memory allocation between the memory read of the instructions it must replace
and the memory write of the breakpoint.

Other user of this interface: immediate values.

Paravirt and alternatives are always done when SMP is inactive, so there is no
need to use locks.

Changelog :
Export text_mutex directly.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <andi@firstfloor.org>
CC: Ingo Molnar <mingo@elte.hu>
---
 include/linux/memory.h |    6 ++++++
 mm/memory.c            |    9 +++++++++
 2 files changed, 15 insertions(+)

Index: linux-2.6-lttng/include/linux/memory.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/memory.h	2009-03-02 13:13:35.000000000 -0500
+++ linux-2.6-lttng/include/linux/memory.h	2009-03-02 19:23:52.000000000 -0500
@@ -99,4 +99,10 @@ enum mem_add_context { BOOT, HOTPLUG };
 #define hotplug_memory_notifier(fn, pri) do { } while (0)
 #endif
 
+/*
+ * Kernel text modification mutex, used for code patching. Users of this lock
+ * can sleep.
+ */
+extern struct mutex text_mutex;
+
 #endif /* _LINUX_MEMORY_H_ */
Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c	2009-03-02 13:13:35.000000000 -0500
+++ linux-2.6-lttng/mm/memory.c	2009-03-02 19:24:33.000000000 -0500
@@ -48,6 +48,8 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/delayacct.h>
+#include <linux/kprobes.h>
+#include <linux/mutex.h>
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
@@ -99,6 +101,13 @@ int randomize_va_space __read_mostly =
 					2;
 #endif
 
+/*
+ * mutex protecting text section modification (dynamic code patching).
+ * some users need to sleep (allocating memory...) while they hold this lock.
+ */
+DEFINE_MUTEX(text_mutex);
+EXPORT_SYMBOL_GPL(text_mutex);
+
 static int __init disable_randmaps(char *s)
 {
 	randomize_va_space = 0;
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH] Text Edit Lock - kprobes architecture independent support (v2)
  2009-03-03  0:32                                                       ` Ingo Molnar
  2009-03-03  0:39                                                         ` Mathieu Desnoyers
  2009-03-03  1:30                                                         ` [PATCH] Text Edit Lock - Architecture Independent Code (v2) Mathieu Desnoyers
@ 2009-03-03  1:31                                                         ` Mathieu Desnoyers
  2009-03-03  9:27                                                           ` Ingo Molnar
  2 siblings, 1 reply; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-03  1:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Masami Hiramatsu, Andrew Morton, Nick Piggin,
	Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Arjan van de Ven, Rusty Russell, H. Peter Anvin, Steven Rostedt

Use the mutual exclusion provided by the text edit lock in the kprobes code. It
allows coherent manipulation of the kernel code by other subsystems.

Changelog:

Move the kernel_text_lock/unlock out of the for loops.
Use text_mutex directly instead of a function.

(applies on 2.6.29-rc6)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
CC: ananth@in.ibm.com
CC: anil.s.keshavamurthy@intel.com
CC: davem@davemloft.net
CC: Roel Kluin <12o3l@tiscali.nl>
---
 kernel/kprobes.c |   18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kprobes.c	2009-03-02 19:22:51.000000000 -0500
+++ linux-2.6-lttng/kernel/kprobes.c	2009-03-02 19:27:26.000000000 -0500
@@ -43,6 +43,7 @@
 #include <linux/seq_file.h>
 #include <linux/debugfs.h>
 #include <linux/kdebug.h>
+#include <linux/memory.h>
 
 #include <asm-generic/sections.h>
 #include <asm/cacheflush.h>
@@ -699,9 +700,10 @@ int __kprobes register_kprobe(struct kpr
 		goto out;
 	}
 
+	mutex_lock(&text_mutex);
 	ret = arch_prepare_kprobe(p);
 	if (ret)
-		goto out;
+		goto out_unlock_text;
 
 	INIT_HLIST_NODE(&p->hlist);
 	hlist_add_head_rcu(&p->hlist,
@@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr
 
 	if (kprobe_enabled)
 		arch_arm_kprobe(p);
-
+out_unlock_text:
+	mutex_unlock(&text_mutex);
 out:
 	mutex_unlock(&kprobe_mutex);
 
@@ -746,8 +749,11 @@ valid_p:
 		 * enabled and not gone - otherwise, the breakpoint would
 		 * already have been removed. We save on flushing icache.
 		 */
-		if (kprobe_enabled && !kprobe_gone(old_p))
+		if (kprobe_enabled && !kprobe_gone(old_p)) {
+			mutex_lock(&text_mutex);
 			arch_disarm_kprobe(p);
+			mutex_unlock(&text_mutex);
+		}
 		hlist_del_rcu(&old_p->hlist);
 	} else {
 		if (p->break_handler && !kprobe_gone(p))
@@ -918,7 +924,6 @@ static int __kprobes pre_handler_kretpro
 		}
 
 		arch_prepare_kretprobe(ri, regs);
-
 		/* XXX(hch): why is there no hlist_move_head? */
 		INIT_HLIST_NODE(&ri->hlist);
 		kretprobe_table_lock(hash, &flags);
@@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
 	if (kprobe_enabled)
 		goto already_enabled;
 
+	mutex_lock(&text_mutex);
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist)
 			if (!kprobe_gone(p))
 				arch_arm_kprobe(p);
 	}
+	mutex_unlock(&text_mutex);
 
 	kprobe_enabled = true;
 	printk(KERN_INFO "Kprobes globally enabled\n");
@@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
 
 	kprobe_enabled = false;
 	printk(KERN_INFO "Kprobes globally disabled\n");
+	mutex_lock(&text_mutex);
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist) {
@@ -1317,7 +1325,7 @@ static void __kprobes disable_all_kprobe
 				arch_disarm_kprobe(p);
 		}
 	}
-
+	mutex_unlock(&text_mutex);
 	mutex_unlock(&kprobe_mutex);
 	/* Allow all currently running kprobes to complete */
 	synchronize_sched();
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC][PATCH] x86: make text_poke() atomic
  2009-03-02 17:01                                     ` [RFC][PATCH] x86: make text_poke() atomic Masami Hiramatsu
  2009-03-02 17:19                                       ` Mathieu Desnoyers
  2009-03-02 18:28                                       ` [RFC][PATCH] x86: make text_poke() atomic Arjan van de Ven
@ 2009-03-03  4:54                                       ` Nick Piggin
  2 siblings, 0 replies; 89+ messages in thread
From: Nick Piggin @ 2009-03-03  4:54 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Ingo Molnar, Andrew Morton, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

On Mon, Mar 02, 2009 at 12:01:29PM -0500, Masami Hiramatsu wrote:
> ---
> 
> Use map_vm_area() instead of vmap() in text_poke() for avoiding page 
> allocation
> and delayed unmapping, and call vunmap_page_range() and local_flush_tlb()
> directly because this mapping is temporary and local.
> 
> At the result of above change, text_poke() becomes atomic and can be called
> from stop_machine() etc.
> 
> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Nick Piggin <npiggin@suse.de>
> ---
>  arch/x86/include/asm/alternative.h |    1 +
>  arch/x86/kernel/alternative.c      |   36 
>  +++++++++++++++++++++++++++++-------
>  include/linux/vmalloc.h            |    1 +
>  init/main.c                        |    3 +++
>  mm/vmalloc.c                       |    2 +-
>  5 files changed, 35 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6/arch/x86/include/asm/alternative.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/alternative.h
> +++ linux-2.6/arch/x86/include/asm/alternative.h
> @@ -177,6 +177,7 @@ extern void add_nops(void *insns, unsign
>   * The _early version expects the memory to already be RW.
>   */
> 
> +extern void text_poke_init(void);
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> 
> Index: linux-2.6/arch/x86/kernel/alternative.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/alternative.c
> +++ linux-2.6/arch/x86/kernel/alternative.c
> @@ -12,6 +12,7 @@
>  #include <asm/nmi.h>
>  #include <asm/vsyscall.h>
>  #include <asm/cacheflush.h>
> +#include <asm/tlbflush.h>
>  #include <asm/io.h>
> 
>  #define MAX_PATCH_LEN (255-1)
> @@ -485,6 +486,16 @@ void *text_poke_early(void *addr, const
>  	return addr;
>  }
> 
> +static struct vm_struct *text_poke_area[2];
> +static DEFINE_SPINLOCK(text_poke_lock);
> +
> +void __init text_poke_init(void)
> +{
> +	text_poke_area[0] = get_vm_area(PAGE_SIZE, VM_ALLOC);
> +	text_poke_area[1] = get_vm_area(2 * PAGE_SIZE, VM_ALLOC);
> +	BUG_ON(!text_poke_area[0] || !text_poke_area[1]);
> +}
> +
>  /**
>   * text_poke - Update instructions on a live kernel
>   * @addr: address to modify
> @@ -501,8 +512,9 @@ void *__kprobes text_poke(void *addr, co
>  	unsigned long flags;
>  	char *vaddr;
>  	int nr_pages = 2;
> -	struct page *pages[2];
> -	int i;
> +	struct page *pages[2], **pgp = pages;
> +	int i, ret;
> +	struct vm_struct *vma;
> 
>  	if (!core_kernel_text((unsigned long)addr)) {
>  		pages[0] = vmalloc_to_page(addr);

This is really good....

> @@ -515,12 +527,22 @@ void *__kprobes text_poke(void *addr, co
>  	BUG_ON(!pages[0]);
>  	if (!pages[1])
>  		nr_pages = 1;
> -	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
> -	BUG_ON(!vaddr);
        ^^^^^^^^^^^^^^
This really is nasty bug in text_poke, and I never knew why it was
allowed to live for so long!

Thanks,
Nick

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] Text Edit Lock - kprobes architecture independent support (v2)
  2009-03-03  1:31                                                         ` [PATCH] Text Edit Lock - kprobes architecture independent support (v2) Mathieu Desnoyers
@ 2009-03-03  9:27                                                           ` Ingo Molnar
  2009-03-03 12:06                                                             ` Ananth N Mavinakayanahalli
  0 siblings, 1 reply; 89+ messages in thread
From: Ingo Molnar @ 2009-03-03  9:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Masami Hiramatsu, Andrew Morton, Nick Piggin,
	Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Arjan van de Ven, Rusty Russell, H. Peter Anvin, Steven Rostedt


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> @@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr
>  
>  	if (kprobe_enabled)
>  		arch_arm_kprobe(p);

hm, it's cleaner now, but there's serious locking dependency 
problems visible in the patch:

> -
> +out_unlock_text:
> +	mutex_unlock(&text_mutex);
>  out:
>  	mutex_unlock(&kprobe_mutex);

this one creates a (text_mutex -> kprobe_mutex) dependency. 
(also you removed a newline spuriously - dont do that)

> @@ -746,8 +749,11 @@ valid_p:
>  		 * enabled and not gone - otherwise, the breakpoint would
>  		 * already have been removed. We save on flushing icache.
>  		 */
> -		if (kprobe_enabled && !kprobe_gone(old_p))
> +		if (kprobe_enabled && !kprobe_gone(old_p)) {
> +			mutex_lock(&text_mutex);
>  			arch_disarm_kprobe(p);
> +			mutex_unlock(&text_mutex);
> +		}
>  		hlist_del_rcu(&old_p->hlist);

(kprobe_mutex -> text_mutex) dependency. AB-BA deadlock.

>  	} else {
>  		if (p->break_handler && !kprobe_gone(p))
> @@ -918,7 +924,6 @@ static int __kprobes pre_handler_kretpro
>  		}
>  
>  		arch_prepare_kretprobe(ri, regs);
> -

spurious (and wrong) newline removal.

>  		/* XXX(hch): why is there no hlist_move_head? */
>  		INIT_HLIST_NODE(&ri->hlist);
>  		kretprobe_table_lock(hash, &flags);
> @@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
>  	if (kprobe_enabled)
>  		goto already_enabled;
>  
> +	mutex_lock(&text_mutex);
>  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
>  		head = &kprobe_table[i];
>  		hlist_for_each_entry_rcu(p, node, head, hlist)
>  			if (!kprobe_gone(p))
>  				arch_arm_kprobe(p);
>  	}
> +	mutex_unlock(&text_mutex);

this one creates a (kprobe_mutex -> text_mutex) dependency 
again.

> @@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
>  
>  	kprobe_enabled = false;
>  	printk(KERN_INFO "Kprobes globally disabled\n");
> +	mutex_lock(&text_mutex);
>  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
>  		head = &kprobe_table[i];
>  		hlist_for_each_entry_rcu(p, node, head, hlist) {
> @@ -1317,7 +1325,7 @@ static void __kprobes disable_all_kprobe
>  				arch_disarm_kprobe(p);
>  		}
>  	}
> -
> +	mutex_unlock(&text_mutex);
>  	mutex_unlock(&kprobe_mutex);

And this one in the reverse direction again.

The dependencies are totally wrong. The text lock (a low level 
lock) should nest inside the kprobes mutex (which is the higher 
level lock).

Have you reviewed the locking dependencies when writing this 
patch, at all? That's one of the first things to do when adding 
a new lock.

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] Text Edit Lock - kprobes architecture independent support (v2)
  2009-03-03  9:27                                                           ` Ingo Molnar
@ 2009-03-03 12:06                                                             ` Ananth N Mavinakayanahalli
  2009-03-03 14:28                                                               ` Mathieu Desnoyers
                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 89+ messages in thread
From: Ananth N Mavinakayanahalli @ 2009-03-03 12:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Peter Zijlstra, Masami Hiramatsu,
	Andrew Morton, Nick Piggin, Steven Rostedt, Andi Kleen,
	linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

On Tue, Mar 03, 2009 at 10:27:50AM +0100, Ingo Molnar wrote:
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > @@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr

Hi Ingo,

> >  	if (kprobe_enabled)
> >  		arch_arm_kprobe(p);
> 
> hm, it's cleaner now, but there's serious locking dependency 
> problems visible in the patch:
> 
> > -
> > +out_unlock_text:
> > +	mutex_unlock(&text_mutex);
> >  out:
> >  	mutex_unlock(&kprobe_mutex);
> 
> this one creates a (text_mutex -> kprobe_mutex) dependency. 
> (also you removed a newline spuriously - dont do that)

That is a mutex_unlock :-) ...

> > @@ -746,8 +749,11 @@ valid_p:
> >  		 * enabled and not gone - otherwise, the breakpoint would
> >  		 * already have been removed. We save on flushing icache.
> >  		 */
> > -		if (kprobe_enabled && !kprobe_gone(old_p))
> > +		if (kprobe_enabled && !kprobe_gone(old_p)) {
> > +			mutex_lock(&text_mutex);
> >  			arch_disarm_kprobe(p);
> > +			mutex_unlock(&text_mutex);
> > +		}
> >  		hlist_del_rcu(&old_p->hlist);
> 
> (kprobe_mutex -> text_mutex) dependency. AB-BA deadlock.

At this time the kprobe_mutex is already held.

...

> > @@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
> >  	if (kprobe_enabled)
> >  		goto already_enabled;
> >  
> > +	mutex_lock(&text_mutex);
> >  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> >  		head = &kprobe_table[i];
> >  		hlist_for_each_entry_rcu(p, node, head, hlist)
> >  			if (!kprobe_gone(p))
> >  				arch_arm_kprobe(p);
> >  	}
> > +	mutex_unlock(&text_mutex);
> 
> this one creates a (kprobe_mutex -> text_mutex) dependency 
> again.

kprobe_mutex his held here too...

> > @@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
> >  
> >  	kprobe_enabled = false;
> >  	printk(KERN_INFO "Kprobes globally disabled\n");
> > +	mutex_lock(&text_mutex);
> >  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> >  		head = &kprobe_table[i];
> >  		hlist_for_each_entry_rcu(p, node, head, hlist) {
> > @@ -1317,7 +1325,7 @@ static void __kprobes disable_all_kprobe
> >  				arch_disarm_kprobe(p);
> >  		}
> >  	}
> > -
> > +	mutex_unlock(&text_mutex);
> >  	mutex_unlock(&kprobe_mutex);
> 
> And this one in the reverse direction again.

Unlock again :-)

> The dependencies are totally wrong. The text lock (a low level 
> lock) should nest inside the kprobes mutex (which is the higher 
> level lock).

>From what I see, Mathieu has done just that and has gotten the order
right in all cases. Or maybe I am missing something?

(I recall having tested this patch with LOCKDEP turned on and it
din't throw any errors).

Ananth

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] Text Edit Lock - kprobes architecture independent support (v2)
  2009-03-03 12:06                                                             ` Ananth N Mavinakayanahalli
@ 2009-03-03 14:28                                                               ` Mathieu Desnoyers
  2009-03-03 14:33                                                               ` [PATCH] Text Edit Lock - kprobes architecture independent support (v3) Mathieu Desnoyers
  2009-03-03 14:53                                                               ` [PATCH] Text Edit Lock - kprobes architecture independent support (v2) Ingo Molnar
  2 siblings, 0 replies; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-03 14:28 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Ingo Molnar, Peter Zijlstra, Masami Hiramatsu, Andrew Morton,
	Nick Piggin, Steven Rostedt, Andi Kleen, linux-kernel,
	Thomas Gleixner, Peter Zijlstra, Frederic Weisbecker,
	Linus Torvalds, Arjan van de Ven, Rusty Russell, H. Peter Anvin,
	Steven Rostedt

* Ananth N Mavinakayanahalli (ananth@in.ibm.com) wrote:
> On Tue, Mar 03, 2009 at 10:27:50AM +0100, Ingo Molnar wrote:
> > 
> > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > 
> > > @@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr
> 
> Hi Ingo,
> 
> > >  	if (kprobe_enabled)
> > >  		arch_arm_kprobe(p);
> > 
> > hm, it's cleaner now, but there's serious locking dependency 
> > problems visible in the patch:
> > 
> > > -
> > > +out_unlock_text:
> > > +	mutex_unlock(&text_mutex);
> > >  out:
> > >  	mutex_unlock(&kprobe_mutex);
> > 
> > this one creates a (text_mutex -> kprobe_mutex) dependency. 
> > (also you removed a newline spuriously - dont do that)
> 
> That is a mutex_unlock :-) ...
> 
> > > @@ -746,8 +749,11 @@ valid_p:
> > >  		 * enabled and not gone - otherwise, the breakpoint would
> > >  		 * already have been removed. We save on flushing icache.
> > >  		 */
> > > -		if (kprobe_enabled && !kprobe_gone(old_p))
> > > +		if (kprobe_enabled && !kprobe_gone(old_p)) {
> > > +			mutex_lock(&text_mutex);
> > >  			arch_disarm_kprobe(p);
> > > +			mutex_unlock(&text_mutex);
> > > +		}
> > >  		hlist_del_rcu(&old_p->hlist);
> > 
> > (kprobe_mutex -> text_mutex) dependency. AB-BA deadlock.
> 
> At this time the kprobe_mutex is already held.
> 
> ...
> 
> > > @@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
> > >  	if (kprobe_enabled)
> > >  		goto already_enabled;
> > >  
> > > +	mutex_lock(&text_mutex);
> > >  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> > >  		head = &kprobe_table[i];
> > >  		hlist_for_each_entry_rcu(p, node, head, hlist)
> > >  			if (!kprobe_gone(p))
> > >  				arch_arm_kprobe(p);
> > >  	}
> > > +	mutex_unlock(&text_mutex);
> > 
> > this one creates a (kprobe_mutex -> text_mutex) dependency 
> > again.
> 
> kprobe_mutex his held here too...
> 
> > > @@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
> > >  
> > >  	kprobe_enabled = false;
> > >  	printk(KERN_INFO "Kprobes globally disabled\n");
> > > +	mutex_lock(&text_mutex);
> > >  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> > >  		head = &kprobe_table[i];
> > >  		hlist_for_each_entry_rcu(p, node, head, hlist) {
> > > @@ -1317,7 +1325,7 @@ static void __kprobes disable_all_kprobe
> > >  				arch_disarm_kprobe(p);
> > >  		}
> > >  	}
> > > -
> > > +	mutex_unlock(&text_mutex);
> > >  	mutex_unlock(&kprobe_mutex);
> > 
> > And this one in the reverse direction again.
> 
> Unlock again :-)
> 
> > The dependencies are totally wrong. The text lock (a low level 
> > lock) should nest inside the kprobes mutex (which is the higher 
> > level lock).
> 
> From what I see, Mathieu has done just that and has gotten the order
> right in all cases. Or maybe I am missing something?
> 
> (I recall having tested this patch with LOCKDEP turned on and it
> din't throw any errors).
> 

Yes, I even moved all kprobe_mutexes out of arch_arm_kprobe/arch_arm_kprobe
a while ago in preparation for this patch. :) I can repost without the
white space modifications.

Mathieu

> Ananth

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH] Text Edit Lock - kprobes architecture independent support (v3)
  2009-03-03 12:06                                                             ` Ananth N Mavinakayanahalli
  2009-03-03 14:28                                                               ` Mathieu Desnoyers
@ 2009-03-03 14:33                                                               ` Mathieu Desnoyers
  2009-03-03 14:53                                                               ` [PATCH] Text Edit Lock - kprobes architecture independent support (v2) Ingo Molnar
  2 siblings, 0 replies; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-03 14:33 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli, Ingo Molnar
  Cc: Peter Zijlstra, Masami Hiramatsu, Andrew Morton, Nick Piggin,
	Steven Rostedt, Andi Kleen, linux-kernel, Thomas Gleixner,
	Peter Zijlstra, Frederic Weisbecker, Linus Torvalds,
	Arjan van de Ven, Rusty Russell, H. Peter Anvin, Steven Rostedt

Use the mutual exclusion provided by the text edit lock in the kprobes code. It
allows coherent manipulation of the kernel code by other subsystems.

Changelog:

Move the kernel_text_lock/unlock out of the for loops.
Use text_mutex directly instead of a function.
Remove whitespace modifications.

(note : kprobes_mutex is always taken outside of text_mutex)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
CC: ananth@in.ibm.com
CC: anil.s.keshavamurthy@intel.com
CC: davem@davemloft.net
CC: Roel Kluin <12o3l@tiscali.nl>
---
 kernel/kprobes.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===================================================================
--- linux-2.6-lttng.orig/kernel/kprobes.c	2009-03-02 19:22:51.000000000 -0500
+++ linux-2.6-lttng/kernel/kprobes.c	2009-03-03 09:11:17.000000000 -0500
@@ -43,6 +43,7 @@
 #include <linux/seq_file.h>
 #include <linux/debugfs.h>
 #include <linux/kdebug.h>
+#include <linux/memory.h>
 
 #include <asm-generic/sections.h>
 #include <asm/cacheflush.h>
@@ -699,9 +700,10 @@ int __kprobes register_kprobe(struct kpr
 		goto out;
 	}
 
+	mutex_lock(&text_mutex);
 	ret = arch_prepare_kprobe(p);
 	if (ret)
-		goto out;
+		goto out_unlock_text;
 
 	INIT_HLIST_NODE(&p->hlist);
 	hlist_add_head_rcu(&p->hlist,
@@ -710,6 +712,8 @@ int __kprobes register_kprobe(struct kpr
 	if (kprobe_enabled)
 		arch_arm_kprobe(p);
 
+out_unlock_text:
+	mutex_unlock(&text_mutex);
 out:
 	mutex_unlock(&kprobe_mutex);
 
@@ -746,8 +750,11 @@ valid_p:
 		 * enabled and not gone - otherwise, the breakpoint would
 		 * already have been removed. We save on flushing icache.
 		 */
-		if (kprobe_enabled && !kprobe_gone(old_p))
+		if (kprobe_enabled && !kprobe_gone(old_p)) {
+			mutex_lock(&text_mutex);
 			arch_disarm_kprobe(p);
+			mutex_unlock(&text_mutex);
+		}
 		hlist_del_rcu(&old_p->hlist);
 	} else {
 		if (p->break_handler && !kprobe_gone(p))
@@ -1280,12 +1287,14 @@ static void __kprobes enable_all_kprobes
 	if (kprobe_enabled)
 		goto already_enabled;
 
+	mutex_lock(&text_mutex);
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist)
 			if (!kprobe_gone(p))
 				arch_arm_kprobe(p);
 	}
+	mutex_unlock(&text_mutex);
 
 	kprobe_enabled = true;
 	printk(KERN_INFO "Kprobes globally enabled\n");
@@ -1310,6 +1319,7 @@ static void __kprobes disable_all_kprobe
 
 	kprobe_enabled = false;
 	printk(KERN_INFO "Kprobes globally disabled\n");
+	mutex_lock(&text_mutex);
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
 		hlist_for_each_entry_rcu(p, node, head, hlist) {
@@ -1318,6 +1328,7 @@ static void __kprobes disable_all_kprobe
 		}
 	}
 
+	mutex_unlock(&text_mutex);
 	mutex_unlock(&kprobe_mutex);
 	/* Allow all currently running kprobes to complete */
 	synchronize_sched();

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] Text Edit Lock - kprobes architecture independent support (v2)
  2009-03-03 12:06                                                             ` Ananth N Mavinakayanahalli
  2009-03-03 14:28                                                               ` Mathieu Desnoyers
  2009-03-03 14:33                                                               ` [PATCH] Text Edit Lock - kprobes architecture independent support (v3) Mathieu Desnoyers
@ 2009-03-03 14:53                                                               ` Ingo Molnar
  2 siblings, 0 replies; 89+ messages in thread
From: Ingo Molnar @ 2009-03-03 14:53 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Mathieu Desnoyers, Peter Zijlstra, Masami Hiramatsu,
	Andrew Morton, Nick Piggin, Steven Rostedt, Andi Kleen,
	linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt


* Ananth N Mavinakayanahalli <ananth@in.ibm.com> wrote:

> On Tue, Mar 03, 2009 at 10:27:50AM +0100, Ingo Molnar wrote:
> > 
> > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > 
> > > @@ -709,7 +711,8 @@ int __kprobes register_kprobe(struct kpr
> 
> Hi Ingo,
> 
> > >  	if (kprobe_enabled)
> > >  		arch_arm_kprobe(p);
> > 
> > hm, it's cleaner now, but there's serious locking dependency 
> > problems visible in the patch:
> > 
> > > -
> > > +out_unlock_text:
> > > +	mutex_unlock(&text_mutex);
> > >  out:
> > >  	mutex_unlock(&kprobe_mutex);
> > 
> > this one creates a (text_mutex -> kprobe_mutex) dependency. 
> > (also you removed a newline spuriously - dont do that)
> 
> That is a mutex_unlock :-) ...
> 
> > > @@ -746,8 +749,11 @@ valid_p:
> > >  		 * enabled and not gone - otherwise, the breakpoint would
> > >  		 * already have been removed. We save on flushing icache.
> > >  		 */
> > > -		if (kprobe_enabled && !kprobe_gone(old_p))
> > > +		if (kprobe_enabled && !kprobe_gone(old_p)) {
> > > +			mutex_lock(&text_mutex);
> > >  			arch_disarm_kprobe(p);
> > > +			mutex_unlock(&text_mutex);
> > > +		}
> > >  		hlist_del_rcu(&old_p->hlist);
> > 
> > (kprobe_mutex -> text_mutex) dependency. AB-BA deadlock.
> 
> At this time the kprobe_mutex is already held.
> 
> ...
> 
> > > @@ -1280,12 +1285,14 @@ static void __kprobes enable_all_kprobes
> > >  	if (kprobe_enabled)
> > >  		goto already_enabled;
> > >  
> > > +	mutex_lock(&text_mutex);
> > >  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> > >  		head = &kprobe_table[i];
> > >  		hlist_for_each_entry_rcu(p, node, head, hlist)
> > >  			if (!kprobe_gone(p))
> > >  				arch_arm_kprobe(p);
> > >  	}
> > > +	mutex_unlock(&text_mutex);
> > 
> > this one creates a (kprobe_mutex -> text_mutex) dependency 
> > again.
> 
> kprobe_mutex his held here too...
> 
> > > @@ -1310,6 +1317,7 @@ static void __kprobes disable_all_kprobe
> > >  
> > >  	kprobe_enabled = false;
> > >  	printk(KERN_INFO "Kprobes globally disabled\n");
> > > +	mutex_lock(&text_mutex);
> > >  	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
> > >  		head = &kprobe_table[i];
> > >  		hlist_for_each_entry_rcu(p, node, head, hlist) {
> > > @@ -1317,7 +1325,7 @@ static void __kprobes disable_all_kprobe
> > >  				arch_disarm_kprobe(p);
> > >  		}
> > >  	}
> > > -
> > > +	mutex_unlock(&text_mutex);
> > >  	mutex_unlock(&kprobe_mutex);
> > 
> > And this one in the reverse direction again.
> 
> Unlock again :-)
> 
> > The dependencies are totally wrong. The text lock (a low level 
> > lock) should nest inside the kprobes mutex (which is the higher 
> > level lock).
> 
> From what I see, Mathieu has done just that and has gotten the 
> order right in all cases. Or maybe I am missing something?

No, it's fine indeed, i got the locking order messed up ... 
twice :-)

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH] x86: make text_poke() atomic using fixmap
  2009-03-03  0:31                                                         ` Masami Hiramatsu
@ 2009-03-03 16:31                                                           ` Masami Hiramatsu
  2009-03-03 17:08                                                             ` Mathieu Desnoyers
  2009-03-05 10:38                                                             ` Ingo Molnar
  0 siblings, 2 replies; 89+ messages in thread
From: Masami Hiramatsu @ 2009-03-03 16:31 UTC (permalink / raw)
  To: Ingo Molnar, Mathieu Desnoyers
  Cc: Andrew Morton, Nick Piggin, Steven Rostedt, Andi Kleen,
	linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

Masami Hiramatsu wrote:
> Ingo Molnar wrote:
>> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
>>
>>> Ingo Molnar wrote:
>>>> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
>>>>
>>>>> Ingo Molnar wrote:
>>>>>>>> So perhaps another approach to (re-)consider would be to go back 
>>>>>>>> to atomic fixmaps here. It spends 3 slots but that's no big 
>>>>>>>> deal.
>>>>>>> Oh, it's a good idea! fixmaps must make it simpler.
>>>>>>>
>>>>>>>> In exchange it will be conceptually simpler, and will also scale 
>>>>>>>> much better than a global spinlock. What do you think?
>>>>>>> I think even if I use fixmaps, we have to use a spinlock to protect
>>>>>>> the fixmap area from other threads...
>>>>>> that's why i suggested to use an atomic-kmap, not a fixmap.
>>>>> Even if the mapping is atomic, text_poke() has to protect pte
>>>>> from other text_poke()s while changing code.
>>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
>>>> Well, but text_poke() is not a serializing API to begin with. 
>>>> It's normally used in code patching sequences when we 'know' 
>>>> that there cannot be similar parallel activities. The kprobes 
>>>> usage of text_poke() looks unsafe - and that needs to be fixed.
>>> Oh, kprobes already prohibited parallel arming/disarming
>>> by using kprobe_mutex. :-)
>> yeah, but still the API is somewhat unsafe.
> 
> Yeah, kprobe_mutex protects text_poke from other kprobes, but
> not from other text_poke() users...
> 
>> In any case, you also answered your own question:
>>
>>>>> Even if the mapping is atomic, text_poke() has to protect pte
>>>>> from other text_poke()s while changing code.
>>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
>> kprobe_mutex does that.
> 
> Anyway, text_edit_lock ensures that.
> 
> By the way, I think set_fixmap/clear_fixmap seems simpler than
> kmap_atomic() variant. Would you think improving kmap_atomic_prot()
> is better?

Hi Ingo,

Here is the patch which uses fixmaps instead of vmap in text_poke().
This made the code much simpler than I thought :).

Thanks,

----
Use fixmaps instead of vmap/vunmap in text_poke() for avoiding page allocation
and delayed unmapping.

At the result of above change, text_poke() becomes atomic and can be called
from stop_machine() etc.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 arch/x86/include/asm/fixmap_32.h |    2 ++
 arch/x86/include/asm/fixmap_64.h |    2 ++
 arch/x86/kernel/alternative.c    |   18 ++++++++++++------
 3 files changed, 16 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/include/asm/fixmap_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/fixmap_32.h
+++ linux-2.6/arch/x86/include/asm/fixmap_32.h
@@ -81,6 +81,8 @@ enum fixed_addresses {
 #ifdef CONFIG_PARAVIRT
 	FIX_PARAVIRT_BOOTMAP,
 #endif
+	FIX_TEXT_POKE0,	/* reserve 2 pages for text_poke() */
+	FIX_TEXT_POKE1,
 	__end_of_permanent_fixed_addresses,
 	/*
 	 * 256 temporary boot-time mappings, used by early_ioremap(),
Index: linux-2.6/arch/x86/include/asm/fixmap_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/fixmap_64.h
+++ linux-2.6/arch/x86/include/asm/fixmap_64.h
@@ -49,6 +49,8 @@ enum fixed_addresses {
 #ifdef CONFIG_PARAVIRT
 	FIX_PARAVIRT_BOOTMAP,
 #endif
+	FIX_TEXT_POKE0,	/* reserve 2 pages for text_poke() */
+	FIX_TEXT_POKE1,
 	__end_of_permanent_fixed_addresses,
 #ifdef CONFIG_ACPI
 	FIX_ACPI_BEGIN,
Index: linux-2.6/arch/x86/kernel/alternative.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/alternative.c
+++ linux-2.6/arch/x86/kernel/alternative.c
@@ -12,7 +12,9 @@
 #include <asm/nmi.h>
 #include <asm/vsyscall.h>
 #include <asm/cacheflush.h>
+#include <asm/tlbflush.h>
 #include <asm/io.h>
+#include <asm/fixmap.h>

 #define MAX_PATCH_LEN (255-1)

@@ -495,12 +497,13 @@ void *text_poke_early(void *addr, const
  * It means the size must be writable atomically and the address must be aligned
  * in a way that permits an atomic write. It also makes sure we fit on a single
  * page.
+ *
+ * Note: Must be called under text_mutex.
  */
 void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
 {
 	unsigned long flags;
 	char *vaddr;
-	int nr_pages = 2;
 	struct page *pages[2];
 	int i;

@@ -513,14 +516,17 @@ void *__kprobes text_poke(void *addr, co
 		pages[1] = virt_to_page(addr + PAGE_SIZE);
 	}
 	BUG_ON(!pages[0]);
-	if (!pages[1])
-		nr_pages = 1;
-	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
-	BUG_ON(!vaddr);
+	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
+	if (pages[1])
+		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
+	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
 	local_irq_save(flags);
 	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
 	local_irq_restore(flags);
-	vunmap(vaddr);
+	clear_fixmap(FIX_TEXT_POKE0);
+	if (pages[1])
+		clear_fixmap(FIX_TEXT_POKE1);
+	local_flush_tlb();
 	sync_core();
 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
 	   that causes hangs on some VIA CPUs. */

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] x86: make text_poke() atomic using fixmap
  2009-03-03 16:31                                                           ` [PATCH] x86: make text_poke() atomic using fixmap Masami Hiramatsu
@ 2009-03-03 17:08                                                             ` Mathieu Desnoyers
  2009-03-05 10:38                                                             ` Ingo Molnar
  1 sibling, 0 replies; 89+ messages in thread
From: Mathieu Desnoyers @ 2009-03-03 17:08 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
> Masami Hiramatsu wrote:
> > Ingo Molnar wrote:
> >> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
> >>
> >>> Ingo Molnar wrote:
> >>>> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
> >>>>
> >>>>> Ingo Molnar wrote:
> >>>>>>>> So perhaps another approach to (re-)consider would be to go back 
> >>>>>>>> to atomic fixmaps here. It spends 3 slots but that's no big 
> >>>>>>>> deal.
> >>>>>>> Oh, it's a good idea! fixmaps must make it simpler.
> >>>>>>>
> >>>>>>>> In exchange it will be conceptually simpler, and will also scale 
> >>>>>>>> much better than a global spinlock. What do you think?
> >>>>>>> I think even if I use fixmaps, we have to use a spinlock to protect
> >>>>>>> the fixmap area from other threads...
> >>>>>> that's why i suggested to use an atomic-kmap, not a fixmap.
> >>>>> Even if the mapping is atomic, text_poke() has to protect pte
> >>>>> from other text_poke()s while changing code.
> >>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
> >>>> Well, but text_poke() is not a serializing API to begin with. 
> >>>> It's normally used in code patching sequences when we 'know' 
> >>>> that there cannot be similar parallel activities. The kprobes 
> >>>> usage of text_poke() looks unsafe - and that needs to be fixed.
> >>> Oh, kprobes already prohibited parallel arming/disarming
> >>> by using kprobe_mutex. :-)
> >> yeah, but still the API is somewhat unsafe.
> > 
> > Yeah, kprobe_mutex protects text_poke from other kprobes, but
> > not from other text_poke() users...
> > 
> >> In any case, you also answered your own question:
> >>
> >>>>> Even if the mapping is atomic, text_poke() has to protect pte
> >>>>> from other text_poke()s while changing code.
> >>>>> AFAIK, atomic-kmap itself doesn't ensure that, does it?
> >> kprobe_mutex does that.
> > 
> > Anyway, text_edit_lock ensures that.
> > 
> > By the way, I think set_fixmap/clear_fixmap seems simpler than
> > kmap_atomic() variant. Would you think improving kmap_atomic_prot()
> > is better?
> 
> Hi Ingo,
> 
> Here is the patch which uses fixmaps instead of vmap in text_poke().
> This made the code much simpler than I thought :).
> 
> Thanks,
> 
> ----
> Use fixmaps instead of vmap/vunmap in text_poke() for avoiding page allocation
> and delayed unmapping.
> 
> At the result of above change, text_poke() becomes atomic and can be called
> from stop_machine() etc.
> 

It looks great, thanks !

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> ---
>  arch/x86/include/asm/fixmap_32.h |    2 ++
>  arch/x86/include/asm/fixmap_64.h |    2 ++
>  arch/x86/kernel/alternative.c    |   18 ++++++++++++------
>  3 files changed, 16 insertions(+), 6 deletions(-)
> 
> Index: linux-2.6/arch/x86/include/asm/fixmap_32.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/fixmap_32.h
> +++ linux-2.6/arch/x86/include/asm/fixmap_32.h
> @@ -81,6 +81,8 @@ enum fixed_addresses {
>  #ifdef CONFIG_PARAVIRT
>  	FIX_PARAVIRT_BOOTMAP,
>  #endif
> +	FIX_TEXT_POKE0,	/* reserve 2 pages for text_poke() */
> +	FIX_TEXT_POKE1,
>  	__end_of_permanent_fixed_addresses,
>  	/*
>  	 * 256 temporary boot-time mappings, used by early_ioremap(),
> Index: linux-2.6/arch/x86/include/asm/fixmap_64.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/fixmap_64.h
> +++ linux-2.6/arch/x86/include/asm/fixmap_64.h
> @@ -49,6 +49,8 @@ enum fixed_addresses {
>  #ifdef CONFIG_PARAVIRT
>  	FIX_PARAVIRT_BOOTMAP,
>  #endif
> +	FIX_TEXT_POKE0,	/* reserve 2 pages for text_poke() */
> +	FIX_TEXT_POKE1,
>  	__end_of_permanent_fixed_addresses,
>  #ifdef CONFIG_ACPI
>  	FIX_ACPI_BEGIN,
> Index: linux-2.6/arch/x86/kernel/alternative.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/alternative.c
> +++ linux-2.6/arch/x86/kernel/alternative.c
> @@ -12,7 +12,9 @@
>  #include <asm/nmi.h>
>  #include <asm/vsyscall.h>
>  #include <asm/cacheflush.h>
> +#include <asm/tlbflush.h>
>  #include <asm/io.h>
> +#include <asm/fixmap.h>
> 
>  #define MAX_PATCH_LEN (255-1)
> 
> @@ -495,12 +497,13 @@ void *text_poke_early(void *addr, const
>   * It means the size must be writable atomically and the address must be aligned
>   * in a way that permits an atomic write. It also makes sure we fit on a single
>   * page.
> + *
> + * Note: Must be called under text_mutex.
>   */
>  void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
>  {
>  	unsigned long flags;
>  	char *vaddr;
> -	int nr_pages = 2;
>  	struct page *pages[2];
>  	int i;
> 
> @@ -513,14 +516,17 @@ void *__kprobes text_poke(void *addr, co
>  		pages[1] = virt_to_page(addr + PAGE_SIZE);
>  	}
>  	BUG_ON(!pages[0]);
> -	if (!pages[1])
> -		nr_pages = 1;
> -	vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
> -	BUG_ON(!vaddr);
> +	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
> +	if (pages[1])
> +		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
> +	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
>  	local_irq_save(flags);
>  	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
>  	local_irq_restore(flags);
> -	vunmap(vaddr);
> +	clear_fixmap(FIX_TEXT_POKE0);
> +	if (pages[1])
> +		clear_fixmap(FIX_TEXT_POKE1);
> +	local_flush_tlb();
>  	sync_core();
>  	/* Could also do a CLFLUSH here to speed up CPU recovery; but
>  	   that causes hangs on some VIA CPUs. */
> 
> -- 
> Masami Hiramatsu
> 
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
> 
> e-mail: mhiramat@redhat.com
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] x86: make text_poke() atomic using fixmap
  2009-03-03 16:31                                                           ` [PATCH] x86: make text_poke() atomic using fixmap Masami Hiramatsu
  2009-03-03 17:08                                                             ` Mathieu Desnoyers
@ 2009-03-05 10:38                                                             ` Ingo Molnar
  2009-03-06 14:06                                                               ` Ingo Molnar
  1 sibling, 1 reply; 89+ messages in thread
From: Ingo Molnar @ 2009-03-05 10:38 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt


* Masami Hiramatsu <mhiramat@redhat.com> wrote:

> Hi Ingo,
> 
> Here is the patch which uses fixmaps instead of vmap in 
> text_poke(). This made the code much simpler than I thought 
> :).

Looks good to me at a quick glance albeit Linus had second 
thoughts about using fixmaps for this in the past. But with 
delayed-flush for vmaps i think fixmaps are again the simpler 
and more robust - albeit more limited - choice ...

In any case, the x86 tree already unified fixmap.h so could you 
please resend the whole series as a 0/3, 1/3, 2/3, 3/3 thing 
against tip:master, starting a new thread on lkml? (this thread 
is already way too deep)

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] x86: make text_poke() atomic using fixmap
  2009-03-05 10:38                                                             ` Ingo Molnar
@ 2009-03-06 14:06                                                               ` Ingo Molnar
  2009-03-06 14:49                                                                 ` Masami Hiramatsu
  0 siblings, 1 reply; 89+ messages in thread
From: Ingo Molnar @ 2009-03-06 14:06 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
> 
> > Hi Ingo,
> > 
> > Here is the patch which uses fixmaps instead of vmap in 
> > text_poke(). This made the code much simpler than I thought 
> > :).
> 
> Looks good to me at a quick glance albeit Linus had second 
> thoughts about using fixmaps for this in the past. But with 
> delayed-flush for vmaps i think fixmaps are again the simpler 
> and more robust - albeit more limited - choice ...
> 
> In any case, the x86 tree already unified fixmap.h so could 
> you please resend the whole series as a 0/3, 1/3, 2/3, 3/3 
> thing against tip:master, starting a new thread on lkml? (this 
> thread is already way too deep)

Ping? I think there's agreement and it would be nice to fix this 
in .30. Looks too complex for .29 - maybe backportable to .29.1 
if it stays problem-free in testing.

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH] x86: make text_poke() atomic using fixmap
  2009-03-06 14:06                                                               ` Ingo Molnar
@ 2009-03-06 14:49                                                                 ` Masami Hiramatsu
  0 siblings, 0 replies; 89+ messages in thread
From: Masami Hiramatsu @ 2009-03-06 14:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Andrew Morton, Nick Piggin, Steven Rostedt,
	Andi Kleen, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Frederic Weisbecker, Linus Torvalds, Arjan van de Ven,
	Rusty Russell, H. Peter Anvin, Steven Rostedt

Hi Ingo,

Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
>> * Masami Hiramatsu <mhiramat@redhat.com> wrote:
>>
>>> Hi Ingo,
>>>
>>> Here is the patch which uses fixmaps instead of vmap in 
>>> text_poke(). This made the code much simpler than I thought 
>>> :).
>> Looks good to me at a quick glance albeit Linus had second 
>> thoughts about using fixmaps for this in the past. But with 
>> delayed-flush for vmaps i think fixmaps are again the simpler 
>> and more robust - albeit more limited - choice ...
>>
>> In any case, the x86 tree already unified fixmap.h so could 
>> you please resend the whole series as a 0/3, 1/3, 2/3, 3/3 
>> thing against tip:master, starting a new thread on lkml? (this 
>> thread is already way too deep)
> 
> Ping? I think there's agreement and it would be nice to fix this 
> in .30. Looks too complex for .29 - maybe backportable to .29.1 
> if it stays problem-free in testing.

Sorry for later, I'll post it as soon as possible.

> 
> 	Ingo

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2009-03-06 14:50 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-20  1:13 [git pull] changes for tip, and a nasty x86 page table bug Steven Rostedt
2009-02-20  1:13 ` [PATCH 1/6] x86: check PMD in spurious_fault handler Steven Rostedt
2009-02-20  1:13 ` [PATCH 2/6] x86: keep pmd rw bit set when creating 4K level pages Steven Rostedt
2009-02-20  1:13 ` [PATCH 3/6] ftrace: allow archs to preform pre and post process for code modification Steven Rostedt
2009-02-20  1:13 ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Steven Rostedt
2009-02-20  1:32   ` Andrew Morton
2009-02-20  1:44     ` Steven Rostedt
2009-02-20  2:05       ` [PATCH][git pull] update to tip/tracing/ftrace Steven Rostedt
2009-02-22 17:50   ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Andi Kleen
2009-02-22 22:53     ` Steven Rostedt
2009-02-23  0:29       ` Andi Kleen
2009-02-23  2:33       ` Mathieu Desnoyers
2009-02-23  4:29         ` Steven Rostedt
2009-02-23  4:53           ` Mathieu Desnoyers
2009-02-23 14:48             ` Steven Rostedt
2009-02-23 15:42               ` Mathieu Desnoyers
2009-02-23 15:51                 ` Steven Rostedt
2009-02-23 15:55                   ` Steven Rostedt
2009-02-23 16:13                   ` Mathieu Desnoyers
2009-02-23 16:48                     ` Steven Rostedt
2009-02-23 17:31                       ` Mathieu Desnoyers
2009-02-23 18:17                         ` Steven Rostedt
2009-02-23 18:34                           ` Mathieu Desnoyers
2009-02-27 17:52                           ` Masami Hiramatsu
2009-02-27 18:07                             ` Mathieu Desnoyers
2009-02-27 18:34                               ` Masami Hiramatsu
2009-02-27 18:53                                 ` Mathieu Desnoyers
2009-02-27 20:57                                   ` Masami Hiramatsu
2009-03-02 17:01                                     ` [RFC][PATCH] x86: make text_poke() atomic Masami Hiramatsu
2009-03-02 17:19                                       ` Mathieu Desnoyers
2009-03-02 22:15                                         ` Masami Hiramatsu
2009-03-02 22:22                                           ` Ingo Molnar
2009-03-02 22:55                                             ` Masami Hiramatsu
2009-03-02 23:09                                               ` Ingo Molnar
2009-03-02 23:38                                                 ` Masami Hiramatsu
2009-03-02 23:49                                                   ` Ingo Molnar
2009-03-03  0:00                                                     ` Mathieu Desnoyers
2009-03-03  0:00                                                     ` [PATCH] Text Edit Lock - Architecture Independent Code Mathieu Desnoyers
2009-03-03  0:32                                                       ` Ingo Molnar
2009-03-03  0:39                                                         ` Mathieu Desnoyers
2009-03-03  1:30                                                         ` [PATCH] Text Edit Lock - Architecture Independent Code (v2) Mathieu Desnoyers
2009-03-03  1:31                                                         ` [PATCH] Text Edit Lock - kprobes architecture independent support (v2) Mathieu Desnoyers
2009-03-03  9:27                                                           ` Ingo Molnar
2009-03-03 12:06                                                             ` Ananth N Mavinakayanahalli
2009-03-03 14:28                                                               ` Mathieu Desnoyers
2009-03-03 14:33                                                               ` [PATCH] Text Edit Lock - kprobes architecture independent support (v3) Mathieu Desnoyers
2009-03-03 14:53                                                               ` [PATCH] Text Edit Lock - kprobes architecture independent support (v2) Ingo Molnar
2009-03-03  0:01                                                     ` [PATCH] Text Edit Lock - kprobes architecture independent support Mathieu Desnoyers
2009-03-03  0:10                                                       ` Masami Hiramatsu
2009-03-03  0:05                                                     ` [RFC][PATCH] x86: make text_poke() atomic Masami Hiramatsu
2009-03-03  0:22                                                       ` Ingo Molnar
2009-03-03  0:31                                                         ` Masami Hiramatsu
2009-03-03 16:31                                                           ` [PATCH] x86: make text_poke() atomic using fixmap Masami Hiramatsu
2009-03-03 17:08                                                             ` Mathieu Desnoyers
2009-03-05 10:38                                                             ` Ingo Molnar
2009-03-06 14:06                                                               ` Ingo Molnar
2009-03-06 14:49                                                                 ` Masami Hiramatsu
2009-03-02 18:28                                       ` [RFC][PATCH] x86: make text_poke() atomic Arjan van de Ven
2009-03-02 18:36                                         ` Mathieu Desnoyers
2009-03-02 18:55                                           ` Arjan van de Ven
2009-03-02 19:13                                             ` Masami Hiramatsu
2009-03-02 19:23                                               ` H. Peter Anvin
2009-03-02 19:47                                             ` Mathieu Desnoyers
2009-03-02 18:42                                         ` Linus Torvalds
2009-03-03  4:54                                       ` Nick Piggin
2009-02-23 18:23                         ` [PATCH 4/6] ftrace, x86: make kernel text writable only for conversions Steven Rostedt
2009-02-23  9:02         ` Ingo Molnar
2009-02-27 21:08     ` Pavel Machek
2009-02-28 16:56       ` Andi Kleen
2009-02-28 22:08         ` Pavel Machek
     [not found]           ` <87wsba1a9f.fsf@basil.nowhere.org>
2009-02-28 22:19             ` Pavel Machek
2009-02-28 23:52               ` Andi Kleen
2009-02-20  1:13 ` [PATCH 5/6] ftrace: immediately stop code modification if failure is detected Steven Rostedt
2009-02-20  1:13 ` [PATCH 6/6] ftrace: break out modify loop immediately on detection of error Steven Rostedt
2009-02-20  2:00 ` [git pull] changes for tip, and a nasty x86 page table bug Linus Torvalds
2009-02-20  2:08   ` Steven Rostedt
2009-02-20  3:44     ` Linus Torvalds
2009-02-20  4:00       ` Steven Rostedt
2009-02-20  4:17         ` Linus Torvalds
2009-02-20  4:34           ` Steven Rostedt
2009-02-20  5:02           ` Huang Ying
2009-02-20  7:29       ` [PATCH] x86: use the right protections for split-up pagetables Ingo Molnar
2009-02-20  7:39         ` [PATCH, v2] " Ingo Molnar
2009-02-20  8:02           ` Ingo Molnar
2009-02-20 10:24             ` Ingo Molnar
2009-02-20 13:57         ` [PATCH] " Steven Rostedt
2009-02-20 15:40         ` Linus Torvalds
2009-02-20 16:59           ` Ingo Molnar
2009-02-20 18:33           ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).