linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] x86: Add a __copy_from_user_nmi
@ 2015-10-19 22:54 Andi Kleen
  2015-10-19 22:54 ` [PATCH 2/2] x86, perf: Optimize stack walk user accesses Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2015-10-19 22:54 UTC (permalink / raw)
  To: x86; +Cc: peterz, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add a inlined __ variant of copy_from_user_nmi. The inlined variant allows
the user to:

- batch the access_ok check for multiple accesses
- avoid having a pagefault_disable/enable on every access if the caller
already ensures disabled page faults due to its context.
- get all the optimizations in copy_*_user for small constant sized transfers

It is just a define to __copy_from_user_inatomic.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/uaccess.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index a8df874..1d0766c 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -745,5 +745,14 @@ copy_to_user(void __user *to, const void *from, unsigned long n)
 #undef __copy_from_user_overflow
 #undef __copy_to_user_overflow
 
+/*
+ * We rely on the nested NMI work to allow atomic faults from the NMI path; the
+ * nested NMI paths are careful to preserve CR2.
+ *
+ * Caller must use pagefault_enable/disable, or run in interrupt context,
+ * and also do a uaccess_ok() check
+ */
+#define __copy_from_user_nmi __copy_from_user_inatomic
+
 #endif /* _ASM_X86_UACCESS_H */
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] x86, perf: Optimize stack walk user accesses
  2015-10-19 22:54 [PATCH 1/2] x86: Add a __copy_from_user_nmi Andi Kleen
@ 2015-10-19 22:54 ` Andi Kleen
  2015-10-20 11:03   ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2015-10-19 22:54 UTC (permalink / raw)
  To: x86; +Cc: peterz, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Change the perf user stack walking to use the new __copy_from_user_nmi,
and split each access into word sized transfer sizes. This allows
to inline the complete access and optimize it all into a single load.

The main advantage is that this avoids the overhead of double page faults.
When normal copy_from_user fails it reexecutes the copy to compute
an accurate number of non copied bytes. This leads to executing
the expensive page fault twice.

While walking stacks having a fault at some point is relatively
common (typically when some part of the program isn't compiled
with frame pointers), so this is a large overhead.

With the optimized copies we avoid this problem because they
only do all accesses once. And of course they're much faster too
when the access does not fault because they're just single instructions
instead of complex function calls.

While profiling a kernel build with -g, the patch brings down
the average time of the PMI handler from 966ns to 552ns (-43%)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4562cf0..15bf8b3 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2255,7 +2255,12 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 		frame.next_frame     = 0;
 		frame.return_address = 0;
 
-		bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
+		if (!access_ok(VERIFY_READ, fp, 8))
+			break;
+		bytes = __copy_from_user_nmi(&frame.next_frame, fp, 4);
+		if (bytes != 0)
+			break;
+		bytes = __copy_from_user_nmi(&frame.return_address, fp+4, 4);
 		if (bytes != 0)
 			break;
 
@@ -2307,7 +2312,13 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
 		frame.next_frame	     = NULL;
 		frame.return_address = 0;
 
-		bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
+		if (!access_ok(VERIFY_READ, fp, 16))
+			break;
+
+		bytes = __copy_from_user_nmi(&frame.next_frame, fp, 8);
+		if (bytes != 0)
+			break;
+		bytes = __copy_from_user_nmi(&frame.return_address, fp+8, 8);
 		if (bytes != 0)
 			break;
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] x86, perf: Optimize stack walk user accesses
  2015-10-19 22:54 ` [PATCH 2/2] x86, perf: Optimize stack walk user accesses Andi Kleen
@ 2015-10-20 11:03   ` Peter Zijlstra
  2015-10-20 17:32     ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2015-10-20 11:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, linux-kernel, Andi Kleen

On Mon, Oct 19, 2015 at 03:54:29PM -0700, Andi Kleen wrote:
> @@ -2307,7 +2312,13 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
>  		frame.next_frame	     = NULL;
>  		frame.return_address = 0;
>  
> -		bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
> +		if (!access_ok(VERIFY_READ, fp, 16))
> +			break;
> +
> +		bytes = __copy_from_user_nmi(&frame.next_frame, fp, 8);
> +		if (bytes != 0)
> +			break;
> +		bytes = __copy_from_user_nmi(&frame.return_address, fp+8, 8);
>  		if (bytes != 0)
>  			break;
>  

The previous patch that introduces this function states that any caller
must have pagefault_disable() or be from interrupt context.

Perf can call this function from !interrupt context (imagine a
tracepoint or other software event), should we therefore not add a
pagefault_disable()/enable() pair around the entire while() loop?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] x86, perf: Optimize stack walk user accesses
  2015-10-20 11:03   ` Peter Zijlstra
@ 2015-10-20 17:32     ` Andi Kleen
  0 siblings, 0 replies; 6+ messages in thread
From: Andi Kleen @ 2015-10-20 17:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andi Kleen, x86, linux-kernel, Andi Kleen

> Perf can call this function from !interrupt context (imagine a
> tracepoint or other software event), should we therefore not add a
> pagefault_disable()/enable() pair around the entire while() loop?

That's true. I'll add the pagefault_disable/enable.

-Andi

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/2] x86, perf: Optimize stack walk user accesses
  2015-11-16 23:23 [PATCH 1/2] x86: Add a __copy_from_user_nmi Andi Kleen
@ 2015-11-16 23:23 ` Andi Kleen
  0 siblings, 0 replies; 6+ messages in thread
From: Andi Kleen @ 2015-11-16 23:23 UTC (permalink / raw)
  To: peterz; +Cc: x86, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Change the perf user stack walking to use the new __copy_from_user_nmi,
and split each access into word sized transfer sizes. This allows to
inline the complete access and optimize it all into a single load.

The main advantage is that this avoids the overhead of double page
faults.  When normal copy_from_user fails it reexecutes the copy
to compute an accurate number of non copied bytes. This leads to
executing the expensive page fault twice.

While walking stacks having a fault at some point is relatively common
(typically when some part of the program isn't compiled with frame
pointers), so this is a large overhead.

With the optimized copies we avoid this problem because they only
do all accesses once. And of course they're much faster too when
the access does not fault because they're just single instructions
instead of complex function calls.

While profiling a kernel build with -g, the patch brings down the
average time of the PMI handler from 966ns to 552ns (-43%)

v2:
Disable page faults explicitly to handle software trace points.
Fix sparse warning
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index a74fab5..c80740a 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2253,12 +2253,18 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 	ss_base = get_segment_base(regs->ss);
 
 	fp = compat_ptr(ss_base + regs->bp);
+	pagefault_disable();
 	while (entry->nr < PERF_MAX_STACK_DEPTH) {
 		unsigned long bytes;
 		frame.next_frame     = 0;
 		frame.return_address = 0;
 
-		bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
+		if (!access_ok(VERIFY_READ, fp, 8))
+			break;
+		bytes = __copy_from_user_nmi(&frame.next_frame, fp, 4);
+		if (bytes != 0)
+			break;
+		bytes = __copy_from_user_nmi(&frame.return_address, fp+4, 4);
 		if (bytes != 0)
 			break;
 
@@ -2268,6 +2274,7 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 		perf_callchain_store(entry, cs_base + frame.return_address);
 		fp = compat_ptr(ss_base + frame.next_frame);
 	}
+	pagefault_enable();
 	return 1;
 }
 #else
@@ -2305,12 +2312,19 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
 	if (perf_callchain_user32(regs, entry))
 		return;
 
+	pagefault_disable();
 	while (entry->nr < PERF_MAX_STACK_DEPTH) {
 		unsigned long bytes;
 		frame.next_frame	     = NULL;
 		frame.return_address = 0;
 
-		bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
+		if (!access_ok(VERIFY_READ, fp, 16))
+			break;
+
+		bytes = __copy_from_user_nmi(&frame.next_frame, fp, 8);
+		if (bytes != 0)
+			break;
+		bytes = __copy_from_user_nmi(&frame.return_address, fp+8, 8);
 		if (bytes != 0)
 			break;
 
@@ -2318,8 +2332,9 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
 			break;
 
 		perf_callchain_store(entry, frame.return_address);
-		fp = frame.next_frame;
+		fp = (void __user *)frame.next_frame;
 	}
+	pagefault_enable();
 }
 
 /*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] x86, perf: Optimize stack walk user accesses
  2015-10-22 22:07 [PATCH 1/2] x86: Add a __copy_from_user_nmi Andi Kleen
@ 2015-10-22 22:07 ` Andi Kleen
  0 siblings, 0 replies; 6+ messages in thread
From: Andi Kleen @ 2015-10-22 22:07 UTC (permalink / raw)
  To: peterz; +Cc: mingo, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Change the perf user stack walking to use the new __copy_from_user_nmi,
and split each access into word sized transfer sizes. This allows to
inline the complete access and optimize it all into a single load.

The main advantage is that this avoids the overhead of double page
faults.  When normal copy_from_user fails it reexecutes the copy
to compute an accurate number of non copied bytes. This leads to
executing the expensive page fault twice.

While walking stacks having a fault at some point is relatively common
(typically when some part of the program isn't compiled with frame
pointers), so this is a large overhead.

With the optimized copies we avoid this problem because they only
do all accesses once. And of course they're much faster too when
the access does not fault because they're just single instructions
instead of complex function calls.

While profiling a kernel build with -g, the patch brings down the
average time of the PMI handler from 966ns to 552ns (-43%)

v2:
Disable page faults explicitly to handle software trace points.
Fix sparse warning
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index a74fab5..c80740a 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2253,12 +2253,18 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 	ss_base = get_segment_base(regs->ss);
 
 	fp = compat_ptr(ss_base + regs->bp);
+	pagefault_disable();
 	while (entry->nr < PERF_MAX_STACK_DEPTH) {
 		unsigned long bytes;
 		frame.next_frame     = 0;
 		frame.return_address = 0;
 
-		bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
+		if (!access_ok(VERIFY_READ, fp, 8))
+			break;
+		bytes = __copy_from_user_nmi(&frame.next_frame, fp, 4);
+		if (bytes != 0)
+			break;
+		bytes = __copy_from_user_nmi(&frame.return_address, fp+4, 4);
 		if (bytes != 0)
 			break;
 
@@ -2268,6 +2274,7 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 		perf_callchain_store(entry, cs_base + frame.return_address);
 		fp = compat_ptr(ss_base + frame.next_frame);
 	}
+	pagefault_enable();
 	return 1;
 }
 #else
@@ -2305,12 +2312,19 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
 	if (perf_callchain_user32(regs, entry))
 		return;
 
+	pagefault_disable();
 	while (entry->nr < PERF_MAX_STACK_DEPTH) {
 		unsigned long bytes;
 		frame.next_frame	     = NULL;
 		frame.return_address = 0;
 
-		bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
+		if (!access_ok(VERIFY_READ, fp, 16))
+			break;
+
+		bytes = __copy_from_user_nmi(&frame.next_frame, fp, 8);
+		if (bytes != 0)
+			break;
+		bytes = __copy_from_user_nmi(&frame.return_address, fp+8, 8);
 		if (bytes != 0)
 			break;
 
@@ -2318,8 +2332,9 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
 			break;
 
 		perf_callchain_store(entry, frame.return_address);
-		fp = frame.next_frame;
+		fp = (void __user *)frame.next_frame;
 	}
+	pagefault_enable();
 }
 
 /*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-11-16 23:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-19 22:54 [PATCH 1/2] x86: Add a __copy_from_user_nmi Andi Kleen
2015-10-19 22:54 ` [PATCH 2/2] x86, perf: Optimize stack walk user accesses Andi Kleen
2015-10-20 11:03   ` Peter Zijlstra
2015-10-20 17:32     ` Andi Kleen
2015-10-22 22:07 [PATCH 1/2] x86: Add a __copy_from_user_nmi Andi Kleen
2015-10-22 22:07 ` [PATCH 2/2] x86, perf: Optimize stack walk user accesses Andi Kleen
2015-11-16 23:23 [PATCH 1/2] x86: Add a __copy_from_user_nmi Andi Kleen
2015-11-16 23:23 ` [PATCH 2/2] x86, perf: Optimize stack walk user accesses Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).