LKML Archive on lore.kernel.org
 help / color / Atom feed
* [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
@ 2012-07-06  6:20 tip-bot for Peter Zijlstra
  2012-07-06 16:29 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2012-07-06  6:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, torvalds, a.p.zijlstra,
	fweisbec, akpm, tglx

Commit-ID:  ce5c1fe9a9e059b5c58f0a7e2a3e687d0efac815
Gitweb:     http://git.kernel.org/tip/ce5c1fe9a9e059b5c58f0a7e2a3e687d0efac815
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Wed, 20 Jun 2012 11:11:38 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 5 Jul 2012 20:59:07 +0200

perf/x86: Fix USER/KERNEL tagging of samples

Several perf interrupt handlers (PEBS,IBS,BTS) re-write regs->ip but
do not update the segment registers. So use an regs->ip based test
instead of an regs->cs/regs->flags based test.

Reported-and-tested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-xxrt0a1zronm1sm36obwc2vy@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/perf_event.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index c4706cf..6ef9d41 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1863,7 +1863,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
 		else
 			misc |= PERF_RECORD_MISC_GUEST_KERNEL;
 	} else {
-		if (user_mode(regs))
+		if (!kernel_ip(regs->ip))
 			misc |= PERF_RECORD_MISC_USER;
 		else
 			misc |= PERF_RECORD_MISC_KERNEL;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-06  6:20 [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples tip-bot for Peter Zijlstra
@ 2012-07-06 16:29 ` Linus Torvalds
  2012-07-06 18:12   ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2012-07-06 16:29 UTC (permalink / raw)
  To: mingo, hpa, eranian, linux-kernel, a.p.zijlstra, torvalds,
	fweisbec, akpm, tglx
  Cc: linux-tip-commits

On Thu, Jul 5, 2012 at 11:20 PM, tip-bot for Peter Zijlstra
<a.p.zijlstra@chello.nl> wrote:
>
> Several perf interrupt handlers (PEBS,IBS,BTS) re-write regs->ip but
> do not update the segment registers. So use an regs->ip based test
> instead of an regs->cs/regs->flags based test.

Christ, people, YOU CANNOT DO THIS!

It is never *ever* valid to test the IP to see if you're in kernel
space or user space. People can do various odd segments etc, the IP is
totally meaningless.

If the perf handlers fake the IP information, they had better fake the
CS/eflags information too. Because it is *wrong* to look at IP. Don't
do it.

                Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-06 16:29 ` Linus Torvalds
@ 2012-07-06 18:12   ` Peter Zijlstra
  2012-07-06 18:16     ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2012-07-06 18:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mingo, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits

On Fri, 2012-07-06 at 09:29 -0700, Linus Torvalds wrote:
> On Thu, Jul 5, 2012 at 11:20 PM, tip-bot for Peter Zijlstra
> <a.p.zijlstra@chello.nl> wrote:
> >
> > Several perf interrupt handlers (PEBS,IBS,BTS) re-write regs->ip but
> > do not update the segment registers. So use an regs->ip based test
> > instead of an regs->cs/regs->flags based test.
> 
> Christ, people, YOU CANNOT DO THIS!
> 
> It is never *ever* valid to test the IP to see if you're in kernel
> space or user space. People can do various odd segments etc, the IP is
> totally meaningless.
> 
> If the perf handlers fake the IP information, they had better fake the
> CS/eflags information too. Because it is *wrong* to look at IP. Don't
> do it.

PEBS, BTS, LBR don't have CS. PEBS does have eflags.

If we cannot do this I'm not sure what we can do :/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-06 18:12   ` Peter Zijlstra
@ 2012-07-06 18:16     ` Linus Torvalds
  2012-07-06 18:34       ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2012-07-06 18:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits

On Fri, Jul 6, 2012 at 11:12 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> PEBS, BTS, LBR don't have CS. PEBS does have eflags.
>
> If we cannot do this I'm not sure what we can do :/

Well, you're passed in a "pt_regs". Which *does* have CS, and has it right.

If some code then changes the values in the pt_regs, it is *that* code
that needs to think twice about what it does. Where is that code?

                 Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-06 18:16     ` Linus Torvalds
@ 2012-07-06 18:34       ` Linus Torvalds
  2012-07-06 20:48         ` Peter Zijlstra
  2012-07-09 11:23         ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2012-07-06 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits

On Fri, Jul 6, 2012 at 11:16 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If some code then changes the values in the pt_regs, it is *that* code
> that needs to think twice about what it does. Where is that code?

>From a quick grep it looks like it is __intel_pmu_pebs_event() that does this.

THAT is where you would possibly have a huge honking big comment about
how you have to fake the CS register contents because the PEBS
information is incomplete. But make it clear that it is a total hack.

Also, somebody should check. Is the PEBS information *actually* the
instruction pointer (address within the code segment), or is it the
"linear address" (segment base + rip)? I hope it is the latter,
because in the absense of CS, the segment-based address is very
unclear indeed.

And if it *is* the linear address, then at that point you could do

   regs->cs = kernel_ip(ip) ? __KERNEL_CS : __USER_CS;
   regs->eflags &= ~X86_EFLAGS_VM;

and document this as a "we fake the CS and vm86 mode, using the known
zero-based code segments". At that point it would be technically
correct.

But any code that does "kernel_ip(regs->ip)" is just terminally
confused and can never be sane.

                  Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-06 18:34       ` Linus Torvalds
@ 2012-07-06 20:48         ` Peter Zijlstra
  2012-07-06 20:59           ` Linus Torvalds
  2012-07-09 11:23         ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2012-07-06 20:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mingo, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter

On Fri, 2012-07-06 at 11:34 -0700, Linus Torvalds wrote:
> On Fri, Jul 6, 2012 at 11:16 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > If some code then changes the values in the pt_regs, it is *that* code
> > that needs to think twice about what it does. Where is that code?
> 
> From a quick grep it looks like it is __intel_pmu_pebs_event() that does this.

Correct, as well as perf_ibs_handle_irq() and I'll have to have a good
look at the BTS stuff.

> THAT is where you would possibly have a huge honking big comment about
> how you have to fake the CS register contents because the PEBS
> information is incomplete. But make it clear that it is a total hack.

OK.

> Also, somebody should check. Is the PEBS information *actually* the
> instruction pointer (address within the code segment), or is it the
> "linear address" (segment base + rip)? I hope it is the latter,
> because in the absense of CS, the segment-based address is very
> unclear indeed.

I _think_ it is just the RIP since its part of a more general register
dump (which does not include the segment registers).

I can't find anything in the SDM clarifying it one way or the other.
Stephane do you know?

> And if it *is* the linear address, then at that point you could do
> 
>    regs->cs = kernel_ip(ip) ? __KERNEL_CS : __USER_CS;
>    regs->eflags &= ~X86_EFLAGS_VM;
> 
> and document this as a "we fake the CS and vm86 mode, using the known
> zero-based code segments". At that point it would be technically
> correct.

OK, thanks.. At least the AMB IBS case provides the Linear address of
the instruction so the IBS code should do it this way. We'll need to
check with Intel to see what PEBS does.

Supposing the worst case and Intel simply provides the RIP as is, what
should we do then?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-06 20:48         ` Peter Zijlstra
@ 2012-07-06 20:59           ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2012-07-06 20:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter

On Fri, Jul 6, 2012 at 1:48 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2012-07-06 at 11:34 -0700, Linus Torvalds wrote:
>> Also, somebody should check. Is the PEBS information *actually* the
>> instruction pointer (address within the code segment), or is it the
>> "linear address" (segment base + rip)? I hope it is the latter,
>> because in the absense of CS, the segment-based address is very
>> unclear indeed.
>
> I _think_ it is just the RIP since its part of a more general register
> dump (which does not include the segment registers).

Ugh. That would be really sad if true (and I would not be at all
surprised if it *is* true). Sure, flat segments are standard, and once
you get into 64-bit mode I don't think you can even do any CS base,
but afaik the code there is supposed to work on x86-32 too, no?

> Supposing the worst case and Intel simply provides the RIP as is, what
> should we do then?

Well, if the RIP is all you have, the RIP is all you can use, and you
have to just assume flat segments. That will be true in practice, and
if it means that you can't sanely profile vm86 mode etc on x86-32,
that's sad but you have to blame the hardware for that.

And I don't have any huge problems with assuming flat segments and
just "work in the simple cases".

What I reacted negatively to is literally that

    kernel_ip(regs->ip)

test, because once you have "regs", that's absolutely the wrong thing
to do. That kind of test just shows that somebody filled in "regs"
incorrectly. So if you have a pt_regs pointer, that kind of test is
simply crap, and indicates a bug somewhere.

             Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-06 18:34       ` Linus Torvalds
  2012-07-06 20:48         ` Peter Zijlstra
@ 2012-07-09 11:23         ` Peter Zijlstra
  2012-07-09 17:55           ` Linus Torvalds
  2012-07-09 18:41           ` [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples Ingo Molnar
  1 sibling, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2012-07-09 11:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mingo, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter

On Fri, 2012-07-06 at 11:34 -0700, Linus Torvalds wrote:

> But any code that does "kernel_ip(regs->ip)" is just terminally
> confused and can never be sane.

How about something like the below?

I've also modified perf_instruction_pointer() to account for the VM86
and IA32 non-zero segment base cases. At least, I tried to do so, I've
never had the 'pleasure' of poking at this segment descriptor stuff
before.

Ingo didn't really like doing that though, his suggestion was to kill
all those IPs by mapping them to a special value (~0UL or so).

---
Subject: perf/x86: Fix USER/KERNEL tagging of samples properly

Some PMUs don't provide a full register set for their sample,
specifically 'advanced' PMUs like AMD IBS and Intel PEBS which provide
'better' then regular interrupt accuracy.

In this case we use the interrupt regs as basis and over-write some
fields (typically IP) with different information.

The perf core however users user_mode() to distinguish user/kernel
samples, user_mode() relies on regs->cs. If the interrupt skid pushed us
over a boundary the new IP might not be in the same domain as the
interrupt. 

Commit ce5c1fe9a9e ("perf/x86: Fix USER/KERNEL tagging of samples")
tried to fix this by making the perf core use kernel_ip(). This however
is wrong (TM), as pointed out by Linus, since it doesn't allow for VM86
and non-zero based segments in IA32 mode.

Therefore, provide a new helper to set the regs->ip field,
set_linear_ip(), which massages the regs into a suitable state assuming
the provided IP is in fact a linear address.

Also modify perf_instruction_pointer() to deal with these 'fun' cases.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c          | 58 ++++++++++++++++++++++++++++---
 arch/x86/kernel/cpu/perf_event.h          | 19 ++++++++++
 arch/x86/kernel/cpu/perf_event_amd_ibs.c  |  4 ++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  6 ++--
 4 files changed, 79 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 29557aa..03a474c 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -32,6 +32,8 @@
 #include <asm/smp.h>
 #include <asm/alternative.h>
 #include <asm/timer.h>
+#include <asm/desc.h>
+#include <asm/ldt.h>
 
 #include "perf_event.h"
 
@@ -1816,14 +1818,62 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
 	}
 }
 
+static unsigned long get_segment_base(unsigned int segment)
+{
+	struct desc_struct *desc;
+	int idx = segment >> 3;
+
+	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+		if (idx > LDT_ENTRIES)
+			return 0;
+
+		desc = current->active_mm->context.ldt;
+	} else {
+		if (idx > GDT_ENTRIES)
+			return 0;
+
+		desc = __this_cpu_ptr(&gdt_page.gdt[0]);
+	}
+
+	return get_desc_base(desc + idx);
+}
+
+static unsigned long code_segment_base(struct pt_regs *regs)
+{
+#ifdef CONFIG_32BIT
+	if (user_mode(regs) && regs->cs != __USER_CS)
+		return get_segment_base(regs->cs);
+#else
+	if (test_thread_flag(TIF_IA32)) {
+		if (user_mode(regs) && regs->cs != __USER32_CS)
+			return get_segment_base(regs->cs);
+	}
+#endif
+	return 0;
+}
+
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
 	unsigned long ip;
 
 	if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-		ip = perf_guest_cbs->get_guest_ip();
-	else
-		ip = instruction_pointer(regs);
+		return perf_guest_cbs->get_guest_ip();
+
+	ip = regs->ip;
+
+	if (regs->flags & X86_VM_MASK) {
+		/*
+		 * If we are in VM86 mode, add the segment offset to convert to
+		 * a linear address.
+		 */
+		ip += 0x10 * regs->cs;
+	} else {
+		/*
+		 * For IA32 we look at the GDT/LDT segment base to convert the
+		 * effective IP to a linear address.
+		 */
+		ip += code_segment_base(regs);
+	}
 
 	return ip;
 }
@@ -1838,7 +1888,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
 		else
 			misc |= PERF_RECORD_MISC_GUEST_KERNEL;
 	} else {
-		if (!kernel_ip(regs->ip))
+		if (user_mode(regs))
 			misc |= PERF_RECORD_MISC_USER;
 		else
 			misc |= PERF_RECORD_MISC_KERNEL;
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index a15df4b..71fa4c6 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -516,6 +516,25 @@ static inline bool kernel_ip(unsigned long ip)
 #endif
 }
 
+/*
+ * Not all PMUs provide the right context information to place the reported IP
+ * into full context. Specifically segment registers are typically not
+ * supplied.
+ *
+ * Assuming the address is a linear address (it is for IBS), we fake the CS and
+ * vm86 mode using the known zero-based code segment and 'fix up' the registers
+ * to reflect this.
+ *
+ * Intel PEBS/LBR appear to typically provide the effective address, nothing
+ * much we can do about that but pray and treat it like a linear address.
+ */
+static inline void set_linear_ip(struct pt_regs *regs, unsigned long ip)
+{
+	regs->cs = kernel_ip(ip) ? __KERNEL_CS : __USER_CS;
+	regs->flags &= ~X86_VM_MASK;
+	regs->ip = ip;
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c
index da9bcdc..7bfb5be 100644
--- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c
+++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c
@@ -13,6 +13,8 @@
 
 #include <asm/apic.h>
 
+#include "perf_event.h"
+
 static u32 ibs_caps;
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
@@ -536,7 +538,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs)
 	if (check_rip && (ibs_data.regs[2] & IBS_RIP_INVALID)) {
 		regs.flags &= ~PERF_EFLAGS_EXACT;
 	} else {
-		instruction_pointer_set(&regs, ibs_data.regs[1]);
+		set_linear_ip(&regs, ibs_data.regs[1]);
 		regs.flags |= PERF_EFLAGS_EXACT;
 	}
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 629ae0b..0549fa9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -499,7 +499,7 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
 	 * We sampled a branch insn, rewind using the LBR stack
 	 */
 	if (ip == to) {
-		regs->ip = from;
+		set_linear_ip(regs, from);
 		return 1;
 	}
 
@@ -529,7 +529,7 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
 	} while (to < ip);
 
 	if (to == ip) {
-		regs->ip = old_to;
+		set_linear_ip(regs, old_to);
 		return 1;
 	}
 
@@ -569,7 +569,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 * A possible PERF_SAMPLE_REGS will have to transfer all regs.
 	 */
 	regs = *iregs;
-	regs.ip = pebs->ip;
+	set_linear_ip(&regs, pebs->ip);
 	regs.bp = pebs->bp;
 	regs.sp = pebs->sp;
 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-09 11:23         ` Peter Zijlstra
@ 2012-07-09 17:55           ` Linus Torvalds
  2012-07-10  9:02             ` Peter Zijlstra
  2012-07-09 18:41           ` [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples Ingo Molnar
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2012-07-09 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter

On Mon, Jul 9, 2012 at 4:23 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> How about something like the below?
>
> I've also modified perf_instruction_pointer() to account for the VM86
> and IA32 non-zero segment base cases. At least, I tried to do so, I've
> never had the 'pleasure' of poking at this segment descriptor stuff
> before.

The code base calculations look correct to me per se, although I'd put
the VM86 check into code_segment_base(). But whatever.

> Ingo didn't really like doing that though, his suggestion was to kill
> all those IPs by mapping them to a special value (~0UL or so).

Hmm. I don't care much one way or another, but I don't really see the
advantage of doing that.

The expensive part is the checking - since in all normal cases we'll
be using the standard CS and not be in vm86 mode. And you have to do
the checking whether you then map it to ~0ul or do a proper segment
base calculation. So why not just do it right? Your code seems sane
enough to me.

Now "regs" always contains a proper register state, and when the perf
code wants a linear address, perf_instruction_pointer() seems to do
the right translation. Looks good to me, although I didn't check all
the uses.

However, it is worth pointing out that sp/bp have exactly the same
segment base issue. So if you do stack tracing into user mode, you
should really do the same thing for those. And quite frankly, at that
point vm86 mode and the stack segment matters in other ways than just
the base pointer: a 16-bit stack segment acts fundamentally
differently from a 32-bit one. So at that point it may well make much
more sense to take the approach Ingo suggests, and simply not follow
stack frames at all.

Anyway, ACK for that patch from me. It may not solve all the problems,
but it looks ok.

                    Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-09 11:23         ` Peter Zijlstra
  2012-07-09 17:55           ` Linus Torvalds
@ 2012-07-09 18:41           ` Ingo Molnar
  2012-07-10  7:54             ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2012-07-09 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2012-07-06 at 11:34 -0700, Linus Torvalds wrote:
> 
> > But any code that does "kernel_ip(regs->ip)" is just 
> > terminally confused and can never be sane.
> 
> How about something like the below?
> 
> I've also modified perf_instruction_pointer() to account for 
> the VM86 and IA32 non-zero segment base cases. At least, I 
> tried to do so, I've never had the 'pleasure' of poking at 
> this segment descriptor stuff before.
> 
> Ingo didn't really like doing that though, his suggestion was 
> to kill all those IPs by mapping them to a special value (~0UL 
> or so).

So, my main worry is that the complexity/actual_use ratio feels 
rather high. Very few (if any) people will explicitly test the 
profiling of segmented x86 code - and even if they sample, 
chances are that it's a Windows COFF/who-knows binary that we 
don't symbol-decode in user-space at the moment.

Open coded calculations like this are easy to get wrong:

> +static unsigned long get_segment_base(unsigned int segment)
> +{
> +	struct desc_struct *desc;
> +	int idx = segment >> 3;
> +
> +	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
> +		if (idx > LDT_ENTRIES)
> +			return 0;
> +
> +		desc = current->active_mm->context.ldt;
> +	} else {
> +		if (idx > GDT_ENTRIES)
> +			return 0;
> +
> +		desc = __this_cpu_ptr(&gdt_page.gdt[0]);
> +	}
> +
> +	return get_desc_base(desc + idx);

Shouldn't idx be checked against active_mm->context.ldt.size, 
not LDT_ENTRIES (which is really just an upper limit)?

> +static unsigned long code_segment_base(struct pt_regs *regs)
> +{
> +#ifdef CONFIG_32BIT
> +	if (user_mode(regs) && regs->cs != __USER_CS)
> +		return get_segment_base(regs->cs);
> +#else
> +	if (test_thread_flag(TIF_IA32)) {
> +		if (user_mode(regs) && regs->cs != __USER32_CS)
> +			return get_segment_base(regs->cs);
> +	}
> +#endif
> +	return 0;
> +}

Will this do the right thing for x32 mode?

>  unsigned long perf_instruction_pointer(struct pt_regs *regs)
>  {
>  	unsigned long ip;
>  
>  	if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
> -		ip = perf_guest_cbs->get_guest_ip();
> -	else
> -		ip = instruction_pointer(regs);
> +		return perf_guest_cbs->get_guest_ip();
> +
> +	ip = regs->ip;
> +
> +	if (regs->flags & X86_VM_MASK) {
> +		/*
> +		 * If we are in VM86 mode, add the segment offset to convert to
> +		 * a linear address.
> +		 */
> +		ip += 0x10 * regs->cs;

Sweet nostalgic memories ;-)

> +	} else {
> +		/*
> +		 * For IA32 we look at the GDT/LDT segment base to convert the
> +		 * effective IP to a linear address.
> +		 */
> +		ip += code_segment_base(regs);
> +	}

I'm also not entirely sure about skid across context switches 
and all that, the idx might not relate to the current LDT 
anymore - but I suspect we can ignore that problem.

( Another race is skid across descriptor updates - fortunately 
  sys_modify_ldt() is thick enough to be a practical barrier 
  against that and we were never crazy enough to mmap() portions 
  of the LDT to user-space or so. )

But no big fundamental objections from me, it would just be 
awfully nice to double check all the boundary conditions in this 
new code.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-09 18:41           ` [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples Ingo Molnar
@ 2012-07-10  7:54             ` Peter Zijlstra
  2012-07-10  8:02               ` Ingo Molnar
  2012-07-10  8:21               ` Ingo Molnar
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2012-07-10  7:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter

On Mon, 2012-07-09 at 20:41 +0200, Ingo Molnar wrote:
> > +static unsigned long get_segment_base(unsigned int segment)
> > +{
> > +     struct desc_struct *desc;
> > +     int idx = segment >> 3;
> > +
> > +     if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
> > +             if (idx > LDT_ENTRIES)
> > +                     return 0;
> > +
> > +             desc = current->active_mm->context.ldt;
> > +     } else {
> > +             if (idx > GDT_ENTRIES)
> > +                     return 0;
> > +
> > +             desc = __this_cpu_ptr(&gdt_page.gdt[0]);
> > +     }
> > +
> > +     return get_desc_base(desc + idx);
> 
> Shouldn't idx be checked against active_mm->context.ldt.size, 
> not LDT_ENTRIES (which is really just an upper limit)?

Ah indeed, fixed that.

> > +static unsigned long code_segment_base(struct pt_regs *regs)
> > +{
> > +#ifdef CONFIG_32BIT
> > +     if (user_mode(regs) && regs->cs != __USER_CS)
> > +             return get_segment_base(regs->cs);
> > +#else
> > +     if (test_thread_flag(TIF_IA32)) {
> > +             if (user_mode(regs) && regs->cs != __USER32_CS)
> > +                     return get_segment_base(regs->cs);
> > +     }
> > +#endif
> > +     return 0;
> > +}
> 
> Will this do the right thing for x32 mode? 

hpa? It looks like x32 has TIF_X32, but from the kernel's POV its really
just another 64bit process, so as long as we don't trigger the TIF_IA32
case we'll just return 0. 

set_personality_ia32() looks like TIF_IA32 and TIF_X32 are mutually
exclusive, so I think we do the right thing.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-10  7:54             ` Peter Zijlstra
@ 2012-07-10  8:02               ` Ingo Molnar
  2012-07-10  8:21               ` Ingo Molnar
  1 sibling, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2012-07-10  8:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > > +static unsigned long code_segment_base(struct pt_regs *regs)
> > > +{
> > > +#ifdef CONFIG_32BIT
> > > +     if (user_mode(regs) && regs->cs != __USER_CS)
> > > +             return get_segment_base(regs->cs);
> > > +#else
> > > +     if (test_thread_flag(TIF_IA32)) {
> > > +             if (user_mode(regs) && regs->cs != __USER32_CS)
> > > +                     return get_segment_base(regs->cs);
> > > +     }
> > > +#endif
> > > +     return 0;
> > > +}
> > 
> > Will this do the right thing for x32 mode? 
> 
> hpa? It looks like x32 has TIF_X32, but from the kernel's POV 
> its really just another 64bit process, so as long as we don't 
> trigger the TIF_IA32 case we'll just return 0.

Maybe add that as a comment or so, to stop future wondering. 
Maybe list all the states an x86 process can be in, and describe 
what we intend to do for each. That will make it easier to match 
up intent with actual code and will make it easier to keep it 
correct in the future as well.

> set_personality_ia32() looks like TIF_IA32 and TIF_X32 are 
> mutually exclusive, so I think we do the right thing.

Ok.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-10  7:54             ` Peter Zijlstra
  2012-07-10  8:02               ` Ingo Molnar
@ 2012-07-10  8:21               ` Ingo Molnar
  2012-07-10  8:52                 ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2012-07-10  8:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2012-07-09 at 20:41 +0200, Ingo Molnar wrote:
> > > +static unsigned long get_segment_base(unsigned int segment)
> > > +{
> > > +     struct desc_struct *desc;
> > > +     int idx = segment >> 3;
> > > +
> > > +     if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
> > > +             if (idx > LDT_ENTRIES)
> > > +                     return 0;
> > > +
> > > +             desc = current->active_mm->context.ldt;
> > > +     } else {
> > > +             if (idx > GDT_ENTRIES)
> > > +                     return 0;
> > > +
> > > +             desc = __this_cpu_ptr(&gdt_page.gdt[0]);
> > > +     }
> > > +
> > > +     return get_desc_base(desc + idx);
> > 
> > Shouldn't idx be checked against active_mm->context.ldt.size, 
> > not LDT_ENTRIES (which is really just an upper limit)?
> 
> Ah indeed, fixed that.

Another boundary condition would be when we intentionally 
twiddle the GDT: such as during suspend or during BIOS upcalls. 
Can we then get a PMU interrupt? If yes then this will probably 
result in garbage:

> > > +             desc = __this_cpu_ptr(&gdt_page.gdt[0]);

it won't outright crash, we don't ever deallocate our GDT - but 
it will return a garbage RIP.

Then there's also all the Xen craziness with segments ...

Both ought to be rare an uninteresting - but then again, 
segmented execution is already rare and uninteresting to begin 
with.

So, instead of trying to discover all these weird x86 cases - 
with little to no testing done after that - I thought that it 
might be more future proof to just handle the cases we are 
explicitly interested in: flat code, and pounce in some well 
defined way in all the other situations by returning the RIP to 
an empty __X86_LEGACY_SEGMENTED_CODE() symbol.

That way we will at least give *some* useful information to the 
poor segmented code user, if the profile says:

    21.32%  [kernel]      [k] __X86_LEGACY_SEGMENTED_CODE
    11.01%  [kernel]      [k] kallsyms_expand_symbol     
     8.29%  [kernel]      [k] vsnprintf                  
     7.37%  libc-2.15.so  [.] __strcmp_sse42             
     6.93%  perf          [.] symbol_filter              
     4.20%  perf          [.] kallsyms__parse            
     3.92%  [kernel]      [k] format_decode              
     3.62%  [kernel]      [k] string.isra.4              
     3.59%  [kernel]      [k] memcpy                     
     3.11%  [kernel]      [k] strnlen                    

then the user at least knows that there's 21% of overhead in 
some sort of segmented x86 code. Or if they *really* want to 
resolve that, they can take your patch and add symbol decoding 
to user-space and test it all.

KISS and such.

Linus?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-10  8:21               ` Ingo Molnar
@ 2012-07-10  8:52                 ` Peter Zijlstra
  2012-07-10  9:48                   ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2012-07-10  8:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter, Konrad Rzeszutek Wilk,
	Jeremy Fitzhardinge

On Tue, 2012-07-10 at 10:21 +0200, Ingo Molnar wrote:
> Another boundary condition would be when we intentionally 
> twiddle the GDT: such as during suspend or during BIOS upcalls. 
> Can we then get a PMU interrupt? If yes then this will probably 
> result in garbage:
> 
> > > > +             desc = __this_cpu_ptr(&gdt_page.gdt[0]);
> 
> it won't outright crash, we don't ever deallocate our GDT - but 
> it will return a garbage RIP.

Nothing we can do about that though..

> Then there's also all the Xen craziness with segments ...

I don't think Xen Dom0 has PMU access, that would be their Hyper-visor
thingy's job, no?

Anyway, that seems a problem for Jeremy and Konrad..

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-09 17:55           ` Linus Torvalds
@ 2012-07-10  9:02             ` Peter Zijlstra
  2012-07-10  9:48               ` Ingo Molnar
                                 ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Peter Zijlstra @ 2012-07-10  9:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mingo, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter

On Mon, 2012-07-09 at 10:55 -0700, Linus Torvalds wrote:
> However, it is worth pointing out that sp/bp have exactly the same
> segment base issue. So if you do stack tracing into user mode, you
> should really do the same thing for those. And quite frankly, at that
> point vm86 mode and the stack segment matters in other ways than just
> the base pointer: a 16-bit stack segment acts fundamentally
> differently from a 32-bit one. So at that point it may well make much
> more sense to take the approach Ingo suggests, and simply not follow
> stack frames at all. 

Right, so I amended the patch to ignore vm86 stacks and added
{cs,ss}_base magic to ia32 stacks.

Ingo, do you want me to do a version where I simply bail on everything
if regs->{cs,ss} != {__USER_CS, __USER32_CS} || regs->flags & VM ?

---
Subject: perf/x86: Fix USER/KERNEL tagging of samples properly
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Jul 10 09:42:15 CEST 2012

Some PMUs don't provide a full register set for their sample,
specifically 'advanced' PMUs like AMD IBS and Intel PEBS which provide
'better' than regular interrupt accuracy.

In this case we use the interrupt regs as basis and over-write some
fields (typically IP) with different information.

The perf core however uses user_mode() to distinguish user/kernel
samples, user_mode() relies on regs->cs. If the interrupt skid pushed
us over a boundary the new IP might not be in the same domain as the
interrupt.

Commit ce5c1fe9a9e ("perf/x86: Fix USER/KERNEL tagging of samples")
tried to fix this by making the perf core use kernel_ip(). This
however is wrong (TM), as pointed out by Linus, since it doesn't allow
for VM86 and non-zero based segments in IA32 mode.

Therefore, provide a new helper to set the regs->ip field,
set_linear_ip(), which massages the regs into a suitable state
assuming the provided IP is in fact a linear address.

Also modify perf_instruction_pointer() and perf_callchain_user() to
deal with segments base offsets.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/include/asm/perf_event.h         |   11 ++-
 arch/x86/kernel/cpu/perf_event.c          |   89 ++++++++++++++++++++++++++----
 arch/x86/kernel/cpu/perf_event.h          |   20 ++++++
 arch/x86/kernel/cpu/perf_event_amd_ibs.c  |    4 +
 arch/x86/kernel/cpu/perf_event_intel_ds.c |    7 +-
 5 files changed, 114 insertions(+), 17 deletions(-)

--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -196,11 +196,16 @@ static inline u32 get_ibs_caps(void) { r
 extern void perf_events_lapic_init(void);
 
 /*
- * Abuse bit 3 of the cpu eflags register to indicate proper PEBS IP fixups.
- * This flag is otherwise unused and ABI specified to be 0, so nobody should
- * care what we do with it.
+ * Abuse bits {3,5} of the cpu eflags register. These flags are otherwise
+ * unused and ABI specified to be 0, so nobody should care what we do with
+ * them.
+ *
+ * EXACT    - the IP points to the exact instruction that triggered the
+ *            event (HW bugs exempt).
+ * NOUNWIND - do no unwind the stack.
  */
 #define PERF_EFLAGS_EXACT	(1UL << 3)
+#define PERF_EFLAGS_NOUNWIND	(1UL << 5)
 
 struct pt_regs;
 extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -32,6 +32,8 @@
 #include <asm/smp.h>
 #include <asm/alternative.h>
 #include <asm/timer.h>
+#include <asm/desc.h>
+#include <asm/ldt.h>
 
 #include "perf_event.h"
 
@@ -1738,6 +1740,29 @@ valid_user_frame(const void __user *fp,
 	return (__range_not_ok(fp, size, TASK_SIZE) == 0);
 }
 
+static unsigned long get_segment_base(unsigned int segment)
+{
+	struct desc_struct *desc;
+	int idx = segment >> 3;
+
+	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+		if (idx > LDT_ENTRIES)
+			return 0;
+
+		if (idx > current->active_mm->context.size)
+			return 0;
+
+		desc = current->active_mm->context.ldt;
+	} else {
+		if (idx > GDT_ENTRIES)
+			return 0;
+
+		desc = __this_cpu_ptr(&gdt_page.gdt[0]);
+	}
+
+	return get_desc_base(desc + idx);
+}
+
 #ifdef CONFIG_COMPAT
 
 #include <asm/compat.h>
@@ -1746,13 +1771,17 @@ static inline int
 perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 {
 	/* 32-bit process in 64-bit kernel. */
+	unsigned long ss_base, cs_base;
 	struct stack_frame_ia32 frame;
 	const void __user *fp;
 
 	if (!test_thread_flag(TIF_IA32))
 		return 0;
 
-	fp = compat_ptr(regs->bp);
+	cs_base = get_segment_base(regs->cs);
+	ss_base = get_segment_base(regs->ss);
+
+	fp = compat_ptr(ss_base + regs->bp);
 	while (entry->nr < PERF_MAX_STACK_DEPTH) {
 		unsigned long bytes;
 		frame.next_frame     = 0;
@@ -1765,8 +1794,8 @@ perf_callchain_user32(struct pt_regs *re
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
 
-		perf_callchain_store(entry, frame.return_address);
-		fp = compat_ptr(frame.next_frame);
+		perf_callchain_store(entry, cs_base + frame.return_address);
+		fp = compat_ptr(ss_base + frame.next_frame);
 	}
 	return 1;
 }
@@ -1789,6 +1818,12 @@ perf_callchain_user(struct perf_callchai
 		return;
 	}
 
+	/*
+	 * We don't know what to do with VM86 stacks.. ignore them for now.
+	 */
+	if (regs->flags & (X86_VM_MASK | PERF_EFLAGS_NOUNWIND))
+		return;
+
 	fp = (void __user *)regs->bp;
 
 	perf_callchain_store(entry, regs->ip);
@@ -1816,16 +1851,50 @@ perf_callchain_user(struct perf_callchai
 	}
 }
 
-unsigned long perf_instruction_pointer(struct pt_regs *regs)
+/*
+ * Deal with code segment offsets for the various execution modes:
+ *
+ *   VM86 - the good olde 16 bit days, where the linear address is
+ *          20 bits and we use regs->ip + 0x10 * regs->cs.
+ *
+ *   IA32 - Where we need to look at GDT/LDT segment descriptor tables
+ *          to figure out what the 32bit base address is.
+ *
+ *    X32 - has TIF_X32 set, but is running in x86_64
+ *
+ * X86_64 - CS,DS,SS,ES are all zero based.
+ */
+static unsigned long code_segment_base(struct pt_regs *regs)
 {
-	unsigned long ip;
+	/*
+	 * If we are in VM86 mode, add the segment offset to convert to a
+	 * linear address.
+	 */
+	if (regs->flags & X86_VM_MASK)
+		return 0x10 * regs->cs;
+
+	/*
+	 * For IA32 we look at the GDT/LDT segment base to convert the
+	 * effective IP to a linear address.
+	 */
+#ifdef CONFIG_32BIT
+	if (user_mode(regs) && regs->cs != __USER_CS)
+		return get_segment_base(regs->cs);
+#else
+	if (test_thread_flag(TIF_IA32)) {
+		if (user_mode(regs) && regs->cs != __USER32_CS)
+			return get_segment_base(regs->cs);
+	}
+#endif
+	return 0;
+}
 
+unsigned long perf_instruction_pointer(struct pt_regs *regs)
+{
 	if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-		ip = perf_guest_cbs->get_guest_ip();
-	else
-		ip = instruction_pointer(regs);
+		return perf_guest_cbs->get_guest_ip();
 
-	return ip;
+	return regs->ip + code_segment_base(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
@@ -1838,7 +1907,7 @@ unsigned long perf_misc_flags(struct pt_
 		else
 			misc |= PERF_RECORD_MISC_GUEST_KERNEL;
 	} else {
-		if (!kernel_ip(regs->ip))
+		if (user_mode(regs))
 			misc |= PERF_RECORD_MISC_USER;
 		else
 			misc |= PERF_RECORD_MISC_KERNEL;
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -516,6 +516,26 @@ static inline bool kernel_ip(unsigned lo
 #endif
 }
 
+/*
+ * Not all PMUs provide the right context information to place the reported IP
+ * into full context. Specifically segment registers are typically not
+ * supplied.
+ *
+ * Assuming the address is a linear address (it is for IBS), we fake the CS and
+ * vm86 mode using the known zero-based code segment and 'fix up' the registers
+ * to reflect this.
+ *
+ * Intel PEBS/LBR appear to typically provide the effective address, nothing
+ * much we can do about that but pray and treat it like a linear address.
+ */
+static inline void set_linear_ip(struct pt_regs *regs, unsigned long ip)
+{
+	regs->cs = kernel_ip(ip) ? __KERNEL_CS : __USER_CS;
+	if (regs->flags & X86_VM_MASK)
+		regs->flags ^= (PERF_EFLAGS_NOUNWIND | X86_VM_MASK);
+	regs->ip = ip;
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
--- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c
+++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c
@@ -13,6 +13,8 @@
 
 #include <asm/apic.h>
 
+#include "perf_event.h"
+
 static u32 ibs_caps;
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
@@ -536,7 +538,7 @@ static int perf_ibs_handle_irq(struct pe
 	if (check_rip && (ibs_data.regs[2] & IBS_RIP_INVALID)) {
 		regs.flags &= ~PERF_EFLAGS_EXACT;
 	} else {
-		instruction_pointer_set(&regs, ibs_data.regs[1]);
+		set_linear_ip(&regs, ibs_data.regs[1]);
 		regs.flags |= PERF_EFLAGS_EXACT;
 	}
 
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -499,7 +499,7 @@ static int intel_pmu_pebs_fixup_ip(struc
 	 * We sampled a branch insn, rewind using the LBR stack
 	 */
 	if (ip == to) {
-		regs->ip = from;
+		set_linear_ip(regs, from);
 		return 1;
 	}
 
@@ -529,7 +529,7 @@ static int intel_pmu_pebs_fixup_ip(struc
 	} while (to < ip);
 
 	if (to == ip) {
-		regs->ip = old_to;
+		set_linear_ip(regs, old_to);
 		return 1;
 	}
 
@@ -569,7 +569,8 @@ static void __intel_pmu_pebs_event(struc
 	 * A possible PERF_SAMPLE_REGS will have to transfer all regs.
 	 */
 	regs = *iregs;
-	regs.ip = pebs->ip;
+	regs.flags = pebs->flags;
+	set_linear_ip(&regs, pebs->ip);
 	regs.bp = pebs->bp;
 	regs.sp = pebs->sp;
 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-10  8:52                 ` Peter Zijlstra
@ 2012-07-10  9:48                   ` Ingo Molnar
  2012-07-31 18:11                     ` H. Peter Anvin
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2012-07-10  9:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter, Konrad Rzeszutek Wilk,
	Jeremy Fitzhardinge


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2012-07-10 at 10:21 +0200, Ingo Molnar wrote:
> > Another boundary condition would be when we intentionally 
> > twiddle the GDT: such as during suspend or during BIOS upcalls. 
> > Can we then get a PMU interrupt? If yes then this will probably 
> > result in garbage:
> > 
> > > > > +             desc = __this_cpu_ptr(&gdt_page.gdt[0]);
> > 
> > it won't outright crash, we don't ever deallocate our GDT - but 
> > it will return a garbage RIP.
> 
> Nothing we can do about that though..

We could read out the current GDT [the SGDT instruction] instead 
of looking at gdt_page.

Then we'd have to decode that descriptor, the limit. Decide 
whether the selector points to the GDT or LDT. All the fun x86 
legacies that we mostly forgot already after two decades of 
running the kernel in flat linear mode...

> > Then there's also all the Xen craziness with segments ...
> 
> I don't think Xen Dom0 has PMU access, that would be their 
> Hyper-visor thingy's job, no?

Unless Xen is destined to become dead code it will eventually be 
interested in PMU based measurements, right?

> Anyway, that seems a problem for Jeremy and Konrad..

Yeah, quite probably so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-10  9:02             ` Peter Zijlstra
@ 2012-07-10  9:48               ` Ingo Molnar
  2012-07-10  9:50               ` Peter Zijlstra
  2012-07-31 17:57               ` [tip:perf/urgent] perf/x86: Fix USER/ KERNEL tagging of samples properly tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2012-07-10  9:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2012-07-09 at 10:55 -0700, Linus Torvalds wrote:
> > However, it is worth pointing out that sp/bp have exactly the same
> > segment base issue. So if you do stack tracing into user mode, you
> > should really do the same thing for those. And quite frankly, at that
> > point vm86 mode and the stack segment matters in other ways than just
> > the base pointer: a 16-bit stack segment acts fundamentally
> > differently from a 32-bit one. So at that point it may well make much
> > more sense to take the approach Ingo suggests, and simply not follow
> > stack frames at all. 
> 
> Right, so I amended the patch to ignore vm86 stacks and added 
> {cs,ss}_base magic to ia32 stacks.
> 
> Ingo, do you want me to do a version where I simply bail on 
> everything if regs->{cs,ss} != {__USER_CS, __USER32_CS} || 
> regs->flags & VM ?

Only if it's really simple to do - out of morbid curiosity, to 
compare the two diffstats and such.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-10  9:02             ` Peter Zijlstra
  2012-07-10  9:48               ` Ingo Molnar
@ 2012-07-10  9:50               ` Peter Zijlstra
  2012-07-10  9:52                 ` Peter Zijlstra
  2012-07-10  9:55                 ` Ingo Molnar
  2012-07-31 17:57               ` [tip:perf/urgent] perf/x86: Fix USER/ KERNEL tagging of samples properly tip-bot for Peter Zijlstra
  2 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2012-07-10  9:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mingo, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter

On Tue, 2012-07-10 at 11:02 +0200, Peter Zijlstra wrote:
> Ingo, do you want me to do a version where I simply bail on everything
> if regs->{cs,ss} != {__USER_CS, __USER32_CS} || regs->flags & VM ?

Here's a variant that does that.. 

---
Subject: perf/x86: Fix USER/KERNEL tagging of samples properly
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Jul 10 09:42:15 CEST 2012

Some PMUs don't provide a full register set for their sample,
specifically 'advanced' PMUs like AMD IBS and Intel PEBS which provide
'better' than regular interrupt accuracy.

In this case we use the interrupt regs as basis and over-write some
fields (typically IP) with different information.

The perf core however uses user_mode() to distinguish user/kernel
samples, user_mode() relies on regs->cs. If the interrupt skid pushed
us over a boundary the new IP might not be in the same domain as the
interrupt.

Commit ce5c1fe9a9e ("perf/x86: Fix USER/KERNEL tagging of samples")
tried to fix this by making the perf core use kernel_ip(). This
however is wrong (TM), as pointed out by Linus, since it doesn't allow
for VM86 and non-zero based segments in IA32 mode.

Therefore, provide a new helper to set the regs->ip field,
set_linear_ip(), which massages the regs into a suitable state
assuming the provided IP is in fact a linear address.

Also modify perf_instruction_pointer() and perf_callchain_user() to
deal with segments base offsets.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/include/asm/perf_event.h         |   11 +++++--
 arch/x86/kernel/cpu/perf_event.c          |   46 +++++++++++++++++++++++++-----
 arch/x86/kernel/cpu/perf_event.h          |   20 +++++++++++++
 arch/x86/kernel/cpu/perf_event_amd_ibs.c  |    4 +-
 arch/x86/kernel/cpu/perf_event_intel_ds.c |    7 ++--
 5 files changed, 74 insertions(+), 14 deletions(-)

--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -196,11 +196,16 @@ static inline u32 get_ibs_caps(void) { r
 extern void perf_events_lapic_init(void);
 
 /*
- * Abuse bit 3 of the cpu eflags register to indicate proper PEBS IP fixups.
- * This flag is otherwise unused and ABI specified to be 0, so nobody should
- * care what we do with it.
+ * Abuse bits {3,5} of the cpu eflags register. These flags are otherwise
+ * unused and ABI specified to be 0, so nobody should care what we do with
+ * them.
+ *
+ * EXACT - the IP points to the exact instruction that triggered the
+ *         event (HW bugs exempt).
+ * VM    - original X86_VM_MASK; see set_linear_ip().
  */
 #define PERF_EFLAGS_EXACT	(1UL << 3)
+#define PERF_EFLAGS_VM		(1UL << 5)
 
 struct pt_regs;
 extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -32,6 +32,8 @@
 #include <asm/smp.h>
 #include <asm/alternative.h>
 #include <asm/timer.h>
+#include <asm/desc.h>
+#include <asm/ldt.h>
 
 #include "perf_event.h"
 
@@ -1752,6 +1754,12 @@ perf_callchain_user32(struct pt_regs *re
 	if (!test_thread_flag(TIF_IA32))
 		return 0;
 
+	if (regs->cs != __USER32_CS && regs->cs != __USER_CS)
+		return 0;
+
+	if (regs->ss != __USER32_DS && regs->ss != __USER_DS)
+		return 0;
+
 	fp = compat_ptr(regs->bp);
 	while (entry->nr < PERF_MAX_STACK_DEPTH) {
 		unsigned long bytes;
@@ -1789,6 +1797,12 @@ perf_callchain_user(struct perf_callchai
 		return;
 	}
 
+	/*
+	 * We don't know what to do with VM86 stacks.. ignore them for now.
+	 */
+	if (regs->flags & (X86_VM_MASK | PERF_EFLAGS_VM))
+		return;
+
 	fp = (void __user *)regs->bp;
 
 	perf_callchain_store(entry, regs->ip);
@@ -1816,16 +1830,34 @@ perf_callchain_user(struct perf_callchai
 	}
 }
 
-unsigned long perf_instruction_pointer(struct pt_regs *regs)
+static bool flat_code_segment(struct pt_regs *regs)
 {
-	unsigned long ip;
+	if (regs->flags & (X86_VM_MASK | PERF_EFLAGS_VM))
+		return false;
 
+#ifdef CONFIG_32BIT
+	if (user_mode(regs) && (regs->cs != __USER_CS &&
+				regs->cs != __USER32_CS))
+		return false;
+#else
+	if (test_thread_flag(TIF_IA32)) {
+		if (user_mode(regs) && (regs->cs != __USER_CS &&
+					regs->cs != __USER32_CS))
+			return false;
+	}
+#endif
+	return true; /* X86_64 and X32 are guaranteed flat */
+}
+
+unsigned long perf_instruction_pointer(struct pt_regs *regs)
+{
 	if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-		ip = perf_guest_cbs->get_guest_ip();
-	else
-		ip = instruction_pointer(regs);
+		return perf_guest_cbs->get_guest_ip();
+
+	if (flat_code_segment(regs))
+		return regs->ip;
 
-	return ip;
+	return ~0UL;
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
@@ -1838,7 +1870,7 @@ unsigned long perf_misc_flags(struct pt_
 		else
 			misc |= PERF_RECORD_MISC_GUEST_KERNEL;
 	} else {
-		if (!kernel_ip(regs->ip))
+		if (user_mode(regs))
 			misc |= PERF_RECORD_MISC_USER;
 		else
 			misc |= PERF_RECORD_MISC_KERNEL;
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -516,6 +516,26 @@ static inline bool kernel_ip(unsigned lo
 #endif
 }
 
+/*
+ * Not all PMUs provide the right context information to place the reported IP
+ * into full context. Specifically segment registers are typically not
+ * supplied.
+ *
+ * Assuming the address is a linear address (it is for IBS), we fake the CS and
+ * vm86 mode using the known zero-based code segment and 'fix up' the registers
+ * to reflect this.
+ *
+ * Intel PEBS/LBR appear to typically provide the effective address, nothing
+ * much we can do about that but pray and treat it like a linear address.
+ */
+static inline void set_linear_ip(struct pt_regs *regs, unsigned long ip)
+{
+	regs->cs = kernel_ip(ip) ? __KERNEL_CS : __USER_CS;
+	if (regs->flags & X86_VM_MASK)
+		regs->flags ^= (PERF_EFLAGS_VM | X86_VM_MASK);
+	regs->ip = ip;
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
--- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c
+++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c
@@ -13,6 +13,8 @@
 
 #include <asm/apic.h>
 
+#include "perf_event.h"
+
 static u32 ibs_caps;
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
@@ -536,7 +538,7 @@ static int perf_ibs_handle_irq(struct pe
 	if (check_rip && (ibs_data.regs[2] & IBS_RIP_INVALID)) {
 		regs.flags &= ~PERF_EFLAGS_EXACT;
 	} else {
-		instruction_pointer_set(&regs, ibs_data.regs[1]);
+		set_linear_ip(&regs, ibs_data.regs[1]);
 		regs.flags |= PERF_EFLAGS_EXACT;
 	}
 
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -499,7 +499,7 @@ static int intel_pmu_pebs_fixup_ip(struc
 	 * We sampled a branch insn, rewind using the LBR stack
 	 */
 	if (ip == to) {
-		regs->ip = from;
+		set_linear_ip(regs, from);
 		return 1;
 	}
 
@@ -529,7 +529,7 @@ static int intel_pmu_pebs_fixup_ip(struc
 	} while (to < ip);
 
 	if (to == ip) {
-		regs->ip = old_to;
+		set_linear_ip(regs, old_to);
 		return 1;
 	}
 
@@ -569,7 +569,8 @@ static void __intel_pmu_pebs_event(struc
 	 * A possible PERF_SAMPLE_REGS will have to transfer all regs.
 	 */
 	regs = *iregs;
-	regs.ip = pebs->ip;
+	regs.flags = pebs->flags;
+	set_linear_ip(&regs, pebs->ip);
 	regs.bp = pebs->bp;
 	regs.sp = pebs->sp;
 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-10  9:50               ` Peter Zijlstra
@ 2012-07-10  9:52                 ` Peter Zijlstra
  2012-07-10  9:55                 ` Ingo Molnar
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2012-07-10  9:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: mingo, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter

On Tue, 2012-07-10 at 11:50 +0200, Peter Zijlstra wrote:
> +       if (regs->cs != __USER32_CS && regs->cs != __USER_CS)
> +               return 0;
> +
> +       if (regs->ss != __USER32_DS && regs->ss != __USER_DS)
> +               return 0; 

Those should be return 1, they are a 32bit callchain, we just don't do
anything with it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-10  9:50               ` Peter Zijlstra
  2012-07-10  9:52                 ` Peter Zijlstra
@ 2012-07-10  9:55                 ` Ingo Molnar
  1 sibling, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2012-07-10  9:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, hpa, eranian, linux-kernel, fweisbec, akpm, tglx,
	linux-tip-commits, Robert Richter


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2012-07-10 at 11:02 +0200, Peter Zijlstra wrote:
> > Ingo, do you want me to do a version where I simply bail on everything
> > if regs->{cs,ss} != {__USER_CS, __USER32_CS} || regs->flags & VM ?
> 
> Here's a variant that does that.. 

>  arch/x86/include/asm/perf_event.h         |   11 +++++--
>  arch/x86/kernel/cpu/perf_event.c          |   46 +++++++++++++++++++++++++-----
>  arch/x86/kernel/cpu/perf_event.h          |   20 +++++++++++++
>  arch/x86/kernel/cpu/perf_event_amd_ibs.c  |    4 +-
>  arch/x86/kernel/cpu/perf_event_intel_ds.c |    7 ++--
>  5 files changed, 74 insertions(+), 14 deletions(-)

this is the full thing:

>  arch/x86/include/asm/perf_event.h         |   11 ++-
>  arch/x86/kernel/cpu/perf_event.c          |   89 ++++++++++++++++++++++++++----
>  arch/x86/kernel/cpu/perf_event.h          |   20 ++++++
>  arch/x86/kernel/cpu/perf_event_amd_ibs.c  |    4 +
>  arch/x86/kernel/cpu/perf_event_intel_ds.c |    7 +-
>  5 files changed, 114 insertions(+), 17 deletions(-)

so that's 40 LOC difference.

Hm, I expected there to be more of a difference, so let me 
change my mind again in view of the evidence: now I tend to
lean Linus's way, we might as well apply those extra 40 lines
now that you've written them :-)

Even if it is not enough to do proper segmented profiling, 
should anyone be interested in such a profiling mode they'll 
have a much easier job making it work, the rest looks mostly a 
user space side job. Your larger patch looks safe enough at the 
boundaries.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:perf/urgent] perf/x86: Fix USER/ KERNEL tagging of samples properly
  2012-07-10  9:02             ` Peter Zijlstra
  2012-07-10  9:48               ` Ingo Molnar
  2012-07-10  9:50               ` Peter Zijlstra
@ 2012-07-31 17:57               ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2012-07-31 17:57 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx

Commit-ID:  d07bdfd322d307789f15b427dbcc39257665356f
Gitweb:     http://git.kernel.org/tip/d07bdfd322d307789f15b427dbcc39257665356f
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Tue, 10 Jul 2012 09:42:15 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 31 Jul 2012 17:02:04 +0200

perf/x86: Fix USER/KERNEL tagging of samples properly

Some PMUs don't provide a full register set for their sample,
specifically 'advanced' PMUs like AMD IBS and Intel PEBS which provide
'better' than regular interrupt accuracy.

In this case we use the interrupt regs as basis and over-write some
fields (typically IP) with different information.

The perf core however uses user_mode() to distinguish user/kernel
samples, user_mode() relies on regs->cs. If the interrupt skid pushed
us over a boundary the new IP might not be in the same domain as the
interrupt.

Commit ce5c1fe9a9e ("perf/x86: Fix USER/KERNEL tagging of samples")
tried to fix this by making the perf core use kernel_ip(). This
however is wrong (TM), as pointed out by Linus, since it doesn't allow
for VM86 and non-zero based segments in IA32 mode.

Therefore, provide a new helper to set the regs->ip field,
set_linear_ip(), which massages the regs into a suitable state
assuming the provided IP is in fact a linear address.

Also modify perf_instruction_pointer() and perf_callchain_user() to
deal with segments base offsets.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341910954.3462.102.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/perf_event.h         |   11 +++-
 arch/x86/kernel/cpu/perf_event.c          |   89 +++++++++++++++++++++++++---
 arch/x86/kernel/cpu/perf_event.h          |   20 +++++++
 arch/x86/kernel/cpu/perf_event_amd_ibs.c  |    4 +-
 arch/x86/kernel/cpu/perf_event_intel_ds.c |    7 +-
 5 files changed, 114 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index dab3935..cb4e43b 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -196,11 +196,16 @@ static inline u32 get_ibs_caps(void) { return 0; }
 extern void perf_events_lapic_init(void);
 
 /*
- * Abuse bit 3 of the cpu eflags register to indicate proper PEBS IP fixups.
- * This flag is otherwise unused and ABI specified to be 0, so nobody should
- * care what we do with it.
+ * Abuse bits {3,5} of the cpu eflags register. These flags are otherwise
+ * unused and ABI specified to be 0, so nobody should care what we do with
+ * them.
+ *
+ * EXACT - the IP points to the exact instruction that triggered the
+ *         event (HW bugs exempt).
+ * VM    - original X86_VM_MASK; see set_linear_ip().
  */
 #define PERF_EFLAGS_EXACT	(1UL << 3)
+#define PERF_EFLAGS_VM		(1UL << 5)
 
 struct pt_regs;
 extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 29557aa..915b876 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -32,6 +32,8 @@
 #include <asm/smp.h>
 #include <asm/alternative.h>
 #include <asm/timer.h>
+#include <asm/desc.h>
+#include <asm/ldt.h>
 
 #include "perf_event.h"
 
@@ -1738,6 +1740,29 @@ valid_user_frame(const void __user *fp, unsigned long size)
 	return (__range_not_ok(fp, size, TASK_SIZE) == 0);
 }
 
+static unsigned long get_segment_base(unsigned int segment)
+{
+	struct desc_struct *desc;
+	int idx = segment >> 3;
+
+	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+		if (idx > LDT_ENTRIES)
+			return 0;
+
+		if (idx > current->active_mm->context.size)
+			return 0;
+
+		desc = current->active_mm->context.ldt;
+	} else {
+		if (idx > GDT_ENTRIES)
+			return 0;
+
+		desc = __this_cpu_ptr(&gdt_page.gdt[0]);
+	}
+
+	return get_desc_base(desc + idx);
+}
+
 #ifdef CONFIG_COMPAT
 
 #include <asm/compat.h>
@@ -1746,13 +1771,17 @@ static inline int
 perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 {
 	/* 32-bit process in 64-bit kernel. */
+	unsigned long ss_base, cs_base;
 	struct stack_frame_ia32 frame;
 	const void __user *fp;
 
 	if (!test_thread_flag(TIF_IA32))
 		return 0;
 
-	fp = compat_ptr(regs->bp);
+	cs_base = get_segment_base(regs->cs);
+	ss_base = get_segment_base(regs->ss);
+
+	fp = compat_ptr(ss_base + regs->bp);
 	while (entry->nr < PERF_MAX_STACK_DEPTH) {
 		unsigned long bytes;
 		frame.next_frame     = 0;
@@ -1765,8 +1794,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
 
-		perf_callchain_store(entry, frame.return_address);
-		fp = compat_ptr(frame.next_frame);
+		perf_callchain_store(entry, cs_base + frame.return_address);
+		fp = compat_ptr(ss_base + frame.next_frame);
 	}
 	return 1;
 }
@@ -1789,6 +1818,12 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
 		return;
 	}
 
+	/*
+	 * We don't know what to do with VM86 stacks.. ignore them for now.
+	 */
+	if (regs->flags & (X86_VM_MASK | PERF_EFLAGS_VM))
+		return;
+
 	fp = (void __user *)regs->bp;
 
 	perf_callchain_store(entry, regs->ip);
@@ -1816,16 +1851,50 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
 	}
 }
 
-unsigned long perf_instruction_pointer(struct pt_regs *regs)
+/*
+ * Deal with code segment offsets for the various execution modes:
+ *
+ *   VM86 - the good olde 16 bit days, where the linear address is
+ *          20 bits and we use regs->ip + 0x10 * regs->cs.
+ *
+ *   IA32 - Where we need to look at GDT/LDT segment descriptor tables
+ *          to figure out what the 32bit base address is.
+ *
+ *    X32 - has TIF_X32 set, but is running in x86_64
+ *
+ * X86_64 - CS,DS,SS,ES are all zero based.
+ */
+static unsigned long code_segment_base(struct pt_regs *regs)
 {
-	unsigned long ip;
+	/*
+	 * If we are in VM86 mode, add the segment offset to convert to a
+	 * linear address.
+	 */
+	if (regs->flags & X86_VM_MASK)
+		return 0x10 * regs->cs;
+
+	/*
+	 * For IA32 we look at the GDT/LDT segment base to convert the
+	 * effective IP to a linear address.
+	 */
+#ifdef CONFIG_X86_32
+	if (user_mode(regs) && regs->cs != __USER_CS)
+		return get_segment_base(regs->cs);
+#else
+	if (test_thread_flag(TIF_IA32)) {
+		if (user_mode(regs) && regs->cs != __USER32_CS)
+			return get_segment_base(regs->cs);
+	}
+#endif
+	return 0;
+}
 
+unsigned long perf_instruction_pointer(struct pt_regs *regs)
+{
 	if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-		ip = perf_guest_cbs->get_guest_ip();
-	else
-		ip = instruction_pointer(regs);
+		return perf_guest_cbs->get_guest_ip();
 
-	return ip;
+	return regs->ip + code_segment_base(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
@@ -1838,7 +1907,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
 		else
 			misc |= PERF_RECORD_MISC_GUEST_KERNEL;
 	} else {
-		if (!kernel_ip(regs->ip))
+		if (user_mode(regs))
 			misc |= PERF_RECORD_MISC_USER;
 		else
 			misc |= PERF_RECORD_MISC_KERNEL;
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 821d53b..6605a81 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -516,6 +516,26 @@ static inline bool kernel_ip(unsigned long ip)
 #endif
 }
 
+/*
+ * Not all PMUs provide the right context information to place the reported IP
+ * into full context. Specifically segment registers are typically not
+ * supplied.
+ *
+ * Assuming the address is a linear address (it is for IBS), we fake the CS and
+ * vm86 mode using the known zero-based code segment and 'fix up' the registers
+ * to reflect this.
+ *
+ * Intel PEBS/LBR appear to typically provide the effective address, nothing
+ * much we can do about that but pray and treat it like a linear address.
+ */
+static inline void set_linear_ip(struct pt_regs *regs, unsigned long ip)
+{
+	regs->cs = kernel_ip(ip) ? __KERNEL_CS : __USER_CS;
+	if (regs->flags & X86_VM_MASK)
+		regs->flags ^= (PERF_EFLAGS_VM | X86_VM_MASK);
+	regs->ip = ip;
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c
index da9bcdc..7bfb5be 100644
--- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c
+++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c
@@ -13,6 +13,8 @@
 
 #include <asm/apic.h>
 
+#include "perf_event.h"
+
 static u32 ibs_caps;
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
@@ -536,7 +538,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs)
 	if (check_rip && (ibs_data.regs[2] & IBS_RIP_INVALID)) {
 		regs.flags &= ~PERF_EFLAGS_EXACT;
 	} else {
-		instruction_pointer_set(&regs, ibs_data.regs[1]);
+		set_linear_ip(&regs, ibs_data.regs[1]);
 		regs.flags |= PERF_EFLAGS_EXACT;
 	}
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 629ae0b..e38d97b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -499,7 +499,7 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
 	 * We sampled a branch insn, rewind using the LBR stack
 	 */
 	if (ip == to) {
-		regs->ip = from;
+		set_linear_ip(regs, from);
 		return 1;
 	}
 
@@ -529,7 +529,7 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
 	} while (to < ip);
 
 	if (to == ip) {
-		regs->ip = old_to;
+		set_linear_ip(regs, old_to);
 		return 1;
 	}
 
@@ -569,7 +569,8 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 * A possible PERF_SAMPLE_REGS will have to transfer all regs.
 	 */
 	regs = *iregs;
-	regs.ip = pebs->ip;
+	regs.flags = pebs->flags;
+	set_linear_ip(&regs, pebs->ip);
 	regs.bp = pebs->bp;
 	regs.sp = pebs->sp;
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples
  2012-07-10  9:48                   ` Ingo Molnar
@ 2012-07-31 18:11                     ` H. Peter Anvin
  0 siblings, 0 replies; 22+ messages in thread
From: H. Peter Anvin @ 2012-07-31 18:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, eranian, linux-kernel, fweisbec,
	akpm, tglx, linux-tip-commits, Robert Richter,
	Konrad Rzeszutek Wilk, Jeremy Fitzhardinge

On 07/10/2012 02:48 AM, Ingo Molnar wrote:
>
> We could read out the current GDT [the SGDT instruction] instead
> of looking at gdt_page.
>
> Then we'd have to decode that descriptor, the limit. Decide
> whether the selector points to the GDT or LDT. All the fun x86
> legacies that we mostly forgot already after two decades of
> running the kernel in flat linear mode...
>

In 32-bit mode there are actually instructions that do that work for you.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, back to index

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-06  6:20 [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples tip-bot for Peter Zijlstra
2012-07-06 16:29 ` Linus Torvalds
2012-07-06 18:12   ` Peter Zijlstra
2012-07-06 18:16     ` Linus Torvalds
2012-07-06 18:34       ` Linus Torvalds
2012-07-06 20:48         ` Peter Zijlstra
2012-07-06 20:59           ` Linus Torvalds
2012-07-09 11:23         ` Peter Zijlstra
2012-07-09 17:55           ` Linus Torvalds
2012-07-10  9:02             ` Peter Zijlstra
2012-07-10  9:48               ` Ingo Molnar
2012-07-10  9:50               ` Peter Zijlstra
2012-07-10  9:52                 ` Peter Zijlstra
2012-07-10  9:55                 ` Ingo Molnar
2012-07-31 17:57               ` [tip:perf/urgent] perf/x86: Fix USER/ KERNEL tagging of samples properly tip-bot for Peter Zijlstra
2012-07-09 18:41           ` [tip:perf/core] perf/x86: Fix USER/KERNEL tagging of samples Ingo Molnar
2012-07-10  7:54             ` Peter Zijlstra
2012-07-10  8:02               ` Ingo Molnar
2012-07-10  8:21               ` Ingo Molnar
2012-07-10  8:52                 ` Peter Zijlstra
2012-07-10  9:48                   ` Ingo Molnar
2012-07-31 18:11                     ` H. Peter Anvin

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git
	git clone --mirror https://lore.kernel.org/lkml/10 lkml/git/10.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git