All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
@ 2014-12-23  6:22 root
  2014-12-23  8:30 ` Andy Lutomirski
       [not found] ` <c027bde0-5f4f-441f-8d45-3e7f6f702231@alibaba-inc.com>
  0 siblings, 2 replies; 23+ messages in thread
From: root @ 2014-12-23  6:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chenggang Qin, Andrew Morton, Arjan van de Ven, David Ahern,
	Ingo Molnar, Mike Galbraith, Namhyung Kim, Paul Mackerras,
	Peter Zijlstra, Wu Fengguang, Yanmin Zhang

From: Chenggang Qin <chenggang.qcg@taobao.com>

For x86_64, the exact value of user stack's esp should be got by KSTK_ESP(current).
current->thread.usersp is copied from PDA while enter ring0. 
Now, we output the value of sp from pt_regs. But pt_regs->sp has changed before
it was pushed into kernel stack.

So, we cannot get the correct callchain while unwind some user stacks.
For example, if the stack contains __lll_unlock_wake()/__lll_lock_wait(), the
callchain will break some times with the latest version of libunwind.
The root cause is the sp that is used by libunwind may be wrong.

If we use KSTK_ESP(current), the correct callchain can be got everytime.
Other architectures also have KSTK_ESP() macro.

Signed-off-by: Chenggang Qin <chenggang.qcg@taobao.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Yanmin Zhang <yanmin.zhang@intel.com>

---
 arch/x86/kernel/perf_regs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index e309cc5..5da8df8 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -60,6 +60,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
 		return 0;
 
+	if (idx == PERF_REG_X86_SP)
+		return KSTK_ESP(current);
+
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2014-12-23  6:22 [PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs root
@ 2014-12-23  8:30 ` Andy Lutomirski
       [not found] ` <c027bde0-5f4f-441f-8d45-3e7f6f702231@alibaba-inc.com>
  1 sibling, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2014-12-23  8:30 UTC (permalink / raw)
  To: root, linux-kernel
  Cc: Chenggang Qin, Andrew Morton, Arjan van de Ven, David Ahern,
	Ingo Molnar, Mike Galbraith, Namhyung Kim, Paul Mackerras,
	Peter Zijlstra, Wu Fengguang, Yanmin Zhang

On 12/22/2014 10:22 PM, root wrote:
> From: Chenggang Qin <chenggang.qcg@taobao.com>
> 
> For x86_64, the exact value of user stack's esp should be got by KSTK_ESP(current).
> current->thread.usersp is copied from PDA while enter ring0. 
> Now, we output the value of sp from pt_regs. But pt_regs->sp has changed before
> it was pushed into kernel stack.
> 
> So, we cannot get the correct callchain while unwind some user stacks.
> For example, if the stack contains __lll_unlock_wake()/__lll_lock_wait(), the
> callchain will break some times with the latest version of libunwind.
> The root cause is the sp that is used by libunwind may be wrong.
> 
> If we use KSTK_ESP(current), the correct callchain can be got everytime.
> Other architectures also have KSTK_ESP() macro.
> 
> Signed-off-by: Chenggang Qin <chenggang.qcg@taobao.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Namhyung Kim <namhyung@gmail.com>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: Yanmin Zhang <yanmin.zhang@intel.com>
> 
> ---
>  arch/x86/kernel/perf_regs.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
> index e309cc5..5da8df8 100644
> --- a/arch/x86/kernel/perf_regs.c
> +++ b/arch/x86/kernel/perf_regs.c
> @@ -60,6 +60,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>  	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
>  		return 0;
>  
> +	if (idx == PERF_REG_X86_SP)
> +		return KSTK_ESP(current);
> +

This patch is probably fine, but KSTK_ESP seems to be bogus:

unsigned long KSTK_ESP(struct task_struct *task)
{
	return (test_tsk_thread_flag(task, TIF_IA32)) ?
			(task_pt_regs(task)->sp) : ((task)->thread.usersp);
}

I swear that every time I've looked at anything that references TIF_IA32
in the last two weeks, it's been wrong.  This should be something like:

if (task_thread_info(task)->status & TS_COMPAT)
	return task_pt_regs(task)->sp;
else if (task == current && task is in a syscall)
	return current_user_stack_pointer();
else if (task is not running && task is in a syscall)
	return task->thread.usersp;
else if (task is not in a syscall)
	return task_pt_regs(task)->sp;
else
	we're confused; give up.

What context are you using KSTK_ESP in?

--Andy

>  	return regs_get_register(regs, pt_regs_offset[idx]);
>  }
>  
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
       [not found] ` <c027bde0-5f4f-441f-8d45-3e7f6f702231@alibaba-inc.com>
@ 2014-12-25 15:48   ` Andy Lutomirski
  2014-12-25 16:21     ` Andy Lutomirski
  2014-12-30 19:03     ` Peter Zijlstra
  0 siblings, 2 replies; 23+ messages in thread
From: Andy Lutomirski @ 2014-12-25 15:48 UTC (permalink / raw)
  To: 秦承刚(承刚)
  Cc: root, linux-kernel, 秦承刚(承刚),
	Andrew Morton, Arjan van de Ven, David Ahern, Ingo Molnar,
	Mike Galbraith, Namhyung Kim, Paul Mackerras, Peter Zijlstra,
	Wu Fengguang, Yanmin Zhang

On Thu, Dec 25, 2014 at 4:13 AM, 秦承刚(承刚) <chenggang.qcg@alibaba-inc.com> wrote:
> The context is NMI (PMU) or IRQ (hrtimer). It is a bit complex. The process
> we want to sample is the current, so it is always running.
> We need to distinguish between IRQ context, syscall or user context that are
> interrupted by NMI.

Oh.  So you really are trying to get the user regs from NMI context
even if you interrupted the kernel.  This will be an unpleasant thing
to do correctly.

> Syscall: sp = current->thread.usersp;

Nope.  For x86_64 at least, sp is in old_rsp, not usersp, *if* the
syscall you interrupted was syscall64 instead of one of the ia32
entries, which are differently strange.  If you've context switched
out and back, then usersp will match it.  If not, then you're looking
at some random stale sp value.

Keep in mind that there is absolutely no guarantee that TIF_IA32
matches the syscall entry type.

>              old_rsp always point to current->thread.usersp. May be we
> shouldn't use current_user_stack_pointer();
> User: sp = task_pt_regs(task)->sp;
>           current's pt_regs are stored in kernel stack while NMI or IRQ
> occured.

This is the only easy case.

> IRQ: sp = task_pt_regs(task)->sp;
>           current's pt_regs are stored in kernel stack while IRQ which was
> interrupted occured.

Sort of.  It's true by the time you actually execute the IRQ handler.

I think that trying to do this is doomed to either failure or extreme
complexity.  You're in an NMI, so you could be part-way through a
context switch or you could be in the very first instruction of the
syscall handler.

On a quick look, there are plenty of other bugs in there besides just
the stack pointer issue.  The ABI check that uses TIF_IA32 in the perf
core is completely wrong.  TIF_IA32 may be equal to the actual
userspace bitness by luck, but, if so, that's more or less just luck.
And there's a user_mode test that should be user_mode_vm.

Also, it's not just sp that's wrong.  There are various places that
you can interrupt in which many of the registers have confusing
locations.  You could try using the cfi unwind data, but that's
unlikely to work for regs like cs and ss, and, during context switch,
this has very little chance of working.

What's the point of this feature?  Honestly, my suggestion would be to
delete it instead of trying to fix it.  It's also not clear to me that
there aren't serious security problems here -- it's entirely possible
for sensitive *kernel* values to and up in task_pt_regs at certain
times, and if you run during context switch and there's no code to
suppress this dump during context switch, then you could be showing
regs that belong to the wrong task.

--Andy

>
> Regards
> Chenggang
>
> ------------------------------------------------------------------
> 发件人:Andy Lutomirski <luto@amacapital.net>
> 发送时间:2014年12月23日(星期二) 16:30
> 收件人:root <chenggang.qin@gmail.com>,linux-kernel
> <linux-kernel@vger.kernel.org>
> 抄 送:秦承刚(承刚) <chenggang.qcg@taobao.com>,Andrew Morton
> <akpm@linux-foundation.org>,Arjan van de Ven <arjan@linux.intel.com>,David
> Ahern <dsahern@gmail.com>,Ingo Molnar <mingo@redhat.com>,Mike Galbraith
> <efault@gmx.de>,Namhyung Kim <namhyung@gmail.com>,Paul Mackerras
> <paulus@samba.org>,Peter Zijlstra <a.p.zijlstra@chello.nl>,Wu Fengguang
> <fengguang.wu@intel.com>,Yanmin Zhang <yanmin.zhang@intel.com>
> 主 题:Re: [PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while
> output user regs
>
> On 12/22/2014 10:22 PM, root wrote:
>> From: Chenggang Qin <chenggang.qcg@taobao.com>
>>
>> For x86_64, the exact value of user stack's esp should be got by
>> KSTK_ESP(current).
>> current->thread.usersp is copied from PDA while enter ring0.
>> Now, we output the value of sp from pt_regs. But pt_regs->sp has changed
>> before
>> it was pushed into kernel stack.
>>
>> So, we cannot get the correct callchain while unwind some user stacks.
>> For example, if the stack contains __lll_unlock_wake()/__lll_lock_wait(),
>> the
>> callchain will break some times with the latest version of libunwind.
>> The root cause is the sp that is used by libunwind may be wrong.
>>
>> If we use KSTK_ESP(current), the correct callchain can be got everytime.
>> Other architectures also have KSTK_ESP() macro.
>>
>> Signed-off-by: Chenggang Qin <chenggang.qcg@taobao.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Arjan van de Ven <arjan@linux.intel.com>
>> Cc: David Ahern <dsahern@gmail.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Namhyung Kim <namhyung@gmail.com>
>> Cc: Paul Mackerras <paulus@samba.org>
>> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> Cc: Wu Fengguang <fengguang.wu@intel.com>
>> Cc: Yanmin Zhang <yanmin.zhang@intel.com>
>>
>> ---
>> arch/x86/kernel/perf_regs.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
>> index e309cc5..5da8df8 100644
>> --- a/arch/x86/kernel/perf_regs.c
>> +++ b/arch/x86/kernel/perf_regs.c
>> @@ -60,6 +60,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>> if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
>> return 0;
>>
>> + if (idx == PERF_REG_X86_SP)
>> + return KSTK_ESP(current);
>> +
>
> This patch is probably fine, but KSTK_ESP seems to be bogus:
>
> unsigned long KSTK_ESP(struct task_struct *task)
> {
> return (test_tsk_thread_flag(task, TIF_IA32)) ?
> (task_pt_regs(task)->sp) : ((task)->thread.usersp);
> }
>
> I swear that every time I've looked at anything that references TIF_IA32
> in the last two weeks, it's been wrong. This should be something like:
>
> if (task_thread_info(task)->status & TS_COMPAT)
> return task_pt_regs(task)->sp;
> else if (task == current && task is in a syscall)
> return current_user_stack_pointer();
> else if (task is not running && task is in a syscall)
> return task->thread.usersp;
> else if (task is not in a syscall)
> return task_pt_regs(task)->sp;
> else
> we're confused; give up.
>
> What context are you using KSTK_ESP in?
>
> --Andy
>
>> return regs_get_register(regs, pt_regs_offset[idx]);
>> }
>>
>>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2014-12-25 15:48   ` 答复:[PATCH] " Andy Lutomirski
@ 2014-12-25 16:21     ` Andy Lutomirski
  2014-12-30 19:03     ` Peter Zijlstra
  1 sibling, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2014-12-25 16:21 UTC (permalink / raw)
  To: 秦承刚(承刚)
  Cc: root, linux-kernel, 秦承刚(承刚),
	Andrew Morton, Arjan van de Ven, David Ahern, Ingo Molnar,
	Mike Galbraith, Namhyung Kim, Paul Mackerras, Peter Zijlstra,
	Wu Fengguang, Yanmin Zhang

On Thu, Dec 25, 2014 at 7:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Dec 25, 2014 at 4:13 AM, 秦承刚(承刚) <chenggang.qcg@alibaba-inc.com> wrote:
>> The context is NMI (PMU) or IRQ (hrtimer). It is a bit complex. The process
>> we want to sample is the current, so it is always running.
>> We need to distinguish between IRQ context, syscall or user context that are
>> interrupted by NMI.
>
> Oh.  So you really are trying to get the user regs from NMI context
> even if you interrupted the kernel.  This will be an unpleasant thing
> to do correctly.
>
>> Syscall: sp = current->thread.usersp;
>
> Nope.  For x86_64 at least, sp is in old_rsp, not usersp, *if* the
> syscall you interrupted was syscall64 instead of one of the ia32
> entries, which are differently strange.  If you've context switched
> out and back, then usersp will match it.  If not, then you're looking
> at some random stale sp value.
>
> Keep in mind that there is absolutely no guarantee that TIF_IA32
> matches the syscall entry type.
>
>>              old_rsp always point to current->thread.usersp. May be we
>> shouldn't use current_user_stack_pointer();
>> User: sp = task_pt_regs(task)->sp;
>>           current's pt_regs are stored in kernel stack while NMI or IRQ
>> occured.
>
> This is the only easy case.
>
>> IRQ: sp = task_pt_regs(task)->sp;
>>           current's pt_regs are stored in kernel stack while IRQ which was
>> interrupted occured.
>
> Sort of.  It's true by the time you actually execute the IRQ handler.
>
> I think that trying to do this is doomed to either failure or extreme
> complexity.  You're in an NMI, so you could be part-way through a
> context switch or you could be in the very first instruction of the
> syscall handler.
>
> On a quick look, there are plenty of other bugs in there besides just
> the stack pointer issue.  The ABI check that uses TIF_IA32 in the perf
> core is completely wrong.  TIF_IA32 may be equal to the actual
> userspace bitness by luck, but, if so, that's more or less just luck.
> And there's a user_mode test that should be user_mode_vm.
>
> Also, it's not just sp that's wrong.  There are various places that
> you can interrupt in which many of the registers have confusing
> locations.  You could try using the cfi unwind data, but that's
> unlikely to work for regs like cs and ss, and, during context switch,
> this has very little chance of working.

Even the unwinder won't be able to get rbx, rbp, r12, r13, r14, and
r15 right -- good luck handling FORK_LIKE, PTREGSCALL, etc.

--Andy

>
> What's the point of this feature?  Honestly, my suggestion would be to
> delete it instead of trying to fix it.  It's also not clear to me that
> there aren't serious security problems here -- it's entirely possible
> for sensitive *kernel* values to and up in task_pt_regs at certain
> times, and if you run during context switch and there's no code to
> suppress this dump during context switch, then you could be showing
> regs that belong to the wrong task.
>
> --Andy
>
>>
>> Regards
>> Chenggang
>>
>> ------------------------------------------------------------------
>> 发件人:Andy Lutomirski <luto@amacapital.net>
>> 发送时间:2014年12月23日(星期二) 16:30
>> 收件人:root <chenggang.qin@gmail.com>,linux-kernel
>> <linux-kernel@vger.kernel.org>
>> 抄 送:秦承刚(承刚) <chenggang.qcg@taobao.com>,Andrew Morton
>> <akpm@linux-foundation.org>,Arjan van de Ven <arjan@linux.intel.com>,David
>> Ahern <dsahern@gmail.com>,Ingo Molnar <mingo@redhat.com>,Mike Galbraith
>> <efault@gmx.de>,Namhyung Kim <namhyung@gmail.com>,Paul Mackerras
>> <paulus@samba.org>,Peter Zijlstra <a.p.zijlstra@chello.nl>,Wu Fengguang
>> <fengguang.wu@intel.com>,Yanmin Zhang <yanmin.zhang@intel.com>
>> 主 题:Re: [PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while
>> output user regs
>>
>> On 12/22/2014 10:22 PM, root wrote:
>>> From: Chenggang Qin <chenggang.qcg@taobao.com>
>>>
>>> For x86_64, the exact value of user stack's esp should be got by
>>> KSTK_ESP(current).
>>> current->thread.usersp is copied from PDA while enter ring0.
>>> Now, we output the value of sp from pt_regs. But pt_regs->sp has changed
>>> before
>>> it was pushed into kernel stack.
>>>
>>> So, we cannot get the correct callchain while unwind some user stacks.
>>> For example, if the stack contains __lll_unlock_wake()/__lll_lock_wait(),
>>> the
>>> callchain will break some times with the latest version of libunwind.
>>> The root cause is the sp that is used by libunwind may be wrong.
>>>
>>> If we use KSTK_ESP(current), the correct callchain can be got everytime.
>>> Other architectures also have KSTK_ESP() macro.
>>>
>>> Signed-off-by: Chenggang Qin <chenggang.qcg@taobao.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Arjan van de Ven <arjan@linux.intel.com>
>>> Cc: David Ahern <dsahern@gmail.com>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Mike Galbraith <efault@gmx.de>
>>> Cc: Namhyung Kim <namhyung@gmail.com>
>>> Cc: Paul Mackerras <paulus@samba.org>
>>> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>> Cc: Wu Fengguang <fengguang.wu@intel.com>
>>> Cc: Yanmin Zhang <yanmin.zhang@intel.com>
>>>
>>> ---
>>> arch/x86/kernel/perf_regs.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
>>> index e309cc5..5da8df8 100644
>>> --- a/arch/x86/kernel/perf_regs.c
>>> +++ b/arch/x86/kernel/perf_regs.c
>>> @@ -60,6 +60,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>>> if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
>>> return 0;
>>>
>>> + if (idx == PERF_REG_X86_SP)
>>> + return KSTK_ESP(current);
>>> +
>>
>> This patch is probably fine, but KSTK_ESP seems to be bogus:
>>
>> unsigned long KSTK_ESP(struct task_struct *task)
>> {
>> return (test_tsk_thread_flag(task, TIF_IA32)) ?
>> (task_pt_regs(task)->sp) : ((task)->thread.usersp);
>> }
>>
>> I swear that every time I've looked at anything that references TIF_IA32
>> in the last two weeks, it's been wrong. This should be something like:
>>
>> if (task_thread_info(task)->status & TS_COMPAT)
>> return task_pt_regs(task)->sp;
>> else if (task == current && task is in a syscall)
>> return current_user_stack_pointer();
>> else if (task is not running && task is in a syscall)
>> return task->thread.usersp;
>> else if (task is not in a syscall)
>> return task_pt_regs(task)->sp;
>> else
>> we're confused; give up.
>>
>> What context are you using KSTK_ESP in?
>>
>> --Andy
>>
>>> return regs_get_register(regs, pt_regs_offset[idx]);
>>> }
>>>
>>>
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2014-12-25 15:48   ` 答复:[PATCH] " Andy Lutomirski
  2014-12-25 16:21     ` Andy Lutomirski
@ 2014-12-30 19:03     ` Peter Zijlstra
  2014-12-30 23:29       ` Andy Lutomirski
  2015-01-04 16:10       ` Jiri Olsa
  1 sibling, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2014-12-30 19:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: 秦承刚(承刚),
	root, linux-kernel, 秦承刚(承刚),
	Andrew Morton, Arjan van de Ven, David Ahern, Ingo Molnar,
	Mike Galbraith, Namhyung Kim, Paul Mackerras, Wu Fengguang,
	Yanmin Zhang, jolsa, Stephane Eranian

On Thu, Dec 25, 2014 at 07:48:28AM -0800, Andy Lutomirski wrote:
> On a quick look, there are plenty of other bugs in there besides just
> the stack pointer issue.  The ABI check that uses TIF_IA32 in the perf
> core is completely wrong.  TIF_IA32 may be equal to the actual
> userspace bitness by luck, but, if so, that's more or less just luck.
> And there's a user_mode test that should be user_mode_vm.
> 
> Also, it's not just sp that's wrong.  There are various places that
> you can interrupt in which many of the registers have confusing
> locations.  You could try using the cfi unwind data, but that's
> unlikely to work for regs like cs and ss, and, during context switch,
> this has very little chance of working.
> 
> What's the point of this feature?  Honestly, my suggestion would be to
> delete it instead of trying to fix it.  It's also not clear to me that
> there aren't serious security problems here -- it's entirely possible
> for sensitive *kernel* values to and up in task_pt_regs at certain
> times, and if you run during context switch and there's no code to
> suppress this dump during context switch, then you could be showing
> regs that belong to the wrong task.

Of course the people who actually wrote the code are not on CC :/

There's two users of this iirc;

 1) the dwarf stack unwinder thingy, which basically dumps the userspace
 regs and the top of userspace stack on 'event'.

 2) the recent sample_regs_intr, which dumps the register set at
 'event', be it kernel or userspace.


The first is somewhat usable when lacking framepointers while still
desiring some unwind information, the second is useful to things like
call argument profiling and the like.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2014-12-30 19:03     ` Peter Zijlstra
@ 2014-12-30 23:29       ` Andy Lutomirski
  2014-12-31  2:00         ` Andy Lutomirski
  2015-01-04 16:10       ` Jiri Olsa
  1 sibling, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2014-12-30 23:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Namhyung Kim, Mike Galbraith, Arjan van de Ven,
	linux-kernel, David Ahern, Paul Mackerras,
	秦承刚(承刚),
	Yanmin Zhang

On Dec 30, 2014 11:03 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>
> On Thu, Dec 25, 2014 at 07:48:28AM -0800, Andy Lutomirski wrote:
> > On a quick look, there are plenty of other bugs in there besides just
> > the stack pointer issue.  The ABI check that uses TIF_IA32 in the perf
> > core is completely wrong.  TIF_IA32 may be equal to the actual
> > userspace bitness by luck, but, if so, that's more or less just luck.
> > And there's a user_mode test that should be user_mode_vm.
> >
> > Also, it's not just sp that's wrong.  There are various places that
> > you can interrupt in which many of the registers have confusing
> > locations.  You could try using the cfi unwind data, but that's
> > unlikely to work for regs like cs and ss, and, during context switch,
> > this has very little chance of working.
> >
> > What's the point of this feature?  Honestly, my suggestion would be to
> > delete it instead of trying to fix it.  It's also not clear to me that
> > there aren't serious security problems here -- it's entirely possible
> > for sensitive *kernel* values to and up in task_pt_regs at certain
> > times, and if you run during context switch and there's no code to
> > suppress this dump during context switch, then you could be showing
> > regs that belong to the wrong task.
>
> Of course the people who actually wrote the code are not on CC :/
>
> There's two users of this iirc;
>
>  1) the dwarf stack unwinder thingy, which basically dumps the userspace
>  regs and the top of userspace stack on 'event'.
>

Given how the x86_64* entry code works, using task_pt_regs from
anywhere except explicitly supported contexts (including exceptions
that originated in userspace and a small handful of system calls) is
asking for trouble.  NMI context is especially bad.

How important is this feature, and which registers matter?  It might
be possible to use a dwarf unwinder on the kernel call stack to get
most of the regs from most contexts, and it might also be possible to
make small changes to the entry code to make it possible to get some
of the registers reliably, but it's not currently possible to safely
use task_pt_regs *at all* from NMI context unless you've at least
blacklisted a handful of origin RIP values that give dangerously bogus
results.  (Using do_nmi's regs parameter if user_mode_vm(regs) is a
different story.)

* I'm not nearly as familiar with the 32-bit entry code, so I don't
know whether we have the same issues there.

>  2) the recent sample_regs_intr, which dumps the register set at
>  'event', be it kernel or userspace.
>

What's wrong with the PMI's pt_regs for that?  If we interrupted the
kernel, they'll be kernel regs (with all their attendant security
issues) and, if we interrupted userspace, then they'll be the full,
correct userspace registers.

--Andy

>
> The first is somewhat usable when lacking framepointers while still
> desiring some unwind information, the second is useful to things like
> call argument profiling and the like.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2014-12-30 23:29       ` Andy Lutomirski
@ 2014-12-31  2:00         ` Andy Lutomirski
  2015-01-02 16:11           ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2014-12-31  2:00 UTC (permalink / raw)
  To: Peter Zijlstra, Jan Beulich
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Namhyung Kim, Mike Galbraith, Arjan van de Ven,
	linux-kernel, David Ahern, Paul Mackerras,
	秦承刚(承刚),
	Yanmin Zhang

On Tue, Dec 30, 2014 at 3:29 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Dec 30, 2014 11:03 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>>
>> On Thu, Dec 25, 2014 at 07:48:28AM -0800, Andy Lutomirski wrote:
>> > On a quick look, there are plenty of other bugs in there besides just
>> > the stack pointer issue.  The ABI check that uses TIF_IA32 in the perf
>> > core is completely wrong.  TIF_IA32 may be equal to the actual
>> > userspace bitness by luck, but, if so, that's more or less just luck.
>> > And there's a user_mode test that should be user_mode_vm.
>> >
>> > Also, it's not just sp that's wrong.  There are various places that
>> > you can interrupt in which many of the registers have confusing
>> > locations.  You could try using the cfi unwind data, but that's
>> > unlikely to work for regs like cs and ss, and, during context switch,
>> > this has very little chance of working.
>> >
>> > What's the point of this feature?  Honestly, my suggestion would be to
>> > delete it instead of trying to fix it.  It's also not clear to me that
>> > there aren't serious security problems here -- it's entirely possible
>> > for sensitive *kernel* values to and up in task_pt_regs at certain
>> > times, and if you run during context switch and there's no code to
>> > suppress this dump during context switch, then you could be showing
>> > regs that belong to the wrong task.
>>
>> Of course the people who actually wrote the code are not on CC :/
>>
>> There's two users of this iirc;
>>
>>  1) the dwarf stack unwinder thingy, which basically dumps the userspace
>>  regs and the top of userspace stack on 'event'.
>>
>
> Given how the x86_64* entry code works, using task_pt_regs from
> anywhere except explicitly supported contexts (including exceptions
> that originated in userspace and a small handful of system calls) is
> asking for trouble.  NMI context is especially bad.
>
> How important is this feature, and which registers matter?  It might
> be possible to use a dwarf unwinder on the kernel call stack to get
> most of the regs from most contexts, and it might also be possible to
> make small changes to the entry code to make it possible to get some
> of the registers reliably, but it's not currently possible to safely
> use task_pt_regs *at all* from NMI context unless you've at least
> blacklisted a handful of origin RIP values that give dangerously bogus
> results.  (Using do_nmi's regs parameter if user_mode_vm(regs) is a
> different story.)

It's actually worse than just knowing the interrupted kernel RIP.  If
the call chain goes usermode -> IST exception -> NMI, then
task_pt_regs is entirely uninitialized.  Assuming all the CFI
annotations are correct, the unwinder could still do it from the
kernel.

Note that, as far as I know, Jan Beulich is the only person who uses
the unwinder on kernel code.  Jan, how do you do this?

>
> * I'm not nearly as familiar with the 32-bit entry code, so I don't
> know whether we have the same issues there.
>
>>  2) the recent sample_regs_intr, which dumps the register set at
>>  'event', be it kernel or userspace.
>>
>
> What's wrong with the PMI's pt_regs for that?  If we interrupted the
> kernel, they'll be kernel regs (with all their attendant security
> issues) and, if we interrupted userspace, then they'll be the full,
> correct userspace registers.
>
> --Andy
>
>>
>> The first is somewhat usable when lacking framepointers while still
>> desiring some unwind information, the second is useful to things like
>> call argument profiling and the like.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2014-12-31  2:00         ` Andy Lutomirski
@ 2015-01-02 16:11           ` Jan Beulich
  2015-01-02 18:03             ` Andy Lutomirski
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2015-01-02 16:11 UTC (permalink / raw)
  To: luto, peterz
  Cc: chenggang.qcg, chenggang.qin, dsahern, namhyung, efault, eranian,
	fengguang.wu, yanmin.zhang, akpm, arjan, jolsa, mingo, paulus,
	chenggang.qcg, linux-kernel

>>> Andy Lutomirski <luto@amacapital.net> 12/31/14 3:00 AM >>>
>On Tue, Dec 30, 2014 at 3:29 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> Given how the x86_64* entry code works, using task_pt_regs from
>> anywhere except explicitly supported contexts (including exceptions
>> that originated in userspace and a small handful of system calls) is
>> asking for trouble.  NMI context is especially bad.
>>
>> How important is this feature, and which registers matter?  It might
>> be possible to use a dwarf unwinder on the kernel call stack to get
>> most of the regs from most contexts, and it might also be possible to
>> make small changes to the entry code to make it possible to get some
>> of the registers reliably, but it's not currently possible to safely
>> use task_pt_regs *at all* from NMI context unless you've at least
>> blacklisted a handful of origin RIP values that give dangerously bogus
>> results.  (Using do_nmi's regs parameter if user_mode_vm(regs) is a
>> different story.)
>
>It's actually worse than just knowing the interrupted kernel RIP.  If
>the call chain goes usermode -> IST exception -> NMI, then
>task_pt_regs is entirely uninitialized.  Assuming all the CFI
>annotations are correct, the unwinder could still do it from the
>kernel.
>
>Note that, as far as I know, Jan Beulich is the only person who uses
>the unwinder on kernel code.  Jan, how do you do this?

Trying to guess what you mean by "this": A stack switch gets expressed by
CFI annotations just like any other frame pointer adjustments. See for example
the CFI_DEF_CFA_REGISTER use in the SAVE_ARGS_IRQ macro.

If that wasn't your question, please be more precise.

Jan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2015-01-02 16:11           ` Jan Beulich
@ 2015-01-02 18:03             ` Andy Lutomirski
  2015-01-05  8:47               ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2015-01-02 18:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Chenggang Qin, Namhyung Kim, Paul Mackerras, Arjan van de Ven,
	Stephane Eranian, Jiri Olsa, Andrew Morton, Yanmin Zhang,
	Peter Zijlstra, linux-kernel, Mike Galbraith,
	秦承刚(承刚),
	root, Fengguang Wu, David Ahern, Ingo Molnar

On Jan 2, 2015 8:11 AM, "Jan Beulich" <jbeulich@suse.com> wrote:
>
> >>> Andy Lutomirski <luto@amacapital.net> 12/31/14 3:00 AM >>>
> >On Tue, Dec 30, 2014 at 3:29 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> Given how the x86_64* entry code works, using task_pt_regs from
> >> anywhere except explicitly supported contexts (including exceptions
> >> that originated in userspace and a small handful of system calls) is
> >> asking for trouble.  NMI context is especially bad.
> >>
> >> How important is this feature, and which registers matter?  It might
> >> be possible to use a dwarf unwinder on the kernel call stack to get
> >> most of the regs from most contexts, and it might also be possible to
> >> make small changes to the entry code to make it possible to get some
> >> of the registers reliably, but it's not currently possible to safely
> >> use task_pt_regs *at all* from NMI context unless you've at least
> >> blacklisted a handful of origin RIP values that give dangerously bogus
> >> results.  (Using do_nmi's regs parameter if user_mode_vm(regs) is a
> >> different story.)
> >
> >It's actually worse than just knowing the interrupted kernel RIP.  If
> >the call chain goes usermode -> IST exception -> NMI, then
> >task_pt_regs is entirely uninitialized.  Assuming all the CFI
> >annotations are correct, the unwinder could still do it from the
> >kernel.
> >
> >Note that, as far as I know, Jan Beulich is the only person who uses
> >the unwinder on kernel code.  Jan, how do you do this?
>
> Trying to guess what you mean by "this": A stack switch gets expressed by
> CFI annotations just like any other frame pointer adjustments. See for example
> the CFI_DEF_CFA_REGISTER use in the SAVE_ARGS_IRQ macro.
>
> If that wasn't your question, please be more precise.

Sorry, my question was vague.

Is there any way to consume these annotations at runtime in the
kernel?  The goal would be for perf's NMI handler to consume the CFI
data to figure out the userspace registers.  I'm guessing that the
answer might be no, because we seem to be compiling with
-fno-asynchronous-unwind-tables and we don't seem to be putting any
.eh_frame stuff into the final kernel image.

I had thought that someone implemented runtime DWARF unwinding, though.

--Andy

>
> Jan
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2014-12-30 19:03     ` Peter Zijlstra
  2014-12-30 23:29       ` Andy Lutomirski
@ 2015-01-04 16:10       ` Jiri Olsa
  2015-01-04 17:18         ` Andy Lutomirski
  1 sibling, 1 reply; 23+ messages in thread
From: Jiri Olsa @ 2015-01-04 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, 秦承刚(承刚),
	root, linux-kernel, 秦承刚(承刚),
	Andrew Morton, Arjan van de Ven, David Ahern, Ingo Molnar,
	Mike Galbraith, Namhyung Kim, Paul Mackerras, Wu Fengguang,
	Yanmin Zhang, Stephane Eranian

On Tue, Dec 30, 2014 at 08:03:27PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 25, 2014 at 07:48:28AM -0800, Andy Lutomirski wrote:
> > On a quick look, there are plenty of other bugs in there besides just
> > the stack pointer issue.  The ABI check that uses TIF_IA32 in the perf
> > core is completely wrong.  TIF_IA32 may be equal to the actual
> > userspace bitness by luck, but, if so, that's more or less just luck.
> > And there's a user_mode test that should be user_mode_vm.
> > 
> > Also, it's not just sp that's wrong.  There are various places that
> > you can interrupt in which many of the registers have confusing
> > locations.  You could try using the cfi unwind data, but that's
> > unlikely to work for regs like cs and ss, and, during context switch,
> > this has very little chance of working.
> > 
> > What's the point of this feature?  Honestly, my suggestion would be to
> > delete it instead of trying to fix it.  It's also not clear to me that
> > there aren't serious security problems here -- it's entirely possible
> > for sensitive *kernel* values to and up in task_pt_regs at certain
> > times, and if you run during context switch and there's no code to
> > suppress this dump during context switch, then you could be showing
> > regs that belong to the wrong task.
> 
> Of course the people who actually wrote the code are not on CC :/
> 
> There's two users of this iirc;
> 
>  1) the dwarf stack unwinder thingy, which basically dumps the userspace
>  regs and the top of userspace stack on 'event'.

looks like this solves the issue I was trying to fix
long time ago:
  http://marc.info/?l=linux-kernel&m=134934717011451&w=2

this seems a lot simpler ;-) I'll test..

jirka

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2015-01-04 16:10       ` Jiri Olsa
@ 2015-01-04 17:18         ` Andy Lutomirski
  2015-01-04 17:41           ` Jiri Olsa
  0 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2015-01-04 17:18 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Stephane Eranian, Ingo Molnar, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Namhyung Kim, Mike Galbraith, Peter Zijlstra,
	Arjan van de Ven, linux-kernel, David Ahern, Paul Mackerras,
	秦承刚(承刚),
	Yanmin Zhang

On Jan 4, 2015 8:11 AM, "Jiri Olsa" <jolsa@redhat.com> wrote:
>
> On Tue, Dec 30, 2014 at 08:03:27PM +0100, Peter Zijlstra wrote:
> > On Thu, Dec 25, 2014 at 07:48:28AM -0800, Andy Lutomirski wrote:
> > > On a quick look, there are plenty of other bugs in there besides just
> > > the stack pointer issue.  The ABI check that uses TIF_IA32 in the perf
> > > core is completely wrong.  TIF_IA32 may be equal to the actual
> > > userspace bitness by luck, but, if so, that's more or less just luck.
> > > And there's a user_mode test that should be user_mode_vm.
> > >
> > > Also, it's not just sp that's wrong.  There are various places that
> > > you can interrupt in which many of the registers have confusing
> > > locations.  You could try using the cfi unwind data, but that's
> > > unlikely to work for regs like cs and ss, and, during context switch,
> > > this has very little chance of working.
> > >
> > > What's the point of this feature?  Honestly, my suggestion would be to
> > > delete it instead of trying to fix it.  It's also not clear to me that
> > > there aren't serious security problems here -- it's entirely possible
> > > for sensitive *kernel* values to and up in task_pt_regs at certain
> > > times, and if you run during context switch and there's no code to
> > > suppress this dump during context switch, then you could be showing
> > > regs that belong to the wrong task.
> >
> > Of course the people who actually wrote the code are not on CC :/
> >
> > There's two users of this iirc;
> >
> >  1) the dwarf stack unwinder thingy, which basically dumps the userspace
> >  regs and the top of userspace stack on 'event'.
>
> looks like this solves the issue I was trying to fix
> long time ago:
>   http://marc.info/?l=linux-kernel&m=134934717011451&w=2
>
> this seems a lot simpler ;-) I'll test..

I suspect that, if it works, then it's pure luck.
(task)->thread.usersp in KSTK_ESP is bogus -- your code was more
accurate.

I think we should seriously consider making use of this feature by
non-root users require an explicit sysctl.  Sending values to user
code that are, at best, free of sensitive kernel data most of the time
is IMO inappropriate for an unprivileged API.

I'm currently working on a patch to try to clean this up better.

--Andy

>
> jirka

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2015-01-04 17:18         ` Andy Lutomirski
@ 2015-01-04 17:41           ` Jiri Olsa
  2015-01-04 18:36             ` [PATCH 0/2] perf: Improve user regs sampling Andy Lutomirski
  0 siblings, 1 reply; 23+ messages in thread
From: Jiri Olsa @ 2015-01-04 17:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stephane Eranian, Ingo Molnar, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Namhyung Kim, Mike Galbraith, Peter Zijlstra,
	Arjan van de Ven, linux-kernel, David Ahern, Paul Mackerras,
	秦承刚(承刚),
	Yanmin Zhang

On Sun, Jan 04, 2015 at 09:18:59AM -0800, Andy Lutomirski wrote:
> On Jan 4, 2015 8:11 AM, "Jiri Olsa" <jolsa@redhat.com> wrote:
> >
> > On Tue, Dec 30, 2014 at 08:03:27PM +0100, Peter Zijlstra wrote:
> > > On Thu, Dec 25, 2014 at 07:48:28AM -0800, Andy Lutomirski wrote:
> > > > On a quick look, there are plenty of other bugs in there besides just
> > > > the stack pointer issue.  The ABI check that uses TIF_IA32 in the perf
> > > > core is completely wrong.  TIF_IA32 may be equal to the actual
> > > > userspace bitness by luck, but, if so, that's more or less just luck.
> > > > And there's a user_mode test that should be user_mode_vm.
> > > >
> > > > Also, it's not just sp that's wrong.  There are various places that
> > > > you can interrupt in which many of the registers have confusing
> > > > locations.  You could try using the cfi unwind data, but that's
> > > > unlikely to work for regs like cs and ss, and, during context switch,
> > > > this has very little chance of working.
> > > >
> > > > What's the point of this feature?  Honestly, my suggestion would be to
> > > > delete it instead of trying to fix it.  It's also not clear to me that
> > > > there aren't serious security problems here -- it's entirely possible
> > > > for sensitive *kernel* values to and up in task_pt_regs at certain
> > > > times, and if you run during context switch and there's no code to
> > > > suppress this dump during context switch, then you could be showing
> > > > regs that belong to the wrong task.
> > >
> > > Of course the people who actually wrote the code are not on CC :/
> > >
> > > There's two users of this iirc;
> > >
> > >  1) the dwarf stack unwinder thingy, which basically dumps the userspace
> > >  regs and the top of userspace stack on 'event'.
> >
> > looks like this solves the issue I was trying to fix
> > long time ago:
> >   http://marc.info/?l=linux-kernel&m=134934717011451&w=2
> >
> > this seems a lot simpler ;-) I'll test..
> 
> I suspect that, if it works, then it's pure luck.
> (task)->thread.usersp in KSTK_ESP is bogus -- your code was more
> accurate.
> 
> I think we should seriously consider making use of this feature by
> non-root users require an explicit sysctl.  Sending values to user
> code that are, at best, free of sensitive kernel data most of the time
> is IMO inappropriate for an unprivileged API.
> 
> I'm currently working on a patch to try to clean this up better.

ook.. well FWIW I tested and if I used KSTK_ESP as a perf user stack
pointer (attached patch) I've got all callchains resolved properly,
as when I tested my original patch

NOTE the patch from Chenggang Qin isnt enough to fix my issue for perf
     dwarf unwind, I needed to use attached change

I'd be happy to test any change you make for this,
because this has been pain for a long time :-\

thanks,
jirka


---
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 5da8df8303cf..eb7c385a6f8e 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -66,6 +66,11 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 	return regs_get_register(regs, pt_regs_offset[idx]);
 }
 
+unsigned long arch_perf_user_stack_pointer(struct pt_regs *regs)
+{
+	return KSTK_ESP(current);
+}
+
 #define REG_RESERVED (~((1ULL << PERF_REG_X86_MAX) - 1ULL))
 
 #ifdef CONFIG_X86_32
diff --git a/kernel/events/core.c b/kernel/events/core.c
index af0a5ba4e21d..d99236173f3b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4486,6 +4486,19 @@ static void perf_sample_regs_intr(struct perf_regs *regs_intr,
 	regs_intr->abi  = perf_reg_abi(current);
 }
 
+#ifdef CONFIG_HAVE_PERF_USER_STACK_DUMP
+unsigned long __weak arch_perf_user_stack_pointer(struct pt_regs *regs)
+{
+        return user_stack_pointer(regs);
+}
+
+static unsigned long perf_user_stack_pointer(struct pt_regs *regs)
+{
+        return arch_perf_user_stack_pointer(regs);
+}
+#else
+#define perf_user_stack_pointer(regs) 0
+#endif /* CONFIG_HAVE_PERF_USER_STACK_DUMP */
 
 /*
  * Get remaining task size from user stack pointer.
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b218782ad..b85e4fd52980 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -180,19 +180,17 @@ static inline void put_recursion_context(int *recursion, int rctx)
 }
 
 #ifdef CONFIG_HAVE_PERF_USER_STACK_DUMP
+unsigned long arch_perf_user_stack_pointer(struct pt_regs *regs);
+
 static inline bool arch_perf_have_user_stack_dump(void)
 {
 	return true;
 }
-
-#define perf_user_stack_pointer(regs) user_stack_pointer(regs)
 #else
 static inline bool arch_perf_have_user_stack_dump(void)
 {
 	return false;
 }
-
-#define perf_user_stack_pointer(regs) 0
 #endif /* CONFIG_HAVE_PERF_USER_STACK_DUMP */
 
 #endif /* _KERNEL_EVENTS_INTERNAL_H */

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 0/2] perf: Improve user regs sampling
  2015-01-04 17:41           ` Jiri Olsa
@ 2015-01-04 18:36             ` Andy Lutomirski
  2015-01-04 18:36               ` [PATCH 1/2] perf: Move task_pt_regs sampling into arch code Andy Lutomirski
                                 ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Andy Lutomirski @ 2015-01-04 18:36 UTC (permalink / raw)
  To: Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	秦承刚(承刚)
  Cc: Andy Lutomirski, Stephane Eranian, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Namhyung Kim, Mike Galbraith, Arjan van de Ven,
	linux-kernel, David Ahern, Paul Mackerras, Yanmin Zhang

This seems to improve the situation on x86_64 from almost entirely
broken to mostly working.  Tested with:

perf record -e cycles --call-graph=dwarf ls

Andy Lutomirski (2):
  perf: Move task_pt_regs sampling into arch code
  x86_64, perf: Improve user regs sampling

 arch/arm/kernel/perf_regs.c   |  8 ++++
 arch/arm64/kernel/perf_regs.c |  8 ++++
 arch/x86/kernel/perf_regs.c   | 90 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/perf_event.h    | 12 +++---
 include/linux/perf_regs.h     | 16 ++++++++
 kernel/events/core.c          | 19 ++++-----
 6 files changed, 137 insertions(+), 16 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] perf: Move task_pt_regs sampling into arch code
  2015-01-04 18:36             ` [PATCH 0/2] perf: Improve user regs sampling Andy Lutomirski
@ 2015-01-04 18:36               ` Andy Lutomirski
  2015-01-05 14:07                 ` Peter Zijlstra
  2015-01-09 12:32                 ` [tip:perf/urgent] " tip-bot for Andy Lutomirski
  2015-01-04 18:36               ` [PATCH 2/2] x86_64, perf: Improve user regs sampling Andy Lutomirski
  2015-01-05 10:46               ` [PATCH 0/2] perf: " Jiri Olsa
  2 siblings, 2 replies; 23+ messages in thread
From: Andy Lutomirski @ 2015-01-04 18:36 UTC (permalink / raw)
  To: Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	秦承刚(承刚)
  Cc: Andy Lutomirski, Stephane Eranian, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Namhyung Kim, Mike Galbraith, Arjan van de Ven,
	linux-kernel, David Ahern, Paul Mackerras, Yanmin Zhang

On x86_64, at least, task_pt_regs may be only partially initialized
in many contexts, so x86_64 should not use it without extra care
from interrupt context, let alone NMI context.

This will allow x86_64 to override the logic and will supply some
scratch space to use to make a cleaner copy of user regs.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/arm/kernel/perf_regs.c   |  8 ++++++++
 arch/arm64/kernel/perf_regs.c |  8 ++++++++
 arch/x86/kernel/perf_regs.c   | 16 ++++++++++++++++
 include/linux/perf_event.h    | 12 +++++++-----
 include/linux/perf_regs.h     | 16 ++++++++++++++++
 kernel/events/core.c          | 19 ++++++++-----------
 6 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index 6e4379c67cbc..592dda3f21ff 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -28,3 +28,11 @@ u64 perf_reg_abi(struct task_struct *task)
 {
 	return PERF_SAMPLE_REGS_ABI_32;
 }
+
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs,
+			struct pt_regs *regs_user_copy)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index 6762ad705587..3f62b35fb6f1 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -50,3 +50,11 @@ u64 perf_reg_abi(struct task_struct *task)
 	else
 		return PERF_SAMPLE_REGS_ABI_64;
 }
+
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs,
+			struct pt_regs *regs_user_copy)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index e309cc5c276e..3bbbb1a4fb52 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -78,6 +78,14 @@ u64 perf_reg_abi(struct task_struct *task)
 {
 	return PERF_SAMPLE_REGS_ABI_32;
 }
+
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs,
+			struct pt_regs *regs_user_copy)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
 #else /* CONFIG_X86_64 */
 #define REG_NOSUPPORT ((1ULL << PERF_REG_X86_DS) | \
 		       (1ULL << PERF_REG_X86_ES) | \
@@ -102,4 +110,12 @@ u64 perf_reg_abi(struct task_struct *task)
 	else
 		return PERF_SAMPLE_REGS_ABI_64;
 }
+
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs,
+			struct pt_regs *regs_user_copy)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
 #endif /* CONFIG_X86_32 */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 486e84ccb1f9..4f7a61ca4b39 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -79,11 +79,6 @@ struct perf_branch_stack {
 	struct perf_branch_entry	entries[0];
 };
 
-struct perf_regs {
-	__u64		abi;
-	struct pt_regs	*regs;
-};
-
 struct task_struct;
 
 /*
@@ -610,7 +605,14 @@ struct perf_sample_data {
 		u32	reserved;
 	}				cpu_entry;
 	struct perf_callchain_entry	*callchain;
+
+	/*
+	 * regs_user may point to task_pt_regs or to regs_user_copy, depending
+	 * on arch details.
+	 */
 	struct perf_regs		regs_user;
+	struct pt_regs			regs_user_copy;
+
 	struct perf_regs		regs_intr;
 	u64				stack_user_size;
 } ____cacheline_aligned;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 3c73d5fe18be..a5f98d53d732 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -1,11 +1,19 @@
 #ifndef _LINUX_PERF_REGS_H
 #define _LINUX_PERF_REGS_H
 
+struct perf_regs {
+	__u64		abi;
+	struct pt_regs	*regs;
+};
+
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
 u64 perf_reg_value(struct pt_regs *regs, int idx);
 int perf_reg_validate(u64 mask);
 u64 perf_reg_abi(struct task_struct *task);
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs,
+			struct pt_regs *regs_user_copy);
 #else
 static inline u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
@@ -21,5 +29,13 @@ static inline u64 perf_reg_abi(struct task_struct *task)
 {
 	return PERF_SAMPLE_REGS_ABI_NONE;
 }
+
+static inline void perf_get_regs_user(struct perf_regs *regs_user,
+				      struct pt_regs *regs,
+				      struct pt_regs *regs_user_copy)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
 #endif /* CONFIG_HAVE_PERF_REGS */
 #endif /* _LINUX_PERF_REGS_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4c1ee7f2bebc..882f835a0d85 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4461,18 +4461,14 @@ perf_output_sample_regs(struct perf_output_handle *handle,
 }
 
 static void perf_sample_regs_user(struct perf_regs *regs_user,
-				  struct pt_regs *regs)
+				  struct pt_regs *regs,
+				  struct pt_regs *regs_user_copy)
 {
-	if (!user_mode(regs)) {
-		if (current->mm)
-			regs = task_pt_regs(current);
-		else
-			regs = NULL;
-	}
-
-	if (regs) {
-		regs_user->abi  = perf_reg_abi(current);
+	if (user_mode(regs)) {
+		regs_user->abi = perf_reg_abi(current);
 		regs_user->regs = regs;
+	} else if (current->mm) {
+		perf_get_regs_user(regs_user, regs, regs_user_copy);
 	} else {
 		regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
 		regs_user->regs = NULL;
@@ -4951,7 +4947,8 @@ void perf_prepare_sample(struct perf_event_header *header,
 	}
 
 	if (sample_type & (PERF_SAMPLE_REGS_USER | PERF_SAMPLE_STACK_USER))
-		perf_sample_regs_user(&data->regs_user, regs);
+		perf_sample_regs_user(&data->regs_user, regs,
+				      &data->regs_user_copy);
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
 		/* regs dump ABI info */
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] x86_64, perf: Improve user regs sampling
  2015-01-04 18:36             ` [PATCH 0/2] perf: Improve user regs sampling Andy Lutomirski
  2015-01-04 18:36               ` [PATCH 1/2] perf: Move task_pt_regs sampling into arch code Andy Lutomirski
@ 2015-01-04 18:36               ` Andy Lutomirski
  2015-01-09 12:32                 ` [tip:perf/urgent] perf/x86_64: " tip-bot for Andy Lutomirski
  2015-01-05 10:46               ` [PATCH 0/2] perf: " Jiri Olsa
  2 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2015-01-04 18:36 UTC (permalink / raw)
  To: Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	秦承刚(承刚)
  Cc: Andy Lutomirski, Stephane Eranian, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Namhyung Kim, Mike Galbraith, Arjan van de Ven,
	linux-kernel, David Ahern, Paul Mackerras, Yanmin Zhang

Perf reports user regs for kernel-mode samples so that samples can
be backtraced through user code.  The old code was very broken in
syscall context, resulting in useless backtraces.

The new code, in contrast, is still dangerously racy, but it should
at least work most of the time.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/perf_regs.c | 78 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 76 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 3bbbb1a4fb52..781861cc5ee8 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -115,7 +115,81 @@ void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs,
 			struct pt_regs *regs_user_copy)
 {
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
+	struct pt_regs *user_regs = task_pt_regs(current);
+
+	/*
+	 * If we're in an NMI that interrupted task_pt_regs setup, then
+	 * we can't sample user regs at all.  This check isn't really
+	 * sufficient, though, as we could be in an NMI inside an interrupt
+	 * that happened during task_pt_regs setup.
+	 */
+	if (regs->sp > (unsigned long)&user_regs->r11 &&
+	    regs->sp <= (unsigned long)(user_regs + 1)) {
+		regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
+		regs_user->regs = NULL;
+		return;
+	}
+
+	/*
+	 * RIP, flags, and the argument registers are usually saved.
+	 * orig_ax is probably okay, too.
+	 */
+	regs_user_copy->ip = user_regs->ip;
+	regs_user_copy->cx = user_regs->cx;
+	regs_user_copy->dx = user_regs->dx;
+	regs_user_copy->si = user_regs->si;
+	regs_user_copy->di = user_regs->di;
+	regs_user_copy->r8 = user_regs->r8;
+	regs_user_copy->r9 = user_regs->r9;
+	regs_user_copy->r10 = user_regs->r10;
+	regs_user_copy->r11 = user_regs->r11;
+	regs_user_copy->orig_ax = user_regs->orig_ax;
+	regs_user_copy->flags = user_regs->flags;
+
+	/*
+	 * Don't even try to report the "rest" regs.
+	 */
+	regs_user_copy->bx = -1;
+	regs_user_copy->bp = -1;
+	regs_user_copy->r12 = -1;
+	regs_user_copy->r13 = -1;
+	regs_user_copy->r14 = -1;
+	regs_user_copy->r15 = -1;
+
+	/*
+	 * For this to be at all useful, we need a reasonable guess for
+	 * sp and the ABI.  Be careful: we're in NMI context, and we're
+	 * considering current to be the current task, so we should
+	 * be careful not to look at any other percpu variables that might
+	 * change during context switches.
+	 */
+	if (IS_ENABLED(CONFIG_IA32_EMULATION) &&
+	    task_thread_info(current)->status & TS_COMPAT) {
+		/* Easy case: we're in a compat syscall. */
+		regs_user->abi = PERF_SAMPLE_REGS_ABI_32;
+		regs_user_copy->sp = user_regs->sp;
+		regs_user_copy->cs = user_regs->cs;
+		regs_user_copy->ss = user_regs->ss;
+	} else if (user_regs->orig_ax != -1) {
+		/*
+		 * We're probably in a 64-bit syscall.
+		 * Warning: this code is severely racy.  At least it's better
+		 * than just blindly copying user_regs.
+		 */
+		regs_user->abi = PERF_SAMPLE_REGS_ABI_64;
+		regs_user_copy->sp = this_cpu_read(old_rsp);
+		regs_user_copy->cs = __USER_CS;
+		regs_user_copy->ss = __USER_DS;
+		regs_user_copy->cx = -1;  /* usually contains garbage */
+	} else {
+		/* We're probably in an interrupt or exception. */
+		regs_user->abi = user_64bit_mode(user_regs) ?
+			PERF_SAMPLE_REGS_ABI_64 : PERF_SAMPLE_REGS_ABI_32;
+		regs_user_copy->sp = user_regs->sp;
+		regs_user_copy->cs = user_regs->cs;
+		regs_user_copy->ss = user_regs->ss;
+	}
+
+	regs_user->regs = regs_user_copy;
 }
 #endif /* CONFIG_X86_32 */
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: 答复:[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs
  2015-01-02 18:03             ` Andy Lutomirski
@ 2015-01-05  8:47               ` Jan Beulich
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2015-01-05  8:47 UTC (permalink / raw)
  To: luto
  Cc: chenggang.qcg, chenggang.qin, dsahern, namhyung, efault, eranian,
	peterz, fengguang.wu, yanmin.zhang, akpm, arjan, jolsa, mingo,
	paulus, chenggang.qcg, linux-kernel

>>> Andy Lutomirski <luto@amacapital.net> 01/02/15 7:03 PM >>>
>On Jan 2, 2015 8:11 AM, "Jan Beulich" <jbeulich@suse.com> wrote:
>> Trying to guess what you mean by "this": A stack switch gets expressed by
>> CFI annotations just like any other frame pointer adjustments. See for example
>> the CFI_DEF_CFA_REGISTER use in the SAVE_ARGS_IRQ macro.
>>
>> If that wasn't your question, please be more precise.
>
>Sorry, my question was vague.
>
>Is there any way to consume these annotations at runtime in the
>kernel?  The goal would be for perf's NMI handler to consume the CFI
>data to figure out the userspace registers.  I'm guessing that the
>answer might be no, because we seem to be compiling with
>-fno-asynchronous-unwind-tables and we don't seem to be putting any
>.eh_frame stuff into the final kernel image.
>
>I had thought that someone implemented runtime DWARF unwinding, though.

Yes, we do have such code in our kernels, but Linus continues to veto it going
into his tree.

Jan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] perf: Improve user regs sampling
  2015-01-04 18:36             ` [PATCH 0/2] perf: Improve user regs sampling Andy Lutomirski
  2015-01-04 18:36               ` [PATCH 1/2] perf: Move task_pt_regs sampling into arch code Andy Lutomirski
  2015-01-04 18:36               ` [PATCH 2/2] x86_64, perf: Improve user regs sampling Andy Lutomirski
@ 2015-01-05 10:46               ` Jiri Olsa
  2 siblings, 0 replies; 23+ messages in thread
From: Jiri Olsa @ 2015-01-05 10:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Peter Zijlstra,
	秦承刚(承刚),
	Stephane Eranian, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Namhyung Kim, Mike Galbraith, Arjan van de Ven,
	linux-kernel, David Ahern, Paul Mackerras, Yanmin Zhang

On Sun, Jan 04, 2015 at 10:36:18AM -0800, Andy Lutomirski wrote:
> This seems to improve the situation on x86_64 from almost entirely
> broken to mostly working.  Tested with:
> 
> perf record -e cycles --call-graph=dwarf ls

looks good to me.. also:

Tested-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka

> 
> Andy Lutomirski (2):
>   perf: Move task_pt_regs sampling into arch code
>   x86_64, perf: Improve user regs sampling
> 
>  arch/arm/kernel/perf_regs.c   |  8 ++++
>  arch/arm64/kernel/perf_regs.c |  8 ++++
>  arch/x86/kernel/perf_regs.c   | 90 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/perf_event.h    | 12 +++---
>  include/linux/perf_regs.h     | 16 ++++++++
>  kernel/events/core.c          | 19 ++++-----
>  6 files changed, 137 insertions(+), 16 deletions(-)
> 
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] perf: Move task_pt_regs sampling into arch code
  2015-01-04 18:36               ` [PATCH 1/2] perf: Move task_pt_regs sampling into arch code Andy Lutomirski
@ 2015-01-05 14:07                 ` Peter Zijlstra
  2015-01-05 16:13                   ` Andy Lutomirski
  2015-01-09 12:32                 ` [tip:perf/urgent] " tip-bot for Andy Lutomirski
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-01-05 14:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Olsa, Ingo Molnar,
	秦承刚(承刚),
	Stephane Eranian, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Namhyung Kim, Mike Galbraith, Arjan van de Ven,
	linux-kernel, David Ahern, Paul Mackerras, Yanmin Zhang

On Sun, Jan 04, 2015 at 10:36:19AM -0800, Andy Lutomirski wrote:
> On x86_64, at least, task_pt_regs may be only partially initialized
> in many contexts, so x86_64 should not use it without extra care
> from interrupt context, let alone NMI context.
> 
> This will allow x86_64 to override the logic and will supply some
> scratch space to use to make a cleaner copy of user regs.

Just wondering how bad it would be to fill out the actual pt_regs that
was previously partially initialized?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] perf: Move task_pt_regs sampling into arch code
  2015-01-05 14:07                 ` Peter Zijlstra
@ 2015-01-05 16:13                   ` Andy Lutomirski
  2015-01-05 16:44                     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2015-01-05 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Mike Galbraith, Namhyung Kim, Arjan van de Ven,
	linux-kernel, David Ahern, Paul Mackerras,
	秦承刚(承刚),
	Yanmin Zhang

n Jan 5, 2015 6:08 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>
> On Sun, Jan 04, 2015 at 10:36:19AM -0800, Andy Lutomirski wrote:
> > On x86_64, at least, task_pt_regs may be only partially initialized
> > in many contexts, so x86_64 should not use it without extra care
> > from interrupt context, let alone NMI context.
> >
> > This will allow x86_64 to override the logic and will supply some
> > scratch space to use to make a cleaner copy of user regs.
>
> Just wondering how bad it would be to fill out the actual pt_regs that
> was previously partially initialized?

Bad, for at least two reasons.

One is that we don't actually know which regs are initialized.  bx,
bp, r12 etc are particularly bad in this regard, due to the FORK_LIKE
mechanism and similar optimizations.

The other is that the uninitialized part of task_pt_regs can be used
for something else entirely.  If we have a syscall instruction
immediately followed by a regular interrupt, then the interrupt's
hardware frame will overlap task_pt_regs.  (I'm not going to claim
that this design is sensible, but it is what it is.  IIRC Denys
Vlasenko had some patches to partially clean this up.)

It would be possible to rework the code to avoid an extra pt_regs
copy, but I don't see an obvious way to do it cleanly.

--Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] perf: Move task_pt_regs sampling into arch code
  2015-01-05 16:13                   ` Andy Lutomirski
@ 2015-01-05 16:44                     ` Peter Zijlstra
  2015-01-05 18:28                       ` Andy Lutomirski
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-01-05 16:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Mike Galbraith, Namhyung Kim, Arjan van de Ven,
	linux-kernel, David Ahern, Paul Mackerras,
	秦承刚(承刚),
	Yanmin Zhang

On Mon, Jan 05, 2015 at 08:13:49AM -0800, Andy Lutomirski wrote:
> > Just wondering how bad it would be to fill out the actual pt_regs that
> > was previously partially initialized?
> 
> Bad, for at least two reasons.
> 
> One is that we don't actually know which regs are initialized.  bx,
> bp, r12 etc are particularly bad in this regard, due to the FORK_LIKE
> mechanism and similar optimizations.

Right, but you need to deal with that anyhow.

> The other is that the uninitialized part of task_pt_regs can be used
> for something else entirely.  If we have a syscall instruction
> immediately followed by a regular interrupt, then the interrupt's
> hardware frame will overlap task_pt_regs.  (I'm not going to claim
> that this design is sensible, but it is what it is.  IIRC Denys
> Vlasenko had some patches to partially clean this up.)

Ah, urgh. Yes painful that.

> It would be possible to rework the code to avoid an extra pt_regs
> copy, but I don't see an obvious way to do it cleanly.

Yeah, we'll see how this works, I was just curious on the exact need for
the copy, but if as you say, the original structure might not even exist
properly (even though we have a pointer to it) that's bad (TM).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] perf: Move task_pt_regs sampling into arch code
  2015-01-05 16:44                     ` Peter Zijlstra
@ 2015-01-05 18:28                       ` Andy Lutomirski
  0 siblings, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2015-01-05 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, root, Andrew Morton,
	秦承刚(承刚),
	Wu Fengguang, Mike Galbraith, Namhyung Kim, Arjan van de Ven,
	linux-kernel, David Ahern, Paul Mackerras,
	秦承刚(承刚),
	Yanmin Zhang

On Mon, Jan 5, 2015 at 8:44 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jan 05, 2015 at 08:13:49AM -0800, Andy Lutomirski wrote:
>> > Just wondering how bad it would be to fill out the actual pt_regs that
>> > was previously partially initialized?
>>
>> Bad, for at least two reasons.
>>
>> One is that we don't actually know which regs are initialized.  bx,
>> bp, r12 etc are particularly bad in this regard, due to the FORK_LIKE
>> mechanism and similar optimizations.
>
> Right, but you need to deal with that anyhow.

I cheated by assuming they're uninitialized.  If we actually wrote -1
over them and they were initialized, then we'd be screwed.

>
>> The other is that the uninitialized part of task_pt_regs can be used
>> for something else entirely.  If we have a syscall instruction
>> immediately followed by a regular interrupt, then the interrupt's
>> hardware frame will overlap task_pt_regs.  (I'm not going to claim
>> that this design is sensible, but it is what it is.  IIRC Denys
>> Vlasenko had some patches to partially clean this up.)
>
> Ah, urgh. Yes painful that.
>
>> It would be possible to rework the code to avoid an extra pt_regs
>> copy, but I don't see an obvious way to do it cleanly.
>
> Yeah, we'll see how this works, I was just curious on the exact need for
> the copy, but if as you say, the original structure might not even exist
> properly (even though we have a pointer to it) that's bad (TM).

How much freedom do we have to redesign the whole mechanism?  Instead
of recording the user state from the PMI, we could set a flag so that
we'd sample the user state before return to user mode.  This would
reliably get all of the registers, except in the case where the PMI
was part way through the return-to-user code.  In that case, we'd want
to set another flag to catch the next entry to avoid writing a bogus
sample.

I don't think we could do this without changing the ABI, though.

This would waste ~300 cycles if we set the flag from a fast-path
syscall, at least until my magic sysret optimization goes in.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip:perf/urgent] perf: Move task_pt_regs sampling into arch code
  2015-01-04 18:36               ` [PATCH 1/2] perf: Move task_pt_regs sampling into arch code Andy Lutomirski
  2015-01-05 14:07                 ` Peter Zijlstra
@ 2015-01-09 12:32                 ` tip-bot for Andy Lutomirski
  1 sibling, 0 replies; 23+ messages in thread
From: tip-bot for Andy Lutomirski @ 2015-01-09 12:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: namhyung, acme, linux, catalin.marinas, efault, eranian, msalter,
	luto, jolsa, torvalds, dsahern, tglx, will.deacon, mingo, hpa,
	jean.pihet, fengguang.wu, arjan, peterz, linux-kernel

Commit-ID:  88a7c26af8dab2f2d69f5a6067eb670694ec38c0
Gitweb:     http://git.kernel.org/tip/88a7c26af8dab2f2d69f5a6067eb670694ec38c0
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Sun, 4 Jan 2015 10:36:19 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 9 Jan 2015 11:12:28 +0100

perf: Move task_pt_regs sampling into arch code

On x86_64, at least, task_pt_regs may be only partially initialized
in many contexts, so x86_64 should not use it without extra care
from interrupt context, let alone NMI context.

This will allow x86_64 to override the logic and will supply some
scratch space to use to make a cleaner copy of user regs.

Tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: chenggang.qcg@taobao.com
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/arm/kernel/perf_regs.c   |  8 ++++++++
 arch/arm64/kernel/perf_regs.c |  8 ++++++++
 arch/x86/kernel/perf_regs.c   | 16 ++++++++++++++++
 include/linux/perf_event.h    | 12 +++++++-----
 include/linux/perf_regs.h     | 16 ++++++++++++++++
 kernel/events/core.c          | 19 ++++++++-----------
 6 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index 6e4379c..592dda3 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -28,3 +28,11 @@ u64 perf_reg_abi(struct task_struct *task)
 {
 	return PERF_SAMPLE_REGS_ABI_32;
 }
+
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs,
+			struct pt_regs *regs_user_copy)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index 6762ad7..3f62b35 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -50,3 +50,11 @@ u64 perf_reg_abi(struct task_struct *task)
 	else
 		return PERF_SAMPLE_REGS_ABI_64;
 }
+
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs,
+			struct pt_regs *regs_user_copy)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index e309cc5..3bbbb1a 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -78,6 +78,14 @@ u64 perf_reg_abi(struct task_struct *task)
 {
 	return PERF_SAMPLE_REGS_ABI_32;
 }
+
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs,
+			struct pt_regs *regs_user_copy)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
 #else /* CONFIG_X86_64 */
 #define REG_NOSUPPORT ((1ULL << PERF_REG_X86_DS) | \
 		       (1ULL << PERF_REG_X86_ES) | \
@@ -102,4 +110,12 @@ u64 perf_reg_abi(struct task_struct *task)
 	else
 		return PERF_SAMPLE_REGS_ABI_64;
 }
+
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs,
+			struct pt_regs *regs_user_copy)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
 #endif /* CONFIG_X86_32 */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 486e84c..4f7a61c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -79,11 +79,6 @@ struct perf_branch_stack {
 	struct perf_branch_entry	entries[0];
 };
 
-struct perf_regs {
-	__u64		abi;
-	struct pt_regs	*regs;
-};
-
 struct task_struct;
 
 /*
@@ -610,7 +605,14 @@ struct perf_sample_data {
 		u32	reserved;
 	}				cpu_entry;
 	struct perf_callchain_entry	*callchain;
+
+	/*
+	 * regs_user may point to task_pt_regs or to regs_user_copy, depending
+	 * on arch details.
+	 */
 	struct perf_regs		regs_user;
+	struct pt_regs			regs_user_copy;
+
 	struct perf_regs		regs_intr;
 	u64				stack_user_size;
 } ____cacheline_aligned;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 3c73d5f..a5f98d5 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -1,11 +1,19 @@
 #ifndef _LINUX_PERF_REGS_H
 #define _LINUX_PERF_REGS_H
 
+struct perf_regs {
+	__u64		abi;
+	struct pt_regs	*regs;
+};
+
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
 u64 perf_reg_value(struct pt_regs *regs, int idx);
 int perf_reg_validate(u64 mask);
 u64 perf_reg_abi(struct task_struct *task);
+void perf_get_regs_user(struct perf_regs *regs_user,
+			struct pt_regs *regs,
+			struct pt_regs *regs_user_copy);
 #else
 static inline u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
@@ -21,5 +29,13 @@ static inline u64 perf_reg_abi(struct task_struct *task)
 {
 	return PERF_SAMPLE_REGS_ABI_NONE;
 }
+
+static inline void perf_get_regs_user(struct perf_regs *regs_user,
+				      struct pt_regs *regs,
+				      struct pt_regs *regs_user_copy)
+{
+	regs_user->regs = task_pt_regs(current);
+	regs_user->abi = perf_reg_abi(current);
+}
 #endif /* CONFIG_HAVE_PERF_REGS */
 #endif /* _LINUX_PERF_REGS_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4c1ee7f..882f835 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4461,18 +4461,14 @@ perf_output_sample_regs(struct perf_output_handle *handle,
 }
 
 static void perf_sample_regs_user(struct perf_regs *regs_user,
-				  struct pt_regs *regs)
+				  struct pt_regs *regs,
+				  struct pt_regs *regs_user_copy)
 {
-	if (!user_mode(regs)) {
-		if (current->mm)
-			regs = task_pt_regs(current);
-		else
-			regs = NULL;
-	}
-
-	if (regs) {
-		regs_user->abi  = perf_reg_abi(current);
+	if (user_mode(regs)) {
+		regs_user->abi = perf_reg_abi(current);
 		regs_user->regs = regs;
+	} else if (current->mm) {
+		perf_get_regs_user(regs_user, regs, regs_user_copy);
 	} else {
 		regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
 		regs_user->regs = NULL;
@@ -4951,7 +4947,8 @@ void perf_prepare_sample(struct perf_event_header *header,
 	}
 
 	if (sample_type & (PERF_SAMPLE_REGS_USER | PERF_SAMPLE_STACK_USER))
-		perf_sample_regs_user(&data->regs_user, regs);
+		perf_sample_regs_user(&data->regs_user, regs,
+				      &data->regs_user_copy);
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
 		/* regs dump ABI info */

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:perf/urgent] perf/x86_64: Improve user regs sampling
  2015-01-04 18:36               ` [PATCH 2/2] x86_64, perf: Improve user regs sampling Andy Lutomirski
@ 2015-01-09 12:32                 ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Andy Lutomirski @ 2015-01-09 12:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jolsa, linux-kernel, tglx, peterz, hpa, arjan, torvalds, dsahern,
	namhyung, mingo, luto, akpm, fengguang.wu, efault, eranian

Commit-ID:  86c269fea37334687b1c0789e6444be0d750e8a6
Gitweb:     http://git.kernel.org/tip/86c269fea37334687b1c0789e6444be0d750e8a6
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Sun, 4 Jan 2015 10:36:20 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 9 Jan 2015 11:12:29 +0100

perf/x86_64: Improve user regs sampling

Perf reports user regs for kernel-mode samples so that samples can
be backtraced through user code.  The old code was very broken in
syscall context, resulting in useless backtraces.

The new code, in contrast, is still dangerously racy, but it should
at least work most of the time.

Tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: chenggang.qcg@taobao.com
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/243560c26ff0f739978e2459e203f6515367634d.1420396372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/perf_regs.c | 78 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 76 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 3bbbb1a..781861c 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -115,7 +115,81 @@ void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs,
 			struct pt_regs *regs_user_copy)
 {
-	regs_user->regs = task_pt_regs(current);
-	regs_user->abi = perf_reg_abi(current);
+	struct pt_regs *user_regs = task_pt_regs(current);
+
+	/*
+	 * If we're in an NMI that interrupted task_pt_regs setup, then
+	 * we can't sample user regs at all.  This check isn't really
+	 * sufficient, though, as we could be in an NMI inside an interrupt
+	 * that happened during task_pt_regs setup.
+	 */
+	if (regs->sp > (unsigned long)&user_regs->r11 &&
+	    regs->sp <= (unsigned long)(user_regs + 1)) {
+		regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
+		regs_user->regs = NULL;
+		return;
+	}
+
+	/*
+	 * RIP, flags, and the argument registers are usually saved.
+	 * orig_ax is probably okay, too.
+	 */
+	regs_user_copy->ip = user_regs->ip;
+	regs_user_copy->cx = user_regs->cx;
+	regs_user_copy->dx = user_regs->dx;
+	regs_user_copy->si = user_regs->si;
+	regs_user_copy->di = user_regs->di;
+	regs_user_copy->r8 = user_regs->r8;
+	regs_user_copy->r9 = user_regs->r9;
+	regs_user_copy->r10 = user_regs->r10;
+	regs_user_copy->r11 = user_regs->r11;
+	regs_user_copy->orig_ax = user_regs->orig_ax;
+	regs_user_copy->flags = user_regs->flags;
+
+	/*
+	 * Don't even try to report the "rest" regs.
+	 */
+	regs_user_copy->bx = -1;
+	regs_user_copy->bp = -1;
+	regs_user_copy->r12 = -1;
+	regs_user_copy->r13 = -1;
+	regs_user_copy->r14 = -1;
+	regs_user_copy->r15 = -1;
+
+	/*
+	 * For this to be at all useful, we need a reasonable guess for
+	 * sp and the ABI.  Be careful: we're in NMI context, and we're
+	 * considering current to be the current task, so we should
+	 * be careful not to look at any other percpu variables that might
+	 * change during context switches.
+	 */
+	if (IS_ENABLED(CONFIG_IA32_EMULATION) &&
+	    task_thread_info(current)->status & TS_COMPAT) {
+		/* Easy case: we're in a compat syscall. */
+		regs_user->abi = PERF_SAMPLE_REGS_ABI_32;
+		regs_user_copy->sp = user_regs->sp;
+		regs_user_copy->cs = user_regs->cs;
+		regs_user_copy->ss = user_regs->ss;
+	} else if (user_regs->orig_ax != -1) {
+		/*
+		 * We're probably in a 64-bit syscall.
+		 * Warning: this code is severely racy.  At least it's better
+		 * than just blindly copying user_regs.
+		 */
+		regs_user->abi = PERF_SAMPLE_REGS_ABI_64;
+		regs_user_copy->sp = this_cpu_read(old_rsp);
+		regs_user_copy->cs = __USER_CS;
+		regs_user_copy->ss = __USER_DS;
+		regs_user_copy->cx = -1;  /* usually contains garbage */
+	} else {
+		/* We're probably in an interrupt or exception. */
+		regs_user->abi = user_64bit_mode(user_regs) ?
+			PERF_SAMPLE_REGS_ABI_64 : PERF_SAMPLE_REGS_ABI_32;
+		regs_user_copy->sp = user_regs->sp;
+		regs_user_copy->cs = user_regs->cs;
+		regs_user_copy->ss = user_regs->ss;
+	}
+
+	regs_user->regs = regs_user_copy;
 }
 #endif /* CONFIG_X86_32 */

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-01-09 12:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-23  6:22 [PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs root
2014-12-23  8:30 ` Andy Lutomirski
     [not found] ` <c027bde0-5f4f-441f-8d45-3e7f6f702231@alibaba-inc.com>
2014-12-25 15:48   ` 答复:[PATCH] " Andy Lutomirski
2014-12-25 16:21     ` Andy Lutomirski
2014-12-30 19:03     ` Peter Zijlstra
2014-12-30 23:29       ` Andy Lutomirski
2014-12-31  2:00         ` Andy Lutomirski
2015-01-02 16:11           ` Jan Beulich
2015-01-02 18:03             ` Andy Lutomirski
2015-01-05  8:47               ` Jan Beulich
2015-01-04 16:10       ` Jiri Olsa
2015-01-04 17:18         ` Andy Lutomirski
2015-01-04 17:41           ` Jiri Olsa
2015-01-04 18:36             ` [PATCH 0/2] perf: Improve user regs sampling Andy Lutomirski
2015-01-04 18:36               ` [PATCH 1/2] perf: Move task_pt_regs sampling into arch code Andy Lutomirski
2015-01-05 14:07                 ` Peter Zijlstra
2015-01-05 16:13                   ` Andy Lutomirski
2015-01-05 16:44                     ` Peter Zijlstra
2015-01-05 18:28                       ` Andy Lutomirski
2015-01-09 12:32                 ` [tip:perf/urgent] " tip-bot for Andy Lutomirski
2015-01-04 18:36               ` [PATCH 2/2] x86_64, perf: Improve user regs sampling Andy Lutomirski
2015-01-09 12:32                 ` [tip:perf/urgent] perf/x86_64: " tip-bot for Andy Lutomirski
2015-01-05 10:46               ` [PATCH 0/2] perf: " Jiri Olsa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.