All of lore.kernel.org
 help / color / mirror / Atom feed
* Random shadow stack pointer corruption
@ 2020-07-18 17:57 Yu-cheng Yu
  2020-07-18 18:00 ` Andy Lutomirski
  0 siblings, 1 reply; 9+ messages in thread
From: Yu-cheng Yu @ 2020-07-18 17:57 UTC (permalink / raw)
  To: LKML, x86, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	H.J. Lu, Ingo Molnar, Ravi V. Shankar, Sebastian Andrzej Siewior,
	Tony Luck, Thomas Gleixner, Peter Zijlstra, Weijiang Yang

Hi,

My shadow stack tests start to have random shadow stack pointer corruption after
v5.7 (excluding).  The symptom looks like some locking issue or the kernel is
confused about which CPU a task is on.  In later tip/master, this can be
triggered by creating two tasks and each does continuous
pthread_create()/pthread_join().  If the kernel has max_cpus=1, the issue goes
away.  I also checked XSAVES/XRSTORS, but this does not seem to be an issue
coming from there.

The tests I run take a long time to complete, and some commit points in bisect
do not show failures right away.  However, the issue can be more easily
triggered after the point of:

d77290507ab2 x86/entry/32: Convert IRET exception to IDTENTRY_SW

Can anyone help me find places to look at?

Thanks,
Yu-cheng


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Random shadow stack pointer corruption
  2020-07-18 17:57 Random shadow stack pointer corruption Yu-cheng Yu
@ 2020-07-18 18:00 ` Andy Lutomirski
  2020-07-18 18:24   ` Yu-cheng Yu
  0 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2020-07-18 18:00 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: LKML, X86 ML, Andy Lutomirski, Borislav Petkov, Dave Hansen,
	H.J. Lu, Ingo Molnar, Ravi V. Shankar, Sebastian Andrzej Siewior,
	Tony Luck, Thomas Gleixner, Peter Zijlstra, Weijiang Yang

On Sat, Jul 18, 2020 at 10:58 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> Hi,
>
> My shadow stack tests start to have random shadow stack pointer corruption after
> v5.7 (excluding).  The symptom looks like some locking issue or the kernel is
> confused about which CPU a task is on.  In later tip/master, this can be
> triggered by creating two tasks and each does continuous
> pthread_create()/pthread_join().  If the kernel has max_cpus=1, the issue goes
> away.  I also checked XSAVES/XRSTORS, but this does not seem to be an issue
> coming from there.

What do you mean "shadow stack pointer corruption"?  Is SSP itself
corrupt while running in the kernel?  Is one of the MSRs getting
corrupted?  Is the memory to which the shadow stack points getting
corrupted? Is the CPU rejecting an attempt to change SSP?

--Andy

>
> The tests I run take a long time to complete, and some commit points in bisect
> do not show failures right away.  However, the issue can be more easily
> triggered after the point of:
>
> d77290507ab2 x86/entry/32: Convert IRET exception to IDTENTRY_SW
>
> Can anyone help me find places to look at?
>
> Thanks,
> Yu-cheng
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Random shadow stack pointer corruption
  2020-07-18 18:00 ` Andy Lutomirski
@ 2020-07-18 18:24   ` Yu-cheng Yu
  2020-07-18 18:27     ` Andy Lutomirski
  2020-07-18 22:41     ` Dave Hansen
  0 siblings, 2 replies; 9+ messages in thread
From: Yu-cheng Yu @ 2020-07-18 18:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Borislav Petkov, Dave Hansen, H.J. Lu, Ingo Molnar,
	Ravi V. Shankar, Sebastian Andrzej Siewior, Tony Luck,
	Thomas Gleixner, Peter Zijlstra, Weijiang Yang

On Sat, 2020-07-18 at 11:00 -0700, Andy Lutomirski wrote:
> On Sat, Jul 18, 2020 at 10:58 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > Hi,
> > 
> > My shadow stack tests start to have random shadow stack pointer corruption after
> > v5.7 (excluding).  The symptom looks like some locking issue or the kernel is
> > confused about which CPU a task is on.  In later tip/master, this can be
> > triggered by creating two tasks and each does continuous
> > pthread_create()/pthread_join().  If the kernel has max_cpus=1, the issue goes
> > away.  I also checked XSAVES/XRSTORS, but this does not seem to be an issue
> > coming from there.
> 
> What do you mean "shadow stack pointer corruption"?  Is SSP itself
> corrupt while running in the kernel?  Is one of the MSRs getting
> corrupted?  Is the memory to which the shadow stack points getting
> corrupted? Is the CPU rejecting an attempt to change SSP?

What I see is, a new thread after ret_from_fork() and iret back to ring-3, 
its shadow stack pointer (MSR_IA32_PL3_SSP) is corrupted.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Random shadow stack pointer corruption
  2020-07-18 18:24   ` Yu-cheng Yu
@ 2020-07-18 18:27     ` Andy Lutomirski
  2020-07-18 22:41     ` Dave Hansen
  1 sibling, 0 replies; 9+ messages in thread
From: Andy Lutomirski @ 2020-07-18 18:27 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Andy Lutomirski, LKML, X86 ML, Borislav Petkov, Dave Hansen,
	H.J. Lu, Ingo Molnar, Ravi V. Shankar, Sebastian Andrzej Siewior,
	Tony Luck, Thomas Gleixner, Peter Zijlstra, Weijiang Yang

On Sat, Jul 18, 2020 at 11:25 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> On Sat, 2020-07-18 at 11:00 -0700, Andy Lutomirski wrote:
> > On Sat, Jul 18, 2020 at 10:58 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > > Hi,
> > >
> > > My shadow stack tests start to have random shadow stack pointer corruption after
> > > v5.7 (excluding).  The symptom looks like some locking issue or the kernel is
> > > confused about which CPU a task is on.  In later tip/master, this can be
> > > triggered by creating two tasks and each does continuous
> > > pthread_create()/pthread_join().  If the kernel has max_cpus=1, the issue goes
> > > away.  I also checked XSAVES/XRSTORS, but this does not seem to be an issue
> > > coming from there.
> >
> > What do you mean "shadow stack pointer corruption"?  Is SSP itself
> > corrupt while running in the kernel?  Is one of the MSRs getting
> > corrupted?  Is the memory to which the shadow stack points getting
> > corrupted? Is the CPU rejecting an attempt to change SSP?
>
> What I see is, a new thread after ret_from_fork() and iret back to ring-3,
> its shadow stack pointer (MSR_IA32_PL3_SSP) is corrupted.
>
>
>

This is going to be impossible to diagnose, given that the upstream
kernel doesn't know about these MSRs at all.  If you point to a git
tree, maybe I can spot the issue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Random shadow stack pointer corruption
  2020-07-18 18:24   ` Yu-cheng Yu
  2020-07-18 18:27     ` Andy Lutomirski
@ 2020-07-18 22:41     ` Dave Hansen
  2020-07-18 23:04       ` H.J. Lu
  2020-07-18 23:34       ` Yu-cheng Yu
  1 sibling, 2 replies; 9+ messages in thread
From: Dave Hansen @ 2020-07-18 22:41 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: LKML, X86 ML, Borislav Petkov, Dave Hansen, H.J. Lu, Ingo Molnar,
	Ravi V. Shankar, Sebastian Andrzej Siewior, Tony Luck,
	Thomas Gleixner, Peter Zijlstra, Weijiang Yang

On 7/18/20 11:24 AM, Yu-cheng Yu wrote:
> On Sat, 2020-07-18 at 11:00 -0700, Andy Lutomirski wrote:
>> On Sat, Jul 18, 2020 at 10:58 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>>> Hi,
>>>
>>> My shadow stack tests start to have random shadow stack pointer corruption after
>>> v5.7 (excluding).  The symptom looks like some locking issue or the kernel is
>>> confused about which CPU a task is on.  In later tip/master, this can be
>>> triggered by creating two tasks and each does continuous
>>> pthread_create()/pthread_join().  If the kernel has max_cpus=1, the issue goes
>>> away.  I also checked XSAVES/XRSTORS, but this does not seem to be an issue
>>> coming from there.
>>
>> What do you mean "shadow stack pointer corruption"?  Is SSP itself
>> corrupt while running in the kernel?  Is one of the MSRs getting
>> corrupted?  Is the memory to which the shadow stack points getting
>> corrupted? Is the CPU rejecting an attempt to change SSP?
> 
> What I see is, a new thread after ret_from_fork() and iret back to ring-3, 
> its shadow stack pointer (MSR_IA32_PL3_SSP) is corrupted.

Does corrupt mean random?  Or is it a valid stack address, just not for
_this_ thread?  Or NULL?  Or is it a kernel address?  Have you tried
tracing *ALL* the WRMSR's and XRSTOR's that write to the MSR?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Random shadow stack pointer corruption
  2020-07-18 22:41     ` Dave Hansen
@ 2020-07-18 23:04       ` H.J. Lu
  2020-07-18 23:34       ` Yu-cheng Yu
  1 sibling, 0 replies; 9+ messages in thread
From: H.J. Lu @ 2020-07-18 23:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, Andy Lutomirski, LKML, X86 ML, Borislav Petkov,
	Dave Hansen, Ingo Molnar, Ravi V. Shankar,
	Sebastian Andrzej Siewior, Tony Luck, Thomas Gleixner,
	Peter Zijlstra, Weijiang Yang

On Sat, Jul 18, 2020 at 3:41 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 7/18/20 11:24 AM, Yu-cheng Yu wrote:
> > On Sat, 2020-07-18 at 11:00 -0700, Andy Lutomirski wrote:
> >> On Sat, Jul 18, 2020 at 10:58 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> >>> Hi,
> >>>
> >>> My shadow stack tests start to have random shadow stack pointer corruption after
> >>> v5.7 (excluding).  The symptom looks like some locking issue or the kernel is
> >>> confused about which CPU a task is on.  In later tip/master, this can be
> >>> triggered by creating two tasks and each does continuous
> >>> pthread_create()/pthread_join().  If the kernel has max_cpus=1, the issue goes
> >>> away.  I also checked XSAVES/XRSTORS, but this does not seem to be an issue
> >>> coming from there.
> >>
> >> What do you mean "shadow stack pointer corruption"?  Is SSP itself
> >> corrupt while running in the kernel?  Is one of the MSRs getting
> >> corrupted?  Is the memory to which the shadow stack points getting
> >> corrupted? Is the CPU rejecting an attempt to change SSP?
> >
> > What I see is, a new thread after ret_from_fork() and iret back to ring-3,
> > its shadow stack pointer (MSR_IA32_PL3_SSP) is corrupted.
>
> Does corrupt mean random?  Or is it a valid stack address, just not for
> _this_ thread?  Or NULL?  Or is it a kernel address?  Have you tried
> tracing *ALL* the WRMSR's and XRSTOR's that write to the MSR?

Another data point.  When memory corruption happened, there was no
core dump at all.  We verified that core dump was enabled and we did
get core dump for other programs.


-- 
H.J.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Random shadow stack pointer corruption
  2020-07-18 22:41     ` Dave Hansen
  2020-07-18 23:04       ` H.J. Lu
@ 2020-07-18 23:34       ` Yu-cheng Yu
  2020-07-29  0:35         ` H.J. Lu
  1 sibling, 1 reply; 9+ messages in thread
From: Yu-cheng Yu @ 2020-07-18 23:34 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: LKML, X86 ML, Borislav Petkov, Dave Hansen, H.J. Lu, Ingo Molnar,
	Ravi V. Shankar, Sebastian Andrzej Siewior, Tony Luck,
	Thomas Gleixner, Peter Zijlstra, Weijiang Yang

On Sat, 2020-07-18 at 15:41 -0700, Dave Hansen wrote:
> On 7/18/20 11:24 AM, Yu-cheng Yu wrote:
> > On Sat, 2020-07-18 at 11:00 -0700, Andy Lutomirski wrote:
> > > On Sat, Jul 18, 2020 at 10:58 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > > > Hi,
> > > > 
> > > > My shadow stack tests start to have random shadow stack pointer corruption after
> > > > v5.7 (excluding).  The symptom looks like some locking issue or the kernel is
> > > > confused about which CPU a task is on.  In later tip/master, this can be
> > > > triggered by creating two tasks and each does continuous
> > > > pthread_create()/pthread_join().  If the kernel has max_cpus=1, the issue goes
> > > > away.  I also checked XSAVES/XRSTORS, but this does not seem to be an issue
> > > > coming from there.
> > > 
> > > What do you mean "shadow stack pointer corruption"?  Is SSP itself
> > > corrupt while running in the kernel?  Is one of the MSRs getting
> > > corrupted?  Is the memory to which the shadow stack points getting
> > > corrupted? Is the CPU rejecting an attempt to change SSP?
> > 
> > What I see is, a new thread after ret_from_fork() and iret back to ring-3, 
> > its shadow stack pointer (MSR_IA32_PL3_SSP) is corrupted.
> 
> Does corrupt mean random?  Or is it a valid stack address, just not for
> _this_ thread?  Or NULL?  Or is it a kernel address?  Have you tried
> tracing *ALL* the WRMSR's and XRSTOR's that write to the MSR?

When a shadow stack address is changed, the address appears to be other task's. 
I traced all WRMSR's and XRSTOR's.  I also verified there have not been any
XRSTORS from a wrong buffer.  When rc6 is tagged, I will re-base, test, and
share current patches.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Random shadow stack pointer corruption
  2020-07-18 23:34       ` Yu-cheng Yu
@ 2020-07-29  0:35         ` H.J. Lu
  2020-07-29  0:56           ` Andy Lutomirski
  0 siblings, 1 reply; 9+ messages in thread
From: H.J. Lu @ 2020-07-29  0:35 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Dave Hansen, Andy Lutomirski, LKML, X86 ML, Borislav Petkov,
	Dave Hansen, Ingo Molnar, Ravi V. Shankar,
	Sebastian Andrzej Siewior, Tony Luck, Thomas Gleixner,
	Peter Zijlstra, Weijiang Yang

On Sat, Jul 18, 2020 at 4:35 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> On Sat, 2020-07-18 at 15:41 -0700, Dave Hansen wrote:
> > On 7/18/20 11:24 AM, Yu-cheng Yu wrote:
> > > On Sat, 2020-07-18 at 11:00 -0700, Andy Lutomirski wrote:
> > > > On Sat, Jul 18, 2020 at 10:58 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > > > > Hi,
> > > > >
> > > > > My shadow stack tests start to have random shadow stack pointer corruption after
> > > > > v5.7 (excluding).  The symptom looks like some locking issue or the kernel is
> > > > > confused about which CPU a task is on.  In later tip/master, this can be
> > > > > triggered by creating two tasks and each does continuous
> > > > > pthread_create()/pthread_join().  If the kernel has max_cpus=1, the issue goes
> > > > > away.  I also checked XSAVES/XRSTORS, but this does not seem to be an issue
> > > > > coming from there.
> > > >
> > > > What do you mean "shadow stack pointer corruption"?  Is SSP itself
> > > > corrupt while running in the kernel?  Is one of the MSRs getting
> > > > corrupted?  Is the memory to which the shadow stack points getting
> > > > corrupted? Is the CPU rejecting an attempt to change SSP?
> > >
> > > What I see is, a new thread after ret_from_fork() and iret back to ring-3,
> > > its shadow stack pointer (MSR_IA32_PL3_SSP) is corrupted.
> >
> > Does corrupt mean random?  Or is it a valid stack address, just not for
> > _this_ thread?  Or NULL?  Or is it a kernel address?  Have you tried
> > tracing *ALL* the WRMSR's and XRSTOR's that write to the MSR?
>
> When a shadow stack address is changed, the address appears to be other task's.
> I traced all WRMSR's and XRSTOR's.  I also verified there have not been any
> XRSTORS from a wrong buffer.  When rc6 is tagged, I will re-base, test, and
> share current patches.
>

We have identified that

ommit 91eeafea1e4b7c95cc4f38af186d7d48fceef89a
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu May 21 22:05:28 2020 +0200

    x86/entry: Switch page fault exception to IDTENTRY_RAW

    Convert page fault exceptions to IDTENTRY_RAW:

      - Implement the C entry point with DEFINE_IDTENTRY_RAW
      - Add the CR2 read into the exception handler
      - Add the idtentry_enter/exit_cond_rcu() invocations in
        in the regular page fault handler and in the async PF
        part.
      - Emit the ASM stub with DECLARE_IDTENTRY_RAW
      - Remove the ASM idtentry in 64-bit
      - Remove the CR2 read from 64-bit
      - Remove the open coded ASM entry code in 32-bit
      - Fix up the XEN/PV code
      - Remove the old prototypes

    No functional change.

triggered the shadow stack corruption when the process returned from syscall.
SSP MSR somehow was changed between setting SSP MSR and IRET.    Could
there be a page fault between setting SSP MSR and IRET?

-- 
H.J.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Random shadow stack pointer corruption
  2020-07-29  0:35         ` H.J. Lu
@ 2020-07-29  0:56           ` Andy Lutomirski
  0 siblings, 0 replies; 9+ messages in thread
From: Andy Lutomirski @ 2020-07-29  0:56 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Yu-cheng Yu, Dave Hansen, Andy Lutomirski, LKML, X86 ML,
	Borislav Petkov, Dave Hansen, Ingo Molnar, Ravi V. Shankar,
	Sebastian Andrzej Siewior, Tony Luck, Thomas Gleixner,
	Peter Zijlstra, Weijiang Yang



> On Jul 28, 2020, at 5:36 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> 
> On Sat, Jul 18, 2020 at 4:35 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>> 
>>> On Sat, 2020-07-18 at 15:41 -0700, Dave Hansen wrote:
>>> On 7/18/20 11:24 AM, Yu-cheng Yu wrote:
>>>> On Sat, 2020-07-18 at 11:00 -0700, Andy Lutomirski wrote:
>>>>> On Sat, Jul 18, 2020 at 10:58 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> My shadow stack tests start to have random shadow stack pointer corruption after
>>>>>> v5.7 (excluding).  The symptom looks like some locking issue or the kernel is
>>>>>> confused about which CPU a task is on.  In later tip/master, this can be
>>>>>> triggered by creating two tasks and each does continuous
>>>>>> pthread_create()/pthread_join().  If the kernel has max_cpus=1, the issue goes
>>>>>> away.  I also checked XSAVES/XRSTORS, but this does not seem to be an issue
>>>>>> coming from there.
>>>>> 
>>>>> What do you mean "shadow stack pointer corruption"?  Is SSP itself
>>>>> corrupt while running in the kernel?  Is one of the MSRs getting
>>>>> corrupted?  Is the memory to which the shadow stack points getting
>>>>> corrupted? Is the CPU rejecting an attempt to change SSP?
>>>> 
>>>> What I see is, a new thread after ret_from_fork() and iret back to ring-3,
>>>> its shadow stack pointer (MSR_IA32_PL3_SSP) is corrupted.
>>> 
>>> Does corrupt mean random?  Or is it a valid stack address, just not for
>>> _this_ thread?  Or NULL?  Or is it a kernel address?  Have you tried
>>> tracing *ALL* the WRMSR's and XRSTOR's that write to the MSR?
>> 
>> When a shadow stack address is changed, the address appears to be other task's.
>> I traced all WRMSR's and XRSTOR's.  I also verified there have not been any
>> XRSTORS from a wrong buffer.  When rc6 is tagged, I will re-base, test, and
>> share current patches.
>> 
> 
> We have identified that
> 
> ommit 91eeafea1e4b7c95cc4f38af186d7d48fceef89a
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Thu May 21 22:05:28 2020 +0200
> 
>    x86/entry: Switch page fault exception to IDTENTRY_RAW
> 
>    Convert page fault exceptions to IDTENTRY_RAW:
> 
>      - Implement the C entry point with DEFINE_IDTENTRY_RAW
>      - Add the CR2 read into the exception handler
>      - Add the idtentry_enter/exit_cond_rcu() invocations in
>        in the regular page fault handler and in the async PF
>        part.
>      - Emit the ASM stub with DECLARE_IDTENTRY_RAW
>      - Remove the ASM idtentry in 64-bit
>      - Remove the CR2 read from 64-bit
>      - Remove the open coded ASM entry code in 32-bit
>      - Fix up the XEN/PV code
>      - Remove the old prototypes
> 
>    No functional change.
> 
> triggered the shadow stack corruption when the process returned from syscall.
> SSP MSR somehow was changed between setting SSP MSR and IRET.    Could
> there be a page fault between setting SSP MSR and IRET?

Not upstream because there’s no SSP MSR.

> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-07-29  0:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-18 17:57 Random shadow stack pointer corruption Yu-cheng Yu
2020-07-18 18:00 ` Andy Lutomirski
2020-07-18 18:24   ` Yu-cheng Yu
2020-07-18 18:27     ` Andy Lutomirski
2020-07-18 22:41     ` Dave Hansen
2020-07-18 23:04       ` H.J. Lu
2020-07-18 23:34       ` Yu-cheng Yu
2020-07-29  0:35         ` H.J. Lu
2020-07-29  0:56           ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.