All of lore.kernel.org
 help / color / mirror / Atom feed
* XSAVE / RDPKRU on Intel 11th Gen Core CPUs
@ 2021-11-08 17:37 Brian Geffon
  2021-11-08 19:37 ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Brian Geffon @ 2021-11-08 17:37 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Guenter Roeck, Borislav Petkov, Andy Lutomirski, stable

Hi Thomas,

We (ChromeOS) have run into an issue which we believe is related to
the following errata on 11th Gen Intel Core CPUs:

"TGL034 A SYSENTER FOLLOWING AN XSAVE OR A VZEROALL MAY LEAD TO
UNEXPECTED SYSTEM BEHAVIOR" [1]

Essentially we notice that the value returned by a RDPKRU instruction
will flip after some amount of time when running on kernels earlier
than 5.14. I have a simple repro that can be used [2].

After a little digging it appears a lot of work was done to refactor
that code and I bisected to the following commit which fixes the
issue:

  commit 954436989cc550dd91aab98363240c9c0a4b7e23
  Author: Thomas Gleixner <tglx@linutronix.de>
  Date:   Wed Jun 23 14:02:21 2021 +0200

    x86/fpu: Remove PKRU handling from switch_fpu_finish()

I backported this patch to 5.4 and it does appear to fix the issue
because it avoids XSAVE. However, I have no idea if it's actually
fixing anything or if the behavior is working as intended. So we're
curious, does it make sense to pull back that patch, would that patch
be enough? Any guidance here would be appreciated because this does
seem broken (because of how it was previously implemented) for those
CPUs prior to 5.14, which is why I'm CCing stable@.

Thanks in advance,
Brian

1. https://cdrdv2.intel.com/v1/dl/getContent/631123?explicitVersion=true
2. https://gist.github.com/bgaff/9f8cbfc8dd22e60f9492e4f0aff8f04f

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-08 17:37 XSAVE / RDPKRU on Intel 11th Gen Core CPUs Brian Geffon
@ 2021-11-08 19:37 ` Dave Hansen
  2021-11-08 22:00   ` Dave Hansen
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2021-11-08 19:37 UTC (permalink / raw)
  To: Brian Geffon, Thomas Gleixner
  Cc: Guenter Roeck, Borislav Petkov, Andy Lutomirski, stable,
	the arch/x86 maintainers, LKML

... adding LKML and x86@

On 11/8/21 9:37 AM, Brian Geffon wrote:
> We (ChromeOS) have run into an issue which we believe is related to
> the following errata on 11th Gen Intel Core CPUs:
> 
> "TGL034 A SYSENTER FOLLOWING AN XSAVE OR A VZEROALL MAY LEAD TO
> UNEXPECTED SYSTEM BEHAVIOR" [1]

I'm struggling to figure out what that has to do with PKRU, though.  I
don't think that erratum is related at all to the issue you're seeing.

> Essentially we notice that the value returned by a RDPKRU instruction
> will flip after some amount of time when running on kernels earlier
> than 5.14. I have a simple repro that can be used [2].

What does it flip to, btw?  Can you dump the whole register state?

> After a little digging it appears a lot of work was done to refactor
> that code and I bisected to the following commit which fixes the
> issue:
> 
>   commit 954436989cc550dd91aab98363240c9c0a4b7e23
>   Author: Thomas Gleixner <tglx@linutronix.de>
>   Date:   Wed Jun 23 14:02:21 2021 +0200
> 
>     x86/fpu: Remove PKRU handling from switch_fpu_finish()
> 
> I backported this patch to 5.4 and it does appear to fix the issue
> because it avoids XSAVE. However, I have no idea if it's actually
> fixing anything or if the behavior is working as intended. So we're
> curious, does it make sense to pull back that patch, would that patch
> be enough? Any guidance here would be appreciated because this does
> seem broken (because of how it was previously implemented) for those
> CPUs prior to 5.14, which is why I'm CCing stable@.

I suspect what you're seeing is that the:

-       __write_pkru(pkru_val);

in that commit was somehow writing a bad value which was read out of the
XSAVE buffer.  That commit stops reading PKRU out of the XSAVE buffer,
which probably has bad state.  Just backporting this patch won't do you
any good.  You'll need to also backport the stuff that stops using the
XSAVE buffer for PKRU in the first place.

The code doesn't bite you until the task context switches.  It probably
has to switch to some pkey-using task and then back to your test app.
I'd randomly guess that your test app is getting a "leaked" PKRU from
another app.  It's _probably_ not a stale PKRU value (like from reading
a PKRU!=0 value from the XSAVE buffer when XSTATE_BV[PKRU]=0) because
your test app should have PKRU=0 set at all times.

Is KVM active on your test system?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-08 19:37 ` Dave Hansen
@ 2021-11-08 22:00   ` Dave Hansen
  2021-11-08 23:20     ` Brian Geffon
  2021-11-09  1:47     ` Brian Geffon
  0 siblings, 2 replies; 12+ messages in thread
From: Dave Hansen @ 2021-11-08 22:00 UTC (permalink / raw)
  To: Brian Geffon, Thomas Gleixner
  Cc: Guenter Roeck, Borislav Petkov, Andy Lutomirski, stable,
	the arch/x86 maintainers, LKML

One more thing...  Does the protection_keys kernel selftest hit any
errors on this same setup?  It does a lot of PKRU sanity checking and
I'm a bit surprised it hasn't caught something yet.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-08 22:00   ` Dave Hansen
@ 2021-11-08 23:20     ` Brian Geffon
  2021-11-09  1:47     ` Brian Geffon
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Geffon @ 2021-11-08 23:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Guenter Roeck, Borislav Petkov, Andy Lutomirski,
	stable, the arch/x86 maintainers, LKML

> One more thing...  Does the protection_keys kernel selftest hit any
> errors on this same setup?  It does a lot of PKRU sanity checking and
> I'm a bit surprised it hasn't caught something yet.

I'll do a little more testing and get back to you.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-08 22:00   ` Dave Hansen
  2021-11-08 23:20     ` Brian Geffon
@ 2021-11-09  1:47     ` Brian Geffon
  2021-11-09  6:49       ` Dave Hansen
  1 sibling, 1 reply; 12+ messages in thread
From: Brian Geffon @ 2021-11-09  1:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Guenter Roeck, Borislav Petkov, Andy Lutomirski,
	stable, the arch/x86 maintainers, LKML

> One more thing...  Does the protection_keys kernel selftest hit any
> errors on this same setup?  It does a lot of PKRU sanity checking and
> I'm a bit surprised it hasn't caught something yet.

Hi Dave,

This issue does reproduce with the self tests too, my simple test
program also fails consistently [1], all it does is spin executing
RDPKRU waiting for a context switch to clobber the value.

$ ./test
unexpected value on iteration 3772082 value:0x55555554 expected:0x55555550

==========================
self tests:

$ ./protection_keys_64
has pkeys: 1
startup pkey_reg: 0000000055555550
WARNING: not run as root, can not do hugetlb test
test 0 PASSED (iteration 1)
test 1 PASSED (iteration 1)
test 2 PASSED (iteration 1)
test 3 PASSED (iteration 1)
test 4 PASSED (iteration 1)
test 5 PASSED (iteration 1)
protection_keys_64: pkey-helpers.h:127: _read_pkey_reg: Assertion
`pkey_reg == shadow_pkey_reg' failed.
Aborted (core dumped)

$ uname -a
Linux localhost 5.13.0-17189-g62fb9874f5da #12 SMP PREEMPT Tue Nov 9
01:29:44 UTC 2021 x86_64 11th Gen Intel(R) Core(TM) i5-1135G7 @
2.40GHz GenuineIntel GNU/Linux

Let me know if I can provide anything else, I'm happy to help troubleshoot this.

Thanks,
Brian

1. https://gist.github.com/bgaff/e4b5457ab1cf5126fea6327666c63441

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-09  1:47     ` Brian Geffon
@ 2021-11-09  6:49       ` Dave Hansen
  2021-11-09 13:43         ` Brian Geffon
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2021-11-09  6:49 UTC (permalink / raw)
  To: Brian Geffon
  Cc: Thomas Gleixner, Guenter Roeck, Borislav Petkov, Andy Lutomirski,
	stable, the arch/x86 maintainers, LKML

On 11/8/21 5:47 PM, Brian Geffon wrote:
>> One more thing...  Does the protection_keys kernel selftest hit any
>> errors on this same setup?  It does a lot of PKRU sanity checking and
>> I'm a bit surprised it hasn't caught something yet.
> Hi Dave,
> 
> This issue does reproduce with the self tests too, my simple test
> program also fails consistently [1], all it does is spin executing
> RDPKRU waiting for a context switch to clobber the value.
> 
> $ ./test
> unexpected value on iteration 3772082 value:0x55555554 expected:0x55555550

Well, gosh, it's making it back to the software init value.  If you do:

	echo 0x15555554 > /sys/kernel/debug/x86/init_pkru

do you end up with 0x15555554 as the value?

The other thing you can try is to turn on all the
/sys/kernel/debug/tracing/events/x86_fpu tracepoints, pin your test
program to one CPU, then dump the trace buffer out for that CPU.  That
probably won't tell us too much for PKRU since it's generally not ever
in its init state.  But, another test would be to use XRSTOR to *get* it
into its init state then see how long it stays there.

Another thing you could do is figure out if pkeys _ever_ worked on that
hardware.  If so, a (long) bisect could figure out what broke it between
its introduction and the 5.13 kernel that you've been testing.
A random 5.11-based distro kernel that I have running on a Cascade Lake
Xeon doesn't seem to have any issues.

Does your system have any more XSAVE support than mine?

> kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
> kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
> kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
> kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
> kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
> kernel: [    0.000000] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
> kernel: [    0.000000] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
> kernel: [    0.000000] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
> kernel: [    0.000000] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
> kernel: [    0.000000] x86/fpu: Enabled xstate features 0x2e7, context size is 2440 bytes, using 'compacted' format.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-09  6:49       ` Dave Hansen
@ 2021-11-09 13:43         ` Brian Geffon
  2021-11-09 14:14           ` Brian Geffon
  2021-11-09 14:57           ` Andy Lutomirski
  0 siblings, 2 replies; 12+ messages in thread
From: Brian Geffon @ 2021-11-09 13:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Guenter Roeck, Borislav Petkov, Andy Lutomirski,
	stable, the arch/x86 maintainers, LKML

Hi Dave,

On Tue, Nov 9, 2021 at 1:49 AM Dave Hansen <dave.hansen@intel.com> wrote:
> Well, gosh, it's making it back to the software init value.  If you do:
>
>         echo 0x15555554 > /sys/kernel/debug/x86/init_pkru
>
> do you end up with 0x15555554 as the value?

What's interesting is that writing to init_pkru fails with -EINVAL for me,
and I've traced it down to get_xsave_addr() returning NULL on the following
check:

  /*
  * This assumes the last 'xsave*' instruction to
  * have requested that 'xfeature_nr' be saved.
  * If it did not, we might be seeing and old value
  * of the field in the buffer.
  *
  * This can happen because the last 'xsave' did not
  * request that this feature be saved (unlikely)
  * or because the "init optimization" caused it
  * to not be saved.
  */
  if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr)))
     return NULL;

And that's why I thought this was possibly related to that erratum
that I shared before.

> Does your system have any more XSAVE support than mine?
>
> > kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> > kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> > kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> > kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
> > kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
> > kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
> > kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
> > kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
> > kernel: [    0.000000] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
> > kernel: [    0.000000] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
> > kernel: [    0.000000] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
> > kernel: [    0.000000] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
> > kernel: [    0.000000] x86/fpu: Enabled xstate features 0x2e7, context size is 2440 bytes, using 'compacted' format.

No, it's pretty much identical

INFO kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87
floating point registers'
INFO kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE
registers'
INFO kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX
registers'
INFO kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x020:
'AVX-512 opmask'
INFO kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x040:
'AVX-512 Hi256'
INFO kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x080:
'AVX-512 ZMM_Hi256'
INFO kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x200:
'Protection Keys User registers'
INFO kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
INFO kernel: [ 0.000000] x86/fpu: xstate_offset[5]: 832, xstate_sizes[5]: 64
INFO kernel: [ 0.000000] x86/fpu: xstate_offset[6]: 896, xstate_sizes[6]: 512
INFO kernel: [ 0.000000] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
INFO kernel: [ 0.000000] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]: 8
INFO kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x2e7,
context size is 2440 bytes, using 'compacted' format.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-09 13:43         ` Brian Geffon
@ 2021-11-09 14:14           ` Brian Geffon
  2021-11-09 14:57           ` Andy Lutomirski
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Geffon @ 2021-11-09 14:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Guenter Roeck, Borislav Petkov, Andy Lutomirski,
	stable, the arch/x86 maintainers, LKML

On Tue, Nov 9, 2021 at 8:43 AM Brian Geffon <bgeffon@google.com> wrote:

> What's interesting is that writing to init_pkru fails with -EINVAL for me,
> and I've traced it down to get_xsave_addr() returning NULL on the following
> check:
>
>   /*
>   * This assumes the last 'xsave*' instruction to
>   * have requested that 'xfeature_nr' be saved.
>   * If it did not, we might be seeing and old value
>   * of the field in the buffer.
>   *
>   * This can happen because the last 'xsave' did not
>   * request that this feature be saved (unlikely)
>   * or because the "init optimization" caused it
>   * to not be saved.
>   */
>   if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr)))
>      return NULL;
>

Sorry, I should have probably also shared the value of xfeatures at
this point is 0x3: which appears to be: (X86_FEATURE_FPU |
X86_FEATURE_XMM)

Brian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-09 13:43         ` Brian Geffon
  2021-11-09 14:14           ` Brian Geffon
@ 2021-11-09 14:57           ` Andy Lutomirski
  2021-11-09 18:58             ` Brian Geffon
  1 sibling, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2021-11-09 14:57 UTC (permalink / raw)
  To: Brian Geffon, Dave Hansen
  Cc: Thomas Gleixner, Guenter Roeck, Borislav Petkov, stable,
	the arch/x86 maintainers, Linux Kernel Mailing List


On Tue, Nov 9, 2021, at 5:43 AM, Brian Geffon wrote:
> Hi Dave,
>
> On Tue, Nov 9, 2021 at 1:49 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> Well, gosh, it's making it back to the software init value.  If you do:
>>
>>         echo 0x15555554 > /sys/kernel/debug/x86/init_pkru
>>
>> do you end up with 0x15555554 as the value?
>
> What's interesting is that writing to init_pkru fails with -EINVAL for me,
> and I've traced it down to get_xsave_addr() returning NULL on the following
> check:
>
>   /*
>   * This assumes the last 'xsave*' instruction to
>   * have requested that 'xfeature_nr' be saved.
>   * If it did not, we might be seeing and old value
>   * of the field in the buffer.
>   *
>   * This can happen because the last 'xsave' did not
>   * request that this feature be saved (unlikely)
>   * or because the "init optimization" caused it
>   * to not be saved.
>   */
>   if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr)))
>      return NULL;

Here's an excerpt from an old email that I, perhaps unwisely, sent to Dave but not to a public list:

static inline void write_pkru(u32 pkru)
{
        struct pkru_state *pk;

        if (!boot_cpu_has(X86_FEATURE_OSPKE))
                return;

        pk = get_xsave_addr(&current->thread.fpu.state.xsave,
XFEATURE_PKRU);

        /*
         * The PKRU value in xstate needs to be in sync with the value
that is
         * written to the CPU. The FPU restore on return to userland would
         * otherwise load the previous value again.
         */
        fpregs_lock();
        if (pk)
                pk->pkru = pkru;

^^^
else we just write to the PKRU register but leave XINUSE[PKRU] clear on
return to usermode?  That seems... unwise.

        __write_pkru(pkru);
        fpregs_unlock();
}

I bet you're hitting exactly this bug.  The fix ended up being a whole series of patches, but the gist of it is that the write_pkru() slow path needs to set the xfeature bit in the xsave buffer and then do the write.  It should be possible to make a little patch to do just this in a couple lines of code.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-09 14:57           ` Andy Lutomirski
@ 2021-11-09 18:58             ` Brian Geffon
  2021-11-09 19:25               ` Brian Geffon
  2021-11-09 19:29               ` Dave Hansen
  0 siblings, 2 replies; 12+ messages in thread
From: Brian Geffon @ 2021-11-09 18:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Thomas Gleixner, Guenter Roeck, Borislav Petkov,
	stable, the arch/x86 maintainers, Linux Kernel Mailing List

Hi Andy,

On Tue, Nov 9, 2021 at 9:58 AM Andy Lutomirski <luto@kernel.org> wrote:
> Here's an excerpt from an old email that I, perhaps unwisely, sent to Dave but not to a public list:
>
> static inline void write_pkru(u32 pkru)
> {
>         struct pkru_state *pk;
>
>         if (!boot_cpu_has(X86_FEATURE_OSPKE))
>                 return;
>
>         pk = get_xsave_addr(&current->thread.fpu.state.xsave,
> XFEATURE_PKRU);
>
>         /*
>          * The PKRU value in xstate needs to be in sync with the value
> that is
>          * written to the CPU. The FPU restore on return to userland would
>          * otherwise load the previous value again.
>          */
>         fpregs_lock();
>         if (pk)
>                 pk->pkru = pkru;
>
> ^^^
> else we just write to the PKRU register but leave XINUSE[PKRU] clear on
> return to usermode?  That seems... unwise.
>
>         __write_pkru(pkru);
>         fpregs_unlock();
> }
>
> I bet you're hitting exactly this bug.  The fix ended up being a whole series of patches, but the gist of it is that the write_pkru() slow path needs to set the xfeature bit in the xsave buffer and then do the write.  It should be possible to make a little patch to do just this in a couple lines of code.

I think you've got the right idea, the following patch does seem to
fix the problem on this CPU, this is based on 5.13. It seems the
changes to asm/pgtable.h were not enough, I also had to modify
fpu/internal.h to get it working properly.

From e5e184d68ac6ca93c3cd2cc88d61af3260d1c014 Mon Sep 17 00:00:00 2001
From: Brian Geffon <bgeffon@google.com>
Date: Tue, 9 Nov 2021 17:08:25 +0000
Subject: [PATCH] x86/fpu: Set XFEATURE_PKRU when writing to pkru

On kernels prior to 5.14 the write_pkru path could
end up writing to the pkru register without updating
the corresponding state.

Signed-off-by: Brian Geffon <bgeffon@google.com>
---
 arch/x86/include/asm/fpu/internal.h | 22 ++++++++++------------
 arch/x86/include/asm/pgtable.h      |  5 +++--
 2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h
b/arch/x86/include/asm/fpu/internal.h
index 16bf4d4a8159..ed2ce7d1afeb 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -564,18 +564,16 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
         * PKRU state is switched eagerly because it needs to be valid before we
         * return to userland e.g. for a copy_to_user() operation.
         */
-       if (!(current->flags & PF_KTHREAD)) {
-               /*
-                * If the PKRU bit in xsave.header.xfeatures is not set,
-                * then the PKRU component was in init state, which means
-                * XRSTOR will set PKRU to 0. If the bit is not set then
-                * get_xsave_addr() will return NULL because the PKRU value
-                * in memory is not valid. This means pkru_val has to be
-                * set to 0 and not to init_pkru_value.
-                */
-               pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
-               pkru_val = pk ? pk->pkru : 0;
-       }
+       /*
+        * If the PKRU bit in xsave.header.xfeatures is not set,
+        * then the PKRU component was in init state, which means
+        * XRSTOR will set PKRU to 0. If the bit is not set then
+        * get_xsave_addr() will return NULL because the PKRU value
+        * in memory is not valid. This means pkru_val has to be
+        * set to 0 and not to init_pkru_value.
+        */
+       pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
+       pkru_val = pk ? pk->pkru : 0;
        __write_pkru(pkru_val);
 }

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b1099f2d9800..d00fc2df4cfe 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -137,18 +137,19 @@ static inline u32 read_pkru(void)
 static inline void write_pkru(u32 pkru)
 {
        struct pkru_state *pk;
+       struct fpu *fpu = &current->thread.fpu;

        if (!boot_cpu_has(X86_FEATURE_OSPKE))
                return;

-       pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
-
        /*
         * The PKRU value in xstate needs to be in sync with the value that is
         * written to the CPU. The FPU restore on return to userland would
         * otherwise load the previous value again.
         */
+       fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU;
        fpregs_lock();
+       pk = get_xsave_addr(&fpu->state.xsave, XFEATURE_PKRU);
        if (pk)
                pk->pkru = pkru;
        __write_pkru(pkru);
-- 
2.34.0.rc0.344.g81b53c2807-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-09 18:58             ` Brian Geffon
@ 2021-11-09 19:25               ` Brian Geffon
  2021-11-09 19:29               ` Dave Hansen
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Geffon @ 2021-11-09 19:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Thomas Gleixner, Guenter Roeck, Borislav Petkov,
	stable, the arch/x86 maintainers, Linux Kernel Mailing List

On Tue, Nov 9, 2021 at 1:58 PM Brian Geffon <bgeffon@google.com> wrote:
>
> Hi Andy,
>
> On Tue, Nov 9, 2021 at 9:58 AM Andy Lutomirski <luto@kernel.org> wrote:
> > Here's an excerpt from an old email that I, perhaps unwisely, sent to Dave but not to a public list:
> >
> > static inline void write_pkru(u32 pkru)
> > {
> >         struct pkru_state *pk;
> >
> >         if (!boot_cpu_has(X86_FEATURE_OSPKE))
> >                 return;
> >
> >         pk = get_xsave_addr(&current->thread.fpu.state.xsave,
> > XFEATURE_PKRU);
> >
> >         /*
> >          * The PKRU value in xstate needs to be in sync with the value
> > that is
> >          * written to the CPU. The FPU restore on return to userland would
> >          * otherwise load the previous value again.
> >          */
> >         fpregs_lock();
> >         if (pk)
> >                 pk->pkru = pkru;
> >
> > ^^^
> > else we just write to the PKRU register but leave XINUSE[PKRU] clear on
> > return to usermode?  That seems... unwise.
> >
> >         __write_pkru(pkru);
> >         fpregs_unlock();
> > }
> >
> > I bet you're hitting exactly this bug.  The fix ended up being a whole series of patches, but the gist of it is that the write_pkru() slow path needs to set the xfeature bit in the xsave buffer and then do the write.  It should be possible to make a little patch to do just this in a couple lines of code.
>
> I think you've got the right idea, the following patch does seem to
> fix the problem on this CPU, this is based on 5.13. It seems the
> changes to asm/pgtable.h were not enough, I also had to modify
> fpu/internal.h to get it working properly.
>

Actually, it seems that only the changes to fpu/internal.h seem
necessary. I guess the switch_fpu_finish explains how it's reverting
to the initial value.

diff --git a/arch/x86/include/asm/fpu/internal.h
b/arch/x86/include/asm/fpu/internal.h
index 16bf4d4a8159..ed2ce7d1afeb 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -564,18 +564,16 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
         * PKRU state is switched eagerly because it needs to be valid before we
         * return to userland e.g. for a copy_to_user() operation.
         */
-       if (!(current->flags & PF_KTHREAD)) {
-               /*
-                * If the PKRU bit in xsave.header.xfeatures is not set,
-                * then the PKRU component was in init state, which means
-                * XRSTOR will set PKRU to 0. If the bit is not set then
-                * get_xsave_addr() will return NULL because the PKRU value
-                * in memory is not valid. This means pkru_val has to be
-                * set to 0 and not to init_pkru_value.
-                */
-               pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
-               pkru_val = pk ? pk->pkru : 0;
-       }
+       /*
+        * If the PKRU bit in xsave.header.xfeatures is not set,
+        * then the PKRU component was in init state, which means
+        * XRSTOR will set PKRU to 0. If the bit is not set then
+        * get_xsave_addr() will return NULL because the PKRU value
+        * in memory is not valid. This means pkru_val has to be
+        * set to 0 and not to init_pkru_value.
+        */
+       pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
+       pkru_val = pk ? pk->pkru : 0;
        __write_pkru(pkru_val);
 }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: XSAVE / RDPKRU on Intel 11th Gen Core CPUs
  2021-11-09 18:58             ` Brian Geffon
  2021-11-09 19:25               ` Brian Geffon
@ 2021-11-09 19:29               ` Dave Hansen
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2021-11-09 19:29 UTC (permalink / raw)
  To: Brian Geffon, Andy Lutomirski
  Cc: Thomas Gleixner, Guenter Roeck, Borislav Petkov, stable,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 11/9/21 10:58 AM, Brian Geffon wrote:
> Hi Andy,
> 
> On Tue, Nov 9, 2021 at 9:58 AM Andy Lutomirski <luto@kernel.org> wrote:
>> Here's an excerpt from an old email that I, perhaps unwisely, sent to Dave but not to a public list:
>>
>> static inline void write_pkru(u32 pkru)
>> {
>>         struct pkru_state *pk;
>>
>>         if (!boot_cpu_has(X86_FEATURE_OSPKE))
>>                 return;
>>
>>         pk = get_xsave_addr(&current->thread.fpu.state.xsave,
>> XFEATURE_PKRU);
>>
>>         /*
>>          * The PKRU value in xstate needs to be in sync with the value
>> that is
>>          * written to the CPU. The FPU restore on return to userland would
>>          * otherwise load the previous value again.
>>          */
>>         fpregs_lock();
>>         if (pk)
>>                 pk->pkru = pkru;
>>
>> ^^^
>> else we just write to the PKRU register but leave XINUSE[PKRU] clear on
>> return to usermode?  That seems... unwise.
>>
>>         __write_pkru(pkru);
>>         fpregs_unlock();
>> }
>>
>> I bet you're hitting exactly this bug.  The fix ended up being a whole series of patches, but the gist of it is that the write_pkru() slow path needs to set the xfeature bit in the xsave buffer and then do the write.  It should be possible to make a little patch to do just this in a couple lines of code.
> 
> I think you've got the right idea, the following patch does seem to
> fix the problem on this CPU, this is based on 5.13. It seems the
> changes to asm/pgtable.h were not enough, I also had to modify
> fpu/internal.h to get it working properly.
> 
>>From e5e184d68ac6ca93c3cd2cc88d61af3260d1c014 Mon Sep 17 00:00:00 2001
> From: Brian Geffon <bgeffon@google.com>
> Date: Tue, 9 Nov 2021 17:08:25 +0000
> Subject: [PATCH] x86/fpu: Set XFEATURE_PKRU when writing to pkru
> 
> On kernels prior to 5.14 the write_pkru path could
> end up writing to the pkru register without updating
> the corresponding state.
> 
> Signed-off-by: Brian Geffon <bgeffon@google.com>
> ---
>  arch/x86/include/asm/fpu/internal.h | 22 ++++++++++------------
>  arch/x86/include/asm/pgtable.h      |  5 +++--
>  2 files changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/fpu/internal.h
> b/arch/x86/include/asm/fpu/internal.h
> index 16bf4d4a8159..ed2ce7d1afeb 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -564,18 +564,16 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>          * PKRU state is switched eagerly because it needs to be valid before we
>          * return to userland e.g. for a copy_to_user() operation.
>          */
> -       if (!(current->flags & PF_KTHREAD)) {
> -               /*
> -                * If the PKRU bit in xsave.header.xfeatures is not set,
> -                * then the PKRU component was in init state, which means
> -                * XRSTOR will set PKRU to 0. If the bit is not set then
> -                * get_xsave_addr() will return NULL because the PKRU value
> -                * in memory is not valid. This means pkru_val has to be
> -                * set to 0 and not to init_pkru_value.
> -                */
> -               pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
> -               pkru_val = pk ? pk->pkru : 0;
> -       }
> +       /*
> +        * If the PKRU bit in xsave.header.xfeatures is not set,
> +        * then the PKRU component was in init state, which means
> +        * XRSTOR will set PKRU to 0. If the bit is not set then
> +        * get_xsave_addr() will return NULL because the PKRU value
> +        * in memory is not valid. This means pkru_val has to be
> +        * set to 0 and not to init_pkru_value.
> +        */
> +       pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
> +       pkru_val = pk ? pk->pkru : 0;
>         __write_pkru(pkru_val);
>  }

This hunk doesn't make any sense to me.  new_fpu should be for
'current', and if 'current' is a PF_KTHREAD, it shouldn't be using PKRU.

Why does this hunk matter?

> index b1099f2d9800..d00fc2df4cfe 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -137,18 +137,19 @@ static inline u32 read_pkru(void)
>  static inline void write_pkru(u32 pkru)
>  {
>         struct pkru_state *pk;
> +       struct fpu *fpu = &current->thread.fpu;
> 
>         if (!boot_cpu_has(X86_FEATURE_OSPKE))
>                 return;
> 
> -       pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> -
>         /*
>          * The PKRU value in xstate needs to be in sync with the value that is
>          * written to the CPU. The FPU restore on return to userland would
>          * otherwise load the previous value again.
>          */
> +       fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU;

This needs to be inside fpregs_lock().  This task can get preempted at
any time and the xfeatures bit is not stable.

>         fpregs_lock();
> +       pk = get_xsave_addr(&fpu->state.xsave, XFEATURE_PKRU);
>         if (pk)
>                 pk->pkru = pkru;
>         __write_pkru(pkru);

I still don't think this quite matches up with the symptoms that you are
seeing.  This fix would ensure that we don't forget to update the XSAVE
buffer when it is in the init state.  But, if we did that, we would see
PKRU *going* to the init state: all 0's.  What you were seeing instead
was it going from 0x55555550 to 0x55555554, not 0x0.

I don't doubt that this makes the symptoms away, I just don't think this
really explains what's going on thoroughly enough.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-11-09 19:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-08 17:37 XSAVE / RDPKRU on Intel 11th Gen Core CPUs Brian Geffon
2021-11-08 19:37 ` Dave Hansen
2021-11-08 22:00   ` Dave Hansen
2021-11-08 23:20     ` Brian Geffon
2021-11-09  1:47     ` Brian Geffon
2021-11-09  6:49       ` Dave Hansen
2021-11-09 13:43         ` Brian Geffon
2021-11-09 14:14           ` Brian Geffon
2021-11-09 14:57           ` Andy Lutomirski
2021-11-09 18:58             ` Brian Geffon
2021-11-09 19:25               ` Brian Geffon
2021-11-09 19:29               ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.