linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Yao, Yuan" <yuan.yao@intel.com>
To: Dave Hansen <dave.hansen@linux.intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: "Bae, Chang Seok" <chang.seok.bae@intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"Hansen, Dave" <dave.hansen@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: RE: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate
Date: Thu, 13 Oct 2022 03:35:20 +0000	[thread overview]
Message-ID: <BYAPR11MB3717EDEF2351C958F2C86EED95259@BYAPR11MB3717.namprd11.prod.outlook.com> (raw)
In-Reply-To: <20221011222425.866137-1-dave.hansen@linux.intel.com>

>-----Original Message-----
>From: Dave Hansen <dave.hansen@linux.intel.com>
>Sent: Wednesday, October 12, 2022 06:24
>To: linux-kernel@vger.kernel.org
>Cc: Bae, Chang Seok <chang.seok.bae@intel.com>; x86@kernel.org; Yao, Yuan <yuan.yao@intel.com>; Hansen, Dave
><dave.hansen@intel.com>; Thomas Gleixner <tglx@linutronix.de>
>Subject: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate
>
>From: Yuan Yao <yuan.yao@intel.com>
>
>This was found a couple of months ago in a big old AMX
>backport.  But, it looks to be a problem in mainline too.
>Please let me know if this looks OK.  I'd also especially
>appreciate some testing from folks that have AMX hardware
>handy.
>
>Builds and survives a quick boot test on non-AMX hardware.
>
>--
>
>== Background ==
>
>'init_fpstate' is a sort of template for all of the fpstates
>that come after it.  It is copied when new processes are
>execve()'d or XRSTOR'd to get fpregs into a known state.
>
>That means that it represents the *starting* state for a
>process's fpstate which includes only the 'default' features.
>Dynamic features can, of course, be acquired later, but
>processes start with only default_features.
>
>During boot the kernel decides whether all fpstates will be
>kept in the compacted or uncompacted format.  This choice is
>communicated to the hardware via the XCOMP_BV field in all
>XSAVE buffers, including 'init_fpstate'.
>
>== Problem ==
>
>But, the existing XCOMP_BV calculation is incorrect.  It uses
>the set of 'max_features', not the default features.
>
>As a result, when XRSTOR operates on buffers derived from
>'init_fpstate', it may attempt to "tickle" dynamic features which
>are at offsets for which there is no space allocated in the
>fpstate.
>
>== Scope ==
>
>This normally results in a relatively harmless out-of-bounds
>memory read.  It's harmless because it never gets consumed.  But,
>if the fpstate is next to some unmapped memory, this "tickle" can
>cause a page fault and an oops.
>
>This only causes issues on systems when dynamic features are
>available and when an XSAVE buffer is next to uninitialized
>memory.  In other words, it only affects recent Intel server
>CPUs, and in relatively few memory locations.
>
>Those two things are why it took relatively long to catch this.
>
>== Solution ==
>
>Use 'default_features' to establish the init_fpstate
>xcomp_bv value.  Reset individual fpstate xcomp_bv values
>when the rest of the fpstate is reset.
>
>[ dhansen: add reset code from tglx, rewrites
>	   commit message and comments ]
>
>Fixes: 1c253ff2287f ("x86/fpu: Move xstate feature masks to fpu_*_cfg")
>Suggested-by: Dave Hansen <dave.hansen@intel.com>
>Suggested-by: Thomas Gleixner <tglx@linutronix.de>
>Signed-off-by: Yuan Yao <yuan.yao@intel.com>
>Cc: stable@vger.kernel.org
>---
> arch/x86/kernel/fpu/core.c   | 3 +++
> arch/x86/kernel/fpu/xstate.c | 7 ++++++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>index 3b28c5b25e12..4d64de34da12 100644
>--- a/arch/x86/kernel/fpu/core.c
>+++ b/arch/x86/kernel/fpu/core.c
>@@ -526,6 +526,9 @@ static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
> 	fpstate->xfeatures	= fpu_kernel_cfg.default_features;
> 	fpstate->user_xfeatures	= fpu_user_cfg.default_features;
> 	fpstate->xfd		= xfd;
>+
>+	/* Ensure that xcomp_bv matches ->xfeatures */
>+	xstate_init_xcomp_bv(&fpstate->regs.xsave, fpstate->xfeatures);
> }
>
> void fpstate_reset(struct fpu *fpu)
>diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>index c8340156bfd2..f9f45610c72f 100644
>--- a/arch/x86/kernel/fpu/xstate.c
>+++ b/arch/x86/kernel/fpu/xstate.c
>@@ -360,7 +360,12 @@ static void __init setup_init_fpu_buf(void)
>
> 	print_xstate_features();
>
>-	xstate_init_xcomp_bv(&init_fpstate.regs.xsave, fpu_kernel_cfg.max_features);
>+	/*
>+	 * 'init_fpstate' is sized for the default feature
>+	 * set without any of the dynamic features.
>+	 */
>+	xstate_init_xcomp_bv(&init_fpstate.regs.xsave,
>+			     fpu_kernel_cfg.default_features);

Below trace is observed on host kernel with this patch applied when create VM.

The reason is __copy_xstate_to_uabi_buf() copies data from &init_fpstate when the component
is not existed in the source kernel fpstate (here is the AMX tile component), but the
AMX TILE bit is removed from init_fpstate due to this patch, so the WARN is triggered and return
NULL which causes kernel NULL pointer dereference later.

I considered 2 possible ways to fix this without big change:
1. For such non-exist component, we can fill the buffer in target fpstate to 0 and don't WARN if the bit is not set
    in init_fpstate.
2. Enlarge the init_fpstate to content the dynamic components and still keel fpu_kernel_cfg.max_features
    in init_fpstate (but still remove the dynamic bit from new cloned thread), so we can use it safely in above case,
    but near 2 pages (8K) is wasted.

[  100.989998] ------------[ cut here ]------------
[  100.995332] WARNING: CPU: 109 PID: 2910 at arch/x86/kernel/fpu/xstate.c:944 __raw_xsave_addr+0x49/0x50
[  101.005948] Modules linked in: kvm_intel kvm x86_pkg_temp_thermal snd_pcm input_leds snd_timer led_class joydev mac_hid irqbypass efi_pstore snd soundcore button sch_fq_codel ip_tables x_tables ixgbe mdio mdio_devres libphy igc [last unloaded: kvm]
[  101.030924] CPU: 109 PID: 2910 Comm: CPU 0/KVM Tainted: G          I        6.0.0-rc6-kmv-upstream-queue+ #23
[  101.054199] RIP: 0010:__raw_xsave_addr+0x49/0x50
[  101.059511] Code: 85 15 b3 44 39 01 74 1c 0f 1f 44 00 00 48 0f a3 f7 73 17 e8 d9 fe ff ff 89 c0 48 01 d8 5b 5d c3 cc cc cc cc 0f 0b 31 c0 eb f3 <0f> 0b 31 c0 eb ed 90 0f 1f 44 00 00 55 49 89 d0 48 89 f1 48 89 e5
[  101.080865] RSP: 0018:ffa000000b8cbc30 EFLAGS: 00010206
[  101.086870] RAX: 0000000000002000 RBX: ffffffff823be3c0 RCX: 0000000000000012
[  101.095038] RDX: 0000000000040000 RSI: 0000000000000012 RDI: 80000000000206e7
[  101.103177] RBP: ffa000000b8cbc38 R08: ffffffff823bf400 R09: 0000000000000001
[  101.111350] R10: ffa000000b8cbc70 R11: ff110008ef3f4008 R12: ff110008ef3f4b00
[  101.119500] R13: 0000000000002000 R14: 0000000000000012 R15: 0000000000000012
[  101.127678] FS:  00007fca15639700(0000) GS:ff1100104fd40000(0000) knlGS:0000000000000000
[  101.136922] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.143512] CR2: 00007fca202a9001 CR3: 00000008f49a8005 CR4: 0000000000773ee0
[  101.151658] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  101.159817] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  101.167961] PKRU: 55555554
[  101.171110] Call Trace:
[  101.173972]  <TASK>
[  101.176435]  __copy_xstate_to_uabi_buf+0x33b/0x870
[  101.181936]  fpu_copy_guest_fpstate_to_uabi+0x28/0x80
[  101.187720]  kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm]
[  101.196409]  ? __this_cpu_preempt_check+0x13/0x20
[  101.204924]  ? vmx_vcpu_put+0x2e/0x260 [kvm_intel]
[  101.213531]  kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[  101.221507]  ? kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[  101.229661]  ? __fget_light+0xd4/0x130
[  101.237011]  __x64_sys_ioctl+0xe3/0x910
[  101.244395]  ? debug_smp_processor_id+0x17/0x20
[  101.252463]  ? fpregs_assert_state_consistent+0x27/0x50
[  101.261228]  do_syscall_64+0x3f/0x90
[  101.268090]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  101.276574] RIP: 0033:0x7fca1b8f362b
[  101.283312] Code: 0f 1e fa 48 8b 05 5d b8 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d b8 2c 00 f7 d8 64 89 01 48
[  101.312528] RSP: 002b:00007fca15638488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  101.323793] RAX: ffffffffffffffda RBX: 000055b3c88ea9c0 RCX: 00007fca1b8f362b
[  101.334583] RDX: 00007fca0c001000 RSI: ffffffff9000aecf RDI: 000000000000000e
[  101.345320] RBP: 00007fca15638580 R08: 000055b3c80d95f0 R09: 000055b3c8820320
[  101.355994] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffeed19156e
[  101.366611] R13: 00007ffeed19156f R14: 00007ffeed191630 R15: 00007fca15638880
[  101.377221]  </TASK>
[  101.382225] ---[ end trace 0000000000000000 ]---
[  101.390004] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  101.400468] #PF: supervisor read access in kernel mode
[  101.408908] #PF: error_code(0x0000) - not-present page
[  101.417319] PGD 88bd3e067 P4D 0
[  101.423541] Oops: 0000 [#1] PREEMPT SMP
[  101.430410] CPU: 109 PID: 2910 Comm: CPU 0/KVM Tainted: G        W I        6.0.0-rc6-kmv-upstream-queue+ #23
[  101.463968] RIP: 0010:memcpy_erms+0x6/0x10
[  101.471389] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 cc cc cc cc 66 90 48 89 f8 48 89 d1 <f3> a4 c3 cc cc cc cc 0f 1f 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[  101.501238] RSP: 0018:ffa000000b8cbc40 EFLAGS: 00010246
[  101.510134] RAX: ff110008ef3f4b00 RBX: 0000000000002000 RCX: 0000000000002000
[  101.521233] RDX: 0000000000002000 RSI: 0000000000000000 RDI: ff110008ef3f4b00
[  101.532327] RBP: ffa000000b8cbce0 R08: ffffffff823bf400 R09: 0000000000000001
[  101.543425] R10: ffa000000b8cbc70 R11: ff110008ef3f4008 R12: ff110008ef3f6b00
[  101.554530] R13: 0000000000000000 R14: 0000000000000012 R15: 0000000000000012
[  101.565652] FS:  00007fca15639700(0000) GS:ff1100104fd40000(0000) knlGS:0000000000000000
[  101.577915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.587555] CR2: 0000000000000000 CR3: 00000008f49a8005 CR4: 0000000000773ee0
[  101.598786] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  101.609993] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  101.621163] PKRU: 55555554
[  101.627358] Call Trace:
[  101.633163]  <TASK>
[  101.638474]  ? __copy_xstate_to_uabi_buf+0x381/0x870
[  101.646977]  fpu_copy_guest_fpstate_to_uabi+0x28/0x80
[  101.655526]  kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm]
[  101.663832]  ? __this_cpu_preempt_check+0x13/0x20
[  101.671938]  ? vmx_vcpu_put+0x2e/0x260 [kvm_intel]
[  101.680146]  kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[  101.687746]  ? kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[  101.695513]  ? __fget_light+0xd4/0x130
[  101.702466]  __x64_sys_ioctl+0xe3/0x910
[  101.709507]  ? debug_smp_processor_id+0x17/0x20
[  101.717289]  ? fpregs_assert_state_consistent+0x27/0x50
[  101.725823]  do_syscall_64+0x3f/0x90
[  101.732497]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  101.740842] RIP: 0033:0x7fca1b8f362b
[  101.747555] Code: 0f 1e fa 48 8b 05 5d b8 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d b8 2c 00 f7 d8 64 89 01 48
[  101.776677] RSP: 002b:00007fca15638488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  101.787889] RAX: ffffffffffffffda RBX: 000055b3c88ea9c0 RCX: 00007fca1b8f362b
[  101.798637] RDX: 00007fca0c001000 RSI: ffffffff9000aecf RDI: 000000000000000e
[  101.809379] RBP: 00007fca15638580 R08: 000055b3c80d95f0 R09: 000055b3c8820320
[  101.820038] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffeed19156e
[  101.830642] R13: 00007ffeed19156f R14: 00007ffeed191630 R15: 00007fca15638880
[  101.841228]  </TASK>
[  101.846230] Modules linked in: kvm_intel kvm x86_pkg_temp_thermal snd_pcm input_leds snd_timer led_class joydev mac_hid irqbypass efi_pstore snd soundcore button sch_fq_codel ip_tables x_tables ixgbe mdio mdio_devres libphy igc [last unloaded: kvm]
[  101.878964] CR2: 0000000000000000
[  101.885472] ---[ end trace 0000000000000000 ]---

>
> 	/*
> 	 * Init all the features state with header.xfeatures being 0x0
>--
>2.34.1


  parent reply	other threads:[~2022-10-13  3:35 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-11 22:24 [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate Dave Hansen
2022-10-11 22:47 ` Chang S. Bae
2022-10-13  1:33 ` Yao, Yuan
2022-10-13  1:47 ` Chang S. Bae
2022-10-13  3:35 ` Yao, Yuan [this message]
2022-10-13 16:23   ` Chang S. Bae
2022-10-13 17:21     ` Dave Hansen
2022-10-13 17:33       ` Chang S. Bae
2022-10-13 17:44         ` Dave Hansen
2022-10-14  3:53           ` Chang S. Bae
2022-10-14  4:10             ` Yao, Yuan
2022-10-14  4:26               ` Chang S. Bae
2022-10-14  4:03     ` Yao, Yuan
2022-10-13 18:04   ` Dave Hansen
2022-10-17 22:39 ` Chang S. Bae

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BYAPR11MB3717EDEF2351C958F2C86EED95259@BYAPR11MB3717.namprd11.prod.outlook.com \
    --to=yuan.yao@intel.com \
    --cc=chang.seok.bae@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).