[PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
@ 2019-09-06 21:29 Steve Wahl
  2019-09-09  8:14 ` Kirill A. Shutemov
  0 siblings, 1 reply; 14+ messages in thread
From: Steve Wahl @ 2019-09-06 21:29 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Juergen Gross, Kirill A. Shutemov, Brijesh Singh,
	Steve Wahl, Jordan Borgner, Feng Tang, linux-kernel
  Cc: Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

Our hardware (UV aka Superdome Flex) has address ranges marked
reserved by the BIOS. These ranges can cause the system to halt if
accessed.

During kernel initialization, the processor was speculating into
reserved memory causing system halts.  The processor speculation is
enabled because the reserved memory is being mapped by the kernel.

The page table level2_kernel_pgt is 1 GiB in size, and had all pages
initially marked as valid, and the kernel is placed anywhere in this
range depending on the virtual address selected by KASLR.  Later on in
the boot process, the valid area gets trimmed back to the space
occupied by the kernel.

But during the interval of time when the full 1 GiB space was marked
as valid, if the kernel physical address chosen by KASLR was close
enough to our reserved memory regions, the valid pages outside the
actual kernel space were allowing the processor to issue speculative
accesses to the reserved space, causing the system to halt.

This was encountered somewhat rarely on a normal system boot, and
somewhat more often when starting the crash kernel if
"crashkernel=512M,high" was specified on the command line (because
this heavily restricts the physical address of the crash kernel,
usually to within 1 GiB of our reserved space).

The answer is to invalidate the pages of this table outside the
address range occupied by the kernel before the page table is
activated.  This patch has been validated to fix this problem on our
hardware.

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Cc: stable@vger.kernel.org
---
 arch/x86/kernel/head64.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 29ffa495bd1c..31f89a5defa3 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -225,10 +225,15 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	 */

 	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
-	for (i = 0; i < PTRS_PER_PMD; i++) {
+	for (i = 0; i < pmd_index((unsigned long)_text); i++)
+		pmd[i] &= ~_PAGE_PRESENT;
+
+	for (; i <= pmd_index((unsigned long)_end); i++)
 		if (pmd[i] & _PAGE_PRESENT)
 			pmd[i] += load_delta;
-	}
+
+	for (; i < PTRS_PER_PMD; i++)
+		pmd[i] &= ~_PAGE_PRESENT;

 	/*
 	 * Fixup phys_base - remove the memory encryption mask to obtain
-- 
2.21.0

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-06 21:29 [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area Steve Wahl
@ 2019-09-09  8:14 ` Kirill A. Shutemov
  2019-09-10  6:18   ` Ingo Molnar
  2019-09-10 14:28   ` Steve Wahl
  0 siblings, 2 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2019-09-09  8:14 UTC (permalink / raw)
  To: Steve Wahl
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Juergen Gross, Brijesh Singh, Jordan Borgner, Feng Tang,
	linux-kernel, Baoquan He, russ.anderson, dimitri.sivanich,
	mike.travis

On Fri, Sep 06, 2019 at 04:29:50PM -0500, Steve Wahl wrote:
> Our hardware (UV aka Superdome Flex) has address ranges marked
> reserved by the BIOS. These ranges can cause the system to halt if
> accessed.
> 
> During kernel initialization, the processor was speculating into
> reserved memory causing system halts.  The processor speculation is
> enabled because the reserved memory is being mapped by the kernel.
> 
> The page table level2_kernel_pgt is 1 GiB in size, and had all pages
> initially marked as valid, and the kernel is placed anywhere in this
> range depending on the virtual address selected by KASLR.  Later on in
> the boot process, the valid area gets trimmed back to the space
> occupied by the kernel.
> 
> But during the interval of time when the full 1 GiB space was marked
> as valid, if the kernel physical address chosen by KASLR was close
> enough to our reserved memory regions, the valid pages outside the
> actual kernel space were allowing the processor to issue speculative
> accesses to the reserved space, causing the system to halt.
> 
> This was encountered somewhat rarely on a normal system boot, and
> somewhat more often when starting the crash kernel if
> "crashkernel=512M,high" was specified on the command line (because
> this heavily restricts the physical address of the crash kernel,
> usually to within 1 GiB of our reserved space).
> 
> The answer is to invalidate the pages of this table outside the
> address range occupied by the kernel before the page table is
> activated.  This patch has been validated to fix this problem on our
> hardware.

If the goal is to avoid *any* mapping of the reserved region to stop
speculation, I don't think this patch will do the job. We still (likely)
have the same memory mapped as part of the identity mapping. And it
happens at least in two places: here and before on decompression stage.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-09  8:14 ` Kirill A. Shutemov
@ 2019-09-10  6:18   ` Ingo Molnar
  2019-09-10 14:38     ` Steve Wahl
  2019-09-10 14:28   ` Steve Wahl
  1 sibling, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2019-09-10  6:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Steve Wahl, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Juergen Gross, Brijesh Singh,
	Jordan Borgner, Feng Tang, linux-kernel, Baoquan He,
	russ.anderson, dimitri.sivanich, mike.travis


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Fri, Sep 06, 2019 at 04:29:50PM -0500, Steve Wahl wrote:
> > Our hardware (UV aka Superdome Flex) has address ranges marked
> > reserved by the BIOS. These ranges can cause the system to halt if
> > accessed.
> > 
> > During kernel initialization, the processor was speculating into
> > reserved memory causing system halts.  The processor speculation is
> > enabled because the reserved memory is being mapped by the kernel.
> > 
> > The page table level2_kernel_pgt is 1 GiB in size, and had all pages
> > initially marked as valid, and the kernel is placed anywhere in this
> > range depending on the virtual address selected by KASLR.  Later on in
> > the boot process, the valid area gets trimmed back to the space
> > occupied by the kernel.
> > 
> > But during the interval of time when the full 1 GiB space was marked
> > as valid, if the kernel physical address chosen by KASLR was close
> > enough to our reserved memory regions, the valid pages outside the
> > actual kernel space were allowing the processor to issue speculative
> > accesses to the reserved space, causing the system to halt.
> > 
> > This was encountered somewhat rarely on a normal system boot, and
> > somewhat more often when starting the crash kernel if
> > "crashkernel=512M,high" was specified on the command line (because
> > this heavily restricts the physical address of the crash kernel,
> > usually to within 1 GiB of our reserved space).
> > 
> > The answer is to invalidate the pages of this table outside the
> > address range occupied by the kernel before the page table is
> > activated.  This patch has been validated to fix this problem on our
> > hardware.
> 
> If the goal is to avoid *any* mapping of the reserved region to stop
> speculation, I don't think this patch will do the job. We still (likely)
> have the same memory mapped as part of the identity mapping. And it
> happens at least in two places: here and before on decompression stage.

Yeah, this really needs a fix at the KASLR level: it should only ever map 
into regions that are fully RAM backed.

Is the problem that the 1 GiB mapping is a direct mapping, which can be 
speculated into? I presume KASLR won't accidentally map the kernel into 
the reserved region, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-09  8:14 ` Kirill A. Shutemov
  2019-09-10  6:18   ` Ingo Molnar
@ 2019-09-10 14:28   ` Steve Wahl
  2019-09-11  0:28     ` Kirill A. Shutemov
  1 sibling, 1 reply; 14+ messages in thread
From: Steve Wahl @ 2019-09-10 14:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Steve Wahl, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Juergen Gross, Brijesh Singh,
	Jordan Borgner, Feng Tang, linux-kernel, Baoquan He,
	russ.anderson, dimitri.sivanich, mike.travis

On Mon, Sep 09, 2019 at 11:14:14AM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 06, 2019 at 04:29:50PM -0500, Steve Wahl wrote:
> > ...
> > The answer is to invalidate the pages of this table outside the
> > address range occupied by the kernel before the page table is
> > activated.  This patch has been validated to fix this problem on our
> > hardware.
> 
> If the goal is to avoid *any* mapping of the reserved region to stop
> speculation, I don't think this patch will do the job. We still (likely)
> have the same memory mapped as part of the identity mapping. And it
> happens at least in two places: here and before on decompression stage.

I imagine you are likely correct, ideally you would not map any
reserved pages in these spaces.

I've been reading the code to try to understand what you say above.
For identity mappings in the kernel, I see level2_ident_pgt mapping
the first 1G.  And I see early_dyanmic_pgts being set up with an
identity mapping of the kernel that seems to be pretty well restricted
to the range _text through _end.

Within the decompression code, I see an identity mapping of the first
4G set up within the 32 bit code.  I believe we go past that to the
startup_64 entry point.  (I don't know how common that path is, but I
don't have a way to test it without figuring out how to force it.)

From a pragmatic standpoint, the guy who can verify this for me is on
vacation, but I believe our BIOS won't ever place the halt-causing
ranges in a space below 4GiB.  Which explains why this patch works for
our hardware.  (We do have reserved regions below 4G, just not the
ones that hardware causes a halt for accessing.)

In case it helps you picture the situation, our hardware takes a small
portion of RAM from the end of each NUMA node (or it might be pairs or
quads of NUMA nodes, I'm not entirely clear on this at the moment) for
its own purposes.  Here's a section of our e820 table:

[    0.000000] BIOS-e820: [mem 0x000000007c000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x0000002f7fffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000002f80000000-0x000000303fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000003040000000-0x0000005f7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000005f7c000000-0x000000603fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000006040000000-0x0000008f7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000008f7c000000-0x000000903fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000009040000000-0x000000bf7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x000000bf7c000000-0x000000c03fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x000000c040000000-0x000000ef7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x000000ef7c000000-0x000000f03fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x000000f040000000-0x0000011f7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000011f7c000000-0x000001203fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000012040000000-0x0000014f7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000014f7c000000-0x000001503fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000015040000000-0x0000017f7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000017f7c000000-0x000001803fffffff] reserved

Our problem occurs when KASLR (or kexec) places the kernel close
enough to the end of one of the usable sections, and the 1G of 1:1
mapped space includes a portion of the following reserved section, and
speculation touches the reserved area.

--> Steve Wahl
-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-10  6:18   ` Ingo Molnar
@ 2019-09-10 14:38     ` Steve Wahl
  0 siblings, 0 replies; 14+ messages in thread
From: Steve Wahl @ 2019-09-10 14:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Steve Wahl, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Juergen Gross,
	Brijesh Singh, Jordan Borgner, Feng Tang, linux-kernel,
	Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

On Tue, Sep 10, 2019 at 08:18:15AM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Fri, Sep 06, 2019 at 04:29:50PM -0500, Steve Wahl wrote:
> > > Our hardware (UV aka Superdome Flex) has address ranges marked
> > > reserved by the BIOS. These ranges can cause the system to halt if
> > > accessed.
> > > 
> > > During kernel initialization, the processor was speculating into
> > > reserved memory causing system halts.  The processor speculation is
> > > enabled because the reserved memory is being mapped by the kernel.
> > > 
> > > The page table level2_kernel_pgt is 1 GiB in size, and had all pages
> > > initially marked as valid, and the kernel is placed anywhere in this
> > > range depending on the virtual address selected by KASLR.  Later on in
> > > the boot process, the valid area gets trimmed back to the space
> > > occupied by the kernel.
> > > 
> > > But during the interval of time when the full 1 GiB space was marked
> > > as valid, if the kernel physical address chosen by KASLR was close
> > > enough to our reserved memory regions, the valid pages outside the
> > > actual kernel space were allowing the processor to issue speculative
> > > accesses to the reserved space, causing the system to halt.
> > > 
> > > This was encountered somewhat rarely on a normal system boot, and
> > > somewhat more often when starting the crash kernel if
> > > "crashkernel=512M,high" was specified on the command line (because
> > > this heavily restricts the physical address of the crash kernel,
> > > usually to within 1 GiB of our reserved space).
> > > 
> > > The answer is to invalidate the pages of this table outside the
> > > address range occupied by the kernel before the page table is
> > > activated.  This patch has been validated to fix this problem on our
> > > hardware.
> > 
> > If the goal is to avoid *any* mapping of the reserved region to stop
> > speculation, I don't think this patch will do the job. We still (likely)
> > have the same memory mapped as part of the identity mapping. And it
> > happens at least in two places: here and before on decompression stage.
> 
> Yeah, this really needs a fix at the KASLR level: it should only ever map 
> into regions that are fully RAM backed.
> 
> Is the problem that the 1 GiB mapping is a direct mapping, which can be 
> speculated into? I presume KASLR won't accidentally map the kernel into 
> the reserved region, right?

I believe you are correct.  There is code that limits KASLR's choice
of phyiscal addresses to valid RAM locations.  There are no bugs in it
that I've seen.

It's just that the 1G mapping includes wide regions of physical
address space on either or both sides of the chosen physical space for
the kernel, which are not limited to valid RAM regions, allowing
speculative accesses into reserved regions if the chosen kernel
physical address is close enough to one of them.

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-10 14:28   ` Steve Wahl
@ 2019-09-11  0:28     ` Kirill A. Shutemov
  2019-09-11 20:08       ` Steve Wahl
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill A. Shutemov @ 2019-09-11  0:28 UTC (permalink / raw)
  To: Steve Wahl
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Juergen Gross, Brijesh Singh, Jordan Borgner, Feng Tang,
	linux-kernel, Baoquan He, russ.anderson, dimitri.sivanich,
	mike.travis

On Tue, Sep 10, 2019 at 09:28:10AM -0500, Steve Wahl wrote:
> On Mon, Sep 09, 2019 at 11:14:14AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 06, 2019 at 04:29:50PM -0500, Steve Wahl wrote:
> > > ...
> > > The answer is to invalidate the pages of this table outside the
> > > address range occupied by the kernel before the page table is
> > > activated.  This patch has been validated to fix this problem on our
> > > hardware.
> > 
> > If the goal is to avoid *any* mapping of the reserved region to stop
> > speculation, I don't think this patch will do the job. We still (likely)
> > have the same memory mapped as part of the identity mapping. And it
> > happens at least in two places: here and before on decompression stage.
> 
> I imagine you are likely correct, ideally you would not map any
> reserved pages in these spaces.
> 
> I've been reading the code to try to understand what you say above.
> For identity mappings in the kernel, I see level2_ident_pgt mapping
> the first 1G.

This is for XEN case. Not sure how relevant it is for you.

> And I see early_dyanmic_pgts being set up with an identity mapping of
> the kernel that seems to be pretty well restricted to the range _text
> through _end.

Right, but rounded to 2M around the place kernel was decompressed to.
Some of reserved areas from the listing below are smaller then 2M or not
aligned to 2M.

> Within the decompression code, I see an identity mapping of the first
> 4G set up within the 32 bit code.  I believe we go past that to the
> startup_64 entry point.  (I don't know how common that path is, but I
> don't have a way to test it without figuring out how to force it.)

Kernel can start in 64-bit mode directly and in this case we inherit page
tables from bootloader/BIOS. They trusted to provide identity mapping to
cover at least kernel (plus some more essential stuff), but it's free to
map more.

> From a pragmatic standpoint, the guy who can verify this for me is on
> vacation, but I believe our BIOS won't ever place the halt-causing
> ranges in a space below 4GiB.  Which explains why this patch works for
> our hardware.  (We do have reserved regions below 4G, just not the
> ones that hardware causes a halt for accessing.)
> 
> In case it helps you picture the situation, our hardware takes a small
> portion of RAM from the end of each NUMA node (or it might be pairs or
> quads of NUMA nodes, I'm not entirely clear on this at the moment) for
> its own purposes.  Here's a section of our e820 table:
> 
> [    0.000000] BIOS-e820: [mem 0x000000007c000000-0x000000008fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x0000002f7fffffff] usable
> [    0.000000] BIOS-e820: [mem 0x0000002f80000000-0x000000303fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000003040000000-0x0000005f7bffffff] usable
> [    0.000000] BIOS-e820: [mem 0x0000005f7c000000-0x000000603fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000006040000000-0x0000008f7bffffff] usable
> [    0.000000] BIOS-e820: [mem 0x0000008f7c000000-0x000000903fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000009040000000-0x000000bf7bffffff] usable
> [    0.000000] BIOS-e820: [mem 0x000000bf7c000000-0x000000c03fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x000000c040000000-0x000000ef7bffffff] usable
> [    0.000000] BIOS-e820: [mem 0x000000ef7c000000-0x000000f03fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x000000f040000000-0x0000011f7bffffff] usable
> [    0.000000] BIOS-e820: [mem 0x0000011f7c000000-0x000001203fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000012040000000-0x0000014f7bffffff] usable
> [    0.000000] BIOS-e820: [mem 0x0000014f7c000000-0x000001503fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000015040000000-0x0000017f7bffffff] usable
> [    0.000000] BIOS-e820: [mem 0x0000017f7c000000-0x000001803fffffff] reserved

It would be interesting to know which of them are problematic.

> Our problem occurs when KASLR (or kexec) places the kernel close
> enough to the end of one of the usable sections, and the 1G of 1:1
> mapped space includes a portion of the following reserved section, and
> speculation touches the reserved area.

Are you sure that it's speculative access to blame? Speculative access
must not cause change in architectural state.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-11  0:28     ` Kirill A. Shutemov
@ 2019-09-11 20:08       ` Steve Wahl
  2019-09-12 10:19         ` Kirill A. Shutemov
  0 siblings, 1 reply; 14+ messages in thread
From: Steve Wahl @ 2019-09-11 20:08 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Steve Wahl, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Juergen Gross, Brijesh Singh,
	Jordan Borgner, Feng Tang, linux-kernel, Baoquan He,
	russ.anderson, dimitri.sivanich, mike.travis

On Wed, Sep 11, 2019 at 03:28:56AM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 10, 2019 at 09:28:10AM -0500, Steve Wahl wrote:
> > On Mon, Sep 09, 2019 at 11:14:14AM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Sep 06, 2019 at 04:29:50PM -0500, Steve Wahl wrote:
> > > > ...
> > > > The answer is to invalidate the pages of this table outside the
> > > > address range occupied by the kernel before the page table is
> > > > activated.  This patch has been validated to fix this problem on our
> > > > hardware.
> > > 
> > > If the goal is to avoid *any* mapping of the reserved region to stop
> > > speculation, I don't think this patch will do the job. We still (likely)
> > > have the same memory mapped as part of the identity mapping. And it
> > > happens at least in two places: here and before on decompression stage.
> > 
> > I imagine you are likely correct, ideally you would not map any
> > reserved pages in these spaces.
> > 
> > I've been reading the code to try to understand what you say above.
> > For identity mappings in the kernel, I see level2_ident_pgt mapping
> > the first 1G.
> 
> This is for XEN case. Not sure how relevant it is for you.

I don't have much familiarity with XEN, and I'm not using it, but it
does seem to be enabled for the distribution kernels we deal with.
However, it is below 4G.

> > And I see early_dyanmic_pgts being set up with an identity mapping of
> > the kernel that seems to be pretty well restricted to the range _text
> > through _end.
> 
> Right, but rounded to 2M around the place kernel was decompressed to.
> Some of reserved areas from the listing below are smaller then 2M or not
> aligned to 2M.

The problematic reserved regions are aligned to 2M or greater.  See
the answer to "which reserved regions" below.

> > Within the decompression code, I see an identity mapping of the first
> > 4G set up within the 32 bit code.  I believe we go past that to the
> > startup_64 entry point.  (I don't know how common that path is, but I
> > don't have a way to test it without figuring out how to force it.)
> 
> Kernel can start in 64-bit mode directly and in this case we inherit page
> tables from bootloader/BIOS. They trusted to provide identity mapping to
> cover at least kernel (plus some more essential stuff), but it's free to
> map more.

I haven't looked at the bootloader, at least not yet.

If tables supplied by the BIOS don't follow the rules, that is
somebody else's problem.  (And if needed I'll hunt them down.)

> > From a pragmatic standpoint, the guy who can verify this for me is on
> > vacation, but I believe our BIOS won't ever place the halt-causing
> > ranges in a space below 4GiB.  Which explains why this patch works for
> > our hardware.  (We do have reserved regions below 4G, just not the
> > ones that hardware causes a halt for accessing.)
> > 
> > In case it helps you picture the situation, our hardware takes a small
> > portion of RAM from the end of each NUMA node (or it might be pairs or
> > quads of NUMA nodes, I'm not entirely clear on this at the moment) for
> > its own purposes.  Here's a section of our e820 table:
> > 
> > [    0.000000] BIOS-e820: [mem 0x000000007c000000-0x000000008fffffff] reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x0000002f7fffffff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000002f80000000-0x000000303fffffff] reserved *
> > [    0.000000] BIOS-e820: [mem 0x0000003040000000-0x0000005f7bffffff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000005f7c000000-0x000000603fffffff] reserved *
> > [    0.000000] BIOS-e820: [mem 0x0000006040000000-0x0000008f7bffffff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000008f7c000000-0x000000903fffffff] reserved *
> > [    0.000000] BIOS-e820: [mem 0x0000009040000000-0x000000bf7bffffff] usable
> > [    0.000000] BIOS-e820: [mem 0x000000bf7c000000-0x000000c03fffffff] reserved *
> > [    0.000000] BIOS-e820: [mem 0x000000c040000000-0x000000ef7bffffff] usable
> > [    0.000000] BIOS-e820: [mem 0x000000ef7c000000-0x000000f03fffffff] reserved *
> > [    0.000000] BIOS-e820: [mem 0x000000f040000000-0x0000011f7bffffff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000011f7c000000-0x000001203fffffff] reserved *
> > [    0.000000] BIOS-e820: [mem 0x0000012040000000-0x0000014f7bffffff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000014f7c000000-0x000001503fffffff] reserved *
> > [    0.000000] BIOS-e820: [mem 0x0000015040000000-0x0000017f7bffffff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000017f7c000000-0x000001803fffffff] reserved *
> 
> It would be interesting to know which of them are problematic.

It's pretty much all of the reserved regions above 4G, I edited the
table above and put asterisks on the lines describing the problematic
regions.  The number and size of them will vary based on the number of
NUMA nodes and other factors.

The alignment of these... While examining the values above, I realized
I'm working with a BIOS version that has already tried to work around
this problem by aligning to 1GiB.  My expert is still on vacation, but
I and a coworker looked at the BIOS source, and we're 99% certain the
alignment and granularity without the workaround code would be 64MiB.
Which accomodates the 2MiB alignment issues discussed above.

I didn't want to delay my response until my expert was back.  I will
send another message when I can get this confirmed.

> > Our problem occurs when KASLR (or kexec) places the kernel close
> > enough to the end of one of the usable sections, and the 1G of 1:1
> > mapped space includes a portion of the following reserved section, and
> > speculation touches the reserved area.
> 
> Are you sure that it's speculative access to blame? Speculative access
> must not cause change in architectural state.

That's a hard one to prove.  However, our hardware stops detecting
accesses to these areas (and halting) when we stop including them in
valid page table entries.  With that change, any non-speculative
accesses to these areas should have transformed into an access to an
invalid page / page fault in kernel mode / oops.  Instead, they just
go away.  That says to me that they are speculative accesses.

Thank you for your time looking into this with me!

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-11 20:08       ` Steve Wahl
@ 2019-09-12 10:19         ` Kirill A. Shutemov
  2019-09-13 15:14           ` Steve Wahl
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill A. Shutemov @ 2019-09-12 10:19 UTC (permalink / raw)
  To: Steve Wahl
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Juergen Gross, Brijesh Singh, Jordan Borgner, Feng Tang,
	linux-kernel, Baoquan He, russ.anderson, dimitri.sivanich,
	mike.travis

On Wed, Sep 11, 2019 at 03:08:35PM -0500, Steve Wahl wrote:
> Thank you for your time looking into this with me!

With all this explanation the original patch looks sane to me.

But I would like to see more information from this thread in the commit
message and some comments in the code on why it's crucial not to map more
than needed.

I think we also need to make it clear that this is workaround for a broken
hardware: speculative execution must not trigger a halt.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-12 10:19         ` Kirill A. Shutemov
@ 2019-09-13 15:14           ` Steve Wahl
  2019-09-16  9:00             ` Kirill A. Shutemov
  0 siblings, 1 reply; 14+ messages in thread
From: Steve Wahl @ 2019-09-13 15:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Steve Wahl, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Juergen Gross, Brijesh Singh,
	Jordan Borgner, Feng Tang, linux-kernel, Baoquan He,
	russ.anderson, dimitri.sivanich, mike.travis

On Thu, Sep 12, 2019 at 01:19:17PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 11, 2019 at 03:08:35PM -0500, Steve Wahl wrote:
> > Thank you for your time looking into this with me!
> 
> With all this explanation the original patch looks sane to me.
> 
> But I would like to see more information from this thread in the commit
> message and some comments in the code on why it's crucial not to map more
> than needed.

I am working on this.

> I think we also need to make it clear that this is workaround for a broken
> hardware: speculative execution must not trigger a halt.

I think the word broken is a bit loaded here.  According to the UEFI
spec (version 2.8, page 167), "Regions that are backed by the physical
hardware, but are not supposed to be accessed by the OS, must be
returned as EfiReservedMemoryType."  Our interpretation is that
includes speculative accesses.

I'd lean more toward "tightening of adherence to the spec required by
some particular hardware."  Does that work for you?

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-13 15:14           ` Steve Wahl
@ 2019-09-16  9:00             ` Kirill A. Shutemov
  2019-09-16 14:25               ` Dave Hansen
  2019-09-16 16:17               ` Steve Wahl
  0 siblings, 2 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2019-09-16  9:00 UTC (permalink / raw)
  To: Steve Wahl, Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Juergen Gross, Brijesh Singh, Jordan Borgner, Feng Tang,
	linux-kernel, Baoquan He, russ.anderson, dimitri.sivanich,
	mike.travis

On Fri, Sep 13, 2019 at 10:14:15AM -0500, Steve Wahl wrote:
> On Thu, Sep 12, 2019 at 01:19:17PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 11, 2019 at 03:08:35PM -0500, Steve Wahl wrote:
> > > Thank you for your time looking into this with me!
> > 
> > With all this explanation the original patch looks sane to me.
> > 
> > But I would like to see more information from this thread in the commit
> > message and some comments in the code on why it's crucial not to map more
> > than needed.
> 
> I am working on this.
> 
> > I think we also need to make it clear that this is workaround for a broken
> > hardware: speculative execution must not trigger a halt.
> 
> I think the word broken is a bit loaded here.  According to the UEFI
> spec (version 2.8, page 167), "Regions that are backed by the physical
> hardware, but are not supposed to be accessed by the OS, must be
> returned as EfiReservedMemoryType."  Our interpretation is that
> includes speculative accesses.

+Dave.

I don't think it is. Speculative access is done by hardware, not OS.

BTW, isn't it a BIOS issue?

I believe it should have a way to hide a range of physical address space
from OS or force a caching mode on to exclude it from speculative
execution. Like setup MTRRs or something.

> I'd lean more toward "tightening of adherence to the spec required by
> some particular hardware."  Does that work for you?

Not really, no. I still believe it's issue of the platform, not OS.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-16  9:00             ` Kirill A. Shutemov
@ 2019-09-16 14:25               ` Dave Hansen
  2019-09-16 17:14                 ` Steve Wahl
  2019-09-16 16:17               ` Steve Wahl
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2019-09-16 14:25 UTC (permalink / raw)
  To: Kirill A. Shutemov, Steve Wahl, Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Juergen Gross, Brijesh Singh, Jordan Borgner, Feng Tang,
	linux-kernel, Baoquan He, russ.anderson, dimitri.sivanich,
	mike.travis

On 9/16/19 2:00 AM, Kirill A. Shutemov wrote:
>>> I think we also need to make it clear that this is workaround for a broken
>>> hardware: speculative execution must not trigger a halt.
>> I think the word broken is a bit loaded here.  According to the UEFI
>> spec (version 2.8, page 167), "Regions that are backed by the physical
>> hardware, but are not supposed to be accessed by the OS, must be
>> returned as EfiReservedMemoryType."  Our interpretation is that
>> includes speculative accesses.
> +Dave.
> 
> I don't think it is. Speculative access is done by hardware, not OS.
> 
> BTW, isn't it a BIOS issue?
> 
> I believe it should have a way to hide a range of physical address space
> from OS or force a caching mode on to exclude it from speculative
> execution. Like setup MTRRs or something.

Ugh.  I bet that was a fun one to chase down.  Have the hardware
engineers learned a lesson or are they hiding behind the EFI spec in an
act of pure cowardice? ;)

The patch is small and fixes a real problem.  The changelog is OK,
although I'd prefer some differentiation between "occupied by the
kernel" and the kernel *image*.  The code is also gloriously free of any
comments about what it's doing or why.

But, I'm left with lots of questions:

Why do PMD-level changes fix this?  Is it because we 2MB pad the kernel
image?  Why can't we still get within 2MB of the memory address in
question?  Is it in the lower 1MB, by chance?  If this is all about
avoiding EFI reserved ranges, why doesn't the patch *LOOK* At EFI
reserved ranges?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-16  9:00             ` Kirill A. Shutemov
  2019-09-16 14:25               ` Dave Hansen
@ 2019-09-16 16:17               ` Steve Wahl
  1 sibling, 0 replies; 14+ messages in thread
From: Steve Wahl @ 2019-09-16 16:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Steve Wahl, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Juergen Gross,
	Brijesh Singh, Jordan Borgner, Feng Tang, linux-kernel,
	Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

On Mon, Sep 16, 2019 at 12:00:58PM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 13, 2019 at 10:14:15AM -0500, Steve Wahl wrote:
> > On Thu, Sep 12, 2019 at 01:19:17PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Sep 11, 2019 at 03:08:35PM -0500, Steve Wahl wrote:
> > > > Thank you for your time looking into this with me!
> > > 
> > > With all this explanation the original patch looks sane to me.
> > > 
> > > But I would like to see more information from this thread in the commit
> > > message and some comments in the code on why it's crucial not to map more
> > > than needed.
> > 
> > I am working on this.
> > 
> > > I think we also need to make it clear that this is workaround for a broken
> > > hardware: speculative execution must not trigger a halt.
> > 
> > I think the word broken is a bit loaded here.  According to the UEFI
> > spec (version 2.8, page 167), "Regions that are backed by the physical
> > hardware, but are not supposed to be accessed by the OS, must be
> > returned as EfiReservedMemoryType."  Our interpretation is that
> > includes speculative accesses.
> 
> +Dave.
> 
> I don't think it is. Speculative access is done by hardware, not OS.

But the OS controls speculative access to physical addresses by their
presence or absence in the page tables.  Speculation is done with
calculations based on virtual addresses, without a valid translation
to physical addresses it doesn't progress to an externally visible
action.

> BTW, isn't it a BIOS issue?
> 
> I believe it should have a way to hide a range of physical address space
> from OS or force a caching mode on to exclude it from speculative
> execution. Like setup MTRRs or something.

This is their intent in marking it as reserved in the memory tables.
They have explored many other avenues (I've even suggested a couple
since pinning down this problem) and for each one there is a reason it
doesn't work.

> > I'd lean more toward "tightening of adherence to the spec required by
> > some particular hardware."  Does that work for you?
> 
> Not really, no. I still believe it's issue of the platform, not OS.

My current wording for a comment to go above the code is:

/*
 * Only the region occupied by the kernel has so far been checked against
 * the table of usable memory regions provided by the firmware, so
 * invalidate pages outside that region.  A page table entry that maps to
 * a reserved area of memory would allow processor speculation into that
 * area, and on some hardware (particularly the UV platform) speculation
 * into reserved areas can cause a system halt.
 */

How close does that come to working for you?  (I'm going to dive
specifically into the phrase "region occupied by the kernel" in a
reply to Dave in another message; I understand that phrase may need
work.  I'm more asking about the remainder of it here.)

--> Steve

> -- 
>  Kirill A. Shutemov

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-16 14:25               ` Dave Hansen
@ 2019-09-16 17:14                 ` Steve Wahl
  2019-09-16 22:19                   ` Dave Hansen
  0 siblings, 1 reply; 14+ messages in thread
From: Steve Wahl @ 2019-09-16 17:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Steve Wahl, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, x86, Juergen Gross,
	Brijesh Singh, Jordan Borgner, Feng Tang, linux-kernel,
	Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

On Mon, Sep 16, 2019 at 07:25:48AM -0700, Dave Hansen wrote:
> On 9/16/19 2:00 AM, Kirill A. Shutemov wrote:
> >>> I think we also need to make it clear that this is workaround for a broken
> >>> hardware: speculative execution must not trigger a halt.
> >> I think the word broken is a bit loaded here.  According to the UEFI
> >> spec (version 2.8, page 167), "Regions that are backed by the physical
> >> hardware, but are not supposed to be accessed by the OS, must be
> >> returned as EfiReservedMemoryType."  Our interpretation is that
> >> includes speculative accesses.
> > +Dave.
> > 
> > I don't think it is. Speculative access is done by hardware, not OS.
> > 
> > BTW, isn't it a BIOS issue?
> > 
> > I believe it should have a way to hide a range of physical address space
> > from OS or force a caching mode on to exclude it from speculative
> > execution. Like setup MTRRs or something.
> 
> Ugh.  I bet that was a fun one to chase down.  Have the hardware
> engineers learned a lesson or are they hiding behind the EFI spec in an
> act of pure cowardice? ;)

Yes, it was fun.  My main BIOS contact has explained to me how they are
stuck between a rock and a hard place on any other options for this.

> The patch is small and fixes a real problem.  The changelog is OK,
> although I'd prefer some differentiation between "occupied by the
> kernel" and the kernel *image*.

OK, is the phrase "kernel image" generally understood to cover
everything from _text to _end, including the bss?  As long as that's
true, I will adopt this phrase.

> The code is also gloriously free of any comments about what it's
> doing or why.

I'm intending to add something like this in the next version:

/*
 * Only the region occupied by the kernel image has so far been checked against
 * the table of usable memory regions provided by the firmware, so
 * invalidate pages outside that region.  A page table entry that maps to
 * a reserved area of memory would allow processor speculation into that
 * area, and on some hardware (particularly the UV platform) speculation
 * into reserved areas can cause a system halt.
 */

> But, I'm left with lots of questions:
> 
> Why do PMD-level changes fix this?  Is it because we 2MB pad the kernel
> image?  Why can't we still get within 2MB of the memory address in
> question?

This fix works for our hardware because the problematic reserved
regions are 64M aligned, and going up to the next 2MB boundary from
_end is not going to cross the next 64M boundary.

One could argue the next step would be going into
boot/compressed/{kaslr.c, misc.c} and rounding the size of the kernel
up to the next 2MB boundary to ensure the chosen random location is
covered by usable RAM up to the next PMD-level boundary.  I did not go
there because for us it is not necessary.

> Is it in the lower 1MB, by chance?

No, this is a reserved range at the top physical addresses for each
NUMA node in a collection of them.  

> If this is all about avoiding EFI reserved ranges, why doesn't the
> patch *LOOK* At EFI reserved ranges?

Because the range the kernel image is located in is already checked
against them in boot/compressed/kaslr.c.  This will now be explained
in the comment I mention above, which you had not yet seen.

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-16 17:14                 ` Steve Wahl
@ 2019-09-16 22:19                   ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2019-09-16 22:19 UTC (permalink / raw)
  To: Steve Wahl
  Cc: Kirill A. Shutemov, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Juergen Gross,
	Brijesh Singh, Jordan Borgner, Feng Tang, linux-kernel,
	Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

On 9/16/19 10:14 AM, Steve Wahl wrote:
> I'm intending to add something like this in the next version:
> 
> /*
>  * Only the region occupied by the kernel image has so far been checked against
>  * the table of usable memory regions provided by the firmware, so
>  * invalidate pages outside that region.  A page table entry that maps to
>  * a reserved area of memory would allow processor speculation into that
>  * area, and on some hardware (particularly the UV platform) speculation
>  * into reserved areas can cause a system halt.
>  */

This is a good start, but it's a bit imprecise.

The KASLR code ensures that the kernel image itself doesn't overlap a
reserved area, but it appears that we're mapping more than strictly just
the kernel image.

>> But, I'm left with lots of questions:
>>
>> Why do PMD-level changes fix this?  Is it because we 2MB pad the kernel
>> image?  Why can't we still get within 2MB of the memory address in
>> question?
> 
> This fix works for our hardware because the problematic reserved
> regions are 64M aligned, and going up to the next 2MB boundary from
> _end is not going to cross the next 64M boundary.

I'd really prefer that we fix this once and for all rather than kicking
the can down the road to the next hardware manufacturer that changes the
alignment.

> One could argue the next step would be going into
> boot/compressed/{kaslr.c, misc.c} and rounding the size of the kernel
> up to the next 2MB boundary to ensure the chosen random location is
> covered by usable RAM up to the next PMD-level boundary.  I did not go
> there because for us it is not necessary.

Doing this would *fully* fix this issue, right?  Seems like the right
thing to do to me.

>> If this is all about avoiding EFI reserved ranges, why doesn't the
>> patch *LOOK* At EFI reserved ranges?
> 
> Because the range the kernel image is located in is already checked
> against them in boot/compressed/kaslr.c.  This will now be explained
> in the comment I mention above, which you had not yet seen.

Ahh, thanks for the explanation.  This will make for good changelog
material.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-09-16 22:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-06 21:29 [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area Steve Wahl
2019-09-09  8:14 ` Kirill A. Shutemov
2019-09-10  6:18   ` Ingo Molnar
2019-09-10 14:38     ` Steve Wahl
2019-09-10 14:28   ` Steve Wahl
2019-09-11  0:28     ` Kirill A. Shutemov
2019-09-11 20:08       ` Steve Wahl
2019-09-12 10:19         ` Kirill A. Shutemov
2019-09-13 15:14           ` Steve Wahl
2019-09-16  9:00             ` Kirill A. Shutemov
2019-09-16 14:25               ` Dave Hansen
2019-09-16 17:14                 ` Steve Wahl
2019-09-16 22:19                   ` Dave Hansen
2019-09-16 16:17               ` Steve Wahl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).