linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] x86/boot/64: Avoid mapping reserved ranges in early page tables.
@ 2019-09-23 18:14 Steve Wahl
  2019-09-23 18:15 ` [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area Steve Wahl
  2019-09-23 18:15 ` [PATCH v2 2/2] x86/boot/64: round memory hole size up to next PMD page Steve Wahl
  0 siblings, 2 replies; 10+ messages in thread
From: Steve Wahl @ 2019-09-23 18:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Juergen Gross, Kirill A.Shutemov, Brijesh Singh, Steve Wahl,
	Jordan Borgner, Feng Tang, linux-kernel, Chao Fan,
	Zhenzhong Duan
  Cc: Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

This patch set narrows the valid space addressed by the page table
level2_kernel_pgt to only contain ranges checked against the "usable
RAM" list provided by the BIOS.

Prior to this, some larger than needed mappings were occasionally
crossing over into spaces marked reserved, allowing the processor to
access these reserved spaces, which were caught by the hardware and
caused BIOS to halt on our platform (UV).

Changes since v1:

* Cover letter added because there's now two patches.

* Patch 1: Added comment and re-worked changelog text.

* Patch 2: New change requested by Dave Hansen to handle the case that
  the mapping of the last PMD page for the kernel image could cross a
  reserved region boundary.


Steve Wahl (2):
  x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  x86/boot/64: round memory hole size up to next PMD page.

 arch/x86/boot/compressed/misc.c | 25 +++++++++++++++++++------
 arch/x86/kernel/head64.c        | 18 ++++++++++++++++--
 2 files changed, 35 insertions(+), 8 deletions(-)

-- 
2.21.0


-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-23 18:14 [PATCH v2 0/2] x86/boot/64: Avoid mapping reserved ranges in early page tables Steve Wahl
@ 2019-09-23 18:15 ` Steve Wahl
  2019-09-23 18:49   ` hpa
  2019-09-23 21:19   ` Dave Hansen
  2019-09-23 18:15 ` [PATCH v2 2/2] x86/boot/64: round memory hole size up to next PMD page Steve Wahl
  1 sibling, 2 replies; 10+ messages in thread
From: Steve Wahl @ 2019-09-23 18:15 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Juergen Gross, Kirill A. Shutemov, Brijesh Singh,
	Steve Wahl, Jordan Borgner, Feng Tang, linux-kernel, Chao Fan,
	Zhenzhong Duan
  Cc: Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

Our hardware (UV aka Superdome Flex) has address ranges marked
reserved by the BIOS. Access to these ranges is caught as an error,
causing the BIOS to halt the system.

Initial page tables mapped a large range of physical addresses that
were not checked against the list of BIOS reserved addresses, and
sometimes included reserved addresses in part of the mapped range.
Including the reserved range in the map allowed processor speculative
accesses to the reserved range, triggering a BIOS halt.

Used early in booting, the page table level2_kernel_pgt addresses 1
GiB divided into 2 MiB pages, and it was set up to linearly map a full
1 GiB of physical addresses that included the physical address range
of the kernel image, as chosen by KASLR.  But this also included a
large range of unused addresses on either side of the kernel image.
And unlike the kernel image's physical address range, this extra
mapped space was not checked against the BIOS tables of usable RAM
addresses.  So there were times when the addresses chosen by KASLR
would result in processor accessible mappings of BIOS reserved
physical addresses.

The kernel code did not directly access any of this extra mapped
space, but having it mapped allowed the processor to issue speculative
accesses into reserved memory, causing system halts.

This was encountered somewhat rarely on a normal system boot, and much
more often when starting the crash kernel if "crashkernel=512M,high"
was specified on the command line (this heavily restricts the physical
address of the crash kernel, in our case usually within 1 GiB of
reserved space).

The solution is to invalidate the pages of this table outside the
kernel image's space before the page table is activated.  This patch
has been validated to fix this problem on our hardware.

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Cc: stable@vger.kernel.org
---
Changes since v1:
  * Added comment.
  * Reworked changelog text.

 arch/x86/kernel/head64.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 29ffa495bd1c..ee9d0e3e0c46 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -222,13 +222,27 @@ unsigned long __head __startup_64(unsigned long physaddr,
 	 * we might write invalid pmds, when the kernel is relocated
 	 * cleanup_highmap() fixes this up along with the mappings
 	 * beyond _end.
+	 *
+	 * Only the region occupied by the kernel image has so far
+	 * been checked against the table of usable memory regions
+	 * provided by the firmware, so invalidate pages outside that
+	 * region.  A page table entry that maps to a reserved area of
+	 * memory would allow processor speculation into that area,
+	 * and on some hardware (particularly the UV platform) even
+	 * speculative access to some reserved areas is caught as an
+	 * error, causing the BIOS to halt the system.
 	 */
 
 	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
-	for (i = 0; i < PTRS_PER_PMD; i++) {
+	for (i = 0; i < pmd_index((unsigned long)_text); i++)
+		pmd[i] &= ~_PAGE_PRESENT;
+
+	for (; i <= pmd_index((unsigned long)_end); i++)
 		if (pmd[i] & _PAGE_PRESENT)
 			pmd[i] += load_delta;
-	}
+
+	for (; i < PTRS_PER_PMD; i++)
+		pmd[i] &= ~_PAGE_PRESENT;
 
 	/*
 	 * Fixup phys_base - remove the memory encryption mask to obtain
-- 
2.21.0


-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/2] x86/boot/64: round memory hole size up to next PMD page.
  2019-09-23 18:14 [PATCH v2 0/2] x86/boot/64: Avoid mapping reserved ranges in early page tables Steve Wahl
  2019-09-23 18:15 ` [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area Steve Wahl
@ 2019-09-23 18:15 ` Steve Wahl
  2019-09-23 21:20   ` Dave Hansen
  1 sibling, 1 reply; 10+ messages in thread
From: Steve Wahl @ 2019-09-23 18:15 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Juergen Gross, Kirill A. Shutemov, Brijesh Singh,
	Steve Wahl, Jordan Borgner, Feng Tang, linux-kernel, Chao Fan,
	Zhenzhong Duan
  Cc: Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

The kernel image map is created using PMD pages, which can include
some extra space beyond what's actually needed.  Round the size of the
memory hole we search for up to the next PMD boundary, to be certain
all of the space to be mapped is usable RAM and includes no reserved
areas.

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Cc: stable@vger.kernel.org
---
Changes since v1:
  * This patch is completely new to this verison.

 arch/x86/boot/compressed/misc.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 53ac0cb2396d..9652d5c2afda 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -345,6 +345,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 {
 	const unsigned long kernel_total_size = VO__end - VO__text;
 	unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
+	unsigned long needed_size;
 
 	/* Retain x86 boot parameters pointer passed from startup_32/64. */
 	boot_params = rmode;
@@ -379,26 +380,38 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 	free_mem_ptr     = heap;	/* Heap */
 	free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
 
+	/*
+	 * The memory hole needed for the kernel is the larger of either
+	 * the entire decompressed kernel plus relocation table, or the
+	 * entire decompressed kernel plus .bss and .brk sections.
+	 *
+	 * On X86_64, the memory is mapped with PMD pages. Round the
+	 * size up so that the full extent of PMD pages mapped is
+	 * included in the check against the valid memory table
+	 * entries. This ensures the full mapped area is usable RAM
+	 * and doesn't include any reserved areas.
+	 */
+	needed_size = max(output_len, kernel_total_size);
+#ifdef CONFIG_X86_64
+	needed_size = ALIGN(needed_size, MIN_KERNEL_ALIGN);
+#endif
+
 	/* Report initial kernel position details. */
 	debug_putaddr(input_data);
 	debug_putaddr(input_len);
 	debug_putaddr(output);
 	debug_putaddr(output_len);
 	debug_putaddr(kernel_total_size);
+	debug_putaddr(needed_size);
 
 #ifdef CONFIG_X86_64
 	/* Report address of 32-bit trampoline */
 	debug_putaddr(trampoline_32bit);
 #endif
 
-	/*
-	 * The memory hole needed for the kernel is the larger of either
-	 * the entire decompressed kernel plus relocation table, or the
-	 * entire decompressed kernel plus .bss and .brk sections.
-	 */
 	choose_random_location((unsigned long)input_data, input_len,
 				(unsigned long *)&output,
-				max(output_len, kernel_total_size),
+				needed_size,
 				&virt_addr);
 
 	/* Validate memory location choices. */
-- 
2.21.0


-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-23 18:15 ` [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area Steve Wahl
@ 2019-09-23 18:49   ` hpa
  2019-09-24 16:32     ` Steve Wahl
  2019-09-23 21:19   ` Dave Hansen
  1 sibling, 1 reply; 10+ messages in thread
From: hpa @ 2019-09-23 18:49 UTC (permalink / raw)
  To: Steve Wahl, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	Juergen Gross, Kirill A. Shutemov, Brijesh Singh, Jordan Borgner,
	Feng Tang, linux-kernel, Chao Fan, Zhenzhong Duan
  Cc: Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

On September 23, 2019 11:15:20 AM PDT, Steve Wahl <steve.wahl@hpe.com> wrote:
>Our hardware (UV aka Superdome Flex) has address ranges marked
>reserved by the BIOS. Access to these ranges is caught as an error,
>causing the BIOS to halt the system.
>
>Initial page tables mapped a large range of physical addresses that
>were not checked against the list of BIOS reserved addresses, and
>sometimes included reserved addresses in part of the mapped range.
>Including the reserved range in the map allowed processor speculative
>accesses to the reserved range, triggering a BIOS halt.
>
>Used early in booting, the page table level2_kernel_pgt addresses 1
>GiB divided into 2 MiB pages, and it was set up to linearly map a full
>1 GiB of physical addresses that included the physical address range
>of the kernel image, as chosen by KASLR.  But this also included a
>large range of unused addresses on either side of the kernel image.
>And unlike the kernel image's physical address range, this extra
>mapped space was not checked against the BIOS tables of usable RAM
>addresses.  So there were times when the addresses chosen by KASLR
>would result in processor accessible mappings of BIOS reserved
>physical addresses.
>
>The kernel code did not directly access any of this extra mapped
>space, but having it mapped allowed the processor to issue speculative
>accesses into reserved memory, causing system halts.
>
>This was encountered somewhat rarely on a normal system boot, and much
>more often when starting the crash kernel if "crashkernel=512M,high"
>was specified on the command line (this heavily restricts the physical
>address of the crash kernel, in our case usually within 1 GiB of
>reserved space).
>
>The solution is to invalidate the pages of this table outside the
>kernel image's space before the page table is activated.  This patch
>has been validated to fix this problem on our hardware.
>
>Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
>Cc: stable@vger.kernel.org
>---
>Changes since v1:
>  * Added comment.
>  * Reworked changelog text.
>
> arch/x86/kernel/head64.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
>index 29ffa495bd1c..ee9d0e3e0c46 100644
>--- a/arch/x86/kernel/head64.c
>+++ b/arch/x86/kernel/head64.c
>@@ -222,13 +222,27 @@ unsigned long __head __startup_64(unsigned long
>physaddr,
> 	 * we might write invalid pmds, when the kernel is relocated
> 	 * cleanup_highmap() fixes this up along with the mappings
> 	 * beyond _end.
>+	 *
>+	 * Only the region occupied by the kernel image has so far
>+	 * been checked against the table of usable memory regions
>+	 * provided by the firmware, so invalidate pages outside that
>+	 * region.  A page table entry that maps to a reserved area of
>+	 * memory would allow processor speculation into that area,
>+	 * and on some hardware (particularly the UV platform) even
>+	 * speculative access to some reserved areas is caught as an
>+	 * error, causing the BIOS to halt the system.
> 	 */
> 
> 	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
>-	for (i = 0; i < PTRS_PER_PMD; i++) {
>+	for (i = 0; i < pmd_index((unsigned long)_text); i++)
>+		pmd[i] &= ~_PAGE_PRESENT;
>+
>+	for (; i <= pmd_index((unsigned long)_end); i++)
> 		if (pmd[i] & _PAGE_PRESENT)
> 			pmd[i] += load_delta;
>-	}
>+
>+	for (; i < PTRS_PER_PMD; i++)
>+		pmd[i] &= ~_PAGE_PRESENT;
> 
> 	/*
> 	 * Fixup phys_base - remove the memory encryption mask to obtain

What does your MTRR setup look like, and what memory map do you present, in exact detail? The BIOS is normally expected to mark the relevant ranges as UC in the MTRRs (that is the remaining, legitimate usage of MTRRs.)

I'm somewhat sceptical that chopping off potentially several megabytes is a good thing. We also have the memory type interfaces which can be used to map these as UC in the page tables.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-23 18:15 ` [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area Steve Wahl
  2019-09-23 18:49   ` hpa
@ 2019-09-23 21:19   ` Dave Hansen
  2019-09-24 20:04     ` Steve Wahl
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Hansen @ 2019-09-23 21:19 UTC (permalink / raw)
  To: Steve Wahl, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Juergen Gross, Kirill A. Shutemov,
	Brijesh Singh, Jordan Borgner, Feng Tang, linux-kernel, Chao Fan,
	Zhenzhong Duan
  Cc: Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

On 9/23/19 11:15 AM, Steve Wahl wrote:
>  	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
> -	for (i = 0; i < PTRS_PER_PMD; i++) {
> +	for (i = 0; i < pmd_index((unsigned long)_text); i++)
> +		pmd[i] &= ~_PAGE_PRESENT;
> +
> +	for (; i <= pmd_index((unsigned long)_end); i++)
>  		if (pmd[i] & _PAGE_PRESENT)
>  			pmd[i] += load_delta;
> -	}
> +
> +	for (; i < PTRS_PER_PMD; i++)
> +		pmd[i] &= ~_PAGE_PRESENT;

This is looking a bunch better.  The broken-up loop could probably use
some comments, or you could combine it back to a single loop like this:

	int text_start_pmd_index = pmd_index((unsigned long)_text);
	int text_end_pmd_index   = pmd_index((unsigned long)_end);

	for (i = 0; i < PTRS_PER_PMD; i++) {
		if ((i < text_start_pmd_index) ||
		    (i > text_end_pmd_index)) {
			/* Unmap entries not mapping the kernel image */
			pmd[i] &= ~_PAGE_PRESENT;
		} else if (pmd[i] & _PAGE_PRESENT)
 			pmd[i] += load_delta;
		}
	}

Although I'd prefer it get commented or rewritten, it's passable like
this, so:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] x86/boot/64: round memory hole size up to next PMD page.
  2019-09-23 18:15 ` [PATCH v2 2/2] x86/boot/64: round memory hole size up to next PMD page Steve Wahl
@ 2019-09-23 21:20   ` Dave Hansen
  2019-09-24 19:48     ` Steve Wahl
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Hansen @ 2019-09-23 21:20 UTC (permalink / raw)
  To: Steve Wahl, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Juergen Gross, Kirill A. Shutemov,
	Brijesh Singh, Jordan Borgner, Feng Tang, linux-kernel, Chao Fan,
	Zhenzhong Duan
  Cc: Baoquan He, russ.anderson, dimitri.sivanich, mike.travis

On 9/23/19 11:15 AM, Steve Wahl wrote:
> The kernel image map is created using PMD pages, which can include
> some extra space beyond what's actually needed.  Round the size of the
> memory hole we search for up to the next PMD boundary, to be certain
> all of the space to be mapped is usable RAM and includes no reserved
> areas.

This looks good.  It also fully closes any possibility that anyone's
future hardware will hit issues like this as long as they mark the
memory reserved, right?

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-23 18:49   ` hpa
@ 2019-09-24 16:32     ` Steve Wahl
  0 siblings, 0 replies; 10+ messages in thread
From: Steve Wahl @ 2019-09-24 16:32 UTC (permalink / raw)
  To: hpa
  Cc: Steve Wahl, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	Juergen Gross, Kirill A. Shutemov, Brijesh Singh, Jordan Borgner,
	Feng Tang, linux-kernel, Chao Fan, Zhenzhong Duan, Baoquan He,
	russ.anderson, dimitri.sivanich, mike.travis, Dave Hansen

On Mon, Sep 23, 2019 at 11:49:44AM -0700, hpa@zytor.com wrote:
> On September 23, 2019 11:15:20 AM PDT, Steve Wahl <steve.wahl@hpe.com> wrote:
> >Our hardware (UV aka Superdome Flex) has address ranges marked
> >reserved by the BIOS. Access to these ranges is caught as an error,
> >causing the BIOS to halt the system.
> >
> >Initial page tables mapped a large range of physical addresses that
> >were not checked against the list of BIOS reserved addresses, and
> >sometimes included reserved addresses in part of the mapped range.
> >Including the reserved range in the map allowed processor speculative
> >accesses to the reserved range, triggering a BIOS halt.
> >
> >Used early in booting, the page table level2_kernel_pgt addresses 1
> >GiB divided into 2 MiB pages, and it was set up to linearly map a full
> >1 GiB of physical addresses that included the physical address range
> >of the kernel image, as chosen by KASLR.  But this also included a
> >large range of unused addresses on either side of the kernel image.
> >And unlike the kernel image's physical address range, this extra
> >mapped space was not checked against the BIOS tables of usable RAM
> >addresses.  So there were times when the addresses chosen by KASLR
> >would result in processor accessible mappings of BIOS reserved
> >physical addresses.
> >
> >The kernel code did not directly access any of this extra mapped
> >space, but having it mapped allowed the processor to issue speculative
> >accesses into reserved memory, causing system halts.
> >
> >This was encountered somewhat rarely on a normal system boot, and much
> >more often when starting the crash kernel if "crashkernel=512M,high"
> >was specified on the command line (this heavily restricts the physical
> >address of the crash kernel, in our case usually within 1 GiB of
> >reserved space).
> >
> >The solution is to invalidate the pages of this table outside the
> >kernel image's space before the page table is activated.  This patch
> >has been validated to fix this problem on our hardware.
> >
> >Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> >Cc: stable@vger.kernel.org
> >---
> >Changes since v1:
> >  * Added comment.
> >  * Reworked changelog text.
> >
> > arch/x86/kernel/head64.c | 18 ++++++++++++++++--
> > 1 file changed, 16 insertions(+), 2 deletions(-)
> >
> >diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> >index 29ffa495bd1c..ee9d0e3e0c46 100644
> >--- a/arch/x86/kernel/head64.c
> >+++ b/arch/x86/kernel/head64.c
> >@@ -222,13 +222,27 @@ unsigned long __head __startup_64(unsigned long
> >physaddr,
> > 	 * we might write invalid pmds, when the kernel is relocated
> > 	 * cleanup_highmap() fixes this up along with the mappings
> > 	 * beyond _end.
> >+	 *
> >+	 * Only the region occupied by the kernel image has so far
> >+	 * been checked against the table of usable memory regions
> >+	 * provided by the firmware, so invalidate pages outside that
> >+	 * region.  A page table entry that maps to a reserved area of
> >+	 * memory would allow processor speculation into that area,
> >+	 * and on some hardware (particularly the UV platform) even
> >+	 * speculative access to some reserved areas is caught as an
> >+	 * error, causing the BIOS to halt the system.
> > 	 */
> > 
> > 	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
> >-	for (i = 0; i < PTRS_PER_PMD; i++) {
> >+	for (i = 0; i < pmd_index((unsigned long)_text); i++)
> >+		pmd[i] &= ~_PAGE_PRESENT;
> >+
> >+	for (; i <= pmd_index((unsigned long)_end); i++)
> > 		if (pmd[i] & _PAGE_PRESENT)
> > 			pmd[i] += load_delta;
> >-	}
> >+
> >+	for (; i < PTRS_PER_PMD; i++)
> >+		pmd[i] &= ~_PAGE_PRESENT;
> > 
> > 	/*
> > 	 * Fixup phys_base - remove the memory encryption mask to obtain
> 
> What does your MTRR setup look like, and what memory map do you
> present, in exact detail?

We set the MTRR default to writeback cacheable, and then use variable
MTRRs to mark certain parts of the address map as uncacheable.  The
uncacheable regions are at the top of the address map (AEP mailboxes,
PCIe config segments other than segment zero, MMRs, 64-bit MMIO) and
from 2GB to 4GB.  There are also uncacheable ranges below 1MB
controlled by the fixed MTRRs.

Details from the e820 map:

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000005efff] usable
[    0.000000] BIOS-e820: [mem 0x000000000005f000-0x000000000005ffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000060000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000072c4dfff] usable
[    0.000000] BIOS-e820: [mem 0x0000000072c4e000-0x0000000073a2ffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000073a30000-0x000000007445ffff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000074460000-0x000000007599ffff] ACPI data
[    0.000000] BIOS-e820: [mem 0x00000000759a0000-0x0000000075aecfff] usable
[    0.000000] BIOS-e820: [mem 0x0000000075aed000-0x0000000075aedfff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000075aee000-0x0000000075b09fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000075b0a000-0x0000000075d49fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000075d4a000-0x0000000079201fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000079202000-0x0000000079202fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000079203000-0x000000007bffffff] usable
[    0.000000] BIOS-e820: [mem 0x000000007c000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x0000002f7fffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000002f80000000-0x000000303fffffff] reserved *
[    0.000000] BIOS-e820: [mem 0x0000003040000000-0x0000005f7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000005f7c000000-0x000000603fffffff] reserved *
[    0.000000] BIOS-e820: [mem 0x0000006040000000-0x0000008f7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000008f7c000000-0x000000903fffffff] reserved *
[    0.000000] BIOS-e820: [mem 0x0000009040000000-0x000000bf7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x000000bf7c000000-0x000000c03fffffff] reserved *
[    0.000000] BIOS-e820: [mem 0x000000c040000000-0x000000ef7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x000000ef7c000000-0x000000f03fffffff] reserved *
[    0.000000] BIOS-e820: [mem 0x000000f040000000-0x0000011f7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000011f7c000000-0x000001203fffffff] reserved *
[    0.000000] BIOS-e820: [mem 0x0000012040000000-0x0000014f7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000014f7c000000-0x000001503fffffff] reserved *
[    0.000000] BIOS-e820: [mem 0x0000015040000000-0x0000017f7bffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000017f7c000000-0x000001803fffffff] reserved *
[    0.000000] BIOS-e820: [mem 0x00000fff00000000-0x00000fff0fefffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fff10000000-0x00000fff1fefffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fff20000000-0x00000fff2fefffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fff30000000-0x00000fff3fefffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fff40000000-0x00000fff4fefffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fff50000000-0x00000fff5fefffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fff60000000-0x00000fff6fefffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fff70000000-0x00000fff7fefffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fffa0000000-0x00000fffa2ffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fffa4000000-0x00000fffa6ffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fffa8000000-0x00000fffaaffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fffac000000-0x00000fffaeffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fffb0000000-0x00000fffb2ffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fffb4000000-0x00000fffb6ffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fffb8000000-0x00000fffbaffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000fffbc000000-0x00000fffbeffffff] reserved

I've added asterisks after the entries that represent the reserved
areas that weren't being respected.  This map is from a relatively
small system.  The number of these reserved areas and their size vary
based on the number of NUMA nodes and, if I remember correctly, the
amount of memory in each node as well.

> The BIOS is normally expected to mark the relevant ranges as UC in
> the MTRRs (that is the remaining, legitimate usage of MTRRs.)

The MTRRs do not scale for this purpose; there are not enough of them
left over to handle a reserved area for each of the NUMA nodes in a
fully configured system.  Also, the pieces we reserve do not usually
follow the alignment requirements of the MTRRs, so using them would
require tossing out a good deal of usable memory.

> I'm somewhat sceptical that chopping off potentially several
> megabytes is a good thing.

I think you may be misreading what the patches do.  They do not change
any allocation of memory, only (1) reduce the valid area of the page
tables used for early mapping of the kernel image down to the
addresses actually occupied by the kernel image (to the granularity of
2MB PMD pages, anyway), and (2) ensure that KASLR doesn't place the
kernel image where the last 2MB page used to map it would also cross a
reserved space boundary (and therefore include some BIOS reserved space
in the mapped region).

If I missed something and I am actually chopping off a bunch of
memory, please let me know.

Later refinement of the page tables set up in early boot has not
changed.

> We also have the memory type interfaces which can be used to map
> these as UC in the page tables.

Those are set up much later, this problem is very early in booting. As
far as I have seen, all of the later code that sets up memory respects
the EFI and/or E820 tables the BIOS passes in.  This early boot code
did not, and left a small window where the reserved addresses get
mapped and cause problems.

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] x86/boot/64: round memory hole size up to next PMD page.
  2019-09-23 21:20   ` Dave Hansen
@ 2019-09-24 19:48     ` Steve Wahl
  0 siblings, 0 replies; 10+ messages in thread
From: Steve Wahl @ 2019-09-24 19:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Steve Wahl, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Juergen Gross, Kirill A. Shutemov,
	Brijesh Singh, Jordan Borgner, Feng Tang, linux-kernel, Chao Fan,
	Zhenzhong Duan, Baoquan He, russ.anderson, dimitri.sivanich,
	mike.travis

On Mon, Sep 23, 2019 at 02:20:35PM -0700, Dave Hansen wrote:
> On 9/23/19 11:15 AM, Steve Wahl wrote:
> > The kernel image map is created using PMD pages, which can include
> > some extra space beyond what's actually needed.  Round the size of the
> > memory hole we search for up to the next PMD boundary, to be certain
> > all of the space to be mapped is usable RAM and includes no reserved
> > areas.
> 
> This looks good.  It also fully closes any possibility that anyone's
> future hardware will hit issues like this as long as they mark the
> memory reserved, right?

I believe that is true.  Thanks!

> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-23 21:19   ` Dave Hansen
@ 2019-09-24 20:04     ` Steve Wahl
  2019-09-24 20:15       ` Dave Hansen
  0 siblings, 1 reply; 10+ messages in thread
From: Steve Wahl @ 2019-09-24 20:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Steve Wahl, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Juergen Gross, Kirill A. Shutemov,
	Brijesh Singh, Jordan Borgner, Feng Tang, linux-kernel, Chao Fan,
	Zhenzhong Duan, Baoquan He, russ.anderson, dimitri.sivanich,
	mike.travis

On Mon, Sep 23, 2019 at 02:19:46PM -0700, Dave Hansen wrote:
> On 9/23/19 11:15 AM, Steve Wahl wrote:
> >  	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
> > -	for (i = 0; i < PTRS_PER_PMD; i++) {
> > +	for (i = 0; i < pmd_index((unsigned long)_text); i++)
> > +		pmd[i] &= ~_PAGE_PRESENT;
> > +
> > +	for (; i <= pmd_index((unsigned long)_end); i++)
> >  		if (pmd[i] & _PAGE_PRESENT)
> >  			pmd[i] += load_delta;
> > -	}
> > +
> > +	for (; i < PTRS_PER_PMD; i++)
> > +		pmd[i] &= ~_PAGE_PRESENT;
> 
> This is looking a bunch better.  The broken-up loop could probably use
> some comments, or you could combine it back to a single loop like this:
> 
> 	int text_start_pmd_index = pmd_index((unsigned long)_text);
> 	int text_end_pmd_index   = pmd_index((unsigned long)_end);
> 
> 	for (i = 0; i < PTRS_PER_PMD; i++) {
> 		if ((i < text_start_pmd_index) ||
> 		    (i > text_end_pmd_index)) {
> 			/* Unmap entries not mapping the kernel image */
> 			pmd[i] &= ~_PAGE_PRESENT;
> 		} else if (pmd[i] & _PAGE_PRESENT)
>  			pmd[i] += load_delta;
> 		}
> 	}

That's funny, I wrote it very close to that way initially, and
re-wrote it broken-up loop style because I thought it better conveyed
my intention.  (Mark pages before the kernel image as invalid, fixup
the pages that are in kernel image range, mark pages after the kernel
image as invalid.)

Given that, would you feel better with A) the way I have it, B) your
rewrite, or C) with an inline comment for each part of the loop:

	pmd = fixup_pointer(level2_kernel_pgt, physaddr);

	/* invalidate pages before the kernel image */
	for (i = 0; i < pmd_index((unsigned long)_text); i++)
		pmd[i] &= ~_PAGE_PRESENT;

	/* fixup pages that are part of the kernel image */
	for (; i <= pmd_index((unsigned long)_end); i++)
		if (pmd[i] & _PAGE_PRESENT)
			pmd[i] += load_delta;

	/* invalidate pages after the kernel image */
	for (; i < PTRS_PER_PMD; i++)
		pmd[i] &= ~_PAGE_PRESENT;

> Although I'd prefer it get commented or rewritten, it's passable like
> this, so:
> 
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks for your input!  

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.
  2019-09-24 20:04     ` Steve Wahl
@ 2019-09-24 20:15       ` Dave Hansen
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Hansen @ 2019-09-24 20:15 UTC (permalink / raw)
  To: Steve Wahl
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Juergen Gross, Kirill A. Shutemov, Brijesh Singh,
	Jordan Borgner, Feng Tang, linux-kernel, Chao Fan,
	Zhenzhong Duan, Baoquan He, russ.anderson, dimitri.sivanich,
	mike.travis

On 9/24/19 1:04 PM, Steve Wahl wrote:
> Given that, would you feel better with A) the way I have it, B) your
> rewrite, or C) with an inline comment for each part of the loop:
> 
> 	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
> 
> 	/* invalidate pages before the kernel image */
> 	for (i = 0; i < pmd_index((unsigned long)_text); i++)
> 		pmd[i] &= ~_PAGE_PRESENT;
> 
> 	/* fixup pages that are part of the kernel image */
> 	for (; i <= pmd_index((unsigned long)_end); i++)
> 		if (pmd[i] & _PAGE_PRESENT)
> 			pmd[i] += load_delta;
> 
> 	/* invalidate pages after the kernel image */
> 	for (; i < PTRS_PER_PMD; i++)
> 		pmd[i] &= ~_PAGE_PRESENT;

I don't feel super strongly either way, but I'd prefer doing B or C over
nothing.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-09-24 20:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-23 18:14 [PATCH v2 0/2] x86/boot/64: Avoid mapping reserved ranges in early page tables Steve Wahl
2019-09-23 18:15 ` [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area Steve Wahl
2019-09-23 18:49   ` hpa
2019-09-24 16:32     ` Steve Wahl
2019-09-23 21:19   ` Dave Hansen
2019-09-24 20:04     ` Steve Wahl
2019-09-24 20:15       ` Dave Hansen
2019-09-23 18:15 ` [PATCH v2 2/2] x86/boot/64: round memory hole size up to next PMD page Steve Wahl
2019-09-23 21:20   ` Dave Hansen
2019-09-24 19:48     ` Steve Wahl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).