All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/ebda: If the EBDA is in lowmem, reserve only 4k for the EBDA
@ 2016-07-21  1:32 Andy Lutomirski
  2016-07-21  8:14 ` [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2016-07-21  1:32 UTC (permalink / raw)
  To: H. Peter Anvin, x86
  Cc: Mario Limonciello, Matthew Garrett, linux-kernel, Andy Lutomirski

Under some conditions, my Dell XPS 13 9350 puts the EBDA at 0x2c000
but reports the lowmem cutoff as 0.  The old code reserves
everything above 0x2c000 and I can't boot [1].

Due to an old quirk, we assume that lowmem ends at 0x9f000.  Nonetheless,
the old code would reserve everything from 0x2c000 to 0xfffff.

Be a little less conservative: when the EBDA is in lowmem, reserve
4k the EBDA and reserve highmem separately.  On my laptop, this ends
up being more or less a no-op: the EBDA shows up as a single-page of
runtime data in the EFI memmap.  Go Dell for getting this right.

[1] This only breaks boot in practice when some other firmware or
    GRUB oddity that I don't fully understand kicks in causing the
    memory below 0x2c000 to be unusable.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

This is intentionally not tagged for -stable.  I think it's -stable
material *eventually*, but the problem that's fixed is not widespread
(it's apparently just me for now) and there's plenty of potential to
regress something that was worked around by the old code.

arch/x86/kernel/ebda.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/ebda.c b/arch/x86/kernel/ebda.c
index afe65dffee80..2183a7eac646 100644
--- a/arch/x86/kernel/ebda.c
+++ b/arch/x86/kernel/ebda.c
@@ -62,10 +62,12 @@ void __init reserve_ebda_region(void)
 	if (lowmem < INSANE_CUTOFF)
 		lowmem = LOWMEM_CAP;
 
-	/* Use the lower of the lowmem and EBDA markers as the cutoff */
-	lowmem = min(lowmem, ebda_addr);
 	lowmem = min(lowmem, LOWMEM_CAP); /* Absolute cap */
 
 	/* reserve all memory between lowmem and the 1MB mark */
 	memblock_reserve(lowmem, 0x100000 - lowmem);
+
+	/* if the EBDA is in lowmem, reserve it separately. */
+	if (ebda_addr < lowmem)
+		memblock_reserve(ebda_addr, 4096);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21  1:32 [PATCH] x86/ebda: If the EBDA is in lowmem, reserve only 4k for the EBDA Andy Lutomirski
@ 2016-07-21  8:14 ` Ingo Molnar
  2016-07-21  8:29   ` H. Peter Anvin
                     ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Ingo Molnar @ 2016-07-21  8:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, x86, Mario Limonciello, Matthew Garrett,
	linux-kernel, Kees Cook, Linus Torvalds, Thomas Gleixner,
	Andrew Morton, Peter Zijlstra


* Andy Lutomirski <luto@kernel.org> wrote:

> Under some conditions, my Dell XPS 13 9350 puts the EBDA at 0x2c000
> but reports the lowmem cutoff as 0.  The old code reserves
> everything above 0x2c000 and I can't boot [1].

> [1] This only breaks boot in practice when some other firmware or
>     GRUB oddity that I don't fully understand kicks in causing the
>     memory below 0x2c000 to be unusable.

Exactly why can't Linux boot if *more* memory is reserved? Is it perhaps the SMP 
trampoline that cannot be allocated? Or does this shift some firmware memory 
corruption pattern that makes it more lethal than it already is?

Without more information we don't really know the root cause here and I'm hesitant 
to touch such really old quirks without first understanding what's going on on 
your machine.

Is the boot failure deterministic - if yes, could you try to dig a bit more into 
this?

My guess it's the SMP trampoline, and I think we should robustify that in a 
different way: lets put it aside very early as a reservation (possibly in this 
very function), to guarantee that we have a below 1MB buffer for the SMP 
trampoline. This would be a lot more robust ...

> --- a/arch/x86/kernel/ebda.c
> +++ b/arch/x86/kernel/ebda.c
> @@ -62,10 +62,12 @@ void __init reserve_ebda_region(void)
>  	if (lowmem < INSANE_CUTOFF)
>  		lowmem = LOWMEM_CAP;
>  
> -	/* Use the lower of the lowmem and EBDA markers as the cutoff */
> -	lowmem = min(lowmem, ebda_addr);
>  	lowmem = min(lowmem, LOWMEM_CAP); /* Absolute cap */
>  
>  	/* reserve all memory between lowmem and the 1MB mark */
>  	memblock_reserve(lowmem, 0x100000 - lowmem);
> +
> +	/* if the EBDA is in lowmem, reserve it separately. */
> +	if (ebda_addr < lowmem)
> +		memblock_reserve(ebda_addr, 4096);

Ugh, this code is a mess, and your fix does not help make it better:

- Firstly, could we please organize this code so that we first calculate ebda_addr
  and then lowmem - not this interleaved mess of first lowmem, then ebda_addr, 
  then lowmem tweaks...

- Also, could we please rename things to make it more obvious what's going on, and 
  then apply changes on top? Here's a list of problems:

- 'lowmem' here means 'super low mem' - i.e. 16-bit addressable memory. In other 
  parts of the x86 code 'lowmem' means 32-bit addressable memory... This makes it 
  super confusing to read.

- It does not help at all that we have various memory range markers, half of which 
  are 'start of range', half of which are 'end of range' - but this crucial 
  property is not obvious in the naming at all ... gave me a headache trying to 
  understand all this.

- Also, the 'ebda_addr' name sucks: it highlights that it's an address (which is 
  obvious, all values here are addresses!), while it does not highlight that it's 
  the start of the EBDA region ...

- 'BIOS_LOWMEM_KILOBYTES' says a lot of things, except that this is the only value
  that is a pointer to a value, not a memory range address.

- The function name itself is a misnomer: it says 'reserve_ebda_region()' while
  its main purpose is to reserve all the firmware ROM typically between 640K and
  1MB, while the 'EBDA' part is only a small part of that ... It should be renamed 
  to something like reserve_bios_regions(). (The plural is intentional, to signal
  that this is potentially about multiple memory regions.)

- Likewise, the paravirt quirk flag name 'ebda_search' is misleading as well: this 
  too should be about whether to reserve firmware areas in the paravirt case. 
  Something like ::reserve_bios_regions would work for me.

- In fact thinking about this as 'end of RAM' is confusing: what this function 
  *really* wants to reserve is firmware data and code areas! Once the thinking is 
  inverted from a mixed 'ram' and 'reserved firmware area' notion to a pure 
  'reserved area' notion everything becomes a lot clearer.

So something like this, names ordered in (probable) address order:

	BIOS_RAM_SIZE_KB_PTR		// was: BIOS_LOWMEM_KILOBYTES

	BIOS_RAM_END_MIN		// was: INSANE_CUTOFF

	ebda_start			// was: ebda_addr

	bios_ram_end			// was: lowmem

	BIOS_RAM_END_MAX		// was: LOWMEM_CAP

would work for me. Also the comments could need some fixing, refreshing and 
general harmonization.

Blah - the patch below does this all (untested). Just look how much more readable 
it all became!

In fact I believe we should try another patch on top of this and massively 
simplify the whole EBDA mess:

 - Find a suitable address for the SMP trampoline and remember it for the
   SMP boot code.

 - Reserve *ALL* of the first 2MB of RAM.

 - (are there any other pieces of kernel code that require memory below 1MB/2MB?)

The thing is, x86 CPUs are already treating the first 1MB of RAM in a special way: 
for example 2MB TLBs get split up into 4K TLBs regardless of what the pagetable or 
the MTRRs say. So it's a substandard piece of RAM already that we want to skip for 
performance reasons.

Furthermore we had various problems in the past where firmware corrupted the first 
1MB of RAM.

So could we please just get rid of this class of problems and reserve it all?

The only thing we need from there is a suitable place for the SMP trampoline. We 
should try to allocate that somewhere in the middle of the (suspected...) 
available RAM range based on BIOS_RAM_SIZE_KB_PTR (no EBDA quirking at this 
point), where the probability of firmware interference is the lowest.

If the EBDA pointer happens to point to the exact same location we should bump the 
trampoline address down so that it's below the EBDA. (and should generate a 
low-key KERN_INFO printk that we've done so.)

This strategy should solve your boot problem too I believe.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21  8:14 ` [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code Ingo Molnar
@ 2016-07-21  8:29   ` H. Peter Anvin
  2016-07-21  9:11     ` Ingo Molnar
  2016-07-21  9:14     ` Ingo Molnar
  2016-07-21  8:31   ` Ingo Molnar
  2016-07-21 14:58   ` Andy Lutomirski
  2 siblings, 2 replies; 16+ messages in thread
From: H. Peter Anvin @ 2016-07-21  8:29 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: x86, Mario Limonciello, Matthew Garrett, linux-kernel, Kees Cook,
	Linus Torvalds, Thomas Gleixner, Andrew Morton, Peter Zijlstra

We already reserve the first megabyte by default.  There is something really bizarre about this; the bug report simply doesn't seem to make any sense.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21  8:14 ` [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code Ingo Molnar
  2016-07-21  8:29   ` H. Peter Anvin
@ 2016-07-21  8:31   ` Ingo Molnar
  2016-07-21 14:58   ` Andy Lutomirski
  2 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2016-07-21  8:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, x86, Mario Limonciello, Matthew Garrett,
	linux-kernel, Kees Cook, Linus Torvalds, Thomas Gleixner,
	Andrew Morton, Peter Zijlstra


* Ingo Molnar <mingo@kernel.org> wrote:

> Blah - the patch below does this all (untested). Just look how much more readable 
> it all became!

I guess it helps if I attach the patch for real :) ...

=====================>
>From edce21216a8887bf06ba85ee49a00695e44c4341 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Thu, 21 Jul 2016 09:53:52 +0200
Subject: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code

So the reserve_ebda_region() code has accumulated a number of
problems over the years that make it really difficult to read
and understand:

- The calculation of 'lowmem' and 'ebda_addr' is an unnecessarily
  interleaved mess of first lowmem, then ebda_addr, then lowmem tweaks...

- 'lowmem' here means 'super low mem' - i.e. 16-bit addressable memory. In other
  parts of the x86 code 'lowmem' means 32-bit addressable memory... This makes it
  super confusing to read.

- It does not help at all that we have various memory range markers, half of which
  are 'start of range', half of which are 'end of range' - but this crucial
  property is not obvious in the naming at all ... gave me a headache trying to
  understand all this.

- Also, the 'ebda_addr' name sucks: it highlights that it's an address (which is
  obvious, all values here are addresses!), while it does not highlight that it's
  the _start_ of the EBDA region ...

- 'BIOS_LOWMEM_KILOBYTES' says a lot of things, except that this is the only value
  that is a pointer to a value, not a memory range address!

- The function name itself is a misnomer: it says 'reserve_ebda_region()' while
  its main purpose is to reserve all the firmware ROM typically between 640K and
  1MB, while the 'EBDA' part is only a small part of that ...

- Likewise, the paravirt quirk flag name 'ebda_search' is misleading as well: this
  too should be about whether to reserve firmware areas in the paravirt case.

- In fact thinking about this as 'end of RAM' is confusing: what this function
  *really* wants to reserve is firmware data and code areas! Once the thinking is
  inverted from a mixed 'ram' and 'reserved firmware area' notion to a pure
  'reserved area' notion everything becomes a lot clearer.

To improve all this rewrite the whole code (without changing the logic):

- Firstly invert the naming from 'lowmem end' to 'BIOS reserved area start'
  and propagate this concept through all the variable names and constants.

	BIOS_RAM_SIZE_KB_PTR		// was: BIOS_LOWMEM_KILOBYTES

	BIOS_START_MIN			// was: INSANE_CUTOFF

	ebda_start			// was: ebda_addr
	bios_start			// was: lowmem

	BIOS_START_MAX			// was: LOWMEM_CAP

- Then clean up the name of the function itself by renaming it
  to reserve_bios_regions() and renaming the ::ebda_search paravirt
  flag to ::reserve_bios_regions.

- Fix up all the comments (fix typos), harmonize and simplify their
  formulation and remove comments that become unnecessary due to
  the much better naming all around.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/bios_ebda.h  |   2 +-
 arch/x86/include/asm/x86_init.h   |   4 +-
 arch/x86/kernel/ebda.c            | 124 +++++++++++++++++++++++++-------------
 arch/x86/kernel/head32.c          |   2 +-
 arch/x86/kernel/head64.c          |   2 +-
 arch/x86/kernel/platform-quirks.c |   4 +-
 6 files changed, 88 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/bios_ebda.h b/arch/x86/include/asm/bios_ebda.h
index 2b00c776f223..4b7b8e71607e 100644
--- a/arch/x86/include/asm/bios_ebda.h
+++ b/arch/x86/include/asm/bios_ebda.h
@@ -17,7 +17,7 @@ static inline unsigned int get_bios_ebda(void)
 	return address;	/* 0 means none */
 }
 
-void reserve_ebda_region(void);
+void reserve_bios_regions(void);
 
 #ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
 /*
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 4dcdf74dfed8..c519c052700a 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -168,14 +168,14 @@ struct x86_legacy_devices {
  * struct x86_legacy_features - legacy x86 features
  *
  * @rtc: this device has a CMOS real-time clock present
- * @ebda_search: it's safe to search for the EBDA signature in the hardware's
+ * @reserve_bios_regions: it's safe to search for the EBDA signature in the hardware's
  * 	low RAM
  * @devices: legacy x86 devices, refer to struct x86_legacy_devices
  * 	documentation for further details.
  */
 struct x86_legacy_features {
 	int rtc;
-	int ebda_search;
+	int reserve_bios_regions;
 	struct x86_legacy_devices devices;
 };
 
diff --git a/arch/x86/kernel/ebda.c b/arch/x86/kernel/ebda.c
index afe65dffee80..6219eef20e2e 100644
--- a/arch/x86/kernel/ebda.c
+++ b/arch/x86/kernel/ebda.c
@@ -6,66 +6,104 @@
 #include <asm/bios_ebda.h>
 
 /*
+ * This function reserves all conventional PC system BIOS related
+ * firmware memory areas (some of which are data, some of which
+ * are code), that must not be used by the kernel as available
+ * RAM.
+ *
  * The BIOS places the EBDA/XBDA at the top of conventional
  * memory, and usually decreases the reported amount of
- * conventional memory (int 0x12) too. This also contains a
- * workaround for Dell systems that neglect to reserve EBDA.
- * The same workaround also avoids a problem with the AMD768MPX
- * chipset: reserve a page before VGA to prevent PCI prefetch
- * into it (errata #56). Usually the page is reserved anyways,
- * unless you have no PS/2 mouse plugged in.
+ * conventional memory (int 0x12) too.
+ *
+ * This means that as a first approximation on most systems we can
+ * guess the reserved BIOS area by looking at the low BIOS RAM size
+ * value and assume that everything above that value (up to 1MB) is
+ * reserved.
+ *
+ * But life in firmware country is not that simple:
+ *
+ * - This code also contains a quirk for Dell systems that neglect
+ *   to reserve the EBDA area in the 'RAM size' value ...
+ *
+ * - The same quirk also avoids a problem with the AMD768MPX
+ *   chipset: reserve a page before VGA to prevent PCI prefetch
+ *   into it (errata #56). (Usually the page is reserved anyways,
+ *   unless you have no PS/2 mouse plugged in.)
+ *
+ * - Plus paravirt systems don't have a reliable value in the
+ *   'BIOS RAM size' pointer we can rely on, so we must quirk
+ *   them too.
+ *
+ * Due to those various problems this function is deliberately
+ * very conservative and tries to err on the side of reserving
+ * too much, to not risk reserving too little.
+ *
+ * Losing a small amount of memory in the bottom megabyte is
+ * rarely a problem, as long as we have enough memory to install
+ * the SMP bootup trampoline which *must* be in this area.
  *
- * This functions is deliberately very conservative.  Losing
- * memory in the bottom megabyte is rarely a problem, as long
- * as we have enough memory to install the trampoline.  Using
- * memory that is in use by the BIOS or by some DMA device
- * the BIOS didn't shut down *is* a big problem.
+ * Using memory that is in use by the BIOS or by some DMA device
+ * the BIOS didn't shut down *is* a big problem to the kernel,
+ * obviously.
  */
 
-#define BIOS_LOWMEM_KILOBYTES	0x413
-#define LOWMEM_CAP		0x9f000U	/* Absolute maximum */
-#define INSANE_CUTOFF		0x20000U	/* Less than this = insane */
+#define BIOS_RAM_SIZE_KB_PTR	0x413
 
-void __init reserve_ebda_region(void)
+#define BIOS_START_MIN		0x20000U	/* 128K, less than this is insane */
+#define BIOS_START_MAX		0x9f000U	/* 640K, absolute maximum */
+
+void __init reserve_bios_regions(void)
 {
-	unsigned int lowmem, ebda_addr;
+	unsigned int bios_start, ebda_start;
 
 	/*
-	 * To determine the position of the EBDA and the
-	 * end of conventional memory, we need to look at
-	 * the BIOS data area. In a paravirtual environment
-	 * that area is absent. We'll just have to assume
-	 * that the paravirt case can handle memory setup
-	 * correctly, without our help.
+	 * NOTE: In a paravirtual environment the BIOS reserved
+	 * area is absent. We'll just have to assume that the
+	 * paravirt case can handle memory setup correctly,
+	 * without our help.
 	 */
-	if (!x86_platform.legacy.ebda_search)
+	if (!x86_platform.legacy.reserve_bios_regions)
 		return;
 
-	/* end of low (conventional) memory */
-	lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
-	lowmem <<= 10;
-
-	/* start of EBDA area */
-	ebda_addr = get_bios_ebda();
+	/* Get the start address of the EBDA page: */
+	ebda_start = get_bios_ebda();
 
 	/*
-	 * Note: some old Dells seem to need 4k EBDA without
-	 * reporting so, so just consider the memory above 0x9f000
-	 * to be off limits (bugzilla 2990).
+	 * Quirk: some old Dells seem to have a 4k EBDA without
+	 * reporting so in their BIOS RAM size value, so just
+	 * consider the memory above 640K to be off limits
+	 * (bugzilla 2990).
+	 *
+	 * We detect this case by filtering for nonsensical EBDA
+	 * addresses below 128K, where we can assume that they
+	 * are bogus and bump it up to a fixed 640K value:
 	 */
+	if (ebda_start < BIOS_START_MIN)
+		ebda_start = BIOS_START_MAX;
 
-	/* If the EBDA address is below 128K, assume it is bogus */
-	if (ebda_addr < INSANE_CUTOFF)
-		ebda_addr = LOWMEM_CAP;
+	/*
+	 * BIOS RAM size is encoded in kilobytes, convert it
+	 * to bytes to get a first guess at where the BIOS
+	 * firmware area starts:
+	 */
+	bios_start = *(unsigned short *)__va(BIOS_RAM_SIZE_KB_PTR);
+	bios_start <<= 10;
 
-	/* If lowmem is less than 128K, assume it is bogus */
-	if (lowmem < INSANE_CUTOFF)
-		lowmem = LOWMEM_CAP;
+	/*
+	 * If bios_start is less than 128K, assume it is bogus
+	 * and bump it up to 640K:
+	 */
+	if (bios_start < BIOS_START_MIN)
+		bios_start = BIOS_START_MAX;
 
-	/* Use the lower of the lowmem and EBDA markers as the cutoff */
-	lowmem = min(lowmem, ebda_addr);
-	lowmem = min(lowmem, LOWMEM_CAP); /* Absolute cap */
+	/*
+	 * Use the lower of the bios_start and ebda_start
+	 * as the starting point, but don't allow it to
+	 * go beyond 640K:
+	 */
+	bios_start = min(bios_start, ebda_start);
+	bios_start = min(bios_start, BIOS_START_MAX);
 
-	/* reserve all memory between lowmem and the 1MB mark */
-	memblock_reserve(lowmem, 0x100000 - lowmem);
+	/* Reserve all memory between bios_start and the 1MB mark: */
+	memblock_reserve(bios_start, 0x100000 - bios_start);
 }
diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index d784bb547a9d..2dda0bc4576e 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -26,7 +26,7 @@ static void __init i386_default_early_setup(void)
 	x86_init.resources.reserve_resources = i386_reserve_resources;
 	x86_init.mpparse.setup_ioapic_ids = setup_ioapic_ids_from_mpc;
 
-	reserve_ebda_region();
+	reserve_bios_regions();
 }
 
 asmlinkage __visible void __init i386_start_kernel(void)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index b72fb0b71dd1..99d48e7d2974 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -183,7 +183,7 @@ void __init x86_64_start_reservations(char *real_mode_data)
 		copy_bootdata(__va(real_mode_data));
 
 	x86_early_init_platform_quirks();
-	reserve_ebda_region();
+	reserve_bios_regions();
 
 	switch (boot_params.hdr.hardware_subarch) {
 	case X86_SUBARCH_INTEL_MID:
diff --git a/arch/x86/kernel/platform-quirks.c b/arch/x86/kernel/platform-quirks.c
index b2f8a33b36ff..24a50301f150 100644
--- a/arch/x86/kernel/platform-quirks.c
+++ b/arch/x86/kernel/platform-quirks.c
@@ -7,12 +7,12 @@
 void __init x86_early_init_platform_quirks(void)
 {
 	x86_platform.legacy.rtc = 1;
-	x86_platform.legacy.ebda_search = 0;
+	x86_platform.legacy.reserve_bios_regions = 0;
 	x86_platform.legacy.devices.pnpbios = 1;
 
 	switch (boot_params.hdr.hardware_subarch) {
 	case X86_SUBARCH_PC:
-		x86_platform.legacy.ebda_search = 1;
+		x86_platform.legacy.reserve_bios_regions = 1;
 		break;
 	case X86_SUBARCH_XEN:
 	case X86_SUBARCH_LGUEST:

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21  8:29   ` H. Peter Anvin
@ 2016-07-21  9:11     ` Ingo Molnar
  2016-07-21 12:32       ` H. Peter Anvin
  2016-07-21  9:14     ` Ingo Molnar
  1 sibling, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2016-07-21  9:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, x86, Mario Limonciello, Matthew Garrett,
	linux-kernel, Kees Cook, Linus Torvalds, Thomas Gleixner,
	Andrew Morton, Peter Zijlstra


* H. Peter Anvin <hpa@zytor.com> wrote:

> We already reserve the first megabyte by default.  There is something really 
> bizarre about this; the bug report simply doesn't seem to make any sense.

Maybe the first megabyte reservation happens after the SMP trampoline allocation 
and hence the EBDA/BIOS reservation matters to our ability to allocate the SMP 
trampoline specifically?

Anyway, agreed that we need more information.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21  8:29   ` H. Peter Anvin
  2016-07-21  9:11     ` Ingo Molnar
@ 2016-07-21  9:14     ` Ingo Molnar
  1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2016-07-21  9:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, x86, Mario Limonciello, Matthew Garrett,
	linux-kernel, Kees Cook, Linus Torvalds, Thomas Gleixner,
	Andrew Morton, Peter Zijlstra


* H. Peter Anvin <hpa@zytor.com> wrote:

> We already reserve the first megabyte by default.  There is something really bizarre about this; the bug report simply doesn't seem to make any sense.

Btw., are you sure about this? We do reserve the first 64KB:

config X86_RESERVE_LOW
        int "Amount of low memory, in kilobytes, to reserve for the BIOS"
        default 64
        range 4 640
        ---help---
          Specify the amount of low memory to reserve for the BIOS.

But I cannot find the place that allocates the first 1MB.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21  9:11     ` Ingo Molnar
@ 2016-07-21 12:32       ` H. Peter Anvin
  0 siblings, 0 replies; 16+ messages in thread
From: H. Peter Anvin @ 2016-07-21 12:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, x86, Mario Limonciello, Matthew Garrett,
	linux-kernel, Kees Cook, Linus Torvalds, Thomas Gleixner,
	Andrew Morton, Peter Zijlstra

On July 21, 2016 2:11:32 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* H. Peter Anvin <hpa@zytor.com> wrote:
>
>> We already reserve the first megabyte by default.  There is something
>really 
>> bizarre about this; the bug report simply doesn't seem to make any
>sense.
>
>Maybe the first megabyte reservation happens after the SMP trampoline
>allocation 
>and hence the EBDA/BIOS reservation matters to our ability to allocate
>the SMP 
>trampoline specifically?
>
>Anyway, agreed that we need more information.
>
>Thanks,
>
>	Ingo

Yes, of course - it has to.  However, 0x2c000 although braindead shouldn't be a problem.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21  8:14 ` [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code Ingo Molnar
  2016-07-21  8:29   ` H. Peter Anvin
  2016-07-21  8:31   ` Ingo Molnar
@ 2016-07-21 14:58   ` Andy Lutomirski
  2016-07-21 16:18     ` Ingo Molnar
  2 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2016-07-21 14:58 UTC (permalink / raw)
  To: Ingo Molnar, Matt Fleming
  Cc: Thomas Gleixner, Mario Limonciello, Kees Cook, linux-kernel,
	Andrew Morton, Matthew Garrett, Peter Zijlstra, X86 ML,
	H. Peter Anvin, Linus Torvalds

On Jul 21, 2016 1:14 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
> > Under some conditions, my Dell XPS 13 9350 puts the EBDA at 0x2c000
> > but reports the lowmem cutoff as 0.  The old code reserves
> > everything above 0x2c000 and I can't boot [1].
>
> > [1] This only breaks boot in practice when some other firmware or
> >     GRUB oddity that I don't fully understand kicks in causing the
> >     memory below 0x2c000 to be unusable.
>
> Exactly why can't Linux boot if *more* memory is reserved? Is it perhaps the SMP
> trampoline that cannot be allocated?

Yes, exactly.

>
> Is the boot failure deterministic - if yes, could you try to dig a bit more into
> this?

It's mostly deterministic.  I hit it every time if I use Dell's latest
BIOS (1.4.4), enable SGX in BIOS (no SGX kernel patches involved), and
boot using Fedora's grub2-efi on the hard disk.  I don't hit it on a
USB stick or if I boot using the EFI stub via the EFI shell.  Using
EFI shell causes 1000-27fff to be conventional memory instead of boot
data -- see below.

Here's my memory map:

[    0.000000] efi: mem00: [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|
WC|UC] range=[0x0000000000000000-0x0000000000000fff] (0MB)
[    0.000000] efi: mem01: [Boot Data          |   |  |  |  |  |  |  |
  |WB|WT|WC|UC] range=[0x0000000000001000-0x0000000000027fff] (0MB)
[    0.000000] efi: mem02: [Loader Data        |   |  |  |  |  |  |  |
  |WB|WT|WC|UC] range=[0x0000000000028000-0x0000000000029fff] (0MB)
[    0.000000] efi: mem03: [Reserved           |   |  |  |  |  |  |  |
  |WB|WT|WC|UC] range=[0x000000000002a000-0x000000000002bfff] (0MB)
[    0.000000] efi: mem04: [Runtime Data       |RUN|  |  |  |  |  |  |
  |WB|WT|WC|UC] range=[0x000000000002c000-0x000000000002cfff] (0MB)
[    0.000000] efi: mem05: [Loader Data        |   |  |  |  |  |  |  |
  |WB|WT|WC|UC] range=[0x000000000002d000-0x000000000002dfff] (0MB)
[    0.000000] efi: mem06: [Conventional Memory|   |  |  |  |  |  |  |
  |WB|WT|WC|UC] range=[0x000000000002e000-0x0000000000057fff] (0MB)
[    0.000000] efi: mem07: [Reserved           |   |  |  |  |  |  |  |
  |WB|WT|WC|UC] range=[0x0000000000058000-0x0000000000058fff] (0MB)
[    0.000000] efi: mem08: [Conventional Memory|   |  |  |  |  |  |  |
  |WB|WT|WC|UC] range=[0x0000000000059000-0x000000000009ffff] (0MB)
[

The EFI quirk to reserve boot data kills 1000-27fff.  The EBDA
reservation code kills the rest, leaving no <1MB memory at all.

>
> My guess it's the SMP trampoline, and I think we should robustify that in a
> different way: lets put it aside very early as a reservation (possibly in this
> very function), to guarantee that we have a below 1MB buffer for the SMP
> trampoline. This would be a lot more robust ...
>

If we really want to robustify that, I would suggest that we change
the way that the trampoline works.  In particular, I don't see any
reason why we need to call setup_real_mode until we're actually ready
to initialize APs, and we should be done with the boot services data
quirk by then (am I right, Matt?).  So if we can get the allocation
code right, we shouldn't have any problem putting the trampoline in
the boot services range.

It would be very easy to implement this if we could handle overlapping
memblocks precisely or set a lower limit on the memblock allocator.
Then we could block off everything below 1MB or 2MB very early and
then unblock it or temporarily change the lower limit and ask for a
single page for the trampoline after that.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21 14:58   ` Andy Lutomirski
@ 2016-07-21 16:18     ` Ingo Molnar
  2016-07-21 21:08       ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2016-07-21 16:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matt Fleming, Thomas Gleixner, Mario Limonciello, Kees Cook,
	linux-kernel, Andrew Morton, Matthew Garrett, Peter Zijlstra,
	X86 ML, H. Peter Anvin, Linus Torvalds


* Andy Lutomirski <luto@amacapital.net> wrote:

> It would be very easy to implement this if we could handle overlapping memblocks 
> precisely or set a lower limit on the memblock allocator. Then we could block 
> off everything below 1MB or 2MB very early and then unblock it or temporarily 
> change the lower limit and ask for a single page for the trampoline after that.

So my suggestion was/is to _permanently_ allocate the SMP trampoline page, and 
leave it also reserved.

'Reserving' a memory area is really just a kernel internal matter. We can still 
use it. No need to unreserve/allocate/re-reserve ... unless I'm missing something.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21 16:18     ` Ingo Molnar
@ 2016-07-21 21:08       ` Andy Lutomirski
  2016-07-21 21:28         ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2016-07-21 21:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matt Fleming, Thomas Gleixner, Mario Limonciello, Kees Cook,
	linux-kernel, Andrew Morton, Matthew Garrett, Peter Zijlstra,
	X86 ML, H. Peter Anvin, Linus Torvalds

On Thu, Jul 21, 2016 at 9:18 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> It would be very easy to implement this if we could handle overlapping memblocks
>> precisely or set a lower limit on the memblock allocator. Then we could block
>> off everything below 1MB or 2MB very early and then unblock it or temporarily
>> change the lower limit and ask for a single page for the trampoline after that.
>
> So my suggestion was/is to _permanently_ allocate the SMP trampoline page, and
> leave it also reserved.
>
> 'Reserving' a memory area is really just a kernel internal matter. We can still
> use it. No need to unreserve/allocate/re-reserve ... unless I'm missing something.
>

I don't think you're missing anything particularly deep.  I'm just
talking about an implementation issue.  We need to make sure that the
page we pick for the trampoline isn't reserved in the memory map or by
some other quirk (including the EBDA).  The kernel currently uses
memblock for this, which means that we should probably play nicely
with the memblock code.

To fix my laptop, though, I think we either need to change the EBDA
reservation (i.e. be willing to pick a page above the EBDA but below
the BIOS) or rework the code so that it can use a BOOT_SERVICES_DATA
page.

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21 21:08       ` Andy Lutomirski
@ 2016-07-21 21:28         ` H. Peter Anvin
  2016-07-21 21:48           ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2016-07-21 21:28 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: Matt Fleming, Thomas Gleixner, Mario Limonciello, Kees Cook,
	linux-kernel, Andrew Morton, Matthew Garrett, Peter Zijlstra,
	X86 ML, Linus Torvalds

On July 21, 2016 2:08:12 PM PDT, Andy Lutomirski <luto@amacapital.net> wrote:
>On Thu, Jul 21, 2016 at 9:18 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Andy Lutomirski <luto@amacapital.net> wrote:
>>
>>> It would be very easy to implement this if we could handle
>overlapping memblocks
>>> precisely or set a lower limit on the memblock allocator. Then we
>could block
>>> off everything below 1MB or 2MB very early and then unblock it or
>temporarily
>>> change the lower limit and ask for a single page for the trampoline
>after that.
>>
>> So my suggestion was/is to _permanently_ allocate the SMP trampoline
>page, and
>> leave it also reserved.
>>
>> 'Reserving' a memory area is really just a kernel internal matter. We
>can still
>> use it. No need to unreserve/allocate/re-reserve ... unless I'm
>missing something.
>>
>
>I don't think you're missing anything particularly deep.  I'm just
>talking about an implementation issue.  We need to make sure that the
>page we pick for the trampoline isn't reserved in the memory map or by
>some other quirk (including the EBDA).  The kernel currently uses
>memblock for this, which means that we should probably play nicely
>with the memblock code.
>
>To fix my laptop, though, I think we either need to change the EBDA
>reservation (i.e. be willing to pick a page above the EBDA but below
>the BIOS) or rework the code so that it can use a BOOT_SERVICES_DATA
>page.
>
>--Andy

Oh... this is booting in EFI mode.  Now it makes a bit more sense.  This is a headache because I believe the same company has put out systems which crash if boot services memory is reclaimed before driver initialization, which is the only reason we can't just treat it as plain memory immediately after invoking ExitBootServices.

In theory EBDA in EFI mode is nonsense, too, but not trying to be smart about it might break some systems due to interaction with ACPI.

The real mode trampoline code isn't all that large; it is fundamentally limited to 64K but I think it is about 24K right now.  This is why we have the 128K "insane" threshold: it is still big enough that we can fit.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21 21:28         ` H. Peter Anvin
@ 2016-07-21 21:48           ` Andy Lutomirski
  2016-07-21 22:45             ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2016-07-21 21:48 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Matt Fleming, Thomas Gleixner, Mario Limonciello,
	Kees Cook, linux-kernel, Andrew Morton, Matthew Garrett,
	Peter Zijlstra, X86 ML, Linus Torvalds

On Thu, Jul 21, 2016 at 2:28 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On July 21, 2016 2:08:12 PM PDT, Andy Lutomirski <luto@amacapital.net> wrote:
>>On Thu, Jul 21, 2016 at 9:18 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>>
>>> * Andy Lutomirski <luto@amacapital.net> wrote:
>>>
>>>> It would be very easy to implement this if we could handle
>>overlapping memblocks
>>>> precisely or set a lower limit on the memblock allocator. Then we
>>could block
>>>> off everything below 1MB or 2MB very early and then unblock it or
>>temporarily
>>>> change the lower limit and ask for a single page for the trampoline
>>after that.
>>>
>>> So my suggestion was/is to _permanently_ allocate the SMP trampoline
>>page, and
>>> leave it also reserved.
>>>
>>> 'Reserving' a memory area is really just a kernel internal matter. We
>>can still
>>> use it. No need to unreserve/allocate/re-reserve ... unless I'm
>>missing something.
>>>
>>
>>I don't think you're missing anything particularly deep.  I'm just
>>talking about an implementation issue.  We need to make sure that the
>>page we pick for the trampoline isn't reserved in the memory map or by
>>some other quirk (including the EBDA).  The kernel currently uses
>>memblock for this, which means that we should probably play nicely
>>with the memblock code.
>>
>>To fix my laptop, though, I think we either need to change the EBDA
>>reservation (i.e. be willing to pick a page above the EBDA but below
>>the BIOS) or rework the code so that it can use a BOOT_SERVICES_DATA
>>page.
>>
>>--Andy
>
> Oh... this is booting in EFI mode.  Now it makes a bit more sense.  This is a headache because I believe the same company has put out systems which crash if boot services memory is reclaimed before driver initialization, which is the only reason we can't just treat it as plain memory immediately after invoking ExitBootServices.
>
> In theory EBDA in EFI mode is nonsense, too, but not trying to be smart about it might break some systems due to interaction with ACPI.
>

It's easy to make reserve_bios_regions() do nothing if we have an EFI
memory map.  Is there any decent reason *not* to do that?  Or we could
be more conservative: if the EFI memory map reserves 640K-1M and
reserves the EBDA (if any), then don't reserve the range between the
EBDA and 640K.  (That would fix my laptop, too.)

Allowing the trampoline to use boot services addresses is fine, too,
but that looks harder to implement.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21 21:48           ` Andy Lutomirski
@ 2016-07-21 22:45             ` Andy Lutomirski
  2016-07-22 13:00               ` Matt Fleming
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2016-07-21 22:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Matt Fleming, Thomas Gleixner, Mario Limonciello,
	Kees Cook, linux-kernel, Andrew Morton, Matthew Garrett,
	Peter Zijlstra, X86 ML, Linus Torvalds

On Thu, Jul 21, 2016 at 2:48 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jul 21, 2016 at 2:28 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On July 21, 2016 2:08:12 PM PDT, Andy Lutomirski <luto@amacapital.net> wrote:
>>>On Thu, Jul 21, 2016 at 9:18 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>>>
>>>> * Andy Lutomirski <luto@amacapital.net> wrote:
>>>>
>>>>> It would be very easy to implement this if we could handle
>>>overlapping memblocks
>>>>> precisely or set a lower limit on the memblock allocator. Then we
>>>could block
>>>>> off everything below 1MB or 2MB very early and then unblock it or
>>>temporarily
>>>>> change the lower limit and ask for a single page for the trampoline
>>>after that.
>>>>
>>>> So my suggestion was/is to _permanently_ allocate the SMP trampoline
>>>page, and
>>>> leave it also reserved.
>>>>
>>>> 'Reserving' a memory area is really just a kernel internal matter. We
>>>can still
>>>> use it. No need to unreserve/allocate/re-reserve ... unless I'm
>>>missing something.
>>>>
>>>
>>>I don't think you're missing anything particularly deep.  I'm just
>>>talking about an implementation issue.  We need to make sure that the
>>>page we pick for the trampoline isn't reserved in the memory map or by
>>>some other quirk (including the EBDA).  The kernel currently uses
>>>memblock for this, which means that we should probably play nicely
>>>with the memblock code.
>>>
>>>To fix my laptop, though, I think we either need to change the EBDA
>>>reservation (i.e. be willing to pick a page above the EBDA but below
>>>the BIOS) or rework the code so that it can use a BOOT_SERVICES_DATA
>>>page.
>>>
>>>--Andy
>>
>> Oh... this is booting in EFI mode.  Now it makes a bit more sense.  This is a headache because I believe the same company has put out systems which crash if boot services memory is reclaimed before driver initialization, which is the only reason we can't just treat it as plain memory immediately after invoking ExitBootServices.
>>
>> In theory EBDA in EFI mode is nonsense, too, but not trying to be smart about it might break some systems due to interaction with ACPI.
>>
>
> It's easy to make reserve_bios_regions() do nothing if we have an EFI
> memory map.  Is there any decent reason *not* to do that?  Or we could
> be more conservative: if the EFI memory map reserves 640K-1M and
> reserves the EBDA (if any), then don't reserve the range between the
> EBDA and 640K.  (That would fix my laptop, too.)
>
> Allowing the trampoline to use boot services addresses is fine, too,
> but that looks harder to implement.

I looked at the code some more.  The boot services quirk is weird and
maybe buggy.  trim_snb_memory uses memblock_reserve to reserve the
bottom 1MB.  If efi_reserve_real_mode has already reserved that range,
then trim_snb_memory's reservation will have no effect because the efi
code will just free it later on.  The same issue will hit any code
that reserves >1MB memory after efi has tried to temporarily reserve
it.

I don't have any great suggestions for cleaning it up.  Perhaps the
efi code should instead skip adding boot services memory to the memory
map in the first place and then add it late and hand any unreserved
bits to the buddy allocator?

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-21 22:45             ` Andy Lutomirski
@ 2016-07-22 13:00               ` Matt Fleming
  2016-07-23  1:16                 ` Linus Torvalds
  2016-07-26  0:41                 ` Andy Lutomirski
  0 siblings, 2 replies; 16+ messages in thread
From: Matt Fleming @ 2016-07-22 13:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Mario Limonciello,
	Kees Cook, linux-kernel, Andrew Morton, Matthew Garrett,
	Peter Zijlstra, X86 ML, Linus Torvalds, Josh Triplett,
	Peter Jones

On Thu, 21 Jul, at 03:45:14PM, Andy Lutomirski wrote:
> 
> I looked at the code some more.  The boot services quirk is weird and
> maybe buggy.  trim_snb_memory uses memblock_reserve to reserve the
> bottom 1MB.  If efi_reserve_real_mode has already reserved that range,
> then trim_snb_memory's reservation will have no effect because the efi
> code will just free it later on.  The same issue will hit any code
> that reserves >1MB memory after efi has tried to temporarily reserve
> it.
 
Yeah, that looks like a bug. memblock_reserve() reference counting,
anyone?

> I don't have any great suggestions for cleaning it up.  Perhaps the
> efi code should instead skip adding boot services memory to the memory
> map in the first place and then add it late and hand any unreserved
> bits to the buddy allocator?

The issue is that some data required at runtime may be contained in
those boot services data regions; the EFI System Resource Table is a
good example or the ACPI BGRT table. esrt_init() happens pretty early
but efi_bgrt_init() is really late in boot because we need the ACPI
subsystem to have been brought up.

Fundamentally, you can't know whether you can use the boot services
regions for allocation until after SetVirtualAddressMap() has been
called (the original bug that required the reservation quirks occurs
at SVAM time) and after drivers have read the EFI config tables and
marked their regions as reserved.

I suppose we could rewrite the page table mapping for those precious
<1MB regions to coerce the firmware into accessing different pages
instead of the 1:1 addresses and copy the regions elsewhere. Maybe.
That assumes we don't hit other firmware bugs though.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-22 13:00               ` Matt Fleming
@ 2016-07-23  1:16                 ` Linus Torvalds
  2016-07-26  0:41                 ` Andy Lutomirski
  1 sibling, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2016-07-23  1:16 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Andy Lutomirski, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Mario Limonciello, Kees Cook, linux-kernel, Andrew Morton,
	Matthew Garrett, Peter Zijlstra, X86 ML, Josh Triplett,
	Peter Jones

On Fri, Jul 22, 2016 at 10:00 PM, Matt Fleming <matt@codeblueprint.co.uk> wrote:
>
> I suppose we could rewrite the page table mapping for those precious
> <1MB regions to coerce the firmware into accessing different pages
> instead of the 1:1 addresses and copy the regions elsewhere. Maybe.
> That assumes we don't hit other firmware bugs though.

.. it also assumes that firmware honors our page tables. Which at a
minimum won't be true for things like SMM, which happens with paging
entirely disabled.

And that's exactly the kind of code that touches some of the memory in
the low 1M region - things like the legacy keyboard state bits in the
low 4kB etc, for crazy old DOS days, iirc.

                Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
  2016-07-22 13:00               ` Matt Fleming
  2016-07-23  1:16                 ` Linus Torvalds
@ 2016-07-26  0:41                 ` Andy Lutomirski
  1 sibling, 0 replies; 16+ messages in thread
From: Andy Lutomirski @ 2016-07-26  0:41 UTC (permalink / raw)
  To: Matt Fleming
  Cc: H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Mario Limonciello,
	Kees Cook, linux-kernel, Andrew Morton, Matthew Garrett,
	Peter Zijlstra, X86 ML, Linus Torvalds, Josh Triplett,
	Peter Jones

On Fri, Jul 22, 2016 at 6:00 AM, Matt Fleming <matt@codeblueprint.co.uk> wrote:
> On Thu, 21 Jul, at 03:45:14PM, Andy Lutomirski wrote:
>>
>> I looked at the code some more.  The boot services quirk is weird and
>> maybe buggy.  trim_snb_memory uses memblock_reserve to reserve the
>> bottom 1MB.  If efi_reserve_real_mode has already reserved that range,
>> then trim_snb_memory's reservation will have no effect because the efi
>> code will just free it later on.  The same issue will hit any code
>> that reserves >1MB memory after efi has tried to temporarily reserve
>> it.
>
> Yeah, that looks like a bug. memblock_reserve() reference counting,
> anyone?
>
>> I don't have any great suggestions for cleaning it up.  Perhaps the
>> efi code should instead skip adding boot services memory to the memory
>> map in the first place and then add it late and hand any unreserved
>> bits to the buddy allocator?
>
> The issue is that some data required at runtime may be contained in
> those boot services data regions; the EFI System Resource Table is a
> good example or the ACPI BGRT table. esrt_init() happens pretty early
> but efi_bgrt_init() is really late in boot because we need the ACPI
> subsystem to have been brought up.
>

I still think my suggestion works.  Let me clarify it:

The memblock allocator (AFAICT) has separate tracking of ranges that
exist and ranges that are reserved.  That is, there are four possible
states a range can be in:

existing, non-reserved: these are available for use
existing, reserved: these ranges are present but either in use or blacklisted
non-existent, reserved: not present but blacklisted anyway
non-existent, non-reserved: nothing here

Currently, boot services data is marked as existing (because the e820
code thinks it's real memory) and reserved (because the EFI code
reserves it).

I'm proposing that it work the other way around: the EFI and e820 code
should, during early boot, treat it just like runtime data, reserved
space, or non-present space: simply don't add it to the memory map in
the first place.  That will cause it to be non-existent and
non-reserved.  Nothing will clobber it because it's not available to
the memblock allocator.

Then, in late boot, either add it back in to the memblock allocator.
Then blacklisted portions will be reserved and non-blacklisted
portions will be non-reserved.  If this is after we switch from
memblock to the normal page allocator, then the code will have to be
structured differently, but the same concept applies.

IOW, just pretend that the boot services memory is initially not
present and then treat it as hot-added memory after SVAM is done.

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-07-26  0:41 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-21  1:32 [PATCH] x86/ebda: If the EBDA is in lowmem, reserve only 4k for the EBDA Andy Lutomirski
2016-07-21  8:14 ` [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code Ingo Molnar
2016-07-21  8:29   ` H. Peter Anvin
2016-07-21  9:11     ` Ingo Molnar
2016-07-21 12:32       ` H. Peter Anvin
2016-07-21  9:14     ` Ingo Molnar
2016-07-21  8:31   ` Ingo Molnar
2016-07-21 14:58   ` Andy Lutomirski
2016-07-21 16:18     ` Ingo Molnar
2016-07-21 21:08       ` Andy Lutomirski
2016-07-21 21:28         ` H. Peter Anvin
2016-07-21 21:48           ` Andy Lutomirski
2016-07-21 22:45             ` Andy Lutomirski
2016-07-22 13:00               ` Matt Fleming
2016-07-23  1:16                 ` Linus Torvalds
2016-07-26  0:41                 ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.