linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap
@ 2020-03-02 14:14 Mika Westerberg
  2020-04-16 11:55 ` Mika Westerberg
  2020-06-26 22:43 ` Bjorn Helgaas
  0 siblings, 2 replies; 5+ messages in thread
From: Mika Westerberg @ 2020-03-02 14:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
  Cc: Borislav Petkov, x86, Benoit Grégoire, Mika Westerberg,
	juhapekka.heikkila, linux-kernel

Commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address
space") made the resource allocation code to avoid all regions that are
in E820 table. This prevents the kernel to assign MMIO resources to
regions that may be real RAM for example.

However, at least with Lenovo Yoca C940 and S740 this causes problems
when allocating resources for PCIe devices behind Thunderbolt port(s).

On Yoga S740 the E820 table contains an entry like this:

  BIOS-e820: [mem 0x000000002bc50000-0x00000000cfffffff] reserved

and ACPI _CRS method for the host bridge returns these windows:

  pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
  pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
  pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
  pci_bus 0000:00: root bus resource [mem 0x45400000-0xbfffffff window]
  pci_bus 0000:00: root bus resource [mem 0x4000000000-0x7fffffffff window]

Note that the 0x45400000-0xbfffffff entry is also included in the E820
table and marked as "reserved".

When Thunderbolt device is connected and the PCIe gets tunneled PCI core
tries to allocate memory for the new devices but it fails because all
the resources are inside this reserved region so arch_remove_reservations()
clips them which makes the resource assignment fail as in below log:

  pci 0000:00:07.0: PCI bridge to [bus 01-2a]
  pci 0000:00:07.0:   bridge window [mem 0x46000000-0x521fffff]
  pci 0000:00:07.0:   bridge window [mem 0x6000000000-0x601bffffff 64bit pref]
  ...
  pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] to [bus 07-2a] add_size 100000 add_align 100000
  pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] to [bus 07-2a] add_size 100000 add_align 100000
  pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] to [bus 02-2a] add_size 100000 add_align 100000
  pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] to [bus 02-2a] add_size 100000 add_align 100000
  pci 0000:01:00.0: bridge window [io  0x1000-0x5fff] shrunken by 0x0000000000004000
  pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] extended by 0x000000000bd00000
  pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] extended by 0x000000001bb00000
  pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] extended by 0x000000000bd00000
  pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] extended by 0x000000001bb00000
  pci 0000:01:00.0: BAR 8: no space for [mem size 0x0c200000]
  pci 0000:01:00.0: BAR 8: failed to assign [mem size 0x0c200000]
  pci 0000:01:00.0: BAR 9: assigned [mem 0x6000000000-0x601bffffff 64bit pref]
  pci 0000:01:00.0: BAR 7: assigned [io  0x4000-0x4fff]

The 01:00.0 is the upstream port of the PCIe switch that is connected to
the PCIe root port (00:07.1) over Thunderbolt link.

If I add "efi=debug" to the command line I can see that the EFI memory
map actually contains several entries:

  [Reserved           |   |  |  |  |  |  |  | |   |WB|WT|WC|UC] range=[0x000000002bc50000-0x000000003fffffff] (323MB)
  [Reserved           |   |  |  |  |  |  |  | |   |WB|  |  |UC] range=[0x0000000040000000-0x0000000040ffffff] (16MB)
  [Reserved           |   |  |  |  |  |  |  | |   |  |  |  |  ] range=[0x0000000041000000-0x00000000453fffff] (68MB)
  [Memory Mapped I/O  |RUN|  |  |  |  |  |  | |   |  |  |  |UC] range=[0x0000000045400000-0x00000000cfffffff] (2220MB)

I think the EFI stub merges these consecutive entries into that single
E820 entry showed above. The last region marked as EFI_MEMORY_MAPPED_IO
actually covers the PCI host bridge window entirely. However, since
there is corresponding E820 type for this it is simply marked as
E820_TYPE_RESERVED.

All in all, I think we can fix this by modifying arch_remove_reservations()
to check the EFI type as well and if it is EFI_MEMORY_MAPPED_IO skip the
clipping in that case.

Reported-by: Benoit Grégoire <benoitg@coeus.ca>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206459
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 arch/x86/kernel/resource.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
index 9b9fb7882c20..c0bc9117dd7d 100644
--- a/arch/x86/kernel/resource.c
+++ b/arch/x86/kernel/resource.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/efi.h>
 #include <linux/ioport.h>
 #include <asm/e820/api.h>
 
@@ -36,6 +37,36 @@ static void remove_e820_regions(struct resource *avail)
 	}
 }
 
+#ifdef CONFIG_EFI
+static bool efi_mmio_mem(const struct resource *avail)
+{
+	resource_size_t start, end;
+	efi_memory_desc_t desc;
+
+	if (!efi_enabled(EFI_MEMMAP) ||
+	    efi_mem_desc_lookup(avail->start, &desc))
+		return false;
+
+	start = desc.phys_addr;
+	end = desc.phys_addr + (desc.num_pages << EFI_PAGE_SHIFT) - 1;
+
+	/*
+	 * No need to clip the resource if it is fully contained in an
+	 * EFI memory mapped region.
+	 */
+	if (avail->start >= start && avail->end <= end &&
+	    desc.type == EFI_MEMORY_MAPPED_IO)
+		return true;
+
+	return false;
+}
+#else
+static inline bool efi_mmio_mem(const struct resource *avail)
+{
+	return false;
+}
+#endif
+
 void arch_remove_reservations(struct resource *avail)
 {
 	/*
@@ -46,6 +77,7 @@ void arch_remove_reservations(struct resource *avail)
 	if (avail->flags & IORESOURCE_MEM) {
 		resource_clip(avail, BIOS_ROM_BASE, BIOS_ROM_END);
 
-		remove_e820_regions(avail);
+		if (!efi_mmio_mem(avail))
+			remove_e820_regions(avail);
 	}
 }
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap
  2020-03-02 14:14 [PATCH] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap Mika Westerberg
@ 2020-04-16 11:55 ` Mika Westerberg
  2020-06-26 22:43 ` Bjorn Helgaas
  1 sibling, 0 replies; 5+ messages in thread
From: Mika Westerberg @ 2020-04-16 11:55 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
  Cc: Borislav Petkov, x86, Benoit Grégoire, juhapekka.heikkila,
	linux-kernel

On Mon, Mar 02, 2020 at 05:14:51PM +0300, Mika Westerberg wrote:
> Commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address
> space") made the resource allocation code to avoid all regions that are
> in E820 table. This prevents the kernel to assign MMIO resources to
> regions that may be real RAM for example.
> 
> However, at least with Lenovo Yoca C940 and S740 this causes problems
> when allocating resources for PCIe devices behind Thunderbolt port(s).
> 
> On Yoga S740 the E820 table contains an entry like this:
> 
>   BIOS-e820: [mem 0x000000002bc50000-0x00000000cfffffff] reserved
> 
> and ACPI _CRS method for the host bridge returns these windows:
> 
>   pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
>   pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
>   pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
>   pci_bus 0000:00: root bus resource [mem 0x45400000-0xbfffffff window]
>   pci_bus 0000:00: root bus resource [mem 0x4000000000-0x7fffffffff window]
> 
> Note that the 0x45400000-0xbfffffff entry is also included in the E820
> table and marked as "reserved".
> 
> When Thunderbolt device is connected and the PCIe gets tunneled PCI core
> tries to allocate memory for the new devices but it fails because all
> the resources are inside this reserved region so arch_remove_reservations()
> clips them which makes the resource assignment fail as in below log:
> 
>   pci 0000:00:07.0: PCI bridge to [bus 01-2a]
>   pci 0000:00:07.0:   bridge window [mem 0x46000000-0x521fffff]
>   pci 0000:00:07.0:   bridge window [mem 0x6000000000-0x601bffffff 64bit pref]
>   ...
>   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] to [bus 07-2a] add_size 100000 add_align 100000
>   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] to [bus 07-2a] add_size 100000 add_align 100000
>   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] to [bus 02-2a] add_size 100000 add_align 100000
>   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] to [bus 02-2a] add_size 100000 add_align 100000
>   pci 0000:01:00.0: bridge window [io  0x1000-0x5fff] shrunken by 0x0000000000004000
>   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] extended by 0x000000000bd00000
>   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] extended by 0x000000001bb00000
>   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] extended by 0x000000000bd00000
>   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] extended by 0x000000001bb00000
>   pci 0000:01:00.0: BAR 8: no space for [mem size 0x0c200000]
>   pci 0000:01:00.0: BAR 8: failed to assign [mem size 0x0c200000]
>   pci 0000:01:00.0: BAR 9: assigned [mem 0x6000000000-0x601bffffff 64bit pref]
>   pci 0000:01:00.0: BAR 7: assigned [io  0x4000-0x4fff]
> 
> The 01:00.0 is the upstream port of the PCIe switch that is connected to
> the PCIe root port (00:07.1) over Thunderbolt link.
> 
> If I add "efi=debug" to the command line I can see that the EFI memory
> map actually contains several entries:
> 
>   [Reserved           |   |  |  |  |  |  |  | |   |WB|WT|WC|UC] range=[0x000000002bc50000-0x000000003fffffff] (323MB)
>   [Reserved           |   |  |  |  |  |  |  | |   |WB|  |  |UC] range=[0x0000000040000000-0x0000000040ffffff] (16MB)
>   [Reserved           |   |  |  |  |  |  |  | |   |  |  |  |  ] range=[0x0000000041000000-0x00000000453fffff] (68MB)
>   [Memory Mapped I/O  |RUN|  |  |  |  |  |  | |   |  |  |  |UC] range=[0x0000000045400000-0x00000000cfffffff] (2220MB)
> 
> I think the EFI stub merges these consecutive entries into that single
> E820 entry showed above. The last region marked as EFI_MEMORY_MAPPED_IO
> actually covers the PCI host bridge window entirely. However, since
> there is corresponding E820 type for this it is simply marked as
> E820_TYPE_RESERVED.
> 
> All in all, I think we can fix this by modifying arch_remove_reservations()
> to check the EFI type as well and if it is EFI_MEMORY_MAPPED_IO skip the
> clipping in that case.
> 
> Reported-by: Benoit Grégoire <benoitg@coeus.ca>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206459
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>

Hello, any comments for this?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap
  2020-03-02 14:14 [PATCH] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap Mika Westerberg
  2020-04-16 11:55 ` Mika Westerberg
@ 2020-06-26 22:43 ` Bjorn Helgaas
  2020-06-30 12:13   ` Mika Westerberg
  1 sibling, 1 reply; 5+ messages in thread
From: Bjorn Helgaas @ 2020-06-26 22:43 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas,
	Borislav Petkov, x86, Benoit Grégoire, juhapekka.heikkila,
	linux-kernel

On Mon, Mar 02, 2020 at 05:14:51PM +0300, Mika Westerberg wrote:
> Commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address
> space") made the resource allocation code to avoid all regions that are
> in E820 table. This prevents the kernel to assign MMIO resources to
> regions that may be real RAM for example.
> 
> However, at least with Lenovo Yoca C940 and S740 this causes problems
> when allocating resources for PCIe devices behind Thunderbolt port(s).
> 
> On Yoga S740 the E820 table contains an entry like this:
> 
>   BIOS-e820: [mem 0x000000002bc50000-0x00000000cfffffff] reserved
> 
> and ACPI _CRS method for the host bridge returns these windows:
> 
>   pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
>   pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
>   pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
>   pci_bus 0000:00: root bus resource [mem 0x45400000-0xbfffffff window]
>   pci_bus 0000:00: root bus resource [mem 0x4000000000-0x7fffffffff window]
> 
> Note that the 0x45400000-0xbfffffff entry is also included in the E820
> table and marked as "reserved".
> 
> When Thunderbolt device is connected and the PCIe gets tunneled PCI core
> tries to allocate memory for the new devices but it fails because all
> the resources are inside this reserved region so arch_remove_reservations()
> clips them which makes the resource assignment fail as in below log:
> 
>   pci 0000:00:07.0: PCI bridge to [bus 01-2a]
>   pci 0000:00:07.0:   bridge window [mem 0x46000000-0x521fffff]
>   pci 0000:00:07.0:   bridge window [mem 0x6000000000-0x601bffffff 64bit pref]
>   ...
>   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] to [bus 07-2a] add_size 100000 add_align 100000
>   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] to [bus 07-2a] add_size 100000 add_align 100000
>   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] to [bus 02-2a] add_size 100000 add_align 100000
>   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] to [bus 02-2a] add_size 100000 add_align 100000
>   pci 0000:01:00.0: bridge window [io  0x1000-0x5fff] shrunken by 0x0000000000004000
>   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] extended by 0x000000000bd00000
>   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] extended by 0x000000001bb00000
>   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] extended by 0x000000000bd00000
>   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] extended by 0x000000001bb00000
>   pci 0000:01:00.0: BAR 8: no space for [mem size 0x0c200000]
>   pci 0000:01:00.0: BAR 8: failed to assign [mem size 0x0c200000]
>   pci 0000:01:00.0: BAR 9: assigned [mem 0x6000000000-0x601bffffff 64bit pref]
>   pci 0000:01:00.0: BAR 7: assigned [io  0x4000-0x4fff]
> 
> The 01:00.0 is the upstream port of the PCIe switch that is connected to
> the PCIe root port (00:07.1) over Thunderbolt link.
> 
> If I add "efi=debug" to the command line I can see that the EFI memory
> map actually contains several entries:
> 
>   [Reserved           |   |  |  |  |  |  |  | |   |WB|WT|WC|UC] range=[0x000000002bc50000-0x000000003fffffff] (323MB)
>   [Reserved           |   |  |  |  |  |  |  | |   |WB|  |  |UC] range=[0x0000000040000000-0x0000000040ffffff] (16MB)
>   [Reserved           |   |  |  |  |  |  |  | |   |  |  |  |  ] range=[0x0000000041000000-0x00000000453fffff] (68MB)
>   [Memory Mapped I/O  |RUN|  |  |  |  |  |  | |   |  |  |  |UC] range=[0x0000000045400000-0x00000000cfffffff] (2220MB)
> 
> I think the EFI stub merges these consecutive entries into that single
> E820 entry showed above. The last region marked as EFI_MEMORY_MAPPED_IO
> actually covers the PCI host bridge window entirely. However, since
> there is corresponding E820 type for this it is simply marked as
> E820_TYPE_RESERVED.

Is this supposed to say "since there is *no* corresponding E820 type"?

I guess this refers to setup_e820()?  I see the switch case there that
sets e820_type to E820_TYPE_RESERVED for EFI_MEMORY_MAPPED_IO.

I'm really not familiar with the EFI stub, but I guess that when
booting with EFI, we basically build our own E820 table based on the
EFI memory map, then use that constructed E820 table to initialize
iomem_resource via e820__reserve_resources()?

I wonder why we don't just use the EFI memory map to initialize
iomem_resource directly.  It seems like the EFI map -> E820 table ->
iomem_resource path adds complication that makes issues like this more
likely.

IIUC, this patch allows Linux to use EFI_MEMORY_MAPPED_IO for PCI MMIO
space, i.e., if "avail" is in a EFI_MEMORY_MAPPED_IO region, we don't
need to consider it "reserved" and we can put PCI resources there.

UEFI v2.8, Table 30 (doc p 160, PDF p 233) says about
EfiMemoryMappedIO that

  before ExitBootServices():

    Used by system firmware to request that a memory-mapped IO region
    be mapped by the OS to a virtual address so it can be accessed by
    EFI runtime services.

  after ExitBootServices():

    This memory is not used by the OS. All system memory-mapped IO
    information should come from ACPI tables.

I would read that as basically saying that the OS must virtually map
EfiMemoryMappedIO for use by EFI runtime services (I don't know if
Linux uses any of those).

The "after ExitBootServices()" part doesn't seem very clear to me.
"This memory is not used by the OS" isn't quite stated as a
requirement for the OS.  And "All .. MMIO information should come from
ACPI tables" sounds like the OS should *ignore* this EFI map in favor
of ACPI.

So I don't know how to fix this.  I just feel like this patch adds
complication to an already very messy flow.  We may have to do that,
but it always makes me sad ;)

> All in all, I think we can fix this by modifying arch_remove_reservations()
> to check the EFI type as well and if it is EFI_MEMORY_MAPPED_IO skip the
> clipping in that case.
> 
> Reported-by: Benoit Grégoire <benoitg@coeus.ca>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206459
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---
>  arch/x86/kernel/resource.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
> index 9b9fb7882c20..c0bc9117dd7d 100644
> --- a/arch/x86/kernel/resource.c
> +++ b/arch/x86/kernel/resource.c
> @@ -1,4 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0
> +#include <linux/efi.h>
>  #include <linux/ioport.h>
>  #include <asm/e820/api.h>
>  
> @@ -36,6 +37,36 @@ static void remove_e820_regions(struct resource *avail)
>  	}
>  }
>  
> +#ifdef CONFIG_EFI
> +static bool efi_mmio_mem(const struct resource *avail)
> +{
> +	resource_size_t start, end;
> +	efi_memory_desc_t desc;
> +
> +	if (!efi_enabled(EFI_MEMMAP) ||
> +	    efi_mem_desc_lookup(avail->start, &desc))
> +		return false;
> +
> +	start = desc.phys_addr;
> +	end = desc.phys_addr + (desc.num_pages << EFI_PAGE_SHIFT) - 1;
> +
> +	/*
> +	 * No need to clip the resource if it is fully contained in an
> +	 * EFI memory mapped region.
> +	 */
> +	if (avail->start >= start && avail->end <= end &&
> +	    desc.type == EFI_MEMORY_MAPPED_IO)
> +		return true;
> +
> +	return false;
> +}
> +#else
> +static inline bool efi_mmio_mem(const struct resource *avail)
> +{
> +	return false;
> +}
> +#endif
> +
>  void arch_remove_reservations(struct resource *avail)
>  {
>  	/*
> @@ -46,6 +77,7 @@ void arch_remove_reservations(struct resource *avail)
>  	if (avail->flags & IORESOURCE_MEM) {
>  		resource_clip(avail, BIOS_ROM_BASE, BIOS_ROM_END);
>  
> -		remove_e820_regions(avail);
> +		if (!efi_mmio_mem(avail))
> +			remove_e820_regions(avail);
>  	}
>  }
> -- 
> 2.25.0
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap
  2020-06-26 22:43 ` Bjorn Helgaas
@ 2020-06-30 12:13   ` Mika Westerberg
  2020-06-30 15:54     ` Bjorn Helgaas
  0 siblings, 1 reply; 5+ messages in thread
From: Mika Westerberg @ 2020-06-30 12:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas,
	Borislav Petkov, x86, Benoit Grégoire, juhapekka.heikkila,
	linux-kernel

On Fri, Jun 26, 2020 at 05:43:44PM -0500, Bjorn Helgaas wrote:
> On Mon, Mar 02, 2020 at 05:14:51PM +0300, Mika Westerberg wrote:
> > Commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address
> > space") made the resource allocation code to avoid all regions that are
> > in E820 table. This prevents the kernel to assign MMIO resources to
> > regions that may be real RAM for example.
> > 
> > However, at least with Lenovo Yoca C940 and S740 this causes problems
> > when allocating resources for PCIe devices behind Thunderbolt port(s).
> > 
> > On Yoga S740 the E820 table contains an entry like this:
> > 
> >   BIOS-e820: [mem 0x000000002bc50000-0x00000000cfffffff] reserved
> > 
> > and ACPI _CRS method for the host bridge returns these windows:
> > 
> >   pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
> >   pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
> >   pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
> >   pci_bus 0000:00: root bus resource [mem 0x45400000-0xbfffffff window]
> >   pci_bus 0000:00: root bus resource [mem 0x4000000000-0x7fffffffff window]
> > 
> > Note that the 0x45400000-0xbfffffff entry is also included in the E820
> > table and marked as "reserved".
> > 
> > When Thunderbolt device is connected and the PCIe gets tunneled PCI core
> > tries to allocate memory for the new devices but it fails because all
> > the resources are inside this reserved region so arch_remove_reservations()
> > clips them which makes the resource assignment fail as in below log:
> > 
> >   pci 0000:00:07.0: PCI bridge to [bus 01-2a]
> >   pci 0000:00:07.0:   bridge window [mem 0x46000000-0x521fffff]
> >   pci 0000:00:07.0:   bridge window [mem 0x6000000000-0x601bffffff 64bit pref]
> >   ...
> >   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] to [bus 07-2a] add_size 100000 add_align 100000
> >   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] to [bus 07-2a] add_size 100000 add_align 100000
> >   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] to [bus 02-2a] add_size 100000 add_align 100000
> >   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] to [bus 02-2a] add_size 100000 add_align 100000
> >   pci 0000:01:00.0: bridge window [io  0x1000-0x5fff] shrunken by 0x0000000000004000
> >   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] extended by 0x000000000bd00000
> >   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] extended by 0x000000001bb00000
> >   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] extended by 0x000000000bd00000
> >   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] extended by 0x000000001bb00000
> >   pci 0000:01:00.0: BAR 8: no space for [mem size 0x0c200000]
> >   pci 0000:01:00.0: BAR 8: failed to assign [mem size 0x0c200000]
> >   pci 0000:01:00.0: BAR 9: assigned [mem 0x6000000000-0x601bffffff 64bit pref]
> >   pci 0000:01:00.0: BAR 7: assigned [io  0x4000-0x4fff]
> > 
> > The 01:00.0 is the upstream port of the PCIe switch that is connected to
> > the PCIe root port (00:07.1) over Thunderbolt link.
> > 
> > If I add "efi=debug" to the command line I can see that the EFI memory
> > map actually contains several entries:
> > 
> >   [Reserved           |   |  |  |  |  |  |  | |   |WB|WT|WC|UC] range=[0x000000002bc50000-0x000000003fffffff] (323MB)
> >   [Reserved           |   |  |  |  |  |  |  | |   |WB|  |  |UC] range=[0x0000000040000000-0x0000000040ffffff] (16MB)
> >   [Reserved           |   |  |  |  |  |  |  | |   |  |  |  |  ] range=[0x0000000041000000-0x00000000453fffff] (68MB)
> >   [Memory Mapped I/O  |RUN|  |  |  |  |  |  | |   |  |  |  |UC] range=[0x0000000045400000-0x00000000cfffffff] (2220MB)
> > 
> > I think the EFI stub merges these consecutive entries into that single
> > E820 entry showed above. The last region marked as EFI_MEMORY_MAPPED_IO
> > actually covers the PCI host bridge window entirely. However, since
> > there is corresponding E820 type for this it is simply marked as
> > E820_TYPE_RESERVED.
> 
> Is this supposed to say "since there is *no* corresponding E820 type"?

Yes, I think it makes more sense that way :)

> I guess this refers to setup_e820()?  I see the switch case there that
> sets e820_type to E820_TYPE_RESERVED for EFI_MEMORY_MAPPED_IO.
> 
> I'm really not familiar with the EFI stub, but I guess that when
> booting with EFI, we basically build our own E820 table based on the
> EFI memory map, then use that constructed E820 table to initialize
> iomem_resource via e820__reserve_resources()?
> 
> I wonder why we don't just use the EFI memory map to initialize
> iomem_resource directly.  It seems like the EFI map -> E820 table ->
> iomem_resource path adds complication that makes issues like this more
> likely.
>
> IIUC, this patch allows Linux to use EFI_MEMORY_MAPPED_IO for PCI MMIO
> space, i.e., if "avail" is in a EFI_MEMORY_MAPPED_IO region, we don't
> need to consider it "reserved" and we can put PCI resources there.
> 
> UEFI v2.8, Table 30 (doc p 160, PDF p 233) says about
> EfiMemoryMappedIO that
> 
>   before ExitBootServices():
> 
>     Used by system firmware to request that a memory-mapped IO region
>     be mapped by the OS to a virtual address so it can be accessed by
>     EFI runtime services.
> 
>   after ExitBootServices():
> 
>     This memory is not used by the OS. All system memory-mapped IO
>     information should come from ACPI tables.
> 
> I would read that as basically saying that the OS must virtually map
> EfiMemoryMappedIO for use by EFI runtime services (I don't know if
> Linux uses any of those).
> 
> The "after ExitBootServices()" part doesn't seem very clear to me.
> "This memory is not used by the OS" isn't quite stated as a
> requirement for the OS.  And "All .. MMIO information should come from
> ACPI tables" sounds like the OS should *ignore* this EFI map in favor
> of ACPI.

It is not clear to me either and I'm not 100% sure the patch here is the
correct thing to do. I think we may need some help from x86 maintainers
for a proper fix.

> So I don't know how to fix this.  I just feel like this patch adds
> complication to an already very messy flow.  We may have to do that,
> but it always makes me sad ;)
> 
> > All in all, I think we can fix this by modifying arch_remove_reservations()
> > to check the EFI type as well and if it is EFI_MEMORY_MAPPED_IO skip the
> > clipping in that case.
> > 
> > Reported-by: Benoit Grégoire <benoitg@coeus.ca>
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=206459
> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > ---
> >  arch/x86/kernel/resource.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
> > index 9b9fb7882c20..c0bc9117dd7d 100644
> > --- a/arch/x86/kernel/resource.c
> > +++ b/arch/x86/kernel/resource.c
> > @@ -1,4 +1,5 @@
> >  // SPDX-License-Identifier: GPL-2.0
> > +#include <linux/efi.h>
> >  #include <linux/ioport.h>
> >  #include <asm/e820/api.h>
> >  
> > @@ -36,6 +37,36 @@ static void remove_e820_regions(struct resource *avail)
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_EFI
> > +static bool efi_mmio_mem(const struct resource *avail)
> > +{
> > +	resource_size_t start, end;
> > +	efi_memory_desc_t desc;
> > +
> > +	if (!efi_enabled(EFI_MEMMAP) ||
> > +	    efi_mem_desc_lookup(avail->start, &desc))
> > +		return false;
> > +
> > +	start = desc.phys_addr;
> > +	end = desc.phys_addr + (desc.num_pages << EFI_PAGE_SHIFT) - 1;
> > +
> > +	/*
> > +	 * No need to clip the resource if it is fully contained in an
> > +	 * EFI memory mapped region.
> > +	 */
> > +	if (avail->start >= start && avail->end <= end &&
> > +	    desc.type == EFI_MEMORY_MAPPED_IO)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +#else
> > +static inline bool efi_mmio_mem(const struct resource *avail)
> > +{
> > +	return false;
> > +}
> > +#endif
> > +
> >  void arch_remove_reservations(struct resource *avail)
> >  {
> >  	/*
> > @@ -46,6 +77,7 @@ void arch_remove_reservations(struct resource *avail)
> >  	if (avail->flags & IORESOURCE_MEM) {
> >  		resource_clip(avail, BIOS_ROM_BASE, BIOS_ROM_END);
> >  
> > -		remove_e820_regions(avail);
> > +		if (!efi_mmio_mem(avail))
> > +			remove_e820_regions(avail);
> >  	}
> >  }
> > -- 
> > 2.25.0
> > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap
  2020-06-30 12:13   ` Mika Westerberg
@ 2020-06-30 15:54     ` Bjorn Helgaas
  0 siblings, 0 replies; 5+ messages in thread
From: Bjorn Helgaas @ 2020-06-30 15:54 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas,
	Borislav Petkov, x86, Benoit Grégoire, juhapekka.heikkila,
	linux-kernel

On Tue, Jun 30, 2020 at 03:13:00PM +0300, Mika Westerberg wrote:
> On Fri, Jun 26, 2020 at 05:43:44PM -0500, Bjorn Helgaas wrote:
> > On Mon, Mar 02, 2020 at 05:14:51PM +0300, Mika Westerberg wrote:
> > > Commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address
> > > space") made the resource allocation code to avoid all regions that are
> > > in E820 table. This prevents the kernel to assign MMIO resources to
> > > regions that may be real RAM for example.
> > > 
> > > However, at least with Lenovo Yoca C940 and S740 this causes problems
> > > when allocating resources for PCIe devices behind Thunderbolt port(s).
> > > 
> > > On Yoga S740 the E820 table contains an entry like this:
> > > 
> > >   BIOS-e820: [mem 0x000000002bc50000-0x00000000cfffffff] reserved
> > > 
> > > and ACPI _CRS method for the host bridge returns these windows:
> > > 
> > >   pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
> > >   pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
> > >   pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
> > >   pci_bus 0000:00: root bus resource [mem 0x45400000-0xbfffffff window]
> > >   pci_bus 0000:00: root bus resource [mem 0x4000000000-0x7fffffffff window]
> > > 
> > > Note that the 0x45400000-0xbfffffff entry is also included in the E820
> > > table and marked as "reserved".
> > > 
> > > When Thunderbolt device is connected and the PCIe gets tunneled PCI core
> > > tries to allocate memory for the new devices but it fails because all
> > > the resources are inside this reserved region so arch_remove_reservations()
> > > clips them which makes the resource assignment fail as in below log:
> > > 
> > >   pci 0000:00:07.0: PCI bridge to [bus 01-2a]
> > >   pci 0000:00:07.0:   bridge window [mem 0x46000000-0x521fffff]
> > >   pci 0000:00:07.0:   bridge window [mem 0x6000000000-0x601bffffff 64bit pref]
> > >   ...
> > >   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] to [bus 07-2a] add_size 100000 add_align 100000
> > >   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] to [bus 07-2a] add_size 100000 add_align 100000
> > >   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] to [bus 02-2a] add_size 100000 add_align 100000
> > >   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] to [bus 02-2a] add_size 100000 add_align 100000
> > >   pci 0000:01:00.0: bridge window [io  0x1000-0x5fff] shrunken by 0x0000000000004000
> > >   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff] extended by 0x000000000bd00000
> > >   pci 0000:01:00.0: bridge window [mem 0x00100000-0x005fffff 64bit pref] extended by 0x000000001bb00000
> > >   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff] extended by 0x000000000bd00000
> > >   pci 0000:02:04.0: bridge window [mem 0x00100000-0x001fffff 64bit pref] extended by 0x000000001bb00000
> > >   pci 0000:01:00.0: BAR 8: no space for [mem size 0x0c200000]
> > >   pci 0000:01:00.0: BAR 8: failed to assign [mem size 0x0c200000]
> > >   pci 0000:01:00.0: BAR 9: assigned [mem 0x6000000000-0x601bffffff 64bit pref]
> > >   pci 0000:01:00.0: BAR 7: assigned [io  0x4000-0x4fff]
> > > 
> > > The 01:00.0 is the upstream port of the PCIe switch that is connected to
> > > the PCIe root port (00:07.1) over Thunderbolt link.
> > > 
> > > If I add "efi=debug" to the command line I can see that the EFI memory
> > > map actually contains several entries:
> > > 
> > >   [Reserved           |   |  |  |  |  |  |  | |   |WB|WT|WC|UC] range=[0x000000002bc50000-0x000000003fffffff] (323MB)
> > >   [Reserved           |   |  |  |  |  |  |  | |   |WB|  |  |UC] range=[0x0000000040000000-0x0000000040ffffff] (16MB)
> > >   [Reserved           |   |  |  |  |  |  |  | |   |  |  |  |  ] range=[0x0000000041000000-0x00000000453fffff] (68MB)
> > >   [Memory Mapped I/O  |RUN|  |  |  |  |  |  | |   |  |  |  |UC] range=[0x0000000045400000-0x00000000cfffffff] (2220MB)
> > > 
> > > I think the EFI stub merges these consecutive entries into that single
> > > E820 entry showed above. The last region marked as EFI_MEMORY_MAPPED_IO
> > > actually covers the PCI host bridge window entirely. However, since
> > > there is corresponding E820 type for this it is simply marked as
> > > E820_TYPE_RESERVED.
> > 
> > Is this supposed to say "since there is *no* corresponding E820 type"?
> 
> Yes, I think it makes more sense that way :)
> 
> > I guess this refers to setup_e820()?  I see the switch case there that
> > sets e820_type to E820_TYPE_RESERVED for EFI_MEMORY_MAPPED_IO.
> > 
> > I'm really not familiar with the EFI stub, but I guess that when
> > booting with EFI, we basically build our own E820 table based on the
> > EFI memory map, then use that constructed E820 table to initialize
> > iomem_resource via e820__reserve_resources()?
> > 
> > I wonder why we don't just use the EFI memory map to initialize
> > iomem_resource directly.  It seems like the EFI map -> E820 table ->
> > iomem_resource path adds complication that makes issues like this more
> > likely.
> >
> > IIUC, this patch allows Linux to use EFI_MEMORY_MAPPED_IO for PCI MMIO
> > space, i.e., if "avail" is in a EFI_MEMORY_MAPPED_IO region, we don't
> > need to consider it "reserved" and we can put PCI resources there.
> > 
> > UEFI v2.8, Table 30 (doc p 160, PDF p 233) says about
> > EfiMemoryMappedIO that
> > 
> >   before ExitBootServices():
> > 
> >     Used by system firmware to request that a memory-mapped IO region
> >     be mapped by the OS to a virtual address so it can be accessed by
> >     EFI runtime services.
> > 
> >   after ExitBootServices():
> > 
> >     This memory is not used by the OS. All system memory-mapped IO
> >     information should come from ACPI tables.
> > 
> > I would read that as basically saying that the OS must virtually map
> > EfiMemoryMappedIO for use by EFI runtime services (I don't know if
> > Linux uses any of those).
> > 
> > The "after ExitBootServices()" part doesn't seem very clear to me.
> > "This memory is not used by the OS" isn't quite stated as a
> > requirement for the OS.  And "All .. MMIO information should come from
> > ACPI tables" sounds like the OS should *ignore* this EFI map in favor
> > of ACPI.
> 
> It is not clear to me either and I'm not 100% sure the patch here is the
> correct thing to do. I think we may need some help from x86 maintainers
> for a proper fix.

FWIW, I think we have two pretty fundamental issues:

  1) The E820 map isn't a very good match for the Linux resource
  manager because Linux assumes resources can't overlap unless they're
  strictly hierarchical.  The E820 and EFI memory maps make no such
  assumption.

  2) We only look at ACPI _CRS information on an ad hoc basis, so we
  ignore most of the detailed address space information from firmware.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-06-30 15:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-02 14:14 [PATCH] x86/resource: Do not exclude regions that are marked as MMIO in EFI memmap Mika Westerberg
2020-04-16 11:55 ` Mika Westerberg
2020-06-26 22:43 ` Bjorn Helgaas
2020-06-30 12:13   ` Mika Westerberg
2020-06-30 15:54     ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).