linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: kexec_file overwrites reserved EFI ESRT memory
       [not found]         ` <20191202085829.GA15808@dhcp-128-65.nay.redhat.com>
@ 2019-12-02  9:05           ` Dave Young
  2019-12-02 23:45             ` Michael Weiser
  2019-12-03 10:01             ` Ard Biesheuvel
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Young @ 2019-12-02  9:05 UTC (permalink / raw)
  To: Michael Weiser
  Cc: Eric W. Biederman, linux-efi, kexec, Ard Biesheuvel, x86,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Linus Torvalds, linux-kernel

Add more cc
On 12/02/19 at 04:58pm, Dave Young wrote:
> On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > Hello Dave,
> > 
> > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > 
> > > > > Fundamentally when deciding where to place a new kernel kexec (either
> > > > > user space or the in kernel kexec_file implementation) needs to be able
> > > > > to ask the question which memory ares are reserved.
> > [...]
> > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > /proc/iomem?
> > > > 
> > > > My guess is that the focus was that some EFI structures need to be kept
> > > > around accross the life cycle of *one* running kernel and
> > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > kexecing another kernel might just never have cropped up thus far. Ard
> > > > or Matt would know.
> > > Can you check your un-reserved memory, if your memory falls into EFI
> > > BOOT* then in X86 you can use something like below if it is not covered:
> > 
> > > void __init efi_esrt_init(void)
> > > {
> > > ...
> > > 	pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);
> > > 	if (md.type == EFI_BOOT_SERVICES_DATA)
> > > 		efi_mem_reserve(esrt_data, esrt_data_size);
> > > ...
> > > }
> > 
> > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > the esrt module reports at boot:
> > 
> > [    0.001244] esrt: Reserving ESRT space from 0x0000000074dd2f98 to 0x0000000074dd2fd0.
> > 
> > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > code you quote reserve it using memblock_reserve() shown by
> > memblock=debug:
> > 
> > [    0.001246] memblock_reserve: [0x0000000074dd2f98-0x0000000074dd2fcf] efi_mem_reserve+0x1d/0x2b
> > 
> > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > as shown by efi=debug:
> > 
> > [    0.178111] efi: mem10: [Boot Data          |   |  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000074dd3000-0x0000000075becfff] (14MB)
> > [    0.178113] efi: mem11: [Boot Data          |RUN|  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000074dd2000-0x0000000074dd2fff] (0MB)
> > [    0.178114] efi: mem12: [Boot Data          |   |  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000006d635000-0x0000000074dd1fff] (119MB)
> > 
> > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > from calling __memblock_free_late() on it. And indeed, memblock=debug does
> > not report this area as being free'd while the surrounding ones are:
> > 
> > [    0.178369] __memblock_free_late: [0x0000000074dd3000-0x0000000075becfff] efi_free_boot_services+0x126/0x1f8
> > [    0.178658] __memblock_free_late: [0x000000006d635000-0x0000000074dd1fff] efi_free_boot_services+0x126/0x1f8
> > 
> > The esrt area does not show up in /proc/iomem though:
> > 
> > 00100000-763f5fff : System RAM
> >   62000000-62a00d80 : Kernel code
> >   62c00000-62f15fff : Kernel rodata
> >   63000000-630ea8bf : Kernel data
> >   63fed000-641fffff : Kernel bss
> >   65000000-6affffff : Crash kernel
> > 
> > And thus kexec loads the new kernel right over that area as shown when
> > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x73000000
> > and 0x73000000+0x24be000 = 0x754be000):
> > 
> > [  650.007695] kexec_file: Loading segment 0: buf=0x000000003a9c84d6 bufsz=0x5000 mem=0x98000 memsz=0x6000
> > [  650.007699] kexec_file: Loading segment 1: buf=0x0000000017b2b9e6 bufsz=0x1240 mem=0x96000 memsz=0x2000
> > [  650.007703] kexec_file: Loading segment 2: buf=0x00000000fdf72ba2 bufsz=0x1150888 mem=0x73000000 memsz=0x24be000
> > 
> > ... because it looks for any memory hole large enough in iomem resources
> > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > excluded from on my system.
> > 
> > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > the area with efi.memmap and installs it using efi_memmap_install().
> > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > of the comments in the source of memremap(), MEMREMAP_WB does specifically
> > *not* reserve that memory in any way.
> > 
> > > Unfortunately I noticed there are different requirements/ways for
> > > different types of "reserved" memory.  But that is another topic..
> > 
> > I tried to reserve the area with something like this:
> > 
> > t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index 4de244683a7e..b86a5df027a2 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -249,6 +249,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> >         efi_memory_desc_t md;
> >         int num_entries;
> >         void *new;
> > +       struct resource *res;
> >  
> >         if (efi_mem_desc_lookup(addr, &md) ||
> >             md.type != EFI_BOOT_SERVICES_DATA) {
> > @@ -294,6 +295,21 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> >         early_memunmap(new, new_size);
> >  
> >         efi_memmap_install(new_phys, num_entries);
> > +
> > +       res = memblock_alloc(sizeof(*res), SMP_CACHE_BYTES);
> > +       if (!res) {
> > +               pr_err("Failed to allocate EFI io resource allocator for "
> > +                               "0x%llx:0x%llx", mr.range.start, mr.range.end);
> > +               return;
> > +       }
> > +
> > +       res->start      = mr.range.start;
> > +       res->end        = mr.range.end;
> > +       res->name       = "EFI runtime";
> > +       res->flags      = IORESOURCE_MEM | IORESOURCE_BUSY;
> > +       res->desc       = IORES_DESC_NONE;
> > +
> > +       insert_resource(&iomem_resource, res);
> >  }
> >  
> >  /*
> > 
> > ... but failed miserably in terms of the kernel not booting because I
> > have no experience whatsoever in programming and debugging early kernel
> > init. But I am somewhat keen to ride the learning curve here. :)
> > 
> > Am I on the right track or were you a couple of leaps ahead of me
> > already and I just didn't get the question?
> 
> It seems a serious problem, the EFI modified memmap does not get an
> /proc/iomem resource update, but kexec_file relies on /proc/iomem in
> X86.
> 
> Can you try below diff see if it works for you? (not tested, and need
> explicitly 'add_efi_memmap' in kernel cmdline param)
> 
> There is an question from Sai about why add_efi_memmap is not enabled by
> default:
> https://www.spinics.net/lists/linux-mm/msg185166.html
> 
> Long time ago the add_efi_memmap is only enabled in case we explict
> enable it on cmdline, I'm not sure if we can do it by default, maybe we
> should.   Need opinion from X86 maintainers..
> 
> diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> index 43a82e59c59d..eddaac6131cf 100644
> --- a/arch/x86/include/asm/efi.h
> +++ b/arch/x86/include/asm/efi.h
> @@ -243,6 +243,7 @@ static inline bool efi_is_64bit(void)
>  
>  extern bool efi_reboot_required(void);
>  extern bool efi_is_table_address(unsigned long phys_addr);
> +extern void do_add_efi_memmap(void);
>  
>  #else
>  static inline void parse_efi_setup(u64 phys_addr, u32 data_len) {}
> diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> index 425e025341db..39e28ec76522 100644
> --- a/arch/x86/platform/efi/efi.c
> +++ b/arch/x86/platform/efi/efi.c
> @@ -149,10 +149,12 @@ void __init efi_find_mirror(void)
>   * (zeropage) memory map.
>   */
>  
> -static void __init do_add_efi_memmap(void)
> +void __init do_add_efi_memmap(void)
>  {
>  	efi_memory_desc_t *md;
>  
> +	if (!add_efi_memmap)
> +		return;
>  	for_each_efi_memory_desc(md) {
>  		unsigned long long start = md->phys_addr;
>  		unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
> @@ -224,8 +226,7 @@ int __init efi_memblock_x86_reserve_range(void)
>  	if (rv)
>  		return rv;
>  
> -	if (add_efi_memmap)
> -		do_add_efi_memmap();
> +	do_add_efi_memmap();
>  
>  	WARN(efi.memmap.desc_version != 1,
>  	     "Unexpected EFI_MEMORY_DESCRIPTOR version %ld",
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 3b9fd679cea9..cfda591e51e3 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -496,6 +496,7 @@ void __init efi_free_boot_services(void)
>  		pr_err("Could not install new EFI memmap\n");
>  		return;
>  	}
> +	do_add_efi_memmap();
>  }
>  
>  /*


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kexec_file overwrites reserved EFI ESRT memory
  2019-12-02  9:05           ` kexec_file overwrites reserved EFI ESRT memory Dave Young
@ 2019-12-02 23:45             ` Michael Weiser
  2019-12-03 11:54               ` Dave Young
  2019-12-03 10:01             ` Ard Biesheuvel
  1 sibling, 1 reply; 7+ messages in thread
From: Michael Weiser @ 2019-12-02 23:45 UTC (permalink / raw)
  To: Dave Young
  Cc: linux-efi, Ard Biesheuvel, x86, kexec, linux-kernel, Ingo Molnar,
	Borislav Petkov, Eric W. Biederman, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds

Hi Dave,

On Mon, Dec 02, 2019 at 05:05:20PM +0800, Dave Young wrote:

> > It seems a serious problem, the EFI modified memmap does not get an
> > /proc/iomem resource update, but kexec_file relies on /proc/iomem in
> > X86.
> > 
> > There is an question from Sai about why add_efi_memmap is not enabled by
> > default:
> > https://www.spinics.net/lists/linux-mm/msg185166.html

Incidentally, a data point I did not think to mention: I do boot the
kernel as EFI application directly from the firmware as a boot entry
with compiled in initrd and command line:

$ grep EFI nobak/kernel/linux/.config
CONFIG_EFI=y
CONFIG_EFI_STUB=y
# CONFIG_EFI_MIXED is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# EFI (Extensible Firmware Interface) Support
CONFIG_EFI_VARS=m
CONFIG_EFI_ESRT=y
CONFIG_EFI_VARS_PSTORE=m
# CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_WRAPPERS=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_EFI_RCI2_TABLE is not set
# end of EFI (Extensible Firmware Interface) Support
CONFIG_UEFI_CPER=y
CONFIG_UEFI_CPER_X86=y
CONFIG_EFI_EARLYCON=y
CONFIG_EFI_PARTITION=y
CONFIG_FB_EFI=y
CONFIG_EFIVAR_FS=y
# CONFIG_EFI_PGT_DUMP is not set

$ grep CMDLINE nobak/kernel/linux/.config
CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE="root=UUID=97[...]e4 rd.luks.uuid=8a[...]c3 rd.luks.allow-discards=8a[...]c3 mem_sleep_default=deep resume=UUID=97[...]e4 resume_offset=96256 efi=debug memblock=debug"
CONFIG_CMDLINE_OVERRIDE=y
# CONFIG_BLK_CMDLINE_PARSER is not set
# CONFIG_CMDLINE_PARTITION is not set
CONFIG_FB_CMDLINE=y

$ efibootmgr -v
BootCurrent: 000A
Timeout: 2 seconds
BootOrder: 000A,0009,0008,0005,0007,0006,0004,0002,0001,0000,0003
[...]
Boot0005* gentoo-5.4.0-next-20191127+-clear
HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.0-next-20191127+-clear)
[...]
Boot000A* gentoo-5.4.1-gentoo
HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.1-gentoo)

So there's no boot loader that could construct an e820 table for the
kernel to consume. I understand it's then up to the EFI stub to come up
with a e820 table from the EFI memory map.

> > Long time ago the add_efi_memmap is only enabled in case we explict
> > enable it on cmdline, I'm not sure if we can do it by default, maybe we
> > should.   Need opinion from X86 maintainers..
> > Can you try below diff see if it works for you? (not tested, and need
> > explicitly 'add_efi_memmap' in kernel cmdline param)

Neither adding add_efi_memmap nor adding your patch and setting that option
does make the ESRT memory region appear in /proc/iomem. kexec_file still
loads the kernel across the ESRT region.

What occurs to me is that nowhere does the ESRT memory region appear in
any externally provided memory map. Neither e820 nor EFI seem to declare
it. Is that expected or a bug of my particular system?

For example, the e820 map (derived from the EFI map by the EFI stub?)
has these regions:

BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000763f5fff] usable
BIOS-e820: [mem 0x00000000763f6000-0x0000000079974fff] reserved
BIOS-e820: [mem 0x0000000079975000-0x00000000799f1fff] ACPI data
BIOS-e820: [mem 0x00000000799f2000-0x0000000079aa6fff] ACPI NVS
BIOS-e820: [mem 0x0000000079aa7000-0x000000007a40dfff] reserved
BIOS-e820: [mem 0x000000007a40e000-0x000000007a40efff] usable
BIOS-e820: [mem 0x000000007a40f000-0x000000007fffffff] reserved
BIOS-e820: [mem 0x00000000f0000000-0x00000000f7ffffff] reserved
BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000047dffffff] usable

The ESRT region sits smack in the middle of a large system RAM region:

BIOS-e820: [mem 0x0000000000100000-0x00000000763f5fff] usable

Consequently, the relevant part of /proc/iomem looks like this:

00000000-00000fff : Reserved
00001000-0009efff : System RAM
0009f000-000fffff : Reserved
  000a0000-000bffff : PCI Bus 0000:00
  000e0000-000e3fff : PCI Bus 0000:00
  000e4000-000e7fff : PCI Bus 0000:00
  000e8000-000ebfff : PCI Bus 0000:00
  000ec000-000effff : PCI Bus 0000:00
  000f0000-000fffff : PCI Bus 0000:00
    000f0000-000fffff : System ROM
00100000-763f5fff : System RAM
  65000000-6affffff : Crash kernel
763f6000-79974fff : Reserved
79975000-799f1fff : ACPI Tables
799f2000-79aa6fff : ACPI Non-volatile Storage
  79a17000-79a17fff : USBC000:00

What it would need to look like for kexec to leave ESRT alone, I guess, is:

00000000-00000fff : Reserved
00001000-0009efff : System RAM
0009f000-000fffff : Reserved
  000a0000-000bffff : PCI Bus 0000:00
  000e0000-000e3fff : PCI Bus 0000:00
  000e4000-000e7fff : PCI Bus 0000:00
  000e8000-000ebfff : PCI Bus 0000:00
  000ec000-000effff : PCI Bus 0000:00
  000f0000-000fffff : PCI Bus 0000:00
    000f0000-000fffff : System ROM
00100000-74dd1fff : System RAM         <---- split System RAM
  65000000-6affffff : Crash kernel
74dd2000-74dd2fff : Reserved           <---- ESRT
74dd3000-763f5fff : System RAM         <---- split System RAM
763f6000-79974fff : Reserved
79975000-799f1fff : ACPI Tables
799f2000-79aa6fff : ACPI Non-volatile Storage
  79a17000-79a17fff : USBC000:00

But since System RAM is set up from the e820 table very early on, the
e820 table would need to be patched before that or the already present
System RAM root resource split into three individual System
RAM/Reserved/System RAM root resources later. Correct?

I've played some more with the resource API and can now make /proc/iomem
look like this, with "EFI runtime" marked reserved:

00000000-00000fff : Reserved
00001000-0009efff : System RAM
0009f000-000fffff : Reserved
  000a0000-000bffff : PCI Bus 0000:00
  000e0000-000e3fff : PCI Bus 0000:00
  000e4000-000e7fff : PCI Bus 0000:00
  000e8000-000ebfff : PCI Bus 0000:00
  000ec000-000effff : PCI Bus 0000:00
  000f0000-000fffff : PCI Bus 0000:00
    000f0000-000fffff : System ROM
00100000-763f5fff : System RAM
  65000000-6affffff : Crash kernel
  74dd2000-74dd2fff : EFI runtime
763f6000-79974fff : Reserved
79975000-799f1fff : ACPI Tables
799f2000-79aa6fff : ACPI Non-volatile Storage
  79a17000-79a17fff : USBC000:00

But kexec_file seems to only look at the 00100000-763f5fff System RAM
root resource and still loads the kernel right across the ESRT region.
Or should it walk down the hierarchy and exclude reserved child nodes?

arm64 seems to have run into similar issues and solved them wholesale:
https://lkml.org/lkml/2018/6/19/131
https://elixir.bootlin.com/linux/v5.4.1/source/arch/arm64/kernel/setup.c#L249

I tried to apply that to x86:

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 3b9fd679cea9..458731f49484 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -295,6 +295,48 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
 	efi_memmap_install(new_phys, num_entries);
 }
 
+static int __init efi_arch_mem_reserve_runtime(void)
+{
+	efi_memory_desc_t *md;
+	struct resource *res, *mem;
+	resource_size_t start, end;
+
+	for_each_efi_memory_desc(md) {
+		if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
+		    (md->type == EFI_BOOT_SERVICES_CODE ||
+		     md->type == EFI_BOOT_SERVICES_DATA))
+			continue;
+
+		res = kzalloc(sizeof(*res), GFP_ATOMIC);
+		if (WARN_ON(!res))
+			return -ENOMEM;
+
+		start	= md->phys_addr;
+		end	= md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1;
+		res->start	= start;
+		res->end	= end;
+		res->name	= "EFI runtime";
+		res->flags	= IORESOURCE_MEM;
+
+		mem = request_resource_conflict(&iomem_resource, res);
+		/* all is well if there's no conflict */
+		if (!mem) {
+			pr_debug("Reserved io resource for runtime region 0x%llx-0x%llx\n",
+					start, end);
+			continue;
+		}
+		kfree(res);
+
+		/* otherwise go on to split up the conflicting region */
+		pr_debug("Splitting 0x%llx-0x%llx to reserve 0x%llx-0x%llx\n",
+				mem->start, mem->end, start, end);
+		reserve_region_with_split(mem, start, end, "EFI Runtime");
+	}
+
+	return 0;
+}
+arch_initcall(efi_arch_mem_reserve_runtime);
+
 /*
  * Helper function for efi_reserve_boot_services() to figure out if we
  * can free regions in efi_free_boot_services().

That comes back with these in dmesg:

[    0.190280] efi: Reserved io resource for runtime region 0xff000000-0xffffffff
[    0.190280] efi: Reserved io resource for runtime region 0xfee00000-0xfee00fff
[    0.190280] efi: Reserved io resource for runtime region 0xfed00000-0xfed03fff
[    0.190280] efi: Reserved io resource for runtime region 0xfec00000-0xfec00fff
[    0.190280] efi: Reserved io resource for runtime region 0xfe000000-0xfe010fff
[    0.190280] efi: Reserved io resource for runtime region 0xf0000000-0xf7ffffff
[    0.190280] efi: Reserved io resource for runtime region 0x7a317000-0x7a40dfff
[    0.190280] efi: Reserved io resource for runtime region 0x79aa7000-0x7a316fff
[    0.190280] efi: Splitting 0x100000-0x763f5fff to reserve 0x74dd2000-0x74dd2fff

... but still only end up with new child resources and kexec_file
still overwriting ESRT:

00000000-00000fff : Reserved
00001000-0009efff : System RAM
0009f000-000fffff : Reserved
  000a0000-000bffff : PCI Bus 0000:00
  000e0000-000e3fff : PCI Bus 0000:00
  000e4000-000e7fff : PCI Bus 0000:00
  000e8000-000ebfff : PCI Bus 0000:00
  000ec000-000effff : PCI Bus 0000:00
  000f0000-000fffff : PCI Bus 0000:00
    000f0000-000fffff : System ROM
00100000-763f5fff : System RAM
  65000000-6affffff : Crash kernel
  74dd2000-74dd2fff : EFI Runtime
763f6000-79974fff : Reserved
79975000-799f1fff : ACPI Tables
799f2000-79aa6fff : ACPI Non-volatile Storage
  79a17000-79a17fff : USBC000:00
79aa7000-7a40dfff : Reserved
  79aa7000-7a316fff : EFI runtime
  7a317000-7a40dfff : EFI runtime
7a40e000-7a40efff : System RAM

My kexec_file debugging currently looks like this:

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 79f252af7dee..3913129a6e22 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -542,6 +542,8 @@ static int locate_mem_hole_callback(struct resource *res, void *arg)
        if (end < kbuf->buf_min || start > kbuf->buf_max)
                return 0;
 
+       pr_debug("Considering 0x%llx-0x%llx\n", start, end);
+
        /*
         * Allocate memory top down with-in ram range. Otherwise bottom up
         * allocation.

... and outputs:

[ 1103.941425] kexec-bzImage64: Loaded purgatory at 0x98000
[ 1103.941428] kexec_file: Considering 0x1000-0x9efff
[ 1103.941428] kexec-bzImage64: Loaded boot_param, command line and misc at 0x96000 bufsz=0x1240 memsz=0x1240
[ 1103.941429] kexec_file: Considering 0x100000-0x763f5fff
[ 1103.941430] kexec-bzImage64: Loaded 64bit kernel at 0x73000000 bufsz=0x1140888 memsz=0x24b7000
[ 1103.941430] kexec-bzImage64: Final command line is: 
[ 1104.017909] kexec_file: Loading segment 0: buf=0x00000000d7398bfe bufsz=0x5000 mem=0x98000 memsz=0x6000
[ 1104.017936] kexec_file: Loading segment 1: buf=0x000000007238663b bufsz=0x1240 mem=0x96000 memsz=0x2000
[ 1104.017938] kexec_file: Loading segment 2: buf=0x00000000bb108eb1 bufsz=0x1140888 mem=0x73000000 memsz=0x24b7000
-- 
Thank you for your patience,
Michael

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: kexec_file overwrites reserved EFI ESRT memory
  2019-12-02  9:05           ` kexec_file overwrites reserved EFI ESRT memory Dave Young
  2019-12-02 23:45             ` Michael Weiser
@ 2019-12-03 10:01             ` Ard Biesheuvel
  2019-12-03 12:01               ` Dave Young
  1 sibling, 1 reply; 7+ messages in thread
From: Ard Biesheuvel @ 2019-12-03 10:01 UTC (permalink / raw)
  To: Dave Young, James Morse
  Cc: Michael Weiser, Eric W. Biederman, linux-efi, Kexec Mailing List,
	the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Linux Kernel Mailing List

On Mon, 2 Dec 2019 at 09:05, Dave Young <dyoung@redhat.com> wrote:
>
> Add more cc
> On 12/02/19 at 04:58pm, Dave Young wrote:
> > On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > > Hello Dave,
> > >
> > > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > >
> > > > > > Fundamentally when deciding where to place a new kernel kexec (either
> > > > > > user space or the in kernel kexec_file implementation) needs to be able
> > > > > > to ask the question which memory ares are reserved.
> > > [...]
> > > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > > /proc/iomem?
> > > > >
> > > > > My guess is that the focus was that some EFI structures need to be kept
> > > > > around accross the life cycle of *one* running kernel and
> > > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > > kexecing another kernel might just never have cropped up thus far. Ard
> > > > > or Matt would know.
> > > > Can you check your un-reserved memory, if your memory falls into EFI
> > > > BOOT* then in X86 you can use something like below if it is not covered:
> > >
> > > > void __init efi_esrt_init(void)
> > > > {
> > > > ...
> > > >   pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);
> > > >   if (md.type == EFI_BOOT_SERVICES_DATA)
> > > >           efi_mem_reserve(esrt_data, esrt_data_size);
> > > > ...
> > > > }
> > >
> > > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > > the esrt module reports at boot:
> > >
> > > [    0.001244] esrt: Reserving ESRT space from 0x0000000074dd2f98 to 0x0000000074dd2fd0.
> > >
> > > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > > code you quote reserve it using memblock_reserve() shown by
> > > memblock=debug:
> > >
> > > [    0.001246] memblock_reserve: [0x0000000074dd2f98-0x0000000074dd2fcf] efi_mem_reserve+0x1d/0x2b
> > >
> > > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > > as shown by efi=debug:
> > >
> > > [    0.178111] efi: mem10: [Boot Data          |   |  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000074dd3000-0x0000000075becfff] (14MB)
> > > [    0.178113] efi: mem11: [Boot Data          |RUN|  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000074dd2000-0x0000000074dd2fff] (0MB)
> > > [    0.178114] efi: mem12: [Boot Data          |   |  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000006d635000-0x0000000074dd1fff] (119MB)
> > >
> > > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > > from calling __memblock_free_late() on it. And indeed, memblock=debug does
> > > not report this area as being free'd while the surrounding ones are:
> > >
> > > [    0.178369] __memblock_free_late: [0x0000000074dd3000-0x0000000075becfff] efi_free_boot_services+0x126/0x1f8
> > > [    0.178658] __memblock_free_late: [0x000000006d635000-0x0000000074dd1fff] efi_free_boot_services+0x126/0x1f8
> > >
> > > The esrt area does not show up in /proc/iomem though:
> > >
> > > 00100000-763f5fff : System RAM
> > >   62000000-62a00d80 : Kernel code
> > >   62c00000-62f15fff : Kernel rodata
> > >   63000000-630ea8bf : Kernel data
> > >   63fed000-641fffff : Kernel bss
> > >   65000000-6affffff : Crash kernel
> > >
> > > And thus kexec loads the new kernel right over that area as shown when
> > > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x73000000
> > > and 0x73000000+0x24be000 = 0x754be000):
> > >
> > > [  650.007695] kexec_file: Loading segment 0: buf=0x000000003a9c84d6 bufsz=0x5000 mem=0x98000 memsz=0x6000
> > > [  650.007699] kexec_file: Loading segment 1: buf=0x0000000017b2b9e6 bufsz=0x1240 mem=0x96000 memsz=0x2000
> > > [  650.007703] kexec_file: Loading segment 2: buf=0x00000000fdf72ba2 bufsz=0x1150888 mem=0x73000000 memsz=0x24be000
> > >
> > > ... because it looks for any memory hole large enough in iomem resources
> > > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > > excluded from on my system.
> > >
> > > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > > the area with efi.memmap and installs it using efi_memmap_install().
> > > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > > of the comments in the source of memremap(), MEMREMAP_WB does specifically
> > > *not* reserve that memory in any way.
> > >
> > > > Unfortunately I noticed there are different requirements/ways for
> > > > different types of "reserved" memory.  But that is another topic..
> > >
> > > I tried to reserve the area with something like this:
> > >
> > > t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > > index 4de244683a7e..b86a5df027a2 100644
> > > --- a/arch/x86/platform/efi/quirks.c
> > > +++ b/arch/x86/platform/efi/quirks.c
> > > @@ -249,6 +249,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> > >         efi_memory_desc_t md;
> > >         int num_entries;
> > >         void *new;
> > > +       struct resource *res;
> > >
> > >         if (efi_mem_desc_lookup(addr, &md) ||
> > >             md.type != EFI_BOOT_SERVICES_DATA) {
> > > @@ -294,6 +295,21 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> > >         early_memunmap(new, new_size);
> > >
> > >         efi_memmap_install(new_phys, num_entries);
> > > +
> > > +       res = memblock_alloc(sizeof(*res), SMP_CACHE_BYTES);
> > > +       if (!res) {
> > > +               pr_err("Failed to allocate EFI io resource allocator for "
> > > +                               "0x%llx:0x%llx", mr.range.start, mr.range.end);
> > > +               return;
> > > +       }
> > > +
> > > +       res->start      = mr.range.start;
> > > +       res->end        = mr.range.end;
> > > +       res->name       = "EFI runtime";
> > > +       res->flags      = IORESOURCE_MEM | IORESOURCE_BUSY;
> > > +       res->desc       = IORES_DESC_NONE;
> > > +
> > > +       insert_resource(&iomem_resource, res);
> > >  }
> > >
> > >  /*
> > >
> > > ... but failed miserably in terms of the kernel not booting because I
> > > have no experience whatsoever in programming and debugging early kernel
> > > init. But I am somewhat keen to ride the learning curve here. :)
> > >
> > > Am I on the right track or were you a couple of leaps ahead of me
> > > already and I just didn't get the question?
> >
> > It seems a serious problem, the EFI modified memmap does not get an
> > /proc/iomem resource update, but kexec_file relies on /proc/iomem in
> > X86.
> >
> > Can you try below diff see if it works for you? (not tested, and need
> > explicitly 'add_efi_memmap' in kernel cmdline param)
> >
> > There is an question from Sai about why add_efi_memmap is not enabled by
> > default:
> > https://www.spinics.net/lists/linux-mm/msg185166.html
> >
> > Long time ago the add_efi_memmap is only enabled in case we explict
> > enable it on cmdline, I'm not sure if we can do it by default, maybe we
> > should.   Need opinion from X86 maintainers..
> >
> > diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> > index 43a82e59c59d..eddaac6131cf 100644
> > --- a/arch/x86/include/asm/efi.h
> > +++ b/arch/x86/include/asm/efi.h
> > @@ -243,6 +243,7 @@ static inline bool efi_is_64bit(void)
> >
> >  extern bool efi_reboot_required(void);
> >  extern bool efi_is_table_address(unsigned long phys_addr);
> > +extern void do_add_efi_memmap(void);
> >
> >  #else
> >  static inline void parse_efi_setup(u64 phys_addr, u32 data_len) {}
> > diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> > index 425e025341db..39e28ec76522 100644
> > --- a/arch/x86/platform/efi/efi.c
> > +++ b/arch/x86/platform/efi/efi.c
> > @@ -149,10 +149,12 @@ void __init efi_find_mirror(void)
> >   * (zeropage) memory map.
> >   */
> >
> > -static void __init do_add_efi_memmap(void)
> > +void __init do_add_efi_memmap(void)
> >  {
> >       efi_memory_desc_t *md;
> >
> > +     if (!add_efi_memmap)
> > +             return;
> >       for_each_efi_memory_desc(md) {
> >               unsigned long long start = md->phys_addr;
> >               unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
> > @@ -224,8 +226,7 @@ int __init efi_memblock_x86_reserve_range(void)
> >       if (rv)
> >               return rv;
> >
> > -     if (add_efi_memmap)
> > -             do_add_efi_memmap();
> > +     do_add_efi_memmap();
> >
> >       WARN(efi.memmap.desc_version != 1,
> >            "Unexpected EFI_MEMORY_DESCRIPTOR version %ld",
> > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index 3b9fd679cea9..cfda591e51e3 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -496,6 +496,7 @@ void __init efi_free_boot_services(void)
> >               pr_err("Could not install new EFI memmap\n");
> >               return;
> >       }
> > +     do_add_efi_memmap();
> >  }
> >
> >  /*
>

We are seeing related issues on ARM where memory referenced by UEFI
configuration tables is clobbered by the kexec tools.

Given that these tables may be located in EFI boot services data
regions, which the kernel itself knows not to touch during early boot,
I think the solution here is to teach the kexec userland tools to
avoid such regions when placing the kernel, initrd and other bits
(such as the DT on ARM) in memory.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kexec_file overwrites reserved EFI ESRT memory
  2019-12-02 23:45             ` Michael Weiser
@ 2019-12-03 11:54               ` Dave Young
  2019-12-03 21:11                 ` Michael Weiser
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Young @ 2019-12-03 11:54 UTC (permalink / raw)
  To: Michael Weiser
  Cc: linux-efi, Ard Biesheuvel, x86, kexec, linux-kernel, Ingo Molnar,
	Borislav Petkov, Eric W. Biederman, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds

On 12/03/19 at 12:45am, Michael Weiser wrote:
> Hi Dave,
> 
> On Mon, Dec 02, 2019 at 05:05:20PM +0800, Dave Young wrote:
> 
> > > It seems a serious problem, the EFI modified memmap does not get an
> > > /proc/iomem resource update, but kexec_file relies on /proc/iomem in
> > > X86.
> > > 
> > > There is an question from Sai about why add_efi_memmap is not enabled by
> > > default:
> > > https://www.spinics.net/lists/linux-mm/msg185166.html
> 
> Incidentally, a data point I did not think to mention: I do boot the
> kernel as EFI application directly from the firmware as a boot entry
> with compiled in initrd and command line:
> 
> $ grep EFI nobak/kernel/linux/.config
> CONFIG_EFI=y
> CONFIG_EFI_STUB=y
> # CONFIG_EFI_MIXED is not set
> CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
> # EFI (Extensible Firmware Interface) Support
> CONFIG_EFI_VARS=m
> CONFIG_EFI_ESRT=y
> CONFIG_EFI_VARS_PSTORE=m
> # CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
> CONFIG_EFI_RUNTIME_MAP=y
> # CONFIG_EFI_FAKE_MEMMAP is not set
> CONFIG_EFI_RUNTIME_WRAPPERS=y
> # CONFIG_EFI_BOOTLOADER_CONTROL is not set
> # CONFIG_EFI_CAPSULE_LOADER is not set
> # CONFIG_EFI_TEST is not set
> # CONFIG_EFI_RCI2_TABLE is not set
> # end of EFI (Extensible Firmware Interface) Support
> CONFIG_UEFI_CPER=y
> CONFIG_UEFI_CPER_X86=y
> CONFIG_EFI_EARLYCON=y
> CONFIG_EFI_PARTITION=y
> CONFIG_FB_EFI=y
> CONFIG_EFIVAR_FS=y
> # CONFIG_EFI_PGT_DUMP is not set
> 
> $ grep CMDLINE nobak/kernel/linux/.config
> CONFIG_CMDLINE_BOOL=y
> CONFIG_CMDLINE="root=UUID=97[...]e4 rd.luks.uuid=8a[...]c3 rd.luks.allow-discards=8a[...]c3 mem_sleep_default=deep resume=UUID=97[...]e4 resume_offset=96256 efi=debug memblock=debug"
> CONFIG_CMDLINE_OVERRIDE=y
> # CONFIG_BLK_CMDLINE_PARSER is not set
> # CONFIG_CMDLINE_PARTITION is not set
> CONFIG_FB_CMDLINE=y
> 
> $ efibootmgr -v
> BootCurrent: 000A
> Timeout: 2 seconds
> BootOrder: 000A,0009,0008,0005,0007,0006,0004,0002,0001,0000,0003
> [...]
> Boot0005* gentoo-5.4.0-next-20191127+-clear
> HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.0-next-20191127+-clear)
> [...]
> Boot000A* gentoo-5.4.1-gentoo
> HD(1,GPT,e7[...]f2,0x800,0x64000)/File(\kernel-5.4.1-gentoo)
> 
> So there's no boot loader that could construct an e820 table for the
> kernel to consume. I understand it's then up to the EFI stub to come up
> with a e820 table from the EFI memory map.
> 
> > > Long time ago the add_efi_memmap is only enabled in case we explict
> > > enable it on cmdline, I'm not sure if we can do it by default, maybe we
> > > should.   Need opinion from X86 maintainers..
> > > Can you try below diff see if it works for you? (not tested, and need
> > > explicitly 'add_efi_memmap' in kernel cmdline param)
> 
> Neither adding add_efi_memmap nor adding your patch and setting that option
> does make the ESRT memory region appear in /proc/iomem. kexec_file still
> loads the kernel across the ESRT region.
> 

Hmm, sorry, my bad, actuall add_efi_memmap does not consider the
EFI_MEMORY_RUNTIME attribute, it only reads the memory descriptor types.

Will read your replied information later, did not get time today, but
probably below chunk can help?

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 3b9fd679cea9..516307617621 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -293,6 +293,8 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
 	early_memunmap(new, new_size);
 
 	efi_memmap_install(new_phys, num_entries);
+	e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
+	e820__update_table(e820_table);
 }
 
 /*

Thanks
Dave


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: kexec_file overwrites reserved EFI ESRT memory
  2019-12-03 10:01             ` Ard Biesheuvel
@ 2019-12-03 12:01               ` Dave Young
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Young @ 2019-12-03 12:01 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: James Morse, Michael Weiser, Eric W. Biederman, linux-efi,
	Kexec Mailing List, the arch/x86 maintainers, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Linus Torvalds,
	Linux Kernel Mailing List

On 12/03/19 at 10:01am, Ard Biesheuvel wrote:
> On Mon, 2 Dec 2019 at 09:05, Dave Young <dyoung@redhat.com> wrote:
> >
> > Add more cc
> > On 12/02/19 at 04:58pm, Dave Young wrote:
> > > On 11/29/19 at 04:27pm, Michael Weiser wrote:
> > > > Hello Dave,
> > > >
> > > > On Mon, Nov 25, 2019 at 01:52:01PM +0800, Dave Young wrote:
> > > >
> > > > > > > Fundamentally when deciding where to place a new kernel kexec (either
> > > > > > > user space or the in kernel kexec_file implementation) needs to be able
> > > > > > > to ask the question which memory ares are reserved.
> > > > [...]
> > > > > > > So my question is why doesn't the ESRT reservation wind up in
> > > > > > > /proc/iomem?
> > > > > >
> > > > > > My guess is that the focus was that some EFI structures need to be kept
> > > > > > around accross the life cycle of *one* running kernel and
> > > > > > memblock_reserve() was enough for that. Marking them so they survive
> > > > > > kexecing another kernel might just never have cropped up thus far. Ard
> > > > > > or Matt would know.
> > > > > Can you check your un-reserved memory, if your memory falls into EFI
> > > > > BOOT* then in X86 you can use something like below if it is not covered:
> > > >
> > > > > void __init efi_esrt_init(void)
> > > > > {
> > > > > ...
> > > > >   pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end);
> > > > >   if (md.type == EFI_BOOT_SERVICES_DATA)
> > > > >           efi_mem_reserve(esrt_data, esrt_data_size);
> > > > > ...
> > > > > }
> > > >
> > > > Please bear with me if I'm a bit slow on the uptake here: On my machine,
> > > > the esrt module reports at boot:
> > > >
> > > > [    0.001244] esrt: Reserving ESRT space from 0x0000000074dd2f98 to 0x0000000074dd2fd0.
> > > >
> > > > This area is of type "Boot Data" (== BOOT_SERVICES_DATA) which makes the
> > > > code you quote reserve it using memblock_reserve() shown by
> > > > memblock=debug:
> > > >
> > > > [    0.001246] memblock_reserve: [0x0000000074dd2f98-0x0000000074dd2fcf] efi_mem_reserve+0x1d/0x2b
> > > >
> > > > It also calls into arch/x86/platform/efi/quirks.c:efi_arch_mem_reserve()
> > > > which tags it as EFI_MEMORY_RUNTIME while the surrounding ones aren't
> > > > as shown by efi=debug:
> > > >
> > > > [    0.178111] efi: mem10: [Boot Data          |   |  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000074dd3000-0x0000000075becfff] (14MB)
> > > > [    0.178113] efi: mem11: [Boot Data          |RUN|  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000074dd2000-0x0000000074dd2fff] (0MB)
> > > > [    0.178114] efi: mem12: [Boot Data          |   |  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000006d635000-0x0000000074dd1fff] (119MB)
> > > >
> > > > This prevents arch/x86/platform/efi/quirks.c:efi_free_boot_services()
> > > > from calling __memblock_free_late() on it. And indeed, memblock=debug does
> > > > not report this area as being free'd while the surrounding ones are:
> > > >
> > > > [    0.178369] __memblock_free_late: [0x0000000074dd3000-0x0000000075becfff] efi_free_boot_services+0x126/0x1f8
> > > > [    0.178658] __memblock_free_late: [0x000000006d635000-0x0000000074dd1fff] efi_free_boot_services+0x126/0x1f8
> > > >
> > > > The esrt area does not show up in /proc/iomem though:
> > > >
> > > > 00100000-763f5fff : System RAM
> > > >   62000000-62a00d80 : Kernel code
> > > >   62c00000-62f15fff : Kernel rodata
> > > >   63000000-630ea8bf : Kernel data
> > > >   63fed000-641fffff : Kernel bss
> > > >   65000000-6affffff : Crash kernel
> > > >
> > > > And thus kexec loads the new kernel right over that area as shown when
> > > > enabling -DDEBUG on kexec_file.c (0x74dd3000 being inbetween 0x73000000
> > > > and 0x73000000+0x24be000 = 0x754be000):
> > > >
> > > > [  650.007695] kexec_file: Loading segment 0: buf=0x000000003a9c84d6 bufsz=0x5000 mem=0x98000 memsz=0x6000
> > > > [  650.007699] kexec_file: Loading segment 1: buf=0x0000000017b2b9e6 bufsz=0x1240 mem=0x96000 memsz=0x2000
> > > > [  650.007703] kexec_file: Loading segment 2: buf=0x00000000fdf72ba2 bufsz=0x1150888 mem=0x73000000 memsz=0x24be000
> > > >
> > > > ... because it looks for any memory hole large enough in iomem resources
> > > > tagged as System RAM, which 0x74dd2000-0x74dd2fff would then need to be
> > > > excluded from on my system.
> > > >
> > > > Looking some more at efi_arch_mem_reserve() I see that it also registers
> > > > the area with efi.memmap and installs it using efi_memmap_install().
> > > > which seems to call memremap(MEMREMAP_WB) on it. From my understanding
> > > > of the comments in the source of memremap(), MEMREMAP_WB does specifically
> > > > *not* reserve that memory in any way.
> > > >
> > > > > Unfortunately I noticed there are different requirements/ways for
> > > > > different types of "reserved" memory.  But that is another topic..
> > > >
> > > > I tried to reserve the area with something like this:
> > > >
> > > > t a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > > > index 4de244683a7e..b86a5df027a2 100644
> > > > --- a/arch/x86/platform/efi/quirks.c
> > > > +++ b/arch/x86/platform/efi/quirks.c
> > > > @@ -249,6 +249,7 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> > > >         efi_memory_desc_t md;
> > > >         int num_entries;
> > > >         void *new;
> > > > +       struct resource *res;
> > > >
> > > >         if (efi_mem_desc_lookup(addr, &md) ||
> > > >             md.type != EFI_BOOT_SERVICES_DATA) {
> > > > @@ -294,6 +295,21 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> > > >         early_memunmap(new, new_size);
> > > >
> > > >         efi_memmap_install(new_phys, num_entries);
> > > > +
> > > > +       res = memblock_alloc(sizeof(*res), SMP_CACHE_BYTES);
> > > > +       if (!res) {
> > > > +               pr_err("Failed to allocate EFI io resource allocator for "
> > > > +                               "0x%llx:0x%llx", mr.range.start, mr.range.end);
> > > > +               return;
> > > > +       }
> > > > +
> > > > +       res->start      = mr.range.start;
> > > > +       res->end        = mr.range.end;
> > > > +       res->name       = "EFI runtime";
> > > > +       res->flags      = IORESOURCE_MEM | IORESOURCE_BUSY;
> > > > +       res->desc       = IORES_DESC_NONE;
> > > > +
> > > > +       insert_resource(&iomem_resource, res);
> > > >  }
> > > >
> > > >  /*
> > > >
> > > > ... but failed miserably in terms of the kernel not booting because I
> > > > have no experience whatsoever in programming and debugging early kernel
> > > > init. But I am somewhat keen to ride the learning curve here. :)
> > > >
> > > > Am I on the right track or were you a couple of leaps ahead of me
> > > > already and I just didn't get the question?
> > >
> > > It seems a serious problem, the EFI modified memmap does not get an
> > > /proc/iomem resource update, but kexec_file relies on /proc/iomem in
> > > X86.
> > >
> > > Can you try below diff see if it works for you? (not tested, and need
> > > explicitly 'add_efi_memmap' in kernel cmdline param)
> > >
> > > There is an question from Sai about why add_efi_memmap is not enabled by
> > > default:
> > > https://www.spinics.net/lists/linux-mm/msg185166.html
> > >
> > > Long time ago the add_efi_memmap is only enabled in case we explict
> > > enable it on cmdline, I'm not sure if we can do it by default, maybe we
> > > should.   Need opinion from X86 maintainers..
> > >
> > > diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
> > > index 43a82e59c59d..eddaac6131cf 100644
> > > --- a/arch/x86/include/asm/efi.h
> > > +++ b/arch/x86/include/asm/efi.h
> > > @@ -243,6 +243,7 @@ static inline bool efi_is_64bit(void)
> > >
> > >  extern bool efi_reboot_required(void);
> > >  extern bool efi_is_table_address(unsigned long phys_addr);
> > > +extern void do_add_efi_memmap(void);
> > >
> > >  #else
> > >  static inline void parse_efi_setup(u64 phys_addr, u32 data_len) {}
> > > diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> > > index 425e025341db..39e28ec76522 100644
> > > --- a/arch/x86/platform/efi/efi.c
> > > +++ b/arch/x86/platform/efi/efi.c
> > > @@ -149,10 +149,12 @@ void __init efi_find_mirror(void)
> > >   * (zeropage) memory map.
> > >   */
> > >
> > > -static void __init do_add_efi_memmap(void)
> > > +void __init do_add_efi_memmap(void)
> > >  {
> > >       efi_memory_desc_t *md;
> > >
> > > +     if (!add_efi_memmap)
> > > +             return;
> > >       for_each_efi_memory_desc(md) {
> > >               unsigned long long start = md->phys_addr;
> > >               unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
> > > @@ -224,8 +226,7 @@ int __init efi_memblock_x86_reserve_range(void)
> > >       if (rv)
> > >               return rv;
> > >
> > > -     if (add_efi_memmap)
> > > -             do_add_efi_memmap();
> > > +     do_add_efi_memmap();
> > >
> > >       WARN(efi.memmap.desc_version != 1,
> > >            "Unexpected EFI_MEMORY_DESCRIPTOR version %ld",
> > > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > > index 3b9fd679cea9..cfda591e51e3 100644
> > > --- a/arch/x86/platform/efi/quirks.c
> > > +++ b/arch/x86/platform/efi/quirks.c
> > > @@ -496,6 +496,7 @@ void __init efi_free_boot_services(void)
> > >               pr_err("Could not install new EFI memmap\n");
> > >               return;
> > >       }
> > > +     do_add_efi_memmap();
> > >  }
> > >
> > >  /*
> >
> 
> We are seeing related issues on ARM where memory referenced by UEFI
> configuration tables is clobbered by the kexec tools.

Oh, the arm implementation is quite different, not sure if these are
same issues although both looks reserved memory related :)

> 
> Given that these tables may be located in EFI boot services data
> regions, which the kernel itself knows not to touch during early boot,
> I think the solution here is to teach the kexec userland tools to
> avoid such regions when placing the kernel, initrd and other bits
> (such as the DT on ARM) in memory.
> 

It seems hard to do so, the boot services data is only awared as EFI
memory descriptors, and some of them are freed and regarded as System
RAM in /proc/iomem,  but part of them eg. BGRT image part are kept as
reserved and with md attribute EFI_MEMORY_RUNTIME, these reserved part
are not showed as Reserved in /proc/iomem resources.

Also we have both userland kexec loader and the in kernel loader
kexec_file_load, so it sounds better if we can fix in kernel to mark them as reserved
resources?

Thanks
Dave


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kexec_file overwrites reserved EFI ESRT memory
  2019-12-03 11:54               ` Dave Young
@ 2019-12-03 21:11                 ` Michael Weiser
  2019-12-04  5:22                   ` Dave Young
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Weiser @ 2019-12-03 21:11 UTC (permalink / raw)
  To: Dave Young
  Cc: linux-efi, Ard Biesheuvel, x86, kexec, linux-kernel, Ingo Molnar,
	Borislav Petkov, Eric W. Biederman, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds

Hi Dave,

On Tue, Dec 03, 2019 at 07:54:35PM +0800, Dave Young wrote:

> > Neither adding add_efi_memmap nor adding your patch and setting that option
> > does make the ESRT memory region appear in /proc/iomem. kexec_file still
> > loads the kernel across the ESRT region.
> Hmm, sorry, my bad, actuall add_efi_memmap does not consider the
> EFI_MEMORY_RUNTIME attribute, it only reads the memory descriptor types.

> Will read your replied information later, did not get time today, but
> probably below chunk can help?

> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 3b9fd679cea9..516307617621 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -293,6 +293,8 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
>  	early_memunmap(new, new_size);

>  	efi_memmap_install(new_phys, num_entries);
> +	e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> +	e820__update_table(e820_table);
>  }

>  /*

Yes, that did it:

00000000-00000fff : Reserved
00001000-0009efff : System RAM
0009f000-000fffff : Reserved
  000a0000-000bffff : PCI Bus 0000:00
  000e0000-000e3fff : PCI Bus 0000:00
  000e4000-000e7fff : PCI Bus 0000:00
  000e8000-000ebfff : PCI Bus 0000:00
  000ec000-000effff : PCI Bus 0000:00
  000f0000-000fffff : PCI Bus 0000:00
    000f0000-000fffff : System ROM
00100000-74dd1fff : System RAM
  65000000-6affffff : Crash kernel
74dd2000-74dd2fff : Reserved                   <----- ESRT
74dd3000-763f5fff : System RAM
763f6000-79974fff : Reserved
79975000-799f1fff : ACPI Tables
799f2000-79aa6fff : ACPI Non-volatile Storage
  79a17000-79a17fff : USBC000:00

[    0.001381] esrt: Reserving ESRT space from 0x0000000074dd2f98 to 0x0000000074dd2fd0.
[    0.001382] memblock_reserve: [0x0000000074dd2f98-0x0000000074dd2fcf] efi_mem_reserve+0x1d/0x2b
[    0.001383] memblock_reserve: [0x000000000009e640-0x000000000009efcf] memblock_alloc_range_nid+0x93/0xfa
[    0.001384] e820: update [mem 0x74dd2000-0x74dd2fff] usable ==> reserved
[...]
[    0.043610] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.043611] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 __register_nosave_region+0x6b/0xca
[    0.043612] memblock_reserve: [0x000000047dff95c0-0x000000047dff95df] memblock_alloc_range_nid+0x93/0xfa
[    0.043613] PM: Registered nosave memory: [mem 0x0009f000-0x000fffff]
[    0.043615] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 __register_nosave_region+0x6b/0xca
[    0.043616] memblock_reserve: [0x000000047dff9580-0x000000047dff959f] memblock_alloc_range_nid+0x93/0xfa
[    0.043617] PM: Registered nosave memory: [mem 0x74dd2000-0x74dd2fff]       <---- ESRT
[    0.043618] memblock_alloc_try_nid: 32 bytes align=0x40 nid=-1 from=0x0000000000000000 max_addr=0x0000000000000000 __register_nosave_region+0x6b/0xca
[    0.043619] memblock_reserve: [0x000000047dff9540-0x000000047dff955f] memblock_alloc_range_nid+0x93/0xfa
[    0.043620] PM: Registered nosave memory: [mem 0x763f6000-0x79974fff]
[    0.043620] PM: Registered nosave memory: [mem 0x79975000-0x799f1fff]
[    0.043621] PM: Registered nosave memory: [mem 0x799f2000-0x79aa6fff]
[    0.043621] PM: Registered nosave memory: [mem 0x79aa7000-0x7a40dfff]
[...]
[    5.993928] PCI: pci_cache_line_size set to 64 bytes
[    5.994563] e820: reserve RAM buffer [mem 0x0009f000-0x0009ffff]
[    5.994565] e820: reserve RAM buffer [mem 0x74dd2000-0x77ffffff]            <----- ESRT
[    5.994565] e820: reserve RAM buffer [mem 0x763f6000-0x77ffffff]
[    5.994566] e820: reserve RAM buffer [mem 0x7a40f000-0x7bffffff]
[    5.994567] e820: reserve RAM buffer [mem 0x47e000000-0x47fffffff]
[    5.995513] acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[    5.995549] acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
[...]
[   86.508053] kexec-bzImage64: Loaded purgatory at 0x98000
[   86.508056] kexec_file: Considering 0x1000-0x9efff
[   86.508057] kexec-bzImage64: Loaded boot_param, command line and misc at 0x96000 bufsz=0x1240 memsz=0x1240
[   86.508057] kexec_file: Considering 0x100000-0x74dd1fff
[   86.508058] kexec-bzImage64: Loaded 64bit kernel at 0x72000000 bufsz=0x1140888 memsz=0x24b7000
[   86.508058] kexec-bzImage64: Final command line is: 
[   86.584668] kexec_file: Loading segment 0: buf=0x00000000d5ec82bc bufsz=0x5000 mem=0x98000 memsz=0x6000
[   86.584672] kexec_file: Loading segment 1: buf=0x00000000af539c69 bufsz=0x1240 mem=0x96000 memsz=0x2000
[   86.584674] kexec_file: Loading segment 2: buf=0x0000000029f9b9a8 bufsz=0x1140888 mem=0x72000000 memsz=0x24b7000           <---- not ESRT :)

And no more invalid version error message from the kexec'd kernel.
-- 
Thanks,
Michael

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kexec_file overwrites reserved EFI ESRT memory
  2019-12-03 21:11                 ` Michael Weiser
@ 2019-12-04  5:22                   ` Dave Young
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Young @ 2019-12-04  5:22 UTC (permalink / raw)
  To: Michael Weiser
  Cc: linux-efi, Ard Biesheuvel, x86, kexec, linux-kernel, Ingo Molnar,
	Borislav Petkov, Eric W. Biederman, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds

On 12/03/19 at 10:11pm, Michael Weiser wrote:
> Hi Dave,
> 
> On Tue, Dec 03, 2019 at 07:54:35PM +0800, Dave Young wrote:
> 
> > > Neither adding add_efi_memmap nor adding your patch and setting that option
> > > does make the ESRT memory region appear in /proc/iomem. kexec_file still
> > > loads the kernel across the ESRT region.
> > Hmm, sorry, my bad, actuall add_efi_memmap does not consider the
> > EFI_MEMORY_RUNTIME attribute, it only reads the memory descriptor types.
> 
> > Will read your replied information later, did not get time today, but
> > probably below chunk can help?
> 
> > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index 3b9fd679cea9..516307617621 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -293,6 +293,8 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> >  	early_memunmap(new, new_size);
> 
> >  	efi_memmap_install(new_phys, num_entries);
> > +	e820__range_update(addr, size, E820_TYPE_RAM, E820_TYPE_RESERVED);
> > +	e820__update_table(e820_table);
> >  }
> 
> >  /*
> 
> Yes, that did it:
> 
> 00000000-00000fff : Reserved
> 00001000-0009efff : System RAM
> 0009f000-000fffff : Reserved
>   000a0000-000bffff : PCI Bus 0000:00
>   000e0000-000e3fff : PCI Bus 0000:00
>   000e4000-000e7fff : PCI Bus 0000:00
>   000e8000-000ebfff : PCI Bus 0000:00
>   000ec000-000effff : PCI Bus 0000:00
>   000f0000-000fffff : PCI Bus 0000:00
>     000f0000-000fffff : System ROM
> 00100000-74dd1fff : System RAM
>   65000000-6affffff : Crash kernel
> 74dd2000-74dd2fff : Reserved                   <----- ESRT
> 74dd3000-763f5fff : System RAM
> 763f6000-79974fff : Reserved
> 79975000-799f1fff : ACPI Tables
> 799f2000-79aa6fff : ACPI Non-volatile Storage
>   79a17000-79a17fff : USBC000:00

Ok, good to know it works.  I will think about it and file a patch
later.  There are more things to consider, eg. kexec reboot multiple
times, userspace kexec loader etc.

If we choose to fix it in kexec_file path to avoid those region then we
need to do same in userspace, there will be compatibility issues so I
would still prefer to go with this way you tested.

BTW, on my laptop the ESRT stays in EFI runtime area so I do not see the
problem.  This should be machine/firmware specific.

Here is the info on my laptop:
[    0.000000] efi: mem34: [Runtime Data       |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000007a4b0000-0x000000007a676fff] (1MB)
[    0.020670] esrt: Reserving ESRT space from 0x000000007a4ec000 to 0x000000007a4ec088.

Thanks
Dave


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-12-04  5:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20191122180552.GA32104@weiser.dinsnail.net>
     [not found] ` <87blt3y949.fsf@x220.int.ebiederm.org>
     [not found]   ` <20191122210702.GE32104@weiser.dinsnail.net>
     [not found]     ` <20191125055201.GA6569@dhcp-128-65.nay.redhat.com>
     [not found]       ` <20191129152700.GA8286@weiser.dinsnail.net>
     [not found]         ` <20191202085829.GA15808@dhcp-128-65.nay.redhat.com>
2019-12-02  9:05           ` kexec_file overwrites reserved EFI ESRT memory Dave Young
2019-12-02 23:45             ` Michael Weiser
2019-12-03 11:54               ` Dave Young
2019-12-03 21:11                 ` Michael Weiser
2019-12-04  5:22                   ` Dave Young
2019-12-03 10:01             ` Ard Biesheuvel
2019-12-03 12:01               ` Dave Young

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).