All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] arm64: export memblock_reserve()d regions via /proc/iomem
@ 2018-04-25 13:22 James Morse
  2018-04-26 17:44 ` Tyler Baicar
  2018-05-02 10:35 ` James Morse
  0 siblings, 2 replies; 5+ messages in thread
From: James Morse @ 2018-04-25 13:22 UTC (permalink / raw)
  To: linux-arm-kernel

There has been some confusion around what is necessary to prevent kexec
overwriting important memory regions. memblock: reserve, or nomap?
Only memblock nomap regions are reported via /proc/iomem, kexec's
user-space doesn't know about memblock_reserve()d regions.

Until commit f56ab9a5b73ca ("efi/arm: Don't mark ACPI reclaim memory
as MEMBLOCK_NOMAP") the ACPI tables were nomap, now they are reserved
and thus possible for kexec to overwrite with the new kernel or initrd.
But this was always broken, as the UEFI memory map is also reserved
and not marked as nomap.

It turns out that while kexec-tools will pick up reserved sections in
iomem that look like:
| 80000000-dfffffff : System RAM
|   81000000-8158ffff : reserved

The reserved section is ignored by its 'locate_hole()' code. To fix
this, we need to describe memblock_reserved() and nomap regions as
'reserved' at the top level:
| 80000000-80ffffff : System RAM
| 81000000-8158ffff : reserved
| 81590000-dfffffff : System RAM

To complicate matters, our existing named sections are described as
being part of 'System RAM', but they are also memblock_reserve()d.
We need to keep this in-case something is depending on it. To do this
involves walking memblock multiple times:

First add the 'System RAM' sections that are memory and not-reserved.
These may be smaller than a page if part of the page is reserved. In
this case we want to describe the page as reserved, so we round these
regions down to the smallest page-size region, which may be empty.
(We round-up the memblock_reserved() regions to fill in the gaps).

The boundaries for kernel_data are changed because paging_init() punches
holes in the _sdata -> _edata region, and this code can't add a named
region that crosses memblock_reserve()d<->normal-memory regions. The
new helpers will catch any more overlapping regions that occur.

Lastly, we add the memblock_reserved() regions using
reserve_region_with_split(), which will fill in the gaps between the
existing named regions. (e.g. the regions occupied by the __init code).
This call uses the slab allocator, so has to run from an initcall.

Reported-by: Bhupesh Sharma <bhupesh.linux@gmail.com>
Reported-by: Tyler Baicar <tbaicar@codeaurora.org>
Suggested-by: Akashi Takahiro <takahiro.akashi@linaro.org>
Signed-off-by: James Morse <james.morse@arm.com>
CC: Ard Biesheuvel <ard.biesheuvel@linaro.org>
CC: Mark Rutland <mark.rutland@arm.com>

---
If we do send this to stable:
Fixes: d28f6df1305a ("arm64/kexec: Add core kexec support")

If we're happy to modify user-sapce, we can do much neater things.

It looks like UEFI's careful 'memory map not mapped' code had me convinced
it was nomap.

 arch/arm64/kernel/setup.c | 136 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 113 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 30ad2f085d1f..e82c0d5c70f8 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -202,45 +202,135 @@ static void __init setup_machine_fdt(phys_addr_t dt_phys)
 	dump_stack_set_arch_desc("%s (DT)", name);
 }
 
+static struct resource * __init add_standard_resources(phys_addr_t start,
+						       phys_addr_t end,
+						       bool reserved)
+{
+	struct resource *res;
+
+	res = alloc_bootmem_low(sizeof(*res));
+
+	if (reserved) {
+		res->name  = "reserved";
+		res->flags = IORESOURCE_MEM;
+	} else {
+		res->name  = "System RAM";
+		res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+	}
+	res->start = start;
+	res->end = end;
+
+	if (request_resource_conflict(&iomem_resource, res)) {
+		pr_warn_once("Attempted to add overlapping resources\n");
+		return NULL;
+	}
+
+	return res;
+}
+
+static void __init add_named_resources(struct resource *named_resource)
+{
+	phys_addr_t start, end;
+	struct resource *res;
+
+	start = __pfn_to_phys(PFN_DOWN(named_resource->start));
+	end = __pfn_to_phys(PFN_UP(named_resource->end)) - 1;
+	res = add_standard_resources(start, end, false);
+	if (res)
+		request_resource(res, named_resource);
+}
+
 static void __init request_standard_resources(void)
 {
+	phys_addr_t start, end;
 	struct memblock_region *region;
 	struct resource *res;
+	u64 i;
+	int num_res = 0;
 
 	kernel_code.start   = __pa_symbol(_text);
 	kernel_code.end     = __pa_symbol(__init_begin - 1);
 	kernel_data.start   = __pa_symbol(_sdata);
-	kernel_data.end     = __pa_symbol(_end - 1);
+	kernel_data.end     = __pa_symbol(_edata - 1);
 
-	for_each_memblock(memory, region) {
-		res = alloc_bootmem_low(sizeof(*res));
-		if (memblock_is_nomap(region)) {
-			res->name  = "reserved";
-			res->flags = IORESOURCE_MEM;
-		} else {
-			res->name  = "System RAM";
-			res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-		}
-		res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region));
-		res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;
+	/*
+	 * We can't allocate memory while walking free memory, count the number
+	 * of struct resource's we will need. Round start/end to the smallest
+	 * page-size region as we round the reserved regions up.
+	 */
+	for_each_free_mem_range(i, NUMA_NO_NODE, 0, &start, &end, NULL) {
+		start = ALIGN(start, PAGE_SIZE);
+		end = ALIGN_DOWN(end, PAGE_SIZE) - 1;
+		if (end > start)
+			num_res++;
+	}
+
+	/* our allocation may split a free memblock */
+	num_res++;
+	res = alloc_bootmem_low(num_res * sizeof(*res));
 
-		request_resource(&iomem_resource, res);
+	/*
+	 * Add the non-reserved memory regions. flag=0 means we skip nomap
+	 * regions too.
+	 */
+	for_each_free_mem_range(i, NUMA_NO_NODE, 0, &start, &end, NULL) {
+		if (WARN_ON(!num_res))
+			return;
+
+		res->name  = "System RAM";
+		res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+
+		res->start = ALIGN(start, PAGE_SIZE);
+		res->end = ALIGN_DOWN(end, PAGE_SIZE) - 1;
+		if (res->end > res->start) {
+			request_resource(&iomem_resource, res);
+			res++;
+			num_res--;
+		}
+	}
 
-		if (kernel_code.start >= res->start &&
-		    kernel_code.end <= res->end)
-			request_resource(res, &kernel_code);
-		if (kernel_data.start >= res->start &&
-		    kernel_data.end <= res->end)
-			request_resource(res, &kernel_data);
+	/* Add the named reserved regions and their system-ram parents */
+	add_named_resources(&kernel_code);
+	add_named_resources(&kernel_data);
 #ifdef CONFIG_KEXEC_CORE
-		/* Userspace will find "Crash kernel" region in /proc/iomem. */
-		if (crashk_res.end && crashk_res.start >= res->start &&
-		    crashk_res.end <= res->end)
-			request_resource(res, &crashk_res);
+	if (crashk_res.end)
+		add_named_resources(&crashk_res);
 #endif
+
+	/* Add the nomap regions */
+	for_each_memblock(memory, region) {
+		if (!memblock_is_nomap(region))
+			continue;
+
+		start = __pfn_to_phys(memblock_region_memory_base_pfn(region));
+		end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;
+		add_standard_resources(start, end, true);
 	}
 }
 
+static int __init reserve_memblock_reserved_regions(void)
+{
+	phys_addr_t start, end, roundup_end = 0;
+	u64 i;
+
+	for_each_reserved_mem_region(i, &start, &end) {
+		if (end <= roundup_end)
+			continue; /* done already */
+
+		start = __pfn_to_phys(PFN_DOWN(start));
+		end = __pfn_to_phys(PFN_UP(end)) - 1;
+		roundup_end = end;
+
+		reserve_region_with_split(&iomem_resource, start, end,
+					  "reserved");
+	}
+
+	return 0;
+}
+/* reserve_region_with_split() requires the slab allocator: */
+arch_initcall(reserve_memblock_reserved_regions);
+
+
 u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };
 
 void __init setup_arch(char **cmdline_p)
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH] arm64: export memblock_reserve()d regions via /proc/iomem
  2018-04-25 13:22 [PATCH] arm64: export memblock_reserve()d regions via /proc/iomem James Morse
@ 2018-04-26 17:44 ` Tyler Baicar
  2018-05-02 10:35 ` James Morse
  1 sibling, 0 replies; 5+ messages in thread
From: Tyler Baicar @ 2018-04-26 17:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 4/25/2018 9:22 AM, James Morse wrote:
> There has been some confusion around what is necessary to prevent kexec
> overwriting important memory regions. memblock: reserve, or nomap?
> Only memblock nomap regions are reported via /proc/iomem, kexec's
> user-space doesn't know about memblock_reserve()d regions.
>
> Until commit f56ab9a5b73ca ("efi/arm: Don't mark ACPI reclaim memory
> as MEMBLOCK_NOMAP") the ACPI tables were nomap, now they are reserved
> and thus possible for kexec to overwrite with the new kernel or initrd.
> But this was always broken, as the UEFI memory map is also reserved
> and not marked as nomap.
>
> It turns out that while kexec-tools will pick up reserved sections in
> iomem that look like:
> | 80000000-dfffffff : System RAM
> |   81000000-8158ffff : reserved
>
> The reserved section is ignored by its 'locate_hole()' code. To fix
> this, we need to describe memblock_reserved() and nomap regions as
> 'reserved' at the top level:
> | 80000000-80ffffff : System RAM
> | 81000000-8158ffff : reserved
> | 81590000-dfffffff : System RAM
>
> To complicate matters, our existing named sections are described as
> being part of 'System RAM', but they are also memblock_reserve()d.
> We need to keep this in-case something is depending on it. To do this
> involves walking memblock multiple times:
>
> First add the 'System RAM' sections that are memory and not-reserved.
> These may be smaller than a page if part of the page is reserved. In
> this case we want to describe the page as reserved, so we round these
> regions down to the smallest page-size region, which may be empty.
> (We round-up the memblock_reserved() regions to fill in the gaps).
>
> The boundaries for kernel_data are changed because paging_init() punches
> holes in the _sdata -> _edata region, and this code can't add a named
> region that crosses memblock_reserve()d<->normal-memory regions. The
> new helpers will catch any more overlapping regions that occur.
>
> Lastly, we add the memblock_reserved() regions using
> reserve_region_with_split(), which will fill in the gaps between the
> existing named regions. (e.g. the regions occupied by the __init code).
> This call uses the slab allocator, so has to run from an initcall.
>
> Reported-by: Bhupesh Sharma <bhupesh.linux@gmail.com>
> Reported-by: Tyler Baicar <tbaicar@codeaurora.org>
> Suggested-by: Akashi Takahiro <takahiro.akashi@linaro.org>
> Signed-off-by: James Morse <james.morse@arm.com>
> CC: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> CC: Mark Rutland <mark.rutland@arm.com>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>

This patch resolves the overwriting of the ESRT memory region without the need
to edit/update kexec.

Thanks!
>
> ---
> If we do send this to stable:
> Fixes: d28f6df1305a ("arm64/kexec: Add core kexec support")
>
> If we're happy to modify user-sapce, we can do much neater things.
>
> It looks like UEFI's careful 'memory map not mapped' code had me convinced
> it was nomap.
>
>   arch/arm64/kernel/setup.c | 136 ++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 113 insertions(+), 23 deletions(-)
>
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 30ad2f085d1f..e82c0d5c70f8 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -202,45 +202,135 @@ static void __init setup_machine_fdt(phys_addr_t dt_phys)
>   	dump_stack_set_arch_desc("%s (DT)", name);
>   }
>   
> +static struct resource * __init add_standard_resources(phys_addr_t start,
> +						       phys_addr_t end,
> +						       bool reserved)
> +{
> +	struct resource *res;
> +
> +	res = alloc_bootmem_low(sizeof(*res));
> +
> +	if (reserved) {
> +		res->name  = "reserved";
> +		res->flags = IORESOURCE_MEM;
> +	} else {
> +		res->name  = "System RAM";
> +		res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> +	}
> +	res->start = start;
> +	res->end = end;
> +
> +	if (request_resource_conflict(&iomem_resource, res)) {
> +		pr_warn_once("Attempted to add overlapping resources\n");
> +		return NULL;
> +	}
> +
> +	return res;
> +}
> +
> +static void __init add_named_resources(struct resource *named_resource)
> +{
> +	phys_addr_t start, end;
> +	struct resource *res;
> +
> +	start = __pfn_to_phys(PFN_DOWN(named_resource->start));
> +	end = __pfn_to_phys(PFN_UP(named_resource->end)) - 1;
> +	res = add_standard_resources(start, end, false);
> +	if (res)
> +		request_resource(res, named_resource);
> +}
> +
>   static void __init request_standard_resources(void)
>   {
> +	phys_addr_t start, end;
>   	struct memblock_region *region;
>   	struct resource *res;
> +	u64 i;
> +	int num_res = 0;
>   
>   	kernel_code.start   = __pa_symbol(_text);
>   	kernel_code.end     = __pa_symbol(__init_begin - 1);
>   	kernel_data.start   = __pa_symbol(_sdata);
> -	kernel_data.end     = __pa_symbol(_end - 1);
> +	kernel_data.end     = __pa_symbol(_edata - 1);
>   
> -	for_each_memblock(memory, region) {
> -		res = alloc_bootmem_low(sizeof(*res));
> -		if (memblock_is_nomap(region)) {
> -			res->name  = "reserved";
> -			res->flags = IORESOURCE_MEM;
> -		} else {
> -			res->name  = "System RAM";
> -			res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> -		}
> -		res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region));
> -		res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;
> +	/*
> +	 * We can't allocate memory while walking free memory, count the number
> +	 * of struct resource's we will need. Round start/end to the smallest
> +	 * page-size region as we round the reserved regions up.
> +	 */
> +	for_each_free_mem_range(i, NUMA_NO_NODE, 0, &start, &end, NULL) {
> +		start = ALIGN(start, PAGE_SIZE);
> +		end = ALIGN_DOWN(end, PAGE_SIZE) - 1;
> +		if (end > start)
> +			num_res++;
> +	}
> +
> +	/* our allocation may split a free memblock */
> +	num_res++;
> +	res = alloc_bootmem_low(num_res * sizeof(*res));
>   
> -		request_resource(&iomem_resource, res);
> +	/*
> +	 * Add the non-reserved memory regions. flag=0 means we skip nomap
> +	 * regions too.
> +	 */
> +	for_each_free_mem_range(i, NUMA_NO_NODE, 0, &start, &end, NULL) {
> +		if (WARN_ON(!num_res))
> +			return;
> +
> +		res->name  = "System RAM";
> +		res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> +
> +		res->start = ALIGN(start, PAGE_SIZE);
> +		res->end = ALIGN_DOWN(end, PAGE_SIZE) - 1;
> +		if (res->end > res->start) {
> +			request_resource(&iomem_resource, res);
> +			res++;
> +			num_res--;
> +		}
> +	}
>   
> -		if (kernel_code.start >= res->start &&
> -		    kernel_code.end <= res->end)
> -			request_resource(res, &kernel_code);
> -		if (kernel_data.start >= res->start &&
> -		    kernel_data.end <= res->end)
> -			request_resource(res, &kernel_data);
> +	/* Add the named reserved regions and their system-ram parents */
> +	add_named_resources(&kernel_code);
> +	add_named_resources(&kernel_data);
>   #ifdef CONFIG_KEXEC_CORE
> -		/* Userspace will find "Crash kernel" region in /proc/iomem. */
> -		if (crashk_res.end && crashk_res.start >= res->start &&
> -		    crashk_res.end <= res->end)
> -			request_resource(res, &crashk_res);
> +	if (crashk_res.end)
> +		add_named_resources(&crashk_res);
>   #endif
> +
> +	/* Add the nomap regions */
> +	for_each_memblock(memory, region) {
> +		if (!memblock_is_nomap(region))
> +			continue;
> +
> +		start = __pfn_to_phys(memblock_region_memory_base_pfn(region));
> +		end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;
> +		add_standard_resources(start, end, true);
>   	}
>   }
>   
> +static int __init reserve_memblock_reserved_regions(void)
> +{
> +	phys_addr_t start, end, roundup_end = 0;
> +	u64 i;
> +
> +	for_each_reserved_mem_region(i, &start, &end) {
> +		if (end <= roundup_end)
> +			continue; /* done already */
> +
> +		start = __pfn_to_phys(PFN_DOWN(start));
> +		end = __pfn_to_phys(PFN_UP(end)) - 1;
> +		roundup_end = end;
> +
> +		reserve_region_with_split(&iomem_resource, start, end,
> +					  "reserved");
> +	}
> +
> +	return 0;
> +}
> +/* reserve_region_with_split() requires the slab allocator: */
> +arch_initcall(reserve_memblock_reserved_regions);
> +
> +
>   u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };
>   
>   void __init setup_arch(char **cmdline_p)

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] arm64: export memblock_reserve()d regions via /proc/iomem
  2018-04-25 13:22 [PATCH] arm64: export memblock_reserve()d regions via /proc/iomem James Morse
  2018-04-26 17:44 ` Tyler Baicar
@ 2018-05-02 10:35 ` James Morse
  2018-05-07  2:40   ` Akashi Takahiro
  1 sibling, 1 reply; 5+ messages in thread
From: James Morse @ 2018-05-02 10:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi guys,

On 25/04/18 14:22, James Morse wrote:
> There has been some confusion around what is necessary to prevent kexec
> overwriting important memory regions. memblock: reserve, or nomap?
> Only memblock nomap regions are reported via /proc/iomem, kexec's
> user-space doesn't know about memblock_reserve()d regions.
> 
> Until commit f56ab9a5b73ca ("efi/arm: Don't mark ACPI reclaim memory
> as MEMBLOCK_NOMAP") the ACPI tables were nomap, now they are reserved
> and thus possible for kexec to overwrite with the new kernel or initrd.
> But this was always broken, as the UEFI memory map is also reserved
> and not marked as nomap.
> 
> It turns out that while kexec-tools will pick up reserved sections in
> iomem that look like:
> | 80000000-dfffffff : System RAM
> |   81000000-8158ffff : reserved
> 
> The reserved section is ignored by its 'locate_hole()' code. To fix
> this, we need to describe memblock_reserved() and nomap regions as
> 'reserved' at the top level:
> | 80000000-80ffffff : System RAM
> | 81000000-8158ffff : reserved
> | 81590000-dfffffff : System RAM
> 
> To complicate matters, our existing named sections are described as
> being part of 'System RAM', but they are also memblock_reserve()d.
> We need to keep this in-case something is depending on it. To do this
> involves walking memblock multiple times:
> 
> First add the 'System RAM' sections that are memory and not-reserved.
> These may be smaller than a page if part of the page is reserved. In
> this case we want to describe the page as reserved, so we round these
> regions down to the smallest page-size region, which may be empty.
> (We round-up the memblock_reserved() regions to fill in the gaps).
> 
> The boundaries for kernel_data are changed because paging_init() punches
> holes in the _sdata -> _edata region, and this code can't add a named
> region that crosses memblock_reserve()d<->normal-memory regions. The
> new helpers will catch any more overlapping regions that occur.
> 
> Lastly, we add the memblock_reserved() regions using
> reserve_region_with_split(), which will fill in the gaps between the
> existing named regions. (e.g. the regions occupied by the __init code).
> This call uses the slab allocator, so has to run from an initcall.

Re-reading Akashi's description of how kdump generates the ELF headers for
/proc/vmcore, this change might break kdump, as now the memblock_reserved()
regions may be missing from /proc/vmcore. (I'll test that).

Unless there is a 'name' for a region that kexec-tools interprets as "don't
overwrite this, but do include it in the ELF header", then we're stuck. We can't
fix kexec without breaking kdump, we have to change user-space.

Of the two, missing data from kdump vmcore would be preferable to failed-to-boot
kexec. If we have to change user-space, I'd like to make use of the kernels
ability to generate the ELF header itself, (e.g. kexec_file_load()).


> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 30ad2f085d1f..e82c0d5c70f8 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -202,45 +202,135 @@ static void __init setup_machine_fdt(phys_addr_t dt_phys)

>  static void __init request_standard_resources(void)
>  {

[...]

> +	/*
> +	 * We can't allocate memory while walking free memory, count the number
> +	 * of struct resource's we will need. Round start/end to the smallest
> +	 * page-size region as we round the reserved regions up.
> +	 */
> +	for_each_free_mem_range(i, NUMA_NO_NODE, 0, &start, &end, NULL) {

Nit: That 0 should be MEMBLOCK_NONE



Thanks,

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] arm64: export memblock_reserve()d regions via /proc/iomem
  2018-05-02 10:35 ` James Morse
@ 2018-05-07  2:40   ` Akashi Takahiro
  2018-05-15 17:10     ` James Morse
  0 siblings, 1 reply; 5+ messages in thread
From: Akashi Takahiro @ 2018-05-07  2:40 UTC (permalink / raw)
  To: linux-arm-kernel

James,

# I was off the last week.

On Wed, May 02, 2018 at 11:35:04AM +0100, James Morse wrote:
> Hi guys,
> 
> On 25/04/18 14:22, James Morse wrote:
> > There has been some confusion around what is necessary to prevent kexec
> > overwriting important memory regions. memblock: reserve, or nomap?
> > Only memblock nomap regions are reported via /proc/iomem, kexec's
> > user-space doesn't know about memblock_reserve()d regions.
> > 
> > Until commit f56ab9a5b73ca ("efi/arm: Don't mark ACPI reclaim memory
> > as MEMBLOCK_NOMAP") the ACPI tables were nomap, now they are reserved
> > and thus possible for kexec to overwrite with the new kernel or initrd.
> > But this was always broken, as the UEFI memory map is also reserved
> > and not marked as nomap.
> > 
> > It turns out that while kexec-tools will pick up reserved sections in
> > iomem that look like:
> > | 80000000-dfffffff : System RAM
> > |   81000000-8158ffff : reserved
> > 
> > The reserved section is ignored by its 'locate_hole()' code. To fix
> > this, we need to describe memblock_reserved() and nomap regions as
> > 'reserved' at the top level:
> > | 80000000-80ffffff : System RAM
> > | 81000000-8158ffff : reserved
> > | 81590000-dfffffff : System RAM
> > 
> > To complicate matters, our existing named sections are described as
> > being part of 'System RAM', but they are also memblock_reserve()d.
> > We need to keep this in-case something is depending on it. To do this
> > involves walking memblock multiple times:
> > 
> > First add the 'System RAM' sections that are memory and not-reserved.
> > These may be smaller than a page if part of the page is reserved. In
> > this case we want to describe the page as reserved, so we round these
> > regions down to the smallest page-size region, which may be empty.
> > (We round-up the memblock_reserved() regions to fill in the gaps).
> > 
> > The boundaries for kernel_data are changed because paging_init() punches
> > holes in the _sdata -> _edata region, and this code can't add a named
> > region that crosses memblock_reserve()d<->normal-memory regions. The
> > new helpers will catch any more overlapping regions that occur.
> > 
> > Lastly, we add the memblock_reserved() regions using
> > reserve_region_with_split(), which will fill in the gaps between the
> > existing named regions. (e.g. the regions occupied by the __init code).
> > This call uses the slab allocator, so has to run from an initcall.
> 
> Re-reading Akashi's description of how kdump generates the ELF headers for
> /proc/vmcore, this change might break kdump, as now the memblock_reserved()
> regions may be missing from /proc/vmcore. (I'll test that).

I also tested your patch, and there seems to be something wrong.
Actually, crash utility fails to read the core:
        crash-arm64: read error: kernel virtual address: ffff0000091fa998  \
                                        type: "shadow_timekeeper xtime_sec"
        crash-arm64: read error: kernel virtual address: ffff0000091fda48  \
                                        type: "high_memory"
Meanwhile,
	$ cat /proc/iomem
	40000000-4007ffff : System RAM
	40080000-40f5ffff : System RAM
	  40080000-40f5ffff : Kernel code
	40f60000-4107ffff : reserved
	41080000-411c9fff : System RAM
	  41080000-411c99ff : Kernel data
	411ca000-4122ffff : reserved		<=== (A)
	41230000-483fffff : System RAM
	48400000-583fffff : System RAM
	  48400000-583fffff : Crash kernel
	58400000-5858ffff : System RAM
	58590000-585effff : reserved
	...

Adding some debug messages shows that "shadow_timekeeper" variable
belongs to the range (A), which is originally part of "Kernel data"
and should have been unmarked from "reserved" list.

> Unless there is a 'name' for a region that kexec-tools interprets as "don't
> overwrite this, but do include it in the ELF header", then we're stuck. We can't
> fix kexec without breaking kdump, we have to change user-space.
> 
> Of the two, missing data from kdump vmcore would be preferable to failed-to-boot
> kexec.

Well, kexec has had this potential bug since the day One in mainline.
(we were then using "boot wrapper" instead of normal boot loaders though.)

Even if we accept your assertion above now, future new-comers may
"rediscover" a kdump bug and make a complaint time to time.
So I think that we would be better off fixing the issue right now
completely with accompanying user-space change.

Thanks,
-Takahiro AKASHI

> If we have to change user-space, I'd like to make use of the kernels
> ability to generate the ELF header itself, (e.g. kexec_file_load()).
> 
> 
> > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> > index 30ad2f085d1f..e82c0d5c70f8 100644
> > --- a/arch/arm64/kernel/setup.c
> > +++ b/arch/arm64/kernel/setup.c
> > @@ -202,45 +202,135 @@ static void __init setup_machine_fdt(phys_addr_t dt_phys)
> 
> >  static void __init request_standard_resources(void)
> >  {
> 
> [...]
> 
> > +	/*
> > +	 * We can't allocate memory while walking free memory, count the number
> > +	 * of struct resource's we will need. Round start/end to the smallest
> > +	 * page-size region as we round the reserved regions up.
> > +	 */
> > +	for_each_free_mem_range(i, NUMA_NO_NODE, 0, &start, &end, NULL) {
> 
> Nit: That 0 should be MEMBLOCK_NONE
> 
> 
> 
> Thanks,
> 
> James
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] arm64: export memblock_reserve()d regions via /proc/iomem
  2018-05-07  2:40   ` Akashi Takahiro
@ 2018-05-15 17:10     ` James Morse
  0 siblings, 0 replies; 5+ messages in thread
From: James Morse @ 2018-05-15 17:10 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Akashi,

On 07/05/18 03:40, Akashi Takahiro wrote:
> On Wed, May 02, 2018 at 11:35:04AM +0100, James Morse wrote:
>> On 25/04/18 14:22, James Morse wrote:
>>> There has been some confusion around what is necessary to prevent kexec
>>> overwriting important memory regions. memblock: reserve, or nomap?
>>> Only memblock nomap regions are reported via /proc/iomem, kexec's
>>> user-space doesn't know about memblock_reserve()d regions.

>>> But this was always broken, as the UEFI memory map is also reserved
>>> and not marked as nomap.

>>> Lastly, we add the memblock_reserved() regions using
>>> reserve_region_with_split(), which will fill in the gaps between the
>>> existing named regions. (e.g. the regions occupied by the __init code).
>>> This call uses the slab allocator, so has to run from an initcall.

>> Re-reading Akashi's description of how kdump generates the ELF headers for
>> /proc/vmcore, this change might break kdump, as now the memblock_reserved()
>> regions may be missing from /proc/vmcore. (I'll test that).

> I also tested your patch, and there seems to be something wrong.

(excellent, I've been trying to remember how to build crash!)


> Actually, crash utility fails to read the core:
>         crash-arm64: read error: kernel virtual address: ffff0000091fa998  \
>                                         type: "shadow_timekeeper xtime_sec"
>         crash-arm64: read error: kernel virtual address: ffff0000091fda48  \
>                                         type: "high_memory"

> Adding some debug messages shows that "shadow_timekeeper" variable
> belongs to the range (A), which is originally part of "Kernel data"
> and should have been unmarked from "reserved" list.

Heh, so it caught fire even earlier. I was expecting only the per-cpu regions to
go missing.

Yes, the problems are going to be spurious-reserved and regions that aren't
reserved anymore, but were when the /proc/iomem list was generated.


>> Unless there is a 'name' for a region that kexec-tools interprets as "don't
>> overwrite this, but do include it in the ELF header", then we're stuck. We can't
>> fix kexec without breaking kdump, we have to change user-space.
>>
>> Of the two, missing data from kdump vmcore would be preferable to failed-to-boot
>> kexec.
> 
> Well, kexec has had this potential bug since the day One in mainline.
> (we were then using "boot wrapper" instead of normal boot loaders though.)

> Even if we accept your assertion above now, future new-comers may
> "rediscover" a kdump bug and make a complaint time to time.
> So I think that we would be better off fixing the issue right now
> completely with accompanying user-space change.

I agree we need to fix this 'now'. But changing user-space means people still
hit this bug, and are told 'update both user-space and your kernel'.

This is because kexec-tools is using the /proc/iomem list for two things:
1 don't write here, you need it to boot.
2 preserve this memory, you need it in the vmcore.

I think kdump is less common and more best-effort, my preference would be:

>> If we have to change user-space, I'd like to make use of the kernels
>> ability to generate the ELF header itself, (e.g. kexec_file_load()).

For kexec-tools to load the vmcore data it currently generates itself from
/sys/kernel/kdump_elf_header or similar.

This splits the 1&2 cases above, meaning the kernel can export information for
(1) via /proc/iomem and (2) via the new file.

old kexec-tools still works for kexec, but hits the problem you found for kdump.
new kexec-tools aware of (2) works in both cases.

But merging this patch to fix old kexec-tools will prevent us doing anything
'neater', and any fix for old-kexec is going to break kdump with the same
user-space.


For your proposed kexec-tool and kernel changes, how will old kexec-tools work?
Is it going to be any worse than today?



Thanks,

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-05-15 17:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-25 13:22 [PATCH] arm64: export memblock_reserve()d regions via /proc/iomem James Morse
2018-04-26 17:44 ` Tyler Baicar
2018-05-02 10:35 ` James Morse
2018-05-07  2:40   ` Akashi Takahiro
2018-05-15 17:10     ` James Morse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.