linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
@ 2020-03-26 18:07 James Morse
  2020-03-26 18:07 ` [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image James Morse
                   ` (7 more replies)
  0 siblings, 8 replies; 92+ messages in thread
From: James Morse @ 2020-03-26 18:07 UTC (permalink / raw)
  To: kexec, linux-mm, linux-arm-kernel
  Cc: Eric Biederman, Andrew Morton, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Bhupesh Sharma, James Morse

Hello!

arm64 recently queued support for memory hotremove, which led to some
new corner cases for kexec.

If the kexec segments are loaded for a removable region, that region may
be removed before kexec actually occurs. This causes the first kernel to
lockup when applying the relocations. (I've triggered this on x86 too).

The first patch adds a memory notifier for kexec so that it can refuse
to allow in-use regions to be taken offline.


This doesn't solve the problem for arm64, where the new kernel must
initially rely on the data structures from the first boot to describe
memory. These don't describe hotpluggable memory.
If kexec places the kernel in one of these regions, it must also provide
a DT that describes the region in which the kernel was mapped as memory.
(and somehow ensure its always present in the future...)

To prevent this from happening accidentally with unaware user-space,
patches two and three allow arm64 to give these regions a different
name.

This is a change in behaviour for arm64 as memory hotadd and hotremove
were added separately.


I haven't tried kdump.
Unaware kdump from user-space probably won't describe the hotplug
regions if the name is different, which saves us from problems if
the memory is no longer present at kdump time, but means the vmcore
is incomplete.


These patches are based on arm64's for-next/core branch, but can all
be merged independently.

Thanks,

James Morse (3):
  kexec: Prevent removal of memory in use by a loaded kexec image
  mm/memory_hotplug: Allow arch override of non boot memory resource
    names
  arm64: memory: Give hotplug memory a different resource name

 arch/arm64/include/asm/memory.h | 11 +++++++
 kernel/kexec_core.c             | 56 +++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c             |  6 +++-
 3 files changed, 72 insertions(+), 1 deletion(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-26 18:07 [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use James Morse
@ 2020-03-26 18:07 ` James Morse
  2020-03-27  0:43   ` Anshuman Khandual
                     ` (3 more replies)
  2020-03-26 18:07 ` [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names James Morse
                   ` (6 subsequent siblings)
  7 siblings, 4 replies; 92+ messages in thread
From: James Morse @ 2020-03-26 18:07 UTC (permalink / raw)
  To: kexec, linux-mm, linux-arm-kernel
  Cc: Eric Biederman, Andrew Morton, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Bhupesh Sharma, James Morse

An image loaded for kexec is not stored in place, instead its segments
are scattered through memory, and are re-assembled when needed. In the
meantime, the target memory may have been removed.

Because mm is not aware that this memory is still in use, it allows it
to be removed.

Add a memory notifier to prevent the removal of memory regions that
overlap with a loaded kexec image segment. e.g., when triggered from the
Qemu console:
| kexec_core: memory region in use
| memory memory32: Offline failed.

Signed-off-by: James Morse <james.morse@arm.com>
---
 kernel/kexec_core.c | 56 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index c19c0dad1ebe..ba1d91e868ca 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -12,6 +12,7 @@
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/kexec.h>
+#include <linux/memory.h>
 #include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/highmem.h>
@@ -22,10 +23,12 @@
 #include <linux/elf.h>
 #include <linux/elfcore.h>
 #include <linux/utsname.h>
+#include <linux/notifier.h>
 #include <linux/numa.h>
 #include <linux/suspend.h>
 #include <linux/device.h>
 #include <linux/freezer.h>
+#include <linux/pfn.h>
 #include <linux/pm.h>
 #include <linux/cpu.h>
 #include <linux/uaccess.h>
@@ -1219,3 +1222,56 @@ void __weak arch_kexec_protect_crashkres(void)
 
 void __weak arch_kexec_unprotect_crashkres(void)
 {}
+
+/*
+ * If user-space wants to offline memory that is in use by a loaded kexec
+ * image, it should unload the image first.
+ */
+static int mem_remove_cb(struct notifier_block *nb, unsigned long action,
+			 void *data)
+{
+	int rv = NOTIFY_OK, i;
+	struct memory_notify *arg = data;
+	unsigned long pfn = arg->start_pfn;
+	unsigned long nr_segments, sstart, send;
+	unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
+
+	might_sleep();
+
+	if (action != MEM_GOING_OFFLINE)
+		return NOTIFY_DONE;
+
+	mutex_lock(&kexec_mutex);
+	if (kexec_image) {
+		nr_segments = kexec_image->nr_segments;
+
+		for (i = 0; i < nr_segments; i++) {
+			sstart = PFN_DOWN(kexec_image->segment[i].mem);
+			send = PFN_UP(kexec_image->segment[i].mem +
+				      kexec_image->segment[i].memsz);
+
+			if ((pfn <= sstart && sstart < end_pfn) ||
+			    (pfn <= send && send < end_pfn)) {
+				pr_warn("Memory region in use\n");
+				rv = NOTIFY_BAD;
+				break;
+			}
+		}
+	}
+	mutex_unlock(&kexec_mutex);
+
+	return rv;
+}
+
+static struct notifier_block mem_remove_nb = {
+	.notifier_call = mem_remove_cb,
+};
+
+static int __init register_mem_remove_cb(void)
+{
+	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
+		return register_memory_notifier(&mem_remove_nb);
+
+	return 0;
+}
+device_initcall(register_mem_remove_cb);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-03-26 18:07 [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use James Morse
  2020-03-26 18:07 ` [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image James Morse
@ 2020-03-26 18:07 ` James Morse
  2020-03-27  9:59   ` David Hildenbrand
                     ` (3 more replies)
  2020-03-26 18:07 ` [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name James Morse
                   ` (5 subsequent siblings)
  7 siblings, 4 replies; 92+ messages in thread
From: James Morse @ 2020-03-26 18:07 UTC (permalink / raw)
  To: kexec, linux-mm, linux-arm-kernel
  Cc: Eric Biederman, Andrew Morton, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Bhupesh Sharma, James Morse

Memory added to the system by hotplug has a 'System RAM' resource created
for it. This is exposed to user-space via /proc/iomem.

This poses problems for kexec on arm64. If kexec decides to place the
kernel in one of these newly onlined regions, the new kernel will find
itself booting from a region not described as memory in the firmware
tables.

Arm64 doesn't have a structure like the e820 memory map that can be
re-written when memory is brought online. Instead arm64 uses the UEFI
memory map, or the memory node from the DT, sometimes both. We never
rewrite these.

Allow an architecture to specify a different name for these hotplug
regions.

Signed-off-by: James Morse <james.morse@arm.com>
---
 mm/memory_hotplug.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0a54ffac8c68..69b03dd7fc74 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,6 +42,10 @@
 #include "internal.h"
 #include "shuffle.h"
 
+#ifndef MEMORY_HOTPLUG_RES_NAME
+#define MEMORY_HOTPLUG_RES_NAME "System RAM"
+#endif
+
 /*
  * online_page_callback contains pointer to current page onlining function.
  * Initially it is generic_online_page(). If it is required it could be
@@ -103,7 +107,7 @@ static struct resource *register_memory_resource(u64 start, u64 size)
 {
 	struct resource *res;
 	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	char *resource_name = "System RAM";
+	char *resource_name = MEMORY_HOTPLUG_RES_NAME;
 
 	if (start + size > max_mem_size)
 		return ERR_PTR(-E2BIG);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name
  2020-03-26 18:07 [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use James Morse
  2020-03-26 18:07 ` [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image James Morse
  2020-03-26 18:07 ` [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names James Morse
@ 2020-03-26 18:07 ` James Morse
  2020-03-30 19:01   ` David Hildenbrand
  2020-04-15 20:37   ` Eric W. Biederman
  2020-03-27  2:11 ` [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use Baoquan He
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 92+ messages in thread
From: James Morse @ 2020-03-26 18:07 UTC (permalink / raw)
  To: kexec, linux-mm, linux-arm-kernel
  Cc: Eric Biederman, Andrew Morton, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Bhupesh Sharma, James Morse

If kexec chooses to place the kernel in a memory region that was
added after boot, we fail to boot as the kernel is running from a
location that is not described as memory by the UEFI memory map or
the original DT.

To prevent unaware user-space kexec from doing this accidentally,
give these regions a different name.

Signed-off-by: James Morse <james.morse@arm.com>
---
This is a change in behaviour as seen by user-space, because memory hot-add
has already been merged.

 arch/arm64/include/asm/memory.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 2be67b232499..ef1686518469 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -166,6 +166,17 @@
 #define IOREMAP_MAX_ORDER	(PMD_SHIFT)
 #endif
 
+/*
+ * Memory hotplug allows new regions of 'System RAM' to be added to the system.
+ * These aren't described as memory by the UEFI memory map, or DT memory node.
+ * If we kexec from one of these regions, the new kernel boots from a location
+ * that isn't described as RAM.
+ *
+ * Give these resources a different name, so unaware kexec doesn't do this by
+ * accident.
+ */
+#define MEMORY_HOTPLUG_RES_NAME "System RAM (hotplug)"
+
 #ifndef __ASSEMBLY__
 extern u64			vabits_actual;
 #define PAGE_END		(_PAGE_END(vabits_actual))
-- 
2.25.1



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-26 18:07 ` [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image James Morse
@ 2020-03-27  0:43   ` Anshuman Khandual
  2020-03-27  2:54     ` Baoquan He
  2020-03-27 15:46     ` James Morse
  2020-03-27  2:34   ` Baoquan He
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 92+ messages in thread
From: Anshuman Khandual @ 2020-03-27  0:43 UTC (permalink / raw)
  To: James Morse, kexec, linux-mm, linux-arm-kernel
  Cc: Eric Biederman, Andrew Morton, Catalin Marinas, Will Deacon,
	Bhupesh Sharma



On 03/26/2020 11:37 PM, James Morse wrote:
> An image loaded for kexec is not stored in place, instead its segments
> are scattered through memory, and are re-assembled when needed. In the
> meantime, the target memory may have been removed.
> 
> Because mm is not aware that this memory is still in use, it allows it
> to be removed.

Why the isolation process does not fail when these pages are currently
being used by kexec ?

> 
> Add a memory notifier to prevent the removal of memory regions that
> overlap with a loaded kexec image segment. e.g., when triggered from the
> Qemu console:
> | kexec_core: memory region in use
> | memory memory32: Offline failed.

Yes this is definitely an added protection for these kexec loaded kernels
memory areas from being offlined but I would have expected the preceding
offlining to have failed as well.

> 
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  kernel/kexec_core.c | 56 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index c19c0dad1ebe..ba1d91e868ca 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -12,6 +12,7 @@
>  #include <linux/slab.h>
>  #include <linux/fs.h>
>  #include <linux/kexec.h>
> +#include <linux/memory.h>
>  #include <linux/mutex.h>
>  #include <linux/list.h>
>  #include <linux/highmem.h>
> @@ -22,10 +23,12 @@
>  #include <linux/elf.h>
>  #include <linux/elfcore.h>
>  #include <linux/utsname.h>
> +#include <linux/notifier.h>
>  #include <linux/numa.h>
>  #include <linux/suspend.h>
>  #include <linux/device.h>
>  #include <linux/freezer.h>
> +#include <linux/pfn.h>
>  #include <linux/pm.h>
>  #include <linux/cpu.h>
>  #include <linux/uaccess.h>
> @@ -1219,3 +1222,56 @@ void __weak arch_kexec_protect_crashkres(void)
>  
>  void __weak arch_kexec_unprotect_crashkres(void)
>  {}
> +
> +/*
> + * If user-space wants to offline memory that is in use by a loaded kexec
> + * image, it should unload the image first.
> + */

Probably this would need kexec user manual and related system call man pages
update as well.

> +static int mem_remove_cb(struct notifier_block *nb, unsigned long action,
> +			 void *data)
> +{
> +	int rv = NOTIFY_OK, i;
> +	struct memory_notify *arg = data;
> +	unsigned long pfn = arg->start_pfn;
> +	unsigned long nr_segments, sstart, send;
> +	unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
> +
> +	might_sleep();

Required ?

> +
> +	if (action != MEM_GOING_OFFLINE)
> +		return NOTIFY_DONE;
> +
> +	mutex_lock(&kexec_mutex);
> +	if (kexec_image) {
> +		nr_segments = kexec_image->nr_segments;
> +
> +		for (i = 0; i < nr_segments; i++) {
> +			sstart = PFN_DOWN(kexec_image->segment[i].mem);
> +			send = PFN_UP(kexec_image->segment[i].mem +
> +				      kexec_image->segment[i].memsz);
> +
> +			if ((pfn <= sstart && sstart < end_pfn) ||
> +			    (pfn <= send && send < end_pfn)) {
> +				pr_warn("Memory region in use\n");
> +				rv = NOTIFY_BAD;
> +				break;
> +			}
> +		}
> +	}
> +	mutex_unlock(&kexec_mutex);
> +
> +	return rv;

Variable 'rv' is redundant, should use NOTIFY_[BAD|OK] directly instead.

> +}
> +
> +static struct notifier_block mem_remove_nb = {
> +	.notifier_call = mem_remove_cb,
> +};
> +
> +static int __init register_mem_remove_cb(void)
> +{
> +	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))

Should not all these new code here be wrapped with CONFIG_MEMORY_HOTREMOVE
to reduce the scope as well as final code size when the config is disabled.

> +		return register_memory_notifier(&mem_remove_nb);
> +
> +	return 0;
> +}
> +device_initcall(register_mem_remove_cb);
> 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-26 18:07 [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use James Morse
                   ` (2 preceding siblings ...)
  2020-03-26 18:07 ` [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name James Morse
@ 2020-03-27  2:11 ` Baoquan He
  2020-03-27 15:40   ` James Morse
  2020-03-27  9:27 ` David Hildenbrand
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-03-27  2:11 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

On 03/26/20 at 06:07pm, James Morse wrote:
> Hello!
> 
> arm64 recently queued support for memory hotremove, which led to some
> new corner cases for kexec.
> 
> If the kexec segments are loaded for a removable region, that region may
> be removed before kexec actually occurs. This causes the first kernel to
> lockup when applying the relocations. (I've triggered this on x86 too).

Do you mean you use 'kexec -l /boot/vmlinuz-xxxx --initrd ...' to load a
kernel, next you hot remove some memory regions, then you execute
'kexec -e' to trigger kexec reboot?

I may not get the point clearly, but we usually do the loading and
triggering of kexec-ed kernel at the same time. 

> 
> The first patch adds a memory notifier for kexec so that it can refuse
> to allow in-use regions to be taken offline.
> 
> 
> This doesn't solve the problem for arm64, where the new kernel must
> initially rely on the data structures from the first boot to describe
> memory. These don't describe hotpluggable memory.
> If kexec places the kernel in one of these regions, it must also provide
> a DT that describes the region in which the kernel was mapped as memory.
> (and somehow ensure its always present in the future...)
> 
> To prevent this from happening accidentally with unaware user-space,
> patches two and three allow arm64 to give these regions a different
> name.
> 
> This is a change in behaviour for arm64 as memory hotadd and hotremove
> were added separately.
> 
> 
> I haven't tried kdump.
> Unaware kdump from user-space probably won't describe the hotplug
> regions if the name is different, which saves us from problems if
> the memory is no longer present at kdump time, but means the vmcore
> is incomplete.

Currently, we will monitor udev events of mem hot add/remove, then
reload kdump kernel. That reloading is only update the elfcorehdr,
because crashkernel has to be reserved during 1st kernel bootup. I don't
think this will have problem.

> 
> 
> These patches are based on arm64's for-next/core branch, but can all
> be merged independently.
> 
> Thanks,
> 
> James Morse (3):
>   kexec: Prevent removal of memory in use by a loaded kexec image
>   mm/memory_hotplug: Allow arch override of non boot memory resource
>     names
>   arm64: memory: Give hotplug memory a different resource name
> 
>  arch/arm64/include/asm/memory.h | 11 +++++++
>  kernel/kexec_core.c             | 56 +++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c             |  6 +++-
>  3 files changed, 72 insertions(+), 1 deletion(-)
> 
> -- 
> 2.25.1
> 
> 



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-26 18:07 ` [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image James Morse
  2020-03-27  0:43   ` Anshuman Khandual
@ 2020-03-27  2:34   ` Baoquan He
  2020-03-27  9:30   ` David Hildenbrand
  2020-04-15 20:33   ` Eric W. Biederman
  3 siblings, 0 replies; 92+ messages in thread
From: Baoquan He @ 2020-03-27  2:34 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

On 03/26/20 at 06:07pm, James Morse wrote:
> An image loaded for kexec is not stored in place, instead its segments
> are scattered through memory, and are re-assembled when needed. In the
> meantime, the target memory may have been removed.
> 
> Because mm is not aware that this memory is still in use, it allows it
> to be removed.
> 
> Add a memory notifier to prevent the removal of memory regions that
> overlap with a loaded kexec image segment. e.g., when triggered from the
> Qemu console:
> | kexec_core: memory region in use
> | memory memory32: Offline failed.

As I replied to the cover letter, usually we do loading and juming of
kexec-ed kernel at one time. If we expect to do both of them at
different time, I agree we should do something to make thing safer if
someone really want to do since it's allowed, can we do anything with
the existing notifier?

Mem hotplug has got a notifier to notice it will offline a memory
region.
memory_notify(MEM_OFFLINE, &arg);

> 
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  kernel/kexec_core.c | 56 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index c19c0dad1ebe..ba1d91e868ca 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -12,6 +12,7 @@
>  #include <linux/slab.h>
>  #include <linux/fs.h>
>  #include <linux/kexec.h>
> +#include <linux/memory.h>
>  #include <linux/mutex.h>
>  #include <linux/list.h>
>  #include <linux/highmem.h>
> @@ -22,10 +23,12 @@
>  #include <linux/elf.h>
>  #include <linux/elfcore.h>
>  #include <linux/utsname.h>
> +#include <linux/notifier.h>
>  #include <linux/numa.h>
>  #include <linux/suspend.h>
>  #include <linux/device.h>
>  #include <linux/freezer.h>
> +#include <linux/pfn.h>
>  #include <linux/pm.h>
>  #include <linux/cpu.h>
>  #include <linux/uaccess.h>
> @@ -1219,3 +1222,56 @@ void __weak arch_kexec_protect_crashkres(void)
>  
>  void __weak arch_kexec_unprotect_crashkres(void)
>  {}
> +
> +/*
> + * If user-space wants to offline memory that is in use by a loaded kexec
> + * image, it should unload the image first.
> + */
> +static int mem_remove_cb(struct notifier_block *nb, unsigned long action,
> +			 void *data)
> +{
> +	int rv = NOTIFY_OK, i;
> +	struct memory_notify *arg = data;
> +	unsigned long pfn = arg->start_pfn;
> +	unsigned long nr_segments, sstart, send;
> +	unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
> +
> +	might_sleep();
> +
> +	if (action != MEM_GOING_OFFLINE)
> +		return NOTIFY_DONE;
> +
> +	mutex_lock(&kexec_mutex);
> +	if (kexec_image) {
> +		nr_segments = kexec_image->nr_segments;
> +
> +		for (i = 0; i < nr_segments; i++) {
> +			sstart = PFN_DOWN(kexec_image->segment[i].mem);
> +			send = PFN_UP(kexec_image->segment[i].mem +
> +				      kexec_image->segment[i].memsz);
> +
> +			if ((pfn <= sstart && sstart < end_pfn) ||
> +			    (pfn <= send && send < end_pfn)) {
> +				pr_warn("Memory region in use\n");
> +				rv = NOTIFY_BAD;
> +				break;
> +			}
> +		}
> +	}
> +	mutex_unlock(&kexec_mutex);
> +
> +	return rv;
> +}
> +
> +static struct notifier_block mem_remove_nb = {
> +	.notifier_call = mem_remove_cb,
> +};
> +
> +static int __init register_mem_remove_cb(void)
> +{
> +	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
> +		return register_memory_notifier(&mem_remove_nb);
> +
> +	return 0;
> +}
> +device_initcall(register_mem_remove_cb);
> -- 
> 2.25.1
> 
> 



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-27  0:43   ` Anshuman Khandual
@ 2020-03-27  2:54     ` Baoquan He
  2020-03-27 15:46     ` James Morse
  1 sibling, 0 replies; 92+ messages in thread
From: Baoquan He @ 2020-03-27  2:54 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: James Morse, kexec, linux-mm, linux-arm-kernel, Eric Biederman,
	Andrew Morton, Catalin Marinas, Will Deacon, Bhupesh Sharma

On 03/27/20 at 06:13am, Anshuman Khandual wrote:
> 
> 
> On 03/26/2020 11:37 PM, James Morse wrote:
> > An image loaded for kexec is not stored in place, instead its segments
> > are scattered through memory, and are re-assembled when needed. In the
> > meantime, the target memory may have been removed.
> > 
> > Because mm is not aware that this memory is still in use, it allows it
> > to be removed.
> 
> Why the isolation process does not fail when these pages are currently
> being used by kexec ?

That is trick of kexec implementaiton. When loading kexec-ed kernel, it
just reads in the kenrel image in user space, then searches a place
where it's going to put kernel image in the whole system RAM, but won't
put kernel image in the searched region immediately, just book ahead a
room. When you execute 'kexec -e' to trigger kexec jumping, it will copy
kernel image into the booked room. So the booking is only recorded in
kexec's own data structure.

> 
> > 
> > Add a memory notifier to prevent the removal of memory regions that
> > overlap with a loaded kexec image segment. e.g., when triggered from the
> > Qemu console:
> > | kexec_core: memory region in use
> > | memory memory32: Offline failed.
> 
> Yes this is definitely an added protection for these kexec loaded kernels
> memory areas from being offlined but I would have expected the preceding
> offlining to have failed as well.
> 
> > 
> > Signed-off-by: James Morse <james.morse@arm.com>
> > ---
> >  kernel/kexec_core.c | 56 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 56 insertions(+)
> > 
> > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> > index c19c0dad1ebe..ba1d91e868ca 100644
> > --- a/kernel/kexec_core.c
> > +++ b/kernel/kexec_core.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/slab.h>
> >  #include <linux/fs.h>
> >  #include <linux/kexec.h>
> > +#include <linux/memory.h>
> >  #include <linux/mutex.h>
> >  #include <linux/list.h>
> >  #include <linux/highmem.h>
> > @@ -22,10 +23,12 @@
> >  #include <linux/elf.h>
> >  #include <linux/elfcore.h>
> >  #include <linux/utsname.h>
> > +#include <linux/notifier.h>
> >  #include <linux/numa.h>
> >  #include <linux/suspend.h>
> >  #include <linux/device.h>
> >  #include <linux/freezer.h>
> > +#include <linux/pfn.h>
> >  #include <linux/pm.h>
> >  #include <linux/cpu.h>
> >  #include <linux/uaccess.h>
> > @@ -1219,3 +1222,56 @@ void __weak arch_kexec_protect_crashkres(void)
> >  
> >  void __weak arch_kexec_unprotect_crashkres(void)
> >  {}
> > +
> > +/*
> > + * If user-space wants to offline memory that is in use by a loaded kexec
> > + * image, it should unload the image first.
> > + */
> 
> Probably this would need kexec user manual and related system call man pages
> update as well.
> 
> > +static int mem_remove_cb(struct notifier_block *nb, unsigned long action,
> > +			 void *data)
> > +{
> > +	int rv = NOTIFY_OK, i;
> > +	struct memory_notify *arg = data;
> > +	unsigned long pfn = arg->start_pfn;
> > +	unsigned long nr_segments, sstart, send;
> > +	unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
> > +
> > +	might_sleep();
> 
> Required ?
> 
> > +
> > +	if (action != MEM_GOING_OFFLINE)
> > +		return NOTIFY_DONE;
> > +
> > +	mutex_lock(&kexec_mutex);
> > +	if (kexec_image) {
> > +		nr_segments = kexec_image->nr_segments;
> > +
> > +		for (i = 0; i < nr_segments; i++) {
> > +			sstart = PFN_DOWN(kexec_image->segment[i].mem);
> > +			send = PFN_UP(kexec_image->segment[i].mem +
> > +				      kexec_image->segment[i].memsz);
> > +
> > +			if ((pfn <= sstart && sstart < end_pfn) ||
> > +			    (pfn <= send && send < end_pfn)) {
> > +				pr_warn("Memory region in use\n");
> > +				rv = NOTIFY_BAD;
> > +				break;
> > +			}
> > +		}
> > +	}
> > +	mutex_unlock(&kexec_mutex);
> > +
> > +	return rv;
> 
> Variable 'rv' is redundant, should use NOTIFY_[BAD|OK] directly instead.
> 
> > +}
> > +
> > +static struct notifier_block mem_remove_nb = {
> > +	.notifier_call = mem_remove_cb,
> > +};
> > +
> > +static int __init register_mem_remove_cb(void)
> > +{
> > +	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
> 
> Should not all these new code here be wrapped with CONFIG_MEMORY_HOTREMOVE
> to reduce the scope as well as final code size when the config is disabled.
> 
> > +		return register_memory_notifier(&mem_remove_nb);
> > +
> > +	return 0;
> > +}
> > +device_initcall(register_mem_remove_cb);
> > 
> 



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-26 18:07 [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use James Morse
                   ` (3 preceding siblings ...)
  2020-03-27  2:11 ` [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use Baoquan He
@ 2020-03-27  9:27 ` David Hildenbrand
  2020-03-27 15:42   ` James Morse
  2020-03-30 13:55 ` Baoquan He
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-03-27  9:27 UTC (permalink / raw)
  To: James Morse, kexec, linux-mm, linux-arm-kernel
  Cc: Eric Biederman, Andrew Morton, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Bhupesh Sharma

On 26.03.20 19:07, James Morse wrote:
> Hello!
> 
> arm64 recently queued support for memory hotremove, which led to some
> new corner cases for kexec.
> 
> If the kexec segments are loaded for a removable region, that region may
> be removed before kexec actually occurs. This causes the first kernel to
> lockup when applying the relocations. (I've triggered this on x86 too).
> 
> The first patch adds a memory notifier for kexec so that it can refuse
> to allow in-use regions to be taken offline.

IIRC other architectures handle that by setting the affected pages
PageReserved. Any reason why to not stick to the same?

> 
> 
> This doesn't solve the problem for arm64, where the new kernel must
> initially rely on the data structures from the first boot to describe
> memory. These don't describe hotpluggable memory.
> If kexec places the kernel in one of these regions, it must also provide
> a DT that describes the region in which the kernel was mapped as memory.
> (and somehow ensure its always present in the future...)
> 
> To prevent this from happening accidentally with unaware user-space,
> patches two and three allow arm64 to give these regions a different
> name.
> 
> This is a change in behaviour for arm64 as memory hotadd and hotremove
> were added separately.
> 
> 
> I haven't tried kdump.
> Unaware kdump from user-space probably won't describe the hotplug
> regions if the name is different, which saves us from problems if
> the memory is no longer present at kdump time, but means the vmcore
> is incomplete.

Whenever memory is added/removed, kdump.service is to be restarted from
user space, which will fixup the data structures such that kdump will
not try to dump unplugged memory. Also, makedumpfile will check if the
sections are still around IIRC.

Not sure what you mean by "Unaware kdump from user-space".


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-26 18:07 ` [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image James Morse
  2020-03-27  0:43   ` Anshuman Khandual
  2020-03-27  2:34   ` Baoquan He
@ 2020-03-27  9:30   ` David Hildenbrand
  2020-03-27 16:56     ` James Morse
  2020-04-15 20:33   ` Eric W. Biederman
  3 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-03-27  9:30 UTC (permalink / raw)
  To: James Morse, kexec, linux-mm, linux-arm-kernel
  Cc: Eric Biederman, Andrew Morton, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Bhupesh Sharma

On 26.03.20 19:07, James Morse wrote:
> An image loaded for kexec is not stored in place, instead its segments
> are scattered through memory, and are re-assembled when needed. In the
> meantime, the target memory may have been removed.
> 
> Because mm is not aware that this memory is still in use, it allows it
> to be removed.
> 
> Add a memory notifier to prevent the removal of memory regions that
> overlap with a loaded kexec image segment. e.g., when triggered from the
> Qemu console:
> | kexec_core: memory region in use
> | memory memory32: Offline failed.
> 
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  kernel/kexec_core.c | 56 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index c19c0dad1ebe..ba1d91e868ca 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -12,6 +12,7 @@
>  #include <linux/slab.h>
>  #include <linux/fs.h>
>  #include <linux/kexec.h>
> +#include <linux/memory.h>
>  #include <linux/mutex.h>
>  #include <linux/list.h>
>  #include <linux/highmem.h>
> @@ -22,10 +23,12 @@
>  #include <linux/elf.h>
>  #include <linux/elfcore.h>
>  #include <linux/utsname.h>
> +#include <linux/notifier.h>
>  #include <linux/numa.h>
>  #include <linux/suspend.h>
>  #include <linux/device.h>
>  #include <linux/freezer.h>
> +#include <linux/pfn.h>
>  #include <linux/pm.h>
>  #include <linux/cpu.h>
>  #include <linux/uaccess.h>
> @@ -1219,3 +1222,56 @@ void __weak arch_kexec_protect_crashkres(void)
>  
>  void __weak arch_kexec_unprotect_crashkres(void)
>  {}
> +
> +/*
> + * If user-space wants to offline memory that is in use by a loaded kexec
> + * image, it should unload the image first.
> + */
> +static int mem_remove_cb(struct notifier_block *nb, unsigned long action,
> +			 void *data)
> +{
> +	int rv = NOTIFY_OK, i;
> +	struct memory_notify *arg = data;
> +	unsigned long pfn = arg->start_pfn;
> +	unsigned long nr_segments, sstart, send;
> +	unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
> +
> +	might_sleep();
> +
> +	if (action != MEM_GOING_OFFLINE)
> +		return NOTIFY_DONE;
> +
> +	mutex_lock(&kexec_mutex);
> +	if (kexec_image) {
> +		nr_segments = kexec_image->nr_segments;
> +
> +		for (i = 0; i < nr_segments; i++) {
> +			sstart = PFN_DOWN(kexec_image->segment[i].mem);
> +			send = PFN_UP(kexec_image->segment[i].mem +
> +				      kexec_image->segment[i].memsz);
> +
> +			if ((pfn <= sstart && sstart < end_pfn) ||
> +			    (pfn <= send && send < end_pfn)) {
> +				pr_warn("Memory region in use\n");
> +				rv = NOTIFY_BAD;
> +				break;
> +			}
> +		}
> +	}
> +	mutex_unlock(&kexec_mutex);
> +
> +	return rv;
> +}
> +
> +static struct notifier_block mem_remove_nb = {
> +	.notifier_call = mem_remove_cb,
> +};
> +
> +static int __init register_mem_remove_cb(void)
> +{
> +	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
> +		return register_memory_notifier(&mem_remove_nb);
> +
> +	return 0;
> +}
> +device_initcall(register_mem_remove_cb);
> 

E.g., in kernel/kexec_core.c:kimage_alloc_pages()

"SetPageReserved(pages + i);"

Pages that are reserved cannot get offlined. How are you able to trigger
that before this patch? (where is the allocation path for kexec, which
will not set the pages reserved?)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-03-26 18:07 ` [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names James Morse
@ 2020-03-27  9:59   ` David Hildenbrand
  2020-03-27 15:39     ` James Morse
  2020-04-02  5:49   ` Dave Young
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-03-27  9:59 UTC (permalink / raw)
  To: James Morse, kexec, linux-mm, linux-arm-kernel
  Cc: Eric Biederman, Andrew Morton, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Bhupesh Sharma

On 26.03.20 19:07, James Morse wrote:
> Memory added to the system by hotplug has a 'System RAM' resource created
> for it. This is exposed to user-space via /proc/iomem.
> 
> This poses problems for kexec on arm64. If kexec decides to place the
> kernel in one of these newly onlined regions, the new kernel will find
> itself booting from a region not described as memory in the firmware
> tables.
> 
> Arm64 doesn't have a structure like the e820 memory map that can be
> re-written when memory is brought online. Instead arm64 uses the UEFI
> memory map, or the memory node from the DT, sometimes both. We never
> rewrite these.
> 
> Allow an architecture to specify a different name for these hotplug
> regions.
> 
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  mm/memory_hotplug.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0a54ffac8c68..69b03dd7fc74 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,10 @@
>  #include "internal.h"
>  #include "shuffle.h"
>  
> +#ifndef MEMORY_HOTPLUG_RES_NAME
> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
> +#endif

So I assume changing this for all architectures would result in some
user space tool breaking? Are we aware of any?

I do wonder if we should simply change it for all architectures if possible.


> +
>  /*
>   * online_page_callback contains pointer to current page onlining function.
>   * Initially it is generic_online_page(). If it is required it could be
> @@ -103,7 +107,7 @@ static struct resource *register_memory_resource(u64 start, u64 size)
>  {
>  	struct resource *res;
>  	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> -	char *resource_name = "System RAM";
> +	char *resource_name = MEMORY_HOTPLUG_RES_NAME;
>  
>  	if (start + size > max_mem_size)
>  		return ERR_PTR(-E2BIG);
> 


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-03-27  9:59   ` David Hildenbrand
@ 2020-03-27 15:39     ` James Morse
  2020-03-30 13:23       ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: James Morse @ 2020-03-27 15:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

Hi David,

On 3/27/20 9:59 AM, David Hildenbrand wrote:
> On 26.03.20 19:07, James Morse wrote:
>> Memory added to the system by hotplug has a 'System RAM' resource created
>> for it. This is exposed to user-space via /proc/iomem.
>>
>> This poses problems for kexec on arm64. If kexec decides to place the
>> kernel in one of these newly onlined regions, the new kernel will find
>> itself booting from a region not described as memory in the firmware
>> tables.
>>
>> Arm64 doesn't have a structure like the e820 memory map that can be
>> re-written when memory is brought online. Instead arm64 uses the UEFI
>> memory map, or the memory node from the DT, sometimes both. We never
>> rewrite these.
>>
>> Allow an architecture to specify a different name for these hotplug
>> regions.


>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 0a54ffac8c68..69b03dd7fc74 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -42,6 +42,10 @@
>>  #include "internal.h"
>>  #include "shuffle.h"
>>  
>> +#ifndef MEMORY_HOTPLUG_RES_NAME
>> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
>> +#endif
> 
> So I assume changing this for all architectures would result in some
> user space tool breaking? Are we aware of any?

Last time we had to touch arm64's /proc/iomem strings I went through debian's
codesearch for stuff that reads it, kexec-tools was the only thing that parsed
it in anger. (From memory, the other tools were looking for PCIe windows to do
firmware flashing..)

Looking again, having qualifiers on the end of 'System RAM' looks like it could
confuse 's390-tools's detect_mem_chunks parser.

It looks like the strings that come out of 'FIRMWARE_MEMMAP' are a duplicate set.


> I do wonder if we should simply change it for all architectures if possible.

If its possible that would be great. But I suspect that ship has sailed,
changing it on other architectures could break some fragile parsing code.

I'm wary of changing it on arm64, the only thing that makes it tolerable is that
memory hot-add was relatively recently merged, and we don't anticipate it being
widely used until you can remove memory as well.

Changing it on arm64 is to prevent today's versions of kexec-tools from
accidentally placing the new kernel in memory that wasn't described at boot.
This leads to an unhandled exception during boot[0] because the kernel can't
access itself via the mapping of all memory. (hotpluggable regions are only
discovered by suitably configured ACPI systems much later)


Thanks,

James

[0]
| NUMA: NODE_DATA [mem 0x7fdf1780-0x7fdf3fff]
| Unable to handle kernel paging request at virtual address ffff00004230aff8
| Mem abort info:
|   ESR = 0x96000006
|   EC = 0x25: DABT (current EL), IL = 32 bits
|   SET = 0, FnV = 0
|   EA = 0, S1PTW = 0
| Data abort info:
|   ISV = 0, ISS = 0x00000006
|   CM = 0, WnR = 0
| swapper pgtable: 4k pages, 48-bit VAs, pgdp=000000008181d000
| [ffff00004230aff8] pgd=000000007fff9003, pud=000000007fdf7003,
pmd=0000000000000000
| Internal error: Oops: 96000006 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 0 PID: 0 Comm: swapper Not tainted 5.6.0-rc3-00098-g3f6c690f5dfe
#11618
| Hardware name: linux,dummy-virt (DT)
| pstate: 80400085 (Nzcv daIf +PAN -UAO BTYPE=--)
| pc : vmemmap_pud_populate+0x2c/0xa0
| lr : vmemmap_populate+0x78/0x154

| Call trace:
|  vmemmap_pud_populate+0x2c/0xa0
|  vmemmap_populate+0x78/0x154
|  __populate_section_memmap+0x3c/0x60
|  sparse_init_nid+0x29c/0x414
|  sparse_init+0x154/0x170
|  bootmem_init+0x78/0xdc
|  setup_arch+0x280/0x5d0
|  start_kernel+0x98/0x4f8
| Code: f9469a84 92748e73 8b010e61 cb040033 (f9400261)
| random: get_random_bytes called from print_oops_end_marker+0x34/0x60
with crng_init=0
| ---[ end trace 0000000000000000 ]---
| Kernel panic - not syncing: Attempted to kill the idle task!
| ---[ end Kernel panic - not syncing: Attempted to kill the idle task!




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-27  2:11 ` [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use Baoquan He
@ 2020-03-27 15:40   ` James Morse
  0 siblings, 0 replies; 92+ messages in thread
From: James Morse @ 2020-03-27 15:40 UTC (permalink / raw)
  To: Baoquan He
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

Hi Baoquan,

On 3/27/20 2:11 AM, Baoquan He wrote:
> On 03/26/20 at 06:07pm, James Morse wrote:
>> arm64 recently queued support for memory hotremove, which led to some
>> new corner cases for kexec.
>>
>> If the kexec segments are loaded for a removable region, that region may
>> be removed before kexec actually occurs. This causes the first kernel to
>> lockup when applying the relocations. (I've triggered this on x86 too).

> Do you mean you use 'kexec -l /boot/vmlinuz-xxxx --initrd ...' to load a
> kernel, next you hot remove some memory regions, then you execute
> 'kexec -e' to trigger kexec reboot?

Yes. But to make it more fun, get someone else to trigger the hot-remove behind
your back!


> I may not get the point clearly, but we usually do the loading and
> triggering of kexec-ed kernel at the same time. 

But its two syscalls. Should the second one fail if the memory layout has
changed since the first?

(UEFI does this for exit-boot-services, there is handshake to prove you know
what the current memory map is)


>> The first patch adds a memory notifier for kexec so that it can refuse
>> to allow in-use regions to be taken offline.
>>
>>
>> This doesn't solve the problem for arm64, where the new kernel must
>> initially rely on the data structures from the first boot to describe
>> memory. These don't describe hotpluggable memory.
>> If kexec places the kernel in one of these regions, it must also provide
>> a DT that describes the region in which the kernel was mapped as memory.
>> (and somehow ensure its always present in the future...)
>>
>> To prevent this from happening accidentally with unaware user-space,
>> patches two and three allow arm64 to give these regions a different
>> name.
>>
>> This is a change in behaviour for arm64 as memory hotadd and hotremove
>> were added separately.
>>
>>
>> I haven't tried kdump.
>> Unaware kdump from user-space probably won't describe the hotplug
>> regions if the name is different, which saves us from problems if
>> the memory is no longer present at kdump time, but means the vmcore
>> is incomplete.

> Currently, we will monitor udev events of mem hot add/remove, then
> reload kdump kernel. That reloading is only update the elfcorehdr,
> because crashkernel has to be reserved during 1st kernel bootup. I don't
> think this will have problem.

Great. I don't think there is much the kernel can do for the kdump case, so its
good to know the tools already exist for detecting and restarting the kdump load
when the memory layout changes.

For kdump via kexec-file-load, we would need to regenerate the elfcorehdr, I'm
hoping that can be done in core code.


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-27  9:27 ` David Hildenbrand
@ 2020-03-27 15:42   ` James Morse
  2020-03-30 13:18     ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: James Morse @ 2020-03-27 15:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

Hi David,

On 3/27/20 9:27 AM, David Hildenbrand wrote:
> On 26.03.20 19:07, James Morse wrote:
>> arm64 recently queued support for memory hotremove, which led to some
>> new corner cases for kexec.
>>
>> If the kexec segments are loaded for a removable region, that region may
>> be removed before kexec actually occurs. This causes the first kernel to
>> lockup when applying the relocations. (I've triggered this on x86 too).
>>
>> The first patch adds a memory notifier for kexec so that it can refuse
>> to allow in-use regions to be taken offline.

> IIRC other architectures handle that by setting the affected pages
> PageReserved. Any reason why to not stick to the same?

Hmm, I didn't spot this. How come core code doesn't do it if its needed?

Doesn't PG_Reserved prevent the page from being used for regular allocations?
(or is that only if its done early)

I prefer the runtime check as the dmesg output gives the user some chance of
knowing why their memory-offline failed, and doing something about it!


>> This doesn't solve the problem for arm64, where the new kernel must
>> initially rely on the data structures from the first boot to describe
>> memory. These don't describe hotpluggable memory.
>> If kexec places the kernel in one of these regions, it must also provide
>> a DT that describes the region in which the kernel was mapped as memory.
>> (and somehow ensure its always present in the future...)
>>
>> To prevent this from happening accidentally with unaware user-space,
>> patches two and three allow arm64 to give these regions a different
>> name.
>>
>> This is a change in behaviour for arm64 as memory hotadd and hotremove
>> were added separately.
>>
>>
>> I haven't tried kdump.
>> Unaware kdump from user-space probably won't describe the hotplug
>> regions if the name is different, which saves us from problems if
>> the memory is no longer present at kdump time, but means the vmcore
>> is incomplete.

> Whenever memory is added/removed, kdump.service is to be restarted from
> user space, which will fixup the data structures such that kdump will
> not try to dump unplugged memory.

Cunning.


> Also, makedumpfile will check if the
> sections are still around IIRC.

Curious. I thought the vmcore was virtually addressed, how does it know which
linear-map portions correspond to sysfs memory nodes with KASLR?


> Not sure what you mean by "Unaware kdump from user-space".

The existing kexec-tools binaries, that (I assume) don't go probing to find out
if 'System RAM' is removable or not, loading a kdump kernel, along with the
user-space generated blob that describes the first kernel's memory usage to the
second kernel.

'user-space' here to distinguish all this from kexec_file_load().



Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-27  0:43   ` Anshuman Khandual
  2020-03-27  2:54     ` Baoquan He
@ 2020-03-27 15:46     ` James Morse
  1 sibling, 0 replies; 92+ messages in thread
From: James Morse @ 2020-03-27 15:46 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Bhupesh Sharma

Hi Anshuman,

On 3/27/20 12:43 AM, Anshuman Khandual wrote:
> On 03/26/2020 11:37 PM, James Morse wrote:
>> An image loaded for kexec is not stored in place, instead its segments
>> are scattered through memory, and are re-assembled when needed. In the
>> meantime, the target memory may have been removed.
>>
>> Because mm is not aware that this memory is still in use, it allows it
>> to be removed.

> Why the isolation process does not fail when these pages are currently
> being used by kexec ?

Kexec isn't using them right now, but it will once kexec is triggered.
Those physical addresses are held in some internal kexec data structures until
kexec is triggered.


>> Add a memory notifier to prevent the removal of memory regions that
>> overlap with a loaded kexec image segment. e.g., when triggered from the
>> Qemu console:
>> | kexec_core: memory region in use
>> | memory memory32: Offline failed.
> 
> Yes this is definitely an added protection for these kexec loaded kernels
> memory areas from being offlined but I would have expected the preceding
> offlining to have failed as well.

kexec hasn't allocate the memory, part of the regions user-space may specify for
the next kernel may be in use. There is nothing to stop the memory being used in
the meantime.


Any other way of doing this would prevent us saying why it failed.
Like this, the user can spot the 'kexec: Memory region in use message', and
unload kexec.


>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index c19c0dad1ebe..ba1d91e868ca 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -1219,3 +1222,56 @@ void __weak arch_kexec_protect_crashkres(void)
>>  
>>  void __weak arch_kexec_unprotect_crashkres(void)
>>  {}
>> +
>> +/*
>> + * If user-space wants to offline memory that is in use by a loaded kexec
>> + * image, it should unload the image first.
>> + */

> Probably this would need kexec user manual and related system call man pages
> update as well.

I can't see anything relevant under Documentation. (kdump yes, kexec no...)


>> +static int mem_remove_cb(struct notifier_block *nb, unsigned long action,
>> +			 void *data)
>> +{
>> +	int rv = NOTIFY_OK, i;
>> +	struct memory_notify *arg = data;
>> +	unsigned long pfn = arg->start_pfn;
>> +	unsigned long nr_segments, sstart, send;
>> +	unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
>> +
>> +	might_sleep();
> 
> Required ?

Habit, and I think best practice. We take a mutex, so might_sleep(), but we also
conditionally return before lockdep would see the mutex. Having this annotation
means a dangerous change to the way this is called triggers a warning without
having to test memory hotplug explicitly.


>> +
>> +	if (action != MEM_GOING_OFFLINE)
>> +		return NOTIFY_DONE;
>> +
>> +	mutex_lock(&kexec_mutex);
>> +	if (kexec_image) {
>> +		nr_segments = kexec_image->nr_segments;
>> +
>> +		for (i = 0; i < nr_segments; i++) {
>> +			sstart = PFN_DOWN(kexec_image->segment[i].mem);
>> +			send = PFN_UP(kexec_image->segment[i].mem +
>> +				      kexec_image->segment[i].memsz);
>> +
>> +			if ((pfn <= sstart && sstart < end_pfn) ||
>> +			    (pfn <= send && send < end_pfn)) {
>> +				pr_warn("Memory region in use\n");
>> +				rv = NOTIFY_BAD;
>> +				break;
>> +			}
>> +		}
>> +	}
>> +	mutex_unlock(&kexec_mutex);
>> +
>> +	return rv;
> 
> Variable 'rv' is redundant, should use NOTIFY_[BAD|OK] directly instead.

You'd prefer a mutex_unlock() in the middle of the loop? ... or goto?
(I'm not convinced)

>> +}
>> +
>> +static struct notifier_block mem_remove_nb = {
>> +	.notifier_call = mem_remove_cb,
>> +};
>> +
>> +static int __init register_mem_remove_cb(void)
>> +{
>> +	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
> 
> Should not all these new code here be wrapped with CONFIG_MEMORY_HOTREMOVE
> to reduce the scope as well as final code size when the config is disabled.

The compiler is really good at this. "if (false)" means this is all dead code,
and static means its not exported, so the compiler is free to remove it.

Not #ifdef-ing it makes it much more readable, and means the compiler checks its
valid C before it removes it. This avoids weird header include problems that
only show up on some rand-config builds.


Thanks,

James

>> +		return register_memory_notifier(&mem_remove_nb);
>> +
>> +	return 0;
>> +}
>> +device_initcall(register_mem_remove_cb);
>>
> 



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-27  9:30   ` David Hildenbrand
@ 2020-03-27 16:56     ` James Morse
  2020-03-27 17:06       ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: James Morse @ 2020-03-27 16:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

Hi David,

On 3/27/20 9:30 AM, David Hildenbrand wrote:
> On 26.03.20 19:07, James Morse wrote:
>> An image loaded for kexec is not stored in place, instead its segments
>> are scattered through memory, and are re-assembled when needed. In the
>> meantime, the target memory may have been removed.
>>
>> Because mm is not aware that this memory is still in use, it allows it
>> to be removed.
>>
>> Add a memory notifier to prevent the removal of memory regions that
>> overlap with a loaded kexec image segment. e.g., when triggered from the
>> Qemu console:
>> | kexec_core: memory region in use
>> | memory memory32: Offline failed.
>>
>> Signed-off-by: James Morse <james.morse@arm.com>
>> ---
>>  kernel/kexec_core.c | 56 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 56 insertions(+)
>>
>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>> index c19c0dad1ebe..ba1d91e868ca 100644
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c

> E.g., in kernel/kexec_core.c:kimage_alloc_pages()
> 
> "SetPageReserved(pages + i);"
> 
> Pages that are reserved cannot get offlined. How are you able to trigger
> that before this patch? (where is the allocation path for kexec, which
> will not set the pages reserved?)

This sets page reserved on the memory it gets back from
alloc_pages() in kimage_alloc_pages(). This is when you load the image[0].

The problem I see is for the target or destination memory once you execute the
image. Once machine_kexec() runs, it tries to write to this, assuming it is
still present...


How can I make the commit message clearer?

're-assembled' and 'target memory' aren't quite cutting it, is there are a
correct term to use?
(destination?)



Thanks,

James


[0] Just to convince myself:
| kimage_alloc_pages+0x30/0x15c
| kimage_alloc_page+0x210/0x7d8
| kimage_load_segment+0x14c/0x8c8
| __arm64_sys_kexec_load+0x4f0/0x720
| do_el0_svc+0x13c/0x3c0
| el0_sync_handler+0x9c/0x3c0
| el0_sync+0x158/0x180


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-27 16:56     ` James Morse
@ 2020-03-27 17:06       ` David Hildenbrand
  2020-03-27 18:07         ` James Morse
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-03-27 17:06 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

On 27.03.20 17:56, James Morse wrote:
> Hi David,
> 
> On 3/27/20 9:30 AM, David Hildenbrand wrote:
>> On 26.03.20 19:07, James Morse wrote:
>>> An image loaded for kexec is not stored in place, instead its segments
>>> are scattered through memory, and are re-assembled when needed. In the
>>> meantime, the target memory may have been removed.
>>>
>>> Because mm is not aware that this memory is still in use, it allows it
>>> to be removed.
>>>
>>> Add a memory notifier to prevent the removal of memory regions that
>>> overlap with a loaded kexec image segment. e.g., when triggered from the
>>> Qemu console:
>>> | kexec_core: memory region in use
>>> | memory memory32: Offline failed.
>>>
>>> Signed-off-by: James Morse <james.morse@arm.com>
>>> ---
>>>  kernel/kexec_core.c | 56 +++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 56 insertions(+)
>>>
>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>>> index c19c0dad1ebe..ba1d91e868ca 100644
>>> --- a/kernel/kexec_core.c
>>> +++ b/kernel/kexec_core.c
> 
>> E.g., in kernel/kexec_core.c:kimage_alloc_pages()
>>
>> "SetPageReserved(pages + i);"
>>
>> Pages that are reserved cannot get offlined. How are you able to trigger
>> that before this patch? (where is the allocation path for kexec, which
>> will not set the pages reserved?)
> 
> This sets page reserved on the memory it gets back from
> alloc_pages() in kimage_alloc_pages(). This is when you load the image[0].
> 
> The problem I see is for the target or destination memory once you execute the
> image. Once machine_kexec() runs, it tries to write to this, assuming it is
> still present...

Let's recap

1. You load the image. You allocate memory for e.g., the kexec kernel.
The pages will be marked PG_reserved, so they cannot be offlined.

2. You do the kexec. The kexec kernel will only operate on a reserved
memory region (reserved via e.g., kernel cmdline crashkernel=128M).

Is it that in 2., the reserved memory region (for the crashkernel) could
have been offlined in the meantime?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-27 17:06       ` David Hildenbrand
@ 2020-03-27 18:07         ` James Morse
  2020-03-27 18:52           ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: James Morse @ 2020-03-27 18:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

Hi David,

On 3/27/20 5:06 PM, David Hildenbrand wrote:
> On 27.03.20 17:56, James Morse wrote:
>> On 3/27/20 9:30 AM, David Hildenbrand wrote:
>>> On 26.03.20 19:07, James Morse wrote:
>>>> An image loaded for kexec is not stored in place, instead its segments
>>>> are scattered through memory, and are re-assembled when needed. In the
>>>> meantime, the target memory may have been removed.
>>>>
>>>> Because mm is not aware that this memory is still in use, it allows it
>>>> to be removed.
>>>>
>>>> Add a memory notifier to prevent the removal of memory regions that
>>>> overlap with a loaded kexec image segment. e.g., when triggered from the
>>>> Qemu console:
>>>> | kexec_core: memory region in use
>>>> | memory memory32: Offline failed.

>>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>>>> index c19c0dad1ebe..ba1d91e868ca 100644
>>>> --- a/kernel/kexec_core.c
>>>> +++ b/kernel/kexec_core.c
>>
>>> E.g., in kernel/kexec_core.c:kimage_alloc_pages()
>>>
>>> "SetPageReserved(pages + i);"
>>>
>>> Pages that are reserved cannot get offlined. How are you able to trigger
>>> that before this patch? (where is the allocation path for kexec, which
>>> will not set the pages reserved?)
>>
>> This sets page reserved on the memory it gets back from
>> alloc_pages() in kimage_alloc_pages(). This is when you load the image[0].
>>
>> The problem I see is for the target or destination memory once you execute the
>> image. Once machine_kexec() runs, it tries to write to this, assuming it is
>> still present...

> Let's recap
> 
> 1. You load the image. You allocate memory for e.g., the kexec kernel.
> The pages will be marked PG_reserved, so they cannot be offlined.
> 
> 2. You do the kexec. The kexec kernel will only operate on a reserved
> memory region (reserved via e.g., kernel cmdline crashkernel=128M).

I think you are merging the kexec and kdump behaviours.
(Wrong terminology? The things behind 'kexec -l Image' and 'kexec -p Image')

For kdump, yes, the new kernel is loaded into the crashkernel reservation, and
confined to it.


For regular kexec, the new kernel can be loaded any where in memory. There might
be a difference with how this works on arm64....

The regular kexec kernel isn't stored in its final location when its loaded, its
relocated there when the image is executed. The target/destination memory may
have been removed in the meantime.

(an example recipe below should clarify this)


> Is it that in 2., the reserved memory region (for the crashkernel) could
> have been offlined in the meantime?

No, for kdump: the crashkernel reservation is PG_reserved, and its not something
mm knows how to move, so that region can't be taken offline.

(On arm64 we additionally prevent the boot-memory from being removed as it is
all described as present by UEFI. The crashkernel reservation would always be
from this type of memory)


This is about a regular kexec, any crashdump reservation is irrelevant.
This kexec kernel is temporarily stored out of line, then relocated when executed.

A recipe so that we're at least on the same terminal! This is on a TX2 running
arm64's for-next/core using Qemu-TCG to emulate x86. (Sorry for the bizarre
config, its because Qemu supports hotremove on x86, but not yet on arm64).


Insert the memory:
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

| root@vm:~# free -m
|               total        used        free      shared  ...
| Mem:            918          52         814           0  ...
| Swap:             0           0           0


Bring it online:
| root@vm:~# cd /sys/devices/system/memory/
| root@vm:/sys/devices/system/memory# for F in memory3*; do echo \
| online_movable > $F/state; done

| Built 1 zonelists, mobility grouping on.  Total pages: 251049
| Policy zone: DMA32

| -bash: echo: write error: Invalid argument
| root@vm:/sys/devices/system/memory# free -m
|               total        used        free      shared  ...
| Mem:           1942          53        1836           0  ...
| Swap:             0           0           0


Load kexec:
| root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline

Press the Attention button to request removal:

(qemu) device_del dimm1

| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Built 1 zonelists, mobility grouping on.  Total pages: 233728
| Policy zone: DMA32

The memory is gone:
| root@vm:/sys/devices/system/memory# free -m
|               total        used        free      shared  ...
| Mem:            918          89         769           0  ...
| Swap:             0           0           0

Trigger kexec:
| root@vm:/sys/devices/system/memory# kexec -e

[...]

| sd 0:0:0:0: [sda] Synchronizing SCSI cache
| kexec_core: Starting new kernel

... and Qemu restarts the platform firmware instead of proceeding with kexec.
(I assume this is a triple fault)

You can use mem-min and mem-max to control where kexec's user space will place
the memory.


If you apply this patch, the above sequence will fail at the device remove step,
as the physical addresses match the loaded kexec image:

| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| Offlined Pages 32768
| kexec_core: Memory region in use
| kexec_core: Memory region in use
| memory memory39: Offline failed.
| Built 1 zonelists, mobility grouping on.  Total pages: 299212
| Policy zone: Normal

| root@vm:/sys/devices/system/memory# free -m
|               total        used        free      shared  ...
| Mem:           1942          90        1793           0    ...
| Swap:             0           0           0

I can't remove the DIMM, because we failed to offline it:

(qemu) object_del mem1
object 'mem1' is in use, can not be deleted

and I can trigger kexec and boot the new kernel.

kexec user-space here comes from debian bullseye. It picked the removable memory
all by itself without any additional arguments.


(a different issue that can be ignored for now: x86 additionally fails to reboot
if I remove memory, even if its not in use by the kexec image. This doesn't
cause qemu to reboot via firmware, I think it dies before the console. It
doesn't happen on arm64. I suspect the memory map is snapshotted and assumed to
still be correct when the image is executed.)


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-27 18:07         ` James Morse
@ 2020-03-27 18:52           ` David Hildenbrand
  2020-03-30 13:00             ` James Morse
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-03-27 18:52 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

>> 2. You do the kexec. The kexec kernel will only operate on a reserved
>> memory region (reserved via e.g., kernel cmdline crashkernel=128M).
> 
> I think you are merging the kexec and kdump behaviours.
> (Wrong terminology? The things behind 'kexec -l Image' and 'kexec -p Image')

Oh, I see - I think your example below clarifies things. Something like
that should go in the cover letter if we end up in this patch being
required :)

(I missed that the problematic part is "random" addresses passed by user
space to the kernel, where it wants data to be loaded to on kexec -e)

> 
> For kdump, yes, the new kernel is loaded into the crashkernel reservation, and
> confined to it.
> 
> 
> For regular kexec, the new kernel can be loaded any where in memory. There might
> be a difference with how this works on arm64....
> 
> The regular kexec kernel isn't stored in its final location when its loaded, its
> relocated there when the image is executed. The target/destination memory may
> have been removed in the meantime.
> 
> (an example recipe below should clarify this)
> 
> 
>> Is it that in 2., the reserved memory region (for the crashkernel) could
>> have been offlined in the meantime?
> 
> No, for kdump: the crashkernel reservation is PG_reserved, and its not something
> mm knows how to move, so that region can't be taken offline.
> 
> (On arm64 we additionally prevent the boot-memory from being removed as it is
> all described as present by UEFI. The crashkernel reservation would always be
> from this type of memory)

Right.

> 
> 
> This is about a regular kexec, any crashdump reservation is irrelevant.
> This kexec kernel is temporarily stored out of line, then relocated when executed.
> 
> A recipe so that we're at least on the same terminal! This is on a TX2 running
> arm64's for-next/core using Qemu-TCG to emulate x86. (Sorry for the bizarre
> config, its because Qemu supports hotremove on x86, but not yet on arm64).
> 
> 
> Insert the memory:
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> 
> | root@vm:~# free -m
> |               total        used        free      shared  ...
> | Mem:            918          52         814           0  ...
> | Swap:             0           0           0
> 
> 
> Bring it online:
> | root@vm:~# cd /sys/devices/system/memory/
> | root@vm:/sys/devices/system/memory# for F in memory3*; do echo \
> | online_movable > $F/state; done
> 
> | Built 1 zonelists, mobility grouping on.  Total pages: 251049
> | Policy zone: DMA32
> 
> | -bash: echo: write error: Invalid argument
> | root@vm:/sys/devices/system/memory# free -m
> |               total        used        free      shared  ...
> | Mem:           1942          53        1836           0  ...
> | Swap:             0           0           0
> 
> 
> Load kexec:
> | root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline
> 

I assume this will trigger

kexec_load -> do_kexec_load -> kimage_load_segment ->
kimage_load_normal_segment -> kimage_alloc_page -> kimage_alloc_pages

Which will just allocate a bunch of pages and mark them reserved.

Now, AFAIKs, all allocations will be unmovable. So none of the kexec
segment allocations will actually end up on your DIMM (as it is onlined
online_movable).

So, the loaded image (with its segments) from user won't be problematic
and not get placed on your DIMM.


Now, the problematic part is (via man kexec_load) "mem and memsz specify
a physical address range that is the target of the copy."

So the place where the image will be "assembled" at when doing the
reboot. Understood :)

> Press the Attention button to request removal:
> 
> (qemu) device_del dimm1
> 
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Built 1 zonelists, mobility grouping on.  Total pages: 233728
> | Policy zone: DMA32
> 
> The memory is gone:
> | root@vm:/sys/devices/system/memory# free -m
> |               total        used        free      shared  ...
> | Mem:            918          89         769           0  ...
> | Swap:             0           0           0
> 
> Trigger kexec:
> | root@vm:/sys/devices/system/memory# kexec -e
> 
> [...]
> 
> | sd 0:0:0:0: [sda] Synchronizing SCSI cache
> | kexec_core: Starting new kernel
> 
> ... and Qemu restarts the platform firmware instead of proceeding with kexec.
> (I assume this is a triple fault)
> 
> You can use mem-min and mem-max to control where kexec's user space will place
> the memory.
> 
> 
> If you apply this patch, the above sequence will fail at the device remove step,
> as the physical addresses match the loaded kexec image:
> 
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | Offlined Pages 32768
> | kexec_core: Memory region in use
> | kexec_core: Memory region in use

Okay, so I assume the kexec userspace tool provided target kernel
addresses for segments that reside on the DIMM.


> | memory memory39: Offline failed.
> | Built 1 zonelists, mobility grouping on.  Total pages: 299212
> | Policy zone: Normal
> 
> | root@vm:/sys/devices/system/memory# free -m
> |               total        used        free      shared  ...
> | Mem:           1942          90        1793           0    ...
> | Swap:             0           0           0
> 
> I can't remove the DIMM, because we failed to offline it:

I wonder if we should instead make the "kexec -e" fail. It tries to
touch random system memory.

Denying to offline MOVABLE memory should be avoided - and what kexec
does here sounds dangerous to me (allowing it to write random system
memory).

Roughly what I am thinking is this:

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index ba1d91e868ca..70c39a5307e5 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1135,6 +1135,10 @@ int kernel_kexec(void)
                error = -EINVAL;
                goto Unlock;
        }
+       if (!kexec_image_validate()) {
+               error = -EINVAL;
+               goto Unlock;
+       }

 #ifdef CONFIG_KEXEC_JUMP
        if (kexec_image->preserve_context) {


kexec_image_validate() would go over all segments and validate that the
involved pages are actual valid memory (pfn_to_online_page()).

All we have to do is protect from memory hotplug until we switch to the
new kernel.

Will probably need some thought. But it will actually also bail out when
user space passes wrong physical memory addresses, instead of
triple-faulting silently.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-27 18:52           ` David Hildenbrand
@ 2020-03-30 13:00             ` James Morse
  2020-03-30 13:13               ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: James Morse @ 2020-03-30 13:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

Hi David,

On 3/27/20 6:52 PM, David Hildenbrand wrote:
>>> 2. You do the kexec. The kexec kernel will only operate on a reserved
>>> memory region (reserved via e.g., kernel cmdline crashkernel=128M).
>>
>> I think you are merging the kexec and kdump behaviours.
>> (Wrong terminology? The things behind 'kexec -l Image' and 'kexec -p Image')
> 
> Oh, I see - I think your example below clarifies things. Something like
> that should go in the cover letter if we end up in this patch being
> required :)

Do you mean the commit message? I think its far too long...

Adding a sentence about the way kexec load works may help, the first paragraph
would read:

| Kexec allows user-space to specify the address that the kexec image should be
| loaded to. Because this memory may be in use, an image loaded for kexec is not
| stored in place, instead its segments are scattered through memory, and are
| re-assembled when needed. In the meantime, the target memory may have been
| removed.

Do you think thats clearer?


> (I missed that the problematic part is "random" addresses passed by user
> space to the kernel, where it wants data to be loaded to on kexec -e)

[...]

>> Load kexec:
>> | root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline
>>
> 
> I assume this will trigger
> 
> kexec_load -> do_kexec_load -> kimage_load_segment ->
> kimage_load_normal_segment -> kimage_alloc_page -> kimage_alloc_pages
> 
> Which will just allocate a bunch of pages and mark them reserved.
> 
> Now, AFAIKs, all allocations will be unmovable. So none of the kexec
> segment allocations will actually end up on your DIMM (as it is onlined
> online_movable).
> 
> So, the loaded image (with its segments) from user won't be problematic
> and not get placed on your DIMM.
> 
> 
> Now, the problematic part is (via man kexec_load) "mem and memsz specify
> a physical address range that is the target of the copy."
> 
> So the place where the image will be "assembled" at when doing the
> reboot. Understood :)

Yup.

[...]

> I wonder if we should instead make the "kexec -e" fail. It tries to
> touch random system memory.

Heh, isn't touching random system memory what kexec does?!

Its all described to user-space as 'System RAM'. Teaching it to probe
/sys/devices/memory/... would require a user-space change.


> Denying to offline MOVABLE memory should be avoided - and what kexec
> does here sounds dangerous to me (allowing it to write random system
> memory).

> Roughly what I am thinking is this:
> 
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index ba1d91e868ca..70c39a5307e5 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -1135,6 +1135,10 @@ int kernel_kexec(void)
>                 error = -EINVAL;
>                 goto Unlock;
>         }
> +       if (!kexec_image_validate()) {
> +               error = -EINVAL;
> +               goto Unlock;
> +       }
> 
>  #ifdef CONFIG_KEXEC_JUMP
>         if (kexec_image->preserve_context) {
> 
> 
> kexec_image_validate() would go over all segments and validate that the
> involved pages are actual valid memory (pfn_to_online_page()).
> 
> All we have to do is protect from memory hotplug until we switch to the
> new kernel.

(migrate_to_reboot_cpu() can sleep), I think you'd end up with something like
this patch, but only while kexec_in_progress. I don't think letting kexec fail
if the events occur in a different order is good for user-space.


> Will probably need some thought. But it will actually also bail out when
> user space passes wrong physical memory addresses, instead of
> triple-faulting silently.

With this change, the reboot(LINUX_REBOOT_CMD_KEXEC), call would fail. This
thing doesn't usually return, so we're likely to trigger error-handling that has
never run before.

(Last time I debugged one of these, it turned out kexec had taken the network
interfaces down, meaning the nfsroot was no longer accessible)

How can user-space know whether kexec is going to succeed, or fail like this?
Any loaded kexec kernel could secretly be in this broken state.

Can user-space know what caused this to become unreliable? (without reading the
kernel source)


Given kexec can be unloaded by user-space, I think its better to prevent us
getting into the broken state, preferably giving the hint that kexec us using
that memory. The user can 'kexec -u', then retry removing the memory.

I think forbidding the memory-offline is simpler for user-space to deal with.


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-30 13:00             ` James Morse
@ 2020-03-30 13:13               ` David Hildenbrand
  2020-03-30 17:17                 ` James Morse
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-03-30 13:13 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

> Adding a sentence about the way kexec load works may help, the first paragraph
> would read:
> 
> | Kexec allows user-space to specify the address that the kexec image should be
> | loaded to. Because this memory may be in use, an image loaded for kexec is not
> | stored in place, instead its segments are scattered through memory, and are
> | re-assembled when needed. In the meantime, the target memory may have been
> | removed.
> 
> Do you think thats clearer?

Yes, very much. Maybe add, that the target is described by user space
during kexec_load() and that user space - right now - parses /proc/iomem
to find applicable system memory.

> [...]
> 
>>> Load kexec:
>>> | root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline
>>>
>>
>> I assume this will trigger
>>
>> kexec_load -> do_kexec_load -> kimage_load_segment ->
>> kimage_load_normal_segment -> kimage_alloc_page -> kimage_alloc_pages
>>
>> Which will just allocate a bunch of pages and mark them reserved.
>>
>> Now, AFAIKs, all allocations will be unmovable. So none of the kexec
>> segment allocations will actually end up on your DIMM (as it is onlined
>> online_movable).
>>
>> So, the loaded image (with its segments) from user won't be problematic
>> and not get placed on your DIMM.
>>
>>
>> Now, the problematic part is (via man kexec_load) "mem and memsz specify
>> a physical address range that is the target of the copy."
>>
>> So the place where the image will be "assembled" at when doing the
>> reboot. Understood :)
> 
> Yup.
> 
> [...]
> 
>> I wonder if we should instead make the "kexec -e" fail. It tries to
>> touch random system memory.
> 
> Heh, isn't touching random system memory what kexec does?!

Having a racy user interface that can trigger kernel crashes feels very
wrong. We should limit the impact.

> 
> Its all described to user-space as 'System RAM'. Teaching it to probe
> /sys/devices/memory/... would require a user-space change.

I think we should really rename hotplugged memory on all architectures.

Especially also relevant for virtio-mem/hyper-v balloon, where some
pieces of (hotplugged )memory blocks are partially unavailable and
should not be touched - accessing them results in unpredictable behavior
(e.g., crashes or discarded writes).

[...]

>> Will probably need some thought. But it will actually also bail out when
>> user space passes wrong physical memory addresses, instead of
>> triple-faulting silently.
> 
> With this change, the reboot(LINUX_REBOOT_CMD_KEXEC), call would fail. This
> thing doesn't usually return, so we're likely to trigger error-handling that has
> never run before.
> 
> (Last time I debugged one of these, it turned out kexec had taken the network
> interfaces down, meaning the nfsroot was no longer accessible)
> 
> How can user-space know whether kexec is going to succeed, or fail like this?
> Any loaded kexec kernel could secretly be in this broken state.
> 
> Can user-space know what caused this to become unreliable? (without reading the
> kernel source)
> 
> 
> Given kexec can be unloaded by user-space, I think its better to prevent us
> getting into the broken state, preferably giving the hint that kexec us using
> that memory. The user can 'kexec -u', then retry removing the memory.
> 
> I think forbidding the memory-offline is simpler for user-space to deal with.

I thought about this over the weekend, and I don't think it's the right
approach.

1. It's racy. If memory is getting offlined/unplugged just while user
space is about to trigger the kexec_load(), you end up with the very
same triple-fault.

2. It's semantically wrong. kexec does not need online memory ("managed
by the buddy"), but still you disallow offlining memory.


I would really much rather want to see user-space choosing boot memory
(e.g., renaming hotplugged memory on all architectures), and checking
during "kexec -e" if the selected memory is actually "there", before
trying to write to it.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-27 15:42   ` James Morse
@ 2020-03-30 13:18     ` David Hildenbrand
  0 siblings, 0 replies; 92+ messages in thread
From: David Hildenbrand @ 2020-03-30 13:18 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

On 27.03.20 16:42, James Morse wrote:
> Hi David,
> 
> On 3/27/20 9:27 AM, David Hildenbrand wrote:
>> On 26.03.20 19:07, James Morse wrote:
>>> arm64 recently queued support for memory hotremove, which led to some
>>> new corner cases for kexec.
>>>
>>> If the kexec segments are loaded for a removable region, that region may
>>> be removed before kexec actually occurs. This causes the first kernel to
>>> lockup when applying the relocations. (I've triggered this on x86 too).
>>>
>>> The first patch adds a memory notifier for kexec so that it can refuse
>>> to allow in-use regions to be taken offline.
> 
>> IIRC other architectures handle that by setting the affected pages
>> PageReserved. Any reason why to not stick to the same?
> 
> Hmm, I didn't spot this. How come core code doesn't do it if its needed?
> 
> Doesn't PG_Reserved prevent the page from being used for regular allocations?
> (or is that only if its done early)
> 
> I prefer the runtime check as the dmesg output gives the user some chance of
> knowing why their memory-offline failed, and doing something about it!

I was confused which memory we are trying to protect. Understood now,
that you are dealing with the target physical memory described during
described during kexec_load.

[...]

> 
>> Also, makedumpfile will check if the
>> sections are still around IIRC.
> 
> Curious. I thought the vmcore was virtually addressed, how does it know which
> linear-map portions correspond to sysfs memory nodes with KASLR?

That's a very interesting question. I remember there was KASLR support
being implemented specifically for that - but I don't know any details.

>> Not sure what you mean by "Unaware kdump from user-space".
> 
> The existing kexec-tools binaries, that (I assume) don't go probing to find out
> if 'System RAM' is removable or not, loading a kdump kernel, along with the
> user-space generated blob that describes the first kernel's memory usage to the
> second kernel.

Finally understood how kexec without kdump works, thanks.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-03-27 15:39     ` James Morse
@ 2020-03-30 13:23       ` David Hildenbrand
  2020-03-30 17:17         ` James Morse
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-03-30 13:23 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma,
	Vitaly Kuznetsov

>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 0a54ffac8c68..69b03dd7fc74 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -42,6 +42,10 @@
>>>  #include "internal.h"
>>>  #include "shuffle.h"
>>>  
>>> +#ifndef MEMORY_HOTPLUG_RES_NAME
>>> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
>>> +#endif
>>
>> So I assume changing this for all architectures would result in some
>> user space tool breaking? Are we aware of any?
> 
> Last time we had to touch arm64's /proc/iomem strings I went through debian's
> codesearch for stuff that reads it, kexec-tools was the only thing that parsed
> it in anger. (From memory, the other tools were looking for PCIe windows to do
> firmware flashing..)
> 
> Looking again, having qualifiers on the end of 'System RAM' looks like it could
> confuse 's390-tools's detect_mem_chunks parser.

Good to know, we should find out if this could work.

> 
> It looks like the strings that come out of 'FIRMWARE_MEMMAP' are a duplicate set.
> 
> 
>> I do wonder if we should simply change it for all architectures if possible.
> 
> If its possible that would be great. But I suspect that ship has sailed,
> changing it on other architectures could break some fragile parsing code.

I assume any parser has to be prepared for new types showing up.
Otherwise these would not be future proof. The question is if a common
prefix is problematic.

E.g., Use "Hotplugged System RAM" instead of "System RAM (hotplug)"

> 
> I'm wary of changing it on arm64, the only thing that makes it tolerable is that
> memory hot-add was relatively recently merged, and we don't anticipate it being
> widely used until you can remove memory as well.
> 
> Changing it on arm64 is to prevent today's versions of kexec-tools from
> accidentally placing the new kernel in memory that wasn't described at boot.
> This leads to an unhandled exception during boot[0] because the kernel can't
> access itself via the mapping of all memory. (hotpluggable regions are only
> discovered by suitably configured ACPI systems much later)

I want the very same for virtio-mem (initially x86-only, but later open
for all archs). Can also be interesting for Hyper-V. kexec should not
try to use hotplugged memory as kexec target, as memory blocks can be
partially inaccessible.

Of course, I can provide an interface to override the name via
add_memory(), but having it on all architectures handled in a similar
way right from the start would be nicer.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-26 18:07 [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use James Morse
                   ` (4 preceding siblings ...)
  2020-03-27  9:27 ` David Hildenbrand
@ 2020-03-30 13:55 ` Baoquan He
  2020-03-30 17:17   ` James Morse
  2020-03-31  3:38 ` Dave Young
  2020-04-15 20:29 ` Eric W. Biederman
  7 siblings, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-03-30 13:55 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

Hi James,

On 03/26/20 at 06:07pm, James Morse wrote:
> Hello!
> 
> arm64 recently queued support for memory hotremove, which led to some
> new corner cases for kexec.
> 
> If the kexec segments are loaded for a removable region, that region may
> be removed before kexec actually occurs. This causes the first kernel to
> lockup when applying the relocations. (I've triggered this on x86 too).
> 
> The first patch adds a memory notifier for kexec so that it can refuse
> to allow in-use regions to be taken offline.

I talked about this with Dave Young. Currently, we tend to use
kexec_file_load more in the future since most of its implementation is
in kernel, we can get information about kernel more easilier. For the
kexec kernel loaded into hotpluggable area, we can fix it in
kexec_file_load side, we know the MOVABLE zone's start and end. As for
the old kexec_load, we would like to keep it for back compatibility. At
least in our distros, we have switched to kexec_file_load, will
gradually obsolete kexec_load. So for this one, I suggest avoiding those
MOVZBLE memory region when searching place for kexec kernel.

Not sure if arm64 will still have difficulty.

> 
> 
> This doesn't solve the problem for arm64, where the new kernel must
> initially rely on the data structures from the first boot to describe
> memory. These don't describe hotpluggable memory.
> If kexec places the kernel in one of these regions, it must also provide
> a DT that describes the region in which the kernel was mapped as memory.
> (and somehow ensure its always present in the future...)
> 
> To prevent this from happening accidentally with unaware user-space,
> patches two and three allow arm64 to give these regions a different
> name.
> 
> This is a change in behaviour for arm64 as memory hotadd and hotremove
> were added separately.
> 
> 
> I haven't tried kdump.
> Unaware kdump from user-space probably won't describe the hotplug
> regions if the name is different, which saves us from problems if
> the memory is no longer present at kdump time, but means the vmcore
> is incomplete.
> 
> 
> These patches are based on arm64's for-next/core branch, but can all
> be merged independently.
> 
> Thanks,
> 
> James Morse (3):
>   kexec: Prevent removal of memory in use by a loaded kexec image
>   mm/memory_hotplug: Allow arch override of non boot memory resource
>     names
>   arm64: memory: Give hotplug memory a different resource name
> 
>  arch/arm64/include/asm/memory.h | 11 +++++++
>  kernel/kexec_core.c             | 56 +++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c             |  6 +++-
>  3 files changed, 72 insertions(+), 1 deletion(-)
> 
> -- 
> 2.25.1
> 
> 



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-30 13:55 ` Baoquan He
@ 2020-03-30 17:17   ` James Morse
  2020-03-31  3:46     ` Dave Young
  0 siblings, 1 reply; 92+ messages in thread
From: James Morse @ 2020-03-30 17:17 UTC (permalink / raw)
  To: Baoquan He
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

Hi Baoquan,

On 3/30/20 2:55 PM, Baoquan He wrote:
> On 03/26/20 at 06:07pm, James Morse wrote:
>> arm64 recently queued support for memory hotremove, which led to some
>> new corner cases for kexec.
>>
>> If the kexec segments are loaded for a removable region, that region may
>> be removed before kexec actually occurs. This causes the first kernel to
>> lockup when applying the relocations. (I've triggered this on x86 too).
>>
>> The first patch adds a memory notifier for kexec so that it can refuse
>> to allow in-use regions to be taken offline.
> 
> I talked about this with Dave Young. Currently, we tend to use
> kexec_file_load more in the future since most of its implementation is
> in kernel, we can get information about kernel more easilier. For the
> kexec kernel loaded into hotpluggable area, we can fix it in
> kexec_file_load side, we know the MOVABLE zone's start and end. As for
> the old kexec_load, we would like to keep it for back compatibility. At
> least in our distros, we have switched to kexec_file_load, will
> gradually obsolete kexec_load.

> So for this one, I suggest avoiding those
> MOVZBLE memory region when searching place for kexec kernel.

How does today's user-space know?


> Not sure if arm64 will still have difficulty.

arm64 added support for kexec_load first, then kexec_file_load. (evidently a
mistake).
kexec_file_load support was only added in the last year or so, I'd hazard most
people using this, are using the regular load kind. (and probably don't know or
care).



Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-03-30 13:23       ` David Hildenbrand
@ 2020-03-30 17:17         ` James Morse
  0 siblings, 0 replies; 92+ messages in thread
From: James Morse @ 2020-03-30 17:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma,
	Vitaly Kuznetsov

Hi David,

On 3/30/20 2:23 PM, David Hildenbrand wrote:
>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>> index 0a54ffac8c68..69b03dd7fc74 100644
>>>> --- a/mm/memory_hotplug.c
>>>> +++ b/mm/memory_hotplug.c
>>>> @@ -42,6 +42,10 @@
>>>>  #include "internal.h"
>>>>  #include "shuffle.h"
>>>>  
>>>> +#ifndef MEMORY_HOTPLUG_RES_NAME
>>>> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
>>>> +#endif
>>>
>>> So I assume changing this for all architectures would result in some
>>> user space tool breaking? Are we aware of any?
>>
>> Last time we had to touch arm64's /proc/iomem strings I went through debian's
>> codesearch for stuff that reads it, kexec-tools was the only thing that parsed
>> it in anger. (From memory, the other tools were looking for PCIe windows to do
>> firmware flashing..)
>>
>> Looking again, having qualifiers on the end of 'System RAM' looks like it could
>> confuse 's390-tools's detect_mem_chunks parser.
> 
> Good to know, we should find out if this could work.
> 
>>
>> It looks like the strings that come out of 'FIRMWARE_MEMMAP' are a duplicate set.
>>
>>
>>> I do wonder if we should simply change it for all architectures if possible.
>>
>> If its possible that would be great. But I suspect that ship has sailed,
>> changing it on other architectures could break some fragile parsing code.
> 
> I assume any parser has to be prepared for new types showing up.
> Otherwise these would not be future proof. The question is if a common
> prefix is problematic.
> 
> E.g., Use "Hotplugged System RAM" instead of "System RAM (hotplug)"

Aha, I went for a (suffix) because that is what 32bit Arm did for the boot alias.


>> I'm wary of changing it on arm64, the only thing that makes it tolerable is that
>> memory hot-add was relatively recently merged, and we don't anticipate it being
>> widely used until you can remove memory as well.
>>
>> Changing it on arm64 is to prevent today's versions of kexec-tools from
>> accidentally placing the new kernel in memory that wasn't described at boot.
>> This leads to an unhandled exception during boot[0] because the kernel can't
>> access itself via the mapping of all memory. (hotpluggable regions are only
>> discovered by suitably configured ACPI systems much later)

> I want the very same for virtio-mem (initially x86-only, but later open
> for all archs). Can also be interesting for Hyper-V. kexec should not
> try to use hotplugged memory as kexec target, as memory blocks can be
> partially inaccessible.

Great! I assumed these placement requirements would be arm64 specific.


> Of course, I can provide an interface to override the name via
> add_memory(), but having it on all architectures handled in a similar
> way right from the start would be nicer.

I agree having it the same on all architectures would be good.

It sounds like virtio-mem is a better argument for doing this than arm64's
firmware memory description.

I'll have a read, and maybe post something to linux-arch to do this at rc1.
(I assume we'll have a few weeks to make sure arm64 at least uses the same name
if it goes on longer)


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-30 13:13               ` David Hildenbrand
@ 2020-03-30 17:17                 ` James Morse
  2020-03-30 18:14                   ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: James Morse @ 2020-03-30 17:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

Hi David,

On 3/30/20 2:13 PM, David Hildenbrand wrote:
>> Adding a sentence about the way kexec load works may help, the first paragraph
>> would read:
>>
>> | Kexec allows user-space to specify the address that the kexec image should be
>> | loaded to. Because this memory may be in use, an image loaded for kexec is not
>> | stored in place, instead its segments are scattered through memory, and are
>> | re-assembled when needed. In the meantime, the target memory may have been
>> | removed.
>>
>> Do you think thats clearer?
> 
> Yes, very much. Maybe add, that the target is described by user space
> during kexec_load() and that user space - right now - parses /proc/iomem
> to find applicable system memory.

(I don't think x86 parses /proc/iomem anymore). I'll repost this patch with that
expanded commit message, once we've agreed this is the right thing to do!


>>> I wonder if we should instead make the "kexec -e" fail. It tries to
>>> touch random system memory.
>>
>> Heh, isn't touching random system memory what kexec does?!
> 
> Having a racy user interface that can trigger kernel crashes feels very
> wrong. We should limit the impact.


>> Its all described to user-space as 'System RAM'. Teaching it to probe
>> /sys/devices/memory/... would require a user-space change.
> 
> I think we should really rename hotplugged memory on all architectures.
> 
> Especially also relevant for virtio-mem/hyper-v balloon, where some
> pieces of (hotplugged )memory blocks are partially unavailable and
> should not be touched - accessing them results in unpredictable behavior
> (e.g., crashes or discarded writes).

I'll need to look into these. I'd assume for KVM that virtio-mem can be brought
back when its accessed ... its just going to be slow.


>>> Will probably need some thought. But it will actually also bail out when
>>> user space passes wrong physical memory addresses, instead of
>>> triple-faulting silently.
>>
>> With this change, the reboot(LINUX_REBOOT_CMD_KEXEC), call would fail. This
>> thing doesn't usually return, so we're likely to trigger error-handling that has
>> never run before.
>>
>> (Last time I debugged one of these, it turned out kexec had taken the network
>> interfaces down, meaning the nfsroot was no longer accessible)
>>
>> How can user-space know whether kexec is going to succeed, or fail like this?
>> Any loaded kexec kernel could secretly be in this broken state.
>>
>> Can user-space know what caused this to become unreliable? (without reading the
>> kernel source)
>>
>>
>> Given kexec can be unloaded by user-space, I think its better to prevent us
>> getting into the broken state, preferably giving the hint that kexec us using
>> that memory. The user can 'kexec -u', then retry removing the memory.
>>
>> I think forbidding the memory-offline is simpler for user-space to deal with.
> 
> I thought about this over the weekend, and I don't think it's the right
> approach.


> 1. It's racy. If memory is getting offlined/unplugged just while user
> space is about to trigger the kexec_load(), you end up with the very
> same triple-fault.

load? How is this different to user-space providing a bogus address?

Sure, user-space may take a nap between parsing /proc/iomem and calling
kexec_load(), but the kernel should reject these as they would never work.

(I can't see where sanity_check_segment_list() considers the platform's memory.
If it doesn't, we should fix it)

Once the image is loaded, and clashes with a request to remove the memory there
are two choices: secretly unload the image, or prevent the memory being taken
offline.


> 2. It's semantically wrong. kexec does not need online memory ("managed
> by the buddy"), but still you disallow offlining memory.

It does need the memory if you want 'kexec -e' to succeed.
If there were any sanity tests, they should have happened at load time.

The memory is effectively in use by the loaded kexec image. User-space told the
kernel to use this memory, you should not be able to then remove it, without
unloading the kexec image first.


Are you saying feeding bogus addresses to kexec_load() is _expected_ to blow up
like this?


> I would really much rather want to see user-space choosing boot memory
> (e.g., renaming hotplugged memory on all architectures), and checking
> during "kexec -e" if the selected memory is actually "there", before
> trying to write to it.

How does 'kexec -e' know where the kexec kernel was loaded? You'd need to pass
something between 'load' and 'exec'. How do you keep existing user-space working
as much as possible?

What do you do if the memory isn't there? User-space just called reboot(), it
would be better to avoid getting into the situation where we have to fail that call.


Solving the bigger problem, would add a 'kexec_it_now' flag to the kexec_load()
call. This would make the window where 'stuff' can change much smaller. Things
changing while user-space sleeps isn't a solvable problem, these would need to
be rejected by sanity tests at load time.


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-30 17:17                 ` James Morse
@ 2020-03-30 18:14                   ` David Hildenbrand
  2020-04-10 19:10                     ` Andrew Morton
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-03-30 18:14 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman, Andrew Morton,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

On 30.03.20 19:17, James Morse wrote:
> Hi David,
> 
> On 3/30/20 2:13 PM, David Hildenbrand wrote:
>>> Adding a sentence about the way kexec load works may help, the first paragraph
>>> would read:
>>>
>>> | Kexec allows user-space to specify the address that the kexec image should be
>>> | loaded to. Because this memory may be in use, an image loaded for kexec is not
>>> | stored in place, instead its segments are scattered through memory, and are
>>> | re-assembled when needed. In the meantime, the target memory may have been
>>> | removed.
>>>
>>> Do you think thats clearer?
>>
>> Yes, very much. Maybe add, that the target is described by user space
>> during kexec_load() and that user space - right now - parses /proc/iomem
>> to find applicable system memory.
> 
> (I don't think x86 parses /proc/iomem anymore). I'll repost this patch with that
> expanded commit message, once we've agreed this is the right thing to do!

Right, I can see kexec-tools parsing /sys/firmware/memmap first.
Unfortunately, all hotplugged memory (via add_memory()) is indicated
there as System RAM ... including memory added by virtio-mem.

I think we should adapt the type there as well. (in your patch #2)

	firmware_map_add_hotplug(start, start + size, "System RAM");

> 
> 
>>>> I wonder if we should instead make the "kexec -e" fail. It tries to
>>>> touch random system memory.
>>>
>>> Heh, isn't touching random system memory what kexec does?!
>>
>> Having a racy user interface that can trigger kernel crashes feels very
>> wrong. We should limit the impact.
> 
> 
>>> Its all described to user-space as 'System RAM'. Teaching it to probe
>>> /sys/devices/memory/... would require a user-space change.
>>
>> I think we should really rename hotplugged memory on all architectures.
>>
>> Especially also relevant for virtio-mem/hyper-v balloon, where some
>> pieces of (hotplugged )memory blocks are partially unavailable and
>> should not be touched - accessing them results in unpredictable behavior
>> (e.g., crashes or discarded writes).
> 
> I'll need to look into these. I'd assume for KVM that virtio-mem can be brought
> back when its accessed ... its just going to be slow.

Touching unplugged virtio-mem memory can result in unpredictable
behavior. Touching (some) unplugged Hyper-V memory will be handled
similarly AFAIK.

[...]

>> 1. It's racy. If memory is getting offlined/unplugged just while user
>> space is about to trigger the kexec_load(), you end up with the very
>> same triple-fault.
> 
> load? How is this different to user-space providing a bogus address?

I guess it's not different. It's just racy because user space with good
intend could crash the system :)

> 
> Sure, user-space may take a nap between parsing /proc/iomem and calling
> kexec_load(), but the kernel should reject these as they would never work.
> 
> (I can't see where sanity_check_segment_list() considers the platform's memory.
> If it doesn't, we should fix it)

Right, that's what I meant. I was not able to find any sanity checks.
Maybe they are in place but I was not able to spot them.

> 
> Once the image is loaded, and clashes with a request to remove the memory there
> are two choices: secretly unload the image, or prevent the memory being taken
> offline.

Exactly. Or make "kexec -e" fail.

> 
> 
>> 2. It's semantically wrong. kexec does not need online memory ("managed
>> by the buddy"), but still you disallow offlining memory.
> 
> It does need the memory if you want 'kexec -e' to succeed.
> If there were any sanity tests, they should have happened at load time.

Offlining != removing. That's the point I was trying to make. (and we
don't want to block removing of memory in the kernel any other way)

> 
> The memory is effectively in use by the loaded kexec image. User-space told the
> kernel to use this memory, you should not be able to then remove it, without
> unloading the kexec image first.

It's not in use before you do the "kexec -e" IMHO.

> Are you saying feeding bogus addresses to kexec_load() is _expected_ to blow up
> like this?

No, not at all. I think this should be fixed if this is possible.

> 
>> I would really much rather want to see user-space choosing boot memory
>> (e.g., renaming hotplugged memory on all architectures), and checking
>> during "kexec -e" if the selected memory is actually "there", before
>> trying to write to it.
> 
> How does 'kexec -e' know where the kexec kernel was loaded? You'd need to pass
> something between 'load' and 'exec'. How do you keep existing user-space working
> as much as possible?

If we use new types (e.g., "System RAM (hotplugged)"), looks like most
of kexec will continue working (memory will be treated like
RANGE_RESERVED or ignored).

I guess we would still have to teach kexec-tools the new types,
primarily to keep the crash memory ranges from getting detected
properly. (no idea how they are used, will have to take a closer look)

> 
> What do you do if the memory isn't there? User-space just called reboot(), it
> would be better to avoid getting into the situation where we have to fail that call.

In kernel_kexec() we already fail if there is no kernel image loaded, so
we can similarly simply fail if the kernel image cannot be moved to the
target memory IMHO.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name
  2020-03-26 18:07 ` [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name James Morse
@ 2020-03-30 19:01   ` David Hildenbrand
  2020-04-15 20:37   ` Eric W. Biederman
  1 sibling, 0 replies; 92+ messages in thread
From: David Hildenbrand @ 2020-03-30 19:01 UTC (permalink / raw)
  To: James Morse, kexec, linux-mm, linux-arm-kernel
  Cc: Eric Biederman, Andrew Morton, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Bhupesh Sharma

On 26.03.20 19:07, James Morse wrote:
> If kexec chooses to place the kernel in a memory region that was
> added after boot, we fail to boot as the kernel is running from a
> location that is not described as memory by the UEFI memory map or
> the original DT.
> 
> To prevent unaware user-space kexec from doing this accidentally,
> give these regions a different name.
> 
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> This is a change in behaviour as seen by user-space, because memory hot-add
> has already been merged.
> 
>  arch/arm64/include/asm/memory.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 2be67b232499..ef1686518469 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -166,6 +166,17 @@
>  #define IOREMAP_MAX_ORDER	(PMD_SHIFT)
>  #endif
>  
> +/*
> + * Memory hotplug allows new regions of 'System RAM' to be added to the system.
> + * These aren't described as memory by the UEFI memory map, or DT memory node.
> + * If we kexec from one of these regions, the new kernel boots from a location
> + * that isn't described as RAM.
> + *
> + * Give these resources a different name, so unaware kexec doesn't do this by
> + * accident.
> + */
> +#define MEMORY_HOTPLUG_RES_NAME "System RAM (hotplug)"
> +
>  #ifndef __ASSEMBLY__
>  extern u64			vabits_actual;
>  #define PAGE_END		(_PAGE_END(vabits_actual))
> 

(While I am familiar with makedumpfile in the crash kernel, I am not yet
familiar with kexec, so bare with me)


Looking at kexec:arch/arm64/crashdump-arm64.c

load_crashdump_segments() -> crash_get_memory_ranges() ->
kexec_iomem_for_each_line() -> iomem_range_callback()


#define SYSTEM_RAM "System RAM\n"

...
} else if (strncmp(str, SYSTEM_RAM, strlen(SYSTEM_RAM)) == 0) {
	return mem_regions_add(&system_memory_rgns, ...);
}


The hotplugged memory will no longer be detected as a crashdump segment,
consequently (AFAIU) not be described in the elf header, and therefore
also no longer dumped (e.g., by makedumpfile).

I assume you'll have to adapt kexec-tools to still consider this memory
for dumping, correct? Or am I missing something?


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-26 18:07 [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use James Morse
                   ` (5 preceding siblings ...)
  2020-03-30 13:55 ` Baoquan He
@ 2020-03-31  3:38 ` Dave Young
  2020-04-15 20:29 ` Eric W. Biederman
  7 siblings, 0 replies; 92+ messages in thread
From: Dave Young @ 2020-03-31  3:38 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Eric Biederman, Andrew Morton,
	Will Deacon

Hi James,
On 03/26/20 at 06:07pm, James Morse wrote:
> Hello!
> 
> arm64 recently queued support for memory hotremove, which led to some
> new corner cases for kexec.
> 
> If the kexec segments are loaded for a removable region, that region may
> be removed before kexec actually occurs. This causes the first kernel to
> lockup when applying the relocations. (I've triggered this on x86 too).

Does a kexec reload work for your case?   If yes then I would suggest to
do it in userspace,  for example have a udev rule to reload kexec if
needed.

Actually we have a rule to restart kdump loading, but not for kexec, it
sounds also need a service to load kexec, and an udev rule to reload for
memory hotplug.

> 
> The first patch adds a memory notifier for kexec so that it can refuse
> to allow in-use regions to be taken offline.
> 
> 
> This doesn't solve the problem for arm64, where the new kernel must
> initially rely on the data structures from the first boot to describe
> memory. These don't describe hotpluggable memory.
> If kexec places the kernel in one of these regions, it must also provide
> a DT that describes the region in which the kernel was mapped as memory.
> (and somehow ensure its always present in the future...)
> 
> To prevent this from happening accidentally with unaware user-space,
> patches two and three allow arm64 to give these regions a different
> name.
> 
> This is a change in behaviour for arm64 as memory hotadd and hotremove
> were added separately.
> 
> 
> I haven't tried kdump.
> Unaware kdump from user-space probably won't describe the hotplug
> regions if the name is different, which saves us from problems if
> the memory is no longer present at kdump time, but means the vmcore
> is incomplete.
> 
> 
> These patches are based on arm64's for-next/core branch, but can all
> be merged independently.
> 
> Thanks,
> 
> James Morse (3):
>   kexec: Prevent removal of memory in use by a loaded kexec image
>   mm/memory_hotplug: Allow arch override of non boot memory resource
>     names
>   arm64: memory: Give hotplug memory a different resource name
> 
>  arch/arm64/include/asm/memory.h | 11 +++++++
>  kernel/kexec_core.c             | 56 +++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c             |  6 +++-
>  3 files changed, 72 insertions(+), 1 deletion(-)
> 
> -- 
> 2.25.1
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 

Thanks
Dave



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-30 17:17   ` James Morse
@ 2020-03-31  3:46     ` Dave Young
  2020-04-14 17:31       ` James Morse
  0 siblings, 1 reply; 92+ messages in thread
From: Dave Young @ 2020-03-31  3:46 UTC (permalink / raw)
  To: James Morse
  Cc: Baoquan He, Anshuman Khandual, Catalin Marinas, Bhupesh Sharma,
	kexec, linux-mm, Eric Biederman, Andrew Morton, Will Deacon,
	linux-arm-kernel

Hi James,
On 03/30/20 at 06:17pm, James Morse wrote:
> Hi Baoquan,
> 
> On 3/30/20 2:55 PM, Baoquan He wrote:
> > On 03/26/20 at 06:07pm, James Morse wrote:
> >> arm64 recently queued support for memory hotremove, which led to some
> >> new corner cases for kexec.
> >>
> >> If the kexec segments are loaded for a removable region, that region may
> >> be removed before kexec actually occurs. This causes the first kernel to
> >> lockup when applying the relocations. (I've triggered this on x86 too).
> >>
> >> The first patch adds a memory notifier for kexec so that it can refuse
> >> to allow in-use regions to be taken offline.
> > 
> > I talked about this with Dave Young. Currently, we tend to use
> > kexec_file_load more in the future since most of its implementation is
> > in kernel, we can get information about kernel more easilier. For the
> > kexec kernel loaded into hotpluggable area, we can fix it in
> > kexec_file_load side, we know the MOVABLE zone's start and end. As for
> > the old kexec_load, we would like to keep it for back compatibility. At
> > least in our distros, we have switched to kexec_file_load, will
> > gradually obsolete kexec_load.
> 
> > So for this one, I suggest avoiding those
> > MOVZBLE memory region when searching place for kexec kernel.
> 
> How does today's user-space know?
> 
> 
> > Not sure if arm64 will still have difficulty.
> 
> arm64 added support for kexec_load first, then kexec_file_load. (evidently a
> mistake).
> kexec_file_load support was only added in the last year or so, I'd hazard most
> people using this, are using the regular load kind. (and probably don't know or
> care).

I agreed that file load is still not widely used,  but in the long run
we should not maintain both of them all the future time.  Especially
when some kernel-userspace interfaces need to be introduced, file load
will have the natural advantage.  We may keep the kexec_load for other
misc usecases, but we can use file load for the major modern
linux-to-linux loading.  I'm not saying we can do it immediately, just
thought we should reduce the duplicate effort and try to avoid hacking if
possible.

Anyway about this particular issue, I wonder if we can just reload with
a udev rule as replied in another mail.

Thanks
Dave



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-03-26 18:07 ` [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names James Morse
  2020-03-27  9:59   ` David Hildenbrand
@ 2020-04-02  5:49   ` Dave Young
  2020-04-02  6:12     ` piliu
  2020-04-15 20:36   ` Eric W. Biederman
  2020-05-09  0:45   ` Andrew Morton
  3 siblings, 1 reply; 92+ messages in thread
From: Dave Young @ 2020-04-02  5:49 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Eric Biederman, Andrew Morton,
	Will Deacon, piliu, Hari Bathini

On 03/26/20 at 06:07pm, James Morse wrote:
> Memory added to the system by hotplug has a 'System RAM' resource created
> for it. This is exposed to user-space via /proc/iomem.
> 
> This poses problems for kexec on arm64. If kexec decides to place the
> kernel in one of these newly onlined regions, the new kernel will find
> itself booting from a region not described as memory in the firmware
> tables.
> 
> Arm64 doesn't have a structure like the e820 memory map that can be
> re-written when memory is brought online. Instead arm64 uses the UEFI
> memory map, or the memory node from the DT, sometimes both. We never
> rewrite these.

Could arm64 use similar way to update DT, or a cooked UEFI maps?

Add pingfan in cc, he said ppc64 update the DT after a memremove thus it
would be good to just redo a kexec load.

Added Pingfan and Hari for comments and corrections.

> 
> Allow an architecture to specify a different name for these hotplug
> regions.
> 
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  mm/memory_hotplug.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0a54ffac8c68..69b03dd7fc74 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,10 @@
>  #include "internal.h"
>  #include "shuffle.h"
>  
> +#ifndef MEMORY_HOTPLUG_RES_NAME
> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
> +#endif
> +
>  /*
>   * online_page_callback contains pointer to current page onlining function.
>   * Initially it is generic_online_page(). If it is required it could be
> @@ -103,7 +107,7 @@ static struct resource *register_memory_resource(u64 start, u64 size)
>  {
>  	struct resource *res;
>  	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> -	char *resource_name = "System RAM";
> +	char *resource_name = MEMORY_HOTPLUG_RES_NAME;
>  
>  	if (start + size > max_mem_size)
>  		return ERR_PTR(-E2BIG);
> -- 
> 2.25.1
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-04-02  5:49   ` Dave Young
@ 2020-04-02  6:12     ` piliu
  2020-04-14 17:21       ` James Morse
  0 siblings, 1 reply; 92+ messages in thread
From: piliu @ 2020-04-02  6:12 UTC (permalink / raw)
  To: Dave Young, James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Eric Biederman, Andrew Morton,
	Will Deacon, Hari Bathini



On 04/02/2020 01:49 PM, Dave Young wrote:
> On 03/26/20 at 06:07pm, James Morse wrote:
>> Memory added to the system by hotplug has a 'System RAM' resource created
>> for it. This is exposed to user-space via /proc/iomem.
>>
>> This poses problems for kexec on arm64. If kexec decides to place the
>> kernel in one of these newly onlined regions, the new kernel will find
>> itself booting from a region not described as memory in the firmware
>> tables.
>>
>> Arm64 doesn't have a structure like the e820 memory map that can be
>> re-written when memory is brought online. Instead arm64 uses the UEFI
>> memory map, or the memory node from the DT, sometimes both. We never
>> rewrite these.
> 
> Could arm64 use similar way to update DT, or a cooked UEFI maps?
> 
> Add pingfan in cc, he said ppc64 update the DT after a memremove thus it
> would be good to just redo a kexec load.
> 
Yes, the memory changes will be observed through device-node under
/proc/device-tree/ (which is for powerpc).

Later if running kexec -l/-p , it can build new dtb with the latest info
from /proc/device-tree
> Added Pingfan and Hari for comments and corrections.
> 
>>
>> Allow an architecture to specify a different name for these hotplug
>> regions.
>>
>> Signed-off-by: James Morse <james.morse@arm.com>
>> ---
>>  mm/memory_hotplug.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 0a54ffac8c68..69b03dd7fc74 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -42,6 +42,10 @@
>>  #include "internal.h"
>>  #include "shuffle.h"
>>  
>> +#ifndef MEMORY_HOTPLUG_RES_NAME
>> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
>> +#endif
>> +
>>  /*
>>   * online_page_callback contains pointer to current page onlining function.
>>   * Initially it is generic_online_page(). If it is required it could be
>> @@ -103,7 +107,7 @@ static struct resource *register_memory_resource(u64 start, u64 size)
>>  {
>>  	struct resource *res;
>>  	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
>> -	char *resource_name = "System RAM";
>> +	char *resource_name = MEMORY_HOTPLUG_RES_NAME;
>>  
>>  	if (start + size > max_mem_size)
>>  		return ERR_PTR(-E2BIG);
>> -- 
>> 2.25.1
>>
>>
>> _______________________________________________
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
>>



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-30 18:14                   ` David Hildenbrand
@ 2020-04-10 19:10                     ` Andrew Morton
  2020-04-11  3:44                       ` Baoquan He
  2020-04-14  7:05                       ` David Hildenbrand
  0 siblings, 2 replies; 92+ messages in thread
From: Andrew Morton @ 2020-04-10 19:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: James Morse, kexec, linux-mm, linux-arm-kernel, Eric Biederman,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

It's unclear (to me) what is the status of this patchset.  But it does appear that
an new version can be expected?


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-10 19:10                     ` Andrew Morton
@ 2020-04-11  3:44                       ` Baoquan He
  2020-04-11  9:30                         ` Russell King - ARM Linux admin
  2020-04-14  7:05                       ` David Hildenbrand
  1 sibling, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-04-11  3:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, James Morse, kexec, linux-mm,
	linux-arm-kernel, Eric Biederman, Catalin Marinas, Will Deacon,
	Anshuman Khandual, Bhupesh Sharma

On 04/10/20 at 12:10pm, Andrew Morton wrote:
> It's unclear (to me) what is the status of this patchset.  But it does appear that
> an new version can be expected?

As we discussed in the thread of replying to the cover letter, the
idea of this patchset is not good. 

Because We tend to use kexec_file_load more and improve/enhance it in the
future, and gradually obsolete the old kexec_load interface which this
patchset is trying to fix on. 

And the issue James spot is a very corner case, we have suggested
another easier way to avoid it by adding systemd service to load kexec
and monitor memory adding/removing uevent, juas as we have done for
kdump loading. Bhupesh is working on this to add a service in Fedora
and test, and will put it to RHEL too if nobody is unsatisfied.

Thanks
Baoquan



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-11  3:44                       ` Baoquan He
@ 2020-04-11  9:30                         ` Russell King - ARM Linux admin
  2020-04-11  9:58                           ` David Hildenbrand
  2020-04-12  5:35                           ` Baoquan He
  0 siblings, 2 replies; 92+ messages in thread
From: Russell King - ARM Linux admin @ 2020-04-11  9:30 UTC (permalink / raw)
  To: Baoquan He
  Cc: Andrew Morton, David Hildenbrand, Catalin Marinas,
	Bhupesh Sharma, Anshuman Khandual, kexec, linux-mm, James Morse,
	Eric Biederman, Will Deacon, linux-arm-kernel

On Sat, Apr 11, 2020 at 11:44:14AM +0800, Baoquan He wrote:
> Because We tend to use kexec_file_load more and improve/enhance it in the
> future, and gradually obsolete the old kexec_load interface which this
> patchset is trying to fix on. 

That's not going to happen; 32-bit ARM kexec uses the kexec_load
interface rather than the kexec_file_load version, and I see no one
with any interest in changing that - and there's users of the former.

I don't see how it's possible to convert 32-bit ARM kexec to the
kexec_file_load interface - this assumes that all you have are the
kernel, initrd, and commandline, but on 32-bit ARM kexec, we have
kernel, initrd and the dtb blob which the user can specify.

So, if we wanted to obsolete the kexec_load interface, _first_ there
needs to be a way to provide users with the existing functionality
they have already in place on 32-bit ARM - otherwise we're looking
at a userspace regression.  Especially as kexec_file_load takes
precedence on some distro patched versions of the kexec tool,
irrespective of which interface the user requests of the tool.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 10.2Mbps down 587kbps up


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-11  9:30                         ` Russell King - ARM Linux admin
@ 2020-04-11  9:58                           ` David Hildenbrand
  2020-04-12  5:35                           ` Baoquan He
  1 sibling, 0 replies; 92+ messages in thread
From: David Hildenbrand @ 2020-04-11  9:58 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Baoquan He, Andrew Morton, David Hildenbrand, Catalin Marinas,
	Bhupesh Sharma, Anshuman Khandual, kexec, linux-mm, James Morse,
	Eric Biederman, Will Deacon, linux-arm-kernel



> Am 11.04.2020 um 11:40 schrieb Russell King - ARM Linux admin <linux@armlinux.org.uk>:
> 
> On Sat, Apr 11, 2020 at 11:44:14AM +0800, Baoquan He wrote:
>> Because We tend to use kexec_file_load more and improve/enhance it in the
>> future, and gradually obsolete the old kexec_load interface which this
>> patchset is trying to fix on. 
> 
> That's not going to happen; 32-bit ARM kexec uses the kexec_load
> interface rather than the kexec_file_load version, and I see no one
> with any interest in changing that - and there's users of the former.
> 
> I don't see how it's possible to convert 32-bit ARM kexec to the
> kexec_file_load interface - this assumes that all you have are the
> kernel, initrd, and commandline, but on 32-bit ARM kexec, we have
> kernel, initrd and the dtb blob which the user can specify.
> 
> So, if we wanted to obsolete the kexec_load interface, _first_ there
> needs to be a way to provide users with the existing functionality
> they have already in place on 32-bit ARM - otherwise we're looking
> at a userspace regression.  Especially as kexec_file_load takes
> precedence on some distro patched versions of the kexec tool,
> irrespective of which interface the user requests of the tool.
> 

On 32bit architectures we usually don‘t really care about memory hotplug. So we could deprecate it only for 64bit architectures AFAIKS.


> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 10.2Mbps down 587kbps up
> 



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-11  9:30                         ` Russell King - ARM Linux admin
  2020-04-11  9:58                           ` David Hildenbrand
@ 2020-04-12  5:35                           ` Baoquan He
  2020-04-12  8:08                             ` Russell King - ARM Linux admin
  1 sibling, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-04-12  5:35 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Andrew Morton, David Hildenbrand, Catalin Marinas,
	Bhupesh Sharma, Anshuman Khandual, kexec, linux-mm, James Morse,
	Eric Biederman, Will Deacon, linux-arm-kernel

On 04/11/20 at 10:30am, Russell King - ARM Linux admin wrote:
> On Sat, Apr 11, 2020 at 11:44:14AM +0800, Baoquan He wrote:
> > Because We tend to use kexec_file_load more and improve/enhance it in the
> > future, and gradually obsolete the old kexec_load interface which this
> > patchset is trying to fix on. 
> 
> That's not going to happen; 32-bit ARM kexec uses the kexec_load
> interface rather than the kexec_file_load version, and I see no one
> with any interest in changing that - and there's users of the former.
> 
> I don't see how it's possible to convert 32-bit ARM kexec to the
> kexec_file_load interface - this assumes that all you have are the
> kernel, initrd, and commandline, but on 32-bit ARM kexec, we have
> kernel, initrd and the dtb blob which the user can specify.

Well, I understand what you said about 32-bit ARM support with only
kexec_old support thing. That's why I said we tend to obsolete it
'GRADUALLY'. It's the existing users who are using kexec_load, and the
ARCHes which only has kexec_load, make us have to transfer to
kexec_file_load gradually.

Comparing with kexec_load, kexec_file_load has only one disadvantage,
that is some ARCHes only have kexec_load. Otherwise, kexec_file_load
benefits kexec/kdump developping/maintaining very much. The loading job
of kexec_file_load is mostly done in kernel, we can get whatever we
want about kernel information very conveniently to do anything needed.
For the kexec_load interface, the loading job is mostly done in
userspace, we have to export kernel information to procfs, sysfs, etc,
then parse them in kexec_tools, finally passed it to kernel part of
kexec loading.

The gradual obsoleting means we may only add
feature/improvement/enhancement to kexec_file_load. And if a bug fix is
needed for both kexec_load and kexec_file_load, and the fix is very
complicated, we may only fix it in kexec_file_load too. Kexec_file_load
interface is suggested to add if does't have, just port user space part
to kernel as x86/s390/arm64 have done.

Surely, it doesn't mean we don't fix the critical/blocker bug with
kexec_load loading. We still try to do, just are not so eager. In the
existing product environment, the kexec_load is used, just keep using
it. Do we bother to change it to kexec_file_load, e.g in our RHEL7
distros? Certainly not. But in our new product, we will change to use
kexec_file_load interface. I guess this is similar with arm64. The
advantage and benefit have been told in the 2nd paragraph.


As for 32-bit ARM, is it like the old product, we have many in-use systems
deployed in customers' laboratory? Wondering if ARM continues designing
new 32-bit ARM cpu, and some companies continue producing tons of 32-bit ARM
cpus. If yes, I think we need continue taking care of kexec_load if
32-bit ARM can't convert to kexec_file_load. If not, it may be not a
barrier when we consider converting kexec_load to kexec_file_load in
other ARCHes. We just need keep using it, try to fix those critical/blocker
bug in kexec_load interface if encountered.

Finally, comning back to this patchset itself, the issue James spotted
is not so ciritical, I would say. When I do kexec jumping, I will do
loading firstly, then trigge jumping. I can think of the case that
people may load kexec-ed kernel, then do something else, later she/he
triggers the kexec jumping. These are not necessary steps. As Dave and I
replied to James in the cover-letter thread, adding a systemd service of
kexec loading, monitor hotplug uevent, reload it if any hot remove
happened. This is quite easy to do, I don't see any problem with it, and
why we don't do like this. 

My personal opinion, please tell if I miss anything.

> 
> So, if we wanted to obsolete the kexec_load interface, _first_ there
> needs to be a way to provide users with the existing functionality
> they have already in place on 32-bit ARM - otherwise we're looking
> at a userspace regression.  Especially as kexec_file_load takes
> precedence on some distro patched versions of the kexec tool,
> irrespective of which interface the user requests of the tool.
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 10.2Mbps down 587kbps up
> 



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-12  5:35                           ` Baoquan He
@ 2020-04-12  8:08                             ` Russell King - ARM Linux admin
  2020-04-12 19:52                               ` Eric W. Biederman
  0 siblings, 1 reply; 92+ messages in thread
From: Russell King - ARM Linux admin @ 2020-04-12  8:08 UTC (permalink / raw)
  To: Baoquan He
  Cc: Anshuman Khandual, Catalin Marinas, Bhupesh Sharma,
	David Hildenbrand, kexec, linux-mm, James Morse, Eric Biederman,
	Andrew Morton, Will Deacon, linux-arm-kernel

On Sun, Apr 12, 2020 at 01:35:07PM +0800, Baoquan He wrote:
> On 04/11/20 at 10:30am, Russell King - ARM Linux admin wrote:
> > On Sat, Apr 11, 2020 at 11:44:14AM +0800, Baoquan He wrote:
> > > Because We tend to use kexec_file_load more and improve/enhance it in the
> > > future, and gradually obsolete the old kexec_load interface which this
> > > patchset is trying to fix on. 
> > 
> > That's not going to happen; 32-bit ARM kexec uses the kexec_load
> > interface rather than the kexec_file_load version, and I see no one
> > with any interest in changing that - and there's users of the former.
> > 
> > I don't see how it's possible to convert 32-bit ARM kexec to the
> > kexec_file_load interface - this assumes that all you have are the
> > kernel, initrd, and commandline, but on 32-bit ARM kexec, we have
> > kernel, initrd and the dtb blob which the user can specify.
> 
> Well, I understand what you said about 32-bit ARM support with only
> kexec_old support thing. That's why I said we tend to obsolete it
> 'GRADUALLY'. It's the existing users who are using kexec_load, and the
> ARCHes which only has kexec_load, make us have to transfer to
> kexec_file_load gradually.
> 
> Comparing with kexec_load, kexec_file_load has only one disadvantage,
> that is some ARCHes only have kexec_load. Otherwise, kexec_file_load
> benefits kexec/kdump developping/maintaining very much. The loading job
> of kexec_file_load is mostly done in kernel, we can get whatever we
> want about kernel information very conveniently to do anything needed.
> For the kexec_load interface, the loading job is mostly done in
> userspace, we have to export kernel information to procfs, sysfs, etc,
> then parse them in kexec_tools, finally passed it to kernel part of
> kexec loading.
> 
> The gradual obsoleting means we may only add
> feature/improvement/enhancement to kexec_file_load. And if a bug fix is
> needed for both kexec_load and kexec_file_load, and the fix is very
> complicated, we may only fix it in kexec_file_load too. Kexec_file_load
> interface is suggested to add if does't have, just port user space part
> to kernel as x86/s390/arm64 have done.
> 
> Surely, it doesn't mean we don't fix the critical/blocker bug with
> kexec_load loading. We still try to do, just are not so eager. In the
> existing product environment, the kexec_load is used, just keep using
> it. Do we bother to change it to kexec_file_load, e.g in our RHEL7
> distros? Certainly not. But in our new product, we will change to use
> kexec_file_load interface. I guess this is similar with arm64. The
> advantage and benefit have been told in the 2nd paragraph.
> 
> 
> As for 32-bit ARM, is it like the old product, we have many in-use systems
> deployed in customers' laboratory? Wondering if ARM continues designing
> new 32-bit ARM cpu, and some companies continue producing tons of 32-bit ARM
> cpus. If yes, I think we need continue taking care of kexec_load if
> 32-bit ARM can't convert to kexec_file_load. If not, it may be not a
> barrier when we consider converting kexec_load to kexec_file_load in
> other ARCHes. We just need keep using it, try to fix those critical/blocker
> bug in kexec_load interface if encountered.
> 
> Finally, comning back to this patchset itself, the issue James spotted
> is not so ciritical, I would say. When I do kexec jumping, I will do
> loading firstly, then trigge jumping. I can think of the case that
> people may load kexec-ed kernel, then do something else, later she/he
> triggers the kexec jumping. These are not necessary steps. As Dave and I
> replied to James in the cover-letter thread, adding a systemd service of
> kexec loading, monitor hotplug uevent, reload it if any hot remove
> happened. This is quite easy to do, I don't see any problem with it, and
> why we don't do like this. 
> 
> My personal opinion, please tell if I miss anything.

All that opinion and hand waving about the benefits of the new
interface is totally irrelevent for 32-bit ARM for the reasons
I stated in my email to which you replied.

Gradual obsolecence or not, the file interface can't be supported
on 32-bit ARM as-is - it is totally inadequate and inferior as an
API compared to the functionality we have with plain kexec_load.
Without that point addressed, kexec_file_load is meaningless for
32-bit ARM.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 10.2Mbps down 587kbps up


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-12  8:08                             ` Russell King - ARM Linux admin
@ 2020-04-12 19:52                               ` Eric W. Biederman
  2020-04-12 20:37                                 ` Bhupesh SHARMA
  2020-04-13  2:37                                 ` Baoquan He
  0 siblings, 2 replies; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-12 19:52 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Baoquan He, Anshuman Khandual, Catalin Marinas, Bhupesh Sharma,
	David Hildenbrand, kexec, linux-mm, James Morse, Andrew Morton,
	Will Deacon, linux-arm-kernel


The only benefit of kexec_file_load is that it is simple enough from a
kernel perspective that signatures can be checked.

kexec_load in every other respect is the more capable and functional
interface.  It makes no sense to get rid of it.

It does make sense to reload with a loaded kernel on memory hotplug.
That is simple and easy.  If we are going to handle something in the
kernel it should simple an automated unloading of the kernel on memory
hotplug.


I think it would be irresponsible to deprecate kexec_load on any
platform.

I also suspect that kexec_file_load could be taught to copy the dtb
on arm32 if someone wants to deal with signatures.

We definitely can not even think of deprecating kexec_load until
architecture that supports it also supports kexec_file_load and everyone
is happy with that interface.  That is Linus's no regression rule.


Eric


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-12 19:52                               ` Eric W. Biederman
@ 2020-04-12 20:37                                 ` Bhupesh SHARMA
  2020-04-13  2:37                                 ` Baoquan He
  1 sibling, 0 replies; 92+ messages in thread
From: Bhupesh SHARMA @ 2020-04-12 20:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Russell King - ARM Linux admin, Baoquan He, David Hildenbrand,
	Catalin Marinas, Bhupesh Sharma, Anshuman Khandual, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel

On Mon, Apr 13, 2020 at 1:26 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>
> The only benefit of kexec_file_load is that it is simple enough from a
> kernel perspective that signatures can be checked.
>
> kexec_load in every other respect is the more capable and functional
> interface.  It makes no sense to get rid of it.
>
> It does make sense to reload with a loaded kernel on memory hotplug.
> That is simple and easy.  If we are going to handle something in the
> kernel it should simple an automated unloading of the kernel on memory
> hotplug.
>
>
> I think it would be irresponsible to deprecate kexec_load on any
> platform.
>
> I also suspect that kexec_file_load could be taught to copy the dtb
> on arm32 if someone wants to deal with signatures.
>
> We definitely can not even think of deprecating kexec_load until
> architecture that supports it also supports kexec_file_load and everyone
> is happy with that interface.  That is Linus's no regression rule.

TBH, I have seen several active users of kexec_load on arm32
environments and we have been trying to help them with kexec issues on
arm32 in recent past as well.

So, I agree with Eric's view that probably deprecating this in favour
of kexec_file_load will break these existing environment.

I tried to do some work at the start of this year to add
kexec_file_load support for arm32 in my spare cycles, but I gave up as
the arm32 hardware had a broken firmware and couldn't boot latest
upstream kernel.

May be I try to find some spare cycles in the coming days to do it.

But I think since kexec_load is an important interface on these arm32
boards for supporting existing kexec-based bootloaders, we should
continue supporting the same until kexec_file_load is supported/mature
enough for arm32.

Thanks,
Bhupesh


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-12 19:52                               ` Eric W. Biederman
  2020-04-12 20:37                                 ` Bhupesh SHARMA
@ 2020-04-13  2:37                                 ` Baoquan He
  2020-04-13 13:15                                   ` Eric W. Biederman
  1 sibling, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-04-13  2:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Russell King - ARM Linux admin, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, David Hildenbrand, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel

On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
> 
> The only benefit of kexec_file_load is that it is simple enough from a
> kernel perspective that signatures can be checked.

We don't have this restriction any more with below commit:

commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
and KEXEC_SIG_FORCE")

With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
secure boot or legacy system for kexec/kdump. Being simple enough is
enough to astract and convince us to use it instead. And kexec_file_load
has been in use for several years on systems with secure boot, since
added in 2014, on x86_64.

> 
> kexec_load in every other respect is the more capable and functional
> interface.  It makes no sense to get rid of it.
> 
> It does make sense to reload with a loaded kernel on memory hotplug.
> That is simple and easy.  If we are going to handle something in the
> kernel it should simple an automated unloading of the kernel on memory
> hotplug.
> 
> 
> I think it would be irresponsible to deprecate kexec_load on any
> platform.
> 
> I also suspect that kexec_file_load could be taught to copy the dtb
> on arm32 if someone wants to deal with signatures.
> 
> We definitely can not even think of deprecating kexec_load until
> architecture that supports it also supports kexec_file_load and everyone
> is happy with that interface.  That is Linus's no regression rule.

I should pick a milder word to express our tendency and tell our plan
then 'obsolete'. Even though I added 'gradually', seems it doesn't help
much. I didn't mean to say 'deprecate' at all when replied.

The situation and trend I understand about kexec_load and kexec_file_load
are:

1) Supporting kexec_file_load is suggested to add in ARCHes which don't
have yet, just as x86_64, arm64 and s390 have done;
 
2) kexec_file_load is suggested to use, and take precedence over
kexec_load in the future, if both are supported in one ARCH.

3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
and by ARCHes for back compatibility w/ kexec_file_load support.

For 1) and 2), I think the reason is obvious as Eric said,
kexec_file_load is simple enough. And currently, whenever we got a bug
report, we may need fix them twice, for kexec_load and kexec_file_load.
If kexec_file_load is made by default, e.g on x86_64, we will change it
in kernel space only, for kexec_file_load. This is what I meant about
'obsolete gradually'. I think for arm64, s390, they will do these too.
Unless there's some critical/blocker bug in kexec_load, to corrupt the
old kexec_load interface in old product.

For 3), people can still use kexec_load and develop/fix for it, if no
kexec_file_load supported. But 32-bit arm should be a different one,
more like i386, we will leave it as is, and fix anything which could
break it. But people really expects to improve or add feature to it? E.g
in this patchset, the mem hotplug issue James raised, I assume James is
focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
another reply, people even don't agree to continue supporting memory
hotplug on 32-bit system. We ever took effort to fix a memory hotplug
bug on i386 with a patch, but people would rather set it as BROKEN.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-13  2:37                                 ` Baoquan He
@ 2020-04-13 13:15                                   ` Eric W. Biederman
  2020-04-13 23:01                                     ` Andrew Morton
                                                       ` (2 more replies)
  0 siblings, 3 replies; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-13 13:15 UTC (permalink / raw)
  To: Baoquan He
  Cc: Russell King - ARM Linux admin, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, David Hildenbrand, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel

Baoquan He <bhe@redhat.com> writes:

> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
>> 
>> The only benefit of kexec_file_load is that it is simple enough from a
>> kernel perspective that signatures can be checked.
>
> We don't have this restriction any more with below commit:
>
> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> and KEXEC_SIG_FORCE")
>
> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> secure boot or legacy system for kexec/kdump. Being simple enough is
> enough to astract and convince us to use it instead. And kexec_file_load
> has been in use for several years on systems with secure boot, since
> added in 2014, on x86_64.

No.  Actaully kexec_file_load is the less capable interface, and less
flexible interface.  Which is why it is appropriate for signature
verification.

>> kexec_load in every other respect is the more capable and functional
>> interface.  It makes no sense to get rid of it.
>> 
>> It does make sense to reload with a loaded kernel on memory hotplug.
>> That is simple and easy.  If we are going to handle something in the
>> kernel it should simple an automated unloading of the kernel on memory
>> hotplug.
>> 
>> 
>> I think it would be irresponsible to deprecate kexec_load on any
>> platform.
>> 
>> I also suspect that kexec_file_load could be taught to copy the dtb
>> on arm32 if someone wants to deal with signatures.
>> 
>> We definitely can not even think of deprecating kexec_load until
>> architecture that supports it also supports kexec_file_load and everyone
>> is happy with that interface.  That is Linus's no regression rule.
>
> I should pick a milder word to express our tendency and tell our plan
> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> much. I didn't mean to say 'deprecate' at all when replied.
>
> The situation and trend I understand about kexec_load and kexec_file_load
> are:
>
> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> have yet, just as x86_64, arm64 and s390 have done;
>  
> 2) kexec_file_load is suggested to use, and take precedence over
> kexec_load in the future, if both are supported in one ARCH.

The deep problem is that kexec_file_load is distinctly less expressive
than kexec_load.

> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> and by ARCHes for back compatibility w/ kexec_file_load support.
>
> For 1) and 2), I think the reason is obvious as Eric said,
> kexec_file_load is simple enough. And currently, whenever we got a bug
> report, we may need fix them twice, for kexec_load and kexec_file_load.
> If kexec_file_load is made by default, e.g on x86_64, we will change it
> in kernel space only, for kexec_file_load. This is what I meant about
> 'obsolete gradually'. I think for arm64, s390, they will do these too.
> Unless there's some critical/blocker bug in kexec_load, to corrupt the
> old kexec_load interface in old product.

Maybe.  The code that kexec_file_load sucked into the kernel is quite
stable and rarely needs changes except during a port of kexec to
another architecture.

Last I looked the real maintenance effor of kexec and kexec on panic was
in the drivers.  So I don't think we can use maintenance to do anything.

> For 3), people can still use kexec_load and develop/fix for it, if no
> kexec_file_load supported. But 32-bit arm should be a different one,
> more like i386, we will leave it as is, and fix anything which could
> break it. But people really expects to improve or add feature to it? E.g
> in this patchset, the mem hotplug issue James raised, I assume James is
> focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
> another reply, people even don't agree to continue supporting memory
> hotplug on 32-bit system. We ever took effort to fix a memory hotplug
> bug on i386 with a patch, but people would rather set it as BROKEN.

For memory hotplug just reload.  Userspace already gets good events.

We should not expect anything except a panic kernel to be loaded over a
memory hotplug event. The kexec on panic code should actually be loaded
in a location that we don't reliquish if asked for it.

Quite frankly at this point I would love to see the signature fad die,
which would allow us to remove kexec_file_load.  I still have not seen
the signature code used anywhere except by people anticipating trouble.

Given that Microsoft has already directly signed a malicous bootloader.
(Not in the Linux ecosystem).  I don't even know if any of the reasons
for having kexec_file_load are legtimate.


If someone wants to do the work and ensure everything that is possible
to load with kexec_load is possible to load with kexec_file_load.
Kernels supporting the multi-boot protocol etc.  Then we can consider
deprecating kexec_load.


I think it took me about 15 years to remove the sysctl system call and
it only ever had about 10 users.  If you want to go through that kind of
work to make certain there are no more users and that everything they
could do with the old interface is doable with the new interface then
please be my guest.  Until then we need to fully support kexec_load.

Eric










^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-13 13:15                                   ` Eric W. Biederman
@ 2020-04-13 23:01                                     ` Andrew Morton
  2020-04-14  6:13                                       ` Eric W. Biederman
  2020-04-14  6:40                                     ` Baoquan He
  2020-04-14  9:16                                     ` Dave Young
  2 siblings, 1 reply; 92+ messages in thread
From: Andrew Morton @ 2020-04-13 23:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Baoquan He, Russell King - ARM Linux admin, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, David Hildenbrand, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel

On Mon, 13 Apr 2020 08:15:23 -0500 ebiederm@xmission.com (Eric W. Biederman) wrote:

> > For 3), people can still use kexec_load and develop/fix for it, if no
> > kexec_file_load supported. But 32-bit arm should be a different one,
> > more like i386, we will leave it as is, and fix anything which could
> > break it. But people really expects to improve or add feature to it? E.g
> > in this patchset, the mem hotplug issue James raised, I assume James is
> > focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
> > another reply, people even don't agree to continue supporting memory
> > hotplug on 32-bit system. We ever took effort to fix a memory hotplug
> > bug on i386 with a patch, but people would rather set it as BROKEN.
> 
> For memory hotplug just reload.  Userspace already gets good events.
> 
> We should not expect anything except a panic kernel to be loaded over a
> memory hotplug event. The kexec on panic code should actually be loaded
> in a location that we don't reliquish if asked for it.

Is that a nack for James's patchset?


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-13 23:01                                     ` Andrew Morton
@ 2020-04-14  6:13                                       ` Eric W. Biederman
  0 siblings, 0 replies; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-14  6:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Baoquan He, Russell King - ARM Linux admin, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, David Hildenbrand, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel

Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 13 Apr 2020 08:15:23 -0500 ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> > For 3), people can still use kexec_load and develop/fix for it, if no
>> > kexec_file_load supported. But 32-bit arm should be a different one,
>> > more like i386, we will leave it as is, and fix anything which could
>> > break it. But people really expects to improve or add feature to it? E.g
>> > in this patchset, the mem hotplug issue James raised, I assume James is
>> > focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
>> > another reply, people even don't agree to continue supporting memory
>> > hotplug on 32-bit system. We ever took effort to fix a memory hotplug
>> > bug on i386 with a patch, but people would rather set it as BROKEN.
>> 
>> For memory hotplug just reload.  Userspace already gets good events.
>> 
>> We should not expect anything except a panic kernel to be loaded over a
>> memory hotplug event. The kexec on panic code should actually be loaded
>> in a location that we don't reliquish if asked for it.
>
> Is that a nack for James's patchset?

I have just read the end of the thread and I have the sense that the
patchset had already been rejected.  I will see if I can go back and
read the beginning.

I was mostly reacting to the idea that you could stop maintaining an
interface that people are actively using because there is a newer
interface.

Eric



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-13 13:15                                   ` Eric W. Biederman
  2020-04-13 23:01                                     ` Andrew Morton
@ 2020-04-14  6:40                                     ` Baoquan He
  2020-04-14  6:51                                       ` Baoquan He
  2020-04-14  8:00                                       ` David Hildenbrand
  2020-04-14  9:16                                     ` Dave Young
  2 siblings, 2 replies; 92+ messages in thread
From: Baoquan He @ 2020-04-14  6:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Russell King - ARM Linux admin, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, David Hildenbrand, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel

On 04/13/20 at 08:15am, Eric W. Biederman wrote:
> Baoquan He <bhe@redhat.com> writes:
> 
> > On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
> >> 
> >> The only benefit of kexec_file_load is that it is simple enough from a
> >> kernel perspective that signatures can be checked.
> >
> > We don't have this restriction any more with below commit:
> >
> > commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> > and KEXEC_SIG_FORCE")
> >
> > With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> > secure boot or legacy system for kexec/kdump. Being simple enough is
> > enough to astract and convince us to use it instead. And kexec_file_load
> > has been in use for several years on systems with secure boot, since
> > added in 2014, on x86_64.
> 
> No.  Actaully kexec_file_load is the less capable interface, and less
> flexible interface.  Which is why it is appropriate for signature
> verification.

Well, everyone has a stance and the corresponding view. You could have
wider view from long time maintenance and in upstrem position, and think
kexec_file_load is horrible. But I can only see from our work as a front
line engineer to maintain/develop kexec/kdump in RHEL, and think
kexec_file_load is easier to maintain.

Surely except of multiple kernel image format support. No matter it is
kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
This is produced from kerel building by default. We have no way to
support it in our distros and add it into kexec_file_load.

[RFC PATCH] x86/boot: make ELF kernel multiboot-able
https://lkml.org/lkml/2017/2/15/654

> 
> >> kexec_load in every other respect is the more capable and functional
> >> interface.  It makes no sense to get rid of it.
> >> 
> >> It does make sense to reload with a loaded kernel on memory hotplug.
> >> That is simple and easy.  If we are going to handle something in the
> >> kernel it should simple an automated unloading of the kernel on memory
> >> hotplug.
> >> 
> >> 
> >> I think it would be irresponsible to deprecate kexec_load on any
> >> platform.
> >> 
> >> I also suspect that kexec_file_load could be taught to copy the dtb
> >> on arm32 if someone wants to deal with signatures.
> >> 
> >> We definitely can not even think of deprecating kexec_load until
> >> architecture that supports it also supports kexec_file_load and everyone
> >> is happy with that interface.  That is Linus's no regression rule.
> >
> > I should pick a milder word to express our tendency and tell our plan
> > then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> > much. I didn't mean to say 'deprecate' at all when replied.
> >
> > The situation and trend I understand about kexec_load and kexec_file_load
> > are:
> >
> > 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> > have yet, just as x86_64, arm64 and s390 have done;
> >  
> > 2) kexec_file_load is suggested to use, and take precedence over
> > kexec_load in the future, if both are supported in one ARCH.
> 
> The deep problem is that kexec_file_load is distinctly less expressive
> than kexec_load.
> 
> > 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> > and by ARCHes for back compatibility w/ kexec_file_load support.
> >
> > For 1) and 2), I think the reason is obvious as Eric said,
> > kexec_file_load is simple enough. And currently, whenever we got a bug
> > report, we may need fix them twice, for kexec_load and kexec_file_load.
> > If kexec_file_load is made by default, e.g on x86_64, we will change it
> > in kernel space only, for kexec_file_load. This is what I meant about
> > 'obsolete gradually'. I think for arm64, s390, they will do these too.
> > Unless there's some critical/blocker bug in kexec_load, to corrupt the
> > old kexec_load interface in old product.
> 
> Maybe.  The code that kexec_file_load sucked into the kernel is quite
> stable and rarely needs changes except during a port of kexec to
> another architecture.
> 
> Last I looked the real maintenance effor of kexec and kexec on panic was
> in the drivers.  So I don't think we can use maintenance to do anything.

Not sure if I got it. But if check Lianbo's patches, a lot of effort has
been taken to make SEV work well on kexec_file_load. And we have
switched to use kexec_file_load in the newly published  Fedora release
on x86_64 by default. Before this, Lianbo has investigated and done many
experiments to make sure the switching is safe. We finally made this
decision. Next we will do the switch in Enterprise distros. Once these
are proved safe, we will suggest customers to use kexec_file_load for
kexec rebooting too. In the future, we will only care about
kexec_file_load if everying is going well. But as I have explained
repeatedly, only caring about kexec_file_load means we will leave
kexec_load as is, we will not add new feature or improvement patches
for it.

commit 6a20bd54473e11011bf2b47efb52d0759d412854
Author: Lianbo Jiang <lijiang@redhat.com>
Date:   Thu Jan 16 13:47:35 2020 +0800

    kdump-lib: switch to the kexec_file_load() syscall on x86_64 by default

> 
> > For 3), people can still use kexec_load and develop/fix for it, if no
> > kexec_file_load supported. But 32-bit arm should be a different one,
> > more like i386, we will leave it as is, and fix anything which could
> > break it. But people really expects to improve or add feature to it? E.g
> > in this patchset, the mem hotplug issue James raised, I assume James is
> > focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
> > another reply, people even don't agree to continue supporting memory
> > hotplug on 32-bit system. We ever took effort to fix a memory hotplug
> > bug on i386 with a patch, but people would rather set it as BROKEN.
> 
> For memory hotplug just reload.  Userspace already gets good events.

Kexec_file_load is easy to maintain. This is an example.

Lock the hotplug area where kexed-ed kernel is targeted in this patchset,
it's obviously not right. We can't disable memory hotplug just because
kexec-ed kernel is loaded ahead of time. 

Reloading is also not a good fix. Kexec-ed kernel is targeted at a
movable area, reloading can avoid kexec rebooting corruption if that
area is hot removed. But if that area is not removed, locating kernel
into the hotpluggable area will change the area into ummovable zone.
Unless we decide to not support memory hotplug in kexec-ed kernel, I
guess it's very hard. Now in our distros kexec rebooting has been
supported, the big cloud providers are deploying linux in guest, bugs on
kexec reboot failure has been reported. They need the memory hotplug to
increase/decrease memory.

The root cause is kexec-ed kernel is targeted at hotpluggable memory
region. Just avoiding the movable area can fix it. In kexec_file_load(),
just checking or picking those unmovable region to put kernel/initrd in
function locate_mem_hole_callback() can fix it. The page or pageblock's
zone is movable or not, it's easy to know. This fix doesn't need to
bother other component.

> 
> We should not expect anything except a panic kernel to be loaded over a
> memory hotplug event. The kexec on panic code should actually be loaded
> in a location that we don't reliquish if asked for it.
> 
> Quite frankly at this point I would love to see the signature fad die,
> which would allow us to remove kexec_file_load.  I still have not seen
> the signature code used anywhere except by people anticipating trouble.
> 
> Given that Microsoft has already directly signed a malicous bootloader.
> (Not in the Linux ecosystem).  I don't even know if any of the reasons
> for having kexec_file_load are legtimate.
> 
> 
> If someone wants to do the work and ensure everything that is possible
> to load with kexec_load is possible to load with kexec_file_load.
> Kernels supporting the multi-boot protocol etc.  Then we can consider
> deprecating kexec_load.
> 
> 
> I think it took me about 15 years to remove the sysctl system call and
> it only ever had about 10 users.  If you want to go through that kind of
> work to make certain there are no more users and that everything they
> could do with the old interface is doable with the new interface then
> please be my guest.  Until then we need to fully support kexec_load.

I want to clarify again, we have no plan to deprecate kexec_load.
We just plan to use kexec_file_load more in our distros, for both legacy
system or system with secure boot.

Eric, I am glad to see you told your opinion about kexec_file_load.
Without the discussion in this thread, we may not know it. So I have one
question, seems kexec_file_load will continue existing, the ARCHes our
distros is supporting, x86_64, s390, ppc, arm64, all have kexec_file_load,
do you object us to continue using kexec_file_load, for signature
verification and normal kexec/kdump booting? Or you plan to deprecate
kexec_file_load?



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-14  6:40                                     ` Baoquan He
@ 2020-04-14  6:51                                       ` Baoquan He
  2020-04-14  8:00                                       ` David Hildenbrand
  1 sibling, 0 replies; 92+ messages in thread
From: Baoquan He @ 2020-04-14  6:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Russell King - ARM Linux admin, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, David Hildenbrand, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel

On 04/14/20 at 02:40pm, Baoquan He wrote:
> On 04/13/20 at 08:15am, Eric W. Biederman wrote:
> > Baoquan He <bhe@redhat.com> writes:
> > 
> > > On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
> > >> 
> > >> The only benefit of kexec_file_load is that it is simple enough from a
> > >> kernel perspective that signatures can be checked.
> > >
> > > We don't have this restriction any more with below commit:
> > >
> > > commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> > > and KEXEC_SIG_FORCE")
> > >
> > > With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> > > secure boot or legacy system for kexec/kdump. Being simple enough is
> > > enough to astract and convince us to use it instead. And kexec_file_load
> > > has been in use for several years on systems with secure boot, since
> > > added in 2014, on x86_64.
> > 
> > No.  Actaully kexec_file_load is the less capable interface, and less
> > flexible interface.  Which is why it is appropriate for signature
> > verification.
> 
> Well, everyone has a stance and the corresponding view. You could have
> wider view from long time maintenance and in upstrem position, and think
> kexec_file_load is horrible. But I can only see from our work as a front
> line engineer to maintain/develop kexec/kdump in RHEL, and think
> kexec_file_load is easier to maintain.
> 
> Surely except of multiple kernel image format support. No matter it is
> kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
> This is produced from kerel building by default. We have no way to
> support it in our distros and add it into kexec_file_load.
> 
> [RFC PATCH] x86/boot: make ELF kernel multiboot-able
> https://lkml.org/lkml/2017/2/15/654
> 
> > 
> > >> kexec_load in every other respect is the more capable and functional
> > >> interface.  It makes no sense to get rid of it.
> > >> 
> > >> It does make sense to reload with a loaded kernel on memory hotplug.
> > >> That is simple and easy.  If we are going to handle something in the
> > >> kernel it should simple an automated unloading of the kernel on memory
> > >> hotplug.
> > >> 
> > >> 
> > >> I think it would be irresponsible to deprecate kexec_load on any
> > >> platform.
> > >> 
> > >> I also suspect that kexec_file_load could be taught to copy the dtb
> > >> on arm32 if someone wants to deal with signatures.
> > >> 
> > >> We definitely can not even think of deprecating kexec_load until
> > >> architecture that supports it also supports kexec_file_load and everyone
> > >> is happy with that interface.  That is Linus's no regression rule.
> > >
> > > I should pick a milder word to express our tendency and tell our plan
> > > then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> > > much. I didn't mean to say 'deprecate' at all when replied.
> > >
> > > The situation and trend I understand about kexec_load and kexec_file_load
> > > are:
> > >
> > > 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> > > have yet, just as x86_64, arm64 and s390 have done;
> > >  
> > > 2) kexec_file_load is suggested to use, and take precedence over
> > > kexec_load in the future, if both are supported in one ARCH.
> > 
> > The deep problem is that kexec_file_load is distinctly less expressive
> > than kexec_load.
> > 
> > > 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> > > and by ARCHes for back compatibility w/ kexec_file_load support.
> > >
> > > For 1) and 2), I think the reason is obvious as Eric said,
> > > kexec_file_load is simple enough. And currently, whenever we got a bug
> > > report, we may need fix them twice, for kexec_load and kexec_file_load.
> > > If kexec_file_load is made by default, e.g on x86_64, we will change it
> > > in kernel space only, for kexec_file_load. This is what I meant about
> > > 'obsolete gradually'. I think for arm64, s390, they will do these too.
> > > Unless there's some critical/blocker bug in kexec_load, to corrupt the
> > > old kexec_load interface in old product.
> > 
> > Maybe.  The code that kexec_file_load sucked into the kernel is quite
> > stable and rarely needs changes except during a port of kexec to
> > another architecture.
> > 
> > Last I looked the real maintenance effor of kexec and kexec on panic was
> > in the drivers.  So I don't think we can use maintenance to do anything.
> 
> Not sure if I got it. But if check Lianbo's patches, a lot of effort has
> been taken to make SEV work well on kexec_file_load. And we have
> switched to use kexec_file_load in the newly published  Fedora release
> on x86_64 by default. Before this, Lianbo has investigated and done many
> experiments to make sure the switching is safe. We finally made this
> decision. Next we will do the switch in Enterprise distros. Once these
> are proved safe, we will suggest customers to use kexec_file_load for
> kexec rebooting too. In the future, we will only care about
> kexec_file_load if everying is going well. But as I have explained
> repeatedly, only caring about kexec_file_load means we will leave
> kexec_load as is, we will not add new feature or improvement patches
> for it.
> 
> commit 6a20bd54473e11011bf2b47efb52d0759d412854
> Author: Lianbo Jiang <lijiang@redhat.com>
> Date:   Thu Jan 16 13:47:35 2020 +0800
> 
>     kdump-lib: switch to the kexec_file_load() syscall on x86_64 by default
> 
> > 
> > > For 3), people can still use kexec_load and develop/fix for it, if no
> > > kexec_file_load supported. But 32-bit arm should be a different one,
> > > more like i386, we will leave it as is, and fix anything which could
> > > break it. But people really expects to improve or add feature to it? E.g
> > > in this patchset, the mem hotplug issue James raised, I assume James is
> > > focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
> > > another reply, people even don't agree to continue supporting memory
> > > hotplug on 32-bit system. We ever took effort to fix a memory hotplug
> > > bug on i386 with a patch, but people would rather set it as BROKEN.
> > 
> > For memory hotplug just reload.  Userspace already gets good events.
> 
> Kexec_file_load is easy to maintain. This is an example.
> 
> Lock the hotplug area where kexed-ed kernel is targeted in this patchset,
> it's obviously not right. We can't disable memory hotplug just because
> kexec-ed kernel is loaded ahead of time. 
> 
> Reloading is also not a good fix. Kexec-ed kernel is targeted at a
> movable area, reloading can avoid kexec rebooting corruption if that
> area is hot removed. But if that area is not removed, locating kernel
> into the hotpluggable area will change the area into ummovable zone.

Here I mean if kexec kernel is targeted at a hotplggable memory region,
after kexec rebooting, that region will become unmovable. People can't
hot remove it in kexec-ed kernel.

> Unless we decide to not support memory hotplug in kexec-ed kernel, I
> guess it's very hard. Now in our distros kexec rebooting has been
> supported, the big cloud providers are deploying linux in guest, bugs on
> kexec reboot failure has been reported. They need the memory hotplug to
> increase/decrease memory.
> 
> The root cause is kexec-ed kernel is targeted at hotpluggable memory
> region. Just avoiding the movable area can fix it. In kexec_file_load(),
> just checking or picking those unmovable region to put kernel/initrd in
> function locate_mem_hole_callback() can fix it. The page or pageblock's
> zone is movable or not, it's easy to know. This fix doesn't need to
> bother other component.
> 
> > 
> > We should not expect anything except a panic kernel to be loaded over a
> > memory hotplug event. The kexec on panic code should actually be loaded
> > in a location that we don't reliquish if asked for it.
> > 
> > Quite frankly at this point I would love to see the signature fad die,
> > which would allow us to remove kexec_file_load.  I still have not seen
> > the signature code used anywhere except by people anticipating trouble.
> > 
> > Given that Microsoft has already directly signed a malicous bootloader.
> > (Not in the Linux ecosystem).  I don't even know if any of the reasons
> > for having kexec_file_load are legtimate.
> > 
> > 
> > If someone wants to do the work and ensure everything that is possible
> > to load with kexec_load is possible to load with kexec_file_load.
> > Kernels supporting the multi-boot protocol etc.  Then we can consider
> > deprecating kexec_load.
> > 
> > 
> > I think it took me about 15 years to remove the sysctl system call and
> > it only ever had about 10 users.  If you want to go through that kind of
> > work to make certain there are no more users and that everything they
> > could do with the old interface is doable with the new interface then
> > please be my guest.  Until then we need to fully support kexec_load.
> 
> I want to clarify again, we have no plan to deprecate kexec_load.
> We just plan to use kexec_file_load more in our distros, for both legacy
> system or system with secure boot.
> 
> Eric, I am glad to see you told your opinion about kexec_file_load.
> Without the discussion in this thread, we may not know it. So I have one
> question, seems kexec_file_load will continue existing, the ARCHes our
> distros is supporting, x86_64, s390, ppc, arm64, all have kexec_file_load,
> do you object us to continue using kexec_file_load, for signature
> verification and normal kexec/kdump booting? Or you plan to deprecate
> kexec_file_load?



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-10 19:10                     ` Andrew Morton
  2020-04-11  3:44                       ` Baoquan He
@ 2020-04-14  7:05                       ` David Hildenbrand
  2020-04-14 16:55                         ` James Morse
  1 sibling, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-04-14  7:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Morse, kexec, linux-mm, linux-arm-kernel, Eric Biederman,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

On 10.04.20 21:10, Andrew Morton wrote:
> It's unclear (to me) what is the status of this patchset.  But it does appear that
> an new version can be expected?
> 

I'd suggest to unqueue the patches until we have a consensus.

While there are a couple of ideas floating around here, my current
suggestion would be either

1. Indicate all hotplugged memory as "System RAM (hotplugged)" in
/proc/iomem and the firmware memmap (on all architectures). This will
require kexec changes, but I would have assume that kexec has to be
updated in lock-step with the kernel just like e.g., makedumpfile.
Modify kexec() to not place the kexec kernel on these areas (easy) but
still consider them as crash regions to dump. When loading a kexec
kernel, validate in the kernel that the memory is appropriate.

2. Make kexec() reload the the kernel whenever we e.g., get a udev event
for removal of memory in /sys/devices/system/memory/. On every
remove_memory(), invalidate the loaded kernel in the kernel.


As I mentioned somewhere, 1. will be interesting for virtio-mem, where
we don't want any kexec kernel to be placed on virtio-mem-added memory.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-14  6:40                                     ` Baoquan He
  2020-04-14  6:51                                       ` Baoquan He
@ 2020-04-14  8:00                                       ` David Hildenbrand
  2020-04-14  9:22                                         ` Baoquan He
  1 sibling, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-04-14  8:00 UTC (permalink / raw)
  To: Baoquan He, Eric W. Biederman
  Cc: Russell King - ARM Linux admin, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, kexec, linux-mm, James Morse,
	Andrew Morton, Will Deacon, linux-arm-kernel

On 14.04.20 08:40, Baoquan He wrote:
> On 04/13/20 at 08:15am, Eric W. Biederman wrote:
>> Baoquan He <bhe@redhat.com> writes:
>>
>>> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
>>>>
>>>> The only benefit of kexec_file_load is that it is simple enough from a
>>>> kernel perspective that signatures can be checked.
>>>
>>> We don't have this restriction any more with below commit:
>>>
>>> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
>>> and KEXEC_SIG_FORCE")
>>>
>>> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
>>> secure boot or legacy system for kexec/kdump. Being simple enough is
>>> enough to astract and convince us to use it instead. And kexec_file_load
>>> has been in use for several years on systems with secure boot, since
>>> added in 2014, on x86_64.
>>
>> No.  Actaully kexec_file_load is the less capable interface, and less
>> flexible interface.  Which is why it is appropriate for signature
>> verification.
> 
> Well, everyone has a stance and the corresponding view. You could have
> wider view from long time maintenance and in upstrem position, and think
> kexec_file_load is horrible. But I can only see from our work as a front
> line engineer to maintain/develop kexec/kdump in RHEL, and think
> kexec_file_load is easier to maintain.
> 
> Surely except of multiple kernel image format support. No matter it is
> kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
> This is produced from kerel building by default. We have no way to
> support it in our distros and add it into kexec_file_load.
> 
> [RFC PATCH] x86/boot: make ELF kernel multiboot-able
> https://lkml.org/lkml/2017/2/15/654
> 
>>
>>>> kexec_load in every other respect is the more capable and functional
>>>> interface.  It makes no sense to get rid of it.
>>>>
>>>> It does make sense to reload with a loaded kernel on memory hotplug.
>>>> That is simple and easy.  If we are going to handle something in the
>>>> kernel it should simple an automated unloading of the kernel on memory
>>>> hotplug.
>>>>
>>>>
>>>> I think it would be irresponsible to deprecate kexec_load on any
>>>> platform.
>>>>
>>>> I also suspect that kexec_file_load could be taught to copy the dtb
>>>> on arm32 if someone wants to deal with signatures.
>>>>
>>>> We definitely can not even think of deprecating kexec_load until
>>>> architecture that supports it also supports kexec_file_load and everyone
>>>> is happy with that interface.  That is Linus's no regression rule.
>>>
>>> I should pick a milder word to express our tendency and tell our plan
>>> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
>>> much. I didn't mean to say 'deprecate' at all when replied.
>>>
>>> The situation and trend I understand about kexec_load and kexec_file_load
>>> are:
>>>
>>> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
>>> have yet, just as x86_64, arm64 and s390 have done;
>>>  
>>> 2) kexec_file_load is suggested to use, and take precedence over
>>> kexec_load in the future, if both are supported in one ARCH.
>>
>> The deep problem is that kexec_file_load is distinctly less expressive
>> than kexec_load.
>>
>>> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
>>> and by ARCHes for back compatibility w/ kexec_file_load support.
>>>
>>> For 1) and 2), I think the reason is obvious as Eric said,
>>> kexec_file_load is simple enough. And currently, whenever we got a bug
>>> report, we may need fix them twice, for kexec_load and kexec_file_load.
>>> If kexec_file_load is made by default, e.g on x86_64, we will change it
>>> in kernel space only, for kexec_file_load. This is what I meant about
>>> 'obsolete gradually'. I think for arm64, s390, they will do these too.
>>> Unless there's some critical/blocker bug in kexec_load, to corrupt the
>>> old kexec_load interface in old product.
>>
>> Maybe.  The code that kexec_file_load sucked into the kernel is quite
>> stable and rarely needs changes except during a port of kexec to
>> another architecture.
>>
>> Last I looked the real maintenance effor of kexec and kexec on panic was
>> in the drivers.  So I don't think we can use maintenance to do anything.
> 
> Not sure if I got it. But if check Lianbo's patches, a lot of effort has
> been taken to make SEV work well on kexec_file_load. And we have
> switched to use kexec_file_load in the newly published  Fedora release
> on x86_64 by default. Before this, Lianbo has investigated and done many
> experiments to make sure the switching is safe. We finally made this
> decision. Next we will do the switch in Enterprise distros. Once these
> are proved safe, we will suggest customers to use kexec_file_load for
> kexec rebooting too. In the future, we will only care about
> kexec_file_load if everying is going well. But as I have explained
> repeatedly, only caring about kexec_file_load means we will leave
> kexec_load as is, we will not add new feature or improvement patches
> for it.
> 
> commit 6a20bd54473e11011bf2b47efb52d0759d412854
> Author: Lianbo Jiang <lijiang@redhat.com>
> Date:   Thu Jan 16 13:47:35 2020 +0800
> 
>     kdump-lib: switch to the kexec_file_load() syscall on x86_64 by default
> 
>>
>>> For 3), people can still use kexec_load and develop/fix for it, if no
>>> kexec_file_load supported. But 32-bit arm should be a different one,
>>> more like i386, we will leave it as is, and fix anything which could
>>> break it. But people really expects to improve or add feature to it? E.g
>>> in this patchset, the mem hotplug issue James raised, I assume James is
>>> focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
>>> another reply, people even don't agree to continue supporting memory
>>> hotplug on 32-bit system. We ever took effort to fix a memory hotplug
>>> bug on i386 with a patch, but people would rather set it as BROKEN.
>>
>> For memory hotplug just reload.  Userspace already gets good events.
> 
> Kexec_file_load is easy to maintain. This is an example.
> 
> Lock the hotplug area where kexed-ed kernel is targeted in this patchset,
> it's obviously not right. We can't disable memory hotplug just because
> kexec-ed kernel is loaded ahead of time. 
> 
> Reloading is also not a good fix. Kexec-ed kernel is targeted at a
> movable area, reloading can avoid kexec rebooting corruption if that
> area is hot removed. But if that area is not removed, locating kernel
> into the hotpluggable area will change the area into ummovable zone.
> Unless we decide to not support memory hotplug in kexec-ed kernel, I
> guess it's very hard. Now in our distros kexec rebooting has been
> supported, the big cloud providers are deploying linux in guest, bugs on
> kexec reboot failure has been reported. They need the memory hotplug to
> increase/decrease memory.
> 
> The root cause is kexec-ed kernel is targeted at hotpluggable memory
> region. Just avoiding the movable area can fix it. In kexec_file_load(),
> just checking or picking those unmovable region to put kernel/initrd in
> function locate_mem_hole_callback() can fix it. The page or pageblock's
> zone is movable or not, it's easy to know. This fix doesn't need to
> bother other component.

I don't fully agree. E.g., just because memory is onlined to ZONE_NORMAL
does not imply that it cannot get offlined and removed e.g., this is
heavily used on ppc64, with 16MB sections.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-13 13:15                                   ` Eric W. Biederman
  2020-04-13 23:01                                     ` Andrew Morton
  2020-04-14  6:40                                     ` Baoquan He
@ 2020-04-14  9:16                                     ` Dave Young
  2020-04-14  9:38                                       ` Dave Young
  2 siblings, 1 reply; 92+ messages in thread
From: Dave Young @ 2020-04-14  9:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Baoquan He, David Hildenbrand, Catalin Marinas, Bhupesh Sharma,
	Anshuman Khandual, kexec, Russell King - ARM Linux admin,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel

On 04/13/20 at 08:15am, Eric W. Biederman wrote:
> Baoquan He <bhe@redhat.com> writes:
> 
> > On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
> >> 
> >> The only benefit of kexec_file_load is that it is simple enough from a
> >> kernel perspective that signatures can be checked.
> >
> > We don't have this restriction any more with below commit:
> >
> > commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> > and KEXEC_SIG_FORCE")
> >
> > With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> > secure boot or legacy system for kexec/kdump. Being simple enough is
> > enough to astract and convince us to use it instead. And kexec_file_load
> > has been in use for several years on systems with secure boot, since
> > added in 2014, on x86_64.
> 
> No.  Actaully kexec_file_load is the less capable interface, and less
> flexible interface.  Which is why it is appropriate for signature
> verification.

I agreed that the user space design is more flexible,  but as for the
common use case of loading bzImage (say x86 as an example) the
kexec_file_load is good enough.   We could have other potential improvement based
on kexec_file_load.  For example we could use it to do some early kdump
loading,  eg. try to load an attached kdump kernel immediately once
the crashkernel memory get reserved.

> 
> >> kexec_load in every other respect is the more capable and functional
> >> interface.  It makes no sense to get rid of it.

We do not remove kexec_load at all, it is indeed helpful in many cases
as all agreed.  But if we have a bug reported for both we may fix
kexec_file_load first because it is usually easier, also do not need to
worry about too much about old kernel and new kernel compatibility.

For example the recent breakage we found in efi path, kexec_file_load
just work after the efi cleanup, but kexec_load depends on the ABI we
added, so we must fix it as below:
https://lore.kernel.org/linux-efi/20200410135644.GB6772@dhcp-128-65.nay.redhat.com/ 

> >> 
> >> It does make sense to reload with a loaded kernel on memory hotplug.
> >> That is simple and easy.  If we are going to handle something in the
> >> kernel it should simple an automated unloading of the kernel on memory
> >> hotplug.
> >> 
> >> 
> >> I think it would be irresponsible to deprecate kexec_load on any
> >> platform.
> >> 
> >> I also suspect that kexec_file_load could be taught to copy the dtb
> >> on arm32 if someone wants to deal with signatures.
> >> 
> >> We definitely can not even think of deprecating kexec_load until
> >> architecture that supports it also supports kexec_file_load and everyone
> >> is happy with that interface.  That is Linus's no regression rule.
> >
> > I should pick a milder word to express our tendency and tell our plan
> > then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> > much. I didn't mean to say 'deprecate' at all when replied.
> >
> > The situation and trend I understand about kexec_load and kexec_file_load
> > are:
> >
> > 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> > have yet, just as x86_64, arm64 and s390 have done;
> >  
> > 2) kexec_file_load is suggested to use, and take precedence over
> > kexec_load in the future, if both are supported in one ARCH.
> 
> The deep problem is that kexec_file_load is distinctly less expressive
> than kexec_load.
> 
> > 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> > and by ARCHes for back compatibility w/ kexec_file_load support.
> >
> > For 1) and 2), I think the reason is obvious as Eric said,
> > kexec_file_load is simple enough. And currently, whenever we got a bug
> > report, we may need fix them twice, for kexec_load and kexec_file_load.
> > If kexec_file_load is made by default, e.g on x86_64, we will change it
> > in kernel space only, for kexec_file_load. This is what I meant about
> > 'obsolete gradually'. I think for arm64, s390, they will do these too.
> > Unless there's some critical/blocker bug in kexec_load, to corrupt the
> > old kexec_load interface in old product.
> 
> Maybe.  The code that kexec_file_load sucked into the kernel is quite
> stable and rarely needs changes except during a port of kexec to
> another architecture.
> 
> Last I looked the real maintenance effor of kexec and kexec on panic was
> in the drivers.  So I don't think we can use maintenance to do anything.
> 
> > For 3), people can still use kexec_load and develop/fix for it, if no
> > kexec_file_load supported. But 32-bit arm should be a different one,
> > more like i386, we will leave it as is, and fix anything which could
> > break it. But people really expects to improve or add feature to it? E.g
> > in this patchset, the mem hotplug issue James raised, I assume James is
> > focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
> > another reply, people even don't agree to continue supporting memory
> > hotplug on 32-bit system. We ever took effort to fix a memory hotplug
> > bug on i386 with a patch, but people would rather set it as BROKEN.
> 
> For memory hotplug just reload.  Userspace already gets good events.
> 
> We should not expect anything except a panic kernel to be loaded over a
> memory hotplug event. The kexec on panic code should actually be loaded
> in a location that we don't reliquish if asked for it.
> 
> Quite frankly at this point I would love to see the signature fad die,
> which would allow us to remove kexec_file_load.  I still have not seen
> the signature code used anywhere except by people anticipating trouble.

Same to me, I also hate the Secure Boot, and I also do not like the
trouble added by signature verification.   But still we found that
beyond of Secure Boot use cases it is also useful in other usual cases.

And since kernel has the lockdown supported we have to leave with it.

Thanks
Dave



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-14  8:00                                       ` David Hildenbrand
@ 2020-04-14  9:22                                         ` Baoquan He
  2020-04-14  9:37                                           ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-04-14  9:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel, linuxppc-dev

On 04/14/20 at 10:00am, David Hildenbrand wrote:
> On 14.04.20 08:40, Baoquan He wrote:
> > On 04/13/20 at 08:15am, Eric W. Biederman wrote:
> >> Baoquan He <bhe@redhat.com> writes:
> >>
> >>> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
> >>>>
> >>>> The only benefit of kexec_file_load is that it is simple enough from a
> >>>> kernel perspective that signatures can be checked.
> >>>
> >>> We don't have this restriction any more with below commit:
> >>>
> >>> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> >>> and KEXEC_SIG_FORCE")
> >>>
> >>> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> >>> secure boot or legacy system for kexec/kdump. Being simple enough is
> >>> enough to astract and convince us to use it instead. And kexec_file_load
> >>> has been in use for several years on systems with secure boot, since
> >>> added in 2014, on x86_64.
> >>
> >> No.  Actaully kexec_file_load is the less capable interface, and less
> >> flexible interface.  Which is why it is appropriate for signature
> >> verification.
> > 
> > Well, everyone has a stance and the corresponding view. You could have
> > wider view from long time maintenance and in upstrem position, and think
> > kexec_file_load is horrible. But I can only see from our work as a front
> > line engineer to maintain/develop kexec/kdump in RHEL, and think
> > kexec_file_load is easier to maintain.
> > 
> > Surely except of multiple kernel image format support. No matter it is
> > kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
> > This is produced from kerel building by default. We have no way to
> > support it in our distros and add it into kexec_file_load.
> > 
> > [RFC PATCH] x86/boot: make ELF kernel multiboot-able
> > https://lkml.org/lkml/2017/2/15/654
> > 
> >>
> >>>> kexec_load in every other respect is the more capable and functional
> >>>> interface.  It makes no sense to get rid of it.
> >>>>
> >>>> It does make sense to reload with a loaded kernel on memory hotplug.
> >>>> That is simple and easy.  If we are going to handle something in the
> >>>> kernel it should simple an automated unloading of the kernel on memory
> >>>> hotplug.
> >>>>
> >>>>
> >>>> I think it would be irresponsible to deprecate kexec_load on any
> >>>> platform.
> >>>>
> >>>> I also suspect that kexec_file_load could be taught to copy the dtb
> >>>> on arm32 if someone wants to deal with signatures.
> >>>>
> >>>> We definitely can not even think of deprecating kexec_load until
> >>>> architecture that supports it also supports kexec_file_load and everyone
> >>>> is happy with that interface.  That is Linus's no regression rule.
> >>>
> >>> I should pick a milder word to express our tendency and tell our plan
> >>> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> >>> much. I didn't mean to say 'deprecate' at all when replied.
> >>>
> >>> The situation and trend I understand about kexec_load and kexec_file_load
> >>> are:
> >>>
> >>> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> >>> have yet, just as x86_64, arm64 and s390 have done;
> >>>  
> >>> 2) kexec_file_load is suggested to use, and take precedence over
> >>> kexec_load in the future, if both are supported in one ARCH.
> >>
> >> The deep problem is that kexec_file_load is distinctly less expressive
> >> than kexec_load.
> >>
> >>> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> >>> and by ARCHes for back compatibility w/ kexec_file_load support.
> >>>
> >>> For 1) and 2), I think the reason is obvious as Eric said,
> >>> kexec_file_load is simple enough. And currently, whenever we got a bug
> >>> report, we may need fix them twice, for kexec_load and kexec_file_load.
> >>> If kexec_file_load is made by default, e.g on x86_64, we will change it
> >>> in kernel space only, for kexec_file_load. This is what I meant about
> >>> 'obsolete gradually'. I think for arm64, s390, they will do these too.
> >>> Unless there's some critical/blocker bug in kexec_load, to corrupt the
> >>> old kexec_load interface in old product.
> >>
> >> Maybe.  The code that kexec_file_load sucked into the kernel is quite
> >> stable and rarely needs changes except during a port of kexec to
> >> another architecture.
> >>
> >> Last I looked the real maintenance effor of kexec and kexec on panic was
> >> in the drivers.  So I don't think we can use maintenance to do anything.
> > 
> > Not sure if I got it. But if check Lianbo's patches, a lot of effort has
> > been taken to make SEV work well on kexec_file_load. And we have
> > switched to use kexec_file_load in the newly published  Fedora release
> > on x86_64 by default. Before this, Lianbo has investigated and done many
> > experiments to make sure the switching is safe. We finally made this
> > decision. Next we will do the switch in Enterprise distros. Once these
> > are proved safe, we will suggest customers to use kexec_file_load for
> > kexec rebooting too. In the future, we will only care about
> > kexec_file_load if everying is going well. But as I have explained
> > repeatedly, only caring about kexec_file_load means we will leave
> > kexec_load as is, we will not add new feature or improvement patches
> > for it.
> > 
> > commit 6a20bd54473e11011bf2b47efb52d0759d412854
> > Author: Lianbo Jiang <lijiang@redhat.com>
> > Date:   Thu Jan 16 13:47:35 2020 +0800
> > 
> >     kdump-lib: switch to the kexec_file_load() syscall on x86_64 by default
> > 
> >>
> >>> For 3), people can still use kexec_load and develop/fix for it, if no
> >>> kexec_file_load supported. But 32-bit arm should be a different one,
> >>> more like i386, we will leave it as is, and fix anything which could
> >>> break it. But people really expects to improve or add feature to it? E.g
> >>> in this patchset, the mem hotplug issue James raised, I assume James is
> >>> focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
> >>> another reply, people even don't agree to continue supporting memory
> >>> hotplug on 32-bit system. We ever took effort to fix a memory hotplug
> >>> bug on i386 with a patch, but people would rather set it as BROKEN.
> >>
> >> For memory hotplug just reload.  Userspace already gets good events.
> > 
> > Kexec_file_load is easy to maintain. This is an example.
> > 
> > Lock the hotplug area where kexed-ed kernel is targeted in this patchset,
> > it's obviously not right. We can't disable memory hotplug just because
> > kexec-ed kernel is loaded ahead of time. 
> > 
> > Reloading is also not a good fix. Kexec-ed kernel is targeted at a
> > movable area, reloading can avoid kexec rebooting corruption if that
> > area is hot removed. But if that area is not removed, locating kernel
> > into the hotpluggable area will change the area into ummovable zone.
> > Unless we decide to not support memory hotplug in kexec-ed kernel, I
> > guess it's very hard. Now in our distros kexec rebooting has been
> > supported, the big cloud providers are deploying linux in guest, bugs on
> > kexec reboot failure has been reported. They need the memory hotplug to
> > increase/decrease memory.
> > 
> > The root cause is kexec-ed kernel is targeted at hotpluggable memory
> > region. Just avoiding the movable area can fix it. In kexec_file_load(),
> > just checking or picking those unmovable region to put kernel/initrd in
> > function locate_mem_hole_callback() can fix it. The page or pageblock's
> > zone is movable or not, it's easy to know. This fix doesn't need to
> > bother other component.
> 
> I don't fully agree. E.g., just because memory is onlined to ZONE_NORMAL
> does not imply that it cannot get offlined and removed e.g., this is
> heavily used on ppc64, with 16MB sections.

Really? I just know there are two kinds of mem hoplug in ppc, but don't
know the details. So in this case, is there any flag or a way to know
those memory block are hotpluggable? I am curious how those kernel data
is avoided to be put in this area. Or ppc just freely uses it for kernel
data or user space data, then try to migrate when hot remove?



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-14  9:22                                         ` Baoquan He
@ 2020-04-14  9:37                                           ` David Hildenbrand
  2020-04-14 14:39                                             ` Baoquan He
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-04-14  9:37 UTC (permalink / raw)
  To: Baoquan He
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel, linuxppc-dev

On 14.04.20 11:22, Baoquan He wrote:
> On 04/14/20 at 10:00am, David Hildenbrand wrote:
>> On 14.04.20 08:40, Baoquan He wrote:
>>> On 04/13/20 at 08:15am, Eric W. Biederman wrote:
>>>> Baoquan He <bhe@redhat.com> writes:
>>>>
>>>>> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
>>>>>>
>>>>>> The only benefit of kexec_file_load is that it is simple enough from a
>>>>>> kernel perspective that signatures can be checked.
>>>>>
>>>>> We don't have this restriction any more with below commit:
>>>>>
>>>>> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
>>>>> and KEXEC_SIG_FORCE")
>>>>>
>>>>> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
>>>>> secure boot or legacy system for kexec/kdump. Being simple enough is
>>>>> enough to astract and convince us to use it instead. And kexec_file_load
>>>>> has been in use for several years on systems with secure boot, since
>>>>> added in 2014, on x86_64.
>>>>
>>>> No.  Actaully kexec_file_load is the less capable interface, and less
>>>> flexible interface.  Which is why it is appropriate for signature
>>>> verification.
>>>
>>> Well, everyone has a stance and the corresponding view. You could have
>>> wider view from long time maintenance and in upstrem position, and think
>>> kexec_file_load is horrible. But I can only see from our work as a front
>>> line engineer to maintain/develop kexec/kdump in RHEL, and think
>>> kexec_file_load is easier to maintain.
>>>
>>> Surely except of multiple kernel image format support. No matter it is
>>> kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
>>> This is produced from kerel building by default. We have no way to
>>> support it in our distros and add it into kexec_file_load.
>>>
>>> [RFC PATCH] x86/boot: make ELF kernel multiboot-able
>>> https://lkml.org/lkml/2017/2/15/654
>>>
>>>>
>>>>>> kexec_load in every other respect is the more capable and functional
>>>>>> interface.  It makes no sense to get rid of it.
>>>>>>
>>>>>> It does make sense to reload with a loaded kernel on memory hotplug.
>>>>>> That is simple and easy.  If we are going to handle something in the
>>>>>> kernel it should simple an automated unloading of the kernel on memory
>>>>>> hotplug.
>>>>>>
>>>>>>
>>>>>> I think it would be irresponsible to deprecate kexec_load on any
>>>>>> platform.
>>>>>>
>>>>>> I also suspect that kexec_file_load could be taught to copy the dtb
>>>>>> on arm32 if someone wants to deal with signatures.
>>>>>>
>>>>>> We definitely can not even think of deprecating kexec_load until
>>>>>> architecture that supports it also supports kexec_file_load and everyone
>>>>>> is happy with that interface.  That is Linus's no regression rule.
>>>>>
>>>>> I should pick a milder word to express our tendency and tell our plan
>>>>> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
>>>>> much. I didn't mean to say 'deprecate' at all when replied.
>>>>>
>>>>> The situation and trend I understand about kexec_load and kexec_file_load
>>>>> are:
>>>>>
>>>>> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
>>>>> have yet, just as x86_64, arm64 and s390 have done;
>>>>>  
>>>>> 2) kexec_file_load is suggested to use, and take precedence over
>>>>> kexec_load in the future, if both are supported in one ARCH.
>>>>
>>>> The deep problem is that kexec_file_load is distinctly less expressive
>>>> than kexec_load.
>>>>
>>>>> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
>>>>> and by ARCHes for back compatibility w/ kexec_file_load support.
>>>>>
>>>>> For 1) and 2), I think the reason is obvious as Eric said,
>>>>> kexec_file_load is simple enough. And currently, whenever we got a bug
>>>>> report, we may need fix them twice, for kexec_load and kexec_file_load.
>>>>> If kexec_file_load is made by default, e.g on x86_64, we will change it
>>>>> in kernel space only, for kexec_file_load. This is what I meant about
>>>>> 'obsolete gradually'. I think for arm64, s390, they will do these too.
>>>>> Unless there's some critical/blocker bug in kexec_load, to corrupt the
>>>>> old kexec_load interface in old product.
>>>>
>>>> Maybe.  The code that kexec_file_load sucked into the kernel is quite
>>>> stable and rarely needs changes except during a port of kexec to
>>>> another architecture.
>>>>
>>>> Last I looked the real maintenance effor of kexec and kexec on panic was
>>>> in the drivers.  So I don't think we can use maintenance to do anything.
>>>
>>> Not sure if I got it. But if check Lianbo's patches, a lot of effort has
>>> been taken to make SEV work well on kexec_file_load. And we have
>>> switched to use kexec_file_load in the newly published  Fedora release
>>> on x86_64 by default. Before this, Lianbo has investigated and done many
>>> experiments to make sure the switching is safe. We finally made this
>>> decision. Next we will do the switch in Enterprise distros. Once these
>>> are proved safe, we will suggest customers to use kexec_file_load for
>>> kexec rebooting too. In the future, we will only care about
>>> kexec_file_load if everying is going well. But as I have explained
>>> repeatedly, only caring about kexec_file_load means we will leave
>>> kexec_load as is, we will not add new feature or improvement patches
>>> for it.
>>>
>>> commit 6a20bd54473e11011bf2b47efb52d0759d412854
>>> Author: Lianbo Jiang <lijiang@redhat.com>
>>> Date:   Thu Jan 16 13:47:35 2020 +0800
>>>
>>>     kdump-lib: switch to the kexec_file_load() syscall on x86_64 by default
>>>
>>>>
>>>>> For 3), people can still use kexec_load and develop/fix for it, if no
>>>>> kexec_file_load supported. But 32-bit arm should be a different one,
>>>>> more like i386, we will leave it as is, and fix anything which could
>>>>> break it. But people really expects to improve or add feature to it? E.g
>>>>> in this patchset, the mem hotplug issue James raised, I assume James is
>>>>> focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
>>>>> another reply, people even don't agree to continue supporting memory
>>>>> hotplug on 32-bit system. We ever took effort to fix a memory hotplug
>>>>> bug on i386 with a patch, but people would rather set it as BROKEN.
>>>>
>>>> For memory hotplug just reload.  Userspace already gets good events.
>>>
>>> Kexec_file_load is easy to maintain. This is an example.
>>>
>>> Lock the hotplug area where kexed-ed kernel is targeted in this patchset,
>>> it's obviously not right. We can't disable memory hotplug just because
>>> kexec-ed kernel is loaded ahead of time. 
>>>
>>> Reloading is also not a good fix. Kexec-ed kernel is targeted at a
>>> movable area, reloading can avoid kexec rebooting corruption if that
>>> area is hot removed. But if that area is not removed, locating kernel
>>> into the hotpluggable area will change the area into ummovable zone.
>>> Unless we decide to not support memory hotplug in kexec-ed kernel, I
>>> guess it's very hard. Now in our distros kexec rebooting has been
>>> supported, the big cloud providers are deploying linux in guest, bugs on
>>> kexec reboot failure has been reported. They need the memory hotplug to
>>> increase/decrease memory.
>>>
>>> The root cause is kexec-ed kernel is targeted at hotpluggable memory
>>> region. Just avoiding the movable area can fix it. In kexec_file_load(),
>>> just checking or picking those unmovable region to put kernel/initrd in
>>> function locate_mem_hole_callback() can fix it. The page or pageblock's
>>> zone is movable or not, it's easy to know. This fix doesn't need to
>>> bother other component.
>>
>> I don't fully agree. E.g., just because memory is onlined to ZONE_NORMAL
>> does not imply that it cannot get offlined and removed e.g., this is
>> heavily used on ppc64, with 16MB sections.
> 
> Really? I just know there are two kinds of mem hoplug in ppc, but don't
> know the details. So in this case, is there any flag or a way to know
> those memory block are hotpluggable? I am curious how those kernel data
> is avoided to be put in this area. Or ppc just freely uses it for kernel
> data or user space data, then try to migrate when hot remove?

See
arch/powerpc/platforms/pseries/hotplug-memory.c:dlpar_memory_remove_by_count()

Under DLAPR, it can remove memory in LMB granularity, which is usually
16MB (== single section on ppc64). DLPAR will directly online all
hotplugged memory (LMBs) from the kernel using device_online(), which
will go to ZONE_NORMAL.

When trying to remove memory, it simply scans for offlineable 16MB
memory blocks (==section == LMB), offlines and removes them. No need for
the movable zone and all the involved issues.

Now, the interesting question is, can we have LMBs added during boot
(not via add_memory()), that will later be removed via remove_memory().
IIRC, we had BUGs related to that, so I think yes. If a section contains
no unmovable allocations (after boot), it can get removed.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-14  9:16                                     ` Dave Young
@ 2020-04-14  9:38                                       ` Dave Young
  0 siblings, 0 replies; 92+ messages in thread
From: Dave Young @ 2020-04-14  9:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Baoquan He, David Hildenbrand, Catalin Marinas, Bhupesh Sharma,
	Anshuman Khandual, kexec, Russell King - ARM Linux admin,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel

> We do not remove kexec_load at all, it is indeed helpful in many cases
> as all agreed.  But if we have a bug reported for both we may fix
> kexec_file_load first because it is usually easier, also do not need to
> worry about too much about old kernel and new kernel compatibility.
> 
> For example the recent breakage we found in efi path, kexec_file_load
> just work after the efi cleanup, but kexec_load depends on the ABI we
> added, so we must fix it as below:
> https://lore.kernel.org/linux-efi/20200410135644.GB6772@dhcp-128-65.nay.redhat.com/ 

Also, we have some specific sysfs files exported for kexec-tools use
/sys/firmware/efi/runtime-map/* and a few other table addresses:
fw_vendor  runtime and config_table under /sys/firmware/efi

That is only used by userspace kexec_tools for kexec_load, now the
runtime field is useless because of Ard's cleanup in efi code, but we
have to keep it there, older kexec-tools will need it.

In this case kexec_file_load do not need those hacks at all.  So in the
future if we have to invent some kernel/userspace abi only for
kexec_load we should be careful and maybe reject if no strong reason. 

Thanks
Dave



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-14  9:37                                           ` David Hildenbrand
@ 2020-04-14 14:39                                             ` Baoquan He
  2020-04-14 14:49                                               ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-04-14 14:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel, linuxppc-dev, piliu

On 04/14/20 at 11:37am, David Hildenbrand wrote:
> On 14.04.20 11:22, Baoquan He wrote:
> > On 04/14/20 at 10:00am, David Hildenbrand wrote:
> >> On 14.04.20 08:40, Baoquan He wrote:
> >>> On 04/13/20 at 08:15am, Eric W. Biederman wrote:
> >>>> Baoquan He <bhe@redhat.com> writes:
> >>>>
> >>>>> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
> >>>>>>
> >>>>>> The only benefit of kexec_file_load is that it is simple enough from a
> >>>>>> kernel perspective that signatures can be checked.
> >>>>>
> >>>>> We don't have this restriction any more with below commit:
> >>>>>
> >>>>> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
> >>>>> and KEXEC_SIG_FORCE")
> >>>>>
> >>>>> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
> >>>>> secure boot or legacy system for kexec/kdump. Being simple enough is
> >>>>> enough to astract and convince us to use it instead. And kexec_file_load
> >>>>> has been in use for several years on systems with secure boot, since
> >>>>> added in 2014, on x86_64.
> >>>>
> >>>> No.  Actaully kexec_file_load is the less capable interface, and less
> >>>> flexible interface.  Which is why it is appropriate for signature
> >>>> verification.
> >>>
> >>> Well, everyone has a stance and the corresponding view. You could have
> >>> wider view from long time maintenance and in upstrem position, and think
> >>> kexec_file_load is horrible. But I can only see from our work as a front
> >>> line engineer to maintain/develop kexec/kdump in RHEL, and think
> >>> kexec_file_load is easier to maintain.
> >>>
> >>> Surely except of multiple kernel image format support. No matter it is
> >>> kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
> >>> This is produced from kerel building by default. We have no way to
> >>> support it in our distros and add it into kexec_file_load.
> >>>
> >>> [RFC PATCH] x86/boot: make ELF kernel multiboot-able
> >>> https://lkml.org/lkml/2017/2/15/654
> >>>
> >>>>
> >>>>>> kexec_load in every other respect is the more capable and functional
> >>>>>> interface.  It makes no sense to get rid of it.
> >>>>>>
> >>>>>> It does make sense to reload with a loaded kernel on memory hotplug.
> >>>>>> That is simple and easy.  If we are going to handle something in the
> >>>>>> kernel it should simple an automated unloading of the kernel on memory
> >>>>>> hotplug.
> >>>>>>
> >>>>>>
> >>>>>> I think it would be irresponsible to deprecate kexec_load on any
> >>>>>> platform.
> >>>>>>
> >>>>>> I also suspect that kexec_file_load could be taught to copy the dtb
> >>>>>> on arm32 if someone wants to deal with signatures.
> >>>>>>
> >>>>>> We definitely can not even think of deprecating kexec_load until
> >>>>>> architecture that supports it also supports kexec_file_load and everyone
> >>>>>> is happy with that interface.  That is Linus's no regression rule.
> >>>>>
> >>>>> I should pick a milder word to express our tendency and tell our plan
> >>>>> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
> >>>>> much. I didn't mean to say 'deprecate' at all when replied.
> >>>>>
> >>>>> The situation and trend I understand about kexec_load and kexec_file_load
> >>>>> are:
> >>>>>
> >>>>> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
> >>>>> have yet, just as x86_64, arm64 and s390 have done;
> >>>>>  
> >>>>> 2) kexec_file_load is suggested to use, and take precedence over
> >>>>> kexec_load in the future, if both are supported in one ARCH.
> >>>>
> >>>> The deep problem is that kexec_file_load is distinctly less expressive
> >>>> than kexec_load.
> >>>>
> >>>>> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
> >>>>> and by ARCHes for back compatibility w/ kexec_file_load support.
> >>>>>
> >>>>> For 1) and 2), I think the reason is obvious as Eric said,
> >>>>> kexec_file_load is simple enough. And currently, whenever we got a bug
> >>>>> report, we may need fix them twice, for kexec_load and kexec_file_load.
> >>>>> If kexec_file_load is made by default, e.g on x86_64, we will change it
> >>>>> in kernel space only, for kexec_file_load. This is what I meant about
> >>>>> 'obsolete gradually'. I think for arm64, s390, they will do these too.
> >>>>> Unless there's some critical/blocker bug in kexec_load, to corrupt the
> >>>>> old kexec_load interface in old product.
> >>>>
> >>>> Maybe.  The code that kexec_file_load sucked into the kernel is quite
> >>>> stable and rarely needs changes except during a port of kexec to
> >>>> another architecture.
> >>>>
> >>>> Last I looked the real maintenance effor of kexec and kexec on panic was
> >>>> in the drivers.  So I don't think we can use maintenance to do anything.
> >>>
> >>> Not sure if I got it. But if check Lianbo's patches, a lot of effort has
> >>> been taken to make SEV work well on kexec_file_load. And we have
> >>> switched to use kexec_file_load in the newly published  Fedora release
> >>> on x86_64 by default. Before this, Lianbo has investigated and done many
> >>> experiments to make sure the switching is safe. We finally made this
> >>> decision. Next we will do the switch in Enterprise distros. Once these
> >>> are proved safe, we will suggest customers to use kexec_file_load for
> >>> kexec rebooting too. In the future, we will only care about
> >>> kexec_file_load if everying is going well. But as I have explained
> >>> repeatedly, only caring about kexec_file_load means we will leave
> >>> kexec_load as is, we will not add new feature or improvement patches
> >>> for it.
> >>>
> >>> commit 6a20bd54473e11011bf2b47efb52d0759d412854
> >>> Author: Lianbo Jiang <lijiang@redhat.com>
> >>> Date:   Thu Jan 16 13:47:35 2020 +0800
> >>>
> >>>     kdump-lib: switch to the kexec_file_load() syscall on x86_64 by default
> >>>
> >>>>
> >>>>> For 3), people can still use kexec_load and develop/fix for it, if no
> >>>>> kexec_file_load supported. But 32-bit arm should be a different one,
> >>>>> more like i386, we will leave it as is, and fix anything which could
> >>>>> break it. But people really expects to improve or add feature to it? E.g
> >>>>> in this patchset, the mem hotplug issue James raised, I assume James is
> >>>>> focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
> >>>>> another reply, people even don't agree to continue supporting memory
> >>>>> hotplug on 32-bit system. We ever took effort to fix a memory hotplug
> >>>>> bug on i386 with a patch, but people would rather set it as BROKEN.
> >>>>
> >>>> For memory hotplug just reload.  Userspace already gets good events.
> >>>
> >>> Kexec_file_load is easy to maintain. This is an example.
> >>>
> >>> Lock the hotplug area where kexed-ed kernel is targeted in this patchset,
> >>> it's obviously not right. We can't disable memory hotplug just because
> >>> kexec-ed kernel is loaded ahead of time. 
> >>>
> >>> Reloading is also not a good fix. Kexec-ed kernel is targeted at a
> >>> movable area, reloading can avoid kexec rebooting corruption if that
> >>> area is hot removed. But if that area is not removed, locating kernel
> >>> into the hotpluggable area will change the area into ummovable zone.
> >>> Unless we decide to not support memory hotplug in kexec-ed kernel, I
> >>> guess it's very hard. Now in our distros kexec rebooting has been
> >>> supported, the big cloud providers are deploying linux in guest, bugs on
> >>> kexec reboot failure has been reported. They need the memory hotplug to
> >>> increase/decrease memory.
> >>>
> >>> The root cause is kexec-ed kernel is targeted at hotpluggable memory
> >>> region. Just avoiding the movable area can fix it. In kexec_file_load(),
> >>> just checking or picking those unmovable region to put kernel/initrd in
> >>> function locate_mem_hole_callback() can fix it. The page or pageblock's
> >>> zone is movable or not, it's easy to know. This fix doesn't need to
> >>> bother other component.
> >>
> >> I don't fully agree. E.g., just because memory is onlined to ZONE_NORMAL
> >> does not imply that it cannot get offlined and removed e.g., this is
> >> heavily used on ppc64, with 16MB sections.
> > 
> > Really? I just know there are two kinds of mem hoplug in ppc, but don't
> > know the details. So in this case, is there any flag or a way to know
> > those memory block are hotpluggable? I am curious how those kernel data
> > is avoided to be put in this area. Or ppc just freely uses it for kernel
> > data or user space data, then try to migrate when hot remove?
> 
> See
> arch/powerpc/platforms/pseries/hotplug-memory.c:dlpar_memory_remove_by_count()
> 
> Under DLAPR, it can remove memory in LMB granularity, which is usually
> 16MB (== single section on ppc64). DLPAR will directly online all
> hotplugged memory (LMBs) from the kernel using device_online(), which
> will go to ZONE_NORMAL.
> 
> When trying to remove memory, it simply scans for offlineable 16MB
> memory blocks (==section == LMB), offlines and removes them. No need for
> the movable zone and all the involved issues.

Yes, this is a different one, thanks for pointing it out. It sounds like
balloon driver in virt platform, doesn't it?

Avoiding to put kexec kernel into movable zone can't solve this DLPAR
case as you said.

> 
> Now, the interesting question is, can we have LMBs added during boot
> (not via add_memory()), that will later be removed via remove_memory().
> IIRC, we had BUGs related to that, so I think yes. If a section contains
> no unmovable allocations (after boot), it can get removed.

I do want to ask this question. If we can add LMB into system RAM, then
reload kexec can solve it. 

Another better way is adding a common function to filter out the
movable zone when search position for kexec kernel, use a arch specific
funciton to filter out DLPAR memory blocks for ppc only. Over there,
we can simply use for_each_drmem_lmb() to do that.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-14 14:39                                             ` Baoquan He
@ 2020-04-14 14:49                                               ` David Hildenbrand
  2020-04-15  2:35                                                 ` Baoquan He
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-04-14 14:49 UTC (permalink / raw)
  To: Baoquan He
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel, linuxppc-dev, piliu

On 14.04.20 16:39, Baoquan He wrote:
> On 04/14/20 at 11:37am, David Hildenbrand wrote:
>> On 14.04.20 11:22, Baoquan He wrote:
>>> On 04/14/20 at 10:00am, David Hildenbrand wrote:
>>>> On 14.04.20 08:40, Baoquan He wrote:
>>>>> On 04/13/20 at 08:15am, Eric W. Biederman wrote:
>>>>>> Baoquan He <bhe@redhat.com> writes:
>>>>>>
>>>>>>> On 04/12/20 at 02:52pm, Eric W. Biederman wrote:
>>>>>>>>
>>>>>>>> The only benefit of kexec_file_load is that it is simple enough from a
>>>>>>>> kernel perspective that signatures can be checked.
>>>>>>>
>>>>>>> We don't have this restriction any more with below commit:
>>>>>>>
>>>>>>> commit 99d5cadfde2b ("kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG
>>>>>>> and KEXEC_SIG_FORCE")
>>>>>>>
>>>>>>> With KEXEC_SIG_FORCE not set, we can use kexec_load_file to cover both
>>>>>>> secure boot or legacy system for kexec/kdump. Being simple enough is
>>>>>>> enough to astract and convince us to use it instead. And kexec_file_load
>>>>>>> has been in use for several years on systems with secure boot, since
>>>>>>> added in 2014, on x86_64.
>>>>>>
>>>>>> No.  Actaully kexec_file_load is the less capable interface, and less
>>>>>> flexible interface.  Which is why it is appropriate for signature
>>>>>> verification.
>>>>>
>>>>> Well, everyone has a stance and the corresponding view. You could have
>>>>> wider view from long time maintenance and in upstrem position, and think
>>>>> kexec_file_load is horrible. But I can only see from our work as a front
>>>>> line engineer to maintain/develop kexec/kdump in RHEL, and think
>>>>> kexec_file_load is easier to maintain.
>>>>>
>>>>> Surely except of multiple kernel image format support. No matter it is
>>>>> kexec_load and kexec_file_load, e.g in x86_64, we only support bzImage.
>>>>> This is produced from kerel building by default. We have no way to
>>>>> support it in our distros and add it into kexec_file_load.
>>>>>
>>>>> [RFC PATCH] x86/boot: make ELF kernel multiboot-able
>>>>> https://lkml.org/lkml/2017/2/15/654
>>>>>
>>>>>>
>>>>>>>> kexec_load in every other respect is the more capable and functional
>>>>>>>> interface.  It makes no sense to get rid of it.
>>>>>>>>
>>>>>>>> It does make sense to reload with a loaded kernel on memory hotplug.
>>>>>>>> That is simple and easy.  If we are going to handle something in the
>>>>>>>> kernel it should simple an automated unloading of the kernel on memory
>>>>>>>> hotplug.
>>>>>>>>
>>>>>>>>
>>>>>>>> I think it would be irresponsible to deprecate kexec_load on any
>>>>>>>> platform.
>>>>>>>>
>>>>>>>> I also suspect that kexec_file_load could be taught to copy the dtb
>>>>>>>> on arm32 if someone wants to deal with signatures.
>>>>>>>>
>>>>>>>> We definitely can not even think of deprecating kexec_load until
>>>>>>>> architecture that supports it also supports kexec_file_load and everyone
>>>>>>>> is happy with that interface.  That is Linus's no regression rule.
>>>>>>>
>>>>>>> I should pick a milder word to express our tendency and tell our plan
>>>>>>> then 'obsolete'. Even though I added 'gradually', seems it doesn't help
>>>>>>> much. I didn't mean to say 'deprecate' at all when replied.
>>>>>>>
>>>>>>> The situation and trend I understand about kexec_load and kexec_file_load
>>>>>>> are:
>>>>>>>
>>>>>>> 1) Supporting kexec_file_load is suggested to add in ARCHes which don't
>>>>>>> have yet, just as x86_64, arm64 and s390 have done;
>>>>>>>  
>>>>>>> 2) kexec_file_load is suggested to use, and take precedence over
>>>>>>> kexec_load in the future, if both are supported in one ARCH.
>>>>>>
>>>>>> The deep problem is that kexec_file_load is distinctly less expressive
>>>>>> than kexec_load.
>>>>>>
>>>>>>> 3) Kexec_load is kept being used by ARCHes w/o kexc_file_load support,
>>>>>>> and by ARCHes for back compatibility w/ kexec_file_load support.
>>>>>>>
>>>>>>> For 1) and 2), I think the reason is obvious as Eric said,
>>>>>>> kexec_file_load is simple enough. And currently, whenever we got a bug
>>>>>>> report, we may need fix them twice, for kexec_load and kexec_file_load.
>>>>>>> If kexec_file_load is made by default, e.g on x86_64, we will change it
>>>>>>> in kernel space only, for kexec_file_load. This is what I meant about
>>>>>>> 'obsolete gradually'. I think for arm64, s390, they will do these too.
>>>>>>> Unless there's some critical/blocker bug in kexec_load, to corrupt the
>>>>>>> old kexec_load interface in old product.
>>>>>>
>>>>>> Maybe.  The code that kexec_file_load sucked into the kernel is quite
>>>>>> stable and rarely needs changes except during a port of kexec to
>>>>>> another architecture.
>>>>>>
>>>>>> Last I looked the real maintenance effor of kexec and kexec on panic was
>>>>>> in the drivers.  So I don't think we can use maintenance to do anything.
>>>>>
>>>>> Not sure if I got it. But if check Lianbo's patches, a lot of effort has
>>>>> been taken to make SEV work well on kexec_file_load. And we have
>>>>> switched to use kexec_file_load in the newly published  Fedora release
>>>>> on x86_64 by default. Before this, Lianbo has investigated and done many
>>>>> experiments to make sure the switching is safe. We finally made this
>>>>> decision. Next we will do the switch in Enterprise distros. Once these
>>>>> are proved safe, we will suggest customers to use kexec_file_load for
>>>>> kexec rebooting too. In the future, we will only care about
>>>>> kexec_file_load if everying is going well. But as I have explained
>>>>> repeatedly, only caring about kexec_file_load means we will leave
>>>>> kexec_load as is, we will not add new feature or improvement patches
>>>>> for it.
>>>>>
>>>>> commit 6a20bd54473e11011bf2b47efb52d0759d412854
>>>>> Author: Lianbo Jiang <lijiang@redhat.com>
>>>>> Date:   Thu Jan 16 13:47:35 2020 +0800
>>>>>
>>>>>     kdump-lib: switch to the kexec_file_load() syscall on x86_64 by default
>>>>>
>>>>>>
>>>>>>> For 3), people can still use kexec_load and develop/fix for it, if no
>>>>>>> kexec_file_load supported. But 32-bit arm should be a different one,
>>>>>>> more like i386, we will leave it as is, and fix anything which could
>>>>>>> break it. But people really expects to improve or add feature to it? E.g
>>>>>>> in this patchset, the mem hotplug issue James raised, I assume James is
>>>>>>> focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
>>>>>>> another reply, people even don't agree to continue supporting memory
>>>>>>> hotplug on 32-bit system. We ever took effort to fix a memory hotplug
>>>>>>> bug on i386 with a patch, but people would rather set it as BROKEN.
>>>>>>
>>>>>> For memory hotplug just reload.  Userspace already gets good events.
>>>>>
>>>>> Kexec_file_load is easy to maintain. This is an example.
>>>>>
>>>>> Lock the hotplug area where kexed-ed kernel is targeted in this patchset,
>>>>> it's obviously not right. We can't disable memory hotplug just because
>>>>> kexec-ed kernel is loaded ahead of time. 
>>>>>
>>>>> Reloading is also not a good fix. Kexec-ed kernel is targeted at a
>>>>> movable area, reloading can avoid kexec rebooting corruption if that
>>>>> area is hot removed. But if that area is not removed, locating kernel
>>>>> into the hotpluggable area will change the area into ummovable zone.
>>>>> Unless we decide to not support memory hotplug in kexec-ed kernel, I
>>>>> guess it's very hard. Now in our distros kexec rebooting has been
>>>>> supported, the big cloud providers are deploying linux in guest, bugs on
>>>>> kexec reboot failure has been reported. They need the memory hotplug to
>>>>> increase/decrease memory.
>>>>>
>>>>> The root cause is kexec-ed kernel is targeted at hotpluggable memory
>>>>> region. Just avoiding the movable area can fix it. In kexec_file_load(),
>>>>> just checking or picking those unmovable region to put kernel/initrd in
>>>>> function locate_mem_hole_callback() can fix it. The page or pageblock's
>>>>> zone is movable or not, it's easy to know. This fix doesn't need to
>>>>> bother other component.
>>>>
>>>> I don't fully agree. E.g., just because memory is onlined to ZONE_NORMAL
>>>> does not imply that it cannot get offlined and removed e.g., this is
>>>> heavily used on ppc64, with 16MB sections.
>>>
>>> Really? I just know there are two kinds of mem hoplug in ppc, but don't
>>> know the details. So in this case, is there any flag or a way to know
>>> those memory block are hotpluggable? I am curious how those kernel data
>>> is avoided to be put in this area. Or ppc just freely uses it for kernel
>>> data or user space data, then try to migrate when hot remove?
>>
>> See
>> arch/powerpc/platforms/pseries/hotplug-memory.c:dlpar_memory_remove_by_count()
>>
>> Under DLAPR, it can remove memory in LMB granularity, which is usually
>> 16MB (== single section on ppc64). DLPAR will directly online all
>> hotplugged memory (LMBs) from the kernel using device_online(), which
>> will go to ZONE_NORMAL.
>>
>> When trying to remove memory, it simply scans for offlineable 16MB
>> memory blocks (==section == LMB), offlines and removes them. No need for
>> the movable zone and all the involved issues.
> 
> Yes, this is a different one, thanks for pointing it out. It sounds like
> balloon driver in virt platform, doesn't it?

With DLPAR there is a hypervisor involved (which manages the actual HW
DIMMs), so yes.

> 
> Avoiding to put kexec kernel into movable zone can't solve this DLPAR
> case as you said.
> 
>>
>> Now, the interesting question is, can we have LMBs added during boot
>> (not via add_memory()), that will later be removed via remove_memory().
>> IIRC, we had BUGs related to that, so I think yes. If a section contains
>> no unmovable allocations (after boot), it can get removed.
> 
> I do want to ask this question. If we can add LMB into system RAM, then
> reload kexec can solve it. 
> 
> Another better way is adding a common function to filter out the
> movable zone when search position for kexec kernel, use a arch specific
> funciton to filter out DLPAR memory blocks for ppc only. Over there,
> we can simply use for_each_drmem_lmb() to do that.

I was thinking about something similar. Maybe something like a notifier
that can be used to test if selected memory can be used for kexec
images. It would apply to

- arm64 and filter out all hotadded memory (IIRC, only boot memory can
  be used).
- powerpc to filter out all LMBs that can be removed (assuming not all
  memory corresponds to LMBs that can be removed, otherwise we're in
  trouble ... :) )
- virtio-mem to filter out all memory it added.
- hyper-v to filter out partially backed memory blocks (esp. the last
  memory block it added and only partially backed it by memory).

This would make it work for kexec_file_load(), however, I do wonder how
we would want to approach that from userspace kexec-tools when handling
it from kexec_load().

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-14  7:05                       ` David Hildenbrand
@ 2020-04-14 16:55                         ` James Morse
  2020-04-14 17:41                           ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: James Morse @ 2020-04-14 16:55 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

Hi guys,

On 14/04/2020 08:05, David Hildenbrand wrote:
> On 10.04.20 21:10, Andrew Morton wrote:
>> It's unclear (to me) what is the status of this patchset.  But it does appear that
>> an new version can be expected?

> I'd suggest to unqueue the patches until we have a consensus.

Certainly!


> While there are a couple of ideas floating around here, my current
> suggestion would be either
> 
> 1. Indicate all hotplugged memory as "System RAM (hotplugged)" in
> /proc/iomem and the firmware memmap (on all architectures). This will
> require kexec changes,

> but I would have assume that kexec has to be
> updated in lock-step with the kernel

News to me: I was using the version I first built when arm64's support was new. I've only
had to update it once when we had to change user-space.

I don't think debian updates kexec-tools when it updates the kernel.


Making changes to /proc/iomem means updating user-space again, (for kdump). I'd like to
avoid that if its at all possible.


> just like e.g., makedumpfile.
> Modify kexec() to not place the kexec kernel on these areas (easy) but
> still consider them as crash regions to dump. When loading a kexec
> kernel, validate in the kernel that the memory is appropriate.


> 2. Make kexec() reload the the kernel whenever we e.g., get a udev event
> for removal of memory in /sys/devices/system/memory/.

I don't think we can rely on user-space to do something,


> On every remove_memory(), invalidate the loaded kernel in the kernel.

This is an option, ... but its a change of behaviour. If user-space asks for two
impossible things, the second request should fail. Having the first-one disappear is a bit
spooky...

Fortunately user-space checks the 'kexec -l' bit happened before it calls reboot() behind
'kexec -e'. So this works, but is not intuitive.

("Did I load it? What changed and when? oh, half a mile up in dmesg is a message saying
the kernel discarded the kexec kernel last wednesday.")


> As I mentioned somewhere, 1. will be interesting for virtio-mem, where
> we don't want any kexec kernel to be placed on virtio-mem-added memory.

Do these virtio-mem-added regions need to be accessible by kdump?
(do we already need a user-space change for that?)


A third option, along the line of what I posted:

Split the 'offline' and 'removed' ideas, which David mentioned somewhere. We'd end up with
(yet) another notifier chain, that prevents the memory being removed, but you can still
mark it as offline in /sys/. (...I'm not quite sure why you would do that...)

This would need hooking up for ACPI (which covers x86 and arm64), and other architectures
mechanisms for doing this...
arm64 can then switch is arch hook that prevents 'bootmem' being removed to this new
notifier chain, as the kernel can only boot from that was present at boot.


My preference is 3, then 2.

I think 1 is slightly less desirable than a message at kexec time that the memory layout
has changed since load, and this might not work...



Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-04-02  6:12     ` piliu
@ 2020-04-14 17:21       ` James Morse
  0 siblings, 0 replies; 92+ messages in thread
From: James Morse @ 2020-04-14 17:21 UTC (permalink / raw)
  To: piliu
  Cc: Dave Young, kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Eric Biederman, Andrew Morton,
	Will Deacon, Hari Bathini

Hi Dave, Pingfan,

On 02/04/2020 07:12, piliu wrote:
> On 04/02/2020 01:49 PM, Dave Young wrote:
>> On 03/26/20 at 06:07pm, James Morse wrote:
>>> Memory added to the system by hotplug has a 'System RAM' resource created
>>> for it. This is exposed to user-space via /proc/iomem.
>>>
>>> This poses problems for kexec on arm64. If kexec decides to place the
>>> kernel in one of these newly onlined regions, the new kernel will find
>>> itself booting from a region not described as memory in the firmware
>>> tables.
>>>
>>> Arm64 doesn't have a structure like the e820 memory map that can be
>>> re-written when memory is brought online. Instead arm64 uses the UEFI
>>> memory map, or the memory node from the DT, sometimes both. We never
>>> rewrite these.
>>
>> Could arm64 use similar way to update DT, or a cooked UEFI maps?

>> Add pingfan in cc, he said ppc64 update the DT after a memremove thus it
>> would be good to just redo a kexec load.

> Yes, the memory changes will be observed through device-node under
> /proc/device-tree/ (which is for powerpc).
> 
> Later if running kexec -l/-p , it can build new dtb with the latest info
> from /proc/device-tree
For arm64, the device-tree is set in stone. We don't have the runtime parts of
open-firmware that powerpc does. (my knowledge in this area is extremely sparse)
arm64 platforms where stuff like this changes tend to use ACPI instead, and these all have
to boot with UEFI, which means its the UEFI memory map that has authority.

We don't cook a fake UEFI memory map when things change because we treat it like the
set-in-stone DT. This means we only have discrepancies in firmware to workaround, instead
of any we introduce ourselves.

One of the UEFI configuration tables describes addresses Linux programmed into hardware
that can't be reset. Newer versions of Linux know how to pick these up on kexec... but
older versions don't know how to parse/rewrite/move that table. Cooking up new versions of
these tables would prevent us doing stuff like this, which we need to workaround hardware
that didn't get the 'kexec exists' memo.


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-31  3:46     ` Dave Young
@ 2020-04-14 17:31       ` James Morse
  0 siblings, 0 replies; 92+ messages in thread
From: James Morse @ 2020-04-14 17:31 UTC (permalink / raw)
  To: Dave Young
  Cc: Baoquan He, Anshuman Khandual, Catalin Marinas, Bhupesh Sharma,
	kexec, linux-mm, Eric Biederman, Andrew Morton, Will Deacon,
	linux-arm-kernel

Hi Dave,

On 31/03/2020 04:46, Dave Young wrote:
> I agreed that file load is still not widely used,  but in the long run
> we should not maintain both of them all the future time.  Especially
> when some kernel-userspace interfaces need to be introduced, file load
> will have the natural advantage.  We may keep the kexec_load for other
> misc usecases, but we can use file load for the major modern
> linux-to-linux loading.  I'm not saying we can do it immediately, just
> thought we should reduce the duplicate effort and try to avoid hacking if
> possible.

Sure. My aim here is to never debug this problem again.


> Anyway about this particular issue, I wonder if we can just reload with
> a udev rule as replied in another mail.

What if it doesn't? I can't find such a rule on my debian machine.
I don't think user-space can be relied on for something like this.

The best we could hope for here is a dying gasp from the old kernel:
| kexec: memory layout changed since kexec load, this may not work.
| Bye!

... assuming anyone sees such a message.


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-14 16:55                         ` James Morse
@ 2020-04-14 17:41                           ` David Hildenbrand
  0 siblings, 0 replies; 92+ messages in thread
From: David Hildenbrand @ 2020-04-14 17:41 UTC (permalink / raw)
  To: James Morse, Andrew Morton
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

>> While there are a couple of ideas floating around here, my current
>> suggestion would be either
>>
>> 1. Indicate all hotplugged memory as "System RAM (hotplugged)" in
>> /proc/iomem and the firmware memmap (on all architectures). This will
>> require kexec changes,
> 
>> but I would have assume that kexec has to be
>> updated in lock-step with the kernel
> 
> News to me: I was using the version I first built when arm64's support was new. I've only
> had to update it once when we had to change user-space.
> 
> I don't think debian updates kexec-tools when it updates the kernel.

I would assume they are also not pushing the latest-greatest kernel in
their current release, after settling on a kexec version, no?

I think you can assume new kernels to require new kexec-tools versions
to provide all features.

> Making changes to /proc/iomem means updating user-space again, (for kdump). I'd like to
> avoid that if its at all possible.

Yes, it's not desirable, but if all that's not working is a "not all
memory will be dumped out of the box", at least I think this is
tolerable. It's not like we're completely breaking kexec.

Your current arm64 patches require the same change AFAIKS - and I think
we already have arm64 hotplug support in Linux distros.

As I said, similarly, makedumpfile has to be upgraded with every kernel
release to make kdump work as expected. And that is no big news I hope :)

>> just like e.g., makedumpfile.
>> Modify kexec() to not place the kexec kernel on these areas (easy) but
>> still consider them as crash regions to dump. When loading a kexec
>> kernel, validate in the kernel that the memory is appropriate.
> 
> 
>> 2. Make kexec() reload the the kernel whenever we e.g., get a udev event
>> for removal of memory in /sys/devices/system/memory/.
> 
> I don't think we can rely on user-space to do something,
> 
> 
>> On every remove_memory(), invalidate the loaded kernel in the kernel.
> 
> This is an option, ... but its a change of behaviour. If user-space asks for two
> impossible things, the second request should fail. Having the first-one disappear is a bit
> spooky...

We are talking about corner cases that are already broken, no?

> 
> Fortunately user-space checks the 'kexec -l' bit happened before it calls reboot() behind
> 'kexec -e'. So this works, but is not intuitive.
> 
> ("Did I load it? What changed and when? oh, half a mile up in dmesg is a message saying
> the kernel discarded the kexec kernel last wednesday.")
> 
> 
>> As I mentioned somewhere, 1. will be interesting for virtio-mem, where
>> we don't want any kexec kernel to be placed on virtio-mem-added memory.
> 
> Do these virtio-mem-added regions need to be accessible by kdump?
> (do we already need a user-space change for that?)

Yes, they have to be accessible by kdump. Currently, they are also
exported as "System RAM" via /proc/iomem - which is why dumping works
e.g., on x86-64 (we'll have to increase the #of memory resources that
can be considered in the future, but that's a different story and only
applies when adding more than 100GB of memory via virtio-mem or so)

But as virtio-mem is fairly new (IOW, about to get queued for
integration soonish), I could still change the memory resources to show
up differently ("System RAM (hotplugged)", "System RAM (virtio-mem)",
etc.) and teach kexec about them.

But learning that we are having similar problems on arm64 (and
theoretically on Hyper-V), I think it makes sense to discuss a solution
that will solve the other issues as well.

> 
> 
> A third option, along the line of what I posted:
> 
> Split the 'offline' and 'removed' ideas, which David mentioned somewhere. We'd end up with
> (yet) another notifier chain, that prevents the memory being removed, but you can still

I dislike limiting memory unplug - and especially making remove_memory()
fail - just because somebody once thought it would be a good place to
load - in the future - some kexec binary onto it.

> mark it as offline in /sys/. (...I'm not quite sure why you would do that...)
> 
> This would need hooking up for ACPI (which covers x86 and arm64), and other architectures
> mechanisms for doing this...
> arm64 can then switch is arch hook that prevents 'bootmem' being removed to this new
> notifier chain, as the kernel can only boot from that was present at boot.


We have two different problems here, right?

1. Don't place kexec binaries on specific memory areas (e.g., arm64,
virtio-mem, hyper-v, ...)

2. Figure out what to do when unplugging memory that was selected as a
target for kexec binaries.


For 1, I have a feeling that /proc/iomem could be the right solution,
eventually requiring kexec changes to handle kdump properly (IOW, dump
all memory). Indicating all hotplugged memory as "System RAM
(hotplugged)" would be the way to go here.

For 2, I think we should unload all kexec images in case they overlap
with memory to be removed (e.g., remove_memory() notifier, which cannot
stop removal, it's only an indication), and make userspace reload kexec
via udev events.


Also, we have to think about kexec_file_load() to deal with 1.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-14 14:49                                               ` David Hildenbrand
@ 2020-04-15  2:35                                                 ` Baoquan He
  2020-04-16 13:31                                                   ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-04-15  2:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel, linuxppc-dev, piliu

On 04/14/20 at 04:49pm, David Hildenbrand wrote:
> >>>>> The root cause is kexec-ed kernel is targeted at hotpluggable memory
> >>>>> region. Just avoiding the movable area can fix it. In kexec_file_load(),
> >>>>> just checking or picking those unmovable region to put kernel/initrd in
> >>>>> function locate_mem_hole_callback() can fix it. The page or pageblock's
> >>>>> zone is movable or not, it's easy to know. This fix doesn't need to
> >>>>> bother other component.
> >>>>
> >>>> I don't fully agree. E.g., just because memory is onlined to ZONE_NORMAL
> >>>> does not imply that it cannot get offlined and removed e.g., this is
> >>>> heavily used on ppc64, with 16MB sections.
> >>>
> >>> Really? I just know there are two kinds of mem hoplug in ppc, but don't
> >>> know the details. So in this case, is there any flag or a way to know
> >>> those memory block are hotpluggable? I am curious how those kernel data
> >>> is avoided to be put in this area. Or ppc just freely uses it for kernel
> >>> data or user space data, then try to migrate when hot remove?
> >>
> >> See
> >> arch/powerpc/platforms/pseries/hotplug-memory.c:dlpar_memory_remove_by_count()
> >>
> >> Under DLAPR, it can remove memory in LMB granularity, which is usually
> >> 16MB (== single section on ppc64). DLPAR will directly online all
> >> hotplugged memory (LMBs) from the kernel using device_online(), which
> >> will go to ZONE_NORMAL.
> >>
> >> When trying to remove memory, it simply scans for offlineable 16MB
> >> memory blocks (==section == LMB), offlines and removes them. No need for
> >> the movable zone and all the involved issues.
> > 
> > Yes, this is a different one, thanks for pointing it out. It sounds like
> > balloon driver in virt platform, doesn't it?
> 
> With DLPAR there is a hypervisor involved (which manages the actual HW
> DIMMs), so yes.
> 
> > 
> > Avoiding to put kexec kernel into movable zone can't solve this DLPAR
> > case as you said.
> > 
> >>
> >> Now, the interesting question is, can we have LMBs added during boot
> >> (not via add_memory()), that will later be removed via remove_memory().
> >> IIRC, we had BUGs related to that, so I think yes. If a section contains
> >> no unmovable allocations (after boot), it can get removed.
> > 
> > I do want to ask this question. If we can add LMB into system RAM, then
> > reload kexec can solve it. 
> > 
> > Another better way is adding a common function to filter out the
> > movable zone when search position for kexec kernel, use a arch specific
> > funciton to filter out DLPAR memory blocks for ppc only. Over there,
> > we can simply use for_each_drmem_lmb() to do that.
> 
> I was thinking about something similar. Maybe something like a notifier
> that can be used to test if selected memory can be used for kexec

Not sure if I get the notifier idea clearly. If you mean 

1) Add a common function to pick memory in unmovable zone;
2) Let DLPAR, balloon register with notifier;
3) In the common function, ask notified part to check if the picked
   unmovable memory is available for locating kexec kernel;

Sounds doable to me, and not complicated.

> images. It would apply to
> 
> - arm64 and filter out all hotadded memory (IIRC, only boot memory can
>   be used).

Do you mean hot added memory after boot can't be recognized and added
into system RAM on arm64?


> - powerpc to filter out all LMBs that can be removed (assuming not all
>   memory corresponds to LMBs that can be removed, otherwise we're in
>   trouble ... :) )
> - virtio-mem to filter out all memory it added.
> - hyper-v to filter out partially backed memory blocks (esp. the last
>   memory block it added and only partially backed it by memory).
> 
> This would make it work for kexec_file_load(), however, I do wonder how
> we would want to approach that from userspace kexec-tools when handling
> it from kexec_load().

Let's make kexec_file_load work firstly. Since this work is only first
step to make kexec-ed kernel not break memory hotplug. After kexec
rebooting, the KASLR may locate kernel into hotpluggable area too.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-03-26 18:07 [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use James Morse
                   ` (6 preceding siblings ...)
  2020-03-31  3:38 ` Dave Young
@ 2020-04-15 20:29 ` Eric W. Biederman
  2020-04-22 12:14   ` James Morse
  7 siblings, 1 reply; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-15 20:29 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

James Morse <james.morse@arm.com> writes:

> Hello!
>
> arm64 recently queued support for memory hotremove, which led to some
> new corner cases for kexec.
>
> If the kexec segments are loaded for a removable region, that region may
> be removed before kexec actually occurs. This causes the first kernel to
> lockup when applying the relocations. (I've triggered this on x86 too).
>
> The first patch adds a memory notifier for kexec so that it can refuse
> to allow in-use regions to be taken offline.
>
>
> This doesn't solve the problem for arm64, where the new kernel must
> initially rely on the data structures from the first boot to describe
> memory. These don't describe hotpluggable memory.
> If kexec places the kernel in one of these regions, it must also provide
> a DT that describes the region in which the kernel was mapped as memory.
> (and somehow ensure its always present in the future...)
>
> To prevent this from happening accidentally with unaware user-space,
> patches two and three allow arm64 to give these regions a different
> name.
>
> This is a change in behaviour for arm64 as memory hotadd and hotremove
> were added separately.
>
>
> I haven't tried kdump.
> Unaware kdump from user-space probably won't describe the hotplug
> regions if the name is different, which saves us from problems if
> the memory is no longer present at kdump time, but means the vmcore
> is incomplete.
>
>
> These patches are based on arm64's for-next/core branch, but can all
> be merged independently.

So I just looked through these quickly and I think there are real
problems here we can fix, and that are worth fixing.

However I am not thrilled with the fixes you propose.

Eric


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-03-26 18:07 ` [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image James Morse
                     ` (2 preceding siblings ...)
  2020-03-27  9:30   ` David Hildenbrand
@ 2020-04-15 20:33   ` Eric W. Biederman
  2020-04-22 12:28     ` James Morse
  3 siblings, 1 reply; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-15 20:33 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

James Morse <james.morse@arm.com> writes:

> An image loaded for kexec is not stored in place, instead its segments
> are scattered through memory, and are re-assembled when needed. In the
> meantime, the target memory may have been removed.
>
> Because mm is not aware that this memory is still in use, it allows it
> to be removed.
>
> Add a memory notifier to prevent the removal of memory regions that
> overlap with a loaded kexec image segment. e.g., when triggered from the
> Qemu console:
> | kexec_core: memory region in use
> | memory memory32: Offline failed.
>
> Signed-off-by: James Morse <james.morse@arm.com>

Given that we are talking about the destination pages for kexec
not where the loaded kernel is currently stored the description is
confusing.

Beyond that I think it would be better to simply unload the loaded
kernel at memory hotunplug time.  Usually somewhere in the loaded image
is a copy of the memory map at the time the kexec kernel was loaded.
That will invalidate the memory map as well.

All of this should be for a very brief window of a few seconds, as
the loaded kexec image is quite short.

So instead of failing in the notifier, if you could simply unload the
loaded image in the notifier I think that would be simpler and more
robust.  While still preventing the loaded image from falling over
when it starts executing.

Eric


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-03-26 18:07 ` [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names James Morse
  2020-03-27  9:59   ` David Hildenbrand
  2020-04-02  5:49   ` Dave Young
@ 2020-04-15 20:36   ` Eric W. Biederman
  2020-04-22 12:14     ` James Morse
  2020-05-09  0:45   ` Andrew Morton
  3 siblings, 1 reply; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-15 20:36 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

James Morse <james.morse@arm.com> writes:

> Memory added to the system by hotplug has a 'System RAM' resource created
> for it. This is exposed to user-space via /proc/iomem.
>
> This poses problems for kexec on arm64. If kexec decides to place the
> kernel in one of these newly onlined regions, the new kernel will find
> itself booting from a region not described as memory in the firmware
> tables.
>
> Arm64 doesn't have a structure like the e820 memory map that can be
> re-written when memory is brought online. Instead arm64 uses the UEFI
> memory map, or the memory node from the DT, sometimes both. We never
> rewrite these.
>
> Allow an architecture to specify a different name for these hotplug
> regions.

Gah.  No.

Please find a way to pass the current memory map to the loaded kexec'd
kernel.

Starting a kernel with no way for it to know what the current memory map
is just plain scary.

Eric


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name
  2020-03-26 18:07 ` [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name James Morse
  2020-03-30 19:01   ` David Hildenbrand
@ 2020-04-15 20:37   ` Eric W. Biederman
  2020-04-22 12:14     ` James Morse
  1 sibling, 1 reply; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-15 20:37 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

James Morse <james.morse@arm.com> writes:

> If kexec chooses to place the kernel in a memory region that was
> added after boot, we fail to boot as the kernel is running from a
> location that is not described as memory by the UEFI memory map or
> the original DT.
>
> To prevent unaware user-space kexec from doing this accidentally,
> give these regions a different name.

Please fix the problem and don't hack around it.

Eric


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-15  2:35                                                 ` Baoquan He
@ 2020-04-16 13:31                                                   ` David Hildenbrand
  2020-04-16 14:02                                                     ` Baoquan He
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-04-16 13:31 UTC (permalink / raw)
  To: Baoquan He
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel, linuxppc-dev, piliu

> Not sure if I get the notifier idea clearly. If you mean 
> 
> 1) Add a common function to pick memory in unmovable zone;

Not strictly required IMHO. But, minor detail.

> 2) Let DLPAR, balloon register with notifier;

Yeah, or virtio-mem, or any other technology that adds/removes memory
dynamically.

> 3) In the common function, ask notified part to check if the picked
>    unmovable memory is available for locating kexec kernel;

Yeah.

> 
> Sounds doable to me, and not complicated.
> 
>> images. It would apply to
>>
>> - arm64 and filter out all hotadded memory (IIRC, only boot memory can
>>   be used).
> 
> Do you mean hot added memory after boot can't be recognized and added
> into system RAM on arm64?

See patch #3 of this patch set, which wants to avoid placing kexec
binaries on hotplugged memory. But I have no idea what the current plan
regarding arm64 is (this thread exploded :) ).

I would assume that we don't want to place kexec images on any
hotplugged (or rather: hot(un)pluggable) memory - on any architecture.

> 
> 
>> - powerpc to filter out all LMBs that can be removed (assuming not all
>>   memory corresponds to LMBs that can be removed, otherwise we're in
>>   trouble ... :) )
>> - virtio-mem to filter out all memory it added.
>> - hyper-v to filter out partially backed memory blocks (esp. the last
>>   memory block it added and only partially backed it by memory).
>>
>> This would make it work for kexec_file_load(), however, I do wonder how
>> we would want to approach that from userspace kexec-tools when handling
>> it from kexec_load().
> 
> Let's make kexec_file_load work firstly. Since this work is only first
> step to make kexec-ed kernel not break memory hotplug. After kexec
> rebooting, the KASLR may locate kernel into hotpluggable area too.

Can you elaborate how that would work?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-16 13:31                                                   ` David Hildenbrand
@ 2020-04-16 14:02                                                     ` Baoquan He
  2020-04-16 14:09                                                       ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-04-16 14:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel, linuxppc-dev, piliu

On 04/16/20 at 03:31pm, David Hildenbrand wrote:
> > Not sure if I get the notifier idea clearly. If you mean 
> > 
> > 1) Add a common function to pick memory in unmovable zone;
> 
> Not strictly required IMHO. But, minor detail.
> 
> > 2) Let DLPAR, balloon register with notifier;
> 
> Yeah, or virtio-mem, or any other technology that adds/removes memory
> dynamically.
> 
> > 3) In the common function, ask notified part to check if the picked
> >    unmovable memory is available for locating kexec kernel;
> 
> Yeah.

These may not be needed, please see below comment.

> 
> > 
> > Sounds doable to me, and not complicated.
> > 
> >> images. It would apply to
> >>
> >> - arm64 and filter out all hotadded memory (IIRC, only boot memory can
> >>   be used).
> > 
> > Do you mean hot added memory after boot can't be recognized and added
> > into system RAM on arm64?
> 
> See patch #3 of this patch set, which wants to avoid placing kexec
> binaries on hotplugged memory. But I have no idea what the current plan
> regarding arm64 is (this thread exploded :) ).
> 
> I would assume that we don't want to place kexec images on any
> hotplugged (or rather: hot(un)pluggable) memory - on any architecture.

Yes, noticed that and James replied to DaveY.

Later, when I was considering to make a draft patch to do the picking of
memory from normal zone, and add a notifier, as we discussed at above, I
suddenly realized that kexec_file_load doesn't have this issue. It
traverse system RAM bottom up to get an available region to put
kernel/initrd/boot_param, etc. I can't think of a system where its
low memory could be unavailable.
> 
> > 
> > 
> >> - powerpc to filter out all LMBs that can be removed (assuming not all
> >>   memory corresponds to LMBs that can be removed, otherwise we're in
> >>   trouble ... :) )
> >> - virtio-mem to filter out all memory it added.
> >> - hyper-v to filter out partially backed memory blocks (esp. the last
> >>   memory block it added and only partially backed it by memory).
> >>
> >> This would make it work for kexec_file_load(), however, I do wonder how
> >> we would want to approach that from userspace kexec-tools when handling
> >> it from kexec_load().
> > 
> > Let's make kexec_file_load work firstly. Since this work is only first
> > step to make kexec-ed kernel not break memory hotplug. After kexec
> > rebooting, the KASLR may locate kernel into hotpluggable area too.
> 
> Can you elaborate how that would work?

Well, boot memory can be hotplugged or not after boot, they are marked
in uefi tables, the current kexec doesn't save and pass them into 2nd
kenrel, when kexec kernel bootup, it need read them and avoid them to
randomize kernel into.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-16 14:02                                                     ` Baoquan He
@ 2020-04-16 14:09                                                       ` David Hildenbrand
  2020-04-16 14:36                                                         ` Baoquan He
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-04-16 14:09 UTC (permalink / raw)
  To: Baoquan He
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Andrew Morton, Will Deacon,
	linux-arm-kernel, linuxppc-dev, piliu

>>> Sounds doable to me, and not complicated.
>>>
>>>> images. It would apply to
>>>>
>>>> - arm64 and filter out all hotadded memory (IIRC, only boot memory can
>>>>   be used).
>>>
>>> Do you mean hot added memory after boot can't be recognized and added
>>> into system RAM on arm64?
>>
>> See patch #3 of this patch set, which wants to avoid placing kexec
>> binaries on hotplugged memory. But I have no idea what the current plan
>> regarding arm64 is (this thread exploded :) ).
>>
>> I would assume that we don't want to place kexec images on any
>> hotplugged (or rather: hot(un)pluggable) memory - on any architecture.
> 
> Yes, noticed that and James replied to DaveY.
> 
> Later, when I was considering to make a draft patch to do the picking of
> memory from normal zone, and add a notifier, as we discussed at above, I
> suddenly realized that kexec_file_load doesn't have this issue. It
> traverse system RAM bottom up to get an available region to put
> kernel/initrd/boot_param, etc. I can't think of a system where its
> low memory could be unavailable.

kexec_walk_memblock() has the option for "kbuf->top_down". Only
kexec_walk_resources() seems to ignore it.

So I think in case of memblocks (e.g., arm64), this still applies?

>>
>>>
>>>
>>>> - powerpc to filter out all LMBs that can be removed (assuming not all
>>>>   memory corresponds to LMBs that can be removed, otherwise we're in
>>>>   trouble ... :) )
>>>> - virtio-mem to filter out all memory it added.
>>>> - hyper-v to filter out partially backed memory blocks (esp. the last
>>>>   memory block it added and only partially backed it by memory).
>>>>
>>>> This would make it work for kexec_file_load(), however, I do wonder how
>>>> we would want to approach that from userspace kexec-tools when handling
>>>> it from kexec_load().
>>>
>>> Let's make kexec_file_load work firstly. Since this work is only first
>>> step to make kexec-ed kernel not break memory hotplug. After kexec
>>> rebooting, the KASLR may locate kernel into hotpluggable area too.
>>
>> Can you elaborate how that would work?
> 
> Well, boot memory can be hotplugged or not after boot, they are marked
> in uefi tables, the current kexec doesn't save and pass them into 2nd
> kenrel, when kexec kernel bootup, it need read them and avoid them to
> randomize kernel into.

What about e.g., memory hotplugged by ACPI? I would assume, that the
kexec kernel will not make use of that (IOW detected that) until the
ACPI driver comes up and re-detects + adds that memory.

Or how would that machinery work in case we have a DIMM hotplugged via ACPI?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-16 14:09                                                       ` David Hildenbrand
@ 2020-04-16 14:36                                                         ` Baoquan He
  2020-04-16 14:47                                                           ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-04-16 14:36 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu

On 04/16/20 at 04:09pm, David Hildenbrand wrote:
> >>> Sounds doable to me, and not complicated.
> >>>
> >>>> images. It would apply to
> >>>>
> >>>> - arm64 and filter out all hotadded memory (IIRC, only boot memory can
> >>>>   be used).
> >>>
> >>> Do you mean hot added memory after boot can't be recognized and added
> >>> into system RAM on arm64?
> >>
> >> See patch #3 of this patch set, which wants to avoid placing kexec
> >> binaries on hotplugged memory. But I have no idea what the current plan
> >> regarding arm64 is (this thread exploded :) ).
> >>
> >> I would assume that we don't want to place kexec images on any
> >> hotplugged (or rather: hot(un)pluggable) memory - on any architecture.
> > 
> > Yes, noticed that and James replied to DaveY.
> > 
> > Later, when I was considering to make a draft patch to do the picking of
> > memory from normal zone, and add a notifier, as we discussed at above, I
> > suddenly realized that kexec_file_load doesn't have this issue. It
> > traverse system RAM bottom up to get an available region to put
> > kernel/initrd/boot_param, etc. I can't think of a system where its
> > low memory could be unavailable.
> 
> kexec_walk_memblock() has the option for "kbuf->top_down". Only
> kexec_walk_resources() seems to ignore it.

Yeah, that top down searching is done in a found low mem area. Means
firstly search an available region bottom up, then put kernel top down
in that region. The reason is our iomem res is linked with singly linked
list. So we can only search bottom up efficiently.

kexec_load is doing the real top down searching, so kernel will be put
at the top of system ram. I ever tried to change it to support top down
searching for kexec_file_load too with patches, since QE and customers
are often confused with this difference when debugging.

Andrew may remeber this, he suggested me to change the singly linked list 
to doubly linked list for iomem res, then do the top down searching for
kexec_file_load. I tried with some effort, the change introduced too much
code change, I just gave up finally.

http://archive.lwn.net:8080/devicetree/20180718024944.577-1-bhe@redhat.com/

I can see that top down searching for kexec can avoid the highly used
low memory region, esp under 4G, for dma, kinds of firmware reserving,
etc. And customers/QE of kexec get used to it. I can change kexec_file_load
to top down too with a simple way if people really complain it. But now, 
seems bottom up is not bad too.

> 
> So I think in case of memblocks (e.g., arm64), this still applies?

Yeah, aren't you trying to remove it? I haven't read your patches
carefully, maybe I got it wrong. And arm64 even can't support the hot added
memory being able to recorded into firmware, seems it's not so ready, 
won't they change that design in the future?
> 
> >>
> >>>
> >>>
> >>>> - powerpc to filter out all LMBs that can be removed (assuming not all
> >>>>   memory corresponds to LMBs that can be removed, otherwise we're in
> >>>>   trouble ... :) )
> >>>> - virtio-mem to filter out all memory it added.
> >>>> - hyper-v to filter out partially backed memory blocks (esp. the last
> >>>>   memory block it added and only partially backed it by memory).
> >>>>
> >>>> This would make it work for kexec_file_load(), however, I do wonder how
> >>>> we would want to approach that from userspace kexec-tools when handling
> >>>> it from kexec_load().
> >>>
> >>> Let's make kexec_file_load work firstly. Since this work is only first
> >>> step to make kexec-ed kernel not break memory hotplug. After kexec
> >>> rebooting, the KASLR may locate kernel into hotpluggable area too.
> >>
> >> Can you elaborate how that would work?
> > 
> > Well, boot memory can be hotplugged or not after boot, they are marked
> > in uefi tables, the current kexec doesn't save and pass them into 2nd
> > kenrel, when kexec kernel bootup, it need read them and avoid them to
> > randomize kernel into.
> 
> What about e.g., memory hotplugged by ACPI? I would assume, that the
> kexec kernel will not make use of that (IOW detected that) until the
> ACPI driver comes up and re-detects + adds that memory.
> 
> Or how would that machinery work in case we have a DIMM hotplugged via ACPI?

ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
pass the efi, it won't get the SRAT table correctly, if I remember
correctly. Yeah, I remeber kvm guest can get memory hotplugged with
ACPI only, this won't happen on bare metal though. Need check carefully. 
I have been using kvm guest with uefi firmwire recently.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-16 14:36                                                         ` Baoquan He
@ 2020-04-16 14:47                                                           ` David Hildenbrand
  2020-04-21 13:29                                                             ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-04-16 14:47 UTC (permalink / raw)
  To: Baoquan He, Andrew Morton
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu

>> kexec_walk_memblock() has the option for "kbuf->top_down". Only
>> kexec_walk_resources() seems to ignore it.
> 
> Yeah, that top down searching is done in a found low mem area. Means
> firstly search an available region bottom up, then put kernel top down
> in that region. The reason is our iomem res is linked with singly linked
> list. So we can only search bottom up efficiently.
> 
> kexec_load is doing the real top down searching, so kernel will be put
> at the top of system ram. I ever tried to change it to support top down
> searching for kexec_file_load too with patches, since QE and customers
> are often confused with this difference when debugging.
> 
> Andrew may remeber this, he suggested me to change the singly linked list 
> to doubly linked list for iomem res, then do the top down searching for
> kexec_file_load. I tried with some effort, the change introduced too much
> code change, I just gave up finally.

Well, at least right now this seems to be the right approach (hotplug),
lol :)

> 
> http://archive.lwn.net:8080/devicetree/20180718024944.577-1-bhe@redhat.com/
> 
> I can see that top down searching for kexec can avoid the highly used
> low memory region, esp under 4G, for dma, kinds of firmware reserving,
> etc. And customers/QE of kexec get used to it. I can change kexec_file_load
> to top down too with a simple way if people really complain it. But now, 
> seems bottom up is not bad too.

Ah, I understand the problem. Maybe a simple "optimization" would be to
start searching bottom-up from e.g.,2GB/4GB first. If nothing was found,
search botoom-up from 0-2GB/4GB etc.

> 
>>
>> So I think in case of memblocks (e.g., arm64), this still applies?
> 
> Yeah, aren't you trying to remove it? I haven't read your patches
> carefully, maybe I got it wrong. And arm64 even can't support the hot added

For arm64 we're still creating memblocks for hotplugged memory, but I
guess it's not too hard to stop doing that.

> memory being able to recorded into firmware, seems it's not so ready, 
> won't they change that design in the future?

It seems to be incomplete, yes. No idea if it's fixable, no arm64 expert ...


>>>>>> - powerpc to filter out all LMBs that can be removed (assuming not all
>>>>>>   memory corresponds to LMBs that can be removed, otherwise we're in
>>>>>>   trouble ... :) )
>>>>>> - virtio-mem to filter out all memory it added.
>>>>>> - hyper-v to filter out partially backed memory blocks (esp. the last
>>>>>>   memory block it added and only partially backed it by memory).
>>>>>>
>>>>>> This would make it work for kexec_file_load(), however, I do wonder how
>>>>>> we would want to approach that from userspace kexec-tools when handling
>>>>>> it from kexec_load().
>>>>>
>>>>> Let's make kexec_file_load work firstly. Since this work is only first
>>>>> step to make kexec-ed kernel not break memory hotplug. After kexec
>>>>> rebooting, the KASLR may locate kernel into hotpluggable area too.
>>>>
>>>> Can you elaborate how that would work?
>>>
>>> Well, boot memory can be hotplugged or not after boot, they are marked
>>> in uefi tables, the current kexec doesn't save and pass them into 2nd
>>> kenrel, when kexec kernel bootup, it need read them and avoid them to
>>> randomize kernel into.
>>
>> What about e.g., memory hotplugged by ACPI? I would assume, that the
>> kexec kernel will not make use of that (IOW detected that) until the
>> ACPI driver comes up and re-detects + adds that memory.
>>
>> Or how would that machinery work in case we have a DIMM hotplugged via ACPI?
> 
> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
> pass the efi, it won't get the SRAT table correctly, if I remember
> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
> ACPI only, this won't happen on bare metal though. Need check carefully. 
> I have been using kvm guest with uefi firmwire recently.

Yeah, I can imagine that bare metal is different. kvm only uses ACPI.

I'm also asking because of virtio-mem. Memory added via virtio-mem is
not part of any efi tables or whatsoever. So I assume the kexec kernel
will not detect it automatically (good!), instead load the virtio-mem
driver and let it add memory back to the system.

I should probably play with kexec and virtio-mem once I have some spare
cycles ... to find out what's broken and needs to be addressed :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-16 14:47                                                           ` David Hildenbrand
@ 2020-04-21 13:29                                                             ` David Hildenbrand
  2020-04-21 13:57                                                               ` David Hildenbrand
                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 92+ messages in thread
From: David Hildenbrand @ 2020-04-21 13:29 UTC (permalink / raw)
  To: Baoquan He, Andrew Morton
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu

>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
>> pass the efi, it won't get the SRAT table correctly, if I remember
>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>> I have been using kvm guest with uefi firmwire recently.
> 
> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> 
> I'm also asking because of virtio-mem. Memory added via virtio-mem is
> not part of any efi tables or whatsoever. So I assume the kexec kernel
> will not detect it automatically (good!), instead load the virtio-mem
> driver and let it add memory back to the system.
> 
> I should probably play with kexec and virtio-mem once I have some spare
> cycles ... to find out what's broken and needs to be addressed :)

FWIW, I just gave virtio-mem and kexec/kdump a try.

a) kdump seems to work. Memory added by virtio-mem is getting dumped.
The kexec kernel only uses memory in the crash region. The virtio-mem
driver properly bails out due to is_kdump_kernel().

b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
to get placed on virtio-mem memory (pure luck due to the left-to-right
search). Memory added by virtio-mem is not getting added to the e820
map. Once the virtio-mem driver comes back up in the kexec kernel, the
right memory is readded.

c) "kexec -c -l" does not work properly. All memory added by virtio-mem
is added to the e820 map, which is wrong. Memory that should not be
touched will be touched by the kexec kernel. I assume kexec-tools just
goes ahead and adds anything it can find in /proc/iomem (or
/sys/firmware/memmap/) to the e820 map of the new kernel.

Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
similarly added to the e820 map and, therefore, won't be able to be
onlined MOVABLE easily.


At least for virtio-mem, I would either have to
a) Not support "kexec -c -l". A viable option if we would be planning on
not supporting it either way in the long term. I could block this
in-kernel somehow eventually.

b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
indicating it in /proc/iomem in a special way ("System RAM
(hotplugged)"/"System RAM (virtio-mem)").

Baoquan, any opinion on that?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-21 13:29                                                             ` David Hildenbrand
@ 2020-04-21 13:57                                                               ` David Hildenbrand
  2020-04-21 13:59                                                               ` Eric W. Biederman
  2020-04-22  9:17                                                               ` Baoquan He
  2 siblings, 0 replies; 92+ messages in thread
From: David Hildenbrand @ 2020-04-21 13:57 UTC (permalink / raw)
  To: Baoquan He, Andrew Morton
  Cc: Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu

On 21.04.20 15:29, David Hildenbrand wrote:
>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
>>> pass the efi, it won't get the SRAT table correctly, if I remember
>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>>> I have been using kvm guest with uefi firmwire recently.
>>
>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
>>
>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
>> not part of any efi tables or whatsoever. So I assume the kexec kernel
>> will not detect it automatically (good!), instead load the virtio-mem
>> driver and let it add memory back to the system.
>>
>> I should probably play with kexec and virtio-mem once I have some spare
>> cycles ... to find out what's broken and needs to be addressed :)
> 
> FWIW, I just gave virtio-mem and kexec/kdump a try.
> 
> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> The kexec kernel only uses memory in the crash region. The virtio-mem
> driver properly bails out due to is_kdump_kernel().
> 
> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> to get placed on virtio-mem memory (pure luck due to the left-to-right
> search). Memory added by virtio-mem is not getting added to the e820
> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> right memory is readded.
> 
> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
> is added to the e820 map, which is wrong. Memory that should not be
> touched will be touched by the kexec kernel. I assume kexec-tools just
> goes ahead and adds anything it can find in /proc/iomem (or
> /sys/firmware/memmap/) to the e820 map of the new kernel.
> 
> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
> similarly added to the e820 map and, therefore, won't be able to be
> onlined MOVABLE easily.
> 
> 
> At least for virtio-mem, I would either have to
> a) Not support "kexec -c -l". A viable option if we would be planning on
> not supporting it either way in the long term. I could block this
> in-kernel somehow eventually.
> 
> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
> indicating it in /proc/iomem in a special way ("System RAM
> (hotplugged)"/"System RAM (virtio-mem)").

I just realized, that *not* creating /sys/firmware/memmap/ entries for
virtio-mem memory seems to be the right thing to do.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-21 13:29                                                             ` David Hildenbrand
  2020-04-21 13:57                                                               ` David Hildenbrand
@ 2020-04-21 13:59                                                               ` Eric W. Biederman
  2020-04-21 14:30                                                                 ` David Hildenbrand
  2020-04-22  9:17                                                               ` Baoquan He
  2 siblings, 1 reply; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-21 13:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baoquan He, Andrew Morton, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu

David Hildenbrand <david@redhat.com> writes:

>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
>>> pass the efi, it won't get the SRAT table correctly, if I remember
>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>>> I have been using kvm guest with uefi firmwire recently.
>> 
>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
>> 
>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
>> not part of any efi tables or whatsoever. So I assume the kexec kernel
>> will not detect it automatically (good!), instead load the virtio-mem
>> driver and let it add memory back to the system.
>> 
>> I should probably play with kexec and virtio-mem once I have some spare
>> cycles ... to find out what's broken and needs to be addressed :)
>
> FWIW, I just gave virtio-mem and kexec/kdump a try.
>
> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> The kexec kernel only uses memory in the crash region. The virtio-mem
> driver properly bails out due to is_kdump_kernel().
>
> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> to get placed on virtio-mem memory (pure luck due to the left-to-right
> search). Memory added by virtio-mem is not getting added to the e820
> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> right memory is readded.

This sounds like a bug.

> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
> is added to the e820 map, which is wrong. Memory that should not be
> touched will be touched by the kexec kernel. I assume kexec-tools just
> goes ahead and adds anything it can find in /proc/iomem (or
> /sys/firmware/memmap/) to the e820 map of the new kernel.
>
> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
> similarly added to the e820 map and, therefore, won't be able to be
> onlined MOVABLE easily.

This sounds like correct behavior to me.  If you add memory to the
system it is treated as memory to the system.

If we need to make it a special kind of memory with special rules we can
have some kind of special marking for the memory.  But hotplugged is not
in itself a sufficient criteria to say don't use this as normal memory.

If take a huge server and I plug in an extra dimm it is just memory.

For a similarly huge server I might want to have memory that the system
booted with unpluggable, in case hardware error reporting notices
a dimm generating a lot of memory errors.

Now perhaps virtualization needs a special tier of memory that should
only be used for cases where the memory is easily movable.

I am not familiar with virtio-mem but my skim of the initial design
is that virtio-mem was not designed to be such a special tier of memory.
Perhaps something has changed?
https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg03870.html


> At least for virtio-mem, I would either have to
> a) Not support "kexec -c -l". A viable option if we would be planning on
> not supporting it either way in the long term. I could block this
> in-kernel somehow eventually.

No.

> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
> indicating it in /proc/iomem in a special way ("System RAM
> (hotplugged)"/"System RAM (virtio-mem)").

How does the kernel memory allocator treat this memory?

The logic is simple.  If the kernel memory allocator treats that memory
as ordinary memory available for all uses it should be presented as
ordinary memory available for all uses.

If the kernel memory allocator treats that memory as special memory
only available for uses that we can easily free later and give back to
the system.  AKA it is special and not oridinary memory we should mark
it as such.

Eric

p.s.  Please excuse me for jumping in I may be missing some important
context, but what I read when I saw this message in my inbox just seemed
very wrong.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-21 13:59                                                               ` Eric W. Biederman
@ 2020-04-21 14:30                                                                 ` David Hildenbrand
  0 siblings, 0 replies; 92+ messages in thread
From: David Hildenbrand @ 2020-04-21 14:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Baoquan He, Andrew Morton, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu


>> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
>> to get placed on virtio-mem memory (pure luck due to the left-to-right
>> search). Memory added by virtio-mem is not getting added to the e820
>> map. Once the virtio-mem driver comes back up in the kexec kernel, the
>> right memory is readded.
> 
> This sounds like a bug.

This is how virtio-mem wants its memory to get handled.

> 
>> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
>> is added to the e820 map, which is wrong. Memory that should not be
>> touched will be touched by the kexec kernel. I assume kexec-tools just
>> goes ahead and adds anything it can find in /proc/iomem (or
>> /sys/firmware/memmap/) to the e820 map of the new kernel.
>>
>> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
>> similarly added to the e820 map and, therefore, won't be able to be
>> onlined MOVABLE easily.
> 
> This sounds like correct behavior to me.  If you add memory to the
> system it is treated as memory to the system.

Yeah, I would agree if we are talking about DIMMs, but this memory is
special. It's added via a paravirtualized interface and will contain
holes, especially after unplug. While memory in these holes can usually
be read, it should not be written. More on that below.

> 
> If we need to make it a special kind of memory with special rules we can
> have some kind of special marking for the memory.  But hotplugged is not
> in itself a sufficient criteria to say don't use this as normal memory.

Agreed. It is special, though.

> 
> If take a huge server and I plug in an extra dimm it is just memory.

Agreed.

[...]

> 
> Now perhaps virtualization needs a special tier of memory that should
> only be used for cases where the memory is easily movable.
> 
> I am not familiar with virtio-mem but my skim of the initial design
> is that virtio-mem was not designed to be such a special tier of memory.
> Perhaps something has changed?
> https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg03870.html

Yes, a lot changed. See
https://lkml.kernel.org/r/20200311171422.10484-1-david@redhat.com for
the latest-greatest design overview.


> 
>> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
>> indicating it in /proc/iomem in a special way ("System RAM
>> (hotplugged)"/"System RAM (virtio-mem)").
> 
> How does the kernel memory allocator treat this memory?

So what virtio-mem does is add memory sections on demand and populate
within these sections the requested amount of memory. E.g., if 64MB are
requested, it will add a 128MB section/resource but only make the first
64MB accessible (via the hypervisor) and only give the first 64MB to the
buddy. This way of adding memory is similar to what XEN and hypver-v
balloon drivers do when hotplugging memory.

When requested to plug more memory, it might go ahead and make (parts
of) the remaining 64MB accessible and give them to the buddy. In case it
cannot "fill any holes", it will add a new section.

When requested to unplug memory, it will try to remove memory from the
added (here 64MB) memory from the buddy and tell the hypervisor about it.

So, it has some similarity to ballooning in virtual environment,
however, it manages its own device memory only and can therefore give
better guarantees and detect malicious guests.

Right now, I think the right approach would be to not create
/sys/firmware/memmap entries from memory virtio-mem added.

[...]

> 
> p.s.  Please excuse me for jumping in I may be missing some important
> context, but what I read when I saw this message in my inbox just seemed
> very wrong.

Yeah, still, thanks for having a look. Please let me know if you need
more information.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-21 13:29                                                             ` David Hildenbrand
  2020-04-21 13:57                                                               ` David Hildenbrand
  2020-04-21 13:59                                                               ` Eric W. Biederman
@ 2020-04-22  9:17                                                               ` Baoquan He
  2020-04-22  9:24                                                                 ` David Hildenbrand
  2 siblings, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-04-22  9:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu

On 04/21/20 at 03:29pm, David Hildenbrand wrote:
> >> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
> >> pass the efi, it won't get the SRAT table correctly, if I remember
> >> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
> >> ACPI only, this won't happen on bare metal though. Need check carefully. 
> >> I have been using kvm guest with uefi firmwire recently.
> > 
> > Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> > 
> > I'm also asking because of virtio-mem. Memory added via virtio-mem is
> > not part of any efi tables or whatsoever. So I assume the kexec kernel
> > will not detect it automatically (good!), instead load the virtio-mem
> > driver and let it add memory back to the system.
> > 
> > I should probably play with kexec and virtio-mem once I have some spare
> > cycles ... to find out what's broken and needs to be addressed :)
> 
> FWIW, I just gave virtio-mem and kexec/kdump a try.
> 
> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> The kexec kernel only uses memory in the crash region. The virtio-mem
> driver properly bails out due to is_kdump_kernel().

Right, kdump is not impacted later added memory.

> 
> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> to get placed on virtio-mem memory (pure luck due to the left-to-right
> search). Memory added by virtio-mem is not getting added to the e820
> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> right memory is readded.

kexec_file_load just behaves as you tested. It doesn't collect later
added memory to e820 because it uses e820_table_kexec directly to pass
e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
doesn't have it in e820 during bootup, but it's recoginized and added
when ACPI scanning. I think we should update e820_table_kexec when hot
add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
balloon will need be added into e820_table_kexec too, and if this is
expected behaviour.

But whatever we do, it won't impact the kexec file_loading, because of
the searching strategy bottom up. Just adding them into e820_table_kexec
will make it consistent with cold reboot which get recognizes and get
them into e820 during bootup.
> 
> c) "kexec -c -l" does not work properly. All memory added by virtio-mem
> is added to the e820 map, which is wrong. Memory that should not be
> touched will be touched by the kexec kernel. I assume kexec-tools just
> goes ahead and adds anything it can find in /proc/iomem (or
> /sys/firmware/memmap/) to the e820 map of the new kernel.
> 
> Due to c), I assume all hotplugged memory (e.g., ACPI DIMMs) is
> similarly added to the e820 map and, therefore, won't be able to be
> onlined MOVABLE easily.

Yes, kexec_load will read memory regions from /sys/firmware/memmap/ or
/proc/iomem. Making it right seems a little harder, we can export them
to /proc/iomem or /sys/firmware/memmap/ with mark them with 'hotplug',
but the attribute that which zone they belongs to is not easy to tell.

We are proactive on widely testing kexec_file_load on x86_64, s390,
arm64 by adding test cases into CKI.

> 
> 
> At least for virtio-mem, I would either have to
> a) Not support "kexec -c -l". A viable option if we would be planning on
> not supporting it either way in the long term. I could block this
> in-kernel somehow eventually.
> 
> b) Teach kexec-tools to leave virtio-mem added memory alone. E.g., by
> indicating it in /proc/iomem in a special way ("System RAM
> (hotplugged)"/"System RAM (virtio-mem)").
> 
> Baoquan, any opinion on that?
> 
> -- 
> Thanks,
> 
> David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-22  9:17                                                               ` Baoquan He
@ 2020-04-22  9:24                                                                 ` David Hildenbrand
  2020-04-22  9:57                                                                   ` Baoquan He
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-04-22  9:24 UTC (permalink / raw)
  To: Baoquan He
  Cc: Andrew Morton, Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu

On 22.04.20 11:17, Baoquan He wrote:
> On 04/21/20 at 03:29pm, David Hildenbrand wrote:
>>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
>>>> pass the efi, it won't get the SRAT table correctly, if I remember
>>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>>>> I have been using kvm guest with uefi firmwire recently.
>>>
>>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
>>>
>>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
>>> not part of any efi tables or whatsoever. So I assume the kexec kernel
>>> will not detect it automatically (good!), instead load the virtio-mem
>>> driver and let it add memory back to the system.
>>>
>>> I should probably play with kexec and virtio-mem once I have some spare
>>> cycles ... to find out what's broken and needs to be addressed :)
>>
>> FWIW, I just gave virtio-mem and kexec/kdump a try.
>>
>> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
>> The kexec kernel only uses memory in the crash region. The virtio-mem
>> driver properly bails out due to is_kdump_kernel().
> 
> Right, kdump is not impacted later added memory.
> 
>>
>> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
>> to get placed on virtio-mem memory (pure luck due to the left-to-right
>> search). Memory added by virtio-mem is not getting added to the e820
>> map. Once the virtio-mem driver comes back up in the kexec kernel, the
>> right memory is readded.
> 
> kexec_file_load just behaves as you tested. It doesn't collect later
> added memory to e820 because it uses e820_table_kexec directly to pass
> e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
> during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
> doesn't have it in e820 during bootup, but it's recoginized and added
> when ACPI scanning. I think we should update e820_table_kexec when hot
> add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
> balloon will need be added into e820_table_kexec too, and if this is
> expected behaviour.
> 
> But whatever we do, it won't impact the kexec file_loading, because of
> the searching strategy bottom up. Just adding them into e820_table_kexec
> will make it consistent with cold reboot which get recognizes and get
> them into e820 during bootup.

Yeah, I think whatever a cold-booted kernel will see is what kexec-ed
kernel should see. Not more, not less.

Regarding virtio-mem: Not in e820 on cold-boot.
Regarding DIMMs: DIMMs under KVM will never show up in the e820 map
IIRC. I think on real HW it can be different.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-22  9:24                                                                 ` David Hildenbrand
@ 2020-04-22  9:57                                                                   ` Baoquan He
  2020-04-22 10:05                                                                     ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: Baoquan He @ 2020-04-22  9:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu

On 04/22/20 at 11:24am, David Hildenbrand wrote:
> On 22.04.20 11:17, Baoquan He wrote:
> > On 04/21/20 at 03:29pm, David Hildenbrand wrote:
> >>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
> >>>> pass the efi, it won't get the SRAT table correctly, if I remember
> >>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
> >>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
> >>>> I have been using kvm guest with uefi firmwire recently.
> >>>
> >>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> >>>
> >>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
> >>> not part of any efi tables or whatsoever. So I assume the kexec kernel
> >>> will not detect it automatically (good!), instead load the virtio-mem
> >>> driver and let it add memory back to the system.
> >>>
> >>> I should probably play with kexec and virtio-mem once I have some spare
> >>> cycles ... to find out what's broken and needs to be addressed :)
> >>
> >> FWIW, I just gave virtio-mem and kexec/kdump a try.
> >>
> >> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> >> The kexec kernel only uses memory in the crash region. The virtio-mem
> >> driver properly bails out due to is_kdump_kernel().
> > 
> > Right, kdump is not impacted later added memory.
> > 
> >>
> >> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> >> to get placed on virtio-mem memory (pure luck due to the left-to-right
> >> search). Memory added by virtio-mem is not getting added to the e820
> >> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> >> right memory is readded.
> > 
> > kexec_file_load just behaves as you tested. It doesn't collect later
> > added memory to e820 because it uses e820_table_kexec directly to pass
> > e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
> > during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
> > doesn't have it in e820 during bootup, but it's recoginized and added
> > when ACPI scanning. I think we should update e820_table_kexec when hot
> > add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
> > balloon will need be added into e820_table_kexec too, and if this is
> > expected behaviour.
> > 
> > But whatever we do, it won't impact the kexec file_loading, because of
> > the searching strategy bottom up. Just adding them into e820_table_kexec
> > will make it consistent with cold reboot which get recognizes and get
> > them into e820 during bootup.
> 
> Yeah, I think whatever a cold-booted kernel will see is what kexec-ed
> kernel should see. Not more, not less.
> 
> Regarding virtio-mem: Not in e820 on cold-boot.
> Regarding DIMMs: DIMMs under KVM will never show up in the e820 map
> IIRC. I think on real HW it can be different.

Yeah, DIMMs under KVM won't show up in e820 map. While this is not feature
of QEMU/KVM, but a defect of it. I ever asked Igor who is developer of
QEMU/KVM guest in this area, why we don't make kvm guest recognize
hotpluggable DIMM and add it into e820 map, he said he had tried to make
it, but this will corrupt guest on HyperV. So he had to revert the
commit on qemu. So I think we can leave it for now for both real HW and
kvm, or update the e820_table_kexec to include added DIMM for both real
HW and KVM. I hope one day KVM dev will find a way to conquer the defect
on HyperV and make the e820map consistent with bare metal. After all,
kvm guest is trying to imitate real HW for the most part.

Anyway, I will think about the e820_table_kexec updating. See if we can
do something about it.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-22  9:57                                                                   ` Baoquan He
@ 2020-04-22 10:05                                                                     ` David Hildenbrand
  2020-04-22 10:36                                                                       ` Baoquan He
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-04-22 10:05 UTC (permalink / raw)
  To: Baoquan He
  Cc: Andrew Morton, Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu

On 22.04.20 11:57, Baoquan He wrote:
> On 04/22/20 at 11:24am, David Hildenbrand wrote:
>> On 22.04.20 11:17, Baoquan He wrote:
>>> On 04/21/20 at 03:29pm, David Hildenbrand wrote:
>>>>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
>>>>>> pass the efi, it won't get the SRAT table correctly, if I remember
>>>>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
>>>>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
>>>>>> I have been using kvm guest with uefi firmwire recently.
>>>>>
>>>>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
>>>>>
>>>>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
>>>>> not part of any efi tables or whatsoever. So I assume the kexec kernel
>>>>> will not detect it automatically (good!), instead load the virtio-mem
>>>>> driver and let it add memory back to the system.
>>>>>
>>>>> I should probably play with kexec and virtio-mem once I have some spare
>>>>> cycles ... to find out what's broken and needs to be addressed :)
>>>>
>>>> FWIW, I just gave virtio-mem and kexec/kdump a try.
>>>>
>>>> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
>>>> The kexec kernel only uses memory in the crash region. The virtio-mem
>>>> driver properly bails out due to is_kdump_kernel().
>>>
>>> Right, kdump is not impacted later added memory.
>>>
>>>>
>>>> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
>>>> to get placed on virtio-mem memory (pure luck due to the left-to-right
>>>> search). Memory added by virtio-mem is not getting added to the e820
>>>> map. Once the virtio-mem driver comes back up in the kexec kernel, the
>>>> right memory is readded.
>>>
>>> kexec_file_load just behaves as you tested. It doesn't collect later
>>> added memory to e820 because it uses e820_table_kexec directly to pass
>>> e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
>>> during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
>>> doesn't have it in e820 during bootup, but it's recoginized and added
>>> when ACPI scanning. I think we should update e820_table_kexec when hot
>>> add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
>>> balloon will need be added into e820_table_kexec too, and if this is
>>> expected behaviour.
>>>
>>> But whatever we do, it won't impact the kexec file_loading, because of
>>> the searching strategy bottom up. Just adding them into e820_table_kexec
>>> will make it consistent with cold reboot which get recognizes and get
>>> them into e820 during bootup.
>>
>> Yeah, I think whatever a cold-booted kernel will see is what kexec-ed
>> kernel should see. Not more, not less.
>>
>> Regarding virtio-mem: Not in e820 on cold-boot.
>> Regarding DIMMs: DIMMs under KVM will never show up in the e820 map
>> IIRC. I think on real HW it can be different.
> 
> Yeah, DIMMs under KVM won't show up in e820 map. While this is not feature
> of QEMU/KVM, but a defect of it. I ever asked Igor who is developer of
> QEMU/KVM guest in this area, why we don't make kvm guest recognize
> hotpluggable DIMM and add it into e820 map, he said he had tried to make
> it, but this will corrupt guest on HyperV. So he had to revert the

Yeah, I remember that this had to be reverted due to something breaking.
But OTOH, it allows us to online coldplugged DIMMs online_movable
easily, so I'd say it's even a feature (although, does not behave like
real HW we have).

I use this extensively when testing memory hot(un)plug via coldplugged
DIMMs.

I do wonder if there is real HW, where this is also the case.

> commit on qemu. So I think we can leave it for now for both real HW and
> kvm, or update the e820_table_kexec to include added DIMM for both real
> HW and KVM. I hope one day KVM dev will find a way to conquer the defect
> on HyperV and make the e820map consistent with bare metal. After all,
> kvm guest is trying to imitate real HW for the most part.
> 
> Anyway, I will think about the e820_table_kexec updating. See if we can
> do something about it.

Yeah, for DIMMs on real HW it might definitely make sense. We might be
able to hook into updates of /sys/firmware/memmap on memory add/remove.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-22 10:05                                                                     ` David Hildenbrand
@ 2020-04-22 10:36                                                                       ` Baoquan He
  0 siblings, 0 replies; 92+ messages in thread
From: Baoquan He @ 2020-04-22 10:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Eric W. Biederman, Russell King - ARM Linux admin,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma, kexec,
	linux-mm, James Morse, Will Deacon, linux-arm-kernel,
	linuxppc-dev, piliu

On 04/22/20 at 12:05pm, David Hildenbrand wrote:
> On 22.04.20 11:57, Baoquan He wrote:
> > On 04/22/20 at 11:24am, David Hildenbrand wrote:
> >> On 22.04.20 11:17, Baoquan He wrote:
> >>> On 04/21/20 at 03:29pm, David Hildenbrand wrote:
> >>>>>> ACPI SRAT is embeded into efi, need read out the rsdp pointer. If we don't
> >>>>>> pass the efi, it won't get the SRAT table correctly, if I remember
> >>>>>> correctly. Yeah, I remeber kvm guest can get memory hotplugged with
> >>>>>> ACPI only, this won't happen on bare metal though. Need check carefully. 
> >>>>>> I have been using kvm guest with uefi firmwire recently.
> >>>>>
> >>>>> Yeah, I can imagine that bare metal is different. kvm only uses ACPI.
> >>>>>
> >>>>> I'm also asking because of virtio-mem. Memory added via virtio-mem is
> >>>>> not part of any efi tables or whatsoever. So I assume the kexec kernel
> >>>>> will not detect it automatically (good!), instead load the virtio-mem
> >>>>> driver and let it add memory back to the system.
> >>>>>
> >>>>> I should probably play with kexec and virtio-mem once I have some spare
> >>>>> cycles ... to find out what's broken and needs to be addressed :)
> >>>>
> >>>> FWIW, I just gave virtio-mem and kexec/kdump a try.
> >>>>
> >>>> a) kdump seems to work. Memory added by virtio-mem is getting dumped.
> >>>> The kexec kernel only uses memory in the crash region. The virtio-mem
> >>>> driver properly bails out due to is_kdump_kernel().
> >>>
> >>> Right, kdump is not impacted later added memory.
> >>>
> >>>>
> >>>> b) "kexec -s -l" seems to work fine. For now, the kernel does not seem
> >>>> to get placed on virtio-mem memory (pure luck due to the left-to-right
> >>>> search). Memory added by virtio-mem is not getting added to the e820
> >>>> map. Once the virtio-mem driver comes back up in the kexec kernel, the
> >>>> right memory is readded.
> >>>
> >>> kexec_file_load just behaves as you tested. It doesn't collect later
> >>> added memory to e820 because it uses e820_table_kexec directly to pass
> >>> e820 to kexec-ed kernel. However, this e820_table_kexec is only updated
> >>> during boot stage. I tried hot adding DIMM after boot, kexec-ed kernel
> >>> doesn't have it in e820 during bootup, but it's recoginized and added
> >>> when ACPI scanning. I think we should update e820_table_kexec when hot
> >>> add/remove memory, at least for DIMM. Not sure if DLPAR, virtio-mem,
> >>> balloon will need be added into e820_table_kexec too, and if this is
> >>> expected behaviour.
> >>>
> >>> But whatever we do, it won't impact the kexec file_loading, because of
> >>> the searching strategy bottom up. Just adding them into e820_table_kexec
> >>> will make it consistent with cold reboot which get recognizes and get
> >>> them into e820 during bootup.
> >>
> >> Yeah, I think whatever a cold-booted kernel will see is what kexec-ed
> >> kernel should see. Not more, not less.
> >>
> >> Regarding virtio-mem: Not in e820 on cold-boot.
> >> Regarding DIMMs: DIMMs under KVM will never show up in the e820 map
> >> IIRC. I think on real HW it can be different.
> > 
> > Yeah, DIMMs under KVM won't show up in e820 map. While this is not feature
> > of QEMU/KVM, but a defect of it. I ever asked Igor who is developer of
> > QEMU/KVM guest in this area, why we don't make kvm guest recognize
> > hotpluggable DIMM and add it into e820 map, he said he had tried to make
> > it, but this will corrupt guest on HyperV. So he had to revert the
> 
> Yeah, I remember that this had to be reverted due to something breaking.
> But OTOH, it allows us to online coldplugged DIMMs online_movable
> easily, so I'd say it's even a feature (although, does not behave like
> real HW we have).
> 
> I use this extensively when testing memory hot(un)plug via coldplugged
> DIMMs.
> 
> I do wonder if there is real HW, where this is also the case.

None for what I know. Hotplug on real HW includes two parts, the boot
mem being hotpluggable is more flexiable one. It allows people to
replace bad DIMM. And you can see code in boot stage has been adjusted a
lot on this purpose, at that time, people haven't thought about kvm
guest.

> 
> > commit on qemu. So I think we can leave it for now for both real HW and
> > kvm, or update the e820_table_kexec to include added DIMM for both real
> > HW and KVM. I hope one day KVM dev will find a way to conquer the defect
> > on HyperV and make the e820map consistent with bare metal. After all,
> > kvm guest is trying to imitate real HW for the most part.
> > 
> > Anyway, I will think about the e820_table_kexec updating. See if we can
> > do something about it.
> 
> Yeah, for DIMMs on real HW it might definitely make sense. We might be
> able to hook into updates of /sys/firmware/memmap on memory add/remove.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-04-15 20:36   ` Eric W. Biederman
@ 2020-04-22 12:14     ` James Morse
  0 siblings, 0 replies; 92+ messages in thread
From: James Morse @ 2020-04-22 12:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

Hi Eric,

On 15/04/2020 21:36, Eric W. Biederman wrote:
> James Morse <james.morse@arm.com> writes:
> 
>> Memory added to the system by hotplug has a 'System RAM' resource created
>> for it. This is exposed to user-space via /proc/iomem.
>>
>> This poses problems for kexec on arm64. If kexec decides to place the
>> kernel in one of these newly onlined regions, the new kernel will find
>> itself booting from a region not described as memory in the firmware
>> tables.
>>
>> Arm64 doesn't have a structure like the e820 memory map that can be
>> re-written when memory is brought online. Instead arm64 uses the UEFI
>> memory map, or the memory node from the DT, sometimes both. We never
>> rewrite these.
>>
>> Allow an architecture to specify a different name for these hotplug
>> regions.
> 
> Gah.  No.
> 
> Please find a way to pass the current memory map to the loaded kexec'd
> kernel.

> Starting a kernel with no way for it to know what the current memory map
> is just plain scary.

We have one. Firmware tables are the source of all this information. We don't tamper with
them.

Firmware describes memory present at boot in the UEFI memory map or DT. On systems with
ACPI, regions that were added after booting are discovered by running AML methods. (for
which we need to allocate memory, so you can't describe boot memory like this)

This doesn't work if you kexec from a hot-added region. You've booted from memory that
wasn't present at boot.

I don't think this is fixable with the set of constraints.


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-04-15 20:29 ` Eric W. Biederman
@ 2020-04-22 12:14   ` James Morse
  2020-04-22 13:04     ` Eric W. Biederman
  0 siblings, 1 reply; 92+ messages in thread
From: James Morse @ 2020-04-22 12:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

Hi Eric,

On 15/04/2020 21:29, Eric W. Biederman wrote:
> James Morse <james.morse@arm.com> writes:
> 
>> Hello!
>>
>> arm64 recently queued support for memory hotremove, which led to some
>> new corner cases for kexec.
>>
>> If the kexec segments are loaded for a removable region, that region may
>> be removed before kexec actually occurs. This causes the first kernel to
>> lockup when applying the relocations. (I've triggered this on x86 too).
>>
>> The first patch adds a memory notifier for kexec so that it can refuse
>> to allow in-use regions to be taken offline.
>>
>>
>> This doesn't solve the problem for arm64, where the new kernel must
>> initially rely on the data structures from the first boot to describe
>> memory. These don't describe hotpluggable memory.
>> If kexec places the kernel in one of these regions, it must also provide
>> a DT that describes the region in which the kernel was mapped as memory.
>> (and somehow ensure its always present in the future...)
>>
>> To prevent this from happening accidentally with unaware user-space,
>> patches two and three allow arm64 to give these regions a different
>> name.
>>
>> This is a change in behaviour for arm64 as memory hotadd and hotremove
>> were added separately.
>>
>>
>> I haven't tried kdump.
>> Unaware kdump from user-space probably won't describe the hotplug
>> regions if the name is different, which saves us from problems if
>> the memory is no longer present at kdump time, but means the vmcore
>> is incomplete.
>>
>>
>> These patches are based on arm64's for-next/core branch, but can all
>> be merged independently.
> 
> So I just looked through these quickly and I think there are real
> problems here we can fix, and that are worth fixing.
> 
> However I am not thrilled with the fixes you propose.

Sure. Unfortunately /proc/iomem is the only trick arm64 has to keep the existing
kexec-tools working.
(We've had 'unthrilling' patches like this before to prevent user-space from loading the
kernel over the top of the in-memory firmware tables.)


arm64 expects the description of memory to come from firmware, be that UEFI for memory
present at boot, or the ACPI AML methods for memory that was added later.

On arm64 there is no standard location for memory. The kernel has to be handed a pointer
to the firmware tables that describe it. The kernel expects to boot from memory that was
present at boot.

Modifying the firmware tables at runtime doesn't solve the problem as we may need to move
the firmware-reserved memory region that describes memory. User-space may still load and
kexec either side of that update.

Even if we could modify the structures at runtime, we can't update a loaded kexec image.
We have no idea which blob from userspace is the DT. It may not even be linux that has
been loaded.

We can't emulate parts of UEFI's handover because kexec's purgatory isn't an EFI program.


I can't see a path through all this. If we have to modify existing user-space, I'd rather
leave it broken. We can detect the problem in the arch code and print a warning at load time.


James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name
  2020-04-15 20:37   ` Eric W. Biederman
@ 2020-04-22 12:14     ` James Morse
  0 siblings, 0 replies; 92+ messages in thread
From: James Morse @ 2020-04-22 12:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

Hi Eric,

On 15/04/2020 21:37, Eric W. Biederman wrote:
> James Morse <james.morse@arm.com> writes:
> 
>> If kexec chooses to place the kernel in a memory region that was
>> added after boot, we fail to boot as the kernel is running from a
>> location that is not described as memory by the UEFI memory map or
>> the original DT.
>>
>> To prevent unaware user-space kexec from doing this accidentally,
>> give these regions a different name.
> 
> Please fix the problem and don't hack around it.

The problem is firmware didn't describe memory that wasn't present at boot.

arm64 relies on the firmware description of memory well before it can go poking around in
ACPI to find out where extra memory was added to the system.

We already need kexec to not overwrite in-memory structures left by firmware. (like, the
memory map). We do this by naming them reserved in /proc/iomem.

Doing the same for hotadded memory means existing kexec user-space can't do this
accidentally. The shape of /proc/iomem is the only trick in the book for arm64's kexec
userspace, as its the only thing it looks at.



Thanks,

James







^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-15 20:33   ` Eric W. Biederman
@ 2020-04-22 12:28     ` James Morse
  2020-04-22 15:25       ` Eric W. Biederman
  0 siblings, 1 reply; 92+ messages in thread
From: James Morse @ 2020-04-22 12:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

Hi Eric,

On 15/04/2020 21:33, Eric W. Biederman wrote:
> James Morse <james.morse@arm.com> writes:
> 
>> An image loaded for kexec is not stored in place, instead its segments
>> are scattered through memory, and are re-assembled when needed. In the
>> meantime, the target memory may have been removed.
>>
>> Because mm is not aware that this memory is still in use, it allows it
>> to be removed.
>>
>> Add a memory notifier to prevent the removal of memory regions that
>> overlap with a loaded kexec image segment. e.g., when triggered from the
>> Qemu console:
>> | kexec_core: memory region in use
>> | memory memory32: Offline failed.
>>
>> Signed-off-by: James Morse <james.morse@arm.com>
> 
> Given that we are talking about the destination pages for kexec
> not where the loaded kernel is currently stored the description is
> confusing.

I think David has some better wording to cover this. I thought I had it with 'scattered
and re-assembled'.


> Beyond that I think it would be better to simply unload the loaded
> kernel at memory hotunplug time.

Unconditionally, or if it aliases the removed region?

I don't particular like it. User-space has asked for two impossible things, we are
changing the answer to the first when we see the second. Its a bit spooky. (maybe no one
will notice)


> Usually somewhere in the loaded image
> is a copy of the memory map at the time the kexec kernel was loaded.
> That will invalidate the memory map as well.

Ah, unconditionally. Sure, x86 needs this.
(arm64 re-discovers the memory map from firmware tables after kexec)

If that's an acceptable change in behaviour, sure, lets do that.


> All of this should be for a very brief window of a few seconds, as
> the loaded kexec image is quite short.

It seems I'm the outlier anticipating anything could happen between those syscalls.


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-04-22 12:14   ` James Morse
@ 2020-04-22 13:04     ` Eric W. Biederman
  2020-04-22 15:40       ` James Morse
  0 siblings, 1 reply; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-22 13:04 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

James Morse <james.morse@arm.com> writes:

> Hi Eric,
>
> On 15/04/2020 21:29, Eric W. Biederman wrote:
>> James Morse <james.morse@arm.com> writes:
>> 
>>> Hello!
>>>
>>> arm64 recently queued support for memory hotremove, which led to some
>>> new corner cases for kexec.
>>>
>>> If the kexec segments are loaded for a removable region, that region may
>>> be removed before kexec actually occurs. This causes the first kernel to
>>> lockup when applying the relocations. (I've triggered this on x86 too).
>>>
>>> The first patch adds a memory notifier for kexec so that it can refuse
>>> to allow in-use regions to be taken offline.
>>>
>>>
>>> This doesn't solve the problem for arm64, where the new kernel must
>>> initially rely on the data structures from the first boot to describe
>>> memory. These don't describe hotpluggable memory.
>>> If kexec places the kernel in one of these regions, it must also provide
>>> a DT that describes the region in which the kernel was mapped as memory.
>>> (and somehow ensure its always present in the future...)
>>>
>>> To prevent this from happening accidentally with unaware user-space,
>>> patches two and three allow arm64 to give these regions a different
>>> name.
>>>
>>> This is a change in behaviour for arm64 as memory hotadd and hotremove
>>> were added separately.
>>>
>>>
>>> I haven't tried kdump.
>>> Unaware kdump from user-space probably won't describe the hotplug
>>> regions if the name is different, which saves us from problems if
>>> the memory is no longer present at kdump time, but means the vmcore
>>> is incomplete.
>>>
>>>
>>> These patches are based on arm64's for-next/core branch, but can all
>>> be merged independently.
>> 
>> So I just looked through these quickly and I think there are real
>> problems here we can fix, and that are worth fixing.
>> 
>> However I am not thrilled with the fixes you propose.
>
> Sure. Unfortunately /proc/iomem is the only trick arm64 has to keep the existing
> kexec-tools working.
> (We've had 'unthrilling' patches like this before to prevent user-space from loading the
> kernel over the top of the in-memory firmware tables.)
>
> arm64 expects the description of memory to come from firmware, be that UEFI for memory
> present at boot, or the ACPI AML methods for memory that was added
> later.
>
> On arm64 there is no standard location for memory. The kernel has to be handed a pointer
> to the firmware tables that describe it. The kernel expects to boot from memory that was
> present at boot.

What do you do when the firmware is wrong?  Does arm64 support the
mem=xxx@yyy kernel command line options?

If you want to handle the general case of memory hotplug having a
limitation that you have to boot from memory that was present at boot is
a bug, because the memory might not be there.

> Modifying the firmware tables at runtime doesn't solve the problem as we may need to move
> the firmware-reserved memory region that describes memory. User-space may still load and
> kexec either side of that update.
>
> Even if we could modify the structures at runtime, we can't update a loaded kexec image.
> We have no idea which blob from userspace is the DT. It may not even be linux that has
> been loaded.

What can be done and very reasonably so is on memory hotplug:
- Unloaded any loaded kexec image.
- Block loading any new image until the hotplug operation completes.

That is simple and generic, and can be done for all architectures.

This doesn't apply to kexec on panic kernel because it fundamentally
needs to figure out how to limp along (or reliably stop) when it has the
wrong memory map.

> We can't emulate parts of UEFI's handover because kexec's purgatory
> isn't an EFI program.

Plus much of EFI is unusable after ExitBootServices is called.

> I can't see a path through all this. If we have to modify existing user-space, I'd rather
> leave it broken. We can detect the problem in the arch code and print a warning at load time.

The weirdest thing to me in all of this is that you have been wanting to
handle memory hotplug.  But you don't want to change or deal with the
memory map changing when hotplug occurs.  The memory map changing is
fundamentally memory hotplug does.

So I think it is fundamental to figure out how to pass the updated
memory map.  Either through command line mem=xxx@yyy command line
options or through another option.

If you really want to keep the limitation that you have to have the
kernel in the initial memory map you can compare that map to the
efi tables when selecting the load address.

Expecting userspace to reload the loaded kernel after memory hotplug is
completely reasonable.

Unless I am mistaken memory hotplug is expected to be a rare event not
something that happens every day, certainly not something that happens
every minute.

Eric




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-22 12:28     ` James Morse
@ 2020-04-22 15:25       ` Eric W. Biederman
  2020-04-22 16:40         ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-22 15:25 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

James Morse <james.morse@arm.com> writes:

> Hi Eric,
>
> On 15/04/2020 21:33, Eric W. Biederman wrote:
>> James Morse <james.morse@arm.com> writes:
>> 
>>> An image loaded for kexec is not stored in place, instead its segments
>>> are scattered through memory, and are re-assembled when needed. In the
>>> meantime, the target memory may have been removed.
>>>
>>> Because mm is not aware that this memory is still in use, it allows it
>>> to be removed.
>>>
>>> Add a memory notifier to prevent the removal of memory regions that
>>> overlap with a loaded kexec image segment. e.g., when triggered from the
>>> Qemu console:
>>> | kexec_core: memory region in use
>>> | memory memory32: Offline failed.
>>>
>>> Signed-off-by: James Morse <james.morse@arm.com>
>> 
>> Given that we are talking about the destination pages for kexec
>> not where the loaded kernel is currently stored the description is
>> confusing.
>
> I think David has some better wording to cover this. I thought I had it with 'scattered
> and re-assembled'.

The confusing part was talking about memory being still in use,
that is actually scheduled for use in the future.

>> Usually somewhere in the loaded image
>> is a copy of the memory map at the time the kexec kernel was loaded.
>> That will invalidate the memory map as well.
>
> Ah, unconditionally. Sure, x86 needs this.
> (arm64 re-discovers the memory map from firmware tables after kexec)
>
> If that's an acceptable change in behaviour, sure, lets do that.

Yes.


>> All of this should be for a very brief window of a few seconds, as
>> the loaded kexec image is quite short.
>
> It seems I'm the outlier anticipating anything could happen between
> those syscalls.

The design is:
	sys_kexec_load()
	shutdown scripts
        sys_reboot(LINUX_REBOOT_CMD_KEXEC);

There are two system call simply so that the shutdown scripts can run.
Now maybe someone somewhere does something different but that is not
expected.

Only the kexec on panic kernel is expected to persist somewhat
indefinitely.  But that should be in memory that is reserved from boot
time, and so the memory hotplug should have enough visibility to not
allow that memory to be given up.

Eric


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use
  2020-04-22 13:04     ` Eric W. Biederman
@ 2020-04-22 15:40       ` James Morse
  0 siblings, 0 replies; 92+ messages in thread
From: James Morse @ 2020-04-22 15:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

Hi Eric,

On 22/04/2020 14:04, Eric W. Biederman wrote:
> James Morse <james.morse@arm.com> writes:
>> On 15/04/2020 21:29, Eric W. Biederman wrote:
>>> James Morse <james.morse@arm.com> writes:
>>>> arm64 recently queued support for memory hotremove, which led to some
>>>> new corner cases for kexec.
>>>>
>>>> If the kexec segments are loaded for a removable region, that region may
>>>> be removed before kexec actually occurs. This causes the first kernel to
>>>> lockup when applying the relocations. (I've triggered this on x86 too).
>>>>
>>>> The first patch adds a memory notifier for kexec so that it can refuse
>>>> to allow in-use regions to be taken offline.
>>>>
>>>>
>>>> This doesn't solve the problem for arm64, where the new kernel must
>>>> initially rely on the data structures from the first boot to describe
>>>> memory. These don't describe hotpluggable memory.
>>>> If kexec places the kernel in one of these regions, it must also provide
>>>> a DT that describes the region in which the kernel was mapped as memory.
>>>> (and somehow ensure its always present in the future...)
>>>>
>>>> To prevent this from happening accidentally with unaware user-space,
>>>> patches two and three allow arm64 to give these regions a different
>>>> name.
>>>>
>>>> This is a change in behaviour for arm64 as memory hotadd and hotremove
>>>> were added separately.
>>>>
>>>>
>>>> I haven't tried kdump.
>>>> Unaware kdump from user-space probably won't describe the hotplug
>>>> regions if the name is different, which saves us from problems if
>>>> the memory is no longer present at kdump time, but means the vmcore
>>>> is incomplete.
>>>>
>>>>
>>>> These patches are based on arm64's for-next/core branch, but can all
>>>> be merged independently.
>>>
>>> So I just looked through these quickly and I think there are real
>>> problems here we can fix, and that are worth fixing.
>>>
>>> However I am not thrilled with the fixes you propose.
>>
>> Sure. Unfortunately /proc/iomem is the only trick arm64 has to keep the existing
>> kexec-tools working.
>> (We've had 'unthrilling' patches like this before to prevent user-space from loading the
>> kernel over the top of the in-memory firmware tables.)
>>
>> arm64 expects the description of memory to come from firmware, be that UEFI for memory
>> present at boot, or the ACPI AML methods for memory that was added
>> later.
>>
>> On arm64 there is no standard location for memory. The kernel has to be handed a pointer
>> to the firmware tables that describe it. The kernel expects to boot from memory that was
>> present at boot.

> What do you do when the firmware is wrong? 

The firmware gets fixed. Its the only source of facts about the platform.


> Does arm64 support the
> mem=xxx@yyy kernel command line options?

Only the debug option to reduce the available memory.


> If you want to handle the general case of memory hotplug having a
> limitation that you have to boot from memory that was present at boot is
> a bug, because the memory might not be there.

arm64's arch code prevents the memory described by the UEFI memory map from being taken
offline/removed.

Memory present at boot may have firmware reservations, that are being used by some other
agent in the system. firmware-first RAS errors are one, the interrupt controllers'
property and pending tables are another.

The UEFI memory map's description of memory may have been incomplete, as there may have
been regions carved-out, not described at all instead of described as reserved.

The UEFI runtime services will live in memory described by the UEFI memory map.


>> Modifying the firmware tables at runtime doesn't solve the problem as we may need to move
>> the firmware-reserved memory region that describes memory. User-space may still load and
>> kexec either side of that update.
>>
>> Even if we could modify the structures at runtime, we can't update a loaded kexec image.
>> We have no idea which blob from userspace is the DT. It may not even be linux that has
>> been loaded.
> 
> What can be done and very reasonably so is on memory hotplug:
> - Unloaded any loaded kexec image.
> - Block loading any new image until the hotplug operation completes.
> 
> That is simple and generic, and can be done for all architectures.

Yes, certainly.


> This doesn't apply to kexec on panic kernel because it fundamentally
> needs to figure out how to limp along (or reliably stop) when it has the
> wrong memory map.
> 
>> We can't emulate parts of UEFI's handover because kexec's purgatory
>> isn't an EFI program.
> 
> Plus much of EFI is unusable after ExitBootServices is called.

Of course, we even overwrite its code when allocating memory for the kernel.

I bring it up because it is our only way of handing over the memory map of the system.


>> I can't see a path through all this. If we have to modify existing user-space, I'd rather
>> leave it broken. We can detect the problem in the arch code and print a warning at load time.

> The weirdest thing to me in all of this is that you have been wanting to
> handle memory hotplug.  But you don't want to change or deal with the
> memory map changing when hotplug occurs.  The memory map changing is
> fundamentally memory hotplug does.

arm64 doesn't have a 'the memory map', just what came from firmware. The memory map linux
uses is built from these firmware descriptions.

Memory is discovered from:
early: The DT memory node.
early: The UEFI memory map.
later: ACPI hotplug memory.

Later kexec()d or kdump'd kernels rebuild the memory map from the firmware description.
This means kexec is totally invisible. Not changing these descriptions is important to
ensure we don't accidentally corrupt them, or make up some property that isn't true.

Your request to 'change' the memory map involves creating a new UEFI memory map that
describes the memory we found via ACPI hotplug.
arm64 doesn't do this because we expect the next kernel to re-discover this memory via
ACPI hotplug.

Generally, arm64 expects a kexec'd kernel to learn and discover things in exactly the same
way that it would have done if it were the first kernel to have been booted.


> So I think it is fundamental to figure out how to pass the updated
> memory map.  Either through command line mem=xxx@yyy command line
> options or through another option.

We re-discover it from firmware. Booting from memory that is not described as memory early
enough is the second problem addressed by this series.


> If you really want to keep the limitation that you have to have the
> kernel in the initial memory map you can compare that map to the
> efi tables when selecting the load address.

Great. How can user-space know the contents of that map?
It only reads /proc/iomem today. On a system that doesn't support APCI memory hotplug,
/proc/iomem describes the memory present at boot. These things have never been different
before.


> Expecting userspace to reload the loaded kernel after memory hotplug is
> completely reasonable.

I'm sold on this, it implicitly solves the 'kexec image wants to be copied into removed
memory' problem.


> Unless I am mistaken memory hotplug is expected to be a rare event not
> something that happens every day, certainly not something that happens
> every minute.

One of the motivations for supporting memory hotplug is for VMs. Container projects like
to create VMs in advance, then reconfigure them just before they are used. This saves the
time taken by the hypervisor to do its work.

Hitting the 'not booted from boot memory' is now just using kexec in a VM deployed like this.


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-22 15:25       ` Eric W. Biederman
@ 2020-04-22 16:40         ` David Hildenbrand
  2020-04-23 16:29           ` Eric W. Biederman
  2020-05-01 16:55           ` James Morse
  0 siblings, 2 replies; 92+ messages in thread
From: David Hildenbrand @ 2020-04-22 16:40 UTC (permalink / raw)
  To: Eric W. Biederman, James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon


> The confusing part was talking about memory being still in use,
> that is actually scheduled for use in the future.

+1

> 
>>> Usually somewhere in the loaded image
>>> is a copy of the memory map at the time the kexec kernel was loaded.
>>> That will invalidate the memory map as well.
>>
>> Ah, unconditionally. Sure, x86 needs this.
>> (arm64 re-discovers the memory map from firmware tables after kexec)

Does this include hotplugged DIMMs e.g., under KVM?
[...]

>>> All of this should be for a very brief window of a few seconds, as
>>> the loaded kexec image is quite short.
>>
>> It seems I'm the outlier anticipating anything could happen between
>> those syscalls.
> 
> The design is:
> 	sys_kexec_load()
> 	shutdown scripts
>         sys_reboot(LINUX_REBOOT_CMD_KEXEC);
> 
> There are two system call simply so that the shutdown scripts can run.
> Now maybe someone somewhere does something different but that is not
> expected.
> 
> Only the kexec on panic kernel is expected to persist somewhat
> indefinitely.  But that should be in memory that is reserved from boot
> time, and so the memory hotplug should have enough visibility to not
> allow that memory to be given up.

Yes, and AFAIK, memory blocks which hold the reserved crashkernel area
can usually not get offlined and, therefore, the memory cannot get removed.

Interestingly, s390x even has a hotplug notifier for that

arch/s390/kernel/setup.c:kdump_mem_notifier()

(offlining of memory on s390x can result in memory getting depopulated
in the hypervisor, so after it would have been offlined, it would no
longer be accessible. I somewhat doubt that this notifier is really
needed - all pages in the crashkernel area should look like ordinary
allocated pages when the area is reserved early during boot via the
memblock allocator, and therefore offlining cannot succeed. But that's a
different story - and I suspect this is a leftover from pre-memblock times.)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-22 16:40         ` David Hildenbrand
@ 2020-04-23 16:29           ` Eric W. Biederman
  2020-04-24  7:39             ` David Hildenbrand
  2020-05-01 16:55           ` James Morse
  1 sibling, 1 reply; 92+ messages in thread
From: Eric W. Biederman @ 2020-04-23 16:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: James Morse, kexec, linux-mm, linux-arm-kernel,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma,
	Andrew Morton, Will Deacon

David Hildenbrand <david@redhat.com> writes:

>> The confusing part was talking about memory being still in use,
>> that is actually scheduled for use in the future.
>
> +1
>
>> 
>>>> Usually somewhere in the loaded image
>>>> is a copy of the memory map at the time the kexec kernel was loaded.
>>>> That will invalidate the memory map as well.
>>>
>>> Ah, unconditionally. Sure, x86 needs this.
>>> (arm64 re-discovers the memory map from firmware tables after kexec)
>
> Does this include hotplugged DIMMs e.g., under KVM?
> [...]

As far as I know.  If the memory map changes we need to drop the loaded
image.


Having thought about it a little more I suspect it would be the
other way and just block all hotplug actions after a kexec_load.
As all we expect to happen is running shutdown scripts.

If blocking the hotplug action uses printk to print a nice message
saying something like: "Hotplug blocked because of a loaded kexec image",
then people will be able to figure out what is going on and
call kexec -u if they haven't started the shutdown scripts yet.


Either way it is something simple and unconditional that will make
things work.

>>>> All of this should be for a very brief window of a few seconds, as
>>>> the loaded kexec image is quite short.
>>>
>>> It seems I'm the outlier anticipating anything could happen between
>>> those syscalls.
>> 
>> The design is:
>> 	sys_kexec_load()
>> 	shutdown scripts
>>         sys_reboot(LINUX_REBOOT_CMD_KEXEC);
>> 
>> There are two system call simply so that the shutdown scripts can run.
>> Now maybe someone somewhere does something different but that is not
>> expected.
>> 
>> Only the kexec on panic kernel is expected to persist somewhat
>> indefinitely.  But that should be in memory that is reserved from boot
>> time, and so the memory hotplug should have enough visibility to not
>> allow that memory to be given up.
>
> Yes, and AFAIK, memory blocks which hold the reserved crashkernel area
> can usually not get offlined and, therefore, the memory cannot get removed.
>
> Interestingly, s390x even has a hotplug notifier for that
>
> arch/s390/kernel/setup.c:kdump_mem_notifier()
>
> (offlining of memory on s390x can result in memory getting depopulated
> in the hypervisor, so after it would have been offlined, it would no
> longer be accessible. I somewhat doubt that this notifier is really
> needed - all pages in the crashkernel area should look like ordinary
> allocated pages when the area is reserved early during boot via the
> memblock allocator, and therefore offlining cannot succeed. But that's a
> different story - and I suspect this is a leftover from pre-memblock times.)

It might be worth seeing if that is true, or if we need to generalize the
s390x code.

Eric



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-23 16:29           ` Eric W. Biederman
@ 2020-04-24  7:39             ` David Hildenbrand
  2020-04-24  7:41               ` David Hildenbrand
  0 siblings, 1 reply; 92+ messages in thread
From: David Hildenbrand @ 2020-04-24  7:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: James Morse, kexec, linux-mm, linux-arm-kernel,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma,
	Andrew Morton, Will Deacon

On 23.04.20 18:29, Eric W. Biederman wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>>> The confusing part was talking about memory being still in use,
>>> that is actually scheduled for use in the future.
>>
>> +1
>>
>>>
>>>>> Usually somewhere in the loaded image
>>>>> is a copy of the memory map at the time the kexec kernel was loaded.
>>>>> That will invalidate the memory map as well.
>>>>
>>>> Ah, unconditionally. Sure, x86 needs this.
>>>> (arm64 re-discovers the memory map from firmware tables after kexec)
>>
>> Does this include hotplugged DIMMs e.g., under KVM?
>> [...]
> 
> As far as I know.  If the memory map changes we need to drop the loaded
> image.
> 
> 
> Having thought about it a little more I suspect it would be the
> other way and just block all hotplug actions after a kexec_load.
> As all we expect to happen is running shutdown scripts.
> 
> If blocking the hotplug action uses printk to print a nice message
> saying something like: "Hotplug blocked because of a loaded kexec image",
> then people will be able to figure out what is going on and
> call kexec -u if they haven't started the shutdown scripts yet.
> 
> 
> Either way it is something simple and unconditional that will make
> things work.
> 

Personally, I consider memory hotplug more important than keeping loaded
kexec data alive (just because somebody once decided to do a "kexec -l"
and never did a "kexec -e" we should not block any memory hot(un)plug -
especially in virtualized environments - for all eternity).

So IMHO we would invalidate loaded kexec data (not the crashkernel, of
course) on memory hot(un)plug and print a warning. In addition, we can
let kexec-tools try to reload whatever they loaded after getting
notified that something changed.

The "something changed" is visible to user space e.g., via udev events
for /sys/devices/memory/memoryX/

>>>>> All of this should be for a very brief window of a few seconds, as
>>>>> the loaded kexec image is quite short.
>>>>
>>>> It seems I'm the outlier anticipating anything could happen between
>>>> those syscalls.
>>>
>>> The design is:
>>> 	sys_kexec_load()
>>> 	shutdown scripts
>>>         sys_reboot(LINUX_REBOOT_CMD_KEXEC);
>>>
>>> There are two system call simply so that the shutdown scripts can run.
>>> Now maybe someone somewhere does something different but that is not
>>> expected.
>>>
>>> Only the kexec on panic kernel is expected to persist somewhat
>>> indefinitely.  But that should be in memory that is reserved from boot
>>> time, and so the memory hotplug should have enough visibility to not
>>> allow that memory to be given up.
>>
>> Yes, and AFAIK, memory blocks which hold the reserved crashkernel area
>> can usually not get offlined and, therefore, the memory cannot get removed.
>>
>> Interestingly, s390x even has a hotplug notifier for that
>>
>> arch/s390/kernel/setup.c:kdump_mem_notifier()
>>
>> (offlining of memory on s390x can result in memory getting depopulated
>> in the hypervisor, so after it would have been offlined, it would no
>> longer be accessible. I somewhat doubt that this notifier is really
>> needed - all pages in the crashkernel area should look like ordinary
>> allocated pages when the area is reserved early during boot via the
>> memblock allocator, and therefore offlining cannot succeed. But that's a
>> different story - and I suspect this is a leftover from pre-memblock times.)
> 
> It might be worth seeing if that is true, or if we need to generalize the
> s390x code.

I'll try to find some time to test if the s390x handler is still relevant.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-24  7:39             ` David Hildenbrand
@ 2020-04-24  7:41               ` David Hildenbrand
  0 siblings, 0 replies; 92+ messages in thread
From: David Hildenbrand @ 2020-04-24  7:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: James Morse, kexec, linux-mm, linux-arm-kernel,
	Anshuman Khandual, Catalin Marinas, Bhupesh Sharma,
	Andrew Morton, Will Deacon

On 24.04.20 09:39, David Hildenbrand wrote:
> On 23.04.20 18:29, Eric W. Biederman wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>>> The confusing part was talking about memory being still in use,
>>>> that is actually scheduled for use in the future.
>>>
>>> +1
>>>
>>>>
>>>>>> Usually somewhere in the loaded image
>>>>>> is a copy of the memory map at the time the kexec kernel was loaded.
>>>>>> That will invalidate the memory map as well.
>>>>>
>>>>> Ah, unconditionally. Sure, x86 needs this.
>>>>> (arm64 re-discovers the memory map from firmware tables after kexec)
>>>
>>> Does this include hotplugged DIMMs e.g., under KVM?
>>> [...]
>>
>> As far as I know.  If the memory map changes we need to drop the loaded
>> image.
>>
>>
>> Having thought about it a little more I suspect it would be the
>> other way and just block all hotplug actions after a kexec_load.
>> As all we expect to happen is running shutdown scripts.
>>
>> If blocking the hotplug action uses printk to print a nice message
>> saying something like: "Hotplug blocked because of a loaded kexec image",
>> then people will be able to figure out what is going on and
>> call kexec -u if they haven't started the shutdown scripts yet.
>>
>>
>> Either way it is something simple and unconditional that will make
>> things work.
>>
> 
> Personally, I consider memory hotplug more important than keeping loaded
> kexec data alive (just because somebody once decided to do a "kexec -l"
> and never did a "kexec -e" we should not block any memory hot(un)plug -
> especially in virtualized environments - for all eternity).
> 
> So IMHO we would invalidate loaded kexec data (not the crashkernel, of
> course) on memory hot(un)plug and print a warning. In addition, we can
> let kexec-tools try to reload whatever they loaded after getting
> notified that something changed.
> 
> The "something changed" is visible to user space e.g., via udev events
> for /sys/devices/memory/memoryX/

/sys/devices/system/memory/memoryX/ ...


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image
  2020-04-22 16:40         ` David Hildenbrand
  2020-04-23 16:29           ` Eric W. Biederman
@ 2020-05-01 16:55           ` James Morse
  1 sibling, 0 replies; 92+ messages in thread
From: James Morse @ 2020-05-01 16:55 UTC (permalink / raw)
  To: David Hildenbrand, Eric W. Biederman
  Cc: kexec, linux-mm, linux-arm-kernel, Anshuman Khandual,
	Catalin Marinas, Bhupesh Sharma, Andrew Morton, Will Deacon

Hi guys,

On 22/04/2020 17:40, David Hildenbrand wrote:
>>>> Usually somewhere in the loaded image
>>>> is a copy of the memory map at the time the kexec kernel was loaded.
>>>> That will invalidate the memory map as well.
>>>
>>> Ah, unconditionally. Sure, x86 needs this.
>>> (arm64 re-discovers the memory map from firmware tables after kexec)

> Does this include hotplugged DIMMs e.g., under KVM?

If you advertise hotplugged memory to the guest using ACPI, yes.

We don't have a practical mechanism to pass 'fact's about the platform between kernels,
instead we rely on those facts being discoverable, or described by firmware.


>>>> All of this should be for a very brief window of a few seconds, as
>>>> the loaded kexec image is quite short.
>>>
>>> It seems I'm the outlier anticipating anything could happen between
>>> those syscalls.
>>
>> The design is:
>> 	sys_kexec_load()
>> 	shutdown scripts
>>         sys_reboot(LINUX_REBOOT_CMD_KEXEC);
>>
>> There are two system call simply so that the shutdown scripts can run.
>> Now maybe someone somewhere does something different but that is not
>> expected.

[...]

> Yes, and AFAIK, memory blocks which hold the reserved crashkernel area
> can usually not get offlined and, therefore, the memory cannot get removed.

The crashkernel area on arm64 will always land in un-removable memory. We set PG_Reserved
on it too.


Thanks,

James


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-03-26 18:07 ` [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names James Morse
                     ` (2 preceding siblings ...)
  2020-04-15 20:36   ` Eric W. Biederman
@ 2020-05-09  0:45   ` Andrew Morton
  2020-05-11  8:35     ` David Hildenbrand
  3 siblings, 1 reply; 92+ messages in thread
From: Andrew Morton @ 2020-05-09  0:45 UTC (permalink / raw)
  To: James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

On Thu, 26 Mar 2020 18:07:29 +0000 James Morse <james.morse@arm.com> wrote:

> Memory added to the system by hotplug has a 'System RAM' resource created
> for it. This is exposed to user-space via /proc/iomem.
> 
> This poses problems for kexec on arm64. If kexec decides to place the
> kernel in one of these newly onlined regions, the new kernel will find
> itself booting from a region not described as memory in the firmware
> tables.
> 
> Arm64 doesn't have a structure like the e820 memory map that can be
> re-written when memory is brought online. Instead arm64 uses the UEFI
> memory map, or the memory node from the DT, sometimes both. We never
> rewrite these.
> 
> Allow an architecture to specify a different name for these hotplug
> regions.
> 
> ...
>
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,10 @@
>  #include "internal.h"
>  #include "shuffle.h"
>  
> +#ifndef MEMORY_HOTPLUG_RES_NAME
> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
> +#endif
> +
>  /*
>   * online_page_callback contains pointer to current page onlining function.
>   * Initially it is generic_online_page(). If it is required it could be
> @@ -103,7 +107,7 @@ static struct resource *register_memory_resource(u64 start, u64 size)
>  {
>  	struct resource *res;
>  	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> -	char *resource_name = "System RAM";
> +	char *resource_name = MEMORY_HOTPLUG_RES_NAME;
>  
>  	if (start + size > max_mem_size)
>  		return ERR_PTR(-E2BIG);

I suppose we should do this as well:

--- a/mm/memory_hotplug.c~mm-memory_hotplug-allow-arch-override-of-non-boot-memory-resource-names-fix
+++ a/mm/memory_hotplug.c
@@ -129,7 +129,8 @@ static struct resource *register_memory_
 			       resource_name, flags);
 
 	if (!res) {
-		pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
+		pr_debug("Unable to reserve " MEMORY_HOTPLUG_RES_NAME
+				" region: %016llx->%016llx\n",
 				start, start + size);
 		return ERR_PTR(-EEXIST);
 	}

It assumes that MEMORY_HOTPLUG_RES_NAME will be a literal string, which
is the case in [3/3].



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names
  2020-05-09  0:45   ` Andrew Morton
@ 2020-05-11  8:35     ` David Hildenbrand
  0 siblings, 0 replies; 92+ messages in thread
From: David Hildenbrand @ 2020-05-11  8:35 UTC (permalink / raw)
  To: Andrew Morton, James Morse
  Cc: kexec, linux-mm, linux-arm-kernel, Eric Biederman,
	Catalin Marinas, Will Deacon, Anshuman Khandual, Bhupesh Sharma

On 09.05.20 02:45, Andrew Morton wrote:
> On Thu, 26 Mar 2020 18:07:29 +0000 James Morse <james.morse@arm.com> wrote:
> 
>> Memory added to the system by hotplug has a 'System RAM' resource created
>> for it. This is exposed to user-space via /proc/iomem.
>>
>> This poses problems for kexec on arm64. If kexec decides to place the
>> kernel in one of these newly onlined regions, the new kernel will find
>> itself booting from a region not described as memory in the firmware
>> tables.
>>
>> Arm64 doesn't have a structure like the e820 memory map that can be
>> re-written when memory is brought online. Instead arm64 uses the UEFI
>> memory map, or the memory node from the DT, sometimes both. We never
>> rewrite these.
>>
>> Allow an architecture to specify a different name for these hotplug
>> regions.
>>
>> ...
>>
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -42,6 +42,10 @@
>>  #include "internal.h"
>>  #include "shuffle.h"
>>  
>> +#ifndef MEMORY_HOTPLUG_RES_NAME
>> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
>> +#endif
>> +
>>  /*
>>   * online_page_callback contains pointer to current page onlining function.
>>   * Initially it is generic_online_page(). If it is required it could be
>> @@ -103,7 +107,7 @@ static struct resource *register_memory_resource(u64 start, u64 size)
>>  {
>>  	struct resource *res;
>>  	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
>> -	char *resource_name = "System RAM";
>> +	char *resource_name = MEMORY_HOTPLUG_RES_NAME;
>>  
>>  	if (start + size > max_mem_size)
>>  		return ERR_PTR(-E2BIG);
> 
> I suppose we should do this as well:
> 
> --- a/mm/memory_hotplug.c~mm-memory_hotplug-allow-arch-override-of-non-boot-memory-resource-names-fix
> +++ a/mm/memory_hotplug.c
> @@ -129,7 +129,8 @@ static struct resource *register_memory_
>  			       resource_name, flags);
>  
>  	if (!res) {
> -		pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
> +		pr_debug("Unable to reserve " MEMORY_HOTPLUG_RES_NAME
> +				" region: %016llx->%016llx\n",
>  				start, start + size);
>  		return ERR_PTR(-EEXIST);
>  	}
> 
> It assumes that MEMORY_HOTPLUG_RES_NAME will be a literal string, which
> is the case in [3/3].

@Andrew, as discussed in this thread already [1], I suggest to drop this
series from -mm tree for now.

[1] https://lkml.kernel.org/r/2e3419b2-d00c-51c3-9b45-9de114608cdf@arm.com

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2020-05-11  8:35 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-26 18:07 [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use James Morse
2020-03-26 18:07 ` [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image James Morse
2020-03-27  0:43   ` Anshuman Khandual
2020-03-27  2:54     ` Baoquan He
2020-03-27 15:46     ` James Morse
2020-03-27  2:34   ` Baoquan He
2020-03-27  9:30   ` David Hildenbrand
2020-03-27 16:56     ` James Morse
2020-03-27 17:06       ` David Hildenbrand
2020-03-27 18:07         ` James Morse
2020-03-27 18:52           ` David Hildenbrand
2020-03-30 13:00             ` James Morse
2020-03-30 13:13               ` David Hildenbrand
2020-03-30 17:17                 ` James Morse
2020-03-30 18:14                   ` David Hildenbrand
2020-04-10 19:10                     ` Andrew Morton
2020-04-11  3:44                       ` Baoquan He
2020-04-11  9:30                         ` Russell King - ARM Linux admin
2020-04-11  9:58                           ` David Hildenbrand
2020-04-12  5:35                           ` Baoquan He
2020-04-12  8:08                             ` Russell King - ARM Linux admin
2020-04-12 19:52                               ` Eric W. Biederman
2020-04-12 20:37                                 ` Bhupesh SHARMA
2020-04-13  2:37                                 ` Baoquan He
2020-04-13 13:15                                   ` Eric W. Biederman
2020-04-13 23:01                                     ` Andrew Morton
2020-04-14  6:13                                       ` Eric W. Biederman
2020-04-14  6:40                                     ` Baoquan He
2020-04-14  6:51                                       ` Baoquan He
2020-04-14  8:00                                       ` David Hildenbrand
2020-04-14  9:22                                         ` Baoquan He
2020-04-14  9:37                                           ` David Hildenbrand
2020-04-14 14:39                                             ` Baoquan He
2020-04-14 14:49                                               ` David Hildenbrand
2020-04-15  2:35                                                 ` Baoquan He
2020-04-16 13:31                                                   ` David Hildenbrand
2020-04-16 14:02                                                     ` Baoquan He
2020-04-16 14:09                                                       ` David Hildenbrand
2020-04-16 14:36                                                         ` Baoquan He
2020-04-16 14:47                                                           ` David Hildenbrand
2020-04-21 13:29                                                             ` David Hildenbrand
2020-04-21 13:57                                                               ` David Hildenbrand
2020-04-21 13:59                                                               ` Eric W. Biederman
2020-04-21 14:30                                                                 ` David Hildenbrand
2020-04-22  9:17                                                               ` Baoquan He
2020-04-22  9:24                                                                 ` David Hildenbrand
2020-04-22  9:57                                                                   ` Baoquan He
2020-04-22 10:05                                                                     ` David Hildenbrand
2020-04-22 10:36                                                                       ` Baoquan He
2020-04-14  9:16                                     ` Dave Young
2020-04-14  9:38                                       ` Dave Young
2020-04-14  7:05                       ` David Hildenbrand
2020-04-14 16:55                         ` James Morse
2020-04-14 17:41                           ` David Hildenbrand
2020-04-15 20:33   ` Eric W. Biederman
2020-04-22 12:28     ` James Morse
2020-04-22 15:25       ` Eric W. Biederman
2020-04-22 16:40         ` David Hildenbrand
2020-04-23 16:29           ` Eric W. Biederman
2020-04-24  7:39             ` David Hildenbrand
2020-04-24  7:41               ` David Hildenbrand
2020-05-01 16:55           ` James Morse
2020-03-26 18:07 ` [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names James Morse
2020-03-27  9:59   ` David Hildenbrand
2020-03-27 15:39     ` James Morse
2020-03-30 13:23       ` David Hildenbrand
2020-03-30 17:17         ` James Morse
2020-04-02  5:49   ` Dave Young
2020-04-02  6:12     ` piliu
2020-04-14 17:21       ` James Morse
2020-04-15 20:36   ` Eric W. Biederman
2020-04-22 12:14     ` James Morse
2020-05-09  0:45   ` Andrew Morton
2020-05-11  8:35     ` David Hildenbrand
2020-03-26 18:07 ` [PATCH 3/3] arm64: memory: Give hotplug memory a different resource name James Morse
2020-03-30 19:01   ` David Hildenbrand
2020-04-15 20:37   ` Eric W. Biederman
2020-04-22 12:14     ` James Morse
2020-03-27  2:11 ` [PATCH 0/3] kexec/memory_hotplug: Prevent removal and accidental use Baoquan He
2020-03-27 15:40   ` James Morse
2020-03-27  9:27 ` David Hildenbrand
2020-03-27 15:42   ` James Morse
2020-03-30 13:18     ` David Hildenbrand
2020-03-30 13:55 ` Baoquan He
2020-03-30 17:17   ` James Morse
2020-03-31  3:46     ` Dave Young
2020-04-14 17:31       ` James Morse
2020-03-31  3:38 ` Dave Young
2020-04-15 20:29 ` Eric W. Biederman
2020-04-22 12:14   ` James Morse
2020-04-22 13:04     ` Eric W. Biederman
2020-04-22 15:40       ` James Morse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).