Linux-HyperV Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/3] mm/memory_hotplug: Allow to not create firmware memmap entries
@ 2020-04-30 10:29 David Hildenbrand
  2020-04-30 10:29 ` [PATCH v2 1/3] mm/memory_hotplug: Prepare passing flags to add_memory() and friends David Hildenbrand
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: David Hildenbrand @ 2020-04-30 10:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, virtio-dev, virtualization, linuxppc-dev, linux-acpi,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Andrew Morton, Michael S . Tsirkin, David Hildenbrand,
	Baoquan He, Benjamin Herrenschmidt, Boris Ostrovsky,
	Christian Borntraeger, Dan Williams, Dave Hansen, Dave Jiang,
	Eric Biederman, Greg Kroah-Hartman, Haiyang Zhang,
	Heiko Carstens, Jason Wang, Juergen Gross, K. Y. Srinivasan,
	Len Brown, Leonardo Bras, Michael Ellerman, Michal Hocko,
	Nathan Lynch, Oscar Salvador, Pankaj Gupta, Paul Mackerras,
	Pavel Tatashin, Pingfan Liu, Rafael J. Wysocki,
	Stefano Stabellini, Stephen Hemminger, Thomas Gleixner,
	Vasily Gorbik, Vishal Verma, Wei Liu, Wei Yang

This is the follow up of [1]:
	[PATCH v1 0/3] mm/memory_hotplug: Make virtio-mem play nicely with
	kexec-tools

I realized that this is not only helpful for virtio-mem, but also for
dax/kmem - it's a fix for that use case (see patch #3) of persistent
memory.

Also, while testing, I discovered that kexec-tools will *not* add dax/kmem
memory (anything not directly under the root when parsing /proc/iomem) to
the elfcorehdr, so this memory will never get included in a dump. This
probably has to be fixed in kexec-tools - virtio-mem will require this as
well.

v1 -> v2:
- Don't change the resource name
- Rename the flag to MHP_NO_FIRMWARE_MEMMAP to reflect what it is doing
- Rephrase subjects/descriptions
- Use the flag for dax/kmem

I'll have to rebase virtio-mem on these changes, there will be a resend.

[1] https://lkml.kernel.org/r/20200429160803.109056-1-david@redhat.com

David Hildenbrand (3):
  mm/memory_hotplug: Prepare passing flags to add_memory() and friends
  mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  device-dax: Add system ram (add_memory()) with MHP_NO_FIRMWARE_MEMMAP

 arch/powerpc/platforms/powernv/memtrace.c       |  2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c |  2 +-
 drivers/acpi/acpi_memhotplug.c                  |  2 +-
 drivers/base/memory.c                           |  2 +-
 drivers/dax/kmem.c                              |  3 ++-
 drivers/hv/hv_balloon.c                         |  2 +-
 drivers/s390/char/sclp_cmd.c                    |  2 +-
 drivers/xen/balloon.c                           |  2 +-
 include/linux/memory_hotplug.h                  | 15 ++++++++++++---
 mm/memory_hotplug.c                             | 14 ++++++++------
 10 files changed, 29 insertions(+), 17 deletions(-)

-- 
2.25.3


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/3] mm/memory_hotplug: Prepare passing flags to add_memory() and friends
  2020-04-30 10:29 [PATCH v2 0/3] mm/memory_hotplug: Allow to not create firmware memmap entries David Hildenbrand
@ 2020-04-30 10:29 ` David Hildenbrand
  2020-04-30 10:29 ` [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP David Hildenbrand
  2020-04-30 10:29 ` [PATCH v2 3/3] device-dax: Add system ram (add_memory()) with MHP_NO_FIRMWARE_MEMMAP David Hildenbrand
  2 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2020-04-30 10:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, virtio-dev, virtualization, linuxppc-dev, linux-acpi,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Andrew Morton, Michael S . Tsirkin, David Hildenbrand, Wei Liu,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Jason Wang, Boris Ostrovsky,
	Juergen Gross, Stefano Stabellini, Thomas Gleixner, Pingfan Liu,
	Leonardo Bras, Nathan Lynch, Oscar Salvador, Michal Hocko,
	Baoquan He, Wei Yang, Pankaj Gupta, Eric Biederman

We soon want to pass flags - prepare for that.

This patch is based on a similar patch by Oscar Salvador:

https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.de

Acked-by: Wei Liu <wei.liu@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Leonardo Bras <leobras.c@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: linux-hyperv@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/platforms/powernv/memtrace.c       |  2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c |  2 +-
 drivers/acpi/acpi_memhotplug.c                  |  2 +-
 drivers/base/memory.c                           |  2 +-
 drivers/dax/kmem.c                              |  2 +-
 drivers/hv/hv_balloon.c                         |  2 +-
 drivers/s390/char/sclp_cmd.c                    |  2 +-
 drivers/xen/balloon.c                           |  2 +-
 include/linux/memory_hotplug.h                  |  7 ++++---
 mm/memory_hotplug.c                             | 11 ++++++-----
 10 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 13b369d2cc45..a7475d18c671 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -224,7 +224,7 @@ static int memtrace_online(void)
 			ent->mem = 0;
 		}
 
-		if (add_memory(ent->nid, ent->start, ent->size)) {
+		if (add_memory(ent->nid, ent->start, ent->size, 0)) {
 			pr_err("Failed to add trace memory to node %d\n",
 				ent->nid);
 			ret += 1;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5ace2f9a277e..ae44eba46ca0 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -646,7 +646,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 	block_sz = memory_block_size_bytes();
 
 	/* Add the memory */
-	rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
+	rc = __add_memory(lmb->nid, lmb->base_addr, block_sz, 0);
 	if (rc) {
 		invalidate_lmb_associativity_index(lmb);
 		return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index e294f44a7850..d91b3584d4b2 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -207,7 +207,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
-		result = __add_memory(node, info->start_addr, info->length);
+		result = __add_memory(node, info->start_addr, info->length, 0);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 2b09b68b9f78..c0ef7d9e310a 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -432,7 +432,7 @@ static ssize_t probe_store(struct device *dev, struct device_attribute *attr,
 
 	nid = memory_add_physaddr_to_nid(phys_addr);
 	ret = __add_memory(nid, phys_addr,
-			   MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+			   MIN_MEMORY_BLOCK_SIZE * sections_per_block, 0);
 
 	if (ret)
 		goto out;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 3d0a7e702c94..e159184e0ba0 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -65,7 +65,7 @@ int dev_dax_kmem_probe(struct device *dev)
 	new_res->flags = IORESOURCE_SYSTEM_RAM;
 	new_res->name = dev_name(dev);
 
-	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+	rc = add_memory(numa_node, new_res->start, resource_size(new_res), 0);
 	if (rc) {
 		release_resource(new_res);
 		kfree(new_res);
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 32e3bc0aa665..0194bed1a573 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -726,7 +726,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 
 		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
 		ret = add_memory(nid, PFN_PHYS((start_pfn)),
-				(HA_CHUNK << PAGE_SHIFT));
+				(HA_CHUNK << PAGE_SHIFT), 0);
 
 		if (ret) {
 			pr_err("hot_add memory failed error is %d\n", ret);
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index a864b21af602..a6a908244c74 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -406,7 +406,7 @@ static void __init add_memory_merged(u16 rn)
 	if (!size)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
-		add_memory(0, addr, block_size);
+		add_memory(0, addr, block_size, 0);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 0c142bcab79d..6ec0373fa624 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -347,7 +347,7 @@ static enum bp_state reserve_additional_memory(void)
 	mutex_unlock(&balloon_mutex);
 	/* add_memory_resource() requires the device_hotplug lock */
 	lock_device_hotplug();
-	rc = add_memory_resource(nid, resource);
+	rc = add_memory_resource(nid, resource, 0);
 	unlock_device_hotplug();
 	mutex_lock(&balloon_mutex);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7dca9cd6076b..0151fb935c09 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -339,9 +339,10 @@ extern void set_zone_contiguous(struct zone *zone);
 extern void clear_zone_contiguous(struct zone *zone);
 
 extern void __ref free_area_init_core_hotplug(int nid);
-extern int __add_memory(int nid, u64 start, u64 size);
-extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource);
+extern int __add_memory(int nid, u64 start, u64 size, unsigned long flags);
+extern int add_memory(int nid, u64 start, u64 size, unsigned long flags);
+extern int add_memory_resource(int nid, struct resource *resource,
+			       unsigned long flags);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern void remove_pfn_range_from_zone(struct zone *zone,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 555137bd0882..c01be92693e3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1004,7 +1004,8 @@ static int online_memory_block(struct memory_block *mem, void *arg)
  *
  * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
  */
-int __ref add_memory_resource(int nid, struct resource *res)
+int __ref add_memory_resource(int nid, struct resource *res,
+			      unsigned long flags)
 {
 	struct mhp_params params = { .pgprot = PAGE_KERNEL };
 	u64 start, size;
@@ -1082,7 +1083,7 @@ int __ref add_memory_resource(int nid, struct resource *res)
 }
 
 /* requires device_hotplug_lock, see add_memory_resource() */
-int __ref __add_memory(int nid, u64 start, u64 size)
+int __ref __add_memory(int nid, u64 start, u64 size, unsigned long flags)
 {
 	struct resource *res;
 	int ret;
@@ -1091,18 +1092,18 @@ int __ref __add_memory(int nid, u64 start, u64 size)
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res);
+	ret = add_memory_resource(nid, res, flags);
 	if (ret < 0)
 		release_memory_resource(res);
 	return ret;
 }
 
-int add_memory(int nid, u64 start, u64 size)
+int add_memory(int nid, u64 start, u64 size, unsigned long flags)
 {
 	int rc;
 
 	lock_device_hotplug();
-	rc = __add_memory(nid, start, size);
+	rc = __add_memory(nid, start, size, flags);
 	unlock_device_hotplug();
 
 	return rc;
-- 
2.25.3


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 10:29 [PATCH v2 0/3] mm/memory_hotplug: Allow to not create firmware memmap entries David Hildenbrand
  2020-04-30 10:29 ` [PATCH v2 1/3] mm/memory_hotplug: Prepare passing flags to add_memory() and friends David Hildenbrand
@ 2020-04-30 10:29 ` David Hildenbrand
  2020-04-30 15:38   ` Eric W. Biederman
  2020-04-30 10:29 ` [PATCH v2 3/3] device-dax: Add system ram (add_memory()) with MHP_NO_FIRMWARE_MEMMAP David Hildenbrand
  2 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-04-30 10:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, virtio-dev, virtualization, linuxppc-dev, linux-acpi,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Andrew Morton, Michael S . Tsirkin, David Hildenbrand,
	Michal Hocko, Pankaj Gupta, Wei Yang, Baoquan He, Eric Biederman

Some devices/drivers that add memory via add_memory() and friends (e.g.,
dax/kmem, but also virtio-mem in the future) don't want to create entries
in /sys/firmware/memmap/ - primarily to hinder kexec from adding this
memory to the boot memmap of the kexec kernel.

In fact, such memory is never exposed via the firmware memmap as System
RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is
wrong:
 "kexec needs the raw firmware-provided memory map to setup the
  parameter segment of the kernel that should be booted with
  kexec. Also, the raw memory map is useful for debugging. For
  that reason, /sys/firmware/memmap is an interface that provides
  the raw memory map to userspace." [1]

We don't have to worry about firmware_map_remove() on the removal path.
If there is no entry, it will simply return with -EINVAL.

[1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/memory_hotplug.h | 8 ++++++++
 mm/memory_hotplug.c            | 3 ++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 0151fb935c09..4ca418a731eb 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -68,6 +68,14 @@ struct mhp_params {
 	pgprot_t pgprot;
 };
 
+/* Flags used for add_memory() and friends. */
+
+/*
+ * Don't create entries in /sys/firmware/memmap/. The memory is detected and
+ * added via a device driver, not via the initial (firmware) memmap.
+ */
+#define MHP_NO_FIRMWARE_MEMMAP		1
+
 /*
  * Zone resizing functions
  *
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c01be92693e3..e94ede9cad00 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1062,7 +1062,8 @@ int __ref add_memory_resource(int nid, struct resource *res,
 	BUG_ON(ret);
 
 	/* create new memmap entry */
-	firmware_map_add_hotplug(start, start + size, "System RAM");
+	if (!(flags & MHP_NO_FIRMWARE_MEMMAP))
+		firmware_map_add_hotplug(start, start + size, "System RAM");
 
 	/* device_online() will take the lock when calling online_pages() */
 	mem_hotplug_done();
-- 
2.25.3


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 3/3] device-dax: Add system ram (add_memory()) with MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 10:29 [PATCH v2 0/3] mm/memory_hotplug: Allow to not create firmware memmap entries David Hildenbrand
  2020-04-30 10:29 ` [PATCH v2 1/3] mm/memory_hotplug: Prepare passing flags to add_memory() and friends David Hildenbrand
  2020-04-30 10:29 ` [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP David Hildenbrand
@ 2020-04-30 10:29 ` David Hildenbrand
  2020-04-30 11:23   ` Dave Hansen
  2 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-04-30 10:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, virtio-dev, virtualization, linuxppc-dev, linux-acpi,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Andrew Morton, Michael S . Tsirkin, David Hildenbrand,
	Michal Hocko, Pankaj Gupta, Wei Yang, Baoquan He, Dave Hansen,
	Eric Biederman, Pavel Tatashin, Dan Williams

Currently, when adding memory, we create entries in /sys/firmware/memmap/
as "System RAM". This does not reflect the reality and will lead to
kexec-tools to add that memory to the fixed-up initial memmap for a
kexec kernel (loaded via kexec_load()). The memory will be considered
initial System RAM by the kexec kernel.

We should let the kexec kernel decide how to use that memory - just as
we do during an ordinary reboot.

Before configuring the namespace:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-33fffffff : namespace0.0
	3280000000-32ffffffff : PCI Bus 0000:00

After configuring the namespace:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  148200000-33fffffff : dax0.0
	3280000000-32ffffffff : PCI Bus 0000:00

After loading kmem:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  150000000-33fffffff : dax0.0
	    150000000-33fffffff : System RAM
	3280000000-32ffffffff : PCI Bus 0000:00

After a proper reboot:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  148200000-33fffffff : dax0.0
	3280000000-32ffffffff : PCI Bus 0000:00

Within the kexec kernel before this change:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  150000000-33fffffff : System RAM
	3280000000-32ffffffff : PCI Bus 0000:00

Within the kexec kernel after this change:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  148200000-33fffffff : dax0.0
	3280000000-32ffffffff : PCI Bus 0000:00

/sys/firmware/memmap/ before this change:
	0000000000000000-000000000009fc00 (System RAM)
	000000000009fc00-00000000000a0000 (Reserved)
	00000000000f0000-0000000000100000 (Reserved)
	0000000000100000-00000000bffdf000 (System RAM)
	00000000bffdf000-00000000c0000000 (Reserved)
	00000000feffc000-00000000ff000000 (Reserved)
	00000000fffc0000-0000000100000000 (Reserved)
	0000000100000000-0000000140000000 (System RAM)
	0000000150000000-0000000340000000 (System RAM)

/sys/firmware/memmap/ after a proper reboot:
	0000000000000000-000000000009fc00 (System RAM)
	000000000009fc00-00000000000a0000 (Reserved)
	00000000000f0000-0000000000100000 (Reserved)
	0000000000100000-00000000bffdf000 (System RAM)
	00000000bffdf000-00000000c0000000 (Reserved)
	00000000feffc000-00000000ff000000 (Reserved)
	00000000fffc0000-0000000100000000 (Reserved)
	0000000100000000-0000000140000000 (System RAM)

/sys/firmware/memmap/ after this change:
	0000000000000000-000000000009fc00 (System RAM)
	000000000009fc00-00000000000a0000 (Reserved)
	00000000000f0000-0000000000100000 (Reserved)
	0000000000100000-00000000bffdf000 (System RAM)
	00000000bffdf000-00000000c0000000 (Reserved)
	00000000feffc000-00000000ff000000 (Reserved)
	00000000fffc0000-0000000100000000 (Reserved)
	0000000100000000-0000000140000000 (System RAM)

kexec-tools already seem to basically ignore any System RAM that's not
on top level when searching for areas to place kexec images - but also
for determining crash areas to dump via kdump. This behavior is not
changed by this patch. kexec-tools probably has to be fixed to also
include this memory in system dumps.

Note: kexec_file_load() does the right thing already within the kernel.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/dax/kmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index e159184e0ba0..929823a79816 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -65,7 +65,8 @@ int dev_dax_kmem_probe(struct device *dev)
 	new_res->flags = IORESOURCE_SYSTEM_RAM;
 	new_res->name = dev_name(dev);
 
-	rc = add_memory(numa_node, new_res->start, resource_size(new_res), 0);
+	rc = add_memory(numa_node, new_res->start, resource_size(new_res),
+			MHP_NO_FIRMWARE_MEMMAP);
 	if (rc) {
 		release_resource(new_res);
 		kfree(new_res);
-- 
2.25.3


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/3] device-dax: Add system ram (add_memory()) with MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 10:29 ` [PATCH v2 3/3] device-dax: Add system ram (add_memory()) with MHP_NO_FIRMWARE_MEMMAP David Hildenbrand
@ 2020-04-30 11:23   ` Dave Hansen
  2020-04-30 15:28     ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2020-04-30 11:23 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, virtio-dev, virtualization, linuxppc-dev, linux-acpi,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Andrew Morton, Michael S . Tsirkin, Michal Hocko, Pankaj Gupta,
	Wei Yang, Baoquan He, Dave Hansen, Eric Biederman,
	Pavel Tatashin, Dan Williams

On 4/30/20 3:29 AM, David Hildenbrand wrote:
> Currently, when adding memory, we create entries in /sys/firmware/memmap/
> as "System RAM". This does not reflect the reality and will lead to
> kexec-tools to add that memory to the fixed-up initial memmap for a
> kexec kernel (loaded via kexec_load()). The memory will be considered
> initial System RAM by the kexec kernel.
> 
> We should let the kexec kernel decide how to use that memory - just as
> we do during an ordinary reboot.
...
> -	rc = add_memory(numa_node, new_res->start, resource_size(new_res), 0);
> +	rc = add_memory(numa_node, new_res->start, resource_size(new_res),
> +			MHP_NO_FIRMWARE_MEMMAP);

Looks fine.  But, if you send another revision, could you add a comment
about the actual goal of MHP_NO_FIRMWARE_MEMMAP?  Maybe:

	/*
	 * MHP_NO_FIRMWARE_MEMMAP ensures that future
	 * kexec'd kernels will not treat this as RAM.
	 */

Not a biggie, though.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/3] device-dax: Add system ram (add_memory()) with MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 11:23   ` Dave Hansen
@ 2020-04-30 15:28     ` David Hildenbrand
  0 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2020-04-30 15:28 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: linux-mm, virtio-dev, virtualization, linuxppc-dev, linux-acpi,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Andrew Morton, Michael S . Tsirkin, Michal Hocko, Pankaj Gupta,
	Wei Yang, Baoquan He, Dave Hansen, Eric Biederman,
	Pavel Tatashin, Dan Williams

On 30.04.20 13:23, Dave Hansen wrote:
> On 4/30/20 3:29 AM, David Hildenbrand wrote:
>> Currently, when adding memory, we create entries in /sys/firmware/memmap/
>> as "System RAM". This does not reflect the reality and will lead to
>> kexec-tools to add that memory to the fixed-up initial memmap for a
>> kexec kernel (loaded via kexec_load()). The memory will be considered
>> initial System RAM by the kexec kernel.
>>
>> We should let the kexec kernel decide how to use that memory - just as
>> we do during an ordinary reboot.
> ...
>> -	rc = add_memory(numa_node, new_res->start, resource_size(new_res), 0);
>> +	rc = add_memory(numa_node, new_res->start, resource_size(new_res),
>> +			MHP_NO_FIRMWARE_MEMMAP);
> 
> Looks fine.  But, if you send another revision, could you add a comment
> about the actual goal of MHP_NO_FIRMWARE_MEMMAP?  Maybe:
> 
> 	/*
> 	 * MHP_NO_FIRMWARE_MEMMAP ensures that future
> 	 * kexec'd kernels will not treat this as RAM.
> 	 */
> 
> Not a biggie, though.

Sure, maybe Andrew can fixup when applying (if no resend is necessary).

Thanks Dave!

> 
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 10:29 ` [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP David Hildenbrand
@ 2020-04-30 15:38   ` Eric W. Biederman
  2020-04-30 15:52     ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Eric W. Biederman @ 2020-04-30 15:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, virtio-dev, virtualization, linuxppc-dev,
	linux-acpi, linux-nvdimm, linux-hyperv, linux-s390, xen-devel,
	Michal Hocko, Andrew Morton, Michael S . Tsirkin, Michal Hocko,
	Pankaj Gupta, Wei Yang, Baoquan He

David Hildenbrand <david@redhat.com> writes:

> Some devices/drivers that add memory via add_memory() and friends (e.g.,
> dax/kmem, but also virtio-mem in the future) don't want to create entries
> in /sys/firmware/memmap/ - primarily to hinder kexec from adding this
> memory to the boot memmap of the kexec kernel.
>
> In fact, such memory is never exposed via the firmware memmap as System
> RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is
> wrong:
>  "kexec needs the raw firmware-provided memory map to setup the
>   parameter segment of the kernel that should be booted with
>   kexec. Also, the raw memory map is useful for debugging. For
>   that reason, /sys/firmware/memmap is an interface that provides
>   the raw memory map to userspace." [1]
>
> We don't have to worry about firmware_map_remove() on the removal path.
> If there is no entry, it will simply return with -EINVAL.
>
> [1]
> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap


You know what this justification is rubbish, and I have previously
explained why it is rubbish.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

This needs to be based on weather the added memory is ultimately normal
ram or is something special.

At least when we are talking memory resources.  Keeping it out of the
firmware map that is fine.

If the hotplugged memory is the result of plugging a stick of ram
into the kernel and can and should used be like any other memory
it should be treated like any normal memory.

If the hotplugged memory is something special it should be treated as
something special.

Justifying behavior by documentation that does not consider memory
hotplug is bad thinking.








> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
> Cc: Wei Yang <richard.weiyang@gmail.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/memory_hotplug.h | 8 ++++++++
>  mm/memory_hotplug.c            | 3 ++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 0151fb935c09..4ca418a731eb 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -68,6 +68,14 @@ struct mhp_params {
>  	pgprot_t pgprot;
>  };
>  
> +/* Flags used for add_memory() and friends. */
> +
> +/*
> + * Don't create entries in /sys/firmware/memmap/. The memory is detected and
> + * added via a device driver, not via the initial (firmware) memmap.
> + */
> +#define MHP_NO_FIRMWARE_MEMMAP		1
> +
>  /*
>   * Zone resizing functions
>   *
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c01be92693e3..e94ede9cad00 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1062,7 +1062,8 @@ int __ref add_memory_resource(int nid, struct resource *res,
>  	BUG_ON(ret);
>  
>  	/* create new memmap entry */
> -	firmware_map_add_hotplug(start, start + size, "System RAM");
> +	if (!(flags & MHP_NO_FIRMWARE_MEMMAP))
> +		firmware_map_add_hotplug(start, start + size, "System RAM");
>  
>  	/* device_online() will take the lock when calling online_pages() */
>  	mem_hotplug_done();

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 15:38   ` Eric W. Biederman
@ 2020-04-30 15:52     ` David Hildenbrand
  2020-04-30 16:04       ` Dave Hansen
  2020-04-30 16:33       ` Eric W. Biederman
  0 siblings, 2 replies; 30+ messages in thread
From: David Hildenbrand @ 2020-04-30 15:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-mm, virtio-dev, virtualization, linuxppc-dev,
	linux-acpi, linux-nvdimm, linux-hyperv, linux-s390, xen-devel,
	Michal Hocko, Andrew Morton, Michael S . Tsirkin, Michal Hocko,
	Pankaj Gupta, Wei Yang, Baoquan He

On 30.04.20 17:38, Eric W. Biederman wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> Some devices/drivers that add memory via add_memory() and friends (e.g.,
>> dax/kmem, but also virtio-mem in the future) don't want to create entries
>> in /sys/firmware/memmap/ - primarily to hinder kexec from adding this
>> memory to the boot memmap of the kexec kernel.
>>
>> In fact, such memory is never exposed via the firmware memmap as System
>> RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is
>> wrong:
>>  "kexec needs the raw firmware-provided memory map to setup the
>>   parameter segment of the kernel that should be booted with
>>   kexec. Also, the raw memory map is useful for debugging. For
>>   that reason, /sys/firmware/memmap is an interface that provides
>>   the raw memory map to userspace." [1]
>>
>> We don't have to worry about firmware_map_remove() on the removal path.
>> If there is no entry, it will simply return with -EINVAL.
>>
>> [1]
>> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap
> 
> 
> You know what this justification is rubbish, and I have previously
> explained why it is rubbish.

Actually, no, I don't think it is rubbish. See patch #3 and the cover
letter why this is the right thing to do *for special memory*, *not
ordinary DIMMs*.

And to be quite honest, I think your response is a little harsh. I don't
recall you replying to my virtio-mem-related comments.

> 
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> This needs to be based on weather the added memory is ultimately normal
> ram or is something special.

Yes, that's what the caller are expected to decide, see patch #3.

kexec should try to be as closely as possible to a real reboot - IMHO.

> 
> At least when we are talking memory resources.  Keeping it out of the
> firmware map that is fine.
> 
> If the hotplugged memory is the result of plugging a stick of ram
> into the kernel and can and should used be like any other memory
> it should be treated like any normal memory.
> 
> If the hotplugged memory is something special it should be treated as
> something special.

I am really sorry, I can't make sense of what you are trying to say here.

> 
> Justifying behavior by documentation that does not consider memory
> hotplug is bad thinking.

Are you maybe confusing this patch series with the arm64 approach? This
is not about ordinary hotplugged DIMMs.

I'd love to get Dan's, Dave's and Michal's opinion.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 15:52     ` David Hildenbrand
@ 2020-04-30 16:04       ` Dave Hansen
  2020-04-30 16:33       ` Eric W. Biederman
  1 sibling, 0 replies; 30+ messages in thread
From: Dave Hansen @ 2020-04-30 16:04 UTC (permalink / raw)
  To: David Hildenbrand, Eric W. Biederman
  Cc: linux-kernel, linux-mm, virtio-dev, virtualization, linuxppc-dev,
	linux-acpi, linux-nvdimm, linux-hyperv, linux-s390, xen-devel,
	Michal Hocko, Andrew Morton, Michael S . Tsirkin, Michal Hocko,
	Pankaj Gupta, Wei Yang, Baoquan He

On 4/30/20 8:52 AM, David Hildenbrand wrote:
>> Justifying behavior by documentation that does not consider memory
>> hotplug is bad thinking.
> Are you maybe confusing this patch series with the arm64 approach? This
> is not about ordinary hotplugged DIMMs.
> 
> I'd love to get Dan's, Dave's and Michal's opinion.

The impact on kexec from the DAX "kmem" driver's use of hotplug was
inadvertent and unfortunate.

The problem statement and solution seem pretty sane to me.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 15:52     ` David Hildenbrand
  2020-04-30 16:04       ` Dave Hansen
@ 2020-04-30 16:33       ` Eric W. Biederman
  2020-04-30 16:49         ` David Hildenbrand
  1 sibling, 1 reply; 30+ messages in thread
From: Eric W. Biederman @ 2020-04-30 16:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, virtio-dev, virtualization, linuxppc-dev,
	linux-acpi, linux-nvdimm, linux-hyperv, linux-s390, xen-devel,
	Michal Hocko, Andrew Morton, Michael S . Tsirkin, Michal Hocko,
	Pankaj Gupta, Wei Yang, Baoquan He

David Hildenbrand <david@redhat.com> writes:

> On 30.04.20 17:38, Eric W. Biederman wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>>> Some devices/drivers that add memory via add_memory() and friends (e.g.,
>>> dax/kmem, but also virtio-mem in the future) don't want to create entries
>>> in /sys/firmware/memmap/ - primarily to hinder kexec from adding this
>>> memory to the boot memmap of the kexec kernel.
>>>
>>> In fact, such memory is never exposed via the firmware memmap as System
>>> RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is
>>> wrong:
>>>  "kexec needs the raw firmware-provided memory map to setup the
>>>   parameter segment of the kernel that should be booted with
>>>   kexec. Also, the raw memory map is useful for debugging. For
>>>   that reason, /sys/firmware/memmap is an interface that provides
>>>   the raw memory map to userspace." [1]
>>>
>>> We don't have to worry about firmware_map_remove() on the removal path.
>>> If there is no entry, it will simply return with -EINVAL.
>>>
>>> [1]
>>> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap
>> 
>> 
>> You know what this justification is rubbish, and I have previously
>> explained why it is rubbish.
>
> Actually, no, I don't think it is rubbish. See patch #3 and the cover
> letter why this is the right thing to do *for special memory*, *not
> ordinary DIMMs*.
>
> And to be quite honest, I think your response is a little harsh. I don't
> recall you replying to my virtio-mem-related comments.
>
>> 
>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> 
>> This needs to be based on weather the added memory is ultimately normal
>> ram or is something special.
>
> Yes, that's what the caller are expected to decide, see patch #3.
>
> kexec should try to be as closely as possible to a real reboot - IMHO.

That is very fuzzy in terms of hotplug memory.  The kexec'd kernel
should see the hotplugged memory assuming it is ordinary memory.

But kexec is not a reboot although it is quite similar.   Kexec is
swapping one running kernel and it's state for another kernel without
rebooting.

>> Justifying behavior by documentation that does not consider memory
>> hotplug is bad thinking.
>
> Are you maybe confusing this patch series with the arm64 approach? This
> is not about ordinary hotplugged DIMMs.

I think I am.

My challenge is that I don't see anything in the description that says
this isn't about ordinary hotplugged DIMMs.  All I saw was hotplug
memory.

If the class of memory is different then please by all means let's mark
it differently in struct resource so everyone knows it is different.
But that difference needs to be more than hotplug.

That difference needs to be the hypervisor loaned us memory and might
take it back at any time, or this memory is persistent and so it has
these different characteristics so don't use it as ordinary ram.

That information is also useful to other people looking at the system
and seeing what is going on.

Just please don't muddle the concepts, or assume that whatever subset of
hotplug memory you are dealing with is the only subset.

I didn't see that flag making the distinction about the kind of memory
it is.

Eric





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 16:33       ` Eric W. Biederman
@ 2020-04-30 16:49         ` David Hildenbrand
  2020-04-30 18:06           ` Eric W. Biederman
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-04-30 16:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-mm, virtio-dev, virtualization, linuxppc-dev,
	linux-acpi, linux-nvdimm, linux-hyperv, linux-s390, xen-devel,
	Michal Hocko, Andrew Morton, Michael S . Tsirkin, Michal Hocko,
	Pankaj Gupta, Wei Yang, Baoquan He

On 30.04.20 18:33, Eric W. Biederman wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 30.04.20 17:38, Eric W. Biederman wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>> Some devices/drivers that add memory via add_memory() and friends (e.g.,
>>>> dax/kmem, but also virtio-mem in the future) don't want to create entries
>>>> in /sys/firmware/memmap/ - primarily to hinder kexec from adding this
>>>> memory to the boot memmap of the kexec kernel.
>>>>
>>>> In fact, such memory is never exposed via the firmware memmap as System
>>>> RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is
>>>> wrong:
>>>>  "kexec needs the raw firmware-provided memory map to setup the
>>>>   parameter segment of the kernel that should be booted with
>>>>   kexec. Also, the raw memory map is useful for debugging. For
>>>>   that reason, /sys/firmware/memmap is an interface that provides
>>>>   the raw memory map to userspace." [1]
>>>>
>>>> We don't have to worry about firmware_map_remove() on the removal path.
>>>> If there is no entry, it will simply return with -EINVAL.
>>>>
>>>> [1]
>>>> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap
>>>
>>>
>>> You know what this justification is rubbish, and I have previously
>>> explained why it is rubbish.
>>
>> Actually, no, I don't think it is rubbish. See patch #3 and the cover
>> letter why this is the right thing to do *for special memory*, *not
>> ordinary DIMMs*.
>>
>> And to be quite honest, I think your response is a little harsh. I don't
>> recall you replying to my virtio-mem-related comments.
>>
>>>
>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>
>>> This needs to be based on weather the added memory is ultimately normal
>>> ram or is something special.
>>
>> Yes, that's what the caller are expected to decide, see patch #3.
>>
>> kexec should try to be as closely as possible to a real reboot - IMHO.
> 
> That is very fuzzy in terms of hotplug memory.  The kexec'd kernel
> should see the hotplugged memory assuming it is ordinary memory.
> 
> But kexec is not a reboot although it is quite similar.   Kexec is
> swapping one running kernel and it's state for another kernel without
> rebooting.

I agree (especially regarding the arm64 DIMM hotplug discussion).
However, for the two cases

a) dax/kmem
b) virtio-mem

We really want to let the driver take back control and figure out "what
to do with the memory".

> 
>>> Justifying behavior by documentation that does not consider memory
>>> hotplug is bad thinking.
>>
>> Are you maybe confusing this patch series with the arm64 approach? This
>> is not about ordinary hotplugged DIMMs.
> 
> I think I am.
> 
> My challenge is that I don't see anything in the description that says
> this isn't about ordinary hotplugged DIMMs.  All I saw was hotplug
> memory.

I'm sorry if that was confusing, I tried to stress that kmem and
virtio-mem is special in the description.

I squeezed a lot of that information into the cover letter and into
patch #3.

> 
> If the class of memory is different then please by all means let's mark
> it differently in struct resource so everyone knows it is different.
> But that difference needs to be more than hotplug.
> 
> That difference needs to be the hypervisor loaned us memory and might
> take it back at any time, or this memory is persistent and so it has
> these different characteristics so don't use it as ordinary ram.

Yes, and I think kmem took an excellent approach of explicitly putting
that "System RAM" into a resource hierarchy. That "System RAM" won't
show up as a root node under /proc/iomem (see patch #3), which already
results in kexec-tools to treat it in a special way. I am thinking about
doing the same for virtio-mem.

> 
> That information is also useful to other people looking at the system
> and seeing what is going on.
> 
> Just please don't muddle the concepts, or assume that whatever subset of
> hotplug memory you are dealing with is the only subset.

I can certainly rephrase the subject/description/comment, stating that
this is not to be used for ordinary hotplugged DIMMs - only when the
device driver is under control to decide what to do with that memory -
especially when kexec'ing.

(previously, I called this flag MHP_DRIVER_MANAGED, but I think
MHP_NO_FIRMWARE_MEMMAP is clearer, we just need a better description)

Would that make it clearer?

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 16:49         ` David Hildenbrand
@ 2020-04-30 18:06           ` Eric W. Biederman
  2020-04-30 18:43             ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Eric W. Biederman @ 2020-04-30 18:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, virtio-dev, virtualization, linuxppc-dev,
	linux-acpi, linux-nvdimm, linux-hyperv, linux-s390, xen-devel,
	Michal Hocko, Andrew Morton, Michael S . Tsirkin, Michal Hocko,
	Pankaj Gupta, Wei Yang, Baoquan He

David Hildenbrand <david@redhat.com> writes:

> On 30.04.20 18:33, Eric W. Biederman wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>>> On 30.04.20 17:38, Eric W. Biederman wrote:
>>>> David Hildenbrand <david@redhat.com> writes:
>>>>
>>>>> Some devices/drivers that add memory via add_memory() and friends (e.g.,
>>>>> dax/kmem, but also virtio-mem in the future) don't want to create entries
>>>>> in /sys/firmware/memmap/ - primarily to hinder kexec from adding this
>>>>> memory to the boot memmap of the kexec kernel.
>>>>>
>>>>> In fact, such memory is never exposed via the firmware memmap as System
>>>>> RAM (e.g., e820), so exposing this memory via /sys/firmware/memmap/ is
>>>>> wrong:
>>>>>  "kexec needs the raw firmware-provided memory map to setup the
>>>>>   parameter segment of the kernel that should be booted with
>>>>>   kexec. Also, the raw memory map is useful for debugging. For
>>>>>   that reason, /sys/firmware/memmap is an interface that provides
>>>>>   the raw memory map to userspace." [1]
>>>>>
>>>>> We don't have to worry about firmware_map_remove() on the removal path.
>>>>> If there is no entry, it will simply return with -EINVAL.
>>>>>
>>>>> [1]
>>>>> https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap
>>>>
>>>>
>>>> You know what this justification is rubbish, and I have previously
>>>> explained why it is rubbish.
>>>
>>> Actually, no, I don't think it is rubbish. See patch #3 and the cover
>>> letter why this is the right thing to do *for special memory*, *not
>>> ordinary DIMMs*.
>>>
>>> And to be quite honest, I think your response is a little harsh. I don't
>>> recall you replying to my virtio-mem-related comments.
>>>
>>>>
>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>
>>>> This needs to be based on weather the added memory is ultimately normal
>>>> ram or is something special.
>>>
>>> Yes, that's what the caller are expected to decide, see patch #3.
>>>
>>> kexec should try to be as closely as possible to a real reboot - IMHO.
>> 
>> That is very fuzzy in terms of hotplug memory.  The kexec'd kernel
>> should see the hotplugged memory assuming it is ordinary memory.
>> 
>> But kexec is not a reboot although it is quite similar.   Kexec is
>> swapping one running kernel and it's state for another kernel without
>> rebooting.
>
> I agree (especially regarding the arm64 DIMM hotplug discussion).
> However, for the two cases
>
> a) dax/kmem
> b) virtio-mem
>
> We really want to let the driver take back control and figure out "what
> to do with the memory".

From reading your v1 cover letter (the description appears missing in
v2) I see what you are talking about with respect to virtio-mem.

So I will count virt-io mem as something different.

>>>> Justifying behavior by documentation that does not consider memory
>>>> hotplug is bad thinking.
>>>
>>> Are you maybe confusing this patch series with the arm64 approach? This
>>> is not about ordinary hotplugged DIMMs.
>> 
>> I think I am.
>> 
>> My challenge is that I don't see anything in the description that says
>> this isn't about ordinary hotplugged DIMMs.  All I saw was hotplug
>> memory.
>
> I'm sorry if that was confusing, I tried to stress that kmem and
> virtio-mem is special in the description.
>
> I squeezed a lot of that information into the cover letter and into
> patch #3.


>> If the class of memory is different then please by all means let's mark
>> it differently in struct resource so everyone knows it is different.
>> But that difference needs to be more than hotplug.
>> 
>> That difference needs to be the hypervisor loaned us memory and might
>> take it back at any time, or this memory is persistent and so it has
>> these different characteristics so don't use it as ordinary ram.
>
> Yes, and I think kmem took an excellent approach of explicitly putting
> that "System RAM" into a resource hierarchy. That "System RAM" won't
> show up as a root node under /proc/iomem (see patch #3), which already
> results in kexec-tools to treat it in a special way. I am thinking about
> doing the same for virtio-mem.

Reading this and your patch cover letters again my concern is that
the justification seems to be letting the tail wag the dog.

You want kexec-tools to behave in a certain way so you are changing the
kernel.

Rather it should be change the kernel to clearly reflect reality and if
you can get away without a change to kexec-tools that is a bonus.

>> That information is also useful to other people looking at the system
>> and seeing what is going on.
>> 
>> Just please don't muddle the concepts, or assume that whatever subset of
>> hotplug memory you are dealing with is the only subset.
>
> I can certainly rephrase the subject/description/comment, stating that
> this is not to be used for ordinary hotplugged DIMMs - only when the
> device driver is under control to decide what to do with that memory -
> especially when kexec'ing.
>
> (previously, I called this flag MHP_DRIVER_MANAGED, but I think
> MHP_NO_FIRMWARE_MEMMAP is clearer, we just need a better description)
>
> Would that make it clearer?

I am not certain, but Andrew Morton deliberately added that
firmware_map_add_hotplug call.  Which means that there is a reason
for putting hotplugged memory in the firmware map.

So the justification needs to take that reason into account.  The
justification can not be it is hotplugged therefore it should not belong
in the firmware memory map.  Unless you can show that
firmware_map_add_hotplug that was actually a bug and should be removed.
But as it has been that way since 2010 that seems like a long shot.

So my question is what is right for the firmware map?

Why does the firmware map support hotplug entries?

Once we have the answers to those questions we can figure out what logic
the special kinds of memory hotplug need.

Ref: d96ae5309165 ("memory-hotplug: create /sys/firmware/memmap entry for new memory")

Eric


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 18:06           ` Eric W. Biederman
@ 2020-04-30 18:43             ` David Hildenbrand
  2020-04-30 18:58               ` Dan Williams
  2020-04-30 22:24               ` Andrew Morton
  0 siblings, 2 replies; 30+ messages in thread
From: David Hildenbrand @ 2020-04-30 18:43 UTC (permalink / raw)
  To: Eric W. Biederman, Andrew Morton
  Cc: linux-kernel, linux-mm, virtio-dev, virtualization, linuxppc-dev,
	linux-acpi, linux-nvdimm, linux-hyperv, linux-s390, xen-devel,
	Michal Hocko, Michael S . Tsirkin, Michal Hocko, Pankaj Gupta,
	Wei Yang, Baoquan He

 >>> If the class of memory is different then please by all means let's mark
>>> it differently in struct resource so everyone knows it is different.
>>> But that difference needs to be more than hotplug.
>>>
>>> That difference needs to be the hypervisor loaned us memory and might
>>> take it back at any time, or this memory is persistent and so it has
>>> these different characteristics so don't use it as ordinary ram.
>>
>> Yes, and I think kmem took an excellent approach of explicitly putting
>> that "System RAM" into a resource hierarchy. That "System RAM" won't
>> show up as a root node under /proc/iomem (see patch #3), which already
>> results in kexec-tools to treat it in a special way. I am thinking about
>> doing the same for virtio-mem.
> 
> Reading this and your patch cover letters again my concern is that
> the justification seems to be letting the tail wag the dog.
> 
> You want kexec-tools to behave in a certain way so you are changing the
> kernel.
> 
> Rather it should be change the kernel to clearly reflect reality and if
> you can get away without a change to kexec-tools that is a bonus.
> 

Right, because user space has to have a way to figure out what to do.

But talking about the firmware memmap, indicating something via a "raw
firmware-provided memory map", that is not actually in the "raw
firmware-provided memory map" feels wrong to me. (below)


>>> That information is also useful to other people looking at the system
>>> and seeing what is going on.
>>>
>>> Just please don't muddle the concepts, or assume that whatever subset of
>>> hotplug memory you are dealing with is the only subset.
>>
>> I can certainly rephrase the subject/description/comment, stating that
>> this is not to be used for ordinary hotplugged DIMMs - only when the
>> device driver is under control to decide what to do with that memory -
>> especially when kexec'ing.
>>
>> (previously, I called this flag MHP_DRIVER_MANAGED, but I think
>> MHP_NO_FIRMWARE_MEMMAP is clearer, we just need a better description)
>>
>> Would that make it clearer?
> 
> I am not certain, but Andrew Morton deliberately added that
> firmware_map_add_hotplug call.  Which means that there is a reason
> for putting hotplugged memory in the firmware map.
> 
> So the justification needs to take that reason into account.  The
> justification can not be it is hotplugged therefore it should not belong
> in the firmware memory map.  Unless you can show that
> firmware_map_add_hotplug that was actually a bug and should be removed.
> But as it has been that way since 2010 that seems like a long shot.
> 
> So my question is what is right for the firmware map?

We have documentation for that since 2008. Andrews patch is from 2010.

Documentation/ABI/testing/sysfs-firmware-memmap

It clearly talks about "raw firmware-provided memory map" and why the
interface was introduced at all ("on most architectures that
firmware-provided memory map is modified afterwards by the kernel itself").

> 
> Why does the firmware map support hotplug entries?

I assume:

The firmware memmap was added primarily for x86-64 kexec (and still, is
mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
get hotplugged on real HW, they get added to e820. Same applies to
memory added via HyperV balloon (unless memory is unplugged via
ballooning and you reboot ... the the e820 is changed as well). I assume
we wanted to be able to reflect that, to make kexec look like a real reboot.

This worked for a while. Then came dax/kmem. Now comes virtio-mem.


But I assume only Andrew can enlighten us.

@Andrew, any guidance here? Should we really add all memory to the
firmware memmap, even if this contradicts with the existing
documentation? (especially, if the actual firmware memmap will *not*
contain that memory after a reboot)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 18:43             ` David Hildenbrand
@ 2020-04-30 18:58               ` Dan Williams
  2020-04-30 22:24               ` Andrew Morton
  1 sibling, 0 replies; 30+ messages in thread
From: Dan Williams @ 2020-04-30 18:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eric W. Biederman, Andrew Morton, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On Thu, Apr 30, 2020 at 11:44 AM David Hildenbrand <david@redhat.com> wrote:
>
>  >>> If the class of memory is different then please by all means let's mark
> >>> it differently in struct resource so everyone knows it is different.
> >>> But that difference needs to be more than hotplug.
> >>>
> >>> That difference needs to be the hypervisor loaned us memory and might
> >>> take it back at any time, or this memory is persistent and so it has
> >>> these different characteristics so don't use it as ordinary ram.
> >>
> >> Yes, and I think kmem took an excellent approach of explicitly putting
> >> that "System RAM" into a resource hierarchy. That "System RAM" won't
> >> show up as a root node under /proc/iomem (see patch #3), which already
> >> results in kexec-tools to treat it in a special way. I am thinking about
> >> doing the same for virtio-mem.
> >
> > Reading this and your patch cover letters again my concern is that
> > the justification seems to be letting the tail wag the dog.
> >
> > You want kexec-tools to behave in a certain way so you are changing the
> > kernel.
> >
> > Rather it should be change the kernel to clearly reflect reality and if
> > you can get away without a change to kexec-tools that is a bonus.
> >
>
> Right, because user space has to have a way to figure out what to do.
>
> But talking about the firmware memmap, indicating something via a "raw
> firmware-provided memory map", that is not actually in the "raw
> firmware-provided memory map" feels wrong to me. (below)
>
>
> >>> That information is also useful to other people looking at the system
> >>> and seeing what is going on.
> >>>
> >>> Just please don't muddle the concepts, or assume that whatever subset of
> >>> hotplug memory you are dealing with is the only subset.
> >>
> >> I can certainly rephrase the subject/description/comment, stating that
> >> this is not to be used for ordinary hotplugged DIMMs - only when the
> >> device driver is under control to decide what to do with that memory -
> >> especially when kexec'ing.
> >>
> >> (previously, I called this flag MHP_DRIVER_MANAGED, but I think
> >> MHP_NO_FIRMWARE_MEMMAP is clearer, we just need a better description)
> >>
> >> Would that make it clearer?
> >
> > I am not certain, but Andrew Morton deliberately added that
> > firmware_map_add_hotplug call.  Which means that there is a reason
> > for putting hotplugged memory in the firmware map.
> >
> > So the justification needs to take that reason into account.  The
> > justification can not be it is hotplugged therefore it should not belong
> > in the firmware memory map.  Unless you can show that
> > firmware_map_add_hotplug that was actually a bug and should be removed.
> > But as it has been that way since 2010 that seems like a long shot.
> >
> > So my question is what is right for the firmware map?
>
> We have documentation for that since 2008. Andrews patch is from 2010.
>
> Documentation/ABI/testing/sysfs-firmware-memmap
>
> It clearly talks about "raw firmware-provided memory map" and why the
> interface was introduced at all ("on most architectures that
> firmware-provided memory map is modified afterwards by the kernel itself").
>
> >
> > Why does the firmware map support hotplug entries?
>
> I assume:
>
> The firmware memmap was added primarily for x86-64 kexec (and still, is
> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> get hotplugged on real HW, they get added to e820. Same applies to
> memory added via HyperV balloon (unless memory is unplugged via
> ballooning and you reboot ... the the e820 is changed as well). I assume
> we wanted to be able to reflect that, to make kexec look like a real reboot.

I can at least say that this breakdown makes sense to me. Traditional
memory hotplug results in permanent change to the raw firmware memory
map reported by the host at next reboot. These device-driver-owned
memory regions really want a hotplug policy per-kernel boot instance
and should fall back to the default reserved state at reboot (kexec or
otherwise). When I say hotplug-policy I mean whether the current
kernel wants to treat the device range as System RAM or leave it as
device-managed. The intent is that the follow-on kernel needs to
re-decide the device policy.

>
> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 18:43             ` David Hildenbrand
  2020-04-30 18:58               ` Dan Williams
@ 2020-04-30 22:24               ` Andrew Morton
  2020-05-01  9:34                 ` David Hildenbrand
  1 sibling, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2020-04-30 22:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eric W. Biederman, linux-kernel, linux-mm, virtio-dev,
	virtualization, linuxppc-dev, linux-acpi, linux-nvdimm,
	linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:

> > 
> > Why does the firmware map support hotplug entries?
> 
> I assume:
> 
> The firmware memmap was added primarily for x86-64 kexec (and still, is
> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> get hotplugged on real HW, they get added to e820. Same applies to
> memory added via HyperV balloon (unless memory is unplugged via
> ballooning and you reboot ... the the e820 is changed as well). I assume
> we wanted to be able to reflect that, to make kexec look like a real reboot.
> 
> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> 
> 
> But I assume only Andrew can enlighten us.
> 
> @Andrew, any guidance here? Should we really add all memory to the
> firmware memmap, even if this contradicts with the existing
> documentation? (especially, if the actual firmware memmap will *not*
> contain that memory after a reboot)

For some reason that patch is misattributed - it was authored by
Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
a decade.  I looked through the email discussion from that time and I'm
not seeing anything useful.  But I wasn't able to locate Dave Hansen's
review comments.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-04-30 22:24               ` Andrew Morton
@ 2020-05-01  9:34                 ` David Hildenbrand
  2020-05-01 16:56                   ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-05-01  9:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric W. Biederman, linux-kernel, linux-mm, virtio-dev,
	virtualization, linuxppc-dev, linux-acpi, linux-nvdimm,
	linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On 01.05.20 00:24, Andrew Morton wrote:
> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>>>
>>> Why does the firmware map support hotplug entries?
>>
>> I assume:
>>
>> The firmware memmap was added primarily for x86-64 kexec (and still, is
>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
>> get hotplugged on real HW, they get added to e820. Same applies to
>> memory added via HyperV balloon (unless memory is unplugged via
>> ballooning and you reboot ... the the e820 is changed as well). I assume
>> we wanted to be able to reflect that, to make kexec look like a real reboot.
>>
>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
>>
>>
>> But I assume only Andrew can enlighten us.
>>
>> @Andrew, any guidance here? Should we really add all memory to the
>> firmware memmap, even if this contradicts with the existing
>> documentation? (especially, if the actual firmware memmap will *not*
>> contain that memory after a reboot)
> 
> For some reason that patch is misattributed - it was authored by
> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
> a decade.  I looked through the email discussion from that time and I'm
> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> review comments.

Okay, thanks for checking. I think the documentation from 2008 is pretty
clear what has to be done here. I will add some of these details to the
patch description.

Also, now that I know that esp. kexec-tools already don't consider
dax/kmem memory properly (memory will not get dumped via kdump) and
won't really suffer from a name change in /proc/iomem, I will go back to
the MHP_DRIVER_MANAGED approach and
1. Don't create firmware memmap entries
2. Name the resource "System RAM (driver managed)"
3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.

This way, kernel users and user space can figure out that this memory
has different semantics and handle it accordingly - I think that was
what Eric was asking for.

Of course, open for suggestions.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01  9:34                 ` David Hildenbrand
@ 2020-05-01 16:56                   ` Dan Williams
  2020-05-01 17:21                     ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-05-01 16:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.05.20 00:24, Andrew Morton wrote:
> > On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
> >
> >>>
> >>> Why does the firmware map support hotplug entries?
> >>
> >> I assume:
> >>
> >> The firmware memmap was added primarily for x86-64 kexec (and still, is
> >> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> >> get hotplugged on real HW, they get added to e820. Same applies to
> >> memory added via HyperV balloon (unless memory is unplugged via
> >> ballooning and you reboot ... the the e820 is changed as well). I assume
> >> we wanted to be able to reflect that, to make kexec look like a real reboot.
> >>
> >> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>
> >>
> >> But I assume only Andrew can enlighten us.
> >>
> >> @Andrew, any guidance here? Should we really add all memory to the
> >> firmware memmap, even if this contradicts with the existing
> >> documentation? (especially, if the actual firmware memmap will *not*
> >> contain that memory after a reboot)
> >
> > For some reason that patch is misattributed - it was authored by
> > Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
> > a decade.  I looked through the email discussion from that time and I'm
> > not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> > review comments.
>
> Okay, thanks for checking. I think the documentation from 2008 is pretty
> clear what has to be done here. I will add some of these details to the
> patch description.
>
> Also, now that I know that esp. kexec-tools already don't consider
> dax/kmem memory properly (memory will not get dumped via kdump) and
> won't really suffer from a name change in /proc/iomem, I will go back to
> the MHP_DRIVER_MANAGED approach and
> 1. Don't create firmware memmap entries
> 2. Name the resource "System RAM (driver managed)"
> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
>
> This way, kernel users and user space can figure out that this memory
> has different semantics and handle it accordingly - I think that was
> what Eric was asking for.
>
> Of course, open for suggestions.

I'm still more of a fan of this being communicated by "System RAM"
being parented especially because that tells you something about how
the memory is driver-managed and which mechanism might be in play.
What about adding an optional /sys/firmware/memmap/X/parent attribute.
This lets tooling check if it cares via that interface and lets it
lookup the related infrastructure to interact with if it would do
something different for virtio-mem vs dax/kmem?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 16:56                   ` Dan Williams
@ 2020-05-01 17:21                     ` David Hildenbrand
  2020-05-01 17:39                       ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-05-01 17:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On 01.05.20 18:56, Dan Williams wrote:
> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.05.20 00:24, Andrew Morton wrote:
>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
>>>
>>>>>
>>>>> Why does the firmware map support hotplug entries?
>>>>
>>>> I assume:
>>>>
>>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
>>>> get hotplugged on real HW, they get added to e820. Same applies to
>>>> memory added via HyperV balloon (unless memory is unplugged via
>>>> ballooning and you reboot ... the the e820 is changed as well). I assume
>>>> we wanted to be able to reflect that, to make kexec look like a real reboot.
>>>>
>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
>>>>
>>>>
>>>> But I assume only Andrew can enlighten us.
>>>>
>>>> @Andrew, any guidance here? Should we really add all memory to the
>>>> firmware memmap, even if this contradicts with the existing
>>>> documentation? (especially, if the actual firmware memmap will *not*
>>>> contain that memory after a reboot)
>>>
>>> For some reason that patch is misattributed - it was authored by
>>> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
>>> a decade.  I looked through the email discussion from that time and I'm
>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
>>> review comments.
>>
>> Okay, thanks for checking. I think the documentation from 2008 is pretty
>> clear what has to be done here. I will add some of these details to the
>> patch description.
>>
>> Also, now that I know that esp. kexec-tools already don't consider
>> dax/kmem memory properly (memory will not get dumped via kdump) and
>> won't really suffer from a name change in /proc/iomem, I will go back to
>> the MHP_DRIVER_MANAGED approach and
>> 1. Don't create firmware memmap entries
>> 2. Name the resource "System RAM (driver managed)"
>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
>>
>> This way, kernel users and user space can figure out that this memory
>> has different semantics and handle it accordingly - I think that was
>> what Eric was asking for.
>>
>> Of course, open for suggestions.
> 
> I'm still more of a fan of this being communicated by "System RAM"

I was mentioning somewhere in this thread that "System RAM" inside a
hierarchy (like dax/kmem) will already be basically ignored by
kexec-tools. So, placing it inside a hierarchy already makes it look
special already.

But after all, as we have to change kexec-tools either way, we can
directly go ahead and flag it properly as special (in case there will
ever be other cases where we could no longer distinguish it).

> being parented especially because that tells you something about how
> the memory is driver-managed and which mechanism might be in play.

The could be communicated to some degree via the resource hierarchy.

E.g.,

            [root@localhost ~]# cat /proc/iomem
            ...
            140000000-33fffffff : Persistent Memory
              140000000-1481fffff : namespace0.0
              150000000-33fffffff : dax0.0
                150000000-33fffffff : System RAM (driver managed)

vs.

           :/# cat /proc/iomem
            [...]
            140000000-333ffffff : virtio-mem (virtio0)
              140000000-147ffffff : System RAM (driver managed)
              148000000-14fffffff : System RAM (driver managed)
              150000000-157ffffff : System RAM (driver managed)

Good enough for my taste.

> What about adding an optional /sys/firmware/memmap/X/parent attribute.

I really don't want any firmware memmap entries for something that is
not part of the firmware provided memmap. In addition,
/sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
and two arm configs enable it at all.

So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 17:21                     ` David Hildenbrand
@ 2020-05-01 17:39                       ` Dan Williams
  2020-05-01 17:45                         ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-05-01 17:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On Fri, May 1, 2020 at 10:21 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.05.20 18:56, Dan Williams wrote:
> > On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 01.05.20 00:24, Andrew Morton wrote:
> >>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>>>>
> >>>>> Why does the firmware map support hotplug entries?
> >>>>
> >>>> I assume:
> >>>>
> >>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
> >>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> >>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>> ballooning and you reboot ... the the e820 is changed as well). I assume
> >>>> we wanted to be able to reflect that, to make kexec look like a real reboot.
> >>>>
> >>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>
> >>>>
> >>>> But I assume only Andrew can enlighten us.
> >>>>
> >>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>> firmware memmap, even if this contradicts with the existing
> >>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>> contain that memory after a reboot)
> >>>
> >>> For some reason that patch is misattributed - it was authored by
> >>> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
> >>> a decade.  I looked through the email discussion from that time and I'm
> >>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> >>> review comments.
> >>
> >> Okay, thanks for checking. I think the documentation from 2008 is pretty
> >> clear what has to be done here. I will add some of these details to the
> >> patch description.
> >>
> >> Also, now that I know that esp. kexec-tools already don't consider
> >> dax/kmem memory properly (memory will not get dumped via kdump) and
> >> won't really suffer from a name change in /proc/iomem, I will go back to
> >> the MHP_DRIVER_MANAGED approach and
> >> 1. Don't create firmware memmap entries
> >> 2. Name the resource "System RAM (driver managed)"
> >> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>
> >> This way, kernel users and user space can figure out that this memory
> >> has different semantics and handle it accordingly - I think that was
> >> what Eric was asking for.
> >>
> >> Of course, open for suggestions.
> >
> > I'm still more of a fan of this being communicated by "System RAM"
>
> I was mentioning somewhere in this thread that "System RAM" inside a
> hierarchy (like dax/kmem) will already be basically ignored by
> kexec-tools. So, placing it inside a hierarchy already makes it look
> special already.
>
> But after all, as we have to change kexec-tools either way, we can
> directly go ahead and flag it properly as special (in case there will
> ever be other cases where we could no longer distinguish it).
>
> > being parented especially because that tells you something about how
> > the memory is driver-managed and which mechanism might be in play.
>
> The could be communicated to some degree via the resource hierarchy.
>
> E.g.,
>
>             [root@localhost ~]# cat /proc/iomem
>             ...
>             140000000-33fffffff : Persistent Memory
>               140000000-1481fffff : namespace0.0
>               150000000-33fffffff : dax0.0
>                 150000000-33fffffff : System RAM (driver managed)
>
> vs.
>
>            :/# cat /proc/iomem
>             [...]
>             140000000-333ffffff : virtio-mem (virtio0)
>               140000000-147ffffff : System RAM (driver managed)
>               148000000-14fffffff : System RAM (driver managed)
>               150000000-157ffffff : System RAM (driver managed)
>
> Good enough for my taste.
>
> > What about adding an optional /sys/firmware/memmap/X/parent attribute.
>
> I really don't want any firmware memmap entries for something that is
> not part of the firmware provided memmap. In addition,
> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
> and two arm configs enable it at all.
>
> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.

I think that's a policy decision and policy decisions do not belong in
the kernel. Give the tooling the opportunity to decide whether System
RAM stays that way over a kexec. The parenthetical reference otherwise
looks out of place to me in the /proc/iomem output. What makes it
"driver managed" is how the kernel handles it, not how the kernel
names it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 17:39                       ` Dan Williams
@ 2020-05-01 17:45                         ` David Hildenbrand
  2020-05-01 17:51                           ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-05-01 17:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On 01.05.20 19:39, Dan Williams wrote:
> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.05.20 18:56, Dan Williams wrote:
>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 01.05.20 00:24, Andrew Morton wrote:
>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>>>>
>>>>>>> Why does the firmware map support hotplug entries?
>>>>>>
>>>>>> I assume:
>>>>>>
>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
>>>>>> memory added via HyperV balloon (unless memory is unplugged via
>>>>>> ballooning and you reboot ... the the e820 is changed as well). I assume
>>>>>> we wanted to be able to reflect that, to make kexec look like a real reboot.
>>>>>>
>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
>>>>>>
>>>>>>
>>>>>> But I assume only Andrew can enlighten us.
>>>>>>
>>>>>> @Andrew, any guidance here? Should we really add all memory to the
>>>>>> firmware memmap, even if this contradicts with the existing
>>>>>> documentation? (especially, if the actual firmware memmap will *not*
>>>>>> contain that memory after a reboot)
>>>>>
>>>>> For some reason that patch is misattributed - it was authored by
>>>>> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
>>>>> a decade.  I looked through the email discussion from that time and I'm
>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
>>>>> review comments.
>>>>
>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
>>>> clear what has to be done here. I will add some of these details to the
>>>> patch description.
>>>>
>>>> Also, now that I know that esp. kexec-tools already don't consider
>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
>>>> won't really suffer from a name change in /proc/iomem, I will go back to
>>>> the MHP_DRIVER_MANAGED approach and
>>>> 1. Don't create firmware memmap entries
>>>> 2. Name the resource "System RAM (driver managed)"
>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
>>>>
>>>> This way, kernel users and user space can figure out that this memory
>>>> has different semantics and handle it accordingly - I think that was
>>>> what Eric was asking for.
>>>>
>>>> Of course, open for suggestions.
>>>
>>> I'm still more of a fan of this being communicated by "System RAM"
>>
>> I was mentioning somewhere in this thread that "System RAM" inside a
>> hierarchy (like dax/kmem) will already be basically ignored by
>> kexec-tools. So, placing it inside a hierarchy already makes it look
>> special already.
>>
>> But after all, as we have to change kexec-tools either way, we can
>> directly go ahead and flag it properly as special (in case there will
>> ever be other cases where we could no longer distinguish it).
>>
>>> being parented especially because that tells you something about how
>>> the memory is driver-managed and which mechanism might be in play.
>>
>> The could be communicated to some degree via the resource hierarchy.
>>
>> E.g.,
>>
>>             [root@localhost ~]# cat /proc/iomem
>>             ...
>>             140000000-33fffffff : Persistent Memory
>>               140000000-1481fffff : namespace0.0
>>               150000000-33fffffff : dax0.0
>>                 150000000-33fffffff : System RAM (driver managed)
>>
>> vs.
>>
>>            :/# cat /proc/iomem
>>             [...]
>>             140000000-333ffffff : virtio-mem (virtio0)
>>               140000000-147ffffff : System RAM (driver managed)
>>               148000000-14fffffff : System RAM (driver managed)
>>               150000000-157ffffff : System RAM (driver managed)
>>
>> Good enough for my taste.
>>
>>> What about adding an optional /sys/firmware/memmap/X/parent attribute.
>>
>> I really don't want any firmware memmap entries for something that is
>> not part of the firmware provided memmap. In addition,
>> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
>> and two arm configs enable it at all.
>>
>> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.
> 
> I think that's a policy decision and policy decisions do not belong in
> the kernel. Give the tooling the opportunity to decide whether System
> RAM stays that way over a kexec. The parenthetical reference otherwise
> looks out of place to me in the /proc/iomem output. What makes it
> "driver managed" is how the kernel handles it, not how the kernel
> names it.

At least, virtio-mem is different. It really *has to be handled* by the
driver. This is not a policy. It's how it works.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 17:45                         ` David Hildenbrand
@ 2020-05-01 17:51                           ` David Hildenbrand
  2020-05-01 18:03                             ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-05-01 17:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On 01.05.20 19:45, David Hildenbrand wrote:
> On 01.05.20 19:39, Dan Williams wrote:
>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 01.05.20 18:56, Dan Williams wrote:
>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 01.05.20 00:24, Andrew Morton wrote:
>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>>>>
>>>>>>>> Why does the firmware map support hotplug entries?
>>>>>>>
>>>>>>> I assume:
>>>>>>>
>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I assume
>>>>>>> we wanted to be able to reflect that, to make kexec look like a real reboot.
>>>>>>>
>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
>>>>>>>
>>>>>>>
>>>>>>> But I assume only Andrew can enlighten us.
>>>>>>>
>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
>>>>>>> firmware memmap, even if this contradicts with the existing
>>>>>>> documentation? (especially, if the actual firmware memmap will *not*
>>>>>>> contain that memory after a reboot)
>>>>>>
>>>>>> For some reason that patch is misattributed - it was authored by
>>>>>> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
>>>>>> a decade.  I looked through the email discussion from that time and I'm
>>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
>>>>>> review comments.
>>>>>
>>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
>>>>> clear what has to be done here. I will add some of these details to the
>>>>> patch description.
>>>>>
>>>>> Also, now that I know that esp. kexec-tools already don't consider
>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
>>>>> won't really suffer from a name change in /proc/iomem, I will go back to
>>>>> the MHP_DRIVER_MANAGED approach and
>>>>> 1. Don't create firmware memmap entries
>>>>> 2. Name the resource "System RAM (driver managed)"
>>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
>>>>>
>>>>> This way, kernel users and user space can figure out that this memory
>>>>> has different semantics and handle it accordingly - I think that was
>>>>> what Eric was asking for.
>>>>>
>>>>> Of course, open for suggestions.
>>>>
>>>> I'm still more of a fan of this being communicated by "System RAM"
>>>
>>> I was mentioning somewhere in this thread that "System RAM" inside a
>>> hierarchy (like dax/kmem) will already be basically ignored by
>>> kexec-tools. So, placing it inside a hierarchy already makes it look
>>> special already.
>>>
>>> But after all, as we have to change kexec-tools either way, we can
>>> directly go ahead and flag it properly as special (in case there will
>>> ever be other cases where we could no longer distinguish it).
>>>
>>>> being parented especially because that tells you something about how
>>>> the memory is driver-managed and which mechanism might be in play.
>>>
>>> The could be communicated to some degree via the resource hierarchy.
>>>
>>> E.g.,
>>>
>>>             [root@localhost ~]# cat /proc/iomem
>>>             ...
>>>             140000000-33fffffff : Persistent Memory
>>>               140000000-1481fffff : namespace0.0
>>>               150000000-33fffffff : dax0.0
>>>                 150000000-33fffffff : System RAM (driver managed)
>>>
>>> vs.
>>>
>>>            :/# cat /proc/iomem
>>>             [...]
>>>             140000000-333ffffff : virtio-mem (virtio0)
>>>               140000000-147ffffff : System RAM (driver managed)
>>>               148000000-14fffffff : System RAM (driver managed)
>>>               150000000-157ffffff : System RAM (driver managed)
>>>
>>> Good enough for my taste.
>>>
>>>> What about adding an optional /sys/firmware/memmap/X/parent attribute.
>>>
>>> I really don't want any firmware memmap entries for something that is
>>> not part of the firmware provided memmap. In addition,
>>> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
>>> and two arm configs enable it at all.
>>>
>>> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.
>>
>> I think that's a policy decision and policy decisions do not belong in
>> the kernel. Give the tooling the opportunity to decide whether System
>> RAM stays that way over a kexec. The parenthetical reference otherwise
>> looks out of place to me in the /proc/iomem output. What makes it
>> "driver managed" is how the kernel handles it, not how the kernel
>> names it.
> 
> At least, virtio-mem is different. It really *has to be handled* by the
> driver. This is not a policy. It's how it works.
> 

Oh, and I don't see why "System RAM (driver managed)" would hinder any
policy in user case to still do what it thinks is the right thing to do
(e.g., for dax).

"System RAM (driver managed)" would mean: Memory is not part of the raw
firmware memmap. It was detected and added by a driver. Handle with
care, this is special.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 17:51                           ` David Hildenbrand
@ 2020-05-01 18:03                             ` Dan Williams
  2020-05-01 18:14                               ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-05-01 18:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On Fri, May 1, 2020 at 10:51 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.05.20 19:45, David Hildenbrand wrote:
> > On 01.05.20 19:39, Dan Williams wrote:
> >> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 01.05.20 18:56, Dan Williams wrote:
> >>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
> >>>>>>
> >>>>>>>>
> >>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>
> >>>>>>> I assume:
> >>>>>>>
> >>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
> >>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> >>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>> ballooning and you reboot ... the the e820 is changed as well). I assume
> >>>>>>> we wanted to be able to reflect that, to make kexec look like a real reboot.
> >>>>>>>
> >>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>
> >>>>>>>
> >>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>
> >>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>>>>> contain that memory after a reboot)
> >>>>>>
> >>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
> >>>>>> a decade.  I looked through the email discussion from that time and I'm
> >>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> >>>>>> review comments.
> >>>>>
> >>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
> >>>>> clear what has to be done here. I will add some of these details to the
> >>>>> patch description.
> >>>>>
> >>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>> won't really suffer from a name change in /proc/iomem, I will go back to
> >>>>> the MHP_DRIVER_MANAGED approach and
> >>>>> 1. Don't create firmware memmap entries
> >>>>> 2. Name the resource "System RAM (driver managed)"
> >>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>>>>
> >>>>> This way, kernel users and user space can figure out that this memory
> >>>>> has different semantics and handle it accordingly - I think that was
> >>>>> what Eric was asking for.
> >>>>>
> >>>>> Of course, open for suggestions.
> >>>>
> >>>> I'm still more of a fan of this being communicated by "System RAM"
> >>>
> >>> I was mentioning somewhere in this thread that "System RAM" inside a
> >>> hierarchy (like dax/kmem) will already be basically ignored by
> >>> kexec-tools. So, placing it inside a hierarchy already makes it look
> >>> special already.
> >>>
> >>> But after all, as we have to change kexec-tools either way, we can
> >>> directly go ahead and flag it properly as special (in case there will
> >>> ever be other cases where we could no longer distinguish it).
> >>>
> >>>> being parented especially because that tells you something about how
> >>>> the memory is driver-managed and which mechanism might be in play.
> >>>
> >>> The could be communicated to some degree via the resource hierarchy.
> >>>
> >>> E.g.,
> >>>
> >>>             [root@localhost ~]# cat /proc/iomem
> >>>             ...
> >>>             140000000-33fffffff : Persistent Memory
> >>>               140000000-1481fffff : namespace0.0
> >>>               150000000-33fffffff : dax0.0
> >>>                 150000000-33fffffff : System RAM (driver managed)
> >>>
> >>> vs.
> >>>
> >>>            :/# cat /proc/iomem
> >>>             [...]
> >>>             140000000-333ffffff : virtio-mem (virtio0)
> >>>               140000000-147ffffff : System RAM (driver managed)
> >>>               148000000-14fffffff : System RAM (driver managed)
> >>>               150000000-157ffffff : System RAM (driver managed)
> >>>
> >>> Good enough for my taste.
> >>>
> >>>> What about adding an optional /sys/firmware/memmap/X/parent attribute.
> >>>
> >>> I really don't want any firmware memmap entries for something that is
> >>> not part of the firmware provided memmap. In addition,
> >>> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
> >>> and two arm configs enable it at all.
> >>>
> >>> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.
> >>
> >> I think that's a policy decision and policy decisions do not belong in
> >> the kernel. Give the tooling the opportunity to decide whether System
> >> RAM stays that way over a kexec. The parenthetical reference otherwise
> >> looks out of place to me in the /proc/iomem output. What makes it
> >> "driver managed" is how the kernel handles it, not how the kernel
> >> names it.
> >
> > At least, virtio-mem is different. It really *has to be handled* by the
> > driver. This is not a policy. It's how it works.

...but that's not necessarily how dax/kmem works.

> >
>
> Oh, and I don't see why "System RAM (driver managed)" would hinder any
> policy in user case to still do what it thinks is the right thing to do
> (e.g., for dax).
>
> "System RAM (driver managed)" would mean: Memory is not part of the raw
> firmware memmap. It was detected and added by a driver. Handle with
> care, this is special.

Oh, no, I was more reacting to your, "don't update
/sys/firmware/memmap for the (driver managed) range" choice as being a
policy decision. It otherwise feels to me "System RAM (driver
managed)" adds confusion for casual users of /proc/iomem and for clued
in tools they have the parent association to decide policy.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 18:03                             ` Dan Williams
@ 2020-05-01 18:14                               ` David Hildenbrand
  2020-05-01 18:43                                 ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-05-01 18:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On 01.05.20 20:03, Dan Williams wrote:
> On Fri, May 1, 2020 at 10:51 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.05.20 19:45, David Hildenbrand wrote:
>>> On 01.05.20 19:39, Dan Williams wrote:
>>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 01.05.20 18:56, Dan Williams wrote:
>>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
>>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Why does the firmware map support hotplug entries?
>>>>>>>>>
>>>>>>>>> I assume:
>>>>>>>>>
>>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
>>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
>>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
>>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
>>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I assume
>>>>>>>>> we wanted to be able to reflect that, to make kexec look like a real reboot.
>>>>>>>>>
>>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> But I assume only Andrew can enlighten us.
>>>>>>>>>
>>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
>>>>>>>>> firmware memmap, even if this contradicts with the existing
>>>>>>>>> documentation? (especially, if the actual firmware memmap will *not*
>>>>>>>>> contain that memory after a reboot)
>>>>>>>>
>>>>>>>> For some reason that patch is misattributed - it was authored by
>>>>>>>> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
>>>>>>>> a decade.  I looked through the email discussion from that time and I'm
>>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
>>>>>>>> review comments.
>>>>>>>
>>>>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
>>>>>>> clear what has to be done here. I will add some of these details to the
>>>>>>> patch description.
>>>>>>>
>>>>>>> Also, now that I know that esp. kexec-tools already don't consider
>>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
>>>>>>> won't really suffer from a name change in /proc/iomem, I will go back to
>>>>>>> the MHP_DRIVER_MANAGED approach and
>>>>>>> 1. Don't create firmware memmap entries
>>>>>>> 2. Name the resource "System RAM (driver managed)"
>>>>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
>>>>>>>
>>>>>>> This way, kernel users and user space can figure out that this memory
>>>>>>> has different semantics and handle it accordingly - I think that was
>>>>>>> what Eric was asking for.
>>>>>>>
>>>>>>> Of course, open for suggestions.
>>>>>>
>>>>>> I'm still more of a fan of this being communicated by "System RAM"
>>>>>
>>>>> I was mentioning somewhere in this thread that "System RAM" inside a
>>>>> hierarchy (like dax/kmem) will already be basically ignored by
>>>>> kexec-tools. So, placing it inside a hierarchy already makes it look
>>>>> special already.
>>>>>
>>>>> But after all, as we have to change kexec-tools either way, we can
>>>>> directly go ahead and flag it properly as special (in case there will
>>>>> ever be other cases where we could no longer distinguish it).
>>>>>
>>>>>> being parented especially because that tells you something about how
>>>>>> the memory is driver-managed and which mechanism might be in play.
>>>>>
>>>>> The could be communicated to some degree via the resource hierarchy.
>>>>>
>>>>> E.g.,
>>>>>
>>>>>             [root@localhost ~]# cat /proc/iomem
>>>>>             ...
>>>>>             140000000-33fffffff : Persistent Memory
>>>>>               140000000-1481fffff : namespace0.0
>>>>>               150000000-33fffffff : dax0.0
>>>>>                 150000000-33fffffff : System RAM (driver managed)
>>>>>
>>>>> vs.
>>>>>
>>>>>            :/# cat /proc/iomem
>>>>>             [...]
>>>>>             140000000-333ffffff : virtio-mem (virtio0)
>>>>>               140000000-147ffffff : System RAM (driver managed)
>>>>>               148000000-14fffffff : System RAM (driver managed)
>>>>>               150000000-157ffffff : System RAM (driver managed)
>>>>>
>>>>> Good enough for my taste.
>>>>>
>>>>>> What about adding an optional /sys/firmware/memmap/X/parent attribute.
>>>>>
>>>>> I really don't want any firmware memmap entries for something that is
>>>>> not part of the firmware provided memmap. In addition,
>>>>> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
>>>>> and two arm configs enable it at all.
>>>>>
>>>>> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.
>>>>
>>>> I think that's a policy decision and policy decisions do not belong in
>>>> the kernel. Give the tooling the opportunity to decide whether System
>>>> RAM stays that way over a kexec. The parenthetical reference otherwise
>>>> looks out of place to me in the /proc/iomem output. What makes it
>>>> "driver managed" is how the kernel handles it, not how the kernel
>>>> names it.
>>>
>>> At least, virtio-mem is different. It really *has to be handled* by the
>>> driver. This is not a policy. It's how it works.
> 
> ...but that's not necessarily how dax/kmem works.
> 

Yes, and user space could still take that memory and add it to the
firmware memmap if it really wants to. It knows that it is special. It
can figure out that it belongs to a dax device using /proc/iomem.

>>>
>>
>> Oh, and I don't see why "System RAM (driver managed)" would hinder any
>> policy in user case to still do what it thinks is the right thing to do
>> (e.g., for dax).
>>
>> "System RAM (driver managed)" would mean: Memory is not part of the raw
>> firmware memmap. It was detected and added by a driver. Handle with
>> care, this is special.
> 
> Oh, no, I was more reacting to your, "don't update
> /sys/firmware/memmap for the (driver managed) range" choice as being a
> policy decision. It otherwise feels to me "System RAM (driver
> managed)" adds confusion for casual users of /proc/iomem and for clued
> in tools they have the parent association to decide policy.

Not sure if I understand correctly, so bear with me :).

Adding or not adding stuff to /sys/firmware/memmap is not a policy
decision. If it's not part of the raw firmware-provided memmap, it has
nothing to do in /sys/firmware/memmap. That's what the documentation
from 2008 tells us.

Again, my point is that we don't create /sys/firmware/memmap entries for
dax/kmem and virtio-mem memory - because it's not part of the raw
firmware-provided memmap. I was not suggesting to add something like
"System RAM (driver managed)" there instead, maybe that part was confusing.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 18:14                               ` David Hildenbrand
@ 2020-05-01 18:43                                 ` Dan Williams
  2020-05-01 19:17                                   ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-05-01 18:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On Fri, May 1, 2020 at 11:14 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.05.20 20:03, Dan Williams wrote:
> > On Fri, May 1, 2020 at 10:51 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 01.05.20 19:45, David Hildenbrand wrote:
> >>> On 01.05.20 19:39, Dan Williams wrote:
> >>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 01.05.20 18:56, Dan Williams wrote:
> >>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>>>
> >>>>>>>>> I assume:
> >>>>>>>>>
> >>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
> >>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> >>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I assume
> >>>>>>>>> we wanted to be able to reflect that, to make kexec look like a real reboot.
> >>>>>>>>>
> >>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>>>
> >>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>>>>>>> contain that memory after a reboot)
> >>>>>>>>
> >>>>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>>>> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
> >>>>>>>> a decade.  I looked through the email discussion from that time and I'm
> >>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> >>>>>>>> review comments.
> >>>>>>>
> >>>>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
> >>>>>>> clear what has to be done here. I will add some of these details to the
> >>>>>>> patch description.
> >>>>>>>
> >>>>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>>>> won't really suffer from a name change in /proc/iomem, I will go back to
> >>>>>>> the MHP_DRIVER_MANAGED approach and
> >>>>>>> 1. Don't create firmware memmap entries
> >>>>>>> 2. Name the resource "System RAM (driver managed)"
> >>>>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>>>>>>
> >>>>>>> This way, kernel users and user space can figure out that this memory
> >>>>>>> has different semantics and handle it accordingly - I think that was
> >>>>>>> what Eric was asking for.
> >>>>>>>
> >>>>>>> Of course, open for suggestions.
> >>>>>>
> >>>>>> I'm still more of a fan of this being communicated by "System RAM"
> >>>>>
> >>>>> I was mentioning somewhere in this thread that "System RAM" inside a
> >>>>> hierarchy (like dax/kmem) will already be basically ignored by
> >>>>> kexec-tools. So, placing it inside a hierarchy already makes it look
> >>>>> special already.
> >>>>>
> >>>>> But after all, as we have to change kexec-tools either way, we can
> >>>>> directly go ahead and flag it properly as special (in case there will
> >>>>> ever be other cases where we could no longer distinguish it).
> >>>>>
> >>>>>> being parented especially because that tells you something about how
> >>>>>> the memory is driver-managed and which mechanism might be in play.
> >>>>>
> >>>>> The could be communicated to some degree via the resource hierarchy.
> >>>>>
> >>>>> E.g.,
> >>>>>
> >>>>>             [root@localhost ~]# cat /proc/iomem
> >>>>>             ...
> >>>>>             140000000-33fffffff : Persistent Memory
> >>>>>               140000000-1481fffff : namespace0.0
> >>>>>               150000000-33fffffff : dax0.0
> >>>>>                 150000000-33fffffff : System RAM (driver managed)
> >>>>>
> >>>>> vs.
> >>>>>
> >>>>>            :/# cat /proc/iomem
> >>>>>             [...]
> >>>>>             140000000-333ffffff : virtio-mem (virtio0)
> >>>>>               140000000-147ffffff : System RAM (driver managed)
> >>>>>               148000000-14fffffff : System RAM (driver managed)
> >>>>>               150000000-157ffffff : System RAM (driver managed)
> >>>>>
> >>>>> Good enough for my taste.
> >>>>>
> >>>>>> What about adding an optional /sys/firmware/memmap/X/parent attribute.
> >>>>>
> >>>>> I really don't want any firmware memmap entries for something that is
> >>>>> not part of the firmware provided memmap. In addition,
> >>>>> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
> >>>>> and two arm configs enable it at all.
> >>>>>
> >>>>> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.
> >>>>
> >>>> I think that's a policy decision and policy decisions do not belong in
> >>>> the kernel. Give the tooling the opportunity to decide whether System
> >>>> RAM stays that way over a kexec. The parenthetical reference otherwise
> >>>> looks out of place to me in the /proc/iomem output. What makes it
> >>>> "driver managed" is how the kernel handles it, not how the kernel
> >>>> names it.
> >>>
> >>> At least, virtio-mem is different. It really *has to be handled* by the
> >>> driver. This is not a policy. It's how it works.
> >
> > ...but that's not necessarily how dax/kmem works.
> >
>
> Yes, and user space could still take that memory and add it to the
> firmware memmap if it really wants to. It knows that it is special. It
> can figure out that it belongs to a dax device using /proc/iomem.
>
> >>>
> >>
> >> Oh, and I don't see why "System RAM (driver managed)" would hinder any
> >> policy in user case to still do what it thinks is the right thing to do
> >> (e.g., for dax).
> >>
> >> "System RAM (driver managed)" would mean: Memory is not part of the raw
> >> firmware memmap. It was detected and added by a driver. Handle with
> >> care, this is special.
> >
> > Oh, no, I was more reacting to your, "don't update
> > /sys/firmware/memmap for the (driver managed) range" choice as being a
> > policy decision. It otherwise feels to me "System RAM (driver
> > managed)" adds confusion for casual users of /proc/iomem and for clued
> > in tools they have the parent association to decide policy.
>
> Not sure if I understand correctly, so bear with me :).
>
> Adding or not adding stuff to /sys/firmware/memmap is not a policy
> decision. If it's not part of the raw firmware-provided memmap, it has
> nothing to do in /sys/firmware/memmap. That's what the documentation
> from 2008 tells us.

It just occurs to me that there are valid cases for both wanting to
start over with driver managed memory with a kexec and keeping it in
the map. Consider the case of EFI Special Purpose (SP) Memory that is
marked EFI Conventional Memory with the SP attribute. In that case the
firmware memory map marked it as conventional RAM, but the kernel
optionally marks it as System RAM vs Soft Reserved. The 2008 patch
simply does not consider that case. I'm not sure strict textualism
works for coding decisions.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 18:43                                 ` Dan Williams
@ 2020-05-01 19:17                                   ` David Hildenbrand
  2020-05-01 20:12                                     ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-05-01 19:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On 01.05.20 20:43, Dan Williams wrote:
> On Fri, May 1, 2020 at 11:14 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.05.20 20:03, Dan Williams wrote:
>>> On Fri, May 1, 2020 at 10:51 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 01.05.20 19:45, David Hildenbrand wrote:
>>>>> On 01.05.20 19:39, Dan Williams wrote:
>>>>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 01.05.20 18:56, Dan Williams wrote:
>>>>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
>>>>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Why does the firmware map support hotplug entries?
>>>>>>>>>>>
>>>>>>>>>>> I assume:
>>>>>>>>>>>
>>>>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
>>>>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
>>>>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
>>>>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
>>>>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I assume
>>>>>>>>>>> we wanted to be able to reflect that, to make kexec look like a real reboot.
>>>>>>>>>>>
>>>>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But I assume only Andrew can enlighten us.
>>>>>>>>>>>
>>>>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
>>>>>>>>>>> firmware memmap, even if this contradicts with the existing
>>>>>>>>>>> documentation? (especially, if the actual firmware memmap will *not*
>>>>>>>>>>> contain that memory after a reboot)
>>>>>>>>>>
>>>>>>>>>> For some reason that patch is misattributed - it was authored by
>>>>>>>>>> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
>>>>>>>>>> a decade.  I looked through the email discussion from that time and I'm
>>>>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
>>>>>>>>>> review comments.
>>>>>>>>>
>>>>>>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
>>>>>>>>> clear what has to be done here. I will add some of these details to the
>>>>>>>>> patch description.
>>>>>>>>>
>>>>>>>>> Also, now that I know that esp. kexec-tools already don't consider
>>>>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
>>>>>>>>> won't really suffer from a name change in /proc/iomem, I will go back to
>>>>>>>>> the MHP_DRIVER_MANAGED approach and
>>>>>>>>> 1. Don't create firmware memmap entries
>>>>>>>>> 2. Name the resource "System RAM (driver managed)"
>>>>>>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
>>>>>>>>>
>>>>>>>>> This way, kernel users and user space can figure out that this memory
>>>>>>>>> has different semantics and handle it accordingly - I think that was
>>>>>>>>> what Eric was asking for.
>>>>>>>>>
>>>>>>>>> Of course, open for suggestions.
>>>>>>>>
>>>>>>>> I'm still more of a fan of this being communicated by "System RAM"
>>>>>>>
>>>>>>> I was mentioning somewhere in this thread that "System RAM" inside a
>>>>>>> hierarchy (like dax/kmem) will already be basically ignored by
>>>>>>> kexec-tools. So, placing it inside a hierarchy already makes it look
>>>>>>> special already.
>>>>>>>
>>>>>>> But after all, as we have to change kexec-tools either way, we can
>>>>>>> directly go ahead and flag it properly as special (in case there will
>>>>>>> ever be other cases where we could no longer distinguish it).
>>>>>>>
>>>>>>>> being parented especially because that tells you something about how
>>>>>>>> the memory is driver-managed and which mechanism might be in play.
>>>>>>>
>>>>>>> The could be communicated to some degree via the resource hierarchy.
>>>>>>>
>>>>>>> E.g.,
>>>>>>>
>>>>>>>             [root@localhost ~]# cat /proc/iomem
>>>>>>>             ...
>>>>>>>             140000000-33fffffff : Persistent Memory
>>>>>>>               140000000-1481fffff : namespace0.0
>>>>>>>               150000000-33fffffff : dax0.0
>>>>>>>                 150000000-33fffffff : System RAM (driver managed)
>>>>>>>
>>>>>>> vs.
>>>>>>>
>>>>>>>            :/# cat /proc/iomem
>>>>>>>             [...]
>>>>>>>             140000000-333ffffff : virtio-mem (virtio0)
>>>>>>>               140000000-147ffffff : System RAM (driver managed)
>>>>>>>               148000000-14fffffff : System RAM (driver managed)
>>>>>>>               150000000-157ffffff : System RAM (driver managed)
>>>>>>>
>>>>>>> Good enough for my taste.
>>>>>>>
>>>>>>>> What about adding an optional /sys/firmware/memmap/X/parent attribute.
>>>>>>>
>>>>>>> I really don't want any firmware memmap entries for something that is
>>>>>>> not part of the firmware provided memmap. In addition,
>>>>>>> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
>>>>>>> and two arm configs enable it at all.
>>>>>>>
>>>>>>> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.
>>>>>>
>>>>>> I think that's a policy decision and policy decisions do not belong in
>>>>>> the kernel. Give the tooling the opportunity to decide whether System
>>>>>> RAM stays that way over a kexec. The parenthetical reference otherwise
>>>>>> looks out of place to me in the /proc/iomem output. What makes it
>>>>>> "driver managed" is how the kernel handles it, not how the kernel
>>>>>> names it.
>>>>>
>>>>> At least, virtio-mem is different. It really *has to be handled* by the
>>>>> driver. This is not a policy. It's how it works.
>>>
>>> ...but that's not necessarily how dax/kmem works.
>>>
>>
>> Yes, and user space could still take that memory and add it to the
>> firmware memmap if it really wants to. It knows that it is special. It
>> can figure out that it belongs to a dax device using /proc/iomem.
>>
>>>>>
>>>>
>>>> Oh, and I don't see why "System RAM (driver managed)" would hinder any
>>>> policy in user case to still do what it thinks is the right thing to do
>>>> (e.g., for dax).
>>>>
>>>> "System RAM (driver managed)" would mean: Memory is not part of the raw
>>>> firmware memmap. It was detected and added by a driver. Handle with
>>>> care, this is special.
>>>
>>> Oh, no, I was more reacting to your, "don't update
>>> /sys/firmware/memmap for the (driver managed) range" choice as being a
>>> policy decision. It otherwise feels to me "System RAM (driver
>>> managed)" adds confusion for casual users of /proc/iomem and for clued
>>> in tools they have the parent association to decide policy.
>>
>> Not sure if I understand correctly, so bear with me :).
>>
>> Adding or not adding stuff to /sys/firmware/memmap is not a policy
>> decision. If it's not part of the raw firmware-provided memmap, it has
>> nothing to do in /sys/firmware/memmap. That's what the documentation
>> from 2008 tells us.
> 
> It just occurs to me that there are valid cases for both wanting to
> start over with driver managed memory with a kexec and keeping it in
> the map.

Yes, there might be valid cases. My gut feeling is that in the general
case, you want to let the kexec kernel implement a policy/ let the user
in the new system decide.

But as I said, you can implement in kexec-tools whatever policy you
want. It has access to all information.

> Consider the case of EFI Special Purpose (SP) Memory that is
> marked EFI Conventional Memory with the SP attribute. In that case the
> firmware memory map marked it as conventional RAM, but the kernel
> optionally marks it as System RAM vs Soft Reserved. The 2008 patch
> simply does not consider that case. I'm not sure strict textualism
> works for coding decisions.

I am no expert on that matter (esp EFI). But looking at the users of
firmware_map_add_early(), the single user is in arch/x86/kernel/e820.c
. So the single source of /sys/firmware/memmap is (besides hotplug) e820.

"'e820_table_firmware': the original firmware version passed to us by
the bootloader - not modified by the kernel. ... inform the user about
the firmware's notion of memory layout via /sys/firmware/memmap"
(arch/x86/kernel/e820.c)

How is the EFI Special Purpose (SP) Memory represented in e820?

/sys/firmware/memmap is really simple: just dump in e820. No policies IIUC.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 19:17                                   ` David Hildenbrand
@ 2020-05-01 20:12                                     ` Dan Williams
  2020-05-01 21:10                                       ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-05-01 20:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On Fri, May 1, 2020 at 12:18 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.05.20 20:43, Dan Williams wrote:
> > On Fri, May 1, 2020 at 11:14 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 01.05.20 20:03, Dan Williams wrote:
> >>> On Fri, May 1, 2020 at 10:51 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 01.05.20 19:45, David Hildenbrand wrote:
> >>>>> On 01.05.20 19:39, Dan Williams wrote:
> >>>>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On 01.05.20 18:56, Dan Williams wrote:
> >>>>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>>>>>
> >>>>>>>>>>> I assume:
> >>>>>>>>>>>
> >>>>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
> >>>>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> >>>>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I assume
> >>>>>>>>>>> we wanted to be able to reflect that, to make kexec look like a real reboot.
> >>>>>>>>>>>
> >>>>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>>>>>
> >>>>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>>>>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>>>>>>>>> contain that memory after a reboot)
> >>>>>>>>>>
> >>>>>>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>>>>>> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
> >>>>>>>>>> a decade.  I looked through the email discussion from that time and I'm
> >>>>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> >>>>>>>>>> review comments.
> >>>>>>>>>
> >>>>>>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
> >>>>>>>>> clear what has to be done here. I will add some of these details to the
> >>>>>>>>> patch description.
> >>>>>>>>>
> >>>>>>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>>>>>> won't really suffer from a name change in /proc/iomem, I will go back to
> >>>>>>>>> the MHP_DRIVER_MANAGED approach and
> >>>>>>>>> 1. Don't create firmware memmap entries
> >>>>>>>>> 2. Name the resource "System RAM (driver managed)"
> >>>>>>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>>>>>>>>
> >>>>>>>>> This way, kernel users and user space can figure out that this memory
> >>>>>>>>> has different semantics and handle it accordingly - I think that was
> >>>>>>>>> what Eric was asking for.
> >>>>>>>>>
> >>>>>>>>> Of course, open for suggestions.
> >>>>>>>>
> >>>>>>>> I'm still more of a fan of this being communicated by "System RAM"
> >>>>>>>
> >>>>>>> I was mentioning somewhere in this thread that "System RAM" inside a
> >>>>>>> hierarchy (like dax/kmem) will already be basically ignored by
> >>>>>>> kexec-tools. So, placing it inside a hierarchy already makes it look
> >>>>>>> special already.
> >>>>>>>
> >>>>>>> But after all, as we have to change kexec-tools either way, we can
> >>>>>>> directly go ahead and flag it properly as special (in case there will
> >>>>>>> ever be other cases where we could no longer distinguish it).
> >>>>>>>
> >>>>>>>> being parented especially because that tells you something about how
> >>>>>>>> the memory is driver-managed and which mechanism might be in play.
> >>>>>>>
> >>>>>>> The could be communicated to some degree via the resource hierarchy.
> >>>>>>>
> >>>>>>> E.g.,
> >>>>>>>
> >>>>>>>             [root@localhost ~]# cat /proc/iomem
> >>>>>>>             ...
> >>>>>>>             140000000-33fffffff : Persistent Memory
> >>>>>>>               140000000-1481fffff : namespace0.0
> >>>>>>>               150000000-33fffffff : dax0.0
> >>>>>>>                 150000000-33fffffff : System RAM (driver managed)
> >>>>>>>
> >>>>>>> vs.
> >>>>>>>
> >>>>>>>            :/# cat /proc/iomem
> >>>>>>>             [...]
> >>>>>>>             140000000-333ffffff : virtio-mem (virtio0)
> >>>>>>>               140000000-147ffffff : System RAM (driver managed)
> >>>>>>>               148000000-14fffffff : System RAM (driver managed)
> >>>>>>>               150000000-157ffffff : System RAM (driver managed)
> >>>>>>>
> >>>>>>> Good enough for my taste.
> >>>>>>>
> >>>>>>>> What about adding an optional /sys/firmware/memmap/X/parent attribute.
> >>>>>>>
> >>>>>>> I really don't want any firmware memmap entries for something that is
> >>>>>>> not part of the firmware provided memmap. In addition,
> >>>>>>> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
> >>>>>>> and two arm configs enable it at all.
> >>>>>>>
> >>>>>>> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.
> >>>>>>
> >>>>>> I think that's a policy decision and policy decisions do not belong in
> >>>>>> the kernel. Give the tooling the opportunity to decide whether System
> >>>>>> RAM stays that way over a kexec. The parenthetical reference otherwise
> >>>>>> looks out of place to me in the /proc/iomem output. What makes it
> >>>>>> "driver managed" is how the kernel handles it, not how the kernel
> >>>>>> names it.
> >>>>>
> >>>>> At least, virtio-mem is different. It really *has to be handled* by the
> >>>>> driver. This is not a policy. It's how it works.
> >>>
> >>> ...but that's not necessarily how dax/kmem works.
> >>>
> >>
> >> Yes, and user space could still take that memory and add it to the
> >> firmware memmap if it really wants to. It knows that it is special. It
> >> can figure out that it belongs to a dax device using /proc/iomem.
> >>
> >>>>>
> >>>>
> >>>> Oh, and I don't see why "System RAM (driver managed)" would hinder any
> >>>> policy in user case to still do what it thinks is the right thing to do
> >>>> (e.g., for dax).
> >>>>
> >>>> "System RAM (driver managed)" would mean: Memory is not part of the raw
> >>>> firmware memmap. It was detected and added by a driver. Handle with
> >>>> care, this is special.
> >>>
> >>> Oh, no, I was more reacting to your, "don't update
> >>> /sys/firmware/memmap for the (driver managed) range" choice as being a
> >>> policy decision. It otherwise feels to me "System RAM (driver
> >>> managed)" adds confusion for casual users of /proc/iomem and for clued
> >>> in tools they have the parent association to decide policy.
> >>
> >> Not sure if I understand correctly, so bear with me :).
> >>
> >> Adding or not adding stuff to /sys/firmware/memmap is not a policy
> >> decision. If it's not part of the raw firmware-provided memmap, it has
> >> nothing to do in /sys/firmware/memmap. That's what the documentation
> >> from 2008 tells us.
> >
> > It just occurs to me that there are valid cases for both wanting to
> > start over with driver managed memory with a kexec and keeping it in
> > the map.
>
> Yes, there might be valid cases. My gut feeling is that in the general
> case, you want to let the kexec kernel implement a policy/ let the user
> in the new system decide.
>
> But as I said, you can implement in kexec-tools whatever policy you
> want. It has access to all information.

Right, so why is a new type needed if all the information is there by
other means?

> > Consider the case of EFI Special Purpose (SP) Memory that is
> > marked EFI Conventional Memory with the SP attribute. In that case the
> > firmware memory map marked it as conventional RAM, but the kernel
> > optionally marks it as System RAM vs Soft Reserved. The 2008 patch
> > simply does not consider that case. I'm not sure strict textualism
> > works for coding decisions.
>
> I am no expert on that matter (esp EFI). But looking at the users of
> firmware_map_add_early(), the single user is in arch/x86/kernel/e820.c
> . So the single source of /sys/firmware/memmap is (besides hotplug) e820.
>
> "'e820_table_firmware': the original firmware version passed to us by
> the bootloader - not modified by the kernel. ... inform the user about
> the firmware's notion of memory layout via /sys/firmware/memmap"
> (arch/x86/kernel/e820.c)
>
> How is the EFI Special Purpose (SP) Memory represented in e820?
> /sys/firmware/memmap is really simple: just dump in e820. No policies IIUC.

e820 now has a Soft Reserved translation for this which means "try to
reserve, but treat as System RAM is ok too". It seems generically
useful to me that the toggle for determining whether Soft Reserved or
System RAM shows up /sys/firmware/memmap is a determination that
policy can make. The kernel need not preemptively block it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 20:12                                     ` Dan Williams
@ 2020-05-01 21:10                                       ` David Hildenbrand
  2020-05-01 21:52                                         ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-05-01 21:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On 01.05.20 22:12, Dan Williams wrote:
> On Fri, May 1, 2020 at 12:18 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.05.20 20:43, Dan Williams wrote:
>>> On Fri, May 1, 2020 at 11:14 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 01.05.20 20:03, Dan Williams wrote:
>>>>> On Fri, May 1, 2020 at 10:51 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 01.05.20 19:45, David Hildenbrand wrote:
>>>>>>> On 01.05.20 19:39, Dan Williams wrote:
>>>>>>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On 01.05.20 18:56, Dan Williams wrote:
>>>>>>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
>>>>>>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Why does the firmware map support hotplug entries?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I assume:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
>>>>>>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
>>>>>>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
>>>>>>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
>>>>>>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I assume
>>>>>>>>>>>>> we wanted to be able to reflect that, to make kexec look like a real reboot.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> But I assume only Andrew can enlighten us.
>>>>>>>>>>>>>
>>>>>>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
>>>>>>>>>>>>> firmware memmap, even if this contradicts with the existing
>>>>>>>>>>>>> documentation? (especially, if the actual firmware memmap will *not*
>>>>>>>>>>>>> contain that memory after a reboot)
>>>>>>>>>>>>
>>>>>>>>>>>> For some reason that patch is misattributed - it was authored by
>>>>>>>>>>>> Shaohui Zheng <shaohui.zheng@intel.com>, who hasn't been heard from in
>>>>>>>>>>>> a decade.  I looked through the email discussion from that time and I'm
>>>>>>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
>>>>>>>>>>>> review comments.
>>>>>>>>>>>
>>>>>>>>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
>>>>>>>>>>> clear what has to be done here. I will add some of these details to the
>>>>>>>>>>> patch description.
>>>>>>>>>>>
>>>>>>>>>>> Also, now that I know that esp. kexec-tools already don't consider
>>>>>>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
>>>>>>>>>>> won't really suffer from a name change in /proc/iomem, I will go back to
>>>>>>>>>>> the MHP_DRIVER_MANAGED approach and
>>>>>>>>>>> 1. Don't create firmware memmap entries
>>>>>>>>>>> 2. Name the resource "System RAM (driver managed)"
>>>>>>>>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
>>>>>>>>>>>
>>>>>>>>>>> This way, kernel users and user space can figure out that this memory
>>>>>>>>>>> has different semantics and handle it accordingly - I think that was
>>>>>>>>>>> what Eric was asking for.
>>>>>>>>>>>
>>>>>>>>>>> Of course, open for suggestions.
>>>>>>>>>>
>>>>>>>>>> I'm still more of a fan of this being communicated by "System RAM"
>>>>>>>>>
>>>>>>>>> I was mentioning somewhere in this thread that "System RAM" inside a
>>>>>>>>> hierarchy (like dax/kmem) will already be basically ignored by
>>>>>>>>> kexec-tools. So, placing it inside a hierarchy already makes it look
>>>>>>>>> special already.
>>>>>>>>>
>>>>>>>>> But after all, as we have to change kexec-tools either way, we can
>>>>>>>>> directly go ahead and flag it properly as special (in case there will
>>>>>>>>> ever be other cases where we could no longer distinguish it).
>>>>>>>>>
>>>>>>>>>> being parented especially because that tells you something about how
>>>>>>>>>> the memory is driver-managed and which mechanism might be in play.
>>>>>>>>>
>>>>>>>>> The could be communicated to some degree via the resource hierarchy.
>>>>>>>>>
>>>>>>>>> E.g.,
>>>>>>>>>
>>>>>>>>>             [root@localhost ~]# cat /proc/iomem
>>>>>>>>>             ...
>>>>>>>>>             140000000-33fffffff : Persistent Memory
>>>>>>>>>               140000000-1481fffff : namespace0.0
>>>>>>>>>               150000000-33fffffff : dax0.0
>>>>>>>>>                 150000000-33fffffff : System RAM (driver managed)
>>>>>>>>>
>>>>>>>>> vs.
>>>>>>>>>
>>>>>>>>>            :/# cat /proc/iomem
>>>>>>>>>             [...]
>>>>>>>>>             140000000-333ffffff : virtio-mem (virtio0)
>>>>>>>>>               140000000-147ffffff : System RAM (driver managed)
>>>>>>>>>               148000000-14fffffff : System RAM (driver managed)
>>>>>>>>>               150000000-157ffffff : System RAM (driver managed)
>>>>>>>>>
>>>>>>>>> Good enough for my taste.
>>>>>>>>>
>>>>>>>>>> What about adding an optional /sys/firmware/memmap/X/parent attribute.
>>>>>>>>>
>>>>>>>>> I really don't want any firmware memmap entries for something that is
>>>>>>>>> not part of the firmware provided memmap. In addition,
>>>>>>>>> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
>>>>>>>>> and two arm configs enable it at all.
>>>>>>>>>
>>>>>>>>> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.
>>>>>>>>
>>>>>>>> I think that's a policy decision and policy decisions do not belong in
>>>>>>>> the kernel. Give the tooling the opportunity to decide whether System
>>>>>>>> RAM stays that way over a kexec. The parenthetical reference otherwise
>>>>>>>> looks out of place to me in the /proc/iomem output. What makes it
>>>>>>>> "driver managed" is how the kernel handles it, not how the kernel
>>>>>>>> names it.
>>>>>>>
>>>>>>> At least, virtio-mem is different. It really *has to be handled* by the
>>>>>>> driver. This is not a policy. It's how it works.
>>>>>
>>>>> ...but that's not necessarily how dax/kmem works.
>>>>>
>>>>
>>>> Yes, and user space could still take that memory and add it to the
>>>> firmware memmap if it really wants to. It knows that it is special. It
>>>> can figure out that it belongs to a dax device using /proc/iomem.
>>>>
>>>>>>>
>>>>>>
>>>>>> Oh, and I don't see why "System RAM (driver managed)" would hinder any
>>>>>> policy in user case to still do what it thinks is the right thing to do
>>>>>> (e.g., for dax).
>>>>>>
>>>>>> "System RAM (driver managed)" would mean: Memory is not part of the raw
>>>>>> firmware memmap. It was detected and added by a driver. Handle with
>>>>>> care, this is special.
>>>>>
>>>>> Oh, no, I was more reacting to your, "don't update
>>>>> /sys/firmware/memmap for the (driver managed) range" choice as being a
>>>>> policy decision. It otherwise feels to me "System RAM (driver
>>>>> managed)" adds confusion for casual users of /proc/iomem and for clued
>>>>> in tools they have the parent association to decide policy.
>>>>
>>>> Not sure if I understand correctly, so bear with me :).
>>>>
>>>> Adding or not adding stuff to /sys/firmware/memmap is not a policy
>>>> decision. If it's not part of the raw firmware-provided memmap, it has
>>>> nothing to do in /sys/firmware/memmap. That's what the documentation
>>>> from 2008 tells us.
>>>
>>> It just occurs to me that there are valid cases for both wanting to
>>> start over with driver managed memory with a kexec and keeping it in
>>> the map.
>>
>> Yes, there might be valid cases. My gut feeling is that in the general
>> case, you want to let the kexec kernel implement a policy/ let the user
>> in the new system decide.
>>
>> But as I said, you can implement in kexec-tools whatever policy you
>> want. It has access to all information.
> 
> Right, so why is a new type needed if all the information is there by
> other means?

You mean "System RAM (driver managed)" in /proc/iomem? See below for more.

> 
>>> Consider the case of EFI Special Purpose (SP) Memory that is
>>> marked EFI Conventional Memory with the SP attribute. In that case the
>>> firmware memory map marked it as conventional RAM, but the kernel
>>> optionally marks it as System RAM vs Soft Reserved. The 2008 patch
>>> simply does not consider that case. I'm not sure strict textualism
>>> works for coding decisions.
>>
>> I am no expert on that matter (esp EFI). But looking at the users of
>> firmware_map_add_early(), the single user is in arch/x86/kernel/e820.c
>> . So the single source of /sys/firmware/memmap is (besides hotplug) e820.
>>
>> "'e820_table_firmware': the original firmware version passed to us by
>> the bootloader - not modified by the kernel. ... inform the user about
>> the firmware's notion of memory layout via /sys/firmware/memmap"
>> (arch/x86/kernel/e820.c)
>>
>> How is the EFI Special Purpose (SP) Memory represented in e820?
>> /sys/firmware/memmap is really simple: just dump in e820. No policies IIUC.
> 
> e820 now has a Soft Reserved translation for this which means "try to
> reserve, but treat as System RAM is ok too". It seems generically
> useful to me that the toggle for determining whether Soft Reserved or
> System RAM shows up /sys/firmware/memmap is a determination that
> policy can make. The kernel need not preemptively block it.

So, I think I have to clarify something here. We do have two ways to kexec

1. kexec_load(): User space (kexec-tools) crafts the memmap (e.g., using
/sys/firmware/memmap on x86-64) and selects memory where to place the
kexec images (e.g., using /proc/iomem)

2. kexec_file_load(): The kernel reuses the (basically) raw firmware
memmap and selects memory where to place kexec images.

We are talking about changing 1, to behave like 2 in regards to
dax/kmem. 2. does currently not add any hotplugged memory to the
fixed-up e820, and it should be fixed regarding hotplugged DIMMs that
would appear in e820 after a reboot.

Now, all these policy discussions are nice and fun, but I don't really
see a good reason to (ab)use /sys/firmware/memmap for that (e.g., parent
properties). If you want to be able to make this configurable, then
e.g., add a way to configure this in the kernel (for example along with
kmem) to make 1. and 2. behave the same way. Otherwise, you really only
can change 1.


Now, let's clarify what I want regarding virtio-mem:

1. kexec should not add virtio-mem memory to the initial firmware
   memmap. The driver has to be in charge as discussed.
2. kexec should not place kexec images onto virtio-mem memory. That
   would end badly.
3. kexec should still dump virtio-mem memory via kdump.

This has to work when using kexec_load() or kexec_file_load(). This has
to theoretically work on different architectures (especially, without
/sys/firmware/memmap). kexec-tools has to have access to that
information to figure out what to do.

Regarding 1:
- kexec_file_load(): works out of the box currently.
- kexec_load(): Don't create entries in /sys/firmware/memmap (for
  reasons discussed)
Regarding 2:
- kexec_file_load(): tag the resources as IORESOURCE_MEM_DRIVER_MANAGED
  (inspired by Eric)
- kexec_load(): indicate the memory as "System RAM (driver managed)"
Regarding 3:
- Same as 2. kexec-tools need to be thought to properly consider the
  memory during kdump.

Now, you are asking, "why System RAM (driver managed)". I don't think
it's strictly needed right now, but it feels cleaner. E.g., for
virtio-mem the current plan is to have /proc/iomem look like

           :/# cat /proc/iomem
            [...]
            140000000-333ffffff : virtio-mem (virtio0)
              140000000-147ffffff : System RAM (driver managed)
              148000000-14fffffff : System RAM (driver managed)
              150000000-157ffffff : System RAM (driver managed)

One could judge by looking at the hierarchy, that this memory is
special. kexec-tools will skip it currently in either form.

If we all agree here, that we can drop it, then let's drop it,
especially if it would allow dax/kmem to use the same mechanism I am
proposing here for virtio-mem.


Now, it would be fairly simple to add a config option for dax/kmem,
making it configurable in the kernel, whether to add memory via
MHP_DRIVER_MANAGED or just as we do now. It would contradict with the
"raw firmware/prov..." description of /sys/firmware/memmap, but hey,
somebody explicitly configured it, so it can't be wrong.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 21:10                                       ` David Hildenbrand
@ 2020-05-01 21:52                                         ` Dan Williams
  2020-05-02  9:26                                           ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Dan Williams @ 2020-05-01 21:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He

On Fri, May 1, 2020 at 2:11 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.05.20 22:12, Dan Williams wrote:
[..]
> >>> Consider the case of EFI Special Purpose (SP) Memory that is
> >>> marked EFI Conventional Memory with the SP attribute. In that case the
> >>> firmware memory map marked it as conventional RAM, but the kernel
> >>> optionally marks it as System RAM vs Soft Reserved. The 2008 patch
> >>> simply does not consider that case. I'm not sure strict textualism
> >>> works for coding decisions.
> >>
> >> I am no expert on that matter (esp EFI). But looking at the users of
> >> firmware_map_add_early(), the single user is in arch/x86/kernel/e820.c
> >> . So the single source of /sys/firmware/memmap is (besides hotplug) e820.
> >>
> >> "'e820_table_firmware': the original firmware version passed to us by
> >> the bootloader - not modified by the kernel. ... inform the user about
> >> the firmware's notion of memory layout via /sys/firmware/memmap"
> >> (arch/x86/kernel/e820.c)
> >>
> >> How is the EFI Special Purpose (SP) Memory represented in e820?
> >> /sys/firmware/memmap is really simple: just dump in e820. No policies IIUC.
> >
> > e820 now has a Soft Reserved translation for this which means "try to
> > reserve, but treat as System RAM is ok too". It seems generically
> > useful to me that the toggle for determining whether Soft Reserved or
> > System RAM shows up /sys/firmware/memmap is a determination that
> > policy can make. The kernel need not preemptively block it.
>
> So, I think I have to clarify something here. We do have two ways to kexec
>
> 1. kexec_load(): User space (kexec-tools) crafts the memmap (e.g., using
> /sys/firmware/memmap on x86-64) and selects memory where to place the
> kexec images (e.g., using /proc/iomem)
>
> 2. kexec_file_load(): The kernel reuses the (basically) raw firmware
> memmap and selects memory where to place kexec images.
>
> We are talking about changing 1, to behave like 2 in regards to
> dax/kmem. 2. does currently not add any hotplugged memory to the
> fixed-up e820, and it should be fixed regarding hotplugged DIMMs that
> would appear in e820 after a reboot.
>
> Now, all these policy discussions are nice and fun, but I don't really
> see a good reason to (ab)use /sys/firmware/memmap for that (e.g., parent
> properties). If you want to be able to make this configurable, then
> e.g., add a way to configure this in the kernel (for example along with
> kmem) to make 1. and 2. behave the same way. Otherwise, you really only
> can change 1.

That's clearer.

>
>
> Now, let's clarify what I want regarding virtio-mem:
>
> 1. kexec should not add virtio-mem memory to the initial firmware
>    memmap. The driver has to be in charge as discussed.
> 2. kexec should not place kexec images onto virtio-mem memory. That
>    would end badly.
> 3. kexec should still dump virtio-mem memory via kdump.

Ok, but then seems to say to me that dax/kmem is a different type of
(driver managed) than virtio-mem and it's confusing to try to apply
the same meaning. Why not just call your type for the distinct type it
is "System RAM (virtio-mem)" and let any other driver managed memory
follow the same "System RAM ($driver)" format if it wants?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-01 21:52                                         ` Dan Williams
@ 2020-05-02  9:26                                           ` David Hildenbrand
  2020-05-02 18:03                                             ` Dan Williams
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2020-05-02  9:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He, Dave Hansen

>> Now, let's clarify what I want regarding virtio-mem:
>>
>> 1. kexec should not add virtio-mem memory to the initial firmware
>>    memmap. The driver has to be in charge as discussed.
>> 2. kexec should not place kexec images onto virtio-mem memory. That
>>    would end badly.
>> 3. kexec should still dump virtio-mem memory via kdump.
> 
> Ok, but then seems to say to me that dax/kmem is a different type of
> (driver managed) than virtio-mem and it's confusing to try to apply
> the same meaning. Why not just call your type for the distinct type it
> is "System RAM (virtio-mem)" and let any other driver managed memory
> follow the same "System RAM ($driver)" format if it wants?

I had the same idea but discarded it because it seemed to uglify the
add_memory() interface (passing yet another parameter only relevant for
driver managed memory). Maybe we really want a new one, because I like
that idea:

/*
 * Add special, driver-managed memory to the system as system ram.
 * The resource_name is expected to have the name format "System RAM
 * ($DRIVER)", so user space (esp. kexec-tools)" can special-case it.
 *
 * For this memory, no entries in /sys/firmware/memmap are created,
 * as this memory won't be part of the raw firmware-provided memory map
 * e.g., after a reboot. Also, the created memory resource is flagged
 * with IORESOURCE_MEM_DRIVER_MANAGED, so in-kernel users can special-
 * case this memory (e.g., not place kexec images onto it).
 */
int add_memory_driver_managed(int nid, u64 start, u64 size,
			      const char *resource_name);


If we'd ever have to special case it even more in the kernel, we could
allow to specify further resource flags. While passing the driver name
instead of the resource_name would be an option, this way we don't have
to hand craft new resource strings for added memory resources.

Thoughts?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP
  2020-05-02  9:26                                           ` David Hildenbrand
@ 2020-05-02 18:03                                             ` Dan Williams
  0 siblings, 0 replies; 30+ messages in thread
From: Dan Williams @ 2020-05-02 18:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List,
	Linux MM, virtio-dev, virtualization, linuxppc-dev, Linux ACPI,
	linux-nvdimm, linux-hyperv, linux-s390, xen-devel, Michal Hocko,
	Michael S . Tsirkin, Michal Hocko, Pankaj Gupta, Wei Yang,
	Baoquan He, Dave Hansen

On Sat, May 2, 2020 at 2:27 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> Now, let's clarify what I want regarding virtio-mem:
> >>
> >> 1. kexec should not add virtio-mem memory to the initial firmware
> >>    memmap. The driver has to be in charge as discussed.
> >> 2. kexec should not place kexec images onto virtio-mem memory. That
> >>    would end badly.
> >> 3. kexec should still dump virtio-mem memory via kdump.
> >
> > Ok, but then seems to say to me that dax/kmem is a different type of
> > (driver managed) than virtio-mem and it's confusing to try to apply
> > the same meaning. Why not just call your type for the distinct type it
> > is "System RAM (virtio-mem)" and let any other driver managed memory
> > follow the same "System RAM ($driver)" format if it wants?
>
> I had the same idea but discarded it because it seemed to uglify the
> add_memory() interface (passing yet another parameter only relevant for
> driver managed memory). Maybe we really want a new one, because I like
> that idea:
>
> /*
>  * Add special, driver-managed memory to the system as system ram.
>  * The resource_name is expected to have the name format "System RAM
>  * ($DRIVER)", so user space (esp. kexec-tools)" can special-case it.
>  *
>  * For this memory, no entries in /sys/firmware/memmap are created,
>  * as this memory won't be part of the raw firmware-provided memory map
>  * e.g., after a reboot. Also, the created memory resource is flagged
>  * with IORESOURCE_MEM_DRIVER_MANAGED, so in-kernel users can special-
>  * case this memory (e.g., not place kexec images onto it).
>  */
> int add_memory_driver_managed(int nid, u64 start, u64 size,
>                               const char *resource_name);
>
>
> If we'd ever have to special case it even more in the kernel, we could
> allow to specify further resource flags. While passing the driver name
> instead of the resource_name would be an option, this way we don't have
> to hand craft new resource strings for added memory resources.
>
> Thoughts?

Looks useful to me and simplifies walking /proc/iomem. I personally
like the safety of the string just being the $driver component of the
name, but I won't lose sleep if the interface stays freeform like you
propose.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, back to index

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-30 10:29 [PATCH v2 0/3] mm/memory_hotplug: Allow to not create firmware memmap entries David Hildenbrand
2020-04-30 10:29 ` [PATCH v2 1/3] mm/memory_hotplug: Prepare passing flags to add_memory() and friends David Hildenbrand
2020-04-30 10:29 ` [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP David Hildenbrand
2020-04-30 15:38   ` Eric W. Biederman
2020-04-30 15:52     ` David Hildenbrand
2020-04-30 16:04       ` Dave Hansen
2020-04-30 16:33       ` Eric W. Biederman
2020-04-30 16:49         ` David Hildenbrand
2020-04-30 18:06           ` Eric W. Biederman
2020-04-30 18:43             ` David Hildenbrand
2020-04-30 18:58               ` Dan Williams
2020-04-30 22:24               ` Andrew Morton
2020-05-01  9:34                 ` David Hildenbrand
2020-05-01 16:56                   ` Dan Williams
2020-05-01 17:21                     ` David Hildenbrand
2020-05-01 17:39                       ` Dan Williams
2020-05-01 17:45                         ` David Hildenbrand
2020-05-01 17:51                           ` David Hildenbrand
2020-05-01 18:03                             ` Dan Williams
2020-05-01 18:14                               ` David Hildenbrand
2020-05-01 18:43                                 ` Dan Williams
2020-05-01 19:17                                   ` David Hildenbrand
2020-05-01 20:12                                     ` Dan Williams
2020-05-01 21:10                                       ` David Hildenbrand
2020-05-01 21:52                                         ` Dan Williams
2020-05-02  9:26                                           ` David Hildenbrand
2020-05-02 18:03                                             ` Dan Williams
2020-04-30 10:29 ` [PATCH v2 3/3] device-dax: Add system ram (add_memory()) with MHP_NO_FIRMWARE_MEMMAP David Hildenbrand
2020-04-30 11:23   ` Dave Hansen
2020-04-30 15:28     ` David Hildenbrand

Linux-HyperV Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-hyperv/0 linux-hyperv/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-hyperv linux-hyperv/ https://lore.kernel.org/linux-hyperv \
		linux-hyperv@vger.kernel.org
	public-inbox-index linux-hyperv

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-hyperv


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git