linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/4] mm/memory_hotplug: Interface to add driver-managed system ram
@ 2020-05-08  8:42 David Hildenbrand
  2020-05-08  8:42 ` [PATCH v4 1/4] device-dax: Don't leak kernel memory to user space after unloading kmem David Hildenbrand
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: David Hildenbrand @ 2020-05-08  8:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-nvdimm, kexec, Vishal Verma, Dave Jiang,
	Pavel Tatashin, David Hildenbrand, Andrew Morton, Baoquan He,
	Dan Williams, Dave Hansen, Eric Biederman, Michal Hocko,
	Pankaj Gupta, Wei Yang

I did some more testing to v3 and found issues with unloading the kmem
module, followed by reconfiguring the namespace.

kexec (via kexec_load()) can currently not properly handle memory added via
dax/kmem, and will have similar issues with virtio-mem. kexec-tools will
currently add all memory to the fixed-up initial firmware memmap. In case
of dax/kmem, this means that - in contrast to a proper reboot - how that
persistent memory will be used can no longer be configured by the kexec'd
kernel. In case of virtio-mem it will be harmful, because that memory
might contain inaccessible pieces that require coordination with hypervisor
first.

In both cases, we want to let the driver in the kexec'd kernel handle
detecting and adding the memory, like during an ordinary reboot.
Introduce add_memory_driver_managed(). More on the samentics are in patch
#1.

In the future, we might want to make this behavior configurable for
dax/kmem- either by configuring it in the kernel (which would then also
allow to configure kexec_file_load()) or in kexec-tools by also adding
"System RAM (kmem)" memory from /proc/iomem to the fixed-up initial
firmware memmap.

More on the motivation can be found in [1] and [2].

v3 -> v4:
- "device-dax: Don't leak kernel memory to user space after unloading kmem"
-- Added
- "device-dax: Add memory via add_memory_driver_managed()"
-- kstrdup_const() the resource name to be used for added memory
-- Remember if any hotremove failed / we still have memory added to the
   system and conditionally kfree_const().

v2 -> v3:
- Don't use flags for add_memory() and friends, provide
  add_memory_driver_managed() instead.
- Flag memory resources via IORESOURCE_MEM_DRIVER_MANAGED and handle them
  in kexec.
- Name memory resources "System RAM ($DRIVER)", visible via /proc/iomem
- Added more details to the patch descriptions, especially regarding the
  history of /sys/firmware/memmap
- Add a comment to the device-dax change. Dropped Dave's Ack as the

v1 -> v2:
- Don't change the resource name
- Rename the flag to MHP_NO_FIRMWARE_MEMMAP to reflect what it is doing
- Rephrase subjects/descriptions
- Use the flag for dax/kmem

[1] https://lkml.kernel.org/r/20200429160803.109056-1-david@redhat.com
[2] https://lkml.kernel.org/r/20200430102908.10107-1-david@redhat.com


David Hildenbrand (4):
  device-dax: Don't leak kernel memory to user space after unloading
    kmem
  mm/memory_hotplug: Introduce add_memory_driver_managed()
  kexec_file: Don't place kexec images on IORESOURCE_MEM_DRIVER_MANAGED
  device-dax: Add memory via add_memory_driver_managed()

 drivers/dax/dax-private.h      |  1 +
 drivers/dax/kmem.c             | 42 ++++++++++++++++++++---
 include/linux/ioport.h         |  1 +
 include/linux/memory_hotplug.h |  2 ++
 kernel/kexec_file.c            |  5 +++
 mm/memory_hotplug.c            | 62 +++++++++++++++++++++++++++++++---
 6 files changed, 104 insertions(+), 9 deletions(-)

-- 
2.25.4



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 1/4] device-dax: Don't leak kernel memory to user space after unloading kmem
  2020-05-08  8:42 [PATCH v4 0/4] mm/memory_hotplug: Interface to add driver-managed system ram David Hildenbrand
@ 2020-05-08  8:42 ` David Hildenbrand
  2020-05-08 23:53   ` Andrew Morton
  2020-05-08  8:42 ` [PATCH v4 2/4] mm/memory_hotplug: Introduce add_memory_driver_managed() David Hildenbrand
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2020-05-08  8:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-nvdimm, kexec, Vishal Verma, Dave Jiang,
	Pavel Tatashin, David Hildenbrand, stable, Dan Williams,
	Andrew Morton

Assume we have kmem configured and loaded:
  [root@localhost ~]# cat /proc/iomem
  ...
  140000000-33fffffff : Persistent Memory$
    140000000-1481fffff : namespace0.0
    150000000-33fffffff : dax0.0
      150000000-33fffffff : System RAM

Assume we try to unload kmem. This force-unloading will work, even if
memory cannot get removed from the system.
  [root@localhost ~]# rmmod kmem
  [   86.380228] removing memory fails, because memory [0x0000000150000000-0x0000000157ffffff] is onlined
  ...
  [   86.431225] kmem dax0.0: DAX region [mem 0x150000000-0x33fffffff] cannot be hotremoved until the next reboot

Now, we can reconfigure the namespace:
  [root@localhost ~]# ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax
  [  131.409351] nd_pmem namespace0.0: could not reserve region [mem 0x140000000-0x33fffffff]dax
  [  131.410147] nd_pmem: probe of namespace0.0 failed with error -16namespace0.0 --mode=devdax
  ...

This fails as expected due to the busy memory resource, and the memory
cannot be used. However, the dax0.0 device is removed, and along its name.

The name of the memory resource now points at freed memory (name of the
device).
  [root@localhost ~]# cat /proc/iomem
  ...
  140000000-33fffffff : Persistent Memory
    140000000-1481fffff : namespace0.0
    150000000-33fffffff : �_�^7_��/_��wR��WQ���^��� ...
    150000000-33fffffff : System RAM

We have to make sure to duplicate the string. While at it, remove the
superfluous setting of the name and fixup a stale comment.

Fixes: 9f960da72b25 ("device-dax: "Hotremove" persistent memory that is used like normal RAM")
Cc: stable@vger.kernel.org # v5.3
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/dax/kmem.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 3d0a7e702c94..1e678bdf5aed 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -22,6 +22,7 @@ int dev_dax_kmem_probe(struct device *dev)
 	resource_size_t kmem_size;
 	resource_size_t kmem_end;
 	struct resource *new_res;
+	const char *new_res_name;
 	int numa_node;
 	int rc;
 
@@ -48,11 +49,16 @@ int dev_dax_kmem_probe(struct device *dev)
 	kmem_size &= ~(memory_block_size_bytes() - 1);
 	kmem_end = kmem_start + kmem_size;
 
-	/* Region is permanently reserved.  Hot-remove not yet implemented. */
-	new_res = request_mem_region(kmem_start, kmem_size, dev_name(dev));
+	new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
+	if (!new_res_name)
+		return -ENOMEM;
+
+	/* Region is permanently reserved if hotremove fails. */
+	new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
 	if (!new_res) {
 		dev_warn(dev, "could not reserve region [%pa-%pa]\n",
 			 &kmem_start, &kmem_end);
+		kfree(new_res_name);
 		return -EBUSY;
 	}
 
@@ -63,12 +69,12 @@ int dev_dax_kmem_probe(struct device *dev)
 	 * unknown to us that will break add_memory() below.
 	 */
 	new_res->flags = IORESOURCE_SYSTEM_RAM;
-	new_res->name = dev_name(dev);
 
 	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
 	if (rc) {
 		release_resource(new_res);
 		kfree(new_res);
+		kfree(new_res_name);
 		return rc;
 	}
 	dev_dax->dax_kmem_res = new_res;
@@ -83,6 +89,7 @@ static int dev_dax_kmem_remove(struct device *dev)
 	struct resource *res = dev_dax->dax_kmem_res;
 	resource_size_t kmem_start = res->start;
 	resource_size_t kmem_size = resource_size(res);
+	const char *res_name = res->name;
 	int rc;
 
 	/*
@@ -102,6 +109,7 @@ static int dev_dax_kmem_remove(struct device *dev)
 	/* Release and free dax resources */
 	release_resource(res);
 	kfree(res);
+	kfree(res_name);
 	dev_dax->dax_kmem_res = NULL;
 
 	return 0;
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 2/4] mm/memory_hotplug: Introduce add_memory_driver_managed()
  2020-05-08  8:42 [PATCH v4 0/4] mm/memory_hotplug: Interface to add driver-managed system ram David Hildenbrand
  2020-05-08  8:42 ` [PATCH v4 1/4] device-dax: Don't leak kernel memory to user space after unloading kmem David Hildenbrand
@ 2020-05-08  8:42 ` David Hildenbrand
  2020-05-08  8:42 ` [PATCH v4 3/4] kexec_file: Don't place kexec images on IORESOURCE_MEM_DRIVER_MANAGED David Hildenbrand
  2020-05-08  8:42 ` [PATCH v4 4/4] device-dax: Add memory via add_memory_driver_managed() David Hildenbrand
  3 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2020-05-08  8:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-nvdimm, kexec, Vishal Verma, Dave Jiang,
	Pavel Tatashin, David Hildenbrand, Pankaj Gupta, Andrew Morton,
	Michal Hocko, Wei Yang, Baoquan He, Dave Hansen, Eric Biederman,
	Dan Williams

Some device drivers rely on memory they managed to not get added to the
initial (firmware) memmap as system RAM - so it's not used as initial
system RAM by the kernel and the driver is under control. While this is the
case during cold boot and after a reboot, kexec is not aware of that and
might add such memory to the initial (firmware) memmap of the kexec kernel.
We need ways to teach kernel and userspace that this system ram is
different.

For example, dax/kmem allows to decide at runtime if persistent memory is
to be used as system ram. Another future user is virtio-mem, which has to
coordinate with its hypervisor to deal with inaccessible parts within
memory resources.

We want to let users in the kernel (esp. kexec) but also user space
(esp. kexec-tools) know that this memory has different semantics and
needs to be handled differently:
1. Don't create entries in /sys/firmware/memmap/
2. Name the memory resource "System RAM ($DRIVER)" (exposed via
   /proc/iomem) ($DRIVER might be "kmem", "virtio_mem").
3. Flag the memory resource IORESOURCE_MEM_DRIVER_MANAGED

/sys/firmware/memmap/ [1] represents the "raw firmware-provided memory map"
because "on most architectures that firmware-provided memory map is
modified afterwards by the kernel itself". The primary user is kexec on
x86-64. Since commit d96ae5309165 ("memory-hotplug: create
/sys/firmware/memmap entry for new memory"), we add all hotplugged
memory to that firmware memmap - which makes perfect sense for traditional
memory hotplug on x86-64, where real HW will also add hotplugged DIMMs to
the firmware memmap. We replicate what the "raw firmware-provided memory
map" looks like after hot(un)plug.

To keep things simple, let the user provide the full resource name
instead of only the driver name - this way, we don't have to manually
allocate/craft strings for memory resources. Also use the resource
name to make decisions, to avoid passing additional flags. In case the
name isn't "System RAM", it's special.

We don't have to worry about firmware_map_remove() on the removal path. If
there is no entry, it will simply return with -EINVAL.

We'll adapt dax/kmem in a follow-up patch.

[1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmap

Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/ioport.h         |  1 +
 include/linux/memory_hotplug.h |  2 ++
 mm/memory_hotplug.c            | 62 +++++++++++++++++++++++++++++++---
 3 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index a9b9170b5dd2..cc9a5b4593ca 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -103,6 +103,7 @@ struct resource {
 #define IORESOURCE_MEM_32BIT		(3<<3)
 #define IORESOURCE_MEM_SHADOWABLE	(1<<5)	/* dup: IORESOURCE_SHADOWABLE */
 #define IORESOURCE_MEM_EXPANSIONROM	(1<<6)
+#define IORESOURCE_MEM_DRIVER_MANAGED	(1<<7)
 
 /* PnP I/O specific bits (IORESOURCE_BITS) */
 #define IORESOURCE_IO_16BIT_ADDR	(1<<0)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 93d9ada74ddd..4abb87b328fd 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -349,6 +349,8 @@ extern void __ref free_area_init_core_hotplug(int nid);
 extern int __add_memory(int nid, u64 start, u64 size);
 extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource);
+extern int add_memory_driver_managed(int nid, u64 start, u64 size,
+				     const char *resource_name);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern void remove_pfn_range_from_zone(struct zone *zone,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fc0aad0bc1f5..c2d084d7c9c0 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -98,11 +98,14 @@ void mem_hotplug_done(void)
 u64 max_mem_size = U64_MAX;
 
 /* add this memory to iomem resource */
-static struct resource *register_memory_resource(u64 start, u64 size)
+static struct resource *register_memory_resource(u64 start, u64 size,
+						 const char *resource_name)
 {
 	struct resource *res;
 	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	char *resource_name = "System RAM";
+
+	if (strcmp(resource_name, "System RAM"))
+		flags |= IORESOURCE_MEM_DRIVER_MANAGED;
 
 	/*
 	 * Make sure value parsed from 'mem=' only restricts memory adding
@@ -1060,7 +1063,8 @@ int __ref add_memory_resource(int nid, struct resource *res)
 	BUG_ON(ret);
 
 	/* create new memmap entry */
-	firmware_map_add_hotplug(start, start + size, "System RAM");
+	if (!strcmp(res->name, "System RAM"))
+		firmware_map_add_hotplug(start, start + size, "System RAM");
 
 	/* device_online() will take the lock when calling online_pages() */
 	mem_hotplug_done();
@@ -1085,7 +1089,7 @@ int __ref __add_memory(int nid, u64 start, u64 size)
 	struct resource *res;
 	int ret;
 
-	res = register_memory_resource(start, size);
+	res = register_memory_resource(start, size, "System RAM");
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
@@ -1107,6 +1111,56 @@ int add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(add_memory);
 
+/*
+ * Add special, driver-managed memory to the system as system RAM. Such
+ * memory is not exposed via the raw firmware-provided memmap as system
+ * RAM, instead, it is detected and added by a driver - during cold boot,
+ * after a reboot, and after kexec.
+ *
+ * Reasons why this memory should not be used for the initial memmap of a
+ * kexec kernel or for placing kexec images:
+ * - The booting kernel is in charge of determining how this memory will be
+ *   used (e.g., use persistent memory as system RAM)
+ * - Coordination with a hypervisor is required before this memory
+ *   can be used (e.g., inaccessible parts).
+ *
+ * For this memory, no entries in /sys/firmware/memmap ("raw firmware-provided
+ * memory map") are created. Also, the created memory resource is flagged
+ * with IORESOURCE_MEM_DRIVER_MANAGED, so in-kernel users can special-case
+ * this memory as well (esp., not place kexec images onto it).
+ *
+ * The resource_name (visible via /proc/iomem) has to have the format
+ * "System RAM ($DRIVER)".
+ */
+int add_memory_driver_managed(int nid, u64 start, u64 size,
+			      const char *resource_name)
+{
+	struct resource *res;
+	int rc;
+
+	if (!resource_name ||
+	    strstr(resource_name, "System RAM (") != resource_name ||
+	    resource_name[strlen(resource_name) - 1] != ')')
+		return -EINVAL;
+
+	lock_device_hotplug();
+
+	res = register_memory_resource(start, size, resource_name);
+	if (IS_ERR(res)) {
+		rc = PTR_ERR(res);
+		goto out_unlock;
+	}
+
+	rc = add_memory_resource(nid, res);
+	if (rc < 0)
+		release_memory_resource(res);
+
+out_unlock:
+	unlock_device_hotplug();
+	return rc;
+}
+EXPORT_SYMBOL_GPL(add_memory_driver_managed);
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
  * A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 3/4] kexec_file: Don't place kexec images on IORESOURCE_MEM_DRIVER_MANAGED
  2020-05-08  8:42 [PATCH v4 0/4] mm/memory_hotplug: Interface to add driver-managed system ram David Hildenbrand
  2020-05-08  8:42 ` [PATCH v4 1/4] device-dax: Don't leak kernel memory to user space after unloading kmem David Hildenbrand
  2020-05-08  8:42 ` [PATCH v4 2/4] mm/memory_hotplug: Introduce add_memory_driver_managed() David Hildenbrand
@ 2020-05-08  8:42 ` David Hildenbrand
  2020-05-08  8:42 ` [PATCH v4 4/4] device-dax: Add memory via add_memory_driver_managed() David Hildenbrand
  3 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2020-05-08  8:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-nvdimm, kexec, Vishal Verma, Dave Jiang,
	Pavel Tatashin, David Hildenbrand, Andrew Morton, Michal Hocko,
	Pankaj Gupta, Wei Yang, Baoquan He, Dave Hansen, Eric Biederman,
	Dan Williams

Memory flagged with IORESOURCE_MEM_DRIVER_MANAGED is special - it won't be
part of the initial memmap of the kexec kernel and not all memory might be
accessible. Don't place any kexec images onto it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 kernel/kexec_file.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index faa74d5f6941..bb05fd52de85 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -540,6 +540,11 @@ static int locate_mem_hole_callback(struct resource *res, void *arg)
 	unsigned long sz = end - start + 1;
 
 	/* Returning 0 will take to next memory range */
+
+	/* Don't use memory that will be detected and handled by a driver. */
+	if (res->flags & IORESOURCE_MEM_DRIVER_MANAGED)
+		return 0;
+
 	if (sz < kbuf->memsz)
 		return 0;
 
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 4/4] device-dax: Add memory via add_memory_driver_managed()
  2020-05-08  8:42 [PATCH v4 0/4] mm/memory_hotplug: Interface to add driver-managed system ram David Hildenbrand
                   ` (2 preceding siblings ...)
  2020-05-08  8:42 ` [PATCH v4 3/4] kexec_file: Don't place kexec images on IORESOURCE_MEM_DRIVER_MANAGED David Hildenbrand
@ 2020-05-08  8:42 ` David Hildenbrand
  3 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2020-05-08  8:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-nvdimm, kexec, Vishal Verma, Dave Jiang,
	Pavel Tatashin, David Hildenbrand, Pankaj Gupta, Andrew Morton,
	Michal Hocko, Wei Yang, Baoquan He, Dave Hansen, Eric Biederman,
	Dan Williams

Currently, when adding memory, we create entries in /sys/firmware/memmap/
as "System RAM". This will lead to kexec-tools to add that memory to the
fixed-up initial memmap for a kexec kernel (loaded via kexec_load()). The
memory will be considered initial System RAM by the kexec'd kernel and
can no longer be reconfigured. This is not what happens during a real
reboot.

Let's add our memory via add_memory_driver_managed() now, so we won't
create entries in /sys/firmware/memmap/ and indicate the memory as
"System RAM (kmem)" in /proc/iomem. This allows everybody (especially
kexec-tools) to identify that this memory is special and has to be treated
differently than ordinary (hotplugged) System RAM.

Before configuring the namespace:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-33fffffff : namespace0.0
	3280000000-32ffffffff : PCI Bus 0000:00

After configuring the namespace:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  148200000-33fffffff : dax0.0
	3280000000-32ffffffff : PCI Bus 0000:00

After loading kmem before this change:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  150000000-33fffffff : dax0.0
	    150000000-33fffffff : System RAM
	3280000000-32ffffffff : PCI Bus 0000:00

After loading kmem after this change:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  150000000-33fffffff : dax0.0
	    150000000-33fffffff : System RAM (kmem)
	3280000000-32ffffffff : PCI Bus 0000:00

After a proper reboot:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  148200000-33fffffff : dax0.0
	3280000000-32ffffffff : PCI Bus 0000:00

Within the kexec kernel before this change:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  150000000-33fffffff : System RAM
	3280000000-32ffffffff : PCI Bus 0000:00

Within the kexec kernel after this change:
	[root@localhost ~]# cat /proc/iomem
	...
	140000000-33fffffff : Persistent Memory
	  140000000-1481fffff : namespace0.0
	  148200000-33fffffff : dax0.0
	3280000000-32ffffffff : PCI Bus 0000:00

/sys/firmware/memmap/ before this change:
	0000000000000000-000000000009fc00 (System RAM)
	000000000009fc00-00000000000a0000 (Reserved)
	00000000000f0000-0000000000100000 (Reserved)
	0000000000100000-00000000bffdf000 (System RAM)
	00000000bffdf000-00000000c0000000 (Reserved)
	00000000feffc000-00000000ff000000 (Reserved)
	00000000fffc0000-0000000100000000 (Reserved)
	0000000100000000-0000000140000000 (System RAM)
	0000000150000000-0000000340000000 (System RAM)

/sys/firmware/memmap/ after a proper reboot:
	0000000000000000-000000000009fc00 (System RAM)
	000000000009fc00-00000000000a0000 (Reserved)
	00000000000f0000-0000000000100000 (Reserved)
	0000000000100000-00000000bffdf000 (System RAM)
	00000000bffdf000-00000000c0000000 (Reserved)
	00000000feffc000-00000000ff000000 (Reserved)
	00000000fffc0000-0000000100000000 (Reserved)
	0000000100000000-0000000140000000 (System RAM)

/sys/firmware/memmap/ after this change:
	0000000000000000-000000000009fc00 (System RAM)
	000000000009fc00-00000000000a0000 (Reserved)
	00000000000f0000-0000000000100000 (Reserved)
	0000000000100000-00000000bffdf000 (System RAM)
	00000000bffdf000-00000000c0000000 (Reserved)
	00000000feffc000-00000000ff000000 (Reserved)
	00000000fffc0000-0000000100000000 (Reserved)
	0000000100000000-0000000140000000 (System RAM)

kexec-tools already seem to basically ignore any System RAM that's not
on top level when searching for areas to place kexec images - but also
for determining crash areas to dump via kdump. Changing the resource name
won't have an impact.

Handle unloading of the driver after memory hotremove failed properly, by
duplicating the string if necessary.

Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/dax/dax-private.h |  1 +
 drivers/dax/kmem.c        | 28 ++++++++++++++++++++++++++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 3107ce80e809..16850d5388ab 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -44,6 +44,7 @@ struct dax_region {
  * @dev - device core
  * @pgmap - pgmap for memmap setup / lifetime (driver owned)
  * @dax_mem_res: physical address range of hotadded DAX memory
+ * @dax_mem_name: name for hotadded DAX memory via add_memory_driver_managed()
  */
 struct dev_dax {
 	struct dax_region *region;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 1e678bdf5aed..275aa5f87399 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -14,6 +14,11 @@
 #include "dax-private.h"
 #include "bus.h"
 
+/* Memory resource name used for add_memory_driver_managed(). */
+static const char *kmem_name;
+/* Set if any memory will remain added when the driver will be unloaded. */
+static bool any_hotremove_failed;
+
 int dev_dax_kmem_probe(struct device *dev)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
@@ -70,7 +75,12 @@ int dev_dax_kmem_probe(struct device *dev)
 	 */
 	new_res->flags = IORESOURCE_SYSTEM_RAM;
 
-	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+	/*
+	 * Ensure that future kexec'd kernels will not treat this as RAM
+	 * automatically.
+	 */
+	rc = add_memory_driver_managed(numa_node, new_res->start,
+				       resource_size(new_res), kmem_name);
 	if (rc) {
 		release_resource(new_res);
 		kfree(new_res);
@@ -100,6 +110,7 @@ static int dev_dax_kmem_remove(struct device *dev)
 	 */
 	rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
 	if (rc) {
+		any_hotremove_failed = true;
 		dev_err(dev,
 			"DAX region %pR cannot be hotremoved until the next reboot\n",
 			res);
@@ -124,6 +135,7 @@ static int dev_dax_kmem_remove(struct device *dev)
 	 * permanently pinned as reserved by the unreleased
 	 * request_mem_region().
 	 */
+	any_hotremove_failed = true;
 	return 0;
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -137,12 +149,24 @@ static struct dax_device_driver device_dax_kmem_driver = {
 
 static int __init dax_kmem_init(void)
 {
-	return dax_driver_register(&device_dax_kmem_driver);
+	int rc;
+
+	/* Resource name is permanently allocated if any hotremove fails. */
+	kmem_name = kstrdup_const("System RAM (kmem)", GFP_KERNEL);
+	if (!kmem_name)
+		return -ENOMEM;
+
+	rc = dax_driver_register(&device_dax_kmem_driver);
+	if (rc)
+		kfree_const(kmem_name);
+	return rc;
 }
 
 static void __exit dax_kmem_exit(void)
 {
 	dax_driver_unregister(&device_dax_kmem_driver);
+	if (!any_hotremove_failed)
+		kfree_const(kmem_name);
 }
 
 MODULE_AUTHOR("Intel Corporation");
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 1/4] device-dax: Don't leak kernel memory to user space after unloading kmem
  2020-05-08  8:42 ` [PATCH v4 1/4] device-dax: Don't leak kernel memory to user space after unloading kmem David Hildenbrand
@ 2020-05-08 23:53   ` Andrew Morton
  2020-05-09  5:40     ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2020-05-08 23:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-nvdimm, kexec, Vishal Verma,
	Dave Jiang, Pavel Tatashin, stable, Dan Williams

On Fri,  8 May 2020 10:42:14 +0200 David Hildenbrand <david@redhat.com> wrote:

> Assume we have kmem configured and loaded:
>   [root@localhost ~]# cat /proc/iomem
>   ...
>   140000000-33fffffff : Persistent Memory$
>     140000000-1481fffff : namespace0.0
>     150000000-33fffffff : dax0.0
>       150000000-33fffffff : System RAM
> 
> Assume we try to unload kmem. This force-unloading will work, even if
> memory cannot get removed from the system.
>   [root@localhost ~]# rmmod kmem
>   [   86.380228] removing memory fails, because memory [0x0000000150000000-0x0000000157ffffff] is onlined
>   ...
>   [   86.431225] kmem dax0.0: DAX region [mem 0x150000000-0x33fffffff] cannot be hotremoved until the next reboot
> 
> Now, we can reconfigure the namespace:
>   [root@localhost ~]# ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax
>   [  131.409351] nd_pmem namespace0.0: could not reserve region [mem 0x140000000-0x33fffffff]dax
>   [  131.410147] nd_pmem: probe of namespace0.0 failed with error -16namespace0.0 --mode=devdax
>   ...
> 
> This fails as expected due to the busy memory resource, and the memory
> cannot be used. However, the dax0.0 device is removed, and along its name.
> 
> The name of the memory resource now points at freed memory (name of the
> device).
>   [root@localhost ~]# cat /proc/iomem
>   ...
>   140000000-33fffffff : Persistent Memory
>     140000000-1481fffff : namespace0.0
>     150000000-33fffffff : �_�^7_��/_��wR��WQ���^��� ...
>     150000000-33fffffff : System RAM
> 
> We have to make sure to duplicate the string. While at it, remove the
> superfluous setting of the name and fixup a stale comment.
> 
> Fixes: 9f960da72b25 ("device-dax: "Hotremove" persistent memory that is used like normal RAM")
> Cc: stable@vger.kernel.org # v5.3

hm.

Is this really -stable material?  These are all privileged operations,
I expect?

Assuming "yes", I've queued this separately, staged for 5.7-rcX.  I'll
redo patches 2-4 as a three-patch series for 5.8-rc1.

> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>

Reviewers, please ;)




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 1/4] device-dax: Don't leak kernel memory to user space after unloading kmem
  2020-05-08 23:53   ` Andrew Morton
@ 2020-05-09  5:40     ` David Hildenbrand
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2020-05-09  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, linux-kernel, linux-mm, linux-nvdimm, kexec,
	Vishal Verma, Dave Jiang, Pavel Tatashin, stable, Dan Williams



> Am 09.05.2020 um 01:53 schrieb Andrew Morton <akpm@linux-foundation.org>:
> 
> On Fri,  8 May 2020 10:42:14 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>> Assume we have kmem configured and loaded:
>>  [root@localhost ~]# cat /proc/iomem
>>  ...
>>  140000000-33fffffff : Persistent Memory$
>>    140000000-1481fffff : namespace0.0
>>    150000000-33fffffff : dax0.0
>>      150000000-33fffffff : System RAM
>> 
>> Assume we try to unload kmem. This force-unloading will work, even if
>> memory cannot get removed from the system.
>>  [root@localhost ~]# rmmod kmem
>>  [   86.380228] removing memory fails, because memory [0x0000000150000000-0x0000000157ffffff] is onlined
>>  ...
>>  [   86.431225] kmem dax0.0: DAX region [mem 0x150000000-0x33fffffff] cannot be hotremoved until the next reboot
>> 
>> Now, we can reconfigure the namespace:
>>  [root@localhost ~]# ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax
>>  [  131.409351] nd_pmem namespace0.0: could not reserve region [mem 0x140000000-0x33fffffff]dax
>>  [  131.410147] nd_pmem: probe of namespace0.0 failed with error -16namespace0.0 --mode=devdax
>>  ...
>> 
>> This fails as expected due to the busy memory resource, and the memory
>> cannot be used. However, the dax0.0 device is removed, and along its name.
>> 
>> The name of the memory resource now points at freed memory (name of the
>> device).
>>  [root@localhost ~]# cat /proc/iomem
>>  ...
>>  140000000-33fffffff : Persistent Memory
>>    140000000-1481fffff : namespace0.0
>>    150000000-33fffffff : �_�^7_��/_��wR��WQ���^��� ...
>>    150000000-33fffffff : System RAM
>> 
>> We have to make sure to duplicate the string. While at it, remove the
>> superfluous setting of the name and fixup a stale comment.
>> 
>> Fixes: 9f960da72b25 ("device-dax: "Hotremove" persistent memory that is used like normal RAM")
>> Cc: stable@vger.kernel.org # v5.3
> 
> hm.
> 
> Is this really -stable material?  These are all privileged operations,
> I expect?

Yes, my thought was rather that an admin could bring the system into such a state (by mistake?). Let‘s see if somebody has a suggestion.

I guess if we were really unlucky, we could access invalid memory and trigger a BUG (e.g., page at the end of memory and does not contain a 0 byte).

> 
> Assuming "yes", I've queued this separately, staged for 5.7-rcX.  I'll
> redo patches 2-4 as a three-patch series for 5.8-rc1.

Make sense, let‘s wait for review feedback, thanks!



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-05-09  5:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-08  8:42 [PATCH v4 0/4] mm/memory_hotplug: Interface to add driver-managed system ram David Hildenbrand
2020-05-08  8:42 ` [PATCH v4 1/4] device-dax: Don't leak kernel memory to user space after unloading kmem David Hildenbrand
2020-05-08 23:53   ` Andrew Morton
2020-05-09  5:40     ` David Hildenbrand
2020-05-08  8:42 ` [PATCH v4 2/4] mm/memory_hotplug: Introduce add_memory_driver_managed() David Hildenbrand
2020-05-08  8:42 ` [PATCH v4 3/4] kexec_file: Don't place kexec images on IORESOURCE_MEM_DRIVER_MANAGED David Hildenbrand
2020-05-08  8:42 ` [PATCH v4 4/4] device-dax: Add memory via add_memory_driver_managed() David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).